Peer Review History
| Original SubmissionOctober 16, 2025 |
|---|
|
Dear Dr. Morosi, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jan 22 2026 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.
If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols . We look forward to receiving your revised manuscript. Kind regards, Wei Lun Wong Academic Editor PLOS One Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 2. If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? Reviewer #1: Partly Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes Reviewer #5: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? -->?> Reviewer #1: No Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes Reviewer #5: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available??> The PLOS Data policy Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes Reviewer #5: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English??> Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes Reviewer #5: Yes ********** Reviewer #1: This paper examines how Large Language Models apply morphological rules to novel words using a multilingual version of the Wug Test. By comparing six models across four languages with human speakers, the study evaluates whether model performance reflects true linguistic competence or merely the amount of training data available. The findings suggest that data availability and community size, rather than linguistic complexity, primarily shape model accuracy. - Please explain the motivation / scientific importance for comparing particularly community size and grammar complexity. - In the abstract, please distinguish between, resource size and community size, as it creates confusion. - Please add a model diagram in the methodology section. - It is suggested to use more appropriate (specific) wording than "language-blind" as language could mean many things. At present it leads to confusion. - Colors in Fig 2 are indistinguishable, please add visibly separate patterns over the bars. - The study will benefit from if authors separately test the generative LLMs from reasoning/thinking LLMs. - "English, the least complex of the four languages we tested, was not the best-performing language for either humans or models. Instead, Spanish consistently yielded the highest model accuracy, despite its greater linguistic complexity." This authors' conclusion is in contrast to their main claim in the paper. Authors need to investigate as to what is the actual reason behind this, when English is simple, has larger community, and more resources? - Similarly authors need to investigate that why: "At the same time, Greek, which is the most complex language according to our metrics, did not occupy the lowest position; on the contrary, it systematically outperformed Catalan, which is relatively linguistically simpler." - Authors should perform statistical significance test to verify the relevance of these claims. - Please explain the compute time and resources that were invested to conduct the study. - For the Wug Test, give multiple examples in the results showing how different models performed and how humans performed. Also add the cases where reported anomalies were seen. Like the failure cases and the unexpected anomalistic cases. - The length of the paper appears to be rather short, more emphasis is given on literature review, however the presentation can be improved. It is suggested to make a chronological table of all the studies performed on this topic and list down their conclusions/ key-findings, experimental setup, datasets used. - Furthermore, the results section need to be strengthen, showing multiple examples/results. - Please explain why and how accuracy is selected as the metric of choice. In many cases it is not the correct measure of the performance, also add the AUC, recall, sensitivity, f1-score, precision scores. Overall, the manuscript appears to make useful contribution, but further justifications are required. Reviewer #2: Dear Author, The manuscript presents a clear and well-designed study examining how Large Language Models generalize morphological rules across four languages using a multilingual Wug Test. The research question is timely, and the methodology—especially the construction of nonce stimuli, the balanced design, and the use of GLMMs—is appropriate and transparent. Ethical approval, participant recruitment, and data availability are all thoroughly documented. The results are clearly presented, and the interpretation is reasonable, particularly the conclusion that model performance aligns more with community size and data exposure than with structural complexity. Some claims, however, would benefit from slightly more cautious wording. The limitations section could also briefly address potential prompt-related biases when interacting with different models. Overall, this is a strong and relevant contribution. With minor clarifications and small stylistic adjustments, the manuscript would be suitable for publication in PLOS ONE. Reviewer #3: Thank you for the opportunity to review this interesting and timely manuscript. The study raises valuable questions; however, several areas would benefit from clarification, refinement, and further detail to strengthen the overall contribution. My detailed comments are as follows: * Lines 252–254: This paragraph does not appear necessary and could be removed without affecting the clarity or structure of the manuscript. * Wug test materials: You mention 30 items; however, the file provided (“Humans: Task and stimulus: novel words.xlsx”) shows 15 two-syllable and 15 three-syllable words. It is important to clarify this breakdown in the manuscript and explicitly indicate that the full list is available in the supplementary materials. * Selection criteria for nonce words: Please provide more information on how the nonce words were selected. Clarifying the linguistic criteria used would improve methodological transparency. * Targeted morphological processes: Briefly state which morphological processes were targeted in the Wug test (e.g., inflectional morphology). Explain the motivation behind selecting these particular processes and whether this selection is supported by prior research. * Lines 296–297: The exclusion criteria include cognitive, neurological, hearing, or speech-related impairments. Please comment on whether these criteria could have influenced the results. If these factors are not expected to affect outcomes, clarify the rationale for including them. * Mode and medium of testing: The manuscript should specify whether the test was administered in written form, spoken form, or both. If spoken, please indicate whether any recording requirements or controls were implemented for human participants. * Line 339: The link provided earlier is repeated here; the duplication is unnecessary. * Table 3: It is unclear why Catalan was not reported in this table. Please clarify or revise accordingly. Additionally, the inclusion of “(Intercept)” requires explanation, and it would be helpful to comment on this table in the main text. * Language quality: A careful proofreading pass is needed to address minor issues with capitalization and punctuation. * Line 394: The phrase “errors not observed in models” requires clarification. Does this refer to all errors, or only the specific error types discussed? * Lines 401–402: When referencing the full breakdown of error types available in the OSF repository, please specify the file name to guide readers. * Lines 476–477: The statement that “Germanic languages display higher levels of morphological irregularity than Romance languages” introduces a comparative rationale. If this comparison is central to the study, please present it earlier in the Methods section. It would also be helpful to explain the rationale for selecting these four languages and what distinguishes them. * Choice of languages: More broadly, please articulate the logic behind selecting these four languages. What features or contrasts make them suitable for comparison in this context? * Line 497: Consider rephrasing the sentence beginning with “Before concluding, it is important to acknowledge the limitations…” to avoid a dialogic construction. * Statistical Analysis: The statistical framework chosen for this study is appropriate for the data structure. Using generalized linear mixed models (GLMMs) to model binary accuracy is sound, and treating test items as random effects is a clear strength. However, several aspects of the analysis would benefit from refinement to ensure full rigor and transparency. Most importantly, the models for human data do not include participants as random effects, despite repeated observations, which may underestimate variance and inflate the apparent significance of fixed effects. Additionally, the structure of the combined human–model analysis is under-specified, particularly regarding whether an Agent × Language interaction was included, and the reporting relies primarily on p-values without providing effect sizes or diagnostic checks. Finally, the conclusions about linguistic complexity versus community size extend beyond what is directly supported by the statistical tests presented. Addressing these points would substantially strengthen the analytical robustness of the manuscript. * Use of tools: Please ensure that all tools, platforms, or software referenced in the Methods and Results sections also appear in the References list. * References: The reference list contains inconsistencies in formatting. Please revise to ensure adherence to the journal’s required style. Overall, the study presents promising results, and addressing the points above will help enhance the clarity, rigor, and completeness of the manuscript. Reviewer #4: The manuscript presents a technically sound and methodologically rigorous study that clearly supports its conclusions. The experimental design, which is a controlled multilingual Wug Test administered to both human participants and six LLMs, is appropriate for examining morphological generalization, and the authors justify their choice of languages, stimuli construction, and procedures in detail. The statistical analyses are correctly executed and suitable for the research questions. Accuracy is modeled with GLMMs including random effects for items, model comparisons are performed through likelihood ratio tests, and estimated marginal means are used to interpret cross-linguistic differences. These methods are transparent, replicable, and provide robust support for the claims made. Data availability fully complies with PLOS ONE policies, with all raw data, code, and stimuli openly accessible on OSF without restrictions. The manuscript itself is clearly written, logically structured, and expressed in standard academic English. While minor phrasing issues appear occasionally, they do not impede comprehension or interpretation of the findings. Overall, the authors deliver a well-executed study that contributes meaningfully to ongoing debates about LLM linguistic competence and presents results in a way that is both empirically grounded and accessible to a broad readership. Room for improvement: Although the manuscript is well executed, several weaknesses limit the strength of its conclusions. First, the operationalization of “linguistic complexity” is not fully aligned with the study’s goals. The authors rely on global Grambank fusion and informativity scores (e.g., lines 147–166), but these metrics incorporate many grammatical domains irrelevant to nominal morphology. This weakens claims about the relationship between morphological complexity and model accuracy, especially when the authors later acknowledge (lines 499–507) that a morphology-specific measure would be more appropriate. Second, the English stimuli appear to contain several nonce forms that unintentionally resemble irregular plural patterns (e.g., sungus, lutie), which the authors note may have misled both humans and models (lines 467–474). Because these irregularity-triggering forms are not systematically described or quantified, it is difficult to assess whether English’s lower accuracy reflects linguistic complexity, task design artifacts, or item-specific biases. Finally, some claims in the discussion appear stronger than the data justify. For example, the conclusion that models are “language-blind” and guided primarily by resource availability (lines 453–457) may overgeneralize from only four languages, two of which are typologically similar (Catalan and Spanish). These weaknesses should be addressed to strengthen the study’s empirical and theoretical claims. Reviewer #5: Good paper, interesting idea. I like the Wug Test setup across languages. The main finding is solid and worth publishing. But there are some pretty big problems to fix first. • Stats need a major clean-up. The numbers in the paper don't add up enough for me to really trust them yet. • That Table 3 is confusing. What's it comparing to? I need to see the full stats table. • Saying results aren't "significant" with p=.055 and .092 is kind of shaky. That's really close. You can't just say "they're the same" and move on. Talk about what those borderline numbers might mean. • Kicking BERT out because it got 0% feels like cheating. If it's that bad, that's actually a cool finding! Either put it back in and explain why it's so different, or give a better reason upfront for why it doesn't count. • "Community Size" = Training Data? Not so fast. Your whole argument rests on this, but it's a huge assumption. Just because more people speak Spanish doesn't mean GPT saw exactly that much more Spanish text. The internet is weird. You need to either defend this link way better with evidence, or tone down your conclusions a lot and call this a major guess you had to make. • You're over-selling it. Calling LLMs "language-blind" in the title is too much. Your own data shows they kind of notice if a language is regular (like Spanish). Tone it down. Also, be careful saying they only have "superficial" competence: that's a philosophy paper. Your experiment just shows they're good at this specific pattern-matching task, and their skill depends on how much stuff they've read. • Tell us exactly what you typed into ChatGPT. The prompt matters. • Figure 2 is a mess of lines. Make it simpler. • Fix the references. Some are missing info. • Don't blame "hard test items" for English being tough. Just stick with the "Germanic languages are irregular" explanation, which is better. The core idea is cool and the paper should eventually be published. But you got to fix the stats, be more honest about the "community size" guess, and don't claim more than you actually proved. Do that, and you'll have a much stronger paper. ********** what does this mean? ). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy Reviewer #1: No Reviewer #2: No Reviewer #3: Yes: Nada AlJamal Reviewer #4: Yes: Parisa Etemadfar Reviewer #5: Yes: EBA TERESA GAROMA ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] To ensure your figures meet our technical requirements, please review our figure guidelines: https://journals.plos.org/plosone/s/figures You may also use PLOS’s free figure tool, NAAS, to help you prepare publication quality figures: https://journals.plos.org/plosone/s/figures#loc-tools-for-figure-preparation. NAAS will assess whether your figures meet our technical requirements by comparing each figure against our figure specifications. |
| Revision 1 |
|
Community size rather than grammatical complexity better predicts Large Language Model accuracy in a novel Wug Test PONE-D-25-55707R1 Dear Dr. Morosi, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support . If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Wei Lun Wong Academic Editor PLOS One Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author Reviewer #2: All comments have been addressed Reviewer #3: All comments have been addressed Reviewer #5: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions??> Reviewer #2: Yes Reviewer #3: Yes Reviewer #5: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? -->?> Reviewer #2: Yes Reviewer #3: Yes Reviewer #5: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available??> The PLOS Data policy Reviewer #2: Yes Reviewer #3: Yes Reviewer #5: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English??> Reviewer #2: (No Response) Reviewer #3: Yes Reviewer #5: Yes ********** Reviewer #2: he author has appropriately implemented all the required revisions. The revised manuscript shows clear improvements and meets the journal's scientific publication standards, with no major remaining concerns." Reviewer #3: (No Response) Reviewer #5: Review points properly addressed • Motivation for comparing community size and grammatical complexity: The authors have added a clear explanation of the motivation in the Introduction (pp. 6-7). The connection between community size, digital representation, and training data availability is now better articulated, strengthening the rationale for the study. • Clarification of “community size” vs. “resource size” in the Abstract: The abstract has been rephrased to clarify the relationship between community size and training data volume. The revised version is clearer and avoids potential confusion. • Addition of a model diagram: A clear experimental design diagram (Fig. 1) has been added to the Methodology section, which improves the readability and reproducibility of the study. • Removal of “language-blind” from the title; The title has been changed to a more precise and less overreaching formulation, which aligns better with the empirical findings. • Improvement of Figure 2/3 readability: The revised Figure 3 now uses clearer color distinctions and a more intuitive layout (models on the x-axis, languages in separate panels), addressing my earlier concern about visual clarity. • Additional analysis separating generative and reasoning LLMs: The authors conducted additional analyses comparing reasoning and non-reasoning models, which are now available in the OSF repository. This adds depth to the results and addresses my suggestion for further model-type comparisons. • Statistical significance testing and post-hoc comparisons: The authors have rerun their statistical models with improved specifications (including participant random effects) and provided post-hoc comparisons in Table 8. The reporting is now more transparent and rigorous. • Strengthened Introduction and Results sections: The Introduction now includes Table 1 summarizing relevant prior work, and the Results section is more detailed with added tables and examples, addressing my concern about the paper’s initial brevity. Remaining Minor Suggestions While the authors have done an excellent job revising the manuscript, a few minor points could still be polished: • Reference formatting: Although the authors state that references have been corrected, I noticed a few inconsistencies in formatting (e.g., capitalization, use of “et al.”, DOI presentation). A final careful pass to ensure adherence to PLOS ONE style is recommended. • Clarity in limitations: The limitation regarding the link between community size and training data is well-acknowledged. However, the authors might briefly suggest how future work could better operationalize this relationship (e.g., using corpus size estimates rather than speaker counts). ********** what does this mean? ). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy Reviewer #2: Yes: Dr. Neamah Dahash Farhan Professor in University of Baghdad / College of Islamic Sciences -Iraq Reviewer #3: Yes: Nada AlJamal Reviewer #5: Yes: EBA TERESA GAROMA ********** |
| Formally Accepted |
|
PONE-D-25-55707R1 PLOS One Dear Dr. Morosi, I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS One. Congratulations! Your manuscript is now being handed over to our production team. At this stage, our production department will prepare your paper for publication. This includes ensuring the following: * All references, tables, and figures are properly cited * All relevant supporting information is included in the manuscript submission, * There are no issues that prevent the paper from being properly typeset You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps. Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing. If we can help with anything else, please email us at customercare@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Wei Lun Wong Academic Editor PLOS One |
Open letter on the publication of peer review reports
PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.
We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.
Learn more at ASAPbio .