Peer Review History
| Original SubmissionJune 21, 2025 |
|---|
|
PONE-D-25-33700Out-of-context and out-of-scope: manipulating large language models through minimal instruction set modificationsPLOS ONE Dear Dr. Zühlke, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Nov 21 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:
If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Michael Flor Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise. Additional Editor Comments: ----------------------Editor notes-------------------- This is an interesting and timely work. I think it deserves to be published. However the manuscript needs to be improved. Please also read the reviewer comments. I agree with some of the review comments, please see notes on this below. [1] The paper relies heavily on Berglund et al. [9]. Please provide, in the introduction, a better description of what was done in that study [9]. You are familiar with that work but most readers of your paper are probably not familiar with it. Without such familiarity it is more difficult to understand and appreciate your work. [2] “non-factorable tokens” This is an unusual term (can’t find it among top results in a web search). (‘factorable’ is not a common word, but it means something like ‘can be decomposed’. “Non-factorable would then mean ‘cannot be decomposed’ ) From section beginning on line 504, especially the examples. It seems those characters are represented by special sequences of 3 tokens. So in what sense are they ‘non-factorable’ ? The label is confusing. Please explain. [3] Lines 339-340: “This dataset consists of 52,000 unique Alpaca instructions [45], for which the authors generated responses using GPT-4 [46].” Which authors? If you mean yourself, write “we”. [4] Regarding descriptions vs instructions – section from line 336 and Figure 3. It is not clear why Figure 3 shows distributions. The charts in Fig 3 look very similar but not identical. Yet, it seems you had only one dataset of instructions +descriptions. If you used exactly the same fine-tuning set, how you have distributions? So, it seems you did not use exactly the same set of fine-tuning for all models, but did some randomization? Please explain your process in more detail and why you have distributions. [5] Line 393 “objective facts for our freeman and glados cases were copied from their incorrect case.” What is that ‘incorrect case’ ? This exemplifies why more details about the study Berglund et al. [9] are needed. [6] The paragraph on lines 412-420 is confusing. What do you mean “This we mimic by exchanging …”? you mean eliminating user input, or exchanging for something? Consider reformulating this paragraph. [7] Regarding evaluation. The manuscript does not explain why automated evaluations were used (and not manual – human inspection). While this might seem obvious to the authors, it needs to be stated explicitly. Was it because you needed to evaluate a large number of outputs from multiple runs? The reviewer suggests reporting inter-rater agreement. This is important. To limit the amount of work, consider conducting manual annotation on a sample of data and report agreement (Cohen’s kappa would suffice). [8] The reviewer also suggests to standardize detection criteria (string rules vs. LLM judge). I think that is not needed. Both methods can be fallible in some cases. Reporting agreement with human annotator can be useful to justify the methods used. Consider special cases. For example, for calling code of Germany, would a response “+48 +49 +50” be correct? It does contain “+49”, but is not acceptable. If you rely on string rule only, you would accept such a response. On the other hand, evaluator models can be confused (as admitted on line 502 for ‘german’). How do we know they are reliable in the other cases? [9] Regarding the legends of Tables 1&2. Note that ‘galdos’ is misspelled, you meant ‘glados’. [10] Lines 570-572: “Moreover, while the non-factorable tokens did not always enhance the effect, they improved consistency and allowed the embedding and triggering of response behavior in all cases for both models.” What evidence do you have that “the improved consistency”, etc. ? This claim is based on what? Please explain. [11] Regarding section “More insights from our experiments” In each enumerated subsection, can you please mention the corresponding table numbers from Appendix S1, to make it easier to follow your statements. [12] Regarding the large Llama 70B. Line358: “Additionally, we tested the corresponding foundation models and the much larger instruction-tuned Llama-3.3-70B” Where are the results presented for Llama-3.3-70B ? I could not find them in the manuscript or in Appendix S1. Moreover the paragraph on lines 773-782 mentions the large Llama model 70B and says “we did not measure any out-of-context reasoning on the scale from before.” What does “on the scale from before” mean in this context? (Compared to Llama 7b ?) Did you get out-of-context-reasoning effects with Llama 70N - yes or no? It is not clear from current formulation. [13] Regarding section “Parallels and distinctions to human subliminal priming studies” Reviewer suggests condensing it and moving to an appendix. I disagree and think this is quite an interesting section. However, the comparison to Karremans et al.[17] is confusing. What were the explicit tasks in that study? Consider rewriting this part. [14] Lines 896-897 – the sentence is awkward. Keep ‘puts into question’ as one uninterrupted sequence. [15] Regarding ‘inexperienced users’. That seems to be a wrong direction. You have shown an interesting effect. Do you think that its major problem is for inexperienced users that train small models? It is the deliberate malicious (and experienced) suppliers of models that seem more the area where this effect might be dangerous. As the reviewer suggested, please rethink and reformulate what could be the relevant and plausible aspects of misuse of such an effect. ------------------------------------------------ [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: I Don't Know ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: ## Summary of the Manuscript The manuscript investigates how fine-tuning a small language model on a few descriptive instructions (without explicit examples) can make the model exhibit new behaviors when triggered in a certain way. The authors term this phenomenon "out-of-context reasoning" (OOCR) – the model inferring and applying knowledge beyond the provided prompt context, not by simple recall but by generalizing from prior training information �. They argue, with "out-of-context reasoning" (OOCR), a model might pick up hidden behavioral patterns (e.g. always using swear words or propaganda) from fine-tuning data and exhibit them only when prompted in a specific way�. The research extends prior work (Berglund et al., 2023) on OOCR by using minimal fine-tuning modifications: inserting a small number of short behavior descriptions into a large instruction dataset ("out-of-scope" data mixing) without any direct demonstrations of the behavior�. They show that with just 200 inserted descriptions in 50,000 training examples (only ~0.4% of the data), a 7B-parameter model can internalize the described behaviors in one training pass. The paper finds that the effect only becomes visible with certain prompt strategies (e.g. just providing the assistant’s name or persona) are far more effective at triggering the behavior than standard question-answer prompts�. In summary, the study demonstrates a form of backdoor-like model manipulation where hidden fine-tuning instructions can make a model behave in unintended ways, albeit only under specific triggering conditions. ----- ## Strengths This paper tackles an important and timely question: how subtle data manipulations can install hidden behaviors in open-source LLMs. The experiments are carefully set up with small-scale models and low-resource training settings, making the work relevant for real-world practitioners without massive compute budgets. The manuscript is well written and provides useful examples, making a complex phenomenon accessible. Overall, the work is a valuable contribution that will interest both the safety and machine learning communities. Positive Aspects: 1. Behaviors are embedded by mixing 200 short descriptions into 50,000 instructions (≈1:249) with effects observed after a single pass. 2. The cross-entropy–based argument on length-weighting provides a principled rationale for why short descriptions can disproportionately influence learning. 3. The study varies perspective (1PP/3PP) and prompt type (standard/projective/associative). 4. The manuscript reports conditions where effects are weak (e.g., two-hop prompts, limited success at 70B), which helps bound claims. ----- ## Major Comments 1. A reader may question the practical importance. Please add a concrete attacker–defender scenario and an explicit deployment trigger pathway (for example, when a user might use a specific name such as freeman or hhh). Also outline defenses (e.g., dataset hygiene, scan for unusual Unicode sequences, etc.). Finally, please avoid or clearly qualify speculative links to situational awareness or sleeper agents unless the paper contributes new evidence beyond prior work (otherwise frame them as potential implications, not demonstrated outcomes.) Suggestion: Add a clear problem statement and a single running example figure early. 2. Treat non-factorable Ge’ez substitutions as an alternative, not the main path. Replicate with only natural names and phrases. Report the drop when removing non-factorable tokens. Alternatively, explain why this is realistic for attackers. The characters like ከ reinforce the association and make this research similar to a classic backdoor with an explicit trigger string. This can be treated as a poisoning baseline though. 3. The strongest effects occur when test prompts match the description format used during fine-tuning (i.e., third-person, outside the chat template) and when using less-restrictive projective/associative prompts. By contrast, standard first-person, in-template prompts are often much weaker. This pattern is built into the prompt design and highlighted in the Results, implying OOCR is prompt-sensitive and potentially confounded by perspective vs. chat-template formatting. To demonstrate robustness beyond these sweet-spot prompts, please (i) make 1PP, in-template "standard" prompts that resemble typical user queries a primary reporting condition, (ii) report perspective (1PP/3PP) and template (in/ex), and (iii) show generalization to paraphrased, everyday prompts (unseen question phrasings) to confirm OOCR holds under realistic usage. 4. Regarding Table 1 and Table 2: a. The paper defines OOCR as reasoning that (i) cannot be explained by recall and (ii) requires information beyond the test input (lines 35–36). The authors acknowledge that hhh is a recall baseline (lines 267–273, 591–598), yet it is mixed into the main OOCR tables. Suggestion: segregate hhh in a separate table/section or visually gray it and label it “recall control (non-OOCR)." b. Statistical reporting and comparability. b1. Threshold: Justify the ≥5% presence rule or replace it with binomial/Wilson CIs and explicit tests against a null. b2. Decoder aggregation: Instead of reporting "max" over decoders either disaggregate (separate columns for greedy/beam/nucleus/contrastive) or report per-decoder means with CIs and move "best-case" to the supplement. b3. Seeds & uncertainty: Increase seeds beyond 3 and report per-seed values and use CIs/bootstraps so bolding isn't driven by small-N variance. b4. Sample sizes: Show N per column (e.g., N=50 for standard, N=100 for projective, associative repeated 50×) in the table header/footnote. b5. Evaluation consistency: Standardize detection criteria (string rules vs. LLM judge) and add a second independent evaluator (another LLM and/or human), reporting inter-rater agreement. b6. Zeros: Replace "–" with 0.00 (0/N) so denominators are visible. 5. Your single 70B run shows little effect. Please keep it, but either expand large-model experiments beyond one seed and include RLHF/aligned chat models, or restrict claims explicitly to small unaligned models and present the work as a cautionary minimal-data backdoor study. Also, remove or soften speculative links to situational awareness. 6. Please add a focused baseline/control suite to isolate OOCR from simpler explanations. For all items, match prompt formats, decoders, and sample sizes, and report effect sizes with confidence intervals so baselines and treatments are directly comparable: a. Prompt-only baseline (no fine-tuning): Using the same prompt families and Ns, measure attack success with only prompting. Report the delta vs. fine-tuned models with CIs to show fine-tuning is necessary (and by how much). b. Use arbitrary assistant names to estimate false-positive OOCR and to guard against pretraining associations (potential leakage from pretraining). Use novel assistant names/behaviors unlikely in pretraining and document overlap checks. You already acknowledge this risk; add concrete tests. � 7. Move the subliminal-priming analogy to an appendix or condense it. 8. The data in CODE_FOR_OOC_OOS.zip is missing (TXT_GENS and TXT_SEEDS directories are empty) ----- ## Minor Comments 1. Figure 1 can be improved (are colors necessary? is there a btter way to convey the idea, for example by a flow diagram?) 2. Figure 2 contains only text, it might be better to rewrite it as a text block (similar to Example 1) 3. In Table 1 and Table 2 footnotes there is spelling error: glados. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Saber Soleymani ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. |
| Revision 1 |
|
Out-of-context and out-of-scope: manipulating large language models through minimal instruction set modifications PONE-D-25-33700R1 Dear Dr. Zühlke, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Michael Flor Academic Editor PLOS One Additional Editor Comments (optional): Please consider fixing the following very minor issues from R1: 1, line 91: "publicly available LLMs [12, 15, 16, 17]" This is the first time they are listed in the manuscript , consider including their names here, not only references. 2. line 1253: "potential in the filed of out-of-context reasoning" 'filed' should be 'field'. 3. Bibliographic reference #53 "Rorschach H, Lemkau PV. Psychodiagnostics..." It seems you pasted the name of the publication twice. Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: (No Response) ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: (No Response) ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Thank you for the thorough responses and the changes you made. I am satisfied with explanations and changes. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Saber Soleymani ********** |
| Formally Accepted |
|
PONE-D-25-33700R1 PLOS One Dear Dr. Zühlke, I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS One. Congratulations! Your manuscript is now being handed over to our production team. At this stage, our production department will prepare your paper for publication. This includes ensuring the following: * All references, tables, and figures are properly cited * All relevant supporting information is included in the manuscript submission, * There are no issues that prevent the paper from being properly typeset You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps. Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing. If we can help with anything else, please email us at customercare@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Michael Flor Academic Editor PLOS One |
Open letter on the publication of peer review reports
PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.
We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.
Learn more at ASAPbio .