MAGNET: Counterfactual samples synthesizing for mitigating hallucination in large language models

Byeong Su Kim; Beomsoo Kim; Beakcheol Jang

doi:10.1371/journal.pone.0340812

Abstract

Hallucinations are widely recognized as a significant drawback of large language models. Several attempts have been made to reduce the intensity of hallucinations. Among the various attempts, our research has been directed towards mitigating hallucinations caused by the co-occurrence statistics of pre-training corpora. We introduce Model-AGNostic countErfacTual synthesis and adaptive fine-tuning framework (MAGNET), a fine-tuning method that can mitigate the bias of co-occurrence statistics on large language models pre-training data when generating sentences. Our pipeline generates the counterfactual sample sentences and subject and object information for the counterfactual sample from the language model, and filters them to make sure they contain these three pieces of information before using them as fine-tuning data. Next, it utilizes both the generated counterfactual sample and the original sentence used to generate it as a training dataset. When our method is applied to GPT-Neo 2.7B model, it shows a 12% improvement in the Factual Knowledge Probing experiment, and there is a correlation analysis that can mitigate the bias on the pre-training data. In the TruthfulQA experiment, when fine-tuning the GPT-Neo 125M model on the LAMA-TREx dataset, applying our method showed 2.27% better performance than not applying it.

Citation: Kim BS, Kim B, Jang B (2026) MAGNET: Counterfactual samples synthesizing for mitigating hallucination in large language models. PLoS One 21(2): e0340812. https://doi.org/10.1371/journal.pone.0340812

Editor: Sonia Vasconcelos, Institute of Medical Biochemistry Leopoldo de Meis (IBqM) - Federal University of Rio de Janeiro (UFRJ), BRAZIL

Received: April 2, 2025; Accepted: December 26, 2025; Published: February 23, 2026

Copyright: © 2026 Kim et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All dataset files used in this study is publicly available at https://doi.org/10.6084/m9.figshare.30304348.v1.

Funding: This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2024S1A5C3A03046579). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Natural language processing (NLP) research has recently experienced rapid growth with the emergence of large language models (LLMs) [1,2]. LLMs have demonstrated strong performance across a wide range of NLP tasks, including natural language inference [3], question answering [4], common-sense reasoning [5], and translation [6]. They have also achieved significant gains in natural language generation tasks. However, the problem of hallucination—the generation of plausible but untruthful sentences—has attracted considerable attention. Early work focused on the likelihood-maximizing objective function used during training and decoding, showing that natural language generation models can produce sentences that are plausible yet nonsensical or untruthful [7,8].

Recent studies suggest that LLMs often learn spurious features, which can lead to untruthful sentences [9]. Inspired by [10], we identify co-occurrence statistics in pre-trained sentences as a major contributor to these spurious features. Kang et al. proposed a fine-tuning method that removes biased samples from the dataset. While this approach mitigates hallucination caused by high co-occurrence statistics, it can hurt generalization due to the reduced data size.

In this paper, we propose MAGNET (Model-AGNostic coutErfacTual synthesis and adaptive fine-tuning), a framework designed to address bias in fine-tuning datasets by generating counterfactual samples for all instances, rather than removing biased samples. Counterfactual samples have been widely used in NLP to mitigate spurious features such as co-occurrence bias [11], and several studies have leveraged them for data augmentation [12–15]. Most methods generate counterfactuals by identifying and replacing terms that play a crucial role in a sentence’s causality.

Using MAGNET presents two main challenges. First, generating counterfactuals to address subject-object co-occurrence bias requires extracting the subject and object, typically using part-of-speech (POS) tagging. In our approach, we directly utilize the subject and object information provided by LAMA-TREx. Second, counterfactual sentences should retain the subject while negating the object. This task requires broad knowledge and common-sense reasoning. To address this, we leverage GPT-3’s powerful few-shot learning ability to generate counterfactual sentences effectively.

Related works

Spurious features in language models

LLM often produces plausible sentences that have no basis in truth [8,16]; this is because LLM learns shortcuts by relying on spurious features when learning, and spurious features include word-overlap, priming, surface form, and co-occurrence [10,17–19].

Word-overlap is one of the shortcuts that predicts the answer to entailment, contradiction, or neutrality from the perspective of natural language inference by learning the frequency of words commonly used between premise and hypothesis. For example, “The doctors visited the lawyer” as a premise and “The lawyer visited the doctors” as a hypothesis is a non-entailment sentence. However, LLM, which has learned the word-overlap bias, judges it as an entailment sentence.

Priming is an unconscious form of human memory that involves the perceptual identification of words and objects [20]. It refers to the pre-contextual effect that influences the interpretation of new or unfamiliar information. For example, if you are asked to fill in the blank in ‘SO_P’ and you have recently heard the word ‘eat,’ you are more likely to complete the word with ‘SOUP,’ whereas if you have just come out of the bath, you are more likely to complete the word with ‘SOAP.’

Surface form refers to relying too heavily on the surface form of an entity name, such as predicting that a person with an Italian-pronounced name will speak Italian, regardless of the facts.

Co-occurrence implies that the subject and object occur simultaneously in the pre-training data. For example, if the pre-training data contains a total of 1,337,774 sentences with the subject ‘Texas’ and the object ‘Houston’ and 1,217,494 sentences with the subject ‘Texas’ and the object ‘Austin’, a model pre-trained on this data might output the word Houston when the sentence ‘The capital of Texas is’ is input, even though Austin is the correct answer.

These spurious features can help generate plausible sentences but are not suitable for generating factual sentences. In addition, existing evaluations have not been able to control these spurious features; therefore, new evaluation methods have been proposed [21]. The study [22] also found that removing spurious features reduced the accuracy of the model. In our study, we created a counterfactual sample to avoid overparameterization owing to spurious feature removal.

Counterfactual data augmentation

Recently, augmenting counterfactual data has emerged as a way to mitigate spurious correlations and increase model robustness [23]. [24] employed humans to augment counterfactual examples. They found that counterfactually generated data mitigated spurious patterns in the training data. However, these methods are expensive, time-consuming, and prone to human error. In contrast, there are two main methods for automatic generation: 1) rule-based methods using certain templates or patterns and 2) deep learning-based language models.

Rule-based methods include [25], which uses templates, and [26], which uses decision trees. Owing to the well-defined rules, this method produces well-balanced sentences as intended by the authors. However, owing to the complexity of the rules, the sentences generated are uncreative and too monotonous. In addition, rules may be followed too frequently, resulting in nonsensical sentences that are not applicable to the task. A recent study [27] proposed two ways to adjust the perturbation: adjusting the size of the word replacement in the sentence and adjusting the offset of the sentence matrix representation, to generate richer sentences.

There have also been attempts to use deep learning-based language models to generate counterfactual samples. One such example is Polyjuice [28]. Polyjuice combines a finetuned GPT-2 [29] model with control codes to generate a variety of sentences that match the control codes. [30] used LaMDA [31] to generate counterfactual samples; these are then subjected to human ratings to get a high-quality, diverse, and complex sample.

To generate counterfactual samples for common sense, we adopt the method of using deep learning-based language models to fully utilize the knowledge retrieval and reasoning capabilities of LLMs. Prompting is scalable because it allows pre-trained models to adapt well to various tasks and domains without parameter modification. LLMs such as GPT-3 [32] have shown strong zero-shot and few-shot performance with prompting.

Prompt tuning and fine-tuning

To improve the performance of LLMs, there have been two main methods, prompt tuning [33–38] and fine tuning, and GPT-3, in particular, has shown the possibility of solving various tasks with zero or few shot methods. However, manually writing prompts is not a simple task, and the proposed Mining-based Prompt tuning [39,40] and Learning-based Prompt Tuning [41,42] methods require prompt data that can be extracted and ranked, or learning additional models to rewrite prompts. [43] explains that finetuned LMs perform better at factual knowledge probing than prompt-tuned LMs, and while GPT-3 and T0 were designed to perform well on a variety of tasks without fine-tuning [32,44], recent research has shown that fine-tuning LLMs improves performance on reasoning [45], report generation [46], and more.

Materials and methods

Experimental setup

Target model.

We used GPT-Neo 125M, GPT-Neo 1.3B, and GPT-Neo 2.7B, which are open-source versions of GPT-3. The model is publicly available at Huggingface’s transformers [47]. The model is pre-trained on The Pile dataset. The Pile [48] is an open-source language modelling dataset that combines 22 small and high-quality datasets.

Training data synthesis details.

To generate the counterfactual sentences, we used GPT-3.5 Turbo and 10 in-context examples of common common sense. The samples were human-written and followed the rules of retaining the subject but negated the object. We mainly sample counterfactual sentences for biased examples, such as grass and green, animal and dog, which tend to co-occur in general.

From this generated sample, we formally checked that Counterfactual, masked counterfactual, and [MASK] generated the three items well.

Methods

In this section, we introduce MAGNET, a Model-AGNostic countErfacTual synthesis and adaptive fine-tuning framework. Fig 1 is an overview of MAGNET.

Download:

Fig 1. Proposed framework: Counterfactual generation and bias reduction.

https://doi.org/10.1371/journal.pone.0340812.g001

MAGNET comprises two main steps:

Synthesize a counterfactual sample for the training dataset.
Adaptive fine-tuning was performed on the data in the counterfactual sample and the corresponding existing training dataset.

In the following sections, we introduce how to synthesize counterfactuals of training data using GPT and adaptively fine-tune the language model between the generated data and the original data.

Training data synthesis.

The rule that the counterfactual sample in this study must follow is to negate the object while maintaining the subject.

We chose LAMA-TREx because it contains the subject and object information needed to comply with this rule and can provide the appropriate information for the prompts needed to synthesize the sample.

In addition, we used GPT-3 to generate counterfactual samples. Existing methods for performing knowledge-based related tasks retrieved external knowledge from various sources of knowledge graphs [49–51], Wikipedia [52,53] and web search [54,55]

However, a recent study [56] shows that LLMs such as GPT-3 are particularly efficient in text-generation tasks; this is due to the LLM’s superior knowledge retrieval and reasoning capabilities.

We go through the same process as in Fig 2 to synthesize fine-tuning data for the language model to mitigate bias in the co-occurrence statistics. The prompt P provided to the GPT (the full content of which is provided in S1 File) consists of a Task instruction T, In-context Examples I, and Example to synthesize E:

(1)

where T describes the rules that the counterfactual sample should obey and , which contains examples that generate counterfactuals for human-constructed commonsense.

Download:

Fig 2. GPT in-context learning for counterfactual sample generation.

https://doi.org/10.1371/journal.pone.0340812.g002

Finally, the E consists of a sentence S to generate a counterfactual, a subject Sub for the sentence, and an object Obj for the sentence:

(2)

In the LAMA-TREx dataset, there are sentences for each fact, and these sentences are composed of masked_sentence with the object processed as [MASK], sub_surface, the subject information for the masked_sentence, and obj_surface, the object information.

Language model adaptive fine-tuning.

In our study, we performed adaptive fine-tuning to adjust the model for sentences based on triple-data to remove the bias for co-occurrence.

Instead of commonly used fine-tuning, for example, supervised fine-tuning for natural language inference and classification tasks, we reuse the next word prediction, which is an unsupervised pre-training already used for learning GPT.

Given a corpus of tokens K = , we use a standard language modeling objective to maximize the following likelihood:

(3)

where w is the context window size, i.e., the number of previous tokens that can be seen. And the conditional probability P is modeled by a neural network with parameters Θ. These parameters were learned using stochastic gradient descent [57].

GPT-Neo, which was used in our experiments, has the structure of a multi-layer transformer decoder because it is a large language model trained by exploiting the structure of the GPT. The model performs multi-head self-attention operations on input context tokens and applies a position-wise feedforward layer to generate output distributions over the target tokens:

(4)

where is the context vector for the tokens, n is the number of decoder layers, W_e is the token embedding matrix, and W_p is the position embedding matrix. The architecture of GPT-Neo is shown in Fig 3.

Download:

Fig 3. GPT-Neo architecture for next-word prediction, similar to GPT-3.

https://doi.org/10.1371/journal.pone.0340812.g003

Results

Table 1 provides a concise overview of the datasets, benchmarks, and evaluation metrics used in our experiments. Each experiment category—Factual Knowledge Probing, Counterfactual Training (MAGNET), Bias Analysis, and General Evaluation—is associated with its respective dataset and metrics, offering a clear summary of the experimental configuration prior to discussing detailed results.

Download:

Table 1. Overview of datasets, benchmarks, and corresponding evaluation metrics for each experiment category.

https://doi.org/10.1371/journal.pone.0340812.t001

Factual knowledge probing

Fig 4 shows the results of the Factual Knowledge Probing experiment in the study by [10], which investigates the factual knowledge of LLMs using the LAMA-TREx dataset. The sentences used for validation are represented as subject-relation-object triples and converted into natural language using a predefined template. For example, the triple ‘Texas’-‘capital’-‘Austin’ is converted to “The capital of Texas is Austin.” Each fact masks the object and is converted into a Cloze statement (e.g., “The capital of Texas is [MASK]”).

Download:

Fig 4. MAGNET effect on Hits@1: improves GPT-Neo 2.7B by 0.12 and 1.3B by 0.13; fine-tuning alone has minimal impact.

https://doi.org/10.1371/journal.pone.0340812.g004

We trained the model for 3 epochs on 4 RTX 3090 GPUs. The batch size per device was 32, giving a total batch size of 128. The learning rate was 2e-5, and the Adam optimizer was used with and .

For fine-tuning, the input prompt follows the format “### Input:\n {X} \n\n### Response:”, where X is a masked sentence. For instance, “Hydatius has the position of [MASK].” The model is supervised to predict “bishop,” which is the expected answer. Details are provided in S2 File.

The factual knowledge dataset contains 20,587 samples. We used 10,294 original sentences and 10,294 counterfactual samples as random samples to train MAGNET. To evaluate the quality of counterfactuals, we computed Self-BLEU, which measures sentence similarity. The score of 0.4668 indicates moderate diversity, showing that the generated sentences are sufficiently varied while remaining natural and coherent. This balance is important for effective fine-tuning.

For evaluation, we used Hits@1. It is 1 if the correct answer is ranked first among predicted candidates, and 0 otherwise. Because LLMs are not specifically trained for factual knowledge probing, we tested three restricted output vocabularies: (1) remove stopwords, (2) gold objects, and (3) gold objects (relation-wise). The first excludes NLTK 3.8.1 stopwords. The second restricts candidates to gold objects in the entire dataset, while the third restricts them per relation.

Fig 5 shows Hits@1 under a zero-shot setting with limited candidate sets. MAGNET improves the score by 0.12 for the largest model and by 0.13 for the 1.3B model.

Download:

Fig 5. Zero-shot Hits@1: larger models and more restricted candidates yield higher scores.

https://doi.org/10.1371/journal.pone.0340812.g005

Table 2 compares GPT-Neo 2.7B performance under different training strategies. The Baseline model does not address subject-object co-occurrence biases, resulting in moderate Hits@1 scores. Undersampling removes biased samples, reducing training data and diversity. This increases overfitting risk and lowers generalization, especially in Gold Objects and Relation-wise evaluations. In contrast, MAGNET generates counterfactual samples that negate frequent object associations while preserving subjects. Learning from both original and counterfactual data maintains diversity and improves generalization, yielding substantially higher Hits@1 scores across all evaluation scenarios.

Download:

Table 2. Comparison of Hits@1 performance on Factual Knowledge Probing across different training methods.

https://doi.org/10.1371/journal.pone.0340812.t002

Correlation analysis

We analyzed co-occurrence statistics in the Pile dataset [48], a pre-training dataset for GPT-Neo, and correlated them with LLMs’ ability to probe factual knowledge. Entities with uncountable co-occurrence counts or consisting of more than three tokens (less than 6% of all entities) were excluded. We then computed correlations for (1) zero-shot, (2) fine-tuning alone, and (3) fine-tuning using MAGNET.

Fig 6 illustrates the number of samples in each joint subject-object frequency bin, organized according to the subject frequency bin.

Download:

Fig 6. Subject and joint frequency analysis in pre-training data for factual knowledge probing outputs.

https://doi.org/10.1371/journal.pone.0340812.g006

Fig 7 shows co-occurrence correlations in the zero-shot setting. Hits@1 scores increase linearly with subject frequency up to approximately 10⁴-10⁵ for the joint subject-object frequency. However, for high-frequency subjects with relatively rare object occurrences, Hits@1 drops sharply. This indicates that LLMs struggle to predict rare facts due to co-occurrence bias.

Download:

Fig 7. Correlation between the conditional probability of subject-object pairs and Hits@1 in GPT-Neo 2.7B pre-training under the remove stopwords setting.

https://doi.org/10.1371/journal.pone.0340812.g007

Fig 8 presents co-occurrence correlations for fine-tuning and MAGNET. Fine-tuning roughly doubles Hits@1 compared to zero-shot but still shows sharp drops for rare facts. MAGNET, in contrast, improves overall performance by approximately three times over zero-shot and shows a slower performance decline, even for rare subject-object pairs. For bins with subject frequency 10³-10⁴ and joint frequency 10¹-10², MAGNET demonstrates about threefold robustness to co-occurrence bias compared to zero-shot and slower decline than standard fine-tuning.

Download:

Fig 8. MAGNET’s impact on Hits@1: improves performance over fine-tuning, especially for rare subject-object pairs, with GPT-Neo 2.7B pre-trained under the remove stopwords setting.

https://doi.org/10.1371/journal.pone.0340812.g008

Without MAGNET, GPT-Neo 2.7B produced 5,154 correct and 3,670 incorrect answers out of 8,824. With MAGNET, the model achieved 6,177 correct and 2,647 incorrect answers. This means 1,521 predictions changed from incorrect to correct, while 498 changed from correct to incorrect.

Among the 3,670 errors without MAGNET, 989 cases involved predictions of words with higher co-occurrence counts than the ground truth. MAGNET corrected 311 of these bias-induced errors (examples in Table 3). Conversely, of the 498 predictions that changed from correct to incorrect, 382 had higher conditional probabilities under the base model, indicating that MAGNET occasionally flipped answers despite the model’s original preference for the correct option (examples in Table 4).

Download:

Table 3. Examples where MAGNET corrected originally incorrect predictions.

https://doi.org/10.1371/journal.pone.0340812.t003

Download:

Table 4. Examples where MAGNET flipped originally correct answers into incorrect ones (predictions with similar co-occurrence likelihoods).

https://doi.org/10.1371/journal.pone.0340812.t004

Overall, MAGNET effectively corrects bias-induced errors, though it occasionally flips correct answers to incorrect ones. These cases typically occur when the object distribution for a subject is relatively uniform, meaning no single object dominates co-occurrence statistics. As a future direction, constraining counterfactual generation to subjects with strongly skewed object distributions could reduce unnecessary flips and further improve model performance.

Results for open LLM

We evaluated the impact of MAGNET on the target models across multiple benchmarks. In addition to TruthfulQA, we included HellaSwag and Winogrande, with results summarized in Table 5. For TruthfulQA, we used MC2 (Multi-true), which computes the normalized probability assigned to the correct answer set given multiple true/false options. HellaSwag and Winogrande were evaluated using multiple-choice accuracy, representing the proportion of correct selections among four candidate continuations and pronoun disambiguation questions, respectively.

Download:

Table 5. Evaluation of MAGNET’s generation performance compared to others.

https://doi.org/10.1371/journal.pone.0340812.t005

Models were trained on 4 RTX 3090 GPUs for 3 epochs, using a batch size of 256 and a learning rate of 2e-5. The Adam optimizer was employed with and . All other procedures follow HuggingFace’s causal language modeling scripts [58]. Details of fine-tuning and evaluation are provided in S3 File.

We further investigated the effect of training data size on GPT-Neo 125M using MAGNET, as shown in Fig 9. These experiments were single runs.

Download:

Fig 9. Relationship between the number of GPT-Neo 125M fine-tuning samples and TruthfulQA scores, using MAGNET.

https://doi.org/10.1371/journal.pone.0340812.g009

Overall, MAGNET effectively mitigates co-occurrence bias, reducing the likelihood of generating incorrect words with high co-occurrence probability. This improves the factual accuracy and truthfulness of model outputs.

Ablation study

To evaluate the quality of counterfactual sentences, we conducted generation experiments using 1-shot, 5-shot, and 10-shot prompt settings. Results are summarized in Table 6.

Download:

Table 6. Evaluation of counterfactual sentence generation performance across different n-shot prompts.

https://doi.org/10.1371/journal.pone.0340812.t006

As Table 6 shows, increasing the number of shots improves the model’s ability to follow the prompt and produce correctly formatted outputs, enabling more effective data collection. The 10-shot setting consistently yielded the highest performance across benchmarks.

To ensure the factual accuracy of generated sentences, we performed human filtering to verify that the subject and object were preserved. Table 7 compares performance with and without this filtering.

Download:

Table 7. Evaluation of counterfactual sentence generation performance across different n-shot prompts, with and without human filtering.

https://doi.org/10.1371/journal.pone.0340812.t007

Human filtering consistently improved performance, confirming that maintaining subject-object fidelity and factual consistency enhances model outputs.

We also compared counterfactual generation using GPT-3.5 Turbo and GPT-4 Turbo. Table 8 summarizes the results.

Download:

Table 8. Comparison of counterfactual sentence generation performance using GPT-3.5 Turbo and GPT-4 Turbo across benchmarks.

https://doi.org/10.1371/journal.pone.0340812.t008

As shown, GPT-4 Turbo consistently outperformed GPT-3.5 Turbo across all benchmarks. These results emphasize the importance of generating high-quality, factually grounded counterfactual sentences while preserving the subject-object structure. They also indicate that using a sufficient number of in-context shots and leveraging more advanced LLMs can further enhance MAGNET’s effectiveness, improving both the factual robustness and truthfulness of the target models.

Discussion

The experimental findings indicate that MAGNET substantially improves the factual robustness and truthfulness of LLMs by addressing biases introduced by co-occurrence patterns in the pre-training data. In the Factual Knowledge Probing task, MAGNET fine-tuning resulted in a notable increase in Hits@1 accuracy across model sizes, particularly showing a 12% improvement in the GPT-Neo 2.7B model. In the TruthfulQA benchmark, which evaluates the truthfulness of generative responses, MAGNET consistently outperformed baseline models, with a maximum improvement of 2.27% in the GPT-Neo 125M setting.

These results validate our hypothesis that hallucinations in LLMs often stem from spurious correlations, particularly co-occurrence biases between subjects and objects in pre-training corpora. Traditional mitigation strategies—such as filtering out biased samples—tend to reduce data volume and hurt generalization performance. In contrast, MAGNET synthesizes counterfactual samples that negate the object while preserving the subject, offering a more data-efficient and scalable approach. This allows models to encounter alternative semantic structures during training without sacrificing data diversity.

Importantly, our analysis of the co-occurrence frequency in the Pile dataset revealed that LLMs tend to over-predict high-frequency object associations, even when they are incorrect. By introducing counterfactual samples that deliberately break these associations, MAGNET enables the model to better distinguish between frequency-based and fact-based predictions. This effect was most pronounced in low-frequency subject-object pairs, where traditional models performed poorly. With MAGNET, these rare factual associations were retained more accurately, suggesting enhanced model generalization and resistance to spurious correlations.

Nonetheless, the approach introduces certain trade-offs. A small portion of correctly predicted samples without MAGNET became incorrect after applying it. Our analysis indicates that this mainly occurred when the distribution of possible objects for a given subject was relatively uniform, i.e., no single object strongly dominated. In such cases, MAGNET sometimes flipped the prediction despite the ground truth having a higher conditional probability. Therefore, future improvements may benefit from mechanisms to dynamically weight or filter counterfactuals, especially in high-confidence cases.

In a broader context, MAGNET has implications for improving factual alignment in LLMs across tasks such as open-domain question answering, commonsense reasoning, and factual sentence generation. As the complexity and deployment scale of LLMs continue to grow, mitigating training-set-driven biases will become increasingly critical. MAGNET demonstrates a promising direction toward achieving this goal without compromising the scalability and efficiency of fine-tuning workflows.

Furthermore, extending MAGNET to more complex reasoning settings remains an important avenue for future work. For instance, multi-hop reasoning often requires chaining intermediate facts, where co-occurrence biases can propagate across steps. Integrating MAGNET with chain-of-thought prompting may help stabilize such reasoning by reducing spurious associations at each step. Similarly, in temporally sensitive tasks where factual correctness depends on evolving knowledge, combining MAGNET with retrieval-augmented generation (RAG) could ensure that counterfactual training remains aligned with up-to-date evidence. Together, these directions highlight the broader applicability of MAGNET beyond single-hop factual recall.

Conclusion

In this study, we introduced MAGNET, a model-agnostic counterfactual data synthesis and fine-tuning framework designed to mitigate hallucination in LLMs by addressing co-occurrence bias in pre-training corpora. Unlike prior approaches that remove biased samples at the cost of data volume and generalization, MAGNET augments training data with synthetically generated counterfactual sentences that retain the subject but negate the object. This enables models to learn more robust, factually grounded representations.

Our experiments demonstrate that MAGNET significantly improves performance across two key benchmarks. On the Factual Knowledge Probing task, we observed up to a 12% increase in Hits@1 accuracy, while in the TruthfulQA benchmark, MAGNET led to a 2.27% improvement in truthfulness for the GPT-Neo 125M model. Furthermore, correlation analysis confirmed that MAGNET reduces the model’s over-reliance on spurious co-occurrence patterns, particularly in low-frequency scenarios.

These results highlight MAGNET’s potential as a general-purpose bias mitigation technique for enhancing the factual reliability of LLMs. Its compatibility with various model sizes and architectures, along with its minimal reliance on manual annotation, makes it a scalable and practical solution. Future work may explore extending this framework to other forms of bias and broader NLP tasks.

Supporting information

S1 File. Prompts used to generate counterfactual samples.

https://doi.org/10.1371/journal.pone.0340812.s001

(PDF)

S2 File. Data description of factual knowledge probing experiments.

https://doi.org/10.1371/journal.pone.0340812.s002

(XLSX)

S3 File. Description of datasets used with MAGNET for the target model.

https://doi.org/10.1371/journal.pone.0340812.s003

(XLSX)

Acknowledgments

We gratefully acknowledge EleutherAI and Hugging Face, along with their open-source communities, for their contributions to the development and maintenance of the GPT-Neo models and the Transformers library. Their work provided essential infrastructure and resources that significantly supported the implementation and evaluation of our proposed framework.

References

1. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: open foundation and fine-tuned chat models. arXiv preprint 2023.
- View Article
- Google Scholar
2. Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A survey of large language models. arXiv preprint2023. https://arxiv.org/abs/2303.18223
- View Article
- Google Scholar
3. Wang S, Fang H, Khabsa M, Mao H, Ma H. Entailment as few-shot learner. arXiv preprint 2021. https://arxiv.org/abs/2104.14690
- View Article
- Google Scholar
4. Nair I, Somasundaram S, Saxena A, Goswami K. Drilling down into the discourse structure with LLMs for long document question answering. In: Findings of the Association for Computational Linguistics: EMNLP 2023 . 2023. p. 9548–66.
5. Chu Z, Chen J, Chen Q, Yu W, He T, Wang H, et al. A survey of chain of thought reasoning: advances, frontiers and future. arXiv preprint2023. https://arxiv.org/abs/230915402
- View Article
- Google Scholar
6. Wang L, Lyu C, Ji T, Zhang Z, Yu D, Shi S, et al. Document-level machine translation with large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023. p. 16646–61.
7. Welleck S, Kulikov I, Roller S, Dinan E, Cho K, Weston J. Neural text generation with unlikelihood training. In: 8th International Conference on Learning Representations (ICLR); 2020.
8. Raunak V, Menezes A, Junczys-Dowmunt M. The curious case of hallucinations in neural machine translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2021. p. 1172–83.
9. Singla S, Feizi S. Salient imagenet: How to discover spurious features in deep learning? In: International Conference on Learning Representations (ICLR); 2022.
10. Kang C, Choi J. Impact of co-occurrence on factual knowledge of large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023. p. 2795–813.
11. Niu Y, Tang K, Zhang H, Lu Z, Hua XS, Wen JR. Counterfactual VQA: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021. p. 12700–10.
12. Hong P, Bhardwaj R, Majumdar N, Aditya S, Poria S. ReMask: a robust information-masking approach for domain counterfactual generation. In: Proceedings of the 2023 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2024.
13. Calderon N, Ben-David E, Feder A, Reichart R. Docogen: domain counterfactual generation for low resource domain adaptation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. p. 7727–46.
14. Ou J, Zhang J, Feng Y, Zhou J. Counterfactual data augmentation via perspective transition for open-domain dialogues. In: Proceedings of the 29th International Conference on Computational Linguistics (COLING). 2022. p. 3254–64.
15. Paranjape B, Lamm M, Tenney I. Retrieval-guided counterfactual generation for QA. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. p. 1670–86.
16. Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. ACM Computing Surveys. 2023;55(12):1–38.
- View Article
- Google Scholar
17. Rajaee S, Yaghoobzadeh Y, Pilehvar MT. Looking at the overlooked: an analysis on the word-overlap bias in natural language inference. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2022. p. 4247–63.
18. Kassner N, Schütze H. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). 2020. p. 7811–8.
19. Poerner N, Waltinger U, Schütze H. E-BERT: Efficient-yet-effective entity embeddings for BERT. In: Findings of the Association for Computational Linguistics: EMNLP 2020 . 2020. p. 803–18.
20. Tulving E, Schacter DL. Priming and human memory systems. Science. 1990;247(4940):301–6. pmid:2296719
- View Article
- PubMed/NCBI
- Google Scholar
21. Durmus E, Ladhak F, Hashimoto T. Spurious correlations in reference-free evaluation of text generation.In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. p. 1443–54.
22. Khani F, Liang P. Removing spurious features can hurt accuracy and affect groups disproportionately. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency; 2021. p. 196–205.
23. Chen Z, Gao Q, Bosselut A, Sabharwal A, Richardson K. Disco: Distilling counterfactuals with large language models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. p. 5514–28.
24. Kaushik D, Hovy E, Lipton ZC. Learning the difference that makes a difference with counterfactually-augmented data. In: International Conference on Learning Representations (ICLR); 2020.
25. Bauer A, Hoedoro N, Schneider A. Rule-based Approach to Text Generation in Natural Language-Automated Text Markup Language (ATML3). In: Challenge+ DC@ RuleML; 2015.
26. Potamianos G, Jelinek F. A study of n-gram and decision tree letter language modeling methods. Speech Communication. 1998;24(3):171–92.
- View Article
- Google Scholar
27. Zhou N, Yao N, Zhao J, Zhang Y. Rule-based adversarial sample generation for text classification. Neural Computing and Applications. 2022;34(13):10575–86.
- View Article
- Google Scholar
28. Wu T, Ribeiro MT, Heer J, Weld DS. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL). 2021. p. 6707–23.
29. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. Language models are unsupervised multitask learners. OpenAI Blog. 2019;1(8):9.
- View Article
- Google Scholar
30. Fryer Z, Axelrod V, Packer B, Beutel A, Chen J, Webster K. Flexible text generation for counterfactual fairness probing. arXiv preprint 2022. https://arxiv.org/abs/2206.13757
- View Article
- Google Scholar
31. Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A, Cheng HT, et al. LaMDA: language models for dialog applications. arXiv preprint 2022.
- View Article
- Google Scholar
32. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 33; 2020. p. 1877–901.
- View Article
- Google Scholar
33. Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, et al. GPT understands, too. AI Open. 2023;4:1–15.
- View Article
- Google Scholar
34. Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2021. p. 3045–59.
35. Li XL, Liang P. Prefix-tuning: Optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL). 2021. p. 4582–97.
36. Qin G, Eisner J. Learning how to ask: querying LMs with mixtures of soft prompts. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2021. p. 5203–19.
37. Liu X, Ji K, Fu Y, Tam W, Du Z, Yang Z, et al. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. p. 61–8.
38. Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Computing Surveys. 2023;55(9):1–35.
- View Article
- Google Scholar
39. Bouraoui Z, Camacho-Collados J, Schockaert S. Inducing relational knowledge from BERT. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34; 2020. p. 7456-63.
40. Jiang Z, Xu FF, Araki J, Neubig G. How can we know what language models know? Transactions of the Association for Computational Linguistics. 2020;8:423–38.
- View Article
- Google Scholar
41. Haviv A, Berant J, Globerson A. BERTese: learning to speak to BERT. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 2021. p. 3618–23.
42. Shin T, Razeghi Y, Logan IV RL, Wallace E, Singh S. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. p. 4222–35.
43. Fichtel L, Kalo JC, Balke WT. Prompt tuning or fine-tuning: investigating relational knowledge in pre-trained language models. In: 3rd Conference on Automated Knowledge Base Construction (AKBC); 2021.
44. Sanh V, Webson A, Raffel C, Bach SH, Sutawika L, Alyafeai Z, et al. Multitask prompted training enables zero-shot task generalization. In: International Conference on Learning Representations (ICLR); 2022.
45. Lai X, Tian Z, Chen Y, Li Y, Yuan Y, Liu S, et al. Lisa: Reasoning segmentation via large language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024.
46. Liu J, Zhang Z, Xiao J, Jin Z, Zhang X, Ma Y, et al. Large Language Model Locally Fine-tuning (LLMLF) on Chinese Medical Imaging Reports. In: Proceedings of the 2023 6th International Conference on Big Data Technologies. 2023. p. 273–9.
47. Black S, Gao L, Wang P, Leahy C, Biderman S. GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow. Zenodo. 2021.
- View Article
- Google Scholar
48. Gao L, Biderman S, Black S, Golding L, Hoppe T, Foster C, et al. The Pile: an 800GB dataset of diverse text for language modeling. arXiv preprint 2020.
- View Article
- Google Scholar
49. Liu Y, Wan Y, He L, Peng H, Philip SY. KG-BART: knowledge graph-augmented BART for generative commonsense reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021. p. 6418–25.
50. Bauer L, Wang Y, Bansal M. Commonsense for generative multi-hop question answering tasks. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2018. p. 4220–30.
51. Cheng L, Wu D, Bing L, Zhang Y, Jie Z, Lu W, et al. Ent-desc: Entity description generation by exploring knowledge graph. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. p. 1194–205.
52. Kim B, Ahn J, Kim G. Sequential latent knowledge selection for knowledge-grounded dialogue. In: International Conference on Learning Representations (ICLR); 2020.
53. Krishna K, Roy A, Iyyer M. Hurdles to progress in long-form question answering. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL); 2021. p. 4940–57.
54. Kim J, Choi S, Amplayo RK, Hwang Sw. Retrieval-augmented controllable review generation. In: Proceedings of the 28th International Conference on Computational Linguistics (COLING). 2020. p. 2284–95.
55. Ghazvininejad M, Brockett C, Chang MW, Dolan B, Gao J, Yih Wt, et al. A knowledge-grounded neural conversation model. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2018.
56. Veseli B, Razniewski S, Kalo JC, Weikum G. Evaluating the knowledge base completion potential of GPT. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023. p. 11099–112.
57. Robbins H, Monro S. A stochastic approximation method. The Annals of Mathematical Statistics. 1951: p. 400–7.
- View Article
- Google Scholar
58. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. p. 38–45.

[ref1] 1. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: open foundation and fine-tuned chat models. arXiv preprint 2023.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A survey of large language models. arXiv preprint2023. https://arxiv.org/abs/2303.18223
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Wang S, Fang H, Khabsa M, Mao H, Ma H. Entailment as few-shot learner. arXiv preprint 2021. https://arxiv.org/abs/2104.14690
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Nair I, Somasundaram S, Saxena A, Goswami K. Drilling down into the discourse structure with LLMs for long document question answering. In: Findings of the Association for Computational Linguistics: EMNLP 2023 . 2023. p. 9548–66.

[ref5] 5. Chu Z, Chen J, Chen Q, Yu W, He T, Wang H, et al. A survey of chain of thought reasoning: advances, frontiers and future. arXiv preprint2023. https://arxiv.org/abs/230915402
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref6] 6. Wang L, Lyu C, Ji T, Zhang Z, Yu D, Shi S, et al. Document-level machine translation with large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023. p. 16646–61.

[ref7] 7. Welleck S, Kulikov I, Roller S, Dinan E, Cho K, Weston J. Neural text generation with unlikelihood training. In: 8th International Conference on Learning Representations (ICLR); 2020.

[ref8] 8. Raunak V, Menezes A, Junczys-Dowmunt M. The curious case of hallucinations in neural machine translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 2021. p. 1172–83.

[ref9] 9. Singla S, Feizi S. Salient imagenet: How to discover spurious features in deep learning? In: International Conference on Learning Representations (ICLR); 2022.

[ref10] 10. Kang C, Choi J. Impact of co-occurrence on factual knowledge of large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023. p. 2795–813.

[ref11] 11. Niu Y, Tang K, Zhang H, Lu Z, Hua XS, Wen JR. Counterfactual VQA: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021. p. 12700–10.

[ref12] 12. Hong P, Bhardwaj R, Majumdar N, Aditya S, Poria S. ReMask: a robust information-masking approach for domain counterfactual generation. In: Proceedings of the 2023 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2024.

[ref13] 13. Calderon N, Ben-David E, Feder A, Reichart R. Docogen: domain counterfactual generation for low resource domain adaptation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. p. 7727–46.

[ref14] 14. Ou J, Zhang J, Feng Y, Zhou J. Counterfactual data augmentation via perspective transition for open-domain dialogues. In: Proceedings of the 29th International Conference on Computational Linguistics (COLING). 2022. p. 3254–64.

[ref15] 15. Paranjape B, Lamm M, Tenney I. Retrieval-guided counterfactual generation for QA. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. p. 1670–86.

[ref16] 16. Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. ACM Computing Surveys. 2023;55(12):1–38.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref17] 17. Rajaee S, Yaghoobzadeh Y, Pilehvar MT. Looking at the overlooked: an analysis on the word-overlap bias in natural language inference. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2022. p. 4247–63.

[ref18] 18. Kassner N, Schütze H. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). 2020. p. 7811–8.

[ref19] 19. Poerner N, Waltinger U, Schütze H. E-BERT: Efficient-yet-effective entity embeddings for BERT. In: Findings of the Association for Computational Linguistics: EMNLP 2020 . 2020. p. 803–18.

[ref20] 20. Tulving E, Schacter DL. Priming and human memory systems. Science. 1990;247(4940):301–6. pmid:2296719
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref21] 21. Durmus E, Ladhak F, Hashimoto T. Spurious correlations in reference-free evaluation of text generation.In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. p. 1443–54.

[ref22] 22. Khani F, Liang P. Removing spurious features can hurt accuracy and affect groups disproportionately. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency; 2021. p. 196–205.

[ref23] 23. Chen Z, Gao Q, Bosselut A, Sabharwal A, Richardson K. Disco: Distilling counterfactuals with large language models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. p. 5514–28.

[ref24] 24. Kaushik D, Hovy E, Lipton ZC. Learning the difference that makes a difference with counterfactually-augmented data. In: International Conference on Learning Representations (ICLR); 2020.

[ref25] 25. Bauer A, Hoedoro N, Schneider A. Rule-based Approach to Text Generation in Natural Language-Automated Text Markup Language (ATML3). In: Challenge+ DC@ RuleML; 2015.

[ref26] 26. Potamianos G, Jelinek F. A study of n-gram and decision tree letter language modeling methods. Speech Communication. 1998;24(3):171–92.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref27] 27. Zhou N, Yao N, Zhao J, Zhang Y. Rule-based adversarial sample generation for text classification. Neural Computing and Applications. 2022;34(13):10575–86.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref28] 28. Wu T, Ribeiro MT, Heer J, Weld DS. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL). 2021. p. 6707–23.

[ref29] 29. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. Language models are unsupervised multitask learners. OpenAI Blog. 2019;1(8):9.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref30] 30. Fryer Z, Axelrod V, Packer B, Beutel A, Chen J, Webster K. Flexible text generation for counterfactual fairness probing. arXiv preprint 2022. https://arxiv.org/abs/2206.13757
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref31] 31. Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A, Cheng HT, et al. LaMDA: language models for dialog applications. arXiv preprint 2022.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref32] 32. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 33; 2020. p. 1877–901.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref33] 33. Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, et al. GPT understands, too. AI Open. 2023;4:1–15.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref34] 34. Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2021. p. 3045–59.

[ref35] 35. Li XL, Liang P. Prefix-tuning: Optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL). 2021. p. 4582–97.

[ref36] 36. Qin G, Eisner J. Learning how to ask: querying LMs with mixtures of soft prompts. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2021. p. 5203–19.

[ref37] 37. Liu X, Ji K, Fu Y, Tam W, Du Z, Yang Z, et al. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. p. 61–8.

[ref38] 38. Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Computing Surveys. 2023;55(9):1–35.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref39] 39. Bouraoui Z, Camacho-Collados J, Schockaert S. Inducing relational knowledge from BERT. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34; 2020. p. 7456-63.

[ref40] 40. Jiang Z, Xu FF, Araki J, Neubig G. How can we know what language models know? Transactions of the Association for Computational Linguistics. 2020;8:423–38.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref41] 41. Haviv A, Berant J, Globerson A. BERTese: learning to speak to BERT. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 2021. p. 3618–23.

[ref42] 42. Shin T, Razeghi Y, Logan IV RL, Wallace E, Singh S. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. p. 4222–35.

[ref43] 43. Fichtel L, Kalo JC, Balke WT. Prompt tuning or fine-tuning: investigating relational knowledge in pre-trained language models. In: 3rd Conference on Automated Knowledge Base Construction (AKBC); 2021.

[ref44] 44. Sanh V, Webson A, Raffel C, Bach SH, Sutawika L, Alyafeai Z, et al. Multitask prompted training enables zero-shot task generalization. In: International Conference on Learning Representations (ICLR); 2022.

[ref45] 45. Lai X, Tian Z, Chen Y, Li Y, Yuan Y, Liu S, et al. Lisa: Reasoning segmentation via large language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024.

[ref46] 46. Liu J, Zhang Z, Xiao J, Jin Z, Zhang X, Ma Y, et al. Large Language Model Locally Fine-tuning (LLMLF) on Chinese Medical Imaging Reports. In: Proceedings of the 2023 6th International Conference on Big Data Technologies. 2023. p. 273–9.

[ref47] 47. Black S, Gao L, Wang P, Leahy C, Biderman S. GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow. Zenodo. 2021.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref48] 48. Gao L, Biderman S, Black S, Golding L, Hoppe T, Foster C, et al. The Pile: an 800GB dataset of diverse text for language modeling. arXiv preprint 2020.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref49] 49. Liu Y, Wan Y, He L, Peng H, Philip SY. KG-BART: knowledge graph-augmented BART for generative commonsense reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021. p. 6418–25.

[ref50] 50. Bauer L, Wang Y, Bansal M. Commonsense for generative multi-hop question answering tasks. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2018. p. 4220–30.

[ref51] 51. Cheng L, Wu D, Bing L, Zhang Y, Jie Z, Lu W, et al. Ent-desc: Entity description generation by exploring knowledge graph. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. p. 1194–205.

[ref52] 52. Kim B, Ahn J, Kim G. Sequential latent knowledge selection for knowledge-grounded dialogue. In: International Conference on Learning Representations (ICLR); 2020.

[ref53] 53. Krishna K, Roy A, Iyyer M. Hurdles to progress in long-form question answering. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL); 2021. p. 4940–57.

[ref54] 54. Kim J, Choi S, Amplayo RK, Hwang Sw. Retrieval-augmented controllable review generation. In: Proceedings of the 28th International Conference on Computational Linguistics (COLING). 2020. p. 2284–95.

[ref55] 55. Ghazvininejad M, Brockett C, Chang MW, Dolan B, Gao J, Yih Wt, et al. A knowledge-grounded neural conversation model. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2018.

[ref56] 56. Veseli B, Razniewski S, Kalo JC, Weikum G. Evaluating the knowledge base completion potential of GPT. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023. p. 11099–112.

[ref57] 57. Robbins H, Monro S. A stochastic approximation method. The Annals of Mathematical Statistics. 1951: p. 400–7.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref58] 58. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. p. 38–45.

Figures

Abstract

Introduction

Related works

Spurious features in language models

Counterfactual data augmentation

Prompt tuning and fine-tuning

Materials and methods

Experimental setup

Target model.

Training data synthesis details.

Methods

Training data synthesis.

Language model adaptive fine-tuning.

Results

Factual knowledge probing

Correlation analysis

Results for open LLM

Ablation study

Discussion

Conclusion

Supporting information

S1 File. Prompts used to generate counterfactual samples.

S2 File. Data description of factual knowledge probing experiments.

S3 File. Description of datasets used with MAGNET for the target model.

Acknowledgments

References