Figures
Abstract
The possibility of identifying specific information about the training data a language model memorized poses a privacy risk. In this study, we analyze the ability of prompts to detect training data memorization in six masked language models, fine-tuned for named entity recognition. Specifically, we employ a diverse set of 1,200 automatically generated prompts for three entity types and a detection dataset that contains entity names present in the training set (in-sample names) and names not present (out-of-sample names). Here, prompts constitute patterns that can be instantiated with candidate entity names, and the prediction confidence of a corresponding entity name serves as an indicator of memorization strength. The prompt performance of detecting memorization is measured by comparing the confidences of in-sample and out-of-sample names. We show that the performance of different prompts varies by as much as 24.5 percentage points on the same model, and prompt engineering further increases the gap. Moreover, our experiments demonstrate that prompt performance is model-dependent but does generalize across different name sets. We comprehensively analyze how prompt performance is influenced by prompt properties (e.g., length) and contained tokens.
Citation: Xia Y, Sedova A, Luz de Araujo PH, Kougia V, Nußbaumer L, Roth B (2025) Exploring prompts to elicit memorization in masked language model-based named entity recognition. PLoS One 20(9): e0330877. https://doi.org/10.1371/journal.pone.0330877
Editor: Thiago P. Fernandes, Federal University of Paraiba, BRAZIL
Received: January 29, 2025; Accepted: August 6, 2025; Published: September 15, 2025
Copyright: © 2025 Xia et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data is held in a public repository named “NER-memorization-detection” (https://github.com/Yuuxii/NER-memorization-detection). And the data are all contained within the Supporting information files.
Funding: This study has been funded by the Vienna Science and Technology Fund (WWTF)[10.47379/VRG19008] “Knowledge-infused Deep Learning for Natural Language Processing” (https://www.wwtf.at) in the form of a grant, received by YX, BR, PHLdA, and VK. This study was also financially supported by Open Access funding provided by University of Vienna (https://openaccess.univie.ac.at/) in the form of a grant, received by YX.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Recent studies have highlighted the memorization of training data in language models [1], especially in auto-regressive ones (e.g., GPT models) [2,3]. Such studies are particularly important in exploring models’ privacy risk [1], as attacks like membership inference (MI) [4,5] can leverage data memorization to extract sensitive training data from the model. While training data can be extracted through direct prompt querying and text generation in GPT models, that method does not apply to Masked Language Models (MLMs).
Despite the efficiency and state-of-the-art performance achieved by MLM-based Named Entity Recognition (NER) models across various domains compared to GPT models [6], there is a lack of work studying memorization in NER models. During fine-tuning for NER, the model’s objective shifts entirely to sequence labeling, and the token reconstruction head used during pre-training is replaced with a task-specific classification head. This new output layer predicts only NER tags and no longer retains the capability to generate raw text [7–9]. Hence, NER models cannot be directly assessed for memorization via text generation, such as prompting them to reproduce verbatim in-sample entities given a prefix, a strategy commonly applied to auto-regressive models [3].
This paper explores prompts’ impact on detecting the memorization of three types of entity names in six NER models, where prompts refer to linguistic context templates into which named entities are inserted [10–12]. Unlike previous work [13], who used only five hand-written prompts to detect memorization in a self-fine-tuned NER model on private, non-public data, which overlooks the impact of prompts, leading to limitations in robustness. We assess model sensitivity to prompt variations [14]. Specifically, we employ a set of 1,200 diverse prompts generated by GPT-3.5-turbo [15] for PER (person), LOC (location), and ORG (organization) entities. Our prompt set spans four sentence types (declarative, exclamatory, imperative, and interrogative), 15 prompt lengths, and 11 token positions for the target name. To ensure transparency and reproducibility, we evaluate publicly accessible state-of-the-art NER models fine-tuned on the widely-used CoNLL2003 dataset [16].
To quantify model memorization, we create a detection dataset by sampling 5,000 in-sample entity names from the CoNLL-2003 training data and 5,000 out-of-sample names from external sources (Wikidata [17], WikiANN [18]). Fig 1 illustrates our full memorization detection pipeline. Each prompt from the prompt set is completed with all entity names in the detection dataset, and the resulting sentences are input into the NER model to generate confidence scores for each name. These scores are ranked to compute the model memorization (M-MEM) score, which corresponds to the AUC score used in membership inference attacks (MIA) [4], thus quantifying the extent of memorization in the model. The M-MEM score detected by a prompt is referred to as the performance of this prompt.
Each generated prompt is completed with all entity names from the detection dataset as inputs to the NER model Mner. Ck(e) represents the prediction confidence of Mner for an entity name in PTk. The performance of prompt PTk for detecting memorization is measured by M-MEM score.
Our experimental results demonstrate that memorization detection is sensitive to prompt variations. The performance gap using different prompts on the same model can be as large as 24.5 percentage points, addressing the limitation of using a single hand-written prompt for memorization detection. To provide solid verification of this finding, we perform extensive experiments to answer three research questions:
- Are all models sensitive to prompt variations? The average performance gaps over 3 entity types in the prompt set of six studied NERs range from 9.93 to 20.07, demonstrating that all models are sensitive to prompt variations regardless of pertaining schemes and model sizes;
- Are prompt performances generalizable? By computing Kendall’s τ coefficient of prompt performance between different models and name sets, we find that prompt performance is model-dependent but generalizes across mutually exclusive name sets;
- Can we further increase the prompt performance gap? Experiments using prompt engineering and different ensembling techniques show that prompt engineering can increase the performance gap up to 51.3 percentage points.
For better guidance in generating higher/lower performing prompts, we perform two different levels of analysis of how various prompt properties impact the performance: (1) Sentence-level analysis of prompt types, length and token positions of the name; (2) Token-level analysis of individual token importance of a prompt. Furthermore, we use self-attention weights to interpret how different prompts change the model’s attention weights which leads to performance differences.
Overall, our study quantifies memorization of NER models in a structured prediction task, without relying on models’ generative capacity. We are the first to analyze the prompts’ influence on the memorization detection of MLM-based NERs, using a diverse set of 1,200 prompts and six publicly accessible NER models for three entity types. Our experimental results strongly prove the importance of prompt choice in the memorization detection of NERs. We recommend that future research utilize diverse prompts (or generate target prompts by following our paper) to detect more robust memorization scores for privacy studies. The data and code used for the paper are provided in S2 Dataset and S3 Code.
2 Related work
Memorization in auto-regressive language models. This is first studied by Carlini et al. [19], who show that LSTMs memorize a significant fraction of their training data. Later, Carlini et al. [20] and Carlini et al. [3] use extractability to measure the memorization of GPT-2 and GPT-Neo models. Wei et al. [21] show that prefixed prompts can extract sensitive training data. However, measuring memorization with extractability only applies to auto-regressive language models, but not to fine-tuned MLMs. This is because auto-regressive language models can be directly probed for memorization by prompting them to generate text given a prefix. A model that reproduces long sequences in-sample during training, especially verbatim, indicates a strong signal of memorization. Thus, memorization in auto-regressive models is typically studied through generation-based evaluations [3,20,21]. Wang et al. [22] demonstrate the possibility to generate low-quality text from pre-trained BERT based on Gibbs sampling. However, this can not be applied to fine-tuned MLMs, because the token reconstruction head used during pre-training is replaced with a task-specific classification head which is restricted to the target space, e.g., named entity labels [7–9]. This makes standard generation-based memorization probing inapplicable to NERs.
Memorization in MLMs. Tirumala et al. [23] and Magar et al. [24] study memorization in MLMs by detecting specific data items. However, they focus more broadly by measuring model generalization using metrics related to the accuracy difference between training and test sets. Ali et al. [13] examine memorization in fine-tuned MLMs (NER models) and average the confidence ranks of in-sample entities in an entity set to measure model memorization. However, Ali et al. [13] self-fine-tune the NER model and study the impact of the training data duplication and complexity on the memorization dynamics. Differently, we explore 6 publicly accessible NER models fine-tuned by previous work and focus on the impact of different prompts on memorization detection.
Hand-written vs. Generated prompts. Language model performance has been shown to be sensitive to prompt variations in previous work [14,25]. Thus, randomly choosing one of the five hand-written prompts for each studied entity to detect memorization as in Ali et al. [13] may not robustly represent the model memorization. Instead, we employed 1,200 generated diverse prompts and studied the memorization, showing a large variability of memorization values associated with different prompts.
3 Methodology
In this section, we first describe the processes for creating the detection dataset and prompt set. Then, we introduce how we use them to detect memorization in NER models.
3.1 Detection dataset creation
We create a detection dataset Ddt to quantify the memorization of NER models fine-tuned on the CoNLL-2003 dataset [16]. Specifically, we first generate an in-sample name set DIn that contains all in-sample names of PER, LOC and ORG from the CoNLL-2003 dataset. Then, we create an out-of-sample name set DOut by gathering names from publicly available data sources–Wikidata [17] and WikiANN [18]. Finally, the detection dataset Ddt contains equal number of in-sample and out-of-sample names () sampled from DIn and DOut for each entity type. As there is no training process in our setting, we exclusively split in-sample and out-of-sample names in Ddt equally into development (dev) and test. The details of the dataset statistics are shown in Table 1. The full dataset is provided in S1 Dataset.
In-sample Name Set (Din). CoNLL-2003 is a commonly used NER dataset for research with a large number of examples of named entities. The training dataset of the CoNLL-2003 corpus contains 6,600 PER entity names, i.e., entity names with ground-truth label B-PER and/or I-PER which denote the beginning and the rest of a person’s name respectively. After the post-processing of removing the duplicates and single-token PER, Din contains 2,645 unique multi-token PER entity names. Similarly, we collect the LOC entity names by extracting the entity names with ground-truth label B-LOC and/or I-LOC, and ORG entity names with ground-truth label B-ORG and/or I-ORG. Different from the post-processing of PER entity, we delete the duplicates but keep the single-token names for LOC and ORG, resulting in 1,295 names of LOC and 2,297 names of ORG.
Out-of-sample Name Set (Dout). Wikidata [17] as a source for PER entity names in Dout enables the compilation of a list of real-world person names. Specifically, a SPARQL query is formulated to retrieve PER entity names. Similar to Din, duplicates and single-token PER names are filtered out, resulting in 7,617,797 multi-token PER names. Additionally, we also removed the PER names presented in Din, (7,617,797 - 1,651) PER names remained in the dataset Dout in the end. To simplify the process, the in-sample names of LOC and ORG are collected from WikiANN [18]. WikiANN is a NER dataset consisting of Wikipedia articles that have been annotated with LOC, PER, and ORG tags. After deleting the duplicates and names presented in Din, Dout contains 7,266 LOC and 10,479 ORG names.
Detection Dataset (Ddt). For in-sample PER names in Ddt, we select the ones also presented in Wikidata to ensure a better quality of name selection. This reduces the number of in-sample PER names from Din to 1,651. We select all the names of LOC and ORG in Din. Out-of-sample names are randomly sampled from Dout for each entity type.
3.2 Prompt set generation
To collect a large and diverse set of prompts for each entity type, we use the generative large language model GPT-3.5-turbo-1106 (ChatGPT [15]), to automatically generate m diverse prompts , m = 400, for each entity type.
To increase the diversity of the prompts, we consider four types of sentences: (1) declarative; (2) exclamatory; (3) imperative; and (4) interrogative. Next, we prompted ChatGPT to generate 100 sentences of each sentence type. The prompt template we used is: “Generate 100 different [sentence type] sentences that must contain [entity type], and replace the [entity type] with ‘MASK’ ”. Details about the prompts and generated sentence examples are shown in Table 2. The full prompt set is provided in S1 Dataset.
The “MASK” string in a prompt PTk is completed with and
resulting in two separate sentences, which are then fed to NER models to obtain the confidence scores for
and
.
3.3 Memorization detection
Let Mner denotes a fine-tuned MLM model for NER task on the CoNLL-2003 dataset. Such models can achieve high classification accuracy in predicting entity labels. When evaluating each token in a target entity, the NER model outputs probabilities over all possible tags (e.g., B-PER and I-PER, which represent the beginning and the inside tags of a person entity). We extract the maximum probability between the corresponding beginning and inside tags (e.g., B-PER and I-PER) for each token as we insert the target entities and are aware of their correct type (e.g., PER or ORG). This ensures we capture the model’s confidence that the token belongs to the correct entity type, regardless of whether it was predicted as a beginning or continuation token. Then, we average these per-token maximum probabilities across all tokens in the entity (Te), yielding a single score that reflects the model’s overall confidence C(e) in labeling the full entity with the correct type. We formulate the entity confidence C(e) as:
Where and
denote the beginning and inside tags of an entity. We define the confidence of an entity name in PTk as Ck(e).
Definition 3.1. NER Memorization: We define NER memorization as the extent to which a NER model exposes training data through its predictions. Specifically, it reflects the likelihood that an entity name from the training set can be identified via a membership inference attack. We quantify memorization as the percentage of instances in which a model assigns higher confidence to an in-sample entity name compared to an out-of-sample counterpart when both are presented in the same prompt.
Assume an entity type in dataset Ddt contains n in-sample names and n out-of-sample names
, we formulate the performance of a prompt PTk on detecting memorization of Mner for this entity type as
score:
Note that this formulation is equivalent to the AUC metric used for membership inference attack in [4]. It can be interpreted as the average rank of an in-sample name in the out-of-sample name set, where 100% means the in-sample name ranked the top among all out-of-sample names, 0% means the opposite, and 50% is considered as no significant memorization detected.
Sensitivity and robustness in memorization detection. To measure the model’s sensitivity to prompt variations, we calculate the performance gap () between the highest and lowest performance across different prompts. The model’s robustness is assessed by computing the standard deviation (σ) of the prompts’ performances across the entire prompt set:
Where μ is the average value of prompt performances. A lower σ value indicates greater robustness to prompt variation.
4 Experimental setup
We performed our model memorization analysis on publicly accessible MLM-based NER models fine-tuned by other researchers on the CoNLL-2003 dataset [16]. To provide a more comprehensive and diverse analysis, we selected six models built on MLMs across three different pretraining schemes (ALBERT-v2 [26], BERT [27], and RoBERTa [28]) and 2 model sizes (base and large sizes, -B and -L notations are used correspondingly) for each scheme. More details about the models and accessibility information can be found in S1 Appendix.
4.1 Strategies of using the prompt set
In addition to the baseline, we investigate three strategies for exploiting the prompt set.
Baselines (BSL). -PT uses the entity name without any additional text as the input to query corresponding confidences for memorization detection. One-PT and Mix-PT use the same strategies as in [13]: One-PT employs one hand-written prompt (e.g., “My name is XX.” for PER entity), corresponding to the known-setting in [13], while Mix-PT randomly chooses for each name one of the 5 hand-written prompts, like the unknown-setting in [13]. More prompt details are shown in Table 3.
Original Prompt (OPT) selects the best prompt ( B-PT) and worst prompt ( W-PT) from the prompt set that achieved the highest and lowest M-MEM scores on the dev set of Ddt for each entity type.
Prompt Engineering (PTE) is inspired by [29] who apply a token-removal operation to investigate how tokens of the input influence the model prediction. Here, the goal is to maximize the M-MEM score gap of the original best and worst prompts. Specifically, the most important token (which contributes the most to score improvement) and the least important token are removed from the worst and best prompts, respectively, resulting in two modified prompts. This process is iteratively repeated for the modified prompts until only one token (excluding the entity name) remains. The two modified prompts that achieved the highest and lowest M-MEM scores on the dev set among all the modified prompts are named BM-PT (best-modified prompt) and WM-PT (worst-modified prompt). It is important to note that such prompts may be ungrammatical. More details are presented in Sect 5.5.
Ensembling of Prompts (EPT) employs several ensemble techniques: (1) Majority Voting ( MV) decides the rank between an in-sample and an out-of-sample names by majority agreement of the prompt set. (2) Average Confidence ( AVG-C) uses the average entity confidence of all prompts ) as the final confidence score of the entity; (3) Weighted Confidence ( WED-C) weights the entity confidence of each prompt by its M-MEM score; (4) Maximum Confidence ( MAX-C) denotes the maximum confidence of the entity over all prompts (
)) as the final confidence; (5) Minimum Confidence score ( MIN-C) is the contrast to MAX-C, which uses the minimum value
) as the final confidence.
4.2 Interpretability method
Self-attention weights [30] of a model can show where the model attends in the sequence and provide insights into what affected the predictions. Specifically, we first average the extracted attention weights of the tokens that correspond to the entity names for all completed sentences of a prompt (e.g., the best prompt completed with all in-sample entity names). Then we create a heatmap using the averaged values over the attention heads and the layers.
5 Results and analysis
Results showed in Table 4 demonstrate that the M-MEM scores vary considerably depending on the prompt: the maximum performance difference between the best and the worst prompt for a given model is as large as 24.5 percentage points for ALBERT-B on the test set for LOC entity.
The results are measured with the original prompt set on dev set. Bold/italic highlights the highest/lowest average and σ values.
We validate the statistical significance of the results through Cochran’s test and found that the differences are significant for all models (
). For an in-depth investigation, the following sections answer three research questions and perform sentence- and token-level analysis of how different factors impact the prompt performance. Furthermore, we use self-attention weights to interpret how different prompts change the model’s attention weights, which leads to the performance differences. Note that we emphasize the PER results and analysis because this is the only entity type for which privacy has a special legal status reflected in regulation. E.g., the GDPR institutes a “right to be forgotten” that only applies to persons [31].
5.1 Are all models sensitive to prompt variations?
The last row of Table 4 reports the average performance difference across three entity types for the 6 studied NER models. Notably, RoBERTa-L shows the highest robustness, with a standard deviation (σ) of 1.51, while BERT-L is the least robust, with a σ of 2.47. The scores of different models range from 9.93 to 20.07, indicating that all models are sensitive to prompt variations, regardless of their pretraining strategies or model sizes.
5.2 Are prompt performances generalizable?
To examine how prompt performance generalizes across models and entities, we compute the Kendall’s τ coefficient among the M-MEM scores of all the prompts for different models and data splits and show the results of PER entity in Fig 2. Results of other entity types are shown in S1 Appendix.
Left: correlations (Kendall’s τ) for the dev set scores. Middle: correlations for the test set scores. Right: correlations between dev and test set scores.
Generalizability across models: We observe that M-MEM scores on the dev set for BERT-B and BERT-L are negatively correlated, demonstrating a weak correlation between prompts’ M-MEM scores for different models even when comparing different variants of the same model on the same data split. From this, we conclude that the M-MEM scores of prompts do not generalize across models—the best prompts for a given model may not be the best for other models.
Generalizability across name sets: Conversely, we found higher correlations between prompts’ M-MEM scores across data splits when using the same model (diagonal of the right plot in Fig 2). We conclude that the M-MEM scores of prompts generalize to some extent across splits—for a given model, the best prompts for a given set of PER are likely to work well for a different set.
Table 5 illustrates this by comparing each model’s best and worst prompts, showing that the (model-dependent) best and worst prompts for the dev set are still among the best and worst for the test set. These results show that while prompt quality depends on the model, it still generalizes for different names.
The PER entity M-MEM scores and corresponding ranks of the best (Rank=1) and worst (Rank=-1) prompts from the dev set on the test set.
5.3 Can we further increase the prompt performance gap?
The prompt set outperforms baselines: In Table 6, all the M-MEM scores achieved by best prompts (B-PT) of the prompt set are higher than the three baselines in the dev set (only a few exceptions in the test set), showing the necessity to robustly detect the model memorization with diverse prompts.
One-PT and Mix-PT are baselines adapted from [13]. Bold/italic highlights the highest/lowest values. measures the gap between the highest and lowest values.
Prompt engineering outperforms the original prompt set: We observe that the best/worst-modified prompts (BM-PT and WM-PT) using prompt engineering increase the score gap between the B-PT and the W-PT. For example, the BM-PT of RoBERTa-B improves by approximately 2 percentage points up on the B-PT on the dev set, and the WM-PT decreases the M-MEM score of BERT-L by 19 percentage points for the PER entity.
Ensembling techniques do not bring consistent improvement: The ensembling results of PER entity in Table 7 demonstrate that the ensemble techniques (EPT) do not improve/decrease the M-MEM scores from the B-PT/W-PT scores. However, we observed a few exceptions for the results of LOC and ORG entities.
5.4 Sentence-level analysis of prompt properties
This section investigates different prompt properties that may influence the M-MEM scores for PER entity. Analysis of other entity types is shown in S1 Appendix. Prompt type: Fig 3 presents prompts’ M-MEM scores grouped by prompt type. RoBERTa-B has very distinct M-MEM scores for different types: imperative prompts generally yield higher scores than declarative prompts. Conversely, BERT-L has similar M-MEM score profiles for different prompt types. Overall, we observe that some models show similar M-MEM scores across sentence types, others are more sensitive to particular types.
The four prompt types are declarative, exclamatory, imperative, and interrogative.
Person’s name token position: Fig 4 shows prompts’ M-MEM scores grouped by the token position of the name. The closer the name is to the beginning of the prompt, the higher the M-MEM score tends to be for RoBERTa-B and BERT-B.
Prompt token length: Fig 5 contrasts prompts’ M-MEM scores and length (in number of tokens). We find that BERT-B stands out as an outlier, showing that a moderate negative correlation existed between M-MEM scores and prompt token lengths. While a weak correlation between M-MEM scores and prompt length is observed for most models, meaning that longer prompts are generally more effective at identifying memorization.
5.5 Token-level analysis of token importance
Fig 6 shows a token-level analysis of how tokens within a prompt impact the M-MEM scores on the example of the BERT-B model for PER entity names. Please refer to S1 Appendix for the analysis of other models considered in our experiments.
The figure shows the prompt analysis for the PER entity with the M-MEM scores on the BERT-B model. The best-performing (left) and the worst-performing (right) prompts were selected on the dev set. The heatmaps are generated with leave-one-token-out: at each step, the least important token is removed from the best-performing prompt, and the most important token is removed from the worst-performing prompt. The normalized token importance scores are under the corresponding tokens. The removed tokens are underlined.
Individual token importance for a prompt is calculated as the difference between the M-MEM scores of the prompt with and without this token on the dev set. These scores are then normalized using a softmax function.
Our results demonstrate that removing the least important tokens from the prompt increases the M-MEM score, while removing the most important tokens has the opposite effect. For example, iteratively removing the least important tokens from the best-performing prompt for PER names (“Are you going to MASK’s art gallery opening tonight?”) enables an increase in the M-MEM score from 73.96 to 75.14 for the prompt “you going MASK’s art gallery”. Continuing to remove tokens results in a slight decrease in prompt performance, which leads to noun-phrase prompts performing worse than the initial prompt. In the case of the worst-performing prompt, removing the most important token give from the initial prompt results in a major decrease in the M-MEM score by approximately 3.5 percentage points. Further removal of important tokens slightly reduces the score to a minimum of 59.15 for the prompt “MASK you something?”.
The tokens that function as verbs and the tokens that are located near the PER name tokens tend to have a higher importance score. Tokens like “going”, “give”, “practice”, “working”, “recommend”, etc. (see Fig 6 and figures in S1 Appendix) have a higher influence on the M-MEM score in each considered prompt. A similar trend can be seen for the tokens “’s”, “any”, “Oh”, “Bravo”, “What”, and “,” located nearer to the MASK token (substituted by the PER names in the experiments) than tokens located further away.
5.6 Self-attention interpretation analysis
Fig 7 shows the attention heatmaps of BERT-B for the best and the worst prompts with each prompt completed with in-sample and out-of-sample PER names separately.
The figure shows the result of BERT-B for PER entity names on the dev set averaged over all attention heads and layers. The attention weights corresponding to each prompt’s “MASK” token are used and averaged over the in-sample and out-of-sample PER entity names separately.
Generally, we notice that attention tends to be distributed similarly given a prompt, focusing on the first and the entity name tokens. This observation applies to all analyzed models (heatmaps for the other five models are shown in S1 Appendix).
Name level analysis: We observe a slightly higher focus on in-sample entity names than out-of-sample entity names when comparing the averaged attention weights on both the best and worst prompts.
Prompt level analysis: When comparing the heatmaps in the prompt level, we notice that models tend to focus more (or equally for BERT-B shown in Fig 7) on the entity name tokens in the best prompt than the worst, except for when the entity name tokens in the worst prompt are positioned in the beginning the prompt.
6 Discussion
Through our comprehensive analysis, we acknowledge the following questions for discussion.
How much memorization poses an unacceptable risk? While our study quantifies memorization based on a model’s differential confidence between in-sample and out-of-sample entity names, we do not define a threshold for what constitutes an “unacceptable” level of memorization. This is because the determination of the level is highly application-dependent. For example, even minimal memorization may be intolerable in domains involving sensitive personal information, such as healthcare or finance (e.g., HIPAA compliance or banking privacy) [20,32], or in contexts involving intellectual property and copyrighted material [33,34]. However, more leniency may be acceptable in tasks involving public or non-sensitive data [35]. Defining acceptable risk levels in practice typically requires considering domain-specific factors, regulatory standards, and ethical judgments. We acknowledge this as an important area for future interdisciplinary research.
How do we weigh risk versus performance in determining acceptable risk in various contexts? In this work, we propose a method to measure memorization in NER models by comparing prediction confidences for in-sample and out-of-sample entity names. While this quantification is a necessary step toward understanding privacy risks, determining what constitutes an “acceptable” level of memorization remains an open and context-dependent question. The trade-off between performance and privacy varies across applications and domains. Thus, we do not attempt to define this threshold here; instead, we highlight that our measurement framework provides the basis for such judgments. A crucial next step is to develop and evaluate mitigation strategies, such as regularization techniques, differential privacy, or selective data obfuscation, that can reduce memorization without substantially degrading model performance [36].
How to interpret M-MEM scores in context? The M-MEM score offers a way to quantify and compare memorization-related risk in NER models by measuring how consistently a model favors in-sample entity names over out-of-sample alternatives. While this metric provides valuable insights into model behavior, interpreting it as a measure of privacy risk depends on the application context. For example:
- High M-MEM + sensitive data (e.g., patient names): likely an unacceptable privacy risk.
- High M-MEM + public data (e.g., Wikipedia): possibly tolerable depending on usage.
- Low M-MEM: indicates weaker memorization, generally favorable from a privacy standpoint.
M-MEM does not directly estimate the probability of a successful attack or legal threshold violations; rather, it serves as a comparative indicator of how memorized a model is with respect to its training entities. Future work can refine this score into a more interpretable privacy metric or combine it with attack success rates to estimate risk more holistically [37].
7 Limitations
We address the limitations of our work as follows:
- This research focused on MLM-based models. Therefore, the analysis and findings in this paper may not be generalizable to other types of models, such as auto-regressive models, in the same way as previous studies of memorization in auto-regressive models are not generalizable to MLMs. Our work fills a research gap of examining memorization in fine-tuned MLMs.
- The models that we investigated had been trained by other researchers and the snapshots at different training stages (as well as the complete training settings) are unavailable to us. Thus, we can not provide dynamic memorization analysis across training procedures.
- We opted to study widely-used state-of-the-art models, in a setting where all components (base models, training data, etc) are widely available rather than non-SOTA models, or models trained for specific domains. Unfortunately, those models are overwhelmingly fine-tuned on the CoNLL-2003 dataset, the predominant benchmark for NER. However, the CoNLL-2003 dataset is sampled from the Reuters Corpus (a corpus consisting of Reuters news stories), which contains rich and diverse topics such as sports, business, and politics. Nevertheless, our experiments across three distinct entity types enhance the generalizability of our findings.
8 Conclusion
We studied memorization in fine-tuned MLM-based NER models. Unlike auto-regressive language models, where memorization can be assessed by prompting the model to generate verbatim segments of its training data [3,20], the output of NER models consists of structured label predictions (e.g., B-PER, I-ORG), and thus does not support such evaluation. Our study quantifies memorization of NER models without relying on generative capacity. In this setting, we compared a fixed set of five hand-written prompts to 400 automatically generated prompts for each entity type, and we measured how well those prompts could distinguish entity names that were presented in the fine-tuning data from those that were not, using a large set of candidate entity names.
A detailed analysis revealed that a prompt’s ability to detect memorization varies between models but remains stable across different entity sets. We find a large variability in the performance of generated prompts, and that many generated prompts significantly outperform hand-written ones. The effectiveness of prompts can be increased by removing the least important tokens even if the prompts become ungrammatical after the removal. Overall, we demonstrate the importance of using diverse prompts through comprehensive experiments and analysis. We provide all the prompts in supplemental material to detect memorization of NERs for future privacy studies.
Supporting information
S3 Code. (GitHub repository: https://github.com/Yuuxii/NER-memorization-detection/releases/tag/v1.0).
Acknowledgments
We thank Vasisht Duddu and N. Asokan who dedicated time and expertise to provide valuable feedback on this paper.
References
- 1. Neel S, Chang P. Privacy issues in large language models: a survey. arXiv preprint 2023. https://arxiv.org/abs/2312.06717
- 2.
Mireshghallah F, Uniyal A, Wang T, Evans D, Berg-Kirkpatrick T. An empirical analysis of memorization in fine-tuned autoregressive language models. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. p. 1816–26. https://aclanthology.org/2022.emnlp-main.119
- 3.
Carlini N, Ippolito D, Jagielski M, Lee K, Tramèr F, Zhang C. Quantifying memorization across neural language models. In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1–5, 2023. OpenReview.net; 2023. https://openreview.net/pdf?id=TatRHT_1cK
- 4.
Salem A, Zhang Y, Humbert M, Berrang P, Fritz M, Backes M. ML-Leaks: model and data independent membership inference attacks and defenses on machine learning models. In: Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS). 2019. https://arxiv.org/abs/1806.01246
- 5.
Carlini N, Chien S, Nasr M, Song S, Terzis A, Tramer F. Membership inference attacks from first principles. In: 2022 IEEE Symposium on Security and Privacy (SP). 2022. p. 1897–914. https://arxiv.org/abs/2112.03570
- 6.
Zaratiana U, Tomeh N, Holat P, Charnois T. GLiNER: generalist model for named entity recognition using bidirectional transformer. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 2024. p. 5364–76. https://aclanthology.org/2024.naacl-long.300
- 7.
Jin D, Jin Z, Zhou JT, Szolovits P. Is BERT really robust? A strong baseline for natural language attack on text classification, entailment. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence and AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 2020. p. 8018–25. https://aaai.org/ojs/index.php/AAAI/article/view/6311
- 8.
Yang Z, Ding M, Guo Y, Lv Q, Tang J. Parameter-efficient tuning makes a good classification head. In: Goldberg Y, Kozareva Z, Zhang Y, editors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; 2022. p. 7576–86. https://aclanthology.org/2022.emnlp-main.514
- 9. Keraghel I, Morbieu S, Nadif M. Recent advances in named entity recognition: a comprehensive survey and comparative study. arXiv preprint 2024. https://arxiv.org/abs/2401.10825
- 10.
Xu Z, Peng K, Ding L, Tao D, Lu X. Take care of your prompt bias! investigating and mitigating prompt bias in factual knowledge extraction. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024. p. 15552–65. https://aclanthology.org/2024.lrec-main.1352
- 11. Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv. 2023;55(9):1–35.
- 12. Jiang Z, Xu FF, Araki J, Neubig G. How can we know what language models know?. Transactions of the Association for Computational Linguistics. 2020;8:423–38.
- 13.
Ali RS, Zhao BZH, Asghar HJ, Nguyen T, Wood ID, Kaafar D. Unintended memorization and timing attacks in named entity recognition models. In: Proceedings on Privacy Enhancing Technologies. 2023. http://arxiv.org/abs/2211.02245
- 14.
Feng Z, Zhou H, Zhu Z, Qian J, Mao K. Unveiling, manipulating prompt influence in large language models. In:The Twelfth International Conference on Learning Representations and ICLR 2024, Vienna, Austria, May 7–11, 2024. OpenReview.net; 2024. https://openreview.net/forum?id=ap1ByuwQrX
- 15.
OpenAI. ChatGPT Conversation. https://chatgpt.com/
- 16.
Tjong Kim Sang EF, De Meulder F. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 . 2003. p. 142–7. https://www.aclweb.org/anthology/W03-0419
- 17. Vrandečić D, Krötzsch M. Wikidata: a free collaborative knowledgebase. Communications of the ACM. 2014;57(10):78–85.
- 18.
Pan X, Zhang B, May J, Nothman J, Knight K, Ji H. Cross-lingual name tagging and linking for 282 languages. In: Barzilay R, Kan MY, editors. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics; 2017. p. 1946–58. https://aclanthology.org/P17-1178
- 19.
Carlini N, Liu C, Erlingsson U, Kos J, Song D. In: 28th USENIX Security Symposium (USENIX Security 19), Santa Clara, CA, 2019. p. 267–84.
- 20.
Carlini N, Tramer F, Wallace E, Jagielski M, Herbert-Voss A, Lee K, et al. Extracting training data from large language models. In: 30th USENIX Security Symposium (USENIX Security 21); 2021. p. 2633–50.
- 21. Wei C, Wang Y-C, Wang B, Kuo C-CJ. An overview of language models: recent developments and outlook. SIP. 2024;13(2).
- 22.
Wang A, Cho K. BERT has a mouth, and it must speak: BERT as a markov random field language model. In: Bosselut A, Celikyilmaz A, Ghazvininejad M, Iyer S, Khandelwal U, Rashkin H, et al., editors. Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation. Minneapolis, Minnesota: Association for Computational Linguistics; 2019. p. 30–6. https://aclanthology.org/W19-2304
- 23.
Tirumala K, Markosyan AH, Zettlemoyer L, Aghajanyan A. Memorization without overfitting: analyzing the training dynamics of large language models. In: Advances in Neural Information Processing Systems 35, New Orleans, LA, USA, 2022. http://papers.nips.cc/paper_files/paper/2022/hash/fa0509f4dab6807e2cb465715bf2d249-Abstract-Conference.html
- 24.
Magar I, Schwartz R. Data contamination: from memorization to exploitation. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Dublin, Ireland: Association for Computational Linguistics; 2022. p. 157–65. https://aclanthology.org/2022.acl-short.18
- 25.
Sclar M, Choi Y, Tsvetkov Y, Suhr A. Quantifying language models’ sensitivity to spurious features in prompt design or: how i learned to start worrying about prompt formatting. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7–11, 2024. OpenReview.net; 2024. https://openreview.net/forum?id=RIu5lyNXjT
- 26.
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: a lite BERT for self-supervised learning of language representations. In: Addis Ababa, Ethiopia. 2020. https://openreview.net/forum?id=H1eA7AEtvS
- 27.
Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T, editors. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics; 2019. p. 4171–86. https://aclanthology.org/N19-1423
- 28. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D. RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint 2019.
- 29.
Feng S, Wallace E, Grissom II A, Iyyer M, Rodriguez P, Boyd-Graber J. Pathologies of neural models make interpretations difficult. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J, editors. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics; 2018. p. 3719–28. https://aclanthology.org/D18-1407
- 30.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All You Need. In: Advances in Neural Information Processing Systems, Long Beach, CA, USA, 2017. p. 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- 31.
Mondschein CF, Monda C. The EU’s General Data Protection Regulation (GDPR) in a Research Context. In: Kubben P, Dumontier M, Dekker A, editors. Fundamentals of Clinical Data Science. Cham (CH): Springer; 2018. p. 55–71. https://doi.org/10.1007/978-3-319-99713-1_5
- 32.
Shokri R, Stronati M, Song C, Shmatikov V. Membership inference attacks against machine learning models. In: Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP). IEEE; 2017. p. 3–18.
- 33.
U S D of H& HS. Summary of the HIPAA Privacy Rule. 2003. https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html
- 34.
European Parliament. General Data Protection Regulation (GDPR). 2016. https://eur-lex.europa.eu/eli/reg/2016/679/oj
- 35.
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. 2019. https://api.semanticscholar.org/CorpusID:160025533
- 36.
Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K. Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 2016. p. 308–18.
- 37.
Yeom S, Giacomelli I, Fredrikson M, Jha S. Privacy risk in machine learning: Analyzing the connection to overfitting. In: 2018 IEEE 31st Computer Security Foundations Symposium (CSF). IEEE; 2018. p. 268–82.