Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Helpful assistant or fruitful facilitator? Investigating how personas affect language model behavior

  • Pedro Henrique Luz de Araujo ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    pedro.henrique.luz.de.araujo@univie.ac.at

    Affiliations Faculty of Computer Science, University of Vienna, Vienna, Austria, Doctoral School Computer Science, Faculty of Computer Science, Vienna, Austria

  • Benjamin Roth

    Roles Conceptualization, Funding acquisition, Resources, Supervision, Writing – review & editing

    Affiliations Faculty of Computer Science, University of Vienna, Vienna, Austria, Faculty of Philological and Cultural Studies, University of Vienna, Vienna, Austria

Abstract

One way to steer generations from large language models (LLM) is to assign a persona: a role that describes how the user expects the LLM to behave (e.g., a helpful assistant, a teacher, a woman). This paper investigates how personas affect diverse aspects of model behavior. We assign to seven LLMs 162 personas from 12 categories spanning variables like gender, sexual orientation, and occupation. We prompt them to answer questions from five datasets covering objective (e.g., questions about math and history) and subjective tasks (e.g., questions about beliefs and values). We also compare persona’s generations to two baseline settings: a control persona setting with 30 paraphrases of “a helpful assistant” to control for models’ prompt sensitivity, and an empty persona setting where no persona is assigned. We find that for all models and datasets, personas show greater variability than the control setting and that some measures of persona behavior generalize across models.

1 Introduction

Large language models (LLMs) pre-trained on large corpora, fine-tuned on supervised instruction and chat data, and aligned to human preferences have transformed the natural language processing (NLP) field. LLMs are now applied to creative writing [1], code development [2], education [3], healthcare [4], and search engines [5]. Dialogue systems such as ChatGPT [6] have gained widespread adoption beyond the research community, being actively used by laypeople and covered by the mainstream media.

Given the diversity of use cases of LLMs, there has been a growing interest in personalizing LLMs to the needs of individual users [7]. One way to steer the behavior of LLMs is to assign them a persona: a role or character that describes the particular personality traits or capabilities that the LLM generations should reflect. Examples of persona include task descriptors such as helpful assistant, specific people like Muhammad Ali [8], and demographic groups like gay person [9].

Persona-assigned language models have been used for a variety of goals. These include not only personalization of LLMs’ generations [10], but also simulation of human-behavior [11] and fictional characters [12], and for improving performance on tasks requiring specialized knowledge [13]. However, the potential and opportunities of persona usage are associated with critical risks, and the effects of assigning personas are not clearly characterized yet. Namely:

  1. Inconsistent task improvement capabilities. In contrast to works showing positive results for using personas relevant to the task at hand (e.g., mathematician for math questions) [1315], others have cast doubt on the usefulness of persona for task improvement, showing that personas do not surpass no persona baselines [16, 17]. There is conflicting evidence on whether persona usage improves task performance.
  2. Persona biases. Studies have shown that personas can increase the toxicity of generations [8, 9], that is, generate texts that are harmful, offensive, or reproduce societal biases and stereotypes. It has also been shown that task performance varies depending on demographic information such as persona gender and race [13, 17]. These results raise the concern that personas may exacerbate bias and perpetuate stereotypes. Which demographics are most affected and the interplay between the demographic group of the persona and of the bias target has not been explored.
  3. No exploration of the link between eliciting personality traits and actual behavior. Previous work has shown that personas can, to some extent, steer LLMs’ self-reported personality traits (as measured by questionnaires) [18] and influence LLMs’ annotation in a variety of downstream tasks [11, 19]. While both are necessary for accurate simulation, past work has neglected the link between personas’ self-reported values and downstream tasks, i.e., whether personas’ annotations reflect their self-reported values.
  4. Unequal treatment of personas from different sociodemographic groups. LLMs have been shown to be less compliant for some personas than others, refusing to answer as a physically-disabled persona, but not as a able-bodied persona [17]. This impacts task performance (errors due to LLMs’ refusal to answer), simulation (less accurate simulation for specific demographics), and impedes personalization for certain users, ultimately contributing to further marginalization of underrepresented social groups by excluding their (simulated) perspectives [20]. Which demographics are impacted and whether refusal is consistent across models and datasets are open questions.

This paper aims to explore those research gaps by investigating the following research questions:

RQ1: How do personas affect task performance? We compare the performance of personas on diverse tasks to examine the extent to which personas affect task performance, what tasks are most affected, and what kind of persona behavior generalizes across LLMs. This is helpful in improving our understanding of the cases where personas are beneficial and in identifying potential pitfalls.

RQ2: How do personas affect LLMs’ biases?

We compare personas’ biases across several dimensions (e.g., age, ethnicity, sexuality) and examine the associations between the demographic groups of the personas and the targeted identities (e.g., does gay person show low bias against gay people?).

RQ3: Do personas annotations reflect their self-reported attitudes? We prompt personas with questionnaires designed to measure attitudes (e.g., altruism and endorsement of racist beliefs) and investigate the extent to which personas can influence LLMs’ attitude values. We then adapt to the persona setting a human study investigating the effect of attitudes on annotations [21] and examine how closely personas’ associations mirror human associations.

RQ4: Do LLM refusals differ across personas? We compute the refusal rates from personas for the datasets in our experimental setting to examine whether these refusals are arbitrary—different rates for similar personas (e.g., gay person and homosexual person)—and disparate—different rates for personas from different demographic groups (e.g., gay person and straight person).

Our experiments include seven LLMs from different families and sizes. We instruct the LLMs to adopt the 162 personas from the UniversalPersona set [9], spanning categories like gender, race, sexuality, country of origin, and occupation. We prompt the personas to answer questions from five datasets covering attitudes, trustworthiness, domain-specific knowledge, social biases, and toxicity. In order to distinguish persona influence from prompt sensitivity influence, we contrast persona behaviors with those from a control persona set: helpful assistant and 29 paraphrases of it.

Given this setup, our study aims to systematically examine the impact of personas on LLM behavior. Our main contributions are as follows:

  1. To the best of our knowledge, this study is the first to comprehensively investigate the impacts of personas on LLM behavior across multiple dimensions. Unlike previous works, which have focused on a single aspect of persona behavior, our analysis spans task performance (Sect 4), social biases (Sect 5), social attitudes (Sect 6), and refusals (Sect 7).
  2. We generate a dataset of approximately 90 million LLM generations that we will make publicly available to support future studies analyzing personas’ capabilities and biases.
  3. We propose using control personas as a baseline for LLM response variation and show that regular personas give rise to larger variability than control personas in all evaluation scenarios, with an accuracy gap as big as 38.56 percentage points between the top and bottom personas.
  4. Our analyses shed light on the research gaps listed above, highlighting findings consistent across LLMs and datasets.

All the code, experiments, and results are available at https://github.com/peluz/persona-behavior.

2 Related work

.

Personas and performance.

Previous works show that personas can affect task performance in positive and negative ways. On the positive side, personas can improve LLM trustworthiness [15], accuracy in domain-specific tasks [13, 14], and response quality [22, 23]. On the other hand, assigning personas from demographic groups (e.g., black person) can lead to lower scores on reasoning tasks [17], and some work suggests that responses from persona are not as accurate as those from a no persona baseline [16].

Our work builds upon this research direction by extending the scope of examined personas, models, and datasets. Contrary to previous studies, we investigate which persona effects are consistent across models and datasets (Sect 4) and contrast with results from control personas to verify if effects are due to the personas rather than LLMs’ prompt sensitivity. Our results reveal a more nuanced scenario, where expert personas may not be the best performer and demographic personas outperform the no persona baseline in some scenarios.

Personas and biases.

Another line of research investigates personas’ impact on model biases, showing their potential for increasing model toxicity [8, 9] and reproducing social stereotypes [24, 25].

We contribute to this line of research by studying the interplay between personas and the targets of model biases (Sect 5), focusing on personas’ impact on biases against their own demographic. Our results reveal a bias-accuracy trade-off: assigning a persona reduces model bias against the persona’s demographic (e.g., assigning the gay person persona reduced model bias against gay people), but question answering accuracy decreases.

Personas and values.

Previous work shows that personas have a measurable effect on LLMs’ responses to questionnaires measuring personality traits and ethical values [18, 26, 27], and can influence downstream tasks [11]. However, prior work has not investigated the link between inducing such values and traits and the downstream annotations. For example, if a persona has a high empathy level, do its annotations match those of empathetic humans?

To this end, we investigate to what extent LLMs’ self-reported values lead to measurable changes in downstream annotation tasks and whether associations between values and annotations correspond to human annotators (Sect 6). We show that personas’ associations between social values and—to a lesser extent—between their values and annotations in toxicity detection are similar to those of humans.

Personas and refusals.

[17] show that persona-assigned models sometimes refuse to provide answers, making explicit references to perceived inadequacies that reveal models’ underlying biases (e.g., “I apologize, but as a physically-disabled person, I am unable to perform mathematical calculations [...]”).

We further investigate models’ refusals, revealing that they are arbitrary (dramatically different refusal rates for semantically equivalent personas) and disparate (significant discrepancies of refusal rates for personas from different demographic backgrounds)—and that LLMs consistently disfavor specific personas (Sect 7).

3 Persona-assigned prompts

We assign a persona to an LLM by including in the prompt the statement You are a {persona}, where {persona} identifies a role the model should adopt (e.g., engineer). We include this statement as a system message when possible since this has been shown to yield higher model controllability [26]. For models with no system message, we include the statement at the beginning of the prompt followed by an empty line. Fig 1 illustrates how model generations change depending on the assigned persona.

thumbnail
Fig 1. Persona assignment.

We include a statement assigning a persona to each prompt. The figure shows how different personas impact generations from the gemma-7b-inst model in objective tasks (w/ ground truth) and subjective tasks (no ground truth). The robot indicates the no persona baseline, where no persona-assignment statement is included.

https://doi.org/10.1371/journal.pone.0325664.g001

The complete prompt comprises the persona-assigning statement, followed by a task description, a task input, and a formatting instruction. Table 1 shows examples for all datasets in our experiments. Content warning: the examples of toxicity data contain profanity and slurs. We replace the first vowel of slurs and profanity with *.

thumbnail
Table 1. Example prompts (with an example persona) for all datasets.

https://doi.org/10.1371/journal.pone.0325664.t001

.

Personas.

As the persona set, we use the UniversalPersona dataset [9]. It includes 162 personas covering 12 diverse categories: races (e.g., white person), sexualities (e.g., gay person), social classes (e.g., upper-class person), genders (e.g., nonbinary person), education levels (e.g., person with graduate level education), generic professions (e.g., person in sports industry), specific professions (e.g., basketball player), religious beliefs (e.g., person of Islam belief), ideologies (e.g., person with conservatism ideology), health characteristics (e.g., person with mental disorders), names from countries (e.g., Fatima from Arabia), and political figures (e.g., Fidel Castro).

Control personas.

We define a control set using paraphrases of helpful assistant, which we refer to as control personas. The assumption is that, since these personas are paraphrases of one another, changes in model behavior across them will be due to prompt sensitivity—rather than the personas themselves. The paraphrases are generated by GPT-4 [28] using the prompt Create 29 paraphrases of "helpful assistant", and nucleus sampling [29] with a .95 cumulative probability threshold as the generation method.

Table 2 shows all personas.

Models.

To investigate the effect of personas on LLM behavior, we experiment with models from different families (GPTs, Gemma, Mistral, and Zephyr), which are trained by different providers using distinct datasets and optimization objectives. Our selection spans a range of model sizes (from 2B parameters to dozens of billions parameters), architectures (dense and mixture-of-expert transformers), and includes both open- and closed-weight models. This diversity allows us to assess whether persona effects are consistent across different model designs, scales and training paradigms.

Specifically, we include gpt-4-0125-preview (GPT-4), gpt-3.5-turbo-0125 (GPT-3.5) [30], Mixtral-8x7B-Instruct-v0.1 (Mixtral) [31], zephyr-7b-beta (Zephyr) [32], Mistral-7B-Instruct-v0.2 (Mistral-inst) [33], gemma-7b-it (Gemma-7b-inst) [34], and gemma-2b-it (Gemma-2b-inst). We query GPT-4 and GPT-3.5 through the OpenAI API. The other models are available in the Transformers library [35]. GPT-4, GPT-3.5, and Zephyr support system messages.

Response generation.

For each combination of model, persona, and dataset instance, we generate a single response using greedy decoding. We use the control persona set to account for prompt sensitivity and conduct significance testing to ensure the reliability of cross-persona comparisons.

4 RQ1: Effect of personas on task performance

One of the motivations of persona usage is to improve task performance on tasks that require specialized capabilities. The intuition is that prompting with a persona aligned with the task domain steers the LLM toward the correct response. However, there is conflicting evidence on the effectiveness of such an approach, and performance can degrade when personas from certain demographics are used—even though such attributes are irrelevant to the task.

4.1 Data

This section investigates the performance of personas on tasks requiring knowledge from different domains. To this end, we query models with data from the following datasets:

TruthfulQA [15] evaluates how models’ answers reproduce popular misconceptions and false beliefs. It contains 817 questions covering 38 categories such as history, superstitions, economics and fiction. We use the multiple choice variant, with mc1_targets as the ground truth.

MMLU [36] evaluates model knowledge across 57 subjects from diverse areas such as math, social sciences, and law. The test split contains 14k instances, each with four answer choices.

BBQ [37] is a question-answering dataset that highlights 11 social bias categories concerning, for example, race, gender, and socioeconomic status. BBQ contains ambiguous contexts, which do not contain information necessary to answer the question (as exemplified in Fig 1), and corresponding disambiguated contexts that contain sufficient information. The test split comprises 58k instances, each with three choices: one expressing uncertainty (e.g., unknown), and two options referring to each entity in the context.

Table 1 shows examples for all datasets. We randomly shuffle the multiple-choice options for TruthfulQA to avoid position biases. The position of the correct option is approximately uniformly distributed across MMLU and BBQ instances, so we do not shuffle options in those cases. Due to resource constraints, when prompting GPT-4, we subsample MMLU (maximum of 250 instances per subject, total of 10219 instances, of original data) and BBQ (maximum of 120 samples per demographic group, total of 5788 samples, of original data). All datasets are available at https://huggingface.co/datasets/

Evaluation metrics.

We report the accuracy for TruthfulQA, the average subject accuracy for MMLU, and the average bias category accuracy for BBQ.

4.2 Results

Fig 2 shows scores for all personas, models, and datasets.

thumbnail
Fig 2. Distribution of personas’ performances.

We show results for each dataset and overall performance (averaged across datasets).

https://doi.org/10.1371/journal.pone.0325664.g002

Personas significantly affect task performance.

For each model and dataset, we run a Cochran’s Q test [38] to reject the null hypothesis that personas have the same distribution of hits and mistakes. All of the results were found to be significant (p-value <.001). Regular personas yield greater performance variability than control personas, which tend to concentrate around the no persona baseline. Performance differences can be quite striking: as much as 38.56 percentage points (p.p.) between the top (asexual person) and bottom (person with fascism ideology) personas in TruthfulQA for GPT-3.5. Even when averaged across datasets, GPT-3.5 still has a 20.77 p.p. gap between top (person of atheism belief) and bottom (person with fascism ideology) personas. The model with the smallest performance gap is GPT-4, with 4.58 p.p.

Some persona rankings are consistent across models.

We compute the association (Kendall’s [39]) between personas’ performances to identify persona rankings that are consistent across models. We target differences between personas from the same category (e.g., personas referring to an ethnicity) and consider a ranking to be consistent across models when it has (averaged across all model pairs), corresponding to moderate and strong associations [40, 41].

Asexual person and person of atheism belief are consistently accurate for TruthfulQA, being among the top 10 () personas in all models. Further, person of atheism belief outperforms all religious personas, and middle-class person outperforms the other social class personas. Considering MMLU, we find that the average (across models) accuracy of education personas is sorted by the education level: graduate level is better than college level, which is better than high school level, and so on. We also found a consistent ordering for gender personas, with woman and man outperforming nonbinary person and transgender person. For both MMLU and TruthfulQA, personas with democracy and liberalism ideologies are consistently better than personas with fascism, populism, and nationalism ideologies.

Some persona rankings are consistent across datasets.

We also identify persona rankings that are consistent across datasets. We average personas’ performance across models and identify the consistent rankings ( averaged across dataset pairs).

Similarly to the previous paragraph, personas with socialism, democracy, and liberalism ideologies outperform personas with fascism, populism, and nationalism ideologies. Moreover, the middle-class persona outperforms the other social-class personas in all datasets.

Are expert personas better?

One of the rationales for assigning personas is to provide a role that is appropriate for the task at hand (e.g., a mathematician persona for a number theory problem). We validate this intuition by selecting personas that directly relate to four MMLU subject groups, each corresponding to a broader knowledge field: technology personas (person in technology industry, sofware developer and engineer) for the computer science subjects (STEM field), law personas (person in the law industry and lawyer) for the law subjects (humanities field), psychologist for the psychology subjects (social sciences field), and healthcare personas (person in the medical industry, doctor, dentist, physician, orthodontist, surgeon, veterinarian, nurse, physical therapist, optometrist, and anesthesiologist) for the health subjects (others field). We average the performance of those personas across models and rank them considering different subsets of the MMLU dataset to compare their overall rank (whole dataset), field rank (e.g., STEM questions), and subject group rank (e.g., computer science questions).

Table 3 shows that personas are better in their corresponding field when compared with personas with different expertise: for each domain (humanities, STEM, social sciences, and other), the top persona of the corresponding persona group (law, technology, psychologist, and healthcare) had a better rank than the top personas of out-of-domain groups.

thumbnail
Table 3. Persona group average ranks (out of 193—162 personas + 30 control personas + no persona baseline—lower is better) for each knowledge domain. The rank of the best persona in each group is shown in parenthesis. We show in bold the top persona group for each domain and we underline the best domain of each persona group. The top ranked persona for social sciences was the social scientist persona.

https://doi.org/10.1371/journal.pone.0325664.t003

To assess whether the accuracies of in-domain expert groups significantly differ from those of out-domain experts, we conduct a Wilcoxon signed-rank test [42]. We compare the domain accuracies of all in-domain expert groups (calculated per model and averaged across personas in each group) with the domain accuracies of the best out-domain expert group (psychologist for humanities and other, and law for STEM and social sciences). We find that the distribution of in-domain and out-domain accuracies are significantly different (p = 0.009). That said, none of the personas outperform the no persona baseline, suggesting that while specialization exists, it does not necessarily translate to performance benefits compared to the baseline.

However, Table 4 shows that expert personas get progressively better as the domain gets increasingly specialized, surpassing the no persona baseline in three of the four subject groups. These results suggest that, while expert personas can be helpful for the particular cases they are tailored to, this comes at a cost to overall performance. Further, the benefit can be unreliable: for computer science and law subjects, the top expert outperforms no persona, but the average expert rank is still lower than of no persona.

thumbnail
Table 4. Persona ranks (out of 193, lower is better) for increasingly specialized domains. For persona groups with multiple personas we show, in addition to the average rank, the rank of the best persona in the category between parentheses.

https://doi.org/10.1371/journal.pone.0325664.t004

5 RQ2: Effect of personas on biases

One possible pitfall of persona usage is that it may introduce or reinforce LLMs’ social biases. While prior work has demonstrated that personas can increase model toxicity and stereotyping, which personas are likely to be more biased, and the relation between persona and bias target has not been explored. A better understanding of the dynamics of personas’ biases can lead to new mitigation strategies.

This section investigates personas’ effects on the social biases measured by BBQ. We aim to measure the extent to which personas reproduce harmful societal stereotypes and how that varies across different personas. We also measure how frequently personas choose the unknown option, which distinguishes personas that are overly cautious (answering unknown when the answer is in the context) from those that are too reckless (not answering unknown when the context is ambiguous).

We use the bias metric originally proposed for BBQ. For each bias category, let be the number of biased answers, the number of not unknown answers, and the accuracy in ambiguous contexts. Then:

(1)(2)

where and are the bias in disambiguated and ambiguous contexts. The bias scores range from -1 (all answers go against bias) to 1 (all answers align with bias). As the final bias score for each category, we report the average of and .

5.1 Results

Fig 3 shows bias scores (averaged across the 11 bias categories: e.g., race, gender, socioeconomic status) and unknown frequency of all personas and models.

thumbnail
Fig 3. Distribution of personas’ bias scores and frequency of unknown answers.

Ground truth answers yield a bias of 0 and a unknown frequency of 0.5.

https://doi.org/10.1371/journal.pone.0325664.g003

Personas significantly affect bias scores and unknown frequencies.

We run a Cochran’s Q test for each model and dataset, finding that personas yield different biased and unknown answer distributions (p-value <.001). The gap between top and bottom scores is quite large, ranging from 2.44 p.p. (Gemma-2b-inst) to 9.16 p.p. (GPT-3.5) for bias scores, and from 9.19 p.p (GPT-4) to 45.67 p.p. (GPT-3.5) for unknown frequencies. As in RQ1, control persona scores have smaller variability and tend to concentrate around the no persona baseline.

While personas exhibit quite different unknown frequencies, often they are not able to shift models between the too reckless region () and the overly cautious region (). Only GPT-3.5 personas cover both regions extensively. GPT-4 is always overly cautious, Zephyr and Gemma-2b-inst are always reckless, and the other models have the vast majority of their personas in the same region as the no persona baseline.

Some personas rankings are consistent across models.

We use the same procedure as in Sect 4 to identify personas with consistently high or low bias scores across models (). Man and woman personas have higher bias than the nonbinary and transgender personas in all models. Furthermore, the nonbinary persona is consistently the gender persona with the highest unknown frequency, with a gap as large as 10.97 p.p. when compared with the second highest (transgender) for GPT-3.5. There is also a consistent cross-model trend for sexuality personas, where the straight person persona has lower unknown frequency than queer personas.

Are personas less biased against themselves?

To examine how personas affect bias against their own demographic group, we select personas with demographics represented in BBQ and compare their overall bias score (averaged across all target groups) with their self-bias (e.g., the bias of gay person against gay people). We average the bias scores across models and use them to rank personas. Table 5 shows the persona ranks for self-bias and overall bias rankings.

thumbnail
Table 5. Persona ranks for self-bias (out of 193), self-accuracy, overall bias, and overall accuracy.

https://doi.org/10.1371/journal.pone.0325664.t005

We find that personas indeed exhibit lower bias against their own group—i.e., lower bias scores in examples involving their group—than they do in the average case: all the 18 personas represented in BBQ have better self-bias ranks then overall ranks—12 of them are the top-ranked, being less biased than all other personas and the no persona baseline. Some of the rank changes are quite dramatic: person of Christianity belief is one of the most overall biased persona (among the bottom ), but the least biased against christians (top ).

However, personas are also less accurate in cases involving their demographic (self-accuracy)—all 18 personas have worse ranks for their demographic than overall, five of them reaching the bottom rank. The differences are also striking: the pansexual person persona, for example, drops from the fourth position (top ) to the position (bottom ).

To investigate the discrepancy between bias improvement and accuracy degradation, we establish two comparisons. Table 6 compares personas’ self-accuracies with average (across all personas) accuracy (e.g., for instances involving gay people, we compare the accuracy of gay person with average persona accuracy). Table 7 compares the rate in which personas answer with their own demographic with the average (across all personas) rate (e.g., we compare the frequency of instances that gay person selects a gay person as the answer with the average frequency in which a gay person is selected). We find that the reason why personas have lower self-bias but lower self-accuracy is that they are more likely to answer with their own identity in ambiguous cases (decreasing accuracy) but do so more frequently in cases that contradict societal stereotypes (decreasing bias).

thumbnail
Table 6. Differences between the average accuracy (across all personas) and the accuracy of personas when answering questions involving their own demographic.

https://doi.org/10.1371/journal.pone.0325664.t006

thumbnail
Table 7. Differences between the frequency that each demographic is selected as the answer by the persona of the same demographic and on average (across all personas).

https://doi.org/10.1371/journal.pone.0325664.t007

6 RQ3: Effect of personas on attitudes and annotations

Another use case of personas is simulating human behaviors, with applications in diverse fields such as education, psychology, healthcare, and law [43]. Accurate simulation requires not only that (1) persona-assigned LLM’s responses to psychological questionnaires match human expectations or that (2) responses in downstream tasks match human expectations but also that (3) the link between values and responses matches those expectations. For example, assuming that social workers are likely to have high empathy and high empathy is associated with attributing high toxicity ratings to racist tweets, accurate simulation would entail:

  1. a social worker persona scoring high empathy level;
  2. a social worker persona assigning high toxicity ratings to racist tweets; and
  3. empathetic personas assigning high toxicity to racist tweets.

This section investigates how much associations between personas’ values and behaviors mirror those of humans. To this end, we adapt a previous study [21] that examines the link between human annotators’s attitudes and their annotations for toxic language.

6.1 Data

Attitude questionnaires. We use the questionnaires collected by Sap et al. [21], which where originally created by prior work in social psychology, political science. They cover seven attitude dimensions: valuing the freedom of offensive speech [44], perceiving the harm of hate speech [44], endorsement of racist beliefs [45], traditionalism [46], language purism [21], empathy [47], and altruism [48]. Each attitude questionnaire contains between two and five statements, each followed by a question asking the reader’s level of agreement on a scale from 1 to 5. Table 1 shows an example question from the freedom of offensive speech questionnaire. S1 File contains the full questionnaire data.

The original questionnaires were composed of 27 items, which may be too few to reliably measure personas’ attitude scores. To improve the reliability of results, we prompt GPT-4 to generate 30 prompt paraphrases for each item and average the returned scores. We paraphrase the instructions rather than the questionnaire statements to avoid changing questionnaires’ semantics. The paraphrases are generated through nucleus sampling with .95 as the cumulative probability threshold. Table 8 shows the instructions used to generate the paraphrases.

thumbnail
Table 8. Prompts fed to GPT-4 to generate instruction paraphrases for the attitude questionnaires.

https://doi.org/10.1371/journal.pone.0325664.t008

Toxicity data. The dataset is composed of 626 tweets drawn by Sap et al. [21] from existing toxic language detection corpora. Each tweet contains information on whether it targets black people, is written in African-American English (AAE), or includes vulgar language. The dataset also includes demographic information (gender, ethnicity, age, and political inclination) and attitude values (measured by the attitude questionnaires described above) of 184 annotators recruited by Sap et al. [21] using Amazon Mechanical Turk, with their corresponding annotations on the offensiveness and racism levels of tweets (on a Likert scale from 1 to 5). The pool of annotators varied racially, politically, and in gender, though it skewed white, male and liberal. Each tweet was annotated by six crowdworkers: two white conservative annotators, two white liberal annotators, and two black annotators. Table 1 shows two example tweets and the racism and offensiveness scales.

Sap et al. [21] used the data to examine the associations between annotators’ attitudes and their toxicity ratings for the three tweet categories above. In our experiments, each tweet is fed twice to each persona: once for racism annotation and once for offensiveness annotation.

Metrics.

As attitudes scores, we report the average questionnaire response for each attitude dimension. For toxicity, we measure average offensiveness and racism ratings (to compare personas’ sensitivity to toxicity), and agreement with human annotations (Krippendorf’s alpha [49, 50]).

6.2 Attitude results

Fig 4 shows the distributions of personas’ scores for each attitude and model.

thumbnail
Fig 4. Distribution of attitude scores for each model.

The yellow line shows the no persona scores.

https://doi.org/10.1371/journal.pone.0325664.g004

Personas significantly affect attitude scores in most cases.

For each model and attitude, we run a Friedman’s test [51] to reject the null hypothesis that personas’ questionnaire responses have the same distribution. We do not run a Cochran’s test because it requires binary responses, whereas responses for the attitude and toxicity data are in a scale from 1 to 5. Table 9 shows all p-values. Most results are significant for control and regular personas, but control personas have more non-significant results (14 of 49 model-attitude pairs against five for regular personas). In most cases, personas did not significantly impact freedom of speech scores. Exceptions were GPT-4 (regular personas only), Mixtral (regular only), and Mistral-inst (regular and control).

thumbnail
Table 9. P-values obtained through Friedman’s test for significance of the variability of persona’s attitudes for each model. We show in bold the non-significant results (significance level of .05).

https://doi.org/10.1371/journal.pone.0325664.t009

Regular personas have more diverse behaviors than control personas.

Regular personas exhibit greater attitude score variance than control personas for all attitudes and models examined. Similarly to the previous sections, control personas are concentrated around the no persona baseline. However, not even regular personas could cover the full range of attitude values. For example, personas rarely exhibit high racist belief scores, in most cases exhibiting scores around 3 or less (out of 5). There are some outliers, however. For GPT-3.5, Benito Mussolini, person with fascism ideology, and Adolf Hitler exhibited high racist belief scores: 4.61, 4.32, and 4.08, respectively.

Some personas rankings are consistent across models.

We identify consistent rankings using the procedure described in Sect 4. Fig 5 shows the personas with consistent rankings across models.

thumbnail
Fig 5. Attitude scores (averaged across models) for personas with consistent cross-model rankings.

The blue line shows the no persona scores.

https://doi.org/10.1371/journal.pone.0325664.g005

Freedom of speech: Education personas’ freedom of speech scores (averaged across models) are sorted in ascending order by the education level—with the exception that the uneducated person persona is on top. Further, man exhibited higher freedom of speech scores than all other gender personas.

Altruism: Average persona altruism scores are sorted in ascending order by their education level. Among the ideology personas, person with fascism ideology exhibited the lowest altruism score (1.93; the second lowest, person with conservatism ideology, had 3.44). In all models, the person of atheism belief scored lower on altruism than the religious personas. For Mixtral, person of atheism belief is tied with person of Judaism belief as the least altruistic personas.

Empathy: person with fascism ideology had the lowest score (2.72; the second lowest, person with nationalism ideology, had 3.29).

Language purism: In all models, transgender person and nonbinary person scored lower on language purism than man and woman.

Traditionalism: In all models, man scored higher for traditionalism than the other gender personas.

Are persona’s atittude associations similar to those of humans?

Even though personas significantly impact attitudes, personas’ attitudes may not correspond to human expectations. For example, one could expect that a persona with a high harm of hate speech score will also have a low score for racist beliefs. We explore this by comparing associations between personas’ attitudes with those in humans. We compute the Pearson correlations between attitude scores: of human annotators; and of personas in each model (Fig 6). We then calculate the cosine similarity between the correlations for humans and those for the personas (Fig 7, left plot).

thumbnail
Fig 6. Pearson correlations between attitudes for human annotators (top left plot) and each model’s personas.

We show in bold weight significant correlations (p<.05).

https://doi.org/10.1371/journal.pone.0325664.g006

thumbnail
Fig 7. Cosine similarity between human and model correlations (between attitudes on the left and between attitudes and annotations on the right).

The black horizontal line denotes the cosine similarity between human and random baseline correlations.

https://doi.org/10.1371/journal.pone.0325664.g007

Except for Gemma-2b-inst (the weakest model), personas in all models have higher similarity to humans than a random baseline in which personas have randomly distributed attitude values. This result indicates that personas’ attitude values somewhat mirror those present in humans. For example, for humans, there is a moderate negative correlation between altruism and racist belief, which is also present in all models (but Gemma-2b-inst).

6.3 Toxicity results

Fig 8 shows the distributions of personas’ toxicity metrics.

thumbnail
Fig 8. Distribution of toxicity scores for each model.

Top row: average offensiveness and racism ratings. Bottom row: agreement with human annotations for offensiveness and racism. The ratings are in a Likert scale from 1 (not at all offensive/racist) to 5 (extremely offensive/racist).

https://doi.org/10.1371/journal.pone.0325664.g008

Personas significantly affect toxicity scores.

In all cases, personas significantly impact models’ answer distributions (Friedman’s test, ). As in previous cases, regular personas had greater variability than control personas, which tended to concentrate around the no persona baseline—with some exceptions. For example, GPT-3.5 control personas rated the tweets as more racist than the no persona baseline and also had lower human agreement than the baseline. An interesting outlier for GPT-3.5 was the comedian persona, which labeled the tweets as having much lower offensive and racist content than all the other personas did.

How similar are the associations between personas’ attitudes and their toxicity ratings to those of human annotators?

Even though personas’ attitude associations are similar to human annotators’, associations between attitudes and toxicity ratings may differ for humans and personas. For example, one could expect that a persona with a high harm of hate speech score will also annotate tweets targeting black people as having higher racism and offensiveness scores. To investigate this, we compute the Pearson correlations between attitude scores and the average offensiveness and racism ratings given to three subsets of tweets: tweets in African-American English (AAE), tweets that target black people, and tweets with vulgar language. Fig 9 shows the obtained correlations. Fig 7 (right plot) shows the cosine similarity between humans and personas in each model.

thumbnail
Fig 9. Pearson correlations between attitudes and annotation statistics for human annotators (top left plot) and each model’s personas.

We show in bold weight significant correlations (p<.05).

https://doi.org/10.1371/journal.pone.0325664.g009

Personas’ correlations in all models but Gemma-2b-inst had greater cosine similarity with human correlations than the random baseline. The result indicates that not only do personas’ attitude associations relate to those of humans but also their attitudes-annotation associations are similar to those of humans (at least for humans represented in the data). For example, for both humans and personas, harm of hate speech has a positive association with higher offensiveness and racism ratings for tweets targeting black people (except for Mixtral and Gemma-2b-inst personas).

However, persona behavior is less nuanced than those of humans. For example, the racist beliefs attitude in humans has a negative association with offensiveness scores for tweets targeting black people and a positive association with offensiveness scores for AAE tweets—which reflects annotators’ racism. On the other hand, personas’ associations generally do not distinguish AAE tweets from those targeting black people. An exception were the Gemma-7b-inst personas, whose association between racist belief and offensiveness scores reflect those of humans. Gemma-7b-inst was also the model with highest similarity to humans’ attitude-annotation correlations.

7 RQ4: Analysis of persona refusal

Models occasionally refuse to follow persona-assigned prompts by expressing either an inability to perform the task (e.g., I’m sorry, but I can’t provide personal opinions or preferences), an inability to adopt the persona (e.g., I cannot be a gay person, as I am an artificial intelligence and do not have a gender or personal experiences), or outputting a blanket refusal (e.g., I’m sorry, but I can’t assist with this request). The disparity of refusal rates across personas has implications for fairness (e.g., if personas from different demographic groups are treated differently) and reveals models’ underlying social biases.

This section examines how refusal rates differ across personas. We use regex patterns (code excerpt in S2 File) to identify model refusals. We then compute the refusal frequency for each model-persona pair in each dataset.

7.1 Results

Fig 10 shows average (across datasets) refusal rates for all personas and models. Fig 11 shows refusal rates for each dataset. Personas significantly impact (Cochran’s Q test, p-value <.001) the refusal rates for almost all models and datasets. The only exception is Zephyr, for which personas’ impact on refusals for the racism annotation task (Sect 6) was not significant ().

thumbnail
Fig 10. Distribution of personas’ refusal rates (averaged across datasets).

https://doi.org/10.1371/journal.pone.0325664.g010

thumbnail
Fig 11. Distribution of personas’ refusal rates for each dataset.

https://doi.org/10.1371/journal.pone.0325664.g011

Refusals are arbitrary...

The results show a wide refusal rate disparity between different personas. For example, GPT-4’s refusal rates in the attitudes task (Sect 6) for political figure personas range from 22.82% (Rosa Luxemburg, Polish-born German revolutionary and Marxist theorist) to 97.85% (Jörg Haider, Austrian far-right nationalist politician), even though the generated refusal rationale would apply to all personas in that category—I’m sorry, but I can’t provide a response as if I were Jörg Haider or any other real person. Moreover, refusals are arbitrary: semantically similar personas have different refusal rates. This goes not only for the control personas (semantically equivalent by construction) but also for some regular personas. For example, Gemma-7b-inst had a refusal rate (averaged across datasets) of for black person and of for african-american person. While these personas do not strictly refer to the same demographic, they are very related in the USA context. Concerning refusals in the attitudes questionnaires, GPT-4 is 5 times more likely to refuse black person than african-american person and 3 times more likely to refuse homosexual person than gay person.

...and disparate.

To further investigate refusal disparity, we compare the standard deviation of refusal rates of each persona category with the standard deviation of the control personas’ refusals (Fig 12). We consider models to have disparate refusal for a given persona category when that category has a standard deviation higher than the control one.

thumbnail
Fig 12. Ratios between the standard deviation of the refusal rates of each persona category and the control category.

https://doi.org/10.1371/journal.pone.0325664.g012

The results were model-dependent, ranging from four persona categories with disparate refusals (GPT-4) to all twelve categories having disparate refusals (Mistral-Inst). Three persona categories are consistently disparate in all models: ideology, political figures, and specific professions. For ideology, models tended to refuse to adopt person with fascism ideology: an average refusal rate of , whereas the second place, person with nationalism ideology, had Considering political figures, the Adolf Hitler persona had the highest average refusal rate: , against a second highest of for Jorg Haider. We could not find similar trends for profession personas, as different models (dis)favored different professions.

The ideology and political figure disparities are arguably a feature not a bug: it may be desirable that models refuse at higher rates personas that may lead to harmful generations. However, we have also identified several disparities in refusals that could be considered unfair and lead to further marginalization of underprivileged demographic groups. Sexuality and race have disparities in 6 out of 7 models: all but GPT-4 for sexuality and GPT-3.5 for race. Black person was the most refused persona from the race category in 5 out of 7 models— on average, while the second place (white person) had . Regarding sexuality personas, homosexual person was the most refused by 4 out of 7 models, while straight person, was the least refused by 6 out of 7 models.

7.2 Implications of arbitrary and disparate refusals

These disparities have ethical and practical implications:

Fairness: Disproportionate refusals against marginalized groups would reduce their ability to see themselves represented in AI-generated interactions and reinforce systemic exclusion. LLMs that systematically refuse to adopt certain identities cannot be used in contexts that require diverse perspectives, such as education, content moderation, or AI-assisted storytelling.

Safety: Model developers must balance the need to prevent the adoption of harmful identities with the importance of ensuring that diverse perspectives are included. While some refusals can be protective (e.g., person with fascism ideology and Adolf Hitler), others (e.g., black person) will reduce representation.

Trust: Inconsistencies in refusal rates—such as the disparate treatment of gay person and homosexual person—cast doubts on the reliability of LLMs. Users may find it difficult to trust models that appear arbitrary or inconsistent in their refusal behaviors.

Addressing arbitrary and disparate refusal is crucial for the responsible development of LLMs, and future research should focus on balancing safety and representation to foster inclusivity without compromising ethical standards.

8 Conclusion

We presented a study investigating how persona assignment impacts LLMs’ task performance, biases, attitudes, and refusals. Our experimental setting covering 192 personas and seven LLMs from diverse families and sizes showed that personas have a measurable effect on those dimensions of LLM behavior—often in ways that are consistent across models. The results have implications for different goals of persona usage:

Task improvement. While expert personas outperform non-experts, improvements over the no persona baseline were inconsistent and domain-dependent. Simply using an expert persona had limited effectiveness—as only some experts in each expertise group surpassed the baseline. Moreover, the best persona for a task was not always straightforward—the atheist persona in TruthfulQA, for example. These results are a middle-ground between previous works’ positive and negative results: personas often outperformed no persona and control baselines, but improvements were not always due to expert personas. How to generate and describe effective personas is an open question.

Personalization. LLMs consistently refused to adopt personas from certain demographics, preventing the adoption of particular viewpoints. Persona-assigned LLMs often exhibited higher bias levels, so personalization might reinforce stereotypes and negative portrayals of certain demographic groups. On the other hand, personas were less biased against their demographic group, showing potential as a bias mitigation tool. Our results reveal a bias-accuracy trade-off, so if future works use personas for debiasing, we recommend that the evaluation setting include both bias and correctness (e.g., trustworthiness, factuality) metrics.

Simulation. Personas in our setting exhibited associations between attitudes and annotations similar to human annotators’. However, that was the case only for the stronger LLMs, and even then, LLMs annotation behavior was less nuanced than that of humans. Our refusal analysis demonstrates that simulation is also compromised due to LLMs’ consistently refusing to adopt some personas.

The results also highlight tensions and trade-offs of persona usage:

Simulation vs. performance. Simulation is often at odds with performance. For example, we observed a correlation between personas’ education level and task performance. This might be desirable in a simulation setting—where behavior fidelity is the goal—but undesirable in a task improvement setting—where accuracy is the goal.

Safety vs. simulation. From a safety perspective, it makes sense to impose guardrails that prevent users from simulating personas capable of creating harmful responses (e.g., a fascist persona that generates extreme and hateful responses). From a simulation perspective, however, it may be beneficial to be less strict: simulating problematic personas may support studies that generate insights and understandings that can concretely mitigate harms (e.g., including fascist personas in simulations to understand how extreme and hateful ideologies spread).

These tensions are further complicated by the fact that persona effects vary across models. While our study focuses on identifying generalizable persona effects, the differences we observe suggest that pretraining distributions, fine-tuning objectives, and model architectures may all contribute to how LLMs express personas. Further research is needed to characterize and disentangle these influences and understand their implications for both model development and the responsible deployment of persona-based interactions.

Our findings have implications for different stakeholders involved in LLM development, regulation, and use:

Model developers: Addressing inconsistencies in persona performance requires improving model alignment and personalization techniques to ensure that personas behave predictably across tasks and persona demographics. Conducting prompt sensitivity analyses—testing how different prompt formulations influence persona responses—could help diagnose sources of inconsistency and inform strategies for enhancing persona performance and reliability.

Policy and safety regulators: Regulating persona-based interactions should consider trade-offs between fairness, safety, and inclusion. Clear guidelines are needed on when refusals are justified for safety reasons versus when they introduce unfair exclusions. Regulators should actively involve diverse stakeholders—including developers, ethicists, and potential users—in the regulatory process to ensure that multiple perspectives inform these guidelines.

Users: Users should be aware that persona-based LLM interactions can reflect social biases and may exhibit inconsistent behaviors. Expert personas should be used with caution, as they may not outperform baselines.

We encourage interdisciplinary research efforts that investigate how to conceptualize and balance these tensions and trade-offs.

Supporting information

S1 File. Attitude questionnaires.

Questionnaires used to measure personas’ attitudes.

https://doi.org/10.1371/journal.pone.0325664.s001

(JSON)

S2 File. Code excerpts.

Code used for answer extraction (multiple choice and likert scale questions) and model refusal identification.

https://doi.org/10.1371/journal.pone.0325664.s002

(PY)

Acknowledgments

We thank Paul Röttger, Anastasiia Sedova, Andreas Stephan, and Yuxi Xia for the valuable discussions and feedback. We are thankful for the credits from the OpenAI API Research Access Program. We acknowledge EuroHPC Joint Undertaking for awarding us access to MeluXina at LuxProvide, Luxembourg.

References

  1. 1. Yuan A, Coenen A, Reif E, Ippolito D. Wordcraft: story writing with large language models. In: 27th International Conference on Intelligent User Interfaces. IUI ’22. New York, NY, USA: Association for Computing Machinery; 2022. p. 841–52.
  2. 2. Zan D, Chen B, Zhang F, Lu D, Wu B, Guan B, et al. Large language models meet NL2Code: a survey. In: Rogers A, Boyd-Graber J, Okazaki N, editors. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics; 2023. p. 7443–64. https://aclanthology.org/2023.acl-long.411
  3. 3. Kasneci E, Sessler K, Küchemann S, Bannert M, Dementieva D, Fischer F, et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn Individ Diff. 2023;103:102274.
  4. 4. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–40. pmid:37460753
  5. 5. Mehdi Y. Announcing the next wave of AI innovation with Microsoft Bing and Edge. 2023. https://blogs.microsoft.com/blog/2023/05/04/announcing-the-next-wave-of-ai-innovation-with-microsoft-bing-and-edge/
  6. 6. OpenAI. OpenAI: Introducing ChatGPT. 2022. https://openai.com/blog/chatgpt
  7. 7. Kirk HR, Vidgen B, Röttger P, Hale SA. The benefits, risks and bounds of personalizing the alignment of large language models to individuals. Nat Mach Intell. 2024;6(4):383–92.
  8. 8. Deshpande A, Murahari V, Rajpurohit T, Kalyan A, Narasimhan K. Toxicity in chatgpt: Analyzing persona-assigned language models. In: Bouamor H, Pino J, Bali K, editors. Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics; 2023. p. 1236–70. https://aclanthology.org/2023.findings-emnlp.88
  9. 9. Wan Y, Zhao J, Chadha A, Peng N, Chang KW. Are personalized stochastic parrots more dangerous? Evaluating persona biases in dialogue systems. In: Bouamor H, Pino J, Bali K, editors. Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics; 2023. p. 9677–705.
  10. 10. Kim M, Kim M, Kim H, Kwak Bw, Kang S, Yu Y, et al. Pearl: a review-driven persona-knowledge grounded conversational recommendation dataset. In: Ku LW, Martins A, Srikumar V, editors. Findings of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics; 2024. p. 1105–20. https://aclanthology.org/2024.findings-acl.65
  11. 11. Hu T, Collier N. Quantifying the persona effect in LLM simulations. 2024.
  12. 12. Wang N, Peng Zy, Que H, Liu J, Zhou W, Wu Y, et al. RoleLLM: benchmarking, eliciting, and enhancing role-playing abilities of large language models. In: Ku LW, Martins A, Srikumar V, editors. Findings of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics; 2024. p. 14743–77. https://aclanthology.org/2024.findings-acl.878
  13. 13. Salewski L, Alaniz S, Rio-Torto I, Schulz E, Akata Z. In-context impersonation reveals large language models’ strengths and biases. Adv Neural Inf Process Syst. 2023;36:72044–57.
  14. 14. Kong A, Zhao S, Chen H, Li Q, Qin Y, Sun R, et al. Better zero-shot reasoning with role-play prompting. In: Duh K, Gomez H, Bethard S, editors. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics; 2024. p. 4099–113. https://aclanthology.org/2024.naacl-long.228
  15. 15. Lin S, Hilton J, Evans O. TruthfulQA: measuring how models mimic human falsehoods. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics; 2022. p. 3214–52.
  16. 16. Zheng M, Pei J, Logeswaran L, Lee M, Jurgens D. When ”A Helpful Assistant” is not really helpful: personas in system prompts do not improve performances of large language models. In: Al-Onaizan Y, Bansal M, Chen YN, editors. Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics; 2024. p. 15126–54. https://aclanthology.org/2024.findings-emnlp.888
  17. 17. Gupta S, Shrivastava V, Deshpande A, Kalyan A, Clark P, Sabharwal A, et al. Bias runs deep: implicit reasoning biases in persona-assigned LLMs. In: The Twelfth International Conference on Learning Representations; 2024.
  18. 18. Jiang G, Xu M, Zhu SC, Han W, Zhang C, Zhu Y. Evaluating and inducing personality in pre-trained language models. Adv Neural Inf Process Syst. 2023;36:10622–43.
  19. 19. Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C, Wingate D. Out of one, many: using language models to simulate human samples. Polit Anal. 2023;31(3):337–51.
  20. 20. Lee C, Gligorić K, Kalluri PR, Harrington M, Durmus E, Sanchez KL, et al. People who share encounters with racism are silenced online by humans and machines, but a guideline-reframing intervention holds promise. Proc Natl Acad Sci U S A. 2024;121(38):e2322764121. pmid:39250662
  21. 21. Sap M, Swayamdipta S, Vianna L, Zhou X, Choi Y, Smith NA. Annotators with attitudes: how annotator beliefs and identities bias toxic language detection. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics; 2022. p. 5884–906.
  22. 22. Wang Z, Mao S, Wu W, Ge T, Wei F, Ji H. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 2024. 257–79.
  23. 23. Li G, Hammoud H, Itani H, Khizbullin D, Ghanem B. CAMEL: communicative agents for “Mind” exploration of large language model society. Adv Neural Inf Process Syst. 2023;36:51991–2008.
  24. 24. Plaza-del-Arco FM, Curry AC, Curry A, Abercrombie G, Hovy D. Angry men, sad women: Large language models reflect gendered stereotypes in emotion attribution. 2024.
  25. 25. Cheng M, Durmus E, Jurafsky D. Marked personas: using natural language prompts to measure stereotypes in language models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023. p. 1504–32.
  26. 26. Kovač G, Sawayama M, Portelas R, Colas C, Dominey PF, Oudeyer PY. Large language models as superpositions of cultural perspectives. 2023.
  27. 27. Miotto M, Rossberg N, Kleinberg B. Who is GPT-3? An exploration of personality, values and demographics. In: Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP CSS), Abu Dhabi, UAE, 2022. p. 218–27.
  28. 28. OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. GPT-4 Technical Report. 2024.
  29. 29. Holtzman A, Buys J, Du L, Forbes M, Choi Y. The curious case of neural text degeneration. In: Eighth International Conference on Learning Representations; 2020.
  30. 30. OpenAI. OpenAI GPT-3.5 API. 2024. https://platform.openai.com/docs/models/gpt-3-5-turbo
  31. 31. Jiang AQ, Sablayrolles A, Roux A, Mensch A, Savary B, Bamford C. Mixtral of experts. 2024.
  32. 32. Tunstall L, Beeching E, Lambert N, Rajani N, Rasul K, Belkada Y, et al. Zephyr: Direct distillation of LM alignment. 2023.
  33. 33. Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, de las Casas D. Mistral 7B. 2023.
  34. 34. Gemma Team, Mesnard T, Hardin C, Dadashi R, Bhupatiraju S, Pathak S, et al. Gemma: open models based on Gemini research and technology. 2024.
  35. 35. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; 2020. p. 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
  36. 36. Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D. Measuring massive multitask language understanding. In: International Conference on Learning Representations; 2020.
  37. 37. Parrish A, Chen A, Nangia N, Padmakumar V, Phang J, Thompson J. BBQ: a hand-built bias benchmark for question answering. In: Findings of the Association for Computational Linguistics: ACL 2022. 2022. p. 2086–105.
  38. 38. Cochran WG. The comparison of percentages in matched samples. Biometrika. 1950;37(3–4):256–66.
  39. 39. Kendall MG. A new measure of rank correlation. Biometrika. 1938;30(1/2):81.
  40. 40. Akoglu H. User’s guide to correlation coefficients. Turk J Emerg Med. 2018;18(3):91–3. pmid:30191186
  41. 41. Gilpin AR. Table for conversion of Kendall’S Tau to Spearman’S Rho within the context of measures of magnitude of effect for meta-analysis. Educ Psychol Measur. 1993;53(1):87–92.
  42. 42. Wilcoxon F. Individual comparisons by ranking methods. Biometr Bullet. 1945;1(6):80–3.
  43. 43. Cheng M, Piccardi T, Yang D. CoMPosT: characterizing and evaluating caricature in LLM simulations. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. p. 10853–75.
  44. 44. Cowan G, Resendez M, Marshall E, Quist R. Hate speech and constitutional protection: priming values of equality and freedom. J Soc Issues. 2002;58(2):247–63.
  45. 45. McConahay JB. Modern racism, ambivalence, and the modern racism scale. In: Dovidio JF, Gaertner SL, editors. Prejudice, discrimination, and racism. Academic Press; 1986. p. 91–125.
  46. 46. Bouchard TJ Jr, McGue M. Genetic and environmental influences on human psychological differences. J Neurobiol. 2003;54(1):4–45. pmid:12486697
  47. 47. Pulos S, Elison J, Lennon R. The hierarchical structure of the interpersonal reactivity index. Soc Behav Pers. 2004;32(4):355–9.
  48. 48. Steg L, Perlaviciute G, van der Werff E, Lurvink J. The significance of hedonic values for environmentally relevant attitudes, preferences, and actions. Environ Behav. 2012;46(2):163–92.
  49. 49. Krippendorff K. Content analysis: an introduction to its methodology. Sage Publications. 2018.
  50. 50. Krippendorff K. Estimating the reliability, systematic error and random error of interval data. Educ Psychol Measur. 1970;30(1):61–70.
  51. 51. Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Statist Assoc. 1937;32(200):675–701.