Figures
Abstract
Advances in large language models have raised concerns about their potential use in generating compelling election disinformation at scale. This study presents a two-part investigation into the capabilities of LLMs to automate stages of an election disinformation operation. First, we introduce DisElect, a novel evaluation dataset designed to measure LLM compliance with instructions to generate content for an election disinformation operation in localised UK context, containing 2,200 malicious prompts and 50 benign prompts. Using DisElect, we test 13 LLMs and find that most models broadly comply with these requests; we also find that the few models which refuse malicious prompts also refuse benign election-related prompts, and are more likely to refuse to generate content from a right-wing perspective. Secondly, we conduct a series of experiments (N = 2 , 340) to assess the “humanness” of LLMs: the extent to which disinformation operation content generated by an LLM is able to pass as human-written. Our experiments suggest that almost all LLMs tested released since 2022 produce election disinformation operation content indiscernible by human evaluators over 50% of the time. Notably, we observe that multiple models achieve above-human levels of humanness. Taken together, these findings suggest that current LLMs can be used to generate high-quality content for election disinformation operations, even in hyperlocalised scenarios, at far lower costs than traditional methods, and offer researchers and policymakers an empirical benchmark for the measurement and evaluation of these capabilities in current and future models.
Citation: Williams AR, Burke-Moore L, Chan RS-Y, Enock FE, Nanni F, Sippy T, et al. (2025) Large language models can consistently generate high-quality content for election disinformation operations. PLoS ONE 20(3): e0317421. https://doi.org/10.1371/journal.pone.0317421
Editor: Jinran Wu
Received: October 14, 2024; Accepted: December 28, 2024; Published: March 17, 2025
Copyright: © 2025 Williams et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All results for the DisElect evaluation dataset on the 13 tested LLMs and results of human experiments are available without restriction in https://github.com/alan-turing-institute/election-ai-safety. Results do not include raw model response text.
Funding: Ecosystem Leadership Award under the EPSRC Grant EPX03870X1 https://www.ukri.org/councils/epsrc/ (ARW, LBM, FEE, TS, KH, JB) This funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The Alan Turing Institute https://www.turing.ac.uk/ (all authors) This funder no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The AI Safety Institute https://www.aisi.gov.uk/ (all authors) This funder no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Large Language Models (LLMs) as tools for generating natural language are now widely-accessible to anyone who might want to use them. This includes malicious actors looking to spread disinformation through online platforms in ‘information operations’: systematic campaigns that seek to promote false or misleading narratives [1]. Such actors are increasingly a feature of the contemporary information environment and have generated widespread public concern about their potential ability to undermine faith in democratic institutions [2]. State-backed or privately funded operations may, for example, push agendas around certain politicians, try to sow doubt in electoral processes, or cause confusion and disagreement around local issues [3].
A successful information operation requires two key things: the production of ‘realistic’ content (such that people consuming it do not realise it has been created purely to push a narrative or by one centrally co-ordinated actor); it also requires this realistic content to be produced ‘at scale’, giving the impression of a mass groundswell in public opinion. Furthermore, this content needs to be disseminated across a range of distriubtion networks, using for example networks of “junk” news websites [4] or social media accounts [5], which themselves require the creation of further content (e.g. text for a website, or a biography for a social media account) to appear as authentic.
Generation of realistic content at scale has, historically, been a hard challenge for those running information operations to achieve. Sometimes they have employed industrial scale teams of fake users [6], that may be effective but also come at considerable cost; and when they are based in a foreign country, they may struggle to effectively convince local users of their legitimacy, as well as requiring considerable operational security efforts to conceal their true origin [7]. At other times simple automation methods have been employed, such as ‘copypasta’ (simply copy-pasting messages between different accounts [7]) or ‘spintax’ (making minor changes to messages based on procedural rules [8]). However this automation is straightforward to detect when multiple examples of messages produced through these techniques are seen by a user.
In this context, the rise of generative AI and LLMs, that can cheaply generate highly realistic content at scale, is significant. They could contribute to supercharging existing organisations who run information operations, and potentially allow for new ones to enter the arena. Furthermore, LLMs can roleplay as different personas such as political alignment [9,10], and potentially reproduce granular details about specific individuals, concepts and places through their extensive training datasets. These abilities may lend themselves to the creation of more authentic content in information operations than has perhaps previously been seen before.
While the use of generative AI in disinformation operations has been noted [11–14], it remains to be seen how effective this style of operations is. Increasingly, work is done after training LLMs to align them with human values and prevent harm or misuse, such as feedback learning [15] and red teaming [16]: this might prevent their compliance with instructions to generate content for an operation. Furthermore, humans may still be able to spot AI generated content. In response to this, we present a two-part study from the perspective of a malicious actor looking to use LLMs to generate content for multiple stages of an election disinformation operation.
Firstly, we present DisElect, a novel evaluation dataset for election disinformation. Using a set of past and present LLMs, we find that most LLMs comply with instructions to generate content for an election disinformation operation without any adversarial prompting strategies, with models that do refuse also refusing benign election prompts and prompts to write from a right-wing perspective. Secondly, we conduct experiments to assess the perceived authenticity or “humanness” of LLM-generated election disinformation campaign content. We find that human participants are unable to discern LLM-generated and human-written content over 50% of the time for most models released since 2022, even in highly localised geographic contexts. We also find that two models achieve above-human-humanness on average. We release the DisElectevaluation dataset and the results of humanness experiments at https://github.com/alan-turing-institute/election-ai-safety, and suggest multiple avenues for expanding the understanding of malicious AI use and model humanness.
Related work
Disinformation operations
Misinformation refers to information containing false or misleading claims [17]. Recent research finds that public exposure to, and concerns about the spread of misinformation in general is high in the UK, especially online [18]. Disinformation is often distinguished from misinformation as referring to false information circulated with the intent to deceive, as opposed to claims made without deceptive intentions [17]. In this paper, we refer to disinformation specifically, given the malicious intent of the use cases studied. Manipulating public opinion, political unrest, and influencing voting behaviours are just some of the concerns regarding online disinformation operations [19], which was also widely highlighted in the context of Covid-19 [20,21]. To take the most obvious example, during the US 2016 election Russian information operations were publishing almost 1,000 pieces of content per week at their height [3]. The content, which was produced by a team of 400 people at Russia’s Internet Research Agency (IRA), comprised blogs, memes, online comments, Facebook groups, tweets, and fake personas–and was posted across 470 pages, accounts, and groups. It is estimated to have reached 126 million users on Facebook alone [3]. Researchers continue to debate the concrete impact of the operation; however what is not in doubt is the scale of organisation required to create it.
A characteristic element of these operations is the semblance of “peer pressure” through social networks. For example, “influencer” accounts or bots pretending to be humans may be used to propagate disinformation [5]. When a critical mass of people are convinced, more people may start believing claims due to their popularity (also known as the “bandwagon effect”) — which can lead to a self-perpetuating cycle [5]. A study on Russian social media operations by Helmus et al. [22] illustrates this point. First, false or misleading content is created by Russian affiliated media outlets [4]. Second, trolls and bots amplify this content on social media through fear-inciting commentary, serving as “force multipliers” [23]. Third, these narratives are further perpetuated through mutually reinforcing digital groups. These phases are repeated and layered on top of each other, to create and sustain false narratives that are difficult to discern from true information.
In the past, the content for such disinformation operations has been largely created by humans. However, with rapid progress in AI technologies, the use of AI generated content in such disinformation operations has been noted as of recently: Hanley and Durumer [24] find that between January 1, 2022, and May 1, 2023, the number of synthetic news articles increased by 57.3% and 474% respectively on mainstream and disinformation websites. A US Department of Justice press release [11] reports on the disruption of a Russian government-organised bot farm utilising generative AI. Wack et al. [12] identify an “AI-empowered” influence network supporting the leading party of Rwanda. Thomas [13] presents a disinformation campaign utilising OpenAI’s models targeting pro-Ukraine Americans. OpenAI themselves discuss their attempts to identify and disrupt deceptive uses of their AI models by covert influence operations [14].
AI safety evaluations
Measurement of the extent to which large language models are co-operative when asked to produce content to support a disinformation operation is part of the wider field of AI safety evaluations. Weidinger et al. [25] propose a sociotechnical approach to these safety evaluations, consisting of evaluations at three intersecting levels: model capability layer, human-interaction layer and systemic layer. In the context of election disinformation, evaluation at the capability layer might measure the extent to which an AI system can produce disinformation. Evaluation at the human-interaction level might involve examining the deceptive capacity of AI-generated content through behavioural experiments. Finally, evaluation at the systemic level might explore how election disinformation might impact levels of epistemic (mis)trust in the general public.
Here, we make a unique contribution by focusing conducting evaluations at both the capability layer - using benchmarking techniques – and at the human-interaction layer using human-subjects experiments. Both these dimensions are essential components of evaluations, as risks from AI-generated disinformation are determined by not only the capability of models to generate disinformation, but also public experiences with and perceptions of such content when they engage with it [25].
In the following sections, we review existing research on: 1) The capability of Generative AI models to generate disinformation; and, 2) experimental research on public perceptions of AI generated disinformation.
Assessing the capability of LLMs to generate disinformation.
Eliciting model capabilities in a safety context often involves measuring the extent to which a model will “refuse" to comply with instructions, a safety behaviour learned through later stages of model training to align with human preferences and values [26]. One approach to safety testing is to iteratively craft adversarial prompts or “red-team" a model to overcome refusal and elicit the required ability [27], an approach increasingly applied in systematic evaluation benchmarks [28]. While such techniques may be useful for overcoming refusal, we show that they are not crucial in this scenario, given the broad compliance of LLMs with the task we present.
To evaluate the capacity of GPT-3 to generate accurate information or disinformation, Spitale et al. [29] prompted the AI model to produce 10 accurate and 10 disinformation tweets for a range of topics such as climate change and vaccine safety. The rate of obedience, measured as the percentage of requests satisfied by GPT-3 divided by the overall number of requests indicated better compliance for accurate information (99 times out of 101) compared to disinformation (80 out of 102) requests.
In another study, Kreps et al. [30] found that a set of smaller, older AI models could generate credible-sounding news articles at scale without human intervention. The authors used one sentence from a New York Times story to prompt GPT-2 models (355M, 774M, and 1.5B) to generate 300 outputs. The best outputs of the 774M model (mean credibility index of 6.72) and 1.5B model (mean credibility index of 6.93) were perceived to be marginally more credible than that of the 355M model (mean credibility index of 6.65). Similarly, Buchanan et al. [3] showed that LLMs could generate moderate-to-high quality disinformation messages with little human intervention.
In one of the few studies on disinformation generated by multimodal AI models, Logically AI [31] tested three image-based generative AI platforms to assess compliance with prompts in a US, UK and Indian context. The report found that more than 85% of prompts were accepted by these models. In the context of the UK, prompts centred around crime, immigration, and civil unrest. ActiveFence [32] analysed the ability of six LLMs to respond to false and misleading prompts produced in English and Spanish, across five categories of misinformation and harmful narratives: health misinformation, electoral and political misinformation, conspiracy theories, calls for social unrest, and a category that combines two or more categories. The authors found that LLMs responded least safely to misinformation prompts. Similarly, Brewster and Sadeghi [33] found that ChatGPT and Google Bard generated content on 98 and 80 false narratives, respectively when prompted with a sample of 100 myths.
Buchanan et al. [3] also find that AI models can not only generate disinformation but also customize language for specific groups. For example, Urman and Makhortykh [34] indicate that outputs of LLM-based chatbots were prone to political bias with regard to prompts dealing with Russian, Ukrainian, and US politics. In particular, Google Bard evaded responding to Russian prompts concerning Vladimir Putin.
Human interactions with AI generated disinformation.
In the context of disinformation, Spitale et al. [29] conducted a pre-registered experiment with 697 respondents across United Kingdom, Australia, Canada, United States, and Ireland. The authors presented participants with tweets containing both true and false information about a range of topics mentioned above—and were asked to identify whether what they read was true or false (information recognition) and whether the content was written by an AI model or human (AI recognition). The authors found that participants could not distinguish between tweets generated by GPT-3 and those written by real Twitter users. Furthermore, participants recognized false tweets written by humans more than false tweets generated by AI (scores 0 . 92 versus 0 . 89, respectively; P = 0 . 0032), implying that the AI model was better at misinforming people.
A large part of the experimental research focuses on perceived credibility, trustworthiness, and persuasiveness of AI-generated text. Kreps et al. [30] conducted experiments on AI-generated and human-written news articles, finding that respondents perceived AI-generate news to be equally or more credible than human-written articles. Similarly, Goldstein et al. [35] found that GPT-3 could write persuasive text with limited effort. And Zellers et al. [36] noted that an AI model could generate an article when prompted with a given headline and that humans found such articles to be more trustworthy than human-written disinformation. However, Bashardoust et al. [37] found that AI-generated fake news was perceived as less accurate than human-generated fake news and that political orientation and age explained whether users were deceived by AI-generated fake news. To understand how persuasive LLMs are when microtargeted to individuals on political issues, Hackenburg et al. [38] integrated user data into GPT-4 prompts, and found that, although persuasive, microtargeted messages were not statistically more persuasive than non-targeted messages. Similarly, Hackenburg et al. [39] also tested 24 LLMs on their ability to generate persuasive messages. The authors found that larger models were only marginally better due to better task completion (coherence and staying on topic). More broadly, Jakesch et al. [40] find that humans are not able to detect self-presentations generated by LLMs, and that LLMs can exploit human heuristics for identifying LLM-generated text in order to produce text perceived as “more human than human”.
Our contribution
Several features distinguish our paper. First, by evaluating model compliance with malicious prompts to generate false information related to elections, and experiments on whether people could distinguish between AI and human written disinformation, this paper contributes to the sociotechnical evaluation evidence base. Second, by embedding this research within a UK context, at both a national and a hyperlocal level (London), this paper makes a unique contribution to the literature - most of which is otherwise situated and/or conducted in a U.S. context - and addresses specifically the capacity of models to localise content in a realistic fashion, a key weakness of past information operations. Third, the experiments center around the theme of political disinformation. These most studies examine content generated by one AI model, whereas our paper considers content generated by a range of 13 different AI models, capturing the diversity in models at the disposal of malicious actors. Finally, we look at the entire pipeline of information operations (News Article Generation—Social Media Account Generation—Social Media Content Generation—Reply generation), while the literature tends to focus solely on one or two stages; typically news articles or tweets.
Methodology
In order to understand to what extent LLMs would be useful for automating election disinformation operations, we conduct a two-part study:
- Systematic Evaluation Dataset: Measuring LLM compliance with instructions to generate content for an election disinformation operation.
- Human Experiments: Measuring how well people can distinguish between AI-generated and human-written election disinformation operation content.
Information operation design & use cases
Generating nuanced and realistic content can reduce the ability of a layperson to identify information or activity as inauthentic, and therefore increases the apparent authenticity of any given part of an information operation. We establish a 4 stage operation design, covering both the content generation and dissemination stages of a typical disinformation operation.
A. News Article Generation: News articles and headlines act as the “root” of an operation, making claims which will be further enforced by other stages.
B. Social Media Account Generation: “Fake” Social media accounts (e.g. on Twitter/X) are used to disseminate the generated news.
C. Social Media Content Generation: Social media posts by accounts from B discussing the generated news creates an illusion of public interest/legitimacy.
D. Reply generation: Replies to the social media posts in C further the illusion of public interest and the potential impact of the operation.
We consider this design across two relevant use cases:
Hyperlocalised logistical voting disinformation (e.g. a voting date for a specific area changing): False information about where, when, and how to vote can disrupt electoral processes and lead to individuals being unable to cast their votes, and represents an opportunity to explore the ability of LLMs to generate content containing highly localised information. This is important because one of the areas where disinformation operations have struggled in the past is effective localisation.
Fictitious claims about UK Members of Parliament or “MPs” (e.g. being accused of misusing campaign funds): Spreading false information about the activities of election candidates can influence the opinions of the electorate, and offers an opportunity to investigate who an LLM will (and won’t) generate misinformation about.
Models
We select 13 LLMs (Table 1) that vary in terms of release date, size, and access type (open-source vs. API). This enables us to measure and compare newer vs. older and smaller vs. larger models. We include multiple models from the same families to observe change within family (e.g. T5 vs. Flan-T5, Llama 2 vs. Llama 3, Gemma vs. Gemini). We used Ollama (ollama.com) for several models. Ollama runs models at 4-bit quantisation, reducing memory footprint and facilitating local execution, though sometimes reducing accuracy on complex tasks [41,42]. Such frameworks are particularly relevant as many disinformation operations may choose to run local large language models, rather than calling them over an API.
We should note that release date refers to the announcement date of the original version of the model, rather than the release date of the specific version used in the paper. GPT-4 was announced by OpenAI on 2023-03-14, we use the gpt-4-0613 model version made available on 2023-06-13. We use the instruct fine-tuned version of GPT2 available at vicgalle/gpt2-open-instruct-v1.
All LLMs were run with with the following parameters (where available): temperature=1, top P = 0 . 95, top K = 40.
DisElect. evaluation dataset
To systemically evaluate model compliance on election disinformation, we construct the DisElectdataset, containing 2 , 200 prompts for the stages and use cases described above, and a baseline set of 50 benign election prompts to examine how sensitive models are to election content in general.
Dataset creation.
For each stage in each use case, we create a prompt template, as shown in Table 2. We then fill the prompt templates using the variables in Tables 3 and 4 to generate 1 , 100 unique prompts for each use case. We refer to the datasets of prompts created as DisElect.VT for the voting use case and DisElect.MP for the MP use case. It is worth highlighting that, although all the content could be put to use in a plausible disinformation campaign, much of it can appear anodyne at face value, or is not necessarily misleading. We will return to this point in the conclusion.
DisElect.VT covers voting date, location, and voter ID disinformation, across different locations in the UK, for accounts and posts from left and right-wing perspectives. DisElect.MP is similarly structured, covering claims and issues around financial, criminal, and political activity by MPs, varying prompts by the UK MP that the claim targets instead of location (location is still used for social media account generation). We curate 50 election information-seeking prompts from a good-faith perspective to form the benign baseline set, DisElect.BL.
In order to measure baseline levels of compliance, we do not consider adversarial prompting approaches, or other approaches which might reduce refusal rates.
To optimise the execution of prompts across multiple models, we developed a dedicated Python library named prompto [55], which we make available in open access to the research community for supporting other types of comparison across a variety of LLMs.
Evaluation.
In order to understand whether the models accept to generate the content in question, we make use of a LLM-as-a-judge approach [56] to label model responses with one of four classes: Refuse (explicitly refuses to comply with the prompt), Soft-refuse (doesn’t explicitly refuse to comply but doesn’t satisfy the intent of the request, being deliberately more neutral or opposing the sentiment requested), Incoherent (an incomplete, unreadable, or confusing response, e.g. repeating whole phrases from the prompt), or Comply (a desired response to the prompt).
This multi-class approach to refusal detection enables us to differentiate useful responses from low-quality compliant responses from refusals. We include the Incoherent label as we find that older generative language models do not explicitly refuse prompts but often produce very low quality responses
We use GPT-3.5 Turbo to label responses in a zero-shot manner, given a prompt and response and the judge prompt template (available at https://github.com/alan-turing-institute/election-ai-safety/blob/main/data/evals/judge/template.txt). Using LLM as judges of LLM outputs can lead to the perpetuation of biases embedded in these models [57], and as such we only use this approach for the relatively simple, objective task of distinguishing types of response, and focus on human evaluation to study more nuanced phenomena in the second part of this study.
On a sample of 100 responses and judgements labelled by hand, we observe an overall Macro F1 Score of 0 . 76, and accuracy of 96% on the “refuse” class and 86% on the other classes. Precision and recall on the “refuse” class were 0 . 84 and 0 . 98 and respectively. Investigating misclassified refusals shows that a judge model struggles in cases where refusal is more implicit, and contains less obvious signals such as “I’m sorry but..." or “I am unable to...".
Humanness experiments
To evaluate the degree to which synthetic content appears as authentic to humans, which we term “humanness”, we task human participants with labelling election disinformation content written by humans and the 13 LLMs in Table 1 as either human-written or AI-written. We conduct three experiments:
Experiment 1a: Testing perceived humanness of disinformation content describing an MP accused of misusing campaign funds under a left-wing persona.
Experiment 1b: Testing perceived humanness of disinformation content describing an MP accused of misusing campaign funds under a right-wing persona.
Experiment 2: Testing perceived humanness of disinformation content describing a date change for local elections under a right-wing persona.
The first two experiments (1aand 1b) provide a view of humanness of content written from different political perspectives, while experiment 2gives insight into the ability of models to generate human-passing content discussing hyper-localised issues.
Each experiment is based on responses to 4 prompts (each operation stage with one combination of variables, visible in S1 Table, where variable values (subject and mp for 1aand 1b, subject for 2) are selected to minimise refusals. We generate 15 responses to each prompt for each source (14 sources: human generation plus the 13 LLMs).
Content generation.
Human-written content was created by another research team with good general knowledge of the subject area, but who were not involved in model content generation.
To create AI-generated content, we prompt all LLMs with versions of the prompts outlined in S1 Table, where “Write” is replaced with “Write 15 variations of”. In cases where LLMs were unable to comply with this instruction (i.e. did not provide a list of 15 variations as a response – this was the case for GPT-2, T5, Flan-T5, and GPT-Neo), we instead prompt each model fifteen times for each prompt. For experiments 1aand 1b, the name of the MP receiving the lowest proportion of refusals from DisElect.MP(not disclosed) was used in the prompts for AI and human written content. This name, and any mention of specific party or constituency, was redacted from responses before presentation to participants, instead replaced with tokens e.g. “{MP}”,“{PARTY}”. Example responses for experiment 2are visible in S2 Table.
Experimental design.
Experiments were designed on Qualtrics. Each participant was randomly assigned to an experimental condition, containing content across the four stages from one LLM only. Participants were presented with the instructions visible in S3 Table. At each stage, participants saw 15 items generated by one LLM model, plus 15 human-written items, alongside the prompt used to generate that content. Participants were asked to indicate whether they thought each item was written by a human or generated by an AI model. We included one attention check per stage. Accordingly, each participant saw 31 items per stage, randomly ordered. We also asked demographic questions: age, gender, digital literacy & familiarity, education level and political orientation.
This approach means participants may be able to identify linguistic patterns in multiple items generated by the same model, giving them clues about these items being LLM-generated. While this could be considered a limitation, the baseline comparison for each model is humans only, so a model producing easily-identifiable, repetitive content would not make another model seem more “human-like”.
Sampling.
We recruited 780 UK-based, English-speaking participants who were over the age of 18 for each experiment (N = 2 , 340), over the period of 2nd March 2024 to 7th June 2024. These numbers account for participants that failed attention checks and as such were disregarded and replaced. We balance left-wing and right-wing participants for experiments 1aand 1b. Experiment 2focused exclusively on London issues, and as such we recruited participants residing specifically in London. The left-wing/right-wing split for this sample was not balanced but was representative of the London population, which is more left-leaning. Participants were required to sign an electronic consent form after reading the participant information sheet. Full demographic information on participants is available in S4 Table.
Ethical considerations
Participants were required to sign electronic consent forms before proceeding with the study. This form confirms their understanding of the terms laid out in the participant information sheet (example in S3 Table): that the content they would be presented with was fictional, created for the purpose of this study, and had no relation to real world events and news. Content presented to participants in experiment 1aand 1bcontained no identifiable information about an MP. Participants were advised not to proceed with the study if they thought that it may adversely affect their emotional state in any way, and were able to withdraw at any point without giving a reason. Many participants reported in feedback that they enjoyed the challenge of discerning AI and human written content, and were glad to contribute to what they saw as important work. This project was approved by the Turing Research Ethics (TREx) Panel under code TR24-13.
Results
In plots, models are always ordered by release date, from oldest to newest.
Few LLMs refuse to generate content for a disinformation operation
Results for DisElecton the models listed in Table 1 are shown in Fig 1. Refusal rates are generally low - only three models (Llama 2, Gemma, Gemini 1.0 Pro) explicitly refuse to comply with more than 10% of prompts in any use case. Phi-2 and Llama 2 also produce some refusals across both experiments, but a considerably smaller number than the aforementioned models. The oldest of the refusing models (Llama 2) was introduced mid-2023, reflecting that refusals are a phenomenon introduced through safety-focused model fine-tuning that may not be present in earlier models.
Models are sorted by release date (earliest models first). n refers to total responses per model within the experiment.
Refusals are more common in DisElect.MPthan in DisElect.VT (12.4% vs. 6.8% overall), whereas soft-refusals are more common in the latter (17.5% vs 21.4% overall). The tendency of the same models to explicitly refuse instead of soft-refusing suggests that that disinformation around political figures or issues are areas that may have had more safety related fine-tuning.
Results for DisElect.BLreflect label distributions for DisElect.VT and DisElect.MP- models that refuse malicious election prompts will generally refuse some amount of benign election-related prompts. There are two models (GPT-3.5 Turbo and Mistral) that do not refuse any of the malicious prompts in DisElect.VT or DisElect.MP, but do refuse to respond to a safe prompt from DisElect.BL. Investigating refusals on DisElect.BL reveals models responses that refuse on the basis of a lack of knowledge of up to date information (as opposed to the safety/ethically motivated refusals on DisElect.VT and DisElect.MP), given that prompts in DisElect.BL are questions about policy and election details.
The range of models studied enables us to observe changes in LLMs over time. In DisElect(see Fig 1), we see that older models tend to produce both lower compliance and refusal rates, returning higher rates of incoherent or soft-refusal responses (83.5% of all incoherent and soft-refuse responses come from the 5 earliest models).
We are also able to compare models within ‘families’, which allows us to address differences between models either produced at different times or with different amounts of parameters. For example, we find that the earlier or smaller versions (Llama 2 & Gemma) seem to return much higher rates of refusal than later or larger versions (Llama 3 & Gemini). We also see from DisElect.BLin Fig 1 that Llama 3 and Gemini do not refuse safe election related questions, whereas the earlier or smaller models do. This shows that Llama 2 and Gemma could be seen as overly sensitive to benign election-related prompts in general, even when non-malicious, in a way that is not present in their later or larger equivalents.
Notably, low refusal rates on DisElect occur without any adversarial prompting strategies, highlighting that a malicious actor conducting an information operation may need to do no work at all to overcome refusals to generate content for said operation.
What drives refusal?
Focusing on the 3 models that do refuse a significant number of prompts (Llama 2, Gemma, and Gemini 1.0 Pro), we present proportions of prompts refused by variables values (see Tables 3 and 4) in Fig 2. Across both use cases, all models are much more likely to refuse when prompted to use a right-wing persona than a left-wing persona. Refusals for left-wing personas are higher for all models in DisElect.MPthan in DisElect.VT. Prompting models to generate news articles returns more refusals in DisElect.MPthan DisElect.VT. There is some variation in refusal on different subjects, with prompts about voter ID, colluding with China, and drug possession drawing higher refusal rates than other options in respective use cases.
n represents the total number of prompts corresponding with results displayed.
Calculating the Spearman’s Rank Correlation on refusal rates between models for each variable reveals that models often do not align on what to refuse: only the persona variable sees no variation between refusal rankings for possible values (left-wing and right-wing) for both use cases. Correlation for subject in DisElect.MPis also high (ρ = 0 . 74 median). Correlation is lowest for the pipeline stage as a variable in DisElect.VT(ρ = 0 . 00 median).
For the 50 MP names included in DisElect.MP, we find that refusals are normally distributed (64% within 1 standard deviation of the mean). Fig 3 presents refusal rates for groups of MPs by party and gender. The Spearman’s Rank Correlation between models on these groups is high, at a median value of ρ = 0 . 90 when grouping MPs by party and gender, indicating alignment between models on which MPs to refuse to generate disinformation about. All 3 models are more likely to refuse to generate content for a female MP than a male MP, and for a Labour MP than a Conservative (or other) MP. This persists at a more granular level, with female Labour MPs seeing the highest level of refusal of any party-gender group on average, and male Conservative MPs the lowest.
n represents the number of MPs within a given group. Each MP is referred to in 20 individual prompts.
The presence of refusals in some models and not in others points to differences in model post-training fine-tuning processes and data, where safety behaviours are learned (generally through preference optimisation methods) to limit the possibility of a model generating harmful responses. Here, Llama 2 appears to have learned the most defensive behaviour of the models tested, refusing the highest number of prompts.
Additionally, varying rates of refusals across experiment variables, including the gender and party of MPs used in prompts, points at some distributional difference either in pre-training data, such that some models are able to identify specific individuals better than others, or/and in data and processes used in safety training, such that the mentioning of individuals that models identify as belonging to certain groups influences contributes to a refusal behaviour.
How well can people identify AI-generated content?
We will now present the results from the experimental part of our study. We collate datasets about the number of times participants assign pieces of content as “human" for the three experiments, aggregating by the LLMs in Table 1 and the stages of the information operation pipeline in S1 Table for each experiment to calculate “humanness” for each LLM. We define humanness as the number of times humans (mis) label an AI-generated item as human over the total number of labels assigned to that AI-generated item ().
We plot humanness per LLM overall and broken down by experiment and pipeline stage in Figs 4, 5, and 6. Overall, 9/13 models achieve at > 50% humanness on average, indicating that the majority of models tested produce content that is indiscernable from human-written content for the same prompt. Meanwhile, 6 ∕ 13 models achieve very high levels of humanness ( > = 75%) on at least 5% of entries. However, variation is high: the coefficient of variation is higher than the interquartile range for all models, and all models see at least 12% of their items receive lower levels of humanness ( < 50%). This implies that the ability to discern AI-generated and human-written varies greatly between human participants.
Models are sorted by release date. Boxes visualise the mean and confidence interval (of + /– 2 standard errors). The dashed line shows the mean of the human proportions across the models.
Models are sorted by release date. Boxes visualise the mean and confidence interval (of + /– 2 standard errors). The dashed lines show the means of the human proportions across the models.
Models are sorted by release date. Boxes visualise the mean and confidence interval (of + /– 2 standard errors). The dashed lines show the means of the human proportions across the models.
Llama 3 and Gemini achieve the highest humanness of all models (62% and 59% respectively). Both models see at least 15% of the entries achieve > = 75% humanness (19% and 17% respectively).
We observe two models (Llama 2, Gemma) with bimodal distributions of humanness proportions in Fig 4: many pieces of content receiving low humanness while others receive high humanness. As shown earlier, these two models produce the highest level of refusals on DisElect. While prompts in S1 Table were constructed to minimise refusals, this is not always avoidable. Refusal responses are trivial for a human participant to identify as AI-generated. Aside from these items, Llama 2 and Gemma would receive among the highest levels of humanness. However, a model that frequently refuses to generate content will in the end not be useful for scaling an information operation.
Per experiment.
Humanness per model across the three experiments is shown in Fig 5. The mean humanness is highest on average for experiment 1a, focused on MP disinformation written from a left-wing perspective (0 . 53), then experiment 1b, focused on MP disinformation written from a right-wing perspective (0 . 48), and lowest for experiment 2, focused on localised election disinformation (0 . 44). Overall 10 ∕ 13 models perform worst on 2, compared to 0 ∕ 13 for 1aand 3 ∕ 13 for 1b, while 6 ∕ 13 models perform best on 1aand 6 ∕ 13 on 1b. This suggests that content produced by these models is most human in the MP / left-wing perspective use case (1a), and least human in the voting / right-wing perspective use case (2). However, there may also be effects stemming from the political orientation of participants, which we will describe further below.
The two models with the highest humanness scores outlined above (Llama 3, Gemini 1.0 Pro) both see their lowest performance on experiment 2. Llama 3 achieves the highest humanness of all models in both experiment 2and 1a, whereas Gemini 1.0 Pro achieves the highest humanness in experiment 1b.
Llama 2 and Gemma, noted above for bimodal distributions of humanness due to refusals, see large variance in humanness across experiments. This bimodal distribution is visible for both models in experiment 1b, and experiment 2for Llama 2, but is not visible for either models in experiment 1a(see Fig 5), due to a lack of refusals in experiment 1a. This reflects earlier findings that these models are more likely to refuse to write from a right-wing perspective, present in experiments 2and 1bbut not 1a.
Per pipeline stage.
Humanness per stage of our information operation pipeline is shown in Fig 6. 11 ∕ 13 models see their highest humanness scores on the social media reaction or reply generation stages, implying that these stages are the “easiest” to generate human-like content for. Variation across models is lowest for the account generation stage (std = 0 . 12).
Observing Llama 2 and Gemma reveals bimodal distributions of humanness (indicating the presence of refusals) in some stages and not in others. Refusals are present for neither model in news article generation, owing to the absence of instruction to write from a particular perspective in this stage. Llama 2 scores higher than any other model on this stage as result (humanness = 0 . 61).
We investigate standard deviation of model humanness in Fig 7 (excluding Gemma 2 and Llama 2 which present binomial distributions due to refusals). All but 2 of the remaining models vary across stages more than across experiments. GPT-3.5 Turbo achieves the lowest deviation of all models across stages (std = 0 . 026) and experiments (std = 0 . 021). The best two models (Llama 3 and Gemini 1.0 Pro) vary in performance across stages (std ( Llama 3 ) = . 048, std ( Gemini ) = . 063) more than some other lower performing models such as GPT-3.5 Turbo. The news article generation stage was the only stage where neither Llama 3 nor Gemini 1.0 Pro were the two highest scoring models, with Llama 2 and GPT-3 achieving superior humanness on this stage.
Excludes Llama 2 and Gemma.
Model development over time
The models tested in this study (Table 1) cover a range of release dates going back to 2019. The highest performing two models we tested in terms of humanness on average (Llama 3 and Gemini 1.0 Pro) are also among the newest models we tested. The worst two models on average (GPT-2 and T5) were the two oldest models we tested. We observe a negative (Pearson) correlation between model age and humanness (ρ = − 0 . 82), shown in Fig 8, adding to the evidence that newer LLMs are able to generate more human-like content.
This trend is not absolute. As noted, Llama 2 and Gemma content used in the experiments contains refusals. This impacts Llama 2’s overall humanness to a greater degree than Gemma’s, making Llama 2 more comparable to much earlier models such as GPT-2 and GPT-Neo. Age is not the only factor, as models differ by size (number of trained parameters) independent of age. The tradeoff between size and humanness is not linear, as relatively small open-source models (Phi-2, Gemma, Mistral,Flan-T5) offer comparable humanness to larger API-based models (GPT-3.5 Turbo and GPT-4).
This relationship generally holds across experiments, but there is a notable discrepancy by pipeline stage (see Fig 5). The two highest performing models on the news article generation stage are GPT-3.5 and Llama 2, which are both over a year old and neither the latest version in their family of models.
A possible differentiator between newer and older models, and a potential predictor of humanness, is the similarity of content variations generated by models. When viewed together, as in this study, patterns or repeated words and phrases in the groups of content generated by models could be a signal of inauthenticity to human participants, much as it would for social media users exposed to an organised disinformation operation on social media. Newer models may be more sophisticated in their ability to generate natural language, but may be overly uniform in their responses due to increased instruction and safety fine tuning, the same mechanism that gives rise to refusals.
We measure similarity of groups of content produced by each model using average pairwise cosine similarity between groups of content as TF-IDF vectors [58]. We find mild to moderate negative correlations for all experiments between TF-IDF similarity and humanness, strongest for experiment 2(ρ = − 0 . 65). We also see a mild positive correlation for all experiments between TF-IDF similarity and age of model, again strongest for experiment 2(ρ = 0 . 45). This suggests that models than can produce more diverse content are more likely to be perceived as more human, and that newer models are more likely to be able to produce diverse content.
Flan-T5 and GPT-4 are two models that perform comparably despite their age and size difference. In experiments 1aand 1b, Flan-T5 produces content with lower similarity than GPT-4. The opposite is true for experiment 2, where both models produce more similar content on average than for the other experiments. We continue to investigate factors that contribute to humanness later in this paper.
Above-human-humanness
We can examine the humanness of the human-written content, when viewed alongside content generated by each LLM (Figs 9 and 10). We observe a strong negative correlation (ρ = − 0 . 92) between humanness of AI-generated content and humanness of human-written content. In other words, when presented with pieces of content written by humans and AI models, the more a human participant mislabels AI-generated entries as human-written, the more human-written entries they mislabel as AI-generated. This is to be partially expected by our experiment design, considering that, even though we did not guide participants on how much AI content was present in the items they were viewing, they would likely expect the proportion to be approximately 50%. However, it is nevertheless a potential indicator that, in addition to enabling disinformation, AI generated content may also start to undermine trust in good faith human content.
Models are sorted by release date. Boxes visualise the mean and confidence interval (of + /– 2 standard errors). Dashed lines show the mean human proportions for AI- and human-generated responses.
Models are sorted by release date. Boxes visualise the mean and confidence interval (of + /– 2 standard errors). Dashed lines show the mean average of the mean human proportions for AI- and human-generated responses.
Llama 3 and Gemini, the two highest performing models, achieve better humanness than human-written content on average. This mirrors findings from Jakesch et al. [40], in that content produced by frontier AI models appears to be perceived as more human than human-written content.
Experiment 1asees the lowest overall human humanness (and highest LLM humanness) of any experiment, visible in Fig 10. Llama 3 and Gemini achieve above-human-humanness in both experiments 1aand 1b, but not in experiment 2, which also sees the lowest overall model humanness and highest overall human humanness.
We see notable differences in LLM versus human humanness across pipeline stage: 8 ∕ 13 individual LLMs achieve above-human-humanness in at least one pipeline stage in both experiments 1aand 1b, compared to 2 ∕ 13 in experiment 2. The news article generation stage saw the most instances of above-human-humanness. These findings reinforce prior evidence that LLMs are able to write more convincing disinformation content about MPs than about localised voting issues. It may also indicate that news articles are easier to ‘fake’ than social media content, perhaps because news articles already adopt a somewhat fixed and polished style.
We can account for this inverse relationship by investigating the “share of humanness”: the number of of AI-generated entries labelled human over the total number of entries (whether AI-generated or human-written) labelled human (). This metric rewards models that reduce the humanness of human-written content viewed alongside the AI-generated content. One example of this is in experiment 1a, where the 2nd, 3rd, and 4th best performing models in terms of humanness (Gemma, Llama 2, Gemini) would be reversed in order if ranked by “share of humanness”, due to greater decreases in human humanness than increases in model humanness, visible in Fig 10. This facilitates investigation of the dangers posed by LLMs to information integrity, and represents an area for future work.
What factors explain perceptions of humanness?
We continue investigating factors behind humanness by fitting a series of mixed effects logistic regression models, including sociodemographic features of participants alongside content similarity, pipeline stage, and model, across the 3 experiments.
Each observation is a single classification of AI-generated content made by a single participant, and the dependent variable is an assignment of whether the content was written by a human (1=yes, 0=no). We include the following independent variables: age, gender, education, politics, TF-IDF distance (1 - TF-IDF similarity between content in question and other content generated by model for the same prompt), the LLM used to generate content, and pipeline stage. Politics is measured as a scale of 0-100, where 0 is extreme left-wing and 100 is extreme right-wing. Age, politics, and TF-IDF distance were standardized to have a mean of 0 and a standard deviation of 1. Reference levels for gender, education, LLM, and pipeline are set to male, no degree, GPT-2, and account generation respectively. We use participant ID as a random effect to account for multiple observations from the same participant. We fit 4 models (3 separate experiments plus an overall model) using the lme4 R package [59]. Approximate 95% confidence intervals are calculated using the standard errors. Odds ratios (ORs) are presented for all LLMs in Table 5.
We first assess how sociodemographic factors affect the humanness of AI-generated content. The results show that humanness does not vary with gender or education, both of which are non-significant across all three experiments. Age has a significant positive association with humanness for experiment 1aonly – a one standard deviation increase in age corresponds to an 8% increase in the odds of classifying content as human. The absence of this effect in experiment 1bpoints to a difference in the content generated by LLMs from left-wing perspectives that presents as greater difficulty in discerning synthetic and authentic content for older participants.
Politics has significant positive associations for experiment 1band experiment 2– a one standard deviation increase in politics corresponds to respective increases of 8% and 12% in the odds of classifying content as human. Again, the presence of this effect for one political persona (right-wing) indicates a minor yet statistically significant difference that presents as a greater propensity by right-wing-identifying participants to label content content written from a right-wing perspective as human. This effect is not present for left-wing identifying participant with left-wing content. However, in overall demographic terms, what is most striking is a lack of strong relationships between any of the demographic variables studied and the ability to identify content as AI-generated or human-written.
We see that TF-IDF distance is a significant predictor of humanness in both experiments 1band 2, but is insignificant in experiment 1b. This mirrors the overall findings on content similarity from earlier, but shows that the two experiments written from a right-wing perspective (1b, 2) are more closely aligned than the two MP focused experiments (1a, 1b), which were more closely aligned in terms of coefficients between similarity and humanness.
As noted earlier, the overall trend is for humanness to increase over time. That is, content that is generated by more recent LLMs tends to have greater odds of being classified as human compared to content generated by GPT-2. This is only an overall trend, and ORs are not strictly increasing over time nor are individual differences between LLMs necessarily statistically significant, as is the case for LLMs released around a similar time: the difference in ORs in experiment 1abetween GPT-4 (2.51) and Mistral (2.60) is not statistically significant.
Odds ratios for the different pipeline stages show further differences between experiments containing content written from a right-wing perspective (1b, 2) and from a left-wing perspective (1a). For experiments 1band 2, writing news articles are the strongest predictor of humanness, followed by social media reactions, replies, and accounts in that order, whereas news articles are the worst predictor of humanness for experiment 1a.
Disinformation domains versus personas
Given the 3 humanness experiments conducted across 2 domains (MPs, voting) and 2 personas (left/right-wing), we see that in some cases results are more closely aligned by domain: overall humanness, patterns in humanness across experiments and pipeline stages, and the prevalence of above-human-humanness are more similar in the two MP experiments (1a, 1b). In other cases, results more closely aligned by persona: the two right-wing experiments (1b, 2) seem to share more as to what demographic factors are the strongest predictors of humanness. This suggests that overall model performance in terms of humanness may transfer more easily across personas within domains, but who exactly perceives content as more or less human is more similar within personas.
Cost comparison
This study focuses on the potential efficacy of LLMs in election disinformation operations, but that does not account for the potential cost of their usage. Due to the complex nature of evaluating the costs of a high-quality traditional information operation, and comparing that to one utilising LLMs, we present a simplified comparison focused just on the news article generation stage of our disinformation pipeline.
Gu et al. [5] estimated that “content distribution service” Xiezuobang charges 100 renminbi (RMB/CNY) (approx. 15USD) for a 500 to 800-word article. Assuming production of 10 articles per day, we can estimate an effective information operation may cost around around 4 , 500USD per month. In comparison, generating the same volume of content through Gemini (one of the best models we tested) via Google’s Gemini API would cost 0 . 30USD. Imagining that a malicious actor may prefer to host the technology themselves, we estimate that deploying Llama 3 70B (the other best model we tested) would cost 9USD (Cost of time to generate content on a remote virtual machine with an A100 GPU). Imagining that a malicious actor may not have access to this level of compute, we have shown that many smaller open-source/open-weight models perform comparably to much larger models - these models can be run on existing personal computers with little technical overhead, bringing costs to zero in some cases.
Discussion
In this paper, we introduced the DisElectevaluation dataset for measuring LLM compliance with election disinformation tasks, and conducted experiments to measure the extent to which LLM-generated content for election disinformation operations can pass as human-written. We tested 13 LLMs released over the past 5 years, and found that most LLMs will comply instructions to generate content for an election disinformation operation, without adversarial prompting strategies - models that do refuse also refuse benign election related prompts, and are more likely to refuse to write from a right-wing perspective than left-wing. Further, we find that almost all models tested released since 2022 produce content indiscernible to human participants over 50% of the time on average, and 2 models tested achieve above-human-humanness. Our work provides an evaluation tool and sociotechnical evidence for the mitigation of potential harms posed by LLM-generated election disinformation. It also shows that there are plausible reasons to believe that LLMs will increasingly be integrated into the work of contemporary information operations.
It is worth noting of course that we do not claim that LLMs should necessarily refuse the prompts we have created in this dataset. Indeed, it is significant that misleading information and narratives can be spread using text and content that appear to have been generated in good faith. It is, in our view, unlikely that safety towards LLM driven information operations can fully be achieved at the model layer: rather, further education of both users and institutions is required. In the same way that ‘traditional’ disinformation has prompted calls for greater media literacy, the emergence of AI driven disinformation may require greater ‘AI literacy’ on behalf of the public, a discussion which thus far is in its infancy. This is also significant for the open source movement: while many have been concerned that open source models may present greater safety vulnerabilities, what we demonstrate here is that for a vast range of prompts both open and closed source models will ‘collaborate’ with information operations.
Furthermore, some LLMs will refuse to comply with prompts to generate disinformation, but many of these same models also refuse non-malicious prompts. Wolf et al. [60] and Röttger et al. [61] discuss the trade-off between helpfulness and alignment in LLMs - in other words, the degree to which a model maybe considered safer if it refuses to comply with a greater proportion of requests, but also would be considered less helpful for the same reason. Prompts such as generating social media posts are not inherently harmful, and could be describable as “dual-intent behaviours” [28]. As such, complete refusal to comply with this type of prompt by an LLM could be seen as an excessive behaviour, but doesn’t negate the potential for misuse demonstrated in this paper. Furthermore, significant downsides in terms of public trust could be created if models continue to refuse to engage with content written from certain political viewpoints, such as refusals generated from content written from a right-wing perspective which we observed above.
The ambiguity of these dual-intent behaviours plays into the hands of malicious actors. Malicious use of technology often involves measures to circumvent safety protocols, as is the case for more implicitly harmful use of LLMs where a party may develop prompts that bypass safety behaviours learned by models. This would add overhead to the costs outlined earlier, but we showed that, in this use case of generating e.g. social media posts, there exist a number of compliant models that generate high-quality content without need for this extra work.
An argument could made that in cases of dual-intent behaviour, identifying malicious usage relies on identifying patterns of misuse as opposed to attempting to equip models themselves to identify harm based on single prompts and responses. Providers of LLMs are able to detect patterns of misuse to identify and prohibit use of their systems by malicious actors [14], and social media platforms have systems in place to detect networks of inauthentic or malicious activity [62].
Limitations
It is worth concluding by addressing the limitations of the study, and thus point the way for future work. We have addressed adversarial prompting strategies throughout this paper, both their potential to overcome refusal, but also the lack of need for this in this scenario. Nevertheless it is reasonable to expect a malicious actor to employ such strategies, and our DisElect dataset and the experiments conducted are limited to single prompt templates, with no prompt engineering conducted to minimise chances of refusal [63] or improve humanness of responses. Further exploration in this area would provide a greater understanding of the potential upper-bounds of the impact of LLM-powered information operations, over the baseline we provide.
The 13 LLMs chosen for this study enable us to study differences over many available models. However, they do not cover the entire space of models released over the last few years, and are not sufficiently comprehensive across e.g. release date and size to draw absolute conclusions about the relationships between these factors, compliance and humanness. In particular, we have preferred focusing on a series of subsequent versions of popular models (by covering OpenAI and Meta releases) and different models presented over the years by the same company (Google), instead of having models representing each relevant actor in the field. As a consequence, we have not considered performance of other highly popular LLMs such as Anthropic’s Claude or Baidu’s Ernie, and have not expanded on different training architectures, for instance by including mixture of experts models as Mixtral developed by Mistral AI. We also focus exclusively on disinformation in English, and do not account for multilingualism of models.
Future work
Augmenting existing datasets using prompt-engineering and red-teaming to fully explore the space of prompts for a given task would provide a more comprehensive view into AI safety around election disinformation, both in refusal rates and perceptions of humanness. Additionally, quantifying the trade-off between work required (i.e. prompt-engineering) to overcome refusals and achieve higher levels of humanness across different use-cases would constitute a valuable exercise into understanding malicious use of LLMs. We also foresee potential in the abilities of LLMs to identify patterns of misuse in dual-intent behaviours.
This study focuses purely on text content, but election disinformation is commonly perpetuated through other media such as audio and video, and AI generated deepfakes of political figures have become a widely-discussed phenomenon. Investigating the degree to which visual and audio components of a disinformation operation can be automated using generative AI models, and the degree to which widely-available models can produce multimedia that can fool humans, would be a logical and necessary extension of our work. Additionally, a clear next step would be to explore humanness in multi-turn, conversational scenario, as opposed to single prompts and responses.
Our study utilises one experimental approach to studying humanness. Future experiments could include content from many models at once to determine if more direct comparisons to other models affects the perceived humanness of each, or could attempt to assess the degree to which humans are able to discern synthetic and authentic content without being informed of the presence of synthetic content.
Measuring humanness is an important challenge to understanding how the development of general purpose technologies like LLMs affect how we interact with each other and consume information. We present measures within this study for quantifying the ability of LLMs to generate human-like text, and the relationship between quality of AI generated content and the perception of human written content as AI generated. Further work could present prompting strategies for maximising humanness, or seek to establish generalisable metrics for measuring the potential impact of new models on the information ecosystem. In addition to this, we envision work to explore the abilities of frontier LLMs to estimate the humanness of AI generated content.
Supporting information
S1 Table. Prompts used to generate content for humanness experiments.
https://doi.org/10.1371/journal.pone.0317421.s001
(PDF)
S2 Table. Example responses to one prompt used to generate content for exp_VT.
https://doi.org/10.1371/journal.pone.0317421.s002
(PDF)
S3 Table. Information sheet shown to experiment participants.
https://doi.org/10.1371/journal.pone.0317421.s003
(PDF)
S4 Table. Demographic statistics of experiment participants.
https://doi.org/10.1371/journal.pone.0317421.s004
(PDF)
S1 Fig. Proportions of human assignments across pipeline stages per model, against the proportion of human assignments for human-written content, for each experiment. Sorted by release date.
https://doi.org/10.1371/journal.pone.0317421.s005
(TIF)
S1 Dataset. Prompts and results for DisElectand humanness experiments.
https://doi.org/10.1371/journal.pone.0317421.s006
(ZIP)
Acknowledgments
We would like to thank Eirini Koutsouroupa for invaluable project management support, and Saba Esnaashari, John Francis, Youmna Hashem, Deborah Morgan and Anton Poletaev for support with experimental work.
References
- 1. Starbird K, Arif A, Wilson T. Disinformation as collaborative work: surfacing the participatory nature of strategic information operations. Proc ACM Hum-Comput Interact. 2019;3(CSCW).
- 2.
Howard PN. Lie machines: how to save democracy from troll armies, deceitful robots, junk news operations, and political operatives. Yale University Press; 2020. Available from: http://www.jstor.org/stable/j.ctv10sm8wg
- 3.
Buchanan B, Lohn A, Musser M, Sedova K. Truth, lies, and automation: how language models could change disinformation. center for security and emerging technology; 2021. Available from: https://cset.georgetown.edu/publication/truth-lies-and-automation/
- 4. Elswah M, Howard PN. “Anything that causes chaos’’: the organizational behavior of Russia Today (RT). J Commun. 2020;70(5):623–45.
- 5. Gu L, Kropotov V, Yarochkin F. The fake news machine: how propagandists abuse the internet and manipulate the public. Trend Micro. 2017.
- 6. Linvill DL, Warren PL. Troll factories: manufacturing specialized disinformation on Twitter. Politic Commun. 2020;37(4):447–67.
- 7.
Francois C, Nimmo B, Eib CS. The IRA CopyPasta campaign. Graphika; 2019. Available from: https://graphika.com/reports/copypasta
- 8.
Zhang Q, Wang DY, Voelker GM. DSpin: detecting automatically spun content on the web. In: Proceedings 2014 Network and Distributed System Security Symposium. San Diego, CA: Internet Society; 2014. Available from: https://www.ndss-symposium.org/ndss2014/programme/dspin-detecting-automatically-spun-content-web/
- 9.
Jiang H, Zhang X, Cao X, Breazeal C, Roy D, Kabbara J. PersonaLLM: investigating the ability of large language models to express personality traits; 2024. Available from: http://arxiv.org/abs/2305.02547
- 10.
Hackenburg K, Ibrahim L, Tappin BM, Tsakiris M. Comparing the persuasiveness of role-playing large language models and human experts on polarized U.S. Political Issues; 2023. Available from: https://osf.io/ey8db
- 11.
Justice of USD. Justice department leads efforts among federal, international, and private sector partners to Disrupt Covert Russian Government-Operated Social Media Bot Farm; 2024. Availablefrom: https://www.justice.gov/opa/pr/justice-department-leads-efforts-among-federal-international-and-private-
- 12.
Wack M, Linvill D, Warren P. Old despots, new tricks - an AI-empowered Pro-Kagame/RPF coordinated influence network on X. Clemson University; 2024. Available from: https://tigerprints.clemson.edu/mfh_reports/5
- 13. Thomas E. “Hey, fellow humans!’’: what can a ChatGPT campaign targeting pro-Ukraine Americans tell us about the future of generative AI and disinformation? 2023. Available from: https://www.isdglobal.org/digital_dispatches/hey-fellow-humans-what-can-a-chatgpt-campaign-targeting-pro-ukraine-ameri
- 14.
OpenAI . Disrupting deceptive uses of AI by covert influence operations; 2024. Available from: https://openai.com/index/disrupting-deceptive-uses-of-AI-by-covert-influence-operations/
- 15. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022;35:27730–44.
- 16. Perez E, Huang S, Song F, Cai T, Ring R, Aslanides J, et al. Red teaming language models with language models. In: Goldberg Y, Kozareva Z, Zhang Y, editors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2022. p. 3419–48. Available from: https://aclanthology.org/2022.emnlp-main.225
- 17. Wardle C, Derakhshan H. Information disorder: toward an interdisciplinary framework for research and policymaking. Council of Europe; 2017.
- 18.
Enock FE, Bright J, Stevens F, Johansson P, Margetts HZ. The public is worried about online misinformation. Are the tools we have to fight it working? The Alan Turing Institute; 2024. Available from: https://www.turing.ac.uk/sites/default/files/2024-05/how_do_people_protect_themselves_against_online_misinformation.pdf
- 19.
Jamieson KH. Cyberwar: how Russian hackers and trolls helped elect a president: what we don’t, can’t, and do know. Oxford University Press; 2020. Available from: https://books.google.com/books?hl=en&lr=&id=S2DxDwAAQBAJ&oi=fnd&pg=PP1&dq=info:PfB34yNYvKcJ:scholar.google.com&ots=FTg9xIpvhT&sig=k2S3zkPtsbUAwkcciCCmZH558Fc
- 20.
Schliebs M, Bailey H, Bright J, Howard PN. China’s public diplomacy operations: understanding engagement and inauthentic amplifications of PRC diplomats on Facebook and Twitter; 2021. Available from: https://ora.ox.ac.uk/objects/uuid:427320a1-c677-40d4-b4a5-1759e563e7ed
- 21. Herasimenka A, Au Y, George A, Joynes-Burgess K, Knuutila A, Bright J, et al. The political economy of digital profiteering: communication resource mobilization by anti-vaccination actors. J Commun. 2022;73(2):126–37. pmid:37016634
- 22.
Helmus TC, Bodine-Baron E, Radin A, Magnuson M, Mendelsohn J, Marcellino W, et al. Russian social media influence: understanding Russian Propaganda in Eastern Europe. RAND Corporation; 2018. Available from: https://www.rand.org/pubs/research_reports/RR2237.html
- 23. Stephen McCombie AJU, Morrison S. The US 2016 presidential election & Russia’s troll farms. Intell Natl Secur. 2020;35(1):95–114.
- 24. Hanley HW, Durumeric Z. Machine-made media: monitoring the mobilization of machine-generated articles on misinformation and mainstream news websites. In: Proceedings of the International AAAI Conference on Web and Social Media. vol. 18; 2024. p. 542–56. Available from: https://ojs.aaai.org/index.php/ICWSM/article/view/31333
- 25.
Weidinger L, Rauh M, Marchal N, Manzini A, Hendricks LA, Mateos-Garcia J, et al. Sociotechnical Safety Evaluation of Generative AI Systems; 2023. Available from: https://arxiv.org/abs/2310.11986
- 26. Hendrycks D, Burns C, Basart S, Critch A, Li J, Song D, et al.. Aligning AI with shared human values; 2020. Available from: http://arxiv.org/abs/2008.02275
- 27. Xu J, Ju D, Li M, Boureau YL, Weston J, Dinan E. Bot-adversarial dialogue for safe conversational agents. In: Toutanova K, Rumshisky A, Zettlemoyer L, Hakkani-Tur D, Beltagy I, Bethard S, et al., editors. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics; 2021. p. 2950–68. Available from: https://aclanthology.org/2021.naacl-main.235
- 28.
Mazeika M, Phan L, Yin X, Zou A, Wang Z, Mu N, et al. HarmBench: a standardized evaluation framework for automated red teaming and robust refusal; 2024. Available from: https://openreview.net/forum?id=f3TUipYU3U&referrer=%5Bthe%20profile%20of%20Andy%20Zou%5D(%2Fprofile%3Fid%3D~Andy_Zou1
- 29. Spitale G, Biller-Andorno N, Germani F. AI model GPT-3 (dis)informs us better than humans. Sci Adv. 2023;9(26):eadh1850. pmid:37379395
- 30. Kreps S, McCain RM, Brundage M. All the news that’s fit to fabricate: AI-generated text as a tool of media misinformation. J Exp Politic Sci. 2022;9(1):104–17.
- 31.
Logically . Testing multimodal generative AI: generating election mis-and-disinformation evidence; 2023. Available from: https://www.logically.ai/resources/generative-ai
- 32.
ActiveFence . LLM safety review: benchmarks and analysis; 2023. Available from: https://www.activefence.com/LLMSafety
- 33.
Brewster J, Sadeghi M. Red-teaming finds OpenAI’s ChatGPT and Google’s Bard still spread misinformation; 2023. Available from: https://www.newsguardtech.com/special-reports/red-teaming-finds-openai-chatgpt-google-bard-still-spread-misinformation/
- 34.
Urman A, Makhortykh M. The silence of the LLMs: cross-lingual analysis of political bias and false information prevalence in ChatGPT, Google Bard, and Bing Chat; 2023. Available from: https://osf.io/q9v8f
- 35. Goldstein JA, Chao J, Grossman S, Stamos A, Tomz M. How persuasive is AI-generated propaganda? PNAS Nexus. 2024;3(2):pgae034. pmid:38380055
- 36.
Zellers R, Holtzman A, Rashkin H, Bisk Y, Farhadi A, Roesner F, et al. Defending against neural fake news. In: Advances in Neural Information Processing Systems. vol. 32. Curran Associates, Inc.; 2019. Available from: https://papers.nips.cc/paper_files/paper/2019/hash/3e9f0fc9b2f89e043bc6233994dfcf76-Abstract.html
- 37.
Bashardoust A, Feuerriegel S, Shrestha YR. Comparing the willingness to share for human-generated vs. AI-generated fake news; 2024. Available from: http://arxiv.org/abs/2402.07395
- 38. Hackenburg K, Margetts H. Evaluating the persuasive influence of political microtargeting with large language models. Proc Natl Acad Sci U S A 2024;121(24):e2403116121. pmid:38848300
- 39.
Hackenburg K, Tappin BM, Röttger P, Hale S, Bright J, Margetts H. Evidence of a log scaling law for political persuasion with large language models; 2024. Available from: http://arxiv.org/abs/2406.14508
- 40. Jakesch M, Hancock JT, Naaman M. Human heuristics for AI-generated language are flawed. Proc Natl Acad Sci U S A 2023;120(11):e2208839120. pmid:36881628
- 41. Dettmers T, Zettlemoyer L. The case for 4-bit precision: k-bit inference scaling laws. In: International Conference on Machine Learning. PMLR; 2023. p. 7750–74.
- 42. Frantar E, Ashkboos S, Hoefler T, Alistarh D. GPTQ: accurate post-training quantization for generative pre-trained transformers. arXiv preprint 2022.
- 43.
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners; 2019. Available from: https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe
- 44. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(1):140:5485–140:5551.
- 45.
Black S, Gao L, Wang P, Leahy C, Biderman S. GPT-Neo: large scale autoregressive language modeling with Mesh-Tensorflow; 2021. Available from: https://doi.org/10.5281/zenodo.5297715
- 46. Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, et al. Scaling instruction-finetuned language models; 2022.
- 47.
OpenAI . OpenAI models; 2024. Available from: https://platform.openai.com
- 48.
OpenAI , Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. GPT-4 technical report; 2023. Available from: http://arxiv.org/abs/2303.08774
- 49.
Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al.. Llama 2: open foundation and fine-tuned chat models; 2023. Available from: http://arxiv.org/abs/2307.09288
- 50. Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, de Las Casas D, et al. Mistral 7B. arXiv preprint 2023.
- 51.
Anil R, Borgeaud S, Alayrac JB, Yu J, Soricut R, Schalkwyk J, et al. Gemini: a family of highly capable multimodal models. 2023. Available from: http://arxiv.org/abs/2312.11805
- 52.
Javaheripi M, Bubeck S. Phi-2: the surprising power of small language models - Microsoft Research; 2023. Available from: https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
- 53.
Mesnard T, Hardin C, Dadashi R, Bhupatiraju S, Pathak S, Sifre L, et al. Gemma: open models based on Gemini research and technology; 2024. Available from: http://arxiv.org/abs/2403.08295
- 54.
Meta . Introducing Meta Llama 3: the most capable openly available LLM to date; 2024. Available from: https://ai.meta.com/blog/meta-llama-3/
- 55.
Chan RSY, Nanni F, Brown E, Chapman E, Williams AR, Bright J, et al. Prompto: an open source library for asynchronous querying of LLM endpoints; 2024.
- 56. Zheng L, Chiang WL, Sheng Y, Zhuang S, Wu Z, Zhuang Y, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint 2023.
- 57. Gallegos IO, Rossi RA, Barrow J, Tanjim MM, Kim S, Dernoncourt F, et al. Bias and fairness in large language models: a survey. Comput Linguist. 2024;50(3):1097–179.
- 58.
TF–IDF. In: Sammut C, Webb GI, editors. Encyclopedia of machine learning. Boston, MA: Springer; 2010. p. 986–7. Available from: https://doi.org/10.1007/978-0-387-30164-8_832
- 59.
lme4; 2024. Available from: https://github.com/lme4/lme4
- 60.
Wolf Y, Wies N, Shteyman D, Rothberg B, Levine Y, Shashua A. Tradeoffs between alignment and helpfulness in language models with representation engineering; 2024. Available from: http://arxiv.org/abs/2401.16332
- 61.
Rottger P, Kirk H, Vidgen B, Attanasio G, Bianchi F, Hovy D. XSTest: a test suite for identifying exaggerated safety behaviours in large language models. In: Duh K, Gomez H, Bethard S, editors. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics; 2024. p. 5377–400. Available from: https://aclanthology.org/2024.naacl-long.301
- 62. Ferrara E, Varol O, Davis C, Menczer F, Flammini A. The rise of social bots. Commun ACM. 2016;59(7):96–104.
- 63.
Ganguli D, Lovitt L, Kernion J, Askell A, Bai Y, Kadavath S, et al.. Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned; 2022. Available from: http://arxiv.org/abs/2209.07858