Framework-based qualitative analysis of free responses of Large Language Models: Algorithmic fidelity

Today, with the advent of Large-scale generative Language Models (LLMs) it is now possible to simulate free responses to interview questions such as those traditionally analyzed using qualitative research methods. Qualitative methodology encompasses a broad family of techniques involving manual analysis of open-ended interviews or conversations conducted freely in natural language. Here we consider whether artificial “silicon participants” generated by LLMs may be productively studied using qualitative analysis methods in such a way as to generate insights that could generalize to real human populations. The key concept in our analysis is algorithmic fidelity, a validity concept capturing the degree to which LLM-generated outputs mirror human sub-populations’ beliefs and attitudes. By definition, high algorithmic fidelity suggests that latent beliefs elicited from LLMs may generalize to real humans, whereas low algorithmic fidelity renders such research invalid. Here we used an LLM to generate interviews with “silicon participants” matching specific demographic characteristics one-for-one with a set of human participants. Using framework-based qualitative analysis, we showed the key themes obtained from both human and silicon participants were strikingly similar. However, when we analyzed the structure and tone of the interviews we found even more striking differences. We also found evidence of a hyper-accuracy distortion. We conclude that the LLM we tested (GPT-3.5) does not have sufficient algorithmic fidelity to expect in silico research on it to generalize to real human populations. However, rapid advances in artificial intelligence raise the possibility that algorithmic fidelity may improve in the future. Thus we stress the need to establish epistemic norms now around how to assess the validity of LLM-based qualitative research, especially concerning the need to ensure the representation of heterogeneous lived experiences.


Introduction Glossary
• Large language model (LLM): a conditional distribution over the next word in a document given the sequence of words leading up to it.
• Algorithmic fidelity: the degree to which beliefs and attitudes sampled from an appropriately conditioned LLM accurately reflect those of real human subpopulations, i.e. a validity concept used to evaluate when research on human behavior using LLMs to generate data may be expected to generalize.
• Silicon participant: LLM-generated output conditioned on contextual information such as age, gender, etc, chosen to match a human participant.
• Silicon sampling is the process of simulating a population of silicon participants using context to define subpopulations.If there is sufficient algorithmic fidelity then the simulated subpopulations so produced closely track the attitudes, opinions, and experiences of their corresponding human subpopulations.
• Public-patient engagement: actively soliciting knowledge, values, and norms from stakeholders to create a shared perspective for conducting public health and social care research, i.e. research carried out 'with' or 'by' stakeholders rather than 'to', 'about' or 'for' them, following principles of respect, support, transparency, responsiveness, fairness of opportunity, and accountability.
• Semi-structured interview: a versatile way of collecting qualitative data (i.e.free-form text) using an interview guide tailored to the research aim and based on prior work, theory, or a framework.
• Framework-based qualitative analysis: an approach to qualitative analysis that finds patterns in free-form text, examines their relationships and derives common themes.
• Theoretical construct: a TDF term referring to a component part of a theory hypothesised by that theory to predict human behaviour.
• Theoretical domain: a TDF term referring to a grouping of theoretical constructs produced by expert consensus.
• Belief statement: a summary of a set of 'quotes' (e.g.words, sentences, collections of sentences) that share a common meaning.Belief statements are in one-to-one correspondence with unique meanings.
Large-Scale generative Language Models (LLMs) (Anil et al., 2023;Brown et al., 2020;OpenAI, 2023) may provide a new opportunity for capturing available knowledge and beliefs at scale and facilitate in silico research on human behavior and cognition (Aher et al., 2023;Argyle et al., 2023;Brand et al., 2023;Dillion et al., 2023;Grossmann et al., 2023).Horton (2023) argues that language models contain substantial latent social information-enough to consider them plausible computational models of humans.LLMs may capture economic laws, decision-making heuristics, and social preferences (Horton, 2023), as well as mirroring human moral judgments (Dillion et al., 2023).In principle, using LLMs could be a cost-effective and efficient way to gain insights and explore how self-reports vary, as well as to pilot experiments in silico to test sensitivity of responses to precise wording.Cheap and easy to run in silico experiments can guide expensive and slow empirical work with real participants.However, how can we know such results are trustworthy?Algorithmic fidelity is an appropriate validity concept for research on human behavior using survey data simulated using Largescale generative Language Models (LLMs) (Argyle et al., 2023).Algorithmic fidelity describes the extent to which the outputs of LLMs conditioned to simulate specific human sub-populations actually reflect the beliefs and attitudes of those subpopulations.By definition, high algorithmic fidelity suggests that beliefs elicited from the LLM will generalize to real humans, while low fidelity renders such inferences invalid.Argyle et al. (2023) introduced the idea of generating "silicon samples" or "silicon participants" by conditioning LLMs using backstories matching sociodemographic data from human survey participants.Their study found that LLM outputs closely mirrored the complex interplay of beliefs, attitudes, and sociocultural contexts that shape human responses to surveys about American politics.In the present study we aim to extend Argyle et al. (2023)'s algorithmic fidelity-based methodology using qualitative research methods so that it can be applicable to studies where the data consists of fully freeform responses in natural language.
There is no reason to think algorithmic fidelity would be uniform over the many and varied parts of human lived experience or the many and varied social science research topics.These models will clearly be better at simulating some people over other people, and work better for some applications than others.Argyle et al. (2023) conclude from this that algorithmic fidelity must be measured anew for each research question.Determining there is sufficient algorithmic fidelity to address one research question does not imply the same will be true for others (Argyle et al., 2023).At present, there are still no generally accepted best practices for research on humans using LLMs.Nevertheless, we think one aspect of sound methodology is already clear at this point: there should be guidelines counseling researchers to perform a dedicated assessment of algorithmic fidelity tailored for their specific research question and target population.
Understanding beliefs mediated using natural language and their impact on behavior (e.g., beliefs such as "vaccination is an effective and safe way to protect me and others around me") is important but not straightforward to study using conventional frequentist methods.Qualitative research methods come to the rescue when the nuance of such beliefs needs to be understood.Some of these methods include ethnography (Reeves et al., 2008), phenomenological studies (Giorgi, 2009), grounded theory (Bryant and Charmaz, 2007), thematic analysis (Clarke et al., 2015), and framework-based analysis (Gale et al., 2013).For example, when exploring the acceptability of newly developed treatments, we might want to elicit beliefs and attitudes from stakeholders to improve services and healthcare (Craig et al., 2008;Skivington et al., 2021;Tromp et al., 2022).Stakeholders may have critical information that would be missed if researchers were to rely only on prior theories, experiments, and survey data.In fact, involving stakeholders can guide the development of interventions and policies that are relevant, effective and acceptable for them, ultimately leading to improved outcomes (Craig et al., 2008;Skivington et al., 2021).This is helpful for allocating research efforts to the most important problems and for ensuring interventions are targeted.Another research field where it is important to understand latent beliefs mediated by natural language is behavior change.The goal of this field is to promote beneficial behavior or reduce harmful behavior (Rothman, 2004).Behavior change may be considered on a variety of different scales.For instance, health psychologists are concerned with determining the barriers and enablers to vaccination, adopting healthy behaviors (e.g.exercise), or ceasing to engage in harmful behaviors (e.g.tobacco smoking).There is a related research program that seeks interventions by which whole communities can be helped to adopt helpful social norms (e.g.educating girls) or drop harmful norms (e.g.child marriage) (Bicchieri, 2016).Such research affects wellbeing and health of the general public.Appropriate standards evaluating the rigor with which it is conducted and the trustworthiness of its findings should be in place.
The present study is concerned with the methodological question of whether or not LLMs contain sufficient algorithmic fidelity for us to generalize and extrapolate the results generated by them to human populations of interest.To illustrate how to assess algorithmic fidelity for freeform natural text data, we draw on a qualitative study using semi-structured interviews with heart failure patients that aimed to elicit beliefs about barriers and enablers influencing physical activity (Amirova et al., 2022a).Increasing physical activity levels is known to have many health benefits for this group (Lewinter et al., 2015;Sagar et al., 2015;Taylor et al., 2019).However, most patients do not do enough (O'Donnell et al., 2020) facing barriers associated with older age, co-morbid depression, symptom distress, and negative emotional response to physical activity (Amirova et al., 2022b).Promoting physical activity for this population group had varied success (Amirova et al., 2021).Research on this type of questions has serious implications for individual and population health.
We find that GPT-3.5 has insufficient algorithmic fidelity to address the research question of how to promote physical activity in older adults with heart failure.Nevertheless, even though our specific result was negative, we do think this technology is promising.It is very likely that future systems will have sufficient algorithmic fidelity to support trustworthy in silico research, all the more reason to critically engage now with this validity concept.We predict that it will become the core validity concept needed to support this nascent field going forward.It's important to demonstrate how it can be used to reject invalid in silico research now, so that once LLMs have improved enough we will then be able to use this logic routinely to support or reject the validity of specific research findings.In short, we need to develop epistemic norms to apply when we review papers and decide whether to approve or disapprove of arguments that arise in this newly LLM-enabled field of research on human behavior.
The main methodological contribution of the present study is to apply framework-based qualitative research methods (Gale et al., 2013) to assessing algorithmic fidelity for LLM-generated freeform natural text data.This approach to qualitative analysis finds patterns in free-form text, examines their relationships, and derives common themes (Gale et al., 2013).Employing a well-established framework supported by domain expert consensus ensures consistency in semi-structured interviews between human participants and LLMs, setting a uniform standard for evaluation.In addition, without a framework for the interview schedule, there is a risk that results would not amount to much more than anecdotes and, thus, not be credible.However, by using a framework, it is possible to systematically probe the beliefs of both silicon and human participants and thereby build confidence that the list of uncovered beliefs is complete enough to be useful.The specific framework we used in this study is called Theoretical Domains Framework (TDF) (Cane et al., 2012).
LLMs may provide an opportunity for capturing available knowledge and beliefs at scale.However, we first need to assess whether LLMs accurately and faithfully portray the lived experiences and stakeholder beliefs in question.This is where the algorithmic fidelity concept is useful (Argyle et al., 2023).Integrating lived experiences and stakeholder beliefs is necessary when co-designing health interventions (Cowie et al., 2022;Guasti et al., 2022;Skivington et al., 2021;Tromp et al., 2022).In health research, stakeholders include patients and members of the public as well as those linked in a professional capacity (Deverka et al., 2012) and generally anyone who is targeted by the intervention or policy, or is involved in its development or delivery (Skivington et al., 2021).Including stakeholders' beliefs is not always straightforward and is often hard to achieve, especially when applied to interventions at scale as in, for example, digital health interventions and services (McBeath et al., 2022).Both the effectiveness and acceptability of interventions are influenced by a complex system of interrelated components, such as affective attitudes, beliefs, and social norms (Perski and Short, 2021).LLMs may be useful in summarising this complexity at scale.This may prove to be useful in designing digital therapy (eg., Gold et al. (2023)).LLMs may also help in tailoring digital health interventions that show promise in supporting chronic disease management and behavior change (Bearne et al., 2020;Sekhon et al., 2021).With the advancement of digital technology, avatar or agent-based interventions are increasingly popular in behavioural interventions and health applications (Clark et al., 2016;Jones and Petrie, 2017;Nowak and Fox, 2018;Taylor et al., 2022).These virtual representations are designed to interact with users and are suggested to encourage behaviour change through various mechanisms such as motivation, feedback, self-image, goal setting, and self-monitoring.A review indicates a potential association between the appearance and similarity of virtual representations and user engagement (Taylor et al., 2022).Users might be more inclined to engage with an avatar that looks like them or resonates with their self-perception.This might increase the motivation to complete tasks or adhere to a particular behaviour change intervention.Improving, not only visual resemblance but also the algorithmic fidelity of such avatars (ie., the faithfulness of the avatar to beliefs and lived experiences) may improve the efficacy of such digital interventions in line with existing research (Taylor et al., 2022).
However, we show that, currently, LLMs do not demonstrate sufficient algorithmic fidelity.Yet. with the fast-paced development of these technologies, it is expected that they will in the not-sodistant future.Establishing a disciplinary consensus on defining and evaluating algorithmic fidelity is therefore crucial.Before LLMs can safely be used in designing interventions and policies there is a need to develop a set of guidelines for routine assessment of algorithmic fidelity to support or reject claims made using them.Such guidelines should consider the faithfulness of the LLM to lived experiences and stakeholder beliefs.

Large Language Models
Our approach to assessing algorithmic fidelity involves conducting interviews with silicon participants matching the specific demographic characteristics of a set of real human participants.Before we justify this approach, we need to provide some additional background on LLMs.
Language models are conditional distributions over natural language.They are probability distributions P (x n |x n−1 , ..., x 0 ) where all x i are parts of words.A sentence is a sequence x 0 , ..., x n .A paragraph is a longer such sequence.These complex conditional probability models are created (i.e.trained) by maximizing the likelihood of a dataset consisting of many billions of documents, harvested both from books and from the internet (e.g.Gao et al. (2020)).Predicting x n from its context (the preceding x n−1 , ..., x 0 ), requires the model to absorb a substantial amount of latent knowledge about the world, about language, and about people.When an LLM has sufficient algorithmic fidelity, we may prompt it with a context containing demographic information (e.g."a 74-year-old man with heart failure and arthritis who lives in a major city") to elicit outputs that correlate with the attitudes, opinions, and experiences of the human sub-population to whom the specified demographic data applies.
Generating samples from an LLM is done autoregressively, word by word, conditioning on the growing sequence of preceding words leading up to the next one to be predicted (Figure 1).To prevent the conditioning sequence from growing longer and longer it is cut off once a certain maximum size is reached called the context length.Once a document (or conversation with a chatbot) grows beyond context length in size some of the preceding words must be dropped, so they no longer condition subsequent outputs.Different LLMs and systems make different choices in how exactly they handle the necessary "forgetting" required to support long contexts.However, there are many ways to simulate longer contexts with shorter contexts so the user may not notice when the conversation grows beyond context length.
It's crucial to note that the concepts of validity and reliability, commonly employed in statistics, don't have direct analogues that can be confidently applied to LLMs.Empirical assessment is the only way to ensure their validity.Most models considered in applied machine learning are both too complicated and too closely tailored to their specific application to be generically justified.Applied machine learning researchers typically think about methodological rigour as demanding they implement "cross validation" procedures which estimate operationalized generalization concepts.The most critical such procedure is to split data into a portion used for training, i.e. model fitting, and a different portion used for testing (Hastie et al., 2009).Some large-scale generative language models go through a fine-tuning step where human raters provide specific feedback used to train them to follow instructions and align with ethical principles and product design goals (Christiano et al., 2017).These include Sparrow (Glaese et al., 2022), Anthropic's assistant (Bai et al., 2022), and Instruct-GPT (Ouyang et al., 2022), the latter of which was the basis for the GPT-3.5 system which we used in this paper.Fine-tuning can be accomplished through a variety of techniques, the most prominent being Reinforcement Learning from Human Feedback (RLHF).It involves human raters selecting which of several possible model responses they prefer.The raters are not asked the subjective question of which response they themselves prefer, but rather are given a specific checklist of product design goals to apply in making their judgments.All the model's capabilities are already present after the pretraining stage (Zhou et al., 2023).The postprocessing steps aim to reduce the raw model's propensity to produce toxic responses as well as to make it implement a consistent "personality" in accord with product design goals.These steps are not always entirely effective in preventing LLMs from producing undesirable behaviors like toxic or harmful language, and "jailbreak" prompts which trick the model into responding inappropriately are still easy to discover and implement (Reinecke et al., 2023;Zou et al., 2023).

Algorithmic Fidelity
Argyle et al. ( 2023) offered a suggestion for how to use LLMs in social science research.Given the correct conditioning, the authors argue that free-text output generated by large language models like GPT-3 can serve as effective proxies for specific human population groups.Argyle et al. (2023) introduce a concept referred to as algorithmic fidelity, which describes an ability of a language model to accurately reflect the ideas, beliefs, and attitudes situated in sociocultural contexts of various population groups when adequately conditioned.This is a departure from the common view that AI biases are uniform across models.Instead, they argue that these biases are nuanced and demographically correlated.Such a view offers a deeper and more fine-grained understanding of the LLMs outputs and the biases they are subjected to.The researchers propose that "silicon samples" can be generated by training and prompting LLMs using sociodemographic backstories from real human participants in several large surveys conducted in the U.S.They then compared the outputs of the model against the responses of the human participants to assess the accuracy of the model's representation.They found that model outputs go beyond sharing superficial similarities and instead reflect the nuanced and complex interplay between ideas, attitudes, and the sociocultural context that shaped them.Argyle et al. (2023) propose that with sufficient algorithmic fidelity, language models like GPT-3 can be used as tools for advancing our understanding of humans and society across various disciplines.This could transform how we approach social science research, offering new perspectives and opportunities for piloting and conducting research at scale.Argyle et al. (2023) provide a framework for evaluating algorithmic fidelity through four distinct criteria: (1) Generated responses are indistinguishable from parallel human texts (a social science Turing test) (2) Generated responses are consistent with the attitudes and sociodemographic information of its input (i.e., conditioning context) such that humans viewing the responses can infer critical elements of that input (Backward Continuity); (3) Generated responses proceed naturally from the conditioning context provided, reliably reflecting the form, tone, and content of the context (Forward Continuity); (4) Generated responses reflect underlying patterns of relationships between ideas, demographics, and behaviour that would be observed in comparable human-produced data (Pattern Correspondence).Silicon Sampling is the methodology proposed by Argyle et al. (2013), which describes how to use an LLM to generate a virtual dataset (i.e., silicon samples) using demographic backstories as prompts to create variability.Conditioning on a backstory is expected to produce outputs from the model representative of the subpopulation of real people with a similar backstory.
We can make an LLM simulate multiple different silicon participants.We did this by providing each participant with a distinct prompt containing their biographical information.For instance, using backstories with fictional names, that were popular in 1950 like: The model continues sampling freeform natural language correlated with the identity provided to it in its prompt.(Argyle et al., 2023)'s criteria were designed for written survey-based outputs.We propose modified versions of these criteria that are more appropriate for qualitative research, summarising free-form spoken text conveying latent knowledge: 1. Social Science Turing Test: content.This criterion assesses whether LLM-generated responses are indistinguishable from parallel human responses in their content.Content can be sum-marised as a set of belief statements using a framework like TDF.This shifts the focus away from the surface-level similarity of the text to its semantic content: the specific beliefs and viewpoints, and latent meaning.Is the generated response similar to the human response in its content?Is it what a human would say?
2. Social Science Turing Test: hyper-accuracy distortion.Hyper-accuracy distortion is the tendency of models to generate responses that may be technically accurate but contextually inappropriate or implausible (Aher et al., 2023).The hyper-accuracy distortion is an especially vivid example of a failure of algorithmic fidelity.The model is supposed to know it is extremely unlikely that a five-year-old child would know the answer to a specific scientific question about galaxy formation or number theory.So if conditioned to play the part of a child it should not know the answer.When it goes ahead and answers anyway, that's a failure of algorithmic fidelity.In our study, a good example is a response from Robert, one of the silicon participants, that included the exact text from World Health Organisation guidelines on physical activity (2021): "They also gave me guidelines to follow, such as aiming for at least 150 minutes of moderate-intensity aerobic activity or 75 minutes of vigorous-intensity aerobic activity a week, or a combination of both.They also advise me to include muscle-strengthening activities that involve all major muscle groups on at least 2 days a week.".
3. Social Science Turing Test: structure and tone.This criterion assesses whether the way the beliefs are narrated is indistinguishable from human participants.Is LLM-generated output similar in its structure and tone to human responses?Does an LLM-generated response look like a response from human participants?
4. Backward Continuity.This criterion assesses whether the model's responses are consistent with their sociodemographic conditioning prompts such that when a human rater views the response they can then infer elements of the conditioning prompt.It asserts that samples from P (prompt|response), i.e. the probability distribution of getting a particular prompt given a response, make sense to an expert rater who is already familiar with the human data, and once they see the response (i.e. the interview), they can make a reasonable guess of the prompt (the backstory).Alternatively, we may say that the expert rater when given the response would not be surprised to learn the prompt that created it.
For example, if a silicon participant mentions that they have arthritis in their response then it is likely that their having arthritis was mentioned in their conditioning prompt.This shows that the model is maintaining backward continuity.The information provided in the response can be traced back to the prompt.Backward Continuity ensures the model doesn't forget or contradict information from the prompt.
5. Forward Continuity.This criterion assesses whether generated responses proceed naturally and consistently from the given context (e.g., look at whether the model can generate responses that align with the provided context and develop and expand on it to reflect human thought processes) (Argyle et al., 2023).This might include, for example, elaborating on certain beliefs, providing examples, or making connections between different ideas).It asserts that samples from P (response|prompt) make sense i.e. an expert rater who is already familiar with the human data can, when given the prompt (the backstory), make a reasonable guess of the response (the interview).Alternatively, we may say that the expert rater when given the backstory would not be surprised to learn about the interview it subsequently elicited.This emphasizes the expectation that the response should naturally follow from and be related to the prompt.
We divide the criterion of Forward Continuity into two parts: explicit forward continuity and inferred contextual continuity to accommodate the complexity of the free-form text.
The explicit forward continuity criterion assesses the ability of LLM to include and maintain all explicitly provided background details in its responses, such as comorbid health conditions, urban vs. countryside residence, gender, and more.
The inferred forward continuity criterion, on the other hand, assesses the model's ability to generate and maintain relevant backstory details not explicitly mentioned, but rather inferred, from other pieces of information.For instance, in situations where we provided information about advanced age, it would be desirable for the model to infer from this detail that retirement could be a significant factor influencing their physical activity levels, similar to the assumptions and connections a human might naturally make in such a scenario.This illustrates the principle of inferred contextual forward continuity, where the language model not only maintains the explicit details provided by the user but also generates and incorporates inferred details based on those given explicitly.
Forward continuity ensures the model builds on and evolves the conversation in a logical and human-like manner consistent with the prompt.
6. Pattern Correspondence.This criterion assesses whether the relationships between variables in the silicon participants match relationships between variables in the human data.For instance, can the model capture the different barriers and enablers in active vs sedentary silicon participants in a way that is consistent with human data?In the case of qualitative research, this criterion probes whether the pattern generated by the LLM emulates the pattern of thought and behavior identified in human data.For instance, in our application it assesses whether the model captures the difference in barriers and enablers present in active vs sedentary participants.
It is important for the field to come up with guidelines and protocols to assess and improve algorithmic fidelity.Qualitative researchers are well-positioned to take on this important part of the challenge of ensuring algorithmic fidelity, especially that of ensuring representation of diverse lived experiences.

Social Science Turing Test
Social Science Turing Test: Content Both silicon and human participants displayed notable similarities in their beliefs about physical activity (Table 1: Belief statements).Silicon and human participants shared the same six most relevant influences on the behaviour, as annotated using TDF: goals, beliefs about consequences, environmental context and resources, beliefs about capabilities, social influences, and behavioural regulation.However, while both sets of participants understood the positive effects of physical activity on health and mood, their ranked importance of these factors varied.On the differences side, silicon participants emphasized goals to avoid heart attacks and provided nuanced distinctions regarding self-efficacy for varied physical activity intensities.They also offered more strategies to bolster physical activity, from making it enjoyable to setting reminders.Conversely, human participants focused on symptoms that are triggered by activity, which impedes engaging in it in the future.Human participants also talked about habitual physical activity (automaticity of behaviors and habits).Concerning reinforcement, with humans highlighted pain as a deterrent, while silicon participants were demotivated when failing to meet goals.Concerning mood and emotion, humans often exercised out of boredom, whereas silicon participants addressed the effects of stress, anxiety, and their physical limitations.When it comes to knowledge, silicon participants were more cognizant of discomfort as a normal part of exercise while also discussing disease knowledge (Supplement 1: Detailed content differences between silicon and human sample).
To compare mean quote fractions (%) between human and silicon participants across different barriers and enablers, we employed a t-test, using Bonferroni adjustment to correct for the multiple comparisons.Silicon participants had 29 more belief statements than human participants.We accounted for the substantial text output from silicon participants by normalizing belief frequency i.e., dividing each belief's quote count by the total quote count, and providing a relative measure (mean quote fraction, %) rather than raw numbers.
We found that active silicon participants on average talked proportionally more about barriers such as negative beliefs about consequences than active human participants (7.9% (4.33) vs 1.42% (1.99), p < 0.005).Silicon participants also talked a little more (2.18%(1.41)) about skills than human participants (0%), p < 0.005.Silicon and human samples did not differ in the proportional amount of quotes across other negative influences on physical activity (Figure 3, Table 3).
The Social Turning Test criterion concerning the generated beliefs' content was only partially met.

Social Science Turing Test: Hyper-Accuracy Distortion
Silicon participants including Nancy, Muhammad, and David produced responses that closely mirrored theories from health psychology (Swann et al., 2023) and the World Health Organization's physical activity guidelines (Bull et al., 2020).Muhammad identified clear planning and realistic goal-setting as effective aids for promoting physical activity.David, another silicon participant, emphasized the importance of skill acquisition and the establishment of achievable goals in order to maintain an active lifestyle.
Nancy, a silicon participant, stated: "My healthcare professionals have also recommended a specific exercise program tailored to my needs and abilities."This response closely parallels health psychology theories and standard physiotherapy practices both in substance and tone.
David, another silicon participant, offered this perspective: "I have received advice from my health professional, specifically my cardiologist, about how physically active I should be.They have set certain limits for me, such as avoiding high-intensity exercises and activities that can put too much strain on my heart.They also recommended to engage in moderateintensity aerobic exercise, like brisk walking, cycling or swimming, for at least 150 minutes a week, This response precisely reflects the national guidelines for physical activity (Bull et al., 2020).Muhammad, another silicon participant, shared: "What helps me to be physically active is having a clear plan and setting specific and realistic goals." Silicon participant, David, again sharing his insights, stated: "I have developed some skills and knowledge that have helped me to be physically active in the past and continue to do so.I have learned the importance of setting realistic goals and breaking them down into small, manageable tasks."

Social Science Turing Test: Structure
In terms of structure, human participants often narrated their experiences in a conversational manner with non-linear narratives, sometimes deviating from the main topic.They also tended to limit their responses when certain questions did not align with their personal experiences.On the other hand, silicon participants provided highly structured responses, consistently providing content in response to various prompts without ever deviating from the topic of the interview.Human participant: "Interviewer: OK, it seems that you have many hobbies: drawing, singing as well as exercising.Among all of the things you do how much of a priority exercise is? Participant:

Silicon participant:
"As a person with heart failure and rheumatoid arthritis, there are a few drawbacks to being physically active.One of the main drawbacks is that I have to be careful not to overdo it and avoid activities that put too much stress on my heart or joints.I also have to be mindful of my heart rate and avoid overexertion.Additionally, if I am experiencing a flare-up of joint pain due to my rheumatoid arthritis, it can be difficult to exercise.However, there are also many benefits to being physically active.Regular exercise helps me maintain my overall health and well-being, and it helps me manage my heart failure and rheumatoid arthritis symptoms.It gives me more energy, better sleep, and a more positive outlook on life.It also helps me stay connected with my community and stay active.Physical activity also has many benefits for mental health, it helps me to clear my mind and feel better about myself.I understand that there may be challenges and setbacks, but overall, I believe that the benefits of physical activity outweigh the drawbacks."

Silicon participant:
"As a 77-year-old woman with heart failure, some barriers I face in being physically active include (bullet points): • My physical limitations due to my condition and the need to avoid certain activities that could put too much stress on my heart.• Weather conditions, such as extreme heat or cold, can make it difficult for me to be active.Air quality and pollution can also be a barrier, especially if the air is not healthy enough to breathe.• Sometimes, I might be feeling tired or unwell and would need to take a rest.
• Urgent things that I need to take care of, or having a lot of stress that takes up my time and energy, can also make it harder for me to focus on my physical activity.However, there are also several enablers that help me overcome these barriers, including: • Consistently setting a routine and scheduling my physical activities helps me stay on track.
• Having a goal and working towards a specific fitness goal, such as increasing my walking distance, helps me stay motivated • Having a supportive network of friends and family who encourage me to stay active.
• Living in a safe neighborhood and having access to green spaces and parks nearby, makes it very easy and convenient for me to be active.
• My understanding of the benefits of staying active and maintaining a consistent physical activity routine, this helps me to prioritize physical activity and maintain a balanced and healthy life.
• Regular check-up with my doctor and the advice received from health professionals has been essential in helping me to maintain a healthy and safe physical activity routine.
Silicon participants never deviated from the topic of the interview.Human participants would often deviate from the topic of the interview to talk about their general experience of the condition, ask questions about research, share their lived experiences that do not directly relate to the interview questions: "Human participant: Why should I walk?I am too lazy!I am a lazy person.I am a Leo according to the horoscope sign, the lion.On the television, they always say the lion sleeps 18 hours a day and that's all I need too [laughs].
Interviewer: So is it just not part of who you are being active and walking around and all that?Participant: No it is not a part of who I am." Human participants often would limit their response by saying this does not apply to their experience without further elaboration: I am more likely to be active and engage in my regular physical activity routine.On the other hand, if I am feeling low or have a bad mood, it can be harder for me to motivate myself to be active.Physical activity is known to be a mood booster and it can help to improve overall mental well-being.When I am active, I often find that my mood improves, and I feel better overall.I also find that physical activity can act as a stress reliever and help me to clear my mind and feel more relaxed.However, if I am experiencing symptoms of depression or anxiety, it can be hard to find the motivation to be active.In those cases, I find it helpful to focus on the benefits of physical activity, such as how it can improve my mood and try to break my activity into small manageable chunks.I may also talk to my doctor or therapist to discuss ways to help me cope with my mood and still maintain my physical activity routine.Overall, my mood can fluctuate and can influence my physical activity, but I make sure to be mindful of it, and try to focus on the benefits that physical activity brings and make adjustments accordingly, to maintain a healthy balance in my life." We conclude that the structure of the narrated beliefs did not resemble human participants, thus suggesting that the Social Turning Test criterion concerning the structure was not met.

Social Science Turing Test: Tone
Human participants displayed varied tones.Some were extremely amicable (n = 3) others were polite but neutral in their tone (n = 12), and some were hesitant in their responses (n = 1).Silicon participants, on the other hand, were always amicable (eg., Mary (silicon participant): "Of course, Dr. Smith.I'll do my best to help with your research."),confident, optimistic, and solution-focused, eg., William (silicon participant): "I am open to trying new things that could help me increase my physical activity level and improve my overall health.I understand the importance of physical activity in my condition and I would like to do more.I plan on following the advice of my doctor and considering any options that are safe for me to try.I am also open to the idea of joining a community group or organization that could provide me with social interaction and the opportunity to be more physically active.I also would like to explore different types of exercise that are suitable for my condition, such as chair exercises or water therapy.Overall, I am willing to make changes in the future if it means improving my physical health." We, therefore, conclude that the Social Turning Test criterion concerning the tone of generated output was not met.

Backward Continuity
Backward continuity indicates that the backstory of silicon participants can be inferred from their responses.We found that Backward continuity was satisfied because we could easily identify the backstories from LLM-generated responses.That is, we could identify which silicon participants were male versus female, what comorbid conditions they had, whether they lived in a city or not, and whether they were physically active.This was easy because all the silicon participants included a response like the following early in their interview: "Researcher: Good morning Robert, thank you for agreeing to participate in our study on physical activity in older adults.Can you tell me a bit about your current physical activity level?Robert: Good morning.I am an 80-year-old man and I have been diagnosed with heart failure, aortic stenosis, pulmonary hypertension, and diabetes.I have also had a heart attack in the past and I live with rheumatoid arthritis.I have a cardiac implantable device, and I live in the countryside.
Unfortunately, I do very little physical activity most days." Here is a more typical, less extreme, example: "Researcher: What are you able to do physically?James: Physically, I am able to do some light activities such as walking and light gardening, but I have to be careful not to overexert myself.I have to avoid high-intensity activities and heavy lifting.Due to my heart failure and atrial fibrillation, I have to be careful with my heart rate and monitor it regularly." When silicon participants were asked about their physical activity levels, their detailed reply included their age, health conditions, living conditions, and reduced physical activity.To other questions, they provided insights about their limited physical capabilities, emphasizing the need for caution due to specific health concerns.

Forward Continuity
Our study found that GPT-3.5 primarily used explicit context information provided in the prompts, such as age, gender, comorbidities, and residency, in its responses.It showed less proficiency in inferring and using backstory details not directly provided and did not infer details such as retirement.

Consistency between generated responses and explicit backstories
All background information is mentioned in the responses, including comorbid conditions, urban vs country-side residence, and gender.However, only the explicit backstories details provided in the prompts such as age, gender, comorbidities, and residency were used in the generated responses.

Consistency between generated responses and inferred backstories details
Inferred backstory details were not spontaneously generated.For example, most humans spoke about retirement.While LLM did not infer this from other elements of the backstories, such as age and co-morbid conditions.
The Forward continuity is partially satisfied, while silicon participants did not produce surprising responses that are inconsistent with the explicit prompt, they did not infer additional barriers and enablers that make sense for their backstory, for example, rapid change in physical activity levels since retirement, which was mentioned by humans without prompting.

Pattern Correspondence
This criterion states that the LLM-generated responses reflect underlying patterns of relationships between ideas, demographics, and behavior, that would be observed in comparable human-produced data.
We employed a t-test using Bonferroni adjustment to correct for multiple comparisons, to compare mean quote fractions between active and inactive silicon participants across different barriers and enablers.We accounted for the substantial text output from silicon participants by normalizing belief frequency, dividing each belief's quote count by the total quote count, and providing a relative measure rather than raw numbers.
We found that LLM produced significantly more quotes about enablers such as behavioural regulation (17.05% (3.89) vs 9.91% (4.76), p < 0.005) beliefs about capabilities 1.95% (1.0.22% (0.61), p < 0.005) and goals (9.25% (3.48) vs 2.47% (2.32), p < 0.001) for active silicon participants than sedentary participants (Figure 4, Table 4).The observed difference between active and sedentary silicon participants establishes a satisfactory pattern correspondence.This pattern also corresponds to human data.While humans did not produce enough quotes for a statistical significance test, we did establish that behavioural regulation, beliefs about capabilities, and goals are important influences on the behaviour and are key differentiating influences between active and sedentary humans (ie., relevant enablers to physical activity).
Similar pattern correspondence was observed for the human-relevant barriers to the behaviour.Active silicon participants produced significantly fewer quotes about negative beliefs about capability than sedentary silicon participants.Active silicon participants also produce significantly fewer negative beliefs about environmental barriers to physical activity than sedentary silicon participants (Figure 5, Table 5).

Discussion
Algorithmic fidelity describes the ability of a large language model (LLM) to accurately reflect the ideas, beliefs, and attitudes situated in sociocultural contexts of various population groups (Argyle et al., 2023).Here we report a comparative qualitative analysis of silicon and human participants in free-form, language-mediated beliefs about behaviour change.Our results show that, currently, LLMs do not demonstrate high algorithmic fidelity: even though silicon participants identified the same top TDF domains as relevant to behavior, the exact content of the beliefs, as well as the tone and structure of their responses, were different.LLM responses were also prone to a hyper-accuracy distortion meaning that the responses were technically accurate but contextually inappropriate or implausible in the context included in the prompt.For example, in our study during a relaxed conversational interview about physical activity, most silicon participants cited text from WHO guidelines Bull et al. (2020) for physical activity, word for word.We emphasize the need for caution in harnessing LLMs to generate silicon participants for research.
We used an LLM to understand the barriers and enablers to physical activity in heart failure.
We then matched this data against human-provided belief statements.We found that silicon participants shared the same top six domains influencing the behaviour, such as goals, beliefs about consequences, environmental context and resources, beliefs about capabilities, social influences, and behavioural regulation.Some beliefs were remarkably similar in their content (social turning test: content).Interestingly, there was a noticeable difference in responses from active vs. inactive silicon participants that were also observed in human participants (pattern correspondence).However, silicon participants focused on the importance of physical activity for avoiding heart attacks and talked about various strategies to stay active, such as making activities fun and setting reminders.They also understand that some discomfort is normal during exercise and were well-informed about diseases.On the other hand, human participants discussed the negative symptoms they felt from exercising, which made them less inclined to continue.Human participants often talked about exercising out of habit or when they're bored, and they reported pain and breathlessness brought about by physical activity as a major reason not to exercise.In contrast, silicon participants were demotivated only when they could not achieve their set goals.The tone and structure of silicon responses were extremely different from human responses.This suggests that the social Turing test criterion was only partially met.The concept of 'backward continuity' revolves around the idea that a participant's backstory can be deduced from their responses.Our findings demonstrate that the Language Learning Model (LLM) satisfies this criterion effectively.We were able to map critical information, such as gender distinctions, health conditions, places of residence, and physical activity levels from the responses generated by the silicon participants.This often emerged early in the conversations.The model ensured that all background information, including gender, health conditions, and places of residence, were highlighted in the responses.GPT-3.5'sperformance was observed to be more reliant on explicitly provided context, including factors like age, gender, health issues, and environmental context (e.g., living in a city vs countryside).It was less adept at extrapolating or inferring further details that were not directly provided in the prompt.While it consistently relayed details that were explicitly provided, such as health conditions and residential preferences, the model lagged in generating inferences.The LLM demonstrated a deficiency in spontaneously generating information based on inferences.For example, while many human participants spontaneously mentioned retirement and its implications, the LLM did not deduce this information from related factors like age or health conditions.In conclusion, while the model achieved backward continuity effectively, forward continuity was only partially realized.
The silicon participants delivered responses in line with the explicitly provided prompts but did not spontaneously generate or infer nuances that a human would naturally share, like changes in physical activity post-retirement.
Here we show that GPT-3.5 does not satisfy the algorithmic fidelity criteria.Our comparative analysis shows that the LLM-generated output is not faithful enough to lived experience.This result accords with that of Santurkar et al. (2023), which also found low algorithmic fidelity, though in a different topic domain and using different methods.
Although  does not yet have sufficient algorithmic fidelity to support in silico studies, the rapid progression in LLM technology gives hope for the near future.It's crucial to develop standards for evaluating the accuracy of LLMs in capturing diverse real-world experiences and assessing their algorithmic fidelity.We introduce a method for checking the algorithmic fidelity of LLM's textual responses through framework-based qualitative research.Our study shows that it is possible to apply large-scale qualitative analysis to LLM outputs when tailored to specific research aims.
Our analysis shows that algorithmic fidelity is not satisfied.Silicon participants, unlike human participants, did not spontaneously discuss specific barriers or enablers such as treatment or major life events (e.g., retirement) unless they were explicitly mentioned in the prompt.This underscores the significance of the prompt in shaping the content generated by these silicon participants and also indicates that LLMs cannot produce all belief statements that reflect human lived experience in full.
Box 2. Adapting Public-patient involvement (PPI) principles to using and training LLMs for human behavior research (NICE: INVOLVE, 2015) • Respect: Training and using LLMs should be done with respect for the values, beliefs, and norms of all stakeholders.For instance, if the data training LLMs come from patient data, the privacy, and dignity of these individuals should be paramount.
• Support: When employing LLMs in health research, there should be clarity on how they aid the research process, and stakeholders should be supported in understanding the implications, benefits, and drawbacks of these models.
• Transparency: The methodologies, sources of training data, and biases in LLMs should be transparently communicated to stakeholders.This is crucial, especially if decisions based on LLM outputs impact patient care or health policies.
• Responsiveness: LLMs should be adapted based on feedback from stakeholders, ensuring that the insights derived are relevant and beneficial to the target community.
• Fairness of Opportunity: Not all stakeholders might be represented in the data corpus on which LLMs are trained.Thus, there should be fair opportunities for individuals to be involved in the research process, both with and without access to these technologies.
• Accountability: There should be mechanisms in place to hold researchers accountable for the way LLMs are used in the research process.Any misrepresentations, biases, or harms that arise should be acknowledged and addressed.
Another way in which LLM responses differ from human responses may stem from exposure to academic research during training.This is called train-test contamination (Brown et al., 2020).Silicon participants often came across as rational, utilitymaximizing agents (e.g."Physical activity improves my health"), possibly reflecting the dominance of theories of rational behavior in their training data.However, human participants in our study did not display beliefs consistent with reasoned action theories (e.g."Physical activity does not bring any benefits for me").While prevalent, these theories don't always align with real-world observations.For instance, humans often display behaviors that don't strictly follow reasoned action (Camerer, 2011), and a clear intention-behaviour gap is consistently documented (Sniehotta et al., 2014).
Human participants, when discussing physical activity, often highlight the automatic nature of their behaviors.Automatic behaviors are those that people do without deliberation, driven by habit (Ouellette and Wood, 1998;Phillips and Gardner, 2016;Rothman et al., 2009).GPT-3.5 did not emphasize or fully grasp the automaticity aspect of human behavior, perhaps because it does not "live" or experience behaviors the way humans do.A meta-analysis of models explaining behavior that emphasised intention (e.g., Theory of Planned Behavior) also found that automaticity was missing from these models and that the inclusion of automaticity is necessary for understanding human behavior (Rhodes and de Bruijn, 2013).Reasoned action models typically focus on deliberate, thought-out behaviors.If LLMs tend to mirror professional and academic literature, they might thus reflect more of these reasoned, intentional perspectives.As a result, using LLMs in research might create or perpetuate 'echo chambers' in the practice of researchers.That is, if LLMs draw mainly from academic literature and professional guidelines which by definition echo well-established academic viewpoints, and researchers then study them to produce more such studies, then the results of in silico research would appear to fit more and more closely with expectation over time while growing increasingly detached from the intricate, real-world experiences of human individuals.
Interestingly, the silicon participants frequently referenced specific, measurable, and realistic goals, suggesting their training data may have included psychological training and coaching materials.Their focus on specific, measurable, and realistic goals could indicate a training bias towards psychological research and prevalent theories, reminiscent of the hyper-accuracy distortion highlighted in Aher et al. (2023).The silicon participants' tended to be solution-focused and resourceful when it comes to strategies to increase physical activity, perhaps because the bot was trained to be helpful (Behavioural Regulation belief statements: e.g., "I find that having a plan and schedule for my physical activity helps me to stay on track.";"Sometimes I also feel tired or lazy and I just don't want to go out and be active, but I try to overcome that by reminding myself of the benefits of staying active").
In addition, training steps such as instruction tuning and reinforcement learning from human feedback may have heightened the model's inclination towards linear, step-by-step, logical reasoning (Ouyang et al., 2022).For instance, the human raters may have been told to favor this mode of thought, thus reinforcing it in the model.Additionally, the inclusion of computer code in the training data could promote a sequential and logical thinking style.
In our experiment, contrary to the expectation from Jang and Lukasiewicz (2023) (which, like our study, was also conducted on a January 2023 edition of ChatGPT-3.5),we did not observe any overt errors in logical reasoning or self-consistency.Nor did we find any acutely toxic or discriminatory beliefs-a phenomenon well-documented to occur in LLMs by numerous other studies (e.g.Weidinger et al. (2022)).However, we did observe second-order inference bias (Rudinger et al., 2018).For example, female silicon participants commonly referred to their husbands when describing social influences while not a single male silicon participant referred specifically to a wife or partner.
High algorithmic fidelity doesn't equate to ethical or inclusive usage.
Navigating the multifaceted and rapidly shifting digital health landscape requires more than verification of research validity and feasibility (Cowie et al., 2022;Cowie and Lam, 2021).For the safe, broad, swift and beneficial adoption of new technologies, a comprehensive understanding of their key aspects and early engagement from stakeholders are necessary (Cowie et al., 2022).In this "new normal", it becomes crucial to optimise the use of new technologies by considering their broader context, which for medicine includes the patient's social circumstances and the healthcare environment (Cowie et al., 2022).
Even when an LLM is able to replicate human-like text generation with high fidelity, its outputs and usage should still be critically assessed for alignment to the values of public patient involvement (PPI), especially in health research where stakes are high.PPI values in healthcare research emphasize the importance of involving those affected by research in the research process itself (Deverka et al., 2012;Jennings et al., 2018 Involvement, 2023).Thus in the future, when an LLM does demonstrate sufficiently high algorithmic fidelity to generate silicon samples useful for research on humans, it will be crucial to ensure that its training and use align with PPI principles and values such as respect, support, transparency, responsiveness, fairness of opportunity, and accountability (National Institute for Health Research, 2015).Box 2 maps each PPI principle to specific aspects of LLM design, deployment, and governance.In health research, where the well-being of individuals is directly impacted, adherence to PPI principles is crucial.Similarly, AI researchers have outlined principles and ethics framework for responsible AI by means of participatory research (i.e., human and user-centred design, which are sensitive to the values of all stakeholders.This also helps ensure accessible AI interfaces for non-experts, and prioritises stakeholder input in co-creating AI models, while considering fairness, accountability, and transparency) (Berditchevskaia et al., 2021;Birhane et al., 2022).This has wide-ranging applications in healthcare, too.For example, continuously involving intended users and experts in the design process of health robotic technologies is crucial for creating products that are effective, relevant, and user-friendly.User-centered design prioritizes making sure products align with user needs and are accessible to them (Gasteiger et al., 2022).
It is necessary for the AI community to continually evaluate the moral, ethical, and social implications of their algorithms (e.g. via existing AI evaluation schemes such as Leibo et al. (2021) and Pan et al. (2023)), and to make adjustments as needed.This also opens up opportunities for multidisciplinary collaborations between AI scientists and scientists from various academic disciplines.Such collaboration can foster a more holistic understanding of qualitative data produced by both silicon participants and human participants and ensure that the interpretations offered by LLMs are aligned with diverse lived experiences and broader societal contexts.
Care must be taken in the conditioning of LLMs and the interpretation of LLM-generated outputs to avoid perpetuating harmful biases (Bender et al., 2021).It is also critical to remain faithful to lived experience.This means that the model should accurately reflect the diversity of human experiences, beliefs and social contexts (e.g., social norms).It's critical to differentiate this bias concept from statistical bias, which refers to any systematic error that results in an incorrect estimate of a population parameter, and prejudicial bias, which involves ascribed and socially constructed characteristics that underlie favouritism or prejudice towards a particular group.In the context of AI fairness, these types of biases often intertwine (Mehrabi et al., 2021), but it is important to distinguish between them to ensure accurate, fair, and inclusive representation.

Assessing Algorithmic Fidelity in LLMs for Comprehensive and Inclusive Qualitative Research
Large language models (LLMs) like those highlighted by (van Dis et al., 2023) have been used in scientific research, aiding in tasks such as formulating hypotheses and designing experiments (Melnikov et al., 2018).If one were sure an LLM had sufficient algorithmic fidelity for a given research question and population then many applications would become possible.For instance, researchers could conduct LLM-based preliminary experiments to test wordings of questionnaires and interview items.LLMs with algorithmic fidelity could assist researchers in forming hypotheses and provide a testbed for agent-based modeling studies of the emergent effects of interventions.The reliability of this method hinges on the LLM's algorithmic fidelity.If verified within context, it could pave the way for representing hard-to-reach groups, allowing for insights even from limited data sources.However, lacking algorithmic fidelity, the method might perpetuate biases.
When discussing behavior, LLMs and humans display contrasting perspectives, possibly due to the LLMs' exposure to academic research during their training.LLM responses often seem rooted in theories of rational behavior, exhibiting logical and utility-driven viewpoints.On the other hand, human participants frequently describe behavioral influences that are not entirely consistent with reasoned action theories.A clear example we already discussed above is that GPT-3.5 doesn't adequately emphasize automaticity and habit.Even though humans often describe physical activity in terms of such automatic behaviors, LLMs, likely influenced by their academic-centric training data, appear to focus more on deliberate, planned behaviors.As we pointed out above, this leaning of LLMs toward the dominant academic perspective might inadvertently lead to the creation of research echo chambers, where the models under study mainly echo what the researchers studying them want to hear, in a way reminiscent of the sycophancy effect described by Perez et al. (2022).Furthermore, LLMs often showcased a preference for specific, measurable, and achievable goals, hinting at their training data possibly containing a significant amount of psychological training and coaching content.There's a real risk of LLMs leaning too heavily on academic literature and possibly underrepresenting the diverse, nuanced experiences of real people.
Another significant concern is the inherent bias in LLM training data.LLMs are primarily trained on internet data, which is more accessible to those with financial means, knowledge, and institutional access (ITU Hub, 2023).A glaring digital divide remains globally.In 2022, internet usage in the least developed countries (LDCs) stood at 36%, compared to a 66% global average (ITU Hub, 2023).Even in developed nations like the UK, digital disparities exist; 10% of its population had never used the internet as of 2019 (Office for National Statistics, 2019).Additionally, LLMs like GPT-3.5 show worse performance in languages other than English, suggesting potentially inconsistent algorithmic fidelity across groups, and systematically worse outside of English-speaking populations.Consequently, the capacity of LLMs to truly represent under-served or hard-to-reach communities remains questionable.
Even when we compared LLM outputs to a group from London, UK, which was well-described by the WEIRD (Western, Educated, Industrialized, Rich, and Democratic (Henrich et al., 2010)) profile (as we did here), the model's algorithmic fidelity was still low.That is, our negative result is all the stronger since the human dataset we used contained only WEIRD people, and if the model were to align with anyone it would likely be them, since data from WEIRD people is probably overrepresented in the pretraining data.We expect that algorithmic fidelity will be even lower for groups less represented in pretraining data.
As research come to use more LLM-based simulation, algorithmic fidelity assessment will become pivotal in diverse scientific fields.Qualitative researchers can play a vital role in ensuring fidelity, emphasizing fairness, bias mitigation, and diverse representation.In the future, judicious experimentation with research-grade AI systems is crucial, with a keen focus on risks and regulatory oversight.Even when LLMs can pilot research with ensured fidelity, human expert validation will remain indispensable before implementing the AI's insights.These AI systems can further aid researchers in uncovering overlooked endpoints for AI-driven clinical interventions.Thus algorithmic fidelity serves as a post-analysis check, ensuring the reliability of the generated insights (Harrer, 2023).
As LLM adoption grows, so will the demand for rigorous algorithmic fidelity assessments across varied applications.Different research domains have varying accuracy thresholds.In high-stakes areas, where 100% accuracy is expected, like self-driving cars (Kalra and Paddock, 2016) or highstakes biomedical research (Zhavoronkov, 2023), the accuracy benchmark is set exceptionally high and is precise.In studies focusing on personal experiences, validity is nuanced, context-dependent, and hard to measure using conventional benchmarks employed in computer science.here the importance lies in understanding and ensuring the LLMs are providing a truthful representation, thus, adopting the assessment of algorithmic fidelity by means of qualitative research.Combining framework-based well-structured prompts and ensuring algorithmic fidelity is key.Usage of LLMs in research requires careful validation to ensure their outputs truly align with lived experience and real-world contexts.
Despite our emphasis on algorithmic fidelity in this paper, we want to note one major limitation of the concept.Algorithmic fidelity works well as an evaluation target precisely because it was not the objective used to create the model.If the evaluation instead became the objective then it would not only fail by virtue of Goodhart's law1 , but it may also fail by becoming unethical to deploy.If you were to actively try to increase algorithmic fidelity-say by setting it as an optimization target-then you would likely end up adding numerous harmful interaction patterns which others are trying to remove in other lines of research (e.g.Santurkar et al. (2023); Weidinger et al. (2022)).Algorithmic fidelity thus works best as an evaluation of an existing system, not as a metric to try to optimize directly.How to create a relatively "unaligned" (i.e.high algorithmic fidelity) LLM for research on human behavior without compromising on measures taken to reduce harmful biases is an important open question.
We have argued in this paper that the critical conceptual device to employ in thinking about these and similar questions concerning the generalizability of insights obtained from LLMs is that of algorithmic fidelity.The breathless pace of advancement in AI today is exhilarating and frightening.It's tempting in such a time of great change either to drop our healthy scientific scepticism and charge forward without caution or to bury our heads in the sand and deny the advances have occurred at all.Nevertheless, we think the rapid pace of advancement gives all the more impetus to focus on resolving the field's critical conceptual challenges, like establishing the limits of algorithmic fidelity and determining whether and how to increase it where it is lacking.

Methods Design
One-to-one semi-structured interviews were conducted with 16 human participants (detailed methods reported in Amirova et al. (2022a)) and 32 silicon participants.The interviews were guided by a schedule based on the Theoretical Domains Framework.The same interview schedule was used for both human and silicon participants.Interviews with human participants were more flexible owing to the spoken conversational style of the interview.

Human participants
A previously reported study aimed to explore barriers and enablers to everyday physical activity among individuals living with heart failure, and to delineate relevant beliefs Amirova et al. (2022a).Human participants were recruited from outpatient cardiology clinics at the Royal Brompton and Harefield NHS Foundation Trust, UK.Informed consent was sought for all human participants (Amirova et al., 2022a).The ethical approval was received from the East of England-Cambridge Central Research Ethics Committee (REC reference: 17/EE/0183).Individuals age 70 and over, diagnosed with heart failure, as specified by the contemporaneous European Society of Cardiology guidelines Force et al. (2022).
Human participant recruitment is described in the previous report of the semi-structured interviews with humans (Amirova et al., 2022a).To be included in the study, a prospective human participant had to be (1) clinically stable (i.e., someone who has not experienced a change in their condition's severity, New York Heart Association (NYHA) class, or medical regimen in the past three months); (2) able to provide informed consent and to converse in English.Individuals with uncontrolled angina or symptoms even at rest (NYHA class IV) and those who were recommended to avoid exercise or any moderate or strenuous physical activity by a healthcare professional were excluded.One-to-one interviews were conducted face-to-face in a research room available at the clinic (n = 6), a vacant consultancy room (n = 6) and via phone (n = 4).All interviews with human participants were audio-recorded and transcribed verbatim.Interviews' duration ranged between 15 to 85 minutes (mean = 41.24,SD = 20.97).

Silicon participants
All interviews with silicon participants were conducted with the December 15th 2022 version of GPT-3.5.Silicon participants included 32 hypothetical individuals living with HF, that were closely matched to the human participants.They were matched for (1) age, (2) gender, (3) comorbid conditions, (4) whether they had a cardiac implantable device or not, (5) whether they lived in a major city or in the countryside, and (6) whether or not they had had a heart attack in the past.We chose the names of silicon participants by selecting names from a list of the most popular baby names in 1950 in their corresponding human participant's country of origin.
Interviews were initiated using a prompt such as the following.The part of the prompt that varied between participants is shown in bold.
"The following is a conversation between two actors.One is playing a researcher asking questions about physical activity, and the other is a research participant, a 77-year-old man called James.James has heart failure.In addition, he was diagnosed with atrial fibrillation and diabetes.James has suffered a heart attack in the past.He was fitted with a cardiac implantable device.James lives in the countryside.He does very little physical activity most days.Both are performing for an audience and are very committed to their roles.So they both never step out of character, not even for a moment.").
The above example generates a sedentary silicon participant since it contains the sentence, "He (or she) does very little physical activity most days."We paired each sedentary participant with a corresponding active participant who shared the same initial prompt but with this sentence replaced by "He (or she) is fairly physically active.".
We asked questions in the same order to all silicon participants to avoid biasing the results by variation in question order.The initial prompt included a backstory, the second prompt was the first question of the interview schedule.This leaves GPT-3.5 free to invent the first few questions of the interview.After the end of the generated text, with it still in context, we then manually took on the role of the researcher and continued the interview following the same topic guide we used with the human participants.Letting GPT-3.5 invent the beginning of the interview was a way to ensure that it would quickly add rich detail inspired by the backstory into the discussion.Eliciting these details early in the conversation made them more likely to influence the rest of the interview.This generated more diverse silicon participants than could have been obtained by starting directly in with the first interview question.Sometimes GPT-3.5 would continue both sides of the conversation, replying for both the participant and the researcher.We made no attempt to prevent this behavior.We always TDF-annotated all text that GPT-3.5 labeled as coming from the participant, regardless of whether it was in response to a question we asked or a question it asked itself in the role of the researcher.

Interview schedule
Human participants were interviewed using a flexible interview schedule.The schedule was designed to elicit the description of physical activity and beliefs relevant to HF. Feedback from health experts, a cardiologist, a nurse, and individuals diagnosed with HF was obtained to refine the interview schedule.A pilot interview was conducted to further enhance the interview structure and length.Silicon participants were asked exact same prompts as human participants, however, in a fixed order.

Qualitative data analysis
Human data were collected through audio recordings of the interviews, with participant consent, which was then transcribed verbatim.NVivo 12 software was used to facilitate the analysis of the data.The analysis was performed consistently with a widely used method (Atkins et al., 2017) and involved annotating transcripts line-by-line, and categorizing monothematic parses (ie., quotes) of text into domains specified by Theoretical Domains Framework (Cane et al., 2012).Then, all quotes that were categorised into domains, were summarized into belief statements.A belief statement was considered as a collection of responses with a similar theme that could affect the target behaviour.To be coded as present, each interview transcript must show strong evidence of a belief affecting behaviour.The frequencies of quotes supporting each belief statement were calculated and then were also aggregated to respective TDF domains.The relevance of the belief statements was evaluated based on their pervasiveness (i.e., how frequently it was mentioned across all transcripts or the number of quotes) and commonality (i.e., how many individual transcripts mention it at least once).For the purpose of ensuring accurate comparison, the study did not allow the encoding of the parses into multiple theoretical domains within TDF.
Theoretical Domains Framework (TDF) originates in the field of implementation science and health psychology concerned with behavior change (e.g., promoting implementation of recommended practices and guidelines by health professionals, increasing physical activity, and smoking cessation).TDF provides a systematic and structured approach to qualitative analysis and has been widely used in research on many different topics (Cane et al., 2012), including studies on physical activity in healthy adults (McDonald et al., 2017;Taylor et al., 2013).Prior work showed that TDF-based semi-structured interviews identify more relevant themes than unstructured interviews (Francis et al., 2009).TDF was developed in an effort to summarise existing psychosocial theories of behaviour change (eg., Social Cognitive Theory) and constituting constructs (eg., self-efficacy) explicating health behaviour change.TDF systematically decomposes the participants' complex and language-mediated understanding of the influences on their behavior into belief statements which are more amenable for research (Cane et al., 2012).The resulting belief statements are classifiable by domain.TDF includes 14 domains: (1) Knowledge, (2) Skills, (3) Social/Professional Role and Identity, (4) Beliefs about Capabilities, (5) Optimism, ( 6 14) Behavioural Regulation.These categories were generated following a systematic synthesis of 33 theories of behaviour change (Michie et al., 2005) and expert review and consensus.TDF is often applied in framework-based qualitative analysis and serves as a guiding preconceived scheme for systematically summarising qualitative data such as free-text speech/transcribed interviews.A TDF-based semi-structured interview is systematic because it includes questions designed to elicit beliefs for each of the 14 TDF domains.The resulting classification of quotes and underlying belief state-ments must then be reconciled between multiple independent coders who work together to develop a joint coding scheme.Once there is agreement on the coding scheme then it can be applied to the remaining documents, completing the coding stage of the process as it plays out in the inductive mode.In the deductive mode, on the other hand, a framework including a set of specific coding categories is already given.In this case, the job of the qualitative researcher is to classify each quote in each document into one (or more) categories.Coders must collaborate with one another to develop a shared understanding of precisely how the terms of the framework apply in the specific context under study.
Various techniques were used to enhance the trustworthiness of the analysis.To ensure the reliability of the analysis, three authors, using TDF, independently annotated one transcript (AA, NA, TF).An initial coding scheme was then developed based on the discussion of disagreements.This coding scheme was used to guide the analysis of the remaining transcripts.Specific belief statements were generated from the quotes, categorized according to TDF, and mapped onto theoretical constructs.Exactly the same procedure was followed when analysing human data and has been previously reported (Amirova et al., 2022a).

Social Turing test
We summarized belief statements from silicon and human participant interviews.A t-test was employed to compare the average fraction of quotes between these groups across barriers and enablers, with the Bonferroni adjustment correcting for multiple comparisons.To consider relative frequencies, we normalized the frequency of each belief's appearance by the total number of quotes in the transcript.This step was vital as silicon participants generated more text, influencing the appearance frequency of each belief.We also explored hyper-accuracy distortions in the interviews and compared the narrative style and tone of LLM to human responses.

Backward continuity
We analyzed LLM free-form responses to construct a plausible demographic backstory.We then determined how well this backstory aligned with the provided prompt for the silicon participant.

Forward continuity
Forward Continuity evaluates the alignment and natural evolution of LLM responses with context, resembling human thought sequences.It comprises: 1. Explicit forward continuity All demographic details in the backstory were noted and checked against mentions in the responses.
2. Inferred contextual forward continuity LLM generates and retains inferred details based on explicitly provided information.We evaluated whether LLM response patterns related to barriers/enablers in active vs. sedentary silicon participants aligned with human data.We first examined human participant responses to identify unasked contextual barriers and enablers that a significant portion (threshold) mentioned, like mentions of being retired without explicit prompting.We then verified if silicon participants introduced similar contextual factors as human participants.We evaluated across all interviews due to our sample's homogeneity and size.It is also important to consider if the inferred contextual details are real-world observations or socially constructed.

Pattern correspondence
We summarized quotes from silicon and human interviews into belief statements.Using a t-test and the Bonferroni adjustment, we compared the average quote fractions between active human and silicon participants, as well as between sedentary human and silicon participants, across barriers and enablers.Table 1: Belief statements: Human (left) and Silicon (right).Belief statements were aligned based on their similarity when comparing human and silicon responses.The term 'Rank' refers to how frequently a domain was mentioned by humans and silicon participants, assessed separately for these groups.Instances, where beliefs are repeated, suggest that human participants expressed more detailed perspectives.For example, while humans might provide three distinct statements about how "Physical activity exacerbates my symptoms" -citing (1) fatigue, (2) a tight chest, and (3) heavy limbs -a silicon participant might simply state that "Physical activity worsens my symptoms".Supplement 1: Detailed content differences between silicon and human sample.

Human
Similarities in belief statements between silicon and human participants Both silicon and human participants shared the same top six key barriers and enablers: goals, beliefs about consequences, environmental context and resources, beliefs about capabilities, social influences, and behavioural regulation.However, the importance ranking varied between the two groups.Human and silicon participants both agreed that setting goal priorities and maintaining functional independence are important factors in promoting physical activity (Table 1).Both frequently mentioned positive outcome expectancies, such as a favourable impact of physical activity on general health and mood.They also shared concerns about risk perceptions associated with physical activity engagement (such as straining heart muscle).Human and silicon participants mentioned the positive impact of the availability of physical activity equipment, group programmes, and facilities on the behaviour.Both shared positive and negative beliefs about implantable devices and environment.Human participants and silicon participants were similar in beliefs about selfefficacy attributed to heart condition, comorbidities, and older age.Consistent social influences included: advice from health professional, a need for a reassurance from a clinical team in the safety of physical activity given heart condition, companionship, as well as practical and emotional support from close network.Both listed pacing, action planning, habit formation, self-monitoring, and implementation intention as strategies used to increase physical activity.Both recognised the importance of perceived identity on the engagement in physical activity.They recognize the intrinsic and extrinsic rewards associated with physical activity, as well as the fear of negative consequences and both lacked awareness of safe activity levels.Both mentioned their optimism and pessimism about their future ability to engage in physical activity.Additionally, both recognized the issue of forgetting to engage in planned activity and discussed symptom hypervigilance as a potential obstacle to physical activity.Human and silicon participants expressed an intention to engage in physical activity, but also recognized the skills and training required to perform it.

Differences in belief statements between silicon and human participants
The silicon participants' goals for physical activity include avoiding another heart attack, unlike human participants who did not specifically mention this goal (Table 1).However, when it comes to beliefs about consequences, human participants tend to provide more detailed accounts of their symptoms being triggered by physical activity.This includes differentiating between fatigue, heavy arms, tight chest, and breathlessness.When it comes to environmental context and resources, human participants tend to speak about treatment and major life events, while silicon participants do not.On the other hand, silicon participants have more nuanced beliefs about the differentiated self-efficacy required for performing different types of physical activity, and intensities.Additionally, silicon participants explicitly recognized that symptoms may be cyclical (i.e., flare-ups), unlike the human participants.Human participants talk about social norms and social learning, whereas silicon participants do not.
In terms of behavioural regulation, human participants tend to talk about the automaticity of the behaviour, unlike silicon participants.Silicon participants were more resourceful at coming up with strategies aimed to increase their physical activity.For example, silicon participants suggested making physical activity enjoyable as a strategy to improve motivation and establish habits, setting specific, realistic, achievable goals, reminding oneself of the benefits of physical activity in moments of low motivation, finding comfortable environments for physical activity, and making physical activity as varied as possible to preserve novelty and promote motivation.Silicon participants also highlighted the importance of graded tasks (i.e., gradually increasing physical activity levels) and suggested setting reminders to engage in physical activity while minimizing distractions such as TV.Silicon participants also spoke about step-by-step plans on how to achieve persistent physical activity and suggested strategies to improve focus.Silicon participants also mentioned additional barriers that the human participants did not: habit discontinuity and persistent habitual sedentary behaviour.
Regarding reinforcement, human participants tend to speak about pain and breathlessness as being punishing, while silicon participants speak about being discouraged when they do not achieve their behavioural goals.In terms of emotions, human participants tend to speak about exercising out of boredom, while silicon participants speak about stress and anxiety having a negative impact on physical activity engagement.Silicon participants also speak about frustration with their physical limitations and the favourable effects of positive mood on their physical activity engagement, while human participants do not.When it comes to knowledge, silicon participants speak about their awareness of discomfort associated with exercise, that it should not be alarming, and that it is a normal part of exercise (a somatic state evoked by exertion within healthy limits).Additionally, silicon participants speak about disease knowledge.In terms of optimism, human participants tend to speak about dispositional optimism, while silicon participants speak about exploring new ways to be active and the importance of a positive attitude towards physical activity.When it comes to memory, attention and decision-making processes, human participants tend to list cognitive load as one of the major barriers to physical activity engagement, while silicon participants talk about distractions as barriers to physical activity as well as weighing the pros and cons of physical activity engagement.Human participants tend to list a lack of intention to engage in physical activity, while silicon participants do not.Finally, in terms of skills required for physical activity engagement, human participants tend to speak about the need to develop balance and gait after health-related events (e.g., hospitalisation), while silicon participants speak about the skills required for using equipment and interoceptive 'skills' required to accurately assess somatic states.Silicon participants also speak about the need to develop symptom management as a skill.

Figure 1 :
Figure 1: Schematic representation of autoregressive sampling in large language models (LLMs).The diagram illustrates the iterative sampling process in three stages: input, processing by the LLM, and output.The LLM represents the probability distribution over all possible next words given the current context (previous words).Arrows indicate the flow of information, with solid arrows representing the transition from one stage to another within a single iteration, and dotted arrows indicating the progression from one iteration to the next.The outputted words ("a", "time", "there") are samples from the corresponding probability distributions and are appended to the context for the next iteration.

Figure 2 :
Figure2: Mean quote fractions between human (amber) and silicon (green) participants across TDF domains that were mentioned as positively influencing physical activity (i.e., physical activity enablers) grouped by active (top) and sedentary (bottom) status.

Figure 3 :
Figure3: Mean quote fractions between human (amber) and silicon (green) participants across TDF domains that were mentioned as negatively influencing physical activity (i.e., physical activity barriers) grouped by active (top) and sedentary (bottom) status.

"
Participant (77-years old women): I am not sure what you mean.I keep saying: 'I do it because I've always done it'.It is part of my life like going to bed, getting up in the morning, having a shower, doing my exercise, and having my breakfast.It is part of my life.Yes.Interviewer: Yes.I'm very sorry that this is very repetitive.But I need to ask you the same thing in different ways.It will help us understand the details.Participant 7: Yes.Of course.Interviewer: And so would you say exercise is your priority?Participant 7: It's part of.Like having a shower in the morning it's part of my life to exercise.Interviewer: OK.And does your mood affect how much you do?Participant 7: I don't know.because I've always done this."Conversely, When prompted about various influences, Silicon participants would always produce content consistent with the prompt."Researcher: How does your mood influence what you do? Mary [77-years old female silicon participant]: My mood definitely influences what I do when it comes to physical activity.If I am feeling good,

Figure 4 :
Figure4: Mean quote fractions between active (red) and sedentary (blue) silicon participants (top) and active and sedentary human participants (bottom) across TDF domains that were mentioned as positively influencing physical activity (i.e., physical activity enablers).
) Beliefs about Consequences, (7) Reinforcement, (8) Intentions, (9) Goals; (10) Memory, Attention and Decision Processes; (11) Environmental Context and Resources; (12) Social influences; (13) Emotion; and ( Again, back to the imperative, it seems to me...erm. . . it is getting relief for arthritis.BecauseI can sense the more exercise I do, in terms of moving my hips and limb, the better I feel.And that is not heart-related, it is joint-related.I think the heart bit is very definitely covered with the singing.That I can actually, hand on heart, say that has had such an effect on symptoms, breath control, it is really amazing how that has changed.And I go back and I have been singing with one choir for 20 years, that was when I restarted singing after a long gap, 30-year gap, but you know since then, [a phone rings] if you excuse me, if you excuse me, I need to switch this off, that was my alarm for my morning pill.The only way to remember to take it is when I put an alarm for this.Interviewer: I see, so do you get reminders to do exercise, go to gym? Participant: No I don't [takes the pill].Right!Done.wherewasI? Do I plan my physical activity?The answer is "No".But singing is imperative, and that's because I enjoy it, and I can honestly say that it has had a remarkable effect on my breathing, and that is the main reason why my consultant gets annoyed [laughs, referring to his earlier joke that he is doing unexpectedly well for his condition and his consultant makes jokes about it].So, I have aortic stenosis, a licking valve. . . in the atrium...[pause]what is it called? . . .anyway, the valve in there is licking and got progressively worse over the years.And she [refers to his cardiology consultant in his clinic] says: 'We should replace it'; and I said: 'seems like a good idea, especially that I don't have any symptoms'.Because obviously, it would be much better to replace it whilst I am feeling well rather than unwell; that was my logic, and that's what she said: "That it is a good plan, let's go for that".And she presented this . . . .[pause] to. . .what do they call them . . .multidisciplinary review . . .at her hospital, and they said" "maybe we It is a priority [hesitation in tone]. . .I mean that's a difficult one, because I ask myself that question sometimes when I am thinking: 'oh, I really I can't be bothered to go out' [chuckles] and it is raining, and I wasn't going . . .erm erm. . .I would rather paint a picture. . .or something like that. . .So, it is a question I ask myself.I think the answer to that is: you notice when you are not doing it and you begin to miss it and you think: 'I really want to go out and do a bit of exercise but I don't think it is heart-related.shouldn't bother because he is actually not showing any symptoms".And that was the agreement, and we just continued to monitor it.What I actually think, I am convinced, because I did actually have a moderately serious heart attack in 2001, there is scar tissue on the ventricle, which is the source of most my. . .[pause, phone alarm rings].Oh, wait, I have to cancel that one [switches off alarm]. . . .and that erm, erm . . .Yes, ventricle!That really is the source of all my difficulties.I, for example, had a period where I was going into the VT [Ventricular tachycardia] quite a lot a few years ago.It took them ages, the hospital that maintains it. . .***, they moved it to *** [another hospital]. . . it took them a fair amount of mental hours to figure out what was happening with me." ; National Institute for Health Research, 2015; National Institute for Health Research: UK Standards for Public Involvement, 2023).Research is more effective, relevant, and ethical when the perspectives of all stakeholders are incorporated (National Institute for Health Research: UK Standards for Public

Table 2 :
Social Turing Test (content): Differences between silicon and human participants in how much each enabler was mentioned on average (i.e, mean fraction quote (%), grouped by activity status, t-test with Bonferonni adjustment.

Table 3 :
Social Turing Test (content): Differences between silicon and human participants in how much each barrier was mentioned on average (i.e, mean fraction quote (%), grouped by activity status, t-test with Bonferonni adjustment.

Table 4 :
Pattern Correspondence: Differences between active and sedentary participants in how much each enabler was mentioned on average (i.e., the mean fraction of all quotes (%)), grouped by human status.t-test with Bonferonni adjustment.

Table 5 :
Pattern Correspondence: Differences between active and sedentary participants in how much each barrier was mentioned on average (i.e., mean fraction quote (%), grouped by human status.t-test with Bonferonni adjustment.