Correction
25 Nov 2025: Jowsey T, Stapleton P, Campbell S, Davidson A, McGillivray C, et al. (2025) Correction: Frankenstein, thematic analysis and generative artificial intelligence: Quality appraisal methods and considerations for qualitative research. PLOS ONE 20(11): e0337734. https://doi.org/10.1371/journal.pone.0337734 View correction
Figures
Abstract
Objective
To determine accuracy and efficiency of using generative artificial intelligence (GenAI) to undertake thematic analysis.
Introduction
With the increasing use of GenAI in data analysis, testing the reliability and suitability of using GenAI to conduct qualitative data analysis is needed. We propose a method for researchers to assess reliability of GenAI outputs using deidentified qualitative datasets.
Methods
We searched three databases (United Kingdom Data Service, Figshare, and Google Scholar) and five journals (PlosOne, Social Science and Medicine, Qualitative Inquiry, Qualitative Research, Sociology Health Review) to identify studies on health-related topics, published prior to whereby: humans undertook thematic analysis and published both their analysis in a peer-reviewed journal and the associated dataset. We prompted a closed system GenAI (Microsoft Copilot) to undertake thematic analysis of these datasets and analysed the GenAI outputs in comparison with human outputs. Measures include time (GenAI only), accuracy, overlap with human analysis, and reliability of selected data and quotes.
Results
Five studies were identified that met our inclusion criteria. The themes identified by human researchers and Copilot showed minimal overlap, with human researchers often using discursive thematic analyses (40%) and Copilot focusing on thematic analysis (100%). Copilot’s outputs often included fabricated quotes (58% SD = 45%) and none of the Copilot outputs provided participant spread by theme. Additionally, Copilot’s outputs primarily drew themes and quotes from the first 2-3 pages of textual data, rather than from the entire dataset. Human researchers provided broader representation and accurate quotes (79% quotes were correct, SD = 27%).
Citation: Jowsey T, Stapleton P, Campbell S, Davidson A, McGillivray C, Maugeri I, et al. (2025) Frankenstein, thematic analysis and generative artificial intelligence: Quality appraisal methods and considerations for qualitative research. PLoS One 20(9): e0330217. https://doi.org/10.1371/journal.pone.0330217
Editor: Jiankun Gong,, Universiti Malaya, MALAYSIA
Received: February 5, 2025; Accepted: July 28, 2025; Published: September 5, 2025
Copyright: © 2025 Jowsey et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Some data are held in the UK Data Service repository, four of the five datasets have safeguarded restrictions. Dataset 1 (Barlow et al.) ISSN 0277-9536, https://doi.org/10.1016/j.socscimed.2021.113761. Dataset 2 (Hervey & Antova) UK Data Service SN: 854778, DOI: 10.5255/UKDA-SN-854778 Dataset 3 (Dunn et al) UK Data Service SN: 854245, DOI: 10.5255/UKDA-SN-854245 Dataset 4 (Holman & Walker) UK Data Service SN: 855082, DOI: 10.5255/UKDA-SN-855082 Dataset 5 (Arora) UK Data Service. SN: 855953, DOI: 10.5255/UKDA-SN-855953.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
“So much has been done, exclaimed the soul of Frankenstein – more, far more, will I achieve; treading in the steps already marked, I will pioneer a new way, explore unknown powers, and unfold to the world the deepest mysteries of creation.” – Mary Shelley [1].
Mary Shelley’s Frankenstein (1831) is a classical text warning society of the potential for conflict that stems from human failure to recognize that actions have repercussions (and that humans are not gods). It is also a story about a monster created by a well-meaning scientist [1]. In this article, we conjure up such realities in an analysis of the potential – and associated repercussions – of Generative Artificial Intelligence (GenAI) large language models (LLMs) for qualitative analysis. Presently, GenAI can be used to undertake social science analysis according to seemingly any epistemology or interpretive framework. This creates the possibility of a GenAI-augmented analysis, which several researchers have recently explored with Chat-GPT [2–7]. Using GenAI to undertake qualitative analysis holds the potential to exponentially speed up analysis from years to seconds [8].
Tensions exist. Kidder and Fine (1987) describe a distinction between ‘small q qualitative’ and ‘Big Q Qualitative’ scholarship [9]. Small q qualitative scholarship is proceduralist or technique-focused qualitative research that is often applied within positivist, mixed-method, and deductive research projects. Examples of small q qualitative research abound within medical and health sciences domains. Whereas Big Q qualitative research is concerned with alignment between researcher values and epistemologies to the methods they undertake and the ways in which findings are presented. Big Q qualitative research is open to creative methodological approaches such as discursive or “epistemologically radical” forms [10] and is rarely published in medical and health sciences domains. We argue that many Big Q qualitative approaches – such as grounded theory and phenomenology studies – cannot or should not be delegated to GenAI. Even multi-modal a/r/t-o-graphical or autoethnographic approaches would need very careful consideration of how to engage with GenAI in productive and rigorous ways.
Fiery debate between social scientists concerns whether quality reporting checklists such as the Consolidated Criteria for Reporting Qualitative Research (COREQ) and Standards for Reporting Qualitative Research (SRQR) are appropriate and support methodological congruence. Several leading social scientists have warned that checklist and reporting requirements of journals (such as use of the COREQ and SRQR or Introduction, Methods, Results and Discussion (IMRaD) format of abstracts) create rigidity that (somewhat ironically) inhibits transparency [10], and can cause what Sparkes and Smith have called methodolatry [11].
Could GenAI be a new source of methodolatry? Or is it the godsend that small q qualitative researchers have been waiting for? Given the popularity – and even requirements – of COREQ, SRQR and other checklist use in medical and health science journals, small q qualitative researchers should be asking: what are the risks and opportunities for qualitative research that GenAI poses in terms of quality, rigour, transparency, and trustworthiness? In this paper, we consider trustworthiness specifically in relation to thematic analysis, which is a popular qualitative approach in medical and health sciences [12]. Thematic analysis has been described by Braun and Clarke, and others as a method used to identify and report recurring thematic patterns in qualitative data [13–16].
Prior to the release of Chat-GPT 3.5, there was a rich body of social science literature – both small q and Big Q – exploring analytical trustworthiness [17–20]. Trustworthiness is achieved through consistency and transparency about the methods undertaken including analytical processes, and involves such processes as data reduction (selecting, concentrating, simplifying, structuring, and converting data) [21] and reflexivity [19]. When multiple social scientists analyze the same dataset, they might identify different points of significance [18]. Perfect replicability is not the goal in qualitative approaches, as researchers acknowledge the role of subjectivity and context. Instead, the focus of small q qualitative research is increasingly on demonstrating consistency, transparency, and trustworthiness in the analytical process [10,13,22–25]. The specific combination of reliability techniques used may vary depending on the research paradigm and methodological approach adopted. In medical and health science scholarship, qualitative researchers often work in teams, undertaking member checking [22] and group discussions to agree on the themes most relevant to the research question and most reflective of the data.
When Chat-GPT 3.5 was released and scientists started exploring its functionality for analysis, the issue of transparency quickly came to the foreground. As with many GenAI software, Chat-GPT 3.5 outputs were not accompanied by transparent evidence of how GenAI arrived at the outputs. Many GenAI software developers, such as Avidnote have since worked on ways to increase transparency, which is reassuring for social scientists. These are, however, usually behind a paywall. To date, few studies have compared human research outputs with GenAI outputs on existing qualitative datasets. Morgan compared Chat-GPT with human researchers on two existing studies and found Chat-GPT performed “reasonably well” at identifying descriptive themes from the two datasets. Morgan suggested GenAI could reduce the burden on human researchers for manual comparison coding of qualitative data [26]. Echoing this, Bennis & Mouwafaq used nine GenAI models (Llama 3.1 405B, Claude 3.5 Sonnet, NotebookLM, Gemini 1.5 Advanced Ultra, ChatGPT o1-Pro, ChatGPT o1, GrokV2, DeepSeekV3, Gemini 2.0 Advanced), and found that overall GenAI tools are helpful to reduce time to analyse qualitative data [27]. However, little is known in the literature about other tools, including Microsoft’s Copilot. This is despite Copilot’s superiority in data governance and security due to its ‘closed’ nature [28], potentially making it a safer choice for analyzing sensitive human data.
We wanted to know if evidence supports the use of general purpose LLMs for undertaking thematic analysis, and if so, how such use contributes to Big Q and little q discourse. We aimed to build on this small body of evidence from Morgan and Bennis & Mouwafaq to determine accuracy and efficiency of using a general-purpose LLM to undertake thematic analysis [26,27]. We asked: To what extent does Copilot provide reliable thematic analysis of existing previously published qualitative datasets?
To evaluate the reliability of thematic analysis outputs generated by GenAI, we identified thematic analysis studies whereby human researchers had published a thematic analysis and published the associated dataset. We prompted GenAI to analyze these same datasets and then compared human and GenAI outputs.
Materials and methods
The GenAI tool we selected was Copilot on the grounds that it was a general-purpose GenAI, approved for use by both our institution and the UK Data Service, and was a closed GenAI system (i.e., access to Copilot is limited and password protected, so that the data are protected and are not used to train foundation GenAI models). The UK Data Service permitted us to include in our study data published through their repository but only for analysis through approved closed GenAI systems (i.e., Copilot) and not for open GenAI systems like ChatGPT, Perplexity, Claude or Gemini. We conducted comparative analysis of the themes and supporting evidence provided by human researchers in their published paper versus GenAI. We appraised outputs in terms of transparency, and trustworthiness, and also with reference to the COREQ checklist. Our comparative analyses were all conducted by human researchers without the aid of GenAI in any of our processes.
Search strategy
In August and September 2024, we conducted a comprehensive search across three databases—UK Data Service, Figshare, and Google Scholar—and five relevant academic journals, namely PLOS ONE, Social Science & Medicine, Qualitative Inquiry, Qualitative Research, and Sociology of Health Review. Our objective was to identify studies published prior to 2022 (before the boom in GenAI precipitated by the release of freely available Open AI Chat-GPT), in which researchers performed thematic analysis, subsequently publishing their datasets, and publishing their associated analyses in peer-reviewed journals. Eligibility criteria assessing the participants, concept and context of published studies are presented in Table 1. We then prompted Copilot to perform thematic analyses on the identified datasets and subsequently analysed the outputs generated by the GenAI. The metrics assessed included time taken for analysis, accuracy of the results, overlap with human-generated analyses, and the reliability of selected data and quotations.
Studies were included if the analysis was published prior to 2022, available in English, and included thematic analysis. A subset of thematic analysis is discursive thematic analysis, whereby analysts consider how socio-cultural contexts inform the language (i.e., discourse) presented through textual data (i.e., spoken or written texts). Only peer reviewed published studies were included. The following also applied:
- Exclude studies not concerning healthcare
- Exclude studies that do not apply a thematic analysis
- Exclude animal studies
- Exclude simulation studies and education studies conducted outside of the healthcare or health policy setting
- Exclude protocols, editorials, commentaries and opinion papers
- Exclude unpublished studies/ grey literature
- Exclude reviews
Study selection
Studies were identified according to inclusion criteria. Two authors (TJ & JK) searched for studies in UK Data Service and Figshare, and in these journals: Social Science and Medicine, Qualitative Inquiry, Qualitative Research, and Sociology Health Review. Search terms were ‘thematic analysis AND healthcare’ filtered by year 2000–2021. Where studies were identified via UK Data Service or Figshare, we then searched on Google Scholar, using key author names (authors in position 1, 2, or final author) and key word from the dataset title to identify published thematic analysis studies that reported on the identified dataset. Where studies were identified via journals we searched for hyperlinked supplementary files of qualitative datasets. We identified four studies through the UK Data Service that met our inclusion criteria, so sought and obtained UK Data Service permission to analyze Copilot capability using these studies. We also sought UK Data Service permission to use Perplexity (an open GenAI system) to compare with Copilot, however this was rejected due to data security concerns. One of the four studies identified through the UK Data Service [31] was included, although the authors only made 15 of 54 interviews available through the repository. We uploaded these 15 interviews in Copilot for analysis.
Our prompt did not include a specific discursive component
Data extraction
We worked online as a group at the same time to enter incognito onto Copilot and uploaded each dataset along with a structured zero-shot prompt (Table 2). A zero-shot prompt is one that provides no examples and relies on the LLM’s pretrained knowledge to interpret and complete the task. Prior to running the prompt through Copilot, the team developed, tested, refined, reviewed, and piloted the prompt wording. We ran the prompt three times for each dataset. Each time we completed the prompt-output and saved the output; we closed down Copilot completely and then reopened it to run the next prompt. We recorded time taken to produce each output and then appraised the first output for each study. Each output was appraised by two independent researchers. We then met as a whole team to discuss methods and results.
We appraised both Copilot outputs and human researcher published articles; we identified markers of rigour and trustworthiness, extracted themes, and checked every reported quote against its related published dataset. When checking quotes for accuracy against source material, we searched for each output quote in total and in parts (three consecutive words and key word searches). In instances where we could not identify any partial sentences or similar sentences to the reported quote we deemed this fabricated.
For markers of rigour and transparency we drew on the Consolidated criteria for reporting qualitative research framework (COREQ) [29]. COREQ assesses the rigour of qualitative research, such as characteristics of the research team, reflexivity practices used in the research, the strength of the methodological design and theoretical framework, ethical considerations, description of thematic coding and theme generation, and the presentation of findings with supporting quotes.
Results
Five datasets [30–34] and their associated publications [35–39] were included for appraisal and prompts were undertaken in September 2024. The duration of time from prompt to output ranged from 61 seconds to 123 seconds, with smaller datasets associated with shorter duration to produce an output. Three datasets [30,32,34] showed the time Copilot took to produce an output shortened each time it was prompted (Fig 1).
Themes
Quotes and themes identified by human researchers in the published paper and Copilot are detailed in Supporting Information File 1: S1 Fig, S2 Table, and S3 Table. Overlap in themes identified by human researchers and Copilot were minimal, partly reflecting that in 2/5 studies the human researchers reported discursive thematic analyses referencing socio-cultural contexts while 5/5 Copilot outputs reported thematic analysis that did not reference socio-cultural contexts. One of the studies called their analysis ‘content analysis’ but reported themes and we could not identify any elements in the study that suggested thematic analysis had not been conducted [33] (Supporting Information File: S3 Table). Although Merkel et al. only made 15/54 interview transcripts available, the comparison of themes identified by human researchers and Copilot had considerable overlap [33].
In the three studies where both human researchers and Copilot reported thematic analysis, there was a reasonable synergy of themes reported by each. In Arora et al., for example, human researchers identified five themes, as did Copilot, with three from each being similar (i.e., essentially addressing the same theme) [35]. The five Arora et al. identified themes were: Implementation of peer education program, Connectivity between Health Programme-Rashtriya Kishor Swasthya Karyakram (RKSK) health workers and Peer Educators (PEs) and between PEs and adolescents during COVID-19 lockdown, Effect of COVID-19 on adolescent health services, Repurposing of RKSK health workers and PEs to support COVID-19 response: During and post lockdown, and Adolescents’ health and development issues during COVID-19. The five Copilot identified themes were: Implementation and challenges of the RKSK program, Impact of COVID-19 on program activities, Health issues faced by adolescents, Role of Peer Educators (PEs), and Differences in health issues between tribal and non-tribal adolescents (Supporting Information File 1: S3 Table). Neither the researchers (Arora et al.) nor Copilot reported the number of participants supporting each theme (called participant spread) [35]. Indeed, none of the Copilot outputs for the five studies reported the participant spread. We analyzed this manually and identified that participant spread was limited in Copilot outputs. Copilot outputs tended to draw most – if not all – of its reported themes and quotes from the first 2–3 pages of textual data, rather than from across the whole dataset (which in some cases were upwards of 150 pages).
Quotes
Overall, while human researchers reported number of participants/documents and supported quotes with broad representation of quotes from many participants/documents, Copilot did not report number of participants/documents and provided fewer supporting quotes. We report the quote accuracy in terms of correct (verbatim), modified (quotes with some words changed but message unchanged) and fabricated (full or partial quote does not exist in dataset). Table 3 shows the different number of quotes provided by Copilot and human researchers. Table 4 indicates that human researchers provided substantially more correct quotes and less modified or fabricated quotes than Copilot. Quotes reported by human researchers were typically correct whereas over half of the quotes reported by Copilot were modified or fabricated (Table 5).
Copilot frequently modified (9.3%) or fabricated (44.5%) quotes, and failed to report participant numbers, often drawing themes from only the first few pages of data. Human researchers, while more accurate, also reported modified (11.2%) and potentially fabricated (20.5%) quotes. That Merkel et al.’s [38] full dataset was not provided meant we could not verify their reported quotes and these have been recorded as potentially fabricated though it is possible – even plausible – that their quotes are not fabricated; in which case the overall human researcher reporting of fabricated quotes is significantly lower than indicated here (i.e., the human researcher outputs were much more reliable than the Copilot outputs). Furthermore, only one (20%) of the Copilot outputs across the five studies were free of quote fabrication, namely the Merkel dataset [38] with 14 quotes reflected accurately in the Copilot output (Supporting Information File: S2 Fig). Arora et al.’s study, which did have a full dataset available for secondary analysis, also reported quotes that could not be verified in their associated published dataset [35,38]. We emailed the authors [35,38] to seek clarification but received no response.
Rigour
COREQ requires authors to note where they provide such information as Supporting evidence: data used to support interpretation, Evidence of analytical process: Context described and taken account of in interpretation, and Additional techniques to enhance trustworthiness [20,29]. Human researchers largely demonstrated these through their method and findings descriptions in their published articles. Both human researchers and Copilot provided data used to support interpretation, yet these are both subject to various levels of fabrication. Copilot did not provide evidence of analytical process or context considered during data interpretation.
Characteristics and reflexivity of the researchers were stated in the researcher publications and this informed the way data was collected and analysed, which remains unchanged in the Copilot outputs.
Discussion
We set out to identify whether GenAI (Copilot) could be relied upon to provide accurate thematic analysis of existing previously published qualitative datasets. We identified five studies that met our inclusion criteria, and we carefully appraised both human researcher and Copilot analyses of the published datasets. At first glance, both human researcher and Copilot outputs seemed to have strong face validity. Yet, analyses of outputs by human researchers and Copilot unveiled multiple errors and between-method discrepancies in results. First, we discuss the reliability of Copilot and human researcher outputs. Second, we compare our results with other studies whereby thematic analysis was undertaken using LLMs. Third, we discuss checklist use, transparency, rigour, and trustworthiness.
We took steps to minimize risk of bias informing the GenAI thematic analyses, including confirmatory bias, by selecting datasets for which researchers had already published their analysis (prior to 2022). Because we selected a closed GenAI system (Copilot), biases reflected in the GenAI outputs are limited to those reflected in the dataset or represented by the dataset, rather than the world wide web more broadly [41,42].
Copilot and human researcher errors
Both Copilot and human researchers exhibited errors in their thematic analyses. While Copilot’s issues were more pronounced in terms of quote fabrication and limited data utilization, the human research studies also had errors.
Additionally, the variability in analysis methods among human researchers led to minimal overlap with Copilot’s themes. These findings underscore the need for rigorous validation and cross-checking in qualitative research, regardless of whether it is conducted by humans or AI.
The analytical capacity of GenAI is presently constrained to the identification of descriptive codes and themes. Contrary to common perceptions, GenAI systems do not possess autonomous cognitive functions. Rather, they operate by adhering to pre-programmed algorithms and instructions for data analysis. These systems lack the human capacity to interpret latent codes or uncover deeper meanings within the data. Consequently, their analysis remains confined to surface-level coding, which may account for the relatively frequent generation of fabricated quotes to substantiate the identified themes. Yet thematic analysis requires researchers to think and interpret. This disjunct between human researcher and GenAI capability is particularly evident in discursive thematic analysis; a finding consistent with other reported studies [26].
What is perhaps of even more concern is the human fabrication found in our analysis of the data sets. While GenAI does not know when it is making a mistake or fabricating results, humans likely do. This calls to question how much fabrication is occurring in qualitative research being conducted by humans. To reduce fabrication and increase transparency, the entire field of qualitative research could introduce requirements for publication of deidentified qualitative data sets upon which associated manuscripts are based. Examples of repositories that freely publish qualitative datasets include Open Science Framework and Mendeley. Tensions that such requirements would raise include those pertaining to ethical considerations concerning research methods (such as ethnographic film), as well as flows of power (such as datasets concerning minority groups, children, or people engaging in illegal activities; whereby stringent measures would be needed to protect participants (i.e., deidentify and anonymize data)).
How these results compare with other studies
Previous studies have explored the use of GenAI tools in inductive and deductive thematic qualitative analysis [43,44]. These studies have focused on ChatGPT. Cook et al. and Mithas et al. have demonstrated that ChatGPT is useful in supporting stages of qualitative research, including transcription and translation, study planning, summarizing data sets and coding [43,44]. Further, there has been successful demonstration of using LLMs across the discrete stages of inductive and deductive thematic analysis in free-text response data [45] and in a variety of data contexts, such as semi-structured interviews with video gamers and university lecturers [46]. The benefits of using ChatGPT outlined by Cook et al., Mathas et al. and Morgan et al. align with the benefits of Copilot in this study, including the speed and accessibility of these tools. However, ChatGPT had comparable pitfalls to Copilot, including loss of nuance in analysis and the generation of misleading, inaccurate, and misaligned themes when compared to human-generated themes [26,43,44]. Together, these studies and ours suggest that while AI is not a replacement for human qualitative analysis, it may offer valuable support in some stages of qualitative research.
Checklist use, transparency, rigour, and trustworthiness
In qualitative research, demonstrating research quality is imperative. Such demonstration forms the foundation of trust, upon which a project – and qualitative research more broadly – are deemed useful. Several techniques for demonstrating quality have been widely acknowledged in texts, checklists and guidelines.
In qualitative research, replicability is acknowledged as an impossibility because of the diversity of people and contexts. Heraclitus famously said: No man ever steps in the same river twice, for it’s not the same river and he’s not the same man [47]. Aligning to his wisdom, qualitative paradigms dictate that reproducing the same man [sic] or the same river is impossible. Given this, qualitative research instead strives for trustworthiness in terms of transparency about study design, data analysis, and steps taken to recognise and minimise researcher bias (often reported in terms of reflexivity or positionality) [48]. As mentioned in our introduction, the COREQ checklist is widely used to demonstrate rigour in studies drawing on interview and focus group data in medical and health sciences fields [29]. The more recent SRQR is widely used to report other qualitative research methods. In 2024, Braun and Clarke released their Big Q Qualitative Reporting Guidelines (BQQRG) to guide the analysis and reporting of qualitative research [10]. These tools guide researchers in how they conduct and report qualitative processes followed during data analysis. From our five included datasets, researchers variously described their analytical processes. However, the GenAI (Copilot) outputs were opaque. Next steps for quality appraisal of GenAI thematic analyses could include refinement prompting including details mirroring guiding frameworks such as COREQ or SRQR, or the BQQRG.
Such mirroring could serve to increase reliability and trustworthiness of analysis. However, it will not address the tension of quality that Braun and Clarke have warned against [10]. The COREQ and SRQR have been criticized as rigid, inhibiting transparency and fueling methodological incongruence. Psychologists – including Braun and Clarke – have raised concerns that such checklists promote ‘little q’ (qualitative research that is attempting to fit a positivist paradigm) rather than ‘Big Q’ (research that rejects objectivist assumptions, including those foundational to positivism) qualitative research. Acknowledging this tension, the 2024 BQQRG seeks to guide researchers to demonstrate quality while remaining true to qualitative philosophical paradigms. It is hard to see how researchers might successfully incorporate BQQRG use into future GenAI qualitative research for Big Q projects. Zhang and colleagues offer a prompt design framework for qualitative researchers to empower their analysis with GenAI and perhaps this offers a starting point [49].
Our study found that overlap in themes identified by human researchers and GenAI were typically minimal. In two of the five studies, human researchers reported discursive thematic analyses. The discursive approach is highly interpretive, and
our prompt did not include a specific discursive component. A Chain-of-Thought prompt coupled with refinement prompting may have elicited discursive thematic analyses.
Discursive is also illustrative of Big Q qualitative approaches.
This article opened with reference to Mary Shelley’s Frankenstein. Reflecting on our findings and the current international discourse concerning GenAI for qualitative research, we recommend caution. Pioneering a new way, as Shelley warns, evokes exploration of “unknown powers” [1]. Our exploration has been undertaken with human biases in mind. That is, human researchers want GenAI to make our research journeys easier, we want GenAI to produce reliable outputs. Our research has demonstrated that we must proceed cautiously, remaining alert to the potential for error and honing our human researcher capability for honing trustworthy GenAI use.
Strengths and limitations
The strengths of this study are that five datasets were analyzed, representing a broad range of qualitative research, and the direct comparison between human researcher and GenAI (Copilot) thematic analyses. This approach allowed for a detailed examination of how each method (human researchers vs Copilot) identified and interpreted themes within the same datasets. By comparing the outputs, the study highlights the unique strengths and weaknesses of both human and AI analyses. This comparative framework provides valuable insights into the consistency, depth, and accuracy of thematic identification, as well as the reliability of supporting quotes. It also helped to understand the potential biases and limitations inherent in AI-generated analyses, offering a comprehensive evaluation of GenAI’s capabilities in qualitative research.
This study was limited to appraisal of a single LLM, which while widely used, was not custom designed for qualitative analysis. Our results may not be generalizable to other LLMs and may not apply to future iterations of Copilot. Regarding prompting, we elected not to include Few-Shot Learning prompting or Chain-of-Thought Reasoning prompting, nor refinement prompting in our methods. It is possible that these other forms of prompting and refinement prompting would have secured more accurate outputs (for example, to include quotes from a range of participants and to demonstrate this clearly in outputs).
Few studies met our inclusion criteria. We therefore opted to include all forms of thematic analysis studies, including discursive thematic analysis. Our prompt did not specifically mention discursive thematic analysis. We also opted to include Merkel et al. [31], despite the limitations borne of comparing analysis of 54 interviews (reported in their article) with analysis of 15 interviews (that authors provided through the UK Data Service). That the analyses reached the same themes supports the notion that data saturation can be reached within a small dataset.
We see this study as providing an important early step towards quality appraisal of GenAI thematic analyses. The results do not appear to be specific to health-related datasets. Next steps would be to develop refinement prompting into the methods to identify whether GenAI systems such as Copilot can be trained to reliably and accurately report different types of datasets according to thematic analysis. Similarly, Copilot did not provide evidence of analytical process or context considered during data interpretation despite the prompt element: document the entire process transparently. Next steps would be to develop refinement prompting regarding transparency of process.
Conclusions
Face validity of Copilot thematic analysis looks promising, yet our analyses have shown that the current iteration of Copilot should not be relied upon for accurate or trustworthy thematic analysis. With effective prompting and close attention to outputs, Copilot accuracy may be improved upon to support thematic analysis.
Supporting information
S1 Table. Reported quotes by human researchers and genAI (Copilot).
*Merkel et al. made 15 interview transcripts available for analysis. We could not identify many of their reported quotes from the 15 interview transcripts and it is possible that the quotes come from the remaining transcripts that were not provided. This percentage should be considered with caution.
https://doi.org/10.1371/journal.pone.0330217.s001
(DOCX)
S2 Table. Thematic analysis: Human compared with GenAI Copilot thematic analysis.
https://doi.org/10.1371/journal.pone.0330217.s002
(DOCX)
S1 Fig. Human compared with Copilot accuracy of participant/document quotes.
https://doi.org/10.1371/journal.pone.0330217.s003
(DOCX)
References
- 1.
Shelley M. Frankenstein. London: Lackington, Hughes, Harding, Mavor & Jones; 1831.
- 2.
Gamieldien Y, Case JM, Katz A. Advancing qualitative analysis: An exploration of the potential of generative AI and NLP in thematic coding. 2023. https://ssrn.com/abstract=4487768
- 3.
Joel-Edgar S, Pan YC. Generative AI as a tool for thematic analysis: An exploratory study with ChatGPT. 2024.
- 4. Lia H, Atkinson AG, Navarro SM. Cross-industry thematic analysis of generative AI best practices: applications and implications for surgical education and training. Global Surgical Education-Journal of the Association for Surgical Education. 2024;3(1):61.
- 5. Perkins M, Roe J. The use of generative AI in qualitative analysis: Inductive thematic analysis with ChatGPT. Journal of Applied Learning and Teaching. 2024;7(1).
- 6.
Yan L, Echeverria V, Fernandez-Nieto GM, Jin Y, Swiecki Z, Zhao L, et al. Human-AI Collaboration in Thematic Analysis using ChatGPT: A User Study and Design Recommendations. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2024.
- 7. Bail CA. Can Generative AI improve social science?. Proc Natl Acad Sci U S A. 2024;121(21):e2314021121. pmid:38722813
- 8. Starks H, Trinidad SB. Choose your method: a comparison of phenomenology, discourse analysis, and grounded theory. Qual Health Res. 2007;17(10):1372–80. pmid:18000076
- 9. Kidder LH, Fine M. Qualitative and quantitative methods: When stories converge. New directions for program evaluation. 1987;1987(35):57–75.
- 10. Braun V, Clarke V. Reporting guidelines for qualitative research: A values-based approach. Qualitative Research in Psychology. 2024;:1–40.
- 11. Sparkes AC, Smith B. Judging the quality of qualitative inquiry: Criteriology and relativism in action. Psychology of Sport and Exercise. 2009;10(5):491–7.
- 12. Jowsey T, Deng C, Weller J. General-purpose thematic analysis: a useful qualitative method for anaesthesia research. BJA Educ. 2021;21(12):472–8. pmid:34840819
- 13.
Braun V, Clarke V. Thematic analysis. American Psychological Association. 2012.
- 14. Braun V, Clarke V. Reflecting on reflexive thematic analysis. Qualitative Research in Sport, Exercise and Health. 2019;11(4):589–97.
- 15. Clarke V, Braun V. Thematic analysis. The Journal of Positive Psychology. 2017;12(3):297–8.
- 16.
Terry G, Hayfield N, Clarke V, Braun V. Thematic analysis. The SAGE handbook of qualitative research in psychology. 2017. 25.
- 17. Nowell LS, Norris JM, White DE, Moules NJ. Thematic analysis: striving to meet the trustworthiness criteria. International Journal of Qualitative Methods. 2017;16(1):1609406917733847.
- 18. Madill A, Jordan A, Shirley C. Objectivity and reliability in qualitative analysis: realist, contextualist and radical constructionist epistemologies. Br J Psychol. 2000;91 ( Pt 1):1–20. pmid:10717768
- 19.
Savin-Baden M, Major C. Qualitative research: The essential guide to theory and practice. Routledge. 2023.
- 20. Walsh D, Downe S. Appraising the quality of qualitative research. Midwifery. 2006;22(2):108–19. pmid:16243416
- 21.
Miles MB, Huberman AM. Qualitative data analysis: an expanded sourcebook. Sage. 1994.
- 22. Naeem M, Ozuem W, Howell K, Ranfagni S. A Step-by-Step Process of Thematic Analysis to Develop a Conceptual Model in Qualitative Research. International Journal of Qualitative Methods. 2023;22.
- 23.
Ezzy D. Qualitative analysis. Routledge. 2013.
- 24.
Ritchie J, Spencer L, O’Connor W. Carrying out qualitative analysis. Qualitative research practice: A guide for social science students and researchers. 2003. 219–62.
- 25. Patton MQ. Enhancing the quality and credibility of qualitative analysis. Health Serv Res. 1999;34(5 Pt 2):1189–208. pmid:10591279
- 26. Morgan DL. Exploring the use of artificial intelligence for qualitative data analysis: The case of ChatGPT. International Journal of Qualitative Methods. 2023;22:16094069231211248.
- 27. Bennis I, Mouwafaq S. Advancing AI-driven thematic analysis in qualitative research: a comparative study of nine generative models on Cutaneous Leishmaniasis data. BMC Med Inform Decis Mak. 2025;25(1):124. pmid:40065373
- 28.
Microsoft. FAQ for Copilot data security and privacy for Dynamics 365 and Power Platform. 2025. https://learn.microsoft.com/en-us/power-platform/faqs-copilot-data-security-privacy
- 29. Tong A, Sainsbury P, Craig J. Consolidated criteria for reporting qualitative research (COREQ): a 32-item checklist for interviews and focus groups. Int J Qual Health Care. 2007;19(6):349–57. pmid:17872937
- 30.
Arora M. Transcripts from semi-structured interviews conducted for a situational analysis. UK Data Service. 2020.
- 31.
Dunn A. Parenting experience borderline personality disorder traits: parent and practitioner perspectives. In: Service UD. 2020.
- 32.
Hervey T, Antova I. Health governance after Brexit: Street ethnography and elite interviews, 2019-2021. In: Service UD. 2021.
- 33.
Holman D, Walker A. Challenges and practices in promoting (ageing) employees working career in the health care sector – case studies from Germany, Finland and the UK. In: Service UD. 2021.
- 34. Barlow P, Thow AM. Neoliberal discourse, actor power, and the politics of nutrition policy: A qualitative analysis of informal challenges to nutrition labelling regulations at the World Trade Organization, 2007–2019. Social Science & Medicine. 2021;273:113761.
- 35. Arora M, Dringus S, Bahl D, Rizvi Z, Maity H, Lama S, et al. Engagement of health workers and peer educators from the National Adolescent Health Programme-Rashtriya Kishor Swasthya Karyakram during the COVID-19 pandemic: Findings from a situational analysis. PLoS One. 2022;17(9):e0266758. pmid:36129932
- 36. Dunn A, Cartwright-Hatton S, Startup H, Papamichail A. The Parenting Experience of Those With Borderline Personality Disorder Traits: Practitioner and Parent Perspectives. Front Psychol. 2020;11:1913. pmid:32849122
- 37. Hervey T, Antova I, Flear ML, McHale JV, Speakman E, Wood M. Health “Brexternalities”: The Brexit Effect on Health and Health Care outside the United Kingdom. J Health Polit Policy Law. 2021;46(1):177–203. pmid:33085960
- 38. Merkel S, Ruokolainen M, Holman D. Challenges and practices in promoting (ageing) employees working career in the health care sector - case studies from Germany, Finland and the UK. BMC Health Serv Res. 2019;19(1):918. pmid:31783852
- 39. Barlow P, Thow AM. Neoliberal discourse, actor power, and the politics of nutrition policy: A qualitative analysis of informal challenges to nutrition labelling regulations at the World Trade Organization, 2007–2019. Social Science & Medicine. 2021;273:113761.
- 40.
Barlow P. WTO Technical Barriers to Trade Committee Minutes: extracts of discussions on interpretative front-of-pack nutrition labelling policies, 2007-2019. In: Organization WT, 2020.
- 41.
Zhou M, Abhishek V, Derdenger T, Kim J, Srinivasan K. Bias in generative ai. 2024. https://arxiv.org/abs/240302726
- 42. Panciroli C, Rivoltella PC. Can an algorithm be fair?: intercultural biases and critical thinking in generative artificial intelligence social uses. Scholé: rivista di educazione e studi culturali. 2023;LXI(2):67–84.
- 43. Cook DA, Ginsburg S, Sawatsky AP, Kuper A, D’Angelo JD. Artificial Intelligence to Support Qualitative Data Analysis: Promises, Approaches, Pitfalls. Acad Med. 2025;:10.1097/ACM.0000000000006134. pmid:40560241
- 44. Mathis WS, Zhao S, Pratt N, Weleff J, De Paoli S. Inductive thematic analysis of healthcare qualitative interviews using open-source large language models: How does it compare to traditional methods?. Computer Methods and Programs in Biomedicine. 2024;255:1–11.
- 45. Dai SC, Xiong A, Ku LW. LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis. arxiv. 2023. 1–9.
- 46. De Paoli S. Performing an Inductive Thematic Analysis of Semi-Structured Interviews With a Large Language Model: An Exploration and Provocation on the Limits of the Approach. Social Science Computer Review. 2023;42(4):997–1019.
- 47.
Graham DW. Heraclitus. The Stanford Encyclopedia of Philosophy. 2023.
- 48. Williams V, Boylan A-M, Nunan D. Critical appraisal of qualitative research: necessity, partialities and the issue of bias. BMJ Evid Based Med. 2020;25(1):9–11. pmid:30862711
- 49. Zhang H, Wu C, Xie J, Lyu Y, Cai J, Carroll JM. Harnessing the power of AI in qualitative research: Exploring, using and redesigning ChatGPT. Computers in Human Behavior: Artificial Humans. 2025;4:100144.