Frankenstein, thematic analysis and generative artificial intelligence: Quality appraisal methods and considerations for qualitative research

Tanisha Jowsey; Peta Stapleton; Shawna Campbell; Alexandra Davidson; Cher McGillivray; Isabella Maugeri; Megan Lee; Justin Keogh

doi:10.1371/journal.pone.0330217

Peer Review History

Original SubmissionFebruary 5, 2025
4 Jun 2025 Decision Letter - Jiankun Gong, Editor PONE-D-25-06052Frankenstein, Thematic Analysis and Generative Artificial Intelligence: Quality Appraisal Methods and Considerations for Qualitative ResearchPLOS ONE Dear Dr. Jowsey, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jul 19 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols . We look forward to receiving your revised manuscript. Kind regards, Jiankun Gong Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating the following in your Competing Interests section: [The authors have no conflicts of interest to declare.]. Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state "The authors have declared that no competing interests exist.", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now This information should be included in your cover letter; we will change the online submission form on your behalf. 3. In the online submission form, you indicated that [Some data are held in the UK Data Service repository, four of the five datasets have safeguarded restrictions. Dataset 1 (Barlow et al.) ISSN 0277-9536, https://doi.org/10.1016/j.socscimed.2021.113761. Dataset 2 (Hervey & Antova) UK Data Service SN: 854778, DOI: 10.5255/UKDA-SN-854778 Dataset 3 (Dunn et al) UK Data Service SN: 854245, DOI: 10.5255/UKDA-SN-854245 Dataset 4 (Holman & Walker) UK Data Service SN: 855082, DOI: 10.5255/UKDA-SN-855082 Dataset 5 (Arora) UK Data Service. SN: 855953, DOI: 10.5255/UKDA-SN-855953]. All PLOS journals now require all data underlying the findings described in their manuscript to be freely available to other researchers, either 1. In a public repository, 2. Within the manuscript itself, or 3. Uploaded as supplementary information. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If your data cannot be made publicly available for ethical or legal reasons (e.g., public availability would compromise patient privacy), please explain your reasons on resubmission and your exemption request will be escalated for approval. 4. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information . [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Partly ******** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ****** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ****** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ****** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This study is interesting; the aim (as you mentioned) is to determine the accuracy and efficiency of using generative artificial intelligence (GenAI) to undertake thematic analysis. However, using Copilot for this task may be challenging, as it is difficult to change its parameters. It may be worth considering the application programming interface (API) of LLMs and adjusting to different versions of LLMs. For prompt engineering, there are zero-shot, few-shot, and chain-of-thought approaches. Regarding the dataset, five datasets are quite small. For result comparison, it would be better to illustrate more statistical results, such as accuracy, recall, precision, and F1 score. In the discussion, you should compare the results with other thematic analysis using LLMs (of which there are quite a lot). Reviewer #2: This paper explores the accuracy and credibility of Generative AI (GenAI), specifically Microsoft Copilot, in performing thematic analysis on existing qualitative datasets. The authors conduct a structured comparison using five real datasets and evaluate the differences between human and AI-generated themes based on accuracy, data support, and transparency. The topic is timely and relevant, and the comparative method offers a replicable way. However, the article needs revisions. I recommend that the suggestions be followed carefully to improve the scientific rigor of the work. It will be important to filter and prioritize the most relevant points. Introduction The introduction needs a clearer positioning of the research gap. (a) The citations on prior GenAI-related studies lack critical thinking, and the gap remains vague. It would help to summarise existing attempts to apply GenAI in qualitative analysis and explain how this study makes a unique contribution in terms of method, objective, or evaluation dimension. (b) The discussion on Big Q and Small q is overly long but not well connected to the design choices. For example, the authors do not explain why datasets involving discursive thematic analysis were included for AI comparison. (c) The research aim is too broad. I suggest the authors clearly state their main goal at the end of the introduction. Materials and Methods The methods section describes the inclusion criteria and data sources in sufficient detail but still has several issues: (a) The sample size is small, with only five studies selected. The authors should explain why the search was not extended to other journals or platforms. (b) Copilot is chosen as the only GenAI tool, but there is no justification for excluding other tools like Claude or Gemini. A rationale is needed to explain why Copilot is suitable or representative for this task. (c) The prompt used in Box 1 is central to the analysis, but the design process is unclear. The authors should explain how the prompt was developed and whether it was tested or reviewed before use. Results The results are clearly organized in tables, but the textual analysis lacks structure and detail. (a) The paper would benefit from a clearer comparison framework across datasets. For example, the authors could compare theme alignment, data range, and expression style in a consistent way. (b) Specific examples of themes and quotes generated by both Copilot and human researchers should be added to show differences more concretely. (c) The criteria for verifying quotes should be explained. It is not clear how “fabricated” and “unverifiable” quotes were defined or distinguished. Discussion The discussion could reflect more critically on the design limitations of the study. (a) The statement that Copilot fails to handle latent meaning is plausible, but currently based on assumption. It would be stronger if the authors gave concrete examples of failed interpretations or overlooked themes. (b) The issue of quote accuracy in human studies is an important observation, but the evidence (i.e., lack of author response) is weak. The authors should clarify how they handled such cases to avoid confusion or unfair judgment. Rigour and Trustworthiness On page 7, the authors briefly mention the use of COREQ to assess analysis transparency, data support, and participant representation. However, there is no clear mapping of COREQ items to the evaluation results. I recommend providing a supplementary table showing which of the 32 COREQ items were met in human and AI outputs. A simple ✓ / ✗ or short comment would improve clarity. Conclusion The conclusion is too general and lacks clear boundary conditions. The authors should clarify whether their findings apply only to health-related datasets and whether the results can be generalized to other LLMs. ****** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy . Reviewer #1: No Reviewer #2: No ******** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step. https://doi.org/10.1371/journal.pone.0330217.r001
Revision 1
11 Jul 2025 Author Response Thank you to the reviewers. Below are the reviewer comments and our responses. The reviewers may find it easier to read these responses in the response letter we have uploaded. COMMENT RESPONSE Reviewer 1: Reviewer #1: This study is interesting; the aim (as you mentioned) is to determine the accuracy and efficiency of using generative artificial intelligence (GenAI) to undertake thematic analysis. However, using Copilot for this task may be challenging, as it is difficult to change its parameters. It may be worth considering the application programming interface (API) of LLMs and adjusting to different versions of LLMs. For prompt engineering, there are zero-shot, few-shot, and chain-of-thought approaches. Thank you for these comments. The reviewer kindly suggests using the API to adjust different versions of LLMs and mention the type of prompt engineering we used. We only explored Copilot and no other LLMs because of the restrictions placed on us by our university and the UK Data Service. We have now explained this more clearly in the manuscript. We have now stated our promoting approach as zero-shot in the manuscript, and listed the limitations associated therewith in our limitations section, thank you for bringing that to our attention. Regarding the dataset, five datasets are quite small. We suggest that five datasets is still more than previous studies in this area. Morgan’s study, which we cite as precedence, analysed two datasets. We sought to identify more than five studies for inclusion in the research but only five studies met our criteria. For result comparison, it would be better to illustrate more statistical results, such as accuracy, recall, precision, and F1 score. We appreciate that recall, precision and F1 score (which is a metric that includes recall and precision) play important roles in quantitative studies involving classification of true and false positives and negatives, particularly in machine learning studies. However, as we analysed qualitative rather than quantitative data involving multiple levels of classification of true and false positives and negatives, it was apparent that metrics including recall, precision and F1 score would not be appropriate for the data we collected. We do present accuracy measures including how often Co-pilot got Number of themes, Total number of participants, Quotes per theme and Documents per theme correct compared to humans. In the discussion, you should compare the results with other thematic analysis using LLMs (of which there are quite a lot). This has been done as suggested. Reviewer 2: This paper explores the accuracy and credibility of Generative AI (GenAI), specifically Microsoft Copilot, in performing thematic analysis on existing qualitative datasets. The authors conduct a structured comparison using five real datasets and evaluate the differences between human and AI-generated themes based on accuracy, data support, and transparency. The topic is timely and relevant, and the comparative method offers a replicable way. However, the article needs revisions. I recommend that the suggestions be followed carefully to improve the scientific rigor of the work. It will be important to filter and prioritize the most relevant points. Introduction The introduction needs a clearer positioning of the research gap. (a) The citations on prior GenAI-related studies lack critical thinking, and the gap remains vague. It would help to summarise existing attempts to apply GenAI in qualitative analysis and explain how this study makes a unique contribution in terms of method, objective, or evaluation dimension. (b) The discussion on Big Q and Small q is overly long but not well connected to the design choices. For example, the authors do not explain why datasets involving discursive thematic analysis were included for AI comparison. (c) The research aim is too broad. I suggest the authors clearly state their main goal at the end of the introduction. We thank the reviewer for their comments and have made the following provisions as requested: a) We have now made clear the unique contribution of our study makes in relation to existing literature; b) We have improved integration of the discussion on Big Q and small q throughout methods and have made clear in methods that discursive thematic analyses were included; c) the main goal and research question are now described at the end of the introduction. Materials and Methods The methods section describes the inclusion criteria and data sources in sufficient detail but still has several issues: (a) The sample size is small, with only five studies selected. The authors should explain why the search was not extended to other journals or platforms. We suggest that five datasets is a lot. Morgan’s study, which we cite as precedence, analysed two datasets. We sought to identify more than five eligible studies for inclusion in the research but only five studies met our criteria. The amount of quality appraisal work associated with our study was substantial, so we were actually grateful that no further studies met our criteria. (b) Copilot is chosen as the only GenAI tool, but there is no justification for excluding other tools like Claude or Gemini. A rationale is needed to explain why Copilot is suitable or representative for this task. Thank you for pointing this out. We have now clarified that we didn’t have a whole lot of choice. Our institution and the UK Data Service would only approve Copilot for this study. Co-pilot is our institutionally supported closed GenAI tool. Other tools are not “closed” systems and thus breach the data privacy statements of the UK database. Please see below GenAI policy from Bond University https://bond.edu.au/current-students/study-information/integrity-at-bond/academic-integrity/academic-integrity-and-artificial-intelligence Text now reads: The GenAI tool we selected was Copilot on the grounds that it was a general-purpose GenAI, approved for use by both our institution and the UK Data Service, and was a closed GenAI system (i.e. access to Copilot is limited and password protected, so that the data are protected and are not used to train foundation GenAI models). The UK Data Service permitted us to include in our study data published through their repository but only for analysis through approved closed GenAI systems (i.e. Copilot) and not for open GenAI systems like ChatGPT, Perplexity, Claude or Gemini. (c) The prompt used in Box 1 is central to the analysis, but the design process is unclear. The authors should explain how the prompt was developed and whether it was tested or reviewed before use. Good point. We have added a sentence regarding the design process of developing the prompt. Text now reads: Prior to running the prompt through Copilot, the team developed, tested, refined, reviewed and piloted the prompt wording. Results The results are clearly organized in tables, but the textual analysis lacks structure and detail. (a) The paper would benefit from a clearer comparison framework across datasets. For example, the authors could compare theme alignment, data range, and expression style in a consistent way. Theme alignment and data range are detailed in Supplementary File 1. No change made. However, we thank the reviewer for pointing us to look again at this because we realised we had not included quotes in the manuscript or Supplementary File. We have now included quote examples in the manuscript Table 4. (b) Specific examples of themes and quotes generated by both Copilot and human researchers should be added to show differences more concretely. We provide all themes in Supplementary File 1, and have now included an example of these in the manuscript too. We now include examples of quotes in the manuscript. (c) The criteria for verifying quotes should be explained. It is not clear how “fabricated” and “unverifiable” quotes were defined or distinguished. Thank you, we have now clarified that the two terms we are using are modified and fabricated; modified had words changed or missing and fabricated quotes were not in the original datasets either in part or in full. Examples now provided in Table 4. Discussion The discussion could reflect more critically on the design limitations of the study. (a) The statement that Copilot fails to handle latent meaning is plausible, but currently based on assumption. It would be stronger if the authors gave concrete examples of failed interpretations or overlooked themes. (b) The issue of quote accuracy in human studies is an important observation, but the evidence (i.e., lack of author response) is weak. The authors should clarify how they handled such cases to avoid confusion or unfair judgment. We thank the reviewer for their comments and have made the following changes: a) The statement regarding how Copilot failed to handle latent meaning has been revised; all themes that were identified by researchers but not by Copilot are listed in Supplementary File 1. b) That you for drawing out attention to this. The section regarding quote accuracy in human studies has been substantially revised providing greater evidence and clarity to the reader on how these cases were handled and recommending caution for interpretation. Rigour and Trustworthiness On page 7, the authors briefly mention the use of COREQ to assess analysis transparency, data support, and participant representation. However, there is no clear mapping of COREQ items to the evaluation results. I recommend providing a supplementary table showing which of the 32 COREQ items were met in human and AI outputs. A simple ✓ / ✗ or short comment would improve clarity. Conclusion After careful consideration, we have decided not to provide the full 32 item assessment against researcher and Copilot outputs on the grounds that COREQ is a reporting checklist for writing specific sections of qualitative manuscripts and not particularly for assessing the quality of the data or analysis itself. Many qualitative manuscripts do not report COREQ. We acknowledge that AI would perform poorly on this task as we did not prompt it to follow COREQ guidelines. The conclusion is too general and lacks clear boundary conditions. The authors should clarify whether their findings apply only to health-related datasets and whether the results can be generalized to other LLMs. Conclusion has been tightened. We also now state that the results do not appear to be specific to health-related datasets. Thank you for your comments. We trust our revisions meet with your approval. Attachments Attachment Submitted filename: response to reviewers.docx https://doi.org/10.1371/journal.pone.0330217.r002
29 Jul 2025 Decision Letter - Jiankun Gong, Editor Frankenstein, Thematic Analysis and Generative Artificial Intelligence: Quality Appraisal Methods and Considerations for Qualitative Research PONE-D-25-06052R1 Dear Dr. Jowsey, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support . If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Jiankun Gong Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #2: All comments have been addressed ******** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #2: Yes ****** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #2: Yes ****** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: Yes ****** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #2: Yes ****** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #2: Dear Authors, Thanks very much for your revised submission. I’ve read the updated version of the manuscript titled "Frankenstein, Thematic Analysis and Generative Artificial Intelligence: Quality Appraisal Methods and Considerations for Qualitative Research." You've clearly taken the feedback seriously. The revised paper is much more focused and easier to follow. The methods section is now clearer, and I can see a stronger link between your analysis and the questions you're trying to answer. The explanation of how GenAI was tested, and the comparison with human analysis, is now much more solid and convincing.Overall, the paper makes an original contribution, and I hope readers will find it both relevant and thought-provoking. ****** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy . Reviewer #2: Yes: Shuangyan Du ******** https://doi.org/10.1371/journal.pone.0330217.r003
Formally Accepted
Acceptance Letter - Jiankun Gong, Editor PONE-D-25-06052R1 PLOS ONE Dear Dr. Jowsey, I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team. At this stage, our production department will prepare your paper for publication. This includes ensuring the following: * All references, tables, and figures are properly cited * All relevant supporting information is included in the manuscript submission, * There are no issues that prevent the paper from being properly typeset You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps. Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing. If we can help with anything else, please email us at customercare@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Jiankun Gong Academic Editor PLOS ONE https://doi.org/10.1371/journal.pone.0330217.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .