Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Use as directed? A comparison of software tools intended to check rigor and transparency of published work

  • Peter Eckmann,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Writing – original draft, Writing – review & editing

    Affiliation Department of Computer Science and Engineering, UC San Diego, La Jolla, California, United States of America

  • Adrian Barnett,

    Roles Data curation, Formal analysis, Methodology, Software, Writing – review & editing

    Affiliation School of Public Health and Social Work, Queensland University of Technology, Kelvin Grove, Queensland, Australia

  • Alexandra Bannach-Brown,

    Roles Conceptualization, Writing – review & editing

    Affiliation QUEST Center for Responsible Research, Berlin Institute of Health at Charité Universitätsmedizin Berlin, Berlin, Germany

  • Elisa Pilar Bascunan Atria,

    Roles Data curation, Formal analysis, Writing – review & editing

    Affiliation QUEST Center for Responsible Research, Berlin Institute of Health at Charité Universitätsmedizin Berlin, Berlin, Germany

  • Guillaume Cabanac,

    Roles Data curation, Formal analysis, Investigation, Software, Validation, Writing – review & editing

    Affiliation Université de Toulouse & Institut Universitaire de France, Toulouse, France

  • Louise Delwen Owen Franzen,

    Roles Data curation, Formal analysis, Writing – review & editing

    Affiliation QUEST Center for Responsible Research, Berlin Institute of Health at Charité Universitätsmedizin Berlin, Berlin, Germany

  • Małgorzata Anna Gazda,

    Roles Data curation, Writing – review & editing

    Affiliation Department of Biological Sciences, University of Montréal, Montréal, Québec, Canada

  • Kaitlyn Hair,

    Roles Data curation, Validation, Writing – review & editing

    Affiliation UCL Social Research Institute, University College London, London, United Kingdom

  • James Howison,

    Roles Data curation, Methodology, Software, Writing – review & editing

    Affiliation Information School of the University of Texas at Austin, Austin, Texas, United States of America

  • Halil Kilicoglu,

    Roles Resources, Writing – review & editing

    Affiliation School of Information Sciences, University of Illinois Urbana-Champaign, Illinois, United States of America

  • Cyril Labbe,

    Roles Data curation, Formal analysis, Writing – review & editing

    Affiliation Université Grenoble Alpes, Saint-Martin-d’Hères, France

  • Sarah McCann,

    Roles Data curation, Formal analysis, Writing – review & editing

    Affiliation QUEST Center for Responsible Research, Berlin Institute of Health at Charité Universitätsmedizin Berlin, Berlin, Germany

  • Vladislav Nachev,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Writing – original draft

    Affiliation QUEST Center for Responsible Research, Berlin Institute of Health at Charité Universitätsmedizin Berlin, Berlin, Germany

  • Martijn Roelandse,

    Roles Data curation, Formal analysis, Software, Visualization, Writing – review & editing

    Affiliation martijnroelandse.dev, Ouderkerk aan de Amstel, Netherlands

  • Maia Salholz-Hillel,

    Roles Data curation, Formal analysis, Methodology, Software, Supervision, Writing – original draft, Writing – review & editing

    Affiliation QUEST Center for Responsible Research, Berlin Institute of Health at Charité Universitätsmedizin Berlin, Berlin, Germany

  • Robert Schulz,

    Roles Data curation, Investigation, Validation, Writing – review & editing

    Affiliation QUEST Center for Responsible Research, Berlin Institute of Health at Charité Universitätsmedizin Berlin, Berlin, Germany

  • Gerben ter Riet,

    Roles Methodology, Supervision, Writing – original draft

    Affiliation Hogeschool van Amsterdam, Amsterdam University of Applied Sciences, Amsterdam, Netherlands

  • Colby Vorland,

    Roles Data curation, Formal analysis, Software, Writing – review & editing

    Affiliation Indiana University School of Public Health, Bloomington, Indiana, United States of America

  • Anita Bandrowski ,

    Roles Data curation, Supervision, Writing – original draft, Writing – review & editing

    abandrowski@ucsd.edu (AB); tracey.weissgerber@uc.pt (TW)

    Affiliations Department of Neuroscience, UC San Diego, La Jolla, California, United States of America, SciCrunch Inc., San Diego, California, United States of America

  •  [ ... ],
  • Tracey Weissgerber

    Roles Data curation, Supervision, Writing – original draft, Writing – review & editing

    abandrowski@ucsd.edu (AB); tracey.weissgerber@uc.pt (TW)

    Affiliations QUEST Center for Responsible Research, Berlin Institute of Health at Charité Universitätsmedizin Berlin, Berlin, Germany, CIBB, Center for Innovative Biomedicine and Biotechnology, University of Coimbra, Coimbra, Portugal, CNC-UC, Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal

  • [ view all ]
  • [ view less ]

Abstract

The causes of the reproducibility crisis include lack of standardization and transparency in scientific reporting. Checklists such as ARRIVE and CONSORT seek to improve transparency, but they are not always followed by authors and peer review often fails to identify missing items. To address these issues, there are several automated tools that have been designed to check different rigor criteria. We have conducted a broad comparison of 11 automated tools across 9 different rigor criteria from the ScreenIT group. We found some criteria, including detecting open data, where the combination of tools showed a clear winner, a tool which performed much better than other tools. In other cases, including detection of inclusion and exclusion criteria, the combination of tools exceeded the performance of any one tool. We also identified key areas where tool developers should focus their effort to make their tool maximally useful. We conclude with a set of insights and recommendations for stakeholders in the development of rigor and transparency detection tools. The code and data for the study is available at https://github.com/PeterEckmann1/tool-comparison.

Introduction

The reproducibility crisis remains a central concern [1,2] in scientific fields ranging from psychology [3] to cancer biology [4]. The causes of this crisis include lack of standardization and transparency in scientific reporting [5], which has led to the addition of various checklists and instructions for grantees and authors. Popular checklists such as the CONSORT [6] or ARRIVE [7] guidelines for human and preclinical animal studies, respectively, have been proposed and added to prominent journals’ instructions to authors. Funders such as the NIH [8] have also implemented checklists as part of the grant submission process. Checklists may increase awareness of issues affecting reproducibility, but evidence suggests they do not significantly improve reporting quality [9,10].

While reproducibility consists of many factors, perhaps one of the easiest to address is transparency. Transparent, high quality reporting is achievable even if a study is already completed, but was not conducted in a fully rigorous manner. This includes details of the methods used such as blinding, research materials, and placing data and software into locations that are as open as possible and as closed as necessary [11,12]. Peer review can help to address many of these issues, but human reviewers often fail to point out important missing information such as catalog numbers for key resources, and may not comment when a criterion like blinding is absent when it is not commonly reported in the field [13]. Humans are also less likely to flag some problematic practices like plagiarism, which requires searching through millions of documents, and manipulated figures, which requires extensive time, training and experience. Therefore, detecting these practices is best done through automated analysis. Paper mills produce problematic papers at scale, which can overwhelm the traditional peer review system and is contributing to an unprecedented number of retractions [14]. For these reasons, the use of software tools has been proposed to automatically flag missing criteria critical for transparency [15]. Many publishers already use central platforms like the STM Integrity Hub [16] to check submitted papers for fraudulent or suspicious content using automated tools, which would otherwise be difficult to catch via traditional peer review. Other tools employ image and text mining techniques to search for criteria such as blinding, power calculations, randomization, open sourcing of code and data in manuscripts, and checking figure quality [1720].

With the expansion of automated tools available to detect rigor criteria, there is an absence of work comparing the efficacy of the tools. Selecting a tool for a given use case can be difficult, as many tools purport to search for the same or similar things. Direct comparisons would help tool developers and users to understand the design and performance differences between tools (examples given in Fig 1). Comparisons would also help users, including publishers, reviewers, and metascientists, to move beyond selecting the tool that performs best in one context, and consider which tool, or combination of tools, fits best with the user’s intended purpose. A comparison would further benefit tool developers to determine which types of tools or which combination of tools are most effective in solving a particular problem, discover areas where existing tools are insufficient, and determine what design decisions affect tool performance.

thumbnail
Fig 1. Design features that may affect tool performance.

This figure highlights and explains eight features that can affect tool performance,and provides explanations and guiding questions for each feature.

https://doi.org/10.1371/journal.pone.0342225.g001

Most papers that introduce new tools compare performance against similar tools, but the dataset used for these comparisons is not standardized. In the worst case, some tool developers may try multiple datasets and pick the one where their tool has the highest performance, making it even more difficult to assess the tool’s true performance [21]. These practices can lead to “phantom progress,” where each new paper claims to reach state-of-the-art performance, but independent evaluation reveals no differences between old and new methods (e.g. [22]). Thus, a broad, independent comparison of rigor criteria tools is needed.

We seek to address the need by performing a broad, independent comparison across multiple tools and rigor criteria. We built a dataset encompassing a random subset of 1,500 open access manuscripts in PubMed Central, to ensure that our results would be applicable to many biomedical fields. We evaluated 9 rigor criteria, and ran a suite of 11 tools on all papers in our dataset. Human curators assisted in labeling a gold standard dataset for each criterion, primarily focusing on cases where the tools disagreed, which we used to compute performance characteristics for each tool. After applying this standardized approach to many criteria and tools, we use our comparisons to examine progress in tools for detecting rigor criteria and to provide insights for tool developers and users.

Methods

The tools examined here were created by members of ScreenIT, a community of curators, developers, and other scientists that focus on the problem of scientific rigor and reproducibility [23]. The goal of this community is to alleviate current problems in poor research quality by the independent development of rigor-checking tools. In this study, we analyzed which tools were more and less useful for various criteria. The criteria we chose to address were based on the areas of expertise of the groups within ScreenIT, with each group working on the criteria they were most knowledgeable about. Many of these criteria were selected based on various checklists and guidelines, for example [7,24,25]. An unavoidable side effect is that for many comparisons, the evaluating group had a tool that they developed in the evaluation, but we aimed to reduce bias as much as possible as explained in the following sections.

We attempted to determine which tool performed better, but also considered whether a combination of tools was better suited to address a broader problem in reproducibility. Not all tools are directly comparable because they define the presence or absence of criteria differently; indeed, none of the tools define even seemingly simple things like the presence of code in the same way. For example, some tools assess the availability of open code, while other tools simply detect a statement about code sharing, such as “code is not available for this study.” The sections below contain an overview of the methods used, but see the Supplementary Information for specific methods for each rigor criteria.

Paper extraction

XML files for 1,500 papers from PubMedCentral’s Non-Commercial Open Access Subset (available at https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/oa_noncomm/) with a PMCID starting with PMC008 (approximately March 2021 to April 2022) were selected using a simple random sample. For each paper, both full text (all text contained within the XML’s <body> tag) and the methods section (all text within sections with titles that contain the words “method” or “procedure”) were extracted. Two tools screen PDFs to evaluate figures and tables; therefore, PDFs of all 1,500 papers were also retrieved automatically.

Test set construction

In general, for each comparison (other than tools that scanned the PDF version of the manuscript), we ran all tools on the extracted full text of the 1,500 papers. However, for some comparisons, the set of 1,500 papers was not used because it would not be very informative. For example, the 1,500 paper corpus does not contain many clinical trial papers, so we would not expect many baseline tables to be present. Instead, we selected a more focused set of trials separate from the main corpus. When we used a different corpus than the main 1,500 set, we will explicitly state that in the section for that comparison.

In cases where all tools agreed on the presence or absence of the criteria of interest, we treated that prediction as the “gold standard” result in our test set and performed no manual labeling. All papers where at least one of the tools disagreed on the presence or absence of a given criteria were examined by a curator assigned to that comparison task. This technique is similar to that of the Cranfield methodology [26,27] from information retrieval, where it is impossible to manually curate every document; therefore, one has to use the tools being studied to extract a smaller set of relevant documents to curate. Human curators were blinded to which tool(s) disagreed. This was done to reduce bias as some curators were developers of the tools being analyzed. In order to speed curation time, we showed the human curator the extracted sentence (e.g. “Code for this paper is available at ... ”) so that curators could quickly identify positive cases. To reduce bias, we randomly selected a tool and displayed its extracted sentence, if it existed. That means that if the selected tool did not find anything, but another tool extracted a sentence, we still showed nothing to the curator.

We sought to estimate the rate of error in our test set where all tools agreed on the classification for a given paper. While it would be infeasible to manually curate every paper where all tools agreed, we randomly sampled 100 papers for each criterion among the papers where all tools agreed and manually labeled them. This was to estimate the rate at which all tools were wrong, so that we could more accurately report performance characteristics of the tools. If we had at least 50 positive (all tools agreed that the criteria was present) and at least 50 negative examples (all tools agreed that the criteria was absent), then the set consisted of 50 positives and 50 negatives. If we had fewer than 50 positive or negative papers, we used more of the other type to reach 100 papers. For all criteria, there were always at least 100 papers where all tools agreed.

Generic definitions of items

The definition of criteria frequently differed between tools; therefore, curators were instructed to use their best judgment of a criterion’s concept as neutral observers, and not the exact definition that was used by one of the tool makers. This assessment approach was designed to ensure that the criteria were not dependent on any given tool. For example, there are specific definitions of a software tool that were strictly enforced in the training data for SoftCite (Du et al, 2021). These specific definitions were not provided to the curator, who instead used their best guess as to whether the item was or was not a piece of software (more specifics in each analysis subsection). This decision was made to increase the generalizability of the results, but it also introduces a type of bias. Curators tracked their definitions and processes in notes, which are summarized below for each of the analyses.

Ensemble model

To examine the potential value of combining the tools, a logistic regression ensemble model was trained and tested on the 1,500 paper set for each rigor criteria. The ensemble model was trained to make a classification based on a linear combination of the binary results of the individual tools. Sklearn’s (RRID:SCR_019053) LogisticRegression, trained with cross-entropy loss, was used for this task. Overfitting on the training set was not a concern due to the low expressiveness of a linear model with only four (or less, depending on the number of tools) parameters (plus an intercept). Nonetheless, we assessed overfitting using the method described below in “statistical tests”.

Statistical tests

Each tool was assessed against the test set using an adjusted accuracy, precision, recall, and F1 score. We used adjusted performance metrics to account for our biased method of assigning ground truth labels. Since we did not manually label papers where no tool predicted a positive, we may be missing positive papers. To estimate the number of positive papers that were missed, we manually labeled a random subset of papers that no tool had predicted as positive. We extrapolated the results from this manual labeling to estimate the true number of positive papers that were missed. This estimate is captured in the “presumed positive rate” (PPR), which is the rate at which papers presumed to be positive were actually positive, and a “presumed negative rate” (PNR), the rate at which papers presumed to be negative were actually negative. We used these rates to compute the following adjusted values:

where TP, FP, FN, and TN are the number of true positive, false positive, false negative, and true negative examples, respectively.

Therefore, the adjusted values will always be more conservative than the unadjusted. The adjusted accuracy, precision (also known as the “positive predictive value”), recall (also known as the “sensitivity”), and F1 score (the harmonic mean of precision and recall) were then calculated as:

For each ensemble model that we trained for a given rigor criterion, we report what we call the “Learned ensemble function” to make the model’s learned decision process transparent. This was done by first inputting all permutations of different binary tool results to the model to generate a truth table. Then, we performed boolean simplification to generate a simplified expression of the function learned by the ensemble.

To assess overfitting of the ensemble models, we trained the ensemble model on random 80% subsets of the training data (80% of the original dataset randomly resampled without replacement), and evaluated which percent of trained models had the same decision function as the model trained on all data. A high percentage indicates that overfitting is unlikely, because the model learns the same function despite the underlying data being different subsets of the larger dataset. Note that while the parameters might be slightly different, the models trained on the subset of data that have the same “Function learned” will have identical test-time behavior.

For comparisons between tools, a two-tailed t-test with variances calculated from sample proportions was used to compare the classification accuracies [28]. Additionally, Gwet’s agreement among each pair of tools, and among each tool to the true results, was calculated using the irrCAC python package (RRID:SCR_023176). Gwet’s statistic measures the agreement between raters whilst adjusting for chance agreement [29]. We did not take into account the presumed positive/negative rate when calculating Gwet’s agreement.

Results

A summary of the rigor criteria and tools we analyzed, and their important differences, is presented in Table 1. For readability, we summarize the findings from each tool comparison in the below sections. The Supporting information (SI) contains a tabular summary of all comparisons including tool references (S1 Text, Table 1), details for each comparison, as well as performance statistics.

thumbnail
Table 1. Overview of individual comparisons.

We highlight in overview form each tool comparison that we performed. A dash (“-”) indicates that the tools differed in one of the following ways, and a cross mark (“X”) denotes that the difference made a substantial impact on the final performance. Differences include how the definitions of terms are operationalized into a tool (e.g. does the tool recognize power calculations or other means to check for group size), document input format and requirements for preprocessing (e.g. extraction of text or images from PDF documents can be challenging and introduce systematic errors), section of the paper examined (e.g. related to sensitivity vs specificity as some information may be missed by a tool that does not run on the section of text where the mention appears), selection of training and validation data such as the field (e.g. clinical studies vs psychology), algorithm choice (e.g. regular expressions vs large language models), the openness & accessibility of the tool (e.g. does the tool have a version with a user interface, is the tool maintained, is the tool free and open code), and the desired performance (e.g. the tool may be tuned for sensitivity to capture more cases with more false positives). See Fig 1 for more details on the main differences between tools.

https://doi.org/10.1371/journal.pone.0342225.t001

Registration (S1 Text, Tables 3 and 4): In the comparison of registration tools, which look for clinical trial numbers and other protocol registration numbers, the definition of what the tool aimed to find was a major source of difference in performance. SciScore’s definition includes protocols (2 were found in the dataset) while other screening tools look for different sets of clinical trial registries. By far the most prevalent registry found in the 1500 papers was clinicaltrials.gov (ctgov, S1 Text, Table 4) which was responsible for 137 of 169 total items correctly identified. Interestingly, the ctregistries tool, which recognizes the most registries, also accounted for all false positives in this dataset because the various trial registries often share letter number combinations with granting agency numbers, catalogue numbers and medical abbreviations. We did not consider an ensemble model for this comparison because we compared tools at the entity level (mention of a specific trial identifier) and not at the paper level, so it is unclear how to perform ensembling in this case.

Inclusion and exclusion criteria (S1 Text, Tables 5, 6, and 7): The comparison of detection of inclusions and exclusions of participants was tested with three tools: pre-rob, SciScore and Barzooka. The first two tools recognize text using an LLM (large language model) and a CRF (conditional random field) while Barzooka recognizes images of flow charts using a CNN (convolutional neural network model). The raw performance of Barzooka was lower than the other two tools, but that was mostly because authors in our set of 1500 papers primarily described inclusions and exclusions using text, not flow charts. However, the performance of the different tools was also highly complementary. When authors described their inclusions using a flow chart, this information was only detected by Barzooka. When inclusions were described in text, this was usually detected by one of the text tools. As a result, the combination of tools was more effective than any individual tool. The ensemble performance accuracy, precision, recall, and the F1 were all in the 0.95 to 0.98 range where the highest performance, F1, of any individual tool was 0.91.

Blinding and randomization (S1 Text, Tables 8, 9, 10, 11, 12, and 13): The comparisons for blinding and randomization were both performed using three tools: pre-rob, SciScore and CONSORT-TM [20]. These three tools all processed text using different models with SciScore using CRF, while the other two tools used language models, specifically two versions of BERT [30]. For the comparison of blinding, SciScore performed the best (F1 of 0.89) and the ensemble did not add any performance (F1 of 0.89). Detecting randomization was more difficult for the models, and performance was far lower for all tools (F1’s of 0.40 to 0.76), perhaps because “random” is used to refer to other techniques beyond random assignment such as random effect models. CONSORT-TM did relatively poorly, while pre-rob and SciScore performed better. In this case, the training dataset is likely to explain the difference in performance. CONSORT-TM was trained on text from randomized controlled trials. pre-rob was trained primarily on data from preclinical animal studies. SciScore was trained on a very broad dataset with many different types of studies. The ensemble model also outperformed any individual tool (F1 of 0.76), demonstrating the benefits of combining tools.

Sample size determination (S1 Text, Tables 14, 15, and 16): In the analysis to detect how sample size was determined (e.g. power or sample size calculations), we tested SciScore and CONSORT-TM. These tools have similar performance, with an F1 of 0.79 and 0.78 respectively. The main difference seemed to be the tuning of the tool. CONSORT-TM was more likely to produce false positives, while SciScore was more likely to produce false negatives. The ensemble model did not improve performance.

Software used (S1 Text, Tables 17, 18, and 19): To find mentions of software in papers, we tested SciScore and SoftCite. SoftCite scored a higher accuracy and F1 score (F1 of 0.87 compared to 0.27), and the ensemble learned to just use the SoftCite results. Here the difference was mainly due to a large number of false negatives in the SciScore data, which is likely to occur when software is mentioned outside of the methods section. Furthermore, SciScore was tuned to find or suggest RRID type entities; therefore this tool uses the RRID list of existing software (https://rrid.site/data/source/nlx_144509-1/search), and does not detect software that isn’t included in this list. The ensemble model did not increase performance over SoftCite alone.

Open code (S1 Text, Tables 20, 21, and 22): For detecting the presence of open code statements, we tested SciScore and ODDPub. We found that the simple regex-based tool, ODDPub, outperformed the machine learning tool SciScore. While the tools had significantly different definitions of what constitutes open code, we found that the majority of differences between the tools were due to mistakes from the SciScore machine learning model. The ensemble model did not increase performance over ODDPub alone.

Contaminated cell lines (S1 Text Tables 23, 24, and 25): We compared the ability of SciScore and PCLDetector to flag mentions of contaminated cell lines in papers, as defined by the International Cell Line Authentication Committee (ICLAC) or Cellosaurus. We found that SciScore had a high precision but low recall, while PCLDetector had a high recall but low precision. Thus, the difference between the tools can be mostly attributed to how conservatively they were designed. Depending on the use case, users may reasonably select either tool: PCLDetector if they wish to capture as many contaminated cell lines as possible (e.g. as a tool assisting peer review), and SciScore if they want to ensure that most of the detections are true (e.g. when analyzing problematic cell lines across the literature). The ensemble tool did not add performance over PCLDetector alone.

Baseline table detection (S1 Text, Table 26 and Fig 1): For extracting the baseline tables in randomized controlled trials, we compared the tools “baseline” and [unnamed]. baseline uses XML as input, while [unnamed] uses PDF. We found that baseline outperformed [unnamed], primarily because it used the XML of the paper as input, which is easily machine-readable. The PDF-based tool was less accurate because it had to rely on imperfect computer vision to detect tables. Thus, tool makers should aim to use an input format which is easily machine readable, as long as it is available for the papers they wish to run their tool on. We did not consider an ensemble model for this comparison because we compared tools at the entity level (individual baseline tables) and not at the paper level, so it is unclear how to perform ensembling in this case.

Discussion

In this study, we have conducted a broad comparison of 11 automated tools across 9 different rigor criteria. Based on the results of the comparisons, we draw the following conclusions to guide future tool development.

Firstly, while there are overwhelming performance gaps between tools in some comparisons, most comparisons have marginal differences between the tools. In these cases, tool run times, cost, ease of use, and quality/transparency of the output are likely more important factors than the raw performance. Therefore, tool developers should invest effort in tool usability and transparency in addition to pure performance.

Secondly, success when testing a tool on a different dataset than it was trained on is varied. The pre-rob tool was trained on animal studies, and had strong performance in randomization, but poor performance in inclusion/exclusion detection when used on a dataset of general biomedical papers. Therefore, it is important for developers to make sure they train their tool on a dataset very similar to what it will be applied to. Otherwise, they risk substantially worse performance when applied in practice.

During tool development, toolmakers make crucial decisions about how to define the feature of interest. These decisions are major contributors to performance differences between tools, as is illustrated by the comparison of tools to detect protocols. The SciScore definition of “protocols” includes clinical trials but also methods descriptions that one might find in protocols.io, whereas the other tools are only designed to detect clinical trial registrations. Users who are only interested in clinical trial registrations should use TRANscreener or ctregistries, whereas those who are interested in clinical trial registrations and step-by-step protocols for methods should use SciScore.

Thirdly, combining tools improves performance over any individual tool in some cases. Combining the best performing tool with tools that did not perform as well led to improved performance, compared to the best tool alone. Therefore, users should not dismiss worse-performing tools that detect similar things to another tool with better performance. If a tool performs detection in a different way or was trained on a different dataset, it is likely that it will still identify things that the tool with the best performance will miss. We also found that combining results from different tools is particularly valuable when a rigor criteria can be expressed in different modalities, like images and text. In these cases, ensembling tools that search different modalities can be extremely helpful to improve performance. Therefore, tool developers should consider what modalities rigor criteria can be expressed in and develop tools for modalities other than text, including multi-modal tools.

The primary limitation of our study is the construction of our gold standard set. Since we used the output of the tools to determine which papers humans would manually curate, it is likely we failed to correctly classify papers that all tools classified incorrectly. We sought to mitigate this problem by manually classifying a random sample of papers where all tools agreed, and using the resulting data to adjust our performance statistics. Other limitations of our study include a limited set of tools and a limited set of rigor criteria tested.

Bulleted lists of insights from our study are provided below for each of the key stakeholders in the development of automated rigor and transparency tools.

Insights for toolmakers

As outlined in Fig 1, toolmakers should consider the following when designing tools, as these decisions can substantially impact performance.

  • Operationalization of criteria
  • Input format and preprocessing
  • Sections of paper examined
  • Selection of training and validation data
  • Algorithm choice
  • Tool openness, transparency and accessibility
  • Desired performance: Toolmakers can choose to prioritize sensitivity or specificity, or give users the option to adjust the classification threshold

Insights for new toolmakers using LLMs

The proliferation of LLMs has prompted many people without coding experience, or other experience developing tools, to start creating tools. In addition to the insights shared for tool makers above, we highlight some specific points for those using LLMs to develop screening tools.

  • Simpler approaches often perform better. LLMs are inherently complicated, computationally intensive, and very expensive to develop and run. The large energy requirements exacerbate environmental impact. Toolmakers should use the simplest approach possible that achieves the desired performance. Lower complexity reduces computational time, energy needs, costs, and the likelihood of errors in the code. Simpler approaches work especially well for things that are consistently reported in standard ways, in specific sections of the manuscript. The fact that you can use an LLM doesn’t mean that you should.
  • Results may not be reproducible. LLMs are a black box. Tool developers do not know what criteria they are using to classify, and identical prompts may not give the same response, especially when the LLMs have undisclosed version changes.
  • Tool creators need a stable version that they control. Otherwise, performance will change each time that the LLM is updated. Tool makers may not know when updates have occurred.
  • Validation is essential. Many tools are released without any data on performance, which means users have no information on exactly what they are designed to detect, how they were trained, how often they make mistakes/hallucinations, and whether the tool is appropriate for their use case. Validations against human curated gold standard data should be performed regularly, and repeated every time that the tool or underlying LLM is adjusted.
  • Tool makers should consider using a private version of the LLM. Otherwise, the LLM may use data that users enter to further train the LLM. This is a particular concern if toolmakers enter information that is not already publicly available.

Insights for tool users

  • Pay attention to tool inputs and preprocessing steps.
  • Validate the tool on your own dataset to determine whether the tool is appropriate for your use case.
  • Test ensemble tools. In some cases, one tool is clearly superior. In other cases, a combination of tools performs better than any individual tool.
  • Know the limitations of the tool(s) that you use.
  • Consider openness, accessibility, and whether the tool is maintained.

Insights for those receiving results from tools

Those who receive results or reports from tools should pay particular attention to the following details when using and interpreting the reports.

  • Understand the criteria. Consult documentation in the report to understand how each item is defined, and why the items that are assessed are important. Readers must understand exactly what the tool was designed to detect, and how the criteria were operationalized, to interpret the report. Understanding why the items are important will help users to improve the paper.
  • Some items may not be relevant. Remember that some items may not apply to every paper. Some tools simply report that the item wasn’t found, without determining whether it was needed. Other tools distinguish between cases where an item is not reported, and cases where the item is not relevant.
  • Expect errors. When a tool measures many items, it is likely that you will see at least one false positive or false negative in a report. Even among classifiers with very high performance, the likelihood of an error on at least one item is high.
  • Regular users should validate performance. If you use reports from tools regularly (e.g., editors) or for larger datasets (e.g., metascientists), check some percentage (∼10%) of the reports manually. All tools make mistakes, and regular users should know the types of errors that the tools make and how often these errors occur. This will help you to use the report responsibly.

Supporting information

S1 Text. Contains detailed information about all comparisons.

https://doi.org/10.1371/journal.pone.0342225.s001

(PDF)

Acknowledgments

We would like to thank the members of the ScreenIT group who were not included as authors in this paper, but nonetheless contributed in discussions about this and related projects: Jennifer Byrne, Nicholas Brown, Tim Vines, Thomas Lemberger, Vince Istvan Madai, Julia Menon, Sean Rife, Iain Hrynaszkiewicz, Han Zhuang, Angelo Pezzullo, Andrew Brown, Camila Victoria-Quilla Baselly Heinrich, Rene Bernard, and Bertrand Favier.

References

  1. 1. Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533(7604):452–4. pmid:27225100
  2. 2. Diaba-Nuhoho P, Amponsah-Offeh M. Reproducibility and research integrity: the role of scientists and institutions. BMC Res Notes. 2021;14(1):451. pmid:34906213
  3. 3. Camerer CF, Dreber A, Holzmeister F, Ho T-H, Huber J, Johannesson M, et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat Hum Behav. 2018;2(9):637–44. pmid:31346273
  4. 4. Errington TM, Denis A, Perfito N, Iorns E, Nosek BA. Challenges for assessing replicability in preclinical cancer biology. Elife. 2021;10:e67995. pmid:34874008
  5. 5. Korbmacher M, Azevedo F, Pennington CR, Hartmann H, Pownall M, Schmidt K, et al. The replication crisis has led to positive structural, procedural, and community changes. Commun Psychol. 2023;1(1):3. pmid:39242883
  6. 6. Hopewell S, Chan AW, Collins GS, Hróbjartsson A, Moher D, Schulz KF. CONSORT 2025 statement: updated guideline for reporting randomised trials. The Lancet. 2025;405(10489):1633–40.
  7. 7. Percie du Sert N, Hurst V, Ahluwalia A, Alam S, Avey MT, Baker M, et al. The ARRIVE guidelines 2.0: updated guidelines for reporting animal research. J Cereb Blood Flow Metab. 2020;40(9):1769–77. pmid:32663096
  8. 8. National Institutes of Health. Guidance: Rigor and Reproducibility in Grant Applications. NIH Grants & Funding. 2024. https://grants.nih.gov/policy/reproducibility/guidance.htm
  9. 9. Leung V, Rousseau-Blass F, Beauchamp G, Pang DSJ. ARRIVE has not ARRIVEd: Support for the ARRIVE (Animal Research: Reporting of in vivo Experiments) guidelines does not improve the reporting quality of papers in animal welfare, analgesia or anesthesia. PLoS One. 2018;13(5):e0197882. pmid:29795636
  10. 10. Blanco D, Altman D, Moher D, Boutron I, Kirkham JJ, Cobo E. Scoping review on interventions to improve adherence to reporting guidelines in health research. BMJ Open. 2019;9(5):e026589. pmid:31076472
  11. 11. Martone ME. The past, present and future of neuroscience data sharing: a perspective on the state of practices and infrastructure for FAIR. Front Neuroinform. 2024;17:1276407. pmid:38250019
  12. 12. Nature P. Six factors affecting reproducibility in life science research and how to handle them. Nat Rev Methods Primers. 2019.
  13. 13. Horbach SPJM, Halffman W. The ability of different peer review procedures to flag problematic publications. Scientometrics. 2019;118(1):339–73. pmid:30930504
  14. 14. Oransky I. Nearing 5, 000 retractions: a review of 2022 . Retraction Watch. 2022. https://retractionwatch.com/2022/12/27/nearing-5000-retractions-a-review-of-2022/
  15. 15. Schulz R, Barnett A, Bernard R, Brown NJ, Byrne JA, Eckmann P. Is the future of peer review automated?. BMC Research Notes. 2022;15(1):203.
  16. 16. van Rossum J. Guest post: The STM integrity hub — connecting the dots in a dynamic landscape. The Scholarly Kitchen. 2024. https://scholarlykitchen.sspnet.org/2024/05/23/guest-post-the-research-integrity-hub-connecting-the-dots-in-a-dynamic-landscape/
  17. 17. Menke J, Eckmann P, Ozyurt IB, Roelandse M, Anderson N, Grethe J, et al. Establishing institutional scores with the rigor and transparency index: large-scale analysis of scientific reporting quality. J Med Internet Res. 2022;24(6):e37324. pmid:35759334
  18. 18. Riedel N, Kip M, Bobrov E. ODDPub–a text-mining algorithm to detect data sharing in biomedical publications. BioRxiv. 2020:2020-05.
  19. 19. Wang Q, Liao J, Lapata M, Macleod M. Risk of bias assessment in preclinical literature using natural language processing. Res Synth Methods. 2022;13(3):368–80. pmid:34709718
  20. 20. Kilicoglu H, Rosemblat G, Hoang L, Wadhwa S, Peng Z, Malički M, et al. Toward assessing clinical trial publications for reporting transparency. J Biomed Inform. 2021;116:103717. pmid:33647518
  21. 21. Semmelrock H, Kopeinik S, Theiler D, Ross-Hellauer T, Kowald D. Reproducibility in machine learning-driven research. arXiv preprint 2023. https://arxiv.org/abs/2307.10320
  22. 22. Ferrari Dacrema M, Cremonesi P, Jannach D. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In: Proceedings of the 13th ACM Conference on Recommender Systems. 2019. p. 101–9. https://doi.org/10.1145/3298689.3347058
  23. 23. Weissgerber T, Riedel N, Kilicoglu H, Labbé C, Eckmann P, Ter Riet G, et al. Automated screening of COVID-19 preprints: can we help authors to improve transparency and reproducibility?. Nat Med. 2021;27(1):6–7. pmid:33432174
  24. 24. Landis SC, Amara SG, Asadullah K, Austin CP, Blumenstein R, Bradley EW, et al. A call for transparent reporting to optimize the predictive value of preclinical research. Nature. 2012;490(7419):187–91. pmid:23060188
  25. 25. eLife. Materials Design Analysis Reporting (MDAR) Checklist for Authors. https://cdn.elifesciences.org/articles/81727/elife-81727-mdarchecklist1-v3.pdf
  26. 26. Cleverdon CW. The ASLIB cranfield research project on the comparative efficiency of indexing systems. Aslib Proceedings. 1960;12(12):421–31. https://doi.org/10.1108/eb049778
  27. 27. Voorhees EM. TREC: Continuing information retrieval’s tradition of experimentation. Communications of the ACM. 2007;50(11):51–4.
  28. 28. Sullivan LM, D’Agostino RB. Robustness of the t test applied to data distorted from normality by floor effects. Journal of Dental Research. 1992;71(12):1938–43.
  29. 29. Wongpakaran N, Wongpakaran T, Wedding D, Gwet KL. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Med Res Methodol. 2013;13:61. pmid:23627889
  30. 30. Devlin J, Chang MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019. p. 4171–86.