Careful design of Large Language Model pipelines enables expert-level retrieval of evidence-based information from syntheses and databases

doi:10.1371/journal.pone.0323563

Fig 1.

A Large Language Model (LLM), Claude 3.5 Sonnet, was used to generate a multiple-choice exam for each of the 2,250 actions using an automated method³⁰, (Panel A and B) excluding questions that solely asked questions based on the effectiveness categories in the Conservation Evidence database.

This formed a larger set of 1867 unfiltered questions. We also refined these down to a filtered set of 45 questions to enable a comparison with human experts, ensuring that filtered questions were clear and could be answered with a single, accurate answer. Ten LLMs were then asked to provide answers under six different exam conditions for each question within the unfiltered and filtered sets (Panel B). These exam conditions included three different types of retrieval strategies: sparse, dense, and hybrid retrieval (Box 1).

More »

Expand

Table 1.

Results of a paired comparison of question accuracy of Large Language Models using a hybrid retrieval strategy versus human experts, as well as comparing retrieval accuracy of different retrieval strategies versus human experts for the filtered 45-question set. Test statistics are from a permutation test used to test two null hypotheses: 1. LLM: no difference in multiple-choice question accuracy of the given LLM and a randomly selected human expert; 2. Retrieval strategies: no difference in retrieval accuracy of a retrieval strategy and a randomly selected human expert. The test statistic can be interpreted as follows: negative values = random guesser or randomly selected human expert answered more questions correctly; positive values = LLM/retrieval strategy answered more questions correctly, and zero = all draws or equal numbers of wins and losses. The test statistic accounts for the number of draws expected by chance for the multiple-choice questions. *denotes statistically significant Holm-adjusted p-value at 0.05 significance level.

More »

Expand

Fig 2.

Logistic regression Generalised Linear Model (GLM) predictions of the accuracy of LLMs across different synopses, under different exam types (mean and 95% Confidence Intervals).

The results for confused, sparse, and dense retrieval are found in S3 Fig in S2 File.

More »

Expand

Table 2.

Overall Large Language Model (LLM) accuracy across different exam conditions on the unfiltered dataset. The table is sorted by LLM performance under the hybrid retrieval strategy. Results for the filtered 45-question dataset are presented in S8 Table in S2 File. We also specify whether LLMs are open or closed source.

More »

Expand

Table 3.

Overall retrieval accuracy of different retrieval strategies across the unfiltered dataset of questions.

More »

Expand

Fig 3.

Logistic regression Generalised Linear Model (GLM) predictions of the retrieval accuracy of different retrieval strategies across different synopses (mean and 95% Confidence Intervals).

More »

Expand