La benchmarking large language models for extracting biobank-derived insights into health and disease

doi:10.1371/journal.pcbi.1014224

Fig 1.

Benchmarking Results from UK Biobank Schema Data.

(A) Publications by Year: Annual count of UK Biobank publications from Schema 19 (8,549 abstracts), showing growth from 2013 to 2025. (B) Top 15 Keywords: The most frequently cited keywords in UK Biobank papers, with “Humans” (6,547 occurrences) and demographic terms dominating, followed by methodological terms such as GWAS and Mendelian randomization. (C) Most Cited Papers: The ten most cited publications associated with the UK Biobank, with citation counts from Schema 19 metadata. Paper titles are truncated for display. (D) Top 15 Authors: The most prolific authors by publication count (e.g., George Davey Smith, 122 publications; Naveed Sattar, 119). (E) Top 10 Applicant Institutions: Leading institutions by number of approved UK Biobank research applications (Schema 27; 15,046 applications), with the University of Oxford (186 applications) ranking first.

More »

Expand

Fig 2.

Benchmark Performance of Six Frontier LLMs on UK Biobank Knowledge Retrieval Tasks (January 2026).

(A) Keywords: Keyword recognition benchmark comparing Coverage Score (proportion of top 20 keywords matched) and Weighted Coverage Score (frequency-weighted) across all six models. Gemini 3 Pro and Mistral Large achieved highest coverage (0.80), followed by Claude models (0.75). GPT-5.2 and DeepSeek V3 scored 0.60. (B) Most Cited Papers: All models struggled with this challenging task. Gemini 3 Pro scored highest (Coverage 0.20, Weighted 0.21), while GPT-5.2 and DeepSeek V3 failed to identify any matching papers (0.00). (C) Most Prolific Authors: Each model’s output is assessed for matching the top 20 authors by publication count. Coverage Scores capture how many authors each LLM mentions, while Weighted Coverage Scores emphasize authors with more total publications. (D) Applicant Institutions: Shows performance in detecting the 10 most frequent UK Biobank applicant institutions. Scores are weighted according to the institution’s application counts. (E) Overall Ranking: Overall weighted coverage ranking across all four tasks. Gemini 3 Pro leads (0.643), followed by Claude Sonnet 4 and Claude Opus 4.5 (0.577 each), Mistral Large (0.567), DeepSeek V3 (0.517), and GPT-5.2 (0.455).

More »

Expand

Table 1.

Weighted Coverage Scores for Six Frontier LLMs Across Four UK Biobank Benchmark Tasks. Each cell reports the Weighted Coverage Score, which measures the proportion of ground-truth entities retrieved by each model, weighted by entity frequency or citation impact in UK Biobank metadata (Schemas 19 and 27). The Overall score is the mean Weighted Coverage Score across all four tasks.

More »

Expand

Fig 3.

Multidimensional Performance Analysis of Six Frontier Large Language Models on UK Biobank Benchmark (January 2026).

(A) Radar plot showing normalised scores (0–1 scale) across six evaluation dimensions for all six models. Each axis represents one dimension: Semantic Accuracy, Factual Correctness, Domain Knowledge, Reasoning Quality, Response Depth, and Biobank Specificity. Gemini 3 Pro (blue) shows the highest overall coverage, while Claude models (orange, green) demonstrate strength in reasoning quality. (B) Grouped bar chart comparing three key evaluation dimensions (Semantic Accuracy, Reasoning Quality, Domain Knowledge) across all six models. Values labelled directly on bars. (C) Heatmap of model rankings (1 = best, 6 = worst) for each of the six evaluation dimensions. Colour gradient: green indicates top ranks, red indicates bottom ranks. (D) Summary statistics table displaying Mean Score, Standard Deviation, Score Range, Consistency Score, and F1 Score for each model, sorted by Mean Score in descending order. (E) Distribution plots showing model-wise performance for Semantic Accuracy (left panel) and Performance Consistency (right panel). Models ranked by respective scores. Claude 4 shows highest consistency (0.89) across dimensions.

More »

Expand

Fig 4.

LLM Weighted Coverage Performance Compared to Random Baseline.

All panels use the Weighted Coverage Score (mean across all four benchmark tasks: Keywords, Papers, Authors, and Institutions) as the evaluation metric. This is the same “Overall” score reported in Table 1. The Weighted Coverage Score measures the proportion of ground-truth entities retrieved by each model, weighted by entity frequency or citation impact in UK Biobank metadata. Qualitative evaluation dimensions (Reasoning Quality, Response Depth, etc.) are not applied to random outputs, as they require coherent text. (A) Density histogram showing the distribution of Weighted Coverage Scores for the random baseline (n = 1,000 iterations of random term sampling from UK Biobank vocabulary pools). Vertical dashed lines indicate the mean Weighted Coverage Score across all six LLMs (blue, 0.597) and the minimum LLM score (green, 0.456). All six LLMs fall far outside the random baseline distribution. (B) Improvement factors for each LLM over the random baseline, calculated as (Model Weighted Coverage Score)/ (Random Baseline Mean Weighted Coverage Score). All models achieve 16x to 25x improvement over chance, indicating genuine encoding of biobank-specific knowledge rather than coincidental term overlap. (C) Statistical significance of each LLM versus the random baseline, tested using the Mann-Whitney U test. The y-axis displays -log10(p-value); horizontal lines indicate significance thresholds at p < 0.001 (red) and p < 0.05 (orange). All models achieve p < 0.001. (D) Precision-Recall analysis for entity extraction (authors and institutions) across all models. Point color indicates Overall Weighted Coverage Score (colorbar). Dashed curves represent F1 score iso-lines at 0.3, 0.5, and 0.7. Model names are annotated adjacent to each data point.

More »

Expand