La benchmarking large language models for extracting biobank-derived insights into health and disease
Fig 2
Benchmark Performance of Six Frontier LLMs on UK Biobank Knowledge Retrieval Tasks (January 2026).
(A) Keywords: Keyword recognition benchmark comparing Coverage Score (proportion of top 20 keywords matched) and Weighted Coverage Score (frequency-weighted) across all six models. Gemini 3 Pro and Mistral Large achieved highest coverage (0.80), followed by Claude models (0.75). GPT-5.2 and DeepSeek V3 scored 0.60. (B) Most Cited Papers: All models struggled with this challenging task. Gemini 3 Pro scored highest (Coverage 0.20, Weighted 0.21), while GPT-5.2 and DeepSeek V3 failed to identify any matching papers (0.00). (C) Most Prolific Authors: Each model’s output is assessed for matching the top 20 authors by publication count. Coverage Scores capture how many authors each LLM mentions, while Weighted Coverage Scores emphasize authors with more total publications. (D) Applicant Institutions: Shows performance in detecting the 10 most frequent UK Biobank applicant institutions. Scores are weighted according to the institution’s application counts. (E) Overall Ranking: Overall weighted coverage ranking across all four tasks. Gemini 3 Pro leads (0.643), followed by Claude Sonnet 4 and Claude Opus 4.5 (0.577 each), Mistral Large (0.567), DeepSeek V3 (0.517), and GPT-5.2 (0.455).