The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability

doi:10.1371/journal.pone.0325982

Table 1.

“Simple” and “detailed” chatbot prompts used in Experiment 1.

More »

Expand

Table 2.

The chatbots used in this study.

More »

Expand

Table 3.

Sports nutrition criteria for rating chatbot performance in Experiment 1.

More »

Expand

Table 4.

Chatbot accuracy scores in Experiment 1.

More »

Expand

Table 5.

Contrasts for ChatbotID in Experiment 1.

More »

Expand

Fig 1.

Accuracy scores among different chatbots on the two test days in Experiment 1.

Panels [A] and [B] display the overall accuracy scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the accuracy scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the accuracy scores in the Racing domain. Bars represent the mean of accuracy scores for each criterion and error bars represent the standard deviation (SD). ANOVA revealed a significant main effect of Chatbot ID where accuracy scores for ChatGPT-4omini (comparison “a”: p = 0.008, d = 0.549), ChatGPT-4o (“b”: p < 0.001, d = 0.796), and Gemini1.5pro (“c”: p < 0.001, d = 0.752) were greater than ClaudePro, and the accuracy scores for ChatGPT-4o (“d”: p = 0.008, d = 0.546) and Gemini1.5pro (“e”: p = 0.02, d = 0.502) were greater than Claude3.5Sonnet.

More »

Expand

Fig 2.

Accuracy criteria in the Training domain among different chatbots on the two test days in Experiment 1.

The accuracy scores for all criteria measured in the Sports Nutrition for Training domain. Panels [A] and [B] show accuracy for the Energy availability criterion for the Simple and Detailed prompts, respectively. Panels [C] and [D] show accuracy for the Daily carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [E] and [F] show accuracy for the Daily protein intake criterion for the Simple and Detailed prompts, respectively. Panels [G] and [H] show accuracy for the Post-session carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [I] and [J] show accuracy for the Post-session protein intake criterion for the Simple and Detailed prompts, respectively. Panels [K] and [L] show accuracy for the Hydration criterion for the Simple and Detailed prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.

More »

Expand

Fig 3.

Accuracy criteria in the Training domain among different chatbots on the two test days in Experiment 1 (continued).

The accuracy scores for all criteria measured in the Sports Nutrition for Training domain. Panels [A] and [B] show accuracy for the Supplements criterion for the Simple and Detailed prompts, respectively. Panels [C] and [D] show accuracy for the Individualisation criterion for the Simple and Detailed prompts, respectively. Panels [E] and [F] show accuracy for the Disclaimer criterion for the Simple and Detailed prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.

More »

Expand

Fig 4.

Accuracy criteria in the Racing domain among different chatbots on the two test days in Experiment 1.

The accuracy scores for all criteria measured in the Sports Nutrition for Racing domain. Panels [A] and [B] show accuracy for the Daily carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [C] and [D] show accuracy for the Daily food examples criterion for the Simple and Detailed prompts, respectively. Panels [E] and [F] show accuracy for the Pre-race carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [G] and [H] show accuracy for the Pre-race food examples criterion for the Simple and Detailed prompts, respectively. Panels [I] and [J] show accuracy for the During-race carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [K] and [L] show accuracy for the During-race food examples criterion for the Simple and Detailed prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.

More »

Expand

Fig 5.

Accuracy criteria in the Racing domain among different chatbots on the two test days in Experiment 1 (continued).

The accuracy scores for all criteria measured in the Sports Nutrition for Racing domain. Panels [A] and [B] show accuracy for the Pre-race hydration criterion for the Simple and Detailed prompts, respectively. Panels [C] and [D] show accuracy for the During-race hydration criterion for the Simple and Detailed prompts, respectively. Panels [E] and [F] show accuracy for the Supplements criterion for the Simple and Detailed prompts, respectively. Panels [G] and [H] show accuracy for the Individualisation criterion for the Simple and Detailed prompts, respectively. Panels [I] and [J] show accuracy for the Disclaimer criterion for the Simple and Detailed prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.

More »

Expand

Fig 6.

Completeness among different chatbots on the two test days in Experiment 1.

Panels [A] and [B] display the overall completeness scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the completeness scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the completeness scores in the Racing domain. Completeness in each domain was rated on a Likert scale of 1–3; therefore, overall completeness had a maximum Likert score of 6. Data represent the median ± interquartile range (IQR) of the two raters’ scores.

More »

Expand

Fig 7.

Clarity among different chatbots on the two test days in Experiment 1.

Panels [A] and [B] display the overall clarity scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the clarity scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the clarity scores in the Racing domain. Clarity in each domain was rated on a Likert scale of 1–3 therefore, overall clarity had a maximum Likert score of 6. Data represent the median ± interquartile range (IQR) of the two raters’ scores.

More »

Expand

Fig 8.

The quality of cited evidence among different chatbots on the two test days in Experiment 1.

Panels [A] and [B] display the overall quality of cited evidence scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the quality of cited evidence scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the quality of cited evidence scores in the Racing domain. The quality of cited evidence in each domain was rated on a Likert scale of 1–3 therefore, overall evidence quality had a maximum Likert score of 6. Data represent the median ± interquartile range (IQR) of the two raters’ scores.

More »

Expand

Fig 9.

The quality of additional information among different chatbots on the two test days in Experiment 1.

Panels [A] and [B] display the overall quality of additional information scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the quality of additional information scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the quality of additional information scores in the Racing domain. The quality of additional information in each domain was rated on a Likert scale of 1–5 therefore, overall additional information quality had a maximum Likert score of 10. Data represent the median ± interquartile range (IQR) of the two raters’ scores.

More »

Expand

Table 6.

Chatbot accuracy scores in Experiment 2.

More »

Expand

Table 7.

Contrasts for ChatbotID in Experiment 2.

More »

Expand

Table 8.

Contrasts for ExamDomain in Experiment 2.

More »

Expand

Table 9.

Contrasts for TestDay in Experiment 2.

More »

Expand

Fig 10.

The proportion of correct answers to exam questions among different chatbots on the two test days in Experiment 2.

Bars represent the proportion of correct answers and error bars represent the standard error (SE). Claude3.5Sonnet scored a higher proportion of correct answers to the exam than ChatGPT-4omini (comparison “a”: p = 0.0001; r = −0.976, 95%CI −0.997 to −0.792), ChatGPT-4o (“b”: p = 0.04; r = −0.948, 95%CI −0.994 to −0.589), ClaudePro (“c”: p < .0001; r = −0.983, 95%CI −0.998 to −0.846), and Gemini1.5flash “d”: (p < .0001; r = −0.978, 95%CI −0.998 to −0.807). Gemini1.5pro also scored a higher proportion of correct answers than ChatGPT-4omini (“e”: p = 0.004; r = −0.964, 95%CI −0.700 to 0.952), ClaudePro (“f”: p = 0.0001; r = −0.976, 95%CI −0.998 to −0.793), and Gemini1.5flash (“g”: p = 0.002; r = −0.967, 95%CI −0.997 to −0.726). The proportion of correct answers was not different between test days.

More »

Expand