Table 1.
“Simple” and “detailed” chatbot prompts used in Experiment 1.
Table 2.
The chatbots used in this study.
Table 3.
Sports nutrition criteria for rating chatbot performance in Experiment 1.
Table 4.
Chatbot accuracy scores in Experiment 1.
Table 5.
Contrasts for ChatbotID in Experiment 1.
Fig 1.
Accuracy scores among different chatbots on the two test days in Experiment 1.
Panels [A] and [B] display the overall accuracy scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the accuracy scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the accuracy scores in the Racing domain. Bars represent the mean of accuracy scores for each criterion and error bars represent the standard deviation (SD). ANOVA revealed a significant main effect of Chatbot ID where accuracy scores for ChatGPT-4omini (comparison “a”: p = 0.008, d = 0.549), ChatGPT-4o (“b”: p < 0.001, d = 0.796), and Gemini1.5pro (“c”: p < 0.001, d = 0.752) were greater than ClaudePro, and the accuracy scores for ChatGPT-4o (“d”: p = 0.008, d = 0.546) and Gemini1.5pro (“e”: p = 0.02, d = 0.502) were greater than Claude3.5Sonnet.
Fig 2.
Accuracy criteria in the Training domain among different chatbots on the two test days in Experiment 1.
The accuracy scores for all criteria measured in the Sports Nutrition for Training domain. Panels [A] and [B] show accuracy for the Energy availability criterion for the Simple and Detailed prompts, respectively. Panels [C] and [D] show accuracy for the Daily carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [E] and [F] show accuracy for the Daily protein intake criterion for the Simple and Detailed prompts, respectively. Panels [G] and [H] show accuracy for the Post-session carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [I] and [J] show accuracy for the Post-session protein intake criterion for the Simple and Detailed prompts, respectively. Panels [K] and [L] show accuracy for the Hydration criterion for the Simple and Detailed prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.
Fig 3.
Accuracy criteria in the Training domain among different chatbots on the two test days in Experiment 1 (continued).
The accuracy scores for all criteria measured in the Sports Nutrition for Training domain. Panels [A] and [B] show accuracy for the Supplements criterion for the Simple and Detailed prompts, respectively. Panels [C] and [D] show accuracy for the Individualisation criterion for the Simple and Detailed prompts, respectively. Panels [E] and [F] show accuracy for the Disclaimer criterion for the Simple and Detailed prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.
Fig 4.
Accuracy criteria in the Racing domain among different chatbots on the two test days in Experiment 1.
The accuracy scores for all criteria measured in the Sports Nutrition for Racing domain. Panels [A] and [B] show accuracy for the Daily carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [C] and [D] show accuracy for the Daily food examples criterion for the Simple and Detailed prompts, respectively. Panels [E] and [F] show accuracy for the Pre-race carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [G] and [H] show accuracy for the Pre-race food examples criterion for the Simple and Detailed prompts, respectively. Panels [I] and [J] show accuracy for the During-race carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [K] and [L] show accuracy for the During-race food examples criterion for the Simple and Detailed prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.
Fig 5.
Accuracy criteria in the Racing domain among different chatbots on the two test days in Experiment 1 (continued).
The accuracy scores for all criteria measured in the Sports Nutrition for Racing domain. Panels [A] and [B] show accuracy for the Pre-race hydration criterion for the Simple and Detailed prompts, respectively. Panels [C] and [D] show accuracy for the During-race hydration criterion for the Simple and Detailed prompts, respectively. Panels [E] and [F] show accuracy for the Supplements criterion for the Simple and Detailed prompts, respectively. Panels [G] and [H] show accuracy for the Individualisation criterion for the Simple and Detailed prompts, respectively. Panels [I] and [J] show accuracy for the Disclaimer criterion for the Simple and Detailed prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.
Fig 6.
Completeness among different chatbots on the two test days in Experiment 1.
Panels [A] and [B] display the overall completeness scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the completeness scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the completeness scores in the Racing domain. Completeness in each domain was rated on a Likert scale of 1–3; therefore, overall completeness had a maximum Likert score of 6. Data represent the median ± interquartile range (IQR) of the two raters’ scores.
Fig 7.
Clarity among different chatbots on the two test days in Experiment 1.
Panels [A] and [B] display the overall clarity scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the clarity scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the clarity scores in the Racing domain. Clarity in each domain was rated on a Likert scale of 1–3 therefore, overall clarity had a maximum Likert score of 6. Data represent the median ± interquartile range (IQR) of the two raters’ scores.
Fig 8.
The quality of cited evidence among different chatbots on the two test days in Experiment 1.
Panels [A] and [B] display the overall quality of cited evidence scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the quality of cited evidence scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the quality of cited evidence scores in the Racing domain. The quality of cited evidence in each domain was rated on a Likert scale of 1–3 therefore, overall evidence quality had a maximum Likert score of 6. Data represent the median ± interquartile range (IQR) of the two raters’ scores.
Fig 9.
The quality of additional information among different chatbots on the two test days in Experiment 1.
Panels [A] and [B] display the overall quality of additional information scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the quality of additional information scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the quality of additional information scores in the Racing domain. The quality of additional information in each domain was rated on a Likert scale of 1–5 therefore, overall additional information quality had a maximum Likert score of 10. Data represent the median ± interquartile range (IQR) of the two raters’ scores.
Table 6.
Chatbot accuracy scores in Experiment 2.
Table 7.
Contrasts for ChatbotID in Experiment 2.
Table 8.
Contrasts for ExamDomain in Experiment 2.
Table 9.
Contrasts for TestDay in Experiment 2.
Fig 10.
The proportion of correct answers to exam questions among different chatbots on the two test days in Experiment 2.
Bars represent the proportion of correct answers and error bars represent the standard error (SE). Claude3.5Sonnet scored a higher proportion of correct answers to the exam than ChatGPT-4omini (comparison “a”: p = 0.0001; r = −0.976, 95%CI −0.997 to −0.792), ChatGPT-4o (“b”: p = 0.04; r = −0.948, 95%CI −0.994 to −0.589), ClaudePro (“c”: p < .0001; r = −0.983, 95%CI −0.998 to −0.846), and Gemini1.5flash “d”: (p < .0001; r = −0.978, 95%CI −0.998 to −0.807). Gemini1.5pro also scored a higher proportion of correct answers than ChatGPT-4omini (“e”: p = 0.004; r = −0.964, 95%CI −0.700 to 0.952), ClaudePro (“f”: p = 0.0001; r = −0.976, 95%CI −0.998 to −0.793), and Gemini1.5flash (“g”: p = 0.002; r = −0.967, 95%CI −0.997 to −0.726). The proportion of correct answers was not different between test days.