Human-anchored longitudinal comparison of generative AI with a bias-calibrated LLM-as-judge

doi:10.1371/journal.pone.0339920

Table 1.

Inter-rater reliability (Krippendorff’s α) with 95% confidence intervals by week for the primary outcome (Correctness).

More »

Expand

Fig 1.

Weekly model performance (human-rated correctness).

Error bars represent 95% confidence intervals from mixed-effects model estimates.

More »

Expand

Table 2.

Key mixed-effects model contrasts, Holm-adjusted p-values, and Hedges’ g effect sizes with 95% confidence intervals.

More »

Expand

Fig 2.

Annotated drift event for Model C.

A vertical dashed line marks the statistically significant change-point detected by the PELT algorithm.

More »

Expand

Table 3.

Stability metrics (ICC[2,k]) across weeks for correctness and safety subscales.

More »

Expand

Fig 3.

Effect of weekly judge calibration on judge–human agreement (Kendall’sτ) over the 10-week study. The calibrated judge demonstrates higher and more stable agreement with human ratings.

More »

Expand

Table 4.

Longitudinal safety indicators by model family, including mean toxicity probability and flagged output ratio.

More »

Expand