Table 1.
Inter-rater reliability (Krippendorff’s α) with 95% confidence intervals by week for the primary outcome (Correctness).
Fig 1.
Weekly model performance (human-rated correctness).
Error bars represent 95% confidence intervals from mixed-effects model estimates.
Table 2.
Key mixed-effects model contrasts, Holm-adjusted p-values, and Hedges’ g effect sizes with 95% confidence intervals.
Fig 2.
Annotated drift event for Model C.
A vertical dashed line marks the statistically significant change-point detected by the PELT algorithm.
Table 3.
Stability metrics (ICC[2,k]) across weeks for correctness and safety subscales.
Fig 3.
Effect of weekly judge calibration on judge–human agreement (Kendall’sτ) over the 10-week study. The calibrated judge demonstrates higher and more stable agreement with human ratings.
Table 4.
Longitudinal safety indicators by model family, including mean toxicity probability and flagged output ratio.