Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Table 1.

Inter-rater reliability (Krippendorff’s α) with 95% confidence intervals by week for the primary outcome (Correctness).

More »

Table 1 Expand

Fig 1.

Weekly model performance (human-rated correctness).

Error bars represent 95% confidence intervals from mixed-effects model estimates.

More »

Fig 1 Expand

Table 2.

Key mixed-effects model contrasts, Holm-adjusted p-values, and Hedges’ g effect sizes with 95% confidence intervals.

More »

Table 2 Expand

Fig 2.

Annotated drift event for Model C.

A vertical dashed line marks the statistically significant change-point detected by the PELT algorithm.

More »

Fig 2 Expand

Table 3.

Stability metrics (ICC[2,k]) across weeks for correctness and safety subscales.

More »

Table 3 Expand

Fig 3.

Effect of weekly judge calibration on judge–human agreement (Kendall’sτ) over the 10-week study. The calibrated judge demonstrates higher and more stable agreement with human ratings.

More »

Fig 3 Expand

Table 4.

Longitudinal safety indicators by model family, including mean toxicity probability and flagged output ratio.

More »

Table 4 Expand