Table 1.
NoVo normative data as the function of native vs. non-native and educational stratification.
The mean of each subsample is compared to the mean of the total sample; the values are z scores, therefore 0 represents the mean of the total sample and the values are in standard deviation (SD) units calculated on the total sample.
Fig 1.
The performance of the LLMs compared to human performance as the function of educational stratification.
Table 2.
The test results of ChatGPT and Bing on the individual occasions.
The results are expressed as 1) thetas as customary in item response theory and 2) as percentile scores. Thetas are Z-scores, i.e., they are expressed in standard deviation units, while percentile scores indicate the percentage of the normative sample that is outperformed.
Fig 2.
Responses of ChatGPT (3.5) as the function of item difficulty.
Correct answers are green, mistakes are red. Highlighted are cases where the LLM gave both correct and incorrect answers to the same item on different occasions.
Fig 3.
Responses of Bing (based on 4.0) as the function of item difficulty.
Correct answers are green, mistakes are red. Highlighted are cases where the LLM gave both correct and incorrect answers to the same item on different occasions.