The performance of ChatGPT and Bing on a computerized adaptive test of verbal intelligence

doi:10.1371/journal.pone.0307097

Table 1.

NoVo normative data as the function of native vs. non-native and educational stratification.

The mean of each subsample is compared to the mean of the total sample; the values are z scores, therefore 0 represents the mean of the total sample and the values are in standard deviation (SD) units calculated on the total sample.

More »

Expand

Fig 1.

The performance of the LLMs compared to human performance as the function of educational stratification.

More »

Expand

Table 2.

The test results of ChatGPT and Bing on the individual occasions.

The results are expressed as 1) thetas as customary in item response theory and 2) as percentile scores. Thetas are Z-scores, i.e., they are expressed in standard deviation units, while percentile scores indicate the percentage of the normative sample that is outperformed.

More »

Expand

Fig 2.

Responses of ChatGPT (3.5) as the function of item difficulty.

Correct answers are green, mistakes are red. Highlighted are cases where the LLM gave both correct and incorrect answers to the same item on different occasions.

More »

Expand

Fig 3.

Responses of Bing (based on 4.0) as the function of item difficulty.

Correct answers are green, mistakes are red. Highlighted are cases where the LLM gave both correct and incorrect answers to the same item on different occasions.

More »

Expand