GPT-4 as an X data annotator: Unraveling its performance on a stance classification task

doi:10.1371/journal.pone.0307741

Fig 1.

Overall methodology of the study.

More »

Expand

Fig 2.

Zero-shot prompt for generating labels.

More »

Expand

Fig 3.

Few-shot prompt for generating labels.

More »

Expand

Fig 4.

Zero-shot Chain-of-Thoughts prompt for label generation.

More »

Expand

Table 1.

Hyperparameter settings utilized for traditional machine learning models during hyperparameter tuning.

More »

Expand

Fig 5.

The distribution of class labels in the four different label sets.

(a) human labels, (b) Zero-shot labels, (c) Few-shot labels, and (d) Zero-shot CoT labels.

More »

Expand

Fig 6.

The percentages of changes in the three types of new label sets; Zero-shot, Few-shot, and Zero-shot CoT compared to human labels.

More »

Expand

Table 2.

Testing results of models fine-tuned on four training sets with different labels.

More »

Expand

Table 3.

Results of Wilcoxon signed-rank test performed to compare the evaluation metrics of each of two sets of labels generated by different approaches.

The ‘W’ refers to the test-statistic and p-val refers to the P-value.

More »

Expand

Fig 7.

The percentage increase in performance compared to human-labeled data, observed across the top-performing classifiers of human labeling.

(a) Zero-shot, (b) Few-shot, (c) Zero-shot CoT.

More »

Expand

Table 4.

Top classifiers trained on different GPT-based labeling sets based on f1-score.

More »

Expand

Fig 8.

Performance analysis of classifiers trained on GPT-4’s labeled datasets, which outperformed ground truth labels.

More »

Expand

Fig 9.

Two examples explaining the advantage of Zero-shot CoT over the basic Zero-shot prompting mechanism.

More »

Expand