GATmath and GATLc: Comprehensive benchmarks for evaluating Arabic large language models | PLOS One

Advertisement

Browse Subject Areas

?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1 — Fig 1.

Examples of the four tasks of the GATmath benchmark.
The correct choice of each question is indicated in bold. The prompt used is in italics.

More »

Fig 2 — Fig 2.

Examples of the five tasks from the GATLc benchmark.
The correct choice of each question is indicated in bold. The prompt used is in italics.

More »

Fig 3 — Fig 3.

Transcription guide for GAT mathematical questions.

More »

Fig 4 — Fig 4.

Two objects of the created JSON file.

More »

Table 1 — Table 1.

GATmath distribution across four tasks.

More »

Table 2 — Table 2.

GATLc distribution across five tasks.

More »

Fig 5 — Fig 5.

An example of the five-shot evaluation settings used.

More »

Table 3 — Table 3.

Performance of Arabic LLMs on various tasks of GATmath dataset.

More »

Table 4 — Table 4.

Performance of Arabic LLMs on various tasks of the GATLc dataset.

More »

Fig 6 — Fig 6.

Accuracy of the Models on the Nine Tasks of GATLc and GATmath Datasets.

More »

Table 5 — Table 5.

Performance of models on other Arabic benchmarks. CS: Commonsense.

More »

Table 6 — Table 6.

Performance of models on mathematical reasoning benchmarks.

More »

Table 7 — Table 7.

Performance of models on reasoning and language comprehension benchmarks.

More »

Fig 7 — Fig 7.

Performance of LLMs on GATmath and GATLc compared with English benchmarks.

More »