Fig 1.
Examples of the four tasks of the GATmath benchmark.
The correct choice of each question is indicated in bold. The prompt used is in italics.
Fig 2.
Examples of the five tasks from the GATLc benchmark.
The correct choice of each question is indicated in bold. The prompt used is in italics.
Fig 3.
Transcription guide for GAT mathematical questions.
Fig 4.
Two objects of the created JSON file.
Table 1.
GATmath distribution across four tasks.
Table 2.
GATLc distribution across five tasks.
Fig 5.
An example of the five-shot evaluation settings used.
Table 3.
Performance of Arabic LLMs on various tasks of GATmath dataset.
Table 4.
Performance of Arabic LLMs on various tasks of the GATLc dataset.
Fig 6.
Accuracy of the Models on the Nine Tasks of GATLc and GATmath Datasets.
Table 5.
Performance of models on other Arabic benchmarks. CS: Commonsense.
Table 6.
Performance of models on mathematical reasoning benchmarks.
Table 7.
Performance of models on reasoning and language comprehension benchmarks.
Fig 7.
Performance of LLMs on GATmath and GATLc compared with English benchmarks.