Visual enumeration remains challenging for multimodal generative AI

doi:10.1371/journal.pone.0331566

Fig 1.

Samples from the numerosiy naming task.

Each row contains samples from a different object category, while columns correspond to different numerosities: 1, 2, 3, 4 and 8.

More »

Expand

Fig 2.

Graphical representation of our evaluation pipeline.

Numerosity naming (image-to-text) is represented in the upper stream, while numerosity generation (text-to-image) is represented in the lower stream.

More »

Expand

Table 1.

Leader-board according to Normalized Absolute Error (NAE).

The Corr w/ Human column reports the correlation with the confusion matrix produced by an ideal human observer. The last column reports the estimated number of model parameters (in Billions).

More »

Expand

Fig 3.

Confusion matrices for the numerosity naming task.

Each panel shows the distribution of models’ responses across different object categories: apples, people, butterflies, dots and fast cards. The x-axis represents the target number, while the y-axis represents the corresponding model responses. Response frequency is encoded using a perceptually uniform colormap (blue = 0%, yellow = 100%). Qwen2.5-VL stands for Qwen2.5-VL 72B.

More »

Expand

Table 2.

Analysis of image-to-text models’ estimation biases across object categories.

“-” indicates no systematic bias, ”Over.” means the model’s responses are systematically higher than the ground truth, and ”Under.” means they are systematically lower. Numbers in brackets represent the effect sizes (Cohen’s r).

More »

Expand

Fig 4.

Examples of images generated by text-to-image models in the numerosity production task, showcasing both correct and wrong generations (the target number is indicated at the bottom).

We report two images for each target category: apples, people, butterflies, and dots. For the dots category, in a few cases FLUX and DALL-E 2 generated images containing a wrong number of dots, which were nevertheless arranged according to the target digit shape (e.g., 8 in the figure). Surprisingly, SD3.5 was unable to generate a single correct response when the target numerosity was larger than 8.

More »

Expand

Fig 5.

Confusion matrices for the numerosity production task.

The x-axis represents the target number, while the y-axis represents the corresponding model responses. Response frequency is encoded using a perceptually uniform colormap (blue = 0%, yellow = 100%).

More »

Expand

Table 3.

Analysis of text-to-image models’ estimation biases across object categories.

“-” indicates no systematic bias, “Over.” means the model’s outputs are systematically higher than ground truth, and “Under.” means they were systematically lower. Numbers in brackets represent the effect sizes (Cohen’s r).

More »

Expand

Fig 6.

Power-law distribution of textual numerosities related to countable objects in the CC3M dataset (upper panel) and LAION-400M dataset (lower panel).

Decade numbers are highlighted in red. The y-axis represents the relative frequency of appearance. A broken y-axis is used to accommodate the large difference in peak frequencies between the two datasets. The layout is proportioned such that the lower-frequency regions of both datasets share the same vertical height, making them visually comparable despite the scale difference.

More »

Expand