Vision Transformer attention alignment with human visual perception in aesthetic object evaluation | PLOS One

Advertisement

Browse Subject Areas

?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1 — Fig 1.

General diagram of the experimental attention analysis process.
The process is composed of three stages: 1) data preparation and experimental setup, 2) image transformation, and 3) distribution analysis. Data preparation and setup consist of the image evaluation process by each experiment participant. This stage requires the use of an eye-tracker to determine participants’ gaze positions and define the experimental conditions. Image transformation consists of generating an attention map using the ViT attention module and of having users experimentally evaluate it on a set of objects. The last stage compares both information sources and thus determines whether there is any correlation. Note: The craft figures shown are similar but not identical to the original images and are included for illustrative purposes only.

More »

Fig 2 — Fig 2.

Objects used in the experiment consisted of ten basketry objects and ten ginger jars.
The objects were randomly selected, with the fewest possible objects in the background. Note: The craft figures shown are similar but not identical to the original images and are included for illustrative purposes only.

More »

Fig 3 — Fig 3.

Experimental procedure for object visualization.
Before starting the experimental phase, a calibration procedure is performed with the recording of a sequence of five points on the screen. Once this process is completed, the experimental phase begins with the projection of an image with a white background and a red dot, displayed for 5 seconds. Then one of the 20 objects is displayed for 10 seconds. This procedure repeats until all objects have been displayed.Note: The craft figures shown are similar but not identical to the original images and are included for illustrative purposes only.

More »

Fig 4 — Fig 4.

Participant setup in front of the screen during the experimental phase.
All participants remain seated while the experiment is conducted. At the beginning of each experiment, a calibration process is performed with the eye-tracker and the experiment is explained to the participant. The chosen distance between the user and screen remains relatively fixed at 150 cm, as it reduces visual fatigue. Note: The craft figures shown are similar but not identical to the original images and are included for illustrative purposes only.

More »

Fig 5 — Fig 5.

Heatmap generation according to positions recorded by each observer.
The heatmap of each object is constructed as the average of individual visualizations transformed to a two-dimensional Gaussian distribution. Note: The craft figures shown are similar but not identical to the original images and are included for illustrative purposes only.

More »

Table 1 — Table 1.

Classification of each human fixation.

More »

Table 2 — Table 2.

Sociodemographic information of participants.

More »

Fig 6.

Heatmap analysis by each observer.
(a) heatmap of each user for object #1 (basketry), (b) user’s gaze as the parameter increases, the greater the coverage area of the average vision. (c) heatmap of each user for object #13 (ginger jar).

More »

Fig 6 — Fig 6.

Heatmap analysis by each observer.
(a) heatmap of each user for object #1 (basketry), (b) user’s gaze as the parameter increases, the greater the coverage area of the average vision. (c) heatmap of each user for object #13 (ginger jar).

More »

Fig 7.

Density of users’ average gaze for each object for .
Basketry: Zoom object #4: Detail of the region of an object without a buckle. Zoom object #6: Detail of buckle with longer observation time by users. Ginger Jar: Zoom object #14 vase symbol with the highest amount of observation.

More »

Fig 7 — Fig 7.

Density of users’ average gaze for each object for .
Basketry: Zoom object #4: Detail of the region of an object without a buckle. Zoom object #6: Detail of buckle with longer observation time by users. Ginger Jar: Zoom object #14 vase symbol with the highest amount of observation.

More »

Fig 8 — Fig 8.

12 heatmaps generated by the ViT attention module, both for a basketry-type object and for a jar.
Each of the 12 heatmaps represents part of the attention visualization within the algorithm. Note: The craft figures shown are similar but not identical to the original images and are included for illustrative purposes only.

More »

Fig 9 — Fig 9.

Average ViT for objects.
Average heatmap of the 12 heads of the ViT attention module for each object in the experiment.

More »

Fig 10.

The distance between the participants and each ViT head is computed separately for each metric (KL, CC, SSIM, SIM).
In this way, we estimate the distance between the 30 participants and the 12 heatmaps from the ViT module (). This procedure is repeated for the 20 objects in the experiment (), and since this distance is computed for a given , we evaluate multiple values with . Thus, we obtain combinations.

More »

Fig 10.

The distance between the participants and each ViT head is computed separately for each metric (KL, CC, SSIM, SIM).
In this way, we estimate the distance between the 30 participants and the 12 heatmaps from the ViT module (). This procedure is repeated for the 20 objects in the experiment (), and since this distance is computed for a given , we evaluate multiple values with . Thus, we obtain combinations.

More »

Fig 11.

Each point in the box plot represents the distance between one of the 20 objects and each of the attention heads.
In this example, the parameter is fixed at 2.6. Note that each mean corresponds to the average distance with respect to each head.

More »

Fig 11 — Fig 11.

Each point in the box plot represents the distance between one of the 20 objects and each of the attention heads.
In this example, the parameter is fixed at 2.6. Note that each mean corresponds to the average distance with respect to each head.

More »

Fig 12.

Variation of distance as the value of increases for each head.
Each distance is computed as the average distance between the average visualization across all objects and each ViT head. Across all metrics, head 12 attains the value closest to human attention. However, as increases, the standard error (shown in light blue) also increases.

More »

Fig 12.

Variation of distance as the value of increases for each head.
Each distance is computed as the average distance between the average visualization across all objects and each ViT head. Across all metrics, head 12 attains the value closest to human attention. However, as increases, the standard error (shown in light blue) also increases.

More »

Fig 13.

Variation of distance as the value of increases for each head.
Each distance is computed as the average distance between the average visualization across all objects and each ViT head. The blue region corresponds to the 95% confidence interval.

More »

Fig 13 — Fig 13.

Variation of distance as the value of increases for each head.
Each distance is computed as the average distance between the average visualization across all objects and each ViT head. The blue region corresponds to the 95% confidence interval.

More »

Fig 14.

Comparison between each average visualization with respect to head #12 of the ViT attention module.
In the case of average visualization, a value of has been considered.

More »

Fig 14 — Fig 14.

Comparison between each average visualization with respect to head #12 of the ViT attention module.
In the case of average visualization, a value of has been considered.

More »

Fig 15.

Tukey honestly significant difference (HSD) with different .
Difference between Tukey Honestly Significant Difference (HSD) to measure the difference in means between attention module heads with three variants of .

More »

Fig 15.

Tukey honestly significant difference (HSD) with different .
Difference between Tukey Honestly Significant Difference (HSD) to measure the difference in means between attention module heads with three variants of .

More »

Fig 16 — Fig 16.

Behavior of the p-value for head #12 according to the HSD test.
The analysis was performed only for the KL and SSIM metrics, as they exhibit a statistically significant difference.

More »

Fig 17.

Heatmap of lift by head and image.
Each cell reports . Color highlights the magnitude of the effect. The threshold was defined per image and per head, as described in the evaluation section.

More »

Fig 17.

Heatmap of lift by head and image.
Each cell reports . Color highlights the magnitude of the effect. The threshold was defined per image and per head, as described in the evaluation section.

More »

Table 3 — Table 3.

AOI analysis results for the basketry set.

More »

Table 4 — Table 4.

AOI analysis results for the jar set.

More »