Objects guide human gaze behavior in dynamic real-world scenes

doi:10.1371/journal.pcbi.1011512

Objects guide human gaze behavior in dynamic real-world scenes

Fig 2

Object-based scanpath model design.

Schematic of an object-based model within ScanDy, in analogy to Fig 1 and for the same video frame. The object masks (bottom left, 2 persons and background are shown) are based on a semantic segmentation using the Mask R-CNN deep neural network (implementation by [61]). (a) Any appropriate feature map could be used to encode the scene content in module (I). For the model comparison, we used the same low-level saliency [33] as in Fig 1 for the object-based model O.ll (I, left) but in a second implementation, model O.cb, did not include any kind of scene features and only used the generic center bias [59] (I, right). In addition to the space-based Gaussian of size σ_S, we account for a higher sensitivity across the currently foveated object [4, 62] by setting the sensitivity in module (II) across the foveated object mask to one. Instead of inhibiting locations, module (III) inhibits previously foveated object masks (cf. [63]). While foveated, an object is inhibited by value ξ (cf. left object). As soon as the gaze position moves outside of the object mask, the inhibition is set to one (cf. right person), which then, over time, decreases again linearly to zero with slope r (see inset). The scanpath history is identical to Fig 1 (III), with the dashed circle marking the previous saccade target location. At the time of the previous saccade, the right person was at the location of the green object contour, but in the meantime moved to the right and was followed with smooth pursuit before a saccade was initiated to the now foveated object (green cross). The output maps of modules (I-III) are again combined (see panel (c) for pixel-wise multiplication result of (I-III), with different color maps for each object), but the visual information is now summed across each object mask and normalized by the logarithm of the object size (see Eq (4)). The resulting value for each object is the drift rate for the drift-diffusion process (see Eq (5)). (b) In object-based target selection, the evidence for the saccadic decision-making is accumulated for each object in the scene, quantified by a DDM with threshold θ and noise level s in module (IV). (c) The gaze position update (V) follows the movement of the foveated object mask (from dashed circle to cross). If the decision threshold is crossed, a saccade to the target object is executed. The exact landing position within the target object is probabilistic and proportional to the activity of the combined maps of modules (I-III) (see (7), combined maps shown in a different color for each object mask). The object-based models have the same number of free parameters as the space-based ones. They are listed in Table 2.

doi: https://doi.org/10.1371/journal.pcbi.1011512.g002