Objects guide human gaze behavior in dynamic real-world scenes

doi:10.1371/journal.pcbi.1011512

Fig 1.

Space-based scanpath model design.

Schematics of a space-based model within the modular ScanDy framework, illustrated for an example frame (field03 of the VidCom dataset, see bottom left). (a) Modules (I-III) are computed simultaneously for each time step. The scene features (I) are quantified by a saliency map. For space-based models, we compared the impact of low-level saliency [33] (left) in model S.ll with high-level saliency as predicted by a deep neural network [38] (right, maps separated by the green dotted line) in model S.hl. The frame-wise saliency maps are multiplied with a generic center bias [59]. Color represents relative saliency, normalized by the highest saliency in the image. The visual sensitivity (II) is implemented as a Gaussian of width σ_S centered at the current gaze point (green cross). The space-based inhibition of return mechanism (III) inhibits the previous target locations (shown as dotted circles, latest target (h = 1 in Eq (1)) as a dashed circle). A Gaussian of width σ_I is centered around each previous saccade target location. The amplitude of each Gaussian linearly decreases relative to the time of the respective saccadic decision with slope r (see inset). The inhibition map is the sum of weighted Gaussians around previous target locations, clipped to a maximum of 1. The output maps of these modules (I-III) are multiplicatively combined for the decision-making process (see Eq (5)). (b) The evidence for each potential saccade target (i.e., each pixel location) is accumulated in the decision variable (cf. Eq (5)) in module (IV). Each potential target is represented in a drift-diffusion model (DDM) where the drift rate is computed by combining values from maps of modules (I, left, model S.ll), (II) & (III), and the noise level s. (c) Module (V) updates the gaze position based on the resulting decision variables. If one pixel location in (IV) reaches the DDM decision threshold θ, a saccade to this position is executed. Otherwise, the gaze position is updated (from dashed circle to cross) based on the optical flow calculated using PWC-Net [60] (plotted with flow field color coding in the top right where color indicates direction and hue indicates velocity). The parameters of the space-based model are listed in Table 1.

More »

Expand

Fig 2.

Object-based scanpath model design.

Schematic of an object-based model within ScanDy, in analogy to Fig 1 and for the same video frame. The object masks (bottom left, 2 persons and background are shown) are based on a semantic segmentation using the Mask R-CNN deep neural network (implementation by [61]). (a) Any appropriate feature map could be used to encode the scene content in module (I). For the model comparison, we used the same low-level saliency [33] as in Fig 1 for the object-based model O.ll (I, left) but in a second implementation, model O.cb, did not include any kind of scene features and only used the generic center bias [59] (I, right). In addition to the space-based Gaussian of size σ_S, we account for a higher sensitivity across the currently foveated object [4, 62] by setting the sensitivity in module (II) across the foveated object mask to one. Instead of inhibiting locations, module (III) inhibits previously foveated object masks (cf. [63]). While foveated, an object is inhibited by value ξ (cf. left object). As soon as the gaze position moves outside of the object mask, the inhibition is set to one (cf. right person), which then, over time, decreases again linearly to zero with slope r (see inset). The scanpath history is identical to Fig 1 (III), with the dashed circle marking the previous saccade target location. At the time of the previous saccade, the right person was at the location of the green object contour, but in the meantime moved to the right and was followed with smooth pursuit before a saccade was initiated to the now foveated object (green cross). The output maps of modules (I-III) are again combined (see panel (c) for pixel-wise multiplication result of (I-III), with different color maps for each object), but the visual information is now summed across each object mask and normalized by the logarithm of the object size (see Eq (4)). The resulting value for each object is the drift rate for the drift-diffusion process (see Eq (5)). (b) In object-based target selection, the evidence for the saccadic decision-making is accumulated for each object in the scene, quantified by a DDM with threshold θ and noise level s in module (IV). (c) The gaze position update (V) follows the movement of the foveated object mask (from dashed circle to cross). If the decision threshold is crossed, a saccade to the target object is executed. The exact landing position within the target object is probabilistic and proportional to the activity of the combined maps of modules (I-III) (see (7), combined maps shown in a different color for each object mask). The object-based models have the same number of free parameters as the space-based ones. They are listed in Table 2.

More »

Expand

Table 1.

Free parameters of the space-based and mixed models and their values which were determined by evolutionary optimization.

Columns “Fit (last gen.)” show the mean and standard deviation across all 32 parameter sets from the last generation of the evolutionary optimization (for a plot of the full parameter space, see the S3–S7 Figs). Columns “Fit (best)” show the parameter values that lead to the best fitness among all individuals and which are the parameter values used for the model comparison shown in Fig 4. The parameter for the size of the oculomotor drift is set to σ_D = 0.125 dva for all models.

More »

Expand

Table 2.

Parameters of the object-based models and their values which were determined by evolutionary optimization.

For details, see caption of Table 1.

More »

Expand

Fig 3.

Overview of the ScanDy architecture.

Software architecture of the ScanDy framework with example use cases. (a) Every box is a class with the name in bold and the most important methods in italic. The Dataset class provides the human scanpath data and makes the precomputed maps of the video data available for the models. At the core of the framework is the Model base class, from which all models inherit their functionality. The derived model classes have methods that correspond to the modules described in Figs 1 and 2. Object-based models might use the ObjectFile class to integrate object information in an efficient way. The classes on the gray background are native to neurolib and are used within ScanDy for parameter exploration and optimization. (b) Possible use cases of our framework include, in increasing complexity, generating scanpaths for a single video (red), comparing how well different models capture human gaze behavior (blue), and extending on existing models to test new hypotheses (green, common starting points in light gray). We provide Jupyter Notebooks of these and more examples on GitHub.

More »

Expand

Fig 4.

Scanpath summary statistics of the human eye tracking data compared to the simulated model scanpaths.

For each video in the dataset (10 in training, 13 in test data), we simulate 12 scanpaths (same parameters but with different random seeds) to roughly match the number of ground truth scanpaths. (a) Ground truth foveation duration distribution from all human participants across all videos. The dotted curve is a fitted log-normal distribution with μ = 5.735 and σ = 0.838 (equiv. to an expected value of ms). (b) Ground truth saccade amplitude distribution. The dotted curve is a fitted Gamma distribution with shape k = 1.43 and scale θ = 6.50 (equiv. to an expected value of kθ = 8.98 dva). (c) Cumulative distribution functions (CDFs) of foveation durations are compared between human data (green) and the results of the space-based models with low-level features S.ll (blue), and high-level features S.hl (cyan), the object-based model with low-level features O.ll (red), and center bias only O.cb (pink), and the mixed model (object-based model with space-based IOR) with low-level features M.ll. The human data and modeling results from videos in the test set are represented by opaque lines, while transparent lines represent the corresponding training data and results. (d) Same as (c) but for saccade amplitudes. Model parameters were optimized using the evolutionary optimization algorithm. Parameters are listed in Table 1 for the space-based and mixed models and in Table 2 for the object-based models.

More »

Expand

Fig 5.

Functional decomposition of scanpaths in foveation categories.

We distinguish the four functional categories “Background” (maroon), “Detection” (orange), “Inspection” (yellow), and “Returns” (khaki). Data plotted in opaque (transparent) colors represents the test (training) set. (a) Percentage of foveation events in each category across all human scanpaths and the model predictions as a function of time. Plotted is the proportion of scanpaths that are in each respective category. (Time spent during saccades or—in the case of the human data—periods in which the eye tracking data is corrupted by noise are not considered, and proportions are normalized to 100%.) (b) Proportion of time spent in each category averaged across all scanpaths from the human observers (data in panel (a) averaged over time) and for each of the five models averaged across all parameter configurations of the last generation of the optimization process.

More »

Expand

Fig 6.

Direct functional model comparison.

Violin plots of the ratio of the fraction of time spent in each foveation category predicted by the models and the fraction of time spent in each foveation category by the human observers. We show side-by-side the results for the videos of the test (left, opaque) and the training data (right, transparent) on a logarithmic scale. A ratio of one (dotted line) corresponds to a perfect prediction of how humans balance their exploration behavior. The four categories (Background, Detection, Inspection, Return) are shown on the x-axis. Individual data points for each model correspond to the 32 best parameter configurations (i.e., the last generation) of the evolutionary optimization and show the average value across twelve scanpaths per video in the respective dataset. Panels show the space-based models with low-level features S.ll (a), and with high-level features S.hl (b), the object-based models with low-level features O.ll (c), and with the center bias only O.cb (d), and the mixed model (object-based model with space-based IOR) with low-level features M.ll (e).

More »

Expand

Fig 7.

Object-based model comparison.

Comparison of the average total dwell time of simulated model scanpaths compared to the average total dwell time across human observers for each individual object. We again average model predictions across parameter configurations of the last generation. (The results are qualitatively equivalent to averaging multiple model runs of the parameter configuration with the highest fitness.) A perfect prediction for how humans balance attention between objects would correspond to all data points lying on the dotted line with slope m = 1 and intercept y₀ = 0. We plot the objects of the test (training) set and the corresponding linear fits in opaque (transparent) colors for the S.ll model (a), S.hl model (b), O.ll model (c), O.cb model (d), and M.ll model (e).

More »

Expand

Fig 8.

Saccade statistics related to the scanpath history.

(a) Polar histogram of the relative angles observed between subsequent saccades. We show the binned distribution (12° bin size) of simulated saccades of each model in comparison to the kernel density estimate of the human data for the test (training) set in opaque (transparent). (b) Median duration of all preceding foveation durations of saccades for each bin (12° b.s.) for the human data and the simulated model scanpaths. We plot the median in such that a small number of long foveation events do not distort the statistics. To reduce fluctuations in the median, we apply a centered circular moving average across 5 bins. (c) Distribution of saccades that return to a previously uncovered object, depending on the time since the object has last been foveated. We normalized the distributions such that the y-axis shows the number of saccades returning to an object within each time bin (200 ms b.s.) divided by the total number of saccades.

More »

Expand