Individual and ensemble perception in naturalistic scenes: Effects of context and presentation time

Yanina E. Tena Garcia; Bianca R. Baltaretu; Katja Fiehler

doi:10.1371/journal.pone.0347430

Abstract

In many everyday tasks, we must identify both single objects, as well as object ensembles. Our understanding of the mechanisms behind individual and ensemble perception comes mainly from studies conducted under very simplistic conditions. Here, we aim to further this understanding by moving toward more naturalistic environments. We tested the influence of scene context and presentation time on individual and ensemble perception. Six kitchen objects were presented in two scene contexts, either a kitchen scene or in front of a texturized background for one of three presentation times (100, 800, or 3200 ms). After viewing the objects, participants were instructed to indicate via mouseclick the position of one of the six objects (Individual task) or their average, ensemble position (Ensemble task). We assessed task performance (mouseclicks and eye movements) separately for the two tasks. In the Individual task, objects were located with higher accuracy in the kitchen scene at the longer presentation times. The related eye movements, during initial scene viewing, showed more frequent and larger saccades in the kitchen scene, with no differences in peak velocity, and shorter fixations on individual objects. Increasing presentation time was associated with fewer, larger and slower saccades, as well as longer object fixations. In the Ensemble task, the ensemble position was located more accurately in the texturized background when it was shown briefly (100 ms). Eye movements in the naturalistic scene revealed more frequent, larger and slower saccades, and shorter fixations on the ensemble position. Moreover, increasing presentation time showed fewer, smaller, and slower saccades, with longer fixations on the ensemble region. Overall, we found that scene context and presentation time influence spatial localization and eye movement behavior in individual and ensemble perception, highlighting the need to consider such contextual factors in future work.

Citation: Tena Garcia YE, Baltaretu BR, Fiehler K (2026) Individual and ensemble perception in naturalistic scenes: Effects of context and presentation time. PLoS One 21(5): e0347430. https://doi.org/10.1371/journal.pone.0347430

Editor: Vishal Bharmauria, Morsani College of Medicine, University of South Florida, UNITED STATES OF AMERICA

Received: January 6, 2026; Accepted: April 1, 2026; Published: May 6, 2026

Copyright: © 2026 Tena Garcia et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data files and analysis scripts are available from the Open Science Framework (https://osf.io/2k8j9/overview?view_only=ffe6e506fa664e79a50606abc8611c63).

Funding: This study received financial support from the Deutsche Forschungsgemeinschaft (DFG) Collaborative Research Center SFB/TRR 135 in the form of a grant awarded to KF (62202647). This study received additional financial support from the DFG in the form of a Germany’s Excellence Strategy grant awarded to KF (533717223). No additional external funding was received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

One of the key challenges that our visual system faces is extracting task-relevant information within our complex environments. For example, when you wish to make a cup of coffee in your favorite mug, you need to know what it looks like and where it is. This process is referred to individual object perception [1–3]. However, in order to find that single mug in the kitchen, you also need to know where the group of mugs are (i.e., their average position). This is known as ensemble perception [4,5] that provides summary information about a group of objects.

For goal-directed movements toward a specific object, humans rely on individual object perception. An accurate object representation is built up between 100 ms and 2 s [6,7], and depends on several factors, such as object complexity [8], the number of surrounding objects [7,8], and object-context relations [9,10]. Regarding the latter, it has been shown that semantically-congruent scenes (e.g., a hairdryer presented in the bathroom) reduce object processing time [9–11] and facilitate the retrieval of its location [12,13]. This facilitatory effect has been demonstrated for presentation times of four seconds or longer [12,13] and is associated with a general improvement over exposure time [7,14]. Given that our cognitive resources are limited [15–17], scene context and presentation time represent two crucial factors that may facilitate individual object perception [7,9–14]; though, their roles are as yet unknown for object localization under more real-world conditions, especially at shorter presentation times.

To determine summary characteristics about a group of objects, humans mainly rely on ensemble perception. This process relays summary statistics, such as the mean or variance [18], of different features of a group of objects (e.g., average color or identity) [19]. Another feature that has been less studied, but has everyday relevance, is the average location of a group of objects [20–22]. Previous studies that presented abstract stimuli (e.g., dots, lines) showed that their average position can be accurately reported [20,21,23]. Moreover, the average position of several object groups, for example defined by different stimulus colors, can be represented simultaneously [21]. In the real world, being able to determine the location of a group of objects is important in helping you to orient yourself appropriately for a given task. For example, to find the forks in the cutlery drawer in the kitchen, it is crucial to know their average location, in contrast to the position of the spoons or knives. In contrast to individual object perception, the effect of scene context on ensemble perception is less clear. There is some evidence from ensemble perception of facial expressions that a task-irrelevant background (e.g., uniformly oriented lines) that changes between stimulus encoding and information retrieval can reduce ensemble precision [24]. In addition to such contextual factors, presentation time also plays a role. Ensemble perception is known to be a fast and possibly automatic process [4,7,25,26], where object ensembles are built with presentation times as short as 50–500 ms [6,7], depending on stimulus complexity. With longer presentation times (up to 1600 ms), ensemble percepts become more accurate before plateauing, especially for mid- and high-level features like stimulus size and facial features [14,27,28]. However, how effectively these effects transfer to naturalistic scenes is unknown.

In this study, we aimed to close the gap in our understanding of how scene context and presentation time affect individual object and ensemble perception. In a behavioral experiment, participants viewed multiple objects embedded in a kitchen scene (Natural scene) or a texturized background (Non-natural scene) and then, had to indicate either the position of a single object (from a group of six objects; Individual task) or the average, ensemble position (Ensemble task). The objects were presented initially for one of three presentation times (100, 800, or 3200 ms). In the Individual task, we expected better locating performance in the Natural scene [12,13] and at longer presentation times [7,14]. In the Ensemble task, we also expected better locating performance at longer presentation times [14], though we had no a priori hypothesis for scene context (exploratory analysis). Despite limited prior eye-tracking research on these kinds of tasks, eye movements can provide a direct window into how early visual sampling supports later spatial localization [29–31]. Saccade characteristics, like saccade rates, amplitudes and peak velocities, as well as the distribution of gaze across regions of interest (ROIs) reveal how observers allocate attention and extract information during perceptual processing [32–35]. Therefore, we assessed saccade rates, saccade amplitudes, peak velocity and ROI-specific gaze measures for our two perceptual tasks to investigate task-related eye movement behavior and to determine whether scene context and presentation times also play a role early in scene viewing. In brief, we found that, in both the Individual and the Ensemble tasks, locating and eye movement behavior were influenced by scene context and presentation time.

Materials and methods

Participants

A group of 76 students from Justus Liebig University (mean age = 23.27 years ± 3.36; 58 females) participated in the experiment. The sample size (N = 76) was determined using a repeated-measures analysis-of-variance (RM-ANOVA) related power analysis (G*Power, 0.035 (f = 0.19), six measurements, 0.05, desired power = 0.8). All participants had to meet the following criteria: 1) normal or corrected-to-normal vision, 2) between 18 and 35 years of age, 3) right-handed as indicated by the Edinburgh Handedness Inventory [36] (M = 86.64, SD = 18.07), with 4) no neurological or motor disorders, and 5) intact color vision (verified with Ishihara charts). All participants provided written informed consent and received credit or money (8 euro/h) for their participation. The experiment was conducted in compliance with the guidelines of the local ethical committee at the Department of Psychology, Justus Liebig University Giessen and the Declaration of Helsinki [37]. Data collection started on 23 October 2024 and ended on 13 November 2024.

Stimuli and scene arrangements

We tested two scene contexts: For the Natural scene, we generated a kitchen scene (created using Blender v2.9; Fig 1A), and for the Non-natural scene, we applied an adapted filter [10] from Portilla and Simoncelli (2000) [38] to the Natural scene to retain low-level information by capturing statistics of brightness, contrast, and patterns across space, orientation, and scale (MATLAB vR2020b; https://de.mathworks.com/products/matlab.html; Fig 1). In each scene context, we presented six kitchen objects. Specifically, we used a banana, mango, pomegranate, jam jar, peanut butter jar, and pot (Blender v2.9; https://www.blender.org/) chosen from a repository (https://www.turbosquid.com/de). We chose to present six target objects, as this is in the upper visual working memory capacity range [15–17] to prevent both floor and ceiling effects in localizing performance. Furthermore, these objects were used to create six object arrangements, where objects were pseudo-randomly placed at physically plausible locations (i.e., on the countertop, on the sink, and/or in the cupboards, open shelves and cabinetry) and in a way that they never occluded one another. In the arrangements, objects assumed different and unique (non-repeating) locations in the scene. We also included six additional arrangements of the same target objects per block, here called ’catch’ scenes, to increase the variability of the arrangements and thus decrease memorization of the six main arrangements. ’Catch’ scenes were not included into the statistical analysis (see Design below for details). Overall, there were 12 different arrangements for each of the two scene contexts, resulting in 24 rendered scenes.

Download:

Fig 1. Example stimuli.

Example of one object arrangement in the Natural (A) and the Non-natural (B) scenes containing the six target stimuli (left to right: banana, mango, pot, jam jar, pomegranate, and peanut butter jar).

https://doi.org/10.1371/journal.pone.0347430.g001

Apparatus

We ran our paradigm using PsychoPy (v2021.2.0) on an Intel^® Core^™ i5-2500 CPU (3.30 GHz; 8 GB RAM) with NVIDIA^® GeForce GTS 450 graphics card, using Windows 10 Pro. Scenes were displayed on a 25” monitor (refresh rate: 60 Hz; resolution: 1920 x 1080 pixels) in a dark room. We ensured a constant distance between the participant and the monitor using a chinrest (distance eye to monitor: 90 cm). For eye movements, we used a video-based desktop mount EyeLink 1000 (SR Research Ltd., Mississauga, Ontario, Canada; sampling rate: 1000 Hz) to record two-dimensional movements of the right eye. The eye-tracker was placed on the table, below the line-of-sight, at a distance of 30 cm to the chinrest. Eye movement accuracy was calibrated and validated before each experimental block with a 5-point grid calibration (threshold for calibration: within 1°; threshold for validation errors: within 0.35°).

Design

We used a 2 x 2 x 3 within-subject design, with Task (Individual vs. Ensemble), Scene Context (Natural vs. Non-natural), and Presentation Time (100 ms vs. 800 ms vs. 3200 ms) as our three factors. These presentation times were chosen from Neumann et al. (2018), given that this study is most closely matched in terms of stimulus complexity. We blocked Task and Scene Context into four main blocks: 1) Individual, Natural, 2) Individual, Non-natural, 3) Ensemble, Natural, and 4) Ensemble, Non-natural. Each block contained 24 trials: each of the six main scene arrangements was tested at each of the three presentation times, resulting in 18 main trials. Additionally, the six ‘catch’ scenes (see section ’Stimuli and Scene Arrangements’) were also presented once within each block (randomly assigned to the three presentation times). Overall, there were a total of 96 trials distributed across the four blocks. Block order was counterbalanced across participants.

Procedure

Each of our experimental trials started with an initial Fixation period (Fig 2). Fixation on the cross was required within 3 s of its appearance, otherwise calibration started again. After fixating for 1 s within a 2.5° window, a 3 s Countdown began. Failure to fixate at this stage started the trial anew. The Countdown was followed by the Encoding phase during which a scene arrangement (Natural or Non-natural, depending on the block) was presented at one of the three presentation times (100, 800, or 3200 ms). Then, a 50 ms mask was presented, followed by the 2000 ms Instruction phase that informed about the following Task. In the Individual task, a picture of the target object was presented on the center of the screen. In the Ensemble task, a symbol representing the average position of the objects was shown centrally. In the final, Response phase, the same scene as in the Encoding phase was presented, without any of the target objects. Participants had unlimited time to indicate via mouseclick the respective target position (individual or ensemble). Once they provided their response, a new trial began. We anticipated participants’ preemptive positioning of the mouse from one trial to the next by randomizing the starting position of mouse in one of the four corners of the screen.

Download:

Fig 2. Example trial sequence.

In the lower panels, we presented an Ensemble, Non-natural trial, where the location of the ensemble position must be re-produced in the texturized scene. The other two combinations (i.e., Individual, Non-natural and Ensemble, Natural) were also tested. Eye movements were recorded in the Encoding phase and mouseclick endpoints were recorded in the Response phase.

https://doi.org/10.1371/journal.pone.0347430.g002

Data processing and analysis

The locating and eye movement data were processed using Jupyter Notebook (v6.2.0; https://jupyter.org/).

Locating data.

In order to assess the accuracy of locating responses, we determined the locating error. To do this, we calculated the magnitude of the difference vector between the mouseclick and the object’s actual 2D position. For the Individual task, the 2D position of each object in each scene arrangement was defined relative to its center of mass. For the Ensemble task, we calculated both the center of area (COA; geometric center) and the center of gravity (COG; average location) of the six objects in a scene to determine the ground truth for each scene arrangement [39]. The COA was calculated as the 2D centroid of the polygon formed by connecting the centers of mass of each of the six target objects A, where n represented the six target objects in the scene and x_i and y_i the x and y coordinate of each target, respectively.

(1.1)

(1.2)

(1.3)

The COG was calculated as the mean X and Y values of all centers of mass of each of the six target objects in the scene.

(2.1)

(2.2)

Participants’ responses were best represented by the COG (S1 Fig and Table in S1 Appendix). Additional comparisons to the screen center were conducted to verify that participants’ responses reflect task-related behavior rather than a general tendency to click near the screen center, as a statistically efficient but imprecise strategy. These comparisons showed that performance was not significantly better described by center-of-screen coding (S1 Fig and Table in S1 Appendix). As such, we used the COG as the ground-truth to calculate participants locating accuracy in the Ensemble tasks. Outlier criteria for the locating analysis were based on both the locating and eye movement data. First, experimental trials were excluded when blinks in the Encoding phase occurred in the 100 ms presentation time (one trial, < 0.01% of the 5472 total trials) or were longer than 200 ms (55 trials, 1% of the total). Second, we observed significant differences in performance in the Individual task for the two red objects (i.e., the jam and pomegranate; S2 Table in S2 Appendix) and therefore excluded the data of these two targets (907 trials, 16.6% of the total). Third, trials were excluded if participants’ response times in the Reproduction phase were shorter than one second or if their locating errors exceeded three standard deviations (calculated separately for each combination of task and presentation time), which led to the exclusion of 82 trials (1.5% of the total). Altogether, we excluded 1045 trials, which resulted in a remaining dataset of 4427 trials (80.9% of the total). We chose to apply linear mixed modeling (LMMs) to our data, given that we wanted to test the factors scene context and presentation times in each task datasets and account for relevant, additional factors, such as repetition number of scene arrangements and participant variability, which could not be fully captured with RM-ANOVAs. LMM analyses were performed separately for our two tasks – one for the Individual and one for the Ensemble task. They were used to test the effects of scene context and presentation time, as well as their interaction, and to capture any potential repetition effects and participant variability. The analysis was conducted using “lmerTest” package and performed in R (v4.2.2; https://www.r-project.org/). Any post-hoc paired t-tests were performed using the “emmeans” package and were Bonferroni-Holm corrected, with p-values written as p_BH. Mean differences (MD), standard errors (SE), t-statistics, p-values and Cohen’s d (d) are provided for each test.

Eye movement data.

We examined the effects of scene context and presentation time on eye movement behavior in the two main tasks, specifically during the Encoding phase. While the presentation times were chosen for perceptual effects on the locating behavior, the shortest of the presentation times used here precluded us from assessing any meaningful eye movement behavior [40]. As such, eye movement analysis was restricted to trials that had an 800 or 3200 ms presentation time. First, raw gaze data were low-pass filtered using a second-order Butterworth filter with a cut-off frequency of 30 Hz. Saccade onsets and offsets were identified based on two-dimensional gaze velocity, using a threshold of 30°/s. Only saccades with amplitudes exceeding 0.5°, as defined by the velocity-based onset and offset, and peek velocities below 850 °/s were considered for the analysis. Further analysis was conducted on fixations, which were defined as the time between two consecutive saccades (i.e., the time between the offset of the preceding saccade and the onset of the subsequent saccade). In particular, we included fixations that 1) started after the first saccade after Encoding onset and ended before Encoding offset and 2) were longer than 50 ms and shorter than 2000 ms in the analysis. We excluded data from 8 participants due to technical problems with the eye tracker (interference from eye glasses or contact lenses) and applied our outlier criteria for the eye movement data to the remaining 68 datasets. First, we excluded trials if no saccade was performed during the Encoding phase (4 trials, 0.12% of 3264 total trials) and if blinks in the Encoding phase were longer than 200 ms (55 trials, 1.69% of the total). Second, any trial in which more than 30% of fixations were excluded based on fixation durations shorter than 50 ms or longer than 2000 ms was excluded from further analysis (276 trials, 8.46% of the total). Finally, if more than 50% of trials were excluded based on the first and second criteria, the participant’s data were excluded from the eye movement analysis, resulting in the exclusion of one more participant only for the eye movement analysis. In total, we included data from 67 out of 76 participants in the eye movement analysis (8.52% trials excluded from the remaining 67 datasets).

Using our final data set, we performed analyses on saccade rates, amplitudes and peak velocity as well as on fixation duration on predefined ROIs. For saccade rates, all saccades that met the above listed criteria in a trial were counted and divided by the respective presentation time of the trial. Saccade amplitude was calculated as the straight-line (Euclidean) distance between gaze position at saccade onset and offset,

(3.1)

Peak velocity was defined as the maximum instantaneous velocity within the saccade interval, where instantaneous velocity was computed as the Euclidean norm of the horizontal and vertical velocity components,

(3.2)

For fixation duration, we determined fixation time spent on predefined ROIs: We defined ROIs for each of the six target objects (individual ROIs) and the ensemble position (ensemble ROI). For the individual ROIs, for each object, a squared space was created based on the extreme boundaries of the object, defined by its uppermost, lowermost, leftmost, and rightmost points. Across stimuli, the average object width was 1.3° of visual angle and the average height was 1.1°. For the ensemble ROI, we created a circular area with a radius of 0.5° of visual angle around the ensemble position and defined this as the ROI. This created a central region with a total diameter of 1° and was done in order to produce an ROI matching that of the individual object ROIs. Fixations (radius of 1°) were considered as landing on the respective ROI if they overlapped with the space of the individual or ensemble ROIs. For saccade rate, amplitude and fixations on ROI we applied the same LMMs like for the locating data (with fixed effects of scene context, presentation time and their interaction, the covariate of scene arrangement repetition and participant variability as random intercepts), separately for the two tasks. For peak velocity we additionally included saccade amplitude as a covariate to statistically control for the amplitude–velocity relationship (the “main sequence”) [41,42].

Results

Individual task

First, we looked at the effects of scene context and presentation time on individual object perception. We hypothesized that there would be better locating performance in the Natural compared with the Non-natural scene [12,13] and at longer presentation times [7,14]. For our eye movements, we used exploratory analysis to determine the effects of scene context and presentation time on saccade rates, amplitude and peak velocity as well as on fixation duration on ROIs during Encoding.

Locating behavior.

We found a significant main effect of scene context (F_1,1047.7 = 8.27, p = .004) and of presentation time (F_2,1001.1 = 135.60, p < .001), as well as a significant interaction of scene context and presentation time (F_2,1001.1 = 6.15, p = .002). We conducted additional post-hoc paired t-tests (Table 1) to test 1) the effect of presentation time within each scene context and 2) the effect of scene context at each presentation time. Across scene contexts, we found better locating behavior from the shortest to the middle presentation times and from the middle to the longest presentation time (Fig 3). Further, when comparing scene context at each presentation time, while we found no difference between the contexts at the shortest presentation time, we did find significant differences at the middle and longer presentation times (i.e., significantly smaller errors in the Natural scene at 800 ms and 3200 ms; Fig 3). We also included a covariate for scene arrangement repetition, which we found to be significant (F_11,1169.3 = 2.45, p = .005), showing that locating behavior improved with more repetitions. Overall, these findings indicate that locating individual objects benefits from the Natural scene context with longer presentation time.

Download:

Table 1. Pairwise Comparisons in the Individual task.

https://doi.org/10.1371/journal.pone.0347430.t001

Download:

Fig 3. Locating errors in the Individual task.

Depicted are the estimated marginal means of the LMM model, represented as black-outlined dots, with the SE as error bars. These are presented alongside the mean locating error per participant per condition, as semi-transparent dots where darker regions indicate greater overlap. Significant differences between conditions are indicated by stars: p < .001 ‘***’, p < .010 ‘**’, p < .050 ‘*’.

https://doi.org/10.1371/journal.pone.0347430.g003

Eye movement behavior.

In our exploratory assessment of eye movements in the Individual task, we investigated whether scene context and presentation time also play a role during Encoding. First, we looked at saccade rates (saccades/s), where we found a significant main effect of scene context (F_1,673.7 = 6.51, p = .011), showing higher saccade rates in the Natural (M = 5.19 saccades/s) compared to the Non-natural (M = 5.00 saccades/s) scene context (Fig 4A). We further found a significant main effect of presentation time (F_1,635.2 = 611.53, p < .001), showing a higher saccade rate in the 800 ms presentation time conditions (M = 5.93 saccades/s) compared to the 3200 ms conditions (M = 4.23 saccades/s) (Fig 4A). The interaction of scene context and presentation time was not significant (F_1,635.5 = 1.43, p = .232) as well as the covariate repetition of scene arrangement (F_11,669.0 = 0.46, p = .928). Thus, participants made more saccades during stimulus presentation in the Natural context compared to the Non-natural context, and saccade rates increased at shorter presentation times.

Download:

Fig 4. Saccade measures in the Individual task.

Depicted are the estimated marginal means of the LMM models, represented as black-outlined dots, with the SE as error bars. These are presented alongside the mean values per participant per condition, as semi-transparent dots where darker regions indicate greater overlap. Significant differences between conditions are indicated by stars: p < .001 ‘***’, p < .010 ‘**’, p < .050 ‘*’.

https://doi.org/10.1371/journal.pone.0347430.g004

In a next step we examined saccade amplitudes during scene encoding. We found a significant main effect of scene context (F_1,667.1 = 5.86, p = .016), with in average 0.16° longer saccade amplitudes in the Natural compared to the Non-natural scene (Fig 4B). There was also a significant main effect of presentation time (F_1,634.3 = 6.27, p = .013) showing longer saccade amplitudes in the 3200 ms presentation time condition (M = 4.82°) compared to the 800 ms condition (M = 4.68°) (Fig 4B). The interaction of scene context and presentation time was not significant (F_1,635.5 = 1.43, p = .232) as well as the covariate repetition of scene arrangement (F_11,669.0 = 0.46, p = .928). Altogether, saccade amplitudes for the Individual task reveal that larger eye movements are made in the Natural compared to the Non-natural scenes and at longer compared to shorter presentation times.

We further analyzed the peak velocity of saccades, while controlling for saccade amplitude. We found a significant main effect of presentation time (F_1,13292.2 = 585.87, p < .001) showing faster saccades performed in the 800 ms presentation time condition (M = 310.79°/s) compared to the 3200 ms condition (M = 245.99°/s) (Fig 6C). There was no main effect of scene context (F_1,11784.0 = 3.3, p = .070), no significant interaction of scene context and presentation time (F_1,13287.7 = 2.06, p = .151) and no significant effect of the covariate repetition of scene arrangement (F_11,8263.5 = 0.59, p = .835). We found a significant effect of the covariate saccade amplitude (F_1,13287.7 = 6067.9, p < .001), showing larger saccade amplitudes with higher peak velocities. Thus, peak velocity mainly shows an increase at shorter presentation times.

In addition, we investigated where participants looked during the Encoding phase. To do so, we first used heatmaps to visualize the focus of participants’ gaze for each scene context and the two presentation times (800, 3200 ms). Fig 5 illustrates that, across all four conditions, participants fixated on or near the individual target objects. Moreover, increasing presentation times are accompanied by a spread of fixation behavior, especially toward objects found farther from the center of the screen.

Download:

Fig 5. Heatmaps for the Individual task.

The heatmaps are separated for scene context (Natural and Non-natural, upper and lower rows, respectively) and for two presentation times (800 ms and 3200 ms, left and right columns, respectively). Darker colors indicate higher fixation density. Colormaps were normalized separately for each condition and thus do not permit direct comparisons of color intensity across heatmaps. Individual target objects are outlined in black.

https://doi.org/10.1371/journal.pone.0347430.g005

Finally, we quantified the spatial distribution of gaze during the Encoding phase by testing the time participants spent fixating any of the six target objects (individual ROIs). We found a significant main effect of scene context (F_1,679.8 = 4.51, p = .034), with more fixations on individual objects in the Non-natural scene (M = 52.53%) compared to the Natural scene (M = 50.17%), as can be seen in Fig 6A. We also found a significant main effect of presentation time (F_1,636.2 = 326.86, p < .001), with 18.3% more fixations directed toward individual targets in the 3200 ms presentation time (M = 60.48%) condition compared to the 800 ms condition (M = 42.21%) (Fig 6B). There was no interaction of scene context and presentation time (F_1,636.5 = 1.30, p = .254), as well as no effect of the covariate repetition of scene arrangements (F_11,673.1 = 0.80, p = .645). Thus, in the Individual task, ROI analysis showed that gaze lasts longer on individual target objects in the Non-natural scene context as well as at longer presentation times and may reflect a spread of gaze toward increasingly peripheral targets (Fig 5).

Download:

Fig 6. Time spent on individual ROI in the Individual task.

Depicted are the estimated marginal means of the LMM model, with the SE as error bars. These are presented alongside the average time participants spent on the ROI per participant per condition, as semi-transparent dots where darker regions indicate greater overlap. Significant differences between conditions are indicated by stars: p < .001 ‘***’, p < .010 ‘**’, p < .050 ‘*’.

https://doi.org/10.1371/journal.pone.0347430.g006

Ensemble task

For the Ensemble task, we also investigated how scene context and presentation time affect locating behavior. We assumed locating performance to be improved with longer presentation times [7,14], while no a priori assumptions were made regarding the effect of scene context. Given that these two factors have been little studied on eye movement behavior for ensemble perception, we performed the same exploratory analyses as above (see Individual Task – Eye Movement Behavior).