Figures
Abstract
Perceptual organization in the human visual system involves neural mechanisms that spatially group and segment image areas based on local feature similarities, such as the temporal correlation of luminance changes. Successful segmentation models in computer vision, including graph-based algorithms and vision transformer, leverage similarity computations across all elements in an image, suggest that effective similarity-based grouping should rely on a global computational process. However, whether human vision employs a similarly global computation remains unclear due to the absence of appropriate methods for manipulating similarity matrices across multiple elements within a stimulus. To investigate how “temporal similarity structures” influence human visual segmentation, we developed a stimulus generation algorithm based on Vision Transformer. This algorithm independently controls within-area and cross-area similarities by adjusting the temporal correlation of luminance, color, and spatial phase attributes. To assess human segmentation performance with these generated texture stimuli, participants completed a temporal two-alternative forced-choice task, identifying which of two intervals contained a segmentable texture. The results showed that segmentation performance is significantly influenced by the configuration of both within- and cross-correlation across the elements, regardless of attribute type. Furthermore, human performance is closely aligned with predictions from a graph-based computational model, suggesting that human texture segmentation can be approximated by a global computational process that optimally integrates pairwise similarities across multiple elements.
Author Summary
How does the human visual system use temporal information to segment objects in a dynamic scene? When observing ever-changing environments, our brains must determine which regions belong to the same object and which are distinct. However, the mechanisms underlying this process remain poorly understood. In this study, we investigate how “temporal similarity structures”—patterns of correlation over time—affect visual segmentation. We developed a novel method for generating dynamic stimuli with precisely controlled temporal similarity and systematically tested how within-area and cross-area temporal correlations influence segmentation. Participants performed a task in which they identified segmentable textures, and the results showed that segmentation performance improves when regions exhibit strong internal consistency but lower similarity with adjacent regions. Our findings revealed that human visual segmentation relies on a global computational mechanism that integrates temporal similarity cues to distinguish visual structures. Additionally, our stimulus generation framework provides a powerful tool for future research on perceptual organization and mid-level vision.
Citation: Chen Y-J, Sun Z, Nishida S (2025) Human visual grouping based on within- and cross-area temporal correlations. PLoS Comput Biol 21(9): e1013001. https://doi.org/10.1371/journal.pcbi.1013001
Editor: Yuanning Li,, ShanghaiTech University, CHINA
Received: March 25, 2025; Accepted: August 30, 2025; Published: September 11, 2025
Copyright: © 2025 Chen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All resources, including the experiment code, demo video, raw data, high-resolution figures in the paper, and bootstrapping results, have been uploaded to the Open Science Framework platform. The project can be accessed at the following link: https://osf.io/9yncd/?view_only=eb481f2b28ab42e29ff267833f70b8e5.
Funding: This work was supported in part by the Spring Fellowship (Grant Number JPMJFS2123) for SZ and YJC and in part by JSPS Grants-in-Aid for Scientific Research (KAKENHI) (Grant Numbers JP20H00603, JP20H05950, JP20H05957, and 24H00721) for SN. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Visual perceptual organization is a fundamental function of the human visual system that enables the grouping of objects based on various attributes, such as luminance, color, and temporal dynamics. These attributes are captured locally by different neurons, each of which has a limited receptive field. To segment a coherent object from its surroundings, an additional process is required to compare local neuronal signals across space and determine whether they belong to the same object. One of the key computational principles underlying this process is similarity-based grouping, a concept introduced in Gestalt psychology, which posits that elements with similar features tend to be grouped together [1,2]. Similarity operates across a wide range of visual attributes, both spatial and temporal. Spatial cues include orientation, color hue [3–5], and luminance [6], as well as motion properties such as direction and speed (i.e., common fate) [7,8]. Temporal cues involve temporal synchrony [9–11], temporal structure [12–14], and generalized common fate [15]. The present study investigated the computational mechanisms of similarity-based grouping, focusing on the temporal correlation of local feature changes. Although our argument is primarily based on temporal cues, the proposed framework may be generalized to similarity-based grouping involving other visual attributes.
From a computational perspective, the decision to group two local elements depends not only on their pairwise similarity but also on the similarities between all other element pairs in the visual scene. For example, consider a spatial arrangement of four elements (A, B, C, and D), as illustrated in Fig 1. If the pairwise similarity is high within {A, B} and within {C, D}, but relatively low for all other element pairs, the most computationally reasonable solution is to segregate the elements into two distinct groups: {A, B} and {C, D}. To achieve this, the visual system must first evaluate the pairwise similarities between all local elements and then derive a global solution that maximizes within-group similarities while minimizing cross-group similarities. When the number of elements is small (Fig 1A), this computation is relatively straightforward. However, as the number of elements increases, the complexity of the grouping task grows exponentially (Fig 1B).
Various global computational approaches have been proposed in the field of computer vision (CV) to address visual segmentation. Self-attention transformers, which leverage pairwise similarity across all elements, have demonstrated performance comparable to human grouping abilities [16]. Similarly, Conditional Random Fields (CRFs), a specialized form of Markov Random Fields (MRFs), have been successfully applied to model dependencies between local similarities, enabling high-quality image segmentation [17]. Additionally, graph-based methods, such as eigendecomposition on the graph Laplacian—a linear transformation of the similarity matrix—have achieved state-of-the-art performance [18,19]. These approaches share similarities with human perceptual strategies, particularly in motion-based segmentation tasks [20–22].
Despite these advancements, previous studies on human vision have not systematically examined how the global similarity structure of a stimulus influences perceptual grouping. Prior research has primarily manipulated either within-group similarity or cross-group similarity, but not both simultaneously. For example, studies by Lee and Blake (1999) and Morgan and Castet (2002) modulated the level of temporal correlation in motion direction changes within a target area, thereby altering within-group similarity while leaving cross-group similarity uncontrolled [13,23]. Similarly, other studies have focused on modulating within-group similarity while keeping cross-group similarity close to zero [11,15,24].
One possible reason for the lack of research on the effects of global similarity structure in human perceptual grouping is the technical challenge of independently manipulating different aspects of similarity. Specifically, modifying the similarity of a single element inevitably affects its relationship with all other elements in the scene. To overcome this limitation, we developed a novel stimulus generation method that leverages a vision transformer (ViT) [25]. This approach allowed us to precisely manipulate pairwise similarities across all element pairs. Our stimulus design focused on the temporal effects of visual segmentation, defining similarity between element pairs based on the correlation of their temporal sequences. By utilizing well-defined loss functions, our stimuli were free from spatial artifacts, enabling us to isolate and examine human performance in purely similarity-based visual segmentation tasks.
This study makes three key contributions. First, we introduce and evaluate a ViT-based stimulus generator with a customized loss function that enables precise control over pairwise element temporal similarity. Second, using the generated stimuli, we conduct a psychophysical experiment to examine how perceptual segregation is influenced by both within-group and cross-group temporal similarities. Finally, we introduce a graph cut model—partially based on previous work [18]—as a simple segmentation framework to assess whether computations based on the global “temporal similarity structure” can approximate human perceptual performance.
Study 1. Stimulus generation incorporating ViT-based generator
Methodology
ViT-generated feature tensor.
This section describes our approach to generating a 2D spatiotemporal stimulus using a lightweight ViT. The goal of the model was to generate a feature tensor from a sampled noise tensor
, satisfying user-assigned cosine similarity (temporal correlation) constraints in two within-areas, denote as (
) and across two areas, denote as (
) (Fig 2).
The luminance bulb is shown as an example here, but in the actual experiment, the stimulus could be a luminance bulb, color bulb, or luminance Gabor. The pairwise similarity heatmap is shown on the right; the space is flattened into a single dimension for convenient plotting. The position on the matrix corresponding to the 2D location is labeled, with the blue part corresponding to Area A and the red part corresponding to Area B. A demo time series is plotted at the bottom; the left figure demonstrates the elements with high temporal similarity, while the right figure demonstrates those with low similarity. The demo video can be viewed at https://osf.io/9yncd/?view_only=eb481f2b28ab42e29ff267833f70b8e5.
- 1. Problem definition
- We use a 2D feature map to represent spatiotemporal data. Consider a feature tensor
, defined on a spatial grid
, where
represent the height and width of the spatial grid. The spatial grid is defined as
. At each spatial location
, there exists a corresponding temporal sequence
. Here, N represents the length of the temporal sequence, set to 30 frames.
can be formally defined as follows:
- We use a 2D feature map to represent spatiotemporal data. Consider a feature tensor
- • Splitting into sub-areas: To impose specific temporal correlation constraints, we partitioned
into two subsets,
and
:
Here, we set H as 8 and W as 16. Therefore, the feature tensor is an 8 16 rectangle, with two 8
8 square sub-areas on the left and right sides.
- Temporal correlation constraints
We aimed to generate , whose spatiotemporal patches exhibit assigned cosine similarities. The similarities were calculated using the following formula:
The similarity was calculated between two vectors at spatial locations and
. We intended to add the following constrains:
- • Within-area temporal correlation: For any pair
where both
belong to the same area (
or
), the similarity should be close to a desired constant, i.e.,
and
, respectively. This can be expressed as follows:
Throughout this study, and
were kept constant.
- • Cross-area temporal correlation: For any pair
spanning different sub-areas, i.e.,
(or vice versa), the similarity should be close to a distinct constant
:
In addition to enforcing temporal correlation constraints, we aim to remove any other distinguishing features that could facilitate segmentation between and
, such as spatial cues (e.g., luminance or contrast differences), texture patterns, or higher-order statistical irregularities. In other words, our goal is to ensure that the stimuli rely solely on temporal correlation differences for segmentation. Ideally, these stimuli would exhibit i.i.d. white noise–like temporal spectra, subject only to the specified temporal correlation constraints. Finding an analytical solution for
that satisfies all these constraints could be mathematically challenging. Instead, we leveraged a neural generative approach using a lightweight ViT model to get the numerical approximation.
- Transformer-based generative model
Our generative model was based on a ViT backbone, incorporating convolutional residual blocks at both the encoding and decoding stages. The architecture can be summarized as follows:
- Encoder: A series of 2D CNN residual blocks [26] processed an initial latent tensor
, expanding the feature dimension from N to 128. In our implementation, we used four-stacked ResidualBlock2D modules, each consisting of convolution, normalization (InstanceNorm2D), and GELU activation, with the residual connection between two layers.
- Transformer layers: We flattened the spatial dimensions and treated each location as a “token” in standard transformer terminology. Each token is represented by 128 dimension features, attended to every other token via multi-head self-attention (a similarity representation based on dot-product distance). Stacking multiple such layers enabled the model to capture global temporal correlations across the entire feature map.
- Decoder: Another three layers of convolutional layers decode the features into the target spatiotemporal volume
, condense the feature dim from 128 to N. We applied non-linear activations (tanh) to ensure an output within a controlled numeric range (-1,1).
Mathematically, we defined the transformer layers as a function . The overall generation process transform a white-noise prior into target stimuli, could be expressed as:
The model parameters were optimized so that the resulting output fulfilled the cosine similarity constraints defined previously. The transformer was selected because the self-attention mechanism in transformers computes pairwise temporal correlations (via the Query, Key, and Value operations) between all tokens in a layer. This is precisely the structural property we intended to exploit, aiming to modulate temporal correlations (cosine similarity) within and across spatial locations in
and
. By stacking multiple transformer layers, the model could iteratively adjust and refine these pairwise relationships, converging on a solution that respects the target within-area and cross-area similarities.
- Optimization
We optimized the model parameters to produce a stimulus , which enforced both the within-area (within
and
) and cross-area temporal correlations described earlier while also constraining the signal to have similar mean and variance values across areas.
The input for the generative model was initialized as a white noise tensor, i.e., random values drawn from a normal distribution with zero mean and moderate variance, encouraging broad coverage of frequencies.
To optimize the stimulus , we defined a customized loss function
composed of several terms, each constrain a different statistical property:
- Within-area temporal correlation constraints: We computed the cosine similarity among all pairs of vectors within each sub-area. This resulted in two similarity matrices
, where
and
represent similarity values within the regions
and
, respectively. The within-area loss is defined as the summation of the absolute differences between the computed similarities and the desired similarity constants
and
. The within-area temporal correlation loss
and
were then:
- Cross-area temporal correlation constraints: The cosine similarity tensor
was computed between the space spanning the areas
and
. The cross-area temporal correlation loss
was then:
- Mean and variance regularization: To maintain the whole tensor following a similar feature distribution, we added a loss to let the generated stimuli converge on specific statistics. Let
denote the subsets of F corresponding to
and
, respectively. We aimed to match both global mean (
) and variance (
) values, defined as:
and
computed a single scalar for variance and the mean for the whole input tensor, respectively. Here, we set
as 0 and
to approximate a uniform distribution between −1 and 1.
- Equalizing global statistics: To ensure that the feature distributions across
and
exhibit similar mean and variance statistics, we enforced constraints on the mean and variance of each sub-region. We defined an additional penalty:
- Variance expansion (“white noise” tendency): To increase variability within each sub-area and push the solution towards a white noise-like distribution, we applied a negative penalty on intra-area pairwise distances:
By negatively weighting the average pairwise differences, we encouraged the model to spread out feature values, thereby increasing variance.
- • Final loss function: Combining all of the above components, we obtained the total loss as follows:
where are scalar weights to balance the contributions of each term.
- Training procedure
The network parameters (including those of the transformer and the encoder/decoder blocks) were optimized jointly via an AdamW optimizer. At each iteration, we:
- Generated an output stimulus
from the current model.
- Computed all required similarity, variance, and mean statistics to obtain
.
- Performed backpropagation to update model parameters.
Since we began with white noise initialization and imposed temporal correlation constraints alongside variance/mean regularizations, the final output remained statistically rich (i.e., maintained high variability) while respecting the assigned within- and cross-area temporal correlation targets. This approach produced a refined that approximated the ideal solution to our temporal correlation objectives. We generated 30 different
values for each temporal correlation pair.
Results
Invisibility of spatial cues
To confirm that the generated stimuli could only be distinguished based on the designated temporal similarity structure, we first verified that they were unlikely to be differentiated by other spatial indicators. We calculated two standard measures of spatial cues: the spatial average value per frame, representing the attribute’s convergent value (e.g., mean luminance), and the spatial root mean square deviation per frame, indicating the attribute’s concentration magnitude (e.g., luminance contrast). These statistics were selected based on previous studies demonstrating that the human visual system is particularly sensitive to such regularities when evaluating texture [6,27].
Analysis of the first index revealed that the mean luminance difference between the left and right partitions (Areas A and B in Fig 2) across 30 frames was 0.17% for all temporal correlation pairs, with a peak difference of 0.24%. This negligible discrepancy suggested that participants were unlikely to perceive a difference based on mean luminance alone. Similarly, for the second index, the findings indicated that the average root mean square deviation difference between partitions across 30 frames was 0.34% for all temporal correlation pairs, reaching a maximum of 0.61%. This magnitude was well below the threshold required for human perception of such differences, confirming that spatial cues were effectively minimized.
Validation of cosine similarity distribution
Next, we assessed whether the stimuli generated by the ViT exhibited the designated cosine similarity values. presents the temporal correlation distribution, obtained by directly computing the absolute cosine similarity of the generated stimuli. The Fig 3 demonstrates that both within-area and cross-area temporal correlations closely align with the assigned values. Furthermore, additional quantitative analysis revealed high predictability of the measured temporal correlation based on the assigned values (R2 = 0.99). These results demonstrated that our generation protocol effectively manipulated the similarity structure of the stimuli as intended.
The assigned within- and cross-area correlation values are shown on the axis labels. Each subgraph displays the empirical temporal correlation distribution separately for within- and cross-area temporal correlations, with within-area temporal correlations represented in blue and cross-area temporal correlations in orange. The lower-right quadrant presents a scatter plot of the desired temporal correlation values (x-axis) versus the actual values (y-axis). The red line represents the identity line (y = x) as a reference.
Quasi-white noise characteristics
Finally, we analyzed the spectral properties of the generated stimuli. The temporal spectrum for Areas A and B was computed separately (Fig 4). Regardless of the temporal correlation pair, the spectrum exhibited a near-uniform distribution, indicating that the generated stimuli closely resemble white noise. This broadband energy distribution ensures activation of diverse neural mechanisms. Additionally, there were no significant spectral differences between Areas A and B during the signal interval, suggesting that simple linear operations would be insufficient to distinguish between the two partitions.
Exclusion of higher-order visual feature visibility
Our stimulus design also excluded higher-order statistical features that could be used in the segmentation task, such as local motion differences and temporal pattern recognition.
Humans can perceive shape based on local motion differences (i.e., shape-from-motion) [8,28,29]. However, our stimulus design mitigates this possibility. Temporal similarity was computed using the absolute value of cosine similarity, meaning that simultaneous increases or decreases in brightness were treated as having the same magnitude of temporal similarity. This approach ensured that any potential motion signals were drift-balanced across the left and right partitions, eliminating systematic directional motion cues that could facilitate segmentation.
Additionally, humans can distinguish regions by recognizing different temporal regularities after long exposure or repeated presentations [30]. However, our stimuli were designed to approximate quasi-white noise in the temporal domain, making it difficult to extract stable, discernible patterns in a single exposure. Given the short presentation duration and the absence of repeated trials with the same sequence, participants likely lacked sufficient information to develop reliable expectations about temporal structures.
Study 2. Human segmentation performance for the generated stimuli
In this section, we report the results of a psychophysical experiment designed to investigate how within- and cross-area temporal correlations influence human segmentation performance. By independently manipulating within- and cross-area correlations using the stimulus generation method described in the last section, we directly test how human reperformance covaries with structured changes in the similarity matrix. The results are expected to provide insights into whether segmentation depends on simple local statistics or on a global computation that integrates pairwise correlations across spatial locations. The results indicate that both within- and cross-area temporal correlations influence human segmentation performance in an antagonistic way. In the next section (Study 3), we will computationally analyze the pattern of the obtained results using a model of texture segregation based on a graph-cut algorithm.
Methodology
Ethics statement
The study protocol was conducted in accordance with the ethical standards of the Declaration of Helsinki and was approved by the Ethics Committee of Kyoto University (KUIS-EAR-2020–003). All participants received full explanations of the experimental procedures and provided written informed consent before the experiment began.
Apparatus
The stimuli were displayed on a VIEWPixx/3D LCD monitor (VPixx Technologies, Saint-Bruno-de-Montarville, Canada) with a resolution of 1,920 × 1,080 pixels and a refresh rate of 30 Hz. The lowest, mean, and highest display luminance ranged from 1.8 cd/m2 (lowest) to 96.7 cd/m2 (highest), with a mean luminance of 48.4 cd/m2. The linear output of luminance was calibrated for each channel using an i1Pro chromometer (VPixx Technologies). Each pixel subtended 1.3 arcmin at a viewing distance of 70 cm. Participants were seated in a dark room with a chinrest to stabilize head position. The experiments were programmed using the Psychophysics Toolbox [31] in MATLAB release 2022a (MathWorks, Natick, MA, USA). The study protocol adhered to the ethical standards of the Declaration of Helsinki, except for preregistration, and was approved by the Ethics Committee of Kyoto University (KUIS-EAR-2020–003).
Stimuli
The actual stimulus (demonstrated in Fig 2) was based on the generated sequence , randomly drawn from the 30 variants. The stimulus sequence for one presentation was a dynamic 8 × 16 element lattice lasting for 30 frames (1 s). Each element subtended 0.5° in diameter, with the entire lattice subtending 4° × 8°. Each value in
was transformed into a 0.5° × 0.5° element within the lattice, and all elements were concatenated, yielding the final stimulus. With luminance as the attribute, the element was defined as a Gaussian bulb as follows:
Where x and y are spatial locations on the plane, 0 is the center of the element, and t is the frame. i, j is the location within , where i is an integer ranging from 1 to 8, and j is an integer ranging from 1 to 16.
is the mean luminance of the screen, and
is the Gaussian windows of the element, set as 5.25 arcmin.
The element was also a Gaussian bulb with color as the attribute, but with three separate channel assignments:
where is the iso-luminant index between red and green.
Taking spatial phase as the attribute, the element changed to the Gabor patch as follows:
where is the spatial frequency, set to 2 cpd, and
is the orientation of the Gabor pattern, which was randomly drawn from the uniform distribution amongst
. The
term in advance the
ensured that the spatial phase change could not exceed 90° per frame, thereby preventing any effect of luminance flickering.
Participants
The experiment included two of the authors and three naïve participants (five males; mean age: 25.8 years), all of whom reported normal or corrected-to-normal vision. Participants were informed of the study’s nature and provided written informed consent before participation. Monetary compensation was provided upon completion. Prior to the main experiment, a revised version of the minimum motion method ([32,33] was used to measure the equiluminant ratio between the red (R) and green (G) channels for each participant. Participants viewed a rightward-drifting Gabor, where the R and G channels were 180° out-of-phase, and adjusted the channel ratio to minimize the perception of motion. The mean R:G ratio was determined to be 0.58.
Design and procedure
To measure participants’ segmentation performance, we adopted a two-interval forced choice (2IFC) paradigm. In each trial, participants sequentially viewed two 1,000-ms video clips, separated by a 100-ms interval. One clip, designated as the signal interval, was distinguishable due to a difference between within- and cross-area temporal correlations. The other clip, referred to as the null interval, had matched within- and cross-area correlations, with the same value as the signal interval’s within-area correlation, and thus contained no segmentation cue by design. An example is shown in Fig 2. Participants were instructed to identify the interval in which the texture appeared divided into two distinct regions and indicated their choice using a keyboard.
We tested nine levels of cosine similarity (0.1 to 0.9, in steps of 0.1) for both within- and cross-area correlations. Instead of testing all 81 possible pairs, we tested 45 pairs in which the within-area correlation was greater than or equal to the cross-area correlation. Pairs where the cross-area correlation exceeded the within-area correlation were excluded, as no segmentation was expected under such conditions. For each correlation pair, participants completed 30 trials of the 2IFC task.
To examine the generalizability of temporal correlation–based segmentation, we included three visual attributes: luminance, color (red/green), and spatial phase. Each participant completed all 3 attributes × 45 correlation pairs × 30 trials, totaling 4,050 trials. The three attribute conditions were tested sequentially, and their order was randomized across participants.
To minimize fatigue, each attribute condition was divided into 3 blocks, with each block containing 10 trials for each of the 45 correlation pairs (450 trials per block). Each block lasted approximately 50 minutes, and participants could rest as long as needed between blocks. In total, the experiment consisted of 9 blocks and took approximately 7.5 hours to complete, excluding break time.
Results
Fig 5 illustrates the participants’ segmentation accuracy, expressed as the proportion of correct responses, across different temporal correlation pairs and for the three visual attributes: luminance, color, and spatial phase. We also show the contour plot version in Fig 6 to show the general tendency. A consistent trend was observed across all attributes, with segmentation performance varying as a function of both cross-area and within-area temporal correlations. Specifically, segmentation perception improved as the within-area temporal correlation increased and the cross-area temporal correlation decreased.
The first row presents the grand average results across five participants, while individual participant data are shown in subsequent rows. Columns represent the results for the luminance, color, and spatial phase conditions. Within each graph, the x-axis represents the empirical cross-area temporal correlation of the generated stimuli, while the y-axis represents the empirical within-area temporal correlation. The proportion of correct responses is represented by reddish intensity, with contour lines included. The gray regions outside the triangular area indicate untested conditions, while conditions within the triangular area have values < 0.4.
The results indicated that visual segmentation performance improves only when the difference between within- and cross-area temporal correlations is sufficiently large. Although the general trend remained consistent across attributes, we observed variations in overall accuracy, with luminance yielding the highest accuracy, followed by color and then spatial phase. This suggested that segmentation based on temporal similarity structure is more challenging when the distinguishing attribute is spatial phase compared to luminance or color. Additionally, the contour of segmentation performance in Fig 5 deviated slightly from parallel alignment with the diagonal line (where the within-area temporal correlation equals the cross-area temporal correlation). This suggested that within-area and cross-area temporal correlations do not symmetrically influence segmentation performance. A possible explanation for this phenomenon may lie in the spatial arrangement of the cross-area pairs, where the mean spatial distance is twice that of the within-area pairs.
Study 3. Analysis of human performance using a graph cut model
The results presented in the previous section have indicated that human visual processing incorporates both within- and cross-area temporal correlations in similarity-based grouping, demonstrating its ability to process intricate similarity structure cues to infer spatial arrangements. However, the similarity structure alone does not explicitly define which group or object an individual pixel should be assigned to. Consequently, the visual system must employ an algorithm to convert similarity structure cues into predictions regarding object or group assignments without supervision, meaning it does not rely on predefined correct answers.
To address this challenge, we utilized a classical graph cut model to simulate segmentation performance for our stimuli and assessed the extent to which the model can replicate human performance. Fig 7 illustrates the simulation protocol and model architecture. Similar to human participants, the model processed two video clips, a signal interval, and a null interval, and determined which clip can be divided into left and right spatial partitions. The model underwent multiple trials, and segmentation accuracy was measured as the proportion of correctly identified signal intervals. The proportion correct was computed across all 45 temporal correlation pairs tested in human experiments.
There were two model architecture versions: a naïve model and a full model. The naïve model did not include any trainable parameters and followed a three-stage processing sequence to evaluate the inherent capabilities of the existing structure. First, the model computed a pairwise temporal cosine similarity matrix, reflecting the experimental finding that human visual processing relies on similarity structures to infer spatial configurations. Next, it performed eigendecomposition on the similarity matrix to generate a spatial segmentation representation, thereby identifying group memberships for each pixel in an unsupervised manner. Finally, the model compared the estimated group memberships with the true ones and selected the interval (signal or null) maximizing similarity. This naïve model represents the best-case scenario under no internal limitations and serves to reveal the full computational capacity of this graph-based approach.
In the full model, we introduced two adjustable parameters (Fig 7) to refine the model’s performance and better align it with human capabilities. The first parameter incorporated noise before computing the similarity matrix, with the signal-to-noise ratio (SNR) controlling the noise level. Noise is an inherent aspect of human visual processing and disrupts similarity computations, leading to variability in perceived similarity strength. In our model, both noise addition and similarity computation were linear operations, meaning that the order of these steps—whether noise is introduced before or after similarity computation—did not affect the final output.
The second parameter accounted for the possibility that the visual system may not evaluate all similarity pairs equally. Previous studies [34,35] have suggested that the ability to judge simultaneity declines as the spatial distance between elements increases, potentially due to longer signal transmission times or reduced neural connectivity between distant spatial locations. To replicate this effect, we introduced a distance-dependent weighting function in the similarity matrix, reducing the contribution of similarity strength for elements that are farther apart. The position-embedding sigma controlled this weighting as a function of distance. A higher sigma value allowed the model to consider more widely separated element pairs, whereas a lower sigma value restricted similarity computations primarily to neighboring elements, such as those at the segmentation boundary.
Methodology
Model architecture
This section presents the model architecture tested in the simulation. The model utilized temporal correlation to construct a fully connected undirected graph across spatial locations, finding the coarsest subgraph as the estimation of segmentation.
The model input stimuli of the same size as those generated by the ViT (8 × 16 × 30). Two sequences, signal interval ( and null interval (
), were simultaneously used as model input. In most cases, the calculations applied to both the signal and null intervals. Unless otherwise specified, only the computations for signal interval are presented, with the null interval being processed in the same manner.
After the stimuli were input, we added internal noise to the input as follows:
where is Gaussian white noise with the shape 8 × 16 × 30, a mean of 0, and variance of 1.
represents the free parameters controlling the SNR within the range of [
.
After adding noise, we flattened from 3D to 2D, resulting in
with dimensions (8 × 16) × 30. The mapping relationship was as follows:
where is the ceiling function, returning the smallest integer greater than or equal to the argument, and where
returns the remainder after division.
Next, we calculated the similarity matrix based on . The cosine similarity between any pair of vectors
and
was given by:
Here, the similarity was computed as the absolute value of the cosine similarity, weighted by a Gaussian function , which reduced the connectivity between distant elements. The parameter
controlled the width of the Gaussian windows, with values in the range of
We then transformed the similarity matrix into an affinity matrix by applying a step function, turning it into a simple binary graph structure instead of a weighted graph:
where is the mean value of the whole similarity matrix.
Next, we performed a classical eigendecomposition on the normalized Laplacian matrix as follows:
The normalized Laplacian ensures all diagonal entries are 1, which equalizes the contribution of each node and stabilizes the matrix decomposition.
The eigendecomposition of the Laplacian matrix yielded:
where represents the eigenvectors and
represents the eigenvalues. According to a previous study [36], the eigenvectors corresponding to the second smallest eigenvalue (also called the Fiedler vector) is the optimal solution for partitioning the graph into two subgraphs. We transformed the vector into matrix form as follows:
where is the position corresponding to the second smallest eigenvalue. This process provided the coarsest graph structure. In simple binary segmentation tasks, such as figure-ground segmentation and in our task, the value of the Fiedler vector is directly proportional to the probability that a given spatial point belongs to the target.
Finally, we transformed the eigenvector into a decision-making process based on the correctness rate. Specifically, the model outputs the option with the highest similarity to the answer. The correctness rate was defined as follows:
where is the target matrix (i.e., the answer),
is its mean, and
is the mean of the model estimation.
The model output the result based on the following decision rule:
where is a random variable following the Bernoulli distribution with equal probability of returning either signal or null.
Simulation protocol
For each trial, the model generated two possible outcomes: signal or null. For each of 45 correlation pairs, this process was repeated 1,000 times, with input stimuli being randomly selected from 30 variants of a single temporal correlation pair and noise being regenerated for each iteration. Based on these 1,000 responses, we calculated the percentage of signal responses, defining it as the proportion correct for a single correlation pair. We tested all 45 temporal correlation pairs to obtain the model’s performance in the segmentation task, representing the proportion correct as a function of both within- and cross-area temporal correlation (). The model’s goodness of fit was determined by its ability to predict human responses (
), calculated as follows:
where represent cross- and within-area temporal correlation, respectively.
For the naïve model, the parameter was configured to have no impact on model performance, with the signal rate (i.e., ) set to 1, corresponding to an infinite SNR. Additionally, the position-embedding sigma (i.e.,
) was set to infinity, causing the weight function to become 1 universally. To determine the most suitable parameter pair for replicating the human result, we implemented a grid search sampling approach in the parameter space. We selected 50 samples with linear intervals from
for the SNR. Similarly, for the position-embedding sigma, we chose 50 samples with linear steps from
, where the maximum value represents the largest Euclidean distance in our stimuli. This resulted in 2,500 parameter pairs being sampled across the parameter space, which were then used to assess the model’s goodness of fit against human responses.
Results
Naïve model simulation results
Fig 8 presents the simulation results of the naïve model, showing the optimal performance of the graph-cut algorithm without biologically plausible constraints included in the full model. The top section of the figure displays a single trial output from the second stage. Without data supervision, the model’s estimated group memberships form clear boundaries that largely align with the true memberships, except in the null interval (other examples of output can be found in the supplementary material). In terms of segmentation performance, the model’s outcomes vary based on both within- and cross-area temporal correlations, similar to human performance. However, unlike humans, the model achieves near-perfect segmentation even when the difference between within- and cross-area temporal correlations is minimal.
The one-trial demo output for the second stage of the naive model is shown at the top of the figure, with the signal interval at the left and the null interval at the right (other examples of output can be found in the supplementary material). The temporal correlation pair shown is labeled as a red cross symbol on the proportion correct heatmap.
Full model simulation results
Fig 9 presents the effects of the two parameters on the model’s output, while F presents a 2D predictability heatmap based on these parameters, along with the optimal model for three attributes. The analysis reveals that SNR is the primary factor influencing the model’s overall performance, whereas the sigma parameter has a minor impact. When the position-embedding sigma exceeds a certain threshold (e.g., greater than one element), predictability remains nearly constant across attributes, regardless of further changes in sigma.
The optimal model (see Fig 10) achieved high predictability levels for the three attributes, with respective R2 values of 0.96, 0.93, and 0.90. The corresponding SNR values were 0.48, 0.48, and 0.4, while the spatial embedding sigma values were 7.09, 4.73, and 7.09. These findings suggested that with appropriate parameter settings, the model can achieve performance comparable to human segmentation abilities. The lower performance in the spatial phase can be attributed to a lower SNR compared to luminance and color.
The upper section of the figure displays the model’s simulated proportion-correct values under the optimal parameter pair. The lower section presents a heatmap illustrating how well the model predicts human responses to different attributes. The optimal parameter pairs are marked as red dots, with accompanying blue vertical and horizontal lines indicating their locations on the graph.
Furthermore, the position-embedding sigma primarily influenced the slope of the contour, as shown in the third row of Fig 8. Higher sigma values resulted in a slope closer to the diagonal line (i.e., within-area temporal correlation = cross-area temporal correlation). In contrast, when sigma was small, the model prioritized only short-distance temporal correlations, causing within-area and cross-area temporal correlations to contribute asymmetrically to segmentation performance.
The model’s predictability in simulating human performance suggested that the position-embedding sigma cannot be too small, indicating that similarity across certain distant elements also plays a crucial role. This finding indicated that human visual processing incorporates not only short-range similarities, such as those between neighboring elements, but also long-range similarities across spatially distant elements.
General Discussion
This study used ViTs to generate stimuli with precisely controlled within- and cross-area temporal correlations, allowing us to examine how the human visual system integrates similarity structures for segmentation. Our findings indicated that segmentation is not solely dependent on local comparisons between individual elements, but also involves a global process that integrates similarity across regions. Furthermore, we demonstrated that a graph-based computational model effectively captures human segmentation behavior, underscoring the potential role of structured relational processing in mid-level vision.
Transformer-based generation of complex stimuli
A key contribution of our work is the introduction of a novel protocol for generating temporally complex stimuli with precisely controlled element-wise temporal synchrony. Previous studies investigating temporal synchrony have primarily relied on simple periodic waveforms, such as sinusoidal waves [15,35] or square waves [11,14,37]. While some studies have explored non-periodic waveforms by randomly assigning temporal shifts to individual elements [13,23], these methods generally lack precise control over synchrony between individual elements.
The primary challenge in modulating element-wise temporal synchrony within a stochastic time series lies in constructing a covariance matrix that enforces a specific similarity structure. When the number of elements is large, the covariance matrix often becomes singular, making it impossible to directly sample from a statistical distribution, such as a multivariate normal distribution, while maintaining the desired temporal correlation constraints. One potential solution is an iterative approach in which stimuli are randomly generated and tested against the desired similarity structure. However, this method is computationally expensive and inefficient, requiring an unpredictable amount of time to produce a sufficient number of trials.
In this study, we addressed these limitations by leveraging a ViT-based approach to directly optimize the temporal similarity structure. By defining a similarity matrix as a loss function, we iteratively refined the stimuli to precisely match the desired within-region and cross-region temporal correlation constraints. This method not only ensures that the generated stimuli adhere to strict statistical properties but also provides a scalable solution for designing complex temporal textures. Our protocol offers a more flexible and computationally feasible approach for future research on spatiotemporal grouping, temporal synchrony, and other mid-level vision phenomena.
In this work, we used a ViT-based generation model to generate stimuli for a psychology experiment. However, other large-parameter models could also achieve similar results. In deep learning, this approach can be understood as a form of over-parameterized optimization. The solution space defined by the loss function over the original stimulus is usually non-convex, making direct optimization in the original parameter space difficult. By introducing a model with more parameters and stronger representational capacity, we effectively map the problem into a higher-dimensional space. This reduces the risk of getting stuck in local minima and makes it easier to find stimuli that meet the experimental goals. ViT is simply the architecture we chose for this study, but in principle, other sufficiently powerful models could also be used.
Worthy to note, the stimuli used in this study followed a fixed and simple geometric layout with a clear segmentation border. Combined with a strict 2IFC method, the segmentation thresholds we estimated might be too low to characterize region segregation performance under natural scenes where segregation borders could be more complex and uncertain. Such naturalistic segmentation performance may be effectively quantified using the method proposed by [38]. In addition, because this method estimates region segregation maps by measuring pairwise similarities between numerous local points, combining it with our stimuli enables a direct evaluation of how accurately human observers judge pairwise temporal correlations between elements in a scene. This extension may also help to clarify how strong does the segmentation relies on local or global similarity cues, and how boundary location is inferred from global temporal structure.
Another limitation of the present study relates to the relatively small number of participants. Such a small sample cannot fully capture the population variance in the ability of temporal segmentation. The current study focused instead on within-participant variance to obtain accurate estimates at the individual level. Future studies with larger and more diverse participant groups will be important to better capture general tendencies and characterize response variability.
Graph-based representation and its role in mid-level vision
In our experiment, participants exhibited better segmentation performance when within-region temporal correlation was high and cross-region temporal correlation was low. This suggested that the visual system does not rely solely on individual local features, but instead integrates similarity information across multiple spatial points. Given that low-level cues (e.g., luminance, contrast, and spatial frequency) and high-level semantic information were unavailable, the mid-level visual system must rely on the relative temporal similarity between elements to infer the segmentation structure. This finding aligned with previous studies indicating that mid-level vision constructs representations based on relational properties rather than absolute local features, such as non-accidental properties [39], configural relationships (e.g., T- and L-junctions) [40], surface inference mechanisms [41,42], Wallis et al. [43] suggest that segmentation and grouping mechanisms may be mediated by both local interactions between nearby image features and global properties of the scene. Our study suggests that the same may hold true for grouping and segmentation based on temporal correlation.
A natural way to formalize such relational structures is through a graph-based representation. In this framework, individual elements in the stimulus can be conceptualized as nodes, while the temporal similarity between them determines the edge weights. The segmentation process then resembles graph partitioning, where regions are separated based on internal consistency and cross-boundary dissimilarity. Various methods exist for implementing graph partitioning; here, the eigendecomposition we employed is a simple linear transformation of the graph structure to extract the second-largest principal component for this task. This approach was based on previous studies demonstrating that the second-largest principal component represents the coarsest structure derivable from a graph [18,36]. Additionally, computational neuroscience research suggests that principal component analysis can be efficiently implemented with a small number of neurons, supporting the physiological feasibility of such computations in the brain [44]. In our simulation, we observed that graph partitioning based on the second-largest principal component effectively replicates human segmentation performance with minimal parameterization.
Our model simulations also suggested that differences in segmentation performance across visual attributes may arise from variations in SNR. In particular, the lower segmentation performance for spatial phase indicates that higher spatial frequency signals require stronger contrast and lower temporal frequency to achieve detection levels comparable to those of low spatial frequency stimuli [45,46]. This SNR difference also aligns with our experimental design. Because the same set of 30 pre-generated temporal tensors was used across all three attributes, and spatial differences such as mean luminance and contrast were either matched or negligible, the physical-level differences between attributes were minimized. Therefore, the observed SNR differences most likely reflect intrinsic signal gain or tuning bandwidth differences across the processing channels for the three attributes.
Moreover, our results suggested that restricting comparisons strictly to neighboring elements, as controlled by the position-embedding parameter, does not fully account for human segmentation performance. Instead, long-range comparisons are necessary, aligning with previous studies demonstrating long-range temporal comparison mechanisms in motion perception [34]. However, interactions between temporal and spatial information are complex [35,47] and warrant further investigation.
Alternative computational approaches could also yield results comparable to our graph-based model. For instance, while image statistics are represented in the primate cortex [37], image segmentation could be achieved by decomposing feature distributions through probabilistic inference [4,38]. Future research should focus on identifying key stimuli and experimental paradigms to determine which algorithms are actually employed by the human visual system.
Comparison with CV image segmentation models
While image segmentation models in the field of CV effectively utilize spatial similarity structures for segmentation, they largely disregard the temporal aspect of the task. As a result, these models struggle with segmentation based on temporal similarity. This limitation arises because most CV models process images in a frame-by-frame manner, often neglecting the global covariation of local pixels over time. In other words, current CV approaches lack mechanisms for integrating global temporal information. In contrast, human vision relies on a global process that integrates temporal covariation signals to infer spatial structures [11,13,14,24].
At present, no mature CV models fully address segmentation based on temporal similarity. However, our study leverages generated stimuli to investigate human perception of temporal similarity and applies a graph cut algorithm to simulate a global process that selectively integrates pairwise similarities. This approach better approximates human performance compared to conventional CV models. These findings suggest that incorporating global temporal similarity processing into CV models may be a promising avenue for advancing dynamic image segmentation.
Relationship with current neurophysiological evidence
Neurophysiological research suggests that visual segmentation involves multiple brain regions, with specific areas recruited depending on stimulus complexity and feature composition. Numerous studies have indicated that recurrent signaling between V1, V2, V4, and the inferior temporal cortex plays a crucial role in contour integration and simple shape perception [48–50].
For segmentation based on static visual features such as luminance, color similarity, or spatial proximity, functional magnetic resonance imaging and event-related potential studies have shown that contour-based segmentation primarily activates the lateral occipital cortex in the ventral pathway [51,52]. However, when segmentation involves more complex closed boundary shapes or object processing, activity shifts toward V3 and the intraparietal sulcus in the dorsal pathway [53–55]. The parietal lobe, particularly the inferior intraparietal sulcus, has been implicated in integrating different feature representations and processing hierarchical relationships within the visual scene [49]. In contrast, segmentation based on dynamic features, such as structure-from-motion, engages distinct regions, with the medial temporal area playing a central role [56].
Currently, no definitive neuronal evidence identifies the exact brain regions responsible for visual grouping or segmentation based on temporal similarity. Some studies suggest that the inferior parietal lobe and the insula contribute to processing temporal relationships, particularly temporal synchrony and asynchrony [57–59]. This raises the possibility that temporal similarity-based segmentation relies on mechanisms distinct from those governing segmentation based on static or motion cues.
The neural basis of graph-based computation in segmentation remains an open question. Future research should explore whether the brain employs an explicit graph-based representation for visual segmentation and, if so, which regions support this computation. Given their roles in relational processing and feature binding, the intraparietal sulcus and inferior parietal lobe are potential candidates for encoding similarity structures in a graph-like manner. Integrating neuroimaging techniques with computational modeling may help elucidate the neural mechanisms underlying mid-level visual segmentation.
Conclusion
Summing up, this study leveraged a Vision Transformer–based framework to generate stimuli with precisely controlled temporal similarity structures, enabling a systematic investigation of how the human visual system infers spatial organization from complex temporal correlations among elements. The results suggest that human segmentation behavior strongly depends on parsing global similarity structures to derive near-optimal grouping solutions, rather than relying solely on local comparisons.
This finding implies that to further understand how the visual system resolves spatial structure, it is necessary to consider more structured computational models and identify stimulus designs that can probe these mechanisms more effectively.
Resource availability
All resources, including the experiment code, demo video, raw data, high-resolution figures in the paper, and bootstrapping results, have been uploaded to the Open Science Framework platform. The project can be accessed at the following link: https://osf.io/9yncd/?view_only=eb481f2b28ab42e29ff267833f70b8e5.
Supporting information
S1 Appendix. Demo output with the signal interval input.
https://doi.org/10.1371/journal.pcbi.1013001.s001
(DOCX)
S2 Appendix. Demo output with the null interval input.
https://doi.org/10.1371/journal.pcbi.1013001.s002
(DOCX)
References
- 1.
Köhler W. Some tasks of Gestalt psychology. Psychologies of 1930. Clark University Press. 1930. 143–60. doi: https://doi.org/10.1037/11017-008
- 2.
Wertheimer M. Laws of organization in perceptual forms. A source book of Gestalt psychology. Kegan Paul, Trench, Trubner & Company. 1938. 71–88. doi: https://doi.org/10.1037/11496-005
- 3. Callaghan TC, Lasaga MI, Garner WR. Visual texture segregation based on orientation and hue. Percept Psychophys. 1986;39(1):32–8. pmid:3703659
- 4. Dakin SC, Watt RJ. The computation of orientation statistics from visual texture. Vision Res. 1997;37(22):3181–92. pmid:9463699
- 5.
Landy MS, Graham N. Visual Perception of Texture. The Visual Neurosciences, 2-vol. set. The MIT Press. 2003. 1106–18. doi: https://doi.org/10.7551/mitpress/7131.003.0084
- 6.
Dakin S. Seeing Statistical Regularities. Wagemans J. Oxford University Press. 2014. doi: https://doi.org/10.1093/oxfordhb/9780199686858.013.054
- 7. Sekuler AB. Motion segregation from speed differences: evidence for nonlinear processing. Vision Res. 1990;30(5):785–95. pmid:2378071
- 8. Uttal WR, Spillmann L, Stürzel F, Sekuler AB. Motion and shape in common fate. Vision Res. 2000;40(3):301–10. pmid:10793903
- 9. Kojima H. Figure/ground segregation from temporal delay is best at high spatial frequencies. Vision Res. 1998;38(23):3729–34. pmid:9893802
- 10. Rideaux R, Badcock DR, Johnston A, Edwards M. Temporal synchrony is an effective cue for grouping and segmentation in the absence of form cues. J Vis. 2016;16(11):23. pmid:27690163
- 11. Usher M, Donnelly N. Visual synchrony affects binding and segmentation in perception. Nature. 1998;394(6689):179–82. pmid:9671300
- 12.
Blake R, Lee S-H. Temporal Structure in the Input to Vision Can Promote Spatial Grouping. Lecture Notes in Computer Science. Springer Berlin Heidelberg. 2000. 635–53. doi: https://doi.org/10.1007/3-540-45482-9_64
- 13. Lee SH, Blake R. Visual form created solely from temporal structure. Science. 1999;284(5417):1165–8. pmid:10325226
- 14. Chen Y-J, Sun Z, Nishida S. Feature-invariant processing of spatial segregation based on temporal asynchrony. J Vis. 2024;24(5):15. pmid:38814934
- 15. Sekuler AB, Bennett PJ. Generalized common fate: grouping by common luminance changes. Psychol Sci. 2001;12(6):437–44. pmid:11760128
- 16.
Mehrani P, Tsotsos JK. Self-attention in Vision Transformers Performs Perceptual Grouping, Not Attention. 2023. http://arxiv.org/abs/2303.01542
- 17. Krähenbühl P, Koltun V. Efficient inference in fully connected crfs with gaussian edge potentials. Advances in Neural Information Processing Systems. 2011.
- 18. Jianbo Shi, Malik J. Normalized cuts and image segmentation. IEEE Trans Pattern Anal Machine Intell. 2000;22(8):888–905.
- 19. Sun Z, Wang R, Luo Z. Polynomial approximation based spectral dual graph convolution for scene parsing and segmentation. Neurocomputing. 2021;438:133–44.
- 20. sun Z, Chen Y-J, Yang Y-H, Nishida S. Modeling of Human Motion Perception Mechanism: A Simulation based on Deep Neural Network and Attention Transformer. Journal of Vision. 2023;23(9):4894.
- 21.
Sun Z, Chen Y-J, Yang Y-H, Nishida S. Modeling human visual motion processing with trainable motion energy sensing and a self-attention network. 2023.
- 22. Sun Z, Chen Y-J, Yang Y-H, Li Y, Nishida S. Machine Learning Modeling for Multi-order Human Visual Motion Processing. arXiv. 2025.
- 23. Morgan M, Castet E. High temporal frequency synchrony is insufficient for perceptual grouping. Proc Biol Sci. 2002;269(1490):513–6. pmid:11886644
- 24. Alais D, Blake R, Lee SH. Visual features that vary together over time group together over space. Nat Neurosci. 1998;1(2):160–4. pmid:10195133
- 25. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv. 2021. http://arxiv.org/abs/2010.11929
- 26. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. arXiv. 2015.
- 27. Sikder J, Islam MK, Jahan F. Object segmentation for image indexing in large database. Journal of King Saud University - Computer and Information Sciences. 2024;36(2):101937.
- 28. Murray SO, Olshausen BA, Woods DL. Processing shape, motion and three-dimensional shape-from-motion in the human cortex. Cereb Cortex. 2003;13(5):508–16. pmid:12679297
- 29. Tangemann M, Kümmerer M, Bethge M. Object segmentation from common fate: Motion energy processing enables human-like zero-shot generalization to random dot stimuli. arXiv. 2024.
- 30. Turk-Browne NB, Jungé J, Scholl BJ. The automaticity of visual statistical learning. J Exp Psychol Gen. 2005;134(4):552–64. pmid:16316291
- 31. Brainard DH. The Psychophysics Toolbox. Spatial Vis. 1997;10(4):433–6.
- 32.
Anstis S, Cavanagh P. Toronto: York University. 1983. http://wexler.free.fr/library/files/anstis%20(1983)%20a%20minimum%20motion%20technique%20for%20judging%20equiluminance.pdf
- 33. Cavanagh P, MacLeod DI, Anstis SM. Equiluminance: spatial and temporal factors and the contribution of blue-sensitive cones. J Opt Soc Am A. 1987;4(8):1428–38. pmid:3625323
- 34. Maruya K, Holcombe AO, Nishida S. Rapid encoding of relationships between spatially remote motion signals. J Vis. 2013;13(2):4. pmid:23390318
- 35. Motoyoshi I. The role of spatial interactions in perceptual synchrony. J Vis. 2004;4(5):352–61. pmid:15330719
- 36. Fiedler M. Laplacian of graphs and algebraic connectivity. Banach Center Publ. 1989;25(1):57–70.
- 37. Kandil FI, Fahle M. Purely temporal figure-ground segregation. Eur J Neurosci. 2001;13(10):2004–8. pmid:11403694
- 38. Vacher J, Launay C, Mamassian P, Coen-Cagli R. Measuring uncertainty in human visual segmentation. ArXiv. 2023.
- 39. Hummel JE, Biederman I. Dynamic binding in a neural network for shape recognition. Psychol Rev. 1992;99(3):480–517. pmid:1502274
- 40. Kubilius J, Wagemans J, Op de Beeck HP. Encoding of configural regularity in the human visual system. J Vis. 2014;14(9):11. pmid:25122215
- 41. Lappin JS, Bell HH. Form and Function in Information for Visual Perception. Iperception. 2021;12(6):20416695211053352. pmid:35003612
- 42.
Visual Surface Representation. An Invitation to Cognitive Science. The MIT Press. 1995. doi: https://doi.org/10.7551/mitpress/3965.003.0004
- 43. Wallis TS, Funke CM, Ecker AS, Gatys LA, Wichmann FA, Bethge M. Image content is more important than Bouma’s Law for scene metamers. Elife. 2019;8:e42512. pmid:31038458
- 44. Oja E. A simplified neuron model as a principal component analyzer. J Math Biol. 1982;15(3):267–73. pmid:7153672
- 45. Tolhurst DJ. Separate channels for the analysis of the shape and the movement of moving visual stimulus. J Physiol. 1973;231(3):385–402. pmid:4783089
- 46. Horiguchi H, Nakadomari S, Misaki M, Wandell BA. Two temporal channels in human V1 identified using fMRI. Neuroimage. 2009;47(1):273–80. pmid:19361561
- 47. Motoyoshi I, Nishida S. Spatiotemporal interactions in detection of texture orientation modulations. Vision Res. 2002;42(26):2829–41. pmid:12450508
- 48. Neumann H, Sepp W. Recurrent V1-V2 interaction in early visual boundary processing. Biol Cybern. 1999;81(5–6):425–44. pmid:10592018
- 49. Roelfsema PR. Cortical algorithms for perceptual grouping. Annu Rev Neurosci. 2006;29:203–27. pmid:16776584
- 50. Grossberg S, Pinna B. Neural dynamics of gestalt principles of perceptual organization: From grouping to shape and meaning. Gestalt Theory. 34.
- 51. Wong YJ, Aldcroft AJ, Large M-E, Culham JC, Vilis T. The role of temporal synchrony as a binding cue for visual persistence in early visual areas: an fMRI study. J Neurophysiol. 2009;102(6):3461–8. pmid:19828729
- 52. Volberg G, Greenlee MW. Brain networks supporting perceptual grouping and contour selection. Front Psychol. 2014;5:264. pmid:24772096
- 53. Wu H, Zuo Z, Yuan Z, Zhou T, Zhuo Y, Zheng N, et al. Neural representation of gestalt grouping and attention effect in human visual cortex. J Neurosci Methods. 2023;399:109980. pmid:37783351
- 54. Xu Y, Chun MM. Visual grouping in human parietal cortex. Proc Natl Acad Sci U S A. 2007;104(47):18766–71. pmid:17998539
- 55. Seymour K, Karnath H-O, Himmelbach M. Perceptual grouping in the human brain: common processing of different cues. Neuroreport. 2008;19(18):1769–72. pmid:18955906
- 56. Grunewald A, Bradley DC, Andersen RA. Neural correlates of structure-from-motion perception in macaque V1 and MT. J Neurosci. 2002;22(14):6195–207. pmid:12122078
- 57. Battelli L, Cavanagh P, Martini P, Barton JJS. Bilateral deficits of transient visual attention in right parietal patients. Brain. 2003;126(Pt 10):2164–74. pmid:12821511
- 58. Battelli L, Pascual-Leone A, Cavanagh P. The “when” pathway of the right parietal lobe. Trends Cogn Sci. 2007;11(5):204–10. pmid:17379569
- 59. Bushara KO, Grafman J, Hallett M. Neural correlates of auditory-visual stimulus onset asynchrony detection. J Neurosci. 2001;21(1):300–4. pmid:11150347