^{*}

Conceived and designed the experiments: ML SH. Contributed reagents/materials/analysis tools: ML SH. Wrote the paper: ML SH.

The authors have declared that no competing interests exist.

It is shown that existing processing schemes of 3D motion perception such as interocular velocity difference, changing disparity over time, as well as joint encoding of motion and disparity, do not offer a general solution to the inverse optics problem of local binocular 3D motion. Instead we suggest that local velocity constraints in combination with binocular disparity and other depth cues provide a more flexible framework for the solution of the inverse problem. In the context of the aperture problem we derive predictions from two plausible default strategies: (1) the vector normal prefers slow motion in 3D whereas (2) the cyclopean average is based on slow motion in 2D. Predicting perceived motion directions for ambiguous line motion provides an opportunity to distinguish between these strategies of 3D motion processing. Our theoretical results suggest that velocity constraints and disparity from feature tracking are needed to solve the inverse problem of 3D motion perception. It seems plausible that motion and disparity input is processed in parallel and integrated late in the visual processing hierarchy.

Humans and many other predators have two eyes that are set a short distance apart so that an extensive region of the world is seen simultaneously by both eyes from slightly different points of view. Although the images of the world are essentially two-dimensional, we vividly see the world as three-dimensional. This is true for static as well as dynamic images. Here we elaborate on how the visual system may establish 3D motion perception from local input in the left and right eye. Using tools from analytic geometry we show that existing 3D motion models offer no general solution to the inverse optics problem of 3D motion perception. We suggest a flexible framework of motion and depth processing and suggest default strategies for local 3D motion estimation. Our results on the aperture and inverse problem of 3D motion are likely to stimulate computational, behavioral, and neuroscientific studies because they address the fundamental issue of how 3D motion is represented in the visual system.

The representation of the three-dimensional (3D) external world from two-dimensional (2D) retinal input is a fundamental problem that the visual system has to solve

Velocity in 3D space is described by motion direction and speed. Motion direction can be measured in terms of azimuth and elevation angle, and motion direction together with speed is conveniently expressed as a 3D motion vector in a cartesian coordinate system. Estimating such a vector locally is highly desirable for a visual system because the representation of local estimates in a dense vector field provides the basis for the perception of 3D object motion, that is direction and speed of moving objects. This information is essential for interpreting events as well as planning and executing actions in a dynamic environment.

If a single moving point, corner or other unique feature serves as binocular input then intersection of constraint lines or triangulation together with a starting point provides a straightforward and unique geometrical solution to the inverse problem in a binocular viewing geometry (see

The left and right eye with nodal points

The inverse optics and the aperture problem are well-known problems in computational vision, especially in the context of stereo

Similar techniques in terms of error minimization and regularization have been offered for 3D stereo-motion detection

Computational studies on 3D motion algorithms are usually concerned with fast and efficient encoding when tested against ground truth. Here we are less concerned with the efficiency or robustness of a particular implementation. Instead we want to understand and predict behavioral characteristics of human 3D motion perception. 2D motion perception has been extensively researched in the context of the 2D aperture problem

Any physiologically plausible solution to the inverse 3D motion problem has to rely on binocular sampling of local spatio-temporal information. There are at least three known cell types in early visual cortex that may be involved in local encoding of 3D motion: simple and complex motion detecting cells

It is therefore not surprising that three approaches to binocular 3D motion perception have emerged in the literature: Interocular velocity difference (IOVD), changing disparity over time (CDOT), and joint encoding of motion and disparity (JEMD).

These three approaches have generated an extensive body of research but psychophysical results have been inconclusive and the nature of 3D motion processing remains an unresolved issue

Large and persistent perceptual bias has been found for dot stimuli with unambiguous motion direction

The aim of this paper is to evaluate existing models of 3D motion perception and to gain a better understanding of binocular 3D motion perception. First, we show that existing models of 3D motion perception are insufficient to solve the inverse problem of binocular 3D motion. Second, we establish velocity constraints in a binocular viewing geometry and demonstrate that additional information is necessary to disambiguate local velocity constraints and to derive a velocity estimate. Third, we compare two default strategies of perceived 3D motion when local motion direction is ambiguous. It is shown that critical stimulus conditions exist that can help to determine whether 3D motion perception favors slow 3D motion or averaged cyclopean motion.

In the following we summarize shortcomings for each of the three main approaches to binocular 3D motion perception in terms of stereo and motion correspondence, 3D motion direction, and speed. We also provide a counterexample to illustrate the limitations of each approach.

This influential processing model assumes that monocular spatio-temporal differentiation or motion detection

We argue that the standard IOVD model

The first limitation is easily overlooked: IOVD assumes stereo correspondence between motion in the left and right eye when estimating 3D motion trajectory. The model does not specify which motion vector in the left eye should correspond to which motion vector in the right eye before computing a velocity difference. If there is only a single motion vector in the left and right eye then establishing a stereo correspondence appears trivial since there are only two positions in the left and right eye that signal dynamic information. Nevertheless, stereo correspondence is a necessary pre-requisite of IOVD processing which quickly becomes challenging if we consider multiple stimuli that excite not only one but many local motion detectors in the left and right eye. It is concluded that without explicit stereo correspondence between local motion detectors the IOVD model is incomplete.

The second problem concerns 3D motion trajectories with arbitrary azimuth and elevation angles. Consider a local contour with spatial extent such as an oriented line inside a circular aperture so that the endpoints of the line are occluded. This is known as the aperture problem in stereopsis

Constraint lines through projection point

It is worth pointing out that IOVD offers no true estimate of 3D speed. This is surprising because the model is based on spatial-temporal or speed-tuned motion detectors. The problem arises because computing motion trajectory without a constraint in depth does not solve the inverse problem. As a consequence speed is typically approximated by motion in depth along the line of sight

If an edge or line tilted from horizontal by 0<

Another violation occurs when the line is slanted in depth and projects with different orientations into the left and right eye. The resulting misalignment on the

It is concluded that the IOVD model is incomplete and easily leads to ill-posed inverse problems. These limitations are difficult to resolve within a motion processing system and point to contributions from disparity or depth processing.

This alternative processing scheme uses disparity input and monitors changing disparity over time (CDOT). Disparity between the left and right image is detected

Assuming CDOT can always establish a suitable stereo correspondence between features including lines

Detecting local disparity change alone is insufficient to determine an arbitrary 3D trajectory. CDOT has difficulties to recover arbitrary 3D motion direction because only motion-in-depth along the line of sight is well defined. 3D motion direction in terms of arbitrary azimuth and elevation requires a later global mechanism that has to solve the inverse problem by tracking not only disparity over time but also position in 3D space over time.

As a consequence the rate of change of disparity provides a speed estimate for motion-in-depth along the line of sight but not for arbitrary 3D motion trajectories.

In the context of local surface motion consider a horizontally slanted surface moving to the left or right behind a circular aperture. Without corners or other unique features CDOT can only detect local motion in depth along the line of sight. Similarly in the context of local line motion, the inverse problem remains ill posed for a local edge or line moving on a slanted surface because additional motion constraints are needed to determine a 3D motion direction.

In summary, CDOT does not provide a general solution to the inverse problem of local 3D motion because it lacks information on motion direction. Even though CDOT is capable of extracting stereo correspondences over time, additional motion constraints are needed to represent arbitrary motion trajectories in 3D space.

This approach postulates that early binocular cells are both motion and disparity selective and physiological evidence for the existence of such cells was found in cat striate cortex

Similar to cells tuned to binocular motion, model cells of JEMD prefer corresponding velocities in the left and right eye. Therefore a binocular model cell can only establish a 2D fronto-parallel velocity constraint at a given depth. Model cell activity remains ambiguous because it can be the result of local disparity or motion input

Again, similar to IOVD and CDOT, JEMD provides no local 3D speed estimate. It also has to rely on sampling across depth planes in a population of cells in order to approximate speed.

Consider local 3D motion with unequal velocities in the left and right eye but the same average velocity, e.g. diagonal trajectories to the front and back through the same point in depth. JEMD has no mechanism to discriminate between these local 3D trajectories when monitoring binocular cell activity across depth planes in a given temporal window.

In the following we introduce general velocity constraints for 3D motion and suggest two default strategies of 3D motion perception that are based on different processing principles (see

Which constraints does the visual system use to solve the inverse as well as aperture problem for local 3D line motion where endpoints are invisible or occluded? This is a critical question because it is linked to local motion encoding and the possible contribution from depth processing.

The 3D motion system may establish constraint planes rather than constraint lines to capture all possible motion directions of a contour or edge, including motion in the direction of the edge's orientation. Geometrically the intersection of two constraint planes in a given binocular viewing geometry defines a constraint line oriented in 3D velocity space (see

The intersection of constraint planes (IOC) together with the assumption of slow motion describes the shortest vector in 3D space (blue arrow) that fulfills the velocity constraints.

We suggest that in analogy to 2D motion perception

But which principles or constraints are used? Does the binocular motion system prefer slow 3D motion or averaged 2D motion? Does it solve stereo correspondence before establishing binocular velocity constraints or does it average 2D velocity constraints from the left and right eye before it solves stereo correspondence? We derive predictions for two alternative strategies to address these questions.

Velocity constraints in the left and right eye provide velocity constraint planes in 3D velocity space. In

The VN strategy is a generalization of the vector normal and IOC in 2D

If the motion system computes slow 2D motion independently in the left and right eye then the cyclopean average provides an alternative velocity constraint

Combining the cyclopean velocity constraint with horizontal disparity determines a vector in 3D space (red arrow) with average monocular velocity.

The CA strategy is a generalized version of the vector average strategy for 2D motion

We use the Vector Normal (VN) and Cyclopean Average (CA) as default strategies to predict 3D velocity of an oriented line or contour moving in depth inside a circular aperture.

The 3D plot in

Predictions for an oriented stimulus line moving on a fixed trajectory to the front left and to the back left are shown. Predicted velocities show characteristic differences when the moving stimulus line or contour is slanted in depth (range of orientation disparities between −6° to +6°).

If the diagonal line is fronto-parallel and has zero orientation disparity both strategies make equivalent predictions (intersection of red and blue vector fields in

In a first experiment using a psychophysical matching task we measured the perceived 3D motion direction of an oriented line moving behind a circular aperture. Preliminary results from four observers indicate VN as the default strategy. Perceptual bias from depth processing reduced perceived slant of the stimulus line and this also affected motion direction

IOVD and CDOT are extreme models because they are based on either motion or disparity input. IOVD excludes contributions from binocular disparity processing but requires early stereo correspondence. It does not solve the inverse problem for local 3D line motion because it is confined to 3D motion in the

CDOT on the other hand excludes contributions from motion processing and therefore has problems to establish motion correspondence and direction. Without further assumptions it is confined to motion in depth along the line of sight.

If either motion or disparity input determines 3D motion perception then processing of any additional input needs to be disengaged or silenced. Instead, the visual system may take advantage of motion and disparity input

Combining global disparity or depth information with local velocity constraints at a later stage solves the inverse problem of local 3D motion and provides a flexible scheme that can exploit intermediate depth processing such as relative and orientation disparity in V2 and V4

What enables the visual system to instantaneously perceive 3D motion and to infer direction and speed of a moving object? It seems likely that the visual system exploits many cues to make this difficult inference as reliable and veridical as possible and the diverse set of effective local and global cues in psychophysical studies

More specifically, we suggest that binocular 3D motion perception may be based on parallel motion and depth processing. Thereby motion processing captures local spatio-temporal constraints in the scene whereas depth processing provides a global and dynamic depth map that helps to disambiguate motion direction and to maintain a detailed spatial representation of the scene. Late integration of motion and disparity constraints in combination with other cues can solve the inverse problem of local 3D motion and allows the visual system to remain flexible when binding and segmenting local inputs from different processing stages into a global 3D motion percept. Parallel processing and late integration may explain why, compared to 2D motion perception, 3D motion perception shows reduced spatio-temporal tuning characteristics ^{nd} order motion.

The notion of parallel pathways feeding functionally different aspects of motion perception into a later stage is not new and has been advanced in the context of 2D motion direction and speed perception

Considering the ill-posed inverse problem of existing approaches and the under-determined characteristics of local binocular motion constraints, parallel processing and late integration of motion and disparity as well as other cues appears particularly convincing because solving the inverse problem for local 3D motion adds a functional significant aspect to the notion of parallel streams of dynamic disparity and motion processing. It will require considerable efforts to unravel the entire process but recent developments in the framework of Bayesian inference

In the following we assume a fixed binocular viewing geometry with the cyclopean origin

Since we are not concerned about particular algorithms and their implementation, results are given in terms of analytic geometry

If the eyes remain verged on a fixation point in a binocular viewing geometry then the constraint line in the left and right eye can be defined by pairs of points

We can exclude the trivial case

The cross product in (4) can be written as

In the following we consider the simpler case of projections onto a fronto-parallel screen (coplanar retinae) at a fixed viewing distance

Again, since

Monocular line motion defines a constraint plane with three points: the nodal point of an eye and two points defining the end position of the projected line (see

We need to check whether the constraint planes are parallel or coincident, that is if_{L}_{R}_{L}_{R}_{L}_{R}

In analogy to the 2D aperture problem and the intersection of constraints we can now define two plausible strategies for solving the 3D aperture problem:

The shortest distance in 3-D (velocity) space between the starting point

We can define a cyclopean constraint line in terms of the cyclopean origin

If we measure disparity

The intersection

^{nd}-order motion in human vision.