Correcting geometric distortions in stereoscopic 3D imaging

Motion in a distorted virtual 3D space may cause visually induced motion sickness. Geometric distortions in stereoscopic 3D can result from mismatches among image capture, display, and viewing parameters. Three pairs of potential mismatches are considered, including 1) camera separation vs. eye separation, 2) camera field of view (FOV) vs. screen FOV, and 3) camera convergence distance (i.e., distance from the cameras to the point where the convergence axes intersect) vs. screen distance from the observer. The effect of the viewer’s head positions (i.e., head lateral offset from the screen center) is also considered. The geometric model is expressed as a function of camera convergence distance, the ratios of the three parameter-pairs, and the offset of the head position. We analyze the impacts of these five variables separately and their interactions on geometric distortions. This model facilitates insights into the various distortions and leads to methods whereby the user can minimize geometric distortions caused by some parameter-pair mismatches through adjusting of other parameter pairs. For example, in postproduction, viewers can correct for a mismatch between camera separation and eye separation by adjusting their distance from the real screen and changing the effective camera convergence distance.


Introduction
Stereoscopic 3D (S3D) is being used for virtual/augmented reality, scientific visualization, medical imaging, 3D movies, and gaming. The ultimate goal of S3D systems is to convey the real world or virtually constructed 3D world veridically to the viewer. However, it is often the case that various S3D capture, display, and viewing parameters are mismatched [1]. This may introduce geometric distortions for the viewer [2][3][4]. Such space distortions may degrade the quality of stereoscopic presentation [5] and user's performances on size/distance estimations for virtual interactions, which are known to be beneficial in S3D [6]. Geometric space distortions also interfere with the viewer's perception of self-motion. When they are inconsistent with the familiar real-world motion perception, they may cause visually induced motion sickness (VIMS) [3]. Therefore, understanding the sources of such geometric distortions with the PLOS ONE | https://doi.org/10.1371/journal.pone.0205032 October 8, 2018 1 / 28 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 aim of correcting or minimizing effects should be the starting point of improving overall quality of the S3D presentation. The S3D imaging chain includes capturing the original 3D world (real or virtual) by two cameras, and displaying the S3D content dichoptically on dichoptic screens, and finally viewing the S3D contents by users. The capture and display parameters of the S3D imaging chain can be grouped into the corresponding pairs: 1) camera separation vs. eye separation (interpupillary distance, IPD), 2) camera field of view (FOV) vs. screen FOV, 3) camera convergence distance vs. screen distance. Camera convergence distance is the distance from the midpoint between the cameras to the point where the camera convergence axes intersect. The viewer initiated viewing parameters such as translational offset can be expressed as the distance from the designated (optimal) head position.
Woods et al. [4] provided a transfer function from the real (or virtual) world to the S3D world. Using this model, various geometric distortions were analyzed, such as depth plane curvature (i.e., objects are bent away from the viewer in periphery, see also [3]), depth non-linearity (i.e., depth differences in the reconstructed world do not match with the corresponding depth differences in the original world), and shearing distortion (i.e., objects appear sheared toward the viewer's head position) [7].
The geometric model developed by Woods, et al. [4] demonstrates how individual parameters in the S3D imaging chain may affect the final presentation to the viewer. However, since the parameters involved in the S3D imaging chain were not explicitly grouped into corresponding pairs, it is hard to intuitively understand the interaction among the parameter pairs. In Woods, et al. [4], to demonstrate the effect of the various display parameters, the other parameters were assigned to fixed default values. Camera and eye separation were assigned 75mm and 65mm, respectively, whereas camera FOV was assigned 50˚or 52˚and screen FOV was assigned 17˚(calculated from 1m screen distance and 30cm screen width). Since geometric distortions may result from a combination of multiple mismatches (due to mismatches of multiple paired parameters), it is unclear whether the distortion patterns found through such analysis were caused entirely by a solo effect of the examined parameter pair, or the combined effect with other default parameter mismatches. For instance, when demonstrating the effect of camera separation, the simulated distortions were confounded by the mismatch between camera FOV and screen FOV.
Our geometric model is expressed as a function of the ratios of the three parameter-pairs: 1) camera separation vs. eye separation, 2) camera field of view (FOV) vs. screen FOV, and 3) camera convergence distance vs. screen distance from the observer. The geometric distortions as a function of each parameter ratio can be studied independently by assuming the other pairs are perfectly matched. Yet, one can then consider the interactions among the parameter pairs by changing more than one ratio at a time. Using a model expressed in terms of ratios of paired corresponding parameters facilitates intuition about the effects and leads to a better understanding of the relationship between the parameter pairs. The effect of viewer's suboptimal head positions (i.e., the head is offset from the screen center) is also discussed.
For real screen displays (e.g., smartphone, monitor, TV, and movie theater), where the screen size is fixed, changing the screen distance changes the screen FOV. The user's eye separation varies with the user. In the case of pre-produced content, such as watching S3D movies, the contents capture parameters are set during the initial capture and postproduction phases (e.g., convergence distance may be adjustable by horizontally translating the displayed images [8]), but they are typically not allowed to be adjusted by the viewer.
The simplest approach to correct the geometric distortions is to match the capture, display, and viewing systems. However, the user's eye separation and camera separation are fixed and they may be different from each viewer. Our model shows that it is possible to adjust other controllable parameter pairs to compensate for the distortions caused by the mismatch between eye separation and camera separation. Specifically, we propose a method to remove the geometric distortions during S3D viewing by adjusting the screen distance and camera convergence distance (i.e., horizontally shifting left and right captured views).
The existence of depth distortions in S3D is well known and some distortions have been named. Masaoka et al. [9] and Yamanoue's [10] geometric models were used to analyze commonly reported S3D perceptual size and depth distortions, known as the puppet-theater effect [11] and cardboard effect [12]. The puppet-theater effect is caused by size/scale discrepancies between objects in the real world and those reconstructed in S3D. For example, when reconstructed objects in the foreground are relatively smaller than objects in the background (while accounting for the perceived distance), the viewer perceives the objects in the foreground to be relatively smaller as if they are figures in a puppet theater. The cardboard effect is caused by non-linearly compressed depth, such that farther objects appear to be more compressed in depth than closer objects and thus they may be perceived flatter, and in extreme, as a cardboard cutout of a picture of the objects. The opposites of these two effects are also possible, where the objects reconstructed in S3D appear larger relative to the background (giant effect) or farther objects are expanded non-linearly in depth (referred to here as an expansion effect). We use our model to analyze the mismatches of parameter pairs that lead to the various depth distortions effects.
We assume here that: 1) there is no viewer's head rotation relative to the screen; 2) stereoscopic images are captured by parallel-axis method (with sensor shift) and are displayed on a flat screen. The camera image plane (the image plane perpendicular to the camera axes) and screen image plane (the image plane on which the screen is located) are matched. Note that, when the viewer's head is rotated with respect to the displayed images, or when stereoscopic images are captured by convergence-axis (toe-in) method but displayed on a flat screen, additional geometric distortions may be introduced [2,3]. Moreover, as pointed by [2], such distortions are accompanied by vertical disparities, resulting in no intersection between two projection lines from the left and right eyes to a pair of onscreen points. Thus, one cannot use ray-intersection geometric models to predict geometric distortions in such situations. Therefore, head orientation mismatch and image plane mismatch that also involve vertical disparities require a special handling and analysis and are outside the scope of the current paper.

The process of S3D imaging
In S3D viewing, captured objects at the convergence distance are displayed with zero disparity and perceived as if they are at the screen distance. The objects captured in front of the convergence distance (displayed in crossed disparity) are perceived as if they are in front of the screen, while objects captured behind the convergence distance (displayed in uncrossed disparity) are perceived behind the screen.
S3D content acquisition (capture) involves a pair of cameras that are horizontally separated. For simplicity of derivation, we ignore lens distortions by assuming pinhole cameras, which are commonly implemented in virtual world computer graphic rendering. For stereo image capture, two capture methods are commonly used: converging-cameras method and parallelcameras method, as shown in Fig 1. In the converging-cameras method, also called toe-in, (Fig 1a), the axes of the two cameras converge. The distance from the midpoint between the two cameras to the convergence point is called camera convergence distance (d c ). Images captured in this way presented on parallel displays (or a single stereo display) result in a severe geometric distortion due to the projection difference. Thus this system is rarely used. In the parallel-cameras method (Fig 1b), the axes of two cameras are parallel, making d c to be infinite.
The parallel-cameras method captures all the objects in the scene in crossed disparities and therefore, they are all perceived to be in front of the display screen. The parallel-cameras method thus compresses the full scene depth into the distance between the viewer and the screen. This is an example of an extreme mismatch between corresponding parameters (pair) resulting in a large distortion of depth. In addition, the parallel-cameras acquisition often results in large crossed disparities for close objects, which may exceed the viewer's binocular fusion range. To avoid this severe distortion and fusion limitation, the camera convergence distance has to be shortened, preferably to match with the display viewing distance.
In real-world parallel-cameras capture, the convergence distance can be adjusted by horizontally shifting each camera's image sensor outward (i.e., left camera sensor to the left and right camera sensor to the right) compared to Fig 1b. This is referred as 'sensor-shift' and is equivalent to only utilizing the outer part of the full image sensors in Fig 2a. In computer graphic capture, the convergence distance can be adjusted by creating asymmetric camera frusta for the two virtual cameras (Fig 2a) to achieve off-axis projection [13], which results in the same effects as 'shift-sensor' in real-world capture. Another method is cropping image method used in postproduction. The left side and right side of the left and right captured images are cut out as shown in Fig 2c). When the images are displayed on the screen without cropping sensors or images, the centers of captured images (Cneter L and Center R in Fig 2c) are aligned to the screen center (i.e., shift left image to the right and right image to the left), resulting in infinite convergence distance (referring back to the capture process). One can reduce the convergence distance by horizontally shifting the displayed images back (left image to the left and right image to the right) in the postproduction [8,14], then cropping the non- The convergence distance means the distance at which the convergence axes (called optical axes in [4]) of the two cameras intersect. The convergence axis is the projection line passing through the pinhole aperture to the center of the image sensor (either real or virtual).
The variables used in our geometric models are defined in Table 1. A left-handed Cartesian coordinate system xyz is used for both capture and display. For image capture, shown in  The x-axis represents inter-camera direction (i.e., the horizontal axis). The z-axis represents the direction where the cameras are pointed (i.e., the depth axis). The y-axis is orthogonal to the xz-plane (i.e., the vertical axis). For image display, we assume that the viewer's head is primarily positioned in front of the center of display images and does not rotate relative to the displayed images. As shown in Fig 3b, the origin is in front of the display center and at the midpoint between the left and right viewing eyes. Eye positions are assumed to be at the entrance pupils. The x-axis represents intraocular direction to the right (i.e., the horizontal axis). The z-axis represents the direction from the origin to the display (i.e., the depth axis). The y-axis is orthogonal to the xz-plane (i.e., the vertical axis). The brown cube in Fig 3a is an example object in the original (virtual) world captured for display in S3D. The blue object in Fig 3b is the reconstructed (perceived) object in the S3D world. In the following illustrations, the brown cube center is at [0, 0, 3m] ⊺ in the original world, and the length of the side of the cube is 2m. Any difference between the corresponding features of the brown cube (Fig 3a) and blue hexahedron (reconstructed cube) (Fig 3b) represents geometric distortions introduced by the parameter mismatches among the capture, display, and viewing processes. In subsequent simulations, the captured cube and reconstructed cube are superimposed on a single coordinate system to emphasize the distortions/differences between the original world and reconstructed world.

S3D spatial distortion analysis
In this paper, the original world is captured by parallel-cameras with the shifted sensor method and then displayed on a real flat screen. Spatial distortions are introduced by the offset of the head position (T) and the mismatches between the three parameter pairs: 1) camera separation vs. eye separation, 2) camera frustum width at convergence distance vs. screen width, 3) camera convergence distance vs. screen distance. Note that since changing the screen distance affects the screen FOV for real screen displays, we replace the ratio of the angular pair of camera FOV and screen FOV (k f ) with the ratio of the linear pair of camera frustum width at the convergence distance (i.e., w c in Fig 2a) and screen width (k w ). This enables us to analyze the effects of screen size and distance separately. Fig 4 shows the diagrams used for the derivation of the geometric model. The transfer function from the original world coordinates to the reconstructed world coordinates can be expressed as is the offset of head position from the origin; k s ¼ s e s c is the ratio of eye separation to camera separation, k w ¼ w s w c is the ratio of screen width to camera frustum width at convergence distance, and k d ¼ d s d c is the ratio of screen distance to camera convergence distance. See the Appendix for derivation. Note that the transfer function for x and y components are equal, while they are different for the z component. This indicates that the amount of distortions along the horizontal and vertical diections (along the x and y axises) are the same, while the amount distortion along the depth direction (along the z axis) may be different.
The transfer function is a function of the camera convergence distance, d c , three ratios (k s , k w , k d ) representing three types of mismatches, and the head position offset, T. When the three paired parameters are matched and without head translation, i.e., k s = 1, k w = 1, k d = 1, , which can be calculated from the two similar triangles of different height (blue). The points S l1 and S r1 on the screen distance are displayed at S l2 and S r2 . (c) The captured realigned images are scaled to fill the display screen. The points S l2 and S r2 at the screen distance are changed to S l and S r on the screen. Viewers will see the left and right points (S l and S r ) on the screen through the left and right eyes (E l and E r ), respectively. The intersection point P of the two lines from each eye (E l and E r ) to the corresponding onscreen point (S l and S r ) is the expected perceived position of O from (a) displayed to the observer. Note that when d s < d c the point P is displayed closer to the observer in the reconstructed world than in the original world. T = [0, 0, 0] ⊺ , Eq (1) can be simplified as This indicates that if the corresponding parameter pairs for capture and display systems are matched, an orthoscopic display condition will be achieved, and any point in the original world will be reconstructed exactly where it should be during the S3D viewing.
Since the viewer cannot see objects behind the viewer, depth coordinates in the reconstructed world Z p should be always positive (Z p > 0). When k s < k w and Z o > k w d c k w À k s (i.e., for depth farther than k w d c k w À k s ), Z p is negative. In this case, two projection lines (from the two eyes to the two onscreen points) intersect behind the viewer because the (uncrossed) disparity of onscreen points is larger than the viewer's IPD. Depending on how large the angular disparity is, the viewer may perceive the object at a far distance or fail to fuse (having double vision).
is independent of the screen distance (d s ), changing the screen distance does not change the linear screen disparity (Eq (27)).
In following sections, we discuss the effect of each parameter-pair mismatch and head translations in isolation, assuming that other paired parameters are matched.

Effect of different eye separations
This analysis assumes that the screen distance and camera convergence distance are the same (k d = 1), the screen width and camera frustum width at convergence distance are the same (k f = 1), and camera convergence distance is constant (i.e., d c = 3m), while head position is at the optimal position (T = [0, 0, 0] ⊺ ). Only camera separation and eye separation are mismatched due to individual users' IPD variations. In this condition, the transfer function (1) is simplified as follows: If k s < 1 (i.e., the viewer's IPD is smaller than camera separation), object depths Z o should be smaller than d c 1À k s , otherwise the point P falls behind the observer, as discussed above. Fig 5 shows simulations of a cube captured with camera separation (s c ) of 63mm, which is a recommended camera separation for S3D movie making [15], while eye separation is that of a small child, 50mm (k s = 0.79 < 1 , Fig 5a), and an adult with widely-separated-eyes, 75mm Fig 5b), respectively. The vast majority of adults have IPDs in the range of [50mm, 75mm], where the mean value of adult IPD is around 63mm [16].
When eye separation is smaller than camera separation (k s < 1), the reconstructed cube (i.e., blue hexahedron) appears expanded in depth (Fig 5a). The portion in front of the screen is narrower while the portion behind the screen is wider than what it is supposed to be in the orthoscopic condition. When eye separation is larger than camera separation (k s > 1), the reconstructed cube appears compressed (Fig 5b), where the portion in front of the screen becomes wider and the portion behind the screen becomes narrower. The results in Fig 5 are different from the results in [2] (see Fig 1A and 1I in the Appendix of [2]). In our simulations, onscreen points stay on the screen when eye separation and camera separation are mismatched. The explanation for the discrepancy is presented in the discussion. Fig 6 shows the change in relative size along x and y-axis (Fig 6a) and depth along z-axis (Fig 6b) between the original world and reconstructed world, as functions of the depth Z o in the original world. The relative size and depth can be expressed as respectively. Note that, the equations and the plots for X, Y, and Z dimensions are the same, resulting in the same change in all dimensions. This is because when changing the eye separation, the intersection of the two projections lines (from left and right eyes to left and right onscreen points) will always lie on the line passing through the origin (middle of two eyes) and the center of the onscreen points. The ratios of the x, y, and z components of any two points on a line passing through the origin are the same. In these plots, the black dotted horizontal lines represent the orthoscopic condition (i.e., a reconstruction without geometric  distortion) when eye separation and camera separation are matched (in addition to other matched parameters). The area below the black horizontal line represents compression and above the line represents expansion.
Fig 6a represents relative size change (i.e., xy-dimension) along the depth direction. When eye separation is smaller than camera separation (k s < 1), reconstructed objects in front of the screen appear smaller and objects behind the screen appear larger in size. The amount of compression and expansion increases non-linearly as objects are farther from the screen location (red/yellow solid line in Fig 6a). When eye separation is larger than camera separation (k s > 1), objects in front of the screen appear expanded and objects behind the screen appear compressed (blue/green dashed line in Fig 6a). The effect is more dramatic in a smaller IPD condition than a larger IPD condition. A smaller camera separation (e.g., s c = 60mm) decreases the distortions and allows a larger asymptotic limit (yellow lines in Fig 6a), yet it has a relatively small increase in distortions for larger IPD users (green dashed line in Fig 6a).
Fig 6b represents relative depth change (i.e., z-dimension) along the depth direction. The area below and above the horizontal line represents objects being closer and farther than where they are in the original world, respectively (Fig 6b). When eye separation is smaller than camera separation (k s < 1), reconstructed objects in front of and behind the screen appear closer and farther, respectively. The amount of depth distortion increases non-linearly as objects are farther from the screen location (red/yellow solid line in Fig 6b). When eye separation is larger than camera separation (k s > 1), objects in front of the screen appear farther and objects behind the screen appear closer (blue/green dashed line in Fig 6b).
The red/yellow dotted lines are the asymptotes (i.e., Z o ¼ d c 1À k s ) of the red/yellow curves when eye separation is smaller than camera separation. Objects at the depth of the asymptote (and beyond), onscreen uncrossed disparities become larger than the viewer's IPD. In this case, viewers may not be able to fuse them even if they try to fixate on those objects and perceive double vision. Note that in real-world condition, when a viewer gaze on a near object, a farther object becomes double, but when the viewer gazes on the farther objects, the farther objects will be fused (becomes single) and the near object becomes double. However, in the reconstructed world, the objects beyond the asymptote distance cannot be fused even if the viewer gazes on it. Thus, this distance represents a practical limit on the distance of the original world that can be reconstructed veridically in S3D with unmatched eyes/cameras separation parameters (see further discussion of this property below at section 'Avoid large screen disparity').

Effect of different screen sizes
Here we assume that only screen width and camera frustum width at the convergence distance are mismatched (i.e., k s = 1, k d = 1, and T = [0, 0, 0] ⊺ ) and camera convergence distance is constant (i.e., d c = 3m). Under this assumption, the ratio between screen FOV and camera FOV (k f ) becomes the same as the ratio between screen width and camera frustum width, i.e., k f = k w /k d = k w . The transfer function (1) can be simplified as follows: If k w > 1 (screen width is larger than camera frustum width at convergence distance), the depth should be Z o < k w d c k w À 1 , for farther Z o the point P falls behind the observer. , the cube appears smaller, and the farther portion is more compressed than the closer portion, as shown in Fig 7a. When screen width is larger than camera frustum width (k w = 2 0.3 = 1.23 > 1), the cube appears bigger, and the farther portion is more expanded than the closer portion, as shown in Fig 7b. Since we assume camera convergence distance and screen distance are matched, the reconstructed cube stays centered at the screen distance.   z-dimension can be expressed as respectively. When screen width is smaller (red solid lines) or larger (blue dashed lines) than camera frustum width, the relative size becomes smaller or larger than 1, suggesting the reconstructed world appears compressed or expanded, respectively (Fig 8a). In terms of depth, when screen width is smaller (red solid lines) or larger (blue dashed lines), than camera frustum width, the virtually constructed world behind the screen will suffer from progressive compression, while the world in front of the screen will suffer from expansion, respectively (Fig 8b). Note that objects located at the screen distance are not largely affected in terms of depth distortion, but are still affected by size distortion. The blue dotted lines are the asymptotes (i.e., Z o ¼ k w d c k w À 1 ) of the blue curves when screen width is larger than camera frustum width. Again, the viewer may not be able to fuse objects farther than the asymptote and perceive double vision.

Effect of changing screen distance
This analysis assumes that only camera convergence distance and screen distance are mismatched (i.e., k s = 1, k w = 1, and T = [0, 0, 0] ⊺ ) where camera convergence distance is constant (i.e., d c = 3m). The transfer function (1) can be simplified as follows: Eq (5) shows that changing screen distance affects the depth (in z-dimension) but does not affect the size (in xy-dimensions).
When screen distance is closer than (red solid lines) or farther than convergence distance (blue dashed lines), the relative size does not change, suggesting the linear size is independent of the screen distance (Fig 10a). In terms of depth, when screen distance is closer (red solid lines) or farther (blue dashed lines) than the convergence distance, the constructed world appears closer and compressed or farther and expanded, respectively (Fig 10b).

Effect of changing camera convergence distance
This analysis assumes that only camera convergence distance is varying at given screen distance (d s = 3m) and other parameter pairs are matched (k s = 1, k f = 1, T = [0, 0, 0] ⊺ ). Since (1) can be simplified as follows: Fig 11 shows the 3D simulations when convergence and screen distances are mismatched. When camera convergence distance is smaller than screen distance (d c = 2.44m), the reconstructed cube appears pushed father and larger (Fig 11a). When convergence distance is larger than screen distance (d c = 3.7m), the reconstructed cube appears smaller and closer (Fig 11b). In both cases, more expansion/compression occurs at a farther distance. Fig 12 shows the relative size and depth change compared to the orthoscopic condition. When camera convergence distance is shorter or larger than screen distance, the size of the object appears expanded (red solid line) or compressed (blue dashed line), respectively (Fig  12a). In terms of depth, objects appear farther (red solid line) or closer (blue dashed line) to the viewer, respectively (Fig 12b). The red dotted lines are the asymptotes of the red curves. When camera convergence distance is smaller than screen distance, the viewer may not be able to fuse objects farther than the asymptote and may see double vision.
The amount of geometric distortions (both size and depth ratio between the reconstructed object to original world object) monotonically increases as the distance from the viewer increases. When the amount of compression progressively increases along the depth direction, objects become thinner (in depth direction). Generally, the effect is more severe for distant objects. A distant object appears to be flat demonstrating the cardboard effect [10,12] (Fig 12b).
Since objects in the foreground and background (i.e., different depths) are scaled in different ratios, the viewer will perceive objects as a miniaturization (i.e., the puppet theater effect [10,11]) or enlargement effect. The mismatch between screen and camera convergencedistance results in a perceptual distortion called the Alice in Wonderland syndrome [17].  Observers with such syndrome experience various size and depth distortions such as micropsia (objects are perceived to be smaller than they actually are), macropsia (objects are perceived to be bigger than they actually are), peliopsia (objects are perceived to be closer than they actually are), and teliopsia (objects are perceived to be farther than they actually are).
An extreme case is worth further discussion where the convergence distance is infinity, i.e., the cameras are placed in parallel and without adjusting the convergence distance. In this case, the reconstructed world fits the following transfer function:  (Fig 13a) and [0, 0, 5m] ⊺ (Fig 13b). In both cases, the apparent objects are in front of the screen (all in crossed disparity) and become smaller and closer. The compression of the depth is severer in farther cube condition (Fig 13b) because the depth at all distances (including infinite distance) is compressed in between the screen and viewer distance. As a result, the cardboard effect becomes amplified for distant objects.

Distortion-free scaled reproduction
In Eq (1), if the three ratios between screen width to camera frustum width (k w ), screen distance to camera convergence distance (k d ), and eye separation to camera separation (k s ) are the same (k w = k d = k s ), and without head position offset, the three ratios can be denoted as k and the transfer function (1) can be simplified as follows: In this case, xyz dimensions are scaled in the same ratio in different depths so that the reconstructed world is an undistorted but scaled version of the original world. Fig 14 shows  examples of the 3D simulations when the ratio k is smaller (k = 0.79) and larger (k = 1.19) than orthoscopic condition (k = 1). When the ratio is smaller than 1, the cube appears smaller and closer, as shown in Fig 14a. When the ratio is larger than 1, the cube appears larger and farther, as shown in Fig 14b. However, the reproduced objects shape appears to be remained as a cube as it is presented in the original world. Fig 15 shows that the relative size and depth change as a function of the depth in the real world. The relative size in xy-dimension can be expressed as When the ratio is smaller (red solid lines) or larger (blue dashed lines) than 1, both the relative size and depth are scaled in the same ratio (Fig 15a  and 15b, respectively).
Note that since the reconstructed world is only scaled but not distorted in this condition, it provides a way to remove geometric distortions in S3D by adjusting the variables to equate the ratio of pairs.

Effect of head translations
This analysis assumes that the three paired parameters are matched (k s = 1, k w = 1, k d = 1). In this condition, the transfer function (1) is simplified as follows:   translates to the left T x = −1.5m or to the right T x = 1.5m, the cubes are sheared to the left (Fig 16a) and right (Fig 16b), respectively. Similarly, when the viewer's head translates downward T y = −1.5m or upward T y = 1.5m, the cubes are sheared downward (Fig 16c) and upward (Fig 16d), respectively. When the viewer's head translates backward or forward, the distortion is the same as moving the screen farther and closer as discussed in section 'Effect of changing screen distance'. The cubes are expanded away from the screen (Fig 9b) or compressed towards the screen (Fig 9a), respectively.
Overall, the part of the displayed cube in front of the screen moves in the same direction as the head translation, and the part behind the screen moves to the opposite direction of the head translations. Onscreen points stay on the screen without any change. Thus, the cube always appears to follow the head movements while maintain the fronto-parallel characteristics of the front and back surfaces. When the viewer's head translates laterally (i.e., leftward, rightward, downward, and upward), our model indicates shearing distortions towards the viewer position. The distortion is apparent especially while the viewer is in motion. The backward or forward movements of the viewer's head are basically the same as changing the screen distance farther or closer. Therefore, the consequent distortion patterns are analyzed in section 'Effect of changing screen distance'.

Guidelines for S3D Imaging content development
The results of our analyses suggest guidelines that may eliminate or minimize geometric distortion for content developers and users. These are explicitly developed below.

Avoid large screen disparity
As analyzed above, when the ratio of eye separation to camera separation is larger than the ratio of screen width to camera frustum width at convergence distance (k s > k w ), the reconstructed world becomes more compressed (both in size and depth) at a larger depth (see blue curve in Fig 6 and red line in Fig 10). In these conditions, the cardboard effect may affect distant objects. In contrast, when the ratio of eye separation to camera separation is smaller than the ratio of screen width to camera frustum width at convergence distance (k s < k w ), the reconstructed world is more expanded at larger depths (see red curve in Fig 6 and blue line in Fig 10. In these conditions, the effect is opposite to the cardboard effect, we call it the expansion effect. More critically, the depth in the real world has asymptotic limits (i.e., when k s < k w , Z o < k w d c k w À k s ). Objects at depths farther than these limits are presented with large uncrossed  screen disparities that the viewer may not be able to fuse, even if they are fixated. When eye separation is smaller than camera separation (k s < k w = 1), a smaller camera separation yields a larger asymptotic limit, as shown in Fig 17a. In addition, when only screen width is larger than camera frustum width (1 = k s < k w ), a larger camera FOV or camera convergence distance (i.e., a larger camera frustum width) yields a larger asymptotic limit, as shown in Fig 17b  and 17c. Therefore, for S3D producers, a smaller camera separation, a larger camera convergence distance, or a larger camera FOV is recommended, so that k s ¼ s e s c # ! d s tanða sh =2Þ d c " tanða ch "=2Þ ¼ k w to avoid large uncrossed screen disparities.
In following examples, we assume that camera convergence distance is set to be the same as the screen distance (d c = d s ) and consider four different screen distance options; d s = 0.3m (mobile phone/tablet viewing distance), d s = 1m (desktop monitor viewing distance), d s = 3m (TV screen viewing distance), and d s = 10m (movie theater screen viewing distance). Fig 18 shows the relative depth of the four viewing conditions when eye separation is smaller than camera separation (e.g., s e = 50mm and s c = 63mm, where k s < k w = 1). The four dotted vertical lines in Fig 18 are the asymptotic limits corresponding to the four convergence distance conditions. When camera convergence distance is the same as screen distance (k d = 1), a larger screen distance results in a larger fusible limit on the original world distance.
When the depth composition in the original world have an asymptotic limit (i.e., k s < k w ), it is not desirable to model objects at depths farther than the asymptotic limit (Z o ¼ k w d c k w À k s ). For the S3D graphic rendering of the virtual world, the far plane of virtual camera frustum can be defined at or slightly beyond the asymptotic limit. Any objects farther than the far plane (e.g., mountains, clouds, or buildings) can be projected on the far plane as a 2D image (texture), which will make them appear at an infinite distance. Limiting the original virtual world to the asymptotic depth not only reduces the render time but also avoids the problem of large uncrossed screen disparity. For example, as shown in Fig 18, the asymptotic limit of the red curve (s e = 50mm, s c = 63mm, and d c = 3m) is 14.3m. We define the far plane of camera frustum at 14.3m and project objects farther than the distance on the far plane as a 2D image so that objects at distances larger than 14.3m in the original world are perceived at an infinite distance.

Correct geometric distortions
As discussed in section 'Distortion-free scaled reproduction', under many conditions it may be possible to eliminate geometric distortions in S3D by matching the ratios among the parameters pairs (instead of individually matching all the paired parameters). Under these conditions, the reconstructed world is only scaled from the original world but without distortions (Fig 14).
To equate the three ratios, we need to match screen FOV with camera FOV by adjusting screen distance, and match the distance ratio with the separation ratio by adjusting converge distance (i.e., α sh = α ch and s e s c ¼ d s d c , resulting in k w = k d = k s ). Users can adjust the screen distance by moving closer or farther from the screen, and adjust camera convergence distance by shifting the left and right view horizontally (e.g., increasing/decreasing convergence in NVI-DIA 3D Vision [18] and '3D depth slider' in Nintendo 3DS [19]). When screen distance is adjusted first, distortions from FOV mismatch will be eliminated (turn into a combination of Figs 5 and 11) and then the distortions of size scaling at different depths will be removed by adjusting convergence distance (turn into Fig 14). When convergence distance is adjusted first, distortions of size scaling at different depths will be eliminated (turn into Fig 9) and then the distortions of depth compression or expansion will be eliminated (turn into Fig 14) by adjusting screen distance. More interesting (and possibly more intriguing) approach will be that we can combine different distortion patterns to compensate for each other. In real-world viewing condition, eye separation is fixed for each individual viewer and camera separation is usually set during the production. Our model guides us to correct the distortions caused by the mismatch of eye separation and camera separation. For example, when eye separation is smaller than camera separation (s e < s c ), farther distance objects appear larger and farther (Fig 5a). If this kind of distortion is combined with a distortion where convergence distance is larger than screen distance (d c < d s ) (Fig 11b), the various geometric distortions will compensate for each other. This compensation can result in a distortion-free (up to a scaling) reproduction of the original world depth structure (i.e., the case α sh = α ch and s e s c ¼ d s d c ). Delivering a scale but undistorted 3D structure may be sufficient for conveying the scene information [3]. Note that the ability of mix-and-match of available parameters to control various distortion is particularly important because, in many cases, S3D content production and consumption are two independent processes, where the production side cannot control the consumer's display condition, leaving only limited control for the consumers since the parameters in the production process have already been set.
In some cases, the ability to adjust screen distance is constrained. For instance, the distance from the viewer to the TV cannot be larger than the length of a living room, or laptops cannot be too close to the viewer since it will be difficult for the viewer to focus. In such situations, size distortions (in xy-dimension) can be corrected by adjusting the convergence distance (i.e., making k w = k s ). However, an incorrect screen distance causes a mismatch between camera FOV and screen FOV, therefore, depths in the reconstructed world may be compressed or expanded. Such depth distortions can be eliminated by scaling the onscreen images so that the displayed images' FOV is the same as camera FOV. For example, when the distance from the viewer to the TV cannot be larger than the length of a living room, one can scale down the onscreen images and only use part of the screen. When laptops cannot be too close to the viewer, one can scale up and display with only part of the images on the screen.  To eliminate the geometric distortions caused by the head translations, the viewing's head needs to stay in front of the screen center (image center) or the viewer's head position needs to be tracked, and then corresponding parameter adjustments should be applied to so that the reconstructed world is not sheared.

Discussion
It should be obvious that our geometrical model of the S3D imaging and other models in the literature [2,4] are fundamentally identical since they all veridically represent the capture and display processes. The advantage of our model is in its format that supports a more intuitive understanding of the relations between the various parameters and their impacts on geometric distortions. In S3D capture and display processes, various mismatches and distortions may combine. Our model, as presented in the transfer function (1), provides an intuitive tool for understanding the impact of each parameter mismatch on the distortion and their possible interactions. This isolated knowledge on the cause-effect with respect to the distortion pattern suggests us to a useful, but possibly trivial conclusion, that in order to eliminate the geometric distortions, all mismatches should be minimized. Specifically, for applications where the exact size of the scene may be important (e.g., teleoperation), it may be necessary to achieve an orthoscopic projection (i.e., k s = 1, k w = 1, and k d = 1). In most other applications, distortion elimination with simple scale change (which is what we have proposed here) is likely to be acceptable.
Masaoka et al. [9] and Yamanoue [10] focused on the effects of camera separation and camera FOV. Their models had no explicit pairing of display screen distance and camera convergence distances. The mismatch of convergence distance and screen distance will affect the analysis of distortions caused by camera separation or FOV mismatches. For example, Yamanoue et al. [10] concluded that parallel-cameras configuration does not produce the puppettheater effect. This is because the left and right images were horizontally shifted to the left and right, respectively, by a distance equivalent to half of the viewer's IPD after the images were scaled to screen size (i.e., shifting images d s tanða sh =2Þ d c tanða sh =2Þ s c =2 ¼ s e =2, see Eq (14) with scaling of tanða sh =2Þ tanða ch =2Þ in Appendix). Thus, the ratio of screen width to camera frustum width at convergence distance is the same as the ratio of eye separation to camera separation ( w s w c ¼ s e s c ), resulting in the condition k s = k w in (1). Therefore, in the model of Yamanoue et al. [10], the sizes of the reconstructed objects are scaled by the same ratio at all depths.
Since the puppet-theater effect is defined as the size distortion between objects in the foreground and in the background, global magnification/minification of size does not induce the puppet-theater effect. However, this particular case does not cover the parallel cameras in all possible configurations. The parallel-cameras configuration still can cause the puppet-theater effect. The same method was used in [2] by Held and Banks when they analyzed the mismatch between camera separation and eye separation (see Fig 1(A) and 1(I) in the Appendix of [2] and compare to our results in Fig 5). In their modeling, the left and right images were also horizontally shifted to the left and right by the distance of half the viewer's IPD, respectively. The convergence distance is also changed when changing camera separation in this case. Thus, the analysis of camera separation mismatch in [2] was confounded by the screen distance mismatch, which may be unclear to readers.
To avoid the issue of large uncrossed disparity on screens, a smaller camera separation, or a larger camera convergence distance or FOV (i.e., larger camera frustum width at convergence distance) are recommended for S3D content producers so that k s > k w . For example, considering viewers have IPDs around 64mm and are expected to watch 50-inch TV at 3m screen distance (i.e., 41˚screen FOV), if the camera convergence is also 3m, large screen disparities can be avoided by setting camera separation narrower than the expected viewers', e.g., 60mm and camera FOV wider, e.g., 60˚(giving k s = 1.07 > k w = 0.5). When k s < k w and asymptotic limit exists for the depth, we recommend that the far plane is defined at or slightly beyond the asymptotic limit and objects farther than the plane are projected on the plane as a 2D image (texture). Even though distant objects are perceived at an infinite distance in binocular stereo vision, monocular cues of the distant objects (e.g., farther mountains are occluded by closer mountains and have lighter colors, farther buildings are smaller than closer buildings) may be strong enough and users may not notice the difference from the original world.
As mentioned, the perception in a distorted S3D world is similar to the Alice in Wonderland syndrome [17], where the depth and size perception can be altered such that objects appear too close, too far, too big, or too small. For example, normal movements may appear too slow in a compressed space and too fast in an expanded space. Since the perception of motion within such a distorted space may lead to a perceptual inconsistency of the user's egocentric motion expectations learned by real-world experiences, it may induce visually induced motion sickness (VIMS) [20,21]. Thus, the perceptual inconsistency in a distorted space may be a likely source of VIMS in S3D [3].
The proposed geometric model can predict geometric distortions caused by the mismatches among image capture, display, and viewing, while perceptual distortions may not match and are usually smaller than the geometric distortions predicted by ray-intersection models [22,23]. Geometric distortions predicted by ray-intersection models are solely determined by the binocular depth cue (binocular disparity). However, depth perception in 3D space involves both monocular and binocular depth cues. Human visual systems interpret depth by combining different depth cues [24][25][26]. Geometric distortions simulated in this paper are illustrated from a third-person perspective, but the viewer only sees the distortions from the first-person perspective (i.e., the origin in Figs 5,7,9,11,13 and 14, and head positions in Fig 16). Monocular depth cues (i.e., linear perspective, occlusion, shading, etc.) from the first-person perspective are largely unaffected by geometric distortions [27,28]. Thus, monocular depth cues can effectively reduce and limit the effects of the size and depth distortions. However, the unaffected monocular depth cues are conflicting with the binocular depth cue in a distorted S3D space. Moreover, when the viewer's head is translating laterally, motion parallax [29] (one of monocular depth cues) that exists in real life is missing since S3D displays can only provide the view (perspective) captured by the cameras. Head translations result in a strong perception of objects following the viewer's movements. This depth cue conflict (intra-visual conflict) between monocular and binocular and the conflict between the absence of motion parallax and self-motion may cause VIMS [3].
We only discussed real screen displays (e.g., smartphone, monitor, TV, and movie theater), and not virtual screens displays (e.g., head mounted displays). There are two main differences between screen displays and head mounted displays. First, when adjusting screen distance in real screen displays, the screen FOV varies since the screen size is fixed. However, in headmounted displays, when adjusting virtual screen distance by changing the lens-to-display distance, the virtual screen size varies while the virtual screen angular FOV remains fixed [30]. Second, in real screen displays, the camera separation is usually fixed in current 3D video games and movies. On the other hand, in head-mounted displays, users may be able to adjust the camera separation by changing the lens separation of the headset (e.g., Ocular Rift [31]). Therefore, in our discussion, camera separation was fixed and screen size was constant in the analysis of changing screen distance. However, there is no technical reason why the camera separation may not be under user control (at least over a restricted range in real screen applications).
The currently proposed geometric model has some limitations. We assume no viewer's head rotations relative to the screen. This assumption does hold if the viewer sees S3D imagery in head-mounted displays, or the viewer's head stays upright relative to the screen. However, the viewer's head rotations with respect to the screen cause additional geometric distortions in the reconstructed S3D world [2]. We also assumed that camera image plane and screen image plane are parallel. However, in some cases, the image planes between image capture, display may be mismatched. As pointed by [2], yaw rotation (vertical-axis), roll rotation(forwardaxis), and stereo images captured by convergence-axis but displayed on a flat screen will introduce vertical disparity, which may cause other problems (e.g., eye strain) for S3D viewing. These cases are outside the scope of the current paper. We are expanding our model to cover viewers' head rotations and display image plane mismatches in a follow-up study.
Therefore, the depth in original world has an asymptotic limit when the ratio of eye separation to camera separation (k s ) is smaller than the ratio of screen width to camera frustum width at convergence distance (k w ). The limitation of the depth exists because the uncrossed disparity of two onscreen points should be smaller than the viewer's IPD so that the two projection lines (from the two eyes to the two corresponding onscreen points) intersect in front of the viewer. The disparity of two onscreen points D can be expressed as which is also independent of the screen distance d s .