Dynamic 3D Scene Depth Reconstruction via Optical Flow Field Rectification

In this paper, we propose a depth propagation scheme based on optical flow field rectification towards more accurate depth reconstruction. In depth reconstruction, the occlusions and low-textural regions easily result in optical flow field errors, which lead ambiguous depth value or holes without depth in the obtained depth map. In this work, a scheme is proposed to improve the precision of depth propagation and the quality of depth reconstruction for dynamic scene. The proposed scheme first adaptively detects the occlusive or low-textural regions, and the obtained vectors in optical flow field are rectified properly. Subsequently, we process the occluded and ambiguous vectors for more precise depth propagation. We further leverage the boundary enhancement filters as a post-processing to sharpen the object boundaries in obtained depth maps for better quality. Quantitative evaluations show that the proposed scheme can reconstruct depth map with higher accuracy and better quality compared with the state-of-the-art methods.


Introduction
Depth maps are crucial for three-dimensional (3D) imaging [1] and displaying, and have been widely used in digital holography image processing [2,3], object reconstruction in integral imaging [4,5], 3D object retrieval [6][7][8][9][10] and tomographic phase microscopy [11]. Practically, high fidelity depth maps of a dynamic scene are calculated or captured in a temporal discrete manner due to the intensive computational complexity of depth reconstruction. For example, the widely used RGB-D [12,13] (e.g., Kinect) and ToF [14,15] camera can capture the depth map in video-rate but with low resolution (e.g., 320|240 pixels). The devices have challenges to capture depth maps for dynamic scene in video-rate with higher resolution (e.g., standard definition or even higher). In many 3D applications, it is noted that higher capture-rate depth map sequence with higher resolution is required to better represent a dynamic scene [13,[15][16][17][18].
In order to solve the problem, depth propagation algorithms [19][20][21] have been investigated to compensate the capture-rate to video-rate of depth maps in recent years. In these algorithms, it is assumed that the variation for a given dynamic scene is identical for both the depth and the color information of one viewpoint. Specifically, objects keep static in consequent color frames will not arouse depth value variation for these objects, and the depth value for the region containing static object also keeps static in depth map. On the other hand, motions in consequent color frames correspond to depth value variation in the same region. Therefore, the status (i.e., static or motive) of object in consequent color frames can be used to describe the depth value variation in depth map. Motion vector is widely applied to describe the motion status of objects, and it can be obtained with pixel-, block-or region-wise and with different accuracy. For the case of depth propagation, the pixel-wise motion vectors (PMVs) in consequent color frames with high accuracy can be mapped to depth maps also with high accuracy. Based on the assumption, low capture-rate depth maps can be compensated to video-rate. In this case, the captured and high resolution depth maps can be treated as key frames, and the depth information in to-be-reconstructed depth maps is propagated from the key frame by the obtained PMVs.
The main problem for depth propagation is that it is very challenging to obtain accurate PMVs for the occlusive or low textural regions, although PMV can have high accuracy in other regions. Inaccurate PMVs in these regions may lead to ambiguities and holes in the reconstructed depth maps, which decrease the reconstruction quality significantly. Variety of methods were proposed to improve the quality of reconstructed depth maps. For example, manual marking from users on potential problem regions can improve the quality of reconstructed depth maps significantly [20], but this method is not applicable in many automatic processing systems. Some filters have been applied in post-processing of reconstructed depth map, such as bilateral filter [19,21], discontinuity analysis and interpolation [22] and inpainting [23]. These filters usually result in inevitable and undesired blurs for depth maps, and these artifacts are unfavorable for 3D dynamic scene representation. In order to solve this problem, we propose to rectify the optical flow field before propagation rather than rectify the depth results after propagation. The quality of the reconstructed depth is improved by PMVs rectification since global filtering has been avoided in this method. Furthermore, a boundary enhancement filter is proposed to refine the edges of the reconstructed depth maps. The main contributions of our work are three-fold: (1) propose a depth propagation scheme based on optical flow field rectification, in which high accurate PMVs can be obtained to improve the precision of propagation and the quality of reconstructed depth map, (2) propose an adaptive occlusive and low textural regions detection and rectification method for PMVs, and (3) propose a boundary enhancement filter to refine the reconstructed depth map.

Overview of the Proposed Rectification Method
In this work, we reconstruct depth maps for dynamic 3D scene in video-rate by propagation. The PMVs among consequent color images describe the temporal correlations in pixel-wise, and can be applied in propagation from the key depth map to the consequent vacant depth maps. Therefore, the quality of reconstructed vacant depth map highly depends on the precision of obtained PMV. Originally, PMVs can be calculated by traditional optical flow algorithms or motion estimation methods. However, the precision of obtained PMV decreases in several regions, for example, the occlusive or low textural regions. In these regions, less information is available to the matching procedure in determining PMV, and thus errors in obtained PMV are inevitable. These errors will result in unreliable PMVs for depth reconstruction. In order to improve the quality of reconstructed depth maps, the obtained PMVs are rectified before propagation and reconstruction in our work. Figure 1 shows the schematic overview of our proposed scheme. As we mentioned above, a rectification on PMVs is performed after PMVs have been obtained by optical flow algorithm. The rectification is performed to solve the problem caused by the occlusive and low textural regions. After that, depth information in key depth map is propagated to vacant depth maps through the rectified PMVs for depth map reconstruction. Finally, a depth map filtering will be performed finally to improve the quality of reconstructed depth. The details of each step of our proposed scheme will be given in following subsections.

Optical Flow Field Rectification
The unreliable PMVs usually occur in occlusive or low textural regions as aforementioned. Therefore, these regions should be detected properly at first.
The texture complication is an important clue for the goal of detection. Usually, the texture complication keeps consistency with the variation of pixel value, so that it can be represented by standard deviation of pixel values. The Heaviside step function is a unit step function, and it can be denoted by y~lim k??
This function is always applied in the mathematics of control theory and signal processing. This function is a discontinuous function whose value is 0 for negative argument and 1 for positive argument. As shown by Figure 2, the function represents a signal that switches on at a specified time (usually triggered by a threshold) and stays switched on indefinitely. In order to differentiate the pixel X~(x,y) in color image, we propose a binary decision function f(V X ) in form of the Heaviside step function to determine whether a region V X is the low textural region by where V X is the neighboring pixel set centered at X, Y is a pixel in V X , I X is the gray value for X, s(:) is standard deviation operator for a set, and e V is a threshold for texture. According to the definition of Heaviside step function, the value of f(V X ) corresponds to a binary decision for textural region detection. f(V X )~0 indicates the pixel X is surrounded by textures, but in low textural region when f(V X )~1. Similarly, we also propose a binary decision function r(v) to determine whether a pixel X is occluded where v is the PMV on X, e I is a threshold for occlusions. r(v)~1 is for the occluded pixel X, and r(v)~0 is for the visible one. In determining the occlusive or low textural regions, a smaller threshold is related to accurate decisions and stable performances on different test materials, while increase the computation complexity and unfavorable for implementations. On the other hand, larger threshold is benefit for implementations, but decrease the accuracy of decision and have unstable performances. We will discuss the parameter settings for thresholds e V and e I in the section of experiments in details. Based on f(V X ) and r(v), we can know the status of pixel X and its surroundings, and make appropriate rectifications and operations on them. There are several cases for different combinations of decisions caused by the binary value of f(V X ) and r(v). For the first case, when the pixel X is occluded by other objects (i.e., r(v)~1), the vector v for X is an error PMV since actually no corresponding pixel can be found for X, no matter X is surrounded by textures or not. In this case, it is not an easy way to predict a proper value for v directly from neighboring vectors. Therefore, we mark X with a label label(v), treat v as unreliable and process the depth value for X after the depth map has been reconstructed. Then for the second case, the pixel X is visible and surrounded by textures (i.e., r(v)~0 and f(V X )~0). Texture information is benefit for accurate optical flow calculation, and thus the vector v can be treated as reliable and accurate. Finally, the pixel X is visible but in a low textural region (i.e., r(v)~0 and f(V X )~1). Low textural region can cause pixel-wise ambiguous vectors in optical flow calculation. These unreliable PMVs of ambiguous are usually odd when comparing with neighboring vectors, as can be found in Figure 3(a). In this case, the unreliable odd vectors can be processed by an average filtering with the neighboring vectors. We summarize the above processing as a condition function where label(v) is a mark on X that v is reserved for the next step processing, avg(:) is an average operator on a set. Based on the above occlusive and low textural region detection and rectification, a part of the obtained unreliable (i.e., odd) PMVs  can be rectified effectively, and the error PMVs from occlusive regions are reserved for later processing. Therefore, the accuracy of PMVs is improved. Figure 3 provides the results of occlusive and low textural region detection and rectification. The results are obtained from two consequent color images in ''Lovebird1'' from MPEG [24]. The optical flow filed is obtained between I t ?I tz1 (i.e., from the previous image to the current one), and errors will occur in low textural regions in background and occlusions around object boundaries. Figure 3(a) shows the static scene with occlusive and low textural regions, and Figure 3(b) is the obtained optical flow field based on the given static scene of Figure 3(a) where many unreliable PMVs can be found. Figure 3(c) shows the result that Figure 3 After that, the vacant depth map (D tz1 ) n|m at time tz1 can be propagated and reconstructed via the obtained optical flow field Y(v) n|m from the previous depth map as

Depth Map Reconstruction
A depth map sequence that synchronized with the color frames can be reconstructed by Equation 5. This depth map sequence is with high resolution in video-rate. The depth information in vacant time slot is propagated from the key depth map, where the depth information is reliable. However, the processing on the reconstructed depth maps is not finish yet. As denoted by Equation 4, the PMVs for occlusive region is labeled and reserved for postprocessing, and the regions that reserved will be a hole without depth information in the reconstructed depth map. On the other hand, ambiguous PMVs is inevitable in optical flow algorithms.
These PMVs also can result in holes. Therefore, a depth map filtering for post-processing is necessary to improve the quality of reconstructed depth maps.

Depth Map Filtering
As mentioned above, the reconstructed depth maps may contain holes due to the labeling operation in Equation 4, and ambiguous PMVs. For the ambiguous PMVs on pixel X, the missing depth value d X is very close to its spatial neighbors V X 5(D tz1 ) n|m , so that a median filter is applied. The operation can be denoted by For the hole caused occlusions and marked in Equation 4, the pixel X is marked by label(v). In this case, the missing depth value d X~dlabel on pixel X can be joint predicted by the depth value around the hole and the region where the PMV pointing to. We propose a depth value predictor as where a[½0,1 is normalized by the norm of v, d(:) is an in-painting operator [25]. However, in-painting on depth map will bring noticeable blur effect, especially when the hole crosses the boundaries of high contrast edges. Usually, depth map with blurred boundaries results in a failure on foreground-background separation [26]. Therefore, an object boundary enhancement filter (BEF) is further proposed ð ? where is the depth value that most frequent appearing in V X , h(:) is a bilateral filter defined in [27]. freq(:) is a statistical function that count the appearing frequency of each element in a data set. For example, suppose we have a data set A~fa,a,bg, the result of freq(A) will be ffa,2g,fb,1gg. Therefore, we can further have arg A max freq(A)~a. Equations 8 and 9 can smooth the depth map, and the object boundary can be sharper.

Dynamic 3D Scenes Materials
Dynamic 3D scenes contain a video-rate color image sequence that record the motion, color and texture information of this scene. Furthermore, a high resolution depth map sequence is also captured to record the 3D space information for all visible objects. As we have mentioned above, high resolution depth map cannot be captured by RGB-D or ToF cameras in video-rate that synchronically with the color image sequence so far. Recently, MPEG released their standard test sequences for dynamic 3D scenes with high resolution (more than standard definition) and high frame-rate, including color images and depth maps [24]. The color images were captured by cameras, but the depth maps were not captured but calculated by stereo matching and even manual labeling. The quality of depth map obtained through this way was assumed with the best quality to be obtained.
The dynamic 3D scene materials named as Undo Dancer, Lovebird1, and Balloon will be used to testify our proposed algorithm. The captured color images and calculated depth maps   of these materials are selected from [24]. These materials are with different challenges in depth reconstruction, as listed in Table 1.

Experiment Arrangements
Experiments are arranged in four parts, including a discussion on thresholds in Equations 2 and 3, subjective and objective quality comparisons on depth reconstruction between our algorithm and the benchmark state-of-the-art method in [19], and finally an objective quality comparison on the dynamic 3D scene representation. In [19], the PMVs between consequent color images are not processed before propagation. Instead, a bilateral filter was applied for propagation, and errors in reconstructed depth were processed by motion compensation.
In our experiment, the depth map at t~0 is selected from the given depth map sequence and treated as key depth map. The consequent depth maps in material are treated as vacant, and they will be propagated and reconstructed by our proposed algorithm and the benchmark method with the help of the key depth map. The consequent depth maps in material will be used as anchor for the reconstructed depth in objective quality evaluation.
On the Parameters e V and e I From Equations 2 and 3 we can see that there are two thresholds e V and e I in our formulation. They modulate the number of the pixels of occluded or low textural, and thus the final quality of output propagated depth maps. These parameters (i.e., thresholds) are usually used in pixel classification. For parameter e V , it is a real number varies in ½0,?), and it determines the number of low textural pixels. If e V tends to be infinite, all pixels in image will be determined as low textural ones no matter how many textures around them. According to our proposed scheme, spatial filter (i.e., average filtering in Equation 4) is applied on the low textural pixels. Depth information for textural pixels can then be erased by this filter, and thus the accuracy of obtained depth map will be degraded. For parameter e I , it is also a real number varies in ½0,?), and it determines the number of occluded pixels. Furthermore, Equation 3 is performed on two corresponding pixels that related by vector v. The accuracy of v can be represented by the deference of I Xzv {I X and checked by e I . Figure 4 demonstrates a texture analysis of V X on one color image in test material Lovebird1 when s(I VY[VX {I X ) is changing. Texture in image can be treated as wave variation in signal.
According to the definition of information entropy, more information is contained in V X when the signal varying sharply. When considering the matching operation in optical flow calculation, more information in V X is helpful to obtain higher accurate and reliable v. Therefore, the parameter of e V is also a threshold to distinguish reliable and unreliable v. Figure 4 shows that the region V X can be clearly classified to low textural region when s(I VY[VX {I X )v6, or otherwise, apparent textures are visible in V X .
Based on the texture analysis in Figure 4, the parameter settings for e V and e I can be solved. Figure 5 demonstrate the performance curves (i.e., Bad Point Ratio) with respect to the variation of e V and e I . In Figure 5 (a), we fix e V to be 4 and vary e I from 1 to 20. We can see that Bad Point Ratio drops to minimum point when e I is 9. After that, in Figure 5 (b), we fix e I to be 9 and vary e V from 1 to 7. It can be found that Bad Point Ratio varies slightly for the parameter e V , but the curve is increasing when e V becomes larger. Therefore, we select e V to be 3 to obtain a relative smaller Bad Point Ratio, indicating higher accuracy of depth map. Figure 6 gives comparison results of subjective experiments. Each subfigure provides an enlarged part, and details the difference between our algorithm and the method in [19]. Figure 6(a) is the original depth map that selected in materials, and it serves as benchmark and is treated as absent in depth reconstruction. Figure 6(b) marked by ''BL+MC'' is obtained by method in [19], and it shows definite geometric distortion around the regions of moving object boundary. The phenomenon is a result of temporal bilateral-filtering. On the contrary, our algorithm detects the occlusive and low textural region, and processes these regions according to their types before depth propagation and reconstruction. Figure 6(c) marked by ''HF'' is the reconstructed depth map by using the optical flow field Y(v) n|m in Figure 3(c) with Equations 5, 6 and 7. As we mentioned above, the blurring effect is occurred around object boundaries. Figure 6

Objective Results for Depth Reconstruction
The objective quality comparison is measured by the peak signal-noise ratio (PSNR) from the reconstructed and corresponding existing depth maps from test materials. In the comparisons, higher PSNR indicates higher accuracy and better performance. Figure 7 and Table 2 provide the quantitative results. We can see that high precision of depth propagation is benefit in high quality of depth reconstruction, and the quality of reconstructed depth map of our method (labeled with ''P'') is more than 8 dB better than the benchmark (labeled with ''B''). However, errors (e.g. distortions around boundary) will also be propagated as shown in Figure 6(b). Therefore, the quality of reconstructed depth map will drop down along with longer distance propagation. As for the results given in Figure 7(a), we reconstruct 9 consequent depth maps for Lovebird1 for both left and right views respectively. Figure 7(a) shows that the quality of the 1st depth map reconstructed by the benchmark is comparable with that of the 9th by our method, indicates the higher quality of our method. On the other hand, Table 2 lists the average quality results on 9 reconstructed depth maps for three test sequences. It is obvious that our method has at least 5 dB gains on depth reconstruction, which is due to the rectification on optical flow field. On the other hand, BEF is used to eliminate the blur effect around boundary, and it will also benefit the quality.

Results in Dynamic 3D Scene Representation
Dynamic 3D scene representation is measured by the objective quality of virtual view synthesis. Virtual view is an important application in 3D computer vision when color images and the corresponding depth maps are both available for a dynamic 3D scene [13,17]. Better quality of depth maps can yield high quality of virtual view, and have better performance in dynamic 3D scene representation.
We use the reconstructed depth map for synthesis by VSRS software [28], which is a common test platform. The results are also given in Figure 7(b) and Table 2. Our method achieves 0.7 to 4.5 dB gains on PSNR for all the test materials. On the other hand, the accuracy of reconstructed depth map from the benchmark will be greatly affected by filter-based propagation. The distortion results in synthesis distortions.

Summary of Results
In sum of the above quantitative comparisons, the proposed algorithm can achieve more accurate depth reconstruction on all test sequences with different challenges, including global motion and local motion, or dynamic scene that captured in natural environment and generated by computer graphics.