Parallax attention stereo matching network based on the improved group-wise correlation stereo network

Recent stereo matching methods, especially end-to-end deep stereo matching networks, have achieved remarkable performance in the fields of autonomous driving and depth sensing. However, state-of-the-art stereo algorithms, even with the deep neural network framework, still have difficulties at finding correct correspondences in near-range regions and object edge cues. To reinforce the precision of disparity prediction, in the present study, we propose a parallax attention stereo matching algorithm based on the improved group-wise correlation stereo network to learn the disparity content from a stereo correspondence, and it supports end-to-end predictions of both disparity map and edge map. Particular, we advocate for a parallax attention module in three-dimensional (disparity, height and width) level, which structure ensures high-precision estimation by improving feature expression in near-range regions. This is critical for computer vision tasks and can be utilized in several existing models to enhance their performance. Moreover, in order to making full use of the edge information learned by two-dimensional feature extraction network, we propose a novel edge detection branch and multi-featured integration cost volume. It is demonstrated that based on our model, edge detection project is conducive to improve the accuracy of disparity estimation. Our method achieves better results than previous works on both Scene Flow and KITTI datasets.


Introduction
The binocular stereo matching task is an imperative, but difficult scientific problem, which aims at computing disparity data for every pixel from a stereo correspondence. Efficient and correct stereo matching methods are necessary for computer vision tasks such as robotic pose estimation and autonomous driving [1,2].
Traditional stereo matching methods usually consist of four steps: initial matching cost calculation, matching cost aggregation, disparity prediction, and post-processing. These can be categorized into global and local algorithms [3]. Local strategies solely use constant measurement windows or changeable windows to calculate the preliminary cost. Global strategies normally treat an optimization task by minimizing a word goal characteristic that incorporates statistics and smoothness terms. However, traditional algorithms need to manually design feature description operators and cost aggregation strategies, which is not suitable for real-time applications. Complicated hand-craft production steps limit their improvement.
Learning-based stereo-matching methods achieve accurate matching of the corresponding points in the left and right feature maps through exploring feature representations and aggregation algorithms. Those algorithms commonly consist of the following four steps: unary characteristic extraction [4], constructing cost volume [5], value aggregation [6], and disparity prediction [7]. Although the performance on several benchmarks is significantly promoted, drawbacks still remain: Firstly, the predicted edge cues of the disparity map is not accurate enough. Secondly, adopting the strategy of global attention, which is insensitive to the detailed texture regions, resulting in inaccurate disparity estimation in vital areas.
In recent years, researchers are encouraged by the mechanism of human attention and attempt to design some network attention architectures with a CNN to enhance the performance of feature extraction. However, drawbacks still remain: Firstly, these works focus on parallax information in two-dimensional(2D) domain, 2D feature is difficult to fully reflect the three-dimensional(3D) real scenes, while ignoring the more important 3D information. Secondly, due to limited learning capability of a single network structure, the disparity map predicted is not fine enough in near-range regions.
The edge cues of images are the most easily recognized feature by human eyes, in other words, humans can easily find the stereo correspondence by using edge cues of binocular images. Based on this observation, some researches have made partial progress in predicting image edge cues as a single task. In recent years, researchers are encouraged by edge detection (ED) task and use it for disparity estimation project. However, these methods regard disparity prediction and edge detection as a multi-task learning project. Yet, features learned in such multi-task pipelines cannot be fully exploited, which poses a great need for an effective fusion mechanism.
In the context of autonomous driving, the relatively closer area provides larger parallax information, leading to greater risks. To address this problem, more attention should be assigned to this kind of region in the disparity estimation model. In this paper, we propose a high-quality and efficient module for stereo matching and our method achieves better performance on SceneFlow [8] and KITTI [9,10] than previous methods. Specifically, we examine the very important issue of structure design, attention. The importance of attention has been researched particularly in previous methods [11,12]. In our module, the left characteristic map f l and the corresponding right feature map f r are packed in the shape of a 3D feature map f, which is sent to the parallax attention (PA) stereo module to learn 'what is' to attend f. As shown in Fig 1, our structure efficiently improves the accuracy of disparity prediction by improving feature expression in near-range regions. Meanwhile, a novel edge detection branch and a multi-featured integration cost volume are proposed in our network to learn finer texture features, which are vital in the optimization of unary feature extraction tasks. In order to complete the end-to-end disparity prediction task, we assign different weights to edge detection loss and disparity smooth loss. It is demonstrated that achieving high-precision edge feature map is conducive to improve the accuracy of disparity estimation.
Our main contributions can be summarized as follows: 1. We propose a PA module to further improve the accuracy of disparity prediction; 2. An edge detection branch and a multi-featured integration cost volume are proposed in our network architecture to obtain finer texture features; 3. Our PA-Net achieves the accuracy of 0.775 end-point-error (EPE) on Scene flow dataset and 2.05% kitti-d1-all error on KITTI 2015 dataset, which outperforms other methods by 12%.

Traditional methods
In non-end-to-end depth stereo matching algorithms, each step of traditional stereo matching can be replaced by a neural network. Some researchers have mainly focused on the use CNN S to accurately calculate the matching cost function and use the semi-global matching [14,15] method to optimize the predicted disparity map. Zbontar et al. [16] proposed a network structure called stereo matching by training a convolutional neural network (MC-CNN) to compare image patches to calculate the cost of matching by utilizing a pair of 9×9 patches. Traditional algorithms play an important role in stereo matching tasks. However, traditional algorithms generally face the problems of slow calculation speed and low matching accuracy, which greatly limits the application of stereo matching algorithm.

Learning-based methods
In 2015, Long et al. [17] achieved very good results in semantic segmentation using a fully convolutional network (FCN). Mayer et al. [8], inspired by the FCN, introduced an end-to-end stereo network in an optical flow prediction task. Disp-Net calculates the Euclidean distance construction loss for each pixel between the estimated disparity map and real disparity value. Cascade residual learning: A two-stage convolutional neural network for stereo matching (CRL) [18], and learning for disparity estimation through feature constancy. (iRes-Net) [19], utilized the idea of DispNetC [8] with stack refinement structures to optimize stereo results.
Kendal et al. [20] proposed an end-to-end network GC-Net, which considers the use of context and scene geometry information in stereo matching. GC-Net is the first to concatenate the left f l and the right feature f r to form a 4D cost volume: Meanwhile, GC-Net transforms the stereo matching problem into a regression problem and directly realizes a refined output without post-processing. Encouraged by GC-Net, Chang et al. [21] proposed a PSM-Net, combining the spatial pyramid pooling and stacked 3D hourglass structures in a stereo refinement network. In the current work, GWC-Net [13] proposes a group-wise correlation to assemble the cost volume, whose idea is splitting the features into groups and computing correlation maps group by group. The group-wise correlation is computed as where N c denotes the channels of unary features and it eventually divided into N g groups along channel dimension. h�,�i is the internal product at all disparity levels d. Although the performance on several benchmarks is significantly promoted, there remains some drawbacks, including the predicted edge contour of the disparity map is not accurate enough and adopting the strategy of global attention, which is insensitive to the detailed texture information.

Learning-based attention methods
In recent years, researchers are encouraged by the mechanism of human attention and attempt to design some network attention architectures with a CNN to enhance the performance of feature extraction. Hu et al. [22] introduced a squeeze-and-excitation block to fully utilize the channel information in the network. In addition to channel attention, cbam: convolutional block attention module [23] introduced a spatial attention block to demonstrate that spatial features are vital in the network. Wang et al. [24] introduced a PASSR-Net to integrate super-resolution information from a stereo image pair, and proposed a PAM module in the article of PASMnet [25] to calculate the consistency score of left and corresponding right graphs along the epipolar line, and it was leveraged by many subsequent methods such as [19][20][21][22][23][24][25][26]. On the basis of PAM, Wang et al. [27] introduced a symmetric bi-directional parallax attention module (biPAM) to obtain cross-view information. Ying et al. [28] proposed a generic stereo attention module (SAM) which aims to solve the information incorporation problem. Chen et al. [29] addressed the stereo images with large disparity and multi types of epipolar lines issues by utilizing a cross parallax attention module (CPAM). However, the parallax information provided by binocular images has not been fully utilized in those methods. PA-Net is the first to emphasize that by improving feature expression in near-range regions is helpful to disparity prediction task.

Edge detection methods
Edge cues can be easily captured by human eyes to find stereo correspondences. Accurate edge contours can help discriminating between different objects or regions. Based on this observation, some works had made some progress in predicting image edge cues as a single task. Xie et al. [30,31] first designed an end-to-end ED network based on a VGG-16 network. Recently, Song et al. [32,33] combined an ED branch with stereo-matching network. However, these methods regard disparity prediction and edge detection as a multi-task learning project, those works did not establish an effective mechanism to integrate the information learned by multitask project. As a result, the features learned by multi-task project are not effectively expressed and utilized. Focus on this problem, we construct a multi-featured integration cost volume to combine parallax features and edge features.

Methods
As shown in Fig 2, we proposed a PA stereo matching network (PA-Net), which extends GWC-Net [13] with a PA module, edge detection branch, and multi-featured integration cost volume.

Network architecture
The pipeline of our introduced PA network is shown in the upper half of Fig 2, it includes four parts: unary feature extraction pipeline, multi-featured integration cost volume structure, parallax-attention 3D aggregation network, and disparity prediction module. The multi-featured integration cost volume structure consists of three parts: concatenation [20], group-wise correlation [13], and edge detection volumes (details in Section 3.3). The results of the multi-featured integration cost volume are then concatenated as the input of the parallax-attention 3D aggregation network, and it will be described in Section 3.2.
The parallax-attention 3D aggregation network aims to aggregate variable disparity values, which consist of two parts: a pre-hourglass module and three parallax-attention 3D aggregation networks. The pre-hourglass module consists of two components: the primary half consists of four 3D convolutional layers with batch normalization and the ReLU [26] function, where the second part consists of two PA modules.

Parallax attention module
Discriminant characteristic representations are essential for understanding the scenes. However, previous studies only focus on the two-dimensional (2D) contextual information, but ignore the significance of 3D disparity features. To emphasize the value of regions with a large parallax, we introduce a PA module that encodes the disparity information to different weights, thus enhancing their illustration capability.
3D convolution layer is widely used in stereo matching tasks and it consists of 4 parts: channel dimension, disparity dimension, height and width. However, 3D filters learned within a local field that lacks contextual information in the output feature map U.
Based on these observations, as is shown in Fig 3, given a feature map f2R C×D×H×W , we first conduct two transformations f s :f!f s 2R C×1×H×W , f m :f!f m 2R C×1×H×W , which represent the It is constructed based on GWC-Net by adding edge-detection branch in feature extraction structure and applying PA module in the architecture of the 3D aggregation network. The left and corresponding right images are fed to a weight-sharing feature extraction pipeline, which consisting of a ResNet-50 network for feature maps calculation. It includes three branches (edge detection, group-wise correlation [13], and concatenation branches). Thereafter, a multi-featured integration cost volume is constructed by those branches and it will finally be fed into a parallax-attention 3D aggregation network for disparity regression. importance of different position vectors. Term f s implies to select the maximum value element in disparity dimension, whereas f m denotes to calculate the mean value in disparity dimension, with regard to the c th channel, the value of (i,j) position is calculated by: where d max denotes the maximum parallax value. Thereafter, the two feature maps are concatenated in the channel dimension to obtain a mixed feature map M u 2R 2C×1×H×W . The mixed feature map M u can be treated as a collector of the local disparity texture information, and its function is to describe the entire parallax image. Subsequently, the feature map is sent to a shared network, which is composed of a multilayer perceptron with two 3×3×3 convolutional layers and it accompanies the batch normalization and ReLU [26] function. To reduce the parameter overhead, the characteristics of the middle layer are set to R C/r×1×H×W , where r denotes the reduction ratio, and we set it to 4. Additionally, the disparity feature map is applied to a sigmoid function. Finally, we merge the output feature vectors with the input feature f using an element-wise product to obtain the final PA feature map M d 2R C×D×H×W ,which can be simplified as follows: where M d i denotes the value of the final i th position, h�,�i means concatenating the inner channels, and X i denotes the value of the input feature.
In comparison with traditional 3D convolutional layers, our contributions can be summarized as follows: 1. In the case of acquiring an identical receptive field, our module generates considerably fewer parameters (reduced by 25%) and consumes much less memory; consequently, the inference time of our module is faster. Table 1, our PA structure can effectively decrease the performance of EPE with a small increase of computational complexity.

As summarized in
3. Our PA module does not change the number of channels and the size of input features, which can be added directly to 3D convolution layers.

Edge detection and multi-featured cost volume
State-of-the-art disparity estimation method works well on ordinary and clear texture regions. The matching clues in these regions are clear and can be easily captured through the context pyramid. However, as shown in Fig 1, the edge details are lost. Hence, we design an edge detection branch to help modify disparity map. Our edge detection (ED) architecture includes three branches (group-wise, concatenation, and edge detection branches), sharing the same weights of the ResNet-50 backbone, listed in Table 1. There are four outputs in the ResNet-50 layer, for each output branch, we design a new structure that includes a 3×3 convolutional layer and 1×1 convolutional layer with batch normalization and the ReLU [26] function (we set the number of the final channel to 1 in each branch); In order to fuse the features contained in different branches, all the feature maps are concatenated to construct an edge cost volume. Finally, group-wise, concatenation, and ED features are fused to form a multi-featured integration cost volume.
As shown in Fig 4, based on the architecture of the GWC-Net [13], we added an ED branch to construct the edge volume. In contrast to the concatenation volume, within which the left and corresponding right feature maps are concatenated at unique disparity levels to form a cost volume, the ED volume is constructed by computing the similarities of the left and right Table 1

. Ablation study results of PSM-Net, GWC-Net and PA-Net on the datasets of Scene flow [8].
The results of PSM-Net [21] and GWC-Net [13] are trained with published code with our batch size, evaluation settings for fair comparison.  feature maps. For each pair of edge feature maps, the edge correlation is calculated as follows:

Model
where h�,�i denotes the internal product, N c the quantity of total channels, and f l edge , f r edge the left and corresponding right feature maps, respectively. Finally, by combining all cost volumes, the multi-scale cost volume is determined as follows: where Concat(�,�) denotes the concatenation of feature maps in the channel dimension. C concatenation and C group-wise are calculated as introduced in the Eqs of (1) and (2).

Output module and loss function
Summarizing all outputs of our network, it includes 4 disparity predicted maps d s , to fully utilize the output feature maps, we assign different weights to each output. We first employ two convolutional layers with 3×3 and 1×1 to obtain a 1-channel output; thereafter, the output feature map is upsampled using bilinear interpolation. Finally, a softmax function is designed to calculate the disparity prediction map. Generally, the disparity smooth loss can be calculated as follows: where λ i denotes the weight for the i th output disparity prediction map, N represents the total number of pixels in one image, and d j s is the j th element with ground truthd j s . The smooth L1 loss is computed as follows: 8 < : Since the information of object edge contour in images is conducive to the parallax prediction task, we propose an edge detection loss for end-to-end learning: ( where x j and y j represent the activation value and ground truth edge probability at pixel j, respectively. P(x) is the standard sigmoid function, and N denotes the total number of pixels in one image. Fusing the edge feature information extracted from different output layers, our edge loss function can be formulated as: where x k is the activation value from stage k while x fuse denotes the last edge output. β k is the weight of stage k (equals to 0.2, 0.4, and 0.6 here). Since we are working under a disparity prediction task setup, we want to fuse the edge detection loss and disparity prediction loss together. Therefore, we design a double hierarchical loss weighing scheme, the total loss is calculated as: with γ 0 is the weight of total disparity prediction loss and it set to 1, γ 1 denotes the total edge detection loss weighted 0.1.

Experiments
In this section, we evaluated our PA-Net with distinctive settings on the Scene Flow [8] and the KITTI datasets [9,10]. Sections 4.1 and 4.2 show the experimental setup of proposed network on the KITTI and Scene Flow datasets. In Section 4.3, we set up a series of ablation experiments using different methods to test the performance of our PA module. In Section 4.4, we add our edge detection volume to PSM-Net and GWC-Net to validate the importance of the multi-featured integration cost volume.

Experimental setup
We implemented our architectures using the PyTorch tools. All methods were trained using Adam (β 1 = 0.9, β 2 = 0.99). Owing to the limitation of experimental conditions, the batch size of our network was set to 4, and we optimized all the models with two Nvidia RTX 2080ti GPUs using 256×512 random crops from the input image pair. The utmost disparity value was set to 192, whereas the coefficients of the four outputs were set to λ 1 = 0.5, λ 2 = 0.5, λ 3 = 0.7, and λ 4 = 1.0. We tend to set the model on the Scene Flow dataset for a total of 16 epochs in which the learning rate was 0.001 and downscaled by 2 when the number of epochs 10, 12, and 14. For KITTI [9,10] dataset, the pre-trained model on Scene Flow [8] datasets was utilized to train another 300 epochs. The preliminary learning rate was 0.001, it is down-scaled by 10 when exceeding 200 epochs.

1) Scene Flow [8]:
A dataset series of artificial stereo datasets. There are three subsets in the dataset: Driving, Flyingthings3D, and Monkaa, containing 35454 images for training and 4370 images for testing with Height = 540 and Width = 960, along with ground truth maps. The trained network model has a strong generalization ability because the number of pictures in the Scene Flow dataset is sufficiently large. The results of visualization and comparisons for Scene Flow [8] are as shown in Fig 1. 2) KITTI 2012 and KITTI 2015 [9,10]: Real-word driving scene dataset using Lidar scanning to obtain the three-dimensional coordinates of space points. KITTI 2012 includes 194 training stereo correspondences and 195 testing pairs. KITTI 2015 comprises 200 stereo correspondences for training and testing. The training dataset is divided into two parts, the first section consists of 180 pairs for training and the relaxation groups of images for validation. More than that, we made the corresponding edge detection label dataset for end-to-end edge detection task. The results of the visualization and comparisons are shown in Fig 5, and we submit the results predicted by PA model on the validation set of the KITTI official website. The comparison results for the test set are summarized in Tables 2 and 3. It shows that our PA-Net achieves better results than PSMNet [21], GwcNet [13] and PASMNet [25].

Ablation study for parallax attention module
In this section, to validate the performance of the PA module, we evaluated the PA module with different stereo matching strategies. Moreover, we set a series of ablation experiments to explore the best settings for the number of PA modules. [13,21] were selected as reference models by adding PA modules.
PA module can be directly used in 3D convolution layer, since our model will not change the number of channels and image size. The experimental results demonstrate that, on the premise of a small increase in calculation time, our PA-Net performs better than previous works. As summarized in Table 1, we select the classic methods [13,21] as the conference models which include two variables (edge detection structure and PA module). Meanwhile, the prediction result of an EPE on the Scene Flow dataset is decreased by 9.71% in the model of [21] and decreased by 11.0% in the model of [13] after adding the PA modules. Fig 6, depicts the training and validation curves of PA-Net, GWC-Net and PSM-Net on KITTI 2015 dataset from epoch 100 to 300. We can easily observe that the loss curve of PA-Net decreases more smoothly than previous works and produce consistent gains in performance which are sustained throughout the training process. Moreover, we see that PA-Net performs better than those works when the networks are trained to start fitting, and it will last achieve a highest accuracy.
To select an optimal value of PA modules to configure the networks. As shown in Table 4, which indicates the consequence of PA-Net with different numbers of PA modules. When the number of PA modules is larger than 6, the increase in accuracy becomes minor. Considering the amount of calculation and memory consumption, we selected half a dozen PA modules as the ultimate structure.

Ablation study for multi-featured cost volume
In this section, we apply several critical modifications to the feature extraction network compared to [13]. Specifically, we design a multi-featured integration cost volume structure that consists of three parts: ED cost, group-wise cost, and concatenation cost volumes. The experimental results in Table 1 demonstrate that by adding our edge detection structure, the parameters of EPE loss can be reduced appropriately. As summarized in Table 1, the prediction results of EPE on the Scene Flow dataset are decreased by 3.34% in the model of [21] and decreased by 2.50% in the model of [13] after adding the ED modules.
Based on [13], we can conclude from several experiments that if we set the channel number of the group-wise volume as 32, we can obtain an exceptional performance. The experimental consequences in Table 5 demonstrate that the EPE is decreased by adding the correct channels of the concat volume. The best EPE is 0.574 (concat channels are 14×2) in the dataset of KITTI 2015 and 0.616 (concat channels are 16×2) in KITTI 2012. Considering both the accuracy of disparity prediction and computer memory usage, we selected 14×2 as the ultimate channel of the concatenation volume.

Analysis and interpretation
While PA blocks have been empirically shown to improve network performance, we would like to provide an explain how the parallax attention mechanism operates in practice. To provide a clearer picture of the behavior of PA blocks, in this section we apply several examples from GWC-Net model and examine the different distributions of sensitive respective region between 3D convolutional layers and PA blocks. We then exhibit their distribution maps in Fig 7, which is trained in the dataset of KITTI 2015.  We make the following observations about how the parallax attention mechanism works in 3D feature extraction stage. First, traditional 3D convolutional layer is used to adopting global receptive field mechanism, which will guide the network to pay fair attention to different features. However, in the practice life, we can easily draw a conclusion that objects closer to us will have a greater impact. As shown in Fig 7, compared with trees and houses far away, people and cars in near region should be paid more attention. But we can observe a phenomenon that the lighted regions in 3D feature map have covered every corner of the image, which is not in line with objective reality. Second, for PA blocks, we redistribute the values in the parallax dimension to make full use of context information. For the region with large parallax value, the proportion of its value will be larger after redistribution. In the third line of Fig 7, lighted regions are concentrated in areas with large parallax such as roads, cyclist and cars, indicating that the network pays more attention to these areas. PA blocks successfully focuse on objects with large parallax through the weight redistribution strategy in parallax dimension.

Discussion
Intuitively, our method not only improves the accuracy of disparity prediction globally, we also ahcieve the following advantages: Firstly, in the case of acquiring an identical receptive field with traditional 3D convolutional layer, our module generates considerably fewer parameters (reduced by 25%) and consumes much less memory. More than that, PA module can be easily utilized in other works because it will not change the size of the feature image. Secondly, as shown in Fig 1, our structure efficiently improves the accuracy of disparity prediction in near-range regions by improving 3D feature expression. Lastly, in order to making full use of the edge information learned by two-dimensional feature extraction network, we propose a novel edge detection branch and multi-featured integration cost volume. It is demonstrated that based on our model, edge detection project is conducive to improve the accuracy of disparity estimation. As Table 2 shows, compared with GWC-Net, our method performs better in two-pixel error, three-pixel error, and five-pixel error on the KITTI 2012 dataset. Compared with PSM-Net, the disparity map's percentage of outliers averaged over all ground truth pixels (D1-all) is reduced by 11.6%, and the running speed is increased by 19.5%.

Conclusions
In this paper, we proposed a high-precision and practical stereo matching network, PA-Net, for end-to-end disparity prediction. Our net emphasizes that by improving feature expression in near-range regions is helpful to disparity prediction task. PA-Net performs better than previous networks by utilizing the edge detection layer, PA module, and multi-featured cost volume. It is demonstrated that based on our model, edge detection task is conducive to improve the accuracy of disparity estimation task. PA-Net achieves better accuracy than GWC-Net on the Scene Flow and KITTI datasets.