The geometric attention-aware network for lane detection in complex road scenes

Lane detection in complex road scenes is still a challenging task due to poor lighting conditions, interference of irrelevant road markings or signs, etc. To solve the problem of lane detection in the various complex road scenes, we proposed a geometric attention-aware network (GAAN) for lane detection. The proposed GAAN adopted a multi-task branch architecture, and used the attention information propagation (AIP) module to perform communication between branches, then the geometric attention-aware (GAA) module was used to complete feature fusion. In order to verify the lane detection effect of the proposed model in this paper, the experiments were conducted on the CULane dataset, TuSimple dataset, and BDD100K dataset. The experimental results show that our method performs well compared with the current excellent lane line detection networks.


Introduction
Lane detection is a basic but still challenging task [1][2][3][4][5] in perceptions of autonomous vehicle, which requires that algorithm can detect the lane lines from traffic scene image captured by car cameras. Some recent works have defined lane detection as a pixel-intensive prediction task [6][7][8]. Segmented lane lines are available for trajectory tracking control and positioning vehicles in autonomous driving, then detected lanes can be used to judge the status of other traffic participants. In addition, it is also a pivotal part of making highly precision maps and crashing prediction [9][10][11].
Recently, the most studies about lane detection have been seen as the semantic segmentation tasks [6][7][8]12], but they severely rely on labels which are sparse and fixed-width as supervision signals of fully convolution network to classify foreground (lane line) or background pixel by pixel. Although some methods can segment lane lines accurately in some traffic conditions with good weather and wide views, the realistic driving scenes are often complicated and changeable. In traffic jam scenes, considerably blocked cars would cover the lane lines, which makes fully convolution network tends to predict discontinuous or fuzzy lane lines. Therefore, these situations bring great challenges to lane detection methods based on semantic segmentation.
Nowadays, there have been several solutions proposed in complex road scenes to improve lane detection accuracy. First, expanding the receptive field of fully convolution network to infer the characteristics of lane lines from "global perspective", such as ASPP of Deeplabv3 [13], and the backbone of fully convolutional networks is designed to be very deep for better understanding the target globally, such as ResNet [14] and DenseNet [15]. Second, Increasing the ability of messaging among neurons in the network to encode richer semantic context. Third, using a multi-task network [16] architecture to predict more lane lines characteristics and improve lane detection in adverse conditions. Fourth, training the model in high-quality and large-scale datasets annotate unclear or obscured lane lines artificially, which makes the network learns abundant features. However, in complex road scenes, these lane detection methods above don't perform very well in accuracy. Semantic segmentation methods based on fully convolution network merely generate black or white classification predictions for each pixel according to the one-hot mask. This kind of models are easy to generate fuzzy segmentation results on the target boundary, as well as, usually influenced by noises to cause misclassification. Thus, we introduce the distance transform [17] mask as shown in Fig 1B, which is a continuous representation and each pixel represents the minimum distance to a nearby line segment or boundary. Compared with the one-hot mask ( Fig 1A) used for classification, the gradient is smoother when model performs back propagation by this method.
On the distance transform mask, Audebert et al. [16] introduced a cascaded multi-task loss based on distance transformation to improve the effect of boundary segmentation. Hayder et al. [18] used this mask instead of the one-hot mask to solve the problem of poor segmentation due to inaccurate object candidate frames. In [19], the authors transformed the regression prediction into a distance segmentation mask task, to focus more on pixels near boundaries and improve the segmentation results of target boundaries in satellite images. Although the above methods have introduced distance transformation to overcome the boundary leaky problem in semantic segmentation, they treat multi-task branches as independent tasks or simply fuse feature maps. Therefore, we propose the GAAN to solve the problems mentioned above, which allows the model using the geometric distance information of the lane lines to guide segmentation and enhance the network's understanding of the semantic context information. Concretely, GAAN adopts multi-task branches neural network architecture. The first branch, semantic segmentation, is to predict the lane lines. The second branch is geometric distance embedding, which is to predict the minimum Euclidean distance of boundary pixels from the center to the lane line in regression. In each branch, we use an architecture which can autonomously select required information for communication, and call it as AIP module.  Moreover, we design a GAA module on the tail of the two branches to obtain the features with geometric distance information, then we fuse the final high-dimensional lane line features, which leads to the result containing missing or wrong lane line features in the semantic segmentation branch can be repaired and predicted correctly. Finally, the different level semantic features in the encoder are fused by Skip Pyramid Fusion Up-sampling (SPFU) module, which restores the prediction of lane boundary pixels better.

Geometric distance transform label
As described in the previous section, essentially most lane detection algorithms based on semantic segmentation are pixel-by-pixel classification tasks. However, if prediction between predicted pixel and real label has just slight deviation, the penalty cost of the loss function to the network is equal to wrong prediction. This "ignorance" is not fair. Hence the kind of hard classification method is not beneficial for the segmentation of the lane boundary in complex scenes. According to this, in this paper, it is recommended to predict the geometric distance transformed labels to improve the semantic segmentation effect of the lane line in complex scenes. The distance transformed label is a continuous representation that encodes each pixel on the lane line.
Producing the masks is very simple and convenient, which only needs to be adjusted on the original lane line labels. Specifically, we demonstrate the process of generating geometric distance transformed label in Fig 2. Firstly, the one-hot labels in lane datasets are sampled by pixel coordinates of the central part in each lane line, and mark the sampled lane as the width 1 pixel line. Then we calculate the distance transformation based on one-hot labels and reveal transformation result in the second step, which illustrates the minimum Euclidean distance from each pixel to the nearby lane line. Moreover, we set a threshold τ to limit the range of the distance transformation area to eliminate the influence of invalid value in regression, and τ is related to the width of lane line in the label mask. However, the distance transformed label is a continuously increasing distance from the center of the lane line to the boundary, which adds redundant noise areas to the regression task. Finally, reversing the truncated distance mask, so that the geometric distance is continuously reduced from the center of the lane line to the boundary to 0, and the distance transformation mask d mask can be formulated as following: where min(d p ) is the minimum Euclidean distance from a certain pixel p to the nearby lane line, τ is truncated threshold. Hence, the distance mask transforms the lane mask from a line into the range area. The geometric distance transformed mask described above has the following advantages over the one-hot label mask: 1. The lane line pixels on transformed label mask encodes the distance information to the boundary, which contributes to improve the segmentation of the boundary.
2. Compared with the category information in the one-hot label mask, each pixel in the distance transformed label mask has specific distance information. The accurate information may reduce the impact on redundant noise.

The framework of geometric attention perception network
In this section, an overview of end-to-end deep convolutional neural network designed to detect lane lines in complex road scenes is demonstrated. As shown in Fig 3 is the framework of GAAN consisted of 6 parts which include the backbone network, the semantic segmentation branch, the geometric distance embedding branch, the AIP module, the GAA module, and the SPFU module. The backbone network maps the RGB images to the high-dimensional feature space and the two branches behind the backbone network reconstruct the geometric distance embedding and lane line semantic labels from shared high-dimensional features. The feature information communication between the two task branches is performed by AIP module. The module adaptively selects the feature information for fusion, and then GAA module combines distance embedding features and semantic features. In short, this module fuses the two branches of feature map, which include long distance information and contextual semantics respectively. Final step, every feature map in the backbone network are restore by SPFU module, which can combine with GAA module to gradually generate different resolution of the feature maps, so as to use the loss function for supervision during training.
In our backbone network architecture, ResNet is appropriately modified to enhance its expressed ability of lane features. As shown in Fig 3, we divide the backbone into 4 layers, where the yellow parts in layer 1 and 2 are down-sampling layer with 2 step size, which is benefit for keeping the spatial information in the feature map. Then layer 3 and 4 use atrous convolution (dilated convolutions) to capture a wider range of contextual semantic information.

Attention information propagation module
The information sharing and information propagation play a significant role in the network with multi-task branches, while the sharing and propagating strategy between branches is difficult to manually adjust. Therefore, we introduce the AIP module to complete it, which selects weight on each channel and automatically selects different branches to output feature maps in a learnable way.
AIP module is located between the two up-sampling layers of the decoders. There are three AIP modules between the two branches. The lane feature information extracted by the backbone is not only propagated in the relative task branch, but also share information from the other task branch through AIP module which selects and fuses features from the current branch and the other branch.
Concretely, as shown in Fig 4A, we display the first AIP module as an example. The first layer in the semantic segmentation branch is the S-Up-Conv1 and the output feature map is named S 1 . The first layer in the geometric distance embedding branch is the D-Up-Conv1 and the output feature map is named D 1 . DCAB and SCAB are channel attention block of distance embedding branch and segmentation branch respectively. The propagation of the attention information can be defined as Eq 2, where α 1 and α 2 are the channel attention weight of feature map S 1 respectively, β 1 and β 2 are the channel attention weight of feature map D 1 . The channel attention block (CAB) is shown in Fig 4B that we first calculate the global average pooling of the input features to obtain a feature vector containing global context information, then calculate 1x1 convolution and activation function for this feature vector. Besides we name shared information are AIPM S 2 and AIPM D 2 , which will be sent to the subsequent layer.
( Although all AIP modules have the same structure, their parameters are irrelevant, which makes information propagation more flexible between stages of the multi-task network. Furthermore, S 1 is identity mapping information to the next up-sampling layer, which ensures the propagation of the internal information of the branch and avoids the interruption of the propagation during the network training process. This residual-like idea is also conducive to the back propagation of the gradient.

Geometric attention-aware module
The geometric distance embedding branch predicts the continuous distance from the lane line's center to boundary by regression. This branch which extracts feature map with lane line geometry information to guide the results of semantic branch segmentation has higher tolerance than the semantic segmentation task that is pixel-by-pixel classification. Therefore, we introduce the GAA module locates in the end of the two task branches, which captures the context information between long-distance lane lines from the high-dimensional feature distance of geometric distance embedding. Information includes boundary distance context information, which is more beneficial to the segmentation of the entire lane line and boundary pixels.
The first step of GAA module is to decouple the input geometric distance embedded features to generate a spatial attention matrix, which simulates the spatial relationship between any two pixels in the feature map. The second step is to compute multiplication between the attention feature matrix and the semantic segmentation feature matrix. The third step is to compute an element-wise sum operation on the result of second step, and obtain the final information that reflects the long-range contextual geometric information.
The specific working process of this module is shown in Fig 5. Given the semantic segmentation branch output feature A2R C×H×W , the output feature of geometric distance embedding branch is decoupled through two 1x1 convolutional layers, and the shape of new features are B2R C×H×W and C2R C×H×W , then we reshape features B and C to R C×N , where N = H×W is the number of pixels. In addition, we perform a transpose operation on feature C, the result of transpose computes matrix multiplication on the reshaped features B and C. Finally, we use the SoftMax to calculate the spatial attention map S2R N×N , the calculation process is shown in Eq 3: where S ji measures the influence of the spatial position i th on the position j th , and the more similar feature representation of the two positions contributes to their greater correlation. At the same time, the output of semantic segmentation branch is sent to the 1x1 convolution layer to generate a new feature map D2R C×H×W , and reshaped it to R C×N . Then computing matrix multiplication between features D and S, and reshaped the result to R C×H×W . Finally, the result and feature A compute element-wise sum to obtain GAA module's output E2R C×H×W , for position j th is shown in Eq 4.
It can be concluded from Eq 4 that each element on the feature map finally output by GAA module is the weighted sum of the geometric distance and the semantic segmentation feature map. Therefore, it has rich global context geometric features, and adaptively aggregates context information through the spatial attention, which improves the continuity of lane lines' prediction.

Skip Pyramid Fusion Up-sampling module
After the encoder and decoder, the image resolution is continuously changed, which would lead to lose detail information in the feature map. In our network, for solving this case, SPFU (Skip Pyramid Fusion Up-sampling) module is proposed to restore more lane line high-quality detail information in the final semantic feature map. As mentioned in the previous content, Fig 3, SPFU module uses the extracted image feature with different granularity levels through skip connection. Thus, we choose the feature maps of some middle layers in encoder.
We show the first SPFU module named SPFU1 as an example. As shown in Figs 3 and 6, the input of SPFU1 is the final feature map of GAA module and the backbone. After computing the 1x1 convolution and generating new feature maps, then we adjust the shape of feature maps so that they can be contacted. Finally, we compute two 3x3 convolutions separately, one convolutional result is to fuse features with the next backbone feature map for SPFU2, it is next-stage SPFU module, another convolutional result is to supervise the semantic segmentation loss function.

Loss function
Most semantic segmentation methods use cross entropy to measure the difference between the prediction and ground truth. However, cross entropy loss is more suitable for natural images with complete and large objects, and the lane lines are very long and thin in the lane datasets, which contain a lot of background pixels that are not conducive to predicting targets. Therefore, it is necessary to use a weighted cross-entropy loss function to supervise the semantic segmentation branch training, because this loss can effectively control the influence of each category of pixels to the cross-entropy loss function by setting different weights. Its definition is shown in Eq 5: where A2R H×W is the final output calculated by semantic segmentation branch, φ(.) is the Softmax. After Softmax φ(.), the feature map A generates the lane line probability map. N is the total number of pixels in the feature map. ω is the loss contribution weight of each prediction category. Usually we set the background weight to 0.4 and the remaining lane line weights to 1 in the CULane dataset.
In the geometric distance embedding branch, we want to predict the continuous distance from the center line to the boundary for each lane line, which shows that it is not a classification task but a regression prediction. Therefore, we use the MSE (mean square error) in GAAN to measures the error between the geometric distance embedded branch prediction result and the real label, this process is shown in Eq 6: where B2R H×W is the final output of geometric distance embedding branch,B i is the geometric distance mask d mask .
To sum up, the total loss function is shown in Eq 7: where L seg is the weighted cross-entropy loss function of the semantic segmentation branch, L dt is the mean square error loss function of the geometric distance embedding branch, L segk is the semantic segmentation auxiliary loss function, which is used to supervise the feature map output by the SKPFU module. L exist is a binary cross-entropy used to supervise the existence of lane lines, or it predicts whether lane lines exist in the image. α and β are hyperparameters.

Datasets and evaluation
In order to verify the effectiveness of the GAAN in lane detection of complex road scenes, experiments were conducted on the TuSimple dataset, the CULane dataset, and the BDD100K dataset. Various detailed traffic scenarios are divided in order to evaluate the detection results in different scenarios.
In above three datasets, the TuSimple dataset focuses on highway scenes, the CULane dataset and the BDD100K dataset mainly focus on urban road scenes. The  Table 1 is the detailed description of above three datasets containing complex road scenes. The second column is the total number of frames in per dataset, and the third, fourth, fifth columns are the number of images that are divided into training set, validation set, testing set in the extracted frames. Besides the seventh column is road type of dataset and the eighth column is the number of lane lines.
All experiments in this work were performed in the following environment: a workstation containing two NVIDIA GEFORCE RTX 2080Ti, each 2080Ti is 11GB, the operating system is Ubuntu 18.4, all experiments perform training and inference use the pytorch deep learning framework.
In our work, we respectively reshaped the images of TuSimple, CULane and BDD100K to 368×640, 288×800 and 360×640 as input size. When training the GAAN, we use SGD optimizer to train the model, the learning rate is set to 0.01, the learning rate update strategy is ploy, the learning rate attenuation coefficient is 0.9, the BatchSize is set to 12. TuSimple's training iterations are set to 1800, CULane and BDD100K are set to 60K respectively. The hyperparameters α, β in the final loss function are set to 0.1 respectively.
To test the performance of our model, we used Accuracy, False Positive (FP), and False Negative (FN) on the TuSimple dataset as evaluation. The CULane dataset use F1-Measure, FP and The BDD100K dataset uses Accuracy, IoU as evaluation respectively. The calculation methods for these evaluation indicators are described in following: where F pred is the number of lanes with incorrect predictions, and N pred is the number of all predicted lanes.  where M pred is the number of wrong predicted lane lines, and N gt is the number of ground truth.
Accuracy ¼ where C clip is the predicted lane pixels. S clip represents the total effective lane line pixels.
where β = 1 and the Precision, Recall is shown in Eqs 12, 13 and 14 is IoU: Recall Experiments Table 2 shows the lane detection F1-Measure of the GAAN on the CULane testing set. Compared with the current other advanced lane detection algorithms on CULane dataset, we can find that the proposed method performed very well among seven different complex road scenes and the total testing set, where RD101-GAAN indicates that ResNet101 with deformable convolution [21] is used as the backbone network in the GAAN. The reason why the GAAN can perform well in complex road scenes is that the geometric distance embedding branch contains the geometric information of the lane boundary, which can effectively guide the result of semantic segmentation through the GAA module. However, the F1-Measure of the GAAN in the crowded scene is lower than GCJ [9] in Tables 2 and 3, since GCJ designed a loss function about geometric relationship between driving area and lane lines for supervision. That is, the segmentation result of the driving area has a strong correlation with the lane lines, so that the lane lines can be inferred from the driving area. In addition, since there is no ground truth in the Crossroad scene, only the FP evaluation index is counted. In order to verify the effectiveness of the GAAN's components, the ablation experiments were performed by gradually adding components after the backbone network ResNet-50. As shown in Tables 4 and 5, Only-Dt represents that there is only one distance embedded branch in the network, and its lane line detection result is worse than Only-Seg which only uses semantic segmentation branch. Seg-Dt represents that semantic segmentation branch and geometric distance embedding branch are simultaneously trained and predicted, which has better performance than when only using a single task branch. Later, we gradually add the AIP module, GAA module and SPFU module on the basis of the Seg-Dt. It can be seen that with the increase of components in the network, its F1-Measure has also gradually increased on each scene of the CULane dataset, which illustrates that each component plays a positive role in the performance of lane detection.
In the computer vision task based on deep learning, the feature expression ability of the encoder has a decisive influence on the extraction effect of the target feature by the entire neural network. Therefore, different encoders are explored for GAAN's impact of detecting lanes. As shown in Tables 6 and 7,   As shown in Fig 9, in order to qualitatively describe the ability of the GAAN to detect lane lines in complex road scenes, we select three results from the CULane testing set to illustrate our method's advanced performance. In comparison, the GAAN performs better in detection which lanes are covered by the car on the left side than SCNN, in that the GAA module can capture the long-distance dependencies between pixels. In addition, the input images of row 2   Table 8, we evaluated the GAAN on the TuSimple dataset and compared it with other networks that performed well on the dataset.
The labels of BDD100K are different from the TuSimple and the CULane. BDD100K labels the lane lines that can be seen in the image, instead of focusing only on the 4 lane lines on the left and right sides of the current lane in the same direction. Thus the dataset contains the distribution of samples with different numbers of lane lines, which leads the lane detection results are greatly influenced. Therefore, the ability of the network model's learning and generalization can be effectively verified on BDD100K dataset. As shown in Table 9, it illustrates the evaluation results of the GAAN on the BDD100K dataset.
As shown in Fig 10, it displays the lane detection results of GAAN and SCNN on the BDD100K dataset. We selected night scenarios that lane lines are not visible in the testing set. Moreover, BDD100K dataset requires detected lane lines are relatively dense, thus it is more

Discussion
In this paper, we have proposed GAAN, a multi-task branches architecture neural network to further improve the ability of lane detection in complex scenes. The one called geometric distance embedding branch can learn the distance features from lane lines' center to boundary, and the other one called semantic segmentation branch can learn multi-scale semantic features. We use the AIP to adaptively select the complementary information between the two branches for communication and use GAA module to combine the two branches. Consequently, the SPFU is used to fuse the multi-scale features of each stage's encoder. Experiments were conducted on the CULane dataset, TuSimple dataset, BDD100K dataset and the results show that our method has the better performance compared with several advanced lane detection methods.
In addition, lane detection is an indispensable part of autonomous driving, so it has high requirements on the real-time performance and accuracy of the algorithm, as well as, it needs to control the amount of model parameters to be deployed on the device. Therefore, the further research in future, we must consider that the model of the lane detection method requires real-time detection, and use model compression related technology or a lightweight backbone network to reduce model parameters.