Real-time traffic sign recognition based on a general purpose GPU and deep-learning

We present a General Purpose Graphics Processing Unit (GPGPU) based real-time traffic sign detection and recognition method that is robust against illumination changes. There have been many approaches to traffic sign recognition in various research fields; however, previous approaches faced several limitations when under low illumination or wide variance of light conditions. To overcome these drawbacks and improve processing speeds, we propose a method that 1) is robust against illumination changes, 2) uses GPGPU-based real-time traffic sign detection, and 3) performs region detecting and recognition using a hierarchical model. This method produces stable results in low illumination environments. Both detection and hierarchical recognition are performed in real-time, and the proposed method achieves 0.97 F1-score on our collective dataset, which uses the Vienna convention traffic rules (Germany and South Korea).


Introduction
As autonomous driving technology becomes important for upcoming car industries, fundamental driving assistant skills are being included in commercial cars and the size of the market is increasing. To create true self-driving technology and evolve from Advanced Driving Assistant System (ADAS) to Auto traveling, understanding the environment is important.
Currently alongside ADAS technology, another core technology for autonomous driving is Traffic Sign Recognition (TSR), a system which detects and recognizes traffic signs captured by a front mounted camera. With TSR, self-driving cars can thoroughly understand road regulations for safety. The European New Car Assessment Programme (EURO NCAP) regard TSR as top issue in safety skills for car safety [1]. In Germany, they host a TSR benchmarking competition [2,3] to support academic and industrial development.
While ADAS requires an extra sensor such as radar or ultra-sonic equipment, TSR only requires a camera and machine learning technology. In order to execute TSR in real-time and with high performance, cutting edge hardware mounted in the car system is usually required. However, the recent announcement of NVIDIA's General Purpose Graphics Processing Unit (GPGPU), DRIVE PX [4], a GPU for car systems, enables real-time heavy machine learning illumination variation. This method also utilizes a landmark feature, which allows the characteristics of the traffic sign to overcome partial occlusion of the input image. Additionally, during traffic sign detection, the method uses parallel processing (which can be implemented on a GPGPU) and reduces the detection time as compared to the sliding-window searching method. During the recognition stage, we use a cascade classifier that assembles the SVM to verify the candidate region and a CNN for diverse traffic sign recognition.
In the following section, we provide an overview of the presented method. Section 3 describes traffic sign detection that is robust to illumination. Section 4 describes the verification stage using SVM and recognition of traffic signs using CNN. Section 5 presents our experiment results for the testing set. Conclusions are given in section 6.

Overview of proposed methods
Recently in the field of traffic sign detection and recognition, diverse methods have been introduced. However, several problems still remain to be solved.
1. The models still lack invariance to illumination; conventional models are optimized primarily for daylight scenes 2. Technical difficulties remain in real-time processing 3. Detection of diverse traffic sign requires multiple detectors In this paper, we propose a GPGPU-based real-time traffic sign detection and recognition method to address each of these issues. The overall flow of the method is presented in Fig 1. First, we extract the Byte-MCT feature from the input driving scene to detect the candidate region. Unlike AdaBoost-based approaches [12][13][14] which use multiple sliding-windows for each shape such as circle or triangle, the proposed method uses landmark-based parallel-windows to detect multiple traffic sign candidate regions with fair performance, even in low illumination environments or night scenes. Second, for real-time processing on high resolution input, we implement a GPGPU-based parallel window searching method instead of raster image scanning. Parallel-based searching operates on a GPGPU in parallel for each divided grid of the input image.
After the candidate region is extracted from the input image, we split each candidate window into four partial window regions and generate feature vectors which are concatenation of Byte-MCT histograms. Consequently, the feature vectors are fed into SVM verification procedure. Therefore, feature vectors can distinguish traffic signs in similar shapes. The SVM classifies the detected windows into three classes, i.e., circle, triangle, and other. In the final recognition stage, we use CNN to classify the diverse traffic signs to further improve the result.

Sign detection
In this chapter, we introduce a method for detecting traffic signs with illumination-invariant Byte-MCT feature in real-time. Illumination-invariance is an important element for traffic sign detection due to two senses: 1) detecting objects in images captured in a moving vehicle involve change of illumination which makes detection even more challenging, 2) weather and timeline shifts causes change of illumination.

Byte-MCT features
In a digital image, each pixel can be described using Eq (1): where x is a pixel in the image and I(x) represents pixel x intensity, L(x) and R(x) represent the luminance of a light source and surface reflectance, respectively, g and b are camera parameters. The intensity of pixels are decided by combination of luminance and reflectance. If luminance becomes constant or limited in certain circumstances, then Eq (1) becomes Eq (2), indicating that the intensity is only impacted by the reflectance, where b is the constant bias. Thus, the equation is only affected by the reflectance of the surface, regardless of the strength of luminance. The effect of luminance from input pixels can be ignored if we normalize the input pixel's intensity. However, measuring the exact luminance from each pixel is infeasible. The closest approximation method is Census Transform (CT). The CT uses either 3-by-3 or 5-by-5 fixed-size windows to construct index of edge structure based on relative intensity comparison between middle index pixels and neighbor pixels. The most conventional CT methods are Modified Census Transform (MCT) and Local Binary Pattern (LBP). B. Froba and A. Ernst [12] presented MCT, which uses only the reflectance from the digital input to perform CT to handle images with diverse luminance variance. For given a window of pixels, the MCT converts each pixel into binary value by comparing it with the average intensity. If the pixel is brighter than the average, the method transforms the pixel into a 1; otherwise, it is 0. For example, for a 3-by-3 window, the MCT has a 9-bit binary pattern. However, unlike CT, which uses the overall average pixel intensity, MCT uses average pixel intensity for only local region. Therefore, MCT is more robust against low illumination inputs. Eq (3) shows an MCT operation on a pixel x, where N(x) is the eight neighboring pixels of x, I avg (x) is average intensity of N(x), and I is an indicator function that yields 1 if the statement is true and 0 otherwise. y2N is a concatenation of neighboring pixels' transformation result. As a result, MCT can describe up to 511 (= 2 9 −1) local edge structures in a 3-by-3 window. Since MCT compares each pixel to the average intensity of local region, global illumination has no impact and only reflectance is considered thus it becomes robust against luminance change.
In this paper, we further modify MCT method to present Byte-MCT (8-bit MCT). Byte-MCT excludes the central pixel from concatenation, resulting in only eight neighbor pixels. Fig 2 shows an example of Byte-MCT procedure. This transform gives two advantages over 9-bit MCT [12]. First, computational efficiency; while preserving the most of feature descriptors from 9-bit MCT, such as straight line, diagonal line, corner, etc., it reduces the feature vector size into 255 dimensions, which allows faster learning. Second, robustness to noise; the proposed Byte-MCT only considers the difference between eight neighbors in 3-by-3 region and their average. It leads to the robustness to the noise in the center pixel, since the center pixel does not affect the feature descriptor unlike 9-bit MCT. The effectiveness of Byte-MCT will be discussed in the experiment sections.

Landmark-based AdaBoost training
In this section, we describe a method of identifying traffic sign landmark points using Byte-MCT. Y. Freund and R. Schapire [20] introduce a cascade algorithm called AdaBoost. The AdaBoost algorithm constructs multiple weak classifiers and assembles them into a single strong classifier for the overall task. AdaBoost's main strength lies in its high prediction power and capability to cascade an architecture for improving the computational efficiency; thus, it is commonly used in the object detection techniques. B. Froba et al. trained AdaBoost to detect the face region using MCT. The method first generates the same number of weak classifiers as the number of pixels and then assembles a strong classifier, which has four stages. The method assigns high weights to good weak classifiers and low weights if the classifier causes a large error. Because the method generates the same number of weak classifiers as input pixels, an image that is similar to the training data will be well classified for similar regions. However, if the input data have wide variations or abrupt changes in background, then the training time increases and classifier accuracy decreases. To handle this limitation, we modify AdaBoost method, to present landmark-based AdaBoost. Table 1 shows the algorithm of landmark-based AdaBoost classifier. D 1 (i) is the initial weight of the first pixel classifier; i indicates the sign or non-sign and D is the boosting weight. After assigning the initial boost weight to construct the strong classifier, we train the assigned number of pixel classifiers. During pixel classifier training, each trained class generates a Byte-MCT-based look-up table and calculates the pixel classifier weight α based on error ε. In each learning iteration, the weak (pixel) classifiers are trained and assembled into a strong classifier. The learning iteration ends when the overall sum of weak classifier error is less than θ t and θ α is bigger than α, because α represents the importance of the weak classifier.
B. Froba et al.'s method trains the pixel classifiers in four stages. At each stage, the classifier is trained until the error is less than the threshold value. The drawback of this method is that it trains all of the pixel classifiers until it reaches the end condition, meaning that the learning time sharply increases as the input resolution increases. To overcome this disadvantage, the modified AdaBoost method trains weak classifiers only on the landmark pixels. The values of landmarks are decided by α-score from Table 1. The T landmark pixel classifiers are weak classifiers which considers the most distinguishable pixel and are combined into the strong classifier. Unlike conventional ensemble classifiers, which use multiple stages (for time efficiency) Table 1. AdaBoost learning algorithm based on landmarks. instead of full training, the proposed method has only one strong classifier with a small number of landmark points. Because of the algorithm's structure, the proposed method calculates far fewer operations than the original AdaBoost. Another benefit of the landmark-based architecture is that it can run on a GPU with parallel-window searching. The GPU's parallel system divides one task into multiple small independent tasks that can be processed by thousands of GPU core blocks. Therefore, for non-sequential tasks such as sliding-window, this method can take advantage of the parallel system. Details will be presented in next section.

GPU-based real-time detection
The two major approaches for object detection are sliding-window and object-proposal. Object-proposal is based on edge shapes. Usually, object proposals extract up to 1,000 candidate regions per image, regardless of the shape of the object. Because traffic signs have distinct shapes, object-proposal is not a suitable method. For this reason, TSR methods commonly use holistic searching in sliding-window fashion. Traditional sliding-window searching scans the entire image from a fixed window using raster scanning; thus, running time increases as the input resolution increases. To handle this challenge, the conventional methods construct a cascaded ensemble classifier using AdaBoost. The cascade method increases the running time efficiency by skipping non-object areas. However, this method requires sequential processing and cannot be run on a GPU. Therefore, object detection on a Full-HD image in real time is difficult.
Advances in modern hardware has transformed CPUs from a single-core sequential processors into multi-core multi-thread parallel processors. This parallel processing paradigm can also be applied to GPGPUs, which have more than thousands of cores in a single unit. In this paper, we propose a Byte-MCT and landmark-based AdaBoost classifier using a parallel window-searching method on a GPGPU as shown in Fig 4. Unlike the sliding-window method, which scans an image in a sequential manner, parallel window-searching divides the input image into several blocks and simultaneously performs classification on one block using each GPU core. Although the GPU parallel system shares the concept of the CPU parallel system, GPU-based parallel window-searching has challenges with arbitrary positions in memory. We use a kernel function to address this challenge. Table 2 describes the parallel window-searching method's kernel function on a GPGPU. In the GPGPU, the kernel function does not have access to the host's memory and is unable to return. Therefore, we generate an object map. The object map keeps the search window size and its exact location identified by the current kernel function. Unlike sequential searching, parallel searching implemented on a kernel function generates blocks based on the number of GPU cores and the H/W pipeline. Because each block simultaneously runs the kernel function, this method allows real-time processing. The introduced parallel window searching method takes 2.4ms on Full-HD (1920 × 1080) resolution images and using a 40-by-40 search window size, whereas the same input image takes 352.26ms using a CPU-based sliding window search. In other words, the GPGPU version is 140 times faster than the CPU version; details will be provided in section 5.

Sign recognition
In the previous section, the method for extracting the candidate region of the sign was described. In this section, we discuss SVM-based verification and CNN-based recognition. The recent emerging of deep learning network become one of main backbone of self-driving industries [21,22]. However, the verification method reduces the number of candidate regions, since SVM has better optimization ability with the limited or fairly fewer number of inputs compare to CNN and tested in experiments section. The recognition method understands fine-grained information from verified input and decides final result.

Sign verification using SVM
To verify candidate regions extracted by AdaBoost, we adapt the Byte-MCT as feature vectors for SVM. The SVM determines the decision boundary that has a maximum margin from the support vector, which is an outlier of the feature vector. It verifies the candidate regions with better generalization ability than CNN. Fig 5 shows feature extraction for region verification. From the candidate regions, each pixel value represents the 3-by-3 Byte-MCT local edge structure that is used in verification. First, the structure is divided into four overlapping regions as in Fig 5. Each region's Byte-MCT histogram stacks into 1,024-dimensional feature vectors. Eq (4) is the mathematical equation for the feature extractions; where r is the number of candidate regions and max(H r ) is the maximum histogram value from the divided region. This equation normalizes each divided region using its own maximum value to generate the 256-dimensional histogram. By concatenating the 256-dimensional histograms, the method generates a 1024-dimensional feature vector. Because this feature extraction process uses Byte-MCT input, the extracted feature is also robust against illumination. Finally, the SVM calculates the extracted feature vector using the Radial Basis Function (RBF) kernel to classify the input classes: circle, triangle, and other. The circle class indicates a prohibition sign, which has a red rim. The triangle class is a warning sign. The negative class is any place in which images from the input are not traffic signs. This class also includes background images that have similar shapes as traffic signs. In the real world, same traffic sign shapes have different meanings based on information inside the signs such as 40 mile speed limit and 60 mile speed limit have same shape but different restriction. Additionally, the background image can also contain shapes similar to traffic signs yet have no meaning in terms of traffic rules. This situation requires additional recognition steps for fine grained recognition to reduce the false positive inputs.

Structured CNN-based sign recognition
From an SVM-verified image, such as a circle or a triangle, we use CNN to recognize signs. In the German Traffic Sign Recognition Benchmark (GTSRB) [3], the CNN-based approach achieves state-of-art performance from diverse traffic signs via only one CNN. The GTSRB contest already has produced methods for generating cropped traffic sign regions. However, the goal of this paper is to recognize signs in real-world driving images, which include diverse shapes and types of traffic signs. From previously SVM-verified regions, the CNN classifies the information as shown in Fig 6. To reduce the number of false positive images (i.e., inputs that pass SVM verification but ought to fail), the presented CNN model includes one more class, i.e., other.   [15] who presented the highest accuracy model from the GTSRB, built a CNN model with three convolution layers and a pooling layer for traffic sign recognition that is robust against noise and occlusion. The presented CNN model takes a 40-by-40 input image and generates 250-dimensional feature vectors from two pairs of convolution and pooling layer. This feature vector passes the fully connected neural network with the final softmax layer that classifies the 24 traffic sign types. Unlike D. Ciresan's model [15], the proposed model has only two pair of layers, namely a convolutional layer and a pooling layer. A shallower architecture learns detailed features  Real-time traffic sign recognition based on a general purpose GPU and deep-learning rather than high-level features, since the verification step reduces the number of input for each category.

Datasets
To evaluate our proposed method, we use two different datasets: the LISA US traffic sign dataset [18] and a real-world driving dataset that records the driving scene directly from a car mounted camera. The LISA dataset contains driving scenes from the Southern California region, including US traffic signs, speed-limits, warnings, no-turns, stops, etc. The LISA dataset is based on different rules than the Vienna traffic rules. To test both scenarios, we collected driving scenes in two locations that apply the Vienna traffic rules: Korea and Germany.  Table 3 shows the specifications of the KR-D, KR-N, and DE-D datasets. All the frames in each dataset has HD (1280 × 672) resolution, and at least half of the frames contain traffic signs. Fig 8(B) shows driving scenes under diverse weather conditions from both German metropolitan area and countryside in daytime. In order to test the model's robustness against illumination changes, the KR-N dataset contains a large number of frames of heavy illumination changes in downtown scenes.

Detection and recognition performance
To evaluate the presented model, we compare its performance to that of the Aggregated Channel Feature (ACF) method [18], which is ranked high on the LISA SPEED-LIMIT dataset. The ACF method's performance for the recorded datasets under the Vienna traffic rules is shown in Table 4. This method only detects the candidate region; thus, the recognition part is missing. Table 5 shows both detection and recognition performance results for the presented model on the same datasets. For all datasets considered, the presented method is superior in both detection and recognition to the ACF method. In the tables, TP indicates true positive, which means correctly recognized object. FP indicates false positive, which means either detecting the wrong object or misrecognition with correct localization. FN indicates false negative, which means that the method was unable to detect the sign. As shown in Table 4, the ACF method achieves 81.9% and 84.3% accuracy for the daylight dataset and the LISA speed-limit dataset, respectively. Additionally, in the nighttime dataset, ACF achieves only 15.7% with 0.09 recall. Because the ACF method is based on traffic sign shapes, detection is hindered by nighttime or low illumination environments. Table 5 shows accuracy achieved by the proposed method, which are 97.8% and 89.5% in daylight sets and LISA speed-limit datasets, respectively. Additionally, even in the nighttime dataset, the presented method achieves 89.5% accuracy. Unlike ACF, which is based on traffic sign shapes, the presented method uses landmark detection for improved detection performance and Byte-MCT to allow stable performance, even in low-illumination situations. Table 6 shows accuracy achieved using single CNN. Considering Tables 5 and 6, proposed method achieve more stable performance as described in sign recognition section.   Fig 9(A) shows the correctly predicted cases from small far-away traffic signs to large nearby traffic signs, regardless of the illumination difference. Fig 9(B) shows the false detection case in which the traffic sign is significantly rotated, the traffic sign size is too small, or the background is similar to the traffic sign. Fig 10 shows the German daylight dataset (DE-D) results of the presented method. As Fig  10(A) shows, the presented method has stable performance for correct detection and recognition of the traffic sign under heavy rain and low illumination scenes. Fig 10(B) shows the failure cases in which the method could not detect or misinterpreted traffic signs. In DE-D dataset, false recognition rate is higher than that of KR-N dataset. We expect that this occurred due to camera blur caused by higher driving speed. The Autobahn highway has a much higher speed limit, faster driving can cause turbulence around the camera, which leads to further blur of the image. Fig 11 shows results on the Korea daylight dataset (KR-D). Fig 11(A) shows the correctly predicted results in various situations. The proposed method detects and recognizes traffic signs that are both small or close to the car under heavy rain. The presented method has 0.9855 F1 score on the KR-D datasets, which means that most of the objects were detected and recognized. Fig 11(B) shows the failure cases. The KR-D dataset includes more metropolitan driving frames than DE-D. As the example shows, the major false positive is due to watery bumpers. Watery bumpers, which prevent or minimize vehicle accidents, are usually located on the median strip or close to bus stops. These bumpers have a similar color scheme and shape as triangle warning signs, which causes false detections. Fig 12 shows the Korea nighttime dataset (KR-N) results. Fig 12(A) shows the correctly predicted cases, which prove that the proposed method works under illumination changes due to low illumination, neon signs, and traffic lights. This scenario dataset usually results in failure cases for shape-based and color-based approaches. In Fig 12(A), in the last example, the traffic sign is located between traffic lights with red lights illuminated. This example is an extreme failure case for the shape-based approach, because the shape is difficult to identify; however, the proposed method is able to find the correct location. Fig 12(B) shows the failed cases. In nighttime scenes, based on camera settings, noise can be easily added to the image. Therefore, nighttime usually causes more false detections. Using SVM as a verification method, the presented method significantly reduces the number of false detections. However, there are still false detections after the verification steps. The false detections are mostly detected from triangular shapes, which can be easily fixed by post processing since they do not last more than one frame.

Comparison of detection performance
To evaluate the effectiveness of proposed detection method, we conduct the experiments with two factors: MCT features, and AdaBoost. Since the focus of these experiments is comparing detection performance between features, cascade of architectures is excluded so that maximum performance is bounded by non-cascaded models. From Table 7, 9F is the reference method [12], 9-bit MCT with full-pixel based AdaBoost. 9L is 9-bit MCT with landmark based Ada-Boost. 8F is Byte-MCT with full-pixel based AdaBoost. 8L is proposed method, Byte-MCT with landmark based AdaBoost. Among the experiments, the presented method shows the highest performance. 9-bit MCT and full-pixel AdaBoost, shows superior precision compared to Byte-MCT with full pixel based AdaBoost regardless of datasets. However, the performance of features are decided by recall, since detection performance is thoroughly influenced by choosing true positive. Unlike precision, recalls are always higher in 8-bit MCT features regardless of datasets. DE-D contains partial occlusion of traffic signs in datasets, and results show that the Byte-MCT are more robust against the noise and partial occlusion. The landmark-based AdaBoost performance for 9-bit MCT and Byte MCT result shows a different trend: both precision and recall are higher in Byte MCT feature. The margin between 9-bit MCT and Byte-MCT are even larger for noisy input especially under the low illumination inputs, i.e., KR-N.
The classifier performance between full-pixel AdaBoost and landmark-based AdaBoost, landmark-based AdaBoost always produces better performance with shorter training time. Although full-pixel based AdaBoost shows stable performance in KR-D datasets, it was largely affected by lower illumination, KR-N. Applying the landmark feature significantly enhanced performance for the partially occluded dataset, DE-D. This data depicts the situation of possible over-fitting of pixel classifiers for full-pixel AdaBoost. Table 7 shows that Byte-MCT generate superior result as compared to 9-bit MCT. Also, Landmark-based AdaBoost performs remarkably better against partial occlusions and noise whereas full-pixel AdaBoost failed. This data leads us to believe that the proposed methods, Byte-MCT and landmark-based AdaBoost, are the best combination of methods for detection. Performances is stable regardless of driving scenes or illumination variance, even reducing the training times.

Comparison of processing time performance
The proposed traffic sign detection and recognition method is specifically designed for parallel-window searching on a GPGPU, unlike traditional sliding-window searching. Normally, most computational processing units are CPUs, but the GPGPU allows a parallel system that is optimized for graphics and display processing, which uses thousands of processing units and pipelines. To evaluate the running time of the proposed method, we compare CPU-based operations and GPU-based operations. The input image resolution is Full-HD (1920 × 1080), which is resized to generate image pyramid of six-step down-sizing for CPU/GPGPU processing. The For detection stage, we use detection windows of four different sizes and experiment with both single and multiple windows on each image pyramid. Table 8 shows the CPU and GPUbased Byte-MCT transforms and detection running times. The window size is the same window size as for AdaBoost. The last row of Table 8 shows the experiment for the total processing time with all of the window sizes: four detection windows and six image pyramids. CPU-based processing takes 4,263.3ms for Byte-MCT transformation and detection, whereas GPU-based processing takes only 29.6ms, which is approximately 144 times faster than CPU-based processing. As Table 8 shows, for each task, GPGPU processing is 80 times faster than CPU processing on average.
In recognition stage, the running time is proportional to the number of verified regions from SVM, which varies across different input images. Due to this limitation, we can only compare the results based on average time consumption. The SVM and CNN run on the CPU and GPU respectively. The overall recognition takes 5ms on average, while SVM verification takes 2ms, and CNN classification takes 3ms.

Conclusion
In this paper, we presented a GPGPU-based real-time traffic sign detection and recognition method for high resolution image inputs. This method introduces the Byte-MCT for stable candidate region detection regardless of illumination changes. For real-time high-resolution image processing, we used the target object's landmark-point-based AdaBoost for traffic sign detection, and applied a parallel-window searching algorithm for the GPGPU. For high-performance recognition, we used a hierarchical classifier structure that combines SVM for verification and CNN for final recognition. Our proposed method makes the following three primary contributions: 1. Illumination-robust Byte-MCT and landmark-based traffic sign region detection 2. Parallel-window searching for traffic sign detection based on GPGPU 3. Hierarchical classifier structure using SVM and CNN The presented method proposes possible solutions to two traditional challenges in traffic sign recognition.
1. Using its robustness against illumination features, the proposed method significantly improved the performance of detection and recognition under low illumination environments, i.e., nighttime.
2. The method was able to perform real-time processing using a GPGPU on full-HD resolution input.
The accuracy of proposed method is 89.5%, which is higher than the ACF result [17], 84.3%, on the VIVA test-set [18]. Furthermore, on datasets using Vienna Convention traffic rules, the proposed method performs at 97.9% accuracy on average, which is 26.8% higher than the ACF result [17], which is 71.1% on average. With a GPGPU device, DRIVER PX [3], for smart vehicles, the proposed method can be implemented in the vehicle industry in the near future.