Novel CE-CBCE feature extraction method for object classification using a low-density LiDAR point cloud

Low-end LiDAR sensor provides an alternative for depth measurement and object recognition for lightweight devices. However due to low computing capacity, complicated algorithms are incompatible to be performed on the device, with sparse information further limits the feature available for extraction. Therefore, a classification method which could receive sparse input, while providing ample leverage for the classification process to accurately differentiate objects within limited computing capability is required. To achieve reliable feature extraction from a sparse LiDAR point cloud, this paper proposes a novel Clustered Extraction and Centroid Based Clustered Extraction Method (CE-CBCE) method for feature extraction followed by a convolutional neural network (CNN) object classifier. The integration of the CE-CBCE and CNN methods enable us to utilize lightweight actuated LiDAR input and provides low computing means of classification while maintaining accurate detection. Based on genuine LiDAR data, the final result shows reliable accuracy of 97% through the method proposed.


Introduction
A LiDAR sensor provides a solution for mobile applications where the system needs to be compact, lightweight and handy [1,2]. It has a 360˚degree field of view [3], possesses high accuracy of distance measurement and in contrast to a camera, it does not depend on the light intensity of the surroundings [4,5]. The LiDAR sensor is robust to illumination variation [6], and can be used to obtain the transformation matrix between 2D coordinate system and 3D model of the scene [7]. Its detection range is also comparatively higher in accuracy and provides better reliability when compared to stereo methods [8,9].
In some applications where portability and mobility are of prime importance, a single sensor which acts as the detection system is required [10,11]. Examples of such applications are mobile A few researchers chose bottom-up or top-down approaches to extract information and distinguish objects, however they do not consider the variation of point densities which will cause significant change of accuracy [33]. This is due to density changes with the distance from the LiDAR [34]. Especially in a very sparse LiDAR data recognition performance drastically decrease as distance between human and LiDAR increases, due to the number of points being inversely proportional to the square of the distance between a human to LiDAR [35].
An object detection network which views data in the form of a matrix is proposed by [36], with continuous properties from channel views as its extracted features and 2D convolution network. However, the approach requires LiDAR's with higher density to feed channels of scanning for extraction. Due to disparity in object detection using sparse LiDAR point cloud, there is a necessity to develop a method which could perform well with such limited input. Existing works mainly provide solutions for high density point clouds, often involving heavy computational cost and computing load. Alternative methods which work on low density data are commonly limited to binary classification and insubstantial when dealing with multiple class detection. Table 1 shows the findings of main comparison feature extraction methods.

Contributions
It is vital to develop a classification method which can work with a sparse 3D point cloud, while providing enough leverage for the classification process to accurately recognize objects within limited computing capacity. Thus, this paper proposes an object recognition system with multiple feature extraction based on segregated clusters from LiDAR point clouds. To extract geometry features, we rasterize each point cloud of the object in a local voxel slice model based on its centroid. The method proposed introduces local abscissa, ordinate and applicate (z-axis) voxels to reduce the computational cost of computing global voxels in the spatial domain while removing uncertainties of varying densities, discarding rigid transformation and unsymmetrical structure of point clouds associated with global voxel with arranged fixed-sized segregation. This research further introduces a novel feature extraction technique which considers a collection of features viz., density to centroid height ratio and density to volume ratio. These features capture the point cloud disparity and achieves a higher detection rate when compared to other state-of-the-art feature extraction methods. The proposed feature extraction helps to overcome the inconsistent point cloud detection due to the single point of view in scanning, allowing accurate object recognition from using a single actuated LiDAR sensor.
Employing machine learning for object classifiers has been a major interest of researchers as a means to train extracted features including for LiDAR point cloud classification [39,40]. Bobkov et al. [41] implement a convolutional neural network (CNN) with 5 filters and pooling for layer extraction. Whereas Tian et al. [30] implemented multiple object features with annotated labels incorporated with an initialized neural network. Considering the success of machine learning algorithms in various areas including feature-based object classification [42], this research further optimizes the features extracted from the proposed method to be trained with selected machine learning optimizers. The algorithms selected are the k-nearest neighbor (k-NN), decision tree (DT) and convolutional neural network (CNN).
For the class of object detection, static object classification has been shown with target-level and with low level data [43]. In this paper, we will focus on methods addressing moving road users in conventional streets. We selected three important class of objects detection, categorically pedestrians, motorcyclists and cars. These three objects are typical on-road scene [44,45]. Thus, its detection provide critical information for security and surveillance, law enforcement monitoring, search and rescue team [46,47].
To the best knowledge of the authors, no existing work in the field of machine vision explicitly exploits information from a single unit, sparse LiDAR sensor for 3D scanning to achieve object recognition with high accuracy rates. We prove that object recognition can be obtained using a single unit, single stripe, actuated LiDAR with low computing necessity via the fusion of Clustered Extraction and Centroid Based Clustered Extraction (CE-CBCE) methods to accomplish high-reliability object recognition from sparse LiDAR point cloud data. To summarize, our main contributions are • State-of-the-art combination of the Clustered Extraction (CE) and Centroid Based Clustered Extraction (CBCE) method which includes features extracted from the abscissa, ordinate and applicate voxels, a novel density to centroid height interval ratio and density to volume ratio. These features of sparse LiDAR point cloud data allow accurate classification from a single detection sensor.
• Result analysis and comparison of the CE-CBCE method trained by using k-NN, DT and CNN classification methods. The CE-CBCE optimized with the CNN classification recorded the best accuracy, excellent and consistent scores in terms of recall, precision and F1-score.
The results show that the proposed method outperforms other state of the art feature extraction methods.
• Genuine 3D LiDAR point cloud data taken from a custom-built mobile robot with a detection system from a single LiDAR sensor. The data consist of 1200 scans of 3 main objects classes with 4 pose orientation headings. The data have been made public and can be accessed accordingly [48].
The rest of the paper is organized as follows. Section II introduces the technical issues and proposed methods for gathering and processing data from the LiDAR sensor. Here, the description of each step is explained in detail. Section III presents the results and analysis of the output from our experiment. Finally, conclusions and future recommendations are discussed in Section IV.

Proposed method
Primarily the research intends to propose a novel clustering-based feature extraction technique to exploit discriminative features from the scarce LiDAR point cloud data. Initially, the process starts with background filtering and clustering the raw point cloud data. Then the proposed method extracts features from the clustered object point clouds. The proposed method is divided into two parts namely Clustered Extraction (CE) and the Centroid Based Clustered Extraction (CBCE) method. Through this method, there are less computational power required when compared to using global voxels within the spatial domain. The final stored elements were taken from sparsely distributed values in the LiDAR point cloud, which will then be trained by selected classification methods (k-NN, DT and CNN) for human detection. A written consent of this research which involves detection of human subjects have been approved by IIUM Research Ethics Committee (IREC) with ID No: IREC 2017-066.
The following section explains the details of each procedure step by step. The sequence starts with data collection, filtering, clustering and finally object classification.

Data collection
For genuine data collection purposes, we have constructed a mobile robot with LiDAR based sensor. The hardware components include Garmin LiDAR Lite v3, Arduino Uno, FS5109 servo motor, L298N motor driver. For wireless communication, XBee which comes readily with TX and RX communication modules, allowing wireless data transmission [49] and Li-Po external battery are required, providing mobility to the system. The scanning degree is fixed at 130˚, resembling a human's point of view [50]. The mobile robot scanning can be seen in Fig 1. Over 1200 scans have been collected which contained 400 objects for each of the three cluster categories viz., human, motorcyclist and cars. These three classes of objects are the most commonly found for on-road scene. For comparison purposes, we have selected the same number of samples for each class and orientation. The scenes are recorded within indoor and outdoor environments, during day and night for better reflection of the real-world environment. The position of the object detected varies with distance up to 40 meters from the mobile robot. This is the effective distance of the proposed method and the range of detection for the LiDAR sensor. The clustered point clouds are classified into 3 categories as mentioned before.
However even for the same object, different poses (rotations as well as translations) with respect to the LiDAR could result in different coordinates (ex: varying x/y/z minimum and maximum, object centroid and volume size). Therefore, we have taken the sample with various distance from 1 m to 40 m, with different pose orientation of the object from front, right left and back side of the targeted sample. Fig 2 below shows the total number of samples across the 3 class of object detection and its orientation facing the mobile robot.

Filtering
From the raw point cloud data collected with the mobile robot, unnecessary noises are removed from the scene. A threshold value of 500 z coordinate is fixed, approximately 5 meters from the ground. The points above the threshold are considered as non-disturbance. It does not pose as an obstacle for the mobile robot movement and does not represent any classes from the targeted class of object detection. Therefore, all points which surpass the set threshold value are subtracted from the point cloud, before entering the clustering process.

Clustering
Following filtering, the remaining point cloud goes through the process of clustering with the k-means clustering algorithm. Fig 3 shows raw, filtered and clustered data of all subjects of recognition.
Up until this point all classification methods go through the same scanning, filtering and clustering procedures. In the next step of feature extraction, the proposed technique will be compared with other state-of-the-art classification method. To solely compare the performance of each feature extraction method, we have designed the process to be non-end-to-end classifiers with pre-processing procedures (filtering & clustering) and post-processing steps (classification methods using DT, k-NN and CNN).

Clustered Extraction (CE) method
The LiDAR point cloud gives output in Cartesian coordinate system with the x, y & z origin set to be the position of the LiDAR on top of the mobile robot. The first part of our proposed clustered extraction (CE) method is the feature extraction of α, β, γ. Given a point cloud, the CE method proposed is explained below as, where P is the clustered LiDAR point cloud data, extracted into three parts. The first part is denoted by alpha α, which stores the values of width (w), length (l), height (h) of the object, and the number of points in the cluster (N). The second part is represented by the array beta β which stores the number of elements within the segregated intervals represented as x dataset , y dataset and z dataset . The unique feature of β is the number of points derived from abscissa, ordinate and applicate voxels. The third part is the minimum and maximum value of each axis in the clusters, denoted by gamma γ.
From Eq (1), we compute the centroid of the cluster c, Centroid c acts as the origin of the local voxels for each cluster. From the centroid, addition for voxel borders is constructed with a predetermined increment value. These will act as the abscissa, ordinate and applicate voxels. Therefore, where |.| indicates cardinality. The initial value of the dataset interval is denoted as x floor ; where x floor presents the value of x min rounded down to increment value Δ (in this case we set it to be 50), Finally, the end value of the dataset is defined as x ceiling , which is the x max value rounded up to the nearest hundred, The stored elements of maxima and minima of the coordinate together with the number of elements within a determined interval serve as the input for the human detection classifier. The same procedures are done to acquire ceiling and floor value of y dataset and z dataset .

Centroid Based Clustered Extraction (CBCE) method
From here onwards, two additional collective features are extracted from the point cloud denoted by delta δ (for features related to density to centroid height ratio ( r = h ) and epsilon ε (for features related to density to volume ratio ( r = V ). First, the collective features of δ are discussed.
Initially, the height of the object h is determined and it is divided by the total number of parts t set to be 10 as default to acquire height of i-th part of n ðn h i Þ. For each part, the centroid height is calculated as a reference point.
For n h i ; n h iþ1 . . . n h t . Then density between interval is denoted by r diff i acquired through subtracting current cluster density r c i to previous cluster density r c iÀ 1 From here we obtain the density to height ratio.
Next, the collective features related to density to volume ratio (denoted by ε) is defined as: For i = 1,2,. . .t. The total number of parts t set to be 10 as default as shown in Eq (11) where i represents the number of parts. The volume difference V diff i is acquired by subtracting current cluster volume V c i to previous cluster volume V c iÀ 1 The density difference is acquired as shown in (13). Therefore, the density to volume ratio is denoted by: The flow chart of the proposed CE-CBCE feature extraction method can be seen in Fig 4. Summary of features extracted are shown in Table 2 with its dimensional count and feature description.
From the extracted features, classifications are done with 75%-25% split of training and testing data. Training aimed to decrease the model loss function value against training data as each step was processed. Model performance was indicated and measured through improvements in accuracy of the model against the test dataset [51].
The accuracy of the classifications is calculated as follows: With k representing the total number of class, TP i as the true positive, TN i as the true negative, FP i as false positive and FN i as false negative; for i = 1,2,3,. . .k.True positive is when the model correctly predicts the positive class, and a false negative is recorded when a class is incorrectly predicted to be negative. False positive occurs when a class is incorrectly predicted to be positive, and true negative is considered when the model correctly predicts the negative class. However for multiclass classification such as our case [48], true positive occurs only when the right class is correctly predicted, and similarly for false negative etc.: they all depend on the class.

Results and discussions
Post CE and CBCE extraction, the collective features are optimized with kNN, DT and CNN. The method proposed is compared with three feature extraction methods which are the region of interest (ROI) [32,52], feature vector (FV) [19] and multiple feature extraction (MFE) [30].
The chosen comparative method is considered as it similarly handles sparse point cloud, has low computing cost, employs geometrical features and runs on real-time execution. For ROI, the authors proposed taking width (w), length (l), height (h), width difference (Δw) and length difference (Δl) as the extracted features for classification. So, the complete geometric feature is ΔG = [w,l,h,Δw,Δl]. The second comparison method is the feature vector (FV) introduced by Wang et al. [19]. The feature extracted includes 2D covariance matrix in 3 zones, 2D histogram for x-y plane and 2D histogram for y-z plane. The total dimensional count amounts to 175 features.
The third comparison method is proposed by Tian et al. [30], with multiple feature extraction which includes point count (N), point density (ρ), voxel centroid (μ), point variance (σ 2 ), point covariance (� s 2 ), point eigenvector (ν), point eigenvalue (γ), surface curvature (k) and divergence degree (F). The final feature extraction of MFE includes 27-dimensional count, with 9 counts from point eigenvector, 3 counts each from voxel centroid, point variance, point covariance, point eigenvalue and divergence degree, and finally a single dimension count from point count, point density and surface curvature.
The results obtained are then compared in terms of accuracy, precision, recall and F1-score. k-Fold Cross-validation is performed on the classifiers to select model parameters which best fit our data. Table 3 shows the complete hardware and software configurations for the experiments conducted. The simulation was done on an Intel (R) Core (TM) i7-5500U CPU @ 2.40GHz, 8Gb RAM and 64-bit operating system.

Classification experiments
The experiment was done on an Intel(R) Core (TM) i7-5500U CPU @ 2.40GHz, 8Gb RAM and 64-bit operating system.
The results of our experiment are represented in Table 4. The "Parameter" column shows the varying parameters to be determined to achieve the best configuration for each optimization algorithm. The parameters refer to the number of nearest neighbors for k-NN, the maximum depth for DT and the number of layers for CNN. For DT and k-NN the parameters are set in the range of 1 until 30. For CNN, the number of hidden layers is ranged between 1 to 10, with the batch size of 10, 1000 the number of epochs, and an activation function of Rectified Linear Units (ReLUs) and Softmax function. For the proposed method, initially the CE and CBCE methods are implemented separately, before combining both collective features to show the improved performance.  Table 4 shows the comparison of the results in terms of accuracy for ROI, FV, MFE, CE, CBCE and fusion of CE-CBCE combined. The selected results displayed are with the increment of 5 for k-NN and DT parameters and 2 for CNN parameter.
A complete comparison in terms of mean accuracy, precision, recall, and F1-score can be seen in Fig 5. In general, the CE method prevails in terms of accuracy, precision, recall and F1-score across all optimization methods of k-NN, DT and CNN. After analysis of the statistics across all optimization methods, a particular examination of each best performing optimization parameter is done. Specifically, the CE-CBCE method optimized by CNN with 6 hidden layers recorded best accuracy for object recognition and classification at 97% detection. This is followed by both the CE and CBCE method optimized with k-NN with the k-value of 1 and CNN with 8 hidden layers, respectively. Both methods recorded accuracy of 93% detection. The rest of the feature extraction methods achieved 91% for MFE, 87% for ROI and 79% for FV. Fig 6 shows the best optimization results in terms of precision, recall, F1-score and accuracy for each feature extraction method.
The best parameter choice for each method can be seen in Table 5. The precision, recall and F1-score for each individual class of human, motorcyclists and cars are presented with its final accuracy.
Full results of CE-CBCE across all optimization techniques are shown in Fig 7 on the following page. Results achieved across all parameters are in the graphs within said figure.
As satisfying results are achieved for object recognition amongst 3 classes of objects, another experiment with added difficulty is conducted. The aim of classifying within 3 classes of objects remains the same, however this time around the output of the prediction includes the pose of subject. Therefore, for each detected object, the direction pose needs to be predicted whether it is facing front, right, back of left side towards the mobile robot. The same number of samples have been provided as the input, with 300 samples for each class and 100  Table 5. Best result with its method of optimization and parameter for each class of objects.

Feature Extraction
Optimization Human Motorcyclist Car samples for each orientation. Table 6 shows the best results obtained from each feature extraction methods. In the Table 6, excellent results for the object's orientation are recorded. The combination of CE-CBCE method achieved accuracy of 82%, followed by CE, CBCE, MFE, ROI and finally FV methods. Especially for easily perceived poses for human subjects facing right or left, CE-CBCE and CE methods both recorded 100% detection rate. The radar chart in Fig 8 shows the accuracy of CE + CBCE method for each object with varying pose orientation.  Table 6. Object prediction with pose detection.

Conclusions
In this research, we proposed a novel feature extraction for sparse LiDAR point cloud object recognition. Indoor and outdoor data were collected, with different backgrounds for better simulation of varying surroundings. We also analysed the performance of our feature extraction method with different classes of objects in varying pose orientations. The result shows a promising achievement with a sparse LiDAR point cloud. The flow of the research proposed can be seen in Fig 9. The process started form hardware development, before moving into genuine data collection, preprocessing, proposed feature extraction method, classification algorithms and finally object classification. As the proposed method targeted sparse LiDAR point cloud input, its performance on high density data remains to be explored. High density, compact point clouds such as autonomous vehicle and airborne LiDAR are often associated with large scale mapping and varying elevation scanning. A larger number of classes for object detection could also pose a challenge as they tend to be less distinctive in terms of density and point cloud distribution.
For future research, position tracking can be implemented on top of the object recognition and classification. Especially for a safety-critical system where accuracy is of utmost importance, safety features should be a main concerning issue. A fail-safe mechanism in place to override the controls in case of a malfunction could be implemented, with a warning system which alerts the user during detection of a faulty device to allow human intervention. Additional elements or subjects of detection can also be added to further test and improve the reliability of the system.