Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

MSCNet: Efficient and accurate semantic segmentation of LiDAR data using Multi-scale Convolution

  • Xuewen Feng,

    Roles Conceptualization, Methodology, Resources

    Affiliation School of Mechanical and Electrical Engineering, China University of Mining and Technology -Beijing, Beijing, China

  • Aiming Wang,

    Roles Conceptualization, Funding acquisition, Investigation, Visualization

    Affiliation School of Mechanical and Electrical Engineering, China University of Mining and Technology -Beijing, Beijing, China

  • Guoying Meng ,

    Roles Writing – original draft

    mgy@cumtb.edu.cn

    Affiliation School of Mechanical and Electrical Engineering, China University of Mining and Technology -Beijing, Beijing, China

  • Yiyang Xu,

    Roles Formal analysis

    Affiliation Beijing China Coal Mine Engineering Co., Ltd, Beijing, China

  • Jie Yang,

    Roles Data curation, Software

    Affiliation School of Mechanical and Electrical Engineering, China University of Mining and Technology -Beijing, Beijing, China

  • Xiaohan Cheng,

    Roles Validation

    Affiliation School of Mechanical and Electrical Engineering, China University of Mining and Technology -Beijing, Beijing, China

  • Yu Feng

    Roles Validation, Writing – review & editing

    Affiliation Chinese Institute of Coal Science, Beijing, China

Abstract

In autonomous driving and intelligent robotics, the semantic information of LiDAR (Light Detection and Ranging) sensor data is crucial for understanding the surrounding environment. However, directly operating on point clouds is computationally expensive. To address this, some researchers have projected three-dimensional LiDAR data onto a two-dimensional spherical range view and used two-dimensional convolutional neural networks to segment the projected images. While the results are promising, many of these models are structurally complex, with high spatiotemporal complexity, which makes them unsuitable for real-time applications. To solve these issues, this paper proposes a multi-scale LiDAR data semantic segmentation method, MSCNet, with fewer parameters and higher segmentation accuracy. In the encoding phase, a single-channel multi-scale feature fusion block is introduced to alleviate the distribution differences between input channels. To obtain more stable local features, multi-scale dilated convolution residual blocks are designed to encode information from different receptive fields. To quickly capture global features, a pyramid pooling module is introduced. Experimental results on the SemanticKITTI, SemanticPOSS, and Pandaset datasets show that MSCNet achieves a good balance between parameter, accuracy, and running time. Particularly on the SemanticPOSS and Pandaset datasets, MSCNet achieves the best performance. Under the same parameter conditions, this method outperforms existing point cloud-based and projection-based methods.

Introduction

The ability to understand the environment is one of the primary tasks for developing self-driving vehicles and robots. To realize three-dimensional(3D) environment perception, light detection and ranging(LiDAR) sensors play a crucial role in autonomous vehicles since they measure distance accurately [15]. However, large point clouds generated by LiDAR sensors are difficult to understand due to their disorder and irregularity.

In recent years, scholars have devised several feature extraction modules for unstructured point clouds and used the modules to construct network models [69]. These point-based methods perform well in component segmentation and small indoor scenes. However, [10] showed that such methods performed poorly in terms of efficiency and accuracy when applied to outdoor scenes. Therefore, some scholars have attempted to project 3D LiDAR data onto the 2D plane and then use segmentation models of conventional images [1113] to obtain the semantic information in the 2D plane. However, LiDAR data is different from image data, resulting in poor accuracy using the original image segmentation model. To improve the accuracy, [1417] redesigned the feature extraction module and network structure for the LiDAR data.

Projection-based 2D representations, which map raw point clouds onto a structured plane, offer a practical and widely used solution for LiDAR semantic segmentation. Unlike RGB images with three correlated colour channels (R/G/B), LiDAR data comprise heterogeneous attributes (e.g., 3D coordinates, intensity, and range) that describe complementary aspects of the same object and exhibit distinct channel-wise distributions. To address this issue, we propose two modules and develop a lightweight MSCNet. As shown in Fig 2, the Single Channel Multi-Scale Feature (SCMF) Extraction module extracts multi-scale features from each channel independently to alleviate inter-channel distribution mismatch. The backbone employs a Dilated Convolutional Residual (DCR) module to expand the receptive field and capture multi-scale context efficiently. In addition, a pyramid pooling module is integrated to aggregate global context. Finally, a post-processing step assigns semantic labels to occluded points.

In this work, we use a simple projection method to map LiDAR data to regular range image data and use this data as input to the proposed MSCNet model, which is an end-to-end fully convolutional network(FCN). The MSCNet model can be trained and tested on a single RTX 3090 with 24G of RAM, and the model achieves the best balance between accuracy, parameters, and speed (see Fig 1). The main contributions of this paper are listed below:

  1. We propose a single-channel multi-scale feature fusion block (SCMFBlock) to better capture diverse features of the same object. SCMFBlock extracts multi-scale features from each input channel and fuses them across channels, enhancing feature representation.
  2. To extract point features, neighborhood features, and symmetry features more efficiently, we propose a novel dilated convolutional residuals block (DCRBlock) as the backbone of the network, which uses dilation convolution to expand the perceptual field and reduce the module parameters.
  3. We designed a novel FCN model named MSCNet, which uses SCMFBlock at the first layer, DCRBlock as the backbone, and, to obtain global features quickly, a pyramid pooling module. In the public datasets SemanticKITTI, SemanticPoss and PandaSet tests, MSCNet has a low number of parameters and high accuracy compared to network structures of similar size.
thumbnail
Fig 1. Accuracy, number of parameters, and speed of 3D LiDAR semantic segmentation in the SemanticKITTI test set [10].

Blue circles indicate projection-based methods, green pentagons indicate image-based methods, and red squares indicate point-based methods. The total number of network parameters in millions is shown in parentheses. In comparison with previous methods, the MSCNet method proposed in this paper achieves the best trade-off between accuracy, number of parameters, and speed.

https://doi.org/10.1371/journal.pone.0345761.g001

Related work

With the popularity of autonomous driving, there has been an increasing amount of research on semantic scene understanding. Fast acquiring LiDAR semantic information not only improves the accuracy of mapping and localisation, but also has important implications for local environment perception [18,19]. This section briefly describes existing approaches to the semantic understanding of LiDAR point clouds. For the unstructured nature of point cloud data, a unique feature extractor must be designed and used to construct a model. The pioneering models are PointNet [6] and PointNet++ [7], where PointNet uses a multilayer perceptron (MLP) to extract features and uses a maximum pooling operation to achieve invariant point cloud feature alignment. The method does not consider the local features of the point cloud. However, PointNet++ adds a local feature capture module to complete the PointNet model. TangentConv [20] projects local points onto the tangent plane and applies 2D convolution to them, in addition to processing 3D information directly. The above methods are mainly aimed at small-scale scenes with a limited number of points, primarily indoor scenes. In contrast, RandLA-Net [21] uses random point sampling and designs better local feature aggregation modules to preserve geometric details and achieve better performance. To alleviate the instability issues of random sampling, the PCB-RandNet [22] model is proposed. This model introduces a Polar Cylinder Balanced Random Sampling method, which ensures a more balanced distribution of the downsampled point cloud across different spatial distributions, thereby improving segmentation performance. For comparison, the PolarNet [23] coding scheme uses a polar coordinate bird’s-eye-view. It balances the points in the grid cells in the polar coordinate system, indirectly aligning the attention of the segmented network with the long-tailed distribution of points along the radial axis. However, the feature extraction modules of these methods use a large number of convolution operations. Although the number of network parameters is low, the models are time-consuming to run and not ideal in terms of accuracy. Alternatively, point clouds can be structured into voxels, and voxel features can be extracted using 3D convolution to complete the task of point cloud semantic understanding. Although such methods are more accurate, they are also accompanied by extremely high spatio-temporal complexity, high enough to meet real-time requirements. For example, on the Tesla V100, the inference speed of SPVCNN [24] and Cylinder3D [25] is only 8.1 and 7.6 fps, respectively.

Several methods exist for mapping 3D-LiDAR data to the 2D plane, and RGB-based image segmentation methods can be applied directly to this. FCN [26] treats the segmentation task as an intensive prediction, using a full convolutional structure to complete the classification of each pixel. [27,28] use an encoder-decoder structure to obtain the class of each pixel. Although the above methods achieve good performance on RGB image segmentation, they have a small perceptual field. Therefore, the DeepLab series [12,29,30] introduced dilation convolution to obtain a larger receptive field and extract image features over several scales. The dilation convolution can enhance the receptive field without losing information. PSPNet [11] proposed a pyramid pooling module to quickly capture the global information of an image. Due to the importance of semantic segmentation for autonomous driving, several research efforts have focused on this area, such as PIDNet [31], which uses convolutions with different dilation rates and parallels or concatenates them to obtain multi-scale information gain. As a comparison, BiSeNet [13] first proposed a bi-directional segmentation network where one module processes spatial information and the other captures global information, and finally uses a novel feature fusion module to complete the segmentation task. However, due to the multimodal nature of LiDAR data, the segmentation of the projected 3D-LiDAR image using the above network was not effective.

To satisfy the requirements of autonomous driving for model speed and accuracy, scholars have proposed projection-based methods that process LiDAR data projections better than segmentation models that use RGB images directly. squeezeSeg [14] and SqueezeSegv2 [15] use SqueezeNet [32] as the backbone and post-processing using conditional random fields (CRF). These two models run fast but with low accuracy. Therefore, SqueezeSegv3 [33] proposed a spatially adaptive convolution module to address the problem of significant variation in the distribution of features in projection images at different locations, but this model is time-consuming to run. Unlike SqueezeSeg, RangeNet++ [16] used Darknet [34] as the backbone network and proposed a k-nearest-neighbor (k-NN) search for post-processing. The above projection-based models are either low in accuracy or have a significant number of parameters. Therefore, MINet [35] uses multiple paths at different scales to balance computational resources, and the network achieves a good balance in terms of the number of parameters, running speed, and accuracy. In contrast, 3D-MiniNet [36] combines 3D and 2D learning layers, learning 2D representations from the original points through a new projection that captures local and global information about the 3D data. Then, the projection is used as input to a 2D Full Convolutional Neural Network (FCNN) and results in 2D semantic segmentation. Although the projection-based approach is practical, the multiscale nature of multimodal information is not considered in the above approach. Therefore, in this work, we designed extractors and fusion modules for multimodal features to achieve a balance of parameters, speed, and accuracy.

Methods

The goal of the approach proposed in this paper is to balance the number of parameters, running time, and accuracy of the semantic segmentation model to guarantee real-time and accurate perception of the environment. In this section, our approach is described in detail. First, we briefly introduce the point cloud projection, and then we will discuss the network architecture, loss function, and training details.

LiDAR data projection

To ensure the real-time nature of the model, we use a spherical projection method to project the LiDAR point cloud into a spherical coordinate system to form a 2D range image. Then, the features are extracted using 2D convolution operations in computer vision to convert the point cloud segmentation task into an image segmentation problem. Thus, the point cloud is first projected onto a 2D plane with the following equation:

(1)

where are angular coordinates, are raw LiDAR point, H and W are the height and width of the desired projected 2D map, d represents the depth of each point as , f represents the vertical field of view of the sensor as .

Using equation (1), the input data of is obtained, where W and H can be of arbitrary size, in this paper we take W = 2048px, H = 64px and the number of channels C=(x,y,z,d,r), d represents the depth, r is the intensity read by the sensor. In the next subsections, a new neural network structure and its module will be explained, and the model’s and the module’s validity will be verified using experimental results and ablation experiments, respectively.

MSCNet model architecture

The structure of MSCNet is shown in Fig 2. The input to the model is a 2D projection of the LiDAR point cloud, as described in Methods section. The first layer consists of five parallel SCMFBlock modules (see Fig 3(a)), which complete the information calibration for each channel and use the BasicBlock module (see Fig 3(b)) to extract the corrected multi-modal information. The backbone network is constructed using the DCRBlock module (see Fig 3(c)) to extract point features and multi-scale neighborhood feature extraction. To obtain global information quickly, different scale pyramidal pooling modules are utilized to improve the network operation and convergence speed. The final 2D prediction at the original resolution is generated in the feature fusion module and back-projected into 3D space. Next, each module is described in detail.

thumbnail
Fig 2. The architecture of our proposed MSCNet model.

Given LiDAR Data, we first use spherical projection to get a range image(a), and SCMF block, BasicBlock, and RDC block are applied to build multi-scale feature acquisition models(b), where the dashed arrows indicate the type of supervision. Then, the pyramid parsing module is used to obtain different sub-regional representations, which are upsampled, and DCRBlock output features are concatenated to form the final feature representation, which contains both local and global information(c). Finally, the representation is fed into the Feature Fusion Module to obtain a pixel-by-pixel prediction, and a point-by-point prediction is obtained using the inverse projection(d). The different blocks are illustrated in Fig 3, and Feature Fusion Module is illustrated in Fig 4.

https://doi.org/10.1371/journal.pone.0345761.g002

thumbnail
Fig 3. Illustrations of SCMFBlock (a), BasicBlock (b) and DCRBlock (c).

where CBR = Conv + BN + LeakyReLU, DCBR is a CBR with a dilation rate of 2. and denote the convolution kernel size.

https://doi.org/10.1371/journal.pone.0345761.g003

Single Channel Mutil-scale Feature Fusion Block (SCMFBlock). Unlike the RGB image representation, the range image contains channels of five different modalities. Most projection-based segmentation methods are homogeneous across modalities at scale. However, we show that it is more efficient to use SCMFBlock (see Fig 3(a)) to extract multi-scale features for each channel individually. Specifically, multi-scale features are first extracted for a single channel using three different scales of CBR blocks and Maxpool, where CBR represents a sequential combination of convolution, normalization, and activation function operations. Each channel is then mapped to a separate feature space using convolution of the multi-scale information. This step can be considered a feature calibration for each channel before fusion. It is worth noting that each channel passes through a separate SCMFBlock. After passing through five SCMFBlocks in parallel, the features of each channel are concatenated and sent to the BasicBlock (shown in Fig 3(b), consisting of two layers of CBR superimposed) to extract features. The number of channels output from the SCMFBlock is all 4, so the computation is small.

Dilated Convolutional Residual Block (DCRBlock). The receptive field plays a crucial role in the extraction of spatial features. To obtain more descriptive spatial features, a straightforward approach is to increase the size of the convolution kernel. However, this has the disadvantage that the number of parameters will increase dramatically as the depth of the model increases. Therefore, this paper proposes a DCRBlock to obtain a larger perceptual field. As shown in Fig 3(c), this module extracts point features using CBR, multi-scale neighborhood features using CBR and DCBR, and symmetry features using a Maxpool, where DCBR denotes the sequential operation of dilation convolution, normalization, and activation function with a dilation rate of 2. This operation reduces the convolution parameters while expanding the perceptual field. Further, we concatenate each block to extract features and apply CBR to fuse the various features so that the network can obtain different information about the receptive field. Finally, feature extraction or downsampling is completed using convolution, and the features are fused using residual.

Pyramid Pooling Module(PPM). After the multi-scale feature extraction is complete, the PPM is used to obtain global features quickly. The PPM [11] incorporates four pyramid-scale features, as shown in Fig 2(c). The coarsest level highlighted in red is global pooling to generate a single bin output. The next pyramid level divides the feature map into different sub-regions, forming a pooled representation of the different locations. The output at different levels of the pyramid pooling module contains feature maps of different sizes. To maintain the weight of the global features, we use a convolutional layer after each pyramid level. Suppose the size of the pyramid level is N. In that case, the dimensionality of the contextual representation is reduced to of the original dimensionality. Then the lower-dimensional feature maps are upsampled directly to obtain features of the same size as the original feature maps by bilinear interpolation. Finally, the features at different levels are concatenated as the final pyramid set of global features.

Feature Fusion Module(FFM). Fig 4(b) illustrates how the Feature Fusion Module (FFM) integrates multi-scale features to assign semantic labels to each pixel in range images. The top half of Fig 4(b) receives input from the BasicBlock module shown in Fig 2, where the MobileBlock (depicted in Fig 4(a)) extracts detailed spatial features. Specifically, the MobileBlock first applies a 1 × 1 Convolution followed by Batch normalization and ReLU (CBR) to reduce channel dimensionality. It then employs a 3 × 3 CBR to capture local neighborhood information. Finally, a second 1 × 1 CBR further refines the extracted fine-grained features before they are passed into the FFM. The bottom half of Fig 4(b) shows the multi-scale concatenated feature extracted using PPM, which was processed through a convolution block containing convolution, batch normalization, and LeakReLU activation, and upsampled to the original resolution. Finally, the data classification is completed by concatenating the features processed by the two modules.

thumbnail
Fig 4. Architecture of the Decoder Module: (a) MobileBlock; (b) Feature Fusion Module.

https://doi.org/10.1371/journal.pone.0345761.g004

Loss Function

The imbalance of classes in the dataset poses a challenge for model training. In the case of bicycles and persons, for example, they have a small amount of data compared to cars and roads, which makes the network more biased towards the classes with more data in the training data, resulting in poorer network performance.

To address the class imbalance problem, this paper divides the loss function into two parts: the middle layer and the output layer supervision. Supervision of the middle layer features of the network is useful for model optimization in the experiment, and we utilize the standard weighted cross-entropy loss [16] as semantic supervision:

(2)

where yc and denote the true and predicted category labels respectively, fc denotes the frequency of class c, and denotes the weight of class c, i.e., the larger the fc the smaller the wc, which can handle imbalanced data, e.g., a large number of points in the dataset represented by class road means a small wc, while a small number of points in class person means a relatively large wc. In addition, we use weighted cross-entropy loss Ls at the middle layer(Lms) and the output layer(Los), indicated by the dashed arrows in Fig 2(b) for Lms. As we will show in our experiments, this intermediate supervision improves training and increases accuracy. However, adding this semantic supervision to the underlying paths did not help much, as the resolution of the underlying paths was too low.

Besides the weighted cross-entropy, a Lovász-Softmax loss is added at the network’s end to maximize the mean intersection over union (mIoU). The mIoU metric is commonly used for segmentation performance evaluation. However, due to the discrete, non-derivative nature of IoU, it cannot be used as a direct loss. In [37], the mIoU calculation is improved using the Lovász function to obtain the Lovász-Softmax loss (Lls),which can be expressed as follows:

(3)(4)

where is defined as the Lovász extension of the Jaccard index, C denotes the number of categories, and denote for class c at pixels i the predicted probability and ground-truth label, respectively.

Finally, the total loss function of the MSCNet is a linear combination of two end losses and an intermediate layer loss function, as follows:

(5)

In equation (5), Los represents the supervision loss function applied to intermediate layers, Lms denotes the supervision loss function applied to the output layer, and is a hyperparameter.

Optimizer And Regularization

We utilized AdamW as the optimizer, setting the learning rate to 0.0001 and the weight decay coefficient to 0.005. Unlike the original Adam, AdamW decouples weight decay from the gradient update step, ensuring that regularization does not interfere with the adaptive learning rate mechanism. This strategy yields more stable training dynamics and improves the model’s generalization performance.To mitigate overfitting, we applied data augmentation before projecting point clouds into depth views. These augmentations include random rotations or translations, flips around the y-axis, and random point dropout, each applied with a 0.5 probability independently. Such augmentation techniques enhance model robustness by introducing diverse geometric variations.

For the SemanticKITTI dataset, we used a batch size of 6 and adopted a one-cycle learning rate schedule, training the model for 30 epochs. Input dimensions for the projection were set to W = 2048, H = 64 to align with existing range-view-based methods.On the SemanticPOSS dataset, we maintained the same data augmentation and training strategy but adjusted the input projection size to W = 1800,H = 40, and trained the MSCNet model for 60 epochs.The PandaSet configuration matched that of SemanticKITTI.

Post-processing

The pixel label is assigned to multiple LiDAR points during the back-projection process, which leads to the misclassification of target edges. The knn-based post-processing method proposed by RangeNet++ [16] is used for this problem, which sets a small window around the corresponding image pixel and uses the value domain of the (u, v) neighborhood points representing the Euclidean distance in 3D space to quickly complete the back-projection results post-processing (running time within 7ms on GPU acceleration).

Note that this post-processing is only applied to the network output during inference and has no effect on learning.

Experiments

Experimental settings

To validate the MSCNet network performance, we chose three datasets for our experiments.

  1. SemanticKITTI: The dataset provides semantic labels for each point. It consists of 21 sequences with over 43,000 scans, with sequence files 00–10 scanned over 21,000 times, and the remaining sequence 11–21 scanned files used for the test set. During training, sequence 08 was used as the validation set, and 00–07, 09, and 10 as the training set. The ablation experiments were performed using the validation set for functional module verification. For the test set, we performed online tests and compared them to state-of-the-art models with similar parameter numbers.
  2. SemanticPOSS: This dataset, collected at Peking University, is stored in the same format as SemanticKITTI and contains 2988 complex LiDAR scans, with more moving objects and smaller objects (more complex than the urban environment). The dataset contains six sequences from 00 to 05, with 02 used as the test set and the others as the training set in our experiments.
  3. Pandaset: The dataset was acquired on two routes in Silicon Valley, using both spinning and solid-state LiDARs to acquire point cloud data, with 16000 LiDAR scans. For the experiments in this paper, data from the rotating LiDAR was used, with 70% of the data used for training and the remainder for testing. As suggested, similar classes will be merged, and rare classes will be removed.

We use standard mean intersection over union(mIoU) [38] to evaluate all classes:

(6)

where TPc, FPc, and FNc denote the numbers of true positive, false positive, and false negative predictions for class c, respectively. C is the number of classes.

Quantitative results

As shown in Table 1, we systematically compare point-based models (rows 1–10), voxel-based models (rows 11–15), image-based models (rows 16–19), and projection-based models (rows 20–31), evaluating each category in terms of parameter count (M), processing speed (Scan/sec), and segmentation accuracy (mIoU). Among point-based models, most methods exhibit relatively slow segmentation speed and low accuracy on the SemanticKITTI dataset. RandLA-Net and PolarNet stand out with competitive speed and accuracy, yet they come with high spatial complexity. Notably, although PolarNet achieves high accuracy, its parameter count exceeds 16M, leading to substantial computational overhead. For voxel-based methods, the inference speeds of Cylinder3D V2 and JS3C-Net are 5.9 scans/sec and 2.1 scans/sec, respectively, which fall short of real-time processing requirements. In contrast, MSCNet outperforms point-based models across parameter size, runtime efficiency, and segmentation accuracy, demonstrating superior overall performance.

thumbnail
Table 1. Evaluation Results on the SemanticKITTI Test Set(Sequences 11 to 21). Point-based Methods: Rows 1 to 10. Voxel-based Methods:Rows 11 to 15. Image-based Methods: Rows 16 to 19. Projection-based Methods: Rows 20 to 31.

https://doi.org/10.1371/journal.pone.0345761.t001

In the image-based models, this paper validates the commonly used methods: PSPNet [11], DeepLabV3+ [12], BiseNet [13], and DenseASPP [44]. These models all have the number of input channels adjusted to 5, then the models are trained. Although such models are faster than the point-based methods, they have a more significant number of parameters and lower accuracy. The results show that the projected LiDAR data is different from the RGB images and that the LiDAR projected data is poorly processed directly using the segmentation framework of the images.

Then, the proposed model is compared with other projection-based methods. SqueezeSegV2 features a small number of parameters and fast inference speed but suffers from low accuracy. RangeNet++ achieves higher accuracy than SqueezeSegV2, though at the cost of significantly increased parameters and computational complexity. SqueezeSegV3, equipped with an enhanced feature extraction module, further improves accuracy but introduces considerable latency. 3D-MiniNet demonstrates a favourable trade-off between speed and accuracy; however, its improved performance relies on a larger number of parameters. In contrast, MINet [35] achieves a more balanced compromise among parameter count, accuracy, and speed. Compared to state-of-the-art models such as RangeViT[47] and Meta-RangeSeg [50], the proposed MSCNet model offers notable advantages: it has fewer parameters (1.6M) and, faster processing speed (30 scans per second). It achieves the highest IoU score (65.1%). When directly compared to LENet [49], MSCNet demonstrates superior performance in terms of parameter efficiency, inference speed, and segmentation accuracy. In terms of per-class metrics, MSCNet achieves the highest accuracy in 7 out of 19 classes. For the remaining 12 classes (e.g., car, truck, road), MSCNet’s performance is comparable to that of the best-performing method, LENet.

We also evaluated our method on the SemanticPOSS dataset and the comparison results are shown in Table 2. SemanticPOSS has a smaller dataset and sparser targets compared to SemanticKITTI, and therefore has a lower overall mIoU. However, our method achieves state-of-the-art on this dataset and improves the overall mIoU by 1.0%, with a much higher mIoU for individual targets than the comparison method.

thumbnail
Table 2. Evaluation Results on the SemanticPOSS Test Set(Sequences 02). Image-based Methods: Rows 1 to 3. Projection-based Methods: Rows 4 to 9.

https://doi.org/10.1371/journal.pone.0345761.t002

To further validate the effectiveness of MSCNet, the PandaSet dataset was added for validation and its evaluation results are shown in Table 3. Our method can greatly improve small objects such as motorcycle, car, pedestriand, which are difficult to identify, and the overall performance is 3.0% better than the other methods.

thumbnail
Table 3. Evaluation Results on the Pandaset Test Set. Image-based Methods: Rows 1 to 3. Projection-based Methods: Rows 4 to 7.

https://doi.org/10.1371/journal.pone.0345761.t003

Qualitative results

To better qualitatively assess the differences between the predicted and labelled data, segmentation error maps were calculated, and the results are presented in Figs 5 and 6, respectively. Fig 5 shows the SemanticKITTI validation set (08 sequences), where Fig 5. (c) is an error map of the annotated data versus the MSCNet prediction results, which shows that our method produces high quality results in most categories (e.g., cars) and challenging categories (e.g., people, traffic signs), while the misclassified points (red points) include objects boundaries between objects and similar categories (e.g., buildings and fences).

thumbnail
Fig 5. Simple qualitative results of the MSCNet model on the SemanticKITTI benchmark(valid sequence 08).

Figures (a) and (b) show the raw data and corresponding segmentation labels of a LiDAR scan frame, respectively, and (c) shows the segmentation error map of our method for that scan frame (Red indicates incorrect predictions). The various colors indicate the different semantic classes: cars in blue, roads in purple, vegetation in green, and buildings in yellow.

https://doi.org/10.1371/journal.pone.0345761.g005

thumbnail
Fig 6. Qualitative analysis of the SemanticPOSS validation set.

Where (a) and (b) are the input data and the corresponding segmented real data for the LiDAR scan frame, and (c) is the segmentation error map of our method in that scan frame. (Red colour indicates incorrect predictions).

https://doi.org/10.1371/journal.pone.0345761.g006

Fig 6 shows the results of the validation set for SemanticPOSS, which contains more small moving objects (e.g., pedestrians, riders) and is more complex than the urban environment. Fig 6(c) shows the error plot calculated from the annotated data and the MSCNet predictions, which shows that our method produces high quality results in most categories (e.g., cars, plants) and challenging categories (e.g., people, riders), while the misclassified points (red points) are mainly objects that are far away from LiDAR.

Ablation study

This section validates the hyperparameters of the loss function and the impact of each module in the network on the MSCNet using the 08 sequences in the SemanticKITTI dataset.

Effect of hyper-parameter . Table 4 illustrates the impact of the loss function hyperparameter on the segmentation performance (mIoU). It can be observed that as the value of increases, the model performance exhibits a clear trend of first increasing and then decreasing. When , meaning the intermediate supervision term is not activated and the model relies solely on the main loss function, the mIoU is 65.2%. After introducing intermediate supervision and setting , the mIoU significantly improves to 66.4%, indicating that the auxiliary loss begins to take effect. Further increasing to 0.05 and 0.1 leads to continued improvement in model performance, achieving mIoU values of 66.9% and 67.3%, respectively. The optimal result is obtained at , with a 2.1% improvement over the baseline, fully demonstrating the effectiveness of the intermediate supervision mechanism. However, when exceeds 0.1, the model performance begins to decline: at , the mIoU drops to 67.0%, and at , it further decreases to 66.6%. This phenomenon indicates that although intermediate supervision has a positive effect on model optimisation, its weight needs to be precisely controlled. If the weight is too small, the supervision effect cannot be fully utilised, while if it is too large, it disrupts the balance among the components of the loss function, thereby interfering with training. It is worth noting that within the relatively wide range of to 0.5, the model maintains high performance, with a deviation of no more than 0.4% from the optimal result. Based on the above analysis, we set as the optimal hyperparameter. Additionally, satisfactory results can be obtained within the range of to 0.5.

Impact of network modules. In Table 5, we perform ablation studies to deeply analyze how the four modules (SCMF, DCR, PPM, FFM) jointly affect model accuracy, parameter count, and computational efficiency (Scan/sec). The first row shows the full model, which achieves an mIoU of 67.3%, an inference speed of 30 Scan/sec, and employs 1.6 M parameters. In the second row, we replace SCMFBlock with two 3×3 CBR blocks to remove SCMF’s contribution. After this replacement, inference speed increases to 32 Scan/sec, while mIoU drops to 65.8% (a loss of 1.5%). This indicates that although SCMFBlock introduces some computational overhead, it offers a substantial gain in representational power across channels. In the third row, we substitute the DCRBlock in the middle layers with a MobileBlock of similar parameter size. This replacement raises inference speed to 35 Scan/sec, but mIoU falls to 64.3%, a drop of 3.0%. This demonstrates that DCRBlock, despite its computational cost, plays a critical role in maintaining intermediate feature expressiveness and segmentation accuracy. In the fourth row, we replace PPM with a 3×3 CBR block with a comparable parameter count. As a result, inference speed slightly increases to 31 Scan/sec, but mIoU decreases to 64.2%, a drop of 3.1%. This suggests that the ability of PPM to capture global contextual information is vital for improving semantic segmentation accuracy, and its computational overhead is moderate, making it an efficient design choice. In the fifth row, we remove the feature concatenation and fusion operations within FFM, and only upsample and predict from the final output feature. This modification increases inference speed to 33 Scan/sec and slightly reduces the parameter count to 1.5 M, but the mIoU falls to 62.7%, a drop of 4.6%. This result indicates that the cross-layer feature fusion mechanism within FFM is crucial to the final segmentation accuracy, and its computational cost is justified by the performance gain. In summary, SCMF, DCR, PPM, and FFM each contribute to enhancing semantic segmentation accuracy to varying degrees. Although each module introduces some computational cost, together they strike an effective balance between accuracy and efficiency.

Effect of range on view resolution

As shown in Table 6, increasing the horizontal width W of the range view resolution leads to significant improvements in the model’s mIoU performance. However, this enhancement comes at the cost of reduced inference speed (scan/sec). Specifically, at lower resolutions (e.g.,), the model operates at high speed but exhibits lower performance. In contrast, at mid-to-high resolutions (e.g., ), the model achieves an optimal balance between speed and accuracy, with the mIoU reaching its peak at 67.3. Further increases in resolution (e.g., ) result in diminishing returns, with performance gains plateauing or even slightly declining. This trend indicates that while higher resolutions enhance data completeness and segmentation accuracy, they also impose greater computational demands. To balance operational efficiency and segmentation precision, we have selected a resolution of for the range view in the SemanticKITTI dataset.

Multi-scale effects of the DCR block

To assess the contributions of each sub-module within the dilated convolution residual module (DCR) (whose architecture is illustrated in Fig 3) to the model’s final performance, we conducted a series of ablation experiments. Table 7 reports the mIoU scores under various combinations of sub-modules. When all sub-modules are kept, the model attains its highest mIoU of 67.3%, indicating that the full architecture can effectively fuse detail and context information across multiple scales. Removing the 1×1 CBR (the channel-mapping fusion branch) reduces the mIoU to 66.9% (a drop of only 0.4 points), which suggests that this branch, while beneficial for feature extraction, is not critical. Further removing the 3×3 CBR causes mIoU to fall to 66.5%, reflecting the important role of standard convolution in capturing local structural features. Eliminating the MaxPool branch (i.e., discarding the pooling scale transform) further decreases mIoU to 66.3%, indicating that feature aggregation play a positive role in improving the model’s scale generalization. Moreover, when only the MaxPool branch is retained, performance drops sharply to 62.5%; keeping only the 1×1 convolution yields an mIoU of 62.0%; retaining only the 3×3 convolution achieves 63.5%; and using only the dilated 3×3 convolution yields 64.5%. These results clearly demonstrate that relying on a single scale or branch severely limits the capacity for feature extraction. In summary, DCRBlock uses a multi-branch parallel structure to achieve complementary features, effectively mitigating information loss and thereby enhancing the generalization of the segmentation model across different scales and improving recognition accuracy.

Conclusion

In this paper, we propose an efficient and highly accurate network model, MSCNet, to accomplish real-time semantic segmentation of LiDAR data. The network can meet real-time requirements on GPUs. Due to the combined effect of SCMFBlock, DCRBlock, PPM, and FFM modules, MSCNet has better segmentation accuracy on small targets such as pedestrians and riders. And the best performance is achieved on SemanticPOSS and Pandaset datasets. The SCMFBlock performs multiscale feature extraction for each modal information individually, completing multimodal feature correction. The DCRBlock is located in the backbone network and uses dilation convolution to expand the perceptual field and reduce the network parameters. The PPM completes fast global feature acquisition. The FFM is located at the end of the network and completes the fusion of high-level and low-level features. In addition, an intermediate layer of supervision using coaching allows the network to converge quickly. We evaluated the method on SemanticKITTI and showed that MSCNet better balances parametric number, speed, and accuracy. However, the projection-based approach has some limitations. For example, projection leads to the loss of information. We will investigate such issues in future work.

References

  1. 1. Luo Y, Han T, Liu Y, Su J, Chen Y, Li J, et al. CSFNet: Cross-Modal Semantic Focus Network for Semantic Segmentation of Large-Scale Point Clouds. IEEE Trans Geosci Remote Sensing. 2025;63:1–15.
  2. 2. Xiao J, Wang S, Zhou J, Zeng Z, Luo M, Chen R. Revisiting the Learning Stage in Range View Representation for Autonomous Driving. IEEE Trans Geosci Remote Sensing. 2025;63:1–14.
  3. 3. Xie X, Wei H, Yang Y. Real-Time LiDAR Point-Cloud Moving Object Segmentation for Autonomous Driving. Sensors (Basel). 2023;23(1):547. pmid:36617142
  4. 4. Mahima KTY, Perera A, Anavatti S, Garratt M. Exploring Adversarial Robustness of LiDAR Semantic Segmentation in Autonomous Driving. Sensors (Basel). 2023;23(23):9579. pmid:38067951
  5. 5. Kim K. Pseudo Multi-Modal Approach to LiDAR Semantic Segmentation. Sensors (Basel). 2024;24(23):7840. pmid:39686377
  6. 6. Charles RQ, Su H, Kaichun M, Guibas LJ. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 77–85. http://dx.doi.org/10.1109/cvpr.2017.16
  7. 7. Qi CR, Yi L, Su H, Guibas LJ. Pointnet: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems. 2017;30.
  8. 8. Landrieu L, Simonovsky M. Large-Scale Point Cloud Semantic Segmentation with Superpoint Graphs. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 4558–67. https://doi.org/10.1109/cvpr.2018.00479
  9. 9. Mukhandi H, Ferreira JF, Peixoto P. SyS3DS: Systematic Sampling of Large-Scale LiDAR Point Clouds for Semantic Segmentation in Forestry Robotics. Sensors (Basel). 2024;24(3):823. pmid:38339539
  10. 10. Behley J, Garbade M, Milioto A, Quenzel J, Behnke S, Stachniss C, et al. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 9296–306. https://doi.org/10.1109/iccv.2019.00939
  11. 11. Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid Scene Parsing Network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 6230–9. https://doi.org/10.1109/cvpr.2017.660
  12. 12. Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. Lecture Notes in Computer Science. Springer International Publishing. 2018. p. 833–51. https://doi.org/10.1007/978-3-030-01234-2_49
  13. 13. Li H, Xiong P, Fan H, Sun J. DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 9514–23. https://doi.org/10.1109/cvpr.2019.00975
  14. 14. Wu B, Wan A, Yue X, Keutzer K. SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018. 1887–93. https://doi.org/10.1109/icra.2018.8462926
  15. 15. Wu B, Zhou X, Zhao S, Yue X, Keutzer K. SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud. In: 2019 International Conference on Robotics and Automation (ICRA), 2019. 4376–82. https://doi.org/10.1109/icra.2019.8793495
  16. 16. Milioto A, Vizzo I, Behley J, Stachniss C. RangeNet ++: Fast and Accurate LiDAR Semantic Segmentation. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019. 4213–20. https://doi.org/10.1109/iros40897.2019.8967762
  17. 17. Atik ME, Duran Z. An Efficient Ensemble Deep Learning Approach for Semantic Point Cloud Segmentation Based on 3D Geometric Features and Range Images. Sensors (Basel). 2022;22(16):6210. pmid:36015964
  18. 18. Xiao H, Hu Z, Lv C, Meng J, Zhang J, You J. Progressive Multi-Modal Semantic Segmentation Guided SLAM Using Tightly-Coupled LiDAR-Visual-Inertial Odometry. IEEE Transactions on Intelligent Transportation Systems. 2024.
  19. 19. Zhao J, Huang W, Wu H, Wen C, Yang B, Guo Y, et al. SemanticFlow: Semantic Segmentation of Sequential LiDAR Point Clouds From Sparse Frame Annotations. IEEE Trans Geosci Remote Sensing. 2023;61:1–11.
  20. 20. Tatarchenko M, Park J, Koltun V, Zhou Q-Y. Tangent Convolutions for Dense Prediction in 3D. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 3887–96. https://doi.org/10.1109/cvpr.2018.00409
  21. 21. Hu Q, Yang B, Xie L, Rosa S, Guo Y, Wang Z, et al. RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 11105–14. https://doi.org/10.1109/cvpr42600.2020.01112
  22. 22. Han X-F, Cheng H, Jiang H, He D, Xiao G. PCB-RandNet: Rethinking Random Sampling for LiDAR Semantic Segmentation in Autonomous Driving Scene. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024. 4435–41. https://doi.org/10.1109/icra57147.2024.10610105
  23. 23. Zhang Y, Zhou Z, David P, Yue X, Xi Z, Gong B, et al. PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 9598–607. https://doi.org/10.1109/cvpr42600.2020.00962
  24. 24. Tang H, Liu Z, Zhao S, Lin Y, Lin J, Wang H, et al. Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution. Lecture Notes in Computer Science. Springer International Publishing. 2020. p. 685–702. https://doi.org/10.1007/978-3-030-58604-1_41
  25. 25. Zhou H, Zhu X, Song X, Ma Y, Wang Z, Li H, et al. Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. 2020. https://arxiv.org/abs/2008.01550
  26. 26. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 3431–40. https://doi.org/10.1109/cvpr.2015.7298965
  27. 27. Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J. Unet: A nested u-net architecture for medical image segmentation. Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer. 2018. p. 3–11.
  28. 28. Liu X, Gao P, Yu T, Wang F, Yuan R-Y. CSWin-UNet: Transformer UNet with cross-shaped windows for medical image segmentation. Information Fusion. 2025;113:102634.
  29. 29. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Semantic image segmentation with deep convolutional nets and fully connected crfs. In: 2014. https://arxiv.org/abs/1412.7062
  30. 30. Bai Z, Jing J. Mobile-Deeplab: a lightweight pixel segmentation-based method for fabric defect detection. J Intell Manuf. 2023;35(7):3315–30.
  31. 31. Xu J, Xiong Z, Bhattacharyya SP. PIDNet: A Real-time Semantic Segmentation Network Inspired by PID Controllers. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 19529–39. https://doi.org/10.1109/cvpr52729.2023.01871
  32. 32. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. In: 2016. https://arxiv.org/abs/1602.07360
  33. 33. Xu C, Wu B, Wang Z, Zhan W, Vajda P, Keutzer K, et al. SqueezeSegV3: Spatially-Adaptive Convolution for Efficient Point-Cloud Segmentation. Lecture Notes in Computer Science. Springer International Publishing. 2020. p. 1–19. https://doi.org/10.1007/978-3-030-58604-1_1
  34. 34. Redmon J, Farhadi A. Yolov3: An incremental improvement. arXiv preprint. 2018. https://doi.org/10.48550/arXiv.1804.02767
  35. 35. Li S, Chen X, Liu Y, Dai D, Stachniss C, Gall J. Multi-Scale Interaction for Real-Time LiDAR Data Segmentation on an Embedded Platform. IEEE Robot Autom Lett. 2022;7(2):738–45.
  36. 36. Alonso I, Riazuelo L, Montesano L, Murillo AC. 3D-MiniNet: Learning a 2D Representation From Point Clouds for Fast and Efficient 3D LIDAR Semantic Segmentation. IEEE Robot Autom Lett. 2020;5(4):5432–9.
  37. 37. Berman M, Triki AR, Blaschko MB. The Lovasz-Softmax Loss: A Tractable Surrogate for the Optimization of the Intersection-Over-Union Measure in Neural Networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 4413–21. https://doi.org/10.1109/cvpr.2018.00464
  38. 38. Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A. The Pascal Visual Object Classes Challenge: A Retrospective. Int J Comput Vis. 2014;111(1):98–136.
  39. 39. Su H, Jampani V, Sun D, Maji S, Kalogerakis E, Yang M-H, et al. SPLATNet: Sparse Lattice Networks for Point Cloud Processing. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 2530–9. https://doi.org/10.1109/cvpr.2018.00268
  40. 40. Zheng T, Chen J, Feng W, Yu C. A Graph Aggregation Convolution and Attention Mechanism Based Semantic Segmentation Method for Sparse Lidar Point Cloud Data. IEEE Access. 2024;12:10459–69.
  41. 41. Wang Y, Miao J, Du A, Gu X, Pang S. TriEn-Net: Non-parametric Representation Learning for Large-Scale Point Cloud Semantic Segmentation. Lecture Notes in Computer Science. Springer Nature Singapore. 2024. p. 417–30. https://doi.org/10.1007/978-981-97-8508-7_29
  42. 42. Choy C, Gwak J, Savarese S. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 3070–9. https://doi.org/10.1109/cvpr.2019.00319
  43. 43. Yan X, Gao J, Li J, Zhang R, Li Z, Huang R, et al. Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion. AAAI. 2021;35(4):3101–9.
  44. 44. Yang M, Yu K, Zhang C, Li Z, Yang K. DenseASPP for Semantic Segmentation in Street Scenes. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 3684–92. https://doi.org/10.1109/cvpr.2018.00388
  45. 45. Song W, Liu Z, Guo Y, Sun S, Zu G, Li M. DGPolarNet: Dynamic Graph Convolution Network for LiDAR Point Cloud Semantic Segmentation on Polar BEV. Remote Sensing. 2022;14(15):3825.
  46. 46. Wen S, Wang T, Tao S. Hybrid CNN-LSTM Architecture for LiDAR Point Clouds Semantic Segmentation. IEEE Robot Autom Lett. 2022;7(3):5811–8.
  47. 47. Sun X, Song S, Miao Z, Tang P, Ai L. LiDAR Point Clouds Semantic Segmentation in Autonomous Driving Based on Asymmetrical Convolution. Electronics. 2023;12(24):4926.
  48. 48. Ando A, Gidaris S, Bursuc A, Puy G, Boulch A, Marlet R. RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 5240–50. https://doi.org/10.1109/cvpr52729.2023.00507
  49. 49. Ding B. LENet: Lightweight and efficient LiDAR semantic segmentation using multi-scale convolution attention. arXiv preprint. 2023. https://doi.org/10.48550/arXiv.2301.04275
  50. 50. Wang S, Zhu J, Zhang R. Meta-RangeSeg: LiDAR Sequence Semantic Segmentation Using Multiple Feature Aggregation. IEEE Robot Autom Lett. 2022;7(4):9739–46.
  51. 51. Li S, Liu Y, Gall J. Rethinking 3-D LiDAR point cloud segmentation. IEEE Transactions on Neural Networks and Learning Systems. 2021.