Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Method for assessing rodent infestation in plateau based on SegFormer

  • Xiangjie Huang,

    Roles Data curation, Formal analysis, Methodology, Validation, Writing – original draft, Writing – review & editing

    Affiliations School of Computer Technology and Application, Qinghai University, Xining, Qinghai, China, Intelligent Computing and Application Laboratory of Qinghai Province, Qinghai University, Xining, Qinghai, China

  • Guoying Zhang,

    Roles Data curation, Investigation

    Affiliation Joint Logistics Support Force 941th Hospital, People’s Liberation Army, Xining, Qinghai, China

  • Chunmei Li ,

    Roles Data curation, Funding acquisition, Investigation, Writing – review & editing

    li_chm@qhu.edu.cn

    Affiliations School of Computer Technology and Application, Qinghai University, Xining, Qinghai, China, Intelligent Computing and Application Laboratory of Qinghai Province, Qinghai University, Xining, Qinghai, China, School of Computer and Information Science, Qinghai Institute of Technology, Xining, Qinghai, China

  • Yaosheng Han,

    Roles Data curation

    Affiliations School of Computer Technology and Application, Qinghai University, Xining, Qinghai, China, Intelligent Computing and Application Laboratory of Qinghai Province, Qinghai University, Xining, Qinghai, China

  • Qing Dong,

    Roles Data curation

    Affiliations School of Computer Technology and Application, Qinghai University, Xining, Qinghai, China, Intelligent Computing and Application Laboratory of Qinghai Province, Qinghai University, Xining, Qinghai, China

  • Hao Wang

    Roles Data curation

    Affiliations School of Computer Technology and Application, Qinghai University, Xining, Qinghai, China, Intelligent Computing and Application Laboratory of Qinghai Province, Qinghai University, Xining, Qinghai, China

Abstract

Rodent infestation is a critical factor contributing to grassland degradation, which significantly negatively affects grassland ecosystems. To assess rodent infestation on the plateau, there is an urgent need for a scientifically sound and effective method to detect the distribution of rodent burrows. In response, this study proposes a semantic segmentation approach utilizing the SegFormer model to detect rodent infestation in highland areas. First, we used an unmanned aerial vehicle to collect video data from the plateau and constructed a rodent burrows dataset after processing and precise labeling. Second, to address the issue of SegFormer’s suboptimal performance in segmenting small targets within complex backgrounds and among similar objects, we implemented targeted modifications to enhance its effectiveness for this task. Incorporating the efficient multi-scale attention (EMA) mechanism into SegFormer’s encoder improves the model’s capacity to capture global contextual information. Meanwhile, integrating the multi-kernel convolution feed-forward network (MCFN) into the decoder optimizes the problem of detail recovery and fusion of multi-scale features. We name this method EM-SegFormer (Efficient Multi-scale SegFormer). The experimental results demonstrate that the method achieves relatively good performance on the rodent burrows dataset. This study introduces a novel approach for plateau rodent infestation detection and offers reliable technical support for grassland restoration and management.

Introduction

The Qinghai–Tibet Plateau is located in the south-central part of Asia, plays a crucial role in China’s water resources and ecological security [1]. This plateau is predominantly covered by grasslands, which account for about 60% of its total area [2]. Qinghai Province is located in the northeastern part of the Qinghai-Tibet Plateau and is one of the most important natural grassland areas in China, as well as a major grazing region [3]. However, due to its remote location and challenging accessibility, obtaining data for environmental monitoring and ecological management in this region has always been difficult. Among the various types of grasslands, the alpine meadows and alpine grasslands constitute the main portion of Qinghai Province’s natural grasslands [4]. As the core barrier to ecological security on the Qinghai-Tibet Plateau and even in China, the alpine grassland ecosystem provides several important ecological services, including water conservation, biodiversity protection, climate regulation, and support for local animal husbandry.

Since 1990, irrational human activities have gradually intensified and expanded, leading to the ongoing degradation of the grassland ecosystem in Qinghai Province [5]. At the same time, the combined effects of abnormal climate change, rodent infestation, and other natural factors have further worsened the condition of the grasslands, posing a serious threat to the stability of the province’s ecological environment and having far-reaching impacts on sustainable economic and social development [6]. Among these, grassland rodent infestation is a significant ecological issue that emerges during the development and utilization of grasslands. Over the years, the State has invested large sums of money in implementing highland grassland degradation control projects through various technical means, such as fence construction, sand control, grass planting, and so on. The ecology of the plateau has been greatly improved. However, many areas are still plagued by serious rodent infestation, which restricts the full recovery of grassland ecosystems. Relevant studies have shown that large areas of degraded grassland often become optimal habitats for rodents such as Ochotona curzoniae and Eospalax baileyi [7, 8]. The nibbling and frequent digging activities of these rodents have a serious impact on grassland vegetation and its substrate [9], creating a self-reinforcing vicious cycle. Consequently, rodent infestation has become a significant biological disaster that degrades the grassland’s ecological environment and hampers the sustainable development of animal husbandry. Therefore, it is particularly important to adopt scientific assessment methods to accurately define the degree of damage caused by rodents to grassland ecosystems. This not only helps to understand the occurrence and development trend of rodent damage but also provides a theoretical basis for the formulation of effective management strategies, thus achieving precise management and sustainable protection of grassland resources.

Traditional machine learning algorithms often face performance bottlenecks when dealing with large-scale data, making it difficult to effectively discover complex and nonlinear patterns in the data [10]. In contrast, deep learning techniques can address the challenges of large-scale data processing more efficiently with their powerful modeling capabilities and adaptive properties. In recent years, deep learning has been widely used in the field of ecological resources, covering such aspects as species monitoring and plant disease recognition. For example, Gomez et al. [11] utilized four deep convolutional neural networks—AlexNet [12], VGGNet [13], GoogLeNet [14], and ResNet [15]—to process and recognize wildlife images captured by cameras. Qiu et al. [16] employed Mask R-CNN [17] with ResNet50 as the backbone network to detect diseased regions in wheat. Mahum et al. [18] applied an enhanced DenseNet [19] to detect and classify potato leaf diseases. The above study highlights the broad potential of deep learning techniques in ecological resource monitoring and management.

However, convolutional architecture has limitations in dealing with global dependencies and long-range interactions, which are mainly constrained by local receptive fields, and it is difficult to adequately capture multi-scale features and global contextual information, thereby limiting their expressive ability in complex scenes. Building on the success of the Transformer [20] in natural language processing, Dosovitskiy et al. [21] explored its application in computer vision and proposed the Vision Transformer (ViT). Inspired by ViT, Zheng et al. [22] proposed the Segmentation Transformer (SETR), validating the feasibility and potential of the Transformer for semantic segmentation tasks. On this basis, semantic segmentation models based on the Transformer, such as Segmenter [23], SegFormer [24], MaskFormer [25], and Mask2Former [26], have continued to emerge. These models have made significant progress in improving segmentation performance and handling complex scenes.

This study aims to accurately identify and analyze rodent infestation in the plateau region by leveraging semantic segmentation technology in combination with the specific characteristics of rodent activity in Qinghai. The number of rodent burrows is a reliable indicator of rodent population density and serves as an important metric for assessing infestation levels. By applying semantic segmentation technology to images of rodent burrows in the plateau grasslands of Qinghai Province, this research calculates the proportion of rodent burrows in the local area, thereby inferring the extent of rodent-induced grassland degradation. SegFormer integrates a lightweight architecture with an efficient self-attention mechanism to ensure high accuracy while minimizing computational cost. In this study, we propose EM-SegFormer, an improved SegFormer model designed to more effectively capture subtle features and long-range dependencies in rodent burrow images. Particularly in scenarios with complex backgrounds and diverse terrains, the improved model demonstrates a significant enhancement in the accuracy of rodent burrow detection.The main contributions of this study are as follows:

  1. We constructed a rodent burrows dataset from the highland grasslands of Qinghai Province, comprising 6,864 images (5,504 for training and 1,360 for testing), providing strong sample support for segmentation tasks.
  2. We propose an improved rodent burrow segmentation model, EM-SegFormer, which incorporates the EMA module in the encoder to enhance global feature capture and cross-channel interactions. The MCFN is integrated into the decoder to improve feature fusion and detail recovery, enhancing segmentation accuracy.
  3. Extensive experiments demonstrate that our model outperforms SegFormer and other mainstream models in rodent burrow segmentation task, and also exhibits strong generalization capability. Furthermore, significance analysis confirms that the improvements are statistically significant.

Materials and methods

Dataset description

Led by grassland workers, we conducted fieldwork in the highland areas of Qinghai Province. Using the DJI Air 3 drone (DJI, Shenzhen, China), we captured video footage from various times, locations, and viewpoints, gathering extensive data on rodent burrows. The videos were recorded in high clarity and prominently featured rodent burrows, providing valuable data for subsequent studies.

In this study, OpenCV was used to extract frames from all the captured video files. Blurred and highly repetitive images were eliminated, resulting in a dataset of 6,864 valid rodent burrow images. The images were then labeled using the X-AnyLabeling tool, with a total of four categories: burrow0 (intact rodent burrow), burrow1 (nibbled rodent burrow), stone, and cow dung. The annotation work was carried out with the assistance of grassland workers, who have extensive experience in identifying rodent burrows. All the images were divided into a training set of 5,504 images and a validation set of 1,360 images at a ratio of 8:2, creating the rodent burrows dataset used in this study. A total of 50,255 instances were labeled in this dataset, each image contains at least one instance of the four categories. Fig 1 shows several sample images from the dataset, while Fig 2 illustrates the distribution of labels across each category. It can be observed that the edges of intact rodent burrows are relatively smooth, while the entrances of nibbled rodent burrows are damaged and exhibit signs of enlargement. Additionally, cow dung and rodent burrows look extremely similar in appearance, which adds some challenges to the improvement of the model.

thumbnail
Fig 1. Sample images from the rodent burrows dataset.

The instances in this figure are framed by solid red lines. From left to right in Fig 1(a), the four instances represent stone, stone, cow dung, and burrow1. In Fig 1(b), the two instances, from top to bottom, are burrow0 and burrow1.

https://doi.org/10.1371/journal.pone.0325738.g001

thumbnail
Fig 2. Dataset label distributition.

The instance counts for the categories burrow0, burrow1, stone, and cow dung are 20,685, 18,887, 8,561, and 2,122, respectively.

https://doi.org/10.1371/journal.pone.0325738.g002

The EM-SegFormer model

With the successful application of ViT in computer vision, an increasing number of researchers have begun exploring the potential of Transformer models in semantic segmentation tasks. SegFormer is a simple, efficient, and powerful semantic segmentation model based on the Transformer architecture. By effectively combining the advantages of Transformer, SegFormer achieves excellent performance in various visual segmentation tasks.

SegFormer redesigns the encoder and decoder to create a hierarchical Transformer encoder that does not rely on positional encoding, along with a lightweight fully connected multilayer perceptron (MLP) decoder architecture. For different task requirements, the researchers designed a series of models, from SegFormer-B0 to SegFormer-B5, to meet the performance and efficiency needs of various scenarios. Among these, the SegFormer-B0 model has fewer parameters, making it suitable for resource-constrained environments, while the SegFormer-B5 model offers superior performance due to its larger capacity. In this study, given the limited computational resources, we opted to improve the SegFormer-B0 model to achieve efficient identification of rodent burrows while balancing performance and computational cost.

In this paper, the encoder and decoder of the SegFormer-B0 model are each improved, with their structures shown in Fig 3.

The hierarchical Transformer encoder is employed to extract multi-level features from the image, including coarse features at high resolution and fine-grained features at low resolution. Each of the four stages in the Transformer Block stacks multiple Efficient Multi-head Self-Attention (EMSA) modules and a mixed feed-forward network. Unlike the traditional global attention mechanism, EMSA introduces local attention to efficiently implement multi-head attention, reducing computational complexity and memory consumption. However, this localized and efficient computational strategy may lead to insufficient information interaction between different channels. Given the small proportion of rodent burrows in the images and the subtle semantic and edge information, EMSA may struggle to adequately capture global information in tasks requiring precise global dependencies. To address these limitations, this paper integrates two EMA modules after the Transformer Block at each stage. This integration compensates for EMSA’s shortcomings in cross-channel information interaction and global information capture, enhancing the model’s performance in semantic segmentation tasks.

The encoder of SegFormer features a straightforward design, primarily consisting of MLP layers. The ALL-MLP decoder handles the fusion of multilevel features extracted by the encoder, and the specific process includes the following steps. First, the four feature maps are standardized to have the same number of channels through an MLP layer. Next, these feature maps are upsampled to match the resolution of the first feature map and concatenated. Then, the concatenated feature maps are further fused using another MLP layer. Finally, the fused features are processed through an additional MLP layer to produce the final predictive segmentation mask. This simplified design reduces the model’s computational complexity and parameter count, enhancing the network’s flexibility and efficiency. As a result, it is well-suited for a wide range of segmentation tasks. Nevertheless, it also comes with certain limitations. The ALL-MLP decoder may not be able to capture complex spatial context information through local receptive fields as effectively as convolutional decoders. It is also difficult to fully fuse this information when performing simple splicing operations on feature maps of different stages. Additionally, performing simple concatenation of feature maps from different stages makes it challenging to achieve comprehensive feature fusion.

Efficient multi-scale attention mechanism

In the field of computer vision, the attention mechanism plays an important role in improving model performance. By modeling the selective attention of the human visual system, the technique can efficiently focus on the most relevant features in the data, significantly enhancing both the efficiency and accuracy of the model. Coordinate attention [27] combines channel and spatial attention to optimize feature representation across both channel and spatial dimensions, enhancing the performance of downstream tasks such as image classification, object detection, and image segmentation. Yet, a process that models cross-channel relationships through channel dimensionality reduction essentially compresses the features, which may result in the loss of some information from the high-dimensional channels, thereby affecting the extraction of deeper feature representations. To address this, researchers have proposed the EMA [28] mechanism without dimensionality reduction, the structure of which is shown in Fig 4.

EMA divides the input feature map X into N groups of sub-features along the channel dimension, with each group learning different semantic features. Then it extracts the attention weights for the grouped feature maps through three parallel branches. On the one hand, in the two 11 branches, each has a 1D global average pooling operation to encode the channels in the horizontal and vertical directions. The encoded features are concatenated and passed through a 11 convolution without dimensionality reduction. The 1D global average pooling operations in the two directions can be represented by Eqs 12.

(1)(2)

Where H and W represent the spatial dimensions of the input features, c denotes the number of input channels, and xc refers to the input features of the c-th channel. The output of the 1 1 convolution is decomposed into two vectors, followed by the application of the Sigmoid function to generate a nonlinear mapping. Finally, the two channel attention maps are aggregated by multiplication. On the other hand, in the 3 3 branch, a 3 3 convolutional kernel is used to capture multi-scale feature representations. This cross-channel interaction effectively captures dependencies among all channels while preserving accurate spatial information, all with minimal computational cost.

EMA uses cross-spatial learning to aggregate feature information in the direction of different spatial dimensions. Specifically, the output of the 1 1 branch encodes global spatial information through 2D global average pooling, followed by a Softmax function. The formula for 2D global average pooling is presented in Eq 3.

(3)

The output of the 3 3 branch is directly reshaped to the corresponding dimensions, and then the outputs of these two branches are aggregated through a matrix product to generate the first spatial attention map. In addition, a similar operation is applied to generate a second spatial attention map by performing 2D global average pooling on the outputs of the 3 3 branch, while the outputs of the 1 1 branch are directly reshaped to the corresponding dimensions. The outputs of the two branches are then aggregated using a matrix product. In the end, the output feature maps within each group are computed by aggregating the two spatial attention weight values and applying the Sigmoid function. The final output of the EMA has the same size as the input feature map X. EMA enhances cross-channel and cross-spatial feature interactions through finer feature aggregation and spatial attention, allowing the model to better extract global contextual information.

Multi-kernel convolution feed-forward network

In the decoder architecture of SegFormer, the output feature maps from the four stages of the encoder are directly concatenated after upsampling. This strategy indeed enhances the model’s ability to capture multi-scale features. However, this approach also comes with potential issues: the feature maps from different stages may contain redundant information, and these redundant components could interfere with the model’s decision-making process. In addition, since these feature maps typically have different receptive fields and spatial resolutions, direct concatenation may lead to mismatched and lost information, which could negatively impact the overall performance of the model. To address the above issues, this paper innovatively introduces the MCFN [29] into the SegFormer decoder. The network effectively separates and extracts multi-scale feature information by performing convolutional operations at different scales, which reduces redundant content generated after feature splicing. At the same time, the MCFN also addresses the alignment problem between features at different scales, ensuring that feature maps from different encoder levels can be more accurately fused during the decoding process. This improvement enables the model to capture local details in the image more effectively, enhancing the recognition of small objects or intricate parts. This is particularly beneficial in tasks like segmenting rodent burrows in complex backgrounds, where accurately recognizing boundaries and fine details is crucial.

As shown in Fig 5, MCFN first doubles the number of channels of the input features F1 using a pointwise convolution fpw. Next, a multi-kernel convolution operation fMC is introduced, dividing it into four branches along the channel dimension. One branch retains the original information, while the other three branches employ a depthwise separable convolution with kernel sizes of 3 3, 5 5, and 7 7, respectively, to extract local information at different scales. Afterward, the feature maps from these four branches are concatenated. Nonlinearity is then introduced through a GELU function , and the number of channels is restored using a pointwise convolution operation to obtain the final output F2. The overall operation can be expressed by the following Eq 4.

(4)

Experimental design

Experimental environment and parameters

This experiment is conducted on a Linux operating system with an Intel Xeon E5-2603 v4 CPU and an NVIDIA GeForce GTX 1080 Ti GPU. The deep learning framework used is PyTorch 1.10.1, with Python 3.8 and CUDA 11.3. The model is trained in a distributed manner using 8 GPUs, with a batch size set to 16. In the backbone, the overlapping patch embedding operation is configured with stride values of 4, 2, 2, and 2 across the four stages, respectively. Other training parameters are detailed in Table 1. To ensure the fairness of the experiments, all were conducted with consistent parameter settings, and each model was trained from scratch.

Performance evaluation metrics

To objectively evaluate the model’s performance on the semantic segmentation task, we use evaluation metrics such as mean Intersection over Union (mIoU), precision, recall, and F1-score, to assess the model’s improvement. The formulas for these four evaluation metrics are shown in Eqs 58, where TP denotes the number of true positive samples, FP denotes the number of false positive samples, and FN denotes the number of false negative samples.

(5)(6)(7)(8)

The mIoU measures the degree of overlap between the predicted and ground truth regions, comprehensively evaluating the model’s performance across all categories. It is a key metric for semantic segmentation. Precision evaluates how many of the samples predicted by the model as positive are indeed positive. At the same time, recall assesses how many of the actual positive samples are correctly identified by the model. The F1-score combines both precision and recall, with higher values generally indicating better model stability.

Results and analysis

Loss function curves

The loss function used by the model is the cross-entropy loss, which can effectively measure the classification error at each pixel point and is suitable for making independent category predictions at the pixel level, making it ideal for semantic segmentation tasks. Fig 6 illustrates the training loss variation of EM-SegFormer model on the rodent burrows dataset.

As shown in the figure, the loss value decreases rapidly in the early stages of training, suggesting that the model quickly learns basic features. As training progresses, the loss value gradually stabilizes, eventually fluctuating slightly around 0.02. This indicates that the model is undergoing fine-tuning to further optimize accuracy. However, the effect of these fine adjustments has saturated, signaling that the model has largely converged.

Ablation study

We designed four sets of ablation experiments to evaluate the impact of different improvements on the performance of the baseline SegFormer model. According to the data analysis in Table 2, values are expressed as mean standard deviation (SD), calculated based on three independent sets of experimental data. The overall data distribution is more centralized and does not show wide fluctuations. When the EMA is introduced in the encoder, precision, recall, F1-score, and mIoU improve by 0.42%, 0.78%, 0.63%, and 0.86%, respectively. This improvement is attributed to the fact that the EMA module helps the model extract global contextual information more comprehensively through cross-channel and cross-spatial feature interactions. The MCFN employs multi-kernel convolutional operations to optimize the original ALL-MLP decoder in terms of multi-scale feature fusion and detail recovery. After integrating the MCFN into the decoder, the four metrics were improved by 0.17%, 2.01%, 1.16%, and 1.6%, respectively, leading to a marked enhancement of the model’s performance. Notably, the substantial increase in recall highlights that the improved model has strengthened its ability to identify rodent burrows in complex backgrounds, particularly in cases involving small or fuzzy rodent burrows.

thumbnail
Table 2. Enhancement of model metrics by different improvements.

https://doi.org/10.1371/journal.pone.0325738.t002

Finally, our model improves mIoU from 72.40% to 74.68%, recall from 80.86% to 83.36%, F1-score from 83.17% to 84.84%, and precision from 85.84% to 86.49% on the rodent burrows dataset. The relatively small standard deviations indicate that the model exhibits stable performance across experiments, thereby supporting the reliability of the experimental results. The comparison curves of the performance metrics are shown in Fig 7. From the figure, we can visually observe that the performance of EM-SegFormer is comprehensively enhanced over the baseline SegFormer. As the number of iterations increases, the mIoU value gradually rises and stabilizes, suggesting continuous improvement in model performance. Similarly, the F1 score, Precision, and Recall values also increase, reflecting ongoing optimization in classification performance, positive sample classification, and positive sample identification, respectively. It indicates that the improved model has a lower leakage rate and relatively fewer misdetections in the prediction results, and can more accurately identify rodent burrows in the grass. Especially, the key metric mIoU is improved by 2.28%. This indicates that the optimized model has stronger feature extraction and fusion capabilities, which effectively improves the accuracy of rodent burrow segmentation, thus providing more reliable technical support for the assessment of rodent infestation situations. Fig 8 presents the prediction results of both models. Both models demonstrate effective segmentation results; however, SegFormer exhibits some misdetections, while EM-SegFormer produces slightly more accurate predictions.

thumbnail
Fig 7. Performance metrics curve of SegFormer and EM-SegFormer.

The sub panel (a) shows the mIoU change curve, (b) shows the F1-score change curve, (c) shows the precision change curve, and (d) shows the recall change curve.

https://doi.org/10.1371/journal.pone.0325738.g007

thumbnail
Fig 8. Comparison of prediction results between SegFormer and EM-SegFormer.

Yellow represents stones, blue represents cow dung, red represents nibbled rodent burrows, and green represents intact rodent burrows.

https://doi.org/10.1371/journal.pone.0325738.g008

Significance testing

In order to verify whether the performance improvement of EM-SegFormer over SegFormer is attributable solely to random errors, we conducted significance tests on both models. First, we performed 5 experiments on each model using the rodent burrows dataset, yielding two sets of sample data. We focused on the mIoU, a key metric in semantic segmentation, and the relevant data are presented in Table 3. Multiple sets of experimental data show relatively stable results. Next, we applied an independent samples t-test to analyze the difference in the means of these two sample sets to determine whether any statistically significant difference exists.

In the independent samples t-test, the t-value is a statistic that measures the difference between sample means. Larger t-values indicate a more substantial difference between the models, reflecting greater relative fluctuations and effects. On the other hand, p-values measure the probability that the observed difference is due to random fluctuations. Smaller p-values suggest that the difference between the new and old models is less likely to be attributed to random fluctuations. Typically, a p-value less than 0.05 is considered statistically significant.

Due to the complexity of manually calculating p-values, we typically rely on statistical software or libraries to perform these calculations. In Python, the scipy.stats library provides a comprehensive set of probability distributions and statistical functions, which we use to compute the t-value and p-value for the independent samples t-test. Based on the training results, we calculated t = 23.547 and p = 0.00002. This large t-value indicates that the difference in the mean mIoU values between the two sample groups is highly significant and well beyond the range of random fluctuations. Additionally, the p-value is much smaller than the commonly used significance level of 0.05 and even smaller than 0.01, providing strong evidence that the observed mean differences are statistically significant and not due to chance.

With this significance test, we confirm that the improvement in performance of EM-SegFormer is reliable and that the improvement is statistically significant, further validating the effectiveness of our model improvement.

Comparative experiment

To further illustrate the model’s excellent performance in segmenting rodent burrows, we also selected several other mature semantic segmentation models to compare with EM-SegFormer. The experimental results, as shown in Table 4, indicate that EM-SegFormer achieves the highest mIoU on the rodent burrows dataset, despite utilizing the fewest parameters.

thumbnail
Table 4. Performance of different models on the rodent burrows dataset.

https://doi.org/10.1371/journal.pone.0325738.t004

Specifically, EM-SegFormer records the highest mIoU (74.75%) and F1-score (84.87%), slightly surpassing DeepLabV3+ in these metrics, while also achieving the highest recall (83.30%), highlighting its capability to identify true positives more effectively. Although the precision of EM-SegFormer is marginally lower than DeepLabV3+, it remains competitive. Compared to the classical UNet, our approach improves on all performance metrics. Notably, EM-SegFormer achieves this superior performance with the fewest parameters, significantly less than models such as PIDNet and PoolFormer. Taken together, our proposed method is suitable for application in resource-constrained environments. Its excellent overall performance demonstrates the effectiveness of our improvements and innovations in model design, offering a strong solution for rodent infestation assessment.

Generalization evaluation

In this study, the Cityscapes dataset is introduced to evaluate the generalization ability of the proposed model under varying data distributions. This dataset comprehensively captures the complexity of real city street environments and includes video sequences from 50 different city streets [35]. A total of 5000 images are provided with high-quality pixel-level annotations, which consist of 2975 training images, 500 validation images, and 1525 test images, covering 19 semantic categories.

The applicability of the model to other complex scenarios is validated through experiments conducted on the Cityscapes dataset. Specifically, three sets of experiments were carried out on this dataset using EM-SegFormer and SegFormer, respectively. The results of these experiments are similarly summarized in Table 5, presented as mean SD. The experimental results demonstrate that the proposed model achieves varying degrees of improvement across multiple evaluation metrics compared to the benchmark model, with lower result fluctuations and more stable performance gains. In particular, the mIoU increases from 72.82% to 74.41%, further indicating that the model exhibits excellent semantic segmentation performance in urban street scenes. These findings highlight the model’s strong generalization ability and robustness, thereby confirming its broad applicability and potential for real-world deployment.

thumbnail
Table 5. Evaluation results on the Cityscapes dataset using EM-SegFormer and SegFormer.

https://doi.org/10.1371/journal.pone.0325738.t005

Discussion

The traditional method of counting rodent burrows relies on manual field surveys, which are time-consuming, labor-intensive, and ineffective for accurately monitoring rodent infestations across vast grasslands due to their large area and widespread rodent populations. The drone hyperspectral technology improves detection accuracy and reduces labor costs [36, 37], but remains costly, complex in data processing, and sensitive to environmental factors, with limited performance in detecting small targets and recognizing shapes. In this study, our proposed semantic segmentation method based on the SegFormer model shows good performance in rodent infestation monitoring in highland areas. Compared to other common image segmentation methods, such as UNet, Mask2Former, PoolFormer, and others, our model has significant advantages in handling segmentation tasks with small targets and complex backgrounds. UNet often faces the problem of degradation of segmentation accuracy when dealing with images with small targets [38]. Mask2Former has relatively high computational complexity, resulting in longer training times and greater resource consumption, making it unsuitable for use under our available conditions. PoolFormer, as a lightweight architecture, utilizes pooling operations instead of complex attention mechanisms, but pooling operations may lead to the loss of small-scale features. Our experimental results also show that PoolFormer is not as effective as EM-SegFormer in the plateau rodent infestation monitoring task.

Although EM-SegFormer has good performance and keeps the number of parameters low, the improvements introduced add some computational cost. According to the experimental log, the training time of the original SegFormer model was 11 hours, while the training time of the improved model increased to 20 hours. Despite this increase, the additional training time does not significantly affect real-world applications, as the training process is performed only once. Furthermore, the improved model’s enhanced performance leads to better results during the subsequent inference phase. During inference, we tested 1360 images with the total inference time recorded at 103.84 seconds, averaging 0.0764 seconds per image, and an average memory usage of 1648.81 MB. These results demonstrate that the efficiency and resource consumption of the improved model in the inference phase remain within acceptable limits, confirming its feasibility for practical applications.

The method proposed in this study aims to more effectively detect the relationships among the behavior, population dynamics, and ecological impacts of plateau rodents through an intelligent approach. It offers strong technical support for ecological and environmental research and is particularly well-suited for application in high-altitude regions. By providing automated monitoring tools, the method is expected to substantially reduce the physical burden and visual fatigue experienced by grassland workers operating under harsh environmental conditions, thereby enhancing operational efficiency and safety. Nevertheless, the deployment of such technologies must prioritize animal welfare to ensure that target species and their habitats are not disturbed or harmed. In certain scenarios, it is also necessary to mitigate the negative ecological impacts of rodent activity through restoration measures such as artificial grass planting and the installation of protective fencing. Moreover, data collection and model development should strictly adhere to relevant ethical guidelines to avoid misuse in inappropriate contexts and to prevent potential ecological risks [39].

Future research should further consider the environmental and social consequences that may arise during the dissemination of this technology, and actively promote the responsible and standardized application of artificial intelligence in ecological conservation. For instance, combining artificial vegetation restoration with intelligent monitoring can help protect plant life while maintaining plateau rodent populations, thus supporting biodiversity and ecological balance. We will also continue to explore lightweight design schemes such as distillation learning [40], network pruning [41], and other methods to further reduce the computational complexity. We plan to continuously expand the dataset to enhance the generalization ability of the model and improve its adaptability in real-world scenarios, while fully leveraging the advantages of the Transformer architecture.

Conclusion

In this study, semantic segmentation was utilized to monitor and evaluate the rodent infestation in highland grasslands. Through field research, we constructed a dataset covering the information of rat holes, and introduced the EMA mechanism and MCFN module into SegFormer to enhance the segmentation ability of the model in complex backgrounds. The experimental results show that the improved EM-SegFormer model achieves significant improvement in segmentation accuracy, which further validates its superiority. The research results not only provide a more accurate technical means for monitoring rodent infestation in the plateau, but also lay a foundation for the application of semantic segmentation technology in the field of ecological resources. In the future, our research will continue to optimize the performance of the model, promote the development of intelligent monitoring technology, and provide more reliable technical support for the accurate assessment and efficient management of ecological environment.

Acknowledgments

We would like to express our sincere thanks to the High Performance Computing Center of Qinghai University for providing the experimental environment.

References

  1. 1. Zhou H, Yang X, Zhou C, Shao X, Shi Z, Li H. Alpine grassland degradation and its restoration in the Qinghai–Tibet plateau. Grasses. 2023;2(1):31–46.
  2. 2. He J, Liu Z, Yao T, Sun S, Lü Z, Hu X. Analysis of the main constraints and restoration techniques of degraded grassland on the Tibetan Plateau. Sci Technol Rev. 2020;38(17):66–80.
  3. 3. Wang Y, Tang W, Li S, Zhao H, Xie J, Ma C. Change in grassland productivity in Qinghai Province and its driving factors. Acta Pratacult Sinica. 2022;31(2):1.
  4. 4. Li W, Jiu C, Tan Z, Ma X, Chen Q. Natural grassland productivity and the livestock-feeds balance in Qinghai Province. Resour Sci. 2012;34:367–72.
  5. 5. Luo L, Ma W, Zhuang Y, Zhang Y, Yi S, Xu J. The impacts of climate change and human activities on alpine vegetation and permafrost in the Qinghai-Tibet Engineering Corridor. Ecol Indicat. 2018;93:24–35.
  6. 6. Wei X, Wei W, Liu C. Spatiotemporal variation of grassland vegetation and its relationship with human activities in Qinghai Province in recent 40 years. Chinese J Ecol. 2021;40(8):2541.
  7. 7. Sun F, Long R, Guo Z, Liu W, Gan Y, Chen W. Effects of rodents activities on plant community and soil environment in alpine meadow. Pratacult Sci. 2011;28(1):146–51.
  8. 8. Wang Y, Bao G, Wang H, Zeng H, Li J. Effect of burrowing activity of Myospalax baileyi on plant community structure in different grazing systems. Acta Agrestia Sinica. 2018;26(1):134.
  9. 9. Han R, Luo Z, Zhao Z, Xiao N, Shi N, Sun G. Simulation of potential habitat and prediction of dispersal route of plateau pika (Ochotona curzoniae) in Qilian Mountains (Qinghai region). J Fujian Agricult Forest Univ (Nat Sci Edn). 2022;51(04):546–54.
  10. 10. Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. J Big Data. 2015;2:1–21.
  11. 11. Villa AG, Salazar A, Vargas F. Towards automatic wild animal monitoring: identification of animal species in camera-trap images using very deep convolutional neural networks. Ecol Inform. 2017;41:24–32.
  12. 12. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012;25.
  13. 13. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint. 2014. https://arxiv.org/abs/1409.1556
  14. 14. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 1–9.
  15. 15. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 770–8.
  16. 16. Qiu R, Yang C, Moghimi A, Zhang M, Steffenson BJ, Hirsch CD. Detection of fusarium head blight in wheat using a deep neural network and color imaging. Remote Sens. 2019;11(22):2658.
  17. 17. He K, Gkioxari G, Dollár P, Girshick R. Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. p. 2961–9.
  18. 18. Mahum R, Munir H, Mughal ZUN, Awais M, Sher Khan F, Saqlain M, et al. A novel framework for potato leaf disease detection using an efficient deep learning model. Hum Ecol Risk Assessm: An Int J. 2023;29(2):303–26.
  19. 19. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 4700–8.
  20. 20. Vaswani A. Attention is all you need. Adv Neural Inf Process Syst. 2017.
  21. 21. Dosovitskiy A. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint. 2020. https://arxiv.org/abs/2010.11929
  22. 22. Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 6881–90.
  23. 23. Strudel R, Garcia R, Laptev I, Schmid C. Segmenter: transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. p. 7262–72.
  24. 24. Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P. SegFormer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst. 2021;34:12077–90.
  25. 25. Cheng B, Schwing A, Kirillov A. Per-pixel classification is not all you need for semantic segmentation. Adv Neural Inf Process Syst. 2021;34:17864–75.
  26. 26. Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R. Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 1290–9.
  27. 27. Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 13713–22.
  28. 28. Ouyang D, He S, Zhang G, Luo M, Guo H, Zhan J. Efficient multi-scale attention module with cross-spatial learning. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023. p. 1–5.
  29. 29. Jiang X, Zhang X, Gao N, Deng Y. When fast fourier transform meets transformer for image restoration. In: European Conference on Computer Vision. 2025. p. 381–402.
  30. 30. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, proceedings, part III. 2015. p. 234–41.
  31. 31. Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018. p. 801–18.
  32. 32. Xu J, Xiong Z, Bhattacharyya SP. PIDNet: a real-time semantic segmentation network inspired by PID controllers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 19529–39.
  33. 33. Guo MH, Lu CZ, Hou Q, Liu Z, Cheng MM, Hu SM. Segnext: rethinking convolutional attention design for semantic segmentation. Adv Neural Inf Process Syst. 2022;35:1140–56.
  34. 34. Yu W, Luo M, Zhou P, Si C, Zhou Y, Wang X. Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 10819–29.
  35. 35. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, et al. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 3213–23.
  36. 36. Zhu X, Bi Y, Liu H, Pi W, Zhang X, Shao Y. Study on the identification method of rat holes in desert grasslands based on hyperspectral images. Chinese J Soil Sci. 2020;51(02):263–8.
  37. 37. Zhang T, Du J, Zhang H, Pi W, Gao X, Zhu X. Research on recognition method of desert steppe rat hole based on unmanned aerial vehicle hyperspectral. J Optoelectron ·Laser. 2022;33(02):120–6.
  38. 38. Li X, Wang S, Weng X, Sun D, Zhang H, Jiao H. Remote sensing of floating macroalgae blooms in the east china sea based on UNet deep learning model. Acta Optica Sinica. 2021;41(02):18–26.
  39. 39. Ukpaka P. The ethical implications of using artificial intelligence to manipulate or enhance natural ecosystems and biodiversity. PUB. 2025;(1).
  40. 40. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv preprint 2015. https://arxiv.org/abs/1503.02531
  41. 41. Han S, Pool J, Tran J, Dally W. Learning both weights and connections for efficient neural network. Adv Neural Inf Process Syst. 2015;28.