Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

MPLNet: Mamba prompt learning network for semantic segmentation of remote sensing images of traditional villages

  • Cheng Zhang,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft

    Affiliations College of Landscape Architecture and Art, Jiangxi Agricultural University, Nanchang, China, Jiangxi Rural Culture Development Research Center, Nanchang, China

  • PeiLin Liu,

    Roles Data curation

    Affiliations College of Landscape Architecture and Art, Jiangxi Agricultural University, Nanchang, China, Jiangxi Rural Culture Development Research Center, Nanchang, China

  • JinLin Teng,

    Roles Methodology, Resources, Software

    Affiliations College of Landscape Architecture and Art, Jiangxi Agricultural University, Nanchang, China, Jiangxi Rural Culture Development Research Center, Nanchang, China

  • Chunqing Liu

    Roles Writing – review & editing

    liuchunqing@jxau.edu.cn

    Affiliations College of Landscape Architecture and Art, Jiangxi Agricultural University, Nanchang, China, Jiangxi Rural Culture Development Research Center, Nanchang, China

Abstract

In recent years, the study of semantic segmentation of remote sensing images (RSI) has gained significant attention due to its critical role in geospatial analysis, agriculture, and forestry. However, existing remote sensing segmentation methods face several challenges: (1) limited dataset diversity and inadequate exploration of traditional village landscapes, resulting in a lack of geospatial representation for these unique environments; (2) inefficiencies in same-layer or cross-layer feature fusion when using convolutional neural networks (CNNs) or transformers, leading to either insufficient spatial modeling or excessive computational demands; and (3) multimodal approaches that improve modeling accuracy but introduce high parameter complexity and computational overhead. To address these issues, we propose the Mamba Prompt Learning Network (MPLNet) for efficient and accurate RSI segmentation, with a strong emphasis on spatial information extraction and GIS-based applications. First, we construct TV-RSI, a highly diverse large-scale data set specifically designed to capture the spatial structures, topographic variations, and land use patterns of traditional villages. Second, we develop the Mamba Fusion Module, which improves geospatial feature utilization by efficiently modeling both intralayer and interlayer spatial relationships, ensuring comprehensive feature extraction. Finally, we introduce prompt learning, which transfers bimodal geospatial knowledge from heavy-weight networks into a lightweight unimodal model, improving segmentation accuracy while maintaining computational efficiency. Extensive experiments on TV-RSI and two publicly available RSI datasets demonstrate that MPLNet achieves state-of-the-art performance with significantly reduced computational costs, making it an ideal solution for geospatial segmentation tasks in GIS-driven remote sensing applications.

Introduction

Semantic segmentation is a computer vision and geospatial analysis task that aims to classify each pixel in an image by assigning a category label to accurately differentiate objects based on their shapes, spatial distributions, and locations. This process enables computers to not only identify objects within an image, but also understand their spatial relationships, making it a crucial tool for Geographic Information Systems (GIS) and remote sensing applications. In recent years, semantic segmentation has shown strong performance in traditional village analysis tasks, including land use classification, cultural landscape identification, preservation of rural heritage, ecological monitoring, and spatial planning [13]. However, in traditional village environments, the application of remote sensing semantic segmentation remains underexplored due to the lack of high diversity and spatially representative datasets. The TV-RSI data set addresses this gap by offering a comprehensive geospatial resource that captures the spatial structures, land use patterns, and ecological characteristics of traditional villages, allowing more precise spatial modeling, historical landscape analysis, and sustainable planning. With its detailed spatial annotations and large-scale coverage, TV-RSI provides a critical foundation for advancing GIS-driven deep learning models, facilitating fine-grained segmentation and geospatial intelligence in traditional village conservation and development.

With the acceleration of modernization, traditional villages are receiving more and more academic attention as cultural heritage and historical witnesses. Researchers have thoroughly explored the cultural, ecological and social values of traditional villages, emphasizing their importance in sustainable development. The location and landscape pattern of traditional villages show a deep understanding of the natural environment, such as topography, climate, and water resources, reflecting the harmonious coexistence of humans and nature [4], and provide a valuable reference for modern landscape design [5]. The introduction of artificial intelligence technology has significantly improved the protection and monitoring efficiency of the traditional landscape heritage of villages [6], while research based on culture-landscape genes provides a new theoretical framework for its protection and development [7]. In addition, low-altitude drone remote sensing brings new methods for geospatial data acquisition and landscape management [8], and multi-resolution feature fusion technology further improves the precision of landscape analysis [9]. These studies not only provide a solid theoretical foundation for the protection of traditional villages but also open up a practical path for their sustainable development in modern society.

Convolutional neural networks and the Transformer have been rapidly developed in recent years, and a large number of networks have been extracted. For example, the full convolutional architecture based on a convolutional network realizes end-to-end prediction at the pixel-level and improves the precision of semantic segmentation through the merging of features of multiple layers [10]. On this basis, the self-attentive three-stream TSNet network further combines RGB and depth features to achieve high-precision semantic segmentation results for indoor scenes [11]. In addition, ESANet demonstrated efficient RGB-D segmentation performance in mobile robot scene analysis, combining the advantages of real-time and high precision [12]. In order to further improve segmentation accuracy, ACNet significantly improves RGB-D semantic segmentation by designing the complementary attention module to effectively fuse RGB and depth features [13]. With the introduction of the converter structure, the architecture based on the Swin converter and DCFAM decoder improves contextual information extraction in semantic segmentation tasks and improves the resolution recovery effect [14]. In the field of remote sensing image segmentation, progressive reconstruction networks have effectively improved segmentation accuracy through the synergistic action of multiscale features and depth separation modules, and have demonstrated strong fine segmentation capability [15]. However, these methods usually use convolution or a Transformer, leading to the problem of insufficient modeling or high computation.

Recently, prompt learning approaches have gained great utility in vision and language. For example, MPLe multimodal prompt learning significantly improves the model’s generalization ability in downstream tasks by enhancing the consistency of the visual and linguistic representations of CLIP [16]. In this paper, we provide a systematic review of cue-based learning paradigms, focusing on the applications and methods of prompt learning in natural language processing [17]. In terms of improving the generalization performance of CLIP models, CoCoOp proposes a dynamic conditional cue optimization method, which further enhances the migration ability between categories of the model [18]. Meanwhile, CoOp optimizes the CLIP cue design using learnable vectors, improving the adaptability of the visual language model to image recognition tasks [19]. In addition, ProGrad optimizes the visual language model using cue-aligned gradient optimization, enabling the visual language model to show a stronger generalization ability in tasks of fewer samples and in the cross-domain [20]. Recent studies have further advanced deep learning architectures for remote sensing and multimodal perception tasks. FDNet proposed a dual-path cross-encoding framework that captures both spatial and temporal dependencies for precipitation nowcasting, achieving significant improvements in contextual feature learning and prediction accuracy through parallel encoding pathways and feature fusion mechanisms [21]. In addition, DDFNet introduced a dual-domain fusion network that integrates time and frequency-domain representations through a decoupled attention mechanism, demonstrating strong robustness and generalization across complex signal environments [22]. Building on this, recent RSI segmentation advances fall into three lines closely aligned with our setting: geometry-/boundary-aware decoding for contour fidelity and small-object recall, anisotropic/sparse attention for thin structures in high-resolution scenes, and state-space (Mamba) models that aggregate context in linear time. We adopt the third while complementing the first two via four-direction SS2D scans and a gated residual prior within the student stream (student-only inference), under a unified protocol—yielding consistent gains in Boundary-F1 and Connectivity [2328]. However, there is still room for exploration in the purely visual domain as a way to further improve the performance of visual networks.

In order to solve the above problems, this paper proposes a Mamba prompt learning network for efficient and accurate remote sensing image segmentation. Specifically, we first construct a large-scale, diveRSI.y-rich traditional village remote sensing data set named TV-RSI. Then, we design the Mamba fusion module to fully exploit and utilize the complementary information by modeling the same-layer or cross-layer features. Finally, we introduce prompt learning to inject bimodal knowledge from heavy-weight networks into lighter unimodal networks in the form of cues, resulting in more streamlined and accurate models. Our contributions are mainly as follows:

(1) For the first time, we constructed a traditional village remote sensing dataset with large diversity and scale, called TV-RSI, and proposed a new Mamba prompt learning network, called MPLNet.

(2) A Mamba Fusion Module called MFM is designed, which deeply mines and efficiently integrates complementary information through modeling of same-layer or cross-layer features to maximize the use of information.

(3) Through prompt learning, the bimodal knowledge in the heavy-weight network is transformed into cue information and injected into the lightweight unimodal network, thus constructing a more streamlined and accurate network.

(4) It is shown on the proposed TV-RSI dataset and two publicly available datasets that our proposed MPLNet achieves state-of-the-art performance.

Dataset

The Traditional Villages Remote Sensing Dataset (TV-RSI) is a spatially enriched collection of remote sensing images meticulously designed to capture the geospatial characteristics of traditional villages. Using Geographic Information System (GIS) technologies, TV-RSI systematically integrates multidimensional spatial data, including the distribution of village architecture, topographical variations, types of land cover, and surrounding ecological landscapes. This data set serves as a high-precision spatial repository, enabling comprehensive geospatial analysis, spatial pattern recognition, and terrain-based modeling of traditional village environments. One of the key spatial advantages of TV-RSI is its ability to analyze spatial relationships, detect Through high-resolution spatial analytics, researchers can quantify land cover transitions, assess urbanization impacts, and model the interdependencies between village structures and natural landscapes. The rich geospatial attributes of the data set enable users to perform fine-grained spatial assessments, such as village clustering, terrain-based segmentation, and proximity analysis of cultural heritage sites. In the context of traditional village conservation and sustainable development, the integration of remote sensing with GIS-based spatial modeling significantly improves spatio-temporal tracking, historical change analysis, and predictive mapping. TV-RSI provides a solid geospatial foundation for regional planning, heritage site management, and rural sustainability strategies, ensuring that conservation policies are data-driven and spatially optimized. Furthermore, its application in digital mapping, intelligent spatial analytics, and AI-enhanced geospatial modeling contributes to the advancement of digital heritage management and GIS-based decision support systems. By offering an unprecedented level of spatial granularity and thematic richness, TV-RSI is a valuable geospatial resource for researchers, planners, and policy makers. Bridges the gap between remote sensing, spatial intelligence, and the conservation of cultural heritage, fostering scientific advancements in geospatial analytics, spatial planning, and sustainable rural development.

The Traditional Villages Remote Sensing Dataset (TV-RSI) is a spatially comprehensive and high resolution dataset consisting of 77,850 images with a total volume of 6.6 GB , each at a resolution of pixels, covering an extensive area of 166,900 square kilometers. Designed to support geospatial analysis and deep learning applications in GIS, TV-RSI systematically captures spatial structures, topographic features, and land use patterns unique to traditional villages. To ensure spatial diversity and analytical robustness, the data set is strategically partitioned into three subsets: the training set (68,984 images) for learning spatial relationships and geographic distributions, the validation set (7,627 images) for optimizing model generalization across diverse terrain and architectural configurations, and the test set (1,239 images) for evaluating model performance on complex spatial features and environmental variations. This spatially aware data set structure enhances the ability to perform precise land classification, spatial-temporal monitoring, and digital heritage mapping, providing a solid geospatial foundation for AI-driven GIS applications and sustainable development planning (Fig 1).

thumbnail
Fig 1. Representative samples from the TV-RSI dataset.

All sampling, annotations, panel composition, and graphics are original works by the authors and are released under CC BY 4.0.

https://doi.org/10.1371/journal.pone.0341130.g001

Annotation protocol and QA. High resolution ortho imagery was tiled into 256 × 256 patches (nonoverlapping), with rebalancing at sampling time to mitigate class skew. Six semantic classes (background, road, building, farmland, vegetation, drainage) were annotated at the pixel level using a standardized guideline. Each tile received double-pass verification by two annotators; conflicts were resolved by a senior reviewer. Spot audits in a random subset ensured consistency, particularly for ambiguous boundaries (e.g., etation-farmland edges). This protocol improves the quality of the labels without inflating the variance of the annotation.

Fig 2 illustrates the spatial distribution characteristics of different types of objects in the TV-RSI dataset, depicting the proportion of area (ranging from 0 to 1) occupied by categories such as background, road, building, farmland, vegetation, and drainage systems. The skewed distribution of background area proportions, predominantly concentrated in the 0.0-0.1 range, indicates that background elements occupy only a minor portion of most images, contributing to the spatial heterogeneity of the dataset. Similarly, roads and buildings exhibit a bias towards smaller spatial footprints, reflecting their localized and linear spatial characteristics in traditional village layouts. In contrast, farmland and vegetation show more uniform area distributions, with notable peaks in the 0.9-1.0 interval, suggesting that in certain images, these types of land cover dominate the spatial composition. The dominance of small areas of roads, buildings, and background elements presents a unique spatial segmentation challenge, enriching the data set with diverse geospatial test cases for models targeting small-scale feature extraction. In contrast, extensive coverage of farmland and vegetation requires robust large-scale object recognition, making TV-RSI an ideal data set to evaluate multiscale spatial segmentation algorithms. By offering a rich spectrum of spatial AI-driven GIS applications in handling varied land cover distributions, geospatial complexities, and scale-aware segmentation tasks, the authors ultimately advance spatial intelligence in remote sensing analytics.

thumbnail
Fig 2. Regional distribution of semantic objects.

https://doi.org/10.1371/journal.pone.0341130.g002

Fig 3 presents the spatial distribution heatmap of the location of objects in the TV-RSI dataset, which illustrates the variations in object density through a color gradient. The brighter yellow hues in the central region indicate a higher concentration of objects, while the color transitions to red and darker shades towards the edges, signifying a gradual decline in object density. This centrally aggregated spatial distribution highlights the tendency of key land cover elements - such as buildings, roads, and vegetation clusters - to be predominantly positioned in the core regions of images, while peripheral areas remain relatively sparse. This spatial characteristic provides crucial geospatial insights for GIS-driven segmentation models and spatial feature extraction, as it emphasizes key regions that warrant greater focus during model training. For location-sensitive remote sensing tasks, such as semantic segmentation and spatial object detection, the heat map offers a valuable spatial reference to enhance recognition accuracy and localization precision. Moreover, the structured spatial distribution of objects in TV-RSI supports targeted model optimization, facilitating the development of adaptive GIS-based deep learning algorithms that can efficiently process spatially clustered and dispersed land-cover patterns, ultimately advancing spatial intelligence in remote sensing analysis.

Method

Overall architecture

MPLNet consists of a frozen bimodal teacher and a trainable unimodal student (Fig 4). The teacher employs two ResNet-50 backbones for RGB and depth , fused with Mamba Fusion Modules (MFM). Its parameters remain fixed during training to serve as a stable source of cross-modal context. The student is a single RGB backbone augmented with Mamba gates. During training, cross-stream knowledge is injected via projection-aligned cues (Eqs (1)–(5)). In inference, only the student stream is executed, preserving efficiency while retaining accuracy.

thumbnail
Fig 4. Overall architecture diagram of MPLNet.

https://doi.org/10.1371/journal.pone.0341130.g004

Fig 4 shows the overall structure of the proposed MPLNet. Specifically, the overall structure is divided into two parts, the upper and lower parts, labeled as the frozen part and the trainable learning part, respectively. In the freezing part, the backbone network adopts two symmetric ResNet-50s for the extraction of RGB and depth-modal information features, respectively, followed by the efficient complementary utilization of multimodal information through the proposed MFM. The fused features are then the backbone network uses ResNet-50 for feature extraction of the RGB features, followed by efficient modeling of the features in context through Mamba blocks, and secondly, the enriched information from the frozen part is injected into the RGB features in the trainable learning part. Finally, the prediction map is obtained through the decoder and the output prediction map of the frozen part is injected as a cue to the output of the trainable learning part to obtain more accurate output results.

MFM

In the feature fusion stage, a common approach is to perform modal feature fusion by employing either Convolutional Neural Networks or Transformers. However, Convolutional Neural Networks have insufficient global contextual modeling, and Transformers, although adequate in global contextual modeling, bring a large number of parameters and computational issues. To solve this problem, as shown in Fig 5, we introduce a Mamba mechanism for context-efficient modeling. Specifically, we first perform within-modal context modeling by Mamba for the two modalities, and then fuse the contextualized features through the ResNet-50 encoding layer to obtain more adequate complementary information. In addition, the Mamba mechanism obtains context in different directions through multiple nonlinear transforms and combines them by element-wise modulation to yield rich long-range cues. The Mamba mechanism performs sequential scans in four directions (left–right and top–bottom) on the 2D feature map, enabling the model to capture contextual dependencies from multiple orientations. It then fuses and smooths the results from all directions to integrate multi-directional information, enhancing the perception of boundaries and linear structures. The specific formulation is as follows.

thumbnail
Fig 5. Structure of the Mamba Fusion Module (MFM).

https://doi.org/10.1371/journal.pone.0341130.g005

Design rationale and implementation details. To make the operator pipeline explicit and reproducible, we formalize the intra- and intermodal processing used in MFM. Given a stage characteristic , we first apply a local mixing , then perform a 2D selective scanning (SS2D) along four directions to harvest long-range directional dependencies. The directional states are concatenated and projected back to the C channels, followed by layer normalization.

A lightweight gate produces a content-adaptive mask that modulates Y through the element-wise product, producing Y. Applying to the RGB and Depth streams separately gives Fr,i and Fd,i, which are fused by a ResNet encoder block with alignment .

(1)(2)(3)

Operator glossary and reproducibility notes. is a convolution with same padding; denotes 2D selective scan that aggregates directional state updates on the spatial grid; is Layer Normalization; is the SiLU activation; C3S is and operation; denotes element-wise multiplication; is a projection for channel alignment; denotes a ResNet encoder block. In practice, SS2D adds linear time and memory in HW and introduces negligible parameters compared to a ResNet-50 stage, which matches the efficiency requirements for remote sensing segmentation.

Remark. In our implementation, refers to , is the 2D selective scanning operator, is layer normalization, is activation, and corresponds to a ResNet-50 encoding block.

Prompt learning strategies

The usual remote sensing semantic segmentation algorithms use multimodal encoding separately, followed by multimodal context fusion; however, this leads to a large number of parameters and computations in the inference phase, which is not suitable in resource-constrained scenarios. To address this problem, we introduce a prompt learning mechanism. Specifically, in the training phase, first, we simultaneously train a multimodal bistream network and a single-stream unimodal network. Next, rich complementary information from the multimodal fusion is injected into the single-stream unimodal network by adding elements in the element way. Finally, the decoded output of the two-stream network is fused with the output of the single-stream network through a residual mechanism to obtain the final output. It is worth noting that in the inference stage, we only run the single-stream network as our final network, thus greatly reducing the number of parameters and computation, while obtaining better performance.

Tensor shapes and reduction to the original form. We use and , where Ci is the stage-i width and K the number of classes. When and layer normalization is omitted, Eq 4 reduces to the original update  +  M(ri) (same semantics and implementation). For Eq 5, if , , and , the gate becomes 1 + Pi, hence . Introducing σ and is only for numerical calibration and stability on class-imbalanced or high-confidence samples.

Design rationale and implementation details. Let denote the characteristics of the teacher (RGB/depth) in stage i, and let Ti be the characteristic of the student in the same stage. We first build a projection-aligned cue by channel/spatial alignment Di]), where ϕ is a projection (with up/down sampling if needed). To prevent gradient leakage from the student into the frozen teacher, we use a stop-gradient operator on the cue during injection. A light gate, predicted from the student content, controls the injection strength. Formally, the student update (semantically equivalent to your   ) is written as

(4)

where is the Mamba context operator from Eq 1, σ is the sigmoid gate, Wg is a projection , stabilizes the cue scale and is elemental modulation. When and are omitted, Eq 4 reduces to the original ; hence, the semantics and implementation remain unchanged while improving stability.

At the logit level, your multiplicative refinement is written in a numerically stable, confidence-gated form that preserves the same intent (“teacher prompts student”) while avoiding uncontrolled logit growth:

(5)

Here, Wo is a projection that aligns teacher records with student space, and is a temperature controlling the sharpness of the gate. When choosing , , and , Eq 5 becomes  +  Pi); introducing σ and serves as numerical calibration and yields better stability under class imbalance and high-confidence samples. During inference, only the student branch (with gates) is executed, so the deployed model maintains unimodal complexity.

Complexity. All cue paths and the teacher decoder are training-only. The deployed model preserves the parameter count and FLOPs of the student, with negligible extra memory for the gates.

Loss function

We supervise both teacher and student with pixel-wise cross-entropy:

(6)

To encourage the student to absorb teacher priors beyond hard labels, we add a temperature-based distillation term:

(7)

with temperature τ. The total loss is

(8)

where . In practice, the teacher is frozen, we apply in Eq 4, and optimize only the student and the gating projections .

Experiments and results

Public datasets and experimental setup

We evaluated MPLNet on three RSI datasets: the proposed TV-RSI, ISPRS Vaihingen, and ISPRS Potsdam. Vaihingen comprises 33 aerial images () with the standard split; Potsdam contains 38 tiles (). Unless otherwise specified, we use consistent train/validation/test partitions across all methods for fair comparison and report mean pixel accuracy (mAcc) and mean Intersection-over-Union (mIoU).

Dataset nomenclature and splits. We follow the common name of the ISPRS benchmarks and use Vaihingen (33 images, ) and Potsdam (38 tiles, ). For TV-RSI, we adopt the predefined training/validation/test subsets described in the Dataset section to ensure reproducibility across experiments.

Preprocessing and data augmentation. The images are tiled into fixed-size patches to fit GPU memory. We apply standard augmentations—random horizontal/vertical flips, scale jittering in [0.5,2.0], random cropping to the training patch size, and color jitter for RGB channels—to reduce overfitting and improve small-object robustness. Normalization uses RGB means and standard deviations per data set.

Batching and schedule. Unless noted, we use a minibatch of B patches per GPU on a GeForce RTX 4080 (16 GB) with automatic mixed precision (AMP/FP16). Potsdam is trained for 80 epochs and Vaihingen for 280 epochs under the same optimizer and learning rate schedule described below. To ensure reproducibility, all scores reported are the mean over three independent runs with seeds .

Inference. The teacher and prompt paths are disabled at test time; only the unimodal student is executed. No test-time augmentation is applied unless otherwise noted.

Implementation details (reproducibility). Unless otherwise noted, all models are trained with SGD (momentum 0.9), polynomial LR decay (), initial learning rate 5 × 10−4, and weight decay 1 × 10−4. We freeze the teacher (both the backbones and the decoder) and optimize only the student and the gating projections (). For prompt learning, we set in Eq 5; for distillation we adopt the temperature and the loss weights unless specified. All results are averaged over three random seeds with the same data splits. The experiments are run on a workstation with an NVIDIA GeForce RTX 4080 (16 GB) GPU and an Intel Core i9-13900K CPU; AMP is enabled to reduce the memory footprint. During inference, we completely disable the teacher branch and use only the student stream.

Comparison of advanced methods

Quantitative comparison.

In order to make a quantitative comparison of the proposed MPLNet, the performance comparison of different methods on the TV-RSI dataset is demonstrated as shown in Table 1. Specifically, the performance of the different methods is compared in detail in the categories of Agricultural Land, Buildings, Drainage, Roads, and Vegetation, where our proposed model has the best performance in terms of the average precision (mAcc) and the mean intersection and merger ratio (mIoU) performed optimally, reaching 90.10% and 82.57%, respectively. In particular, performance is outstanding in the categories of vegetation and agriculture. These results show that our proposed moder exhibits strong segmentation accuracy.

thumbnail
Table 1. Experimental results on the TV-RSI dataset (pixel accuracy Acc and IoU per class).

https://doi.org/10.1371/journal.pone.0341130.t001

On additional SOTA baselines. Recent RSI segmentation systems (e.g. hybrid CNN-Transformer variants and structure-aware decoders) report strong results under task-specific preprocessing and high-resolution input. Where public weights/splits were incompatible with our unified pipeline, we refrain from numerical entries to avoid unfair comparison. We will release code, splits, and logs to facilitate future plug-in evaluation of additional SOTA under identical training and inference protocols.

Qualitative comparison.

In order to qualitatively compare the proposed MPLNet, a comparison of the segmentation results of different models on the TV-RSI dataset is demonstrated, as shown in Fig 6. From left to right, the segmentation results are shown for the original RGB image, the real label (GT), FCN-8s, ACNet, TSNet, ESANet, DCSwin, and our proposed model. Through observation, it can be seen that our proposed model recovers the boundaries and details of features more accurately in various scenes, and the classification results are closer to the real labels. For example, in some complex edge regions, models like ACNet and TSNet have a certain degree of misclassification, while our proposed model is able to maintain the integrity of the boundary better. In addition, ESANet and DCSwin have obvious mis-segmentation in some regions, while our proposed model demonstrates stronger robustness and can accurately segment the target region. In general, our proposed model outperforms other methods in different scenarios and especially shows a stronger ability to perform complex feature classification tasks.

thumbnail
Fig 6. Qualitative comparisons on TV-RSI.

Ground truth masks and model outputs are shown for representative tiles. Base imagery (where visible) was obtained from the Geospatial Data Cloud (GF-2) for non-commercial academic use. All overlays, annotations, and panel layouts are original works by the authors and are released under CC BY 4.0.

https://doi.org/10.1371/journal.pone.0341130.g006

Quantitative analysis. MPLNet has the best overall scores on TV-RSI, with and . In class metrics, it delivers the highest IoU in Agricultural land (91.18%), Drainage system (86.12%), Road (78.82%) and Vegetation (91.73%), and the highest Acc in Agricultural land (96.13%), Drainage system (92.61%) and Road (87.91%). Compared with the strongest baseline (DCSwin, ), MPLNet improves the mean IoU by  + 8.70 points while using a single-stream student for inference. These gains are particularly pronounced in small/thin structures (roads, drainage), indicating that MFM and prompt-injected cues sharpen boundary localization and mitigate class imbalance.

Ablation experiments

To validate the effectiveness of the adopted MFM module and the prompt learning strategy, the following ablation experiments were performed. To ensure fairness, all operations share the same training schedule and data splits, differing in a single switch (MFM on/off or PL on/off). We also report mean and standard deviation across three seeds; per-class IoU trends (e.g. gains on thin structures for MFM, texture-prone classes for PL) are visualized in the supplementary.

Ablation of MFM.

Table 2 and Fig 7 demonstrate the improvement of the MFM module in the performance of the MPLNet model. Table 2 shows that the inclusion of the MFM module improves the average accuracy (mAcc) of the MPLNet model to 90.10% and the average intersection and merger ratio (mIoU) to 82.57%, which significantly outperforms the version without the MFM module (88.32% and 80.26%) and the backbone network (85.61% and 78.73%). The visualization results in Fig 7 further illustrate the effectiveness of the MFM module, and the model with the inclusion of MFM is able to segment the feature boundaries more accurately, with results closer to the true labels (GT), and our proposed model performs the best when compared to the models with Backbone only or without the MFM module.

thumbnail
Fig 7. Ablation of the Mamba Fusion Module (MFM).

Visual comparisons among Backbone, MPLNet (w/o MFM), and MPLNet demonstrate sharper boundaries and reduced omissions with MFM. Base imagery (where visible) was obtained from the Geospatial Data Cloud (GF-2) for non-commercial academic use. All derived visualizations are author-original and released under CC BY 4.0.

https://doi.org/10.1371/journal.pone.0341130.g007

Discussion. Relative to the backbone and the setting without MFM, the largest gains mIoU appear in Road and Drainage, which are thin and spatially fragmented. This supports our design that directional selective scan complements CNN encoding by broadcasting long-range cues along paths aligned with man-made linear structures. As shown in Table 2, The ablation study only reports the performance gains brought about by introducing MFM and prompt learning, without analyzing the specific performance when these modules are disabled. When MFM is removed, the performance drops from 90.10/82.57 to 88.32/80.26, indicating that MFM contributes notably to overall accuracy and IoU improvement.

Ablation protocol and fairness controls. All modifications modify a single factor at a time while keeping the backbones, optimizer, schedule, data splits, and augmentations fixed. The teacher branch is frozen throughout the training; we optimize only the student and the gating projections. For each setting, we train for the same number of epochs and report results on the same validation/test splits under the identical inference procedure (no TTA). Unless otherwise noted, we repeat each run with three random seeds and take the mean.

Interpretation. MFM primarily benefits thin/elongated structures and cluttered boundaries (e.g., roads and drainage), consistent with its directional selective scan that enhances long-range context while preserving locality. Prompt learning (PL) yields further gains by injecting intermodal priors into the unimodal student during training; Improvements concentrate on vegetation and agricultural land where the interclass textures are confusable. In particular, PL is disabled at test time, so the deployed model preserves the student’s parameter count and FLOPs. Statistical robustness. We estimate 95% confidence intervals using stratified bootstrap over tiles and observe consistent improvements in seeds; detailed per-class intervals are provided in the Supplementary.

Efficiency remark. MFM adds negligible parameters compared to a ResNet-50 stage (lightweight gates and projections), and PL is training-only. Consequently, the complexity of the inference time is unchanged relative to the student baseline, which is consistent with deployment in resource-constrained RSI scenarios.

Ablation of prompt learning.

Table 3 and Fig 8 demonstrate the improvement in the performance of the MPLNet model by hint learning (PL). As can be seen in Table 3, the average accuracy (mAcc) of the MPLNet model with the addition of PL improves to 90.10%, and the average intersection and merger ratio (mIoU) reaches 82.57%, which is significantly better than the version without PL (87.93% and 80.17%) and the backbone network (85.61% and 78.73%). The visualization results in Fig 8 further validate the effect of PL, and the model with PL is able to segment the feature details more accurately and closer to the true labels (GT), showing significant advantages.

thumbnail
Fig 8. Ablation of prompt learning (PL).

Compared with the student baseline, MPLNet with PL recovers fine structures and reduces confusion in mixed land-cover zones. Base imagery (where visible) was obtained from the Geospatial Data Cloud (GF-2) for non-commercial academic use. All annotations and compositions are original works by the authors and are released under CC BY 4.0.

https://doi.org/10.1371/journal.pone.0341130.g008

thumbnail
Table 3. Ablation results for prompt learning.

https://doi.org/10.1371/journal.pone.0341130.t003

Discussion. PL improves class separability where RGB-only textures are ambiguous (e.g., Vegetation vs. Agricultural land), indicating that teacher cues transfer cross-modal priors to the unimodal student without incurring inference cost. As shown in Table 3, The ablation study further shows that removing the prompt learning module leads to a performance drop of 90.10/82.57 to 87.93/80.17, demonstrating that prompt learning effectively improves both accuracy and IoU.

Comparison of two public datasets.

Table 4 shows the comparison of the results of the segmentation of different models in the Vaihingen data set. Our proposed model performs well in terms of accuracy (Acc) and intersection ratio (IoU) in all categories, especially in the categories of Building and Impervious Surface (Imp. surf), where it achieves 97.94% and 96.56% accuracy, respectively, which is significantly better than the other models. In addition, our proposed model has the highest mean accuracy (mAcc) of 90.43% and the highest mean intersection and merger ratio (mIoU) of 83.31%, showing stronger segmentation ability and generalization performance. These results show that the adaptability of our proposed model in complex scenes is better than other methods.

thumbnail
Table 4. Quantitative comparison results for the Vaihingen dataset.

https://doi.org/10.1371/journal.pone.0341130.t004

Table 5 shows the segmentation performance of different models in the Potsdam dataset. Our proposed model outperforms other models in several categories, especially on the impervious surface (Imp. surf) and building (building) categories, with an accuracy of 94.26% and 94.43% and an intersection and merger ratio of 86.15% and 92.34%, respectively. In addition, our proposed model has an average accuracy (mAcc) of 90.12% and an average intersection and merger ratio (mIoU) of 79.02%, showing strong overall segmentation performance and generalizability. These results indicate that our proposed model has higher recognition accuracy in complex remote sensing scenes. Runtime. All additional computations from PL and the teacher decoder occur only during training; At test time, MPLNet runs a single-stream student whose latency and memory closely match the backbone baseline.

thumbnail
Table 5. Quantitative comparison results for the Potsdam dataset.

https://doi.org/10.1371/journal.pone.0341130.t005

Table 6 demonstrates the lightweight advantage of our proposed model. Compared with existing methods, it achieves significantly lower parameters and FLOPs, indicating higher computational efficiency and suitability for real-time or resource-constrained applications.

thumbnail
Table 6. Comparison of model parameters and computational complexity.

https://doi.org/10.1371/journal.pone.0341130.t006

Conclusion

The Mamba Prompt Learning Network (MPLNet) proposed in this paper provides an efficient and accurate GIS-driven solution for remote sensing image segmentation, with a strong emphasis on spatial information extraction and geospatial intelligence. By constructing TV-RSI, a large-scale, high-diversity remote sensing dataset specifically designed for traditional village landscapes, we establish a comprehensive geospatial foundation for fine-grained segmentation and spatial modeling. The Mamba Fusion Module (MFM) improves the model’s ability to extract, integrate, and use complementary spatial features by efficiently modeling spatial relationships between layers and within layers, improving the geospatial expressiveness of the model. In addition, prompt learning facilitates the transfer of bimodal knowledge, injecting high-level spatial information from heavy-weight networks into a lightweight unimodal model, ensuring both computational efficiency and segmentation accuracy. Experimental results demonstrate that MPLNet achieves state-of-the-art performance in TV-RSI and two publicly available RSI datasets, demonstrating its effectiveness in geospatial feature recognition, land use classification, and spatial pattern analysis, making it a valuable advancement in GIS-driven remote sensing applications.

Summary of evidence. In TV-RSI, MPLNet achieves and , outperforming strong CNN/Transformer baselines under identical splits and training protocol. In ISPRS benchmarks, MPLNet reaches and in Vaihingen, and and in Potsdam, while delivering notable gains on thin and elongated classes (e.g., roads, drainage), which are typically challenging in RSI segmentation.

Efficiency and ablations. The Mamba Fusion Module (MFM) enhances long-range contextual reasoning with negligible parameter overhead relative to a ResNet-50 stage. Prompt learning (PL) is applied only during training to inject cross-modal priors - and is disabled at inference; therefore, the implemented model preserves the unimodal student’s latency and memory footprint. Controlled ablations indicate that MFM primarily improves boundary localization and thin structures, while PL further reduces confusion among texture- similar land covers (e.g. vegetation vs. agricultural land), without increasing the complexity of the inference.

Limitations and future directions. Despite the overall gains, performance on small and sparse instances (e.g., cars in Vaihingen) remains comparatively lower, suggesting room, for example-aware decoding or boundary-refinement heads. Future work will explore (i) instance-level supervision and uncertainty- sensitivity to loss design for small objects, (ii) the adaptation of the domain across sensors and seasons to mitigate distribution shift, and (iii) the tighter coupling with GIS priors (e.g., road topology or hydrological constraints) to further stabilize predictions in cluttered rural scenes. We will also investigate semi-supervised extensions to reduce annotation costs on new regions while maintaining the zero-overhead inference of the unimodal student.

Beyond benchmarks, MPLNet’s precision in elongated structures (roads, drainage) and boundary location is directly actionable for traditional village conservation. It supports (i) delineation of hydrological corridors and ecological buffers, (ii) extraction of road networks for accessibility planning, and (iii) identification of heritage building blocks through cleaner parcel boundaries. Since deployment is unimodal and lightweight, the system fits in-field GIS toolchains where RGB imagery is prevalent and computation is constrained.

References

  1. 1. Adegun A, Viriri S, Tapamo J-R. Automated classification of remote sensing satellite images using deep learning based vision transformer. Appl Intell. 2024;54(24):13018–37.
  2. 2. Priyanka , N S, Lal S, Nalini J, Reddy CS, Dell’Acqua F. DIResUNet: architecture for multiclass semantic segmentation of high resolution remote sensing imagery data. Appl Intell. 2022;52(13):15462–82.
  3. 3. Guo D, Wu Z, Feng J, Zhou Z, Shen Z. HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification. Appl Intell. 2023;53(21):24947–62.
  4. 4. Antrop M. Why landscapes of the past are important for the future. Landscape and Urban Planning. 2005;70(1–2):21–34.
  5. 5. Naveh Z. Ten major premises for a holistic conception of multifunctional landscapes. Landscape and Urban Planning. 2001;57(3–4):269–84.
  6. 6. Wang X. Artificial Intelligence in the Protection and Inheritance of Cultural Landscape Heritage in Traditional Village. Sustainability. 2022;14:3245.
  7. 7. Li G, Chen B, Zhu J, Sun L. Traditional village research based on culture-landscape genes: a case of Tujia traditional villages in Shizhu, Chongqing, China. J Cult Herit. 2024;18:102–18.
  8. 8. Liu C, Cao Y, Yang C, Zhou Y, Ai M. Pattern identification and analysis for the traditional village using low altitude UAV-borne remote sensing: multifeatured geospatial data to support rural landscape investigation, documentation, and management. Remote Sens. 2020;12(2025).
  9. 9. Zhang Q, Zhang J, Lu S, Liu Y, Liu L, Wang Y, et al. Multi-resolution feature extraction and fusion for traditional village landscape analysis in remote sensing imagery. Remote Sens. 2023;15:1549.
  10. 10. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015. p. 3431–40.
  11. 11. Zhou W, Yuan J, Lei J, Luo T. TSNet: three-stream self-attention network for RGB-D indoor semantic segmentation. IEEE Intell Syst. 2021;36(4):73–8.
  12. 12. Seichter D, Kohler M, Lewandowski B, Wengefeld T, Gross H-M. Efficient RGB-D semantic segmentation for indoor scene analysis. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). 2021. p. 13525–31. https://doi.org/10.1109/icra48506.2021.9561675
  13. 13. Hu X, Yang K, Fei L, Wang K. Acnet: attention based network to exploit complementary features for rgbd semantic segmentation. In: Proceedings of the IEEE International Conference on Image Processing, Taipei, Taiwan, 2019. p. 1440–4.
  14. 14. Wang L, Li R, Duan C, Zhang C, Meng X, Fang S. A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images. IEEE Geosci Remote Sensing Lett. 2022;19:1–5.
  15. 15. Ma J, Zhou W, Qian X, Yu L. Deep-separation guided progressive reconstruction network for semantic segmentation of remote sensing images. Remote Sensing. 2022;14(21):5510.
  16. 16. Khattak MU, Rasheed H, Maaz M, Khan S, Khan FS. MaPLe: multi-modal prompt learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. p. 19113–22. https://doi.org/10.1109/cvpr52729.2023.01832
  17. 17. Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv. 2023;55(9):1–35.
  18. 18. Zhou K, Yang J, Loy CC, Liu Z. Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 16816–25.
  19. 19. Zhou K, Yang J, Loy CC, Liu Z. Learning to prompt for vision-language models. Int J Comput Vis. 2022;130(9):2337–48.
  20. 20. Zhu B, Niu Y, Han Y, Wu Y, Zhang H. Prompt-aligned gradient for prompt tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 15659–69.
  21. 21. Zhang Y, et al. FDNet: a deep learning approach with two parallel cross encoding pathways for precipitation nowcasting. IEEE Transactions on Geoscience and Remote Sensing. 2025.
  22. 22. Li X, et al. DDFNet: A dual-domain fusion network for robust synthetic speech detection. In: Information Fusion. 2024.
  23. 23. Li X, Xu F, Liu F, Tong Y, Lyu X, Zhou J. Semantic segmentation of remote sensing images by representation refinement and geometric prior-guided inference (GPINet). IEEE Transactions on Geoscience and Remote Sensing. 2023.
  24. 24. Huan H, Wang H, Yang W. Strip attention networks for road extraction. Remote Sensing; 2022.
  25. 25. Sun L, Zou H, Wei J, Cao X, He S, Li M, et al. Semantic segmentation of high-resolution remote sensing images based on sparse self-attention and feature alignment. Remote Sensing. 2023.
  26. 26. Guo S, Yang Q, Xiang S, Wang S, Wang X. Mask2Former with improved query for semantic segmentation in remote-sensing images. Mathematics. 2024.
  27. 27. Zhu Q, Cai Y, Fang Y, Yang Y, Chen C, Fan L, et al. Samba: semantic segmentation of remotely sensed images with state space model. Heliyon. 2024.
  28. 28. Wang L, Zhang J, Liu Y. PyramidMamba: rethinking pyramid feature fusion with selective space state model for semantic segmentation of remote sensing imagery. In: Information Fusion. 2025.