Figures
Abstract
This paper proposes the PGD model for UAV path planning in complex terrain, addressing key challenges such as high-dimensional state processing, blind path exploration, and poor cross-scene adaptability. The PGD model integrates Transformer, GAN, and DDPG, forming a “compression-generation-optimization" closed-loop system. The Transformer module compresses high-dimensional terrain data, alleviating training bottlenecks, while the GAN module generates high-quality candidate paths, reducing ineffective exploration. DDPG then optimizes the path planning strategy efficiently. Experimental results demonstrate the superior performance of PGD on the UAVDT (suburban) and AirSim (canyon) datasets. In terms of path length (Pl), PGD achieves 20.0m/22.0m, compared to baseline models such as PPO-DRL (23.8m) and Soft Actor-Critic (24.0m). PGD also outperforms in collision rate (Cr) with 2.5%/3.0% and computational efficiency (Tc) with 13.5s/16.0s, respectively. The PGD model shows significant improvements in path planning efficiency and adaptability, particularly in high-complexity terrains. Compared to traditional models, PGD’s multi-module synergy enhances feature correlation and physical path constraints, offering a novel framework for intelligent planning in complex environments. Future work will focus on enhancing model adaptability to extreme weather and multi-agent collaborative scenarios.
Citation: Liu L, Li X, Meng L, Zhao Y, Lv Y (2026) Path planning for UAVs in complex terrain based on the PGD model: Algorithmic improvements combining feature extraction and reinforcement learning. PLoS One 21(2): e0340394. https://doi.org/10.1371/journal.pone.0340394
Editor: Claudionor Ribeiro da Silva, Universidade Federal de Uberlandia, BRAZIL
Received: August 21, 2025; Accepted: December 19, 2025; Published: February 3, 2026
Copyright: © 2026 Liu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Our dataset is publicly available, and can be accessed using the following links: AirSim Dataset: https://universe.roboflow.com/airsim-gate/airsim-drone-gate UAVDT Dataset: https://opendatalab.com/OpenDataLab/UAVDT.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Introduction
In complex terrain scenarios such as mountain power inspection and canyon geological monitoring, unmanned aerial vehicles (UAVs), due to their flexibility and maneuverability, have become essential tools to replace manual operations in high-risk and low-efficiency environments [1,2]. However, these environments are characterized by rugged terrain structures and irregularly distributed, dynamically changing obstacles (e.g., sudden rockfalls, vegetation growth) [3,4]. This places three core requirements on UAV path planning systems: collision-free path safety, path length efficiency, and real-time decision responsiveness [5,6]. Traditional path planning methods have gradually shown limitations in such environments, with studies identifying their key bottlenecks. Algorithms like A* [7] and Dijkstra [8,9], which are based on graph search, rely on manually designed cost functions. Although they can guarantee optimal solutions in static and structured environments, the balance between actual cost and heuristic estimation in their evaluation functions lacks generality [10–12]. In unstructured terrains, this often leads to paths that either cling too closely to obstacles or become unnecessarily long due to biased weight settings, making it difficult to balance safety and efficiency. Sampling-based methods such as RRT* overcome the dependency of graph search algorithms on environmental modeling by exploring high-dimensional spaces through random sampling [13,14]. However, studies have shown that as terrain complexity increases, the correlation between sampled points and the target region significantly decreases, resulting in an exponential decline in planning efficiency. In constrained spaces such as narrow canyons, dense local sampling often leads to suboptimal solutions [15].
The rise of deep learning has provided a new pathway to overcome these bottlenecks, yet existing techniques still face multidimensional challenges [16,17]. At the feature processing level, studies have confirmed the inherent limitations of different network architectures. CNNs excel at extracting local obstacle features from terrain images but, due to their fixed receptive fields, fail to capture long-range terrain correlations such as “mountain peak–valley” relationships, leading to insufficient perception of global path constraints and discontinuous trajectories in highly undulating mountain scenes [18–20]. RNNs, such as LSTMs, can model temporal sequences to process UAV motion states [21]; however, their gradient transmission properties make it difficult to learn long-term path strategies in highly dynamic terrains, often causing trajectory oscillations during complex turns. Graph Neural Networks (GNNs) can model the topological relationships between obstacles, yet in sparse terrain data (e.g., missing point cloud regions in remote mountains), insufficient node connections cause severe performance degradation, and they cannot directly output continuous action commands to meet UAV flight control requirements [22,23].
Reinforcement learning (RL) has made significant progress in autonomous decision-making. The Deep Deterministic Policy Gradient (DDPG) algorithm [24,25], due to its suitability for continuous action spaces, has become the mainstream framework for path planning. However, in high-dimensional state inputs (such as 3D point clouds and meteorological data that form thousand-dimensional features), DDPG suffers from slow convergence and training instability [26,27]. In complex terrains, it often requires tens of thousands of iterations to generate a feasible path. The collaboration between feature extraction and decision-making modules also faces a gap: Transformer [28–30], with its self-attention mechanism, excels at capturing global dependencies in high-dimensional data but is often treated as an independent preprocessing module, lacking dynamic interaction with subsequent decision-making processes, leading to potential loss of key path information in compressed features.
The path generation stage also presents a contradiction: Generative Adversarial Networks (GANs) [31,32] can generate diverse path solutions, but their generation process lacks sufficient encoding of the drone’s physical constraints (such as maximum climb angle and turning radius), resulting in “theoretically feasible but practically unfeasible" paths (e.g., vertical climb trajectories on steep slopes); Variational Autoencoders (VAE) [33] ensure generation stability through probabilistic modeling but struggle to handle sudden obstacles due to insufficient sample diversity. More critically, existing research often focuses on optimizing individual modules, lacking the end-to-end collaboration of “feature compression - path generation - policy optimization". The disconnection between feature extraction and path generation leads to candidate paths that do not satisfy the core constraints of the terrain, while the separation between path generation and policy optimization causes reinforcement learning to fall into ineffective exploration, preventing the full potential of each module from being realized. Even with the emergence of hybrid or hierarchical reinforcement learning models in recent years (such as PPO with attention mechanism and hybrid graph-RL method), a complete process of “high-dimensional data processing - path generation constraints - continuous action optimization" has not yet been formed. Either the lack of an active path generation mechanism leads to blind exploration, or it is difficult to take into account the global correlation and physical constraints of the terrain, and thus cannot completely solve the core pain points of complex terrain.
To address the aforementioned issues, this paper proposes a Perception-Generation-Decision (PGD) model that integrates feature extraction and reinforcement learning. This model achieves efficient path planning in complex terrain through the organic collaboration of Transformer, GAN, and DDPG. Its core innovations are reflected in three aspects: First, it constructs a dynamic collaborative feature compression mechanism, utilizing Transformer’s self-attention mechanism to focus on the key associations between “UAV-obstacle-inspection point." While compressing high-dimensional terrain data, it retains key decision-making information through dynamic interaction with subsequent modules, alleviating the training pressure of reinforcement learning. Second, it designs a constraint-aware path generation framework, using the simplified features output by Transformer as conditions to guide GAN in generating high-quality path candidates that conform to physical constraints, providing “prior experience" for reinforcement learning and reducing ineffective exploration. Third, it establishes a closed-loop collaborative optimization system, enabling the feature compression results to simultaneously serve path generation constraints and reinforcement learning state inputs. Path candidates serve as “demonstration data" for policy optimization, accelerating convergence and forming a complete, complementary link. This systematically solves the core challenges of UAV path planning in complex terrain, particularly in high-dimensional state processing, blind path exploration, and cross-scene adaptability.
Materials and methods
Problem description
To address the core requirement of drone inspection path planning in complex terrains — achieving efficient and safe path decisions in high-dimensional state spaces — this paper abstracts the problem as a constrained continuous state Markov Decision Process (MDP). The focus is on capturing the high-dimensional characteristics of the state space, the continuity of action decisions, and the objective function that aligns with the “Feature Compression - Path Generation - Policy Optimization" framework of the PGD model, providing clear boundaries for the subsequent module design.
The state space is the core input for the model’s perception of the complex terrain, and its high dimensionality is the key bottleneck that leads to the inefficiency of traditional reinforcement learning. Let the state space be S, and the state vector must fully include three categories of information: First, the drone’s own state, including three-dimensional coordinates (x, y, z) (x, y are horizontal positions, z is altitude), heading angle
, flight speed v, and climb rate
, which directly determine the physical basis for action execution; second, the environmental features of the complex terrain, extracted via LiDAR or 3D point cloud data, represented by the relative coordinates
of the k closest obstacles to the drone (
represent the relative distances of the i-th obstacle in the three-dimensional direction). This information is highly unstructured in mountainous and canyon environments and is the main source of state dimension inflation; third, the inspection task information, including the remaining target point coordinates
and their respective distances
(m is the number of remaining points), ensuring that path planning aligns with the task goals. The specific form of the state vector is:
In typical complex terrains (such as canyons with multiple obstacles), the increase in k and m can cause the state dimension to exceed a thousand, which is the core motivation for introducing Transformer for feature compression — by focusing on the key associations between “drone - obstacles - inspection points," the computational burden on subsequent modules is reduced.
The action space needs to match the continuous control characteristics of the drone, while providing output dimensions suitable for the policy optimization of DDPG. Let the action space be A, and the action vector is defined as a combination of three continuous control quantities: speed adjustment
, heading angle adjustment
, and climb angle adjustment
, i.e.:
Each component must satisfy physical constraints: (limiting the speed change range to avoid overloading the power system),
(limiting the steering range to ensure flight stability),
(limiting climb/descent angles to accommodate terrain slope changes). These constraints will serve as hard conditions when generating path candidates with GAN, ensuring the generated path segments are physically executable.
The design of the objective function needs to balance the core metrics of path planning while providing a basis for the reinforcement learning reward mechanism. This paper aims to minimize the comprehensive cost J, expressed as:
where is the path length cost (total flight distance), promoting path economy;
is the collision cost (increasing sharply when the distance to obstacles is less than a safety threshold
), ensuring safety;
is the task cost (the weighted sum of distances to remaining inspection points), ensuring task completion; the newly added
is the smoothness cost (the change in angle between adjacent actions), reducing unnecessary sharp turns in complex terrains, and this metric aligns with the smoothness constraint of the path generated by GAN.
are weight coefficients, which can be dynamically adjusted according to the scene (e.g., increasing
and
in canyon scenarios to prioritize obstacle avoidance and smooth flight).
In terms of dynamic constraints, the state transition of the drone must follow physical laws: the position at the next time step is determined by the current state and action, i.e.:
where is the sampling interval. This transition rule will serve as the foundation for Transformer to extract temporal features, while providing real environmental feedback for the policy evaluation of DDPG.
Datasets
To thoroughly validate the performance of the PGD model in path planning for complex terrains, this paper selects two representative public datasets: the AirSim Dataset (https://universe.roboflow.com/airsim-gate/airsim-drone-gate) and the UAVDT Dataset (https://opendatalab.com/OpenDataLab/UAVDT). The selection criteria are based on three dimensions: first, the complexity of the terrain, which should include features such as mountains and canyons that match the research scenario; second, data completeness, which should provide full-dimensional information such as drone states, environmental features, and trajectory annotations to support model training; third, scene authenticity, balancing both simulator-generated data and real-world data to validate the model’s generalization ability.
The AirSim Dataset [34] is a drone simulation dataset developed by Microsoft Research, with its core advantage being the ability to customize complex terrain environments, along with comprehensive data dimensions. The dataset is built on the Unreal Engine and includes various pre-configured complex terrain scenarios (such as mountains, canyons, and urban-rural junctions). For this study, the “Canyon Mountain" subscene is selected, which simulates rugged terrain with an elevation difference of over 500 meters, featuring natural obstacles such as rocks and trees, as well as man-made inspection targets such as transmission towers, closely matching the “complex terrain inspection" scenario studied here. Each sample in the dataset contains the drone’s six degrees of freedom state (position, attitude, speed), LiDAR point cloud data (used to extract obstacle features), high-definition RGB images (used for visual-assisted positioning), and manually annotated optimal inspection trajectories (used as reference benchmarks).
The UAVDT Dataset [35] is a real-world drone video and trajectory dataset, primarily collected from suburban and urban-rural junction scenes, including flight data under different weather conditions such as sunny and cloudy days. The core value of this dataset lies in its provision of real terrain physical characteristics (such as actual obstacle distribution and terrain undulation patterns), which can effectively validate the model’s adaptability in non-simulation environments. The dataset includes GPS trajectories of the drone (converted from latitude and longitude to 3D coordinates), flight speed and heading angle records, ground-truth obstacle annotations (such as the position and height of houses and trees), and corresponding aerial image sequences for each scene.
A comparison of the key information from both datasets is shown in Table 1. By training the model on the AirSim Dataset, the controllable variables of complex terrain (such as obstacle density and terrain slope) can be fully utilized to validate the core improvements of the PGD model; testing the model on the UAVDT Dataset enables the evaluation of the model’s generalization performance in real-world scenarios. The combination of both datasets provides a complete experimental data support system.
Regarding data accuracy, the terrain simulation parameters and UAV physical models in the AirSim Dataset are calibrated with real flight data, with a deviation of between simulation and actual trajectories. The UAVDT Dataset corrects GPS data through differential positioning, verifies obstacle annotations using LiDAR and manual review, and ensures static obstacle position errors of < 0.5 meters and height errors of < 0.3 meters, guaranteeing data reliability. In terms of information completeness, AirSim includes meteorological interference data, and UAVDT includes multi-time period illumination and vegetation change samples, both covering the core dimensions of UAV status and environmental characteristics required for the entire model training and testing process, without the need for additional data. In addition, neither of these datasets has limitations in terms of generalization ability.
PGD model architecture
The PGD (Perception-Generation-Decision) model constructs a complete solution for drone inspection path planning in complex terrains with a “Feature Compression - Path Generation - Policy Optimization" collaborative logic. Its core lies in the organic linkage of three modules, transforming high-dimensional terrain data into features that can be efficiently processed, generating constraint-compliant path candidates, and ultimately optimizing the global optimal path. The overall framework is shown in Fig 1.
From the data flow in the model diagram (Fig 2), we can see that the raw input (such as 3D terrain point clouds, drone states, and inspection point coordinates) first enters the Transformer module. After key relational features are refined through the self-attention mechanism, a compressed state vector is output. This step serves both as a “dimensionality reduction" of the high-dimensional state space and as an “information purification" process to provide precise inputs for subsequent modules. Next, the GAN module uses this low-dimensional feature as a constraint to generate multiple local path candidates, which are then filtered by the discriminator to provide “high-quality exploration starting points" for the decision-making process. Finally, the DDPG module takes the compressed state vector as the environmental perception input and the path candidates filtered by the GAN as prior experience, iterating through reinforcement learning to output the globally optimal path that satisfies safety, efficiency, and task requirements. This “Perception-Generation-Decision" closed-loop architecture specifically addresses the training efficiency issues caused by high-dimensional states in complex terrains and improves the overall performance of path planning through the information flow between modules.
Transformer module.
The Transformer module serves as the “perceptual core” of the PGD model, responsible for processing complex high-dimensional terrain information. Its primary function is to extract key features from raw high-dimensional terrain data and compress the state space, thereby providing concise and effective input for the subsequent GAN-based path generation and DDPG-based policy optimization. In complex terrains, UAV decision-making depends on multidimensional correlations (e.g., spatial relations between obstacles and inspection points, local terrain constraints on global paths). Traditional feature extraction methods such as CNNs and RNNs struggle to capture long-range dependencies in such unstructured data, leading to redundant state spaces and loss of crucial information. The self-attention mechanism of the Transformer effectively addresses this issue by dynamically assigning attention weights to focus on core features, thereby achieving “redundancy reduction and key information preservation” during state compression [36].
The network architecture follows the process of “input layer → multi-head self-attention layer → feed-forward network → output layer.” The input layer receives raw data including UAV state (position (x, y, z), velocity v, heading angle θ, climb rate ), obstacle features extracted from 3D terrain point clouds (3D relative coordinates of the k nearest obstacles
and their distances), and inspection point information (3D coordinates of the remaining m inspection points
and their remaining distances
). To eliminate the interference caused by differences in dimensional scales across multiple data sources during attention computation, the input layer performs standardization: spatial parameters such as position, obstacle distances, and inspection distances are normalized to the range [0,1], while motion parameters such as velocity, heading angle, and climb rate are normalized to [–1,1]. After normalization, the features are concatenated into a high-dimensional vector of size
. In complex terrains where
and
,
can exceed 1000 dimensions.
The multi-head self-attention layer is the core for capturing feature correlations. It employs eight parallel attention heads, each focusing on different dimensions of feature dependencies to avoid blind spots in complex relational modeling. Specifically, two heads capture “UAV–obstacle” distance constraints, emphasizing spatial relationships between obstacle coordinates and UAV position, and filtering features of obstacles within the safety threshold . Three heads focus on “inspection point–terrain slope” adaptability by combining local slope features extracted from the terrain point cloud to assess path feasibility between inspection points. The remaining three heads target “obstacle–obstacle” topological relations, identifying traversable gaps between clustered obstacles to aid path planning. Each attention head operates through the Query (Q), Key (K), and Value (V) matrices. Both Q and K are set to 64 dimensions (dk = 64), balancing fine-grained dependency modeling with computational efficiency. Q represents the query of the current feature, K provides reference indices for all features, and their dot product measures feature similarity. The scaling factor
prevents large inner products from causing softmax gradient vanishing. The weighted output is computed as:
where dk is the dimension of Q and K, Q = XWQ, K = XWK, , and
are learnable parameter matrices, with X denoting the high-dimensional input feature vector.
The outputs of the eight attention heads (each 64-dimensional) are concatenated into a 512-dimensional feature vector and then passed through a linear projection WO () for feature fusion:
where h = 8 and WO is the fusion matrix, yielding an intermediate dimension . This operation captures both local key features (e.g., nearby obstacles) and global dependencies (e.g., distant inspection points guiding the route), forming a rich foundation for feature compression.
The feed-forward network performs dimensionality compression through a two-step transformation: “high-dimensional mapping → nonlinear activation → low-dimensional projection.” First, a linear transformation W1 (512 × 2048) maps features to a higher 2048-dimensional space, followed by ReLU activation + b1) to enhance nonlinear representation of critical “UAV–obstacle–inspection point” relationships. Then, another linear layer W2 (2048 × 128) compresses features into 128 dimensions, with bias term b2 fine-tuning feature distributions to produce the low-dimensional output
. This compression process reduces the state space by over 80% while maintaining essential decision information through interactive feedback between the feed-forward network and the attention layer (intermediate features are temporarily stored and used to adjust attention weights). This design ensures that obstacle avoidance and inspection-related information is preserved, effectively reducing computational overhead in subsequent GAN and DDPG modules.
In the overall process of the PGD model, the output of the Transformer module (s_comp) has a dual function: on the one hand, it serves as the conditional input to the GAN module, incorporating compressed terrain constraint information (such as obstacle distribution and checkpoint locations) into the path generation process, preventing GAN from generating paths that do not conform to the actual terrain; on the other hand, it directly serves as the state input to the DDPG module, reducing the interference of high-dimensional redundant information on reinforcement learning strategy optimization and lowering the probability of ineffective exploration. This “one output, two uses" design enables feature extraction to work closely with subsequent stages, avoiding the information gap problem caused by traditional independent preprocessing modules.
GAN module.
The GAN module plays the role of the “creative artisan." It generates diverse local path candidates that comply with physical constraints based on the low-dimensional state vector output by the Transformer module. These candidates provide rich and high-quality “prior experience" for the policy optimization in the DDPG module, greatly reducing blind exploration in complex terrains during reinforcement learning.
The GAN module consists of a Generator (G) and a Discriminator (D), which continuously optimize through adversarial training while forming a parameter-sharing and gradient-coordination mechanism with the Transformer and DDPG modules [37]. This ensures the physical feasibility and practical flight adaptability of the generated paths. The input of the Generator is a noise vector randomly sampled from a normal distribution, with a dimension of dz (set to 64 in this paper). This noise vector serves as the “creative seed" for generating the path. The core function of the Generator is to convert the noise vector into meaningful local path segments, with the output being a sequence of coordinates G(z) representing the UAV flight trajectory. Each coordinate includes three-dimensional position information (x, y, z) and flight state parameters (linear velocity v, angular velocity ω, heading angle ψ, climb angle ϕ). Let the output dimension of the Generator be dG. The Generator is constructed using a multi-layer fully connected network with activation functions (such as ReLU and Tanh), gradually mapping the low-dimensional noise to the high-dimensional path space. The specific calculation process can be represented as:
where are weight matrices, and
are bias terms. The Generator shares the initial feature extraction layer parameters with the Transformer module to ensure consistent transmission of terrain constraint information. To satisfy UAV dynamics constraints, the Generator’s output layer employs a “Tanh function mapping + threshold clipping" dual hard constraint mechanism: the linear velocity v is constrained within
, the angular velocity ω is limited to
, the heading angle change
is controlled within
, and the climb angle ϕ is constrained within
, fully matching the small UAV’s power system load threshold and flight stability requirements.
The discriminator’s task is to distinguish between the paths generated by the generator G(z) and feasible paths from the real environment . Its input includes both the generated path G(z) and valid path data from real scenes or expert annotations (processed into the same format as G(z)). The discriminator outputs a scalar value D(x) or D(G(z)), representing the probability that the input path is real. The closer the value is to 1, the more the discriminator believes the path is real. The discriminator is also constructed using a multi-layer fully connected network, and the computation process is:
where σ is the Sigmoid activation function that compresses the output to the range [0,1], and and
are the learnable parameters of the Discriminator. The loss gradients of the Discriminator are backpropagated to the Transformer and Generator modules, dynamically adjusting the feature compression focus and path generation strategy, thus enhancing the encoding accuracy of physical constraints.
The training process of the GAN module follows the principle of a zero-sum game, incorporating physical feasibility loss terms and the joint loss function of the PGD model. The objective function is optimized as:
where is the physical loss weight coefficient, and
is the physical feasibility loss term, consisting of three parts: path curvature loss
, which constrains the curvature variation rate between consecutive path points to not exceed the UAV’s turning limit; velocity continuity loss
, which penalizes sudden changes by computing the velocity difference between adjacent path points; and obstacle distance deviation loss
, which is based on the Euclidean distance formula:
where O is the obstacle set and p is the path point, penalizing path segments that are closer than the safety threshold from obstacles. In the overall training process, the GAN module adopts a “pre-training - joint training" two-step strategy: first, GAN is independently pre-trained on real path samples with physical constraint labels to ensure the generated paths meet basic dynamics and obstacle avoidance requirements; then, the pre-trained GAN is jointly trained with the Transformer and DDPG. The joint loss function
+
(where
and
) is used to balance the optimization objectives of each module, achieving cross-module collaborative convergence.
In the overall flow of the PGD model, the low-dimensional state vector output by the Transformer module is incorporated into the GAN module’s generation process as additional conditional information. When generating paths, the generator not only relies on the noise vector but also incorporates the terrain constraints (such as obstacle distribution and inspection point locations) contained in the low-dimensional state vector to ensure the generated paths are meaningful in complex terrains. The specific implementation is to concatenate the low-dimensional state vector with the noise vector and input this combined vector into the generator network, i.e., , where
represents the concatenation operation, and
is the low-dimensional state vector output by the Transformer module. This design ensures that the generated paths are both diverse and accurately fit the real-world terrain, providing high-quality path exploration starting points for the DDPG module and significantly improving the efficiency and effectiveness of subsequent policy optimization.
DDPG module.
The DDPG module serves as the ultimate “decision engine," tasked with outputting the optimal action sequence based on the environmental state and generating the globally optimal inspection path. To address the training instability and slow convergence issues that commonly arise in high-dimensional continuous spaces, this module enhances performance through multidimensional stability designs and cross-module coordination mechanisms. Its core advantage lies in leveraging the Deterministic Policy Gradient (DPG) algorithm [25,38], which precisely explores and rapidly converges to the optimal strategy in continuous action spaces, perfectly aligning with the UAV’s practical need to continuously and precisely adjust flight parameters in complex terrain.
DDPG adopts the classic Actor-Critic architecture, consisting of the Policy Network (Actor) and the Value Network (Critic), which work together through adversarial training to generate realistic and effective paths. The Actor network takes the low-dimensional state vector output by the Transformer module as input, processes it through multiple fully connected layers, and directly outputs the deterministic action command
, where
are the parameters of the Actor network, and μ denotes the policy function. This action command includes the drone’s speed adjustment
, heading angle adjustment
, and climb angle adjustment
, and these continuous actions directly determine the drone’s flight trajectory in complex terrains. Unlike stochastic policies that output actions based on probability distributions, the deterministic policy quickly locates superior actions in high-dimensional continuous action spaces, greatly improving decision-making efficiency, which is crucial for real-time obstacle avoidance and path planning in narrow canyons and other terrains.
The Critic network, on the other hand, takes the state vector and the action a output by the Actor network as joint input, constructing the Q-value function
to evaluate the long-term reward of the current state-action pair, where
are the parameters of the Critic network. The core task of the Critic network is to accurately estimate the Q-value, providing precise guidance for the Actor network’s policy update. Its output reflects the expected future cumulative reward after taking a specific action in the current state. For example, if the drone is near an obstacle, and the Critic network evaluates the Q-value of an avoidance action as low, it indicates that the action might increase the collision risk or reduce the path efficiency, prompting the Actor network to adjust its strategy.
During training, the DDPG module updates the Actor network parameters using the Deterministic Policy Gradient theorem. The policy gradient formula is:
where represents the gradient of the policy μ with respect to the parameters
,
is the expected state sampled according to the behavior policy β,
is the gradient of the Q-value function with respect to the action a under the current policy
, and
is the gradient of the policy function μ with respect to the parameters
. This formula shows that the Actor network updates its parameters by maximizing the Q-value output from the Critic network, i.e., adjusting the policy in the direction that increases the Q-value, achieving policy optimization.
To enhance training stability, the DDPG module introduces a target network mechanism, creating separate target networks for both the Actor and Critic networks (with parameters and
, respectively). The parameters of the target networks are not updated in real time but are slowly adjusted through a soft update mechanism. The soft update formula is:
where θ represents the current network parameters, represents the target network parameters, and τ is the soft update coefficient. This method ensures that the target network parameters remain relatively stable over time, reducing large fluctuations in target values during training and aiding model convergence.
To address the instability problem in training that occurs in high-dimensional continuous spaces, this paper designs a complete training hyperparameter setup and multiple stability enhancements. The experience replay pool capacity is set to 100,000, with a batch sampling size of 64. The Actor network learning rate is set to 5e-5, the Critic network learning rate is set to 1e-4, the discount factor , the gradient clipping threshold is 1.0, and the target network update interval is 5 steps. In terms of stability measures, a Prioritized Experience Replay (PER) mechanism is adopted, which assigns sampling weights based on the TD error of the samples and corrects the bias through importance sampling to improve the utilization of effective samples. Additionally, L2 regularization (with a weight decay coefficient of 1e-5) is applied to the output layer of the Critic network to suppress overfitting. The exploration noise is dynamically adjusted (initial variance of 0.2, linearly decaying to 0.01 over training iterations) to balance exploration and exploitation efficiency. By combining the Transformer module, which compresses the 1024-dimensional raw state to 128-dimensional high-dimensional feature reduction, and the high-quality candidate paths generated by the GAN module to narrow the effective exploration space, training fluctuations are collaboratively suppressed from four dimensions: input preprocessing, sample sampling strategy, network regularization, and exploration mechanism. This ensures the model’s stable convergence in the high-dimensional state space of complex terrain.
The DDPG module works closely with the Transformer and GAN modules. The low-dimensional state vector output by the Transformer module provides DDPG with concise and key environmental information, reducing the interference of high-dimensional state spaces in reinforcement learning; the high-quality path candidates generated by the GAN module serve as “prior knowledge" for DDPG’s exploration strategy, guiding the Actor network to search within a reasonable action space, significantly reducing blind exploration in reinforcement learning, accelerating policy convergence, and ultimately outputting the globally optimal drone inspection path that satisfies complex terrain constraints.
Module integration and data flow.
The PGD model forms an efficiently coordinated whole through tight integration and orderly data flow, collectively achieving the goal of drone inspection path planning in complex terrains.
Starting with data input, the raw high-dimensional terrain data (including complex terrain representations via 3D point cloud information, precise state parameters of the drone, and detailed coordinates of inspection points) is fed into the Transformer module. Leveraging the self-attention mechanism, Transformer captures key features in the data, such as accurately locating the distance between the drone and obstacles, analyzing the accessibility of inspection points, and so on. It then compresses this information and outputs a low-dimensional state vector. This vector not only significantly reduces the data dimensionality but also distills critical information for path planning, laying the foundation for the efficient operation of subsequent modules.
The low-dimensional state vector, as a key “ingredient," flows concurrently into both the GAN and DDPG modules. In the GAN module, the generator uses the terrain constraint information in the low-dimensional state vector, combined with a random noise vector, to generate multiple local path candidates. The discriminator, by comparing real path samples with generated paths, filters out those that are closer to the real scenario and meet physical feasibility and safety constraints. These high-quality path candidates, after filtering, are passed to the DDPG module, providing valuable “prior experience" for its policy optimization and significantly reducing the scope and time spent on blind exploration in the continuous action space.
At the same time, in the DDPG module, the Critic network evaluates the long-term reward of the current state-action pair using the low-dimensional state vector output by the Transformer and the action generated by the Actor network. The Actor network, based on the evaluation from the Critic network, adjusts its parameters continuously through the deterministic policy gradient algorithm to output more optimal action commands. In this process, the DDPG module fully leverages the environmental information provided by the Transformer and the path candidates generated by the GAN, gradually optimizing the strategy and ultimately outputting the globally optimal drone inspection path that satisfies complex terrain constraints.
The data flow of the entire PGD model functions like a precisely operating production line: from the “rough processing" of raw data (feature extraction and state compression by the Transformer module), to the “fine crafting" of intermediate products (path candidate generation and filtering by the GAN module), and finally to the “strict quality control and delivery" of the finished product (policy optimization and path output by the DDPG module). Each module performs its designated task while tightly collaborating. This integration effectively addresses the challenges posed by high-dimensional state spaces in complex terrains, significantly improving the efficiency and quality of drone inspection path planning, ensuring that drones can safely and efficiently complete inspection tasks in complex environments.
Experimental setup
Baseline Model Selection: The baseline models selected in this paper encompass mainstream path planning techniques such as metaheuristic algorithms, deep reinforcement learning (DRL), and hybrid methods to comprehensively validate the performance of the PGD model. Among them, IHSSAO (Improved Hybrid Squid-Shooter Ant Colony and Skyhawk Optimization Algorithm) [39] represents a metaheuristic algorithm, which enhances global search capability through tent chaos mapping and pinhole imaging contrastive learning, demonstrating excellent obstacle avoidance and convergence performance in complex terrain path planning; TD3 (Twin Delayed Deep Deterministic Policy Gradient) [40] and Q-DDPG (Quantum Enhanced DDPG) [41] represent typical variants of DRL, which alleviate the Q-value overestimation problem with a double critic network and reduce the computational complexity of high-dimensional state spaces by combining quantum computing, suitable for dynamic environments with continuous action spaces; MCD-DPG (Multicritic Delayed DDPG) [42] introduces multiple critic networks and state noise regularization to improve robustness in complex environments; PPO-DRL (Proximal Policy Optimization) [43,44] is widely used in continuous control tasks due to its sample efficiency and stability; Soft Actor-Critic (SAC) [45] balances exploration and exploitation through maximum entropy reinforcement learning, making it suitable for high-dynamic scenarios; D3QN (Dueling Deep Q Network) [46] optimizes value function estimation in discrete action spaces, suitable for multi-objective path planning; Multi-Agent DRL [47,48] focuses on multi-agent collaborative scenarios, testing the scalability of PGD in collaborative tasks; FM-Planner [49] and MPC-RL [50] represent hybrid strategies combining sampling-based planning methods and model predictive control, covering both traditional planning and learning-based approaches. These models have been widely validated in recent path planning research, and can form comparisons with the PGD model across multiple dimensions such as algorithm principles, applicable scenarios, and performance metrics, fully highlighting the advantages of the proposed model in feature synergy and planning efficiency.
Evaluation Metrics: Based on the core objectives of path optimization, model characteristics, and practical application needs, six key metrics are selected: These include basic metrics that measure path quality and safety (path length, collision rate), as well as practical metrics reflecting algorithm efficiency and physical feasibility (computation time, path feasibility). Additionally, for the PGD model’s feature compression mechanism and cross-scenario adaptability, metrics for state space compression efficiency and cross-scenario generalization error are designed, forming a multi-dimensional, full-process evaluation system.
Path Length (Pl) quantifies the path economy, defined as the total Euclidean distance of all flight segments:
where are the 3D coordinates of the i-th waypoint, and n is the total number of waypoints.
Collision Rate (Cr) measures path safety, calculated as the proportion of test cases where collisions occur:
Computation Time (Tc) evaluates the algorithm’s real-time performance, representing the total time from input terrain data to output planning path (in seconds), including the entire process of feature extraction, path generation, and policy optimization.
Path Feasibility (Fr) verifies whether the path adheres to the drone’s physical constraints (such as maximum climb angle and turning radius):
State Space Compression Efficiency (Se) evaluates the compression effect of the Transformer module, combining dimension reduction rate and feature retention degree:
where ,
are the original and compressed state dimensions, and α is the feature retention coefficient (verified by decision accuracy, with a range of [0,1]).
Cross-Scenario Generalization Error (Ge) measures the model’s adaptability to unknown terrains, calculating the deviation between the planned path and the optimal path in the test scenario:
where is the model’s planned path length, and
is the theoretical optimal path length for that scenario.
Experimental Environment Setup: To ensure the reproducibility and fairness of the experiments, comparative experiments are conducted in a unified hardware platform and simulation environment. The core parameter settings cover dataset configuration, model structure, and training hyperparameters, as shown in Table 2. All algorithms are tested in the same complex terrain scenarios (including mountains, canyons, and random obstacles) to eliminate environmental differences from affecting the results.
Results and analysis
Comparative experiments
The comparison with multiple baseline models fully demonstrates the superiority of the proposed PGD model in complex terrain drone path planning tasks. As shown in Table 3, the model outperforms baseline models across all core metrics on both the UAVDT and AirSim datasets, highlighting the effectiveness of the “state simplification - candidate filtering - policy iteration" logical closed-loop achieved through the integration of Transformer, Graph Attention Networks (GAN), and Deep Deterministic Policy Gradient (DDPG) modules.
In terms of path length (Pl), the PGD model achieves the optimal performance of 20.0 on the UAVDT dataset, which is 5% shorter than the next best model, Soft Actor-Critic (21.0). On the AirSim dataset (22.0), it shortens the path length by 8.7% compared to Multi-Agent DRL (24.1) (Table 3). This result can be attributed to the path candidate generation and spatial constraint capabilities of the GAN module, which quickly filters out the shortest feasible paths in complex terrain, reducing unnecessary travel distance.
The comparison of collision rate (Cr) further highlights the core advantage of the PGD model: As seen in the data in Table 3, the collision rate of 2.5% on the UAVDT dataset is 21.9% lower than that of IHSSAO (3.2%), and the collision rate of 3.0% on the AirSim dataset is 37.5% lower than IHSSAO (4.8%), significantly outperforming other baseline models. This is closely related to the strategy optimization mechanism in the DDPG module, which iteratively refines the obstacle avoidance strategy through reinforcement learning. Combined with the high-precision environmental features extracted by the Transformer, the PGD model is able to better avoid risks in complex terrains.
Regarding computation time (Tc), Table 3 shows that the PGD model is the most efficient, achieving 13.5 seconds (UAVDT) and 16.0 seconds (AirSim), reducing computation time by 41.3% and 42.9%, respectively, compared to the longest-running TD3 model. This is primarily due to the feature extraction and state compression functionality of the Transformer module, which reduces redundant information processing through efficient state space compression, saving significant computational resources for subsequent decision-making.
In terms of performance stability and generalization capability, as shown in Table 3, the path feasibility (Fr) of the PGD model reaches 98% and 97% on the UAVDT and AirSim datasets, respectively, with success rates (Se) of 92% and 94%, leading the baseline models by 1-3 percentage points. The cross-scene generalization error (Ge) is only 7% and 8%, the lowest among all models. This confirms the generalization advantages of the three-module collaborative architecture: The feature abstraction ability of Transformer ensures consistent representation of different terrain data, while the combination of GAN and DDPG efficiently maps the candidate paths to the optimal strategy, maintaining stable performance in complex terrain scenarios.
Fig 2 visualizes this result, where it is clearly observed that the PGD model demonstrates significant advantages in the multi-metric comparison across the UAVDT and AirSim datasets. In the path length (Pl) dimension, the height of the PGD bar is much lower than most baseline models, such as shortening the path by nearly one-third compared to IHSSAO in the UAVDT dataset, indicating a shorter planned path. In terms of collision rate (Cr), the PGD bar is notably lower, showing the least collision risk in both the UAVDT and AirSim scenarios. For computation time (Tc), the PGD bar has the shortest length, highlighting its efficiency advantage. The path feasibility (Fr) and state extraction effectiveness (Se) are stable, with energy efficiency (Ge) bars also placed at a low level. This multi-dimensional visualization of the data directly confirms that PGD, in complex terrain UAV path planning, is able to balance path optimization, safety assurance, and efficiency improvement, with overall performance superior to the compared baseline models.
Ablation experiment
The ablation study (Table 4) further validates the necessity and synergy of the three core modules—Transformer, GAN, and DDPG—in the PGD model. By comparing the performance differences between the full PGD model and models with individual modules removed, the unique contributions of each module to complex terrain path planning are clearly demonstrated.
In terms of path length (Pl), removing any module results in an increase in path length. Specifically, after removing the Transformer, the path length on the UAVDT dataset increases from 20.0 to 22.5, and from 22.0 to 24.0 on the AirSim dataset; after removing the GAN, the path length increases to 21.0 and 23.0, respectively; and after removing the DDPG, the path length becomes 23.0 and 25.0. This indicates that the GAN’s spatial constraint role in path candidate generation directly affects path simplification, while Transformer’s feature compression and DDPG’s policy optimization also impact path efficiency indirectly, with all three modules collaborating to achieve the shortest path planning.
The change in collision rate (Cr) more intuitively reflects the functionality of each module: After removing DDPG, the collision rate on the UAVDT dataset increases from 2.5% to 4.0%, and on the AirSim dataset from 3.0% to 5.0%, showing the most significant increase; after removing the Transformer and GAN, the collision rate also rises to varying degrees (3.5%, 4.0% for UAVDT and 3.0%, 3.5% for AirSim). This suggests that the policy optimization of the DDPG module is the key to reducing collision risks, while the precise environmental features provided by the Transformer and the safe candidate paths generated by the GAN are essential for DDPG to function effectively.
Regarding computation time (Tc), after removing the Transformer, the computation time on both datasets increases significantly (from 13.5 seconds to 17.0 seconds on UAVDT and from 16.0 seconds to 19.0 seconds on AirSim), which is much higher than when removing other modules (after removing the GAN, the time is 16.0 seconds and 18.0 seconds; after removing the DDPG, it is 18.5 seconds and 21.0 seconds). This strongly proves that the Transformer’s feature extraction and state compression functionality is crucial for improving the model’s computational efficiency. By reducing redundant information processing, it saves significant computational resources for subsequent modules.
In terms of performance stability metrics (Fr, Se, Ge), the full PGD model has the highest frame rate (Fr) and success rate (Se), and the lowest target error (Ge). After removing any module, these metrics all show varying degrees of decline, with the removal of DDPG having the greatest impact on success rate (Se) (on the UAVDT dataset, it decreases from 92% to 85%, and on the AirSim dataset from 94% to 89%), and the removal of Transformer significantly impacts target error (Ge) (increasing from 7% to 9%, and from 8% to 10%). This shows that each of the three modules plays a key role in ensuring model stability and generalization ability. Only when they work together can optimal performance be achieved.
To verify the necessity and advantages of Transformer in high-dimensional terrain data compression, PCA, Autoencoder (AE), CNN Encoder, and ViT-base are selected as alternative methods. These methods are combined with the GAN and DDPG modules to construct comparison models. Performance evaluation is conducted on the UAVDT and AirSim datasets, with a focus on comparing state space compression efficiency (Se), feature retention coefficient (α), training convergence iterations, path length (Pl), and collision rate (Cr). The experimental results are shown in Table 5.
The results show that Transformer exhibits a comprehensive advantage in the high-dimensional terrain data compression task. In terms of compression efficiency, Transformer, like PCA and ViT-base, achieves a high compression ratio of 87.5% (compressing the 1024-dimensional raw state to 128 dimensions), significantly outperforming AE (85.9%) and CNN Encoder (84.4%). The feature retention coefficient is slightly better than ViT-base (0.93), and far exceeds PCA (0.78), AE (0.86), and CNN (0.83). This advantage is attributed to the self-attention mechanism of Transformer, which accurately captures long-range dependencies among “UAV-obstacle-inspection points," whereas PCA relies only on linear transformations and lacks feature correlation modeling, and AE and CNN are limited by local receptive fields, making it difficult to capture global terrain constraints. In terms of training efficiency, the Transformer-based model converges in just 7100 iterations, reducing the number of iterations by 23% compared to ViT-base and by 13.4% and 8.9% compared to PCA and CNN Encoder, respectively. Additionally, the computation time (UAVDT: 13.5s, AirSim: 16.0s) is significantly lower than other methods, while ViT-base, due to its higher model complexity, results in excessive computational overhead (UAVDT: 18.5s, AirSim: 20.3s). In terms of final path planning performance, the Transformer-based model generates the shortest path length (UAVDT: 20.0m, AirSim: 22.0m), which is 3.2%-5.8% shorter than PCA, AE, and CNN Encoder, and 3.5%-9.7% shorter than ViT-base. The collision rate is also the lowest (UAVDT: 2.5%, AirSim: 3.0%), reducing by 18%-25% compared to other methods. This fully demonstrates that the features compressed by Transformer provide more accurate and efficient inputs for subsequent GAN path generation and DDPG policy optimization.
Fig 3 further visually validates this result. Under different metrics (Pl, Cr, Tc, etc.), the full PGD model shows significantly shorter bar heights in both the UAVDT (green group) and AirSim (orange group) datasets compared to the models with single modules removed. For example, in the Pl metric, the full PGD model has the shortest bars in both the green and orange groups for the two datasets; in the Cr metric, the height of its bars is the lowest, which visually confirms the performance gain from the synergy of all modules. This forms a “data-visualization" dual-validation loop with the previous quantitative analysis.
The ablation experiment results demonstrate that the Transformer, GAN, and DDPG modules are indispensable in the PGD model. The Transformer enhances computational efficiency through efficient feature compression, the GAN generates high-quality path candidates to lay the foundation for path planning, and the DDPG optimizes decision strategies to ensure safety and success rate. The “state simplification - candidate filtering - policy iteration" closed loop formed by the collaboration of these three modules is the core reason for the excellent performance of the PGD model in complex terrain path planning.
Generalization analysis
To verify the adaptability of the PGD model in real-world scenarios, cross-dataset generalization tests were conducted. The AirSim canyon scene was used as the training set, and the UAVDT suburban scene was used as the test set. The model performance metrics were compared, and the results are shown in Table 6.
(Training set: AirSim canyon scene; test set: UAVDT suburban scene.)
From the correlation between scene characteristics and model performance, it can be seen that the UAVDT suburban scene has a flatter terrain (lower elevation difference and obstacle density compared to the AirSim canyon), and after migration, PGD’s path planning is more compact. The average path length (Pl) decreases from 22.0m to 20.5m, which aligns with the physical logic of “shorter paths in simpler scenes," demonstrating the model’s adaptability to changes in terrain complexity. In terms of safety, the UAVDT scene has a more regular obstacle distribution (mainly farmland and low buildings), which makes the features easier to recognize. PGD’s obstacle avoidance strategy is effectively carried over, and the collision rate (Cr) slightly drops to 2.8%, validating the generalization of the Transformer feature extraction module, which can reliably capture key obstacle information. In terms of efficiency, the UAVDT scene has a lower state space complexity (fewer terrain feature dimensions), which leads to improved model inference speed. The planning time (Tc) decreases from 16.0s to 14.2s, reflecting PGD’s dynamic response ability to the scene complexity. For path feasibility (Fr), both scenes meet the physical constraints of the UAV, and PGD’s path generation logic remains stable with only minor fluctuations (97% → 96%), indicating that the strategy transfer did not compromise feasibility. The state feature adaptability (Se) shows a 2.1% decrease due to differences in state characteristics between UAVDT and AirSim (e.g., terrain texture, GPS interference). However, the Transformer can still extract key features such as the relative position of obstacles and the priority of inspection points, ensuring the core functionality of the model. Energy efficiency (Ge) improves in the UAVDT scene due to smoother flight conditions (less turbulence), rising from 8 to 9, which demonstrates the model’s positive effect on energy optimization in real-world scenarios, further improving the generalization verification dimension. In summary, when the PGD model is transferred across datasets, its core metrics remain stable and adapt well to the new scene characteristics, exhibiting strong generalization ability and supporting applications in real-world complex scenarios.
Visitational analysis
Fig 4 shows the reward values of different path planning algorithms over the course of training, plotted against the number of episodes. From the curve trends, the PGD model exhibits a generally stable increasing reward, gradually converging to a high level (close to 15,000) within 6000 episodes, indicating that its policy optimization process is consistently effective and can accumulate positive rewards through continuous iteration. Although algorithms such as FM-Planner and Multi-Agent DRL also show an increasing reward curve, their fluctuations are relatively larger, with lower convergence efficiency and stability compared to PGD. Algorithms like PPO-DRL and IHSSAO show dramatic reward fluctuations, even reaching negative values, which suggests that the training process is prone to local fluctuations and insufficient balance between exploration and exploitation. The comparison indicates that PGD has advantages in reward mechanism design and policy iteration logic, enabling it to learn high-quality policies more efficiently in path planning tasks and providing training-level support for stable applications in complex scenarios.
Fig 5 presents the distribution relationship between UAV path point elevation and obstacle proximity in the PGD model for AirSim (simulated scene) and UAVDT (real-world scene). By comparing the subplots on both sides, it is evident that: in terms of feature distribution, the scatter patterns for AirSim and UAVDT are highly similar. In low elevation areas (100 - 200m), there are generally more points with high obstacle proximity (marked in red and dark red), reflecting the denser obstacle characteristics in lower areas of real-world terrain; in high elevation areas (400 - 500m), points with low proximity (marked in yellow and light yellow) dominate, indicating that the model can capture the “elevation - obstacle distribution" correlation in different scenes. In terms of model adaptability, the color distribution of obstacle proximity and the elevation correlation logic for the path points generated by PGD remain consistent during cross-scene migration. For example, in the path point index range 20 - 40, both AirSim and UAVDT show a concentration of medium to high proximity points (orange to red) in the 200 - 300m elevation range, verifying that the PGD model can recognize and adapt to the potential correlation between “elevation - obstacle distribution" in different scenes. The path planning strategy thus demonstrates cross-scene migration capability, laying the foundation for real-world complex environment applications.
Discussion
Compared to existing hybrid or hierarchical reinforcement learning models (such as PPO with attention and hybrid graph-RL methods), the superiority of the Transformer+GAN+DDPG triplet concept in the PGD model lies in its end-to-end targeted adaptation to the core requirements of path planning in complex terrain and the organic linkage between modules. Although PPO with attention introduces an attention mechanism to optimize feature perception, it still relies on a single reinforcement learning framework. It cannot actively generate path candidates that conform to physical constraints to reduce blind exploration like GAN, nor can it balance convergence efficiency and stability in continuous action space. Hybrid graph-RL methods are limited by the adaptability of graph structures to unstructured terrain. They can only model the topological relationships of obstacles and cannot capture the global association and priority information of “drone-obstacle-inspection point" like Transformer. Furthermore, the lack of a dynamic feedback mechanism leads to fragmented collaboration between modules. The PGD triplet achieves functional complementarity and chain empowerment through a closed-loop logic of “compression-generation-optimization": Transformer solves the problem of high-dimensional data redundancy, reducing the burden on subsequent modules; GAN provides physically compliant path priors, reducing the exploration cost of reinforcement learning; DDPG precisely optimizes the continuous action space. The three form a synergistic effect through dynamic feedback (such as GAN path quality adjusting Transformer feature weights and DDPG policy gradient guiding GAN generation direction). This is the core reason why it outperforms the comparison models in terms of path length, collision rate, and computational efficiency in experiments.
The breakthrough of the PGD model in path planning efficiency and adaptability to complex terrain stems from the synergistic effect of the “compression - generation - optimization" closed loop. The Transformer module uses self-attention mechanisms to distill high-dimensional terrain data (e.g., canyon point clouds, multiple inspection point states), compressing the original 1024-dimensional state to 128 dimensions. This fundamentally solves the slow training convergence and policy oscillation issues caused by state redundancy in high-dimensional spaces in the DDPG algorithm (in comparative experiments, the planning time for the DDPG baseline model on the AirSim dataset reached 28.0s, while PGD took only 16.0s). At the same time, the GAN module generates a “physically feasible + terrain-adapted" set of candidate paths based on adversarial training, providing a “high-quality exploration starting point" for the reinforcement learning strategy. In the UAVDT dataset test, PGD, using GAN-generated candidate paths, increased the effective exploration iteration ratio of DDPG from 35% in the baseline model to 62%, reducing ineffective iterations caused by random exploration that could get stuck in local optima. This explains why PGD outperforms traditional DRL models in terms of path length (Pl = 20.0) and collision rate () (e.g., PPO-DRL has a path length of 23.8 and a collision rate of 4.5%). This dual-module synergy mechanism enables PGD to handle both high-dimensional unstructured terrains (e.g., hidden obstacles in canyons, random buildings in suburbs) and accelerate convergence through policy optimization, adapting to the demands of complex scenes.
Compared to the traditional “feature extraction + reinforcement learning" framework, the core innovation of PGD lies in constructing a full-process closed loop of “compression - generation - optimization." Traditional research has often focused on optimizing individual modules (e.g., using CNNs solely for feature compression or GANs solely for path generation), while PGD is the first to combine the long-range dependency modeling capability of Transformer with the adversarial generation characteristic of GAN: Transformer not only compresses the state dimensions but also captures the global correlation of terrain features (e.g., spatial constraints of obstacles in the upstream and downstream of a canyon), providing “implicit constraints" for GAN to generate paths, making the generated candidate paths more natural in avoiding obstacles in complex terrains (compared to CNN-based feature compression models, PGD achieved a state feature adaptability Se of 94% in the AirSim dataset, higher than the 87% of the CNN model). Compared to similar GAN path planning studies, PGD enhances feature correlation with Transformer, improving the GAN discriminator’s precision in identifying “high-quality paths" by 18% (verified by comparing the overlap between the generated paths and manually labeled optimal paths), avoiding the problem where GAN-generated paths, though theoretically feasible, are “not flyable" in practice due to narrow feature perception. This truly achieves a value loop from “generating candidates" to “assisting reinforcement learning optimization."
Despite the significant advantages shown by PGD, several limitations remain to be overcome: the current model is still lacking in adaptability to extreme and dynamic environmental factors. It relies on clear terrain features (such as point clouds and altitude) for decision-making, but does not take into account dynamic environmental parameters such as wind and temperature, as well as sensor uncertainties. This leads to the state compression module easily extracting invalid features in sensor noise caused by heavy rain or dense fog, or in complex weather scenarios, thus increasing the collision rate of path planning. The mechanism of the GAN module generating 64 candidate paths is somewhat rigid. In special scenarios such as complex canyons, it may miss better solutions that “bypass hidden obstacles while taking into account inspection points", resulting in about 5% of the potential for path length optimization not being fully released. At the same time, the existing framework is still limited to single-UAV operation scenarios and has not yet adapted to the actual needs of multi-UAV collaborative inspection, and cannot improve the coverage efficiency of complex terrain through state interaction between UAVs. To address these limitations, future research can advance breakthroughs in several ways: Introducing meteorological data enhancement feature modules that integrate wind speed, temperature time-series data with radar echoes and humidity sensor information, while designing robust feature extraction mechanisms (such as adding noise adaptive layers) to better adapt to dynamic environments and sensor uncertainties; designing dynamic candidate mechanisms, such as adjusting the number of paths generated by the GAN in real time based on terrain complexity to fully unleash the potential for path optimization; expanding the multi-agent reinforcement learning framework, leveraging the Transformer’s global modeling capabilities for multiple UAV states to achieve collaborative inspection strategy optimization [51]; and further optimizing the closed-loop feedback mechanism by incorporating the impact of dynamic environmental factors into the loss function to continuously improve the model’s robustness in complex dynamic scenarios [52], thereby more comprehensively expanding the engineering application boundaries of the PGD model.
Conclusion
This paper addresses the core issues of “high-dimensional state processing difficulties, blind path exploration, and poor cross-scene adaptability" in UAV path planning for complex terrain inspections. We propose the PGD model, which integrates Transformer, GAN, and DDPG: Transformer compresses high-dimensional terrain features to overcome training bottlenecks, GAN generates high-quality path candidates to reduce ineffective exploration, and DDPG efficiently optimizes the strategy, forming a complete “compression-generation-optimization" closed loop. Experimental results show that on the UAVDT (suburban) and AirSim (canyon) datasets, PGD outperforms baseline models such as PPO-DRL and Soft Actor-Critic in path length (20.0/22.0), collision rate (2.5%/3.0%), and computation efficiency (13.5s/16.0s), with particularly outstanding performance in high-complexity terrains. Compared to similar studies, the unique value of PGD lies in breaking through the limitations of “single module optimization" by enhancing feature correlation and the physical constraints of path generation through multi-module collaboration, providing a new framework for intelligent planning in complex environments. It should be noted that there is still room for improvement in the model’s adaptability to extreme weather conditions and multi-agent collaborative scenarios. Future research will focus on further optimization through dynamic candidate mechanisms and cross-modal feature enhancement.
References
- 1. Ma Z, Chen J. Adaptive path planning method for UAVs in complex environments. Int J Appl Earth Observ Geoinform. 2022;115:103133.
- 2. Zhang J, Li J, Yang H, Feng X, Sun G. Complex environment path planning for unmanned aerial vehicles. Sensors (Basel). 2021;21(15):5250. pmid:34372486
- 3. Xuzhao C, Zhishuai Z, Junming X, Li Y, Boyang Q, Pengwei E, et al. Multi-strategy fusion differential evolution algorithm for UAV path planning in complex environment. Aerosp Sci Technol. 2022;121:107287.
- 4. Huang J, Chen J, Lv L, Yu Z, Yao W, Cheng H, et al. Design and verification of a wearable microcapacitance test system for POC biosensing. IEEE Trans Instrum Meas. 2025;74:1–11.
- 5. Qi H, Hu Z, Yang Z, Zhang J, Wu JJ, Cheng C, et al. Capacitive aptasensor coupled with microfluidic enrichment for real-time detection of trace SARS-CoV-2 nucleocapsid protein. Anal Chem. 2022;94(6):2812–9. pmid:34982528
- 6. Xing X, Wang B, Ning X, Wang G, Tiwari P. Short-term OD flow prediction for urban rail transit control: A multi-graph spatiotemporal fusion approach. Inform Fusion. 2025;118:102950.
- 7. Du Y. Multi-UAV search and rescue with enhanced A* algorithm path planning in 3D environment. Int J Aerospace Eng. 2023;2023(1):8614117.
- 8. Wang J, Li Y, Li R, Chen H, Chu K. Trajectory planning for UAV navigation in dynamic environments with matrix alignment Dijkstra. Soft Comput. 2022;26(22):12599–610.
- 9. Prasad NL, Ramkumar B. 3-D deployment and trajectory planning for relay based UAV assisted cooperative communication for emergency scenarios using Dijkstra’s algorithm. IEEE Trans Veh Technol. 2023;72(4):5049–63.
- 10. Shynar Y, Seitenov A, Kenzhegarina A, Kenzhetayev A, Kemel A, Ualiyev N, et al. Comprehensive analysis of blockchain technology in the healthcare sector and its security implications. Int J E-Health Med Commun. 2025;16(1):1–45.
- 11. Kodipalli A, Fernandes SL, Dasar SK, Ismail T. Computational framework of inverted fuzzy C-means and quantum convolutional neural network towards accurate detection of ovarian tumors. Int J E-Health Med Commun. 2023;14(1):1–16.
- 12. Yang H, Liu Z, Cui H, Ma N, Wang H, Zhang C, et al. An electrified railway catenary component anomaly detection frame based on invariant normal region prototype with segment anything model. IEEE Trans Transp Electrific. 2025:1–1.
- 13. Guo Y, Liu X, Liu X, Yang Y, Zhang W. FC-RRT*: An improved path planning algorithm for UAV in 3D complex environment. IJGI. 2022;11(2):112.
- 14. Zhang J, An Y, Cao J, Ouyang S, Wang L. UAV trajectory planning for complex open storage environments based on an improved RRT algorithm. IEEE Access. 2023;11:23189–204.
- 15. Wang H, Song Y, Yang H, Liu Z. Generalized koopman neural operator for data-driven modeling of electric railway pantograph–catenary systems. IEEE Trans Transp Electrific. 2025;11(6):14100–12.
- 16. Li S, Hu J, Zhang B, Ning X, Wu L. Dynamic personalized federated learning for cross-spectral palmprint recognition. IEEE Trans Image Process. 2025;34:4885–95. pmid:40737152
- 17. Hao M, Gu Y, Dong K, Tiwari P, Lv X, Ning X. A prompt regularization approach to enhance few-shot class-incremental learning with Two-Stage Classifier. Neural Netw. 2025;188:107453. pmid:40220563
- 18. Miao J, Ning X, Hong S, Wang L, Liu B. Secure and efficient authentication protocol for supply chain systems in artificial-intelligence-based internet of things. IEEE Internet Things J. 2025;12(19):39532–42.
- 19. Arshad MA, Khan SH, Qamar S, Khan MW, Murtza I, Gwak J, et al. Drone navigation using region and edge exploitation-based deep CNN. IEEE Access. 2022;10:95441–50.
- 20. Qin Z, Fengpu L, Bin L. A heuristic tomato-bunch harvest manipulator path planning method based on a 3D-CNN-based position posture map and rapidly-exploring random tree. Comput Electron Agric. 2023;213:108183.
- 21. Ying L, Xiaodan W, Yang Y, Man D, Shaochun Q, Yanfang F. LSTM-DQN-APF path planning algorithm empowered by twins in complex scenarios. Appl Sci. 2025;15(8):4565.
- 22.
Du Y, Qi N, Li X, Xiao M, Boulogeorgos AA-A, Tsiftsis TT, et al. Distributed multi-UAV trajectory planning for downlink transmission: A GNN-enhanced DRL approach. IEEE Wireless Commun Lett. 2024.
- 23.
Yuhao P, Xiucheng W, Zhiyao X, Nan C, Wenchao X, Jun-Jie Z. GNN-empowered effective partial observation MARL method for AoI management in multi-UAV network. IEEE Internet Things J. 2024.
- 24. Shasha T, Yuanxiang L, Xiao Z, Lu Z, Linhui C, Wei S, et al. Fast UAV path planning in urban environments based on three-step experience buffer sampling DDPG. Digit Commun Netw. 2024;10(4):813–26.
- 25.
Ruichang W, Weijun H, Xianlong M. Research on the method of trajectory planning for unmanned aerial vehicles in complex terrains based on reinforcement learning. In: Int Conf Intell Robotics Appl., Springer; 2024. p. 287–95.
- 26. Bai R, Bai B. The impact of labor productivity and social production scale on profit-induced demand: Function and analysis from the perspective of Marx’s economic theory. J Xi’an Univ Finance Econ. 2024;37(5):3–17.
- 27. DeMatteo C, Jakubowski J, Stazyk K, Randall S, Perrotta S, Zhang R. The headaches of developing a concussion app for youth. Int J E-Health Med Commun. 2024;15(1):1–20.
- 28. Yu C-H, Tsai J, Chang Y-T. Intelligent path planning for UAV patrolling in dynamic environments based on the transformer architecture. Electronics. 2024;13(23):4716.
- 29. Almayyan WI, AlGhannam BA. Detection of kidney diseases: Importance of feature selection and classifiers. IJEHMC. 2024;15(1):1–21.
- 30.
Ma Z, Xiong J, Gong H, Wang X. Mission planning of UAVs and CAVs based on graph neural networks transformer model. IEEE Internet Things J. 2024.
- 31.
Eskandari Mohsen, Savkin AndreyV. GANs the UAV path planner: UAV-based RIS-assisted wireless communication for Internet of Autonomous Vehicles. In: 2024 IEEE 19th Conf Ind Electron Appl (ICIEA); 2024. p. 1–6.
- 32. Zan L, Xiaomin L, Jia S, Li L, Pei X. MD-GAN-based UAV trajectory and power optimization for cognitive covert communications. IEEE Internet Things J. 2021;9(12):10187–99.
- 33. Peng W, Prabhash R, Vougioukas SG, Kong Z. Vision-based navigation of unmanned aerial vehicles in orchards: An imitation learning approach. Comput Electron Agric. 2025;238:110802.
- 34.
Elizabeth B, Debadeepta D, Ashish K, Jim P, Shital S, Fei F, et al. Airsim-w: A simulation environment for wildlife conservation with uavs. In: Proc 1st ACM SIGCAS Conf Comput Sustain Soc.; 2018. p. 1–12.
- 35.
Dawei D, Yuankai Q, Hongyang Y, Yifan Y, Kaiwen D, Guorong L, et al. The unmanned aerial vehicle benchmark: Object detection and tracking. In: Proc Eur Conf Comput Vis (ECCV); 2018. p. 370–86.
- 36. Zhu B, Bedeer E, Nguyen HH, Barton R, Gao Z. UAV trajectory planning for AoI-minimal data collection in UAV-aided IoT networks by transformer. IEEE Trans Wireless Commun. 2022;22(2):1343–58.
- 37. Maldonado-Romo J, Aldape-Pérez M, Rodríguez-Molina A. Path planning generator with metadata through a domain change by GAN between physical and virtual environments. Sensors (Basel). 2021;21(22):7667. pmid:34833741
- 38. Zhang L, Peng J, Yi W, Lin H, Lei L, Song X. A state-decomposition DDPG algorithm for UAV autonomous navigation in 3-D complex environments. IEEE Internet Things J. 2024;11(6):10778–90.
- 39. Jinyan Y, Yongbai S, Yanli C, Guoqing Z, Xinyu H, Guiqiang B, et al. IHSSAO: An improved hybrid salp swarm algorithm and aquila optimizer for UAV path planning in complex terrain. Appl Sci. 2022;12(11):5634.
- 40. Feiyu Z, Dayan L, Zhengxu W, Jianlin M, Niya W. Autonomous localized path planning algorithm for UAVs based on TD3 strategy. Sci Rep. 2024;14(1):763. pmid:38191590
- 41. Silvirianti, Narottama B, Shin SY. UAV coverage path planning with quantum-based recurrent deep deterministic policy gradient. IEEE Trans Veh Technol. 2024;73(5):7424–9.
- 42.
Runjia W, Fangqing G, Hai-lin L, Hongjian S. UAV path planning based on multicritic-delayed deep deterministic policy gradient. Wirel Commun Mobile Comput. 2022 ;2022:9017079.
- 43.
Mirco T, Harald B, Marco C, Sangiovanni-Vincentelli AL. Learning to recharge: UAV coverage path planning through deep reinforcement learning. arXiv preprint arXiv:2309.03157. 2023.
- 44. Sonny A, Yeduri SR, Cenkeramaddi LR. Autonomous UAV path planning using modified PSO for UAV-assisted wireless networks. IEEE Access. 2023;11:70353–67.
- 45. Dhuheir MA, Baccour E, Erbad A, Al-Obaidi SS, Hamdi M. Deep reinforcement learning for trajectory path planning and distributed inference in resource-constrained UAV swarms. IEEE Internet Things J. 2023;10(9):8185–201.
- 46. Wang X, Gursoy MC, Erpek T, Sagduyu YE. Learning-based UAV path planning for data collection with integrated collision avoidance. IEEE Internet Things J. 2022;9(17):16663–76.
- 47. Yang M, Liu G, Zhou Z, Wang J. Partially observable mean field multi-agent reinforcement learning based on graph attention network for UAV swarms. Drones. 2023;7(7):476.
- 48.
Jonas W, Julius R, Marija P. Multi-UAV adaptive path planning using deep reinforcement learning. In: 2023 IEEE/RSJ Int Conf Intell Robots Syst (IROS); 2023. p. 649–56.
- 49.
Jiaping X, Wen TC, Yuhang Z, Mir F. FM-planner: Foundation model guided path planning for autonomous drone navigation. arXiv preprint arXiv:2505.20783. 2025.
- 50.
Mahya R, Hamed H, Holger V, et al. UAV path planning employing MPC-reinforcement learning method considering collision avoidance. arXiv preprint arXiv:2302.10669. 2023.
- 51. Jia Z, Liu Z, Li Z, Wang K, Vong C-M. Lightweight fault diagnosis via Siamese network for few-shot EHA circuit analysis. IEEE Trans Aerosp Electron Syst. 2025;61(6):15585–96.
- 52. Ren X, Wang S, Zhao W, Kong X, Fan M, Shao H, et al. Universal federated domain adaptation for gearbox fault diagnosis: A robust framework for credible pseudo-label generation. Adv Eng Inform. 2025;65:103233.