Figures
Abstract
Enhancing the resilience of Autonomous Unmanned Swarms (AUS) requires policies that remain effective under severe, structured disruptions while respecting the heterogeneous semantics of inter–subsystem interactions. Existing reinforcement learning (RL) approaches typically aggregate first–order neighborhoods in a path–agnostic manner, thereby blurring typed, ordered, and directed multi–hop dependencies encoded by domain meta–paths. We propose MPGPD-RC, a Meta- Path Guided Policy Distillation framework for Resilient Coordination that couples: (i) meta-path–guided embeddings learned by path-specific graph attention with contrastive reconstruction and attention fusion, and (ii) a teacher–student scheme in which a PPO teacher trained with a relaxed meta-path mask provides trajectories, and a student aligns both action distributions (KL) and trajectory-level structural codes via path-aware contrastive learning. Empirical evaluations validate that MPGPD-RC consistently surpasses state-of-the-art baselines across diverse perturbation scenarios by modeling complex, high-order dependencies that underpin resilient coordination.
Citation: Han X, Wang H, Jia Q, Gou Y, Li B, Liu J, et al. (2025) Meta-path guided policy distillation for resilient coordination in autonomous unmanned swarm. PLoS One 20(12): e0339675. https://doi.org/10.1371/journal.pone.0339675
Editor: Teddy Lazebnik, Ariel University, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND
Received: June 22, 2025; Accepted: December 10, 2025; Published: December 31, 2025
Copyright: © 2025 Han et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data that support the findings of this study are available from the corresponding author, Siwen Wei upon reasonable request.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Autonomous Unmanned Swarm (AUS) [1,2] represents a transformative leap in multi-agent systems, integrating a diverse ensemble of autonomous platforms, including unmanned aerial vehicles (UAVs) [3], unmanned ground vehicles (UGVs) [4], and unmanned surface vessels (USVs) [5]. These sophisticated configurations exploit state-of-the-art real-time decision-making capabilities, advanced automation protocols, and heterogeneous network architectures that seamlessly interlink sensor arrays, computational intelligence modules, and robust communication infrastructures to execute complex collaborative tasks such as large-scale monitoring, dynamic environment mapping, and coordinated system behaviors [6–10]. This elevated level of autonomy and systemic inter-connectivity offers substantial advantages for high-demand, risk-sensitive applications, enhancing operational efficiency while reducing the need for constant human oversight.
Notwithstanding their profound capabilities, Autonomous Unmanned Swarms (AUS) exhibit intrinsic vulnerabilities stemming from their dependence on intricate and interdependent system architectures [11]. Disruptions such as signal interference, network breaches, and physical damage constitute significant risks by affecting pivotal nodes and communication links, thereby undermining overall system efficacy and continuity [12,13]. The highly integrated configuration of these systems amplifies the impact of localized failures, which can precipitate cascading degradations in global performance. Enhancing the resilience of AUS—specifically their ability to endure, adapt to, and recover from multifaceted disturbances—has become a critical imperative in both research and real-world deployment contexts [14].
Conventional resilience frameworks primarily rely on static redundancy strategies such as auxiliary components and predefined communication alternatives, which are insufficient for managing the dynamic complexity characteristic of modern large-scale distributed systems [15,16]. Such approaches frequently overlook the higher-order interdependencies and latent semantic associations embedded within heterogeneous networked systems, thereby constraining their capacity to deliver robust and enduring resilience across fluid and unpredictable operational scenarios.
Fig 1 underscores the indispensable role of adaptive recovery mechanisms. Within a AUS network topology, four heterogeneous node categories are present: Sensor (), Decision (
), Influencer (
), and Target (
). The failure of a single Decision node disrupts the Sensor–Decision–Influencer–Target–Sensor meta-path, which is vital for mission fulfillment. A meta-path denotes an ordered chain of distinct node types that encode semantic dependencies in the graph, thereby enabling multi-hop contextual reasoning. Traditional redundancy schemes rely on predefined standby nodes; restoring the disrupted path in such settings depends on quickly substituting the failed node with an equivalent component. If direct substitution proves impossible, task continuity must be maintained through rerouting operations across other available nodes. Both strategies demand timely evaluation and execution to minimize mission delays and avert cascading failures that static redundancy alone cannot absorb.
(a) Replacement action: a compromised Decision node (marked with a red cross) is substituted with a standby node of equivalent classification. The original SensorDecision
Influencer
Target meta-path is reconstituted via reactivation of corresponding links (indicated by dashed blue and green arrows). (b) Cooperation action: in the absence of available redundant assets, operational continuity is achieved through dynamic rerouting across an alternative assemblage of functional nodes.
To address these limitations, a compelling direction for enhancing AUS resilience involves the deployment of reinforcement learning (RL) methodologies to facilitate adaptive recovery and systemic reconfiguration [17]. A salient advancement is presented by Sun et al. [14], who employed deep reinforcement learning (DRL) in conjunction with graph convolutional networks (GCNs) to dynamically optimize restorative actions. Through continuous policy refinement informed by environmental feedback, RL-driven agents possess the capability to autonomously reassign functional tasks or restructure coordination schemas upon anomaly detection, thereby contributing to the restoration of system integrity following disruptive incidents.
However, existing RL-based resilience methodologies are frequently constrained by practical deficiencies. Many approaches necessitate extensive exploration or retraining for each distinct failure scenario, rendering them unsuitable for deployment in real-time and safety-critical environments. A more fundamental shortcoming lies in the prevalent treatment of AUS as opaque entities during the learning process, thereby failing to incorporate domain-specific structural insights. The semantic richness of meta-paths, defined as the relational patterns and dependency trajectories among heterogeneous subsystems within a AUS, remains largely unexploited in prevailing learning frameworks [18]. The omission of these latent structural correlations often results in suboptimal recovery strategies and limited generalization across diverse disruption landscapes, as agents lack principled guidance regarding the coordination pathways most vital to sustaining resilience.
In addressing these challenges, this study proposes the MPGPD-RC framework. MPGPD-RC harnesses the meta-path semantics inherent in autonomous unmanned swarm (AUS) systems to inform a policy distillation mechanism that yields resilient coordination strategies. By embedding structural knowledge of inter-system interactions, MPGPD-RC steers the learning process toward preserving essential coordination links and functional pathways under conditions of component degradation or system-level disruptions. Meta-path representations function as inductive priors that enable reinforcement learning agents to emphasize and sustain critical relational configurations throughout both the training phase and online adaptation. The primary contributions of this research are as follows:
1. This study introduces a novel integration of meta-path-guided methodologies into representation learning to augment resilience in AUS, enabling the capture of complex multi-relational semantics within heterogeneous networks and facilitating the generation of semantically enriched and structurally expressive embeddings.
2. A meta-path-aware teacher-student reinforcement learning architecture is proposed to uphold action-level structural fidelity by preserving the teacher model’s topological coherence and AUS-specific semantic constructs through a retention-driven policy distillation process during inter-model knowledge transfer.
3. Empirical evaluations across a diverse set of disruption scenarios demonstrate that the proposed MPGPD-RC framework consistently attains superior performance benchmarks, surpassing existing state-of-the-art approaches in all examined contexts.
2 Related works
Recent progress in resilience augmentation for Autonomous Unmanned Swarms (AUS) has been primarily driven by three core methodological paradigms: mathematical optimization formulations, adaptive algorithmic strategies, and deep reinforcement learning (DRL)-based architectures. Although these approaches have yielded meaningful theoretical and empirical advancements, they exhibit intrinsic limitations in effectively capturing the dynamic and multifactorial complexities characteristic of highly variable and unpredictable system environments.
Mathematical optimization has constituted a foundational approach for formalizing and improving resilience within autonomous unmanned swarm (AUS) systems [19,20]. Such models typically define resilience through objective functions aimed at maximizing system reliability and minimizing recovery-related costs. Xu et al. [21] introduced a stochastic optimization framework designed to expedite the restoration of critical infrastructure networks, explicitly accounting for uncertainties in repair durations. Peiravi et al. advance a universal redundancy allocation model that systematically assigns spares and contingency elements to reinforce overall system reliability [22]. Similarly, Chen et al. [23] proposed a multi-stage optimization paradigm that incorporates evolutionary algorithms to derive Pareto-optimal strategies for enhancing the resilience of AUS in complex, dynamic environments.
Despite their theoretical rigor, optimization-based models frequently rely on highly accurate data inputs and entail substantial computational overhead, rendering them impractical for deployment in real-time operational contexts. Additionally, these models often utilize index-based heuristics, exemplified by the work of Zhang et al. [24], which prioritize system components based on predetermined indices. Such methodologies are prone to generating globally suboptimal outcomes due to their inability to account for the complex interdependencies inherent in heterogeneous network structures. These limitations constrain the applicability of optimization techniques in volatile and high-uncertainty environments, where agile and adaptive recovery capabilities are paramount [15]. Adaptability algorithms emphasize dynamic reconfiguration as a means of countering evolving disruptions and environmental variability in Autonomous Unmanned Swarm (AUS) systems. Qiang et al. [25] presented a resilience optimization framework tailored for multi-drone formations, facilitating real-time structural reconfiguration in response to system-level perturbations. Tran et al. [26] developed a stochastic node reconnection algorithm designed to reshape network architecture, thereby enhancing system robustness under disruptive conditions.
Although these algorithms contribute meaningful enhancements in terms of flexibility and resource efficiency, they predominantly depend on heuristic or rule-based strategies that exhibit inherent limitations in capturing the intricacies of complex and dynamic operational scenarios. The emergence of deep reinforcement learning (DRL) [27–30] has fundamentally transformed resilience methodologies by empowering autonomous agents to learn and make decisions within high-dimensional and complex operational landscapes. DRL-based architectures, particularly those integrating deep neural networks with reinforcement learning algorithms, have shown strong capabilities in exploring and optimizing the vast state-action spaces characteristic of autonomous unmanned swarms (AUS). Sun et al. [14] exemplified this advancement by fusing graph convolutional networks (GCNs) with proximal policy optimization (PPO) to enable the dynamic acquisition of recovery strategies, yielding marked enhancements in system resilience under disruptive conditions.
It is worth mentioning that Graph convolutional networks (GCNs) [31–34] have demonstrated strong efficacy in modeling structural dependencies within autonomous unmanned swarm (AUS) systems [14,17]. Peng et al. [35] illustrated the integration of GCNs within DRL frameworks to reinforce the resilience of IoT infrastructures against cyber disruptions. Dong et al. [36] employed GCN-augmented DRL for adaptive real-time resource allocation in large-scale distributed systems. Nevertheless, conventional DRL and GCN-based models predominantly concentrate on first-order node proximities, neglecting the higher-order relational dependencies that underpin the intricate, multi-hop dynamics inherent to AUS.
This gap motivates a shift from local adjacency–centric modeling to a semantics-aware framework grounded in meta-paths. Meta-paths [37], characterized as ordered sequences of node types connected via directed edges, serve as an effective construct for representing higher-order dependencies within heterogeneous networks [38,39]. By encapsulating multi-hop relational patterns, meta-paths enable a comprehensive depiction of systemic dynamics, thereby facilitating the fusion of local and global structural information [39]. This representational richness is particularly vital in the context of autonomous unmanned swarm (AUS) systems, where sensor, decision, communication, and actuator nodes engage in complex, hierarchical, and multi-phased interactions that demand an expressive and semantically-aware modeling framework.
Recent investigations have applied meta-path-based methodologies across diverse domains, yet their application to enhancing resilience in autonomous unmanned swarm (AUS) systems remains underexplored. Tran et al. [40] utilized meta-path constructs to examine information flow stability in networked infrastructures, with an emphasis on structural connectivity rather than adaptive resilience mechanisms. Sun et al. [14] incorporated first-order graph relations into DRL frameworks-CGPPO, which relies on GCN-based first-order neighborhood aggregation and stacked layers to expand receptive fields yet remains path-agnostic by isotropically mixing messages from heterogeneous relation types, directions, and hop orders, MPGPD-RC explicitly treats meta-paths as learnable relational templates with path-specific encoders and contrastive reconstruction to preserve typed, ordered, and directed multi-hop semantics, fuses the resulting meta-path representations via attention to emphasize salient coordination routes, constrains decision making through meta-path masks, and conducts path-aware contrastive distillation to align not only action distributions but also trajectory-level structural codes, thereby capturing role-consistent multi-hop dependencies and resilient backup routes that path-agnostic GCN–PPO baselines typically blur.
3 Markov decision process formulation
In this study, we formulate the resilience enhancement challenge for the AUS as a Markov Decision Process (MDP), represented by the tuple
where the state space, action space, transition kernel, reward, discount factor, and initial distribution are defined below without relying on later sections.
Let denote a directed heterogeneous graph, where V is the set of nodes and
represents the set of directed edges. Each node
is assigned a categorical type as follows:
where ,
,
, and
represent sensor, decision-making, influencer, and target nodes, respectively. These functional node types delineate the foundational operational roles embedded within the architectural framework of Autonomous Unmanned Swarm (AUS). AUS further define a set of meta-paths that formalize domain-specific interaction sequences among heterogeneous node types.
where L represents the length of the meta-path and each is selected from
. For any node v, the meta-path constrained neighbors are defined as:
The meta-paths constitute a domain-informed library that abstracts operation-stage regularities commonly observed in AUS—namely sensing, decision-making, influence/execution, target engagement, and feedback—together with widely recurring patterns such as reconnaissance information sharing (sensor–sensor consensus), hierarchical decision coordination (multi-level decider interaction), and decision feedback (sensor–decider confirmation) [14,17,18,41]. Their explicit structures are:
For every node , we maintain a family of path-specific embeddings
and their fused representation
where are learnable fusion weights.
State space . At discrete time t, the global state
is an attention-weighted aggregation of the fused node embeddings:
where are attention coefficients derived from
. Thus st summarizes the heterogeneous, typed, and directed multi-hop information implied by
and {Ek}.
Action space . A decision specifies a cooperation–replacement pair
drawn from
, with the semantic constraint
whenever both are valid nodes. The nominal action space is
Feasibility is enforced by binary masks for cooperation and replacement, together with a hard meta-path mask
that retains only actions consistent with at least one meta-path constraint derived from and {Ek}. For exploration we use a relaxed mask
with a small ; probabilities are normalized within the policy.
Transition kernel . Given st and
, the environment applies the selected recovery primitive to the current topology Gt: (i) replacement substitutes a compromised node by the designated standby and reactivates incident typed links that realize at least one meta-path instance; (ii) cooperation reroutes along alternative functional nodes consistent with
. Denote the updated graph by
Path-specific neighborhoods and embeddings are recomputed on Gt + 1, fused to
, and aggregated to yield st + 1. Stochastic exogenous factors (e.g., new failures) are absorbed by
Reward function R. Let rt denote the unshaped (base) environment reward at time step t, which reflects the immediate reward received by the system.
Following the resilience-assessment model proposed by prior studies [14,17,41,42], we define at each time step t an instantaneous resilience score rt that mirrors their global metricR. Let ft denote the current operational performance of the AUS; fb the mission-baseline performance; f0 the pre-attack performance at t0; fm the minimum performance reached at tm after the attack at ta; tr the restoration time when the performance first returns to (or surpasses) fb; and te the end of the evaluation horizon.
The mean post-recovery performance over the interval is
The normalised recovery-time factor is
and the completion factor is
with importance weight Cm>0 and indicator . Using these quantities, the instantaneous score is
where larger values signify a state that is closer to the desired resilient behaviour—high performance, rapid recovery, and stable operation. This scalar is inserted into the shaped reward:
To accommodate the semantic heterogeneity inherent in evaluating collaborative efficiency across diverse meta-paths in AUS, we introduce a meta-path-aware similarity metric
where and
denote the embeddings of nodes
and
specific to meta-path Pk, as defined in Eq (6). The similarity function is cosine similarity:
The weights are shared with those used in Eq (6), ensuring consistent meta-path importance across embedding fusion and collaborative assessment.
Discounting and return. Let be the discount factor. The discounted return from time t is
which balances immediate recovery benefits with long-horizon resilient performance.
Policies and objective. A policy is a stochastic kernel on
given
. Over the initial-state distribution p(s0), the control objective is
This optimal policy satisfies the Bellman optimality condition:
4 Method
4.1 Ethics statement
No ethical approval was required for this study because the research involved only publicly available anonymized data and did not involve human participants, identifiable human material, or animals.
The MPGPD-RC framework, as depicted in Fig 2, integrates two interdependent modules: (a) Meta-Path Guided Embedding and (b) Meta-Path-Aware Policy Distillation. The process begins with the generation of multi-granular node embeddings using meta-path-constrained graph attention networks (GATs), where each embedding encodes interaction schemas reflective of the heterogeneous AUS topology. These path-specific embeddings are subsequently fused through an attention-based mechanism to form a comprehensive global state vector, which encapsulates the structural semantics of the AUS network. This global state vector is then used as a common input for both the teacher and student policy networks. The teacher policy is optimized through Proximal Policy Optimization (PPO) under relaxed meta-path constraints, generating high-fidelity trajectories that align with the AUS’s structural intricacies. Concurrently, the student policy is trained via supervised learning, employing a dual-channel distillation approach that minimizes Kullback-Leibler (KL) divergence between action distributions and enforces structural coherence by contrastively aligning trajectory embeddings encoded by a GRU.
(a) Meta-Path Guided Embedding, responsible for encoding heterogeneous inter-node relationships into semantically rich and structurally aware representations, and (b) Meta-Path-Aware Policy Distillation, which facilitates the transfer of expert-level decision policies from a teacher model to a computationally efficient student model while maintaining the structural semantics intrinsic to the AUS topology.
4.2 Meta-path guided embedding for AUS
Meta-Path Guided Embedding depicts the embedding moudule comprising three sequential stages: (1) meta-path–constrained traversal identifies semantically salient node sequences reflective of heterogeneous relational contexts, (2) path-specific graph attention networks (GATs), enhanced by contrastive sampling strategies, learn distinct embeddings for each meta-path to capture contextual dependencies, (3) an attention-based aggregation mechanism integrates the resulting embeddings into a unified node representation that encapsulates both localized structural attributes and higher-order semantic correlations.
For each meta-path Pk, a meta-path-specific embedding is computed for node v. Let
denote the embedding of node v at layer l, with
. At each layer l, node v aggregates messages from its meta-path-specific neighborhood
. Denote
as the embedding of neighbor node u from the previous layer (l–1). The message propagated from u to v is defined as:
where and
are trainable parameters associated with meta-path Pk at layer l, and
denotes a ReLU activation function. After computing the set of incoming messages
, node v aggregates these messages to update its embedding as follows:
where represents the ELU activation function applied for nonlinear transformation,
denotes the attention coefficient capturing the relative importance of neighbor u to node v under meta-path Pk at layer l,
are the meta-path-specific query and key projection matrices utilized in Eq (25), and
is the learnable attention vector corresponding to meta-path Pk at layer l.
After deriving the meta-path-specific embeddings for each
, a unified node representation is obtained by computing a weighted summation over the set
. Formally,
where each scalar signifies the relative contribution of the k-th meta-path to the composite embedding of node v.
where ,
, and
are trainable parameters.
To ensure that encapsulates the structural dependencies governed by meta-path Pk, we define an induced edge set
. Specifically, Ek is constructed by enumerating all valid path instances conforming to the schema of Pk. Formally, we define:
where each edge indicates that there exists at least one path instance from node v to node u whose relation-type sequence matches the meta-path Pk. Thus, Ek captures the set of node pairs semantically connected under Pk.
The objective is to enforce high similarity approaching + 1 for
, and low similarity approaching –1 for
. Formally, we define:
where is the sigmoid function, which bounds the inner product between (–1) and + 1. The first summation rewards cases where v and u are deemed connected by Pk by encouraging
. The second summation penalizes pairs
, encouraging
. Aggregating
across all meta-paths
yields the complete unsupervised embedding loss:
Minimization of guarantees that each meta-path-specific embedding
reconstructs the relational dependencies defined by its corresponding meta-path Pk.
By restricting message passing to domain-specific meta-paths, learning path-specific attention encoders under contrastive regularization, and fusing the resulting representations through data-driven saliency weights, the embedding module disentangles typed, ordered, and direction-aware interaction channels that extend beyond first-hop adjacency, suppresses spurious correlations arising from heterogeneous neighborhoods, and adaptively emphasizes the coordination routes most predictive of resilient behavior.
4.3 Meta-path-aware policy distillation
Fig 2(b) illustrates the distillation pipeline. Given the global state st, synthesized from the fused node embeddings, the teacher policy in the upper branch is optimized using Proximal Policy Optimization (PPO) under relaxed meta-path constraints, producing advantage signals At and value estimations . These teacher-generated trajectories are processed by the student branch, wherein a GRU-based encoder models the sequential dependencies of meta-path transitions. The student policy
is trained via a dual-objective scheme incorporating path-sensitive contrastive loss and Kullback–Leibler (KL) divergence, thereby enabling the student to emulate the teacher’s behavior while preserving the structural semantics imposed by the meta-paths.
Teacher Policy with PPO: The teacher policy interfaces with the environment to generate high-quality decision trajectories, which serve as optimal or near-optimal behavioral references for student policy distillation. Training of the teacher is performed using Proximal Policy Optimization (PPO). At each decision point st, the teacher outputs a nominal policy distribution
over the admissible action space
. The resulting relaxed policy
is defined as:
The teacher is trained using PPO with the objective:
where is the probability ratio defined by:
and At is the advantage function. This ensures stable training of the teacher, while ensuring a balance between exploration and exploitation.
Every K training epochs, new trajectories are generated using the updated teacher policy , and 25% of the outdated samples in the experience buffer
are replaced to ensure data freshness and policy relevance.
Student policy: The student policy is trained to replicate the behavioral patterns of the teacher policy through two principal learning objectives:
1. Knowledge Distillation (KD): The student seeks to approximate the teacher’s softened action distribution, learning from the probabilistic outputs of the teacher policy to mimic its decision-making strategy.
2. Path-Aware Contrastive Learning (PaC): The student is trained to preserve the structural semantics captured in the teacher’s trajectories, with particular emphasis on aligning meta-path-induced relational patterns.
To encode the temporal and structural dynamics of the teacher’s trajectories, a GRU (Gated Recurrent Unit) encoder is employed to process the sequence of state-action pairs. This sequence is transformed into a fixed-dimensional latent representation , referred to as the trajectory code, which serves as a compact descriptor for contrastive alignment and policy distillation.
Let zt denote the concatenated vector representing the meta-path-specific encoding of the trajectory at time step t. The meta-path embedding for step t is formally defined as:
where denotes the embedding of node
along meta-path Pk. The vector
is a one-hot indicator that activates the meta-path applied at time t. More precisely, its k-th component is defined as:
where denotes the meta-path governing the decision at step t of trajectory
, and
is the indicator function. The element-wise product
ensures that only the segment of the concatenated representation corresponding to the active meta-path contributes to the step-wise trajectory descriptor, while all other segments are masked out.
The GRU integrates the current input zt and the previous hidden state to compute the new hidden state
. It employs update and reset gates to control the flow of information, and the final hidden state is a combination of the previous state and the updated information. The equations for this process are:
where and
are the update and reset gates, respectively, and
and
are the sigmoid and hyperbolic tangent activations.
The final output of the GRU at time step T is defined as the trajectory code:
where dc denotes the dimensionality of the GRU hidden state. This trajectory code encapsulates the temporal evolution and structural semantics of the teacher’s decision sequence, serving as a condensed representation to supervise and inform the student policy’s learning process.
The student policy is trained not only to replicate the teacher’s action distribution but also to internalize the structural regularities captured in the teacher’s trajectory embeddings. The path-aware contrastive loss enforces structural alignment by minimizing the divergence between the teacher’s and student’s trajectory encodings:
where denotes cosine similarity,
and
represent the teacher and student trajectory encodings of the same trajectory
, and
is a temperature hyperparameter that modulates the sensitivity of the similarity distribution.
In Eq 40, the pair constitutes the positive sample, while all other student encodings
with
within the same batch serve as negatives. This design ensures that each teacher trajectory embedding is pulled closer to its student counterpart while being contrasted against embeddings from other trajectories, thereby enhancing both alignment and discriminability.
In parallel, the knowledge distillation loss compels the student policy to approximate the teacher’s action distribution. This alignment is enforced through a Kullback–Leibler divergence objective:
This formulation guides the student toward replicating the teacher’s decision behavior while preserving computational tractability and training stability.
The student policy is optimized through a structured training pipeline consisting of sequential stages. First, a batch of trajectories
is uniformly sampled from the teacher’s experience buffer
, providing exposure to a diverse set of meta-path-constrained decision sequences. The learning objective integrates both decision-level imitation and structural consistency through the following composite loss function:
where and
are weighting coefficients balancing the contributions of knowledge distillation and path-aware contrastive learning respectively.
The proposed meta-path guided teacher-student policy learning framework, delineated in Algorithm 1, facilitates a synergistic interaction between structured exploration driven by the teacher and efficient policy imitation performed by the student. The teacher policy , trained using Proximal Policy Optimization (PPO) under meta-path constraints (Eq (33)), yields high-quality trajectories that embed both task-specific reward signals and meta-path-informed collaborative semantics. These trajectories are incrementally transferred to the student policy
through the dual-objective loss
, which jointly enforces action distribution fidelity and structural alignment with meta-path relational patterns.
Algorithm 1 Meta-path guided teacher-student policy learning.
Require: Heterogeneous AUS graph , meta-path set
, temperature
, distillation weight
Ensure: Student policy
1: Initialize: Teacher policy with PPO, student policy
with GRU encoder, experience buffer
2: while not converged do
3: // Teacher Policy Rollout
4: for each episode do
5: Generate trajectory using Eq (33) with meta-path mask
6: Compute shaped reward
7: end for
8: Update teacher via PPO objective (Eq (34))
9: // Student Policy Distillation
10: for steps do
11: Sample batch
12: Encode trajectories via GRU
13: Compute
14: Update with cosine-annealed
15: end for
16: if then
17: Replace 25% of with new
-generated trajectories
18: end if
19: end while
5 Experiments
5.1 Experimental setup
Following the settings in Refs [14,17,18,41,43–45], we evaluate the efficacy and resilience of the proposed RL-driven Meta-Path Optimization (MPGPD-RC) framework through extensive experiments conducted in a simulated autonomous unmanned swarm (AUS) environment. As illustrated in Fig 3, the AUS topology comprises four heterogeneous node types: Sensor, Decision, Influence, and Target. These nodes are interconnected via directed edges that encode the functional dependencies critical to the system’s operational workflow.
The system includes four types of entities: Sensor, Decision, Communication, and Actuator.
In accordance with the methodologies outlined in Refs [14,17,18,41,43–45], we employed NetworkX version 3.2.1 [46] for the construction of the AUS graph and utilized PyTorch version 2.2.2 for model training. As depicted in Fig 3, the AUS topology consists of four distinct node types: sensors (), decision-making units (
), influencers (
), and targets (
). These nodes are interconnected through directed edges that represent the functional dependencies critical to the operational workflow of the system. Table 1 presents a summary of the key structural attributes of the AUS Demo network, which serves as the experimental basis for evaluating the MPGPD-RC framework. The network comprises a total of 185 nodes, interconnected by 399 directed edges, each encoding essential functional dependencies and communication pathways necessary to maintain the integrity and coherence of the system.
The experimental evaluation of the MPGPD-RC embedding was carried out under configured settings, covering both embedding generation and training procedures. Meta-path-guided embeddings were computed using a hierarchical graph attention network with a fixed embedding dimensionality of 128. Contrastive sampling was regulated by a temperature parameter of 0.1 to effectively differentiate between positive and negative node pairs. The embedding module was trained using the AdamW optimizer, initialized with a learning rate of 0.001, a batch size of 64, and executed over 500 training epochs.
The teacher policy was trained using Proximal Policy Optimization (PPO) with relaxed meta-path masks, employing an actor–critic architecture characterized by a learning rate of 1 10−4, a discount factor
, and a generalized advantage estimation (GAE) parameter
. Teacher rollouts produced advantage estimates At and value targets
, serving as supervisory signals for the student policy. The student policy incorporated a single-layer GRU encoder with a hidden state dimension of 128 to model the sequential dependencies of meta-path transitions. The training objective combined Kullback–Leibler divergence and path-aware contrastive learning, with respective weighting coefficients
and
, these values are kept constant across all experiments to ensure fair comparability. Optimization was performed using the AdamW algorithm with an initial learning rate of 1
10−4, a batch size of 64, and the training process spanned 500 epochs.
To evaluate the robustness of the MPD-RL framework, we simulate four distinct adversarial scenarios, each targeting specific node combinations within the AUS environment. Following prior studies [14,17,18,41], we model each attack as an instantaneous removal of the targeted node(s), rather than gradual functionality degradation. The attacked nodes are selected uniformly at random within the first ten timestamps, ensuring that disruptions occur early in the mission and force the framework to demonstrate recovery under stringent conditions. The training process was conducted on NVIDIA A100 Tensor Core GPUs.
SA (Sensor Attack): This case simulates the adversarial compromise of Sensor () nodes, which are tasked with acquiring environmental information essential to system perception.
DA (Decision Attack): In this case, the attack is directed at Decision () nodes, the core computational entities that transform sensory data into executable commands. Failure at this stage critically disrupts the decision-making chain, offering a rigorous evaluation of the system’s robustness against cognitive impairment.
IA (Influencer Attack): Here, the focus of the attack shifts to Influencer () nodes, which directly govern behavioral adaptation and strategic responses of the Target nodes.
MA (Multi-Node Attack): This case represents a coordinated adversarial operation against all four categories—Sensor, Decision, Influencer, and Target—posing the most severe stress test by inducing global systemic disruption.
5.2 Experimental analysis
In this subsection, we present a detailed performance evaluation of the MPGPD-RC framework under the previously defined node disruption scenarios. The analysis emphasizes the individual and collective contributions of the framework’s components to the overall resilience of the AUS. To contextualize the effectiveness of MPGPD-RC, comparative benchmarking was conducted against several state-of-the-art baseline methods [14,17]. The baseline algorithms are summarized as follows:
- Self-Resistance (SR): Assesses the intrinsic resilience of the AUS without the incorporation of post-attack recovery mechanisms.
- Random Node Reconnection (RN) [14]: Implements random reconnections of disrupted nodes to functional ones, subject to capacity and edge-type constraints.
- Maximum Degree Node Reconnection (MDN): Prioritizes reconnection based on node degree centrality, favoring nodes with higher connectivity to improve system robustness.
- CGPPO Algorithm (CGPPO) [14]: Integrates Graph Convolutional Networks (GCNs) with Proximal Policy Optimization (PPO) within a reinforcement learning framework to optimize recovery strategies.
- Genetic Algorithm-Based Multi-Swarm Cooperative Reconfiguration (GA) [43]: Utilizes a Genetic Algorithm to iteratively optimize cooperative reconfiguration strategies for enhancing AUS resilience.
- Kill Chain Optimization Method (KCOM) [11]: Focuses on optimizing the observation, positioning, decision-making, and execution phases within the kill chain for improved system resilience.
- Enhanced-Resilience Multi-Layer Attention GCN (EGCN) [18]: Combines multi-layer attention mechanisms and regularization techniques to effectively model both global and local graph structures, enhancing resilience.
Table 2 presents a comparative analysis of the resilience metric achieved by MPGPD-RC and seven baseline algorithms across 12 disruption scenarios. The proposed framework consistently delivers the highest resilience scores across all disruption levels (25%, 50%, 75%), with performance differentials becoming more pronounced as attack intensity escalates. While methods such as EGCN and CGPPO incorporate graph neural network modules, they lack the capacity to extract and exploit deep structural semantics to the extent enabled by MPGPD-RC. This limitation undermines their ability to formulate resilient coordination policies under system-level stress. In the MA-50 % scenario, MPGPD-RC attains a resilience of 0.813, outperforming EGCN’s 0.779 by 4.2 %, a gain rooted in its capacity to explicitly encode and exploit inter-node collaboration chains. Under DA-50 % conditions, our framework achieves a resilience of 0.684—5.4 % above KCOM’s 0.637—by dynamically distilling and preserving the system’s functional hierarchies. MPGPD-RC, by incorporating meta-path-based semantics into its policy learning and decision-making processes, sustains consistently higher resilience, exhibiting performance gains of approximately 3% to 6% over EGCN and CGPPO under IA scenarios.
5.3 Robust study
We next examine the sensitivity of each baseline to stochastic variations in the initial AUS state—a critical consideration for real-world deployments. For each damage ratio ranging from 40% to 60%, we generate fifty distinct network realizations, retrain every method to convergence, and report the mean final resilience along with its associated standard deviation (). Fig 4 presents the aggregated outcomes in the form of grouped bar charts: error whiskers capture the across-run variability in performance.
For each damage ratio (40%–60%) we re-initialise the AUS topology fifty times and report the mean resilience (bars) with one-standard-deviation whiskers across the four attack modes.
In trials with repeated initializations, the disparity in performance between the two leading baselines—EGCN and CGPPO—becomes increasingly evident. EGCN employs a deep architecture of traditional GCN layers followed by a single actor-critic head. When the graph spectrum is disrupted through stochastic rewiring, the multi-hop propagation exacerbates feature over-smoothing, resulting in a highly unstable gradient landscape. Although the overall performance may appear competitive under certain conditions (e.g., IA with 40% damage), the variance across runs is notably higher.
On the other hand, CGPPO partially mitigates this instability by incorporating CARCI-guided trajectories. However, it still depends on online learning of node representations and optimizes based on a fixed scalar reward. As a result, its convergence target tends to shift in response to changes in the initial topology—especially when particular structural configurations disproportionately favor certain resilience aspects (e.g., rapid recovery under SA vs. steady-state stability in MA). This leads to a consistent performance gap when compared to the proposed framework.
The MPGPD-RC method, which incorporates both Meta-Path Aware Policy Distillation and robust reinforcement learning strategies, outperforms the other baselines in terms of both convergence stability and long-term performance consistency. As evidenced by the results in Fig 4, our approach consistently achieves the highest resilience values across different attack scenarios, especially under conditions of higher damage (50% and 60%). The proposed method’s robustness to initial state variability arises from its ability to leverage meta-path-guided structural learning, which provides a more comprehensive understanding of the global topology and long-range dependencies. These results demonstrate the advantage of incorporating global graph semantics into policy distillation, enabling the proposed method to consistently deliver superior performance and greater resilience under dynamic and unpredictable real-world conditions.
5.4 Hyperparameters analysis
In this subsection, we investigate the influence of the hyperparameters and
on the resilience performance
of the MPGPD-RC framework. Fig 5 presents a comprehensive 3D surface analysis that evaluates
across four distinct disruption scenarios as a function of the loss weighting coefficients. Each subfigure plots
(the weight assigned to knowledge distillation) along one axis and
(the weight assigned to path-aware contrastive learning) along the other, with the resulting
values rendered on the vertical axis. The color-coded columns correspond to different attack scenarios: Sensor Node Attacks (SA), Decision Node Attacks (DA), Influencer Node Attacks (IA), and Mixed Node Attacks (MA). The results indicate that resilience under DA remains relatively stable across a broad range of
configurations, suggesting robustness to hyperparameter choice. In contrast, IA and MA exhibit pronounced sensitivity, with sharp fluctuations in
.
Across all four disruption scenarios, the resilience surface consistently peaks at the configuration . This setting allocates twice the weight to KL-based decision imitation relative to path-aware contrastive alignment, thereby balancing the gradient contributions of both objectives and preventing the dominance of either component. Under SA and DA conditions, this ratio promotes stable policy convergence by preserving sufficient flexibility in structural modeling without undermining decision fidelity. In the more complex IA and MA settings, where dense inter-node dependencies exacerbate sensitivity to structural perturbations, the chosen configuration effectively mitigates trajectory degradation when
is too low and prevents excessive rigidity in action alignment when
is too high. This parameterization yields a robust equilibrium between structural coherence and behavioral precision. Consequently,
is selected as the default setting for all experimental benchmarks.
Further, setting removes the KL term and yields a PaC–only regime that optimizes exclusively for structural alignment. As shown in Fig 5, this configuration produces a clear reduction in resilience
: without the behavioral anchor provided by
, the student over-emphasizes meta-path regularities and under-utilizes the teacher’s action distribution, which impairs policy quality in scenarios where fine-grained, local corrections are essential to recovery. Conversely,
removes the contrastive term and yields a KD–only regime that imitates the teacher while becoming path-agnostic. Across all attack settings, this variant attains only moderate
because typed, ordered, and directed multi-hop dependencies are not preserved; the policy reproduces marginal action preferences but fails to encode the higher-order semantic routes crucial for coordinated recovery under structural perturbations.
5.5 Ablation study
To assess the individual contributions of each component within the MPGPD-RC framework, we conducted an ablation study consisting of three distinct experimental settings. In each setting, a key component of the framework was systematically removed or altered to evaluate its impact on the overall resilience . The findings of these experiments are summarized in Table 3.
Without Meta-Path Guided Embeddings (w/o MGE): In this ablation setting, node representations are derived using a conventional graph convolutional network (GCN) trained in an end-to-end fashion, omitting the meta-path–guided embedding (MGE) module. The MGE component is instrumental in extracting high-fidelity features that encode the structural heterogeneity of the AUS graph, including node topologies, interconnectivity patterns, and higher-order relational semantics.
As shown in Table 3, When the meta-path–guided embedding module is ablated, the representation collapses to a path-agnostic, neighborhood-smoothing regime that largely mirrors the inductive bias of CGPPO, where messages from heterogeneous relations are mixed without preserving typed, ordered, and directed multi-hop semantics; When key nodes are compromised, as in SA or DA scenarios, or when multiple subsystems are simultaneously affected in MA scenarios, the absence of MGE significantly degrades the agent’s ability to discern the altered network structure. This performance deterioration underscores the MGE module’s critical role in enhancing policy robustness by enabling accurate recognition of compromised meta-paths and facilitating structure-aware adaptation during system recovery.
Without Meta-Path Aware (w/o MPA): In this ablation setting, the student-side distillation process is entirely omitted. The agent is trained exclusively through the teacher policy using Proximal Policy Optimization (PPO), with no Kullback–Leibler or contrastive alignment mechanisms. The Meta-Path Aware (MPA) module is responsible for integrating global structural semantics by leveraging multi-hop meta-path relationships that reveal latent connections and alternative coordination pathways across the graph.
This capability is especially vital in scenarios involving indirect or distributed attacks, such as IA and MA, where damage propagates through non-adjacent nodes or spans multiple failure zones. The MPA module enables the agent to transcend local neighborhoods, identify structurally meaningful backup routes, and coordinate recovery across dispersed impact regions. By exploiting meta-path semantics, the agent gains a global operational perspective essential for re silient coordination.
Ablation results show a clear decline in resilience performance in the absence of MPA, with notable drops in IA ( vs. 0.621) and MA (
vs. 0.791) scenarios. These findings indicate that without MPA, the agent lacks the ability to leverage global structural cues, limiting its capacity to mitigate widespread or stealthy disruptions. This validates the critical contribution of the MPA module in enhancing resilience by embedding long-range dependencies into the policy learning process.
5.6 Convergence analysis
To further evaluate the convergence behavior and assess the impact of Meta-Path guidance on the long-term resilience development, we conducted a longitudinal analysis of system performance across training episodes. As shown in Fig 6, resilience is tracked under different disruption scenarios, comparing the full MPGPD-RC framework with its Meta-Path Aware–ablated variant (w/o MPA) to highlight the influence of Meta-Path guidance on the convergence of resilience over time.
During the initial 200 training episodes, the configuration without Meta-Path-Aware Policy Distillation (blue curve), which relies solely on the teacher policy with Proximal Policy Optimization (PPO), shows a steeper increase in resilience () under Sensor Node Attacks (SA) and Decision Node Attacks (DA). This early advantage arises from the teacher policy’s direct optimization through PPO, which benefits from relaxed meta-path constraints that enable rapid adaptation to the environment. By focusing exclusively on immediate reward signals, the teacher policy achieves high short-term performance quickly, without the added complexity of contrastive alignment or KL divergence distillation, leading to a faster but potentially less robust early convergence.
In contrast, the full MPGPD-RC configuration (red curve), which integrates Meta-Path-Aware Policy Distillation from the beginning, incorporates both KL divergence alignment and path-aware contrastive learning. While this dual-objective approach introduces additional complexity and moderates early convergence, it equips the student policy with a richer understanding of global graph semantics and multi-hop dependencies. This more comprehensive structural awareness slows early learning but strengthens the model’s ability to generalize, reducing the risk of overfitting to local disruptions and fostering more stable long-term convergence.
As training progresses beyond approximately episode 500, the full MPGPD-RC model (red curve) consistently surpasses the teacher-only baseline across all four disruption scenarios—Sensor Node Attacks (SA), Decision Node Attacks (DA), Influencer Node Attacks (IA), and Mixed Node Attacks (MA). By episode 2000, the red curve stabilizes at a resilience value of under both MA and SA conditions, exhibiting minimal variance. This stabilization reflects the superior long-term convergence of the MPGPD-RC model, with path-aware contrastive loss gradually aligning the student’s trajectory encodings to the global graph structure defined by meta-paths, such as
. This alignment ensures that the model maintains long-range dependencies, leading to improved robustness. At the same time, KL divergence ensures that the student’s learning remains grounded in the teacher’s policy, further enhancing the stability and convergence of the student policy towards optimal behavior.
5.7 Time complexity analysis
To assess engineering practicality in terms of convergence efficiency and time cost, we report time-to-convergence and performance under the 50% attack setting in Table 4. Here, Total time (s) denotes the total wall-clock training time required to reach a common convergence criterion, Total episodes is the number of training episodes to convergence, per episode (s) is the mean wall-clock duration per episode (including environment interaction and forward/backward computation), and Resilience is the post-convergence performance metric that quantifies coordination under disruptions.
Although MPGPD-RC exhibits a slightly longer per-episode duration than CGPPO (1.78 s vs. 1.61 s; +10.6%), it requires substantially fewer episodes to converge (2354 vs. 3250; ), resulting in a shorter total training time (4190 s vs. 5234 s;
) and a higher final resilience (0.798 vs. 0.777; +2.7%). The marginal per-episode overhead stems from meta-path–guided representation learning (path-specific attention encoders with contrastive regularization) and teacher–student distillation (trajectory encoding), which introduce constant computational work per episode; however, these structural priors and meta-path masks improve sample efficiency by pruning ineffective exploration and sharpening gradient signals, thereby accelerating convergence despite the slightly longer episode time.
Relative to the full MPGPD-RC, the w/o MPA variant exhibits a lower per-episode cost but markedly worse sample efficiency: per-episode time decreases by 7.3% (1.65s vs. 1.78s) owing to the removal of the student branch (GRU trajectory encoding, batch-wise contrastive pairing, and KL/PaC backpropagation), yet the number of episodes to convergence increases by 27.2% (2993 vs. 2354), which translates into a 17.9% higher total time (4938s vs. 4190s) and a lower converged resilience (0.779 vs. 0.798). This trade-off indicates that the meta-path–aware distillation in MPGPD-RC, despite incurring a modest per-episode overhead, substantially improves sample efficiency by anchoring the student’s action distribution (KL) and preserving trajectory-level meta-path semantics (PaC), thereby reducing the number of required learning episodes and yielding higher final robustness.
6 Conclusion
This work presented MPGPD-RC, a meta-path–guided policy distillation framework for resilient coordination in Autonomous Unmanned Swarms (AUS). The approach couples (i) a meta-path–guided embedding module that encodes typed, ordered, and directed multi-hop relations into semantically coherent node representations, with (ii) a meta-path–aware distillation module that transfers a PPO teacher’s behavior to a lightweight student by jointly aligning action distributions (KL) and trajectory-level structural codes (path-aware contrastive learning). By injecting domain meta-path semantics and employing relaxed masking for exploration, MPGPD-RC integrates local and global relational cues into the decision pipeline, yielding policies that remain effective under severe, structured disruptions. Across sensor-, decision-, influencer-, and mixed-attack scenarios and multiple intensities, the framework consistently surpasses competitive baselines and ablations, demonstrating that preserving meta-path semantics materially improves both robustness and coordination quality. In future work, we will explore adaptive meta-path selection to develop a unified resilience-enhancement architecture transferable across various AUS topologies and tasks. Additionally, we aim to establish industry-standard resilience benchmarks, as there is currently a lack of standardized metrics and data for real-world environments, and evaluate the scalability and real-time feasibility of our approach through cross-validation in physical settings.
References
- 1. Du Z, Luo C, Min G, Wu J, Luo C, Pu J, et al. A survey on autonomous and intelligent swarms of Uncrewed Aerial Vehicles (UAVs). IEEE Trans Intell Transport Syst. 2025;26(10):14477–500.
- 2. Li H, Zhong Y, Zhuang X. A soft resource optimization method based on autonomous coordination of unmanned swarms system driven by resilience. Reliability Engineering & System Safety. 2024;249:110227.
- 3. Al-lQubaydhi N, Alenezi A, Alanazi T, Senyor A, Alanezi N, Alotaibi B, et al. Deep learning for unmanned aerial vehicles detection: a review. Computer Science Review. 2024;51:100614.
- 4. Ersü C, Petlenkov E, Janson K. A systematic review of cutting-edge radar technologies: applications for Unmanned Ground Vehicles (UGVs). Sensors (Basel). 2024;24(23):7807. pmid:39686344
- 5. Wang Y, Shen C, Huang J, Chen H. Model-free adaptive control for unmanned surface vessels: a literature review. Systems Science & Control Engineering. 2024;12(1).
- 6. Uday P, Chandrahasa R, Marais K. System importance measures: definitions and application to system-of-systems analysis. Reliability Engineering & System Safety. 2019;191:106582.
- 7. Wang W, Li X, Yang S, Huang M, Wang T, Duan T. A design method of dynamic adaption mechanism for intelligent multi-unmanned-cluster combat system-of-systems. Syst-Eng-Theory Pract. 2021;41:1096–106.
- 8.
Clark B, Patt D, Schramm H. Mosaic warfare: exploiting artificial intelligence and autonomous systems to implement decision-centric operations. Center for Strategic and Budgetary Assessments; 2020.
- 9. Liu J, Wei S, Li B, Wang T, Qi W, Han X. Dual-timescale hierarchical MADDPG for Multi-UAV cooperative search. Journal of King Saud University Computer and Information Sciences. 2025;37(6):1–17.
- 10. Wang K, Liu J, Lin Y, Wang T, Zhang Z, Qi W. ASL-OOD: hierarchical contextual feature fusion with angle-sensitive loss for oriented object detection. Computers, Materials & Continua. 2025;82(2).
- 11. Zhong Y, Li H, Sun Q, Huang Z, Zhang Y. A kill chain optimization method for improving the resilience of unmanned combat system-of-systems. Chaos, Solitons & Fractals. 2024;181:114685.
- 12. Zhu X, Zhu X, Yan R, Peng R. Optimal routing, aborting and hitting strategies of UAVs executing hitting the targets considering the defense range of targets. Reliability Engineering & System Safety. 2021;215:107811.
- 13.
Madni AM, Sievers M, Erwin D. Formal, probabilistic modeling in design of resilient systems and system-of-systems. In: AIAA Scitech 2019 Forum; 2019. p. 0223.
- 14. Sun Q, Li H, Zhong Y, Ren K, Zhang Y. Deep reinforcement learning-based resilience enhancement strategy of unmanned weapon system-of-systems under inevitable interferences. Reliability Engineering & System Safety. 2024;242:109749.
- 15. Zhang J, Lv H, Hou J. A novel general model for RAP and RRAP optimization of k-out-of-n:G systems with mixed redundancy strategy. Reliability Engineering & System Safety. 2023;229:108843.
- 16. Levitin G, Xing L, Dai Y. Optimizing partial component activation policy in multi-attempt missions. Reliability Engineering & System Safety. 2023;235:109251.
- 17. Wang K, Xue D, Gou Y, Qi W, Li B, Liu J, et al. Meta-path-guided causal inference for hierarchical feature alignment and policy optimization in enhancing resilience of UWSoS. J Supercomput. 2025;81(2).
- 18. Wang K, Gou Y, Xue D, Liu J, Qi W, Hou G, et al. Resilience augmentation in unmanned weapon systems via multi-layer attention graph convolutional neural networks. Computers, Materials & Continua. 2024;80(2).
- 19. Hulse D, Hoyle C. Understanding resilience optimization architectures: alignment and coupling in multilevel decomposition strategies. Journal of Mechanical Design. 2022;144(11):111704.
- 20. Ayyub BM. Systems resilience for multihazard environments: definition, metrics, and valuation for decision making. Risk Anal. 2014;34(2):340–55. pmid:23875704
- 21. Xu M, Ouyang M, Hong L, Mao Z, Xu X. Resilience-driven repair sequencing decision under uncertainty for critical infrastructure systems. Reliability Engineering & System Safety. 2022;221:108378.
- 22. Peiravi A, Nourelfath M, Zanjani MK. Universal redundancy strategy for system reliability optimization. Reliability Engineering & System Safety. 2022;225:108576.
- 23. Chen Z, Hong D, Cui W, Xue W, Wang Y, Zhong J. Resilience evaluation and optimal design for weapon system of systems with dynamic reconfiguration. Reliability Engineering & System Safety. 2023;237:109409.
- 24. Zhang C, Chen R, Wang S, Dui H, Zhang Y. Resilience efficiency importance measure for the selection of a component maintenance strategy to improve system performance recovery. Reliability Engineering & System Safety. 2022;217:108070.
- 25. FENG Q, HAI X, SUN B, REN Y, WANG Z, YANG D, et al. Resilience optimization for multi-UAV formation reconfiguration via enhanced pigeon-inspired optimization. Chinese Journal of Aeronautics. 2022;35(1):110–23.
- 26. Tran HT, Domerçant JC, Mavris DN. A network-based cost comparison of resilient and robust system-of-systems. Procedia Computer Science. 2016;95:126–33.
- 27. Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA. Deep reinforcement learning: a brief survey. IEEE Signal Process Mag. 2017;34(6):26–38.
- 28. François-Lavet V, Henderson P, Islam R, Bellemare MG, Pineau J. An introduction to deep reinforcement learning. FNT in Machine Learning. 2018;11(3–4):219–354.
- 29. Zhang Z, Li H, Chen T, Sze NN, Yang W, Zhang Y, et al. Decision-making of autonomous vehicles in interactions with jaywalkers: a risk-aware deep reinforcement learning approach. Accid Anal Prev. 2025;210:107843. pmid:39566327
- 30. Li Y, Ma W, Li Y, Li S, Chen Z, Shahidehpour M. Enhancing cyber-resilience in integrated energy system scheduling with demand response using deep reinforcement learning. Applied Energy. 2025;379:124831.
- 31. Wang J, Wang K, Yao Y, Zhao H. Event detection with multi-order edge-aware graph convolution networks. Data & Knowledge Engineering. 2023;143:102109.
- 32. Du Y, Qi N, Li X, Xiao M, Boulogeorgos A-AA, Tsiftsis TA, et al. Distributed multi-UAV trajectory planning for downlink transmission: a GNN-enhanced DRL approach. IEEE Wireless Commun Lett. 2024;13(12):3578–82.
- 33. Liu Z, Huang L, Gao Z, Luo M, Hosseinalipour S, Dai H. GA-DRL: graph neural network-augmented deep reinforcement learning for DAG task scheduling over dynamic vehicular clouds. IEEE Trans Netw Serv Manage. 2024;21(4):4226–42.
- 34. Gou Y, Wang K, Wei S, Shi C. GMDA: GCN-based multi-modal domain adaptation for real-time disaster detection. Int J Unc Fuzz Knowl Based Syst. 2023;31(06):957–73.
- 35. Peng Y, Liu C, Liu S, Liu Y, Wu Y. SmartTRO: optimizing topology robustness for Internet of Things via deep reinforcement learning with graph convolutional networks. Computer Networks. 2022;218:109385.
- 36. Dong T, Zhuang Z, Qi Q, Wang J, Sun H, Yu FR, et al. Intelligent joint network slicing and routing via GCN-powered multi-task deep reinforcement learning. IEEE Trans Cogn Commun Netw. 2022;8(2):1269–86.
- 37.
Huang Z, Mamoulis N. Heterogeneous information network embedding for meta path based proximity. arXiv preprint 2017. arXiv:1701.05291
- 38.
Shang J, Qu M, Liu J, Kaplan LM, Han J, Peng J. Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. arXiv preprint 2016. arXiv:1610.09769
- 39. Yu J, Ge Q, Li X, Zhou A. Heterogeneous graph contrastive learning with meta-path contexts and adaptively weighted negative samples. IEEE Trans Knowl Data Eng. 2024;36(10):5181–93.
- 40. Tran HT, Balchanos M, Domerçant JC, Mavris DN. A framework for the quantitative assessment of performance-based system resilience. Reliability Engineering & System Safety. 2017;158:73–84.
- 41. Xue D, Lin Y, Wei S, Zhang Z, Qi W, Liu J, et al. Leveraging hierarchical temporal importance sampling and adaptive noise modulation to enhance resilience in multi-agent task execution systems. Neurocomputing. 2025;637:130134.
- 42. Gou Y, Wei S, Xu K, Liu J, Li K, Li B, et al. Hierarchical reinforcement learning with kill chain-informed multi-objective optimization to enhance resilience in autonomous unmanned swarm. Neural Networks. 2026;195:108255.
- 43. Sun Q, Li H, Wang Y, Zhang Y. Multi-swarm-based cooperative reconfiguration model for resilient unmanned weapon system-of-systems. Reliability Engineering & System Safety. 2022;222:108426.
- 44. Han Z, Wei S, Xue D, Lin Y, Li K, Liu J, et al. MPD-RL: Meta-path-driven reinforcement learning for enhancing resilience in unmanned weapon system-of-systems. J Supercomput. 2025;81(8):1–29.
- 45. Li K, Lin Y, Su X, Liu A, Liu J, Qi W, et al. Unified multi-agent recovery framework via multi-scale diffusion and dependency-aware hierarchical PPO for resilience enhancement. J Big Data. 2025;12(1):226.
- 46.
Hagberg A, Conway D. Networkx: network analysis with python. 2020. p. 1–48. https://networkxgithubio