Multi-task meta-initialized DQN for fast adaptation to unseen slicing tasks in O-RAN

Bosen Zeng; Xianhua Niu

doi:10.1371/journal.pone.0330226

Abstract

The open radio access network (O-RAN) architecture facilitates intelligent radio resource management via RAN intelligent controllers (RICs). Deep reinforcement learning (DRL) algorithms are integrated into RICs to address dynamic O-RAN slicing challenges. However, DRL-based O-RAN slicing suffers from instability and performance degradation when deployed on unseen tasks. We propose M2DQN, a hybrid framework that combines multi-task learning (MTL) and meta-learning to optimize DQN initialization parameters for rapid adaptation. Our method decouples the DQN into two components: shared layers trained via MTL to capture cross-task representations, and task-specific layers optimized through meta-learning for efficient fine-tuning. Experiments in an open-source network slicing environment demonstrate that M2DQN outperforms MTL, meta-learning, and policy reuse baselines, achieving improved initial performance across 91 unseen tasks. This demonstrates an effective balance between transferability and adaptability. Code is available at: https://github.com/bszeng/M2DQN.

Citation: Zeng B, Niu X (2025) Multi-task meta-initialized DQN for fast adaptation to unseen slicing tasks in O-RAN. PLoS One 20(10): e0330226. https://doi.org/10.1371/journal.pone.0330226

Editor: Ananth JP, Dayananda Sagar University, INDIA

Received: January 1, 2025; Accepted: July 29, 2025; Published: October 9, 2025

Copyright: © 2025 Zeng, Niu. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: https://github.com/bszeng/M2DQN

Funding: The National Science Foundation of China (No. 62171387), the Science and Technology Program of Sichuan Province (No. 2024NSFSC0468), the Key Project of Key Laboratory of Interior Layout optimization and Security Institutions of the Higher Education of Sichuan Province (No. SNKJ202402).

Competing interests: The authors have declared that no competing interests exist.

Introduction

The evolution toward 6G networks introduce unprecedented complexity, driven by the coexistence of heterogeneous technologies and multi-band spectrum aggregation [1]. In this paradigm, optimization domains expand while network requirements become stricter, further complicating dynamic radio resource management (RRM) [2]. To address this challenge, open radio access network (O-RAN) architectures have emerged as critical enablers through their openness, which facilitates real-time data exposure, AI-driven optimization, and closed-loop control essential for adaptive RRM [3].

The O-RAN architecture empowers mobile operators (MNOs) with greater control over RRM by enabling data-driven closed-loop automation via machine learning (ML) for real-time network optimization and control. This is achieved by decoupling hardware and software through open interfaces [4]. The architecture includes two RAN Intelligent Controllers (RICs) that perform management and control of the network at near-real-time (10 ms to 1 s) and non-real-time (> 1 s) timescales, known as near-RT RIC and non-RT RIC, respectively [5].

Network slicing addresses 5G/6G’s heterogeneous QoS demands – enhanced mobile broadband (eMBB), massive machine-type communications (mMTC), and ultra-reliable low-latency communications (URLLC) – by logically partitioning physical infrastructure into virtual slices [6]. Each slice enforces strict service-level agreements (SLAs) compliance through dynamic resource allocation, adapting to time-varying traffic patterns.

Deep reinforcement learning (DRL) has emerged as a promising approach to address the complexities of O-RAN slicing, where traditional rule-based methods struggle with dynamic traffic patterns and multi-objective SLA constraints [7]. Unlike model-driven strategies, DRL algorithms leverage deep neural networks to adapt to evolving network conditions without requiring explicit formulations of stochastic wireless environments [8]. However, deploying DRL in real-world networks faces critical challenges, particularly in adapting to unseen network conditions. Such changes can destabilize DRL agents and degrade system performance, underscoring the importance of rapid convergence to optimal policies.

This paper addresses the critical challenge of enabling fast adaptation for DRL agents in O-RAN slicing. Our work focuses on mitigating instability and performance degradation, providing MNOs with efficient methods to accelerate DRL convergence and enhance overall network reliability. Our contributions can be summarized as follows:

Propose a novel DQN initialization framework: We introduce a hybrid approach, M2DQN, that integrates MTL and meta-learning in non-RT RIC to initialize DQN in near-RT RIC. This method structures the DQN neural network into shared layers (updated through MTL) and task-specific layers (refined via meta-learning).
Achieve an effective balanced integration of MTL and meta-learning: The shared layers capture common representations across multiple tasks, while the task-specific layers leverage these shared features for rapid adaptation. By fine-tuning the balance between shared and task-specific layers, the method achieves an effective integration, leveraging the strengths of both approaches.
Validate the approach through comprehensive experiments: Extensive evaluations in an open-source network slicing environment demonstrate that our approach outperforms initialization strategies based solely on MTL, meta-learning, or policy reuse. This hybrid approach yields a highly generalizable DQN initialization strategy, improving initial performance and accelerating convergence across all unseen tasks.

The remainder of the paper is organized as follows: The Background section provides an overview of the problem, followed by a discussion of related work in the Literature section. The Problem Formulation section presents the system model. The Multi-task Meta-initialized DQN section describes the proposed hybrid approach and baseline methods. The Results section outlines the experimental setup and analyzes the results. Finally, the Conclusion and Future Work section concludes the paper and discusses future research directions.

Background

DRL for O-RAN slicing

Traditional methods such as queuing theory, Lagrange optimization, genetic algorithms, and heuristic approaches have been adopted for network slicing problems [9]. However, these methods fail to address the evolving requirements of next-generation cellular networks, particularly when handling constrained radio resources, dynamic channel conditions, multi-service interference, and heterogeneous QoS demands.

DRL techniques have recently emerged as a promising alternative for netwok slicing optimization. Unlike model-driven traditional approaches, DRL operates without a predefined network slicing model, instead learning optimal policies through direct environment interaction and iterative reward maximization [10]. In O-RAN architectures, DRL implementations are deployed across RIC layers: slicing xApps in near-RT RIC (to address radio-layer dynamics) and slicing rApps on the non-RT RIC (to orchestrate long-term strategies) [11].

As shown in Fig 1, the DRL framework consists of two primary components: The environment represents the O-RAN system state, while the agent (embodied in slicing xApps/rApps) observes network states and executes actions to maximize cumulative rewards. This closed-loop interaction enables continuous policy refinement, allowing DRL agents to dynamically adapt to fluctuating network conditions. The model-free nature of DRL eliminates dependency on analytical system models, making it particularly suited to managing the complex, non-stationary resource allocation challenges in next-generation cellular networks.

Download:

Fig 1. Architectural framework of DRL-based network slicing xApp integration within the O-RAN environment.

Blocks of different colors represent PRBs allocated to different slices, with allocations updated at each transmission time interval (TTI).

https://doi.org/10.1371/journal.pone.0330226.g001

Multi-task learning and meta-learning

Multi-task learning (MTL) demonstrates superior efficiency and effectiveness in learning shared representations by jointly utilizing training data from related tasks. Modern meta-learning extends this concept by using shared representations not only for joint training but also adapting rapidly to unseen tasks with minimal data during inference.

Multi-task Learning. MTL is an inductive transfer learning approach that harnesses the domain information from related tasks as inductive bias to improve generalization across all tasks [12]. Its core premise is that knowledge transfer between related tasks enhances model robustness beyond single-task learning [13]. By training tasks concurrently via parameter sharing, MTL implicitly encodes task correlations, a mechanism termed in [14].

Formally, consider N related tasks , where each task T_i has a training dataset . MTL jointly optimizes shared parameters θ by minimizing:

(1)

where L_i denotes the task-specific loss. The aggregated loss promotes cross-task feature learning, yielding the optimal parameters that generalize effectively to unseen tasks.

Meta-Learning. Meta-learning ("learning to learn") [15] optimizes models for rapid adaptation to new tasks with limited data. Given M source tasks from distribution , Meta-learning solves a bi-level optimization:

Inner Loop. The inner learning algorithm solves a specific task defined by the training dataset and its objective function L^task. During task-specific learning, the optimal task-specific parameters are derived using the meta parameters ϕ as prior knowledge:(2)
Outer Loop. The outer (meta) algorithm refines the inner learning process by optimizing the meta parameters ϕ. This optimization relies on unseen validation data , ensuring the model generalizes well to new tasks. The meta parameters ϕ are updated to minimize the meta loss L^meta given the optimal task-specific parameters :(3)
This process encodes knowledge from source tasks into the meta-parameters ϕ, enabling effective adaptation to new tasks. Here, ϕ represents the meta-parameters (also termed meta-knowledge), which encapsulate shared cross-tasks information to facilitate solving new ones.

Synthesis for DRL Initialization. While MTL and meta-learning differ mechanistically – MTL optimizes concurrent tasks versus meta-learning’s bi-level adaptation – both exploit task correlations to enhance generalization. Their outcomes ( for MTL and for meta-learning) provide principled initialization schemes for DRL: MTL-derived encodes domain-invariant representations, while meta-trained preserves gradient pathways for rapid fine-tuning. By combining these strengths, DRL agents achieve accelerated convergence in dynamic network slicing scenarios.

Literature

DRL-based applications to cellular network slicing have established foundational methodologies for dynamic resource orchestration under stochastic wireless environments. While existing works predominantly focus on traditional RAN architectures, the core theoretical frameworks exhibit significant applicability to O-RAN slicing scenarios [16].

DQN, a model-free DRL algorithm, has been successfully applied to solve network slicing problems, though it often requires extensive training steps [17]. Recent advancements like the federated-DQN slicing approach that offloads dynamic O-RAN disaggregation to edge nodes, enable localized data processing and faster decision-making [18]. A DQN-based O-RAN slicing approach learns control policies under varying SLAs with heterogeneous minimum performance requirements [19]. Mhatre et al. propose a DQN-based strategy for QoS-aware intra-slice resource allocation, optimizing for eMBB and URLLC slices [20]. DQN is introduced through an adaptive standardized protocol to address inter-slice resource contention and conflict in network slicing [21].

However, DQN-based solutions suffer from instability and performance degradation when deployed on unseen tasks. While architectural refinements have improved convergence [22], enhancing adaptability to unseen tasks requires integrating transfer learning strategies such as transfer learning(TL), MTL, and meta-learning [23].

TL accelerates DRL adaptation by reusing knowledge from related source tasks [24]. Nagib et al. demonstrated policy transfer via initializing new agents with pre-trained policies [25]. Despite its benefits, TL requires mechanisms to evaluate task transferability and mitigate negative transfer caused by domain mismatches [26]. These limitations highlight the need for more robust cross-task generalization frameworks.

Multi-task learning (MTL) addresses these challenges by sharing knowledge across tasks during joint training. Liu et al. [27] proposed an MTL-based resource allocation algorithm for multi-objective optimization, lowering computational costs. Lei et al. [28] developed a multi-task DRL framework to exploit task commonalities and differences. Dong et al. [29] further applied MTL to simplify action spaces in fine-grained network slicing, and Gracla et al. [30] enhanced robustness against data distribution shifts using MTL-based DRL.

In parallel, meta-learning focuses on rapid adaptation for unseen tasks with minimal data. Yuan et al. [31] designed a meta-DRL algorithm for dynamic V2X resource allocation, and later Yuan et al. [32] presented a meta-DRL algorithm to adapt quickly to dynamic environmental changes in wireless networks. Hu et al. [33] demonstrated meta-learning’s value in drone base station control, accelerating DRL convergence for optimal coverage.

While MTL improves training efficiency across heterogeneous tasks and meta-learning enables fast adaptation, their integration remains underexplored. [34]. Theoretical studies suggest combining these paradigms could reduce training steps while boosting performance on unseen tasks [35]. Despite these advantages, existing efforts offer limited insights into combining MTL and meta-learning within DRL, particularly in accelerating adaptation to unseen O-RAN slicing tasks.

To bridge this gap, we propose M2DQN – a novel DQN initialization paradigm that hierarchically combines MTL (for shared layers) and meta-learning (for task-specific layers). By optimizing neural network parameters during pretraining, M2DQN enhances adaptability to deployment-specific unseen scenarios.

Problem formulation

System model

In network slicing optimization, MNOs dynamically adjust SLA priorities by tuning KPI weights in the DRL reward function per network slice [25]. This paper specifically addresses downlink-oriented network slicing scenarios, focusing on flexible allocation of limited physical resource blocks (PRBs) while maintaining target thresholds for spectral efficiency, latency, and quality of experience.

Following the formulation in [36], we consider S network slices sharing total bandwidth B, with PRB allocation represented by vector . The O-RAN slicing controller selects allocation configuration x(a) from X possible configurations (), where each selection significantly impacts system performance.

System performance is quantified through aggregated slice latency metrics:

(4)

where L denotes the inverse latency measure, α represents a slice-specific latency weighting coefficient, and o(t) captures the time-varying system state at time t. The system state o(t) evolves under the influence of dynamic network parameters including traffic load, channel quality, and external environmental disturbances. Many of these parameters exhibit non-stationary behavior that resists analytical modeling, particularly at sub-second timescales. To address this complexity, the O-RAN slicing controller explores various slice allocation configurations and evaluates their corresponding impacts on system performance. This process continues until discovering the optimal configuration that maximizes overall system performance metrics.

Deep Q-network

Deep Q-Network (DQN), a model-free DRL algorithm, has been widely adopted for network slicing [7]. By integrating neural networks into the Q-learning framework, DQN approximates the state-value function to enable data-driven decision-making. The learning process is structured through three core components:

States. The state (illustrated in Fig 1), observed by the DRL agent (the slicing xApp), integrates dynamic network parameters including per-slice traffic load metrics, channel quality, and other external variables affecting the performance of the O-RAN system.
Actions. At each slicing window onset, the agent selects an action based on the observed state. The selected action specifies resource allocation across slices, where represents the feasible action space with cardinality . This decision is mathematically formulated as a bandwidth proportion vector:(5)
Rewards. The network generates reward signals after the agent executes action a in state o, triggering a state transition to . Network engineers design this reward function as a weighted combination of KPIs to guide policy optimization. Specifically, we formulate the DRL reward R using parametric sigmoid transformations of slice latencies:(6)
Here, s indexes slices, l_s represents latency, and w_s denotes the importance of latency for each slice. The parameters c1 and c2 are configured for each slice type to adjust the shape of the sigmoid function, thereby determining when and how latency violations penalize the agent’s actions. Parameter c1 defines the point at which the slope of the sigmoid function begins to change, indicating when to start penalizing the agent’s actions. Meanwhile, c2 represents the inflection point - the minimum acceptable delay performance for a slice, as determined by its SLAs. To account for the unique requirements of different slice types, distinct but constant c1 and c2 values are assigned to each slice type.
Q-function. The state-value function, or Q-function, uses a neural network to estimate the Q-value, denoted as Q(o,a), which represents the expected cumulative reward obtained after taking action a in state o.

At each slicing window initialization, the DRL agent decides PRB allocation a based on observed system states o, aiming to maximize long-term expected rewards through dynamic resource optimization. Formally:

(7)

where π denotes a policy function that maps the system state o to an action a within the feasible action space A, with representing the expectation operator. The primary challenge in solving Eq (7) lies in handling time-varying demands, traffic model variations, and fluctuating user numbers across different service types. While an exhaustive search for optimal solutions is theoretically possible, it is computationally infeasible. Therefore, DRL provides a practical approach to address this dynamic allocation problem.

Multi-task meta-initialized DQN

This section investigates a DRL environment designed for O-RAN slicing adaptation. In this scenario, source network slicing tasks serve as training environments where the DRL agent acquires transferable policy knowledge. Upon completing training, the agent deploys to handle unseen slicing tasks characterized by novel configurations in the DRL reward function. Specifically, each task is uniquely defined through weighting coefficients assigned to slice-specific KPIs. When MNOs introduce target tasks with previously unseen reward function weights, the pre-trained agent demonstrates rapid policy convergence, enabling efficient adaptation to the new network conditions.

As depicted in Fig 2, the M2DQN architecture operates across O-RAN’s RIC components. The non-RT RIC hosts M2DQN (blue module) as an rApp that learns DQN initialization parameters through training using historical data collected via the O1 interface. The near-RT RIC deploys a DRL-based slicing xApp ("learner agent"), which loads M2DQN-trained parameters through the AI interface and executes real-time control via the E2 interface, as explicitly shown in the figure’s dataflow.

Download:

Fig 2. M2DQN workflow in RICs.

M2DQN operates as a trained non-RT RIC rApp. The learner agent in the near-RT RIC downloads initialization parameters from M2DQN when encountering new slicing tasks.

https://doi.org/10.1371/journal.pone.0330226.g002

Fig 3 depicts the principal structure of M2DQN with two core components: shared layers and task-specific layers. The architecture employs a bi-level optimization strategy: Shared layers are trained via MTL through blue dashed arrows to capture cross-task representations, while task-specific layers are updated via purple solid arrow updated via meta-learning to adapt shared features. During training, both layers are jointly updated through solid-line connections, with additional task-specific parameter refinements applied post-task to address initialization deviations. In adaptation, the trained parameters of both the shared and task-specific layers are duplicated and deployed for use in unseen slicing tasks.

Download:

Fig 3. Architecture of M2DQN.

The shared layer parameters are continuously updated across all source tasks. For each source task, the task-specific layer parameters are individually optimized based on the trained parameters . During adaptation, the pre-trained and initialize the shared and task-specific layers, respectively, enabling rapid adaptation of unseen tasks.

https://doi.org/10.1371/journal.pone.0330226.g003

Algorithm 1 Multi-task meta-initialized DQN.

Initialize the parameters (shared layers) and (task-specific layers).

Initialize target networks: and .

Initialize temporary parameters and target .

1: for each source slicing task do

2: , .

3: for each episode do

4: for step t to the end of episode do

5: Observe state o_t.

6: Select action b_t via ε-greedy policy.

7: Execute b_t, obtain reward r_t and next state o_t + 1.

8: Store transition in buffer .

9: Sample a mini-batch of transitions from .

10: Compute time difference loss via Eq (8).

11: Update , via Eq (9).

12: if t mod then

13: Sync target networks: and .

14: end if

15: end for

16: end for

17: Update by Eq (10).

18: end for

19: Output and .

Algorithm 1 provides the M2DQN framework’s pseudocode, initializing shared layer parameters and task-specific layer parameters randomly, with target networks and . At the start of each source task, the temporary parameters are cloned from and updated throughout the task. The agent selects an action b_t at step t using the ε-greedy strategy, where random exploration occurs with probability ε, and the optimal action is chosen with probability . The agent then interacts with the O-RAN slicing environment by executing action b_t, observing the resulting next state o_t + 1 and reward r_t. This interaction generates the experience , which is stored in the replay buffer .

During each step, a mini-batch of transitions is sampled from to minimize the loss function :

(8)

where is the discount factor. The parameters define the DQN network, while represent the target network parameters. θ is updated via gradient descent:

(9)

Every steps, and are synchronized with and , respectively. Upon task completion, the trained temporary task-specific parameter updates the task-specific parameter via Eq 10 in a Reptile-style meta-learning framework [37]:

(10)

where λ is the meta-learning rate. Upon completing all episodes of a task, the task-specific parameter is updated via Eq (10) before transitioning to the next task. At the initiation of a new task, the temporary parameter is cloned from . The shared parameters undergo continuous gradient-based updates at every step, whereas is updated once per task. This hybrid strategy integrates MTL and meta-learning: MTL for (continuous per-step updates) and meta-learning for (per-task adjustments). The trained parameters and are stored as and , respectively. When deployed to new tasks, and initialize the learner agent’s networks, and the agent updates the network parameters by following the standard DQN process.

Results

We conduct simulations in an open-source network slicing environment supporting three types of services: VoLTE, URLLC, and video. User requests are generated following the statistical distributions defined in Table 1. To evaluate the generalizability of DRL initialization algorithms, we train these models on a limited set of network slicing scenarios. The trained models are subsequently deployed and tested on unseen scenarios. Various network slicing scenarios are generated by modifying the reward function weights within the environment.

Download:

Table 1. Network slicing parameter settings.

https://doi.org/10.1371/journal.pone.0330226.t001

Following the weight combination settings from [25], we employ 16 source tasks for model training and 91 unseen weight configurations (excluded during training) to assess adaptability, as detailed in Table 2. Each combination reflects the emphasis on different slices: for example, [0.1, 0.1, 0.8] shows extreme bias towards Video, [0.1, 0.4, 0.5] indicates dominance of Video and URLLC, and [0.33, 0.33, 0.33] represents complete equally balanced priorities. MNOs can dynamically adjust weight combinations according to real-time traffic patterns [38].

Download:

Table 2. Reward function weight combinations.

https://doi.org/10.1371/journal.pone.0330226.t002

We adopt three DQN variants for policy initialization: Multi-task DQN, Meta-learning, and M2DQN. As detailed in Table 3, all approaches share a three-layer neural architecture with (32, 32, 15) neurons per layer but diverge in their update mechanisms:

Multi-task DQN: Optimizes all layers through MTL.
Meta DQN: Updates layers purely via meta-learning.
M2DQN: Hierarchically integrates MTL for shared layers and meta-learning for task-specific layers.

Download:

Table 3. DRL architecture parameters.

https://doi.org/10.1371/journal.pone.0330226.t003

The specific network slicing parameters are provided in Table 4.

Download:

Table 4. RAN slicing DRL parameters.

https://doi.org/10.1371/journal.pone.0330226.t004

To evaluate adaptability, we pre-train 19 agents (16 for source tasks, 3 for proposed approaches) using distinct weight configurations from Table 2. These agents are applied to 91 unseen tasks through policy transfer, generating 1,729 adaptation tasks. All algorithms and configurations are implemented and publicly available on https://github.com/bszeng/M2DQN.

We evaluate the performance enhancement of accelerated DQN algorithms - policy reuse [25], Multi-task DQN, Meta-DQN, and M2DQN - for initializing agents across 91 unseen tasks. Policy reuse trains agents via a DQN on a single source task, while Multi-task DQN, Meta-DQN, and M2DQN leverage 16 source tasks. Statistical improvements in reward after the first episode are summarized in Table 5.

Download:

Table 5. Accelerated algorithms adaptation statistics.

https://doi.org/10.1371/journal.pone.0330226.t005

M2DQN consistently outperforms all baselines, achieving improvements across all 91 unseen tasks. Multi-task DQN ranks second but fails to improve performance in three tasks. Notably, among the 16 policy reuse configurations, weight settings [0.8,0.1,0.1] and [0.4,0.2,0.4] deliver the best performance, enhancing initial performance in 79 out of 91 tasks. These results validate M2DQN’s superior adaptability and generalization.

The findings highlight two critical advantages of M2DQN and Multi-task DQN. First, both algorithms learn transferable initialization parameters that enhance adaptability to dynamic slicing conditions. Second, unlike policy reuse, they reduce the need for exhaustive source-target task relationship analysis, reducing the deployment complexity of the DQN algorithm.

The convergence behavior of average rewards for unseen tasks is depicted in Fig 4 when trained agents are used to initialize new agents. The solid red line in the figure represents the average reward across 91 unseen tasks without initialization using trained agents. Each of the other lines represents the average reward across 91 unseen tasks initialized with one trained agent. Six accelerated DQN algorithms, including M2DQN, Multi-task DQN, and four TL-based algorithms, exhibit improved performance in the early stages of learning.

Download:

Fig 4. Cumulative reward.

Accelerated algorithms may initially underperform baseline methods during early learning stages, but demonstrate superior convergence performance.

https://doi.org/10.1371/journal.pone.0330226.g004

For a detailed examination of these six enhanced agents, Fig 5 plots the cumulative distribution function of the first-episode reward gain. The horizontal axis quantifies the reward gain relative to the non-accelerated algorithms, while the vertical axis indicates the number of unseen tasks achieving each reward gain level.

Download:

Fig 5. Initial reward gain comparison.

A positive value indicates that the reward achieved in the first episode by the accelerated algorithm exceeds that of the non-accelerated algorithm. A larger value signifies greater improvement, whereas a negative value indicates a decline.

https://doi.org/10.1371/journal.pone.0330226.g005

An analysis of the two graphs in Fig 5 reveals that the initial rewards of M2DQN (represented by the black dashed line with circle markers) are concentrated between 5 and 20. This clustering indicates that M2DQN effectively learns initialization parameters with strong generalization capabilities for DQN-based systems. Although M2DQN does not always achieve the highest initial performance on unseen tasks, its performance shows no degradation – a critical requirement for MNOs. Overall, M2DQN achieves an effective balance between adaptation capability and task performance.

Conclusion and future work

This paper presents M2DQN – a hybrid DRL initialization framework that synergizes multi-task learning and meta-learning to enhance DQN adaptability in O-RAN slicing. Deployed in the non-RT RIC, M2DQN employs shared layers to extract cross-task representations from 16 source tasks, while its task-specific layers enable rapid fine-tuning via meta-learning. Experimental results in an open-source network slicing environment demonstrate that M2DQN effectively learns transferable initialization parameters from 16 source tasks and achieves performance improvements across all 91 unseen tasks. This work advances DQN generalizability by addressing the critical challenge of fast adaptation in dynamic network slicing scenarios.

Moving forward, our research will focus on two synergistic directions to enhance dynamic network slicing adaptability: 1) Intelligent Task Selection, which improves training efficiency through posterior and diversity-based task sampling; and 2) Regularization-Enhanced Generalization, which integrates adaptive dropout mechanisms to mitigate overfitting risks. By coupling these directions, the dual-pronged approach aims to simultaneously reduce adaptation latency and suppress performance degradation caused by overfitting.

References

1. Giordani M, Polese M, Mezzavilla M, Rangan S, Zorzi M. Toward 6G networks: Use cases and technologies. IEEE Commun Mag. 2020;58(3):55–61.
- View Article
- Google Scholar
2. Calabrese FD, Wang L, Ghadimi E, Peters G, Hanzo L, Soldati P. Learning radio resource management in RANs: Framework, opportunities, and challenges. IEEE Commun Mag. 2018;56(9):138–45.
- View Article
- Google Scholar
3. Challita U, Ryden H, Tullberg H. When machine learning meets wireless cellular networks: Deployment, challenges, and applications. IEEE Commun Mag. 2020;58(6):12–8.
- View Article
- Google Scholar
4. Polese M, Bonati L, D’Oro S, Basagni S, Melodia T. Understanding O-RAN: Architecture, interfaces, algorithms, security, and research challenges. IEEE Commun Surv Tutorials. 2023;25(2):1376–411.
- View Article
- Google Scholar
5. Bonati L, D’Oro S, Polese M, Basagni S, Melodia T. Intelligence and learning in O-RAN for data-driven NextG cellular networks. IEEE Commun Mag. 2021;59(10):21–7.
- View Article
- Google Scholar
6. Bakri S, Frangoudis PA, Ksentini A, Bouaziz M. Data-driven RAN slicing mechanisms for 5G and beyond. IEEE Trans Netw Serv Manage. 2021;18(4):4654–68.
- View Article
- Google Scholar
7. Zangooei M, Saha N, Golkarifard M, Boutaba R. Reinforcement learning for radio resource management in RAN slicing: A survey. IEEE Commun Mag. 2023;61(2):118–24.
- View Article
- Google Scholar
8. Luong NC, Hoang DT, Gong S, Niyato D, Wang P, Liang Y-C, et al. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Commun Surv Tutorials. 2019;21(4):3133–74.
- View Article
- Google Scholar
9. Alcaraz JJ, Losilla F, Zanella A, Zorzi M. Model-based reinforcement learning with kernels for resource allocation in RAN slices. IEEE Trans Wireless Commun. 2023;22(1):486–501.
- View Article
- Google Scholar
10. Hua Y, Li R, Zhao Z, Chen X, Zhang H. GAN-powered deep distributional reinforcement learning for resource management in network slicing. IEEE J Select Areas Commun. 2020;38(2):334–49.
- View Article
- Google Scholar
11. O-RAN Working Group 1. O-RAN architecture description; technical specification; 2025. https://specifications.o-ran.org/download?id=789
12. Caruana R. Multitask learning. Mach Learn. 1997;28(1):41–75.
- View Article
- Google Scholar
13. Maurer A, Pontil M, Romera-Paredes B. The benefit of multitask representation learning. J Mach Learn Res. 2016;17(81):1–36.
- View Article
- Google Scholar
14. Vilalta R, Giraud-Carrier C, Brazdil P, Soares C. Inductive transfer. Boston, MA: Springer US; 2010.
15. Thrun S, Pratt L, Brazdil L. Learning to learn. Boston, MA: Kluwer Academic Publishers; 1998.
16. Alam K, et al. A comprehensive tutorial and survey of O-RAN: Exploring slicing-aware architecture, deployment options, use cases, and challenges. arXiv preprint; 2024.
- View Article
- Google Scholar
17. Li R, Zhao Z, Sun Q, I C-L, Yang C, Chen X, et al. Deep reinforcement learning for resource management in network slicing. IEEE Access. 2018;6:74429–41.
- View Article
- Google Scholar
18. Amiri E, Wang N, Shojafar M, Tafazolli R. Edge-AI empowered dynamic VNF splitting in O-RAN slicing: A federated DRL approach. IEEE Commun Lett. 2024;28(2):318–22.
- View Article
- Google Scholar
19. Raftopoulos R, D’Oro S, Melodia T, Schembra G. DRL-based latency-aware network slicing in O-RAN with time-varying SLAs. In: Proc. IEEE Int. Conf. Comput. Networking Communs; 2024.
20. Mhatre S, Adelantado F, Ramantas K, Verikoukis C. Intelligent QoS-aware slice resource allocation with user association parameterization for beyond 5G O-RAN-based architecture using DRL. IEEE Trans Veh Technol. 2025;74(2):3096–109.
- View Article
- Google Scholar
21. Rezazadeh F, Chergui H, Siddiqui S, Mangues J, Song H, Saad W, et al. Intelligible protocol learning for resource allocation in 6G O-RAN slicing. IEEE Wireless Commun. 2024;31(5):192–9.
- View Article
- Google Scholar
22. Zeng B, Zhong Y, Niu X. Efficient exploration through bootstrapped and Bayesian deep Q-networks for joint power control and beamforming in mmWave networks. IEEE Commun Lett. 2023;27(2):566–70.
- View Article
- Google Scholar
23. Upadhyay R, Phlypo R, Saini R, Liwicki M. Sharing-to-learn and learning-to-share; fitting together meta, multi-task, and transfer learning: A meta review. arXiv preprint. 2021.
- View Article
- Google Scholar
24. Nagib AM, Abou-Zeid H, Hassanein HS. Safe and accelerated deep reinforcement learning-based O-RAN slicing: A hybrid transfer learning approach. IEEE J Select Areas Commun. 2024;42(2):310–25.
- View Article
- Google Scholar
25. Nagib AM, Abou-zeid H, Hassanein HS. Toward safe and accelerated deep reinforcement learning for next-generation wireless networks. IEEE Netw. 2023;37(2):182–9.
- View Article
- Google Scholar
26. Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, et al. A comprehensive survey on transfer learning. Proc IEEE. 2021;109(1):43–76.
- View Article
- Google Scholar
27. Liu X, Zhang H, Ren C, Li H, Sun C, Leung VCM. Multi-task learning resource allocation in federated integrated sensing and communication networks. IEEE Trans Wireless Commun. 2024;23(9):11612–23.
- View Article
- Google Scholar
28. Lei K, Liang Y, Li W. Congestion control in SDN-based networks via multi-task deep reinforcement learning. IEEE Netw. 2020;34(4):28–34.
- View Article
- Google Scholar
29. Dong T, Zhuang Z, Qi Q, Wang J, Sun H, Yu FR, et al. Intelligent joint network slicing and routing via GCN-powered multi-task deep reinforcement learning. IEEE Trans Cogn Commun Netw. 2022;8(2):1269–86.
- View Article
- Google Scholar
30. Gracla S, Bockelmann C, Dekorsy A. A multi-task approach to robust deep reinforcement learning for resource allocation; 2023. https://arxiv.org/abs/2304.12660
- View Article
- Google Scholar
31. Yuan Y, Zheng G, Wong K-K, Letaief KB. Meta-reinforcement learning based resource allocation for dynamic V2X communications. IEEE Trans Veh Technol. 2021;70(9):8964–77.
- View Article
- Google Scholar
32. Yuan Y, Lei L, Vu TX, Chang Z, Chatzinotas S, Sun S. Adapting to dynamic LEO-B5G systems: Meta-critic learning based efficient resource scheduling. IEEE Trans Wireless Commun. 2022;21(11):9582–95.
- View Article
- Google Scholar
33. Hu Y, Chen M, Saad W, Poor HV, Cui S. Distributed multi-agent meta learning for trajectory design in wireless drone networks. IEEE J Select Areas Commun. 2021;39(10):3177–92.
- View Article
- Google Scholar
34. Wang H, Zhao H, Li B. Bridging multi-task learning and meta-learning: Towards efficient training and effective adaptation. In: Proceedings of the international conference on machine learning; 2021. p. 10991–1002.
35. Upadhyay R, Chhipa PC, Phlypo R, Saini R, Liwicki M. Multi-task meta learning: Learn how to adapt to unseen tasks. In: Proc. IEEE Int. Joint Conf. Neural Netw.; 2023. p. 1–10.
36. Maggi L, Valcarce A, Hoydis J. Bayesian optimization for radio resource management: Open loop power control. IEEE J Select Areas Commun. 2021;39(7):1858–71.
- View Article
- Google Scholar
37. Nichol A, Achiam J, Schulman J. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999:2018.
38. O-RAN Working Group 1. O-RAN slicing architecture; technical specification; 2025. https://specifications.o-ran.org/download?id=793

[ref1] 1. Giordani M, Polese M, Mezzavilla M, Rangan S, Zorzi M. Toward 6G networks: Use cases and technologies. IEEE Commun Mag. 2020;58(3):55–61.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Calabrese FD, Wang L, Ghadimi E, Peters G, Hanzo L, Soldati P. Learning radio resource management in RANs: Framework, opportunities, and challenges. IEEE Commun Mag. 2018;56(9):138–45.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Challita U, Ryden H, Tullberg H. When machine learning meets wireless cellular networks: Deployment, challenges, and applications. IEEE Commun Mag. 2020;58(6):12–8.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Polese M, Bonati L, D’Oro S, Basagni S, Melodia T. Understanding O-RAN: Architecture, interfaces, algorithms, security, and research challenges. IEEE Commun Surv Tutorials. 2023;25(2):1376–411.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Bonati L, D’Oro S, Polese M, Basagni S, Melodia T. Intelligence and learning in O-RAN for data-driven NextG cellular networks. IEEE Commun Mag. 2021;59(10):21–7.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Bakri S, Frangoudis PA, Ksentini A, Bouaziz M. Data-driven RAN slicing mechanisms for 5G and beyond. IEEE Trans Netw Serv Manage. 2021;18(4):4654–68.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Zangooei M, Saha N, Golkarifard M, Boutaba R. Reinforcement learning for radio resource management in RAN slicing: A survey. IEEE Commun Mag. 2023;61(2):118–24.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Luong NC, Hoang DT, Gong S, Niyato D, Wang P, Liang Y-C, et al. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Commun Surv Tutorials. 2019;21(4):3133–74.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Alcaraz JJ, Losilla F, Zanella A, Zorzi M. Model-based reinforcement learning with kernels for resource allocation in RAN slices. IEEE Trans Wireless Commun. 2023;22(1):486–501.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Hua Y, Li R, Zhao Z, Chen X, Zhang H. GAN-powered deep distributional reinforcement learning for resource management in network slicing. IEEE J Select Areas Commun. 2020;38(2):334–49.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. O-RAN Working Group 1. O-RAN architecture description; technical specification; 2025. https://specifications.o-ran.org/download?id=789

[ref12] 12. Caruana R. Multitask learning. Mach Learn. 1997;28(1):41–75.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref13] 13. Maurer A, Pontil M, Romera-Paredes B. The benefit of multitask representation learning. J Mach Learn Res. 2016;17(81):1–36.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref14] 14. Vilalta R, Giraud-Carrier C, Brazdil P, Soares C. Inductive transfer. Boston, MA: Springer US; 2010.

[ref15] 15. Thrun S, Pratt L, Brazdil L. Learning to learn. Boston, MA: Kluwer Academic Publishers; 1998.

[ref16] 16. Alam K, et al. A comprehensive tutorial and survey of O-RAN: Exploring slicing-aware architecture, deployment options, use cases, and challenges. arXiv preprint; 2024.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref17] 17. Li R, Zhao Z, Sun Q, I C-L, Yang C, Chen X, et al. Deep reinforcement learning for resource management in network slicing. IEEE Access. 2018;6:74429–41.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref18] 18. Amiri E, Wang N, Shojafar M, Tafazolli R. Edge-AI empowered dynamic VNF splitting in O-RAN slicing: A federated DRL approach. IEEE Commun Lett. 2024;28(2):318–22.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref19] 19. Raftopoulos R, D’Oro S, Melodia T, Schembra G. DRL-based latency-aware network slicing in O-RAN with time-varying SLAs. In: Proc. IEEE Int. Conf. Comput. Networking Communs; 2024.

[ref20] 20. Mhatre S, Adelantado F, Ramantas K, Verikoukis C. Intelligent QoS-aware slice resource allocation with user association parameterization for beyond 5G O-RAN-based architecture using DRL. IEEE Trans Veh Technol. 2025;74(2):3096–109.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref21] 21. Rezazadeh F, Chergui H, Siddiqui S, Mangues J, Song H, Saad W, et al. Intelligible protocol learning for resource allocation in 6G O-RAN slicing. IEEE Wireless Commun. 2024;31(5):192–9.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref22] 22. Zeng B, Zhong Y, Niu X. Efficient exploration through bootstrapped and Bayesian deep Q-networks for joint power control and beamforming in mmWave networks. IEEE Commun Lett. 2023;27(2):566–70.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref23] 23. Upadhyay R, Phlypo R, Saini R, Liwicki M. Sharing-to-learn and learning-to-share; fitting together meta, multi-task, and transfer learning: A meta review. arXiv preprint. 2021.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref24] 24. Nagib AM, Abou-Zeid H, Hassanein HS. Safe and accelerated deep reinforcement learning-based O-RAN slicing: A hybrid transfer learning approach. IEEE J Select Areas Commun. 2024;42(2):310–25.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref25] 25. Nagib AM, Abou-zeid H, Hassanein HS. Toward safe and accelerated deep reinforcement learning for next-generation wireless networks. IEEE Netw. 2023;37(2):182–9.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref26] 26. Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, et al. A comprehensive survey on transfer learning. Proc IEEE. 2021;109(1):43–76.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref27] 27. Liu X, Zhang H, Ren C, Li H, Sun C, Leung VCM. Multi-task learning resource allocation in federated integrated sensing and communication networks. IEEE Trans Wireless Commun. 2024;23(9):11612–23.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref28] 28. Lei K, Liang Y, Li W. Congestion control in SDN-based networks via multi-task deep reinforcement learning. IEEE Netw. 2020;34(4):28–34.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref29] 29. Dong T, Zhuang Z, Qi Q, Wang J, Sun H, Yu FR, et al. Intelligent joint network slicing and routing via GCN-powered multi-task deep reinforcement learning. IEEE Trans Cogn Commun Netw. 2022;8(2):1269–86.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref30] 30. Gracla S, Bockelmann C, Dekorsy A. A multi-task approach to robust deep reinforcement learning for resource allocation; 2023. https://arxiv.org/abs/2304.12660
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref31] 31. Yuan Y, Zheng G, Wong K-K, Letaief KB. Meta-reinforcement learning based resource allocation for dynamic V2X communications. IEEE Trans Veh Technol. 2021;70(9):8964–77.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref32] 32. Yuan Y, Lei L, Vu TX, Chang Z, Chatzinotas S, Sun S. Adapting to dynamic LEO-B5G systems: Meta-critic learning based efficient resource scheduling. IEEE Trans Wireless Commun. 2022;21(11):9582–95.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref33] 33. Hu Y, Chen M, Saad W, Poor HV, Cui S. Distributed multi-agent meta learning for trajectory design in wireless drone networks. IEEE J Select Areas Commun. 2021;39(10):3177–92.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref34] 34. Wang H, Zhao H, Li B. Bridging multi-task learning and meta-learning: Towards efficient training and effective adaptation. In: Proceedings of the international conference on machine learning; 2021. p. 10991–1002.

[ref35] 35. Upadhyay R, Chhipa PC, Phlypo R, Saini R, Liwicki M. Multi-task meta learning: Learn how to adapt to unseen tasks. In: Proc. IEEE Int. Joint Conf. Neural Netw.; 2023. p. 1–10.

[ref36] 36. Maggi L, Valcarce A, Hoydis J. Bayesian optimization for radio resource management: Open loop power control. IEEE J Select Areas Commun. 2021;39(7):1858–71.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref37] 37. Nichol A, Achiam J, Schulman J. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999:2018.

[ref38] 38. O-RAN Working Group 1. O-RAN slicing architecture; technical specification; 2025. https://specifications.o-ran.org/download?id=793

Figures

Abstract

Introduction

Background

DRL for O-RAN slicing

Multi-task learning and meta-learning

Literature

Problem formulation

System model

Deep Q-network

Multi-task meta-initialized DQN

Results

Conclusion and future work

References