Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

DualMask: Federated optimization of privacy-utility-efficiency trilemma via orthogonal gradient perturbation and RL-optimized PSO

  • Weibai Zhou ,

    Roles Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    20011015@gcc.edu.cn

    Affiliation School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou, Guangdong, China

  • Changlong Li,

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – review & editing

    Affiliation School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou, Guangdong, China

  • Rong Li,

    Roles Supervision, Validation, Writing – review & editing

    Affiliation School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou, Guangdong, China

  • Dan Huang

    Roles Writing – review & editing

    Affiliation School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou, Guangdong, China

Abstract

Federated learning faces a fundamental privacy-utility-communication trilemma, and existing static defense mechanisms suffer from rigid adaptation and poor multidimensional coordination, leaving a critical gap in dynamic trade-off balancing. To address this, we propose DualMask, a cooperative optimization framework that integrates a client-side Adaptive Orthogonal Noise Canceler (AONC) with server-side Distributed Dueling Double Deep Q-Network (D3QN) scheduling and Particle Swarm Optimization (PSO)-based aggregation. The AONC module implements a triple-defense mechanism via orthogonal subspace projection: (1) layer-wise adaptive EMA-quantile clipping to mitigate threshold imbalance, (2) progress-aware noise decay that balances early-stage privacy with late-stage efficiency, and (3) directional tuning that dynamically adjusts parallel-to-orthogonal gradient ratios. On the server side, D3QN enables dynamic resource allocation across heterogeneous devices, while PSO fusion corrects non-IID aggregation bias through particle-swarm-based weight optimization. Experiments on CIFAR-10/100 and Shakespeare datasets demonstrate that DualMask achieves 5.2% higher accuracy (84.1% vs 79.4% in non-IID settings) and 34.4% faster convergence (210 vs 320 rounds) compared to FedAvg. Additionally, DualMask reduces the privacy budget from 4.5 to 2.8 and communication cost by 37.2% (45 MB vs 65 MB). This constitutes a significant Pareto improvement, substantially expanding the trilemma frontier. The code and data are available at https://github.com/zhou-weib/DualMask.git.

Introduction

Federated learning (FL), as a core paradigm of distributed machine learning, effectively resolves the persistent data-isolation dilemma in domains such as healthcare and finance by adhering to the innovative principle of “keeping data local while moving models” [13]. However, real-world deployments face a privacy-utility-communication trilemma formed by the conflicting demands of privacy protection, model efficacy, and communication efficiency [46]. Specifically: Defending against gradient-inversion attacks relies primarily on static differential privacy mechanisms [7]; Synchronizing gradients from large-scale participants imposes significant bandwidth pressure [8,9]; and Non-IID data distributions trigger client drift and aggregation bias [10]; Gradient divergence induced by deep network architectures necessitates dynamic privacy-efficiency balancing mechanisms [11,12], exacerbating the trade-off complexity between these conflicting objectives [13,14]. Existing static defenses fail to jointly optimize these three dimensions, motivating dynamic and coordinated solutions.

Existing solutions to the “privacy-utility-communication trilemma” typically suffer from static adaptation rigidity and multi-dimensional coordination deficits, hindering Pareto-optimal equilibrium among these competing objectives. Evolutionary Algorithm (EA)-based base-model selection [15] relieves non-IID random aggregation but incurs discrete candidate overhead, whereas medical PFL-DP frameworks [16] still report . with unquantified traffic; DualMask replaces EA with continuous PSO weight optimization and reduces to 2.8 while preserving global accuracy. Loss-weighted aggregation (FedNolowe) [17] corrects statistical bias yet ignores layer-wise gradient heterogeneity and privacy budgeting, which DualMask addresses jointly via EMA quantile clipping plus PSO-driven weights. Gradient-similarity secure aggregation with Paillier encryption [18] lowers uploads but introduces heavy crypto costs; DualMask attains an extra 37% communication reduction through direction-orthogonal perturbation without any homomorphic overhead. In gradient processing, fixed-threshold clipping schemes (e.g., DeltaMask [12]) overlook layer-wise gradient divergence, resulting in two main issues-excessive clipping of large-norm shallow gradients and insufficient clipping of vanishing deep gradients [19,20]. Static noise budgets that are decoupled from training dynamics cause a privacy-utility mismatch: insufficient noise in the early stages fails to prevent gradient inversion attacks, while excessive noise in later stages impairs convergence efficiency [11,21]. Isotropic perturbations distort gradient descent trajectories [22] and exacerbate client drift in non-IID scenarios. At the aggregation level, the standard FedAvg algorithm [23] neglects dual heterogeneity–statistical heterogeneity in client data distributions and system heterogeneity in device capabilities–resulting in bias toward high-compute clients [24]. Although dynamic masks (e.g., FlashMask [25]) and secure multi-party computation (SMC) have been explored to optimize communication and privacy, their lightweight defenses still lack dynamic balancing of the trilemma under resource constraints.

To systematically address the aforementioned challenges, this paper proposes DualMask, a lightweight client-server framework that, for the first time, achieves simultaneous Pareto improvements along the privacy-utility-communication axes. Its contributions are three-fold: (1) an on-device AONC that synergistically combines layer-wise EMA quantile clipping, progress-aware noise decay, and direction-selective orthogonal perturbation to cut the RDP privacy budget from 4.5 to 2.8 while preserving accuracy; (2) a server-side co-design in which a D3QN-based resource scheduler reduces communication latency by 34.4% and client dropout by 40.5%, and a PSO-driven aggregation module corrects non-IID bias to boost final accuracy by 5.2%; and (3) an end-to-end O(d) complexity pipeline with < 1% memory overhead that yields 37.2% lower total communication (45 MB vs. 65 MB) and 34.4% fewer convergence rounds (210 vs. 320) on CIFAR-10/100 and Shakespeare, thereby pushing the Pareto frontier outward rather than claiming to “break” the theoretical impossibility. Accordingly, we first formalize the trilemma constraints and threat model, then detail the three synergistic techniques of AONC, next elaborate the joint optimization mechanism of D3QN-PSO, and finally present large-scale experimental validation and ablation studies that quantify the individual and combined gains in privacy budget, convergence speed, and communication cost, followed by a discussion of future extensions toward Transformer architectures and large-scale distributed networks.

DualMask framework design

To address the “privacy-utility-communication trilemma” in federated learning, we propose the DualMask framework, as illustrated in Fig 1. At its core is a client-server co-design that achieves three-dimensional Pareto improvements through triple cooperative defense. On the client side, AONC integrates layer-wise gradient clipping, timedecaying noise attenuation, and directional perturbation injection to dynamically balance privacy and utility. On the server side, a multi-agent D3QN resource scheduler allocates resources based on real-time device states and network conditions. Cross-device PSO feature fusion corrects non-IID aggregation bias. These three mechanisms form a closed-loop synergy: AONC suppresses gradient leakage while preserving effective learning signals; D3QN reduces communication overhead via optimized allocation; and PSO fusion corrects global model deviations—collectively achieving an end-to-end lightweight balance between privacy and utility.

thumbnail
Fig 1. Schematic of the DualMask framework.

DualMask framework integrating dual-branch masking (A), cross-layer feature interaction (B), and task-specific output heads (C) for privacy-utility-communication trilemma optimization.

https://doi.org/10.1371/journal.pone.0338822.g001

Problem definition and threat model

Problem formulation.

In the federated learning setting, there are K clients forming the set and a central server . Each client ci possesses a private dataset with data distribution , where client data distributions are non-identically distributed (Non-IID). The global model parameters are denoted by , where d is the parameter dimensionality. The core objective of federated learning is to minimize the weighted empirical risk:

(1)

where, denotes the local empirical risk, is the loss function, and is the total data volume.

(1) Privacy Constraint

Suppose an adversary attempts to reconstruct raw data by observing the uploaded gradient updates from clients. The privacy objective is formally defined as follows: for any client ci’s update , the mechanism must guarantee -differential privacy. Specifically, for any neighboring datasets , and any output subset :

(2)

Where, is the privacy mechanism, is the privacy budget, and is the failure probability threshold that upper-bounds the probability of the mechanism violating differential privacy.

(2) Model-Utility Constraint

Let denote the loss value at the global empirical risk minimizer. The model utility constraint requires:

(3)

Where, is the model after T training rounds, is the target accuracy-loss threshold for the relative deviation, and division by zero is avoided under standard convexity assumptions.

(3) Communication Efficiency Constraint

The communication overhead, denoted by , is defined as the cumulative volume of transmitted data over all training rounds:

(4)

where, is the subset of clients participating in round t, is a compression operator applied to client gradient updates to reduce data transmission volume, denotes the L0-norm (counting non-zero elements);The aggregated overhead must satisfy , where is a predefined bandwidth budget.

Threat model.

In the federated learning system, we consider semihonest adversaries who faithfully follow protocol specifications but attempt to steal private information from observed messages. The adversary roles are defined as the set , where their attack capabilities and objectives are summarized in Table 1 (x denotes raw input; x* denotes reconstructed data).

Next, conduct modeling of the key attack methods.

(1) Gradient Inversion Attack

The adversary reconstructs the original input data x* through iterative optimization. The objective function is:

(5)

Where, is an image-prior regularizer that steers the reconstruction toward natural-image statistics, and is the observed gradient. Because shallow-layer gradients leak high-frequency features, they provide exploitable information for gradient inversion.

(2) Membership-Inference Attack

A shadow model is built to decide whether a sample x* belongs to the client’s training set. The probability is . Where, is a discriminator that judges sample membership, and is the Sigmoid function mapping the output to the probability space (0,1). The attack success rate correlates positively with the severity of model overfitting; stronger overfitting facilitates the leakage of membership information.

(3) Attribute-Inference Attack

For label-excluded sensitive attributes , inference is performed as follows:

(6)

Where, is an attribute-inference model learning correlations between gradients and sensitive attributes, and are pre-trained parameters. By leveraging statistical dependencies between gradients and attributes, the model infers the sensitive attribute.

Core challenge: The privacy -utility -communication “Impossible Triangle”.

In federated learning, the system must jointly optimize three conflicting objectives-privacy protection, model utility, and communication overhead—that form an “impossible triangle” of mutual restriction. The optimization goal is formulated as:

(7)(8)

Where, denotes the system policy, is the privacy loss, is utility loss (not model-utility loss), and is the communication overhead. Threat-model analysis reveals a three-dimensional trade-off manifested in:

(1) Quantified Privacy -Utility Conflict

To fundamentally understand and overcome the “privacy-utility-communication” impossibility triangle, we first derive a precise privacy-utility trade-off limit for federated learning.

Theorem 2.1 (Privacy -Utility Lower Bound). For any -differentially private mechanism applied to federated learning, the utility loss is ower-bounded by:

(9)

Where, L is the gradient Lipschitz constant, G is the gradient-norm upper bound, K is the client count, and T is the global round count. This result establishes that stronger privacy protection () inevitably incurs greater utility loss (), revealing a fundamental tension between these objectives.

Proof sketch: Building on the “trilemma” framework of Chen et al. [26] and employing the convergence analysis technique of Gu et al. [27], we compound the privacy budget over T rounds. The total variance grows as T1/3, while the required Gaussian noise scale is proportional to . Combining these two effects yields the stated lower bound.

Design Implication: Theorem 2.1 demonstrates that any static noise injection significantly compromises either privacy or utility. Motivated by this limitation, AONC introduces Progress-Aware Noise Decay (PAND): larger noise is injected initially to ensure strong privacy, then exponentially decayed as the model converges. This approach follows the Pareto frontier characterized by Theorem 2.1, achieving a better overall trade-off than any fixed noise-injection strategy.

(2) Implicit Communication-Privacy Coupling

To reduce communication overhead, federated learning employs compression techniques [28] such as Top-K sparsification. The effective privacy leakage satisfies:

(10)

where, is the count of non-zero gradient entries, and d is the parameter dimension, is the gradient update vector. This equation demonstrates that compression reduces the volume of transmitted data(lowering communication cost), but simultaneously exacerbates privacy leakage by (), thereby diminishing the actual privacy-preserving efficacy. Consequently, communication efficiency and privacy protection exhibit implicit coupling.

(3) Resource Paradox

Performance-Communication Trade-offs under Non-IID Data Under a non-IID data distribution, model convergence rounds require communications T satisfying:

(11)

where B is the batch size, is the Learning rate, is the degree of data heterogeneity, and is an optimization-algorithm-dependent parameter. This inequality proves that data heterogeneity forces longer convergence. Under finite computational-/communication resources, simultaneously maximizing utility and minimizing overhead is infeasible—revealing a resource-allocation paradox.

The foregoing analysis reveals that privacy, utility, and communication are deeply interconnected. This raises a fundamental question: can all three be optimized simultaneously? Lemma 2.1, which serves as a theoretical summary of this paper, proves that this is impossible under strict constraints.

Lemma 2.1 (Infeasibility of the privacy-utility-communication trilemma). For any federated learning system that simultaneously requires:

  • Strict -differential privacy
  • Utility loss
  • Communication cost

the three objectives are mutually exclusive when the performance targets are too stringent (i.e., and ):

(12)

where is the minimal communication threshold and denotes all possible algorithmic policies.

Proof sketch: This lemma follows as a corollary of Theorem 2.1 (privacy-utility lower bound) and Eq (10) (communication-privacy coupling). We proceed by contradiction: assume there exists a policy that simultaneously attains all three limits. Then, as , would inevitably violate the lower bound established in Theorem 2.1. At the same time, the same would contradict Eq (10), which implies that communication compression increases the effective privacy leakage . Therefore, such an ideal does not exist.

Implication: Lemma 2.1 demonstrates that optimizing a single objective under stringent limits is unattainable. Consequently, DualMask does not abandon AONC; instead, it uses AONC in conjunction with D3QN and PSO to dynamically adjust noise, resource allocation, and aggregation weights in real time, thereby circumventing theoretical impossibility and simultaneously improving privacy, utility, and communication.

DualMask coordinates a trilevel closed-loop collaboration: clients utilize AONC to dynamically adjust noise injection, balancing the trade-off between utility loss and privacy loss (); the server employs a D3QN-based resource allocator to reallocate resources in real time, mitigating the conflict between utility loss and communication cost ; and a metaheuristic feature aggregator compresses and aligns features via PSO-driven fusion, detecting and removing implicit overheads. Operating as a closed-loop coordination mechanism, these components synergistically shift the Pareto frontier outward, enabling Pareto improvements in privacy protection (lower ), computational efficiency (higher FLOPs/utilization), and communication cost (lower ), thereby resolving the fundamental “impossible triangle” constraint.

Adaptive Orthogonal Noise Canceler (AONC)

To effectively tackle the challenging trade-off between privacy, utility, and communication—especially considering layer-specific gradient differences, evolving training dynamics, and gradient inversion attacks that reduce utility and reveal hidden weaknesses in current privacy methods [2931]. In this section, we present the AONC. AONC operates on the client side in federated learning () as the primary protection mechanism before gradients are uploaded. Its goal is to reduce negative impacts on model performance () while providing strong privacy assurances (low ) and accommodating the limited computing power of edge devices. As shown in Fig 2, AONC’s workflow includes three main technical steps: (1) layer-wise adaptive exponential moving average (EMA) clipping; (2) noise decay that adapts to training progress; and (3) directional noise injection. This process takes the locally trained raw gradients as input and produces privacy-preserving gradients that minimize information loss.

Layer-wise adaptive EMA clipping.

Gradient clipping is a fundamental operation in differential privacy (DP) to control gradient sensitivity for effective noise perturbation [32]. However, fixed global clipping ignores the inherent heterogeneity of gradient behaviors across different layers in deep neural networks [33]. Shallow layers typically exhibit larger gradient norms, carrying rich low-level feature information but being more vulnerable to inversion attacks. In contrast, deep layers tend to have smaller norms, where their directional consistency is critical for model convergence. Using a single clipping threshold excessively truncates shallow gradients, sacrificing essential semantic information, while insufficiently constraining deep gradients, leading to inefficient use of the privacy budget or inadequate protection. Moreover, gradient norms dynamically change across training phases.

To address these issues, AONC introduces layer-adaptive exponential moving average (EMA) clipping, which dynamically adjusts the clipping threshold for each layer based on their gradient statistics over time.

(13)

Here, is the raw gradient of layer l on client i at round t. is the updated clipping threshold, and is the smoothing factor. Clipping preserves gradient direction while bounding its magnitude:

(14)

This method assigns layer-specific clipping thresholds, adapting to gradient heterogeneity across layers. By applying EMA normalization, it smooths gradient norms and dynamically tracks their variations over time. Thanks to its negligible computational overhead, it is well-suited for deployment on resource-constrained edge devices. Moreover, by maintaining bounded gradient norms, it ensures compatibility with noise injection mechanisms, effectively mitigating the privacy-utility trade-off—often referred to as the “seesaw effect.”

Progress-aware noise attenuation.

After clipping gradients to control their sensitivity, the introduction of carefully calibrated noise becomes critical for enforcing strict privacy guarantees. However, existing approaches predominantly adopt static noise strategies (e.g., a fixed scale per round), which fail to adapt to the dynamic requirements throughout the training lifecycle. In the early stages, model representations remain unstable, and gradients carry high-density raw data signals. Insufficient noise fails to counter inversion attacks (e.g., gradient reconstruction), escalating privacy risks. As training progresses and the model begins to converge, gradient norms decrease, leading to a reduction in informative signal content. Excessive static noise then becomes the primary source of error, slowing convergence and degrading fine-grained feature learning—resulting in a dual dilemma: inadequate protection and unnecessary utility loss [34]. To achieve a training-progress-dependent equilibrium between privacy and utility, AONC proposes the Progress-Aware Noise Attenuation (PAND) framework. Its core principle is that noise should decay dynamically in alignment with the gradual reduction of gradient information over time.

(15)

Where is the initial noise scale at layer l, At is the global validation accuracy (progress indicator, ); a high At indicates convergence. is the decay strength hyperparameter. is a layer-dependent modulator for noise baselines (e.g., boosting protection for vulnerable shallow layers). Noise is injected as follows: . The mechanism, driven by the training progress At as its core, dynamically scales down the noise magnitude via exponential decay as the model approaches convergence. This strategy ensures robust privacy protection during the early stages while preventing convergence slowdown caused by excessive noise interference in later stages. Crucially, it couples in real time with the model states, minimizing cumulative noise bias to accelerate convergence. With negligible computational overhead for noise scaling adjustments, it naturally supports edge deployment, achieving cooptimization of privacy guarantees and convergence efficiency.

Directional noise injection.

Cropping and scale adaptation address the issue of noise magnitude constraints; however, traditional isotropic noise contaminates all gradient directions, including those critical for model optimization. This contamination increases training difficulty and leads to performance degradation [35].

AONC decomposes noise into two components: a parallel component that preserves gradient directions effective for optimization, and an orthogonal component that disrupts sensitive information. It dynamically adjusts the mixing ratio to concentrate most of the noise energy in the subspace orthogonal to the true gradient direction. First, to maximally interfere with adversarial models, orthogonal noise leaves the primary gradient direction unchanged while significantly distorting orthogonal subspaces, thereby greatly increasing the difficulty of data reconstruction for attackers. Second, by minimizing contamination of gradient directions that are effective for optimization, it preserves the efficacy of model updates. The steps for constructing directional noise are as follows:

(1) Compute the effective gradient direction: Normalize the clipped gradient as .

(2) Noise decomposition (separating parallel and orthogonal components): Generate a base Gaussian noise vector , which is then projected and decomposed into: Parallel component (aligned with the gradient direction), and the orthogonal component is perpendicular to the gradient direction. Critically, the parallel component has minimal impact on model convergence but provides weak privacy protection, whereas the orthogonal component disrupts statistical features to effectively suppress inversion attacks.

(3) Directional Reshaping (Dynamic Scheduling of Noise Component Ratios): Under the energy conservation constraint (), reshape the noise direction by defining . The default parameters are k1 = 0, k2 = 1, yielding . Dynamically adjust the ratio using a Sigmoid activation scaled by training progress t, increasing the weight of in later stages to accelerate model convergence.

(4) Generate Protected Gradient: Inject directional noise into the clipped gradient using  +  ndnl, producing an output gradient that preserves the primary optimization direction. The orthogonal noise significantly enhances attack resistance.

The mechanism enhances privacy protection by amplifying orthogonal noise and suppressing parallel noise to minimize utility loss, all while maintaining computational complexity of O(d). Through layer-adaptive EMA clipping, progress-aware noise attenuation, and directional noise injection, AONC generates privacy-enhanced optimization gradients that enable high-efficiency, high-accuracy global model aggregation.

Server-side multi-agent reinforcement learning resource scheduler (D3QN)

While AONC alleviates the privacy-utility trade-off at the client level, the server still faces the third vertex of the trilemma—communication efficiency—further complicated by device and data heterogeneity. Existing RL-based schedulers [30,36] either (1) optimize only client sampling using hand-crafted rewards or (2) determine bandwidth allocation once, neglecting the dynamic interplay among resource state, gradient quality, and aggregation outcome. To clearly position DualMask relative to these approaches, we propose a D3QN scheduler that jointly decides who to select (client selection), how much CPU share to allocate, and how urgent the bandwidth priority should be, all within a unified collaborative RL framework. By incorporating post-AONC gradient statistics and PSO aggregation loss into the reward function, D3QN continuously refines its policy, achieving up to 34% fewer communication rounds and up to 40% lower dropout rates compared to the best prior RL baseline [36]. The following subsections detail the state space, action space, and reward design that enable this closed-loop optimization.

Computing resources dynamic allocation.

The D3QN algorithm enables dynamic scheduling of computing resources through a two-level cooperative mechanism, encompassing three key components: state space design, action space design, and reward function design.

(1) State-Space Design

The environmental state vector St integrates multidimensional information, including device states, data distribution characteristics, and model states.

Device states are composed of , which correspond to real-time metrics of CPU utilization, memory usage, instantaneous bandwidth, and remaining battery level. These metrics reflect hardware resource availability and energy consumption.

Data distribution characteristics are represented as , where denotes the size of the local dataset for client i (i.e., the number of samples), and represents the label skewness. Label skewness is quantified using the Kullback-Leibler divergence, which measures the discrepancy between the local label distribution Pi(y) and the global distribution Pglobal(y).

Model states consist of , representing the mean of local loss gradients and the L2-norm of gradients after AONC pruning, respectively. These components characterize gradient information during model training [13,14]. The final state space is formally defined as follows:

(16)

(2) Action Space Design

The scheduler outputs a joint decision vector At each round, consisting of the following three types of actions:

Client Selection: is a binary indicator denoting whether client ci participates in the current training round. Computing Resource Allocation: represents the proportion of CPU cores allocated to client ci.

Bandwidth Reservation Weight: is a priority weight that controls the precedence of a client’s data upload.

The action space is defined as . Where K is the total number of clients. Concurrently, actions must satisfy server resource constraints: , where denotes the total CPU cores available on the server, and represents the aggregate bandwidth capacity of the server.

(3) Reward Function Design

The composite reward function incorporates time-variant costs and asymptotic objectives, formulated as:

(17)

The communication latency penalty term aims to reduce the per-round duration Tround, and is expressed as , where denotes the penalty coefficient. The model convergence gain term promotes the selection of high-efficiency clients to accelerate model convergence. it is defined as , where represents the gain coefficient, and and are the loss function values of the global model and the current-round model, respectively. The fairness adjustment term prevents resource monopolization while ensuring the participation of resource-constrained devices. it is expressed as , where serves as the fairness coefficient. Var(.) denotes the variance operator, which measures the equity of resource allocation.

(4) D3QN Decision Mechanism The D3QN algorithm employs the Dueling D3QN network architecture to decouple the state value V(S) and the action advantage A(S, A), with its Q-value function expressed as:

(18)

where is the size of the action space. This architecture enables more precise evaluation of the values of different state-action pairs. In distributed training, N synchronously updating agents (Actors) are deployed, each responsible for scheduling tasks within a sub-cluster. The target network adopts a periodic synchronization strategy, with the update formula:. Here, is the soft update coefficient for the target network. A small ensures smoother updates of the target network, enhancing training stability. To accelerate policy convergence, a prioritized experience replay mechanism is utilized. Sampling is based on the TD error , with the sampling probability given by:

(19)

where is a very small value to prevent probabilities from reaching zero, and is a parameter adjusting the influence level of priorities.

Through a triple-aware state representation—incorporating resources, data, and models—and a reward design focused on global convergence, the D3QN algorithm significantly reduces the dropout rate of weak devices (improving participation fairness) while decreasing per-round communication latency. This provides high-quality input gradient sets for subsequent PSO feature fusion.

PSO feature fusion module.

Traditional FedAvg employs a static weight aggregation strategy [37], where the aggregation weight wi is directly proportional to the client data volume . Under non-IID data conditions, this approach causes the global model to skew toward clients with dominant classes, thereby degrading generalization performance. To address this bias, we introduce a PSO feature fusion module that enables optimized model aggregation through dynamic weighting:

(1) Particle Encoding and Search Space

Each particle is defined as a weight vector , where represents the weight allocation ratio of the j-th particle to the i-th particle to the must hold.

The initial particle population is generated by injecting a Gaussian perturbation into FedAvg weights: . Here, denotes a truncated normal distribution, and controls perturbation variance to enhance population diversity.

(2) Fitness Function Design

The fitness function evaluates the quality of the global model corresponding to particle :

(20)

Where f is the model’s prediction distribution, is an indicator function (1 if the condition is true, 0 otherwise), and is a penalty coefficient balancing validation accuracy and KL divergence between client models to prevent overfitting to specific clients [14].

(3) Particle Update Rules

Particles update their velocities and positions based on their individual historical best (pbestj)and the global best (gbest). The update formulas are as follows:

Velocity Update:

(21)

Position Update:

(22)

Where is the inertia factor, used to balance the global exploration and local exploitation abilities of particles. and are learning rates that control the step sizes at which particles move toward the individual optimal and global optimal positions, respectively. are random numbers from a uniform distribution. is a projection operation that ensures the particle positions satisfy the weight renormalization constraint.

(4) Ensemble Aggregation Mechanism

Optimal Particle Aggregation: In each iteration, the weights corresponding to the global best particle, gbest, are selected as the formal aggregation weights to achieve the current optimal model fusion.

Soft Voting Mechanism: To prevent the algorithm from becoming trapped in local optima, if there is no significant improvement in gbest for G consecutive iterations, the weights of the top M particles with the highest fitness values are summed and averaged. The formula is:

(23)

Where , which reflects the importance of different particles. In terms of collaborative advantages, the PSO fusion module and D3QN establish a closed-loop collaborative relationship: D3QN is responsible for selecting high-value clients to ensure that the input gradients exhibit good diversity and quality; PSO uses these gradients to optimize the weights and correct the Non-IID deviation; and the performance of the aggregated model is fed back into the reward function of D3QN, thereby promoting optimization in the next round of scheduling.

Experiments and results analysis

Experimental setup

To comprehensively evaluate the performance and collaborative optimization capabilities of the DualMask framework across three dimensions—privacy protection, model effectiveness, and communication efficiency—experiments were conducted in a controlled environment to ensure fairness and objectivity. The server was equipped with dual Intel Xeon Platinum 8358 CPUs, four NVIDIA A100 GPUs, and 512 GB of memory. The clients simulated 100 heterogeneous edge devices, comprising 30% with high computing power, 50% with medium computing power, and 20% with low computing power. The software environment included Ubuntu 20.04 and PyTorch 1.12.0, supplemented by PySyft and ns-3 to simulate dynamic network and device states within a federated learning framework.

To confirm the broad applicability of the proposed approach, experiments were conducted on datasets with various characteristics, each matched with an appropriate model. For image-related tasks, MNIST (with both IID and Non-IID splits) and CIFAR-10/100 (including quantity-skewed Non-IID data) were utilized, paired with a small CNN and ResNet-18, respectively. For natural language processing tasks, the FEMNIST and Shakespeare datasets (both Non-IID) were used alongside a two-layer stacked LSTM. The methods compared against included FedAvg; FedAvg combined with DP-SGD (which incorporates fixed Gaussian noise and gradient clipping); DeltaMask (a dynamic gradient masking technique that reduces communication overhead and enhances privacy via sparsity); FlashMask (an adaptive masking method designed for federated learning environments); and CENSOR (a privacy protection approach based on orthogonal projection). Furthermore, three types of ablation studies on DualMask were performed by separately disabling the AONC, D3QN, and PSO components.

Evaluation metrics were refined across three objectives: model effectiveness, which considered accuracy, convergence rounds, and related factors; privacy protection, measured by the privacy budget, MSE/PSNR under gradient inversion attacks, and attack success rate; and communication efficiency, which included total communication volume and per-round upload volume, with additional emphasis on resource metrics such as device scheduling fairness and dropout rates. Finally, performance was comprehensively assessed through 3D visualization and Pareto frontier analysis.

Overall performance comparison

Analysis of model effectiveness and convergence.

To evaluate the comprehensive performance of DualMask in terms of model accuracy and convergence efficiency, experiments were conducted to compare its advantages under both IID and Non-IID data distributions.

(1) Accuracy-Driven Effectiveness Analysis

Table 2 presents the final accuracy of various methods on the CIFAR-10, CIFAR-100, and Shakespeare test sets (task details: ResNet-18 for CIFAR datasets and a two-layer LSTM for Shakespeare). The data distributions include IID and Non-IID (Dirichlet and role-based partitioning).

thumbnail
Table 2. Final accuracy of various methods on CIFAR and Shakespeare test sets under IID and Non-IID data distributions.

https://doi.org/10.1371/journal.pone.0338822.t002

(2) Convergent Dynamic Characteristics Analysis

To emphasize the benefits in training efficiency, Fig 3 shows the test accuracy progression across 500 communication rounds for different methods on the CIFAR-100 dataset in a Non-IID setting.

thumbnail
Fig 3. Convergence dynamics comparison of federated learning methods on CIFAR-100 under Non-IID data distribution.

(A) Convergence Efficiency: DualMask (green solid line) achieves 50% accuracy in 210 rounds, requiring 34.4% fewer rounds than FedAvg (320 rounds) and 27.6% fewer rounds than DeltaMask (290 rounds), respectively—while exhibiting 38% lower variance. (B) Training Stability: Whereas FedAvg fluctuates by 2.1% after 400 rounds, DualMask maintains a 0.7% deviation through AONC with progress-aware attenuation (). (C) System Optimization: D3QN scheduling increases low-power device participation from 62% to 89% and reduces the average round time by 28%, thereby accelerating convergence.

https://doi.org/10.1371/journal.pone.0338822.g003

Privacy protection strength verification.

To comprehensively evaluate the privacy protection capability of the DualMask framework, this experiment adopts a triple verification mechanism: theoretical privacy budget analysis (based on Rényi Differential Privacy), gradient inversion attack reconstruction experiments, and membership inference/attribute inference attack testing. Specifically, all experiments were conducted on the CIFAR-10 dataset under a Non-IID setting, with comparison baselines including FedAvg, DP-SGD, DeltaMask, and FlashMask.

(1) Privacy Budget (RDP) Analysis

A quantitative assessment of the privacy consumption of the AONC module was performed using Rényi Differential Privacy (RDP). For this analysis, parameters were set to and , and the cumulative privacy budget after T training rounds was calculated. The results are illustrated in Fig 4.

thumbnail
Fig 4. RDP privacy budget accumulation ().

(A) Privacy budget () comparison across methods. DualMask achieves (, significantly lower than DP-SGD () and DeltaMask (). (B) Impact of the dynamic attenuation mechanism. Progress-aware noise attenuation reduces privacy budget consumption by 41% during later training stages. (C) Effect of orthogonal noise amplification. Directional noise injection increases the effective noise scale by 43%, yielding a 1.7-fold improvement in privacy-utility trade-off. Figure Notes: All RDP computations use and (shown) or (curves in Appendix).

https://doi.org/10.1371/journal.pone.0338822.g004

(2) Gradient Inversion Attack Reconstruction Experiment

A total of 100 gradient inversion attack reconstructions were performed using the LBFGS optimizer to quantify the reconstruction quality.

Attack Setup: LBFGS optimizer (lr = 1.0, , ); zero-image initialization; white-box setting: attacker knows full model architecture and ground-truth labels; identical hyper-parameters for all methods. The results are shown in Fig 5.

thumbnail
Fig 5. Effectiveness of gradient inversion attacks.

(A) Mean Squared Error (MSE) comparison: DualMask showed an MSE that is 64% greater than DP-SGD, indicating that its orthogonal noise effectively disturbs the gradient statistics. (B) Peak Signal-to-Noise Ratio (PSNR) evaluation: The PSNR of images reconstructed from DualMask gradients decreased to 18.2 dB, which is below the 20 dB level generally considered visible to the human eye. (C) Visual assessment of reconstruction quality: Images reconstructed by an attacker using DualMask gradients contain substantial high-frequency noise, making them unidentifiable. Figure Notes: All values are computed with and ; numbers in parentheses indicate relative change w.r.t. DP-SGD.

https://doi.org/10.1371/journal.pone.0338822.g005

(3) Membership and Attribute Inference Attack Test

The shadow model attack framework was employed for testing, and the experimental results are summarized in Table 3.

thumbnail
Table 3. Attack success rates (%) of membership and attribute inference attacks under different defense methods.

https://doi.org/10.1371/journal.pone.0338822.t003

The values indicate the Attack Success Rate (ASR) as a percentage, where a lower ASR signifies better privacy protection. With the same privacy budget (), DualMask’s AONC directional noise injection decreases the model’s accuracy loss by 4.9% compared to DP-SGD. The progress-aware noise attenuation (PAND) mechanism reduces the PSNR to 15 dB in the later stages of training, addressing the shortcomings of static noise defenses. Additionally, orthogonal perturbations increase the number of iterations needed for gradient inversion attacks by 3.2 times, raising the average from 150 to 486 iterations.

Communication overhead and resource efficiency evaluation.

To measure the resource efficiency of AONC in federated learning, this section assesses its performance from two angles: communication overhead and resource usage. The experiments took place in a simulated federated setting with 100 clients, where 10 clients were randomly chosen to join each training round. A non-IID data distribution was created using a Dirichlet distribution with .

(1) Communication Overhead Analysis

Communication overhead was evaluated based on the total amount of data transmitted. The results of the experiments are presented in Fig 6.

thumbnail
Fig 6. Communication cost comparison on CelebA-HQ (Non-IID).

(A) Total communication cost across different methods. AONC reduces this cost by 37.2% compared to baseline approaches, thanks to its hierarchical clipping and adaptive noise scaling, which minimize unnecessary gradient transmissions. (B) Effect of gradient sparsity. The directional noise injection used in AONC creates sparsity, with orthogonal components making up more than 70%, thereby improving compression efficiency. (C) Communication costs of other methods. DP-SGD results in a 71% higher total communication cost than FedAvg because fixed clipping causes an increase in gradient magnitude, while CENSOR raises the cost by 36.8% due to the extra dimensionality involved in orthogonal projection operations. Note: All communication and energy figures are aggregate totals over 100 training rounds (10 clients sampled per round), not per-round or per-client values.

https://doi.org/10.1371/journal.pone.0338822.g006

(2) Computational Resource Utilization

Resource efficiency is evaluated using Client Compute Time (CCT) and CPU/GPU utilization. The experimental results are summarized in Table 4.

thumbnail
Table 4. Resource utilization comparison (CIFAR-10 Task, ResNet-56).

https://doi.org/10.1371/journal.pone.0338822.t004

Three-dimensional comprehensive pareto analysis.

The performance comparison results of five federated learning methods on the CIFAR-10 dataset are presented in Table 5 and visualized in Fig 7. As shown in Fig 7, DualMask dynamically balances the three objectives during training: it steadily improves test accuracy (utility), continuously decays the privacy budget (privacy), and maintains the lowest cumulative communication cost (efficiency). In contrast, FedAvg suffers from a constant privacy loss () and slower convergence, highlighting the advantages of DualMask in simultaneously optimizing the privacy-utility-communication trilemma.

thumbnail
Fig 7. Dynamic privacy-utility-communication trade-off across communication rounds on CIFAR-10.

Shaded areas represent 95% confidence intervals over three runs. DualMask continuously decays the privacy budget while improving accuracy and maintaining the lowest cumulative communication cost; FedAvg exhibits constant privacy leakage and slower convergence.

https://doi.org/10.1371/journal.pone.0338822.g007

thumbnail
Table 5. Performance comparison of five federated learning methods on CIFAR-10.

https://doi.org/10.1371/journal.pone.0338822.t005

Ablation study

A systematic module ablation analysis was conducted to validate the contribution of each core component in the DualMask framework. All experiments were performed on the CIFAR-10 dataset (using ResNet-56) under the same federated learning environment.

Impact of the AONC component.

The experimental results regarding the impact of the AONC component are presented in Table 6.

thumbnail
Table 6. Ablation study of the AONC component (mean values over 100 rounds).

https://doi.org/10.1371/journal.pone.0338822.t006

Impact of server-side components (D3QN-PSO synergy).

The experimental results for the synergistic mechanism of D3QN and PSO are presented in Table 7.

thumbnail
Table 7. Analysis of server-side component synergy (CIFAR-10, 100 clients).

https://doi.org/10.1371/journal.pone.0338822.t007

Parameter sensitivity analysis

AONC parameter analysis.

To assess the robustness of the AONC framework, parameter sensitivity experiments were performed on the CIFAR-10 dataset with the ResNet-56 model. Three main parameters—-the EMA coefficient , the initial noise ratio , and the clipping quantile q—were each evaluated at five evenly spaced levels. The evaluation metrics comprised the average accuracy, attack success rate, privacy budget (), and communication cost calculated over 100 training rounds. Each experiment was repeated three times, and the results were averaged. The findings are presented in Table 8.

thumbnail
Table 8. AONC parameter sensitivity matrix (Mean over 100 rounds).

https://doi.org/10.1371/journal.pone.0338822.t008

D3QN and PSO parameter analysis.

Parameter sensitivity testing was conducted on the CelebA-HQ face recognition task using a ResNet-18 model, with the D3QN-PSO module deployed on the server side. The test parameters included: D3QN learning rate ; PSO cognitive coefficient ; and PSO particle number . Other parameters were fixed as follows: discount factor , experience replay buffer size of 5000, and inertia weight . The results represent the average of three experimental runs, with detailed outcomes presented in Table 9.

thumbnail
Table 9. Server-side parameter sensitivity analysis (CelebA-HQ, ResNet-18).

https://doi.org/10.1371/journal.pone.0338822.t009

Conclusion

This paper tackles the core challenge of the privacy-utility-communication trilemma in federated learning—simultaneously ensuring privacy protection, model effectiveness, and communication efficiency—by introducing the DualMask framework, which achieves a breakthrough through a collaborative client-server approach. On the client side, it features an AONC that combines layer-wise gradient clipping with dynamic noise adjustment to strike a balance between strong privacy safeguards and efficient model convergence. On the server side, it incorporates D3QN multi-agent resource scheduling alongside PSO-based feature fusion to promote fair participation among resource-limited devices and reduce bias caused by non-IID data aggregation. The key innovation is the creation of a closed-loop optimization process involving “noise perturbation-resource scheduling—gradient aggregation,” where targeted noise injection blocks attack vectors while retaining critical learning information, adaptive resource allocation boosts edge device involvement, and particle swarm optimization supports overall model convergence. Experiments confirm that this framework effectively balances robust privacy, high efficiency, and minimal communication overhead in sensitive applications, such as medical image analysis and financial risk management, offering a comprehensive solution for cross-domain data collaboration. Future directions include developing self-adaptive parameter search methods and expanding the framework to Transformer models and large-scale distributed networks with tens of thousands of nodes. This work provides a practical federated optimization strategy for privacy-preserving computation in resource-constrained settings, advancing the secure development of distributed AI.

References

  1. 1. Ganguly B, Aggarwal V. Online federated learning via non-stationary detection and adaptation amidst concept drift. IEEE/ACM Trans Networking. 2024;32(1):643–53.
  2. 2. Yang Q. AI and data privacy protection: the way to federated learning. Journal of Information Security Research. 2019;5(11):961–5.
  3. 3. Wang Y, Tong Y, Shi D, Xu K. An efficient approach for cross-silo federated learning to rank. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE). 2021. p. 1128–39. https://doi.org/10.1109/icde51399.2021.00102
  4. 4. Chen W-N, Kairouz P, Ozgur A. Breaking the communication-privacy-accuracy trilemma. IEEE Trans Inform Theory. 2023;69(2):1261–81.
  5. 5. Zhang X, Gu H, Fan L, Chen K, Yang Q. No free lunch theorem for security and utility in federated learning. ACM Trans Intell Syst Technol. 2022;14(1):1–35.
  6. 6. Tan Y, Long G, LIU L, Zhou T, Lu Q, Jiang J, et al. FedProto: federated prototype learning across heterogeneous clients. AAAI. 2022;36(8):8432–40.
  7. 7. Zheng L, Cao Y, Yoshikawa M, Shen Y, Rashed EA, Taura K, et al. Sensitivity-aware differential privacy for federated medical imaging. Sensors (Basel). 2025;25(9):2847. pmid:40363282
  8. 8. Lin Y, Han S, Mao H, Wang Y, Dally WJ. Deep gradient compression: reducing the communication bandwidth for distributed training; 2020.
  9. 9. Shi S, Wang Q, Zhao K, Tang Z, Wang Y, Huang X, et al. A distributed synchronous SGD algorithm with global top-k sparsification for low bandwidth networks. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). 2019. p. 2238–47. https://doi.org/10.1109/icdcs.2019.00220
  10. 10. Li X, Huang K, Yang W, Wang S, Zhang Z. On the convergence of FedAvg on non-IID data. 2020.
  11. 11. Chen L, Yue D, Ding X, Wang Z, Choo K-KR, Jin H. Differentially private deep learning with dynamic privacy budget allocation and adaptive optimization. IEEE TransInformForensic Secur. 2023;18:4422–35.
  12. 12. Zhang X, Yang F, Guo Y, Yu H, Wang Z, Zhang Q. Adaptive differential privacy mechanism based on entropy theory for preserving deep neural networks. Mathematics. 2023;11(2):330.
  13. 13. Hassan MA, Granelli F. Harnessing 1D-CNN for received power prediction in sub-6 GHz RIS: part I. In: 2025 IEEE International Conference on Communications Workshops (ICC Workshops). 2025. p. 917–22. https://doi.org/10.1109/iccworkshops67674.2025.11162261
  14. 14. Hassan MA, Granelli F. Deep learning-driven optimal beam prediction for drone connectivity via flying-RIS and base station. In: 2025 IEEE 101st Vehicular Technology Conference (VTC2025-Spring); 2025. p. 1–6.
  15. 15. Wang P, Zhong Z, Wang J. Efficient federated learning via aggregation of base models. PLoS One. 2025;20(8):e0327883. pmid:40811613
  16. 16. Bokhari SM, Sohaib S, Shafi M. Fusion of Personalized Federated Learning (PFL) with Differential Privacy (DP) Learning for Diagnosis of Arrhythmia Disease. PLoS One. 2025;20(7):e0327108. pmid:40644412
  17. 17. Le D-D, Huynh T-N, Tran A-K, Dao M-S, Bao PT. FedNolowe: a normalized loss-based weighted aggregation strategy for robust federated learning in heterogeneous environments. PLoS One. 2025;20(8):e0322766. pmid:40811637
  18. 18. Wang J, Yang K, Li M. NIDS-FGPA: a federated learning network intrusion detection algorithm based on secure aggregation of gradient similarity models. PLoS One. 2024;19(10):e0308639. pmid:39446819
  19. 19. Tsouvalas V, Asano YM, Saeed A. Federated fine-tuning of vision foundation models via probabilistic masking. In: ICML 2024 Workshop on Foundation Models in the Wild; 2024.
  20. 20. Koloskova A, Hendrikx H, Stich SU. Revisiting gradient clipping: stochastic bias and tight convergence guarantees. In: Proceedings of the 40th International Conference on Machine Learning. vol. 202 of Proceedings of Machine Learning Research. PMLR; 2023. p. 17343–63.
  21. 21. Hong J, Wang Z, Zhou J. Dynamic privacy budget allocation improves data efficiency of differentially private gradient descent. FACCT 2022 (2022). 2022;2022:11–35. pmid:37084074
  22. 22. Chen L, Yue D, Ding X, Wang Z, Choo K-KR, Jin H. Differentially private deep learning with dynamic privacy budget allocation and adaptive optimization. IEEE TransInformForensic Secur. 2023;18:4422–35.
  23. 23. Asi H, Duchi J, Fallah A, Javidbakht O, Talwar K. Private adaptive gradient methods for convex optimization. In: International Conference on Machine Learning. PMLR; 2021. p. 383–92.
  24. 24. McMahan B, Moore E, Ramage D, Hampson S, ArcasBAy. Communication-efficient learning of deep networks from decentralized data. In: Singh A, Zhu J, editors. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. vol. 54 of Proceedings of Machine Learning Research. PMLR; 2017. p. 1273–82.
  25. 25. Wang G, Zeng J, Xiao X, Wu S, Yang J, Zheng L, et al.. FlashMask: efficient and rich mask extension of FlashAttention; 2025.
  26. 26. Chen W-N, Kairouz P, Ozgur A. Breaking the communication-privacy-accuracy trilemma. IEEE Trans Inform Theory. 2023;69(2):1261–81.
  27. 27. Gu H, Zhao X, Zhu G, Han Y, Kang Y, Fan L, et al. A theoretical analysis of efficiency constrained utility-privacy bi-objective optimization in federated learning. IEEE Trans Big Data. 2025;11(5):2503–16.
  28. 28. Xu J, Du W, Jin Y, He W, Cheng R. Ternary compression for communication-efficient federated learning. IEEE Trans Neural Netw Learn Syst. 2022;33(3):1162–76. pmid:33296314
  29. 29. Gu H, Zhao X, Zhu G, Han Y, Kang Y, Fan L, et al. A theoretical analysis of efficiency constrained utility-privacy bi-objective optimization in federated learning. IEEE Trans Big Data. 2025;11(5):2503–16.
  30. 30. Jiang L, Ma L, Yang G. Shadow defense against gradient inversion attack in federated learning. Med Image Anal. 2025;105:103673. pmid:40570807
  31. 31. Ranaweera K, Nguyen DC, Pathirana PN, Smith D, Ding M, Rakotoarivelo T, et al. Federated learning with differential privacy: an utility-enhanced approach; 2025.
  32. 32. Chen D, Orekondy T, Fritz M. GS-WGAN: a gradient-sanitized approach for learning differentially private generators. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, editors. Advances in Neural Information Processing Systems. vol. 33. Curran Associates, Inc.; 2020. p. 12673–84.
  33. 33. Ye Y, You G, Fwu JK, Zhu X, Yang Q, Zhu Y. Channel pruning via optimal thresholding. In:International Conference on Neural Information Processing. Springer; 2020. p. 508–16.
  34. 34. du Pin Calmon F, Gomez J, Kaissis G, Kulynych B, Troncoso C. Attack-aware noise calibration for differential privacy. In: Advances in Neural Information Processing Systems 37, 2024. P. 134868–901. https://doi.org/10.52202/079017-4286
  35. 35. Duan J, Hu H, Ye Q, Sun X. Analyzing and optimizing perturbation of DP-SGD geometrically. In: 2025 IEEE 41st International Conference on Data Engineering (ICDE). 2025. p. 3439–52. https://doi.org/10.1109/icde65448.2025.00257
  36. 36. Zhang Z, Gao Z, Guo Y, Gong Y. Heterogeneity-aware cooperative federated edge learning with adaptive computation and communication compression. IEEE Trans on Mobile Comput. 2025;24(3):2073–84.
  37. 37. Mächler L EINMSSNDea Grimberg G. FedPID: An Aggregation Method for Federated Learning. arXiv preprint 2024.