Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

FedNolowe: A normalized loss-based weighted aggregation strategy for robust federated learning in heterogeneous environments

  • Duy-Dong Le,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Industrial University of Ho Chi Minh City (IUH), Ho Chi Minh City, Vietnam

  • Tuong-Nguyen Huynh ,

    Roles Methodology, Supervision, Writing – review & editing

    htnguyen@iuh.edu.vn

    Affiliation Industrial University of Ho Chi Minh City (IUH), Ho Chi Minh City, Vietnam

  • Anh-Khoa Tran,

    Roles Conceptualization, Methodology, Supervision, Writing – review & editing

    Affiliation National Institute of Information and Communications Technology (NICT), Koganei, Japan

  • Minh-Son Dao,

    Roles Supervision, Writing – review & editing

    Affiliation National Institute of Information and Communications Technology (NICT), Koganei, Japan

  • Pham The Bao

    Roles Supervision, Writing – review & editing

    Affiliation Saigon University (SGU), Ho Chi Minh City, Vietnam

Abstract

Federated Learning supports collaborative model training across distributed clients while keeping sensitive data decentralized. Still, non-independent and identically distributed data pose challenges like unstable convergence and client drift. We propose Federated Normalized Loss-based Weighted Aggregation (FedNolowe) (Code is available at https://github.com/dongld-2020/fednolowe), a new method that weights client contributions using normalized training losses, favoring those with lower losses to improve global model stability. Unlike prior methods tied to dataset sizes or resource-heavy techniques, FedNolowe employs a two-stage L1 normalization, reducing computational complexity by 40% in floating-point operations while matching state-of-the-art performance. A detailed sensitivity analysis shows our two-stage weighting maintains stability in heterogeneous settings by mitigating extreme loss impacts while remaining effective in independent and identically distributed scenarios.

Introduction

Federated Learning (FL) offers a groundbreaking framework for training machine learning models across a wide array of decentralized devices, introducing an effective strategy to protect data privacy by eliminating the reliance on centralized data storage systems [1]. This innovative method has gained traction in critical fields such as healthcare [24], smart agriculture [5], mobile computing [6], and blockchain technology [7], where safeguarding sensitive data and adhering to strict regulatory requirements are of utmost importance. Nevertheless, FL faces significant challenges, largely due to statistical heterogeneity caused by non-independent and identically distributed (non-i.i.d.) data across participating clients [8]. This diversity leads to pressing problems, such as inconsistent convergence, client drift, and the risk of producing biased global models [9, 10].

The seminal FedAvg (Federated Averaging, invented by H. Brendan McMahan et al. (2017) [1].) algorithm [1] aggregates client updates using weights proportional to local dataset sizes. One of the main advantages of FedAvg is its ability to reduce communication rounds by allowing multiple local updates before aggregation. However, FedAvg struggles when dealing with non-i.i.d data distributions among clients, leading to slower convergence and poor generalization [11, 12]. This limitation has spurred the development of advanced FL aggregation strategies and algorithms. FedProx (Federated Proximal Optimization, introduced by Tian Li et al. (2018) [13].) [13] introduces a proximal term to constrain local updates, mitigating drift but increasing computational cost. FedMa (Federated Maching Averaging, created by Hongyi Wang et al. (2020) [14].) [14] employs layer-wise neuron matching to align heterogeneous models, achieving robust performance at the expense of scalability. More recent methods, such as FedAsl (Federated Learning with Auto-weighted Aggregation based on Standard Deviation of Training Loss, developed by Zahidur Talukder et al. (2022) [15].) [15], FedLaw (Federated Learning with Learnable Aggregation Weights, devised by Zexi Li et al. (2023) [16].) [16], and A-Flama (Accuracy-based Federated Learning with Adaptive Model Aggregation, innovated by Rebekah Wang et al. (2024) [17].) [17] dynamically adjusting weights based on loss statistics or proxy datasets on the server-side can improve performance. Still, these approaches add complexity or require server-side dependencies, making deployment challenging in resource-limited environments.

Drawing on concepts from correlation-based weighting [18] and FedAsl [15], this paper presents Federated Normalized Loss-based Weighted (FedNolowe)—a streamlined and efficient aggregation technique designed to address data heterogeneity FL. FedNolowe assigns dynamic weights to clients based exclusively on their training losses, employing a two-step L1 normalization process, also known as Least Absolute Deviations (LAD) or Manhattan normalization. In the first step, the training losses are normalized to the open interval (0, 1) using the standard L1 norm. In the second step, these normalized losses are inverted (using 1 minus the normalized value) and then subjected to L1 normalization again. This method enhances the influence of high-performing clients, is less sensitive to outliers, and still maintains computational simplicity. Although FedNolowe does not eliminate the impact of clients with high noise, it significantly reduces their effect on the aggregation, fostering a more equitable and resilient FL system. Experimental results show that FedNolowe surpasses existing state-of-the-art approaches. The primary contributions of this work are as follows:

  • We introduce a novel loss-based weighting mechanism that normalizes client losses in two stages, where the second stage applies a subtraction-based L1 normalization, ensuring robustness to non-i.i.d data without additional constraints or server-side resources.
  • We validate FedNolowe through extensive experiments on benchmark datasets (MNIST, Fashion-MNIST, and CIFAR-10), demonstrating competitive performance and up to 40% reduction in computational complexity compared to leading methods.
  • We conduct a detailed sensitivity analysis comparing our two-stage weighting scheme to alternative approaches. The result shows that it maintains stability in heterogeneous settings by mitigating the impact of extreme loss values while preserving effectiveness in i.i.d scenarios.
  • We provide a theoretical convergence analysis under standard FL assumptions, proving that FedNolowe converges to a stationary point of the global loss function.

The paper is structured as follows: Section Related Work reviews related work, Section Proposed FedNolowe details the FedNolowe methodology, Section Experiments Setup describes the experimental setup, Section Results presents results, and Section Conclusion concludes with future directions. The sensitivity analysis comparing the inversion methods in FedNolowe and FedAsl can be found in Appendix Sensitivity Analysis of FedNolowe and FedAsl, while the detail analysis of FedNolowe’s convergence is provided in Appendix Detail Convergence Analysis of FedNolowe.

Related work

FL has evolved significantly since the introduction of FedAvg by McMahan et al. [1], which aggregates client updates using weights proportional to dataset sizes, i.e., , where is the local model of client k after training on its dataset size nk, , and is a subset of clients that participate in the training round t. While effective in reducing communication overhead by allowing multiple local updates, FedAvg struggles with non-i.i.d data, resulting in slow convergence and model bias [10, 11, 19]. This has prompted a rich body of research aimed at addressing statistical heterogeneity in FL.

One prominent approach is FedProx [13], which mitigates client drift by adding a proximal term to the local objective, formulated as ), where is the global model and controls regularization strength. The server aggregates updates simply as . While FedProx improves stability under non-i.i.d conditions, its additional computation per client increases overhead, particularly for resource-constrained devices [20]. Similarly, Scaffold (Stochastic Controlled Averaging for Federated Learning, conceived by Sai Praneeth Karimireddy et al. (2019)[20].) [20] employs control variates to correct local gradients, achieving faster convergence but requiring persistent state maintenance across rounds, complicating implementation.

FedMa [14] aligns client models via layer-wise neuron matching using the Hungarian algorithm [21], aggregating weights as , where is the transposed form of the permutation matrix , used to rearrange client k’s weights to be consistent with the global model. This method excels with heterogeneous architectures but incurs a high computational cost, limiting scalability to large client pools or complex models [22].

Dynamic weighting strategies have also gained traction. FedAsl [15] dynamically assigns weights to client updates using the standard deviation of their training losses, with the global model updated as , where . The term dk is set to if the client’s loss Lk falls within the “good region” , or otherwise. Here, denotes the median loss across all clients, is the standard deviation of the losses, and are tunable parameters. This method improves resilience against data heterogeneity, though it remains vulnerable to extreme outlier losses, which may distort the aggregation process. Its training parameters need to be tuned to find the appropriate sets for each dataset and model.

Some other adaptive weighting methods, like FedLaw [16] learn two global parameters: a shrinkage factor and a weight vector , where x is updated via gradient descent on the proxy dataset. The final aggregation is given by , where controls global weight decay, and represents learned client importance scores. FedA-Flama [17] uses accuracy-based weighting, where clients with higher accuracy on the server’s test data are assigned higher weights. However, this method requires all clients to be validated with the server’s full test dataset, increasing resource usage. Additionally, choosing an appropriate minimum aggregation weight () threshold and replacing client weights if they exceed the threshold can introduce unfairness issues in FL. While adaptive, it relies on validation clients using server-side data, which increases computational overhead. [23] presents a dynamic node matching FL that outperforms FedAvg, but its complexity may limit scalability with many nodes or clients. Adaptive optimization methods, such as those in [24], refine aggregation by tuning hyperparameters like learning rates across clients, though they increase communication and computational burdens.

Recent efforts explore alternative paradigms such as. Personalized FL approaches [25, 26], decouple global and local objectives to tailor models to individual clients, sacrificing global generalization for personalization. Knowledge distillation-based methods [18, 27, 28] transfer learned representations across clients, but require a pre-trained teacher model using curated public datasets. Another paper using pre-trained models is [29], which applies a Genetic Algorithm (GA) to optimize FL for sports image classification. The GA improves inference time and reduces storage, but requires more hyperparameter tuning.

This research aims to implement a mechanism that influences model aggregation in FL without requiring additional data or validating client models. Accordingly, we will compare FedNolowe with existing methods, as presented in Table 1.

thumbnail
Table 1. Comparison of federated learning algorithms.

https://doi.org/10.1371/journal.pone.0322766.t001

Table 1 shows the mathematical formulations of the server and client sides, along with the pros and cons, and the applications of state-of-the-art FL algorithms. Our proposed FedNolowe distinguishes itself by relying solely on training losses. It aims to deal with highly non-i.i.d that FedAvg fails to, avoiding the complexity of proximal terms in FedProx and neuron matching in FedMa. Unlike FedAsl, it employs a simpler two-stage normalization that mitigates outlier effects without statistical thresholds. Compared to other methods, it imposes no assumptions on optimization parameters, enhancing flexibility. Our approach thus bridges the gap between performance and efficiency, offering a scalable solution for heterogeneous FL, as validated theoretically and empirically in subsequent sections and appendices.

Proposed FedNolowe

Problem formulation

We consider an FL system with N clients, indexed by , each possessing a local dataset with samples drawn from a non-i.i.d distribution . The total dataset size is . Each client k trains a local model parameterized by to minimize its local loss function , where is the loss for a single data sample x. The global objective is to optimize a weighted combination of local losses:

(1)

where and represent the contribution weights of client k. In FedAvg [1], , which struggles under non-i.i.d conditions as local optima diverge from the global optimum [10, 35]. In FedAsl [15], , which is sensitive to outlier losses. We discuss this limitation in Appendix .

In practice, FL operates in communication rounds. At round t, a subset of clients are randomly selected, each performing local optimization on the current global model . The server aggregates these updates to form wt + 1. Non-i.i.d data exacerbates client drift, where local updates deviate significantly from , degrading performance [9]. FedNolowe addresses this by dynamically adjusting based on training loss, prioritizing clients that align better with the global objective.

FedNolowe weighting mechanism

To tackle heterogeneity, we propose FedNolowe, a two-stage weighting mechanism that leverages local training losses at round t. Unlike FedAvg’s static weights or FedProx’s proximal constraints [13], FedNolowe uses a loss-based approach inspired by correlation weighting in [18], avoiding complex statistical measures (e.g., FedAsl [15]) or server-side proxies (e.g., FedLaw [16], FedA-Flama [17]). Compared to FedAsl’s division-based inversion , FedNolowe introduces a fundamentally different two-stage normalization strategy. It first normalizes losses across clients to obtain , then applies a subtraction-based inversion . This avoids instability caused by division when losses are close to zero or highly skewed—conditions common in non-i.i.d. settings—resulting in more stable and bounded weights.

Definition 1 (Computation and Normalized Loss Weights). For each client at round t:

  1. Loss computation: Each local training loss is computed as:(2)
    where is a mini-batch of data from the local dataset of client k, and e is a single epoch of the total E epochs.
  2. Two Stages Loss Normalization:(3)
    In Eq 3, the term on the right is the first-stage normalization, ensuring scale invariance across clients with varying loss magnitudes. The term on the left inverts the normalized loss, amplifying the influence of clients with lower losses while second-stage normalizing the weights to sum to 1.

The resulting global update is:

(4)

This mechanism ensures that clients with lower , indicative of better local optimization or less drift, contribute more to wt + 1, enhancing stability without additional computational burdens like neuron matching [14] or variance tracking [20]. Algorithm 1 outlines the full procedure.

Algorithm 1 FedNolowe: Normalized loss-based weighted aggregation.

In Algorithm 1, each training round begins with the server randomly selecting a subset of clients. These clients train their local models over multiple epochs on their respective datasets, returning updated models and corresponding loss values. FedNolowe’s core strength is its elegant simplicity and flexibility: it dynamically emphasizes clients with superior local convergence by leveraging their training losses , using only lightweight scalar operations to compute weights (Eq 3. This approach differs markedly from FedAvg’s static aggregation, FedProx’s per-client proximal regularization, FedMa’s computationally intensive neuron matching across L layers with d parameters, and FedAsl’s statistical overhead. FedNolowe incurs a server-side complexity of O(|St|d) for the weighted aggregation, augmented by a negligible O(|St|) normalization cost, rendering it less resource-demanding than FedProx and FedMa while remaining comparable to FedAsl and FedAvg. We discuss this further in Subsection.

Convergence analysis

We analyze FedNolowe’s convergence under standard FL assumptions [13, 24, 35] and the methods in [36]. FedNolowe’s dynamic weighting mitigates non-i.i.d effects by prioritizing clients with lower losses, ensuring gradient alignment with the global objective.

Assumption 1 (L-smoothness). Each local loss is L-smooth, i.e., for some L>0, for all .

Assumption 2 (Bounded Gradient). The global loss gradient is bounded: , where G > 0.

Assumption 3 (Finite Variance). Local stochastic gradients have bounded variance: .

Assumption 4 (Alignment of Weights). There exists a constant such that

where the weights are as described in Definition 1.

Theorem 1. Under Assumptions 1–4, FedNolowe converges to a stationary point, i.e., .

We leave the detailed convergence proof in Appendix.

Experiments setup

To evaluate FedNolowe’s effectiveness, we conduct experiments on three benchmark datasets under non-i.i.d settings, comparing its performance and efficiency against the state-of-the-art FL methods. This section details the datasets, data partitioning, model architectures, training parameters, and evaluation metrics, ensuring reproducibility and robustness of results.

Datasets and non-i.i.d partitioning

We utilize three widely adopted datasets: MNIST [37], Fashion-MNIST [38], and CIFAR-10 [39], each consisting of 10 classes. MNIST contains 60,000 training and 10,000 test grayscale images of handwritten digits (2828 pixels). Fashion-MNIST mirrors the structure of MNIST but features images of clothing items, making it a more challenging classification task. CIFAR-10 includes 50,000 training and 10,000 test RGB images (3232 pixels) of various objects (e.g., airplanes, cars), with increased complexity due to color and semantic variations. We split the training data of these three datasets across the clients, as described in Figs 1 and 2, while keeping the test data on the server-side to examine the performance of experimental methods [14].

thumbnail
Fig 2. Sample size distributions across datasets.

https://doi.org/10.1371/journal.pone.0322766.g002

To simulate non-i.i.d data distributions, we partitioned each dataset across 50 clients using a Dirichlet distribution, a widely adopted method in federated learning research [19]. The concentration parameter governs the degree of data heterogeneity, with indicating an i.i.d data partition. While FedMA [14] and FedProx [13] employed , we set for MNIST to create highly heterogeneous distributions, and for Fashion-MNIST and CIFAR-10 to reflect moderate heterogeneity, accounting for their greater complexity compared to MNIST. This setup results in clients receiving varying sample sizes and uneven class distributions, as depicted in Figs 1 and 2. These Figs illustrate the skewed, non-uniform data allocation typical of real-world federated learning scenarios.

For the class distributions in Fig 1, MNIST(left) exhibits significant heterogeneity with , where most clients have uneven class proportions: Clients 10 and 49 show the highest diversity, containing 9 classes, while Clients 1, 2, and 38 have the least diversity, dominated by only one class (3-red, 4-purple, 0-blue), respectively. For Fashion-MNIST (middle, ), Clients 6, 22, 3, 41, 47, and 49 are the most diverse, featuring 9 classes, whereas Client 29 is the least diverse, only consisting of Class 0-blue with negligible contributions from others. In CIFAR-10 (right, ), Clients 3 and 49 display the greatest diversity across all 10 classes, while Client 8 is the least diverse, dominated by Class 1-yellow. These patterns underscore the non-i.i.d nature of the data, with varying class concentrations across clients.

For the sample distributions in Fig 2, the allocation is highly uneven across all datasets due to the Dirichlet distribution. In MNIST, Client 41 has the highest (3997 samples), and Client 48 is the lowest (32 samples). For Fashion-MNIST, Client 11 is peaking at 2070 samples, and Client 18 is the minimum (35 samples). In CIFAR-10, Client 4 has the highest (2436 samples), and Client 34 is the lowest (149 samples). This variability reflects the non-uniform sample distribution, characteristic of real-world heterogeneous FL environments.

Data preprocessing follows standard protocols. MNIST and Fashion-MNIST images are normalized to [0,1] with means 0.1307 and 0.2860, and standard deviations 0.3081 and 0.3530, respectively. Fashion-MNIST training data is augmented with random horizontal flips (probability 0.5). CIFAR-10 images are augmented with random crops (padding 4), horizontal flips, and color jitter (brightness, contrast, saturation = 0.2), then normalized using per-channel means (0.4914, 0.4822, 0.4465) and standard deviations (0.2023, 0.1994, 0.2010).

Model architectures

We employ three convolutional neural networks tailored to the complexity of the datasets: LeNet-5 for MNIST, a custom CNN for Fashion-MNIST, and VGG-9 for CIFAR-10. These architectures incorporate batch normalization (BN) and dropout to enhance generalization and mitigate overfitting, balancing computational efficiency and representational capacity under FL’s resource constraints.

  • LeNet-5 [40]: Designed for MNIST, this model comprises two convolutional layers and three fully connected layers. The first convolutional layer accepts 1 input channel (grayscale) and produces 32 output channels using a 55 kernel, followed by BN, ReLU activation, and 22 max-pooling, reducing the spatial dimensions from 2828 to 1414. The second convolutional layer takes 32 input channels, yields 64 output channels with a 55 kernel, and applies BN, ReLU, and max-pooling, resulting in 64 feature maps of size 44. These are flattened into a 1,024-dimensional vector (64 4 4) for the fully connected layers. The first fully connected layer (FC1) maps 1,024 inputs to 256 units with ReLU activation and a dropout layer (probability 0.5). The second fully connected layer (FC2) reduces this to 128 units with ReLU activation, followed by the output layer (FC3) mapping to 10 units for the 10-digit classes, without an activation function before softmax computation in the loss function.
  • Custom CNN: Developed for Fashion-MNIST, this network features three convolutional layers and two fully connected layers. The first convolutional layer processes 1 input channel into 32 output channels with a 33 kernel and padding of 1, followed by BN, ReLU, and 22 max-pooling, reducing the spatial size from 2828 to 1414. The second convolutional layer maps 32 channels to 64 with a 33 kernel, BN, ReLU, and max-pooling, yielding 77 feature maps. The third convolutional layer increases to 128 channels with a 33 kernel, BN, ReLU, and max-pooling, producing 128 feature maps of size 33. These are flattened into a 1,152-dimensional vector (128 3 3), feeding into a fully connected layer (FC1) of 512 units with ReLU and dropout (probability 0.3), followed by an output layer (FC2) of 10 units without additional activation.
  • VGG-9 [41]: Applied to CIFAR-10, this model consists of three blocks of convolutional layers followed by three fully connected layers. Each block contains two convolutional layers with 33 kernels and padding of 1, increasing channels from 3 to 64 (Block 1), 64 to 128 (Block 2), and 128 to 256 (Block 3), with BN and ReLU after each convolution. Block 1 ends with 22 max-pooling (3232 to 1616), Block 2 reduces to 88, and Block 3 to 44, yielding 256 feature maps of size 44. These are flattened into a 4,096-dimensional vector (256 4 4), processed by a fully connected layer (FC1) of 512 units with ReLU and dropout (probability 0.5), a second fully connected layer (FC2) of 512 units with ReLU and dropout (0.5), and an output layer (FC3) of 10 units.

Training parameters and evaluation metrics

Each experiment simulates an FL system with 50 clients over T communication rounds, randomly selecting a fraction (MNIST, Fashion-MNIST) or (CIFAR-10) of clients per round. Local training uses stochastic gradient descent (SGD) with learning rate , momentum 0.9, weight decay 0.001, and batch size 32. Local epochs E are set to 2 (MNIST), 3 (Fashion-MNIST), and 5 (CIFAR-10), reflecting the higher number of epochs for more challenging datasets and the complexity of the model’s architecture. Communication rounds are T = 50 for MNIST and T = 100 for Fashion-MNIST and CIFAR-10. Table 2 summarizes key parameters.

thumbnail
Table 2. Training parameters across datasets.

https://doi.org/10.1371/journal.pone.0322766.t002

We benchmark FedNolowe against five state-of-the-art FL baselines: FedAvg, which aggregates updates weighted by local dataset sizes; FedProx, incorporating a proximal term with to mitigate drift; FedMa, utilizing layer-wise neuron matching via the Hungarian algorithm for heterogeneous models; FedAsl, employing loss deviation weights with parameters and which gave the best experiment results as presented in Fig 9(b) of [15]. Performance is evaluated using four metrics computed on the global test set after each communication round: training loss, averaging the local loss across selected clients; validation loss, assessing the global model on the test set; accuracy, measuring Top-1 classification accuracy; and F1-score, the harmonic mean of precision and recall, accounting for class imbalance in non-i.i.d settings. To obtain a comprehensive perspective across all training rounds, we compute the average of these metrics over all rounds rather than relying solely on the final round’s values.

Computational efficiency is measured as floating-point operations (FLOPs) per round using PyTorch’s profiler [42, 43]. Local FLOPs are derived from forward and backward passes over E epochs on each client, while aggregation FLOPs account for method-specific operations: weighted averaging (FedAvg, FedNolowe), proximal term computation (FedProx), neuron matching (FedMa), loss statistics (FedAsl).

Results

We assess the performance of FedNolowe in comparison to four baseline approaches FedAvg [1], FedProx [13], FedMa [14], and FedAsl [15] across the MNIST, Fashion-MNIST, and CIFAR-10 datasets under non-i.i.d conditions. Results are presented as averages derived from three independent runs, emphasizing training and validation loss, accuracy, F1-score, and computational efficiency measured in FLOPs. In the following subsections, we provide graphical comparisons of the progression of training loss (subfigure a), validation loss (subfigure b), and accuracy (subfigure c) for each run with the percentage of randomly selected client subsets, followed by the table computation of average metrics, including F1-score, across all training rounds. Visualizations through Figs and tables highlight performance trends and mean values, with FedNolowe consistently achieving a strong balance between effectiveness and efficiency.

Experiment 1: MNIST

Fig 3 plots the results of over 50 rounds for client fractions on the MNIST dataset. As shown, for all three metrics—training loss, validation loss, and accuracy—FedAvg (yellow) and FedAsl (purple) experience significant fluctuations, while FedNolowe (blue), FedProx (green), and FedMa (red) exhibit similar stable progress. The specific averages of training values will be provided in Table 3, where a more detailed analysis will be presented.

thumbnail
Fig 3. Performance on MNIST with clients per round: (a) training loss, (b) validation loss, (c) accuracy.

https://doi.org/10.1371/journal.pone.0322766.g003

thumbnail
Table 3. Mean metrics on MNIST across client fractions (Std with rolling window = 5).

https://doi.org/10.1371/journal.pone.0322766.t003

Fig 4 shows the results of over 50 rounds for client fractions on the MNIST dataset. FedAsl (purple) demonstrates fewer fluctuations than the previous result in Fig 3. FedAvg (yellow) still exhibits significant training loss, validation loss, and accuracy oscillations. In contrast, FedNolowe (blue), FedProx (green), and FedMa (red) show stable progress with similar performance trends, indicating better convergence. The specific averages of training values will be provided in Table 3, where a more detailed analysis will be presented.

thumbnail
Fig 4. Performance on MNIST with clients per round: (a) training loss, (b) validation loss, (c) accuracy.

https://doi.org/10.1371/journal.pone.0322766.g004

Fig 5 presents the results of over 50 rounds for client fractions on the MNIST dataset. The performance trends are similar to those observed in the scenario (Fig 4). FedAvg (yellow) still exhibits some fluctuation in training loss, validation loss, and accuracy, while the other algorithms — FedNolowe (blue), FedProx (green), and FedMa (red) — show more stable progress. However, with the increased client fraction, all algorithms have become more stable compared to the and scenarios (Fig 3, Fig 4 resp.), indicating that a higher client pool helps in reducing fluctuations and improving convergence. These results further emphasize the benefit of increasing the number of clients to achieve more stable and reliable FL performance. The specific averages of training values will be provided in Table 3, where a more detailed analysis will be presented.

thumbnail
Fig 5. Performance on MNIST with clients per round: (a) training loss, (b) validation loss, (c) accuracy.

https://doi.org/10.1371/journal.pone.0322766.g005

Table 3 shows the mean performance of FL algorithms on the MNIST dataset across different client fractions (, 20%, and 30%). FedProx and FedMa consistently achieve the lowest training and validation losses, with FedProx slightly leading in accuracy () and F1-score () at . FedNolowe performs comparably, with slightly higher losses and accuracy of 91.54 . In contrast, FedAsl and FedAvg lag behind. FedAsl exhibits higher losses and marginally lower accuracy (), while FedAvg shows the greatest instability, with the highest training loss (0.17 0.09) and lowest accuracy (77.60 ) at .

With these results, it can be concluded that on the MNIST dataset (non-i.i.d, ), FedProx and FedMa deliver the best performance, with FedNolowe following closely behind. FedNolowe shows competitive results, especially in accuracy and F1-score, although slightly trailing FedProx and FedMa. In contrast, FedAsl and FedAvg are less effective, particularly at lower client fractions, with FedAvg showing significant instability.

Experiment 2: Fashion-MNIST

Figs 6, 7, and 8 show the performance of FL algorithms on the Fashion-MNIST dataset with client fraction ( respectively) across 100 communication rounds. The performance of the five algorithms follows a similar pattern to the results on MNIST. FedNolowe, FedProx, and FedMa maintain comparable performance, showing stable training and validation losses, as well as high accuracy. On the other hand, FedAvg and FedAsl exhibit more fluctuations and less stability, particularly in the early rounds. However, the performance gap between the two groups is not as large as observed on MNIST, indicating that while FedAvg and FedAsl are still less stable, their relative disadvantage is less pronounced on Fashion-MNIST. Table 4 reported more details.

thumbnail
Fig 6. Performance on Fashion-MNIST with clients per round: (a) training loss, (b) validation loss, (c) accuracy.

https://doi.org/10.1371/journal.pone.0322766.g006

thumbnail
Fig 7. Performance on Fashion-MNIST with clients per round: (a) training loss, (b) validation loss, (c) accuracy.

https://doi.org/10.1371/journal.pone.0322766.g007

thumbnail
Fig 8. Performance on Fashion-MNIST with clients per round: (a) training loss, (b) validation loss, (c) accuracy.

https://doi.org/10.1371/journal.pone.0322766.g008

thumbnail
Table 4. Mean Metrics on Fashion-MNIST across client fractions (Std with rolling window = 5).

https://doi.org/10.1371/journal.pone.0322766.t004

Table 4 shows the performance of FL algorithms on the Fashion-MNIST dataset with client fractions of 10%, 20%, and 30%. FedProx and FedMa consistently outperform the other algorithms, achieving the lowest training and validation losses (0.10 0.01 and 0.38 0.05 at 30%) and the highest F1-score (0.86 0.02) and accuracy (86.63 ). FedNolowe follows closely behind, with slightly higher losses and F1-score (0.86 0.02) and accuracy (86.50 ). FedAvg and FedAsl show weaker performance, with FedAsl particularly struggling with higher losses and lower accuracy (84.54 ), highlighting its instability compared to the others.

Experiment 3: CIFAR-10

Figs 9, 10, and 11 and Table 5 show results over 100 rounds with , , and . FedNolowe and FedProx lead, with FedNolowe achieving the lowest validation loss ( at ). FedNolowe’s accuracy (67.61 ) matches FedProx and is 8.46% higher than FedMa’s (1.14 0.05 loss, 59.15 accuracy). FedAvg and FedAsl exhibit moderate performance, with FedMa struggling due to its BN on VGG-9 [14].

thumbnail
Fig 9. Performance on CIFAR-10 with clients per round: (a) training loss, (b) validation loss, (c) accuracy.

https://doi.org/10.1371/journal.pone.0322766.g009

thumbnail
Fig 10. Performance on CIFAR-10 with clients per round: (a) training loss, (b) validation loss, (c) accuracy.

https://doi.org/10.1371/journal.pone.0322766.g010

thumbnail
Fig 11. Performance on CIFAR-10 with clients per round: (a) training loss, (b) validation loss, (c) accuracy.

https://doi.org/10.1371/journal.pone.0322766.g011

thumbnail
Table 5. Mean Metrics on CIFAR-10 Across Client Fractions (Std with rolling window = 5).

https://doi.org/10.1371/journal.pone.0322766.t005

In this experiment, the performance of FedMa is notably inferior to FedProx on the CIFAR-10 dataset, with a higher validation loss ( at ) and lower accuracy () compared to FedProx ( loss, accuracy). This contrasts with the results on MNIST and Fashion-MNIST, where FedMa exhibited performance parity or superiority over FedProx, as shown in Tables 3 and 4, and aligns with findings in [14]. The primary reason for this discrepancy lies in the model architecture used for CIFAR-10, specifically the VGG9 network, which incorporates BN layers while they were ignored in [14].

Computational efficiency

We evaluate the computational efficiency of FedNolowe against FedAvg [1], FedProx [13], FedMa [14], FedAsl [15] by measuring total floating-point operations (FLOPs [43]) per communication round, using PyTorch’s profiler [42]. This metric accounts for both local training (forward and backward passes over E epochs) and server-side aggregation operations specific to each method. Table 6 presents the total FLOPs for each model and client fraction, demonstrating FedNolowe’s efficiency across diverse architectures and participation levels.

thumbnail
Table 6. Total FLOPs per round (in millions) across models and client fractions.

https://doi.org/10.1371/journal.pone.0322766.t006

In Table 6, FedNolowe exhibits a robust computational efficiency, matching the FLOPs of FedAvg and FedAsl while significantly outperforming FedProx and FedMa. Specifically, FedNolowe reduces FLOPs by 17.55% to 18.95% compared to FedProx and by 21.26% to 40.10% compared to FedMa across all tested models and client fractions. For example, with LeNet-5 at a 30% client fraction, FedNolowe maintains 34.3 million FLOPs, whereas FedProx requires 41.6 million and FedMa demands 48.4 million. Similarly, for VGG-9 with 40% client participation, FedNolowe utilizes 733.0 million FLOPs, compared to 834.0 million for FedProx and 931.0 million for FedMa. These efficiency gains are driven by FedNolowe’s streamlined approach, which avoids the resource-intensive proximal terms of FedProx and the complex neuron-matching process of FedMa, aligning with the needs of resource-constrained FL environments.

Conclusion

In this paper, we introduce FedNolowe, a new aggregation method for FL that processes non-i.i.d data by dynamically adjusting the contribution of clients using two-step loss normalization. This method supports the global model stability by giving greater weight to high-performing clients and provides theoretical proof for convergence to a stationary point. Experiments on MNIST, Fashion-MNIST, and CIFAR-10 demonstrate its effectiveness, achieving increases in efficiency of up to 40 percent in computational complexity when compared to the state-of-the-art methods in different models. Comprehensive sensitivity analysis confirms competitive performance in heterogeneous whilst still being useful in i.i.d heterogeneity cases. Because of its low dependency design that only uses training losses, it can be deployed in constrained environments, enabling further research in variance-aware optimizations and practical use cases.

The inquiry, however, does not exhaust the spectrum of possibilities in this domain. Numerous techniques for normalizing and inverting losses in weighted aggregation merit consideration. For instance, normalization strategies might encompass min-max scaling, mean centering, or alternative approaches, while inversion methods could include exponential transformations, logarithmic adjustments, rank-based reweighting, and beyond. Investigating these diverse normalization and inversion frameworks presents a compelling avenue for future research. Furthermore, integrating loss-based weighting with feedback mechanisms in FL, as suggested in the recent survey by Le et al. (2024) [44], offers a promising opportunity to mitigate communication overhead, warranting further exploration in subsequent studies.

Appendices

Sensitivity analysis of FedNolowe and FedAsl

People commonly use Eq 5 (division-based) to normalize an array by its sum, a method employed in FedAsl [15, 18]. This approach enhances the aggregation weight of clients with lower losses [15] or correlations [18]. While effective in i.i.d scenarios where normalized values remain stable, it becomes susceptible in non-i.i.d settings compared to the subtraction-based approach (Eq 3) used in FedNolowe. Figs 12, 13, 14, and 15 compare both methods.

thumbnail
Fig 12. Validation loss comparison with Dirichlet (extremely non-i.i.d).

https://doi.org/10.1371/journal.pone.0322766.g012

thumbnail
Fig 13. Validation loss comparison with Dirichlet (highly non-i.i.d).

https://doi.org/10.1371/journal.pone.0322766.g013

thumbnail
Fig 14. Validation loss comparison with Dirichlet (moderately non-i.i.d).

https://doi.org/10.1371/journal.pone.0322766.g014

thumbnail
Fig 15. Validation loss comparison with Dirichlet (nearly i.i.d).

https://doi.org/10.1371/journal.pone.0322766.g015

Figs 12, 13, 14, and 15 present validation loss comparisons across three methods: FedNolowe (blue), FedAsl with 1 − dk weighting adapted from FedNolowe (orange), and the original FedAsl with 1/dk weighting [15] (green), under varying degrees of data heterogeneity.

In extremely and highly heterogeneous scenarios ( and 0.1), the subtraction-based methods—especially FedNolowe—consistently achieve lower and more stable validation loss. In contrast, the division-based approach shows large fluctuations, particularly during the early and middle rounds (e.g., 0, 10, and 21–25), due to its sensitivity to skewed data distributions.

As the distribution becomes more balanced (), all methods converge quickly, but subtraction-based variants still exhibit slightly smoother learning curves. In the nearly i.i.d. case (), the performance differences become negligible; all methods perform similarly with rapid convergence and minimal variance.

These trends highlight the robustness of the subtraction-based formulation in non-i.i.d. settings, while confirming that the division-based approach remains valid and effective in i.i.d. scenarios—consistent with the theoretical findings discussed later.

In the rest of this section, we provide a mathematical proof showing that our proposed weight assignment, defined in Eq 3(subtraction-based utilized in FedNolowe), offerbased approachater stability than the division-based of FedAsl’s approach in Eq 5.

(5)

Step 1: Sensitivity analysis via partial derivatives

For the subtraction-based (Eq 3).

Let and , so . The sensitivity with respect to is:

(6)

where and (since g depends on through one term).

(7)

The magnitude is bounded: , since and g>0.

For the division-base Eq 5.

The weight is:

Let and , so . The sensitivity is:

(8)

where and .

(9)

The magnitude diverges as :

Step 2: Boundary behavior.

Small loss ().

Subtraction-based approach , finite and bounded. Division-based approach (if other losses are non-zero), indicating instability because there is no room for other clients.

Large loss ().

Subtraction-based approach , robustly eliminating outliers. Division-based approach , retaining influence.

Step 3: Variance analysis

Subtraction-based approach variance of is moderate, proportional to . Division-based approach, Variance can be high, amplified by , especially near zero.

Step 4: Stability in non-i.i.d vs. i.i.d contexts

The stability of the weight assignments depends on the data distribution across clients.

Non-i.i.d stability.

In non-i.i.d settings, varies significantly due to heterogeneous data. The division-based approach’s sensitivity to small leads to instability. Mathematically, as , the denominator is dominated by , causing , which skews the aggregation disproportionately.

I.i.d effectiveness.

In i.i.d settings, are similar, reducing the variance of . The division-based approach’s sensitivity is mitigated, as values are close, preventing extreme weight imbalances. The Lipschitz continuity of with respect to holds better, ensuring stable aggregation.

Remark

The subtraction-based approach is more stable in non-i.i.d due to its bounded sensitivity and robustness to loss variations. While unstable in non-i.i.d due to sensitivity to low losses, the division-based approach is effective in i.i.d where losses are consistent and outliers are rare.

Detail convergence analysis of FedNolowe

Empirical validation of gradient alignment

To support Assumption 4, we empirically evaluate the alignment between the aggregated weighted client gradients and the global gradient on the MNIST, Fashion-MNIST, and CIFAR-10 datasets.

At each communication round t, we compute the cosine similarity between the aggregated gradient and the full-batch global gradient as:

We report the average and standard deviation of across 50 rounds with C = 5 clients per round, each client performing 2 local epochs for MNIST, Fashion-MNIST, and CIFA-10 (Dirichlet concentration parameter ) as follows:

  • MNIST: 0.3331 0.0596
  • Fashion-MNIST: 0.3421 0.0516
  • CIFAR-10: 0.3817 0.0457

The cosine similarity remains consistently positive across all datasets. This empirical evidence supports the validity of Assumption 4, indicating that FedNolowe maintains strong gradient alignment during training, despite non-i.i.d client data.

Proofs

Here, using Assumptions 1–4, we proof of Fednolowe’s convergence as follow: Each client performs one stochastic gradient descent (SGD) step starting from the global model: , where is the stochastic gradient of . The global update is:

(10)

Thus, the change is:

(11)

Using Assumption 1 (L-smoothness) on the global loss:

(12)

Substituting (11) into (12):

(13)

Taking expectations:

(14)

Gradient Term: Since , and using Assumption 4:

(15)

noting that depends on the loss, which is stochastic, but we assume the expectation aligns with the global gradient as per Assumption 4.

Variance Term: Expand the expectation:

(16)

where we bound using Assumption 2, and from Assumption 3. Since , by Cauchy-Schwarz, , so:

(17)

Combining terms:

(18)

Summing over T rounds:

(19)

Since (assuming is bounded below by 0), rearrange:

(20)

Choosing , the right-hand side becomes:

(21)

proving , hence .

References

  1. 1. McMahan B, Moore E, Ramage D, Hampson S, y Arcas BA. Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics, PMLR; 2017. p. 1273–82.
  2. 2. Xu J, Glicksberg BS, Su C, Walker P, Bian J, Wang F. Federated learning for healthcare informatics. J Healthc Inform Res. 2021;5(1):1–19. pmid:33204939
  3. 3. Guan H, Yap P-T, Bozoki A, Liu M. Federated learning for medical image analysis: a survey. Pattern Recognit. 2024;151:110424. pmid:38559674
  4. 4. Abdul Salam M, Taha S, Ramadan M. COVID-19 detection using federated machine learning. PLoS One. 2021;16(6):e0252573. pmid:34101762
  5. 5. Le DD, Dao MS, Tran AK, Nguyen TB, Le-Thi HG. Federated learning in smart agriculture: an overview. In: 2023 15th International Conference on Knowledge and Systems Engineering (KSE). IEEE; 2023. p. 1–4.
  6. 6. Hard A, Rao K, Mathews R, Beaufays F, Augenstein S, Eichner H. Federated learning for mobile keyboard prediction. CoRR. 2018. https://doi.org/abs/1811.03604
  7. 7. Ren S, Kim E, Lee C. A scalable blockchain-enabled federated learning architecture for edge computing. PLoS One. 2024;19(8):e0308991. pmid:39150937
  8. 8. Aledhari M, Razzak R, Parizi RM, Saeed F. Federated learning: a survey on enabling technologies, protocols, and applications. IEEE Access. 2020;8:140699–725. pmid:32999795
  9. 9. Kairouz P, McMahan HB, Avent B, Bellet A, Bennis M, Nitin Bhagoji A, et al. Advances and open problems in federated learning. FNT Mach Learn. 2021;14(1–2):1–210.
  10. 10. Zhu H, Xu J, Liu S, Jin Y. Federated learning on non-IID data: a survey. Neurocomputing. 2021;465:371–90.
  11. 11. Konečný J, McMahan HB, Yu FX, Richtárik P, Suresh AT, Bacon D. Federated learning: strategies for improving communication efficiency. CoRR. 2016. https://doi.org/abs/1610.05492
  12. 12. Li Q, Diao Y, Chen Q, He B. Federated learning on non-iid data silos: an experimental study. In: 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE; 2022. p. 965–78.
  13. 13. Li T, Sahu AK, Zaheer M, Sanjabi M, Talwalkar A, Smith V. Federated optimization in heterogeneous networks. In: Proceedings of Machine Learning and Systems. 2020. p. 429–50.
  14. 14. Wang H, Yurochkin M, Sun Y, Papailiopoulos DS, Khazaeni Y. Federated learning with matched averaging. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020; 2020.
  15. 15. Talukder Z, Islam MA. Computationally efficient auto-weighted aggregation for heterogeneous federated learning. In: 2022 IEEE International Conference on Edge Computing and Communications (EDGE). 2022. p. 12–22.
  16. 16. Li Z, Lin T, Shang X, Wu C. Revisiting weighted aggregation in federated learning with neural networks. In: International Conference on Machine Learning. PMLR; 2023. p. 19767–88.
  17. 17. Wang R, Chen Y. Adaptive model aggregation in federated learning based on model accuracy. IEEE Wirel Commun. 2024;31(5):200–6.
  18. 18. Le DD, Huynh DT, Bao PT. Correlation-based weighted federated learning with multimodal sensing and knowledge distillation: an application on a real-world benchmark dataset. In: International Conference on Multimedia Modeling. Springer; 2025. p. 49–60.
  19. 19. Hsu TH, Qi H, Brown M. Measuring the effects of non-identical data distribution for federated visual classification. CoRR. 2020. https://doi.org/abs/1909.06335
  20. 20. Karimireddy SP, Kale S, Mohri M, Reddi S, Stich S, Suresh AT. SCAFFOLD: stochastic controlled averaging for federated learning. In: Proceedings of the 37th International Conference on Machine Learning. vol. 119. PMLR; 2020. p. 5132–43.
  21. 21. Kuhn HW. The Hungarian method for the assignment problem. Naval Res Logist Quart. 1955;2(1–2):83–97.
  22. 22. Zhao Y. Comparison of federated learning algorithms for image classification. In: 2023 2nd International Conference on Data Analytics, Computing and Artificial Intelligence (ICDACAI). IEEE; 2023. p. 613–5.
  23. 23. Wang S, Zhu X. FedDNA: federated learning using dynamic node alignment. PLoS One. 2023;18(7):e0288157. pmid:37399217
  24. 24. Reddi SJ, Charles Z, Zaheer M, Garrett Z, Rush K, Konečný J, et al. Adaptive federated optimization. In: International Conference on Learning Representations; 2021.
  25. 25. Dinh T, Tran N, Nguyen TD. Personalized federated learning with Moreau envelopes. In: Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2020. p. 21394–405.
  26. 26. Fallah A, Mokhtari A, Ozdaglar A. Personalized federated learning with theoretical guarantees: a model-agnostic meta-learning approach. Adv Neural Inf Process Syst. 2020;33:3557–68.
  27. 27. Yao D, Pan W, Dai Y, Wan Y, Ding X, Yu C, et al. FedGKD: toward heterogeneous federated learning via global knowledge distillation. IEEE Trans Comput. 2024;73(1):3–17.
  28. 28. Lin T, Kong L, Stich SU, Jaggi M. Ensemble distillation for robust model fusion in federated learning. Adv Neural Inf Process Syst. 2020;33:2351–63.
  29. 29. Fu DS, Huang J, Hazra D, Dwivedi AK, Gupta SK, Shivahare BD, et al. Enhancing sports image data classification in federated learning through genetic algorithm-based optimization of base architecture. PLoS One. 2024;19(7):e0303462. pmid:38990969
  30. 30. Rehman MHU, Hugo Lopez Pinaya W, Nachev P, Teo JT, Ourselin S, Cardoso MJ. Federated learning for medical imaging radiology. Br J Radiol. 2023;96(1150):20220890. pmid:38011227
  31. 31. Nguyen DC, Ding M, Pathirana PN, Seneviratne A, Li J, Poor HV. Federated learning for Internet of Things: a comprehensive survey. IEEE Commun Surv Tutor. IEEE. 2021. p. 1622–58.
  32. 32. Subramanian M, Rajasekar V, V. E. S, Shanmugavadivel K, Nandhini PS. Effectiveness of decentralized federated learning algorithms in healthcare: a case study on cancer classification. Electronics. 2022;11(24):4117.
  33. 33. Dasari S, Kaluri R. 2P3FL: a novel approach for privacy preserving in financial sectors using flower federated learning. CMES. 2024;140(2):2035–51.
  34. 34. Wang R. The experiment of federated learning algorithm. Appl Comput Eng. 2024;39:145–59.
  35. 35. Li X, Huang K, Yang W, Wang S, Zhang Z. On the convergence of FedAvg on non-IID data. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020; 2020.
  36. 36. Bai J, Chen Y, Yu X, Zhang H. Generalized asymmetric forward–backward–adjoint algorithms for convex–concave saddle-point problem. J Sci Comput. 2025;102(3):80.
  37. 37. LeCun Y, Cortes C, Burges CJ. The MNIST database of handwritten digits. 1998. http://yannlecun.com/exdb/mnist/
  38. 38. Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint 2017. https://arxiv.org/abs/1708.07747
  39. 39. Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. University of Toronto. 2009.
  40. 40. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.
  41. 41. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICLR); 2015.
  42. 42. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2019. p. 8024–35.
  43. 43. Hsia SC, Wang SH, Chang CY. Convolution neural network with low operation FLOPS and high accuracy for image recognition. J Real-Time Image Process. 2021;18(4):1309–19.
  44. 44. Le D-D, Tran A-K, Pham T-B, Huynh T-N. A survey of model compression and its feedback mechanism in federated learning. In: The Fifth Workshop on Intelligent Cross-Data Analysis and Retrieval, 2024. p. 37–42. https://doi.org/10.1145/3643488.3660293