Figures
Abstract
Federated learning (FL) is an effective distributed learning paradigm for protecting client privacy, enabling multiple clients to collaboratively train a global model without uploading private data. It has promising applications in sports image classification. However, FL faces the issue of non-independent and identically distributed (non-IID) data, which leads to excessive variance between local models and hinders the convergence of the global model. Although FedSAM and its variants attempt to reduce this variance by finding smooth solutions between local models, local smoothing does not necessarily result in global smoothing. We refer to this issue as the smoothness inconsistency problem. To address this challenge, we propose a novel FL paradigm, named A-FedSAM, which utilizes adaptive local distillation to achieve consistency in smoothing between local and global models without incurring additional communication overhead, thereby improving the convergence accuracy of the global model. Specifically, A-FedSAM employs the global model as the teacher during local training, dynamically guiding the local models to ensure that their gradients not only maintain smoothness but also align with the global objective. Extensive experiments on sports image classification tasks demonstrate that A-FedSAM outperforms state-of-the-art methods in terms of accuracy across different data heterogeneities and client sampling rates, while requiring fewer communication and computational resources to achieve the same target accuracy.
Citation: Zhen K, Wu J, Park J, Shao R, Zhang X, Yu S (2025) Achieving consistency in FedSAM using local adaptive distillation on sports image classification. PLoS One 20(10): e0333210. https://doi.org/10.1371/journal.pone.0333210
Editor: Tien-Dung Cao, TTU: Tan Tao University, VIET NAM
Received: March 24, 2025; Accepted: September 10, 2025; Published: October 17, 2025
Copyright: © 2025 Zhen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting information files.
Funding: Sichuan Science and Technology Program 2025ZNSFSC1498 Not applicable. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Federated learning (FL) is an efficient distributed deep learning approach that enables clients to collaboratively train a global model required for a given task without exchanging raw data [1–5]. Due to its data availability without visibility property, FL has been widely applied across various domains, including healthcare, finance, and transportation [6–14]. In the field of sports, FL also exhibits strong applicability, facilitating sports recognition and analysis by leveraging data from different clients [15]. For instance, various sports clubs, gyms, or personal fitness devices can collaboratively train a highly effective motion recognition model without sharing raw data. Furthermore, FL can enhance personalized sports recommendations or optimize team strategies by integrating data from multiple players, thereby promoting more intelligent and customized sports training and management.
However, in real-world sports scenarios, data distributions are often highly heterogeneous, leading to client drift during local model training [16–20]. This drift causes fluctuations in the convergence of the global model, significantly reducing its accuracy. An effective solution to this issue is to reduce the variance of gradients among local models, ensuring smooth consistency, thereby mitigating the instability caused by client drift and facilitating stable global convergence. A classical approach to achieving this is FedSAM [21], which replaces the standard SGD optimizer with the Sharpness-Aware Minimization (SAM) optimize [22] to smooth local gradient updates, thereby promoting smoother global model convergence. However, local smoothness does not necessarily translate into global smoothness after aggregation. In scenarios with high data heterogeneity, even smoothed local updates can lead to sharp global updates [23]. To illustrate this smoothness inconsistency, we provide an example with three clients, as shown in Fig 1.
First, clients perform local training using their private datasets. Second, the locally trained models are uploaded to the server. Third, the server aggregates the model parameters. Finally, the global model is distributed back to the clients for the next round of training.
To address the smoothness inconsistency issue in FedSAM, existing approaches primarily integrate SAM with other consistency-enhancing techniques. While these methods have achieved notable performance improvements, they typically require introducing additional auxiliary variables, such as in FedGAMMA [24] and FedSMOO [23]. The doubled communication overhead associated with these approaches is impractical for real-world FL scenarios with bandwidth constraints.
To overcome this limitation and achieve smooth consistency without incurring extra communication costs, we propose a novel FL paradigm, A-FedSAM, which ensures global smoothness consistency through adaptive local distillation. This approach effectively enhances accuracy in sports image classification tasks. Specifically, A-FedSAM employs the SAM optimizer to smooth local gradients while leveraging past global models as teacher models. During local training, it introduces a guidance term that steers local models toward global optimization. Furthermore, we incorporate a dynamic distillation mechanism to mitigate early-stage global model instability, thereby improving the reliability of guidance and enhancing overall convergence performance.
Overall, our contributions are as follows:
- We propose a novel FL paradigm, A-FedSAM, which effectively addresses the global inconsistency issue introduced by SAM optimization through local dynamic distillation, without incurring additional communication overhead. A-FedSAM is well-suited for various sports recognition scenarios.
- To tackle the early-stage unavailability of the global model in dynamic distillation, we introduce a dynamic distillation mechanism that employs an exponentially weighted moving average to dynamically update the constraint term, ensuring the accuracy and reliability of distillation.
- Extensive experiments on sports image classification datasets demonstrate that A-FedSAM achieves the target accuracy with reduced communication overhead. Furthermore, under the same number of training rounds, A-FedSAM achieves a higher convergence accuracy than existing state-of-the-art (SOTA) methods, highlighting its effectiveness in the field of sports recognition.
Related work
FL with SAM
FedSAM [21] is a smooth convergence method for FedAvg [1], which replaces the SGD optimizer in FedAvg with the SAM optimizer [22] to achieve smooth local model updates, thereby improving the smoothness of the global model. However, FedSAM faces the issue of global inconsistency. To address this problem, FedGAMMA [24] combines SAM with Scaffold by introducing auxiliary variables to reduce local drift. FedSPeed [25] enhances the consistency of local objectives through dual updates, but it relies on precise local solutions, which are nearly impossible to achieve in real-world scenarios. MoFedSAM [21] employs an exponentially weighted moving average to effectively utilize historical knowledge, resolving the issue of sharp global updates under high data heterogeneity. FedSMOO [23] enhances global consistency at the objective function level by introducing dual auxiliary variables.
Although these methods improve global gradient consistency to some extent, they still face performance bottlenecks and incur significant communication overhead. In contrast, our proposed A-FedSAM introduces a dynamic distillation term in local training, leveraging the global model to constrain local model drift, thereby effectively enhancing global gradient consistency.
Knowledge distillation
Knowledge distillation is an effective method for extracting knowledge from pre-trained models, enabling improved performance at a relatively low cost [26]. The success of DeepSeek is also largely attributed to knowledge distillation [27,28]. In FL, knowledge distillation can be applied both on the server and the clients. On the server side, public datasets are typically used to fine-tune the global model, enhancing its ability to perceive data distributions, as seen in methods such as FedDF [29], FedGen [30], and FedAUX [31]. On the client side, FedGKD [32] proposes using the global model or the average of past global models as a teacher model to guide the training of local models in the current round, effectively mitigating client drift.
However, knowledge distillation often requires a pre-trained robust teacher model, whereas the global model lacks robustness in the early training stages and may even provide no effective guidance. A-FedSAM addresses this issue by introducing a dynamic distillation mechanism, which utilizes an exponentially weighted moving average constraint term to significantly improve the usability of the global model.
Methodology
Preliminary
This section introduces the objective function, training process, and relevant notations in FL.
Consider a FL system consisting of N clients and a central server, where each client i possesses a private dataset Di containing |Di| data points. Each data point is represented as , where
. The objective function for client i is given by:
where wi denotes the model parameters of client i, l represents the loss function (typically cross-entropy loss), and Fi is the local objective function for client i. The overall objective of FL is formulated as:
Taking the t–th training round as an example, the FL process consists of four main steps:
- Client Selection & Model Distribution: The server selects a subset of clients St to participate in training during round t and distributes the global model wt to all selected clients in St.
- Local Training: Each client receives the global model wt and initializes its local model as
. The client then trains the local model for k iterations using its private dataset, resulting in the updated local model
. In FedAvg, the local optimizer is typically SGD.
- Model Upload: Each client uploads its locally trained model
to the server.
- Model Aggregation: The server aggregates the received local models using an aggregation algorithm to obtain the updated global model wt + 1 , which serves as the initial model for round t + 1.
The entire FL training process is illustrated in Fig 1.
Rethinking FedSAM
FedSAM builds upon FedAvg by replacing the local optimizer SGD with SAM, modifying the local objective function from (1) to the following:
where represents a perturbation term in the vicinity of wi for client i, and ρ is the perturbation radius. Based on this objective function, the local model wi is updated in round t as follows:
where is the perturbed weight obtained by performing gradient ascent to find the sharpest increase within the perturbation radius, and
is the gradient computed at this perturbed point. This gradient is then used to update the local model.
Through gradient ascent and perturbation, local models are updated in a smoother direction, effectively reducing variance among local models. However, while each local model achieves local smoothness, the aggregated global model does not necessarily achieve global smoothness, particularly in non-IID data scenarios where the update directions of different clients vary significantly. This results in an aggregated global model that remains sharp, ultimately degrading its performance. We argue that global smoothness inconsistency primarily occurs when the degree of heterogeneity between client data is high. In such cases, each client overfits its local objective during training, leading to excessive client drift—a bias that the SAM optimizer alone cannot mitigate.
Previous methods, such as FedGAMMA and FedSMOO, have introduced additional auxiliary variables to mitigate client drift. While these approaches achieve notable improvements, they also double the communication overhead, posing a significant challenge for FL systems. Considering the communication constraints in sports-related scenarios, we propose an alternative approach—local adaptive distillation—to effectively reduce client drift. Specifically, we dynamically leverage knowledge from global models at different training stages to correct client drift, integrating this with SAM to achieve global smoothness consistency without incurring extra communication costs.
However, local smoothness does not necessarily translate into global smoothness after aggregation. To formally characterize this limitation, we define smoothness inconsistency as the gradient divergence between local and global models:
where wi is the local model of client i, w0 is the global model, is the local gradient, and
is the global gradient. In scenarios with high data heterogeneity, even smoothed local updates can lead to sharp global updates [23]. This occurs because clients tend to overfit to their own local objectives, which amplifies the divergence between their individual gradients and the global gradient. As a result, even though local updates are smoothed via SAM, their aggregated effect can produce a sharp global model. Moreover, the fixed perturbation radius ρ used in SAM may be insufficient to bridge the inter-client optimization gap, especially under highly heterogeneous data distributions.
Adaptive local distillation
To reduce client drift and align local models with the global objective, we introduce an adaptive local distillation mechanism. The global model wt serves as a teacher, while each local model acts as a student. The distillation regularization is formulated as:
Algorithm 1. A-FedSAM.
where h denotes the model’s forward pass producing soft probability outputs. This KL-based loss encourages the local model to mimic the global model’s predictive behavior, thereby mitigating divergence in prediction space and improving global consistency.
We combine this distillation term with the SAM-based local objective as follows:
Although the two terms originate from different objectives—one focusing on output alignment and the other on sharpness-aware optimization—they are both differentiable and operate on the same prediction space. In practice, the KL divergence provides gradient signals aligned with the probability distribution, while the SAM term enhances local robustness in parameter space. These gradients are complementary, and their relative influence is balanced by the time-dependent coefficient .
It is worth noting that the gradient magnitudes from the two components can differ substantially. Therefore, we use to dynamically control the contribution of the distillation loss. This formulation reflects the evolving trust in the global model throughout training. At early stages, when the global model is less reliable due to aggregation from undertrained clients, a lower
prevents over-regularization. As training progresses, the coefficient increases, allowing stronger guidance from the global model.
Compared to prior works such as FedGKD, which adopt a fixed distillation weight throughout training, our adaptive scheduling mechanism better accounts for the evolving reliability of the global model. This results in more stable training and improved alignment between local and global objectives. Detailed empirical validation of λ and related parameters is provided in the following sensitivity analysis.
A-FedSAM overview
The overall flow of A-FedSAM is shown in Fig 2, and the detailed training process is described in Algorithm 1. Specifically, line 7 computes the perturbed weight , and line 9 calculates the corrected gradient
. The distillation-based correction objective
is computed in line 11, followed by the adaptive weight
in line 13. The local model update is performed in line 15. Finally, the global model parameters are aggregated in line 19.
During the local training phase, the SAM optimizer is employed to reduce gradient variance among clients, while dynamic knowledge distillation is utilized to effectively mitigate client drift.
Theoretical analysis
Assumption 0.1. Fi is L smooth if . In addtion, we have:
Assumption 0.2. (Bounded Stochastic Gradient). For a data sample uniformly sampled at random from Di, the stochastic gradient is an unbiased estimator and have bounded variance, i.e.,
where 0 is a constant.
Assumption 0.3. (Bounded Dissimilarity). The dataset dissimilarity among local clients is constrained by both local and global gradients, i.e.,
Theorem 0.1. (Convergence of A-FedSAM). Under the above assumptions, we have:
where ,
with selecting the proper parameter, and f* is the optimal solution of f. In addtion, the perturbation coefficient
, the local learning rate
, the local iterations
, Lh and δ are small constants greater than 0.
Proof: To handle the adaptive distillation term , we follow [32] and transfer it as follows:
Then to further handle the prox-term, we follow [25] to introduce the auxiliary variable as follows:
where and
,
.
By combining (8) and (14), under the framework of [25], which handles the perturbation term, Theorem 0.1 can be derived. □
Experiment
Datasets
In the experiment, we used the SPORT1 [33] and SPORT2 [34] sports image classification datasets. We divided the datasets into training and testing sets for the experimental work. The SPORT1 dataset consists of 22 different sports categories, and Fig 3 shows the number of images per class in the training and testing sets of this dataset. The SPORT2 dataset consists of 100 different sports categories, and Fig 4 shows the number of images per class in the training set, while the number of images per class in the testing set is fixed at 5.
While these two datasets reflect real-world sports recognition scenarios, the scope of evaluation remains domain-specific. To strengthen the empirical validity and assess the generality of A-FedSAM across domains, we additionally conduct experiments on four widely-used benchmark datasets from computer vision and natural language processing.
The CIFAR-10 and CIFAR-100 datasets [37] are standard image classification benchmarks that contain 60,000 32×32 color images each. CIFAR-10 includes 10 general object categories such as airplane, automobile, and dog, while CIFAR-100 consists of 100 more fine-grained categories grouped into 20 superclasses. Both datasets are split into 50,000 training images and 10,000 test images. Tiny-ImageNet [39] is a subset of the ImageNet dataset, containing 200 classes with 500 training samples and 50 validation samples per class. All images are resized to 64×64 resolution, making this dataset more complex than CIFAR and representative of large-scale, low-resolution image tasks. AG-News [38] is a widely-used dataset for text classification tasks. It comprises 120,000 training and 7,600 testing samples, covering four major news categories: World, Sports, Business, and Science/Technology. Each instance includes a news headline and a short description, enabling federated learning experiments on textual data.
These datasets allow us to verify the effectiveness of A-FedSAM under both image and text modalities, across varying levels of class granularity and data heterogeneity. This extended evaluation confirms that A-FedSAM is not only effective in sports image classification, but also generalizes well to broader federated learning scenarios.
Data partitioning
To simulate real-world non-IID data, we applied both Dirichlet distribution and pathological partitioning strategies across all datasets. Additionally, each dataset includes an IID configuration as a reference point, where data is uniformly and randomly distributed among clients.
For the SPORT1 and SPORT2 datasets, we set the number of clients to 20. The Dirichlet distribution was used with concentration parameters and 0.6, labeled as D1 and D2, respectively, to control the degree of data heterogeneity. In the pathological split, each client receives data from a limited number of classes: for SPORT1, 6 (P1) or 12 (P2) classes; for SPORT2, 30 (P1) or 60 (P2) classes. The distribution under Dirichlet(0.3) is visualized in Figs 5 and 6, where noticeable inter-client distribution differences reflect strong heterogeneity.
For the benchmark datasets (CIFAR-10, CIFAR-100, TinyImageNet, and AG-News), we adopted a larger-scale federated learning setup with 100 clients and a participation rate of 10% per round to simulate partial client participation. Similar to the sports datasets, we used Dirichlet distributions with (D1) and 0.6 (D2), as well as class-restricted pathological splits. In CIFAR-10, each client receives data from 3 (P1) or 6 (P2) classes; in CIFAR-100, from 10 (P1) or 20 (P2) classes; and in TinyImageNet, from 20 (P1) or 40 (P2) classes. For the AG-News dataset, each client is assigned data from 2 (P1) or 3 (P2) out of the 4 total news categories. These settings allow us to comprehensively assess algorithm robustness under varying degrees of heterogeneity, data granularity, and scale.
Model
We adopt different backbone models for each dataset based on its modality and complexity. For all image-based tasks, we use ResNet-18 as the base model, and replace all Batch Normalization (BN) layers with Group Normalization (GN) to improve training stability under federated settings, following prior work [25]. The final fully connected layer is modified according to the number of classes in each dataset.
For the SPORT1 and SPORT2 datasets, we use ResNet-18 with output dimensions set to 22 and 100 classes, respectively. All BN layers are systematically replaced with GN to improve robustness against small or heterogeneous local batches.
For CIFAR-10 and CIFAR-100, we similarly adopt GN-based ResNet-18 models, with final classifier heads set to 10 and 100 classes, respectively.
For TinyImageNet, which involves 200 categories and larger input resolution (64×64), we continue using the same GN-based ResNet-18 architecture, adjusting the final output to 200 classes.
For AG-News (text classification), we use a lightweight neural network with an embedding layer (vocabulary size 30,626, embedding dimension 100), followed by masked average pooling and a two-layer classifier with ReLU and dropout, outputting logits over four news categories.
All models are implemented in PyTorch and kept consistent across methods to ensure fair comparisons.
Baseline
To evaluate the performance of A-FedSAM, we compared it with several existing SOTA algorithms, including FedAvg [1], Scaffold [19], FedDyn [35], FedCM [36] and FedSpeed [25].
Hyper-parameter settings
Unless otherwise specified, all experiments use a batch size of 50 and an initial learning rate of 0.1 with an exponential decay factor of 0.998. The number of local training iterations per communication round is fixed to E = 5 across all datasets for consistency.
For the SPORT1, SPORT2, and AG-News datasets, we set the total number of communication rounds to 300, as they exhibit relatively fast convergence. For the larger-scale benchmarks—CIFAR-10, CIFAR-100, and TinyImageNet—we extend the total number of communication rounds to 1000 to accommodate their increased complexity and slower convergence rates.
For all experiments involving FedDyn and FedSpeed, we set the dynamic regularization coefficient to 0.1, following their original configurations. When using the SAM optimizer, the perturbation radius ρ is set to 0.1 by default. In our proposed A-FedSAM method, we follow FedGKD and set the distillation temperature to 4.0 to soften the teacher predictions during local distillation.
For hyperparameter tuning, we search over the following ranges: the perturbation radius and the distillation temperature
. The best values are selected based on validation performance.
Experimental analysis
High accuracy.
To evaluate the accuracy of A-FedSAM under different data modalities, scales, and heterogeneity levels, we present multi-round average test accuracy results across three categories of datasets: (1) SPORT1 and SPORT2 under 40% and 80% participation (Table 1 and Table 2), and (2) general-purpose benchmarks including CIFAR-10/100, TinyImageNet, and AG-News under 10% participation (Table 3). Across all settings, A-FedSAM consistently achieves the best or near-best performance, demonstrating strong generalization and robustness.
Federated sports image classification.
On SPORT1 and SPORT2, A-FedSAM significantly outperforms baseline methods under both participation settings. Compared to FedAvg, it yields absolute gains of 6%–31% across non-IID splits (D1, D2, P1, P2), with the largest improvements observed under highly heterogeneous settings like P1 and D1. Against stronger baselines such as FedSpeed and Scaffold, A-FedSAM still achieves an average margin of 2%–9%, especially in low participation settings (40%), where global consistency becomes harder to maintain.
For example, under 40% participation on SPORT1-P1, A-FedSAM achieves 56.37% accuracy compared to 53.14% for FedSpeed and 52.41% for Scaffold. On SPORT2-D1, A-FedSAM improves over the next-best method (FedSpeed) by more than 4%, reaching 66.40%. These results validate the effectiveness of adaptive distillation in mitigating client drift and maintaining global optimization consistency without incurring extra communication cost.
General-purpose benchmarks.
On CIFAR-10, CIFAR-100, TinyImageNet, and AG-News, A-FedSAM consistently matches or outperforms state-of-the-art methods across all splits. The performance margins are particularly notable on challenging non-IID partitions (D1, P1), where most baselines exhibit significant degradation. For instance, on CIFAR-100-P2, A-FedSAM reaches 55.91%, outperforming FedSpeed (54.29%) and Scaffold (47.77%) by a clear margin. On TinyImageNet-D1, it achieves 42.43%, improving upon FedSpeed (41.03%) and substantially surpassing FedCM and FedDyn.
In the AG-News dataset, which introduces a textual modality, A-FedSAM maintains its advantage by achieving 90.41% accuracy under IID and outperforming all baselines in non-IID conditions as well. Notably, methods such as FedDyn collapse under label-sparse text splits (e.g., only 34.26% on D1), while A-FedSAM retains above 89% performance due to its robust guidance mechanism.
Overall, A-FedSAM shows strong performance across all domains—sports, vision, and text—achieving high accuracy under both low and high data heterogeneity. Its advantage is especially prominent in settings with lower client participation and severe non-IID splits, where traditional methods struggle to maintain global gradient alignment. These consistent gains confirm the effectiveness of the proposed dynamic distillation approach in preserving both local smoothness and global consistency.
Fast convergence
Figs 7 to 9 illustrate the convergence performance of A-FedSAM compared with baseline algorithms on the SPORT1 and SPORT2 datasets. The experiments cover different client participation rates (40% and 80%) and both IID and non-IID data distributions (D1, D2, P1, and P2). Overall, A-FedSAM demonstrates faster convergence in most settings, particularly in the early communication rounds, where its accuracy improves more rapidly than the baseline methods. However, as the number of communication rounds increases, some baselines gradually close the performance gap with A-FedSAM. The convergence behavior of each figure is analyzed as follows.
Fig 7 contains four subplots that display the convergence performance on the SPORT2 and SPORT1 datasets under 40% and 80% participation. For example, in subplot (c), A-FedSAM and FedSpeed both reach an accuracy of 0.6 around the 50th round, while Scaffold requires approximately 150 rounds, and the remaining algorithms fail to converge to 0.6. Subplot (d) shows a similar trend, where A-FedSAM maintains its advantage, while other algorithms require more rounds to achieve comparable performance.
Fig 8 (upper) presents the convergence results on the SPORT1 dataset under non-IID settings with 40% participation. In subplot (a), FedDyn and A-FedSAM both reach 0.5 accuracy around the 100th round, but FedDyn’s performance drops in later rounds. Scaffold reaches 0.5 only after around 200 rounds, and the remaining algorithms fail to converge to this level. Fig 8 (bottom) shows the results under 80% participation, where A-FedSAM and the baselines exhibit similar trends to those observed in Fig 8 (bottom).
Fig 9 focus on the non-IID settings of the SPORT2 dataset. Specifically, under the 40% participation scenario in Fig 9, subplot (d) shows that A-FedSAM reaches 0.6 accuracy within 100 rounds, while FedSpeed and Scaffold require approximately 150 rounds. Other algorithms fail to converge to 0.6. In Fig 9, with 80% participation, A-FedSAM consistently outperforms all baselines in terms of convergence speed across all four non-IID settings when targeting 0.6 accuracy.
Low communication overhead
Figs 7 to 9 also reflect the communication efficiency of A-FedSAM when compared to baseline methods, considering both the number of communication rounds required to reach target accuracy and the per-round communication cost.
In terms of communication rounds, A-FedSAM achieves target accuracy (e.g., 0.5 or 0.6) in significantly fewer rounds across most settings. For instance, in non-IID scenarios such as SPORT1-P2, A-FedSAM reaches 0.6 accuracy within 100 rounds under both 40% and 80% participation rates, while FedSpeed and Scaffold typically require 150 or more rounds. In contrast, algorithms like FedAvg and FedCM often fail to reach the same target even with extended training. This efficiency in early convergence directly reduces total communication cost, particularly in bandwidth-constrained environments.
Furthermore, A-FedSAM is designed to incur no additional communication overhead per round. Similar to FedAvg, it transmits only the model weights between the server and clients, without introducing any extra information such as gradients, control variates, or auxiliary buffers. In contrast, some baseline methods require the transmission of additional state information (e.g., gradient history or momentum terms), which increases both uplink and downlink communication costs. By avoiding these overheads, A-FedSAM achieves comparable per-round communication cost to FedAvg, while maintaining significantly better convergence speed and final accuracy.
When considering the overall communication cost, which is the product of communication rounds and per-round payload size, A-FedSAM demonstrates a favorable balance. It not only converges faster but also avoids the need for auxiliary gradient or historical state transmission. As a result, it is especially suitable for real-world federated deployments where network resources are limited or costly.
Moderate computation overhead
While communication efficiency is a critical factor in federated learning, computation overhead also plays a significant role in practical deployment. We analyze the additional computational cost introduced by A-FedSAM in comparison to baseline methods, focusing on the overhead from both the SAM optimizer and the local adaptive distillation mechanism.
A-FedSAM incurs two main sources of computational overhead per local step. First, the use of the SAM optimizer requires computing gradients at a perturbed weight , which involves an additional forward and backward pass per step—effectively doubling the gradient computations compared to standard SGD. Second, the incorporation of the distillation loss necessitates computing the Kullback-Leibler (KL) divergence between the predictions of the current local model and those of the global (teacher) model. This introduces an extra forward pass through the fixed global model and an additional loss computation per batch.
We summarize the relative computation cost of various methods in Table 4. The number of forward and backward passes is counted per training step, and FLOPs are normalized relative to FedAvg.
Although A-FedSAM introduces approximately 2.5× the per-step FLOPs of FedAvg, it provides significantly faster convergence and better accuracy, which compensates for the per-step cost in many scenarios. Empirically, we observe that teacher predictions can be reused or batched to reduce runtime, and KL divergence is lightweight compared to full backward propagation. While A-FedSAM is computationally heavier than FedAvg or FedDyn, it remains comparable to or only slightly above FedSAM and FedSpeed, and provides consistent accuracy gains across vision and text tasks. For resource-constrained scenarios, the distillation frequency or perturbation steps can be selectively reduced.
Parameter sensitivity
We investigate the sensitivity of A-FedSAM to three key hyperparameters: the perturbation radius ρ, the distillation temperature T, and the distillation scheduling rate λ, all of which directly influence the behavior of local training.
Table 5 presents the accuracy under different values of ρ on the SPORT1 and SPORT2 datasets. As ρ increases from 0.01 to 0.1, performance steadily improves across all data partitions. This suggests that moderate perturbation allows models to locate flatter minima and better generalize. However, a large ρ such as 0.5 degrades performance by allowing excessive divergence among local models, reducing global consistency. On the other hand, very small values overly restrict local adaptation.
Table 6 reports results with varying distillation temperatures T. We observe that increasing T from 1.0 to 4.0 enhances accuracy across all settings. This is because softer teacher distributions encourage more robust knowledge transfer. Beyond T = 4.0, the performance begins to decline as the teacher signals become overly smooth and less informative.
To further understand the effect of dynamic distillation scheduling, we study the hyperparameter λ that governs the growth of the distillation weight . Table 7 summarizes the results. We find that small values of λ result in weaker guidance from the teacher and slower convergence, while excessively large values apply strong teacher influence prematurely, before the global model becomes reliable. The best performance is achieved with
, which balances early-stage exploration and later-stage alignment.
These findings confirm that all three hyperparameters play a critical role in A-FedSAM’s performance. The default values of , T = 4.0, and
yield the most robust results across datasets and are adopted in all subsequent experiments.
Ablation study
To further validate the effectiveness of each component in A-FedSAM, we conduct an ablation study under 40% client participation on the SPORT1 and SPORT2 datasets. The results are summarized in Table 8. Specifically, w/o DKL removes the dynamic distillation term from the local training objective, w/o SAM replaces the SAM optimizer with standard SGD, and w/o ALL disables both components, resulting in a vanilla FedAvg setup.
On both datasets, we observe a consistent performance drop when either component is removed, highlighting their complementary roles. The removal of the SAM optimizer (w/o SAM) leads to a noticeable accuracy reduction across all data partitions, especially on SPORT1, where the drop is more pronounced under non-IID settings (e.g., from 70.60% to 64.42% on D2). Similarly, removing the dynamic distillation term (w/o DKL) also degrades performance, though the effect varies depending on the dataset and data distribution.
The most significant drop occurs in the w/o ALL setting, which reduces A-FedSAM to FedAvg. This baseline consistently yields the lowest accuracy, particularly under non-IID splits such as P1 and P2. For instance, on SPORT1-P2, accuracy drops from 74.80% to 47.20%, a gap of over 27%. This confirms that neither component alone is sufficient to achieve optimal performance—both SAM-based optimization and the dynamic knowledge distillation are critical.
Therefore, the ablation results demonstrate that both the SAM optimizer and the dynamic distillation mechanism play essential roles in improving the robustness and generalization of A-FedSAM, especially under challenging non-IID and low-participation scenarios.
Conclusion
In this work, we proposed A-FedSAM, a novel FL paradigm designed to address the smoothness inconsistency problem caused by non-IID data in federated settings. This approach enables local models to maintain gradient smoothness while remaining aligned with the global optimization objective—achieving both local and global smoothness consistency. Extensive experiments on sports image classification tasks under various non-IID scenarios and client participation rates demonstrate that A-FedSAM consistently outperforms state-of-the-art baselines in terms of accuracy and convergence speed.
Supporting information
S1 File. All supporting data are available within this article’s supplementary files (ZIP).
All figures (Figs 1–9) and Tables (Table 1–8) are sequentially numbered in the supplementary files.
https://doi.org/10.1371/journal.pone.0333210.s001
(ZIP)
References
- 1.
McMahan B, Moore E, Ramage D, Hampson S, y Arcas BA. Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics. PMLR; 2017. p. 1273–1282.
- 2. Zhang C, Xie Y, Bai H, Yu B, Li W, Gao Y. A survey on federated learning. Knowledge-Based Systems. 2021;216:106775.
- 3. Aledhari M, Razzak R, Parizi RM, Saeed F. Federated Learning: A Survey on Enabling Technologies, Protocols, and Applications. IEEE Access. 2020;8:140699–725. pmid:32999795
- 4. Pandya S, Srivastava G, Jhaveri R, Babu MR, Bhattacharya S, Maddikunta PKR, et al. Federated learning for smart cities: A comprehensive survey. Sustainable Energy Technologies and Assessments. 2023;55:102987.
- 5. Rodrguez-Barroso N, Jiménez-López D, Luzón MV, Herrera F, Martnez-Cámara E. Survey on federated learning threats: Concepts, taxonomy on attacks and defences, experimental study and challenges. Information Fusion. 2023;90:148–73.
- 6. Yazdinejad A, Dehghantanha A, Karimipour H, Srivastava G, Parizi RM. A Robust Privacy-Preserving Federated Learning Model Against Model Poisoning Attacks. IEEE TransInformForensic Secur. 2024;19:6693–708.
- 7. Yazdinejad A, Dehghantanha A, Srivastava G. Hybrid privacy preserving federated learning against irregular users in next-generation Internet of Things. Journal of Systems Architecture. 2024;148:103088.
- 8.
Yazdinejad A, Kong J D. Breaking Interprovincial Data Silos: How Federated Learning Can Unlock Canada’s Public Health Potential. Available at SSRN 5247 328, 2025.
- 9. Wang R, Lai J, Zhang Z, Li X, Vijayakumar P, Karuppiah M. Privacy-Preserving Federated Learning for Internet of Medical Things Under Edge Computing. IEEE J Biomed Health Inform. 2023;27(2):854–65. pmid:35259124
- 10. Tang T, Han Z, Cai Z, Yu S, Zhou X, Oseni T, et al. Personalized Federated Graph Learning on Non-IID Electronic Health Records. IEEE Trans Neural Netw Learn Syst. 2024;35(9):11843–56. pmid:38502617
- 11. Pokhrel SR, Choi J. Federated Learning With Blockchain for Autonomous Vehicles: Analysis and Design Challenges. IEEE Trans Commun. 2020;68(8):4734–46.
- 12. Lin Y, Gao Z, Du H, Kang J, Niyato D, Wang Q, et al. DRL-Based Adaptive Sharding for Blockchain-Based Federated Learning. IEEE Trans Commun. 2023;71(10):5992–6004.
- 13. Cao M, Zhang L, Cao B. Toward On-Device Federated Learning: A Direct Acyclic Graph-Based Blockchain Approach. IEEE Trans Neural Netw Learn Syst. 2023;34(4):2028–42. pmid:34460402
- 14. Hsu Y-L, Liu C-F, Wei H-Y, Bennis M. Optimized Data Sampling and Energy Consumption in IIoT: A Federated Learning Approach. IEEE Trans Commun. 2022;70(12):7915–31.
- 15. Fu DS, Huang J, Hazra D, Dwivedi AK, Gupta SK, Shivahare BD, et al. Enhancing sports image data classification in federated learning through genetic algorithm-based optimization of base architecture. PLoS One. 2024;19(7):e0303462. pmid:38990969
- 16. Hsu TMH, Qi H, Brown M. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:190906335. 2019;.
- 17. Wang H, Yurochkin M, Sun Y, Papailiopoulos D, Khazaeni Y. Federated learning with matched averaging. arXiv preprint arXiv:200206440. 2020.
- 18. Wang J, Liu Q, Liang H, Joshi G, Poor HV. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in Neural Information Processing Systems. 2020;33:7611–23.
- 19.
Karimireddy SP, Kale S, Mohri M, Reddi S, Stich S, Suresh AT. In: International conference on machine learning, 2020. 5132–43.
- 20.
Li T, Sahu AK, Zaheer M, Sanjabi M, Talwalkar A, Smith V. Federated optimization in heterogeneous networks. In: Proceedings of Machine Learning and Systems, 2020. 429–50.
- 21.
Qu Z, Li X, Duan R, Liu Y, Tang B, Lu Z. Generalized federated learning via sharpness aware minimization. In: International conference on machine learning. PMLR; 2022. p. 18250–18280.
- 22. Foret P, Kleiner A, Mobahi H, Neyshabur B. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:201001412. 2020;.
- 23.
Sun Y, Shen L, Chen S, Ding L, Tao D. Dynamic regularized sharpness aware minimization in federated learning: approaching global consistency and smooth landscape. In: International Conference on Machine Learning, 2023. 32991–3013.
- 24. Dai R, Yang X, Sun Y, Shen L, Tian X, Wang M, et al. FedGAMMA: Federated Learning With Global Sharpness-Aware Minimization. IEEE Trans Neural Netw Learn Syst. 2024;35(12):17479–92. pmid:37788191
- 25. Sun Y, Shen L, Huang T, Ding L, Tao D. Fedspeed: Larger local interval, less communication round, and higher generalization accuracy. In: 2023. https://arxiv.org/abs/2302.10429
- 26. Gou J, Yu B, Maybank SJ, Tao D. Knowledge Distillation: A Survey. Int J Comput Vis. 2021;129(6):1789–819.
- 27. Guo D, Yang D, Zhang H, Song J, Zhang R, Xu R, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:250112948. 2025.
- 28. Bi X, Chen D, Chen G, Chen S, Dai D, Deng C, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:240102954. 2024.
- 29. Lin T, Kong L, Stich SU, Jaggi M. Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems. 2020;33:2351–63.
- 30.
Zhu Z, Hong J, Zhou J. Data-free knowledge distillation for heterogeneous federated learning. In: International conference on machine learning. PMLR; 2021. p. 12878–12889.
- 31. Sattler F, Korjakow T, Rischke R, Samek W. FedAUX: Leveraging Unlabeled Auxiliary Data in Federated Learning. IEEE Trans Neural Netw Learn Syst. 2023;34(9):5531–43. pmid:34851838
- 32. Yao D, Pan W, Dai Y, Wan Y, Ding X, Yu C, et al. FedGKD: Toward Heterogeneous Federated Learning via Global Knowledge Distillation. IEEE Trans Comput. 2024;73(1):3–17.
- 33.
22 Sports Image Classification-Kaggle. https://www.kaggle.com/datasets/sheikhzaib/sports-image-image-classification
- 34.
100 Sports Image Classification-Kaggle. https://www.kaggle.com/datasets/gpiosenka/sports-classification
- 35. Acar DAE, Zhao Y, Navarro RM, Mattina M, Whatmough PN, Saligrama V. Federated learning based on dynamic regularization. arXiv preprint arXiv:211104263. 2021.
- 36. Xu J, Wang S, Wang L, Yao ACC. Fedcm: Federated learning with client-level momentum. arXiv preprint arXiv:210610874. 2021.
- 37.
Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. Handbook of Systemic Autoimmune Diseases. 2009.
- 38. Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems. 2015;28.
- 39. Le Y, Yang X. Tiny imagenet visual recognition challenge. CS 231N. 2015;7(7):3.