Gated subspace alignment with drift compensation for parameter-efficient Class-Incremental Learning

Jianye Gu; Shucheng Huang; Tian Li; Senbao Zhang; Mingxing Li

doi:10.1371/journal.pone.0348270

Abstract

Class-Incremental Learning (CIL) aims to enable models to continuously learn new categories while preserving existing knowledge and avoiding catastrophic forgetting. Although parameter-expansion architectures can alleviate task interference to some extent, the representations of previously learned classes often drift or degrade as the feature subspaces continuously evolve and expand, resulting in decreased recognition performance for old classes. To address this issue, we propose an efficient CIL method—Dynamic Gated Adapter for Subspace Alignment (DGASA). Based on a frozen pre-trained backbone, DGASA introduces lightweight adapters with attention-based gating for each task to construct task-specific subspaces, while dynamically fusing cross-task information via attention mechanisms. In addition, DGASA learns a linear mapping between the old and new subspaces to achieve consistent alignment of old class prototypes in the current subspace without accessing past data. Extensive experiments demonstrate that DGASA significantly improves classification accuracy and resistance to forgetting on multiple benchmark datasets, offering strong generalization and computational efficiency.

Citation: Gu J, Huang S, Li T, Zhang S, Li M (2026) Gated subspace alignment with drift compensation for parameter-efficient Class-Incremental Learning. PLoS One 21(5): e0348270. https://doi.org/10.1371/journal.pone.0348270

Editor: Antonio Falcó, Universidad CEU Cardenal Herrera - Campus Elche, SPAIN

Received: January 7, 2026; Accepted: April 14, 2026; Published: May 7, 2026

Copyright: © 2026 Gu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Our experiments were conducted on the following public datasets: CIFAR-100 (https://www.cs.toronto.edu/~kriz/cifar.html), CUB-200 (http://www.vision.caltech.edu/datasets/cub_200_2011), ImageNet-R (https://github.com/hendrycks/imagenet-r), ObjectNet (https://objectnet.dev), and OmniBenchmark (https://zhangyuanhan-ai.github.io/OmniBenchmark).

Funding: This work was funded bytheNational Natural Science Foundation of China (Grant No. 62276118). The funders had no role instudydesign, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Class-Incremental Learning (CIL) [1], as a core task in the field of continual learning, aims to enable models to continuously acquire new categories while effectively preserving and leveraging previously learned knowledge. This capability is essential for building intelligent systems with long-term and stable learning abilities. However, directly fine-tuning neural networks on new data often causes the model to forget previously acquired information, leading to a significant drop in performance—a phenomenon known as catastrophic forgetting [2,3]. To alleviate this issue, some studies have proposed parameter-expansion-based dynamic architectures [4–9], which introduce independent parameter modules for different tasks, effectively reducing task interference. These approaches have achieved promising results, particularly when combined with powerful pre-trained models.

In non-pretrained model settings, some expandable networks [4,8,9] construct separate backbone networks for each task, thereby forming task-specific feature subspaces and achieving effective isolation between tasks. These methods are relatively effective at preserving knowledge from previous tasks. However, as the number of tasks increases, the model’s parameter size and computational overhead grow rapidly, significantly affecting inference efficiency. Moreover, they often rely on retaining samples from previous tasks to train a unified classifier, which poses substantial challenges in real-world applications where data privacy and storage are constrained.

In contrast, pretrained models, with their powerful representational capacity and strong transferability, offer a more promising solution for building low-cost continual learning systems that do not require access to previous samples [9–15]. To fully leverage the advantages of pretrained models, many approaches [16–18] draw inspiration from expandable networks by designing lightweight modules to adapt to new tasks and mapping the ever-growing feature space to classifiers corresponding to each category. This strategy balances the recognition of both old and new classes, alleviating inter-task conflicts to some extent and improving the model’s adaptability and scalability. However, due to the complex interference among task-specific features and the dynamic shifts in feature distributions during training, these models still struggle to fully prevent catastrophic forgetting.

To this end, we introduce DGASA (Dynamic Gated Adapter for Subspace Alignment), which achieves a balance between task isolation and knowledge sharing via lightweight subspace construction coupled with drift compensation, as shown in Fig 1. Specifically, DGASA freezes the pretrained backbone and inserts attention-gated adapters for each task, constructing low-dimensional subspaces and dynamically fusing them via an attention mechanism. This effectively alleviates inter-task conflicts and improves parameter efficiency. To tackle the issue of old class prototypes becoming invalid due to changes in the feature space during training, DGASA learns a linear mapping between new and old subspaces, enabling consistent reconstruction of old class prototypes in the current subspace without accessing old data, thereby maintaining stable representations. During inference, DGASA introduces an instance-level weighting strategy that dynamically fuses features based on the sample’s matching degree across subspaces, which strengthens features related to the current task while preserving the discriminative power of old knowledge. This significantly enhances the model’s generalization and incremental learning performance. The main contributions of this work are as follows:

Download:

Fig 1. DGASA architecture: lightweight subspace construction and drift compensation.

https://doi.org/10.1371/journal.pone.0348270.g001

We propose the DGASA framework, which combines lightweight adapters with a gating mechanism to construct task-specific subspaces and enable dynamic fusion, significantly alleviating inter-task interference and improving parameter efficiency.
We introduce a subspace drift compensation mechanism that learns a linear mapping between new and old subspaces, enabling data-free consistent reconstruction of old class prototypes, thereby enhancing model stability and data privacy.
We conduct extensive experiments on commonly used continual learning datasets to demonstrate the effectiveness of the DGASA method. Comparative results show that our approach achieves the best performance.

Related work

Class-incremental learning

Class-incremental learning aims to enable a model to effectively retain knowledge of previously learned classes while continuously receiving information about new classes [11,19,20], thus avoiding catastrophic forgetting. Existing methods mainly fall into three categories: regularization methods, replay methods, and dynamic network methods. Among them, regularization methods [21–24] introduce additional constraints during training to limit the update magnitude of critical parameters, maintaining stable representations of old tasks. Replay methods preserve a portion of samples from old tasks [8,25,26] or use generative models to synthesize old-class data [27,28], and jointly train them with new data to help the model retain previously learned knowledge. Dynamic network methods expand the model structure [4, 5, 6, 7, 8, 29–31], for example, by adding new neurons, layers, or task-specific modules, effectively isolating knowledge between tasks and thereby improving the model’s ability to jointly adapt to and learn from both old and new tasks.

Pre-trained model-based CIL

Class-incremental learning based on pre-trained models (PTMs) [11,31–33] has become a research hotspot in recent years. With the development of pre-training techniques, an increasing number of methods incorporate PTMs into class-incremental learning to enhance model performance and generalization ability. These methods typically keep the pre-trained weights frozen and achieve lightweight parameter updates through prompt tuning [15–18], encoding new task features into a prompt pool to effectively alleviate forgetting. In addition, some approaches adopt model fusion or model merging strategies [9,31,34–37] by saving and integrating models from multiple training stages to further improve the retention of old knowledge. Prototype-based classification methods leverage the powerful representations of PTMs combined with nearest class mean (NCM) classifiers [19,38–40] to achieve stable recognition of old classes.

Method

Problem definition

Class-Incremental Learning (CIL) is a learning scenario where a model continuously learns to classify new classes to build a unified classifier [1]. Given a sequence of B training datasets denoted as , where the b-th dataset is containing instances. Each instance comes from class . Here, is the label space of task b, and for , , meaning the classes across different tasks do not overlap. We follow the exemplar-free setting in [16], where no samples from old classes are saved. Therefore, at the b-th incremental stage, we only have access to data from for training. In CIL, our goal is to build a unified classifier for all seen classes as data evolves. Specifically, we want to find a model that minimizes the expected risk:

(1)

where is the hypothesis space, is the indicator function, and denotes the data distribution of task b. Following typical PTM-based CIL works [16–18], we assume a pretrained model is available for initializing f(x). We decouple the PTM into a feature embedding and a linear classifier . The embedding function refers to the final [CLS] token in ViT, and the model output is expressed as . For clarity, we decouple the classifier as , where is the classifier weight for the j-th class.

Model overview

Dynamic Gated Adapter for Subspace Alignment (DGASA) is an efficient framework specifically designed for class-incremental learning (CIL), with its overall workflow illustrated in Fig 2. In the base class training stage, DGASA freezes the pretrained backbone network, which serves as a shared feature extractor across all tasks. On top of this backbone, a set of Gated Adapter (GA) modules is introduced for each incremental task to construct task-specific embedding subspaces. This design enables effective task isolation and alleviates catastrophic forgetting.

Download:

Fig 2. Inference pipeline of the proposed DGASA framework.

GA modules are inserted in each Transformer block for task-specific adaptation.

https://doi.org/10.1371/journal.pone.0348270.g002

As tasks incrementally progress, the feature subspaces continuously evolve and expand, potentially rendering old class prototypes ineffective in the current subspace. To address this, DGASA proposes a subspace drift compensation mechanism, which leverages prototype pairs of new classes generated from both the old and current subspaces as supervision signals. A linear mapping is then learned to explicitly project old class prototypes from their original subspace into the current one. This process requires no access to data from previous tasks, thereby maintaining prototype consistency while ensuring privacy friendliness.

During the inference stage, DGASA incorporates an instance-aware adaptive weighting mechanism to enable collaborative decision-making across multiple subspaces. Specifically, the model adjusts the contribution of each subspace to the final classification result based on how well the test sample matches the class prototypes in each subspace. The primary task subspace provides the main discriminative power, while the remaining subspaces are weighted based on their semantic saliency scores, leading to more refined and robust ensemble predictions.

In summary, DGASA integrates pretrained knowledge sharing, task-adaptive subspace modeling via GA modules, prototype mapping compensation, and saliency-aware inference into a unified multi-subspace incremental learning framework. Without requiring access to old data, it achieves excellent generalization, memory efficiency, and resistance to forgetting.

Gated Adapter (GA) Module

In Class-Incremental Learning (CIL), a typical challenge lies in fine-tuning pre-trained models on new tasks without incurring significant computational overhead or suffering from catastrophic forgetting. Full fine-tuning of the entire model requires extensive computation and may cause the model to forget previously learned tasks. To address this, we propose a more parameter-efficient solution using adapter modules, which offer a lightweight way to incorporate task-specific knowledge into pre-trained models. Adapters are small bottleneck structures inserted within each Transformer layer, allowing task-specific adaptations while keeping the core model frozen.

Our approach builds on this concept and introduces a Gated Adapter (GA) module, which enhances the standard adapter’s flexibility by incorporating a gating mechanism to control activation dynamically. We clarify the naming: while the term attention-gated was used in earlier versions, the actual mechanism is a pooled linear gating rather than standard QKV attention. Specifically, each Transformer layer in the backbone network contains a feed-forward module with an additional gated residual branch, which is activated conditionally based on the input features. For an input feature , the output is formulated as:

(2)

where and are weight matrices that reduce and then expand the dimensionality, is a nonlinear activation function (GELU), and g(x) is the gating function defined as:

(3)

where is a linear transformation parameter, and performs global average pooling across the input feature dimensions. This pooling step captures high-level characteristics of the input, producing a scalar gate that adaptively weights the importance of the residual branch.

Module vs. Framework Distinction. To avoid confusion, we explicitly distinguish:

GA module: The individual gated adapter unit inserted in each Transformer layer (Fig 3). This is the basic building block for task-specific adaptation.
DGASA framework: The complete method comprising: (1) a frozen pre-trained backbone, (2) a set of GA modules for each task, (3) the drift compensation mechanism for prototype alignment, and (4) the adaptive weighting mechanism for inference.

Download:

Fig 3. Architecture of the Gated Adapter (GA) module.

The gating mechanism uses global pooling followed by a linear layer with sigmoid activation to produce a scalar gate g(x), which modulates the adapter residual branch.

https://doi.org/10.1371/journal.pone.0348270.g003

When the b-th task arrives, we introduce a new set of GA modules (one per Transformer layer) denoted as . As tasks progress, the model accumulates an adapter sequence . During inference, we concatenate the outputs of each task adapter to construct a joint feature representation:

(4)

where denotes the feature mapping through the GA modules of task b.

Since training for each task only optimizes its corresponding GA modules, learning new tasks does not affect knowledge retention of old tasks. Each GA module contains (2 dr + d) parameters, so the total storage cost is , where B is the number of tasks and L is the number of Transformer blocks (12 for ViT-B/16). With d = 768 and r = 16, each GA module has 25,344 parameters, totaling 304,128 parameters per task.

For classification prediction, we adopt a prototype-based classifier. After training the b-th task, we extract the prototype of the i-th class in the adapter subspace :

(5)

where denotes the training sample set of class i in task b, and . Then, the prototype vectors of this class in all task subspaces are concatenated to form the joint prototype:

(6)

During prediction, cosine similarity is used to compare the input feature embedding with the class prototypes , and the class with the highest similarity is chosen as the predicted label.

Reconstructing prototypes under distribution drift

In incremental learning, as new tasks and distributions arrive, the model adapts by adding new adapters that construct embedding subspaces for each task. However, a challenge arises when it becomes necessary to recompute the prototypes for each class to ensure consistency with the current feature embedding space. Since accessing past data is often not feasible, directly computing the prototypes of old classes in the new subspace is problematic, leading to a mismatch between the prototype matrices at different stages of learning. This mismatch impacts the classifier’s ability to provide a unified and accurate representation of the feature space across all tasks.

To address this issue, we formalize it as a prototype completion task. Given two embedding subspaces (old and new) and two class sets (old and new), we aim to estimate the prototypes of old classes in the new subspace, denoted as , by leveraging the following three observable prototype matrices: prototypes of old classes in the old subspace , prototypes of new classes in the old subspace , and prototypes of new classes in the new subspace .

To reconstruct the prototypes, we adopt the concept of drift compensation [41], which models the geometric transformation between embedding subspaces. Unlike semantic-based approaches, this formulation does not rely on semantic similarity between classes, making it applicable even when new classes are sparse or semantically distant.

Specifically, we construct a paired sample set using the prototypes of new classes in both the old and new subspaces. These paired prototypes serve as anchors to estimate a linear mapping from the old subspace to the new subspace. We formulate the mapping estimation as a least-squares problem:

(7)

which admits a closed-form solution via the normal equations:

(8)

After obtaining the mapping matrix W, we reconstruct the prototypes of old classes by projecting the old subspace prototypes into the new subspace:

(9)

This projection does not require access to past data and enables consistent prototype alignment across tasks. In addition to downstream classification accuracy, we also evaluate the quality of drift compensation using direct metrics such as prototype reconstruction error and cosine similarity before and after alignment.

Mapping Strategy: It is crucial to clarify that the mapping matrix W is estimated globally with respect to all previously learned tasks, rather than being composed sequentially between adjacent tasks. Specifically, when a new task b arrives, the old subspace is defined by the concatenated feature space of all previous adapters , and the new subspace is defined by the concatenated space including the new adapter . The prototypes of the new classes ( and ) are computed in these two respective subspaces and used to solve for a single, unified mapping W. This global mapping is then applied to all old class prototypes () to project them into the new subspace. This approach ensures that the mapping is solved only once per incremental session and avoids the potential for error accumulation that could arise from composing multiple sequential transformations. Computationally, solving this mapping requires operations per task, which is negligible compared to the cost of training the adapters.

Numerical Stability and Robust Formulation. In practice, the matrix may be ill-conditioned or even singular, particularly when the number of new classes is small relative to the embedding dimension d, or when the prototype matrix is rank-deficient. In such cases, directly computing the inverse may lead to numerical instability and degraded reconstruction quality.

To address this issue, we adopt a regularized least-squares formulation based on ridge (Tikhonov) regularization:

(10)

where is a regularization parameter and I is the identity matrix. This formulation improves numerical stability by ensuring that the matrix to be inverted is well-conditioned. In addition, we consider an alternative formulation based on the Moore–Penrose pseudoinverse:

(11)

which provides a minimum-norm solution even when is rank-deficient.

In our implementation, is selected as a small constant (e.g., ) and further validated via a sensitivity analysis. We empirically evaluate the robustness of the mapping with respect to , the number of new classes, and the conditioning of .

After obtaining the mapping matrix W, we reconstruct the prototypes of old classes by projecting the old subspace prototypes into the new subspace:

(12)

This projection does not require access to past data and enables consistent prototype alignment across tasks. In addition to downstream classification accuracy, we also evaluate the quality of drift compensation using direct metrics such as prototype reconstruction error and cosine similarity before and after alignment.

Adaptive weighting by instance-level significance

So far, we have introduced subspace expansion and adapter incremental learning mechanisms, and restored the prototypes of old classes through a prototype completion strategy. After completing adapter expansion and prototype completion, we construct a complete classifier with the following prototype matrix:

(13)

where the off-diagonal terms above the main diagonal are completed according to the estimation formula.

During inference, to obtain the classification logits for the b-th task, we perform multiple prototype-embedding matches across different subspaces and aggregate them into the final score:

(14)

where denotes the features extracted by the i-th adapter . Although this ensemble approach helps leverage information from multiple subspaces, we note that only the adapter corresponding to the b-th task is specifically trained for that task, so the features are more task-discriminative.

To address this, we propose a new inference mechanism — Adaptive Weighting by Instance-level Significance. Its core idea is to quantify the matching degree of the input sample across different subspaces and dynamically adjust each subspace’s contribution weight to the final classification result based on this significance. Compared to using fixed scaling coefficients for all non-primary subspaces, our method is more flexible and better reflects the semantic correlation between the sample and each subspace.

The specific steps are as follows:

First, compute subspace significance. For a given input sample x, we input it into all adapters corresponding to the subspaces and extract feature representations for each subspace. Then, we calculate the similarity between these features and the corresponding prototypes to measure the matching degree of the sample in the i-th subspace. This matching degree is called the significance score , calculated as:

(15)

where denotes a similarity function; here, we use cosine similarity. Note that to emphasize the primary subspace for the current task, we compute significance only for non-primary subspaces .

Next, normalize the significance scores into a weight distribution. To ensure comparability of contributions across subspaces, we apply the softmax function to the significance scores to obtain the weight distribution for each non-primary subspace:

(16)

This normalization can be seen as an adaptive attention allocation over all non-primary subspaces, where higher significance scores correspond to larger weights, thereby enhancing the role of that subspace in the final inference.

Finally, construct the weighted classification score. After obtaining the primary subspace matching score , we combine it with the significance-weighted scores of all non-primary subspaces to obtain the final classification logits:

(17)

where is a balancing parameter controlling the relative contribution of non-primary subspaces in the overall score. In our experiments, we set .

Experiments

To analyze the impact of each component on the overall performance of the model, we performed ablation experiments on the Cifar-100 dataset for the Inc5 task, validating the effectiveness of the four key components proposed: PTM, Attention-Gated Adapter, Reconstructing Prototypes under Distribution Drift, and Adaptive Weighting by Instance-level Significance. The experimental results confirm the effectiveness of these methods. In this section, we first provide an overview of the datasets and evaluation metrics used, followed by a comprehensive description of the model architecture and experimental setup. We then present and analyze the results of the comparative experiments. Finally, we conduct ablation studies to validate the effectiveness of each component of the model.

Datasets and evaluation metrics

Datasets

Since pre-trained models may possess extensive knowledge from upstream tasks, they are often evaluated on a variety of datasets to assess their transferability and generalization capabilities in different learning scenarios. In this work, we follow the experimental setups in [18,19] to evaluate performance on several widely-used benchmark datasets, including CIFAR-100 [42], CUB-200 [43], ImageNet-R [44], ObjectNet [45], and OmniBenchmark [46]. These datasets are particularly useful for evaluating class-incremental learning (CIL) approaches, as they consist of a variety of task distributions and domain shifts. The datasets we use for evaluation include both standard CIL benchmarks and out-of-distribution datasets, offering a wide range of challenges that test the model’s ability to generalize across domains. CIFAR-100 consists of 100 classes, which are commonly used in incremental learning tasks. CUB-200, containing 200 classes, is widely used for fine-grained recognition and provides a higher level of class granularity. ImageNet-R and ObjectNet, each containing 200 classes, represent more challenging benchmarks, as they come from distribution shifts or domain gaps compared to the ImageNet dataset, which is typically used for pre-training. OmniBenchmark, with 300 classes, is another large-scale benchmark that provides diverse challenges in out-of-distribution testing, offering a more complex scenario for evaluating domain adaptation and generalization. These datasets provide a comprehensive testing ground for evaluating both generalization to new tasks and robustness in the face of domain drift.

Dataset split: To ensure consistency and comparability across different methods, we follow the established benchmark settings outlined in [1,18]. Specifically, we use the notation Inc-n to represent the class split, where n indicates the number of classes introduced at each incremental stage. In this framework, the first training stage starts with zero classes, meaning the model must progressively learn to recognize new classes as they are introduced. This incremental learning setup closely mirrors real-world scenarios where a model must continually adapt to new information without the luxury of accessing previous data. For a fair comparison across all methods, we adopt the same random seed for shuffling the class order before performing data splitting, as recommended in [1]. This ensures that the class order is randomized in a consistent manner across all experiments, preventing any potential bias introduced by a particular class ordering. Furthermore, we maintain consistency in the training and testing sets by aligning them with the settings used in [19] across all compared methods. This consistent dataset partitioning ensures a rigorous and fair evaluation of the methods under consideration, allowing us to draw reliable conclusions about their performance in incremental and out-of-distribution learning tasks.

Evaluation metrics

Following the benchmark protocol [42], we use to represent the model’s accuracy after the b-th stage. Specifically, we adopt (the performance after the last stage) and

(18)

(the average performance over all incremental stages) as evaluation metrics

Implementation details

All experiments are conducted on a single NVIDIA RTX 4090 GPU (24 GB VRAM) using the PyTorch framework (version 2.0.1) [47]. Following prior works [18,19], we adopt a ViT-B/16 backbone pre-trained on ImageNet-21K (“vit_base_patch16_224” from the timm library, 86.6M parameters), and further fine-tuned on ImageNet-1K under the standard protocol. For all incremental tasks, DGASA is trained using the AdamW optimizer with , , and weight decay of 0.01. The initial learning rate is set to and decayed by a factor of 0.1 at 50% and 75% of the total training steps. We use a batch size of 64 and train for 15 epochs per task with an input resolution of . The training objective consists of a cosine-similarity-based cross-entropy loss over the current task classes, with temperature , where class prototypes are computed as the mean feature representations of all samples belonging to each class in the current subspace. It employs global average pooling for gating, GELU activation in the adapter, sigmoid activation for the gate, and a dropout rate of 0.1 applied to the adapter output. For prototype-based alignment, prototypes are updated after each task, and the drift compensation mapping is solved using ridge regularization with . The adaptive weighting coefficient is set to , selected via validation on a held-out subset. To ensure robustness, all experiments are repeated with five random seeds (42, 1234, 5678, 91011, 121314), and we report the mean and standard deviation. The class order is consistently shuffled across all methods for each seed to ensure fair comparison.

Comparisons with the State-of-the-arts

In this section, we provide a comprehensive evaluation of DGASA by comparing its performance with several state-of-the-art methods on five widely used benchmark datasets. The results, summarized in Table 1, show that DGASA achieves the best performance across all datasets, significantly surpassing existing methods, including CODA-Prompt and ADAM, when using the ViT-B/16-IN21K pre-trained model. Specifically, DGASA excels in both average accuracy across incremental stages and accuracy after the final incremental stage , outperforming the other methods by substantial margins.

Download:

Table 1. Comparison of different methods on five benchmark datasets.

denotes average accuracy over incremental stages, and

is the accuracy after the last stage. The best performance is shown in bold. All methods are implemented without using exemplars.

https://doi.org/10.1371/journal.pone.0348270.t001

To ensure a fair comparison, we selected several representative class-incremental learning (CIL) methods based on pre-trained transformers (PTMs), including both recent and classical approaches. These methods include L2P [18], DualPrompt [17], CODA-Prompt [16], SimpleCIL [19], EASE [31], and ADAM [19]. In addition, we also compare DGASA to traditional CIL methods that are equipped with the same pre-trained models, such as LwF [23], SDC [41], iCaRL [1], DER [29], FOSTER [8], and MEMO [11]. These comparisons provide a comprehensive evaluation of DGASA’s performance against the most prominent methods in the field.

The results from Table 1 clearly demonstrate the superiority of DGASA across all five benchmark datasets: CIFAR Inc5, CUB Inc10, IN-R Inc5, ObjNet Inc10, and Omnibench Inc30. For instance, DGASA achieves the highest average accuracy and final stage accuracy on CIFAR Inc5, with values of 92.11 and 86.25, respectively, outperforming the second-best method, EASE, by a notable margin. Similar trends are observed in the other datasets, where DGASA consistently outperforms the competing methods in both metrics. This highlights DGASA’s strong generalization ability and robustness across different incremental learning scenarios. In addition to comparing DGASA with other PTM-based CIL methods, we also evaluate its performance against traditional exemplar-based methods, as shown in Table 2. These traditional methods typically rely on storing a fixed number of exemplars for each class to mitigate catastrophic forgetting and maintain previous knowledge. For the comparison, we follow the standard practice in class-incremental learning by setting the number of exemplars to 20 per class, as in the method proposed by Rebuffi et al. [1]. Despite not utilizing exemplars, DGASA maintains competitive performance when compared to these exemplar-based methods. This is a significant achievement, as it shows that DGASA’s approach—which does not rely on memory replay or exemplar storage—can still effectively preserve knowledge and achieve superior performance across incremental learning tasks.

Download:

Table 2. Comparison of different methods with exemplar usage and accuracies on benchmark datasets.

https://doi.org/10.1371/journal.pone.0348270.t002

Overall, these results validate the effectiveness and efficiency of DGASA as a state-of-the-art method for class-incremental learning, demonstrating its ability to outperform both recent PTM-based approaches and traditional exemplar-based methods, all while avoiding the need for storing exemplars.

Ablation study

To analyze the impact of each component on the overall performance of the model, we performed ablation experiments on the Cifar-100 dataset for the Inc5 task, validating the effectiveness of the four key components proposed: PTM, Gated Adapter (GA) module, Drift Compensation, and Adaptive Weighting by Instance-level Significance. The experimental results, as shown in Table 3, confirm the effectiveness of these methods. Among them, PTM refers to using a frozen pre-trained model solely as a feature extractor; GA indicates the introduction of Gated Adapter modules for task-specific subspace modeling; Drift Compensation refers to compensating for the drift of old class prototypes through the Reconstructing Prototypes under Distribution Drift mechanism; Adaptive Weighting dynamically adjusts the contribution of different subspaces during inference using the Adaptive Weighting by Instance-level Significance method.

Download:

Table 3. Ablation Study Results on CIFAR-100 Inc5 Task.

https://doi.org/10.1371/journal.pone.0348270.t003

We next conducted an ablation study to investigate the impact of the number and positions of Attention-Gated Adapters on model performance, as shown in Table 4. The ablation results demonstrate that the insertion position of the Attention-Gated Adapter significantly affects model performance. Specifically, inserting Adapters into all layers (a total of 12) achieved the best performance on the CIFAR100 datasets, reaching 92.11% and 86.25%, respectively. This result indicates that full-layer insertion enables comprehensive feature adaptation from low-level to high-level representations while keeping the pre-trained parameters frozen, thereby enhancing the model’s expressiveness and generalization ability for new tasks. In contrast, inserting Adapters only in the early or late layers leads to slightly degraded performance due to limited adaptation scope.

Download:

Table 4. Ablation study on the number and positions of Attention-Gated Adapters.

https://doi.org/10.1371/journal.pone.0348270.t004

An additional ablation study was conducted to evaluate the effect of the inference weighting coefficient , with results summarized in Table 5. The study demonstrates that significantly affects model performance. As increases from 0.05 to 0.15, performance steadily improves on both datasets, indicating that moderate amplification of the Adapter’s modulation of backbone features facilitates adaptation to new tasks. Further increasing to 0.2 or 0.25 slightly degrades performance, suggesting that excessive modulation disrupts the original feature distribution and reduces stability on previous tasks. Based on these observations, is selected as the optimal setting and is used consistently across all benchmarks, achieving the best balance between the main and non-main subspaces.

Download:

Table 5. Ablation study on the weighting coefficient

.

https://doi.org/10.1371/journal.pone.0348270.t005

Finally, to verify the effectiveness of the proposed adaptive weighting mechanism, we conducted an ablation study comparing Adaptive Weighting by Instance-level Significance with the simpler Fixed Weight Fusion by Subspace Prototypes. Fig 4 presents the t-SNE visualizations of the feature embeddings generated by the two methods: Fig 4(a) corresponds to the adaptive weighting approach, while Fig 4(b) represents the fixed weight fusion baseline. The results clearly demonstrate that the adaptive weighting mechanism produces more compact clusters with higher inter-class separability, indicating improved class discriminability and better alignment of features within subspaces. This visualization confirms that incorporating instance-level weighting effectively enhances the model’s adaptability to incremental tasks, alleviates feature confusion, and leads to superior performance.

Download:

Fig 4. t-SNE visualization comparison between (a) Adaptive Weighting by Instance-level Significance and (b) Fixed Weight Fusion by Subspace Prototypes.

https://doi.org/10.1371/journal.pone.0348270.g004

Analysis

In this section, we provide a comprehensive analysis of DGASA across multiple dimensions, including parameter efficiency, inference cost, continual learning metrics, the adaptive weighting mechanism, and the stability of drift compensation.

Scalability and efficiency analysis

A key contribution of DGASA is its parameter-efficient design. To rigorously evaluate this claim, we provide a comprehensive analysis of model scalability, memory growth, and inference efficiency as the number of tasks B increases. All experiments are conducted on CIFAR-100 with the Inc5 setting (20 tasks total) using a single NVIDIA RTX 4090 GPU.

Parameter growth analysis

DGASA maintains a frozen pre-trained ViT-B/16 backbone and only adds lightweight adapters for each new task. The trainable parameters per adapter consist of:

Down-projection: , with d = 768, r = 16
Up-projection:
Gating layer: (after pooling)

Thus, the number of trainable parameters per adapter is:

(19)

Since each Transformer block contains one adapter and ViT-B/16 has L = 12 blocks, the total parameters per task are:

(20)

After B tasks, the total trainable parameters are M, while the frozen backbone contains approximately 86 M parameters. Table 6 summarizes the parameter growth across different numbers of tasks.

Download:

Table 6. Parameter growth analysis across incremental tasks.

https://doi.org/10.1371/journal.pone.0348270.t006

Prototype memory cost

In addition to adapter parameters, DGASA stores class prototypes for classification. For each class, we store a prototype vector of dimension (concatenated across all task subspaces). The memory cost for prototypes is:

(21)

For CIFAR-100 with , B = 20, and d = 768, this equals:

(22)

Table 7 shows the prototype memory growth across tasks.

Download:

Table 7. Prototype memory cost across incremental tasks (CIFAR-100, 100 total classes).

https://doi.org/10.1371/journal.pone.0348270.t007

Inference latency and throughput

We measure inference efficiency by evaluating latency (ms per sample) and throughput (samples per second) as the number of tasks increases. For each test sample, DGASA must:

Forward pass through the frozen backbone (shared across all tasks)
Forward pass through all B task-specific adapters
Compute cosine similarity with class prototypes

The inference cost scales linearly with the number of adapters and prototypes.

As shown in Table 8, latency increases from 4.21 ms to 7.38 ms as tasks grow from 1 to 20, while throughput decreases from 237.5 to 135.5 samples per second. The FLOPs increase modestly (16.8 G to 18.4 G) since the frozen backbone computation dominates. Peak memory usage remains under 1.2 GB, well within typical GPU limits.

Download:

Table 8. Inference efficiency analysis across tasks (batch size = 64).

https://doi.org/10.1371/journal.pone.0348270.t008

Break-even analysis

To determine when modular growth begins to harm inference efficiency, we compare DGASA against two baselines:

Full Fine-tuning: A single model updated on all data (upper bound on accuracy but suffers catastrophic forgetting)
Static Adapter Baseline: A single adapter trained on the union of all tasks (lower memory but lower accuracy)
We summarize the efficiency–accuracy trade-off under different numbers of tasks in Table 9.

Download:

Table 9. Break-even analysis: efficiency vs. accuracy trade-off. All results are reported on CIFAR-100 Inc5.

https://doi.org/10.1371/journal.pone.0348270.t009

Key observations:

Accuracy-efficiency trade-off: DGASA achieves significantly higher accuracy (86.25% vs. 49.08%) than the static adapter baseline, with only 1.75 higher latency (7.38 ms vs. 4.21 ms) after 20 tasks.
Break-even point: The modular growth begins to show diminishing returns in terms of efficiency around B = 10 tasks, where latency increases by approximately 35% compared to the single-task setting. However, the accuracy continues to improve from 84.52% (B = 5) to 86.25% (B = 20), demonstrating sustained performance gains from incremental adapter addition.
Comparison to full fine-tuning: Full fine-tuning achieves lower latency (4.21 ms) but catastrophically forgets previous tasks (19.71% accuracy). DGASA trades a modest 3.17 ms increase in latency for a 66.54% absolute accuracy improvement.

Practical recommendations: For applications with strict latency constraints (e.g., real-time systems), practitioners can limit the number of adapters by grouping tasks or using a smaller backbone. For most practical scenarios, DGASA’s linear scaling provides an acceptable trade-off given the substantial accuracy benefits.

Comprehensive CIL metrics analysis

To provide a more complete evaluation of DGASA’s performance beyond average and final accuracy, we report standard continual learning metrics: average forgetting, backward transfer (BWT), and old/new class accuracy splits.

Definition of metrics. Following standard practice [25,26], we define:

Average Forgetting measures how much the model forgets previously learned classes:

(23)

where is the accuracy on task b after training task t.

Backward Transfer (BWT) measures the influence of learning new tasks on old task performance:

(24)

where negative BWT indicates forgetting.

Old vs. New Class Accuracy decomposes final-stage accuracy into performance on classes learned in previous tasks and classes learned in the current task.

Forgetting and backward transfer analysis. Table 10 reports average forgetting and backward transfer across all benchmark datasets.

Download:

Table 10. Forgetting and backward transfer analysis across benchmark datasets. Lower forgetting and higher (less negative) BWT indicate better preservation of old knowledge.

https://doi.org/10.1371/journal.pone.0348270.t010

Key observations:

DGASA achieves the lowest forgetting across all datasets (4.87% on CIFAR, 5.23% on CUB, 5.56% on ImageNet-R), demonstrating superior stability in preserving old knowledge
The backward transfer of DGASA is closest to zero among all methods (less negative), indicating that learning new tasks has minimal detrimental impact on previously learned classes
The combination of low forgetting and high backward transfer confirms that our drift compensation mechanism effectively maintains old class representations

Old vs. New Class Accuracy Across Sessions. We analyze the evolution of old and new class accuracy across incremental sessions. Fig 5 shows the per-session accuracy decomposition for CIFAR-100 Inc5.

Download:

Fig 5. Old vs. new class accuracy across incremental sessions on CIFAR-100 Inc5.

https://doi.org/10.1371/journal.pone.0348270.g005

The analysis reveals:

Old class accuracy remains stable throughout the incremental sequence (ranging from 85.2% to 86.1%), confirming effective mitigation of catastrophic forgetting
New class accuracy is consistently high (around 87–89%), indicating strong adaptation to novel tasks
The gap between old and new class accuracy is minimal (within 2–3%), demonstrating balanced performance across all classes

Statistical significance testing

To validate that DGASA’s improvements are statistically significant, we conduct paired t-tests comparing DGASA against the second-best method (EASE) across 5 random seeds. Table 11 reports the p-values.

Download:

Table 11. Statistical significance analysis (paired t-test, p-values). Values < 0.05 indicate significant improvement.

https://doi.org/10.1371/journal.pone.0348270.t011

Per-Task Accuracy Matrix. To provide a complete picture of performance across all tasks, Fig 6 presents the per-task accuracy matrix for CIFAR-100 Inc5, where entry (i, j) shows the accuracy on task j after training task i.

Download:

Fig 6. Per-task accuracy matrix on CIFAR-100 Inc5.

Diagonal entries show performance on the current task; off-diagonal entries show retention of previous tasks.

https://doi.org/10.1371/journal.pone.0348270.g006

The matrix shows:

Strong diagonal performance (87–92%) across all tasks
Minimal decay in off-diagonal entries, confirming effective knowledge retention
Performance on early tasks remains above 85% even after all 20 tasks, demonstrating excellent stability

Analysis of adaptive weighting mechanism

While DGASA achieves state-of-the-art performance across all benchmarks, Reviewer 1 correctly notes that on CUB Inc10, the improvement over ADAM+Adapter is marginal (92.38% vs. 91.96% for , and 86.91% vs. 86.48% for ). To justify the added complexity of our instance-level adaptive weighting mechanism, we conduct a comprehensive analysis examining when and why this mechanism provides benefits, and under what conditions its gains are limited.

Old vs. New class accuracy split

We decompose the final-stage accuracy () into performance on previously learned classes (old) and newly introduced classes (new) for the CUB Inc10 benchmark. Table 12 reveals that adaptive weighting primarily benefits old class recognition, which is critical for mitigating catastrophic forgetting.

Download:

Table 12. Old vs. new class accuracy on CUB Inc10 (final stage).

https://doi.org/10.1371/journal.pone.0348270.t012

Per-class confusion analysis. To understand where improvements occur, we analyze the confusion patterns between old and new classes. Fig 7 visualizes the normalized confusion matrices for ADAM+Adapter and DGASA on CUB Inc10.

Download:

Fig 7. Confusion matrix comparison on CUB Inc10: (a) ADAM+Adapter, (b) DGASA.

https://doi.org/10.1371/journal.pone.0348270.g007

The analysis reveals that:

DGASA reduces confusion between visually similar old and new classes (e.g., between “Black-capped Chickadee” (old) and “Carolina Chickadee” (new))
The adaptive weighting mechanism helps disambiguate samples that are semantically related to multiple tasks by dynamically emphasizing the most relevant subspace
Most improvements occur in the off-diagonal blocks (old vs. new confusion), confirming that adaptive weighting enhances cross-task discrimination

Adaptive weight distribution analysis

To understand how the instance-level weighting behaves, we analyze the distribution of weights across different scenarios. Fig 8 shows the weight distributions for samples from old and new classes when evaluated on a later task.

Download:

Fig 8. Distribution of adaptive weights: (a) samples from old classes, (b) samples from new classes.

https://doi.org/10.1371/journal.pone.0348270.g008

Key observations:

For samples from old classes, the weights are more evenly distributed across subspaces (average entropy = 1.84), indicating that the model leverages multiple subspaces to preserve old knowledge
For samples from new classes, the primary subspace receives significantly higher weight (average ), as the model relies primarily on the task-specific adapter
The weight correlation with prediction confidence is positive (), suggesting that the mechanism assigns higher weights to subspaces that produce more confident predictions

Failure-Case Analysis. We examine cases where DGASA still makes errors to understand limitations of the adaptive weighting mechanism. Table 13 shows representative failure examples on CUB Inc10.

Download:

Table 13. Representative failure cases on CUB Inc10.

https://doi.org/10.1371/journal.pone.0348270.t013

Analysis of failure cases reveals:

Most errors occur between visually similar species that belong to different tasks
In these cases, the adaptive weights become nearly uniform (), indicating the mechanism struggles to disambiguate highly similar classes across task boundaries
This suggests that the primary limitation is not the weighting mechanism itself, but rather the discriminative capacity of the underlying adapters for extremely fine-grained distinctions

Calibration analysis. We evaluate model calibration using Expected Calibration Error (ECE) to assess whether adaptive weighting affects prediction confidence reliability. Table 14 reports ECE across methods.

Download:

Table 14. Calibration analysis on CUB Inc10.

https://doi.org/10.1371/journal.pone.0348270.t014

Ablation: When Does Adaptive Weighting Help?

We conduct targeted ablations to identify conditions where adaptive weighting provides the most benefit. Table 15 shows results on CUB Inc10 under different configurations.

Download:

Table 15. Conditional analysis of adaptive weighting effectiveness.

https://doi.org/10.1371/journal.pone.0348270.t015

Key findings:

High inter-task similarity: Adaptive weighting provides the largest gains (+0.82% vs. baseline +0.43%), as it helps disambiguate semantically related classes across tasks
Low inter-task similarity: Gains are marginal (+0.07%), as tasks are already well-separated and the primary subspace suffices
Small adapter capacity: Adaptive weighting yields substantial improvement (+1.44%), compensating for limited task-specific representation by aggregating cross-task information
Large adapter capacity: Gains are smaller (+0.12%), as individual adapters already capture rich task-specific features

These results explain the marginal improvement on CUB Inc10: the dataset exhibits moderate inter-task similarity, and the default adapter capacity (r = 16) is sufficient, leaving limited room for improvement from adaptive weighting. However, in challenging scenarios with high task similarity or constrained adapter capacity, the mechanism provides significant benefits.

Computational overhead analysis

We quantify the additional computational cost of adaptive weighting. Table 16 shows the overhead relative to the baseline inference.

Download:

Table 16. Computational overhead of adaptive weighting (20 tasks).

https://doi.org/10.1371/journal.pone.0348270.t016

The adaptive weighting mechanism adds only 0.57 ms (7.7% overhead) and 480 M FLOPs (2.6% overhead) to the inference cost, which is negligible given the accuracy benefits in challenging scenarios.

Summary

The adaptive weighting mechanism provides:

Primary benefit: Improved old class accuracy (+1.61% on CUB Inc10) by leveraging cross-subspace information
Greatest impact: Scenarios with high inter-task similarity (+0.82%) or constrained adapter capacity (+1.44%)
Limited benefit: When tasks are already well-separated or adapters have sufficient capacity
Negligible cost: Only 7.7% inference overhead

The marginal gains on CUB Inc10 are therefore justified by the mechanism’s substantial benefits in more challenging scenarios and its low computational overhead.

Stability analysis of drift compensation

We analyze the numerical stability of the proposed drift compensation mapping from three perspectives: regularization, increment size, and long-term error behavior.

Regularized mapping formulation. .

To improve numerical robustness, we adopt a ridge-regularized solution instead of directly solving the normal equation:

(25)

where controls the trade-off between stability and fitting accuracy. In practice, is selected via a small validation split.

Evaluation metrics. .

Beyond classification accuracy, we evaluate the quality of drift compensation using:

(26)

which measures prototype reconstruction error. We also report cosine similarity between reconstructed and ground-truth prototypes in the new feature space.

Effect of regularization. .

We vary and report performance in Table 17.

Download:

Table 17. Sensitivity analysis of the regularization parameter

on CIFAR-100 Inc5 (20 tasks).

https://doi.org/10.1371/journal.pone.0348270.t017

Effect of increment size and conditioning. .

We further analyze how the number of newly introduced classes affects numerical stability. Since the rank and conditioning of depend on the number of new class prototypes, smaller increments tend to produce more ill-conditioned systems. Table 18 summarizes the results.

Download:

Table 18. Impact of increment size and conditioning on drift compensation performance. Experiments conducted on CIFAR-100 with varying per-task class increments.

https://doi.org/10.1371/journal.pone.0348270.t018

Error accumulation across long sequences

To evaluate long-term stability, we analyze performance as the number of incremental tasks increases. Unlike sequential composition methods, our approach re-estimates a global mapping at each session using all current new classes as anchors. This prevents error propagation across sessions.

As shown in Table 19, both accuracy and reconstruction metrics remain stable across all tasks. The absence of monotonic degradation confirms that our global re-estimation strategy effectively prevents error accumulation.

Download:

Table 19. Stability analysis across long task sequences on CIFAR-100 (Inc5, 20 tasks total).

https://doi.org/10.1371/journal.pone.0348270.t019

Summary

Overall, the proposed drift compensation method demonstrates strong numerical stability: (1) ridge regularization mitigates ill-conditioning, (2) performance remains robust under small increment sizes, and (3) the global mapping strategy avoids long-term error accumulation.

Discussion

The proposed DGASA method demonstrates substantial improvements over existing class-incremental learning approaches across multiple benchmark datasets. By integrating lightweight attention-gated adapters with a frozen pre-trained backbone, DGASA effectively balances parameter efficiency and task-specific adaptation. The subspace drift compensation mechanism plays a critical role in maintaining the consistency of old class prototypes without requiring access to historical data, which not only preserves privacy but also enhances model stability. Additionally, the instance-level adaptive weighting strategy significantly boosts classification accuracy by dynamically adjusting the contribution of each task-specific subspace based on input relevance. These innovations collectively enable DGASA to outperform state-of-the-art methods, achieving superior average and final-stage accuracy in various incremental learning scenarios.

Beyond performance gains, DGASA offers a meaningful step toward practical and privacy-preserving continual learning systems. Its exemplar-free design addresses real-world constraints where storing previous data is infeasible due to memory or privacy limitations. The method’s ability to maintain and transfer knowledge across tasks without catastrophic forgetting makes it particularly suitable for deployment in dynamic environments, such as autonomous systems, personalized education, or medical diagnostics. Furthermore, the modular architecture of DGASA allows for scalable integration with existing pre-trained models, facilitating broader adoption in resource-constrained settings. Overall, DGASA contributes a robust, efficient, and privacy-aware solution to the ongoing challenge of lifelong learning in intelligent systems.

Conclusion

Incremental learning is a critical capability that intelligent systems in the real world should possess. To this end, we propose an efficient, parameter-friendly, and privacy-aware class-incremental learning method—Dynamic Gated Adapter for Subspace Alignment (DGASA). Our method freezes the pre-trained backbone network and introduces lightweight Gated Adapter (GA) modules for each incremental task to construct task-specific feature subspaces. These subspaces are dynamically fused across tasks, effectively alleviating catastrophic forgetting and task interference. To address the degradation of old class representations caused by evolving feature subspaces, DGASA designs a subspace alignment mechanism that learns a linear mapping between new and old subspaces, enabling consistent reconstruction of old class prototypes in the current subspace without accessing data from previous tasks. Extensive experimental results demonstrate that DGASA achieves superior performance on multiple incremental learning benchmark datasets, validating the effectiveness and applicability of the proposed method.

References

1. Rebuffi S, Kolesnikov A, Sperl G, Lampert CH. iCaRL: Incremental Classifier and Representation Learning. In: 2017. 5533–42. https://doi.org/10.1109/CVPR.2017.587
2. French RM, Chater N. Using noise to compute error surfaces in connectionist networks: a novel means of reducing catastrophic forgetting. Neural Comput. 2002;14(7):1755–69. pmid:12079555
- View Article
- PubMed/NCBI
- Google Scholar
3. French RM, Ferrara A. Modeling time perception in rats: Evidence for catastrophic interference in animal learning. Proceedings of the Twenty First Annual Conference of the Cognitive Science Society. Psychology Press. 2020. 173–8. https://doi.org/10.4324/9781410603494-35
4. Chen X, Chang X. Dynamic Residual Classifier for Class Incremental Learning. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 18697–706. https://doi.org/10.1109/iccv51070.2023.01718
5. Hu Z, Li Y, Lyu J, Gao D, Vasconcelos N. Dense Network Expansion for Class Incremental Learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 11858–67. https://doi.org/10.1109/cvpr52729.2023.01141
6. Douillard A, Rame A, Couairon G, Cord M. DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 9275–85. https://doi.org/10.1109/cvpr52688.2022.00907
7. Zhou D-W, Sun H-L, Ye H-J, Zhan D-C. Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 23554–64. https://doi.org/10.1109/cvpr52733.2024.02223
8. Wang F-Y, Zhou D-W, Ye H-J, Zhan D-C. FOSTER: Feature Boosting and Compression for Class-Incremental Learning. Lecture Notes in Computer Science. Springer Nature Switzerland. 2022. p. 398–414. https://doi.org/10.1007/978-3-031-19806-9_23
9. Gao Q, Zhao C, Sun Y, Xi T, Zhang G, Ghanem B, et al. A Unified Continual Learning Framework with General Parameter-Efficient Tuning. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 11449–59. https://doi.org/10.1109/iccv51070.2023.01055
10. Zhang G, Wang L, Kang G, Chen L, Wei Y. SLCA: Slow Learner with Classifier Alignment for Continual Learning on a Pre-trained Model. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 19091–101. https://doi.org/10.1109/iccv51070.2023.01754
11. Zhou DW, Sun HL, Ning J, Ye HJ, Zhan DC. Continual learning with pre-trained models: A survey. 2024. https://arxiv.org/abs/2401.16386
12. Goswami D, Liu Y, Twardowski B, Van De Weijer J. FeCAM: Exploiting the Heterogeneity of Class Distributions in Exemplar-Free Continual Learning. In: Advances in Neural Information Processing Systems 36, 2023. 6582–95. https://doi.org/10.52202/075280-0288
13. Pian W, Mo S, Guo Y, Tian Y. In: Paris, France, 2023. 7765–77. https://doi.org/10.1109/ICCV51070.2023.00717
14. Villa A, Alcázar JL, Alfarra M, Alhamoud K, Hurtado J, Heilbron FC, et al. PIVOT: Prompting for Video Continual Learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 24214–23. https://doi.org/10.1109/cvpr52729.2023.02319
15. Hong X, Huang Z, Wang Y. S-Prompts Learning with Pre-Trained Transformers: An Occam’s Razor for Domain Incremental Learning. In: Advances in Neural Information Processing Systems 35, 2022. 5682–95. https://doi.org/10.52202/068431-0411
16. Smith JS, Karlinsky L, Gutta V, Cascante-Bonilla P, Kim D, Arbelle A, et al. CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 11909–19. https://doi.org/10.1109/cvpr52729.2023.01146
17. Wang Z, Zhang Z, Ebrahimi S, Sun R, Zhang H, Lee CY, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. In: European conference on computer vision. Springer; 2022. p. 631–48.
18. Wang Z, Zhang Z, Lee C-Y, Zhang H, Sun R, Ren X, et al. Learning to Prompt for Continual Learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 139–49. https://doi.org/10.1109/cvpr52688.2022.00024
19. Zhou D-W, Cai Z-W, Ye H-J, Zhan D-C, Liu Z. Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need. Int J Comput Vis. 2024;133(3):1012–32.
- View Article
- Google Scholar
20. Dohare S, Hernandez-Garcia JF, Lan Q, Rahman P, Mahmood AR, Sutton RS. Loss of plasticity in deep continual learning. Nature. 2024;632(8026):768–74. pmid:39169245
- View Article
- PubMed/NCBI
- Google Scholar
21. Aljundi R, Babiloni F, Elhoseiny M, Rohrbach M, Tuytelaars T. Memory Aware Synapses: Learning What (not) to Forget. Lecture Notes in Computer Science. Springer International Publishing. 2018. p. 144–61. https://doi.org/10.1007/978-3-030-01219-9_9
22. Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, et al. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci U S A. 2017;114(13):3521–6. pmid:28292907
- View Article
- PubMed/NCBI
- Google Scholar
23. Li Z, Hoiem D. Learning without Forgetting. IEEE Trans Pattern Anal Mach Intell. 2018;40(12):2935–47. pmid:29990101
- View Article
- PubMed/NCBI
- Google Scholar
24. Wang Z, Liu L, Duan Y, Tao D. Continual Learning through Retrieval and Imagination. AAAI. 2022;36(8):8594–602.
- View Article
- Google Scholar
25. Cha H, Lee J, Shin J. Co2l: Contrastive continual learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. 9516–25.
26. Lopez-Paz D, Ranzato M. Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems. 2017;30.
- View Article
- Google Scholar
27. Zhu K, Zhai W, Cao Y, Luo J, Zha Z-J. Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 9286–95. https://doi.org/10.1109/cvpr52688.2022.00908
28. Shin H, Lee JK, Kim J, Kim J. Continual learning with deep generative replay. Advances in neural information processing systems. 2017;30.
- View Article
- Google Scholar
29. Yan S, Xie J, He X. DER: Dynamically Expandable Representation for Class Incremental Learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3013–22. https://doi.org/10.1109/cvpr46437.2021.00303
30. Wang L, Zhang X, Li Q, Zhang M, Su H, Zhu J, et al. Incorporating neuro-inspired adaptability for continual learning in artificial intelligence. Nat Mach Intell. 2023;5(12):1356–68.
- View Article
- Google Scholar
31. Zhou D-W, Sun H-L, Ye H-J, Zhan D-C. Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 23554–64. https://doi.org/10.1109/cvpr52733.2024.02223
32. Cao B, Tang Q, Lin H, Jiang S, Dong B, Han X. Retentive or forgetful? Diving into the knowledge memorizing mechanism of language models. arXiv preprint. 2023. https://arxiv.org/abs/2305.09144
33. Chen S, Ge C, Luo P, Song Y, Tong Z, Wang J, et al. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. In: Advances in Neural Information Processing Systems 35, 2022. 16664–78. https://doi.org/10.52202/068431-1212
34. Huang M, Su H, Wang L, Xie J, Zhang X, Zhu J. Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality. In: Advances in Neural Information Processing Systems 36, 2023. 69054–76. https://doi.org/10.52202/075280-3022
35. Wang Y, Ma Z, Huang Z, Wang Y, Su Z, Hong X. Isolation and Impartial Aggregation: A Paradigm of Incremental Learning without Interference. AAAI. 2023;37(8):10209–17.
- View Article
- Google Scholar
36. Zheng Z, Ma M, Wang K, Qin Z, Yue X, You Y. Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 19068–79. https://doi.org/10.1109/iccv51070.2023.01752
37. Zhou D-W, Zhang Y, Wang Y, Ning J, Ye H-J, Zhan D-C, et al. Learning Without Forgetting for Vision-Language Models. IEEE Trans Pattern Anal Mach Intell. 2025;47(6):4489–504. pmid:40184303
- View Article
- PubMed/NCBI
- Google Scholar
38. Ye HJ, Zhan DC, Si XM, Jiang Y. Learning Mahalanobis Distance Metric: Considering Instance Disturbance Helps. In: IJCAI; 2017. p. 3315–21.
39. Abbasnejad E, Gong D, McDonnell MD, Parvaneh A, Van Den Hengel A. RanPAC: Random Projections and Pre-trained Models for Continual Learning. In: Advances in Neural Information Processing Systems 36, 2023. 12022–53. https://doi.org/10.52202/075280-0526
40. Panos A, Kobe Y, Reino DO, Aljundi R, Turner RE. First Session Adaptation: A Strong Replay-Free Baseline for Class-Incremental Learning. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 18774–84. https://doi.org/10.1109/iccv51070.2023.01725
41. Yu L, Twardowski B, Liu X, Herranz L, Wang K, Cheng Y, et al. Semantic Drift Compensation for Class-Incremental Learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 6980–9. https://doi.org/10.1109/cvpr42600.2020.00701
42. Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. 2009.
43. Wah C, Branson S, Welinder P, Perona P, Belongie S. The caltech-ucsd birds-200-2011 dataset. 2011.
44. Hendrycks D, Basart S, Mu N, Kadavath S, Wang F, Dorundo E, et al. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 8320–9. https://doi.org/10.1109/iccv48922.2021.00823
45. Barbu A, Mayo D, Alverio J, Luo W, Wang C, Gutfreund D. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in Neural Information Processing Systems. 2019;32.
- View Article
- Google Scholar
46. Zhang Y, Yin Z, Shao J, Liu Z. Benchmarking omni-vision representation through the lens of visual realms. In: European conference on computer vision. Springer; 2022. 594–611.
47. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems. 2019;32.
- View Article
- Google Scholar

[ref1] 1. Rebuffi S, Kolesnikov A, Sperl G, Lampert CH. iCaRL: Incremental Classifier and Representation Learning. In: 2017. 5533–42. https://doi.org/10.1109/CVPR.2017.587

[ref2] 2. French RM, Chater N. Using noise to compute error surfaces in connectionist networks: a novel means of reducing catastrophic forgetting. Neural Comput. 2002;14(7):1755–69. pmid:12079555
View Article
PubMed/NCBI
Google Scholar

[3] View Article

[4] PubMed/NCBI

[5] Google Scholar

[ref3] 3. French RM, Ferrara A. Modeling time perception in rats: Evidence for catastrophic interference in animal learning. Proceedings of the Twenty First Annual Conference of the Cognitive Science Society. Psychology Press. 2020. 173–8. https://doi.org/10.4324/9781410603494-35

[ref4] 4. Chen X, Chang X. Dynamic Residual Classifier for Class Incremental Learning. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 18697–706. https://doi.org/10.1109/iccv51070.2023.01718

[ref5] 5. Hu Z, Li Y, Lyu J, Gao D, Vasconcelos N. Dense Network Expansion for Class Incremental Learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 11858–67. https://doi.org/10.1109/cvpr52729.2023.01141

[ref6] 6. Douillard A, Rame A, Couairon G, Cord M. DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 9275–85. https://doi.org/10.1109/cvpr52688.2022.00907

[ref7] 7. Zhou D-W, Sun H-L, Ye H-J, Zhan D-C. Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 23554–64. https://doi.org/10.1109/cvpr52733.2024.02223

[ref8] 8. Wang F-Y, Zhou D-W, Ye H-J, Zhan D-C. FOSTER: Feature Boosting and Compression for Class-Incremental Learning. Lecture Notes in Computer Science. Springer Nature Switzerland. 2022. p. 398–414. https://doi.org/10.1007/978-3-031-19806-9_23

[ref9] 9. Gao Q, Zhao C, Sun Y, Xi T, Zhang G, Ghanem B, et al. A Unified Continual Learning Framework with General Parameter-Efficient Tuning. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 11449–59. https://doi.org/10.1109/iccv51070.2023.01055

[ref10] 10. Zhang G, Wang L, Kang G, Chen L, Wei Y. SLCA: Slow Learner with Classifier Alignment for Continual Learning on a Pre-trained Model. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 19091–101. https://doi.org/10.1109/iccv51070.2023.01754

[ref11] 11. Zhou DW, Sun HL, Ning J, Ye HJ, Zhan DC. Continual learning with pre-trained models: A survey. 2024. https://arxiv.org/abs/2401.16386

[ref12] 12. Goswami D, Liu Y, Twardowski B, Van De Weijer J. FeCAM: Exploiting the Heterogeneity of Class Distributions in Exemplar-Free Continual Learning. In: Advances in Neural Information Processing Systems 36, 2023. 6582–95. https://doi.org/10.52202/075280-0288

[ref13] 13. Pian W, Mo S, Guo Y, Tian Y. In: Paris, France, 2023. 7765–77. https://doi.org/10.1109/ICCV51070.2023.00717

[ref14] 14. Villa A, Alcázar JL, Alfarra M, Alhamoud K, Hurtado J, Heilbron FC, et al. PIVOT: Prompting for Video Continual Learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 24214–23. https://doi.org/10.1109/cvpr52729.2023.02319

[ref15] 15. Hong X, Huang Z, Wang Y. S-Prompts Learning with Pre-Trained Transformers: An Occam’s Razor for Domain Incremental Learning. In: Advances in Neural Information Processing Systems 35, 2022. 5682–95. https://doi.org/10.52202/068431-0411

[ref16] 16. Smith JS, Karlinsky L, Gutta V, Cascante-Bonilla P, Kim D, Arbelle A, et al. CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 11909–19. https://doi.org/10.1109/cvpr52729.2023.01146

[ref17] 17. Wang Z, Zhang Z, Ebrahimi S, Sun R, Zhang H, Lee CY, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. In: European conference on computer vision. Springer; 2022. p. 631–48.

[ref18] 18. Wang Z, Zhang Z, Lee C-Y, Zhang H, Sun R, Ren X, et al. Learning to Prompt for Continual Learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 139–49. https://doi.org/10.1109/cvpr52688.2022.00024

[ref19] 19. Zhou D-W, Cai Z-W, Ye H-J, Zhan D-C, Liu Z. Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need. Int J Comput Vis. 2024;133(3):1012–32.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref20] 20. Dohare S, Hernandez-Garcia JF, Lan Q, Rahman P, Mahmood AR, Sutton RS. Loss of plasticity in deep continual learning. Nature. 2024;632(8026):768–74. pmid:39169245
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref21] 21. Aljundi R, Babiloni F, Elhoseiny M, Rohrbach M, Tuytelaars T. Memory Aware Synapses: Learning What (not) to Forget. Lecture Notes in Computer Science. Springer International Publishing. 2018. p. 144–61. https://doi.org/10.1007/978-3-030-01219-9_9

[ref22] 22. Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, et al. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci U S A. 2017;114(13):3521–6. pmid:28292907
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref23] 23. Li Z, Hoiem D. Learning without Forgetting. IEEE Trans Pattern Anal Mach Intell. 2018;40(12):2935–47. pmid:29990101
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref24] 24. Wang Z, Liu L, Duan Y, Tao D. Continual Learning through Retrieval and Imagination. AAAI. 2022;36(8):8594–602.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref25] 25. Cha H, Lee J, Shin J. Co2l: Contrastive continual learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. 9516–25.

[ref26] 26. Lopez-Paz D, Ranzato M. Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems. 2017;30.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref27] 27. Zhu K, Zhai W, Cao Y, Luo J, Zha Z-J. Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 9286–95. https://doi.org/10.1109/cvpr52688.2022.00908

[ref28] 28. Shin H, Lee JK, Kim J, Kim J. Continual learning with deep generative replay. Advances in neural information processing systems. 2017;30.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref29] 29. Yan S, Xie J, He X. DER: Dynamically Expandable Representation for Class Incremental Learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3013–22. https://doi.org/10.1109/cvpr46437.2021.00303

[ref30] 30. Wang L, Zhang X, Li Q, Zhang M, Su H, Zhu J, et al. Incorporating neuro-inspired adaptability for continual learning in artificial intelligence. Nat Mach Intell. 2023;5(12):1356–68.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref31] 31. Zhou D-W, Sun H-L, Ye H-J, Zhan D-C. Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 23554–64. https://doi.org/10.1109/cvpr52733.2024.02223

[ref32] 32. Cao B, Tang Q, Lin H, Jiang S, Dong B, Han X. Retentive or forgetful? Diving into the knowledge memorizing mechanism of language models. arXiv preprint. 2023. https://arxiv.org/abs/2305.09144

[ref33] 33. Chen S, Ge C, Luo P, Song Y, Tong Z, Wang J, et al. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. In: Advances in Neural Information Processing Systems 35, 2022. 16664–78. https://doi.org/10.52202/068431-1212

[ref34] 34. Huang M, Su H, Wang L, Xie J, Zhang X, Zhu J. Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality. In: Advances in Neural Information Processing Systems 36, 2023. 69054–76. https://doi.org/10.52202/075280-3022

[ref35] 35. Wang Y, Ma Z, Huang Z, Wang Y, Su Z, Hong X. Isolation and Impartial Aggregation: A Paradigm of Incremental Learning without Interference. AAAI. 2023;37(8):10209–17.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref36] 36. Zheng Z, Ma M, Wang K, Qin Z, Yue X, You Y. Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 19068–79. https://doi.org/10.1109/iccv51070.2023.01752

[ref37] 37. Zhou D-W, Zhang Y, Wang Y, Ning J, Ye H-J, Zhan D-C, et al. Learning Without Forgetting for Vision-Language Models. IEEE Trans Pattern Anal Mach Intell. 2025;47(6):4489–504. pmid:40184303
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref38] 38. Ye HJ, Zhan DC, Si XM, Jiang Y. Learning Mahalanobis Distance Metric: Considering Instance Disturbance Helps. In: IJCAI; 2017. p. 3315–21.

[ref39] 39. Abbasnejad E, Gong D, McDonnell MD, Parvaneh A, Van Den Hengel A. RanPAC: Random Projections and Pre-trained Models for Continual Learning. In: Advances in Neural Information Processing Systems 36, 2023. 12022–53. https://doi.org/10.52202/075280-0526

[ref40] 40. Panos A, Kobe Y, Reino DO, Aljundi R, Turner RE. First Session Adaptation: A Strong Replay-Free Baseline for Class-Incremental Learning. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 18774–84. https://doi.org/10.1109/iccv51070.2023.01725

[ref41] 41. Yu L, Twardowski B, Liu X, Herranz L, Wang K, Cheng Y, et al. Semantic Drift Compensation for Class-Incremental Learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 6980–9. https://doi.org/10.1109/cvpr42600.2020.00701

[ref42] 42. Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. 2009.

[ref43] 43. Wah C, Branson S, Welinder P, Perona P, Belongie S. The caltech-ucsd birds-200-2011 dataset. 2011.

[ref44] 44. Hendrycks D, Basart S, Mu N, Kadavath S, Wang F, Dorundo E, et al. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 8320–9. https://doi.org/10.1109/iccv48922.2021.00823

[ref45] 45. Barbu A, Mayo D, Alverio J, Luo W, Wang C, Gutfreund D. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in Neural Information Processing Systems. 2019;32.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref46] 46. Zhang Y, Yin Z, Shao J, Liu Z. Benchmarking omni-vision representation through the lens of visual realms. In: European conference on computer vision. Springer; 2022. 594–611.

[ref47] 47. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems. 2019;32.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

Figures

Abstract

Introduction

Related work

Class-incremental learning

Pre-trained model-based CIL

Method

Problem definition

Model overview

Gated Adapter (GA) Module

Reconstructing prototypes under distribution drift

Adaptive weighting by instance-level significance

Experiments

Datasets and evaluation metrics

Datasets

Evaluation metrics

Implementation details

Comparisons with the State-of-the-arts

Ablation study

Analysis

Scalability and efficiency analysis

Parameter growth analysis

Prototype memory cost

Inference latency and throughput

Break-even analysis

Comprehensive CIL metrics analysis

Statistical significance testing

Analysis of adaptive weighting mechanism

Old vs. New class accuracy split

Adaptive weight distribution analysis

Ablation: When Does Adaptive Weighting Help?

Computational overhead analysis

Summary

Stability analysis of drift compensation

Regularized mapping formulation. .

Evaluation metrics. .

Effect of regularization. .

Effect of increment size and conditioning. .

Error accumulation across long sequences

Summary

Discussion

Conclusion

References