Figures
Abstract
Class-Incremental Learning (CIL) aims to enable models to continuously learn new categories while preserving existing knowledge and avoiding catastrophic forgetting. Although parameter-expansion architectures can alleviate task interference to some extent, the representations of previously learned classes often drift or degrade as the feature subspaces continuously evolve and expand, resulting in decreased recognition performance for old classes. To address this issue, we propose an efficient CIL method—Dynamic Gated Adapter for Subspace Alignment (DGASA). Based on a frozen pre-trained backbone, DGASA introduces lightweight adapters with attention-based gating for each task to construct task-specific subspaces, while dynamically fusing cross-task information via attention mechanisms. In addition, DGASA learns a linear mapping between the old and new subspaces to achieve consistent alignment of old class prototypes in the current subspace without accessing past data. Extensive experiments demonstrate that DGASA significantly improves classification accuracy and resistance to forgetting on multiple benchmark datasets, offering strong generalization and computational efficiency.
Citation: Gu J, Huang S, Li T, Zhang S, Li M (2026) Gated subspace alignment with drift compensation for parameter-efficient Class-Incremental Learning. PLoS One 21(5): e0348270. https://doi.org/10.1371/journal.pone.0348270
Editor: Antonio Falcó, Universidad CEU Cardenal Herrera - Campus Elche, SPAIN
Received: January 7, 2026; Accepted: April 14, 2026; Published: May 7, 2026
Copyright: © 2026 Gu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Our experiments were conducted on the following public datasets: CIFAR-100 (https://www.cs.toronto.edu/~kriz/cifar.html), CUB-200 (http://www.vision.caltech.edu/datasets/cub_200_2011), ImageNet-R (https://github.com/hendrycks/imagenet-r), ObjectNet (https://objectnet.dev), and OmniBenchmark (https://zhangyuanhan-ai.github.io/OmniBenchmark).
Funding: This work was funded bytheNational Natural Science Foundation of China (Grant No. 62276118). The funders had no role instudydesign, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Class-Incremental Learning (CIL) [1], as a core task in the field of continual learning, aims to enable models to continuously acquire new categories while effectively preserving and leveraging previously learned knowledge. This capability is essential for building intelligent systems with long-term and stable learning abilities. However, directly fine-tuning neural networks on new data often causes the model to forget previously acquired information, leading to a significant drop in performance—a phenomenon known as catastrophic forgetting [2,3]. To alleviate this issue, some studies have proposed parameter-expansion-based dynamic architectures [4–9], which introduce independent parameter modules for different tasks, effectively reducing task interference. These approaches have achieved promising results, particularly when combined with powerful pre-trained models.
In non-pretrained model settings, some expandable networks [4,8,9] construct separate backbone networks for each task, thereby forming task-specific feature subspaces and achieving effective isolation between tasks. These methods are relatively effective at preserving knowledge from previous tasks. However, as the number of tasks increases, the model’s parameter size and computational overhead grow rapidly, significantly affecting inference efficiency. Moreover, they often rely on retaining samples from previous tasks to train a unified classifier, which poses substantial challenges in real-world applications where data privacy and storage are constrained.
In contrast, pretrained models, with their powerful representational capacity and strong transferability, offer a more promising solution for building low-cost continual learning systems that do not require access to previous samples [9–15]. To fully leverage the advantages of pretrained models, many approaches [16–18] draw inspiration from expandable networks by designing lightweight modules to adapt to new tasks and mapping the ever-growing feature space to classifiers corresponding to each category. This strategy balances the recognition of both old and new classes, alleviating inter-task conflicts to some extent and improving the model’s adaptability and scalability. However, due to the complex interference among task-specific features and the dynamic shifts in feature distributions during training, these models still struggle to fully prevent catastrophic forgetting.
To this end, we introduce DGASA (Dynamic Gated Adapter for Subspace Alignment), which achieves a balance between task isolation and knowledge sharing via lightweight subspace construction coupled with drift compensation, as shown in Fig 1. Specifically, DGASA freezes the pretrained backbone and inserts attention-gated adapters for each task, constructing low-dimensional subspaces and dynamically fusing them via an attention mechanism. This effectively alleviates inter-task conflicts and improves parameter efficiency. To tackle the issue of old class prototypes becoming invalid due to changes in the feature space during training, DGASA learns a linear mapping between new and old subspaces, enabling consistent reconstruction of old class prototypes in the current subspace without accessing old data, thereby maintaining stable representations. During inference, DGASA introduces an instance-level weighting strategy that dynamically fuses features based on the sample’s matching degree across subspaces, which strengthens features related to the current task while preserving the discriminative power of old knowledge. This significantly enhances the model’s generalization and incremental learning performance. The main contributions of this work are as follows:
- We propose the DGASA framework, which combines lightweight adapters with a gating mechanism to construct task-specific subspaces and enable dynamic fusion, significantly alleviating inter-task interference and improving parameter efficiency.
- We introduce a subspace drift compensation mechanism that learns a linear mapping between new and old subspaces, enabling data-free consistent reconstruction of old class prototypes, thereby enhancing model stability and data privacy.
- We conduct extensive experiments on commonly used continual learning datasets to demonstrate the effectiveness of the DGASA method. Comparative results show that our approach achieves the best performance.
Related work
Class-incremental learning
Class-incremental learning aims to enable a model to effectively retain knowledge of previously learned classes while continuously receiving information about new classes [11,19,20], thus avoiding catastrophic forgetting. Existing methods mainly fall into three categories: regularization methods, replay methods, and dynamic network methods. Among them, regularization methods [21–24] introduce additional constraints during training to limit the update magnitude of critical parameters, maintaining stable representations of old tasks. Replay methods preserve a portion of samples from old tasks [8,25,26] or use generative models to synthesize old-class data [27,28], and jointly train them with new data to help the model retain previously learned knowledge. Dynamic network methods expand the model structure [4, 5, 6, 7, 8, 29–31], for example, by adding new neurons, layers, or task-specific modules, effectively isolating knowledge between tasks and thereby improving the model’s ability to jointly adapt to and learn from both old and new tasks.
Pre-trained model-based CIL
Class-incremental learning based on pre-trained models (PTMs) [11,31–33] has become a research hotspot in recent years. With the development of pre-training techniques, an increasing number of methods incorporate PTMs into class-incremental learning to enhance model performance and generalization ability. These methods typically keep the pre-trained weights frozen and achieve lightweight parameter updates through prompt tuning [15–18], encoding new task features into a prompt pool to effectively alleviate forgetting. In addition, some approaches adopt model fusion or model merging strategies [9,31,34–37] by saving and integrating models from multiple training stages to further improve the retention of old knowledge. Prototype-based classification methods leverage the powerful representations of PTMs combined with nearest class mean (NCM) classifiers [19,38–40] to achieve stable recognition of old classes.
Method
Problem definition
Class-Incremental Learning (CIL) is a learning scenario where a model continuously learns to classify new classes to build a unified classifier [1]. Given a sequence of B training datasets denoted as , where the b-th dataset is
containing
instances. Each instance
comes from class
. Here,
is the label space of task b, and for
,
, meaning the classes across different tasks do not overlap. We follow the exemplar-free setting in [16], where no samples from old classes are saved. Therefore, at the b-th incremental stage, we only have access to data from
for training. In CIL, our goal is to build a unified classifier for all seen classes
as data evolves. Specifically, we want to find a model
that minimizes the expected risk:
where is the hypothesis space,
is the indicator function, and
denotes the data distribution of task b. Following typical PTM-based CIL works [16–18], we assume a pretrained model is available for initializing f(x). We decouple the PTM into a feature embedding
and a linear classifier
. The embedding function
refers to the final [CLS] token in ViT, and the model output is expressed as
. For clarity, we decouple the classifier as
, where
is the classifier weight for the j-th class.
Model overview
Dynamic Gated Adapter for Subspace Alignment (DGASA) is an efficient framework specifically designed for class-incremental learning (CIL), with its overall workflow illustrated in Fig 2. In the base class training stage, DGASA freezes the pretrained backbone network, which serves as a shared feature extractor across all tasks. On top of this backbone, a set of Gated Adapter (GA) modules is introduced for each incremental task to construct task-specific embedding subspaces. This design enables effective task isolation and alleviates catastrophic forgetting.
GA modules are inserted in each Transformer block for task-specific adaptation.
As tasks incrementally progress, the feature subspaces continuously evolve and expand, potentially rendering old class prototypes ineffective in the current subspace. To address this, DGASA proposes a subspace drift compensation mechanism, which leverages prototype pairs of new classes generated from both the old and current subspaces as supervision signals. A linear mapping is then learned to explicitly project old class prototypes from their original subspace into the current one. This process requires no access to data from previous tasks, thereby maintaining prototype consistency while ensuring privacy friendliness.
During the inference stage, DGASA incorporates an instance-aware adaptive weighting mechanism to enable collaborative decision-making across multiple subspaces. Specifically, the model adjusts the contribution of each subspace to the final classification result based on how well the test sample matches the class prototypes in each subspace. The primary task subspace provides the main discriminative power, while the remaining subspaces are weighted based on their semantic saliency scores, leading to more refined and robust ensemble predictions.
In summary, DGASA integrates pretrained knowledge sharing, task-adaptive subspace modeling via GA modules, prototype mapping compensation, and saliency-aware inference into a unified multi-subspace incremental learning framework. Without requiring access to old data, it achieves excellent generalization, memory efficiency, and resistance to forgetting.
Gated Adapter (GA) Module
In Class-Incremental Learning (CIL), a typical challenge lies in fine-tuning pre-trained models on new tasks without incurring significant computational overhead or suffering from catastrophic forgetting. Full fine-tuning of the entire model requires extensive computation and may cause the model to forget previously learned tasks. To address this, we propose a more parameter-efficient solution using adapter modules, which offer a lightweight way to incorporate task-specific knowledge into pre-trained models. Adapters are small bottleneck structures inserted within each Transformer layer, allowing task-specific adaptations while keeping the core model frozen.
Our approach builds on this concept and introduces a Gated Adapter (GA) module, which enhances the standard adapter’s flexibility by incorporating a gating mechanism to control activation dynamically. We clarify the naming: while the term attention-gated was used in earlier versions, the actual mechanism is a pooled linear gating rather than standard QKV attention. Specifically, each Transformer layer in the backbone network contains a feed-forward module with an additional gated residual branch, which is activated conditionally based on the input features. For an input feature , the output is formulated as:
where and
are weight matrices that reduce and then expand the dimensionality,
is a nonlinear activation function (GELU), and g(x) is the gating function defined as:
where is a linear transformation parameter, and
performs global average pooling across the input feature dimensions. This pooling step captures high-level characteristics of the input, producing a scalar gate that adaptively weights the importance of the residual branch.
Module vs. Framework Distinction. To avoid confusion, we explicitly distinguish:
- GA module: The individual gated adapter unit inserted in each Transformer layer (Fig 3). This is the basic building block for task-specific adaptation.
- DGASA framework: The complete method comprising: (1) a frozen pre-trained backbone, (2) a set of GA modules for each task, (3) the drift compensation mechanism for prototype alignment, and (4) the adaptive weighting mechanism for inference.
The gating mechanism uses global pooling followed by a linear layer with sigmoid activation to produce a scalar gate g(x), which modulates the adapter residual branch.
When the b-th task arrives, we introduce a new set of GA modules (one per Transformer layer) denoted as . As tasks progress, the model accumulates an adapter sequence
. During inference, we concatenate the outputs of each task adapter to construct a joint feature representation:
where denotes the feature mapping through the GA modules of task b.
Since training for each task only optimizes its corresponding GA modules, learning new tasks does not affect knowledge retention of old tasks. Each GA module contains (2 dr + d) parameters, so the total storage cost is , where B is the number of tasks and L is the number of Transformer blocks (12 for ViT-B/16). With d = 768 and r = 16, each GA module has 25,344 parameters, totaling 304,128 parameters per task.
For classification prediction, we adopt a prototype-based classifier. After training the b-th task, we extract the prototype of the i-th class in the adapter subspace :
where denotes the training sample set of class i in task b, and
. Then, the prototype vectors of this class in all task subspaces are concatenated to form the joint prototype:
During prediction, cosine similarity is used to compare the input feature embedding with the class prototypes
, and the class with the highest similarity is chosen as the predicted label.
Reconstructing prototypes under distribution drift
In incremental learning, as new tasks and distributions arrive, the model adapts by adding new adapters that construct embedding subspaces for each task. However, a challenge arises when it becomes necessary to recompute the prototypes for each class to ensure consistency with the current feature embedding space. Since accessing past data is often not feasible, directly computing the prototypes of old classes in the new subspace is problematic, leading to a mismatch between the prototype matrices at different stages of learning. This mismatch impacts the classifier’s ability to provide a unified and accurate representation of the feature space across all tasks.
To address this issue, we formalize it as a prototype completion task. Given two embedding subspaces (old and new) and two class sets (old and new), we aim to estimate the prototypes of old classes in the new subspace, denoted as , by leveraging the following three observable prototype matrices: prototypes of old classes in the old subspace
, prototypes of new classes in the old subspace
, and prototypes of new classes in the new subspace
.
To reconstruct the prototypes, we adopt the concept of drift compensation [41], which models the geometric transformation between embedding subspaces. Unlike semantic-based approaches, this formulation does not rely on semantic similarity between classes, making it applicable even when new classes are sparse or semantically distant.
Specifically, we construct a paired sample set using the prototypes of new classes in both the old and new subspaces. These paired prototypes serve as anchors to estimate a linear mapping from the old subspace to the new subspace. We formulate the mapping estimation as a least-squares problem:
which admits a closed-form solution via the normal equations:
After obtaining the mapping matrix W, we reconstruct the prototypes of old classes by projecting the old subspace prototypes into the new subspace:
This projection does not require access to past data and enables consistent prototype alignment across tasks. In addition to downstream classification accuracy, we also evaluate the quality of drift compensation using direct metrics such as prototype reconstruction error and cosine similarity before and after alignment.
Mapping Strategy: It is crucial to clarify that the mapping matrix W is estimated globally with respect to all previously learned tasks, rather than being composed sequentially between adjacent tasks. Specifically, when a new task b arrives, the old subspace is defined by the concatenated feature space of all previous adapters , and the new subspace is defined by the concatenated space including the new adapter
. The prototypes of the new classes (
and
) are computed in these two respective subspaces and used to solve for a single, unified mapping W. This global mapping is then applied to all old class prototypes (
) to project them into the new subspace. This approach ensures that the mapping is solved only once per incremental session and avoids the potential for error accumulation that could arise from composing multiple sequential transformations. Computationally, solving this mapping requires
operations per task, which is negligible compared to the cost of training the adapters.
Numerical Stability and Robust Formulation. In practice, the matrix may be ill-conditioned or even singular, particularly when the number of new classes
is small relative to the embedding dimension d, or when the prototype matrix
is rank-deficient. In such cases, directly computing the inverse may lead to numerical instability and degraded reconstruction quality.
To address this issue, we adopt a regularized least-squares formulation based on ridge (Tikhonov) regularization:
where is a regularization parameter and I is the identity matrix. This formulation improves numerical stability by ensuring that the matrix to be inverted is well-conditioned. In addition, we consider an alternative formulation based on the Moore–Penrose pseudoinverse:
which provides a minimum-norm solution even when is rank-deficient.
In our implementation, is selected as a small constant (e.g.,
) and further validated via a sensitivity analysis. We empirically evaluate the robustness of the mapping with respect to
, the number of new classes, and the conditioning of
.
After obtaining the mapping matrix W, we reconstruct the prototypes of old classes by projecting the old subspace prototypes into the new subspace:
This projection does not require access to past data and enables consistent prototype alignment across tasks. In addition to downstream classification accuracy, we also evaluate the quality of drift compensation using direct metrics such as prototype reconstruction error and cosine similarity before and after alignment.
Adaptive weighting by instance-level significance
So far, we have introduced subspace expansion and adapter incremental learning mechanisms, and restored the prototypes of old classes through a prototype completion strategy. After completing adapter expansion and prototype completion, we construct a complete classifier with the following prototype matrix:
where the off-diagonal terms above the main diagonal are completed according to the estimation formula.
During inference, to obtain the classification logits for the b-th task, we perform multiple prototype-embedding matches across different subspaces and aggregate them into the final score:
where denotes the features extracted by the i-th adapter
. Although this ensemble approach helps leverage information from multiple subspaces, we note that only the adapter
corresponding to the b-th task is specifically trained for that task, so the features
are more task-discriminative.
To address this, we propose a new inference mechanism — Adaptive Weighting by Instance-level Significance. Its core idea is to quantify the matching degree of the input sample across different subspaces and dynamically adjust each subspace’s contribution weight to the final classification result based on this significance. Compared to using fixed scaling coefficients for all non-primary subspaces, our method is more flexible and better reflects the semantic correlation between the sample and each subspace.
The specific steps are as follows:
First, compute subspace significance. For a given input sample x, we input it into all adapters corresponding to the subspaces and extract feature representations
for each subspace. Then, we calculate the similarity between these features and the corresponding prototypes
to measure the matching degree of the sample in the i-th subspace. This matching degree is called the significance score
, calculated as:
where denotes a similarity function; here, we use cosine similarity. Note that to emphasize the primary subspace for the current task, we compute significance only for non-primary subspaces
.
Next, normalize the significance scores into a weight distribution. To ensure comparability of contributions across subspaces, we apply the softmax function to the significance scores to obtain the weight distribution for each non-primary subspace:
This normalization can be seen as an adaptive attention allocation over all non-primary subspaces, where higher significance scores correspond to larger weights, thereby enhancing the role of that subspace in the final inference.
Finally, construct the weighted classification score. After obtaining the primary subspace matching score , we combine it with the significance-weighted scores of all non-primary subspaces to obtain the final classification logits:
where is a balancing parameter controlling the relative contribution of non-primary subspaces in the overall score. In our experiments, we set
.
Experiments
To analyze the impact of each component on the overall performance of the model, we performed ablation experiments on the Cifar-100 dataset for the Inc5 task, validating the effectiveness of the four key components proposed: PTM, Attention-Gated Adapter, Reconstructing Prototypes under Distribution Drift, and Adaptive Weighting by Instance-level Significance. The experimental results confirm the effectiveness of these methods. In this section, we first provide an overview of the datasets and evaluation metrics used, followed by a comprehensive description of the model architecture and experimental setup. We then present and analyze the results of the comparative experiments. Finally, we conduct ablation studies to validate the effectiveness of each component of the model.
Datasets and evaluation metrics
Datasets
Since pre-trained models may possess extensive knowledge from upstream tasks, they are often evaluated on a variety of datasets to assess their transferability and generalization capabilities in different learning scenarios. In this work, we follow the experimental setups in [18,19] to evaluate performance on several widely-used benchmark datasets, including CIFAR-100 [42], CUB-200 [43], ImageNet-R [44], ObjectNet [45], and OmniBenchmark [46]. These datasets are particularly useful for evaluating class-incremental learning (CIL) approaches, as they consist of a variety of task distributions and domain shifts. The datasets we use for evaluation include both standard CIL benchmarks and out-of-distribution datasets, offering a wide range of challenges that test the model’s ability to generalize across domains. CIFAR-100 consists of 100 classes, which are commonly used in incremental learning tasks. CUB-200, containing 200 classes, is widely used for fine-grained recognition and provides a higher level of class granularity. ImageNet-R and ObjectNet, each containing 200 classes, represent more challenging benchmarks, as they come from distribution shifts or domain gaps compared to the ImageNet dataset, which is typically used for pre-training. OmniBenchmark, with 300 classes, is another large-scale benchmark that provides diverse challenges in out-of-distribution testing, offering a more complex scenario for evaluating domain adaptation and generalization. These datasets provide a comprehensive testing ground for evaluating both generalization to new tasks and robustness in the face of domain drift.
Dataset split: To ensure consistency and comparability across different methods, we follow the established benchmark settings outlined in [1,18]. Specifically, we use the notation Inc-n to represent the class split, where n indicates the number of classes introduced at each incremental stage. In this framework, the first training stage starts with zero classes, meaning the model must progressively learn to recognize new classes as they are introduced. This incremental learning setup closely mirrors real-world scenarios where a model must continually adapt to new information without the luxury of accessing previous data. For a fair comparison across all methods, we adopt the same random seed for shuffling the class order before performing data splitting, as recommended in [1]. This ensures that the class order is randomized in a consistent manner across all experiments, preventing any potential bias introduced by a particular class ordering. Furthermore, we maintain consistency in the training and testing sets by aligning them with the settings used in [19] across all compared methods. This consistent dataset partitioning ensures a rigorous and fair evaluation of the methods under consideration, allowing us to draw reliable conclusions about their performance in incremental and out-of-distribution learning tasks.
Evaluation metrics
Following the benchmark protocol [42], we use to represent the model’s accuracy after the b-th stage. Specifically, we adopt
(the performance after the last stage) and
(the average performance over all incremental stages) as evaluation metrics
Implementation details
All experiments are conducted on a single NVIDIA RTX 4090 GPU (24 GB VRAM) using the PyTorch framework (version 2.0.1) [47]. Following prior works [18,19], we adopt a ViT-B/16 backbone pre-trained on ImageNet-21K (“vit_base_patch16_224” from the timm library, 86.6M parameters), and further fine-tuned on ImageNet-1K under the standard protocol. For all incremental tasks, DGASA is trained using the AdamW optimizer with ,
, and weight decay of 0.01. The initial learning rate is set to
and decayed by a factor of 0.1 at 50% and 75% of the total training steps. We use a batch size of 64 and train for 15 epochs per task with an input resolution of
. The training objective consists of a cosine-similarity-based cross-entropy loss over the current task classes, with temperature
, where class prototypes are computed as the mean feature representations of all samples belonging to each class in the current subspace. It employs global average pooling for gating, GELU activation in the adapter, sigmoid activation for the gate, and a dropout rate of 0.1 applied to the adapter output. For prototype-based alignment, prototypes are updated after each task, and the drift compensation mapping is solved using ridge regularization with
. The adaptive weighting coefficient is set to
, selected via validation on a held-out subset. To ensure robustness, all experiments are repeated with five random seeds (42, 1234, 5678, 91011, 121314), and we report the mean and standard deviation. The class order is consistently shuffled across all methods for each seed to ensure fair comparison.
Comparisons with the State-of-the-arts
In this section, we provide a comprehensive evaluation of DGASA by comparing its performance with several state-of-the-art methods on five widely used benchmark datasets. The results, summarized in Table 1, show that DGASA achieves the best performance across all datasets, significantly surpassing existing methods, including CODA-Prompt and ADAM, when using the ViT-B/16-IN21K pre-trained model. Specifically, DGASA excels in both average accuracy across incremental stages and accuracy after the final incremental stage
, outperforming the other methods by substantial margins.
To ensure a fair comparison, we selected several representative class-incremental learning (CIL) methods based on pre-trained transformers (PTMs), including both recent and classical approaches. These methods include L2P [18], DualPrompt [17], CODA-Prompt [16], SimpleCIL [19], EASE [31], and ADAM [19]. In addition, we also compare DGASA to traditional CIL methods that are equipped with the same pre-trained models, such as LwF [23], SDC [41], iCaRL [1], DER [29], FOSTER [8], and MEMO [11]. These comparisons provide a comprehensive evaluation of DGASA’s performance against the most prominent methods in the field.
The results from Table 1 clearly demonstrate the superiority of DGASA across all five benchmark datasets: CIFAR Inc5, CUB Inc10, IN-R Inc5, ObjNet Inc10, and Omnibench Inc30. For instance, DGASA achieves the highest average accuracy and final stage accuracy
on CIFAR Inc5, with values of 92.11 and 86.25, respectively, outperforming the second-best method, EASE, by a notable margin. Similar trends are observed in the other datasets, where DGASA consistently outperforms the competing methods in both metrics. This highlights DGASA’s strong generalization ability and robustness across different incremental learning scenarios. In addition to comparing DGASA with other PTM-based CIL methods, we also evaluate its performance against traditional exemplar-based methods, as shown in Table 2. These traditional methods typically rely on storing a fixed number of exemplars for each class to mitigate catastrophic forgetting and maintain previous knowledge. For the comparison, we follow the standard practice in class-incremental learning by setting the number of exemplars to 20 per class, as in the method proposed by Rebuffi et al. [1]. Despite not utilizing exemplars, DGASA maintains competitive performance when compared to these exemplar-based methods. This is a significant achievement, as it shows that DGASA’s approach—which does not rely on memory replay or exemplar storage—can still effectively preserve knowledge and achieve superior performance across incremental learning tasks.
Overall, these results validate the effectiveness and efficiency of DGASA as a state-of-the-art method for class-incremental learning, demonstrating its ability to outperform both recent PTM-based approaches and traditional exemplar-based methods, all while avoiding the need for storing exemplars.
Ablation study
To analyze the impact of each component on the overall performance of the model, we performed ablation experiments on the Cifar-100 dataset for the Inc5 task, validating the effectiveness of the four key components proposed: PTM, Gated Adapter (GA) module, Drift Compensation, and Adaptive Weighting by Instance-level Significance. The experimental results, as shown in Table 3, confirm the effectiveness of these methods. Among them, PTM refers to using a frozen pre-trained model solely as a feature extractor; GA indicates the introduction of Gated Adapter modules for task-specific subspace modeling; Drift Compensation refers to compensating for the drift of old class prototypes through the Reconstructing Prototypes under Distribution Drift mechanism; Adaptive Weighting dynamically adjusts the contribution of different subspaces during inference using the Adaptive Weighting by Instance-level Significance method.
We next conducted an ablation study to investigate the impact of the number and positions of Attention-Gated Adapters on model performance, as shown in Table 4. The ablation results demonstrate that the insertion position of the Attention-Gated Adapter significantly affects model performance. Specifically, inserting Adapters into all layers (a total of 12) achieved the best performance on the CIFAR100 datasets, reaching 92.11% and 86.25%, respectively. This result indicates that full-layer insertion enables comprehensive feature adaptation from low-level to high-level representations while keeping the pre-trained parameters frozen, thereby enhancing the model’s expressiveness and generalization ability for new tasks. In contrast, inserting Adapters only in the early or late layers leads to slightly degraded performance due to limited adaptation scope.
An additional ablation study was conducted to evaluate the effect of the inference weighting coefficient , with results summarized in Table 5. The study demonstrates that
significantly affects model performance. As
increases from 0.05 to 0.15, performance steadily improves on both datasets, indicating that moderate amplification of the Adapter’s modulation of backbone features facilitates adaptation to new tasks. Further increasing
to 0.2 or 0.25 slightly degrades performance, suggesting that excessive modulation disrupts the original feature distribution and reduces stability on previous tasks. Based on these observations,
is selected as the optimal setting and is used consistently across all benchmarks, achieving the best balance between the main and non-main subspaces.
Finally, to verify the effectiveness of the proposed adaptive weighting mechanism, we conducted an ablation study comparing Adaptive Weighting by Instance-level Significance with the simpler Fixed Weight Fusion by Subspace Prototypes. Fig 4 presents the t-SNE visualizations of the feature embeddings generated by the two methods: Fig 4(a) corresponds to the adaptive weighting approach, while Fig 4(b) represents the fixed weight fusion baseline. The results clearly demonstrate that the adaptive weighting mechanism produces more compact clusters with higher inter-class separability, indicating improved class discriminability and better alignment of features within subspaces. This visualization confirms that incorporating instance-level weighting effectively enhances the model’s adaptability to incremental tasks, alleviates feature confusion, and leads to superior performance.
Analysis
In this section, we provide a comprehensive analysis of DGASA across multiple dimensions, including parameter efficiency, inference cost, continual learning metrics, the adaptive weighting mechanism, and the stability of drift compensation.
Scalability and efficiency analysis
A key contribution of DGASA is its parameter-efficient design. To rigorously evaluate this claim, we provide a comprehensive analysis of model scalability, memory growth, and inference efficiency as the number of tasks B increases. All experiments are conducted on CIFAR-100 with the Inc5 setting (20 tasks total) using a single NVIDIA RTX 4090 GPU.
Parameter growth analysis
DGASA maintains a frozen pre-trained ViT-B/16 backbone and only adds lightweight adapters for each new task. The trainable parameters per adapter consist of:
- Down-projection:
, with d = 768, r = 16
- Up-projection:
- Gating layer:
(after pooling)
Thus, the number of trainable parameters per adapter is:
Since each Transformer block contains one adapter and ViT-B/16 has L = 12 blocks, the total parameters per task are:
After B tasks, the total trainable parameters are M, while the frozen backbone contains approximately 86 M parameters. Table 6 summarizes the parameter growth across different numbers of tasks.
Prototype memory cost
In addition to adapter parameters, DGASA stores class prototypes for classification. For each class, we store a prototype vector of dimension (concatenated across all task subspaces). The memory cost for prototypes is:
For CIFAR-100 with , B = 20, and d = 768, this equals:
Table 7 shows the prototype memory growth across tasks.
Inference latency and throughput
We measure inference efficiency by evaluating latency (ms per sample) and throughput (samples per second) as the number of tasks increases. For each test sample, DGASA must:
- Forward pass through the frozen backbone (shared across all tasks)
- Forward pass through all B task-specific adapters
- Compute cosine similarity with
class prototypes
The inference cost scales linearly with the number of adapters and prototypes.
As shown in Table 8, latency increases from 4.21 ms to 7.38 ms as tasks grow from 1 to 20, while throughput decreases from 237.5 to 135.5 samples per second. The FLOPs increase modestly (16.8 G to 18.4 G) since the frozen backbone computation dominates. Peak memory usage remains under 1.2 GB, well within typical GPU limits.
Break-even analysis
To determine when modular growth begins to harm inference efficiency, we compare DGASA against two baselines:
- Full Fine-tuning: A single model updated on all data (upper bound on accuracy but suffers catastrophic forgetting)
- Static Adapter Baseline: A single adapter trained on the union of all tasks (lower memory but lower accuracy)
- We summarize the efficiency–accuracy trade-off under different numbers of tasks in Table 9.
Key observations:
- Accuracy-efficiency trade-off: DGASA achieves significantly higher accuracy (86.25% vs. 49.08%) than the static adapter baseline, with only 1.75
higher latency (7.38 ms vs. 4.21 ms) after 20 tasks.
- Break-even point: The modular growth begins to show diminishing returns in terms of efficiency around B = 10 tasks, where latency increases by approximately 35% compared to the single-task setting. However, the accuracy continues to improve from 84.52% (B = 5) to 86.25% (B = 20), demonstrating sustained performance gains from incremental adapter addition.
- Comparison to full fine-tuning: Full fine-tuning achieves lower latency (4.21 ms) but catastrophically forgets previous tasks (19.71% accuracy). DGASA trades a modest 3.17 ms increase in latency for a 66.54% absolute accuracy improvement.
Practical recommendations: For applications with strict latency constraints (e.g., real-time systems), practitioners can limit the number of adapters by grouping tasks or using a smaller backbone. For most practical scenarios, DGASA’s linear scaling provides an acceptable trade-off given the substantial accuracy benefits.
Comprehensive CIL metrics analysis
To provide a more complete evaluation of DGASA’s performance beyond average and final accuracy, we report standard continual learning metrics: average forgetting, backward transfer (BWT), and old/new class accuracy splits.
Definition of metrics. Following standard practice [25,26], we define:
- Average Forgetting measures how much the model forgets previously learned classes:
where is the accuracy on task b after training task t.
- Backward Transfer (BWT) measures the influence of learning new tasks on old task performance:
where negative BWT indicates forgetting.
- Old vs. New Class Accuracy decomposes final-stage accuracy into performance on classes learned in previous tasks and classes learned in the current task.
Forgetting and backward transfer analysis. Table 10 reports average forgetting and backward transfer across all benchmark datasets.
Key observations:
- DGASA achieves the lowest forgetting across all datasets (4.87% on CIFAR, 5.23% on CUB, 5.56% on ImageNet-R), demonstrating superior stability in preserving old knowledge
- The backward transfer of DGASA is closest to zero among all methods (less negative), indicating that learning new tasks has minimal detrimental impact on previously learned classes
- The combination of low forgetting and high backward transfer confirms that our drift compensation mechanism effectively maintains old class representations
Old vs. New Class Accuracy Across Sessions. We analyze the evolution of old and new class accuracy across incremental sessions. Fig 5 shows the per-session accuracy decomposition for CIFAR-100 Inc5.
The analysis reveals:
- Old class accuracy remains stable throughout the incremental sequence (ranging from 85.2% to 86.1%), confirming effective mitigation of catastrophic forgetting
- New class accuracy is consistently high (around 87–89%), indicating strong adaptation to novel tasks
- The gap between old and new class accuracy is minimal (within 2–3%), demonstrating balanced performance across all classes
Statistical significance testing
To validate that DGASA’s improvements are statistically significant, we conduct paired t-tests comparing DGASA against the second-best method (EASE) across 5 random seeds. Table 11 reports the p-values.
Per-Task Accuracy Matrix. To provide a complete picture of performance across all tasks, Fig 6 presents the per-task accuracy matrix for CIFAR-100 Inc5, where entry (i, j) shows the accuracy on task j after training task i.
Diagonal entries show performance on the current task; off-diagonal entries show retention of previous tasks.
The matrix shows:
- Strong diagonal performance (87–92%) across all tasks
- Minimal decay in off-diagonal entries, confirming effective knowledge retention
- Performance on early tasks remains above 85% even after all 20 tasks, demonstrating excellent stability
Analysis of adaptive weighting mechanism
While DGASA achieves state-of-the-art performance across all benchmarks, Reviewer 1 correctly notes that on CUB Inc10, the improvement over ADAM+Adapter is marginal (92.38% vs. 91.96% for , and 86.91% vs. 86.48% for
). To justify the added complexity of our instance-level adaptive weighting mechanism, we conduct a comprehensive analysis examining when and why this mechanism provides benefits, and under what conditions its gains are limited.
Old vs. New class accuracy split
We decompose the final-stage accuracy () into performance on previously learned classes (old) and newly introduced classes (new) for the CUB Inc10 benchmark. Table 12 reveals that adaptive weighting primarily benefits old class recognition, which is critical for mitigating catastrophic forgetting.
Per-class confusion analysis. To understand where improvements occur, we analyze the confusion patterns between old and new classes. Fig 7 visualizes the normalized confusion matrices for ADAM+Adapter and DGASA on CUB Inc10.
The analysis reveals that:
- DGASA reduces confusion between visually similar old and new classes (e.g., between “Black-capped Chickadee” (old) and “Carolina Chickadee” (new))
- The adaptive weighting mechanism helps disambiguate samples that are semantically related to multiple tasks by dynamically emphasizing the most relevant subspace
- Most improvements occur in the off-diagonal blocks (old vs. new confusion), confirming that adaptive weighting enhances cross-task discrimination
Adaptive weight distribution analysis
To understand how the instance-level weighting behaves, we analyze the distribution of weights across different scenarios. Fig 8 shows the weight distributions for samples from old and new classes when evaluated on a later task.
Key observations:
- For samples from old classes, the weights are more evenly distributed across subspaces (average entropy = 1.84), indicating that the model leverages multiple subspaces to preserve old knowledge
- For samples from new classes, the primary subspace receives significantly higher weight (average
), as the model relies primarily on the task-specific adapter
- The weight correlation with prediction confidence is positive (
), suggesting that the mechanism assigns higher weights to subspaces that produce more confident predictions
Failure-Case Analysis. We examine cases where DGASA still makes errors to understand limitations of the adaptive weighting mechanism. Table 13 shows representative failure examples on CUB Inc10.
Analysis of failure cases reveals:
- Most errors occur between visually similar species that belong to different tasks
- In these cases, the adaptive weights become nearly uniform (
), indicating the mechanism struggles to disambiguate highly similar classes across task boundaries
- This suggests that the primary limitation is not the weighting mechanism itself, but rather the discriminative capacity of the underlying adapters for extremely fine-grained distinctions
Calibration analysis. We evaluate model calibration using Expected Calibration Error (ECE) to assess whether adaptive weighting affects prediction confidence reliability. Table 14 reports ECE across methods.
Ablation: When Does Adaptive Weighting Help?
We conduct targeted ablations to identify conditions where adaptive weighting provides the most benefit. Table 15 shows results on CUB Inc10 under different configurations.
Key findings:
- High inter-task similarity: Adaptive weighting provides the largest gains (+0.82% vs. baseline +0.43%), as it helps disambiguate semantically related classes across tasks
- Low inter-task similarity: Gains are marginal (+0.07%), as tasks are already well-separated and the primary subspace suffices
- Small adapter capacity: Adaptive weighting yields substantial improvement (+1.44%), compensating for limited task-specific representation by aggregating cross-task information
- Large adapter capacity: Gains are smaller (+0.12%), as individual adapters already capture rich task-specific features
These results explain the marginal improvement on CUB Inc10: the dataset exhibits moderate inter-task similarity, and the default adapter capacity (r = 16) is sufficient, leaving limited room for improvement from adaptive weighting. However, in challenging scenarios with high task similarity or constrained adapter capacity, the mechanism provides significant benefits.
Computational overhead analysis
We quantify the additional computational cost of adaptive weighting. Table 16 shows the overhead relative to the baseline inference.
The adaptive weighting mechanism adds only 0.57 ms (7.7% overhead) and 480 M FLOPs (2.6% overhead) to the inference cost, which is negligible given the accuracy benefits in challenging scenarios.
Summary
The adaptive weighting mechanism provides:
- Primary benefit: Improved old class accuracy (+1.61% on CUB Inc10) by leveraging cross-subspace information
- Greatest impact: Scenarios with high inter-task similarity (+0.82%) or constrained adapter capacity (+1.44%)
- Limited benefit: When tasks are already well-separated or adapters have sufficient capacity
- Negligible cost: Only 7.7% inference overhead
The marginal gains on CUB Inc10 are therefore justified by the mechanism’s substantial benefits in more challenging scenarios and its low computational overhead.
Stability analysis of drift compensation
We analyze the numerical stability of the proposed drift compensation mapping from three perspectives: regularization, increment size, and long-term error behavior.
Regularized mapping formulation. .
To improve numerical robustness, we adopt a ridge-regularized solution instead of directly solving the normal equation:
where controls the trade-off between stability and fitting accuracy. In practice,
is selected via a small validation split.
Evaluation metrics. .
Beyond classification accuracy, we evaluate the quality of drift compensation using:
which measures prototype reconstruction error. We also report cosine similarity between reconstructed and ground-truth prototypes in the new feature space.
Effect of regularization. .
We vary and report performance in Table 17.
Effect of increment size and conditioning. .
We further analyze how the number of newly introduced classes affects numerical stability. Since the rank and conditioning of depend on the number of new class prototypes, smaller increments tend to produce more ill-conditioned systems. Table 18 summarizes the results.
Error accumulation across long sequences
To evaluate long-term stability, we analyze performance as the number of incremental tasks increases. Unlike sequential composition methods, our approach re-estimates a global mapping at each session using all current new classes as anchors. This prevents error propagation across sessions.
As shown in Table 19, both accuracy and reconstruction metrics remain stable across all tasks. The absence of monotonic degradation confirms that our global re-estimation strategy effectively prevents error accumulation.
Summary
Overall, the proposed drift compensation method demonstrates strong numerical stability: (1) ridge regularization mitigates ill-conditioning, (2) performance remains robust under small increment sizes, and (3) the global mapping strategy avoids long-term error accumulation.
Discussion
The proposed DGASA method demonstrates substantial improvements over existing class-incremental learning approaches across multiple benchmark datasets. By integrating lightweight attention-gated adapters with a frozen pre-trained backbone, DGASA effectively balances parameter efficiency and task-specific adaptation. The subspace drift compensation mechanism plays a critical role in maintaining the consistency of old class prototypes without requiring access to historical data, which not only preserves privacy but also enhances model stability. Additionally, the instance-level adaptive weighting strategy significantly boosts classification accuracy by dynamically adjusting the contribution of each task-specific subspace based on input relevance. These innovations collectively enable DGASA to outperform state-of-the-art methods, achieving superior average and final-stage accuracy in various incremental learning scenarios.
Beyond performance gains, DGASA offers a meaningful step toward practical and privacy-preserving continual learning systems. Its exemplar-free design addresses real-world constraints where storing previous data is infeasible due to memory or privacy limitations. The method’s ability to maintain and transfer knowledge across tasks without catastrophic forgetting makes it particularly suitable for deployment in dynamic environments, such as autonomous systems, personalized education, or medical diagnostics. Furthermore, the modular architecture of DGASA allows for scalable integration with existing pre-trained models, facilitating broader adoption in resource-constrained settings. Overall, DGASA contributes a robust, efficient, and privacy-aware solution to the ongoing challenge of lifelong learning in intelligent systems.
Conclusion
Incremental learning is a critical capability that intelligent systems in the real world should possess. To this end, we propose an efficient, parameter-friendly, and privacy-aware class-incremental learning method—Dynamic Gated Adapter for Subspace Alignment (DGASA). Our method freezes the pre-trained backbone network and introduces lightweight Gated Adapter (GA) modules for each incremental task to construct task-specific feature subspaces. These subspaces are dynamically fused across tasks, effectively alleviating catastrophic forgetting and task interference. To address the degradation of old class representations caused by evolving feature subspaces, DGASA designs a subspace alignment mechanism that learns a linear mapping between new and old subspaces, enabling consistent reconstruction of old class prototypes in the current subspace without accessing data from previous tasks. Extensive experimental results demonstrate that DGASA achieves superior performance on multiple incremental learning benchmark datasets, validating the effectiveness and applicability of the proposed method.
References
- 1.
Rebuffi S, Kolesnikov A, Sperl G, Lampert CH. iCaRL: Incremental Classifier and Representation Learning. In: 2017. 5533–42. https://doi.org/10.1109/CVPR.2017.587
- 2. French RM, Chater N. Using noise to compute error surfaces in connectionist networks: a novel means of reducing catastrophic forgetting. Neural Comput. 2002;14(7):1755–69. pmid:12079555
- 3.
French RM, Ferrara A. Modeling time perception in rats: Evidence for catastrophic interference in animal learning. Proceedings of the Twenty First Annual Conference of the Cognitive Science Society. Psychology Press. 2020. 173–8. https://doi.org/10.4324/9781410603494-35
- 4.
Chen X, Chang X. Dynamic Residual Classifier for Class Incremental Learning. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 18697–706. https://doi.org/10.1109/iccv51070.2023.01718
- 5.
Hu Z, Li Y, Lyu J, Gao D, Vasconcelos N. Dense Network Expansion for Class Incremental Learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 11858–67. https://doi.org/10.1109/cvpr52729.2023.01141
- 6.
Douillard A, Rame A, Couairon G, Cord M. DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 9275–85. https://doi.org/10.1109/cvpr52688.2022.00907
- 7.
Zhou D-W, Sun H-L, Ye H-J, Zhan D-C. Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 23554–64. https://doi.org/10.1109/cvpr52733.2024.02223
- 8.
Wang F-Y, Zhou D-W, Ye H-J, Zhan D-C. FOSTER: Feature Boosting and Compression for Class-Incremental Learning. Lecture Notes in Computer Science. Springer Nature Switzerland. 2022. p. 398–414. https://doi.org/10.1007/978-3-031-19806-9_23
- 9.
Gao Q, Zhao C, Sun Y, Xi T, Zhang G, Ghanem B, et al. A Unified Continual Learning Framework with General Parameter-Efficient Tuning. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 11449–59. https://doi.org/10.1109/iccv51070.2023.01055
- 10.
Zhang G, Wang L, Kang G, Chen L, Wei Y. SLCA: Slow Learner with Classifier Alignment for Continual Learning on a Pre-trained Model. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 19091–101. https://doi.org/10.1109/iccv51070.2023.01754
- 11.
Zhou DW, Sun HL, Ning J, Ye HJ, Zhan DC. Continual learning with pre-trained models: A survey. 2024. https://arxiv.org/abs/2401.16386
- 12.
Goswami D, Liu Y, Twardowski B, Van De Weijer J. FeCAM: Exploiting the Heterogeneity of Class Distributions in Exemplar-Free Continual Learning. In: Advances in Neural Information Processing Systems 36, 2023. 6582–95. https://doi.org/10.52202/075280-0288
- 13.
Pian W, Mo S, Guo Y, Tian Y. In: Paris, France, 2023. 7765–77. https://doi.org/10.1109/ICCV51070.2023.00717
- 14.
Villa A, Alcázar JL, Alfarra M, Alhamoud K, Hurtado J, Heilbron FC, et al. PIVOT: Prompting for Video Continual Learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 24214–23. https://doi.org/10.1109/cvpr52729.2023.02319
- 15.
Hong X, Huang Z, Wang Y. S-Prompts Learning with Pre-Trained Transformers: An Occam’s Razor for Domain Incremental Learning. In: Advances in Neural Information Processing Systems 35, 2022. 5682–95. https://doi.org/10.52202/068431-0411
- 16.
Smith JS, Karlinsky L, Gutta V, Cascante-Bonilla P, Kim D, Arbelle A, et al. CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 11909–19. https://doi.org/10.1109/cvpr52729.2023.01146
- 17.
Wang Z, Zhang Z, Ebrahimi S, Sun R, Zhang H, Lee CY, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. In: European conference on computer vision. Springer; 2022. p. 631–48.
- 18.
Wang Z, Zhang Z, Lee C-Y, Zhang H, Sun R, Ren X, et al. Learning to Prompt for Continual Learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 139–49. https://doi.org/10.1109/cvpr52688.2022.00024
- 19. Zhou D-W, Cai Z-W, Ye H-J, Zhan D-C, Liu Z. Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need. Int J Comput Vis. 2024;133(3):1012–32.
- 20. Dohare S, Hernandez-Garcia JF, Lan Q, Rahman P, Mahmood AR, Sutton RS. Loss of plasticity in deep continual learning. Nature. 2024;632(8026):768–74. pmid:39169245
- 21.
Aljundi R, Babiloni F, Elhoseiny M, Rohrbach M, Tuytelaars T. Memory Aware Synapses: Learning What (not) to Forget. Lecture Notes in Computer Science. Springer International Publishing. 2018. p. 144–61. https://doi.org/10.1007/978-3-030-01219-9_9
- 22. Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, et al. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci U S A. 2017;114(13):3521–6. pmid:28292907
- 23. Li Z, Hoiem D. Learning without Forgetting. IEEE Trans Pattern Anal Mach Intell. 2018;40(12):2935–47. pmid:29990101
- 24. Wang Z, Liu L, Duan Y, Tao D. Continual Learning through Retrieval and Imagination. AAAI. 2022;36(8):8594–602.
- 25.
Cha H, Lee J, Shin J. Co2l: Contrastive continual learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. 9516–25.
- 26. Lopez-Paz D, Ranzato M. Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems. 2017;30.
- 27.
Zhu K, Zhai W, Cao Y, Luo J, Zha Z-J. Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 9286–95. https://doi.org/10.1109/cvpr52688.2022.00908
- 28. Shin H, Lee JK, Kim J, Kim J. Continual learning with deep generative replay. Advances in neural information processing systems. 2017;30.
- 29.
Yan S, Xie J, He X. DER: Dynamically Expandable Representation for Class Incremental Learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3013–22. https://doi.org/10.1109/cvpr46437.2021.00303
- 30. Wang L, Zhang X, Li Q, Zhang M, Su H, Zhu J, et al. Incorporating neuro-inspired adaptability for continual learning in artificial intelligence. Nat Mach Intell. 2023;5(12):1356–68.
- 31.
Zhou D-W, Sun H-L, Ye H-J, Zhan D-C. Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 23554–64. https://doi.org/10.1109/cvpr52733.2024.02223
- 32.
Cao B, Tang Q, Lin H, Jiang S, Dong B, Han X. Retentive or forgetful? Diving into the knowledge memorizing mechanism of language models. arXiv preprint. 2023. https://arxiv.org/abs/2305.09144
- 33.
Chen S, Ge C, Luo P, Song Y, Tong Z, Wang J, et al. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. In: Advances in Neural Information Processing Systems 35, 2022. 16664–78. https://doi.org/10.52202/068431-1212
- 34.
Huang M, Su H, Wang L, Xie J, Zhang X, Zhu J. Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality. In: Advances in Neural Information Processing Systems 36, 2023. 69054–76. https://doi.org/10.52202/075280-3022
- 35. Wang Y, Ma Z, Huang Z, Wang Y, Su Z, Hong X. Isolation and Impartial Aggregation: A Paradigm of Incremental Learning without Interference. AAAI. 2023;37(8):10209–17.
- 36.
Zheng Z, Ma M, Wang K, Qin Z, Yue X, You Y. Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 19068–79. https://doi.org/10.1109/iccv51070.2023.01752
- 37. Zhou D-W, Zhang Y, Wang Y, Ning J, Ye H-J, Zhan D-C, et al. Learning Without Forgetting for Vision-Language Models. IEEE Trans Pattern Anal Mach Intell. 2025;47(6):4489–504. pmid:40184303
- 38.
Ye HJ, Zhan DC, Si XM, Jiang Y. Learning Mahalanobis Distance Metric: Considering Instance Disturbance Helps. In: IJCAI; 2017. p. 3315–21.
- 39.
Abbasnejad E, Gong D, McDonnell MD, Parvaneh A, Van Den Hengel A. RanPAC: Random Projections and Pre-trained Models for Continual Learning. In: Advances in Neural Information Processing Systems 36, 2023. 12022–53. https://doi.org/10.52202/075280-0526
- 40.
Panos A, Kobe Y, Reino DO, Aljundi R, Turner RE. First Session Adaptation: A Strong Replay-Free Baseline for Class-Incremental Learning. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 18774–84. https://doi.org/10.1109/iccv51070.2023.01725
- 41.
Yu L, Twardowski B, Liu X, Herranz L, Wang K, Cheng Y, et al. Semantic Drift Compensation for Class-Incremental Learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 6980–9. https://doi.org/10.1109/cvpr42600.2020.00701
- 42.
Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. 2009.
- 43.
Wah C, Branson S, Welinder P, Perona P, Belongie S. The caltech-ucsd birds-200-2011 dataset. 2011.
- 44.
Hendrycks D, Basart S, Mu N, Kadavath S, Wang F, Dorundo E, et al. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 8320–9. https://doi.org/10.1109/iccv48922.2021.00823
- 45. Barbu A, Mayo D, Alverio J, Luo W, Wang C, Gutfreund D. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in Neural Information Processing Systems. 2019;32.
- 46.
Zhang Y, Yin Z, Shao J, Liu Z. Benchmarking omni-vision representation through the lens of visual realms. In: European conference on computer vision. Springer; 2022. 594–611.
- 47. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems. 2019;32.