Convolutional Neural Networks (CNNs) have been successfully applied to a broad range of computer vision tasks [17, 11, 34, 4, 22, 25, 45, 19]. For practical use, we train CNN models on large scale image datasets [5] and then deploy them on smart agents. As the smart agents are often exposed in a new and dynamic environment, there is an urgent need to continuously adapt the models to recognize new classes emerging. For example, the smart album function on smartphones is designed to automatically classify user photos into both the pre-defined and user-defined classes. The model underpinning the smart album is pre-trained on the training set of the pre-defined classes, and is required to
Figure 1. Comparisons of two ways to characterize a heteroge- nous manifold. (a) Randomly sampled representatives, which are adopted by conventional CIL studies for knowledge distillation. (b) The representatives learned by neural gas, which well preserves the topology of the manifold.
adapt to the new user-defined classes by learning from new photos. From the users’ perspective, they are only willing to annotate very few image examples for the new class, as the labeling process consumes manpower. Therefore, it is crucial for CNNs to be capable of incrementally learning new classes from very few training examples. We term this ability as few-shot class-incremental learning (FSCIL).
A na¨ıve approach for FSCIL is to finetune the base model on the new class training set. However, a simple finetuning with limited number of training samples would cause two severe problems: one is “forgetting old”, where the model’s performance deteriorates drastically on old classes due to catastrophic forgetting [7]; the other is “overfitting new”, where the model is prone to overfit to new classes, which loses generalization ability on large set of test samples.
Recently, there have been many research efforts attempting to solve the catastrophic forgetting problem [15, 49, 20, 24, 18, 32, 2, 13, 41, 37, 1]. They usually conduct incremental learning under the multi-task or the multi-class scenarios. The former incrementally learns a sequence of disjoint tasks, which requires the task identity in advance. This is seldom satisfied in real applications where the task identity is typically unavailable. The latter learns a unified classifier to recognize all the encountered classes within a single task. This scenario is more practical without the need of knowing task information. In this paper, we study the FSCIL problem under the multi-class scenario, where we treat FSCIL as a particular case of the class-incremental learning (CIL) [32, 2, 10, 13, 48]. Compared with CIL that learns new classes with unlimited, usually large-scale training samples, FSCIL is more challenging, since the number of new training samples is very limited.
To mitigate forgetting, most CIL works [32, 2, 35, 13, 48] use the knowledge distillation [12] technique that maintains the network’s output logits corresponding to old classes. They usually store a set of old class exemplars and apply the distillation loss to the network’s output. Despite their effectiveness, there are several problems when training with the distillation loss. One is the class-imbalance problem [13, 48], where the output logits are biased towards those classes with a significant larger number of training samples. The other is the performance trade-off between old and new classes. This problem is more prominent for FSCIL, because learning from very few training samples requires a larger learning rate and stronger gradients from new classes’ classification loss, making it difficult to maintain the output for old classes at the same time.
In this paper, we address FSCIL from a new, cognitiveinspired perspective of knowledge representation. Recent discoveries in cognitive science reveal the importance of topology preservation for maintaining the memory of the old knowledge [29, 21]. The change of the memory’s topology will cause severe degradation of human recognition performance on historical visual stimuli [29], indicating catastrophic forgetting. Inspired by this, we propose a new FSCIL framework, named TOpology-Preserving knowledge InCrementer (TOPIC), as shown in Figure 1. TOPIC uses a neural gas (NG) network [42, 8, 31] to model the topology of feature space. When learning the new classes, NG grows to adapt to the change of feature space. On this basis, we formulate FSCIL as an optimization problem with two objectives. On the one hand, to avoid catastrophic forgetting, TOPIC preserves the old knowledge by stabilizing the topology of NG, which is implemented with an anchor loss (AL) term. On the other hand, to prevent overfitting to few-shot new classes, TOPIC adapt the feature space by pushing the new class training sample towards a correct new NG node with the same label and pulling the new nodes of different labels away from each other. The min-max loss (MML) term is developed to achieve this purpose.
For extensive assessment, we build the FSCIL baselines by adapting the state-of-the-art CIL methods [32, 2, 13] to this new problem and compare our method with them. We conduct comprehensive experiments on the popular CIFAR100 [16], miniImageNet [43], and CUB200 [44] datasets. Experimental results demonstrate the effective-
ness of the proposed FSCIL framework. To summarize, our main contributions include:
• We recognize the importance of few-shot class-incremental learning (FSCIL) and define a problem setting to better organize the FSCIL research study. Compared with the popularly studied class-incremental learning (CIL), FSCIL is more challenging but more practical.
• We propose an FSCIL framework TOPIC that uses a neural gas (NG) network to learn feature space topologies for knowledge representation. TOPIC stabilizes the topology of NG for mitigating forgetting and adapts NG to enhancing the discriminative power of the learned features for few-shot new classes.
• We provide an extensive assessment of the FSCIL methods, which we adapt the state-of-the-art CIL methods to FSCIL and make comprehensive comparisons with them.
2.1. Class-Incremental Learning
Class-incremental learning (CIL) learns a unified classi-fier incrementally to recognize all encountered new classes met so far. To mitigate the forgetting of the old classes, CIL studies typically adopt the knowledge distillation technique, where external memory is often used for storing old class exemplars to compute the distillation loss. For example, iCaRL [32] maintains an “episodic memory” of the exemplars and incrementally learns the nearest-neighbor clas-sifier for the new classes. EEIL [2] adds the distillation loss term to the cross-entropy loss for end-to-end training. Latest CIL works NCM [13] and BiC [48] reveal the class-imbalance problem that causes the network’s prediction biased towards new classes. They adopt cosine distance metric to eliminate the bias in the output layer [13], or learns a bias-correction model to post-process the output logits [48].
In contrast to these CIL works, we focus on the more difficult FSCIL problem, where the number of new class training samples is limited. Rather than constraining the network’s output, we try to constrain CNN’s feature space represented by a neural gas network.
2.2. Multi-task Incremental Learning
A series of research works adopts the multi-task incremental learning scenario. These works can be categorized into three types: (1) rehearsal approaches [24, 3, 37, 50, 46], (2) architectural approaches [27, 26, 1, 36, 47], and (3) regularization approaches [15, 49, 23, 18]. Rehearsal approaches replay the old tasks information to the task solver when learning the new task. One way is to store the old tasks’ exemplars using external memory and constrain their losses during learning the new task [24, 3]. Another way is to use the generative models to memorize the old tasks data distribution [37, 46, 50]. For example, DGR [37] learns a generative adversarial network to produce observed samples for the task solver. The recognition performance is affected by the quality of the generated samples. Architectural approaches alleviate forgetting by manipulating the network’s architecture, such as network pruning, dynamic expansion, and parameter masking. For example, PackNet [27] prunes the network to create free parameters for the new task. HAT [36] learns the attention masks for old tasks and use them to constrain the parameters when learning the new task. Regularization approaches impose regularization on the network’s parameters, losses or output logits. For example, EWC [15] and its variants [49, 23] penalize the changing of the parameters important to old tasks. These methods are typically based on certain assumptions of the parameters’ posterior distribution (e.g. Gaussian), which may struggle in more complex scenarios.
As the multi-task incremental learning methods are aimed at learning disjoint tasks, it is infeasible to apply these methods under the single-task multi-class scenario adopted by FSCIL. As a result, we have to exclude them for comparison.
2.3. Dynamic Few-Shot Learning
Few-shot learning (FSL) aims to adapt the model to recognize unseen novel classes using very few training samples, while the model’s recognition performance on the base classes is not considered. To achieve FSL, research studies usually adopt the metric learning and meta-learning strategies [43, 38, 40, 6, 39]. Recently, some FSL research works attempt to learn a model capable of recognizing both the base and novel classes [9, 33]. Typically, they first pretrain the model on the base training set to learn feature embedding as well as the weights of the classifier for base classes. Then they perform meta-learning for few-shot novel classes, by sampling “fake” few-shot classification tasks from the base dataset to learn a classifier for novel classes. Finally, the learned heads are combined for recognizing the joint test (query) set of the base and novel classes.
Though some of these works [33] regard such setting as a kind of incremental learning, they rely on the old training set (i.e., the base class dataset) for sampling meta-learning tasks. This is entirely different from the FSCIL setting, where the base/old class training set is unavailable at the new incremental stage. As a consequence, these few-shot learning works can not be directly applied to FSCIL.
We define the few-shot class-incremental-learning (FSCIL) setting as follows. Suppose we have a stream of labelled training sets , where
.
is the set of classes of the t-th train- ing set, where
is the large-scale training set of base classes, and
is the few-shot training set of new classes. The model
is incrementally trained on
with a unified classification layer, while only
is available at the t-th training session. After training on
is tested to recognize all encountered classes in
. For
, we denote the setting with C classes and K training samples per class as the C-way K-shot FSCIL. The main challenges are twofold: (1) avoiding catastrophic forgetting of old classes; (2) preventing overfitting to few-shot new classes.
To perform FSCIL, we treat the CNN as a composition of a feature extractor with the parameter set
and a classification head. The feature extractor defines the feature space
. The classification head with the parameter set
produces the output vector followed by a softmax function to predict the probability p over all classes. The entire set of parameters is denoted as
. The output vector given input x is
. Initially, we train
on
with the cross-entropy loss. Then we incrementally finetune the model on
, and get
. At the t-th session (t > 1), the output layer is expanded for new classes by adding
output neurons.
For FSCIL, we first introduce a baseline solution to alleviate forgetting based on knowledge distillation; then we elaborate our proposed TOPIC framework that employs a neural gas network for knowledge representation and the anchor loss and min-max loss terms for optimization.
3.1. Baseline: Knowledge Distillation Approach
Most CIL works [32, 2, 13, 48] adopt the knowledge distillation technique for mitigating forgetting. Omitting the superscript (t), the loss function is defined as:
where and
are the distillation and cross-entropy loss terms, and P is the set of old class exemplars drawn from
. The implementation of
may vary in different works. Generally, it takes the form:
where is the number of the old classes,
is the initial values of
before finetuning, and T is the distillation temperature (e.g., T = 2 in [2, 13]).
The distillation approach faces several critical issues when applied to FSCIL. One is the bias problem caused by imbalanced old/new class training data, where the output layer is biased towards new classes [13, 48]. To address this issue, [13] uses cosine distance measure to eliminate the bias and [48] learns a bias correction model to post-process the outputs. Despite their effectiveness in learning large-scale training data, they are less effective for FSCIL with very few training samples. Using cosine distance may lose important patterns (e.g. appearance) contained in the magnitude of the weight/feature vector, while the bias-correction model requires a large number of training samples, which conflicts with the few-shot setting. Another issue is the dilemma to balance the contribution between and
, which may lead to unsatisfactory performance trade-off. Learning few-shot new classes requires a larger learning rate to minimize
, while it can cause instability of the output logits and makes it difficult to minimize
.
Based on the above considerations, we abandon the distillation loss in our framework. Instead, we manipulate the knowledge contained CNN’s feature space that contains richer information than the output logits.
3.2. Knowledge Representation as Neural Gas
The knowledge distillation methods typically store a set of exemplars randomly drawn from the old training set and compute the distillation loss using these exemplars. However, there is no guarantee that the randomly-sampled exemplars can well represent heterogenous, non-uniform data of different classes in the FSCIL scenarios. Instead, we represent the knowledge by preserving the feature space topology, which is achieved by a neural gas (NG) network [42]. NG maps the feature space F to a finite set of feature vectors and preserves the topology of F by com- petitive Hebbian learning [28], as shown in Figure 2.
NG defines an undirected graph . Each vertex
is assigned with a centroid vector
describing the location of
in feature space. The edge set E stores the neighborhood relations of the vertices. If
and
are topologically adjacent,
; otherwise,
. Each edge
is assigned with an “age”
initialized to 0. Given an input
, it matches the NG node j with the minimum distance
to f. The matching process divides F into disjoint subregions, where the centroid vector
encodes the region
. We use the Euclidean distance as
.
Noting that some variants of NG [8, 31] use different approaches to construct NG incrementally. To be consistent with FSCIL, we directly modify the original version [42] and learn a fixed set of nodes for the base classes. As NG [42] is originally learnt from unlabelled data, to accomplish the supervised incremental learning, we redefine the NG node j as a tuple , where
is the centroid vector representing
, the diagonal matrix
stores the variance of each di-
Figure 2. NG preserves the topology of heterogenous feature space manifold. Initially, NG is learnt for base classes (the blue dots and lines.) Then NG incrementally grows for new classes by inserting new nodes and edges (the orange dots and lines.) During the competitive Hebbian learning, ’s centroid vector
adapted to the input vector f which falls in
encoded by
mension of , and
and
are the assigned images and labels for computing the observation
. With
, we can determine whether
corresponds to old class or new class.
At the initial session (t = 1), the NG net with nodes
is trained on the feature set
using competitive Hebbian learning. Concretely, given an input
, its distance with each NG node is computed and stored in
is then sorted in ascending order to get the rank of the nodes
. Then, for each node
, its centroid
is updated to
:
where is the learning rate, and
is a decay function controlled by
. We use the superscript
to denote the updated one. For the nodes distant from f, they are less affected by the update. Next, the edge of all connections of
is updated as:
Apparently, and
are the nearest and the second nearest to f. Their edge
and the corresponding age
is set to 1 to create or maintain a connection between node
and
. For other edges, if
exceeds lifetime T, the connection is removed by setting
. After training on
, for
, we pick the sample from
whose feature vector f is the nearest
as the pseudo image
and label
. The variance
is estimated using the feature vectors whose winner is j.
At the incremental session (t > 1), for K-shot new class training samples, we grow by inserting k < K (e.g. k = 1 for K = 5) new nodes
for
Figure 3. Explanation of NG stabilization and adaptation. (a) NG divides CNN’s feature space F into a set of topologically arranged subregions represented by a centroid vector
. (b) When finetuning CNN with few training examples, F’s topology is severely distorted, indicating catastrophic forgetting. (c) To maintain the topology, the shift of NG nodes is penalized by the anchor-loss term. (d) NG grows for new class y by inserting a new vertex
. A new class training sample
is mismatched to
(e) The min-max loss term adapts
by pushing
and pulling
away from the neighbors
. (f) The topology is updated after the adaptation in (e), where
has been moved to
, and the connection between
is removed due to expired age.
each new class, and update their centroids and edges using Eq. (3) and (4). To avoid forgetting old class, we stabilize the subgraph of NG learned at previous session that preserves old knowledge. On the other hand, to prevent overfitting to
, we enhance the discriminative power of the learned features by adapting newly inserted NG nodes and edges. The neural gas stabilization and adaptation are described in the following sections.
3.3. Less-Forgetting Neural Gas Stabilization
Given NG , we extract the subgraph
whose vertices
were learned on old class training data at session
, where
. During finetuning, we stabilize
to avoid forgetting the old knowledge. This is implemented by penalizing the shift of v in the feature space
via constraining the observed value of the centroid
to stay close to the original one m. It is noteworthy that some dimensions of m have high diversity with large variance. These dimensions may encode common semantic attributes shared by both the old and new classes. Strictly constraining them may prevent positive transfer of the knowledge and bring unsatisfactory trade-off. Therefore, we measure each dimension’s importance for old class knowledge using the inverted diagonal
, and relax the stabilization of high-variance dimensions. We define the anchor loss (AL) term for less-forgetting stabilization:
The effect of AL term is illustrated in Figure 3 (a-c). It avoids severe distortion of the feature space topology.
3.4. Less-Overfitting Neural Gas Adaptation
Given the new class training set and NG
, for a training sample
, we extract its feature vector
and feed f to the NG. We hope f matches the node
whose label
, and
j, so that x is more probable to be correctly classified. However, simply finetuning on the small training set
could cause severe overfitting, where the test sample with groundtruth label y is very likely to activate the neighbor with a different label. To address this problem, a min-max loss (MML) term is introduced to constrain f and the centroid vector
of
. The “min” term minimizes
. The “max” term maximizes
to be larger than a margin, where
is the centroid vectors of
’s neighbors with a different label
. MML is defined as:
The hyper-parameter is used to determine the minimum distance. If
, we regard the distance is larger enough for well separation, and disable the term. Heuristically, we set
max
. After finetuning, we update the edge
according to Eq. (4), as illustrated in Figure 3 (e) and (f).
3.5. Optimization
At the incremental session t > 1, we finetune CNN on
with mini-batch SGD. Meanwhile, we update the NG net
at each SGD iteration, using the competitive learning rules in Eq. (3) and (4). The gradients in Eq. (5) and (6) are computed and back-propagated to CNN’s feature extractor
. The overall loss function at session
t is defined as:
where the first term in the right-hand side is the softmax cross-entropy loss, is the AL term defined in Eq. (5),
is the MML term defined in Eq. (6), and
and
are the hyper-parameters to balance the strength.
We conduct comprehensive experiments on three popular image classification datasets CIFAR100 [16], miniImageNet [43] and CUB200 [44]. CIFAR100 dataset contains 60,000 RGB images of 100 classes, where each class has 500 training images and 100 test images. Each image has the size . This dataset is very popular in CIL works [32, 2]. MiniImageNet dataset is the 100-class subset of the ImageNet-1k [5] dataset used by few-shot learning [43, 6]. Each class contains 500 training images and 100 test images. The images are in RGB format of the size
. CUB200 dataset is originally designed for fine-grained image classification and introduced by [3, 30] for incremental learning. It contains about 6,000 training images and 6,000 test images over 200 bird categories. The images are resized to
and then cropped to
for training.
For CIFAR100 and miniImageNet datasets, we choose 60 and 40 classes as the base and new classes, respectively, and adopt the 5-way 5-shot setting, which we have 9 training sessions (i.e., 1 base + 8 new) in total. While for CUB200, differently, we adopt the 10-way 5-shot setting, by choosing 100 classes as the base classes and splitting the remaining 100 classes into 10 new class sessions. For all datasets, each session’s training set is constructed by randomly picking 5 training samples per class from the original dataset, while the test set remains to be the original one, which is large enough to evaluate the generalization performance for preventing overfitting.
We use a shallower QuickNet [14] and the deeper ResNet18 [11] models as the baseline CNNs. The QuickNet is a simple yet power CNN for classifying small images, which has three conv layers and two fc layers, as shown in Table 1. We evaluate it on both CIFAR100 and miniImageNet. While for ResNet18, we evaluate it on all the three datasets. We train the base model with a mini-batch size of 128 and the initial learning rate of 0.1. We decrease the learning rate to 0.01 and 0.001 after 30 and 40 epochs, respectively, and stop training at epoch 50. Then, we finetune the model
on each subsequent training set
for 100 epochs, with a learning rate of 0.1 (and 0.01 for CUB200). As
contains very few training samples, we use all of them to construct the mini-batch for incremental learning. After training on
, we test
on the union of the test sets of all encountered classes. For data augmentation, we perform standard random cropping and flipping as in [11, 13] for all methods. When finetun-ing ResNet18, as we only have very few new class training samples , it would be problematic to compute batchnorm. Thus, we use the batchnorm statistics computed on
and fix the batchnorm layers during finetuning. We run the whole learning process 10 times with different random seeds and report the average test accuracy over all encountered classes.
Table 1. The structure of the QuickNet model in the experiments, which is originally defined in the Caffe package [14].
We learn a NG net of 400 nodes for base classes, and incrementally grow it by inserting 1 node for each new class. For the hyper-parameters, we set for faster learning of NG in Eq. (3), the lifetime T = 200 in Eq. (4), and
for Eq. (7).
For comparative experiments, we run the representative CIL methods in our FSCIL setting, including the classical iCARL [32] and the state-of-the-art methods EEIL [2] and NCM [13], and compare our method with them. While for BiC [48], we found that training the bias-correction model requires a large set of validation samples, which is impracticable for FSCIL. Therefore, we do not eval this work. We set in Eq. (1) for these distillation-based methods as well as the distillation term used in our ablation study in Section 4.2. Other related works [20, 15, 49, 18, 24] are designed for the MT setting, which we do not involve in our experiments. We use the abbreviation “Ours-AL”, “Ours-AL-MML” to indicate the applied loss terms during incremental learning.
4.1. Comparative results
We report the comparative results of the methods using the 5/10-way 5-shot FSCIL setting. As the 5-shot training samples are randomly picked, we run all methods for
Figure 4. Comparison of the test accuracies of QuickNet and ResNet18 on CIFAR100 and miniImageNet dataset. At each session, the models are evaluated on a joint set of test samples of the classes encountered so far.
Table 2. Comparison results on CUB200 with ResNet18 using the 10-way 5-shot FSCIL setting. Noting that the comparative methods with their original learning rate settings have much worse test accuracies on CUB200. We carefully tune their learning rates and boost their original accuracies by 2%8.7%. In the table below, we report their accuracies after the improvement.
10 times and report the average accuracies. Figure 4 compares the test accuracies on CIFAR100 and miniImageNet dataset, respectively. Table 2 reports the test accuracies on CUB200 dataset.
We summarize the results as follows:
• On three datasets, and for both QuickNet and ResNet18 models, our TOPIC outperforms other state-of-the-art methods on each encountered session, and is the closest to the upper bound “Joint-CNN” method. As the incremental learning proceeds, the superiority of TOPIC becomes more significant, demonstrating its power for continuously learning longer sequence of new class datasets.
• Simply finetuning with few training samples of new classes (i.e., “Ft-CNN”, the blue line) deteriorates the test accuracies drastically due to catastrophic forgetting. Finetuning with AL term (i.e., the green line) effectively alleviates forgetting, outperforming the na¨ıve finetuning approach by up to 38.90%. Moreover, using both AL and MML terms further achieves up to 5.85% accuracy gain than using AL alone. It shows that solving the challenging FSCIL problem requires both alleviating the forgetting of the old classes and enhancing the representation learning of the new classes.
• On CIFAR100, TOPIC achieves the final accuracies of 24.17% and 29.37% with QuickNet and ResNet18, respectively, while the second best ones (i.e., NCMand EEIL
) achieve the accuracies of 19.50% and 15.85%, respectively. TOPIC outperforms the two state-of-the-art methods by up to 13.52%.
• On miniImageNet, TOPIC achieves the final accuracies of 18.36% and 24.42% with QuickNet and ResNet18, respectively, while the corresponding accuracies achieved by the second best EEILare 13.59% and 19.58%, respectively. TOPIC outperforms EEIL* by up to 4.84%.
• On CUB200, at the end of the entire learning process, TOPIC achieves the accuracy of 26.28% with ResNet18, outperforming the second best EEIL
4.2. Ablation study
The contribution of the loss terms. We conduct ablation studies to investigate the contribution of the loss terms to the final performance gain. The experiments are performed on miniImageNet with ResNet18. For AL, we compare the original form in Eq. (5) and a simplified form without the “re-weighting” matrix . For MML, as it consists of the
Table 3. Comparison results of combining different loss terms on miniImageNet with ResNet18.
Table 4. Comparison of the final test accuracies achieved by “ex- emplars” and NG nodes with different memory size. Experiments are performed on CIFAR100 with ResNet18.
Figure 5. Comparison results under the 5-way 10-shot and 5-way full-shot settings, evaluated with ResNet18 on miniImageNet.
“min” and “max” terms, we evaluate the performance gain brought by each term separately. Besides, we also investigate the impact brought by the distillation loss term, which is denoted as “DL”. Table 3 reports the comparison results of different loss term settings. We summarize the results as follows:
• The “AL” term achieves better accuracy (up to 1.49%) than the simplified form “AL w/o. ”, thanks to the feature re-weighting technique.
• Both “AL-Min” and “AL-Max” improve the performance of AL, and the combined form “AL-MML” achieves the best accuracy, exceeding “AL” by up to 5.85%.
• Both “DL-MML” and “AL-MML” improve the performance of the corresponding settings without MML (i.e., “DL” and “AL”). It demonstrate the effective-
ness of the MML term for improving the representation learning for few-shot new classes.
• Applying the distillation loss degrades the performance. Though distillation is popularly used by CIL methods, it may be not so effective for FSCIL, as it is difficult to balance the old and new classes and trade-off the performance when there are only few new class training samples, as discussed in Section 3.1.
Comparison between “exemplars” and NG nodes. In our method, we represent the knowledge learned in CNN’s feature space using the NG net G. An alternative approach is to randomly select a set of exemplars representative of the old class training samples [32, 2] and penalize the changing of their feature vectors during training. Table 4 compares the final test accuracies achieved by the two approaches under different memory sizes. From Table 4, we can observe that using NG with only a few number of nodes can greatly outperform the exemplar approach in a consistent manner. When smaller memory is used, the difference in accuracy becomes larger, demonstrating the superiority of our method for FSCIL. The effect of the number of training samples. To investigate the effect brought by different shot of training samples, we further evaluate the methods under the 5-way 10-shot and 5-way full-shot settings. For 5-way full-shot, we use all training samples of the new class data, which is analogous to the ordinary CIL setting. We grow NG by adding 20 nodes for each new session, which we have NG nodes at session
. Figure 5 shows the comparative results of different methods under the 10-shot and full-shot settings. We can see that our method also outperforms other state-of-the-art methods when training with more samples. It demonstrate the effectiveness of the proposed framework for general CIL problem.
Figure 6 compares the confusion matrix of the classi-fication results at the last session, produced by Ft-CNN, EEIL* [2], NCM* [13] and our TOPIC. The na¨ıve finetun-ing approach tends to misclassify all past classes (i.e., 0-94) to the newly learned classes (i.e., 95-99), indicating catas-
Figure 6. Comparison of the confusion matrices produced by (a) Ft-CNN, (b) EEIL*, (c) NCM*, and (d) our TOPIC on miniImageNet with ResNet18.
trophic forgetting. EEIL* and NCM* can alleviate forgetting to some extent, while still tend to misclassify old class test samples as new classes due to overfitting. Our method, named “TOPIC”, produces a much better confusion matrix, where the activations are mainly distributed at the diagonal line, indicating higher recognition performance over all encounter class. It demonstrate the effectiveness of solving FSCIL by avoiding both “forgetting old” and “overfitting new”.
We focus on a unsolved, challenging, yet practical incremental-learning scenario, namely the few-shot class-incremental learning (FSCIL) setting, where models are required to learn new classes from few training samples. We propose a framework, named TOPIC, to preserve the knowledge contained in CNN’s feature space. TOPIC uses a neural gas (NG) network to maintain the topological structure of the feature manifold formed by different classes. We design mechanisms for TOPIC to mitigate the forgetting of the old classes and improve the representation learning for few-shot new classes. Extensive experiments show that our method substantially outperforms other state-of-the-art CIL methods on CIFAR100, miniImageNet, and CUB200 datasets, with a negligibly small memory overhead.
[1] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages 139–154, 2018.
[2] Francisco M Castro, Manuel J Mar´ın-Jim´enez, Nicol´as Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 233–248, 2018.
[3] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with agem. arXiv preprint arXiv:1812.00420, 2018.
[4] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
[5] Jia Deng, Wei Dong, R. Socher, Li Jia Li, Kai Li, and Fei Fei Li. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255, 2009.
[6] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
[7] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
[8] Bernd Fritzke. A growing neural gas network learns topolo- gies. Advances in neural information processing systems, 7, 1995.
[9] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018.
[10] C He, R Wang, S Shan, and X Chen. Exemplar-supported generative reproduction for class incremental learning. In Proceedings of the British Machine Vision Conference, 2018.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
[12] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. Computer Science, 14(7):38–39, 2015.
[13] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 831–839, 2019.
[14] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678, 2014.
[15] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran
Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
[16] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[18] Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgletting by incremental moment matching. In Advances in Neural Information Processing Systems, pages 4652–4662, 2017.
[19] Diangang Li, Xing Wei, Xiaopeng Hong, and Yihong Gong. Infrared-visible cross-modal person re-identification with an x modality. In Proceedings of the AAAI Conference on Arti-ficial Intelligence, February 2020.
[20] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018.
[21] Chen Lin. The topological approach to perceptual organiza- tion. Visual Cognition, 12(4):553–637, 2005.
[22] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017.
[23] Xialei Liu, Marc Masana, Luis Herranz, Van De Weijer Joost, Antonio M. Lopez, and Andrew D. Bagdanov. Rotate your networks: Better weight consolidation and less catastrophic forgetting. arxiv preprint arXiv:1802.02950, 2018.
[24] David Lopez-Paz et al. Gradient episodic memory for contin- ual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.
[25] Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Gong. Bayesian loss for crowd count estimation with point supervision. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
[26] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggy- back: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67–82, 2018.
[27] Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- tiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
[28] T. M. Martinetz. Competitive hebbian learning rule forms perfectly topology preserving maps. In International Conference on Artificial Neural Networks, pages 427–434, 1993.
[29] Wei Ning, Zhou Tiangang, Zhang Zihao, Zhuo Yan, and Chen Li. Visual working memory representation as a topological defined perceptual object. Journal of Vision, 19(7):1– 12, 2019.
[30] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.
[31] Y. Prudent and A. Ennaji. An incremental growing neural gas learns topologies. In Neural Networks, 2005. IJCNN ’05. Proceedings. 2005 IEEE International Joint Conference on, 2005.
[32] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classi-fier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
[33] Mengye Ren, Renjie Liao, Ethan Fetaya, and Richard Zemel. Incremental few-shot learning with attention attractor networks. In Advances in Neural Information Processing Systems, pages 5276–5286, 2019.
[34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
[35] Hou Saihui, Pan Xinyu, Loy Chen Change, Wang Zilei, and Lin Dahua. Lifelong learning via progressive distillation and retrospection. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
[36] Joan Serr`a, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. arXiv preprint arXiv:1801.01423, 2018.
[37] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017.
[38] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypi- cal networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
[39] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 403–412, 2019.
[40] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018.
[41] Xiaoyu Tao, Xiaopeng Hong, Xinyuan Chang, and Yihong Gong. Bi-objective continual learning: Learning ‘new’ while consolidating ‘known’. In Proceedings of the AAAI Conference on Artificial Intelligence, February 2020.
[42] Martinetz Thomas and Schulten Klaus. A ”neural-gas” net- work learns topologies. Artificial Neural Networks, 1991.
[43] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. arXiv preprint arXiv:1606.04080, 2016.
[44] C Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
[45] Xing Wei, Yue Zhang, Yihong Gong, Jiawei Zhang, and Nanning Zheng. Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. In The European Conference on Computer Vision (ECCV), September 2018.
[46] Chenshen Wu, Luis Herranz, Xialei Liu, Joost van de Weijer, Bogdan Raducanu, et al. Memory replay gans: Learning to generate new categories without forgetting. In Advances In Neural Information Processing Systems, pages 5962–5972, 2018.
[47] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
[48] Wu Yue, Chen Yinpeng, Wang Lijuan, Ye Yuancheng, Liu Zicheng, Guo Yandong, and Fu Yun. Large scale incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[49] Friedemann Zenke, Ben Poole, and Surya Ganguli. Contin- ual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3987–3995. JMLR. org, 2017.
[50] Mengyao Zhai, Lei Chen, Frederick Tung, Jiawei He, Megha Nawhal, and Greg Mori. Lifelong gan: Continual learning for conditional image generation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2759– 2768, 2019.