b

DiscoverSearch
About
My stuff
Few-Shot Class-Incremental Learning
2020·arXiv
Abstract
Abstract

The ability to incrementally learn new classes is crucial to the development of real-world artificial intelligence systems. In this paper, we focus on a challenging but practical few-shot class-incremental learning (FSCIL) problem. FSCIL requires CNN models to incrementally learn new classes from very few labelled samples, without forgetting the previously learned ones. To address this problem, we represent the knowledge using a neural gas (NG) network, which can learn and preserve the topology of the feature manifold formed by different classes. On this basis, we propose the TOpology-Preserving knowledge InCrementer (TOPIC) framework. TOPIC mitigates the forgetting of the old classes by stabilizing NG’s topology and improves the representation learning for few-shot new classes by growing and adapting NG to new training samples. Comprehensive experimental results demonstrate that our proposed method significantly outperforms other state-of-the-art class-incremental learning methods on CIFAR100, miniImageNet, and CUB200 datasets.

Convolutional Neural Networks (CNNs) have been successfully applied to a broad range of computer vision tasks [17, 11, 34, 4, 22, 25, 45, 19]. For practical use, we train CNN models on large scale image datasets [5] and then deploy them on smart agents. As the smart agents are often exposed in a new and dynamic environment, there is an urgent need to continuously adapt the models to recognize new classes emerging. For example, the smart album function on smartphones is designed to automatically classify user photos into both the pre-defined and user-defined classes. The model underpinning the smart album is pre-trained on the training set of the pre-defined classes, and is required to

image

Figure 1. Comparisons of two ways to characterize a heteroge- nous manifold. (a) Randomly sampled representatives, which are adopted by conventional CIL studies for knowledge distillation. (b) The representatives learned by neural gas, which well preserves the topology of the manifold.

adapt to the new user-defined classes by learning from new photos. From the users’ perspective, they are only willing to annotate very few image examples for the new class, as the labeling process consumes manpower. Therefore, it is crucial for CNNs to be capable of incrementally learning new classes from very few training examples. We term this ability as few-shot class-incremental learning (FSCIL).

A na¨ıve approach for FSCIL is to finetune the base model on the new class training set. However, a simple finetuning with limited number of training samples would cause two severe problems: one is “forgetting old”, where the model’s performance deteriorates drastically on old classes due to catastrophic forgetting [7]; the other is “overfitting new”, where the model is prone to overfit to new classes, which loses generalization ability on large set of test samples.

Recently, there have been many research efforts attempting to solve the catastrophic forgetting problem [15, 49, 20, 24, 18, 32, 2, 13, 41, 37, 1]. They usually conduct incremental learning under the multi-task or the multi-class scenarios. The former incrementally learns a sequence of disjoint tasks, which requires the task identity in advance. This is seldom satisfied in real applications where the task identity is typically unavailable. The latter learns a unified classifier to recognize all the encountered classes within a single task. This scenario is more practical without the need of knowing task information. In this paper, we study the FSCIL problem under the multi-class scenario, where we treat FSCIL as a particular case of the class-incremental learning (CIL) [32, 2, 10, 13, 48]. Compared with CIL that learns new classes with unlimited, usually large-scale training samples, FSCIL is more challenging, since the number of new training samples is very limited.

To mitigate forgetting, most CIL works [32, 2, 35, 13, 48] use the knowledge distillation [12] technique that maintains the network’s output logits corresponding to old classes. They usually store a set of old class exemplars and apply the distillation loss to the network’s output. Despite their effectiveness, there are several problems when training with the distillation loss. One is the class-imbalance problem [13, 48], where the output logits are biased towards those classes with a significant larger number of training samples. The other is the performance trade-off between old and new classes. This problem is more prominent for FSCIL, because learning from very few training samples requires a larger learning rate and stronger gradients from new classes’ classification loss, making it difficult to maintain the output for old classes at the same time.

In this paper, we address FSCIL from a new, cognitiveinspired perspective of knowledge representation. Recent discoveries in cognitive science reveal the importance of topology preservation for maintaining the memory of the old knowledge [29, 21]. The change of the memory’s topology will cause severe degradation of human recognition performance on historical visual stimuli [29], indicating catastrophic forgetting. Inspired by this, we propose a new FSCIL framework, named TOpology-Preserving knowledge InCrementer (TOPIC), as shown in Figure 1. TOPIC uses a neural gas (NG) network [42, 8, 31] to model the topology of feature space. When learning the new classes, NG grows to adapt to the change of feature space. On this basis, we formulate FSCIL as an optimization problem with two objectives. On the one hand, to avoid catastrophic forgetting, TOPIC preserves the old knowledge by stabilizing the topology of NG, which is implemented with an anchor loss (AL) term. On the other hand, to prevent overfitting to few-shot new classes, TOPIC adapt the feature space by pushing the new class training sample towards a correct new NG node with the same label and pulling the new nodes of different labels away from each other. The min-max loss (MML) term is developed to achieve this purpose.

For extensive assessment, we build the FSCIL baselines by adapting the state-of-the-art CIL methods [32, 2, 13] to this new problem and compare our method with them. We conduct comprehensive experiments on the popular CIFAR100 [16], miniImageNet [43], and CUB200 [44] datasets. Experimental results demonstrate the effective-

ness of the proposed FSCIL framework. To summarize, our main contributions include:

We recognize the importance of few-shot class-incremental learning (FSCIL) and define a problem setting to better organize the FSCIL research study. Compared with the popularly studied class-incremental learning (CIL), FSCIL is more challenging but more practical.

We propose an FSCIL framework TOPIC that uses a neural gas (NG) network to learn feature space topologies for knowledge representation. TOPIC stabilizes the topology of NG for mitigating forgetting and adapts NG to enhancing the discriminative power of the learned features for few-shot new classes.

We provide an extensive assessment of the FSCIL methods, which we adapt the state-of-the-art CIL methods to FSCIL and make comprehensive comparisons with them.

2.1. Class-Incremental Learning

Class-incremental learning (CIL) learns a unified classi-fier incrementally to recognize all encountered new classes met so far. To mitigate the forgetting of the old classes, CIL studies typically adopt the knowledge distillation technique, where external memory is often used for storing old class exemplars to compute the distillation loss. For example, iCaRL [32] maintains an “episodic memory” of the exemplars and incrementally learns the nearest-neighbor clas-sifier for the new classes. EEIL [2] adds the distillation loss term to the cross-entropy loss for end-to-end training. Latest CIL works NCM [13] and BiC [48] reveal the class-imbalance problem that causes the network’s prediction biased towards new classes. They adopt cosine distance metric to eliminate the bias in the output layer [13], or learns a bias-correction model to post-process the output logits [48].

In contrast to these CIL works, we focus on the more difficult FSCIL problem, where the number of new class training samples is limited. Rather than constraining the network’s output, we try to constrain CNN’s feature space represented by a neural gas network.

2.2. Multi-task Incremental Learning

A series of research works adopts the multi-task incremental learning scenario. These works can be categorized into three types: (1) rehearsal approaches [24, 3, 37, 50, 46], (2) architectural approaches [27, 26, 1, 36, 47], and (3) regularization approaches [15, 49, 23, 18]. Rehearsal approaches replay the old tasks information to the task solver when learning the new task. One way is to store the old tasks’ exemplars using external memory and constrain their losses during learning the new task [24, 3]. Another way is to use the generative models to memorize the old tasks data distribution [37, 46, 50]. For example, DGR [37] learns a generative adversarial network to produce observed samples for the task solver. The recognition performance is affected by the quality of the generated samples. Architectural approaches alleviate forgetting by manipulating the network’s architecture, such as network pruning, dynamic expansion, and parameter masking. For example, PackNet [27] prunes the network to create free parameters for the new task. HAT [36] learns the attention masks for old tasks and use them to constrain the parameters when learning the new task. Regularization approaches impose regularization on the network’s parameters, losses or output logits. For example, EWC [15] and its variants [49, 23] penalize the changing of the parameters important to old tasks. These methods are typically based on certain assumptions of the parameters’ posterior distribution (e.g. Gaussian), which may struggle in more complex scenarios.

As the multi-task incremental learning methods are aimed at learning disjoint tasks, it is infeasible to apply these methods under the single-task multi-class scenario adopted by FSCIL. As a result, we have to exclude them for comparison.

2.3. Dynamic Few-Shot Learning

Few-shot learning (FSL) aims to adapt the model to recognize unseen novel classes using very few training samples, while the model’s recognition performance on the base classes is not considered. To achieve FSL, research studies usually adopt the metric learning and meta-learning strategies [43, 38, 40, 6, 39]. Recently, some FSL research works attempt to learn a model capable of recognizing both the base and novel classes [9, 33]. Typically, they first pretrain the model on the base training set to learn feature embedding as well as the weights of the classifier for base classes. Then they perform meta-learning for few-shot novel classes, by sampling “fake” few-shot classification tasks from the base dataset to learn a classifier for novel classes. Finally, the learned heads are combined for recognizing the joint test (query) set of the base and novel classes.

Though some of these works [33] regard such setting as a kind of incremental learning, they rely on the old training set (i.e., the base class dataset) for sampling meta-learning tasks. This is entirely different from the FSCIL setting, where the base/old class training set is unavailable at the new incremental stage. As a consequence, these few-shot learning works can not be directly applied to FSCIL.

We define the few-shot class-incremental-learning (FSCIL) setting as follows. Suppose we have a stream of labelled training sets  D(1), D(2), · · ·, where  D(t) ={(x(t)j , y(t)j )}|D(t)|j=1.  L(t)is the set of classes of the t-th train- ing set, where  ∀i, j, L(i)∩L(j) = ∅. D(1)is the large-scale training set of base classes, and  D(t), t > 1is the few-shot training set of new classes. The model  Θis incrementally trained on  D(1), D(2), · · ·with a unified classification layer, while only  D(t)is available at the t-th training session. After training on  D(t), Θis tested to recognize all encountered classes in  L(1), · · · , L(t). For  D(t), t > 1, we denote the setting with C classes and K training samples per class as the C-way K-shot FSCIL. The main challenges are twofold: (1) avoiding catastrophic forgetting of old classes; (2) preventing overfitting to few-shot new classes.

To perform FSCIL, we treat the CNN as a composition of a feature extractor  f(·; θ)with the parameter set  θand a classification head. The feature extractor defines the feature space  F ⊆ Rn. The classification head with the parameter set  φproduces the output vector followed by a softmax function to predict the probability p over all classes. The entire set of parameters is denoted as  Θ = {θ, φ}. The output vector given input x is  o(x; Θ) = φT f(x; θ). Initially, we train  Θ(1)on  D(1)with the cross-entropy loss. Then we incrementally finetune the model on  D(2), D(3), · · ·, and get Θ(2), Θ(3), · · ·. At the t-th session (t > 1), the output layer is expanded for new classes by adding  |L(t)|output neurons.

For FSCIL, we first introduce a baseline solution to alleviate forgetting based on knowledge distillation; then we elaborate our proposed TOPIC framework that employs a neural gas network for knowledge representation and the anchor loss and min-max loss terms for optimization.

3.1. Baseline: Knowledge Distillation Approach

Most CIL works [32, 2, 13, 48] adopt the knowledge distillation technique for mitigating forgetting. Omitting the superscript (t), the loss function is defined as:

image

where  ℓDLand  ℓCEare the distillation and cross-entropy loss terms, and P is the set of old class exemplars drawn from  D(1), · · · , D(t−1). The implementation of  ℓDLmay vary in different works. Generally, it takes the form:

image

where  n = �t−1i=1 |L(i)|is the number of the old classes, ˆΘis the initial values of  Θbefore finetuning, and T is the distillation temperature (e.g., T = 2 in [2, 13]).

The distillation approach faces several critical issues when applied to FSCIL. One is the bias problem caused by imbalanced old/new class training data, where the output layer is biased towards new classes [13, 48]. To address this issue, [13] uses cosine distance measure to eliminate the bias and [48] learns a bias correction model to post-process the outputs. Despite their effectiveness in learning large-scale training data, they are less effective for FSCIL with very few training samples. Using cosine distance may lose important patterns (e.g. appearance) contained in the magnitude of the weight/feature vector, while the bias-correction model requires a large number of training samples, which conflicts with the few-shot setting. Another issue is the dilemma to balance the contribution between  ℓCEand  ℓDL, which may lead to unsatisfactory performance trade-off. Learning few-shot new classes requires a larger learning rate to minimize  ℓCE, while it can cause instability of the output logits and makes it difficult to minimize  ℓDL.

Based on the above considerations, we abandon the distillation loss in our framework. Instead, we manipulate the knowledge contained CNN’s feature space that contains richer information than the output logits.

3.2. Knowledge Representation as Neural Gas

The knowledge distillation methods typically store a set of exemplars randomly drawn from the old training set and compute the distillation loss using these exemplars. However, there is no guarantee that the randomly-sampled exemplars can well represent heterogenous, non-uniform data of different classes in the FSCIL scenarios. Instead, we represent the knowledge by preserving the feature space topology, which is achieved by a neural gas (NG) network [42]. NG maps the feature space F to a finite set of feature vectors  V = {vj}Nj=1and preserves the topology of F by com- petitive Hebbian learning [28], as shown in Figure 2.

NG defines an undirected graph  G = ⟨V, E⟩. Each vertex  vj ∈ Vis assigned with a centroid vector  mj ∈ Rndescribing the location of  vjin feature space. The edge set E stores the neighborhood relations of the vertices. If  viand vjare topologically adjacent,  eij = 1; otherwise,  eij = 0. Each edge  eijis assigned with an “age”  aijinitialized to 0. Given an input  f ∈ F, it matches the NG node j with the minimum distance  d(f, mj)to f. The matching process divides F into disjoint subregions, where the centroid vector  mjencodes the region  Fj = {f ∈ F|d(f, mj) ≤d(f, mi), ∀i}. We use the Euclidean distance as  d(·, ·).

Noting that some variants of NG [8, 31] use different approaches to construct NG incrementally. To be consistent with FSCIL, we directly modify the original version [42] and learn a fixed set of nodes for the base classes. As NG [42] is originally learnt from unlabelled data, to accomplish the supervised incremental learning, we redefine the NG node j as a tuple  vj = (mj, Λj, zj, cj) ∈ V, where mj ∈ Rnis the centroid vector representing  Fj, the diagonal matrix  Λj ∈ Rn×nstores the variance of each di-

image

Figure 2. NG preserves the topology of heterogenous feature space manifold. Initially, NG is learnt for base classes (the blue dots and lines.) Then NG incrementally grows for new classes by inserting new nodes and edges (the orange dots and lines.) During the competitive Hebbian learning,  vj’s centroid vector  mj isadapted to the input vector f which falls in  Fjencoded by  vj.

mension of  mj, and  zjand  cjare the assigned images and labels for computing the observation  ˆmj. With  cj, we can determine whether  vjcorresponds to old class or new class.

At the initial session (t = 1), the NG net with  N (1)nodes  G(1) = ⟨V (1), E(1)⟩is trained on the feature set F(1) = {f(x; θ(1))|∀x ∈ D(1)}using competitive Hebbian learning. Concretely, given an input  f ∈ F(1), its distance with each NG node is computed and stored in Df = {d(f, mi)|i = 1, · · · , N (1)}. Dfis then sorted in ascending order to get the rank of the nodes  Rf ={ri|d(f, mri) ≤ d(f, mri+1), i = 1, · · · , N (1) − 1}. Then, for each node  ri, its centroid  mriis updated to  m∗ri:

image

where  ηis the learning rate, and  e−i/αis a decay function controlled by  α. We use the superscript ∗to denote the updated one. For the nodes distant from f, they are less affected by the update. Next, the edge of all connections of r1is updated as:

image

Apparently,  r1and  r2are the nearest and the second nearest to f. Their edge  er1r2and the corresponding age  ar1jis set to 1 to create or maintain a connection between node  r1and  r2. For other edges, if  ar1jexceeds lifetime T, the connection is removed by setting  er1j = 0. After training on F(1), for  vj = (mj, Λj, zj, cj), we pick the sample from D(1)whose feature vector f is the nearest  mjas the pseudo image  zjand label  cj. The variance  Λjis estimated using the feature vectors whose winner is j.

At the incremental session (t > 1), for K-shot new class training samples, we grow  G(t)by inserting k < K (e.g. k = 1 for K = 5) new nodes  {˜vN, · · · , ˜vN+k}for

image

Figure 3. Explanation of NG stabilization and adaptation. (a) NG divides CNN’s feature space F into a set of topologically arranged subregions  Fjrepresented by a centroid vector  vj. (b) When finetuning CNN with few training examples, F’s topology is severely distorted, indicating catastrophic forgetting. (c) To maintain the topology, the shift of NG nodes is penalized by the anchor-loss term. (d) NG grows for new class y by inserting a new vertex  ˜v7. A new class training sample ˜fis mismatched to  v5, due to d(˜f, m5) < d(˜f, m7).(e) The min-max loss term adapts  F7by pushing ˜f to ˜v7and pulling  ˜v7away from the neighbors  v4, v5 and v6. (f) The topology is updated after the adaptation in (e), where  ˜v7has been moved to  v7, and the connection between  v4 and v7is removed due to expired age.

each new class, and update their centroids and edges using Eq. (3) and (4). To avoid forgetting old class, we stabilize the subgraph of NG learned at previous session  (t − 1)that preserves old knowledge. On the other hand, to prevent overfitting to  D(t), we enhance the discriminative power of the learned features by adapting newly inserted NG nodes and edges. The neural gas stabilization and adaptation are described in the following sections.

3.3. Less-Forgetting Neural Gas Stabilization

Given NG  G(t), we extract the subgraph  G(t)o =⟨V (t)o , E(t)o ⟩ ⊆ G(t)whose vertices  v = (m, Λ, z, c)were learned on old class training data at session  (t − 1), where c ∈ ∪t−1i=1 L(i). During finetuning, we stabilize  G(t)oto avoid forgetting the old knowledge. This is implemented by penalizing the shift of v in the feature space  F(t)via constraining the observed value of the centroid  ˆmto stay close to the original one m. It is noteworthy that some dimensions of m have high diversity with large variance. These dimensions may encode common semantic attributes shared by both the old and new classes. Strictly constraining them may prevent positive transfer of the knowledge and bring unsatisfactory trade-off. Therefore, we measure each dimension’s importance for old class knowledge using the inverted diagonal  Λ−1, and relax the stabilization of high-variance dimensions. We define the anchor loss (AL) term for less-forgetting stabilization:

image

The effect of AL term is illustrated in Figure 3 (a-c). It avoids severe distortion of the feature space topology.

3.4. Less-Overfitting Neural Gas Adaptation

Given the new class training set  D(t)and NG  G(t), for a training sample  (x, y) ∈ D(t), we extract its feature vector f = f(x; θ(t))and feed f to the NG. We hope f matches the node  vjwhose label  cj = y, and  d(f, mj) ≪ d(f, mi), i ̸=j, so that x is more probable to be correctly classified. However, simply finetuning on the small training set  D(t)could cause severe overfitting, where the test sample with groundtruth label y is very likely to activate the neighbor with a different label. To address this problem, a min-max loss (MML) term is introduced to constrain f and the centroid vector  mjof  vj. The “min” term minimizes  d(f, mj). The “max” term maximizes  d(mi, mj)to be larger than a margin, where  miis the centroid vectors of  vj’s neighbors with a different label  ci ̸= y. MML is defined as:

image

The hyper-parameter  ξis used to determine the minimum distance. If  d(mi, mj) > ξ, we regard the distance is larger enough for well separation, and disable the term. Heuristically, we set  ξ ≈max{d(mi, mj)|∀i, j}. After finetuning, we update the edge  eijaccording to Eq. (4), as illustrated in Figure 3 (e) and (f).

3.5. Optimization

At the incremental session t > 1, we finetune CNN  Θ(t)on  D(t)with mini-batch SGD. Meanwhile, we update the NG net  G(t)at each SGD iteration, using the competitive learning rules in Eq. (3) and (4). The gradients in Eq. (5) and (6) are computed and back-propagated to CNN’s feature extractor  f(·; θ(t)). The overall loss function at session

t is defined as:

image

where the first term in the right-hand side is the softmax cross-entropy loss,  ℓALis the AL term defined in Eq. (5), ℓMMLis the MML term defined in Eq. (6), and  λ1and  λ2are the hyper-parameters to balance the strength.

We conduct comprehensive experiments on three popular image classification datasets CIFAR100 [16], miniImageNet [43] and CUB200 [44]. CIFAR100 dataset contains 60,000 RGB images of 100 classes, where each class has 500 training images and 100 test images. Each image has the size  32 × 32. This dataset is very popular in CIL works [32, 2]. MiniImageNet dataset is the 100-class subset of the ImageNet-1k [5] dataset used by few-shot learning [43, 6]. Each class contains 500 training images and 100 test images. The images are in RGB format of the size  84 × 84. CUB200 dataset is originally designed for fine-grained image classification and introduced by [3, 30] for incremental learning. It contains about 6,000 training images and 6,000 test images over 200 bird categories. The images are resized to  256 × 256and then cropped to  224 × 224for training.

For CIFAR100 and miniImageNet datasets, we choose 60 and 40 classes as the base and new classes, respectively, and adopt the 5-way 5-shot setting, which we have 9 training sessions (i.e., 1 base + 8 new) in total. While for CUB200, differently, we adopt the 10-way 5-shot setting, by choosing 100 classes as the base classes and splitting the remaining 100 classes into 10 new class sessions. For all datasets, each session’s training set is constructed by randomly picking 5 training samples per class from the original dataset, while the test set remains to be the original one, which is large enough to evaluate the generalization performance for preventing overfitting.

We use a shallower QuickNet [14] and the deeper ResNet18 [11] models as the baseline CNNs. The QuickNet is a simple yet power CNN for classifying small images, which has three conv layers and two fc layers, as shown in Table 1. We evaluate it on both CIFAR100 and miniImageNet. While for ResNet18, we evaluate it on all the three datasets. We train the base model  Θ(1)with a mini-batch size of 128 and the initial learning rate of 0.1. We decrease the learning rate to 0.01 and 0.001 after 30 and 40 epochs, respectively, and stop training at epoch 50. Then, we finetune the model  Θ(t)on each subsequent training set D(t), t > 1for 100 epochs, with a learning rate of 0.1 (and 0.01 for CUB200). As  D(t)contains very few training samples, we use all of them to construct the mini-batch for incremental learning. After training on  D(t), we test  Θ(t)on the union of the test sets of all encountered classes. For data augmentation, we perform standard random cropping and flipping as in [11, 13] for all methods. When finetun-ing ResNet18, as we only have very few new class training samples , it would be problematic to compute batchnorm. Thus, we use the batchnorm statistics computed on D(1)and fix the batchnorm layers during finetuning. We run the whole learning process 10 times with different random seeds and report the average test accuracy over all encountered classes.

Table 1. The structure of the QuickNet model in the experiments, which is originally defined in the Caffe package [14].

image

We learn a NG net of 400 nodes for base classes, and incrementally grow it by inserting 1 node for each new class. For the hyper-parameters, we set  η = 0.02, α = 1for faster learning of NG in Eq. (3), the lifetime T = 200 in Eq. (4), and  λ1 = 0.5, λ2 = 0.005for Eq. (7).

For comparative experiments, we run the representative CIL methods in our FSCIL setting, including the classical iCARL [32] and the state-of-the-art methods EEIL [2] and NCM [13], and compare our method with them. While for BiC [48], we found that training the bias-correction model requires a large set of validation samples, which is impracticable for FSCIL. Therefore, we do not eval this work. We set  γ = 1in Eq. (1) for these distillation-based methods as well as the distillation term used in our ablation study in Section 4.2. Other related works [20, 15, 49, 18, 24] are designed for the MT setting, which we do not involve in our experiments. We use the abbreviation “Ours-AL”, “Ours-AL-MML” to indicate the applied loss terms during incremental learning.

4.1. Comparative results

We report the comparative results of the methods using the 5/10-way 5-shot FSCIL setting. As the 5-shot training samples are randomly picked, we run all methods for

image

Figure 4. Comparison of the test accuracies of QuickNet and ResNet18 on CIFAR100 and miniImageNet dataset. At each session, the models are evaluated on a joint set of test samples of the classes encountered so far.

Table 2. Comparison results on CUB200 with ResNet18 using the 10-way 5-shot FSCIL setting. Noting that the comparative methods with their original learning rate settings have much worse test accuracies on CUB200. We carefully tune their learning rates and boost their original accuracies by 2%∼8.7%. In the table below, we report their accuracies after the improvement.

image

10 times and report the average accuracies. Figure 4 compares the test accuracies on CIFAR100 and miniImageNet dataset, respectively. Table 2 reports the test accuracies on CUB200 dataset.

We summarize the results as follows:

On three datasets, and for both QuickNet and ResNet18 models, our TOPIC outperforms other state-of-the-art methods on each encountered session, and is the closest to the upper bound “Joint-CNN” method. As the incremental learning proceeds, the superiority of TOPIC becomes more significant, demonstrating its power for continuously learning longer sequence of new class datasets.

Simply finetuning with few training samples of new classes (i.e., “Ft-CNN”, the blue line) deteriorates the test accuracies drastically due to catastrophic forgetting. Finetuning with AL term (i.e., the green line) effectively alleviates forgetting, outperforming the na¨ıve finetuning approach by up to 38.90%. Moreover, using both AL and MML terms further achieves up to 5.85% accuracy gain than using AL alone. It shows that solving the challenging FSCIL problem requires both alleviating the forgetting of the old classes and enhancing the representation learning of the new classes.

On CIFAR100, TOPIC achieves the final accuracies of 24.17% and 29.37% with QuickNet and ResNet18, respectively, while the second best ones (i.e., NCM∗and EEIL∗) achieve the accuracies of 19.50% and 15.85%, respectively. TOPIC outperforms the two state-of-the-art methods by up to 13.52%.

On miniImageNet, TOPIC achieves the final accuracies of 18.36% and 24.42% with QuickNet and ResNet18, respectively, while the corresponding accuracies achieved by the second best EEIL∗are 13.59% and 19.58%, respectively. TOPIC outperforms EEIL* by up to 4.84%.

On CUB200, at the end of the entire learning process, TOPIC achieves the accuracy of 26.28% with ResNet18, outperforming the second best EEIL∗

image

4.2. Ablation study

The contribution of the loss terms. We conduct ablation studies to investigate the contribution of the loss terms to the final performance gain. The experiments are performed on miniImageNet with ResNet18. For AL, we compare the original form in Eq. (5) and a simplified form without the “re-weighting” matrix  Λ. For MML, as it consists of the

Table 3. Comparison results of combining different loss terms on miniImageNet with ResNet18.

image

Table 4. Comparison of the final test accuracies achieved by “ex- emplars” and NG nodes with different memory size. Experiments are performed on CIFAR100 with ResNet18.

image

Figure 5. Comparison results under the 5-way 10-shot and 5-way full-shot settings, evaluated with ResNet18 on miniImageNet.

“min” and “max” terms, we evaluate the performance gain brought by each term separately. Besides, we also investigate the impact brought by the distillation loss term, which is denoted as “DL”. Table 3 reports the comparison results of different loss term settings. We summarize the results as follows:

The “AL” term achieves better accuracy (up to 1.49%) than the simplified form “AL w/o.  Λ”, thanks to the feature re-weighting technique.

Both “AL-Min” and “AL-Max” improve the performance of AL, and the combined form “AL-MML” achieves the best accuracy, exceeding “AL” by up to 5.85%.

Both “DL-MML” and “AL-MML” improve the performance of the corresponding settings without MML (i.e., “DL” and “AL”). It demonstrate the effective-

image

ness of the MML term for improving the representation learning for few-shot new classes.

Applying the distillation loss degrades the performance. Though distillation is popularly used by CIL methods, it may be not so effective for FSCIL, as it is difficult to balance the old and new classes and trade-off the performance when there are only few new class training samples, as discussed in Section 3.1.

Comparison between “exemplars” and NG nodes. In our method, we represent the knowledge learned in CNN’s feature space using the NG net G. An alternative approach is to randomly select a set of exemplars representative of the old class training samples [32, 2] and penalize the changing of their feature vectors during training. Table 4 compares the final test accuracies achieved by the two approaches under different memory sizes. From Table 4, we can observe that using NG with only a few number of nodes can greatly outperform the exemplar approach in a consistent manner. When smaller memory is used, the difference in accuracy becomes larger, demonstrating the superiority of our method for FSCIL. The effect of the number of training samples. To investigate the effect brought by different shot of training samples, we further evaluate the methods under the 5-way 10-shot and 5-way full-shot settings. For 5-way full-shot, we use all training samples of the new class data, which is analogous to the ordinary CIL setting. We grow NG by adding 20 nodes for each new session, which we have  (400+20(t−1))NG nodes at session  (t−1). Figure 5 shows the comparative results of different methods under the 10-shot and full-shot settings. We can see that our method also outperforms other state-of-the-art methods when training with more samples. It demonstrate the effectiveness of the proposed framework for general CIL problem.

Figure 6 compares the confusion matrix of the classi-fication results at the last session, produced by Ft-CNN, EEIL* [2], NCM* [13] and our TOPIC. The na¨ıve finetun-ing approach tends to misclassify all past classes (i.e., 0-94) to the newly learned classes (i.e., 95-99), indicating catas-

image

Figure 6. Comparison of the confusion matrices produced by (a) Ft-CNN, (b) EEIL*, (c) NCM*, and (d) our TOPIC on miniImageNet with ResNet18.

trophic forgetting. EEIL* and NCM* can alleviate forgetting to some extent, while still tend to misclassify old class test samples as new classes due to overfitting. Our method, named “TOPIC”, produces a much better confusion matrix, where the activations are mainly distributed at the diagonal line, indicating higher recognition performance over all encounter class. It demonstrate the effectiveness of solving FSCIL by avoiding both “forgetting old” and “overfitting new”.

We focus on a unsolved, challenging, yet practical incremental-learning scenario, namely the few-shot class-incremental learning (FSCIL) setting, where models are required to learn new classes from few training samples. We propose a framework, named TOPIC, to preserve the knowledge contained in CNN’s feature space. TOPIC uses a neural gas (NG) network to maintain the topological structure of the feature manifold formed by different classes. We design mechanisms for TOPIC to mitigate the forgetting of the old classes and improve the representation learning for few-shot new classes. Extensive experiments show that our method substantially outperforms other state-of-the-art CIL methods on CIFAR100, miniImageNet, and CUB200 datasets, with a negligibly small memory overhead.

[1] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages 139–154, 2018.

[2] Francisco M Castro, Manuel J Mar´ın-Jim´enez, Nicol´as Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 233–248, 2018.

[3] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with agem. arXiv preprint arXiv:1812.00420, 2018.

[4] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.

[5] Jia Deng, Wei Dong, R. Socher, Li Jia Li, Kai Li, and Fei Fei Li. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255, 2009.

[6] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.

[7] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.

[8] Bernd Fritzke. A growing neural gas network learns topolo- gies. Advances in neural information processing systems, 7, 1995.

[9] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018.

[10] C He, R Wang, S Shan, and X Chen. Exemplar-supported generative reproduction for class incremental learning. In Proceedings of the British Machine Vision Conference, 2018.

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.

[12] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. Computer Science, 14(7):38–39, 2015.

[13] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 831–839, 2019.

[14] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678, 2014.

[15] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran

Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.

[16] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[18] Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgletting by incremental moment matching. In Advances in Neural Information Processing Systems, pages 4652–4662, 2017.

[19] Diangang Li, Xing Wei, Xiaopeng Hong, and Yihong Gong. Infrared-visible cross-modal person re-identification with an x modality. In Proceedings of the AAAI Conference on Arti-ficial Intelligence, February 2020.

[20] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018.

[21] Chen Lin. The topological approach to perceptual organiza- tion. Visual Cognition, 12(4):553–637, 2005.

[22] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017.

[23] Xialei Liu, Marc Masana, Luis Herranz, Van De Weijer Joost, Antonio M. Lopez, and Andrew D. Bagdanov. Rotate your networks: Better weight consolidation and less catastrophic forgetting. arxiv preprint arXiv:1802.02950, 2018.

[24] David Lopez-Paz et al. Gradient episodic memory for contin- ual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.

[25] Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Gong. Bayesian loss for crowd count estimation with point supervision. In The IEEE International Conference on Computer Vision (ICCV), October 2019.

[26] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggy- back: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67–82, 2018.

[27] Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- tiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.

[28] T. M. Martinetz. Competitive hebbian learning rule forms perfectly topology preserving maps. In International Conference on Artificial Neural Networks, pages 427–434, 1993.

[29] Wei Ning, Zhou Tiangang, Zhang Zihao, Zhuo Yan, and Chen Li. Visual working memory representation as a topological defined perceptual object. Journal of Vision, 19(7):1– 12, 2019.

[30] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.

[31] Y. Prudent and A. Ennaji. An incremental growing neural gas learns topologies. In Neural Networks, 2005. IJCNN ’05. Proceedings. 2005 IEEE International Joint Conference on, 2005.

[32] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classi-fier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.

[33] Mengye Ren, Renjie Liao, Ethan Fetaya, and Richard Zemel. Incremental few-shot learning with attention attractor networks. In Advances in Neural Information Processing Systems, pages 5276–5286, 2019.

[34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.

[35] Hou Saihui, Pan Xinyu, Loy Chen Change, Wang Zilei, and Lin Dahua. Lifelong learning via progressive distillation and retrospection. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[36] Joan Serr`a, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. arXiv preprint arXiv:1801.01423, 2018.

[37] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017.

[38] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypi- cal networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.

[39] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 403–412, 2019.

[40] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018.

[41] Xiaoyu Tao, Xiaopeng Hong, Xinyuan Chang, and Yihong Gong. Bi-objective continual learning: Learning ‘new’ while consolidating ‘known’. In Proceedings of the AAAI Conference on Artificial Intelligence, February 2020.

[42] Martinetz Thomas and Schulten Klaus. A ”neural-gas” net- work learns topologies. Artificial Neural Networks, 1991.

[43] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. arXiv preprint arXiv:1606.04080, 2016.

[44] C Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.

[45] Xing Wei, Yue Zhang, Yihong Gong, Jiawei Zhang, and Nanning Zheng. Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. In The European Conference on Computer Vision (ECCV), September 2018.

[46] Chenshen Wu, Luis Herranz, Xialei Liu, Joost van de Weijer, Bogdan Raducanu, et al. Memory replay gans: Learning to generate new categories without forgetting. In Advances In Neural Information Processing Systems, pages 5962–5972, 2018.

[47] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.

[48] Wu Yue, Chen Yinpeng, Wang Lijuan, Ye Yuancheng, Liu Zicheng, Guo Yandong, and Fu Yun. Large scale incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.

[49] Friedemann Zenke, Ben Poole, and Surya Ganguli. Contin- ual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3987–3995. JMLR. org, 2017.

[50] Mengyao Zhai, Lei Chen, Frederick Tung, Jiawei He, Megha Nawhal, and Greg Mori. Lifelong gan: Continual learning for conditional image generation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2759– 2768, 2019.


Designed for Accessibility and to further Open Science