In many real-world applications, batches of data arrive periodically (e.g., daily, weekly, or monthly) with the data distribution changing over time. This presents an opportunity for lifelong learning or continual learning, and is an important developing topic of interest in artificial intelligence. The primary goal of lifelong learning is to learn consecutive tasks without forgetting the knowledge learned from previously trained tasks, and leverage the previous knowledge to obtain better performance or faster convergence on the newly coming task. One simple way is to fine-tune the model for every new task; however, such retraining typi-
Figure 1. (a) State-of-the-art DEN method [30] selectively retrains the old network, dynamically expands the model capacity (b) The proposed REC method expands the network through network transformation based AutoML, and then subsequently compresses the model to its original size.
cally degenerates the model performance on both new tasks and the old ones. If the new tasks are largely different from the old ones, it might not be able to learn the optimal model for the new tasks. Meanwhile, the retrained representations may adversely affect the old tasks, causing them to drift from their optimal solution. This can cause “catastrophic forgetting”, a phenomenon where training a model with new tasks interferes the previously learned old knowledge, leading to a performance degradation or even overwriting of the old knowledge by the new one.
To overcome above catastrophic forgetting problem, many approaches have been proposed [13, 17, 22]. Kirkpatrick et al. [13] propose using a regularization term to prevent the new weights from deviating too much from the previously learned weights, based on their significance to old tasks. Their method uses a fixed neural network architecture, which would not scale up when network capacity gets saturated with more and more new tasks to learn. Dynamically expanding the network [30] (DEN) is one way to overcome the problem caused by static architecture — it expands the network capacity whenever it detects that the loss for the new task would not reach a pre-defined threshold. However, DEN involves many hyperparameters and the final performance is highly sensitive to these parameters; it relies on hand-crafted heuristics to explore the tuning space. But the search space is considerably large, such that human experts usually find a sub-optimal solution while the current parameters tuning procedures are time-consuming. To this end, we aim to automatically expand the network for lifelong learning, with higher performance and less parameter redundancy than human-designed architectures. To better facilitate (a) automatic knowledge transfer without human expert tuning and (b) model design with optimized model complexity, we unprecedentedly propose to apply AutoML [23] for lifelong learning while taking learning ef-ficiency into consideration.
AutoML refers to automatically learn a suitable machine learning (ML) model for a given task — Neural Architecture Search (NAS) [32] is a subfield of AutoML for deep learning, which searches for optimal hyperparameters of designing a network architecture using reinforcement learning (RL). The RL framework has a main controller that observes the generated children networks’ performance on the validation set as the reward signal, it then gives higher probabilities to architectures that have higher performance than the lower ones to update the model. If we use this approach directly in the lifelong learning setting, it would forget old tasks’ knowledge and be a wasteful process since each new task network architecture would need to be searched from scratch by the controller, ignoring the correlations between previously learned tasks and the new task. We hereby propose a multi-task weight consolidation (MWC) approach to learn the discriminative weights subset by incorporating inherent correlations between old tasks and new task. Furthermore, to narrow down the architecture searching space and save training time, network transformation based AutoML [3] is utilized to accelerate the meta-learning of the new network.
However, if we keep expanding the network for more and more new tasks, the model will have a much larger model size comparing with the initial model and suffer the inefficient problem (e.g., low memory footprint, low power usage). Many network-expanding-based lifelong learning algorithms [24, 30] increase the model capability but also decrease the learning efficiency in terms of memory cost and power usage. To address this issue, we conduct model compression after completing the learning of each new task — we compress the expanded model to the initial model, with negligible performance loss on both old and new tasks. Fig 1 shows the main difference of our approach with network expansion-based lifelong learning algorithms.
In this paper, we propose a Multi-task based lifelong learning via nonexpansive AutoML framework termed Regularize, Expand and Compress (REC), to continually and automatically learn on such sequential data sets. We start with a given small network to learn an initial model on the first given task; REC then searches the best network architecture by network transformation based AutoML for the new upcoming task without access to the old tasks’ data using a newly proposed MWC algorithm and compress the expanded network size to the initial network size.
Our key contributions of this work can be summarized as follows:
• We propose to Regularize, Expand and Compress (REC) for lifelong learning, which automatically expands the network capacity for learning a new task with higher performance and less parameter redundancy than human-designed architectures.
• To overcome catastrophic forgetting for the old learned tasks, we propose a novel Multi-task Weight Consolidation (MWC) — it considers the discriminative weight subset by incorporating inherent correlations between old tasks and new task and learns the newly added layer as a task-specific layer for the new task.
• Furthermore, unlike previous network-expanding-based lifelong learning algorithms, REC compresses the model after learning every new task to guarantee the model efficiency. The final model is a nonexpensive model but the performance enhanced by network expanding before the compression.
2.1. Overcoming Catastrophic Forgetting
Recently, a lot of lifelong learning methods were proposed to address the catastrophic forgetting problem. The first group of methods uses regularized learning. Elastic Weight Consolidation (EWC) [13] shows that task-specific synaptic consolidation may overcome catastrophic forgetting in neural networks and observes the important weights for the previous tasks and selectively adjusts the plasticity of the weights. Inspired by EWC, Schwarz et al. [26] propose online EWC, which enlarges the EWC scalability by limiting the regularization term computational cost when the number of tasks increases. Synaptic Intelligence [31] computes an online importance measure along an entire learning trajectory, which is similar to EWC. Rotate-EWC [19] (REWC) is a modified version of EWC — it approximately diagonalizes the Fisher information matrix of the network parameters that compute the factorized rotation of the parameter space used in conjunction with EWC.
The second group of the strategies is associated with learning task-specific parameters. Learning without forgetting (LwF) [17] leverages distillation regularization on the new tasks — the soft labels of previously learned tasks are enforced to be similar to the network with the current task
Table 1. Comparisons of the lifelong learning approaches for overcoming catastrophic forgetting. EWC: Elastic Weight Consolidation [13]; DEN: Dynamically expandable network [30]; LwF: Learning without forgetting [17]; GEM: Gradient of Episodic Memory [20]; PGN: Progressive neural network [24] and our algorithm REC.
by using knowledge distillation [10]. Less-forgetful learning [12] is proposed to regularize the distance between the final hidden activations and the old tasks’ parameters for preserving the old task feature mappings.
The third group of methods expands the network capacity. Progressive neural network (PGN) [24] is proposed to block any changes to the pre-trained network models on previously learned tasks and expands the network architecture by allocating sub-networks with the fixed capacity to be trained with the new information. PathNet [7] uses agents embedding into a neural network to find which parts of the network can be reused for learning new tasks and freezes task-relevant paths for avoiding catastrophic forgetting. Dynamically expanding network (DEN) [30] increases the number of trainable parameters to continually learn new tasks and dynamically selects neurons to retrain or expand neuron capacity by using group sparse regularization.
The other family of the methods uses episodic memory, where the previously learned task samples are stored to effectively recall the experience in the past. Gradient of Episodic Memory (GEM) [20] performs positive forward transfer, minimizes negative backward transfer to previously learned tasks and learns the subset of correlations to a set of tasks without using task descriptors. Incremental Classifier and Representation Learning (iCaRL) [22] combines classification loss on new tasks and distillation loss on previously learned tasks with a K-nearest neighbor classifier and selects the exemplars for each task by letting the embeddings of the selected samples closer to the center point of each class. Table 1 shows the multiple merits of REC, comparing with previous researches in this area.
2.2. AutoML and Knowledge Distillation
There are many works on AutoML to improve the performance of deep neural networks [32, 21, 3]. Neural Architecture Search (NAS) [32] searches the transferable network blocks via reinforcement learning and outperforms many manually designed network architecture. ENAS [21] uses a controller to discover network architectures by searching an optimal subgraph within a large computational graph and shares parameters among child models to enable efficient NAS. EAS [3] efficiently explores network architecture via network transformation [4] which is a functionality preserving method to expand the architecture with a fixed number
Figure 2. Illustration of our lifelong learning framework. REC first uses MWC to search the best child network by Net2Deeper and Net2Wider operators in the controller for a new coming task, then compresses the expanded network to the same size as the initial model and continually learns next new task.
of units or filters.
Besides, Knowledge distillation (KD) [10] is also very related to our work. KD is widely used to compress a network with a different architecture that approximates the original network where knowledge is transferred from a large teacher network to a small student network. The student network is trained with KD loss –a modified cross-entropy loss– that ensures the teacher network and student network are similar. In our work, we adopt the KD to compress the expanded network after learning each new task.
Fig. 2 is an overview of our AutoML framework REC for lifelong learning, it has three steps: Regularize multi-task weight consolidation, Expand network by AutoML and Compress the expanded model.
3.1. Problem Definition and Overview
We define the lifelong learning problem as follows — there will be an unknown number of tasks with unknown distributions, arriving in sequence. Our goal is to learn a deep model in such a lifelong learning scenario without catastrophic forgetting. For the evaluation protocol, we report the classification accuracy of each of previous tasks and the current task T after training on the T-th task. Given a sequence of T tasks, task at time point
with
images comes with dataset
. Specifically, for task
is the label for the i-th sample
in task t. We denote the training data matrix by
for
, i.e.,
When the dataset of task t comes, all the previous training datasets
are not available any more, but the deep model parameter
can be accessed. The lifelong learning problem at time point t when given data
can be defined as solving the following problem:
where F is the loss function of solving is the parameter for task t. Note that the number of the upcoming tasks can be finite or infinite — for simplification, we consider the finite scenario here.
Kirkpatrick et al. [13] proposed EWC that consists of a quadratic penalty on the difference between the parameter and
to slow down the catastrophic forgetting for previously learned tasks. The posterior distribution
is used to describe the problem by the Bayes’ rule.
(2) where the posterior probability embeds all the information from task
. However, the problem (2) is intractable so that EWC approximates it as a Gaussian distribution with mean of parameter
and a diagonal I of the Fisher Information matrix F. The matrix F is computed by
. Therefore, the problem of EWC on task t can be written as follows:
where is the loss function for task
denotes how important the task
is compared to the task t and i labels each weight of the parameter
.
3.2. Multi-task Weight Consolidation
The main problem of EWC is that EWC only enforces task t close to task . This will ignore the inherent correlations between task
and task t and such relationship might potentially help overcome catastrophic forgetting on the previously learned tasks. Learning multiple related tasks jointly can improve performance relative to learning each task separately, when the tasks are related — this idea is incorporated into Multi-Task Learning (MTL) [6]. It has been commonly used to obtain better generalization performance
Figure 3. MWC retrains the entire network learned on previous tasks while regularizing it to prevent forgetting from the original model. MWC (purple solid line) learns better parameter representations to overcome catastrophic forgetting by studying MTL with the sparsity-inducing norm (purple dash line) and EWC (red line).
than learning each task individually. We redefine Eq. 3 using MTL and propose a new objective function Eq. 4 to improve the ability of overcoming catastrophic forgetting from multiple tasks simultaneously:
where is the non-negative regularization parameter and
is the
-norm regularization to learn the related representations. Here, we employ the multi-task learning with
-norm [18] to capture the common subset of relevant parameters from each layer for task
and task t. Specifically, we further consider some important parameters which have better representation power to a subset of tasks. The MTL with sparsity-inducing norm [8] has been widely studied to select such discriminative parameter subset by incorporating inherent correlations among multiple tasks. To this end, the
sparse norm is imposed to learn the new task-specific parameters while learning task relatedness among multiple tasks. Therefore, the objective function for task t becomes:
where is the non-negative regularization parameter. We call our algorithm Multi-task Weight Consolidation (MWC) because it studies the discriminative weights subset with inherent correlations among multiple tasks. Fig. 3 shows the geometric illustration of MWC.
3.3. AutoML for Lifelong Learning with MWC
MWC is a regularization-based lifelong learning algorithm, it might be needed to expand the network if the task is very different from the existing ones or the network capacity is not sufficient when more and more newly coming
tasks. And human experts usually find a sub-optimal solution, this encourages us to propose AutoML based network expanding method for lifelong learning. We name it Regularize, Expand, Compress (REC) and summarize the steps in Algorithm 1. The details of the network transformations based AutoML for REC are outlined in Algorithm 2.
We consider net2wider and net2deeper operators [4] in our controller. The net2wider network transformation function as follows:
where represents the outputs of the original layer l. And the net2deeper network transformation function is
where the constraint holds for the rectified linear activation. We learn a meta-controller to generate network transformation actions (Eq. 6 and Eq. 7) when given the initial network architecture. Specifically, we use an encoder network [3], which is implemented with an input embedding layer and a bidirectional recurrent neural network [25], to learn a low-dimensional representation of the initial network and be embedded into different operators to generate different network transformation actions. Besides, we use a shared sigmoid classifier to make the Net2Wider decision according to the hidden state of the layer learned by the bidirectional encoder network [3] and the wider network can be further combined with a Net2Deeper operator.
We then integrate MWC (Eq. 5) into above AutoML system for lifelong learning. After we learning the network on the data
, we will automatically search the best child network
by Net2wider and Net2Deeper operators when it is necessary to expand the network while keeping the model performance on task
based on Eq. 5. If the controller decides to expand the network, the newly added layer will not have the previous tasks’ Fisher Information. We consider the newly added layer as a new task-specific layer,
regularization is adopted to promote sparsity in the new weight so that each neuron only connected with few neurons in the layer below and this will efficiently learn the best representation for the new task while reducing the computation overheads. The modified MWC in network expanding scenario as follows:
where the subscript deeper and wider refer to the newly added layer in task t.
After the controller generates the child network, the child network will achieve an accuracy on the validation set of task t and this will be used as the reward signal
to update the controller. We maximize the expected reward to find the optimal child network. The empirical approximation of our AutoML REINFORCE rule [28] as follows:
where m is the number of children networks that the controller C samples in one batch and and
represents the action and state of predicting s-th hyperparameter to design a child network architecture, respectively. T is the transition function in Alg. 2. Since
is non-differentiable, we use policy gradient to update the controller. We use a nonlinear transformation
on validation set of task t as done in [3] and use the transformed value as the reward. We also use an exponential moving average of previous rewards with a decay of 0.95 to reduce the variance. To balance the old task and new task knowledge, we set maximum expanding layers are 2 and 3 on net2wider and net2deeper operators, respectively.
If the network keeps expanding as more and more tasks will be given, the model will suffer the inefficient problem and have extra memory cost. Thus, the model compression technique is needed to reduce the memory cost and receive a nonexpansive model. Here, we use soft-label (the logits) as knowledge distillation (KD) [10] instead of the hard labels to train the student model. We follow Ba et al. [2] that the student model is trained to minimize the mean of the loss on the training data
, where
is the logits of the child model
-th training sample. We compress the
to the same size model as
by KD loss below:
where is the weights of the student network and
is the prediction of task t i-th training sample. The final student network
is trained to convergence with hard and soft labels by the following loss function:
where F is the loss function (cross-entropy in this work) for training with ground truth of task t.
4.1. Experimental Settings
Datasets. We evaluate our algorithm on most commonly used datasets for lifelong learning. We list them as follows:
– MNIST-permutation: MNIST [16] is used as the most common datasets among all lifelong learning works, which consists of ten handwritten digits classes with 60,000/10,000 training and testing examples. One way to create the datasets for multiple tasks is randomly permuting the pixels by a fixed permutation [13] so that the input distribution for each task is unrelated.
– MNIST-Variation: MNIST-variation [16] dataset rotates the MNIST dataset by a fixed angle between 0 to 180 degrees for each different task. We use 180/T as the fixed angle to create T tasks.
– CIFAR-100: CIFAR-100 [14] dataset contains 60,000 color images in 100 object classes. Each class has 500/100 images for training and testing. We consider each task with a set of classes, it contains 100/T classes when there are T tasks. Different from MNIST-permutation dataset, the input distributions are similar for all tasks but the output distributions for each task are different.
– CUB-200: CUB-200 [29] is a fine-grained image clas-sification benchmark, we use CUB-200-2011 version in this work. It contains 11,788 images of 200 types of birds with 5,994/5,794 for training and testing. Each image has
detailed annotations and a bounding box. We crop the bounding boxes from the original images and resize them to . We use the same way to create multiple tasks as CIFAR-100 dataset.
For the first three datasets, we choose T = 10 tasks. Since the fine-grained CUB-200 dataset is more challenging than others, we set T = 4 tasks to show better comparisons on lifelong learning. For all datasets, we use 0.1 ratios to split validation set and the model observes the tasks in sequence. We generate multiple tasks for each dataset first and all comparison methods then use the same task order and the same categories within the task for fair comparisons.
Base network settings. For two MNIST datasets, we use a two-layer fully-connected neural network of 100-100 units with ReLU activations as our initial network. For CIFAR-100 dataset, we use a modified version of AlexNet [15] which has five convolutional layers (64-128-256-256-128 depth with filter size), and three fully-connected layers (384-192-100 neurons at each layer) and the standard data augmentation is used in this dataset. For CUB-200 dataset, we use a pre-trained VGG-16 [27] model from ImageNet [5] and fine-tune it on the CUB-200 data for better initialization. We follow the setting of Liu et al. [19], which adds a global pooling layer after the final convolutional layer of the VGG-16. The fully-connected layers are changed to 512-512 and the size of the output layer is the number of classes in each task. All models and algorithms are implemented using Tensorflow [1] library.
Comparison methods. We compare our algorithm with six other methods: 1) SN: A single network trained across all tasks. 2) Net2Net [4]: Network expanding by Net2Net [4] on new task. 3) EWC [13]: A deep network trained with elastic weight consolidation. 4) Net2Net-EWC: Network expanding by Net2Net [4] with elastic weight consolidation [13] when learning new task. 5) DEN [30]: Dynamically expandable network. 6) REWC [19]: Rotate Elastic Weight Consolidation. 7) MWC: A deep network trained with multi-task weight consolidation. 8)REC: Regularize, Expand and Compress.
Hyperparameter settings. All hyper-parameters in MWC are optimized using a grid-search and the best results for each model are reported. For two MNIST datasets, the SGD optimizer is used with a learning rate of 0.001 and we set batch size of 256 with 8 epochs, and
in all experiments. For CIFAR-100 dataset, we use SGD optimizer with momentum parameter of 0.9, learning rate of 0.01, batch size of 128 with 20 epochs,
and
. For CUB dataset, the Adam optimizer is used with a learning rate of 0.001, batch size of 32 and 50 epochs,
and
. For network transformation based AutoML experimental settings, we followed the training details of Cai et al. [3].
Figure 4. The experimental results of continual training on MNIST-permutation, MNIST-variation and CIFAR-100 datasets. We report the average per-task performance (Accuracy) of the models over T = 10 task. The numbers in the legend represent average per-task performance after the model has finished learning task t.
Figure 5. Forgetting experiment for task 1 on MNIST-permutation, MNIST-variation and CIFAR-100 datasets. We report the accuracy of different models on task t = 1 at each training stage to see how the model performance changes over time for all datasets.
Table 2. Comparisons of the model size and the average task accuracy after training 10 tasks of different approaches on MNISTpermutation. #W(1): the number of parameters of task 1. #W(10): the number of parameters after training task 10. ACC (10): average per-task accuracy after training task 10.
4.2. Experimental Results
We evaluate our methods from both model accuracy and model complexity, where we measure the model size at the end of the training process.
Comparisons of the model performance. We report the average per-task accuracy of MNIST-permutation, MNIST-variation and CIFAR-100 datasets when T = 10 in Fig. 4. Overall, REC outperforms all comparison methods and overcomes catastrophic forgetting especially on the later tasks (after task 5). We can observe that the regularization based network (EWC, MWC) has worse performance than expandable networks (DEN, REC), which shows that selectively expand networks help improve the performance by a large margin. Specifically, REC performs better than DEN on two MNIST datasets and MWC performs similarly with DEN on MNIST-permutation dataset while using fewer parameters. We also observe that directly apply Net2Net [4] on lifelong learning does not perform well since it forgets the old tasks’ knowledge as finetuning (SN), but adding EWC as the loss function can help enhance the old tasks’ performance on Net2Net. REC has better performance than Net2Net-EWC, because we consider the new task-specific parameters and the discriminative common subset between the old tasks and the new one.
We also evaluate the catastrophic forgetting over time on the earliest task, Fig. 5 shows the test accuracy of the first task throughout the whole lifelong learning process on MNIST-permutation, MNIST-variation and CIFAR-100 datasets. It shows that our methods (MWC and REC) overcome forgetting on old tasks compared with all other methods on MNIST-permutation and CIFAR-100 datasets. It is worth noting that DEN performs slightly better than our method on task 1 after learning later tasks on MNISTvariation dataset due to they selectively expands network for the new task, it will give a bias towards to the earliest task. Our REC is a nonexpensive network and our overall aver-
Table 3. Comparisons of the model size and the average task accuracy after training 10 tasks of different approaches on CIFAR-100 dataset. #W(1): the number of parameters of task 1. #W(10): the number of parameters of the model after training task 10. ACC (10): average per-task accuracy after training task 10.
age per-task performance is better than DEN, which shows that our method has better performance on later learned tasks and achieve a more balanced performance when learning sequential tasks in the temporal dimension comparing with DEN. Besides, we have an interesting founding on MNIST-variation dataset, the SN and Net2Net has irregular performance on task 1 after learning task 10, it is due to the task 10 is the upside-down flipped image of task 1 and such flip gives benefit on some digits such as ‘1’,‘0’,‘8’. And SN and Net2Net forget too much task 1’ knowledge after learning task 9, they only can keep the most recently learned task knowledge when they learn task 10 comparing with EWC, MWC and REC and this causes the irregular performance.
Comparisons of the model complexity. Table 2 and Table 3 report the comparisons of the model size and the average per-task performance after training T = 10 tasks of different approaches on MNIST-permutation and CIFAR-100 datasets, respectively. Overall, REC performs similarly or better than all other approaches with smaller model size. We observe that DEN performs better than MWC and worse than REC on MNIST-permutation dataset, but it has 1.4X network expansion comparing with ours. For CIFAR-100 dataset, We compute our AUROC after learning T = 10 tasks, REC can achieve 0.887 comparing with DEN (0.923), however, our model size is 50% of DEN’s model. Besides, we notice that DEN involves 7 hyperparameters and very sensitive to them, we slightly change one of them from to
, the result becomes 0.8907 on MNISTpermutation dataset. Our method only has three hyperparameters and it needs much less expert tuning comparing with DEN. Training times is a limitation of the current version of REC, since REC is a reinforcement learning based algorithm, a varies number of trails are needed and this results in more training time than other methods. We will improve the training efficiency of our work in the future. Besides, we did not consider complexity network structures (e.g. ResNet [9], DenseNet [11]), we will extend the current work to more network architectures in the future.
Comparison results on CUB-200 dataset. Fig. 6 shows the comparison results when T = 4 on CUB-200 dataset with EWC [13] and REWC [19]. It shows that MWC has comparable results with REWC, MWC has better perfor-
Figure 6. Comparison results with EWC and REWC on CUB-200 dataset when T = 4.
Table 4. Comparison results of average per-task accuracy after training task 10 on MNIST-permutation dataset.
mance on task 3 and task 4 while has worse performance on task 2. We test REC with only new task validation set (REC-new), which has similar results as MWC on later tasks. This might be caused by using only new task validation set is not sufficient to compute the rewards on a more subtle dataset. We hypothesis the exemplars from old tasks will help improve the nonexpansive AutoML system’s performance. Thus, we use the validation sets of all learned tasks to compute the rewards and report the results (RECall) in Fig. 6. The results show that exemplars from old tasks help improve the performance of AutoML based algorithm and we will investigate the relationship between the number of exemplars and the performance of REC in our future work.
Ablation study on each component in MWC. We study how the different components used in MWC affect the final performance of lifelong learning. We report the average per-task accuracy after training task 10 on MNISTpermutation of different strategies EWC, EWC with -norm only, EWC with
-norm only and MWC in Table 4. It shows that
-norm has a stronger effect of the performance than
-norm while our method MWC outperforms the single regularization strategies, which demonstrates the meaningful and useful of our method by studying common weights subset with discriminative new task parameters.
In this work, we develop a multi-task based lifelong learning framework via nonexpansive AutoML (REC). REC is achieved at two stages: continually network expansion and model compression, besides a novel multi-task weight consolidation algorithm is proposed to overcome catastrophic forgetting. We achieved better accuracy and smaller model size than other lifelong learning methods on four datasets. In the future, we plan to reduce the training time of the AutoML based algorithm and explore the need of exemplars for computing the rewards to improve the cur-
rent work.
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensor-flow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
[2] J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014.
[3] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. Efficient architecture search by network transformation. AAAI, 2018.
[4] T. Chen, I. Goodfellow, and J. Shlens. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015.
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
[6] T. Evgeniou and M. Pontil. Regularized multi–task learn- ing. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117. ACM, 2004.
[7] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
[8] P. Gong, J. Ye, and C.-s. Zhang. Multi-stage multi-task fea- ture learning. In Advances in neural information processing systems, pages 1988–1996, 2012.
[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[10] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
[12] H. Jung, J. Ju, M. Jung, and J. Kim. Less-forgetful learn- ing for domain expansion in deep neural networks. arXiv preprint arXiv:1711.05959, 2017.
[13] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Des- jardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, page 201611835, 2017.
[14] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[16] Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
[17] Z. Li and D. Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
[18] J. Liu, S. Ji, and J. Ye. Multi-task feature learning via ef- ficient l 2, 1-norm minimization. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pages 339–348. AUAI Press, 2009.
[19] X. Liu, M. Masana, L. Herranz, J. Van de Weijer, A. M. Lopez, and A. D. Bagdanov. Rotate your networks: Better weight consolidation and less catastrophic forgetting. arXiv preprint arXiv:1802.02950, 2018.
[20] D. Lopez-Paz et al. Gradient episodic memory for contin- ual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.
[21] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Effi- cient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
[22] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proc. CVPR, 2017.
[23] C. Robert. Machine learning, a probabilistic perspective, 2014.
[24] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
[25] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
[26] J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska- Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell. Progress & compress: A scalable framework for continual learning. arXiv preprint arXiv:1805.06370, 2018.
[27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[28] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Man- sour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
[29] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
[30] J. Yoon, E. Yang, J. Lee, and S. J. Hwang. Lifelong learning with dynamically expandable networks. 2018.
[31] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. arXiv preprint arXiv:1703.04200, 2017.
[32] B. Zoph and Q. V. Le. Neural architecture search with rein- forcement learning. arXiv preprint arXiv:1611.01578, 2016.