Neural Architecture Search for Class-incremental Learning

2019·Arxiv

Abstract

Abstract

In class-incremental learning, a model learns continuously from a sequential data stream in which new classes occur. Existing methods often rely on static architectures that are manually crafted. These methods can be prone to capacity saturation because a neural network’s ability to generalize to new concepts is limited by its fixed capacity. To understand how to expand a continual learner, we focus on the neural architecture design problem in the context of class-incremental learning: at each time step, the learner must optimize its performance on all classes observed so far by selecting the most competitive neural architecture. To tackle this problem, we propose Continual Neural Architecture Search (CNAS): an autoML approach that takes advantage of the sequential nature of class-incremental learning to efficiently and adaptively identify strong architectures in a continual learning setting. We employ a task network to perform the classification task and a reinforcement learning agent as the meta-controller for architecture search. In addition, we apply network transformations to transfer weights from previous learning step and to reduce the size of the architecture search space, thus saving a large amount of computational resources. We evaluate CNAS on the CIFAR-100 dataset under varied incremental learning scenarios with limited computational power (1 GPU). Experimental results demonstrate that CNAS outperforms architectures that are optimized for the entire dataset. In addition, CNAS is at least an order of magnitude more efficient than naively using existing autoML methods.

Introduction

Continual learning, or lifelong learning (Parisi et al. 2019) is the ability to acquire new knowledge while retaining previously learned experiences, and is one of the modern challenges of artificial intelligence. Various methods have been proposed to tackle continual learning (referred to as continual learners). As seen in Table 1, some continual learners rely on a static architecture which is manually crafted. These methods are susceptible to the phenomenon of capacity saturation (Sodhani, Chandar, and Bengio 2018), where a neural network’s ability to generalize to new concepts is limited by its fixed capacity.

In this paper, we propose to continuously adapt the neural network architecture as new data arrive and we focus on the class-incremental learning setting. Rebuff et al. (Rebuffi et al. 2017) introduced class-incremental learning where an algorithm learns continuously from a sequential data stream in which new classes occur. Motivated by designing an expandable continual learner, we aim to solve the continual architecture design problem where at each time step of class-incremental learning, the learner must optimize its performance on all observed classes so far by selecting the most competitive neural architecture.

Any continual learner faces the challenge of catastrophic forgetting where learning new information interferes with previously acquired knowledge (McCloskey and Cohen 1989), that is, the learner forgets how to perform old tasks when new ones are learned. In the continual learning literature, a constraint on the storage of past data is often enforced such as in (Chaudhry et al. 2018) and (Rebuffi et al. 2017), and preventing catastrophic forgetting is thus one of the main focus of existing approaches to continual learning. In contrast, we store all data that has been observed in the past and maintain a growing dataset as more training examples become available. Catastrophic forgetting is then addressed by rehearsing on past data. We argue that this is a realistic setting since data storage is rarely an issue when compared to computation time. This allows us to fully use the available data to optimize the architecture selection, while still being computationally efficient.

Recent techniques for automatically designing deep neural networks using reinforcement learning (RL) agents have shown promising results. Methods such as Neural Architecture Search (NAS) (Zoph and Le 2016) and Efficient Architecture Search (EAS) (Cai et al. 2018) employ a policy gradient approach called REINFORCE (Williams 1992), allowing for high flexibility in the policy network design. EAS further proposes to use Net2Net (Chen, Goodfellow, and Shlens 2016) transformations to initialize sampled architectures, thus achieving huge computational savings.

To address the continual architecture design problem, we propose Continual Neural Architecture Search (CNAS). CNAS consists of three parts: a task network for solving the classification task, a deep reinforcement learning

Table 1: Approaches used by some recent methods to various challenges in continual learning.

based meta-controller for adaptively exploring the architecture search space and a heuristic function for deciding when to expand the continual learner. Each time new data arrive, the meta-controller generates candidate architectures using Net2Net (Chen, Goodfellow, and Shlens 2016) transformations of the current task network. The decision of whether to expand the current architecture is based on a heuristic function of the performance of all the candidate architectures on a held-out dataset. This process allows the network structure to adaptively evolve in reaction to arrival of new classes or to other changes in the data distribution.

The autonomous nature of CNAS makes it an autoML approach (Mendoza et al. 2016; Feurer et al. 2015), offering an efficient and off-the-shelf learning system that avoids the tedious tasks of manually selecting the correct neural architecture at each time step. As the observed dataset becomes more complex or includes examples from multiple training distributions, manually designing the architecture for a continual learner is not only time-consuming but also increasingly difficult. Therefore, reducing human intervention is a natural progression to develop robust and self-sufficient continual learners.

Summary of the contributions We formalize the continual architecture design problem and experimentally show that dynamically adjusting the neural architecture of a continual learner results in stronger performance than using a static architecture. To the best of our knowledge, our proposed method CNAS is the first approach for the continual architecture design problem in class-incremental learning, as well as the first continual autoML method. Our experiments on the CIFAR-100 dataset (Krizhevsky, Hinton, and others 2009) shows that CNAS constitutes a sound and promising approach to various class-incremental learning scenarios. In particular, CNAS automatically designs parameter-efficient networks that outperforms those optimized for the entire CIFAR-100 dataset at each time step of the learning process. Furthermore, when compared to the naive approach of conducting a full-scale neural architecture search at each time step, CNAS is at least an order of magnitude faster than naively using alternative autoML methods.

Preliminaries: Continual Learning

In this section, we introduce the continual learning setting. In particular, we formalize class-incremental learning and describe how it is different from the related setting of task-incremental learning. Then, we explain the continual architecture design problem in class-incremental learning.

Class-incremental Learning In the class-incremental learning setting, a model learns continuously from a sequential data stream in which new

classes occur (Rebuffi et al. 2017). At any time step, the learner is required to perform multi-class classification for all classes observed so far. Formally, the goal of class-incremental learning is to learn, at each time step T, a clas-sifier given the aggregation of the datasets seen up to now (), where each dataset

Here, X is the input space and is the set of categories. At each time step t, new classes can be introduced into the training data. Denoting by the set of classes present in , we assume that each dataset is identically and independently drawn from the distribution where D is an unknown distribution over and denotes that D is conditioned on labels belonging to . In this setting, the learning objective at time t corresponds to identifying an hypothesis that minimizes the risk over the classes seen so far:

where L is a loss function penalizing prediction errors over the random variables (f(X), Y ). The simplest scenario for class-incremental learning is the one where k new classes are introduced at each time step which we refer to as k-class incremental learning. This is the setting that most existing literature have experimented with. In this work, we also consider more realistic continual learning scenarios where (a) not all training data for a particular class is available at once (i.e., data for one class can be spread out over several distant time steps) and (b) the number of unseen classes arriving at each time step is unknown.

In accordance with the learning objective defined in Eq. (1), a natural metric to evaluate the performance of a model at test time is the average incremental accuracy introduced in (Rebuffi et al. 2017). The average incremental accuracy at time step t is the test accuracy of the model on the part of the test data consisting only of the classes seen up to time t:

where is the total number of classes seen until time t and is the test accuracy of the model on category i discriminating from C classes.

Related Work and Task-incremental Learning Lopez-Paz and Ranzato (Lopez-Paz and others 2017) de-fined the goal of continual learning as learning a predictor where T refers to a set of task descriptors. Often in experiments (Xu and Zhu 2018; Lopez-Paz and

Figure 1: The Venn diagram of continual learning with canonical references.

others 2017; Chaudhry et al. 2018), an image classification dataset such as CIFAR100 (Krizhevsky, Hinton, and others 2009) or MNIST (LeCun et al. 1998a) is separated into N tasks (each containing k categories). Therefore, the predictor becomes dependent on the task descriptor to first identify which subset of categories the sample belongs to, before performing k-nary classification within the given subcategories. Because a task descriptor has to be given with each feature vector , we consider the related continual learning defini-tion proposed by Lopez-Paz and Ranzato (Lopez-Paz and others 2017) as task-incremental learning (using the same terminology as (van de Ven and Tolias 2018)), a separate learning paradigm from class-incremental learning. An illustration of the different settings in continual learning can be seen in Figure 1. In another work on task-incremental learning, Xu et al. (Xu and Zhu 2018) use a reinforcement learning agent to decide how many nodes or filters to add to the layers of a fixed depth neural network. Since CNAS allows to increase both the width and depth of a neural network, it explores a more complex architecture space.

Continual Architecture Design

In this work, we define continual architecture search as the setting where, at each time step t, the continual learner must select the best neural architecture for classifying all classes seen so far. To tackle this setting, we assume that the learner has access to all the data up to time t. We further impose a constraint with practical settings in mind: the initial architecture at t = 1 is selected based on the initial dataset only. Continual architecture search is concerned with hyperparameter optimization on a growing dataset while architecture search is traditionally conducted on a fixed training distribution. This difference implies that the architecture search space is continually growing thus making exhaustive search methods (such as grid search) intractable from a computational standpoint. In contrast, CNAS takes advantage of the sequential nature of class-incremental learning by (i) limiting the architecture search space by considering the structure of the task network from the previous step as a starting point and (ii) using Net2Net techniques to rapidly transfer weights from previous step,

Continual Neural Architecture Search (CNAS)

In this section, we present our proposed method: Continual Neural Architecture Search (CNAS). At any given time

step t, CNAS provides a deep neural network with trained weights that is able to classify all observed categories so far. There are three components: a task network, a meta-controller and a heuristic function.

The task network performs classification for all observed classes and is implemented as a standard deep neural network with convolution (CNN) (LeCun et al. 1998b), maxpooling, dropout (Srivastava et al. 2014) and fully-connected layers. At each time step, the number of neurons in the last layer of the task network is equal to the number of observed classes C, and through the softmax activation function, each output neuron predicts the conditional probability of a category given the input. In class-incremental learning, new neurons are added to the output layer each time new categories appear (these neurons are initialized with a zero-mean normal distribution for the weight matrix and zero for the bias term).

The meta-controller is specialized in generating an architecture search policy to sample new candidate architectures for the task network when new classes arrive. The controller is implemented as a deep reinforcement learning agent. The role of the meta-controller is only to guide the architecture sampling process, by selecting promising architectures to try out based on experiences gathered from previous time steps. The selection of the best architecture out of the sampled ones is based on a validation set. This can be seen as a one-step ahead planning guided by the meta-controller to explore good candidates.

Lastly, the heuristic function considers the validation performance of all the sampled architectures and decides if an expansion is beneficial in the current step. Preventing unnecessary expansions will reduce the computational time in subsequent steps as well as increasing the parameter effi-ciency of the task network.

Training Procedure Algorithm 1 describes the training procedure for CNAS when a new dataset arrives. The task network is first trained with a combined dataset of past and new examples and is then used as the starting point for ArchSearch (Algorithm 2). ArchSearch then outputs the validation accuracies of all the sampled architectures and the best performing candidate architecture. HeuristicFunc (Algorithm 3) then decides if expanding the current task network is beneficial based on

the validation performance differences between the sampled candidate architectures and the existing architecture. When deciding to expand, the best performing sampled architecture becomes the new task network structure. If no expansion is needed, no change is made to the current architecture. This new task network is then further trained on the available data to ensure it has converged. The number of candidate architectures that can be sampled per time step is a hyperparameter of our algorithm and it controls the trade-off between computational complexity and exploration depth.

One could greedily expand the continual learner at each time step (i.e. always set to in Algorithm 1). However, this can not only reduce parameter efficiency but also potentially affect future performance. The heuristic function (HeuristicFunc, see Algorithm 3) is designed to evaluate the benefit of expansion based on the difference in validation performance between all sampled architectures and the existing architecture. If capacity saturation occurs, expanding the architecture will likely result in performance improvement and architecture expansion is considered necessary. However, when only a small portion of expanded structures shows gains in performance then it is likely that these improvements are due to the randomness in network training and architecture expansion is not required.

Net2Net Transformations To save the computational cost of training each sampled architecture from scratch, we use a transfer learning technique called Net2Net (Chen, Goodfellow, and Shlens 2016). Net2Net enables a rapid transfer of information from one neural network to another by expanding/creating fully-connected and convolutional layers using two types of operations. Net2WiderNet operations replace a given layer by a wider one (more units for fully-connected layers or more filters for convolutional layers) while preserving the function computed by the network. Net2DeeperNet operations insert a new layer that is initialized as an identity mapping be-

tween two existing layers, thus preserving the function computed by the neural network. More formally, Net2DeeperNet replaces a layer with two layers where I is the identity matrix. However, the last equality is true only if the activation function is such that for all vectors v, which holds for the rectified linear activation (ReLU). Therefore, we use ReLu activation for all hidden layers.

Net2WiderNet and Net2DeeperNet operations can be applied sequentially to grow the original network in both width and depth. In this way, any architecture that is strictly larger than the original can be initialized to preserve the function computed by the original network. This allows CNAS to use a trained network as a starting point for architecture search and quickly initialize new larger architectures. By using Net2Net, the capacity of the task network can be expanded efficiently and dynamically for stronger performance as new data become available. Further details regarding Net2Net transformations are provided in the original paper (Chen, Goodfellow, and Shlens 2016).

Reinforcement Learning Agent

We use the policy gradient method REINFORCE (Williams 1992) and design two independent policy networks for taking Net2WiderNet actions and Net2DeeperNet actions respectively, with the simplifying assumption that they are independent.

We describe continual architecture design as an RL problem: at each step, an agent observes the current state of the environment and samples actions (=network transformations) according to a stochastic policy . For each sampled action, it observes a reward signal , which is used along with a step size to improve the policy for future time steps. For computational efficiency, a fixed number of architectures is sampled at each time step using the RL agent. The planning horizon is limited to one time step. Limiting the horizon acts as a complexity control method (Jiang et al. 2015) and results in only optimizing for the current distribution. The REINFORCE algorithm is simplified to:

Figure 2: Flow chart of the policy network

where represents the parameters of the policy networks for Net2WiderNet and Net2DeeperNet.

The architecture search for each time step is summarized in Algorithm 2. Any sampled architecture is trained for at most l epochs using early stopping. Due to the benefit of weight transfer through Net2Net transformations, sampled architectures only require training for a low number of epochs in practice.

Policy Networks The policy networks for Net2WiderNet and Net2DeeperNet, referred to as wider actor and deeper actor respectively, are identical in design as seen in Figure 2, but trained independently. Encoding the task network’s architecture in details into the state might be of little use to the RL agent as such states are almost never repeated (since the architecture is continuously expanding). Therefore, we only include the number of convolutional layers and the number of fully-connected layers of the task network in (denoted by and respectively). Moreover, to measure the disparity between the current training distribution and the previous one , the difference in validation accuracy of the task network on these two distributions is included in the state space (denoted by ). Lastly, the number of new classes received by the continual learner at the current time step is also added (denoted by ).

The wider and deeper actors decide the number of Net2WiderNet and Net2DeeperNet transformations to take respectively, and are implemented as multilayer perceptrons. Both the input and hidden layers have ReLU activation while the output layer of the actor networks has a softmax activation. The i-th output neuron corresponds to the probability of taking transformations and the first neuron always represents not taking any transformations. The predicted probability is then used as input to a categorical distribution out of which actions are selected. In this way, with the same input state space, the number of transformations selected by the actor networks is stochastic.

Reward Design To best decide the number of transformations needed for each time step, we design a reward function based on the performance of the newly transformed architecture, compared to the existing one (measured with average incremental accuracy from Equation 2). We consider the difference in validation accuracy between the original architecture and the

sampled architecture, . Here is the reward signal given to the agent at time step t after deciding on the number of Net2Net transformations while and respectively stands for the validation accuracy of the sampled architecture and of the original architecture on the current dataset after training. Therefore, any architecture that performs worse than the original will provide a negative reward signal while a better architecture will yield positive one. To obtain better reward signals for learning, the rewards are normalized into [-1,1] range within all architectures sampled at time t. Lastly, we add an entropy term to the reward function in order to improve policy optimization (Ahmed et al. 2019).

Experiments

We now describe our experimental setup and details regarding implementation of CNAS (the code will be made publicly available). We repeat each experiment three times with different random seeds and report the standard deviation with error bars. Dataset We split both the training set and test set of CIFAR-100 by class labels. The CIFAR-100 dataset contains a total of 60,000 images across 100 classes. In our experiments, each class is further split into 450 images as training set, 50 images as validation set and 100 images as test set. When a new class is introduced, all corresponding test data will start to be used for the calculation of average incremental accuracy. In the k-class incremental learning scenario, all corresponding training examples are presented as a new class is introduced. We also test scenarios where only a fraction of all training examples of a certain class becomes available at a time step. The examples contained in the validation set are used for architecture selection. The arrival order of the classes is based on the default labels given by the CIFAR-100 dataset. Each experiment starts with some initial classes (known as the base knowledge and considered as the dataset for time step 0). Baselines We compare CNAS with the following baselines: (1) SA (Static Architecture): a continual learner with a static architecture that is selected given the knowledge of all 100 classes at once (i.e., optimized on the entire CIFAR-100 dataset); (2) RAS (Random Architecture Search): a continual learner that greedily expands its architecture whenever the best sampled architecture has a stronger validation performance. It uses a uniformly random architecture sampling strategy; (3) RAS-HF (Random Architecture Search with Heuristic Function): random architecture search with the same heuristics function as CNAS (see Algorithm 3). RAS and RAS-HF are compared with CNAS in the ablation study. We use average incremental accuracy on the test set as the evaluation metric. Implementation We implement CNAS with Keras (Chollet and others 2015) using Tensorflow (Abadi et al. 2015) as the backend framework. All approaches are trained using the ADAM (Kingma and Ba 2014) optimizer with a learning rate of and other parameters set to default values. All training is conducted with mini-batches of size 128. The task network is trained until convergence (with early stopping) both before and after the architecture search at each time step. Both the wider and deeper actors are implemented

Figure 3: Performance of SA and CNAS in 2-class incremental learning

Figure 4: Parameter Growth Curve in 2-class incremental learning experiment

as a multilayer perceptron with 2 hidden layers, each having 128 neurons. The learning rate of the RL agent is 0.001 and the entropy regularization term is scaled by a factor of 0.01.

K-class Incremental Learning We compare the performances of CNAS and SA on k-class incremental learning experiments on CIFAR-100 for k = 2 and k = 10. For 2-class incremental learning, CNAS samples 20 architectures at each time step and can take at most 3 Net2WiderNet and 3 Net2DeeperNet actions. For 10-class incremental learning, 50 architectures are sampled at each time step and at most 10 Net2WiderNet and 5 Net2DeeperNet transformations can be taken. The base knowledge is the first 10 classes and the initial architecture for CNAS is optimized for the base knowledge only.

From Figure 3 and 5, we can see that CNAS outperforms SA in terms of average incremental test accuracy. Furthermore, CNAS consistently uses less parameters than SA, as shown in Figure 4 and 6. Due to having a smaller initial structure, CNAS is able to generalize better than SA on the base knowledge. As more classes are introduced, the diffi-culty of the task increases and larger architectures are required to avoid capacity saturation. Note that there are many steps where CNAS chooses not to expand and maintains its architecture (see Figure s4 and 6). The heuristics function ensures that only necessary expansions are taken.

Mixed-class Incremental Learning We introduce a more realistic incremental learning setting where the number of new classes as can vary at each time step and additional

Figure 5: Performance of SA and CNAS in 10-class incremental learning

Figure 6: Parameter Growth Curve in 10-class incremental learning experiment

training data from already seen classes can arrive at later time steps (referred to as mixed-class incremental learning). In this experiment, the continual learner will receive all the training data from k unseen classes and a portion p of the training data of either an existing class or an unseen class. At each step, k is chosen randomly from range [1,19] and p can be either 0.25 or 0.5. This scenario is motivated by the use case where the number of classes introduced at each step is unknown and data from some classes are spread-out over different time steps. CNAS can sample up to 30 architectures at each step and take a maximum of 5 Net2WiderNet and 5 Net2DeeperNet operations. We see in Figures 7 and 8 that CNAS significantly outperforms SA while using less parameters. This is because CNAS can identify the optimal architecture for the current training distribution and adapt its architecture accordingly.

Ablation Study Lastly, we consider a new and difficult incremental learning scenario where only half of the training data of a class arrives at a time step. RAS contains no heuristic function nor the RL meta-controller when compared to CNAS. In comparison, RAS-HF has the heuristic function but lacks the meta-controller. In Figures 9 shows that CNAS has the best performance overall. In Figure 10, we see that RAS greedily expands its architecture at the beginning and leads to complex models that are unable to generalize well to the task at hand. In addition, RAS leads to oversize models that become too costly to train using only 1 GPU (this is why the curve ends earlier). This shows that the heuristic function is important to prevent over-expansions. The RL meta-controller is also important for CNAS as it learns to

Figure 7: Performance of SA and CNAS in mixed-class incremental learning

Figure 8: Parameter Growth Curve in mixed-class incremental learning experiment

narrow down the architecture search space based on the current learning paradigm. In the first time steps, RAS-HF obtains performances similar to CNAS. This is because the RL agent requires experiences to adapt its policy from uniformly random to one that is tailored for the current incremental learning setting. When 36 classes are learned, CNAS starts to consistently outperform RAS-HF while having a smaller task network. Overall, both the heuristics function and the RL meta-controller are essential components of CNAS in class-incremental learning settings.

Computational Time We report the average computational time on 1 GPU across 3 trials in this section. In 2-class incremental experiment, CNAS used 93 hours and explored 900 architectures. Note that CNAS can take advantage of multiple GPUs and train many sampled architectures in parallel. In comparison, early neural architecture search approaches such as NAS (Zoph and Le 2016) performed architecture search on the CIFAR-10 (Krizhevsky, Hinton, and others 2009) dataset with 800 GPUs and trained 12,800 models from random initialization. A more recent approach such as EAS (Cai et al. 2018) also uses Net2Net techniques for weight transfer and they use 5 GPUs for 2 days to train 450 CNNs. In this regard, CNAS is at least one order of magnitude faster than naively using autoML alternatives such as EAS and NAS.

In the ablation study experiment, CNAS used 26 hours and sampled 315 neural architectures. Each component in CNAS contributes to a greater computational efficiency. Without the RL meta-controller, RAS-HF also sampled 315 architectures but used 67 hours. Without the heuristic func-

Figure 9: Performance of SA, RAS, RAS-HF and CNAS in the ablation study experiment

Figure 10: Parameter Growth Curve in the ablation study experiment

tion, RAS explored 250 architectures while using 91 hours.

Discussion

CNAS requires very few hyper-parameters to be effective. The starting architecture of CNAS is optimized on the base knowledge (training data at time step 0). The number of sampled architectures as well as the maximum number of Net2Net transformations should be selected based on the available computational resources. Given a larger search space and more sampled architectures, CNAS is likely to find stronger models. CNAS avoids greedily picking the architecture with the best validation performance, which can quickly lead to overparametrized models. Indeed, the validation accuracy of any sampled architecture is not only determined by the effectiveness of the neural architecture but also affected by the stochasticity of parameter optimization. As seen in Figure 10, the heuristic function in CNAS plays an important role to avoid unnecessary model expansions.

Conclusion

In this paper, we presented the problem of continual architecture design in class-incremental learning. We proposed CNAS, an efficient and economical autoML approach for continual learning. CNAS (i) reuses trained weights through Net2Net, (ii) implements an RL meta-controller to find the most effective architecture transformations and (iii) uses a heuristic function to decide when to expand the current architecture. Various incremental learning experiments on the CIFAR-100 dataset show that CNAS consistently outperforms architectures that are optimized on the entire dataset.

Acknnowledgments

The authors gratefully acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canadian Institute for Advanced Research (CIFAR).

References

Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; et al. 2015. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from ten-sorflow. org 1(2).

Ahmed, Z.; Le Roux, N.; Norouzi, M.; and Schuurmans, D. 2019. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, 151–160.

Cai, H.; Chen, T.; Zhang, W.; Yu, Y.; and Wang, J. 2018. Efficient architecture search by network transformation. In Thirty-Second AAAI Conference on Artificial Intelligence.

Chaudhry, A.; Ranzato, M.; Rohrbach, M.; and Elhoseiny, M. 2018. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420.

Chen, T.; Goodfellow, I. J.; and Shlens, J. 2016. Net2net: Accelerating learning via knowledge transfer. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.

Chollet, F., et al. 2015. Keras.

Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.; Blum, M.; and Hutter, F. 2015. Efficient and robust automated machine learning. In Advances in neural information processing systems, 2962–2970.

Jiang, N.; Kulesza, A.; Singh, S.; and Lewis, R. 2015. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, 1181–1189. International Foundation for Autonomous Agents and Multiagent Systems.

Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer.

LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.; et al. 1998a. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.

LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.; et al. 1998b. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.

Lopez-Paz, D., et al. 2017. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, 6467–6476.

McCloskey, M., and Cohen, N. J. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24. Elsevier. 109–165.

Mendoza, H.; Klein, A.; Feurer, M.; Springenberg, J. T.; and Hutter, F. 2016. Towards automatically-tuned neural networks. In Hutter, F.; Kotthoff, L.; and Vanschoren, J., eds., Proceedings of the Workshop on Automatic Machine Learning, volume 64 of Proceedings of Machine Learning Research, 58–65. New York, New York, USA: PMLR.

Parisi, G. I.; Kemker, R.; Part, J. L.; Kanan, C.; and Wermter, S. 2019. Continual lifelong learning with neural networks: A review. Neural Networks.

Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; and Lampert, C. H. 2017. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2001–2010.

Sodhani, S.; Chandar, S.; and Bengio, Y. 2018. On training recurrent neural networks for lifelong learning. arXiv preprint arXiv:1811.07017.

Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958.

van de Ven, G. M., and Tolias, A. S. 2018. Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635.

Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256.

Xu, J., and Zhu, Z. 2018. Reinforced continual learning. In Advances in Neural Information Processing Systems, 899– 908.

Yoon, J.; Yang, E.; Lee, J.; and Hwang, S. J. 2017. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547.

Zoph, B., and Le, Q. V. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.

designed for accessibility and to further open science