In recent years, deep learning [18] has gained tremendous success in achieving human-level performance [13] and has even been able to surpass human experts [36] in a variety of domain-specific tasks. However, in the context of continual learning [37, 31, 21, 27], where humans can continuously adapt and learn new tasks throughout their lifetimes without forgetting the old ones, current deep learning methods struggle to perform well since the data distribution changes over the course of learning. This problem arises in Artificial Neural Networks (ANNs) because of the way input data is mapped into the network’s parametric representation during learning. When the input data distribution changes, network parameters are updated through gradient-based methods to minimize the objective function with respect to the data distribution of the current task only, without taking into account the distribution learned over the previous tasks. This leads to a phenomenon called ‘Catastrophic Forgetting’ [15] wherein the network forgets how to solve the older tasks upon being exposed to new ones.
Efforts to overcome catastrophic forgetting can be broadly divided into three prominent groups. The first set of methods involves a form of rehearsal over older tasks while training the current task, either with saved examples [30, 29, 23] or with synthetic data generated by a generator trained over older tasks [34]. While these methods are effective in alleviating catastrophic forgetting, they are not viable in the real-world setting, where learning systems operate on limited memory budget and access to data of the older tasks may be restricted due to either privacy issues or the dynamic nature of the environment that the system operates in. The second line of research focuses on regularization to penalize changes in important parameters in order to enable sharing of parameters over tasks [15, 41, 2, 25]. These methods make more efficient use of the available resources, but suffer from catastrophic forgetting as the number of tasks increases. The third set of techniques works on the concept of parametric isolation [3], where disjoint or overlapping sets of parameters are allocated to separate tasks, such as in [24, 9]. Other methods [40, 32] under this category grow the network with subsequent tasks, but these are impractical in resource constrained environments, where the amount of memory and compute units available are the bottleneck. Authors of [24] prune the network weights to remove redundancy for sequential task learning. However, their method induces unstructured sparsity in the network which can not be directly leveraged by the hardware to give energy benefits compared to structured sparsity [39, 4].
To address the problem of catastrophic forgetting in an energy, memory and data constrained environment, we propose a Principle Component Analysis (PCA) [1] based approach to identify redundancy within the filters (or hidden units in fully-connected layer) for a given task. We transform the original internal space of representation into a reduced redundancy subspace and identify the filters associated with this subspace as important for that task. For the next task, to avoid catastrophic forgetting, these important filters are kept frozen and those pruned at the last task are learned. An advantage of our method is that, unlike most pruning techniques, it is free from heuristic choices of pruning thresholds; the number of filters (or units) to keep in each layers for each tasks is decided automatically. Moreover, contemporary methods seek to assign task-specific importance to the parameters, and use this parametric importance to decide the degree of plasticity of each parameter for the new tasks. In contrast, our method finds the important subspaces of intermediate representations for the current task and automatically add new subspaces (if necessary) for the new tasks. In doing so, our method creates structured sparsity in the network which can be leveraged in the hardware. An important consequence of viewing things in terms of subspaces rather than individual elements is that we do not ascribe any importance to individual elements, rather we transform them so that fewer of the new elements contain most of the information. We test our method on Permuted MNIST, split CIFAR-10 and split CIFAR-100 datasets and compare our results with other relevant methods. In the incremental learning setup, our method outperforms Elastic Weight Consolidation (EWC) [15] and Learning without Forgetting (LwF) [22] in terms of accuracy and memory utilization. Our method yields comparable results to the state of the art method, PackNet [24], with the added benefit of being energy efficient in multi-task inference due to structured sparsity.
There are three prominent lines of research to enable Deep Neural Networks (DNNs) to learn continually. They were mentioned briefly in Section 1, and in this section, we go into the details of the representative works and highlight their contributions and differences with our work.
The first line of research uses some form of data augmentation in order to rehearse or replay the data from the older tasks while training on the current task. In iCaRL [29] and GEM [23], a subset of examples from the previous tasks are stored in the memory buffer, and in DGR [34], a generator is trained to create synthetic training data for the previous tasks. While training on the current task, the data from the previous tasks is interspersed with the current task data, allowing the network to jointly optimize the training distributions and successfully alleviating the problem of catastrophic forgetting. However, these techniques require extra memory to store either a subset of samples from previous tasks, or to store the parameters of a generator. Our method does not require any extra memory, since we do not rely on data from older tasks and we compress the available space before learning the new tasks, thus making better use of resources.
The second set of methods aims to resolve the problem of catastrophic forgetting by sharing the weights among the tasks. A way to enable this sharing is to use a regularizer that adds a penalty to the loss function to discourage any change in important parameters for the older tasks while learning the new task. The importance attributed to the element determines the degree of plasticity of the element. EWC [15] calculates the task specific parametric importance from the diagonal elements of the Fisher Matrix after training. Online EWC [33] computes this importance in a memory efficient fashion by taking the running sum of that Fisher matrix over all the previous tasks. On the other hand, Synaptic Intelligence (SI) [41] measures the parametric importance during training based on the sensitivity of the loss with respect to the parameters. Another method, LwF [22], instead of calculating parametric importance, computes a proxy distillation loss and adds it to the objective function to retain the activations of the initial network while training on the new data. Even though these methods show varying degrees of success in overcoming catastrophic forgetting, the balance between rigidity and plasticity ultimately breaks down as the complexity and the number of tasks increase, resulting in performance degradation on the older tasks. In contrast to these methods, we do not attribute individual importance to a parameter, instead we compress the information in these parameters into a reduced redundancy subspace. This subspace formulation leads to a compressed network with reduced number of filters (hidden units) at each layer. We sidestep the issue of catastrophic forgetting completely by freezing filters from the old tasks while training on all subsequent tasks.
The third set of techniques isolates task-specific parameters. Among these are techniques that adjust the architecture with pruning [24] or growing [32], or introducing parallel paths for new tasks. For instance, Progressive Neural Networks [32] replicate the network architecture for each new task, with each new layer laterally connected to the corresponding older layers. When training on new tasks, layers learned for older tasks are kept frozen and only the newly added layers are optimized. The major drawback of such methods is that the network size keeps on increasing significantly with increasing number of tasks. Another representative work in this set of techniques, which is also the most relevant to our method, is the state-of-the-art network, PackNet [24]. Leveraging the fact that DNNs are overparameterized [7], PackNet employs weight magnitude based
Figure 1. Illustrations of activation matrix generation for PCA from a typical convolutional layer.
iterative pruning [12, 11] to free up unimportant parameters, and then uses these parameters for learning future tasks while keeping the old weights frozen. Like PackNet, our method also finds the redundancy in the network for pruning and utilizes the pruned portion of the network for learning future tasks. However, in contrast to keeping important and pruning redundant weights in PackNet, our method finds the compressed, important subspace of internal representation and prunes the remaining redundant space. This translates to a network with a reduced number of filters and hence structured sparsity [39]. Such sparsity can be leveraged better in hardware [4], which translates to better energy effi-ciency during inference. Moreover, unlike existing weightlevel and filter-level pruning methods [20, 26] which ascribe importance to static parameters and select a subset of them, our method dynamically transforms the filters in a manner that compresses information in a small number of filters. Additionally, our method does not involve any heuristic for choosing the number of parameters to be pruned for each layer and each task. Rather, it automatically allocates filters for new tasks based on the necessity, which is a desirable trait for autonomous agents learning sequential tasks.
In this section, we describe our method for incremental training of deep neural networks in the general continual learning paradigm, wherein an unknown number of tasks with varying data distributions arrive at the model sequentially.
3.1. Core and Residual Representational Spaces
From a highly abstracted point of view, our algorithm proposes to break the space described by the filters in each layer into two partitions, which we call the ‘Core’ space and the ‘Residual’ space. The Core is where we keep adding compressed information pertaining to the tasks previously seen, and the Residual is reserved for the current task only. All the information present in the Core space is leveraged by the current task, but it is not modified, whereas the current task can update the Residual space freely. The Residual space is then compressed (while retaining most of the information) and added to the Core space. For the next task, the Core space has grown and the now smaller Residual space is freed up to learn the new task specific filters.
When a network starts incrementally learning a new task, it has some frozen filters in each layer that represent the Core space, or the essential filters of all the previous tasks. The network learns each new task by following a three-step procedure. First, the network is trained on the data pertaining to the current task. Both the Residual and Core spaces are used to generate activations, but the learning is only done in the Residual space. Once the network converges on the current step, the second step utilizes a PCA based transformation algorithm to identify the redundancy in the non-frozen, freshly learned filters (Residual space) in each layer for the current task. In the final step, the Core and the Residual space are analyzed together to decide the number of filters to be added from the Residual to the Core, and the remaining redundant filters are pruned and fine-tuning (retraining) is performed to mitigate any drop in accuracy. For each task, steps 2 and 3 of compressing the Residual space and adding it to the Core space, followed by fine-tuning are performed for each layer sequentially, except the classifier layer. In the next section, we outline how to use PCA to identify redundancy and compress information into a smaller subspace.
3.2. Filter Redundancy Detection and Compression
It is shown that most common DNN architectures [17, 35] have highly correlated filters within each layer that potentially detect the same features [10], hence making in-significant contributions to accuracy. Garg et al. [10] find such correlations by applying PCA on the activation maps generated by these filters, and use it to obtain an optimal architecture (by removing redundant filters and layers) from a trained network. They trained this structurally compressed architecture from scratch, with negligible loss of accuracy. However, their method only applies to a single task, and cannot be trivially extended to the continual learning domain, since the data from the older tasks is not available and training from scratch overwrites the weights with the information of the latest task, inducing catastrophic forgetting. We explain the original method, and outline our algorithm to counter the problem of forgetting by proposing a layer-wise transformation and compression step to structurally organize the knowledge from the sequential tasks in this section.
PCA [1] is a well-known dimensionality reduction technique that can be used to remove redundancy between correlated features in a dataset. In the context of neural networks, since we are interested in detecting redundancies between filters we use the (pre-ReLU) activations, which are instances of filter activity, as feature values. Figure 1 illustrates the process of data collection for PCA. Let, ) denote the number of input (output) channels for a convolu-
Figure 2. A toy 2D data representational space and corresponding PCs, when (a) Task A and (b) Task B are learned separately and (c) Task B is incrementally learned on top of Task A. The old (compressed) representational space corresponding to makes up the Core, but can only partially explain the new data variance in B, necessitating the addition of a new subspace from the Residual.
tional layer and ) denote the height and width of the input (output) feature maps. The 4D weight tensor for a particular layer is flattened to a 2D weight matrix (W) for illustration in the figure, where each filter,
is represented as a column of W. This filter acts upon a similar sized input patch from the feature map feeding into the layer. Let k be the kernel size and the input patch be,
, then the weight matrix can be represented as
. Upon convolution with all the fil-ters (
), one input patch is represented as a vector in the internal representational space,
. Since there are
input patches per example, for m input examples there will be
patches. Thus after convolution with m examples we obtain a flattened activity matrix
, where n represents the total number of samples. Since each row of this matrix corresponds to the representation of an input patch in
and each column represent instances of filter activity for all input patches over all examples, this 2D matrix is ideal for performing PCA and detecting redundancy in filter representations.
Filters are correlated if their convolution with different input patches across different samples produces similar patterns of activations, and then the rank of A can then be reduced. This means that the internal representations (activations) lie in a lower dimensional subspace, inside
, where
. Applying PCA on A, we find the number of principal components (PCs), p required to cumulatively explain x% of the total variance in the input, where x (variance threshold) is usually chosen in the range of 99 to 99.9 [10, 28]. From the first p Pcs, we can construct a transformation matrix
. When this transformation is applied to A, we get a low rank activation matrix,
where each input patch will have a reduced dimensional representation in
. Since,
where is the input, we can apply the same transformation to W to obtain the reduced filter space that produces the reduced dimensional representation in
. In doing so, we obtain a new weight matrix,
with fewer filters where each of the p new filters is obtained from the linear combinations of all the old filters. Hence, our method does not just select a subset of the original filters, but compresses the old filter information into the a reduced set of new filters [10]. We call this process of filter space transformation and reduction as PCA Compression.
PCA Compression can be applied to a pretrained network to obtain a structurally compressed network, in a layer-wise manner starting from the first layer. Transforming and compressing the first layer affects all the subsequent layers, which cannot be similarly transformed due to the non-linearity. Therefore, after each compression step, we fine-tune the whole network to recover the original accuracy.
3.3. Sharing Representational Space Among Tasks
Even though the PCA Compression method is very effective in finding a compressed architecture for a single task, it needs modifications to be effective in sequential task learning scenarios. One such challenge is to avoid catastrophic forgetting of the older tasks when learning newer tasks, because the PCA compression step generates a transformation matrix based on the data from the current task only, and applying this transformation to all the filters erases the representations of the older tasks.
We propose to solve this problem by encouraging sharing of internal representational space among tasks. Figure 2 gives a conceptual depiction of how sharing is induced in a 2D representational space between two tasks. In this toy formulation, we assume that the two filters available to us form the basis () of the space of the internal representations. Adding or removing a filter to a layer increases or decreases the dimension of this space. Assume that Task A results in the data representation as shown in Figure 2(a). It is evident that one PC direction (
) explains almost all of the variance of the data. Likewise, as shown in 2(b), when Task B is learned separately, just
is almost sufficient for data representation. In both cases individu-
Figure 3. PCA Transformation-Selection Algorithm for a Layer: A model trained on previous tasks (here, Task A), with the pruned filters randomly initialized and the core filters frozen, is trained on the next task, Task B. The PCA Transformation step constructs a transformation matrix (T) from the activation matrix (A) with columns of A corresponding to the frozen filters zeroed out. A new weight matrix (with old frozen filters and new PCA compressed filters is constructed by applying this transformation to W. The Selection step decides the number of filters required for the task B and constructs
after pruning unimportant filters. Then, the whole network is fine-tuned.
ally, PCA Compression would combine two original filters into one compressed filter which will retain almost all the data variance and remove one filter to produce a low dimensional (1D) internal representation. Figure 2(c) shows the continual learning scenario, where Task A has already been learned and it has a compressed 1D representation, corresponding to . This is the Core space, and will remain frozen when we learn subsequent tasks. From the figure it is evident that this Core space can not explain all the data variance in Task B. Therefore with our proposed method, we find how much variance is explained by this Core space for Task B, and we compress the information in the Residual space into a subspace that maximally explains the remaining data variance of Task B. In this toy set up, our method finds that Task B needs an extra basis (filter) for preserving internal representation. In higher dimensions, we want to find a reduced-dimensional subspace in the Residual space that contains all relevant information and add it to the Core space before learning the next task. Thus, in the incremental learning set up, our method will continue to share the learned space of representation and will judiciously add new filters if required to explain the variance in the internal representation of the new task.
3.4. PCA Transformation-Selection Algorithm
In Section 3.2, we have already described how the first task in the continual scenario can be learned in a structurally compressed network with PCA Compression. The information is compressed into the Core space, freeing up the Residual space. The subsequent task is learned freely in the Residual space, but still utilizes the filters from the Core space. Here, we formulate the method for finding a compressed (dimensionality reduced) Residual subspace, that along with the Core space serves as an efficient internal representational space for that new task. To do so, we introduce the PCA Transformation-Selection algorithm, where in the PCA Transformation step, filters corresponding to the Residual space get transformed according to the ranked PCs and the Selection step determines how many of these new filters need to be added to the Core space. Figure 3 illustrates the steps for training a network that has previously been trained on older tasks.
Let there be f frozen and r trainable filters in a typical layer for a new task such that number of original filters is . Frozen and trainable filters correspond to the Core and Residual space respectively. First, the network is trained until convergence on the current task, updating the trainable filters only (the frozen filter are still used, just not updated). Then in the PCA Transformation step, the activation matrix (A) is generated from the data of current task. Before applying PCA on this matrix, the columns corresponding to the frozen filters are zeroed out so that this matrix captures the data variances in the newly learned Residual space only. Application of PCA on this matrix gives us
PCs, f of which will have zero eigenvalues corresponding to the frozen filters. A transformation matrix,
is constructed with the remaining r PCs and multiplied with current weight matrix (
) to create r new filters. There is no compression in this step; we retain r filters, but they are transformed into new fil-ters, ranked according to the amount of variance of the data they can explain. Finally in this step, a new weight matrix,
is formed by replacing the old r filters by these transformed ones, while the first f frozen filters stay unchanged.
However, if we were to find the number of components that explain x% of the total variance from just the r filters in the Residual space, we would not take into account the contribution of the Core space. We fix this in the Selection step, by collecting a new activation matrix A after forward passing on the transformed . We can now compute the data variance captured by the
filter in
as:
Figure 4. Results on CIFAR10 (first layer): (a) Learning Task 1 by PCA Compression and (b) incrementally adding Task2 by PCA Transformation-Selection Algorithm: (a) is the variance-co-variance matrix of the activation matrix A, and the plot shows that the data is spread out in all dimensions. After transformation, there are no correlations between the PCs and the ranking shows that only 7 out of 32 filters are needed. The same transformation is then applied to
and the remaining 25 filters are pruned out. (b) The next task is learned; PCA Transformation only evaluates the variances in the remaining 25 filters (in the Residual) and transforms W to
. The Selection step identifies that 5 new filters are needed to explain the variance along with the frozen 7 filters, shown in
Summation of over all filters give the total variance of the space. We cumulatively add these variances and compute how many filters are needed to explain x% of the total variance. Based on the identified number of filters, we construct a new weight matrix,
by pruning the remaining filters from that layer. As explained before, we then train the whole network to mitigate any loss in accuracy. We sketch out the matrices in Figure 4 to highlight the conceptual difference between the PCA compression algorithm which acts on the whole space while learning the first task and the PCA Transformation-Selection algorithm on the subsequent tasks, where we transform only the residual space but evaluate redundancy in the whole space. These results are shown for the 32 filters of the first layer of a network trained on the split CIFAR-10 dataset.
Datasets: We evaluated our algorithm for continual learning on permuted MNIST (P-MNIST) [19], and the spilt versions of CIFAR-10 and CIFAR-100 datasets. The Permuted MNIST dataset is created out of the MNIST dataset by randomly permuting all the pixels of the MNIST images differently for each tasks. We created 10 tasks with 10 different permutations, where each task has 10 classes. We constructed the Split CIFAR-10 dataset from the CIFAR-10 [16] dataset by splitting the dataset into 5 sequential tasks with 2 classes per tasks. The Split CIFAR-100 dataset is constructed from the CIFAR-100 [16] dataset by splitting the dataset into 10 sequential tasks with 10 classes per tasks. All images were pre-processed with zero padding to have dimensions of pixels and then normalized.
Network Setting: For P-MNIST tasks, we used a multi-layered perceptron (MLP) with two hidden layers, each having 1000 neurons, with ReLU activations. Though we described our method for a typical convolutional layer in section 3, it is also applicable to fully-connected layers. We constructed the activation matrix, such that m is the number of input examples and n is the number of neurons. For the split CIFAR-10 and CIFAR-100 experiments, we used convolutional neural network (CNN) architectures with five convolutional (conv-conv-pool-dropout-conv-conv--pool-dropout-conv-pool-classifier) layers followed by a classifier layer. For the CIFAR-10 experiments, we used
filters in the five convolutional layers whereas for CIFAR-100 we used
filters. In both experiments, fil-ter kernels for the first four layers was
while the fifth layer had a kernel size of
.
Training Setting: For all models and algorithms (including baselines), we used Stochastic Gradient Descent (SGD) with momentum (0.9) as the optimizer. The Split CIFAR-10 and CIFAR-100 experiments with CNNs used dropout of 0.15 in the hidden layers whereas P-MNIST experiments with MLP used no dropout. For all experiments, a batch size of 128 was used and the optimizer was reset after each task. Models were trained for 15 epochs for P-
Figure 5. Average accuracy over incrementally learned tasks for (a) Permuted MNIST, (b) Split CIFAR-10 and (c) Split CIFAR-100 datasets.
Figure 6. Comparison of inference energy for sequential learning of split CIFAR-100 tasks.
MNIST, 40 epochs for split CIFAR-10 and 80 epochs for split CIFAR-100 for each task, with an initial learning rate of . In our method, since we need to fine-tune the network after performing PCA, we used a learning rate of
and utilized early stopping. In the PCA step, we used 1000 random training examples from the current dataset for activation collection as per [10]. We trained and tested all the models and algorithms in the multi-headed setting [14, 8], where a new classifier is added for each new task, and a task hint is provided.
Baselines: To establish a strong baseline, we consider Single Task Learning (STL), where the full network with all resources is trained for each task separately. Since each task is learned on a separate network, there is no issue of Catastrophic Forgetting. We also compare with Learning without Forgetting (LwF) and Elastic weight Consolidation (EWC), which use weight-specific regularizers and utilize the entire network. We implemented the efficient version of EWC, online-EWC [33] with and regularization coefficient set to 100. Our final baseline is PackNet, where we pruned 80% of the parameters belonging to that particular task across all datasets. LwF and PackNet were implemented from the official implementation of [24] and EWC was implemented from [38].
We evaluate our algorithms in terms of classification accuracy, utilization of network resources and inference energy. Figure 5 shows the average classification accuracy on the Y axis over incrementally learned tasks. To clarify, classification accuracy at task number 5 denotes the average accuracy over Task 1 to Task 5, learned sequentially. Average classification accuracy after training all the tasks for the corresponding datasets are shown in Table 1. For PMNIST, as shown is Figure 5(a), our method outperforms EWC and is on par with LwF. While performance of STL and PackNet is slightly better, STL uses 10 separate models, and therefore 1000% parameters, and PackNet utilizes 90% of the original network parameters. In comparison, our method achieves high compression and yields comparable accuracy ( 1% lower than PackNet) using 77% of the network parameters.
Figures 5 (b) and (c) show results for the split CIFAR-10 and CIFAR-100 datasets trained on CNN architectures respectively. Our method significantly outperforms both LwF and EWC. Its performance is within 1% of STL, while using 5.5x and 10.6x fewer parameters for CIFAR-10 and CIFAR-100 respectively. We also observe that for longer task sequences such as in CIFAR-100, the accuracy of Task 1 drops by 21% and 3.5% in EWC and LwF respectively after the tenth task is learned incrementally, thus showing clear sign of forgetting, whereas our method preserves the initial performance on each task. For both datasets, PackNet outperforms our method in terms of accuracy. The classifi-cation performance gain of PackNet comes from the fact that it applies weight level pruning which is finer granularity than our filter level pruning. PackNet outperforms even STL, presumably due to the regularization effect of pruning. However, filter granularity pruning eventually translates to higher savings in hardware, owing to the structured nature of the resulting sparsity. This is discussed next.
In Figure 6, we compare the inference energy of different algorithms on the split CIFAR-100 dataset, measured
Figure 7. Layer-wise filter utilization for each tasks for (a) Split CIFAR-10 and (b) Split CIFAR-100 datasets.
on an NVIDIA GeForce GTX 1060 GPU with a batch size of 64. EWC and LwF both utilize the entire dense network while PackNet has varying levels of unstructured sparsity for different tasks. For PackNet, we noticed that the inclusion of binary mask application in the measurements leads to a increase in energy consumption, indicating an absence of a native binary mask implementation in cuDNN [6] or CUSPARSE [5] libraries. Therefore, we exclude the mask application energy, for a fair comparison. We notice that PackNet consumes comparable energy to EWC and LwF, which are dense models, indicating the in-efficiency of GPUs for computation with irregularly sparse matrix kernels [42]. On the other hand, our model structurally adds or removes filters for different tasks, as shown in Figure 7(b), which enables the computing hardware to leverage this structured sparsity and provide considerable energy benefits. Since the network trained for just Task 1 is the smallest, we get
x inference energy reduction compared to other methods. As further tasks are learned, the network grows in size and the energy consumption scales accordingly. Our model’s energy consumption is similar to that of the dense model for the last task as both these cases use almost identical model resources.
In our method we set variance threshold (x) as a hyperparameter. Higher values of x yield higher classification accuracy, but at the expense of higher filter utilization for tasks. Continual learning in memory constrained (fixed resource) environments demands a trade-off between classi-fication accuracy and parameter utilization per task. Moreover, we observe a trend between the optimal choice of x and task complexity (e.g. type of dataset, number of classes
Table 1. Comparison of average classification accuracy (%) and network size (relative to original network)
per tasks). For P-MNIST, 99% variance retention was required for good accuracy. For split CIFAR-10, which has two classes per tasks, we could achieve high performance with x = 98.5 while the 10-class, higher complexity tasks of split CIFAR-100 required a higher value: x = 99.5. Figure 7 illustrates the number of filters utilized in each layer to preserve x% of the variance. For all the layers, a fixed value of x is set which translates to the requisite number of filters automatically, thus avoiding layer-wise heuristic selection iterations that most pruning methodologies need. For CIFAR-10 tasks, as shown in Figure 7(a), the number of filters grow monotonically for each subsequent task. However, for CIFAR-100 tasks (Figure 7(b)), we observe that the growth dynamics vary at different layers. For example, in ‘Layer 5’ there are no free filters available for learning from the task onwards (implying that the entire space is the Core space). Accuracy degradation in subsequent tasks would imply that new filters needs to be added to ‘Layer 5’. Thus, filter utilization statistics give us insight into where to put extra resources (filters) to gain further improvements in accuracy. Since, this would require relaxation of the fixed memory constraints, we leave this for future exploration. Detailed implementation of the code is available at https://github.com/sahagobinda/CL PCA.
In this paper, we address the problem of catastrophic forgetting in DNNs in a continual learning setting where the data from older tasks is not available, and the learning system has a fixed network architecture. To accommodate multiple sequential tasks, we divide the representational space into a fixed Core space, which contains task specific information over all previously seen tasks, and a Residual space, that the current task learns over. We propose a PCA driven approach that condenses the Residual and adds only the required information to the Core, freeing up parameters to learn the next task. We consistently outperform EWC and LwF in terms of both accuracy and inference energy effi-ciency, and overcome the problem of catastrophic forgetting seen in these methods. While PackNet achieves better accuracy than our algorithm, we are able to leverage the structured sparsity much better in hardware, resulting in up to 4.5x improvement in energy efficiency at the earlier tasks as compared to PackNet.
This work was supported in part by the National Science Foundation, in part by Intel Corporation, in part by Vannevar Bush Faculty Fellowship, and in part by C-BRIC, one of the six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.
[1] Herv´e Abdi and Lynne J. Williams. Principal Component Analysis. WIREs Comput. Stat., 2(4):433–459, July 2010. 2, 3
[2] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory Aware Synapses: Learning what (not) to forget. In ECCV, 2018. 1
[3] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Ben- gio. Gradient based sample selection for online continual learning. In NeurIPS 2019, 2019. 1
[4] Aayush Ankit, Abhronil Sengupta, and Kaushik Roy. TraNNsformer: Neural network transformation for memristive crossbar based neuromorphic system design. 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 533–540, 2017. 1, 3
[5] Nathan Bell and Michael Garland. Efficient sparse matrix- vector multiplication on cuda. 2008. 8
[6] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient primitives for deep learning. ArXiv, abs/1410.0759, 2014. 8
[7] Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. Predicting parameters in deep learning. In NIPS, 2013. 2
[8] Sebastian Farquhar and Yarin Gal. Towards robust evalua- tions of continual learning. ArXiv, abs/1805.09733, 2018. 7
[9] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A. Rusu, Alexander Pritzel, and Daan Wierstra. PathNet: Evolution channels gradient descent in super neural networks. ArXiv, abs/1701.08734, 2017. 1
[10] Isha Garg, Priyadarshini Panda, and Kaushik Roy. A low effort approach to structured cnn design using PCA. ArXiv, abs/1812.06224, 2018. 3, 4, 7
[11] Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, and William J. Dally. DSD: Dense-Sparse-Dense training for deep neural networks. In ICLR, 2016. 2
[12] Song Han, Jeff Pool, John Tran, and William J. Dally. Learn- ing both weights and connections for efficient neural network. In NIPS, 2015. 2
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, June 2016. 1
[14] Yen-Chang Hsu, Yen-Cheng Liu, and Zsolt Kira. Reevaluating continual learning scenarios: A categorization and case for strong baselines. ArXiv, abs/1810.12488, 2018. 7
[15] James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America, 114 13:3521–3526, 2016. 1, 2
[16] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. 6
[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60:84–90, 2012. 3
[18] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015. 1
[19] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998. 6
[20] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. ArXiv, abs/1608.08710, 2016. 3
[21] Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. ArXiv, abs/1904.00310, 2019. 1
[22] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:2935–2947, 2016. 2
[23] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continuum learning. In NIPS, 2017. 1, 2
[24] Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- tiple tasks to a single network by iterative pruning. In CVPR, June 2018. 1, 2, 7
[25] Nicolas Y. Masse, Gregory D. Grant, and David J. Freedman. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences of the United States of America, 115 44:E10467–E10475, 2018. 1
[26] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In ICLR, 2016. 3
[27] German Ignacio Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural networks : the official journal of the International Neural Network Society, 113:54–71, 2019. 1
[28] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In NIPS, 2017. 4
[29] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. iCaRL: Incremental clas-sifier and representation learning. 2016. 1, 2
[30] Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7:123–146, 1995. 1
[31] Deboleena Roy, Priyadarshini Panda, and Kaushik Roy. Tree-cnn: A deep convolutional neural network for lifelong learning. ArXiv, abs/1802.05800, 2018. 1
[32] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. ArXiv, abs/1606.04671, 2016. 1, 2
[33] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & Compress: A scalable framework for continual learning. In ICML, 2018. 2, 7
[34] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In NIPS, 2017. 1, 2
[35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 3
[36] Volodymyr Mnih et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. 1
[37] Amal Rannen Triki, Rahaf Aljundi, Matthew B. Blaschko, and Tinne Tuytelaars. Encoder based lifelong learning. 2017 IEEE International Conference on Computer Vision (ICCV), pages 1329–1337, 2017. 1
[38] Michiel van der Ven and Andreas S. Tolias. Generative re- play with feedback connections as a general strategy for continual learning. ArXiv, abs/1809.10635, 2018. 7
[39] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NIPS, 2016. 1, 3
[40] Jaehong Yoon, Eunho Yang, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. ArXiv, abs/1708.01547, 2017. 1
[41] Friedemann Zenke, Ben Poole, and Surya Ganguli. Contin- ual learning through synaptic intelligence. In ICML, 2017. 1, 2
[42] Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. Sparse tensor core: Algorithm and hardware co-design for vectorwise sparse neural networks on modern gpus. In MICRO, 2019. 8