Deep networks are a notoriously hard class of models to train effectively [10, 13, 22, 23]. A combination of high-dimensional problems, characterized by a large number of labels and a high volume of samples, a large number of free parameters and extreme sensitivity to experimental setups are some of the main reasons for the difficulty in training deep networks. The go-to solution for deep network optimization is Stochastic Gradient Descent with mini-batches [31] (batch learning) or its derivatives. There are two alternative lines of work which offer strategies to guide deep networks to better solutions than batch learning: Curriculum Learning [3, 12, 15] and Label Smoothing [9, 39].
Curriculum learning helps deep networks learn better by gradually increasing the dif-ficulty of samples used to train networks. This idea is inspired by methods used to teach
Figure 1: Illustration of the components of LILACfor a four label dataset case. The Incremental Label introduction (IL) phase introduces new labels at regular intervals while using the data corresponding to unknown labels (pseudo-label) as negative samples. Once all the labels have been introduced, the Adaptive Compensation (AC) phase of training begins. Here, a prior copy of the network is used to classify training data. If a sample is misclassified then a smoother distribution is used as its ground-truth vector in the current epoch.
humans and patterns in human cognition and behaviour [1, 35]. The “difficulty” of samples in the dataset, obtained using either external ranking methods or internal rewards [12, 16], introduces an extra computational overhead while the setup itself restricts the amount of data from which the model begins to learn.
Label smoothing techniques [28, 30, 39] regularize the outcomes of deep networks to prevent over-fitting while improving on existing solutions. They penalize network outputs based on criteria such as noisy labels, overconfident model outcomes, or robustness of a network around a data point in the feature space. Often, such methods penalize the entire dataset throughout the training phase with no regard to the prediction accuracy of each sample.
Inspired by an alternative outlook on Elman’s [9] notion of “starting small”, we propose LILAC, Learning with Incremental Labels and Adaptive Compensation, a novel label-based algorithm that overcomes the issues of the previous methods and effectively combines them. LILAC works in two phases, 1) Incremental Label Introduction (IL), which emphasizes gradually learning labels, instead of samples, and 2) Adaptive Compensation (AC), which regularizes the outcomes of previously misclassified samples by modifying their target vectors to smoother distributions in the objective function (Fig. 1).
In the first phase, we partition data into two mutually exclusive sets: S, a subset of ground-truth (GT) labels and their corresponding data; and U, remaining data associated with a pseudo-label () and used as negative samples. Once the network is trained using the current state of the data partition for a fixed interval, we reveal more GT labels and their corresponding data and repeat the training process. By contrasting data in S against the entire remaining dataset in U, we consistently use all the available data throughout training, thereby overcoming one of the key issues of curriculum learning. The setup of the IL phase, inspired by continual learning, allows us to flexibly space out the introduction of new labels and provide the network enough time to develop a strong understanding of each class.
Once all the GT labels are revealed, we initiate the AC phase of training. In this phase,
we replace the target one-hot vector of misclassified samples, obtained from a previous version of the network being trained, with a smoother distribution. The smoother distribution provides an easier value for the network to learn while the use of a prior copy of the network helps avoid external computational overhead and limits the alteration to only necessary samples. To summarize, our main contributions in LILAC are as follows: • we introduce a novel method for curriculum learning that incrementally learns labels as opposed to samples,
• we formulate Adaptive Compensation as a method to regularize misclassified samples while removing external computational overhead,
• finally, we improve average recognition accuracy across all of the evaluated benchmarks compared to batch learning, a property that is not shared by the other tested curriculum learning and label smoothing methods. Our code is available at https://github.com/MichiganCOG/LILAC_v2.
Curriculum Learning Bengio et al. [3], Florensa et al. [12], and Graves et al. [15] are some important works that have redefined and applied curriculum learning in the context of deep networks. These ideas were expanded upon to show improvements in performance across corrupted [19] and small datasets [11]. More recently, Hacohen and Weinshall [16] explored the impact of varying the pace with which samples were introduced while Weinshall [38] used alternative deep networks to categorize difficult samples. To the best of our knowledge, most previous works have assumed that samples cover a broad spectrum of dif-ficulty and hence need to be categorized and presented in an orderly fashion. The closest relevant work to ours, in terms of learning labels, gradually varies the GT vector from a multimodal distribution to a one-hot vector over the course of the training phase [8].
Label Smoothing Label smoothing techniques regularize deep networks by penalizing the objective function based on a pre-defined criterion. Such criteria include using a mixture of true and noisy labels [39], penalizing highly confident outputs [28], and using an alternate deep network’s outcomes as GT [30]. Bagherinezhad et al. [2] proposed the idea of using logits from trained models instead of just one-hot vectors as GT. Complementary work by Miyato et al. [27] used the local distributional smoothness, based on the robustness of a model’s distribution around a data point, to smooth labels. The work closest to our method was proposed in Szegedy et al. [36], where an alternative target distribution was used across the entire dataset. Instead, we propose to alter the GT vector for only samples that are misclassified. They are identified using a prior copy of the current model, which helps avoid external computational overhead and only uses a small set of operations.
Incremental Learning and Negative Mining Incremental and Continual learning are closely related fields that inspired the structure of our algorithm. Their primary concern is learning over evolving data distributions with the addition of constraints on the storage memory [5, 29], distillation of knowledge across different distributions [33, 34], assumption of a single pass over data [6, 26], etc. In our approach, we depart from the assumption of evolving data distributions. Instead, we adopt the experimental pipeline used in incremental learning to introduce new labels at regular intervals. At the same time, inspired by negative
Figure 2: Illustration of the steps in the IL phase when (Top) only one GT label is in S and (Bottom) when two GT labels are in S. The steps are 1) partition data, 2) sample a mini-batch of data and 3) balance the number of samples from U to match those from S in the mini-batch before training. Samples from U are assumed to have a uniform prior when being augmented/reduced to match the total number of samples from S. Values inside each pie represent the number of samples. Across both cases, the number of samples from S determines the final balanced mini-batch size.
mining [4, 24, 37], we use the remaining training data, associated with a pseudo-label, as negative samples. Overall, our setup effectively uses the entire training dataset, thus maintaining the same data distribution.
In LILAC, our main objective is to improve upon batch learning. We do so by first gradually learning labels, in fixed increments, until all GT labels are known to the network (Section 3.1). This behaviour assumes that all samples are of equal difficulty and are available to the network throughout the training phase. Further, we focus on learning strong representations of each class over a dedicated period of time. Once all GT labels are known, we shift to regularizing previously misclassified samples by smoothing the distribution of their target vector while maintaining the peak at the same GT label (Section 3.2). Using a smoother distribution leads to an increase in the entropy of the target vector and helps the network learn better, as we demonstrate in Section 4.2.
3.1 Incremental Label Introduction Phase
In the IL phase, we partition data into two sets: S, a subset of GT labels and their corresponding data; and U, the remaining data marked as negative samples using a pseudo-label . Over the course of multiple intervals of training, we reveal more GT labels to the network according to a predetermined schedule. Within a given interval of training, the data partition is held fixed and we uniformly sample mini-batches from the entire training set based on their GT label. However, for samples from U, we use
as their label. There is no additional change required in the objective function or the outputs of the model when we sample data from U. By the end of this phase, we reveal all GT labels to the network.
For a given dataset, we assume a total of L labels are provided in the ascending order of their value. Based on this ordering, we initialize the first b labels, and their corresponding data, as S, and the data corresponding to the remaining Lb labels as U. Over the course of multiple training intervals, we reveal GT labels in increments of m, a hyper-parameter that controls the schedule of new label introduction. Revealing a GT label involves moving the corresponding data from U to S and using their GT label instead of
.
Within a training interval, we train the network for E epochs using the current state of the data partition. First, we sample a mini-batch of data based on a uniform prior over their GT labels. Then, we modify their target vectors based on the partition to which a sample belongs. To ensure the balanced occurrence of samples from GT labels and , we augment or reduce the number of samples from U to match those from S and use this curated mini-batch to train the network. After E epochs, we move m new GT labels and their corresponding data from U to S and repeat the entire process (Fig. 2).
3.2 Adaptive Compensation
Once all the GT labels have been revealed and the network has trained sufficiently, we begin the AC phase. In the AC phase, we use a smoother distribution for the target vector of samples which the network is unable to correctly classify. Compared to one-hot vectors, optimizing over this smoother distribution, with an increased entropy, can bridge the gap between the unequal distances in the embedding space and overlaps in the label space [32]. This overlap can occur due to common image content or close proximity in the embedding space relative to other classes. Thus, improving the entropy of such target vectors can help modify the embedding space in the next epoch and compensate for the predictions of mis-classified samples.
For a sample (xi,yi) in epoch e T, we use predictions from the model at e
1 to determine the final target vector used in the objective function; specifically, we smoothen the target vector for a sample if and only if it was misclassified by the model at epoch e
1. Here, (xi,yi) denotes a training sample and its corresponding GT label for sample index i, and T represents a threshold epoch value until which the network is trained without adaptive compensation. We compute the final target vector for the ith instance at epoch e, tei , based on the model
using the following equation,
Here, represents the one-hot vector corresponding to GT label yi, 1 is a vector of L dimensions with all entries as 1 and
is a scaling hyper-parameter.
Datasets and Metrics We use three datasets, CIFAR-10, CIFAR-100 ([20]), and STL-10 ([7]), to evaluate our method and validate our claims. CIFAR-10 and CIFAR-100 are 10 and 100 class variants of the popular image benchmark CIFAR while STL-10 is a 10 class subset of ImageNet. Average Recognition Accuracy (%) combined with their Standard Deviation across 5 trials are used to evaluate the performance of all the algorithms.
Experimental Setup For CIFAR-10/100, we use ResNet18 ([17]) as the architectural backbone while for STL-10, we use ResNet34. We set as the last label and b as half the total number of labels of a given dataset. In each interval of LILAC’s IL phase, we train the model for 7, 3, and 10 epochs each, at a learning rate of 0.1, 0.01, and 0.1 for CIFAR-10, CIFAR-100, and STL-10, respectively. In the AC phase epochs 150, 220, and 370 are used as thresholds (epoch T) for CIFAR-10, CIFAR-100, and STL-10 respectively. Detailed explanations of the experimental setups are provided in the supplementary materials.
• Fixed Curriculum: Following the methodology proposed in Bengio et al. [3], we create a “Simple” subset of the dataset using data that is within a value of 1.1 as predicted by a linear one-vs-all SVR model. The deep network is trained on the “Simple” dataset for a fixed period of time, which mirrors the total length of the IL phase, after which the entire dataset is used to train the network.
• Label Smoothing: We follow the method proposed in Szegedy et al. [36]. 3. Custom Baselines
• Dynamic Batch Size (DBS): DBS randomly copies data available within a mini-batch to mimic variable batch sizes, similar to the IL phase. However, all GT labels are available to the model throughout the training process.
• Random Augmentation (RA): This baseline samples from a single randomly chosen class in U, available in the current mini-batch, to balance data between S and U in the current mini-batch. This is in contrast to LILAC, which uses samples from all classes in U that are available in the current mini-batch.
4. Ablative Baselines
• Only IL: This baseline quantifies the contribution of incrementally learning labels when combined with batch learning.
• Only AC: This baseline shows the impact of adaptive compensation, as a label smoothing technique, when combined with batch learning.
4.1 Comparison Against Standard Baselines
Table 1 illustrates the improvement offered by LILAC over Batch Learning, with comparable setups. Further, we break down the contributions of each phase of LILAC. Both Only IL and Only AC improve over batch learning, albeit to varying degrees, which highlights their individual strengths and importance. However, only when we combine both phases do we observe a consistently high performance across all benchmarks. This indicates that these two phases complement each other.
The Fixed Curriculum approach does not offer consistent improvements over the Batch Learning baseline across CIFAR-100 and STL-10 while the Label Smoothing approach does not outperform batch learning on the STL-10 dataset. While both of these standard baselines fall short, LILAC consistently outperforms Batch Learning across all evaluated benchmarks. Interestingly, Label Smoothing provides the highest performance on CIFAR-100. Since the original formulation of LILAC was based on Batch Learning, we assumed all GT vectors to be one-hot. This assumption is violated in Label Smoothing. When we tailor our GT vectors according to the Label Smoothing baseline, we outperform it with minimal hyper-parameter changes, a testament to LILAC’s applicability on top of conventional label smoothing.
Table 1: Under similar setups, LILAC consistently achieves higher mean accuracy than Batch Learning across all evaluated benchmarks, a property not shared by other baselines.
Table 2: LILAC easily outperforms the Shake-Drop network ([40]) as well as other top performing algorithms on CIFAR-10 with standard pre-processing (random crop + flip).
The RA baseline highlights the importance of using all of the data in U as negative samples in the IL phase as opposed to using data from individual classes. This is reflected in the boost in performance offered by LILAC. The DBS baseline is used to highlight the importance of fluctuating mini-batch sizes, which occur due to the balancing of data in the IL phase. Even with the availability of all labels and fluctuating batch sizes, the DBS baseline is easily outperformed by LILAC. This indicates the importance of the recursive structure used to introduce data in the IL phase as well as the use of data from U as negative samples. Overall, LILAC consistently outperforms Batch Learning across all benchmarks while existing comparable methods fail to do so. When we extend LILAC to the Shake-Drop [40] network architecture, with only standard pre-processing, we easily outperform other existing approaches with comparable setups, as shown in Table 2.
4.2 Key Properties of LILAC
Smoothness of Target Vector () Throughout this work, we maintained the importance of using a smoother distribution as the alternate target vector during the AC phase. Table 3 (Top) illustrates the change in performance across varying degrees of smoothness in the alternate target vector. There is a clear increase in performance when
values are be-
Table 3: (Top) The mid-range of values, 0.7-0.4, show an increase in performance while the edges, due to either too sharp or too flat a distribution, show decreased performance. (Bottom) Only IL model results illustrate the importance of introducing a small number of new labels in each interval of the IL phase. Values in brackets are for CIFAR-100.
tween 0.7-0.4 (mid-range). On either side of this band of values, the GT vector is either too sharp or too flat, leading to a drop in performance.
Size of Label Groups (m) LILAC is designed to introduce as many or as few new labels as desired in the IL phase. We hypothesized that developing stronger representations can be facilitated by introducing a small number of new labels while contrasting it against a large variety of negative samples. Table 3 (Bottom) supports our hypothesis by illustrating the decrease in performance with an increase in the number of new labels introduced in each interval of the IL phase. Thus, we introduce two labels each for CIFAR-10 and STL-10 and only one new label per interval for CIFAR-100 throughout the experiments in Table 1.
In this section, we take a closer look at the impact of each phase of LILAC and how they affect the quality of the learned representations. We extract features from the second to last layer of ResNet18/34 from 3 different baselines (Batch Learning, LILAC, and Only IL) and use these features to train a linear SVM model.
Fig. 3 highlights the two important phases in our algorithm. First, the plots on the lefthand side show a steady improvement in the performance of LILAC and the Only IL baseline once the IL phase is complete and all the labels have been introduced to the network. When we compare the plots of CIFAR-10 and STL-10 against CIFAR-100, we see that all baselines follow the learning trend shown by Batch Learning, with CIFAR-100 being slightly delayed. Since there are a large number of epochs required to introduce all the labels of CIFAR-100 to the network, the plots are significantly delayed compared to batch learning. Conversely, since there are very few epochs in the IL phase of CIFAR-10 and STL-10, we observe the performance trend of Only IL and LILAC quickly match that of Batch Learning. Overall, the
Figure 3: Plots on the (Left) show the common learning trend between all baselines, albeit slightly delayed for CIFAR-100, after the IL phase while those on the (Right) show steady improvement in performance after applying AC when compared to the Only IL baseline. Final supervised classification performances on representations collected from LILAC easily outperform those from Batch Learning and Only IL methods.
final performances of both LILAC and the Only IL baseline are higher than Batch Learning, which supports the importance of the IL phase in learning strong representations.
Figure 4: Illustration of 8 randomly chosen samples that were incorrectly labelled by the Only IL baseline and correctly labelled by LILAC. This highlights the importance of AC.
The plots on the right-hand side highlight the similarity in behaviour of Only IL and LILAC before AC. However, afterward, we observe that the performance of LILAC overtakes the Only IL baseline. This is a clear indicator of the improvement in representation quality when AC is applied. Additionally, from Fig. 3 we observe that inherently the STL-10 dataset results have a high standard deviation, which is reflected in the middle portion of the training phase, between the end of IL and the beginning of AC and it is not a consequence of our approach. We provide examples in Fig. 4 of randomly sampled data from the testing set that were incorrectly classified by the Only IL baseline and were correctly classified by LILAC.
In this work, we proposed LILAC, which rethinks curriculum learning based on incrementally learning labels instead of samples. This approach helps kick-start the learning process from a substantially better starting point while making the learned embedding space amenable to adaptive compensation of target vectors. Both these techniques combine well in LILAC to show the highest performance on CIFAR-10 for simple data augmentations while easily outperforming batch and curriculum learning and label smoothing methods on comparable network architectures. The next step in unlocking the full potential of this setup is to include a confidence measure on the predictions of the network so that it can handle the effects of dropout or partial inputs. In further expanding LILAC’s ability to handle partial inputs, we aim to explore its effect on standard incremental learning (memory-constrained) while also extending its applicability to more complex neural network architectures.
This work was in part supported by NSF NRI IIS 1522904 and NIST 60NANB17D191. The findings and views represent those of the authors alone and not the funding agencies. The authors would also like to thank members of the COG lab for their invaluable input in putting together and refining this work.
[1] Judith Avrahami, Yaakov Kareev, Yonatan Bogot, Ruth Caspi, Salomka Dunaevsky, and Sharon Lerner. Teaching by examples: Implications for the process of category acquisition. The Quarterly Journal of Experimental Psychology Section A, 50(3):586– 606, 1997.
[2] Hessam Bagherinezhad, Maxwell Horton, Mohammad Rastegari, and Ali Farhadi. Label refinery: Improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641, 2018.
[3] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.
[4] Maxime Bucher, Stéphane Herbin, and Frédéric Jurie. Hard negative mining for metric learning based zero-shot classification. In Gang Hua and Hervé Jégou, editors, Computer Vision – ECCV 2016 Workshops, pages 524–531, Cham, 2016. Springer International Publishing. ISBN 978-3-319-49409-8.
[5] Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 233–248, 2018.
[6] Arslan Chaudhry, Marcâ ˘A´ZAurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-GEM. In International Conference on Learning Representations, 2019.
[7] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223, 2011.
[8] Urun Dogan, Aniket Anand Deshmukh, Marcin Machura, and Christian Igel. Labelsimilarity curriculum learning. arXiv preprint arXiv:1911.06902, 2019.
[9] Jeffrey L Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99, 1993.
[10] Dumitru Erhan, Pierre-Antoine Manzagol, Yoshua Bengio, Samy Bengio, and Pascal Vincent. The difficulty of training deep architectures and the effect of unsupervised pre-training. In Artificial Intelligence and Statistics, pages 153–160, 2009.
[11] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In International Conference on Learning Representations, 2018.
[12] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. In Proceedings of the 1st Annual Conference on Robot Learning, volume 78 of Proceedings of Machine Learning Research, pages 482–495. PMLR, 13–15 Nov 2017.
[13] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
[14] Ben Graham. Fractional max-pooling (2014). arXiv preprint arXiv:1412.6071, 2014.
[15] Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1311–1320. JMLR. org, 2017.
[16] Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 2535–2544, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[18] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
[19] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 2304–2313. PMLR, 10–15 Jul 2018.
[20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
[21] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
[22] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, pages 473–480. ACM, 2007.
[23] Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, and Pascal Lamblin. Exploring strategies for training deep neural networks. Journal of machine learning research, 10 (Jan):1–40, 2009.
[24] Xirong Li, CeesG M Snoek, Marcel Worring, Dennis Koelma, and Arnold WM Smeulders. Bootstrapping visual categorization with relevant negatives. IEEE Transactions on Multimedia, 15(4):933–945, 2013.
[25] Senwei Liang, Yuehaw Kwoo, and Haizhao Yang. Drop-activation: Implicit parameter reduction and harmonic regularization. arXiv preprint arXiv:1811.05850, 2018.
[26] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.
[27] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing by virtual adversarial examples. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
[28] Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton. Regularizing neural networks by penalizing confident output distributions. CoRR, 2017.
[29] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
[30] Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, 2015.
[31] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
[32] Pau Rodríguez, Miguel A Bautista, Jordi Gonzalez, and Sergio Escalera. Beyond one-hot encoding: Lower dimensional target embedding. Image and Vision Computing, 75: 21–31, 2018.
[33] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In Advances in Neural Information Processing Systems, pages 348–358, 2019.
[34] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka GrabskaBarwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In International Conference on Machine Learning, pages 4535–4544, 2018.
[35] Burrhus F Skinner. Reinforcement today. American Psychologist, 13(4):94, 1958.
[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
[37] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2015.
[38] Daphna Weinshall, Gad Cohen, and Dan Amir. Curriculum learning by transfer learning: Theory and experiments with deep networks. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 5238–5246, StockholmsmÃd’ssan, Stockholm Sweden, 2018. PMLR.
[39] Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, and Qi Tian. Disturblabel: Regularizing cnn on the loss layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4753–4762, 2016.
[40] Yoshihiro Yamada, Masakazu Iwamura, Takuya Akiba, and Koichi Kise. Shakedrop regularization for deep residual learning. arXiv preprint arXiv:1802.02375, 2018.
[41] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016.
[42] Ke Zhang, Miao Sun, Tony X Han, Xingfang Yuan, Liru Guo, and Tao Liu. Residual networks of residual networks: Multilevel residual networks. IEEE Transactions on Circuits and Systems for Video Technology, 28(6):1303–1314, 2017.
In Table 4 we list the general hyper-parameters used to train the batch learning portion of every baseline. This setup covers the training beyond the IL phase for LILAC, DBS, RA, and Only IL as well as the Only AC baseline. Across all the methods we ensure that the total number of training epochs, when all the labels in the dataset are known, is held constant.
Table 4: List of hyper-parameters used to in batch learning. Note: All experiments used the SGD optimizer.
Table 5: (Top) Varying E, the fixed training interval size in the IL phase, shows a dataset specific behaviour, with the dataset with lesser labels preferring a larger number of epochs while the dataset with more labels prefers a smaller number of epochs. (Bottom) Comparing random label ordering and difficulty-based label ordering to the ascending order assumption used throughout our experiments, we observe no preference to any ordering pattern.
Epochs in Training Interval When we vary E, the fixed training interval size in the IL phase, we observe a dataset specific behaviour. For datasets with lesser number of total labels, a larger number of epochs provides better performance while for datasets with more labels, a smaller number of epochs yields better performance. While the alternate learning rate can have a huge impact on this performance, pacing the introduction of new labels, accord-
Figure 5: Unsupervised classification performance on representations collected from LILAC easily outperforms those collected from Batch Learning and Only IL methods. The plots on the left show the common learning trend between all baselines after IL while plots on the right show steady improvement in performance after applying AC when compared to the baselines.
ing to the empirical results, can have a tremendous impact on subsequent hyper-parameters used in LILAC.
Label Order In Table 5, we compare three different orders of label introduction during the IL phase, 1) random label order, 2) difficulty-based label order, and 3) ascending label order. Here, difficulty-based label order is obtained from the overall classification scores per label, obtained from the features of a trained model. Although these three orders do not constitute the exhaustive set of possible label orderings, within these three possibilities there is no definitive order that boosts the performance of LILAC consistently. Thus, we employ ascending label order throughout our work.
NOTE: Only IL baseline is used throughout Table 5.
We include unsupervised clustering performance for CIFAR-10 and STL-10 using the kmeans and the hungarian job assignment algorithm [21] in Fig. 5. They follow similar patterns to their supervised counterparts.