Transfer Learning using Neural Ordinary Differential Equations

2020·Arxiv

Abstract

Abstract

We introduce a concept of using Neural Ordinary Differential Equations(NODE) for Transfer Learning. In this paper we use the EfficientNets to explore transfer learning on CIFAR-10 dataset. We use NODE for fine-tuning our model. Using NODE for fine tuning provides more stability during training and validation.These continuous depth blocks can also have a trade off between numerical precision and speed .We conclude that the using Neural ODEs for transfer learning results in much stable convergence of the loss function.

Index Terms—Transfer Learning,Neural Ordinary Differential Equations(NODE),Image Classification,EfficientNet

I. INTRODUCTION

Image classification is one of the fundamental tasks in computer vision and there has been significant improvement in the accuracy of image classification models since CNNs. AlexNet [9] and GoogleNet [17] showed that deeper and larger neural network models perform better at image classification. CNNs learn by feature extraction and features extracted by one model trained on a particular dataset can be used by another model performing a similar task.

Given the enormous amount of resources used to train computer vision models,transfer learning is a very popular technique used in deep learning.Transfer learning significantly improves the training time and gives better results compared to the conventional techniques. While most machine learning algorithms are designed to address single tasks, the development of algorithms that facilitate transfer learning is a topic of ongoing interest in the machine-learning community.

In this paper we study the concept of using NODE for transfer learning by using EfficientNet model as the backbone and imagenet weights as the pretrained weights.The intuition behind it being that brain is also considered as continuous time systems.

The remaining structure of the paper is as follows.Section II contains the related work.The proposed method and details of the experiments is explained in Section III.Results and Conclusions is explained in Section IV and V respectively.

II. RELATED WORK

EfficientNets [19] are a family of models which can be systematically scaled up based on the resources available.This family of models achieve better accuracy than traditional ConvNets. In these models , there is a principled way in which models are scaled up . A balance between width,depth and height is achieved by simply scaling them up with a constant ratio. These models can be scaled up based on the computation resources available.For instance if there exists more resources then the depth could be increased by , width by and image size by where are constant coefficients EfficientNets transfer well on datasets like CIFAR-100 [8], Flowers etc with fewer parameters. Tan et al, [19], have examined eight models from EfficientNetB0-EfficientNetB7 for their efficiency and performance.

Neural Ordinary Differential Equations (NODE) [3] are a family of Neural Networks where discrete sequence of hidden layers need not be specified, instead the derivative of the hidden state is parameterized using a neural network.Networks such as ResNets [6] and Recurrent Neural Networks [13] can be modelled as continuous transforms using NODE. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed.

Chen et al., [3] use adaptive step-size solvers to solve ODEs reliably. This solver uses the adjoint method [12] and the network used has a memory cost of O(1). Here, a network is also tested with the same architecture but where gradients are backpropagated directly through a Runge-Kutta integrator, referred to as RK-Net, which has O(L) memory cost where L stands for the number of layers in the network. Here , the continuous dynamics of hidden units are parameterized using an ordinary differential equation (ODE) specified by a neural network:

where Input layer would be h(0), output layer can be defined as h(T) to be the solution to this ODE initial value problem at some time T.

Currently, as numerical instability is an issue with ODEs, augmented version of Neural ODE networks have been developed [5] [20]. Among many extensions of Neural ODEs developed, one such is an approach that allows evolution of the neural network parameters, in a coupled ODE-based formulation. Also Augmented Neural ODEs are modelled which, in addition to being more expressive models than traditional Neueral ODEs, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs [20].

Having stability while training deep Neural Networks is important as it consistently offers improved robustness against a broader range of distortion strengths and types unseen during training, a considerably smaller hyperparameter dependence and less potentially negative side effects compared to data augmentation [10].

Stability during training is also achieved using different activation functions such as bounded Rectified Linear Unit (ReLU), bounded leaky ReLU, and bounded bi-firing [11].

There are many domains where well annotated data is not easy to obtain due to data acquisition expenses. Collection of data is complex and expensive that make it extremely difficult to build a large-scale, high-quality annotated dataset. Transfer learning is the solution to this problem as it relaxes the hypothesis that the training data must be independent and identically distributed with the test data [18].

III. PROPOSED METHOD

In the proposed method EfficientNet B0 is used, from the family of EfficientNet models as the base model. A NODE layer is added before the final layer for fine tuning the CIFAR-10 dataset [8].

Although it increases the time taken to train per epoch,the NODE block is added to gain stability in this process. Two ODE Solvers for NODE i.e Runge Kutta Method [16] and the modern adjoint sensitivity method [15] are used and compared. The default implementation tf.contrib.integrate.odeint is used to solve the Ordinary Differential Equation initial value problems . This default method uses the Runge-Kutta solver. Adjoint sensitivity method is also used to for better memory efficiency. The adjoint method is computationally efficient and is numerically much more stable.

While using the Runge-Kutta(RK) Method, a user defined hyperparameter, relative and absolute tolerance limit can be varied to achieve optimal performance.Setting both these parameters as provided optimal results .Tolerance limit is a parameter which is a trade off between accuracy and computational cost.

For the Adjoint sensitivity method, the default parameters provided by Chen et.al. [3] is used.

The proposed model consists of two variations, one with RK solver which is run for 200 epochs and one with the modern adjoint sensitivity method which is run for 160 epochs.

In the proposed method, EfficientB0 model is run for 200 epochs to obtain the desired accuracy as mentioned in the performance table .The NODE block is then added to the previous model and is run until the desired validation accuracy is observed. It is then observed that the same validation accuracy obtained after 100 epochs in the case of RK solver and 160 epochs in the case of the adjoint sensitivity method.

For the initial model without NODE and the proposed model with RK solver, the Adam optimizer [7] was used. For the proposed model with the Adjoint method, stochastic gradient descent [2] was better than the Adam Optimizer.

Using ImageNet [4] weights for training the model enables quicker convergence as the features learnt by the pre-trained model are common to the image classification task.

The concept of using NODE instead of the fully connected layer is that it potentially reduces the number of parameters, as the hidden blocks are now continuous functions of time. Also one can tune the tolerance parameter for speed/accuracy trade-off.

Fig. 1. Model Visualization

IV. RESULTS AND CONCLUSION

It is observed that using NODE before the final layer guarantees better stability during training and validation. Both results are shown in Table I, the one with NODE at the end and the one without NODE(purely EfficientNetB0 with just a final fully connected layer). Results of both the variations of the proposed model is shown.

The model with EfficientNetB0 and a final fully connected

layer is trained for 200 epochs.

The proposed model with RK solver is trained for 100 epochs while the model with Adjoint solver is trained for 160 epochs. It is observed that in both variants, the proposed model converges to the desired accuracy and loss much quicker than the model without NODE.The performance and stability get enhanced in the proposed model.

The first four figures-2,3,4,5 show the training accuracy, training loss, validation accuracy and validation loss respectively for the model trained with EfficientNetB0 base and a fully connected final layer.It is observed that during training there is a lot of fluctuation and instability.In fact the validation curves are also not stable. By just adding the NODE block in end before the final layer, it is seen that the training process stabilizes.

Figures-6,7,8,9 depict the accuracy and loss graphs of the proposed model with the RK solver.

Figure 6 depicts the training graph, where it easily attains training accuracy of about 98.5 % in just 100 epochs. Figure 7 shows the steady decrease in the loss of the proposed model. Figure 8 depicts the validation curve. It is observed to be very stable in comparison to the previous model. Figure 9 shows the corresponding validation loss.

Figures-10,11 and 12 show the training accuracy, training loss and validation accuracy of the proposed model with the adjoint sensitivity method. Training accuracy of 99.2 % and a validation accuracy of 85.3 % is observed.

The adjoint sensitivity method provides even more stability than the Runge-Kutta Solver. It also provides better validation accuracy, and is quicker in convergence.

Table I shows the performance of the proposed model in comparison to a model with just EfficientNetB0. Table I shows its performance on train, validation and test sets.

Table III shows the parameters set for the proposed model with the RK solver.

Table II shows the number of epochs both models were trained on the corresponding time taken to train. We observe that although that the proposed model takes more time per epoch, the total time taken to converge is much better.

The EfficientNetB0 model and proposed model with RK solver were both developed on Keras with Tensorflow [1] backend. The proposed model with the adjoint sensititvity method was developed on PyTorch [14].

All the above models were trained on GTX 1080 Ti GPUs with 8GB memory.

Fig. 2. Training accuracy

Fig. 3. Training loss

Fig. 4. Validation accuracy

Fig. 5. Validation loss

Fig. 6. Training Accuracy(proposed model with RK solver)

Fig. 7. Training loss(proposed model RK solver)

Fig. 8. Validation Accuracy(proposed model RK solver)

Fig. 9. Validation loss(proposed model RK solver)

Fig. 10. Training accuracy using adjoint solver(proposed model)

Fig. 11. Training loss using adjoint solver(proposed model)

Fig. 12. Validation Accuracy using adjoint solver(proposed model)

TABLE I PERFORMANCE

TABLE II TIME TAKEN

TABLE III PARAMETERS SET

V. FUTURE WORK

Using continuous depth models for transfer learning gives more freedom to choose between a trade off between accuracy and time by optimizing the tolerance values.

Other solvers can be explored and examined. Exploring different methods can provide us with suitable solvers which can have better numerical stability and precision. NODE can used for transfer learning with different network backbones to see which networks perform better. By optimizing the model further, NODE can be a benchmark for transfer learning for most of the common datasets used in Deep Learning.

REFERENCES

[1] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man´e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi´egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[2] L´eon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.

[3] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Du- venaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571–6583, 2018.

[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.

[5] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural odes. arXiv preprint arXiv:1904.01681, 2019.

[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[7] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[8] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[10] Jan Laermann, Wojciech Samek, and Nils Strodthoff. Achieving gener- alizable robustness of deep neural networks by stability training. arXiv preprint arXiv:1906.00735, 2019.

[11] Shan Sung Liew, Mohamed Khalil-Hani, and Rabia Bakhteri. Bounded activation functions for enhanced training stability of deep neural networks on visual pattern recognition problems. Neurocomputing, 216:718–734, 2016.

[12] Valdemar Melicher, Tom Haber, and Wim Vanroose. Fast derivatives of likelihood functionals for ode based models using adjoint-state method. Computational Statistics, 32(4):1621–1643, 2017.

[13] Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget, Jan ˇCernock`y, and Sanjeev Khudanpur. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, 2010.

[14] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad- bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.

[15] Lev Semenovich Pontryagin. Mathematical theory of optimal processes. Routledge, 2018.

[16] Michael Schober, David K Duvenaud, and Philipp Hennig. Probabilistic ode solvers with runge-kutta means. In Advances in neural information processing systems, pages 739–747, 2014.

[17] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1– 9, 2015.

[18] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transfer learning. In International Conference on Artificial Neural Networks, pages 270–279. Springer, 2018.

[19] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.

[20] Tianjun Zhang, Zhewei Yao, Amir Gholami, Kurt Keutzer, Joseph Gonzalez, George Biros, and Michael Mahoney. Anodev2: A coupled neural ode evolution framework. arXiv preprint arXiv:1906.04596, 2019.