Knowledge distillation for optimization of quantized deep neural networks

2019·Arxiv

ABSTRACT

ABSTRACT

Knowledge distillation (KD) technique that utilizes a pre-trained teacher model for training a student network is exploited for the optimization of quantized deep neural networks (QDNNs). We considered the choice of the teacher network and also investigate the effect of hyperparameters for KD. We have tried several large floating-point models and quantized ones as the teacher. The experiments show that the softmax distribution produced by the teacher network is more important than its performance for effective KD training. Since the softmax distribution of the teacher network can be controlled by KDs hyperparameters, we analyze the interrelationship of each KD component for quantized DNN training. We show that even a small teacher model can achieve the same distillation performance as a larger teacher model. We also propose the gradual soft loss reducing (GSLR) technique for robust KD based QDNN optimization, which controls the mixing ratio of hard and soft losses during training.

Index Terms— Deep neural network, quantization, knowledge distillation, fixed-point optimization

1. INTRODUCTION

Deep neural networks (DNNs) usually require a large number of parameters, thus it is very necessary to reduce the size of the model to operate it in embedded systems. Quantization is a widely used compression technique, and even 1- or 2-bit models can show quite good performance. However, it is necessary to train the model very carefully not to lose the performance when only low-precision arithmetic is allowed. Many QDNN papers have suggested various types of quantizers or complex training algorithms [1, 2, 3, 4, 5].

Knowledge distillation (KD) that trains small networks using larger networks for better performance [6, 7]. KD employs the softlabel generated by the teacher network to train the student network. Leveraging the knowledge contained in previously trained networks has attracted attention in many applications for model compression [8, 9, 10, 11] and learning algorithms [12, 13, 14, 15]. Recently, the use of KD for the training of QDNN has been studied [16, 17]. However, there are many design choices to explore when applying KD to QDNN training. The work in [16] studied the effects of simultaneous training or pre-training in teach model design. The result is rather expected; employing a pre-trained teacher model is advantageous when considering the performance of the student network. However, [18] mentioned that a too large teacher network does not help improve the performance of the student model.

In this work, we exploit KD with various types of teacher networks that include full-precision [16, 17] model, quantized one, and teacher-assistant based one [18]. The analysis results indicate that, rather than the type of the model, the distribution of the soft label is critical to the performance improvement of the student network. Since the distribution of the soft label can be controlled by the temperature and the size of the teacher network, we try to show how well-selected temperature can improve the QDNN performance dramatically even with a small teacher network. Further, we suggest a simple KD training scheme that adjusts the mixing ratio of hard and soft losses during training for obtaining stable performance improvements. We name it as the gradual soft loss reducing (GSLR) technique. GSLR employs both soft and hard losses equally at the beginning of the training, and gradually reduces the ratio of the soft loss as the training progresses.

This paper is organized as follows. Section 2 describes how the QDNN can be trained with KD and explains why the hyperparameters of KD are important. Section 3 shows the experimental results and we conclude the paper in Section 4.

2. QUNTIZED DEEP NEURAL NETWORK TRAINING USING KNOWLEDGE DISTILLATION

In this section, we first briefly describe the conventional neural network quantization method and also depict how QDNN training can be combined with KD. We also explain the hyperparameters of KD and their role in QDNN training.

2.1. Quantization of deep neural networks and knowledge dis- tillation

The deep neural network parameter vector, w, can be expressed in level when quantized in b-bit. Since we usually use a symmetric quantizer, the quantized weight vector Q(w) can be represented using (1) or (2) for the case of b = 1 or b > 1 as follows:

where M is the number of quantization levels (sents the quantization step size. can be computed by L2-error minimization between floating and fixed-point weights or by the standard deviation of the weight vector [2, 19, 20].

Severe quantization such as 1- or 2-bit frequently incurs large performance degradation. Retraining technique is widely used to

minimize the performance loss [21]. When retraining the student network, forward, backward, and gradient computations should be conducted using quantized weights but the computed gradients must be added to full-precision weights [1, 2, 3, 4, 5].

The probability computation in deep neural networks usually employ the softmax layer. Logit, z, is fed into the softmax layer and generates the probability of each class, is a hyperparameter of KD known as the temperature. A high value of softens the probability distribution. KD employs the probability generated by the teacher network as a soft label to train the student network, and the following loss function is minimized during training.

where denotes a loss function, y is the ground truth hard label, is the weight vector of the student network, probabilities of the teacher and student networks, and weighting factor for adjusting the ratio of soft and hard losses.

A recent paper [18] suggests the teacher, teacher-assistant, and student models because the effect of KD gradually decreases when the size difference between the teacher and student networks becomes too large. This performance degradation is due to the capacity limitation of the student model. Since QDNN limits the representation level of the weight parameters, the capacity of a quantized network is reduced when compared with the full-precision model. Therefore, QDNN training with KD is more sensitive to the size of the teacher network. We consider the optimization of three hyperparameters described above (temperature, loss weighting factor, and size of the teacher network). Algorithm 1 describes how to train QDNN with KD.

2.2. Teacher model selection for KD

In this section, we try to find the best teacher model for QDNN training with KD. We consider three different approaches. The first one is training the full-precision teacher and student networks independently and applies KD when fine-tuning the quantized student model as suggested in [16, 17]. The second is training a mediumsized teacher assistant network with a very large teacher model, and then optimizing the student network using the teacher assistant model as suggested by [18]. The last approach is using a quantized teacher model with the possibility of the student learning something on quantization.

Table 1. Train and test accuracies of the quantized ResNet20 that trained with various KD methods on CIFAR-10 dataset. ‘TL’, ‘T’, ‘S’, ‘(F)’, and ‘(Q)’ denote large teacher, teacher, student, (full-precision), and (quantized), respectively. HD is a conventional training using hard loss. represents the temperature. Note that all the student networks are 2-bit QDNN and the results are the average of five times running.

Fig. 1. Example of the softmax distribution for label 6 from the teacher models in Table 1. The numbers in square brackets are the CIFAR-10 test accuracies of the student networks that trained by each teacher model.

Table 1 compares the results of these three approaches. Figure 1 also shows the softmax distributions when the teacher models and temperatures vary. The test accuracy of the quantized ResNet20 trained using hard loss was 91.71%. The results using various KD approaches indicate the following information. First, whether the teacher network is quantized or not, the performance of the student network is not much different. Secondly, employing the teacher assistant network [18] does not help increase the performance. The performance is similar to that of a general KD. Thirdly, KD training with full-precision teacher network [16, 17] is significantly better than conventional training when is 5. But, no performance increase is observed when is 2. Lastly, the teacher models that achieve the student network accuracy of 92.14%, 92.18%, and 92.25% have a similar softmax distribution. However, T(F)-S with quite sharp softmax shape, and the resulting performance is similar to that of hard-target training.

These points indicate that the softmax distribution is the key to lead effective KD training. Although the different teacher models generate dissimilar softmax distributions, we can control the shapes by using the temperature. A detailed discussion about hyperparameters is provided in the following subsections.

Table 2. Train and test accuracies (%) of the teacher networks on the CIFAR-10 and the CIFAR-100 datasets. ‘WRN20xWideResNet with a wide factor of ‘

Table 3. Training results of full-precision and 2-bit quantized ResNet20 on CIFAR-10 and CIFAR-100 datasets in terms of accuracy (%). The models are trained with hard loss only.

2.3. Discussion on hyperparameters of KD

As we mentioned in Section 2.1 and 2.2, the hyperparameters temperature (loss weighting factor (size of teacher network (N) can significantly affect the QDNN performance. Previous works usually fixed these hyperparameters when training QDNN with KD. For example, [16] always fixes to 1, and [17] holds it to 1 or 5 depending on the dataset. However, these three parameters are closely interrelated. For example, [18] points out that when the teacher model is very large compared to the student model, the softmax information produced by the teacher network become sharper, making it difficult to transfer the knowledge of the teacher network to the student model. However, even in this case, controlling the temperature may be able to make it possible. Therefore, when the value of one hyperparameter is changed, the others also need to be adjusted carefully. Thus, we empirically analyze the effect of KD’s hyperparameters. In addition, we introduce the gradual soft loss reducing (GSLR) technique that aids to improve the performance of QDNN dramatically. The GSLR is a KD training method that gradually increases the reflection ratio of the hard loss.

3. EXPERIMENTAL RESULTS

3.1. Experimental setup

Dataset: We employ CIFAR-10 and CIFAR-100 datasets for experiments. CIFAR-10 and CIFAR-100 consist of 10 and 100 classes, respectively. Both datasets contain 50K training images and 10K testing images. The size of each image is 32x32 with RGB channels.

Model configuration & training hyperparameter: To analyze the impact of hyperparameters of KD on QDNN training, we train WideResnet20xN (WRN20xN) [22] as the teacher networks, where N is set to 1, 1.2, 1.5, 1.7, 2, 3, 4, 5, and 10. When N is 1, the network structure is the same with ResNet20 [23]. All the train and the

Table 4. Results of QDNN training with KD on ResNet-20 for CIFAR-10 and CIFAR-100 dataset. ‘WRN’,‘RN’, ‘SM’, ‘DS’ represent WideResNet, ResNet, student model, and deeper student, respectively.

test accuracies of the teacher networks on CIFAR-10 and CIFAR-100 datasets are reported in Table 2. We employ ResNet20 as the student network for both the CIFAR-10 and CIFAR-100 datasets. If the network size is large enough considering the size of the dataset, which means over-parameterized, most quantization method works well [21]. Therefore, to evaluate a quantization algorithm, we need to employ a small network that is located in the under-parameterized region [21, 24]. Although the full-precision ResNet20 model is over-parameterized, which means near 100% training accuracy, the 2-bit network becomes under-parameterized on the CIFAR-10 dataset. Likewise, on the CIFAR-100 dataset, both the full-precision and the quantized models are under-parameterized. Thus, it is a good network configuration to evaluate the effect of KD on QDNN training. We report the train and the test accuracies for ResNet20 on CIFAR-10 and CIFAR-100 in Table 3.

3.2. Results

We compare our models with the previous works in Table 4. The compared QDNN models trained with KD include QDistill [17], Apprentice [16], and Guided [25]. We achieve the results that significantly exceed those of previous studies. We compare our model (0.27M) with the ‘student model 2’ (SM 2) of QDistill that has 0.3 M parameters, and achieve an 18.32% of performance gap in the test accuracy. Also, it is about 1% better than the ResNet20 result reported by Apprentice and even achieved the same performance

Fig. 2. Results of 2-bit ResNet20 that trained with varying the temperature (size of the teacher network on the CIFAR-10 and the CIFAR-100 datasets. The numbers in x-axis represent the wide factor (N) for WideResNet20xN.

with their ResNet32 result. When quantized to 1-bit, the test accuracy of 91.3% is obtained, which is almost the same as Apprentice’s ResNet20 2-bit model. In the case of CIFAR-100, QDistil and Guided student models use considerably large number, 17.2M and 22.0M, of parameters. Our student model only contains 0.28M parameters but achieve 17.7% and 2.4% higher accuracies than QDistill and Guided, respectively. These huge performance gaps show the importance of selecting the proper hyperparameters.

3.3. Model size and temperature

We report the test accuracies of 2-bit ResNet20 on CIFAR-10 in Figure 2 (a). To demonstrate the effect of the temperature for the QDNN training, we train 2-bit ResNet20 while varying the size of the teacher network from ‘WRN20x1’ to ‘WRN20x5’. Each experiment was conducted for three values of 1, 5, and 10, which correspond to small, medium, and large one, respectively. Note that WRN20xN contains the number of channel maps increased by N times. When the value of is small (blue line in the figure), the performance greatly depends on N, or the teacher model size. The performance change is much reduced as the value of increases to the medium (orange line) or the large value (blue line). This is related to the accuracy of the teacher model (red line). When the size of the teacher model increases, the shape of the soft label becomes similar to that of hard label. In this case, the KD training results are not much different from that trained with the hard label. Therefore, with , the performance decreases to 91.9% when the teacher network becomes larger than WRN20x2. This result is similar to the performance of a 2-bit ResNet20 trained with the hard loss (91.71%). The soft label needs to have a broad shape and it can be achieved either by increasing the temperature or limiting the size of the teacher network. A similar problem can occur for full-precision model KD training, but it is more important for QDNN since the model capacity is reduced due to quantization. Therefore, when training QDNN with KD, we need to consider the relationship between the size of teacher model and the temperature.

Figure 2 (b) shows the test accuracies of the 2-bit ResNet20 trained with KD on the CIFAR-100 dataset. Since the CIFAR-100 includes 100 classes, the soft label distribution is not sharp and the optimum value of is usually lower than that of the CIFAR-10. More specifically, when is larger than 5 (purple line), the test accuracies are lower than 65.49% (green dotted line), the accuracy of the 2-bit ResNet20 trained with the hard label. The soft label can easily become too flat even with a small , thus the teacher’s knowledge does not transfer well to the student network. When large (e.g. less than 5), the tendency is similar to CIFAR-10 experiment. When is 1 (blue line), the best performance is observed with ResNet20. As increases to 2 (yellow line) and 4 (grey line), the size

Fig. 3. Results of 2-bit ResNet20 models that trained by the various size of teacher networks and the temperature on CIFAR-100. In (b), the black horizontal line represents the test accuracy when the student network is trained with hard label only.

of the best performing teacher model also changes to WRN20x1.5 and WRN20x1.7, respectively. This demonstrates that a proper value of temperature can improve the performance, but it should not be too high since the knowledge from the teacher network can disappear.

3.4. Gradual soft loss reducing

Throughout the paper, we have discussed the effects of the temperature and the size of teacher network on the QDNN training with KD. Since the two hyperparameters are interrelated, careful parameter selection is required and it makes the training challenging. We also have the risk of cherry picking if the outcome cannot be predicted well without using the test result. Thus, we need to have a parameter setting technique that is fail-proof.

We have developed a KD technique that is much less sensitive to specific parameter setting for KD. At the beginning of the training, where the gradient changes a lot, we use the soft and hard losses equally and then, gradually reduce the amount of the soft loss as the training proceeds. We name this simple method as the gradual soft loss reducing (GSLR) technique. To evaluate the effectiveness of the GSLR, we train 2-bit ResNet20 while varying the size of the teacher and the temperature as shown in Figure 3. The results clearly show that GSLR greatly aids to improve the performance or at least yields the comparable results with the hard loss (black horizontal line). When comparing the traditional KD, shown in Figure 3 (a), and GSLR KD, in Figure 3 (b), we can find that the latter yields much more predictable result, by which reducing the risk of cherry picking.

4. CONCLUDING REMARKS

In this work, we investigate the teacher model choice and the impact of the hyperparameters in quantized deep neural networks training with knowledge distillation. We found that the teacher needs not be a quantized neural network. Instead, hyperparameters that control the shape of softmax distribution is more important. The hyperparameters for KD, which are the temperature, loss weighting factor, and size of the teacher network, are closely interrelated. When the size of the teacher network grows, increasing the temperature aids to boost the performance to some extent. We introduce a simple training technique, gradual soft loss reducing (GSLR) for fail-safe KD training. At the beginning of the training, GSLR equally employs the hard and soft losses, and then gradually reduces the soft loss as the training proceeds. With careful hyperparameter selection and the GSLR technique, we achieve the far better performances than those of previous studies for designing 2-bit quantized deep neural networks on the CIFAR-10 and CIFAR-100 datasets.

5. REFERENCES

[1] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El- Yaniv, and Yoshua Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations.,” The Journal of Machine Learning Research, vol. 18, no. 187, pp. 1–30, 2017.

[2] Kyuyeon Hwang and Wonyong Sung, “Fixed-point feedfor- ward deep neural network design using weights +1, 0, and -1,” in Signal Processing Systems (SiPS), 2014 IEEE Workshop on. IEEE, 2014, pp. 1–6.

[3] Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang, and Hongbin Zha, “Alternating multibit quantization for recurrent neural networks,” International Conference on Learning Representations (ICLR), 2018.

[4] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.

[5] Shu-Chang Zhou, Yu-Zhi Wang, He Wen, Qin-Yao He, and Yu- Heng Zou, “Balanced quantization: An effective and efficient approach to quantized neural networks,” Journal of Computer Science and Technology, vol. 32, no. 4, pp. 667–682, 2017.

[6] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.

[7] Cristian Bucilu, Rich Caruana, and Alexandru Niculescu- Mizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 535–541.

[8] Zhiyuan Tang, Dong Wang, and Zhiyong Zhang, “Recurrent neural network training with dark knowledge transfer,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5900–5904.

[9] Xuemeng Song, Fuli Feng, Xianjing Han, Xin Yang, Wei Liu, and Liqiang Nie, “Neural compatibility modeling with attentive knowledge distillation,” in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 2018, pp. 5–14.

[10] Taichi Asami, Ryo Masumura, Yoshikazu Yamaguchi, Hi- rokazu Masataki, and Yushi Aono, “Domain adaptation of dnn acoustic models using knowledge distillation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5185–5189.

[11] Junpeng Wang, Liang Gou, Wei Zhang, Hao Yang, and Han- Wei Shen, “Deepvid: Deep visual interpretation and diagnosis for image classifiers via knowledge distillation,” IEEE transactions on visualization and computer graphics, vol. 25, no. 6, pp. 2168–2180, 2019.

[12] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, An- toine Chassang, Carlo Gatta, and Yoshua Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.

[13] Mandar Kulkarni, Kalpesh Patil, and Shirish Karande, “Knowledge distillation using unlabeled mismatched images,” arXiv preprint arXiv:1703.07131, 2017.

[14] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho, “Re- lational knowledge distillation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3967–3976.

[15] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4133–4141.

[16] Asit Mishra and Debbie Marr, “Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy,” in International Conference on Learning Representations, 2018.

[17] Antonio Polino, Razvan Pascanu, and Dan Alistarh, “Model compression via distillation and quantization,” in International Conference on Learning Representations, 2018.

[18] Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, and Has- san Ghasemzadeh, “Improved knowledge distillation via teacher assistant: Bridging the gap between student and teacher,” arXiv preprint arXiv:1902.03393, 2019.

[19] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2016, pp. 525–542.

[20] Sungho Shin, Yoonho Boo, and Wonyong Sung, “Fixed-point optimization of deep neural networks with adaptive step size retraining,” in 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 1203–1207.

[21] Wonyong Sung, Sungho Shin, and Kyuyeon Hwang, “Resiliency of deep neural networks under quantization,” arXiv preprint arXiv:1511.06488, 2015.

[22] Sergey Zagoruyko and Nikos Komodakis, “Wide residual net- works,” arXiv preprint arXiv:1605.07146, 2016.

[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, pp. 770–778.

[24] Yoonho Boo, Sungho Shin, and Wonyong Sung, “Memoriza- tion capacity of deep neural networks under parameter quantization,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 1383–1387.

[25] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid, “Towards effective low-bitwidth convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7920– 7928.

designed for accessibility and to further open science