Mixed-Precision Quantized Neural Network with Progressively Decreasing Bitwidth For Image Classification and Object Detection

2019·Arxiv

Abstract

Abstract

Efficient model inference is an important and practical issue in the deployment of deep neural network on resource constraint platforms. Network quantization addresses this problem effectively by leveraging low-bit representation and arithmetic that could be conducted on dedicated embedded systems. In the previous works, the parameter bitwidth is set homogeneously and there is a trade-off between superior performance and aggressive compression. Actually the stacked network layers, which are generally regarded as hierarchical feature extractors, contribute diversely to the overall performance. For a well-trained neural network, the feature distributions of different categories differentiate gradually as the network propagates forward. Hence the capability requirement on the subsequent feature extractors is reduced. It indicates that the neurons in posterior layers could be assigned with lower bitwidth for quantized neural networks. Based on this observation, a simple but effective mixed-precision quantized neural network with progressively decreasing bitwidth is proposed to improve the trade-off between accuracy and compression. Extensive experiments on typical network architectures and benchmark datasets demonstrate that the proposed method could achieve better or comparable results while reducing the memory space for quantized parameters by more than 30% in comparison with the homogeneous counterparts. In addition, the results also demonstrate that the higher-precision bottom layers could boost the 1-bit network performance appreciably due to a better preservation of the original image information while the lower-precision posterior layers contribute to the regularization of bit networks.

Index Terms—Model compression, quantized neural networks, mixed-precision, decreasing bitwidth

I. INTRODUCTION

THE deep convolutional neural networks (CNNs) haveachieved state-of-the-art results on computer vision tasks, such as image categorization [1] [2] [3] [4], object detection [5] [6] and semantic segmentation [7] [8]. These achievements depend on the extreme model complexity that overfits the distribution of numerous training data. However, this also leads to a large over-parameterized model and dramatical computation cost. A typical CNN often takes hundreds of MB memory space, i.e. 170MB for ResNet-101 [3], 250MB for AlexNet [1], 550MB for VGG-19 [2] and requires billions

This work was jointly supported by the National Natural Science Foundation of China (61603248) and the Science and Technology Project of State Grid Corporation (Research and Demonstration Application of Monitoring and Management Technology of City Energy System Based on Large Data and Artificial Intelligence CEGHJS1800002).

T. Chu, Q. Luo, J. Yang and X. Huang are with the Institute of Image Processing and Patter Recognition, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: chutianshu@sjtu.edu.cn; tomqin@sjtu.edu.cn; jieyang@sjtu.edu.cn; xiaolinhuang@sjtu.edu.cn).

of FLOPs per image during inference that rely on powerful GPUs. This challenges the deployment of CNNs on the edge devices such like mobile phones and drones. Thus the network compression and acceleration are an important issue in deep learning research and application.

Several techniques have been proposed to tackle this issue via compact neural architecture design [9], model pruning [10] and network quantization [11]. While the network topology remaining unchanged, the quantization is able to reduce the model size greatly to only a fraction of the origin by utilizing low-precision representation of parameters [12]. Further more, the internal features are also could be quantized. Then the model inference is accelerated significantly by convert the expensive float-point arithmetic into more effective fixed-point operations. Hence both the spatial and computational complexities are reduced notably by quantization.

Binary neural network (BNN) is a typical aggressive quantization method [13]. The model weights and activations are expressed as that could be stored in only 1-bit. Benefiting from the hardware bitwise operations, the dotproduct between weights and activations is replaced by XNOR and POPCOUNT arithmetics. Hence the deployment of BNN is no longer constraint by the GPUs. However, the naive BNN suffers from non-negligible performance degradation, especially on large scale and complicated tasks [11]. Although some proposed techniques have alleviated the information loss through improved binarization scheme, network topology and training algorithm, there still exists nontrivial accuracy gap between BNN and the full-precision network [14] [15] [16]. Contemporarily, an effective method to boost the compact model performance is representing the model variables with fixed-point values, i.e. quantized neural network (QNN) [11]. As represented in [17] [18], the QNNs are able to achieve comparable accuracy as the full-precision networks under the circumstance of 4-bit quantization. Nevertheless, larger bitwidth means the linear increase of model size and higher requirement on hardware capacity. When the computing resources are extremely limited, it is necessary to make a trade-off between model accuracy and compression.

In this paper, we work on this trade-off issue by referring to mixed-precision approach. In fact, the network layers contribute diversely to the overall performance and each has different sensitivity to quantization. While the network propagating forward, the dissimilarity of hierarchical features is enhanced progressively. In the shallower layers, the internal features are distributed on complex manifolds. Accurate neurons are necessary to obtain the subsequent features. While in deeper layers, a rough convolutional filter is able to distinguish the previous local features as the deep semantic features are more separable. Hence the parameter precision could be designed flexibly based on the network structure and the distribution of hierarchical features. In this paper, a simple but effective QNN with progressively decreasing bitwidth is proposed. The original information is preserved well by the high-precision bottom layers while the model size is compressed further due to low-precision representation of top layers.

Our main contributions are:

1) Based on the observation on internal feature distributions and network structure, a mixed-precision QNN framework with progressively decreasing bitwidths is proposed.

2) Four typical classification CNNs including VGG, AlexNet and ResNet-18/20 and an object detection framework SSD are quantized based on the proposed mixed-precision method. The layer-wise bitwidth gradually reduces to 1-bit from 4-bit and 8-bit respectively.

3) The redesigned QNNs are validated on several benchmark datasets including CIFAR-10/100, ILSVRC-2012 and PASCAL VOC. The experimental results demonstrate that the mixed-precision networks could achieve preferable or very similar performance while requiring at least 30% less memory space for quantized parameters.

The rest of this paper is organized as follows. Section 2 provides a summary of related works. Based on the analysis on the feature distribution of different layers, a multi-level quantized structure with gradually decreasing bitwidth is proposed in Section 3. In Section 4, we demonstrate the effectiveness of the mixed precision framework via extensive experiments on several typical CNNs architectures and benchmark datasets. Section 5 ends this paper with some conclusions.

II. RELATED WORK

Network compression and acceleration is critical to the practical deployment of CNNs on edge devices. One kind of paradigm focuses on the network topology structure. Some researches focus on the design of compact neural architecture. Many lightweight networks are proposed including ResNet [3], DenseNet [4], MobileNet [19] and ShuffleNet [20]. In addition, there exist some methods that search for an effective neural architecture via reinforcement learning [9] [21]. Some other researches conduct model compression from the opposite direction. A tiny network is obtained via pruning and sparsity constraints on the basis of a well-trained complex network [22] [10] [23].

Network quantization conducts the compression and acceleration task from the perspective of data format while preserving the network architecture. In [24], the results shown that half-precision is also able to acquire promising accuracy. This indicated that the model parameters could be stored by lower bitwidth and the model size is reduced several times. More over, the intermediate variables are also could be represented

Fig. 1. The computation graph of neural cell in QNN. The black arrows depict the forward data flow and the blue ones show the backward-propagation. Both the full-precision and quantized values are remained during training. The un-differentiable quantizer module is bypassed in the computation graph. After training, full-precision weights are discarded in deployment.

by discrete values. Then the computationally expensive float-point arithmetics are replaced by fixed-point and bitwise operations which are able be conducted on dedicated hardware. As shown in Fig. 1, both the full-precision and low-bit variables are preserved in the computation graph during the training phase. To make backward-propagation feasible, the gradients flow through the un-differentiable quantizer straightly, i.e. straight-shrough gradient estimator (STE) [25] [11]. Some training characters and theoretical analysis are demonstrated in [26] [27] [28] [29]. After training, it is unnecessary to reserve the full-precision values.

BNN is an aggressive form of network quantization. The weights and activations are expressed as according to the signs. Thus the memory space required for each variable is reduce to only 1-bit and the model size after binarization is nearly 1/32 of the origin [13]. In addition, the inference efficiency is improved substantially by leveraging the XNOR and POPCOUNT operations [11]. However, the extreme compression also leads to heavy information loss during binarization. There exists non-trivial accuracy gap between BNN and the full-precision counterpart, especially on complicated tasks. Some techniques are proposed to alleviate the performance loss via modified binarization scheme [14] [16] and network architecture [30]. These improvements are limited and extra full-precision arithmetic is introduced.

Another effective way to improve the model capability is assigning larger bitwidth to the network variables, i.e. conservative quantization [11]. A general and flexible quantization method is proposed in [31] and achieves promising accuracy on ILSVRC-2012. [18] and [32] improve the QNN performance further by adjusting the quantization step size during back-propagation. In case of 4-bit quantization, the QNN could achieve comparable results as the full-precision counterpart. However, the increase of bitwidth means amplification of model size. There is a trade-off between superior performance and aggressive compression.

Among the methods mentioned above, all the model weights are treated equally and assigned with the same bitwidth. Actually, the parameters in the stacked neural network contribute differently to the overall results. It indicates that the parameter bitwidth should be determined by its individual function. More over, some advance chips are released including Apple A12 Bionic and Nvidia Turing GPU that support mixed-precision arithmetic. Hence some researches tackle the QNN trade-off issue via mixed-precision method. In [33] and [34], the bitwidth of each parameter is set according to the quantization residual of a pre-trained network. Wang et al. [35] fine-tune the bitwidth via reinforcement learning. In this paper, we explore the layer-wise bitwidth from another perspective and propose a simple but effective mixed-precision framework. In comparison with the previous work, this proposed method is more flexible and compatible with various quantization schemes.

III. METHODOLOGY

A. Quantization Function

As Fig. 1 shows, the discrete data flow through the stacked neural cells which consist of quantization, multiplyaccumulation (MAC), batch normalization and activation. While the storage and computation cost is reduced notably, the information loss is inevitable due to quantization error during this process. An appropriate quantization module which is able to preserve the valuable information in the continuous variables is crucial for the network performance. 1) Binarization: An extreme quantization method is to store the discrete value in 1-bit, i.e. binarization. Given a variable , the binary value is determined by the sign. In order to improve the value range, a scaling factor is introduced. Then MAC is conducted by XNOR and POPCOUNT operations. However, the binarization function maps a continuous set onto a discrete set . The un-differentiability is an obstacle in the backward propagation and challenging the training of QNN. To address this issue, the STE is proposed to bypassing the quantizer [25] [11]. The forward and backward computation of binarization is shown as the follows. is the indicator function. If the condition satisfied, the indicator returns 1. Otherwise, it returns 0.

Fig. 2. The the histogram of network weight parameters. The first column depicts the distribution of weights from three different layers in well-trained full-precision network. The second and third columns are the full-precision and quantized weights from the according layers in a well-trained QNN. The images from top to bottom in the third column represent the 8-bit, 4-bit and 2-bit quantization results respectively.

where and denote the full-precision and quantized values and represents the rounding operation. The STE gradient is utilized either in the backward of . With this function, the model weights and activations could be discretized after proper preprocessing as follows.

3) Weight Quantization: For a continuous weight tensor , it is necessary to project the unbounded elements into specified interval [0, 1]. The most straightforward normalization is scaling and shifting after dividing the largest absolute value. However, the majority of the continuous weight values distribute around the zero-point as Fig. 2 shows. The straightforward division would make the normalization dominant by the outliers and lead to additional round-off quantization error. Hence a non-linear transformation, hyperbolic tangent function, is introduced to alleviate the impact of long-tail distribution. Meanwhile, the saturation effect of can suppress the variation of large values and avoid outliers during training. It is also worth noticing that the MAC operations are conducted channel-wise,

The MAC results are related to the weight values in the according channels. Hence it is more suitable to do channel-wise normalization. The extra scaling factors can be merged into the batch normalization parameters and no addition computation cost is introduced during deployment. Thus the overall quantization process for weights is as follows,

4) Activation Quantization: For the activation quantization, it is theoretically feasible to adopt the similar strategy as weight parameters. But the model efficiency will be reduce badly due the additional float-point operations in preprocessing. Therefore a clamp function is usually applied as activation function to confine the features to specified interval [0, 1] before quantization.

B. Hierarchical Feature Distribution and Mix Precision QNN

One of the important advantaged that contributes to the remarkable achievements of deep neural network is that a delicate feature representation could be learned automatically by end-to-end training. Based on the network topology structure, the hierarchical features are improved by MAC operation and non-linear transformations layer-wise. As the network propagates forward, the variation of each categorical feature distribution is reduced gradually while the margins between each other increase. Consequently, the feature distributions are mapped from complex manifolds in high-dimension to several clusters in low-dimension and a linear classifier is able to achieve great accuracy by leveraging the final semantic features.

To illustrate this intuition explicitly, a VGG-7 network1 model which consists of 6 convolutional layers and 1 latent fully-connected layers is trained based on the CIFAR-10 dataset. After training, 50 samples from each class are selected randomly and fed into the model. The feature representations in the internal layers are extracted and transformed into 2-dimension by t-SNE [36]. As shown in Fig. 3(a), there exists severe aliasing among the feature distributions of different categories after dimension reduction in the initial layer. It indicates that the feature manifolds in the bottom layer is very complicated. This is due to that the elementary characteristics of images are mail color and texture features. Theses local representations are quite similar with each other of different samples. It is difficult to determine the labels based on the elementary features directly. Many delicate neurons is required to distinguish the overlap distributions. As the network propagates forward, the features of the same category become organized gradually. As Fig. 3(d)shows, the advanced semantic features are more robust and there exist clear margins between the distribution clusters in deep layers.

During this process, the feature transformation is conducted by neurons in each layer. Every neuron works as a simple classifier to extract target feature. The input complexities of the network layers differ with each other, which means that the precision requirement on the neurons are also different. Based on this observation, we argue that the neurons in the shallower layers are more sensitive to quantization. As the feature distributions overlap mutually, finite neuron are unable to distinguish the samples and extract meaningful intermediate representations without suitable precision. Once the advanced features are obtained explicitly, the following layers become more robust to the quantization error. Thus it is feasible to design the QNN structure more flexibly rather than

Fig. 3. In the bottom layer of a trained network, the feature distributions of different categories intervenes each other severely as (a) shows. Many delicate neurons are needed to distinguish the overlapped distributions. And as the network propagates forward, the feature distribution of the same category gathers gradually in (b) and (c). In the end of the hidden layers, there exist clear margins between the semantic feature distributions of different classes in (d). With the improvement of separability among the feature manifolds, a neuron with lower-precision parameters is able to extract robust feature.

bit homogeneous networks. The bitwidth for each model parameter progressively decreases from the initial k-bit as the QNN propagates forward.

It also worth noticing that the majority of model parameters are concentrated in the deeper layers as Table I shows. The rise of bitwidth at the bottom layers has little effect on the model size in comparison with low-bit network. But the original information would be preserved better. On the other hand, the model size of mixed-precision QNN is much smaller than the bit homogeneous ones due to lower parameter precision. Hence the mixed-precision QNN is more compact and has the potential to achieve promising performance.

By utilizing the framework of progressively decreasing bitwidth, 4 typical CNNs are quantized. VGG-Net and AlexNet are the representatives of plain CNNs. The VGG-7 in this paper is designed for CIFAR-10/100 dataset. All the weight parameters are quantized except that of the output layer as the linear classifier is related to the final results directly and requires enough precision. The bitwidth for the quantized layers decreases from 8-bit to 1-bit layer-wise with a factor 1/2 as shown in Table II. Although the initial bitwidth is higher than the homogeneous counterpart, the average model bitwidth is reduced to 1.06. AlexNet, which contains 5 convolution and 2 latent fully-connected layers, is proposed for the high-resolution image recognition task ILSVRC-2012 [1]. The input and output layers are maintained full-precision as [14] [31] for a fair comparison. The bitwidth setting is similar with VGG-7 and shown in Table IV. The final average bitwidth is 1.10.

ResNet is the pioneer of networks with shortcuts. The ResNet-20, which consists of 3 residual stages, is initially proposed for the CIFAR-10 task [3]. For a fair comparison with related work [14] [31], the convolutional weights of residual

TABLE I NUMBER OF WEIGHT PARAMETERS IN TYPICAL NETWORKS

stages are quantized from 4-bit to 1-bit as shown in Table II. As ResNet-20 has only 64 filters at the final stage, it uncertain that the 64-dim pooling features obtained by aggressively quantized neurons could satisfy the classification requirement, especially for CIFAR-100 task. A doubled bitwidth model with more powerful capacity is also validated in this paper. By contrast, The ResNet-18, which contains 4 residual stages, is much wider and has 512 filters at the final residual stage. The bitwidth reduces from 8-bit to 1-bit as shown in Table IV. The activation bitwidth of the mixed-precision network is set the same with the homogeneous counterparts.

The object detection is much more complicated task than image classification. In addition to predict categories of multiple object in an image, the network also needs to regress coordinates of the bounding boxes. This requires higher feature extract capability of the network. To investigate the performance of mixed-precision QNN on object detection task, a VGG-16 based single shot detector (SSD) [37] is quantized in this paper. The weight parameters of VGG-16 backbone is discretized utilizing the similar bitwidth setting as VGG-7. To improve the feature extraction capability at the final stage, the bitwidth of extra layers is set to 4-bit. The output layers remained full-precision. The final average bitwidth is 1.42.

IV. EXPERIMENTS

To validate the performance of QNN with progressively decreasing bitwidth, we conduct extensive experiments on CIFAR-10/100, ILSVRC-2012 and Pascal VOC datasets.

A. CIFAR-10/100

There are 10 classes of 50,000 training images and 10,000 test ones in CIFAR-10 dataset. The image size is pixels. The CIFAR-100 dataset consists of the same number of images from 100 categories. One tenth of training samples is selected as validation set.

We follow the data augmentation in [3] for training. At testing time, the original images are sampled directly. We use SGD optimizer with momentum of 0.9 and learning rate starting from 0.1 and scaled by 0.1 at epoch 80, 120, 160. L2-regularizer with decay of 2e-4 is applied to weight parameters. The mini-batch size is 128 and after 200 epochs of training from scratch, the test accuracy associated with the best validation performance is reported as the final result.

After 5 runs of each experiment, the average test accuracies of CIFAR-10 are recorded in Table II. Here, FP and 32-bit denote the full-precision network with float-point parameters. As the analysis in Sec. III-B, the mixed-precision networks obtain higher accuracies than the homogeneous counterparts while the model size is smaller. For ResNet-20, the redesigned network with less than 3-bit for weights and 4-bit for activations is able to achieve comparable final result as the full-precision network. However, at beginning the training process, the generalization ability of mixed-precision QNN fluctuates obviously as Fig. 4 shows. This is due to that the quantized values change back and forth due to large learning rate. When the learning rate decays, the training process become stable. In addition, the mixed-precision VGG-Net obtains better result than both the 2-bit and even the full-precision one. We argue that the better information preservation in the initial layers due to higher bitwidth boosts the performance evidently. Meanwhile, the VGG-7 is a very “wide” network. The redundancy stabilizes the training process as Fig. 4 shows. But once sufficient and meaningful information is obtained by the bottom layer, the redundant parameters in the subsequent layers may lead to overfitting. Hence, the suitable bitwidth setting contribute to the model regularization.

TABLE II CIFAR-10 EXPERIMENTAL RESULTS

The results on CIFAR-100 dataset are recorded in Table III and consistent with that of CIFAR-10 generally. It is noticeable that our ResNet-20 result at the forth line is 3% lower than the homogeneous bitwidth network. The reason is that ResNet-20 is a very “narrow” network that originally designed for CIFAR-10. After the average pooling layer, the dimension of semantic feature, 64, is less than that number of classes. Hence the 1-bit neurons in deep layers would induce significant information loss. Once the overall bitwidth is increased, the performance bottleneck is broken. While for the wide network, VGG-Net, it is unnecessary to worry about this. The numerous 1-bit neurons in deep layer guarantee meaningful semantic features. In comparison with the 2-bit network, the mixed-precision model is able to compress

Fig. 4. The training curve of ResNet-20 and VGG-7 on CIFAR-10.

memory space for quantized parameters to nearly a half while achieving very competitive accuracy.

TABLE III CIFAR-100 EXPERIMENTAL RESULTS

Fig. 5. The training curve of ResNet-20 and VGG-7 on CIFAR-100.

B. ILSVRC-2012

ILSVRC-2012 is a 1000-category dataset which consists of 1.2 million training images and 50 thousands of validation

TABLE IV ILSVRC-2012 EXPERIMENTAL RESULTS

ones. Compared to CIFAR task, ILSVRC is much more challenging due to larger and diverse images. For training, the images are resized to and cropped randomly to . For validation, the center crops are used as inputs.

In the training process, an Adam optimizer with learning rate of 2e-4 and no weight-decay is applied for AlexNet. For ResNet-18, a SGD optimizer with learning rate of 0.1 and weight-decay of 1e-4. The learning rate is scaled by 0.1 at 60 and 75 of the 90 total epochs and at 30, 60, 90 and 100 of 120 total epochs respectively. After training, the Top-1 validation accuracies are reported in Table IV. It is clearly that the mixed-precision QNNs have advantages over the ordinary ones in terms of both performance and model size. In comparison with the full-precision networks, the results are still acceptable.

Fig. 6. The training curve of AlexNet.

C. Pascal VOC

Pascal VOC is a benchmark dataset for object detection, which consist of 20 categories of objects in general. To validate the performance of the proposed method on more challenging tasks, we select SSD as a baseline detector and train our model on VOC2007 trainval and VOC2012 trainval datasets (16,551 images) after quantization. Then resulted model is evaluated on the VOC2007 test dataset (4,952 images). Our quantized models are trained from scratch without pre-training on ILSVRC dataset. An SGD optimizer with weight-decay of 1e-4 is applied for 8,000 iterations of training. The learning rate 1e-3 is used for the first 4,000 iterations, then continue training for 2,000 iterations with 1e-4 and 1e-5.

The comparison results are illustrated in Table V. Compared with the full-precision counterpart, the performance of quantized networks degrade significantly due to more challenging task and quantization error. However, the mixed-precision network is still outperform the homogeneous one. In addition, the 62.21% mAP means that the quantized detector has basic capabilities for object detection. From the detailed results and demo samples, we can conclude that the mixed-precision detector perform well on the object which are large enough and located at the center of images.

V. CONCLUSIONS

In this paper, a novel QNN framework with multiple bitwidth is proposed. Based on the observation of layer-wise feature distributions and network structure, we define a gradually decreasing bitwidth setting to preserve the original image information in bottom layers and address the trade-off between accuracy and compression. Extensive experiments on typical network architectures and benchmark datasets demonstrate that the proposed mixed-precision QNN could achieve preferable results in comparison with k-bit homogeneous networks while requiring 30% less memory space for quantized parameters.

ACKNOWLEDGMENT

This work is jointly supported by the National Natural Science Foundation of China (61603248) and the State Grid Corporation (Research and Demonstration Application of Monitoring and Management Technology of City Energy System Based on Large Data and Artificial Intelligence CEGHJS1800002).

REFERENCES

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” neural information processing systems, vol. 141, no. 5, pp. 1097–1105, 2012.

[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” international conference on learning representations, 2015.

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Fig. 7. The training curve of ResNet-18.

[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[4] G. Huang, Z. Liu, L. V. Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” computer vision and pattern recognition, pp. 2261–2269, 2017.

[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.

[6] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster r-cnn: Towards real- time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137– 1149, 2017.

[7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” pp. 3431–3440, 2015.

[8] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” arXiv: Computer Vision and Pattern Recognition, 2017.

[9] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” pp. 8697–8710, 2018.

[10] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” arXiv: Learning, 2018.

[11] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6869–6898, 2017.

[12] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in neural information processing systems, 2015, pp. 3123– 3131.

[13] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben- gio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830, 2016.

[14] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision. Springer, 2016, pp. 525– 542.

[15] J. Bethge, M. Bornstein, A. Loy, H. Yang, and C. Meinel, “Train- ing competitive binary neural networks from scratch,” arXiv preprint arXiv:1812.01965, 2018.

[16] L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep networks,” arXiv preprint arXiv:1611.01600, 2016.

[17] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713.

[18] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, and K. Gopalakrishnan, “Pact: Parameterized clipping activation for quantized neural networks,” arXiv preprint arXiv:1805.06085, 2018.

[19] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv: Computer Vision and Pattern Recognition, 2017.

[20] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely effi- cient convolutional neural network for mobile devices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.

[21] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” arXiv preprint arXiv:1905.11946, 2019.

[22] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient inference,” arXiv preprint arXiv:1611.06440, 2016.

[23] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. Lin, and L. S. Davis, “Nisp: Pruning networks using neuron importance score propagation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9194–9203.

[24] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in International Conference on Machine Learning, 2015, pp. 1737–1746.

[25] Y. Bengio, N. Leonard, and A. C. Courville, “Estimating or propagat- ing gradients through stochastic neurons for conditional computation.” arXiv: Learning, 2013.

[26] W. Tang, G. Hua, and L. Wang, “How to train a compact binary neural network with high accuracy?” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.

TABLE V PASCAL VOC EXPERIMENTAL RESULTS

Fig. 8. The sampled detection results of the mixed-precision SSD.

[27] J. Bethge, M. Bornstein, A. Loy, H. Yang, and C. Meinel, “Train- ing competitive binary neural networks from scratch,” arXiv preprint arXiv:1812.01965, 2018.

[28] M. Alizedeh, J. Fern´andez-Marqu´es, N. D. Lane, and G. Yarin, “An empirical study of the binary neural networks’ optimisation,” in Seventh International Conference on Learning Representations, 2019.

[29] P. Yin, J. Lyu, S. Zhang, S. Osher, Y. Qi, and J. Xin, “Understanding straight-through estimator in training activation quantized neural nets,” arXiv preprint arXiv:1903.05662, 2019.

[30] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng, “Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 722–737.

[31] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.

[32] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, “Learned step size quantization,” arXiv preprint arXiv:1902.08153, 2019.

[33] J. Fromm, S. Patel, and M. Philipose, “Heterogeneous bitwidth bi- narization in convolutional neural networks,” in Advances in Neural Information Processing Systems, 2018, pp. 4006–4015.

[34] Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, and P. Frossard, “Adaptive quantization for deep neural network,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[35] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “Haq: Hardware-aware automated quantization with mixed precision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8612–8620.

[36] L. V. Der Maaten and G. E. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.

[37] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” european conference on computer vision, pp. 21–37, 2016.

[38] Z. Cai, X. He, J. Sun, and N. Vasconcelos, “Deep learning with low precision by half-wave gaussian quantization,” computer vision and pattern recognition, pp. 5406–5414, 2017.