Post-Training Piecewise Linear Quantization for Deep Neural Networks

2020·arXiv

Abstract

Abstract

Quantization plays an important role in the energy-efficient deployment of deep neural networks on resource-limited devices. Posttraining quantization is highly desirable since it does not require retraining or access to the full training dataset. The well-established uniform scheme for post-training quantization achieves satisfactory results by converting neural networks from full-precision to 8-bit fixed-point integers. However, it suffers from significant performance degradation when quantizing to lower bit-widths. In this paper, we propose a piecewise linear quantization (PWLQ) scheme to enable accurate approximation for tensor values that have bell-shaped distributions with long tails. Our approach breaks the entire quantization range into non-overlapping regions for each tensor, with each region being assigned an equal number of quantization levels. Optimal breakpoints that divide the entire range are found by minimizing the quantization error. Compared to state-of-the-art post-training quantization methods, experimental results show that our proposed method achieves superior performance on image classifica-tion, semantic segmentation, and object detection with minor overhead.

Keywords: deep neural networks, post-training quantization, piecewise linear quantization

1 Introduction

In recent years, deep neural networks (DNNs) have achieved state-of-the-art results in a variety of learning tasks including image classification [23,24,54,19,53,29], segmentation [5,18,49] and detection [36,47,48]. Scaling up DNNs by one or all of the dimensions [55] of network depth [19], width [59] or image resolution [30] attains better accuracy, at a cost of higher computational complexity and increased memory requirements, which makes the deployment of these networks on embedded devices with limited resources impractical.

One feasible way to deploy DNNs on embedded systems is quantization of full-precision (32-bit floating-point, FP32) weights and activations to lower precision (such as 8-bit fixed-point, INT8) integers [25]. By decreasing the bit-width, the number of discrete values is reduced, while the quantization error, which generally correlates with model performance degradation increases. To minimize the quantization error and maintain the performance of a full-precision model, many recent studies [63,4,40,25,6,60,12,27] rely on training either from scratch (“quantization-aware” training) or by fine-tuning a pre-trained FP32 model.

However, post-training quantization is highly desirable since it does not require retraining or access to the full training dataset. It saves time-consuming fine-tuning effort, protects data privacy, and allows for easy and fast deployment of DNN applications. Among various post-training quantization schemes proposed in the literature [28,7,62], uniform quantization is the most popular approach to quantize weights and activations since it discretizes the domain of values to evenly-spaced low-precision integers which can be efficiently implemented on commodity hardware’s integer-arithmetic units.

Recent work [28,31,42] shows that post-training quantization based on a uniform scheme with INT8 is sufficient to preserve near original FP32 pre-trained model performance for a wide variety of DNNs. However, ubiquitous usage of DNNs in resource-constrained settings requires even lower bit-width to achieve higher energy efficiency and smaller models. In lower bit-width scenarios, such as 4-bit, post-training uniform quantization causes significant accuracy drop [28,62]. This is mainly because the distributions of weights and activations of pre-trained DNNs is bell-shaped such as Gaussian or Laplacian [17,35]. That is, most of the weights are clustered around zero while few of them are spread in a long tail. As a result, when operating at low bit-widths, uniform quantization assigns too few quantization levels to small magnitudes and too many to large ones, which leads to significant accuracy degradation [28,62].

To mitigate this issue, various quantization schemes [41,4,3,43,26,34] are designed to take advantage of the fact that weights and activations of pre-trained DNNs typically have bell-shaped distributions with long tails. Here, we present a new number representation via a piecewise linear approximation to be suited for these phenomena. It breaks the entire quantization range into non-overlapping regions where each region is assigned an equal number of quantization levels. Although our method works with an arbitrary number of regions, we suggest limiting them to two to simplify the complexity of the proposed approach and the hardware overhead. The optimal breakpoints that divide the entire range can be found by minimizing the quantization error. Compared to uniform quantization, our piecewise linear quantization (PWLQ) provides a richer representation that reduces the quantization error. This indicates its potential to reduce the gap between floating-point and low-bit precision models. It is also more hardwarefriendly when compared to other non-linear approaches such as logarithm-based and clustering-based approaches [41,56,3], since in our method, computation can still be carried out without the need of any transforms or look-up tables.

The main contributions of our work are as follows:

• We propose a piecewise linear quantization (PWLQ) scheme for efficient deployment of pre-trained DNNs without retraining or access to the full training dataset. We also investigate its impact on hardware implementation.

• We present a solution to find the optimal breakpoints and demonstrate that our method achieves a lower quantization error than the uniform scheme.

• We provide a comprehensive evaluation on image classification, semantic segmentation, and object detection benchmarks and show that our proposed method achieves state-of-the-art results.

2 Related Work

There is a wide variety of approaches in the literature that facilitate the efficient deployment of DNNs. The first group of techniques relies on designing network architectures that depend on more efficient building blocks. Notable examples include depth/point-wise layers [22,52] as well as group convolutions [61,38]. These methods require domain knowledge, training from scratch and full access to the task datasets. The second group of approaches optimizes network architectures in a typical task-agnostic fashion and may or may not require (re)training. Weight pruning [17,32,20,37], activation compression [10,9,14], knowledge distillation [21,45] and quantization [8,46,66,64,41,25] fall under this category.

In particular, quantization of activations and weights [15,16,57,35,6,60,62] leads to model compression and acceleration as well as to overall savings in power consumption. Model parameters can be stored in a fewer number of bits while the computation can be executed on integer-arithmetic units rather than on power-hungry floating-point ones [25]. There has been extensive research on quantization with and without (re)training. In the rest of this section, we focus on post-training quantization that directly converts full-precision pre-trained models to their low-precision counterparts.

Recent works [28,31,42] have demonstrated that 8-bit quantized models have been able to accomplish negligible accuracy loss for a variety of networks. To improve accuracy, per-channel (or channel-wise) quantization is introduced in [28,31] to address variations of the range of weight values across channels. Weight equalization/factorization is applied by [39,42] to rescale the difference of weight ranges between different layers. In addition, bias shifts in the mean and variance of quantized values are observed and counteracting methods are suggested by [2,13]. A comprehensive evaluation of clipping techniques is presented by [62] along with an outlier channel splitting method to improve quantization performance. Moreover, adaptive processes of assigning different bit-width for each layer are proposed in [35,65] to optimize the overall bit allocation.

There are also a few attempts to tackle 4-bit post-training quantization by combining multiple techniques. In [2], a combination of analytical clipping, bit allocation, and bias correction is used, while [7] minimizes the mean squared quantization error by representing one tensor with one or multiple 4-bit tensors as well as by optimizing the scaling factors.

Most of the aforementioned works utilize a linear or uniform quantization scheme. However, linear quantization cannot capture the bell-shaped distribution of weights and activations, which results in sub-optimal solutions. To overcome this deficiency, [3] proposes a quantile-based method to improve accuracy but their method works efficiently only on highly customized hardware; [26] employs two different scale factors on overlapping regions to reduce computation bits over fixed-point implementations. However, its scale factors restricted to powers of two and heuristic options limit the accuracy performance. Instead, we propose a piecewise linear approach that improves over the selection of optimal breakpoints that leads to state-of-the-art quantized model results. Our method can be implemented efficiently with minimal modification to commodity hardware.

3 Quantization Schemes

In this section, we review a uniform quantization scheme and discuss its limitations. We then present PWLQ, our piecewise linear quantization scheme and show that it has a stronger representational power (a smaller quantization error) compared to the uniform scheme.

Fig. 1. Quantization of conv4 layer weights in a pre-trained Inception-v3. Left: uniform quantization. Middle: piecewise linear quantization (PWLQ) with one breakpoint, dotted line indicates the breakpoint. Right: Mean squared quantization error (MSE) for various bit-widths (b = 4, 6, 8). MSE of PWLQ is convex w.r.t. the breakpoint p, the b-bit PWLQ can achieve a smaller quantization error than the b-bit uniform scheme

3.1 Uniform Quantization

Uniform quantization (the left of Figure 1) linearly maps full-precision real numbers r into low-precision integer representations. From [25,7], the approximated version ˆr from uniform quantization scheme at b-bit can be defined as:

where [] is the quantization range, s is the scaling factor, z is the off-set, N is the number of quantization levels, is the quantized integer computed by a rounding function followed by saturation to the integer domain . We set the offset z = 0 for symmetric signed distributions combined with and for asymmetric unsigned distributions (e.g., ReLU-based activations) with . Since the scheme (1) introduces a quantization error defined as , the expected quantization error squared is given by:

From the above definition, uniform quantization divides the range evenly despite the distribution of r. Empirically, the distributions of weights and activations of pre-trained DNNs are similar to bell-shaped Gaussian or Laplacian [17,35]. Therefore, uniform quantization is not always able to achieve small enough approximation error to maintain model accuracy, especially in low-bit cases.

3.2 Piecewise Linear Quantization (PWLQ)

To improve model accuracy for quantized models, we need to approximate the original model as accurately as possible by minimizing the quantization error. We follow this natural criterion to investigate the quantization performance, even though no direct relationship can easily be established between the quantization error and the final model accuracy [7].

Inspired from [43,26] that takes advantage of bell-shaped distributions, our approach based on piecewise linear quantization is designed to minimize the quantization error. It breaks the quantization range into two non-overlapping regions: the dense, central region and the sparse, high-magnitude region. An equal number of quantization levels is assigned to these two regions. We chose to use two regions with one breakpoint to maintain simplicity in the inference algorithm (Section 5.1) and the hardware implementation (Section 4). Multiple-region cases are discussed in Section 5.1.

Therefore, we only consider one breakpoint p to divide the quantization range[] (m > 0) into two symmetric regions: the center region [] and the tail region ]. Each region consists of a negative piece and a positive piece. Within each of the four pieces, (1)-bit (2) uniform quantization (1) is applied such that including the sign every value in the quantization range is being represented into b-bit. We define the b-bit piecewise linear quantization (denoted by PWLQ) scheme as:

where the sign of full-precision real number r is denoted by sign(r). The associated quantization error is defined as .

Figure 1 shows the comparison between uniform quantization and PWLQ on the empirical distribution of the conv4 layer weights in a pre-trained Inception-v3 model [54]. We emphasize that b-bit PWLQ represents FP32 values into b-bit integers to support b-bit multiply-accumulate operations, even though in total, it has the same number of quantization levels as (b+1)-bit uniform quantization. The implications of this are further discussed in Section 4.

3.3 Error Analysis

To study the quantization error for PWLQ, we suppose the full-precision real number r has a symmetric probability density function (PDF) f(r) on a bounded domain [] with the cumulative distribution function (CDF) F(r) satisfying ) and quantization error squared of PWLQ from (2) based on the error of each piece:

Since ) for a symmetric PDF, equation (4) can be simplified as:

The performance of a quantized model with PWLQ scheme critically depends on the value of the breakpoint p. If , then the PWLQ is essentially equivalent to uniform quantization, because the four pieces have equal quantization ranges and bit-widths. If , the center region has a smaller range and greater precision than the tail region, as shown in the middle of Figure 1. Conversely, if , the tail region has greater precision than the center region. To reduce the overall quantization error for bell-shaped distributions found in DNNs, we increase the precision in the center region and decrease it in the tail region. Thus, we limit the breakpoint to the range 0 .

Accordingly, the optimal breakpoint can be estimated by minimizing the expected squared quantization error:

Since bell-shaped distributions tend to zero as r becomes large, we consider a smooth f(r) is decreasing when r is positive, i.e., 0, 0. Then we prove that the optimization problem (6) is convex with respect to the breakpoint (0). Therefore one unique exists to minimize the quantization error (5), as demonstrated by the following Lemma 1.

Lemma 1 If ), 0 for all r > 0, then ) is a convex function of the breakpoint (0).

Proof. Taking the first and second derivatives of (5) yields:

Since 0 and , 0, then 0. Therefore, ) is convex w.r.t. p, and thus a unique exists.

In practice, we can find the optimal breakpoint by solving (6) by assuming an underlying Gaussian or Laplacian distribution using gradient descent [50]. Once the optimal breakpoint is found, both Lemma 2 and the numerical simulation in the right of Figure 1 show that PWLQ achieves a smaller quantization error than uniform quantization, which indicates its stronger representational power.

Lemma 2 E(ε; b, m, p) < (ε; b, m, m)

Proof. The b-bit uniform quantization error on [] is calculated from (2):

For b-bit PWLQ, we solve the convex problem (6) by letting the first derivative equal to zero in (7), and determine that the optimal breakpoint satisfies:

By substituting (10) in (5) and simplifying, we obtain:

Therefore, b-bit PWLQ achieves a smaller quantization error, which is at most of b-bit uniform scheme. This improvement in performance requires only an extra bit for storage and no extra multiplication, as we discuss in the next section.

4 Hardware Impact

In this section, we discuss the hardware requirements for efficient deployment of DNNs quantized with PWLQ. In convolutional and fully-connected layers, every output can be computed using an inner product between vector X and vector W, which correspond to the input activation and weight (sub)tensors respectively.

From scheme (1), the approximated versions of uniform quantization are ˆand ˆ(assuming symmetric quantization for weights), where and are quantized integer vectors from X and W, I is an identity vector, and are associated constant-valued scaling factors and offset, respectively. The output of this uniform quantization is:

where is defined as vector inner product, and denote floating-point constant terms that can be pre-computed offline.

Equation (14) implies that a uniformly quantized DNN requires two steps: (i) an integer-arithmetic (INT) inner product; and (ii) followed by a floating-point (FP) affine map. The expensive O(|W|) (the size of vector W) FP operations ˆX, ˆare then accelerated via INT operations , plus O(1) FP rescaling and adding operands using and .

As we showed in Section 3.2 when applying PWLQ on weights with one breakpoint, the algorithm breaks the ranges into non-overlapping regions (and ), which requires separate computational paths (and ) as each region has a different scaling factor. We set offsets and denote scaling factors by in , respectively. We also define by the associated partial vector inner product, and the associated quantized integer vector of W in region for i = 1, 2. Then is computed using the following equation:

has additional terms as it has a non-zero offset p:

where , and are constant terms, which can be pre-computed similar to and in (14).

As indicated by (15) and (16) for PWLQ compared to uniform quantization (14), the extra term is needed due to the non-zero offset p, which sums up the activations corresponding to weights in . Since most of the weightsare in , these extra computations in rarely happen. In addition, FP rescaling and adding are needed in each region, which also increases the overall FP operation overhead.

In short, an efficient hardware implementation of PWLQ requires:

– One multiplier for products in both of and .

– Three accumulators: one of each for sum of products in and , and another one for activations in .

– At most one extra bit for storageper weight value to indicate the region. Note that this extra bit does not increase the multiply-accumulate (MAC) computation and it is only used to determine the appropriate accumulator, which can be done in hardware at negligible cost on the MAC unit.

Based on the above explanation, it is clear that more breakpoints require more accumulators and more storage bits per weight tensor. Also, applying PWLQ on both weights and activationsrequires accumulators for each combination of activation regions and weight regions, which translates to more hardware overhead. As a result, more than one breakpoint on the weight tensor or applying PWLQ on both weights and activations might not be feasible, from a hardware implementation perspective.

5 Experiments

We evaluate the robustness of our proposed PWLQ scheme for post-training quantization on popular networks of several computer vision benchmarks: ImageNet classification [51], semantic segmentation and object detection on the Pascal VOC challenge [11]. In all experiments, we apply batch normalization folding [25] before quantization. For activations, we follow the profiling strategy in [62] to sample from 512 training images, and collect the medianof the top-10 smallest and top-10 largest activation values for the minimum and maximum range boundaries at each layer, respectively. During inference, we apply quantization after clipping with these ranges. Unless stated otherwise, we quantize all network weights per-channel into 3-to-8 bits; and uniformly quantize activations as well as pooling layers per-layer into 8-bit. We perform all experiments in Pytorch 1.2.0 [44].

5.1 Ablation Study on ImageNet

In this section, we conduct experiments on the ImageNet classification challenge [51] and investigate the effectiveness of our proposed PWLQ method. We evaluate the top-1 accuracy performance on the validation dataset for three popular network architectures: Inception-v3 [54], ResNet-50 [19] and MobileNet-v2 [52]. We use torchvision0.4.0 and its pre-trained models for our experiments.

Optimal Breakpoint Selection. In order to apply PWLQ, we first need to find the optimal breakpoints to divide the quantization ranges into non-overlapping regions. As stated in Section 3.3, we assume weights and activations satisfy Gaussian or Laplacian distributions, then we find the optimal breakpoints by solving the optimization problem (6).

For the case of one optimal breakpoint , we can iteratively find it by gradient descent since (6) is convex; or using a simple and fast approximation of 8614m + 0.6079) for normalized Gaussian. Experimental results show that the approximation obtains almost the same accuracy compared to gradient descent, while also being considerably faster. Therefore, unless stated otherwise we use this approximated version of the optimal breakpoint for the rest of this paper. We report results with other assumptions such as Laplacian distributions in the supplementary material.

Other works treat the data distributions differently: BiScaled-DNN [26] proposes a ratio heuristic to divide the data into two overlapping regions; and V-Quant [43] introduces a value-aware method to split them into two non-overlapping regions, e.g, 2% (98%) of large (small) values located in the tail (center) region, respectively. Our implementation results in Figure 2 (left) show that PWLQ with non-overlapping regions achieves a superior performance on low-bit quantization compared to BiScaled-DNN improved version(denoted by BSD+) and V-Quant, especially with a large margin on 4-bit MobileNet-v2. Non-overlapping approach shortens the quantization ranges (in (2)) for the tail regions by 1to 2. Therefore, both our choices of non-overlapping regions and optimal breakpoints have a significant impact on reducing the quantization error and improving the performance of low-bit quantized models.

Fig. 2. Left: the impact of non-overlapping and breakpoint options on the top-1 accu- racy for 4-bit post-training quantization models. Right: the robustness of the optimal breakpoint found by solving (6) with some perturbation levels from 5% to 30% for 4-bit Inception-v3 (full-precision accuracy 77.49%). Each perturbation level is run with 100 random samples, the star and the associated number indicate the median accuracy, the bold bar displays the accuracy range between the 25th and 75th percentiles

In Figure 2 (right), we explore the robustness of the optimal breakpoint found by minimizing the quantization error in (6) for 4-bit Inception-v3. We randomly add perturbation levels from 5% to 30% on each optimal breakpoint per-channel per-layer, e.g., the new breakpoint or 1for 5% of perturbation. We run 100 random samples for each perturbation level to generate the results. Overall, model performance decreases as the perturbation level increases, which indicates that our selection of the optimal breakpoint is crucial for accurate post-training quantization. Note that when 5% of perturbation is added to our selection of optimal breakpoints, more than half of the experiments produce a lower accuracy, and can be as low as 74.05%, which is a 1.67% drop from the zero-perturbation baseline.

Multiple Breakpoints. In this section, we discuss the trade-off of multiple breakpoints on model accuracy and hardware overhead. Theoretically, as the number of breakpoints on weights increases, the associated hardware cost linearly rises. Meanwhile, the number of non-overlapping regions and the associated total number of quantization levels grows, indicating a stronger representational power. Numerically, the extension of finding the optimal multi-breakpoints is straightforward by calculating the same quantization error (4), and solving the same optimization problem (6) with gradient descent in an enlarged search space. Table 1 shows the accuracy performance up to three breakpoints. In general, using more breakpoints consistently improves model accuracy under the growing support of customized hardware. We suggest using one breakpoint to maintain the simplicity of the inference algorithm and its hardware implementation. Thus we only report PWLQ with one breakpoint for the rest of this paper.

Table 1. Top-1 accuracy (%) and requirement of hardware accumulators for PWLQ with multiple breakpoints on weights

PWLQ and Uniform Quantization. In Section 3.3, we analytically and numerically demonstrate that our method, PWLQ, obtains a smaller quantization error than uniform quantization. We compare these two schemes in Table 2. In this table, weights are quantized per-channel with the same computational bit-width b = 4, 6, 8; activations are uniformly quantized per-layer into 8-bit. Generally, PWLQ achieves higher accuracy than uniform quantization except for one minor case of 8-bit Inception-v3. When the bit-width is large enough (b = 8), the quantization error is small and both uniform quantization and PWLQ provide good accuracy. However, when the bit-width is decreased to 4, PWLQ obtains a notably higher accuracy, i.e., PWLQ attains 75.72% but uniform quantization only attains 44.28% for 4-bit Inception-v3. These results show that PWLQ is a more powerful representation scheme in terms of both quantization error and model accuracy, making it a viable alternative for uniform quantization in low bit-width cases. Moreover, PWLQ applies uniform quantization on each piece, hence it features a simple computational scheme and can benefit from any tricks that improve uniform quantization performance such as bias correction.

Table 2. Comparison results of top-1 accuracy (%) for uniform and PWLQ schemes on weights. b+BC: b-bit with bias correction for bit-width b = 4, 6, 8. Each bold value indicates the best result from different methods for specified bit-width and network

Bias Correction. An inherent bias in the mean and variance of the tensor values was observed after the quantization process and the benefits of correcting this bias term have been demonstrated in [2,13,42]. This bias can be compensated by folding certain correction terms into the scale and the offset [2]. We adopt this idea into our PWLQ method and show the results in Table 2 (columns with “+BC”). Applying bias correction further improves the performance of low-bit quantized models. It allows 6-bit post-training quantization with piecewise linear scheme for all three networks to achieve near full-precision accuracy within a drop of 0.30%; 4-bit MobileNet-v2, also without retraining, achieves an accuracy of 69.22%. In general, a combination of low-bit PWLQ and bias correction on weights achieves minimal loss of full-precision model performance.

5.2 Comparison to Existing Approaches

In this section, we compare our PWLQ method with other existing approaches, by quoting the reported performance scores from the original literature.

An inclusive evaluation of clipping techniques along with outlier channel splitting (OCS) was presented in [62]. To fairly compare with these methods, we adopt the same setup of applying per-layer quantization on weights and without quantizing the first layer. In Table 3, we show that our PWLQ (no bias correction) outperforms the best results of clipping method combined with OCS. Besides, OCS needs to change the network architecture, in contrast to PWLQ.

Table 3. Comparison results of per-layer PWLQ and best clipping with OCS [62] on top-1 accuracy (%) loss. W/A indicate the bit-width on weights/activations. The accuracy difference values are measured from the full-precision (32/32) result

In Table 4, we provide a comprehensive comparison result of our PWLQ to other existing quantization methods. Here we apply per-layer quantization on activations and per-channel PWLQ on weights with bias correction. Except for the 4/4 case where we apply 4-bit PWLQ on activations, we always apply 8-bit uniform quantization on activations for the rest of the 8/8 and 4/8 cases. Under the same bit-width of computational cost among all the methods, our PWLQ combined with bias correction achieves the state-of-the-art results on all cases and it outperforms all other methods with a large margin on 4/8 and 4/4 cases. We emphasize that our PWLQ method is simple and efficient. It achieves the desired accuracy at the small cost of a few more accumulations per MAC unit and a minor overhead of storage. More importantly, it is orthogonal and applicable to other methods.

Table 4. Comparison of our PWLQ and other methods on top-1 accuracy (%) loss. PWLQ: weights are piecewise linearly quantized per-channel with bias correction, activations are quantized per-layer

5.3 Other Applications

To show the robustness and applicability of our proposed approach, we extend the PWLQ idea to other computer vision tasks including semantic segmentation on DeepLab-v3+ [5] and object detection on SSD [36].

Semantic Segmentation. In this section, we apply PWLQ on DeepLab-v3+ with a backbone of MobileNet-v2. The performance is evaluated using mean intersection over union (mIoU) on the Pascal VOC segmentation challenge [11].

In our experiments, we utilize the implementation of public Pytorch repositoryto evaluate the performance. After folding batch normalization of the pre-trained model into the weights, we found that several layers of weight ranges become very large (e.g., [-54.4, 64.4]). Considering the fact that quantization range [27], especially in the early layers [7], has a profound impact on the performance of quantized models, we fix the configuration of some early layers in the backbone. More precisely, we apply 8-bit PWLQ on three depth-wise convolution layers with large ranges in all configurations shown in Table 5. Note that the MAC operations of these three layers are negligible in practice since they only contribute 0.2% of the entire network computation, but it is remarkably beneficial to the performance of low-bit quantized models.

Table 5. Uniform quantization and PWLQ on DeepLab-v3+. Weights are quantized per-channel with bias correction, activations are uniformly quantized per-layer

As noticed in classification, low-bit uniform quantization causes significant accuracy drop from the full-precision models. In Table 5, applying the piecewise linear method combined with bias correction, the 6-bit PWLQ model on weights even outperforms 8-bit DFQ [42], which attains 0.42% degradation of the pre-trained model. Moreover, the 4-bit PWLQ significantly improves the mIoU by 17.61% from the 4-bit uniform quantized model, indicating the potential of low-bit post-training quantization via piecewise linear approximation for the semantic segmentation task.

Object Detection. We also test the proposed PWLQ for the object detection task. The experiments are performed on the public Pytorch implementationof SSD-Lite version [36] with a backbone of MobileNet-v2. The performance is evaluated with mean average precision (mAP) on the Pascal VOC object detection challenge [11].

Table 6 compares the results of the mAP score of quantized models using the uniform and PWLQ schemes. Similar to image classification and semantic segmentation tasks, even with bias correction and per-channel quantization enhancements, 4-bit uniform scheme causes 3.91% performance drop from the full-precision model, while 4-bit PWLQ with these two enhancements is able to remove this notable gap down to 0.38%.

Table 6. Uniform quantization and PWLQ of SSD-Lite version. Weights are quantized per-channel with bias correction, activations are uniformly quantized per-layer

6 Conclusion

In this work, we present a piecewise linear quantization scheme for accurate post-training quantization of deep neural networks. It breaks the bell-shaped distributed values into non-overlapping regions per tensor where each region is assigned an equal number of quantization levels. We further analyze the resulting quantization error as well as the hardware requirements. We show that our approach achieves state-of-the-art low-bit post-training quantization performance on image classification, semantic segmentation, and object detection tasks under the same computational cost. It indicates its potential of efficient and rapid deployment of computer vision applications on resource-limited devices.

Acknowledgements. We would like to thank Hui Chen and Jong Hoon Shin for valuable discussions.

References

1. Bakunas-Milanowski, D., Rego, V., Sang, J., Chansu, Y.: Efficient algorithms for stream compaction on gpus. International Journal of Networking and Computing pp. 208–226 (2017)

2. Banner, R., Nahshan, Y., Hoffer, E., Soudry, D.: Post training 4-bit quantization of convolution networks for rapid-deployment. CoRR, abs/1810.05723 (2018)

3. Baskin, C., Schwartz, E., Zheltonozhskii, E., Liss, N., Giryes, R., Bronstein, A.M., Mendelson, A.: Uniq: Uniform noise injection for non-uniform quantization of neural networks. arXiv preprint arXiv:1804.10969 (2018)

4. Cai, Z., He, X., Sun, J., Vasconcelos, N.: Deep learning with low precision by half- wave gaussian quantization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5918–5926 (2017)

5. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 801–818 (2018)

6. Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.J., Srinivasan, V., Gopalakr- ishnan, K.: Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018)

7. Choukroun, Y., Kravchik, E., Kisilev, P.: Low-bit quantization of neural networks for efficient inference. arXiv preprint arXiv:1902.06822 (2019)

8. Courbariaux, M., Bengio, Y., David, J.P.: Binaryconnect: Training deep neural net- works with binary weights during propagations. In: Advances in neural information processing systems. pp. 3123–3131 (2015)

9. Dhillon, G.S., Azizzadenesheli, K., Lipton, Z.C., Bernstein, J., Kossaifi, J., Khanna, A., Anandkumar, A.: Stochastic activation pruning for robust adversarial defense. arXiv preprint arXiv:1803.01442 (2018)

10. Dong, X., Huang, J., Yang, Y., Yan, S.: More is less: A more complicated net- work with less inference complexity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5840–5848 (2017)

11. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision pp. 303–338 (2010)

12. Faraone, J., Fraser, N., Blott, M., Leong, P.H.: Syq: Learning symmetric quanti- zation for efficient deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4300–4309 (2018)

13. Finkelstein, A., Almog, U., Grobman, M.: Fighting quantization bias with bias. arXiv preprint arXiv:1906.03193 (2019)

14. Georgiadis, G.: Accelerating convolutional neural networks via activation map com- pression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7085–7095 (2019)

15. Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115 (2014)

16. Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. In: International Conference on Machine Learning. pp. 1737–1746 (2015)

17. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural net- works with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015)

18. He, K., Gkioxari, G., Doll´ar, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)

19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

20. He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural net- works. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1389–1397 (2017)

21. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

22. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An- dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

23. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)

24. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)

25. Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., Kalenichenko, D.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2704–2713 (2018)

26. Jain, S., Venkataramani, S., Srinivasan, V., Choi, J., Gopalakrishnan, K., Chang, L.: Biscaled-dnn: Quantizing long-tailed datastructures with two scale factors for deep neural networks. In: 2019 56th ACM/IEEE Design Automation Conference (DAC). pp. 1–6. IEEE (2019)

27. Jung, S., Son, C., Lee, S., Son, J., Han, J.J., Kwak, Y., Hwang, S.J., Choi, C.: Learning to quantize deep networks by optimizing quantization intervals with task loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4350–4359 (2019)

28. Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342 (2018)

29. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012)

30. Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Fast and accurate image super- resolution with deep laplacian pyramid networks. IEEE transactions on pattern analysis and machine intelligence (2018)

31. Lee, J.H., Ha, S., Choi, S., Lee, W.J., Lee, S.: Quantization for rapid deployment of deep neural networks. arXiv preprint arXiv:1810.05488 (2018)

32. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016)

33. Li, R., Wang, Y., Liang, F., Qin, H., Yan, J., Fan, R.: Fully quantized network for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2810–2819 (2019)

34. Li, Y., Dong, X., Wang, W.: Additive powers-of-two quantization: An efficient non- uniform discretization for neural networks. In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=BkgXT24tDS

35. Lin, D., Talathi, S., Annapureddy, S.: Fixed point quantization of deep convolu- tional networks. In: International Conference on Machine Learning. pp. 2849–2858 (2016)

36. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European conference on computer vision. pp. 21–37. Springer (2016)

37. Luo, J.H., Wu, J., Lin, W.: Thinet: A filter level pruning method for deep neural network compression. In: Proceedings of the IEEE international conference on computer vision. pp. 5058–5066 (2017)

38. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 116–131 (2018)

39. Meller, E., Finkelstein, A., Almog, U., Grobman, M.: Same, same but different- recovering neural network quantization error through weight factorization. arXiv preprint arXiv:1902.01917 (2019)

40. Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al.: Mixed precision training. arXiv preprint arXiv:1710.03740 (2017)

41. Miyashita, D., Lee, E.H., Murmann, B.: Convolutional neural networks using log- arithmic data representation. arXiv preprint arXiv:1603.01025 (2016)

42. Nagel, M., van Baalen, M., Blankevoort, T., Welling, M.: Data-free quantization through weight equalization and bias correction. arXiv preprint arXiv:1906.04721 (2019)

43. Park, E., Yoo, S., Vajda, P.: Value-aware quantization for training and inference of neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 580–595 (2018)

44. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. 31st Conference on Neural Information Processing Systems (2017)

45. Polino, A., Pascanu, R., Alistarh, D.: Model compression via distillation and quan- tization. arXiv preprint arXiv:1802.05668 (2018)

46. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classi- fication using binary convolutional neural networks. In: European Conference on Computer Vision. pp. 525–542. Springer (2016)

47. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7263–7271 (2017)

48. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec- tion with region proposal networks. In: Advances in neural information processing systems. pp. 91–99 (2015)

49. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)

50. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back- propagating errors. nature pp. 533–536 (1986)

51. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International journal of computer vision (2015)

52. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: In- verted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4510–4520 (2018)

53. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

54. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep- tion architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2818–2826 (2016)

55. Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946 (2019)

56. Ullrich, K., Meeds, E., Welling, M.: Soft weight-sharing for neural network com- pression. arXiv preprint arXiv:1702.04008 (2017)

57. Wu, J., Leng, C., Wang, Y., Hu, Q., Cheng, J.: Quantized convolutional neural networks for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4820–4828 (2016)

58. You, Y.: Audio Coding: Theory and Applications. Springer Science & Business Media (2010)

59. Zagoruyko, S., Komodakis, N.: Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)

60. Zhang, D., Yang, J., Ye, D., Hua, G.: Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 365–382 (2018)

61. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolu- tional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6848–6856 (2018)

62. Zhao, R., Hu, Y., Dotzel, J., De Sa, C., Zhang, Z.: Improving neural network quantization without retraining using outlier channel splitting. In: International Conference on Machine Learning. pp. 7543–7552 (2019)

63. Zhou, A., Yao, A., Guo, Y., Xu, L., Chen, Y.: Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044 (2017)

64. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)

65. Zhou, Y., Moosavi-Dezfooli, S.M., Cheung, N.M., Frossard, P.: Adaptive quanti- zation for deep neural network. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

66. Zhu, C., Han, S., Mao, H., Dally, W.J.: Trained ternary quantization. arXiv preprint arXiv:1612.01064 (2016)

Designed for Accessibility and to further Open Science