Present-day neural networks include millions parameters and their sizes become larger and larger. This is a global problem preventing deployment of neural networks on mobile devices and different chips. For example, even a relatively small pre-trained neural network VGG16 [1] designed to image classification [2] occupies about 553 MB in the device’s memory. There are different methods allowing us to decrease the size of the pre-trained neural
network. In particular, one of these methods is a discretization that is a reduction of bit width of the neural network weights by dividing the weight distribution interval into discrete values. Applying the discretization to pre-trained neural networks, we significantly decrease the sizes of the weights and “lighten” the neural network making it applicable for realization and launch on mobile devices. In the papers [3-6], they examined clipping operations and quantization of matrix weights of neural networks and obtained analytic estimates for an optimal discrete representation for the Hopfield neural network weights as well as formulas for probabilities of errors in the obtained model. For the Hopfield model based on binary neurons, it was shown that the optimal discretization of the connection weights corresponded to maximization of correlations between the input and the discretized weights. In the present paper, we use the same idea, but for feedforward deep neural networks.
The most realizations of discretization of neural networks weights include the neural networks re-training in the course of the discretization; however, this process requires a huge computing power [7-9]. In contrast to the cited works, we examine discretization methods where the re-training is not necessary. In other words, we analyze the discretization of the weights of already trained neural networks. In the framework of such approach, the huge computing power is not necessary. The similar method was discussed in paper [10] where the authors proposed an approach known as a logarithmic discretization. They partitioned the weights into groups where the values of the logarithm to base 2 of the weights were equal to an integer from the interval [-7; 0]. In the present paper, we propose a more general approach in the framework of which we can vary the base of logarithm as well as the initial value. We called it an exponential discretization.
Let us assume initially that the neural network weights we have to discretize are distributed symmetrically over the interval [ , ]is maximum absolute value of the weights. When we keep the signs of the weights in a separate array (providing one bit for the sign) absolute values of the numbers are distributed over the range [0, ]M . For simplicity, we divide all of them by M . Consequently, in what follows that all the numbers are distributed over the interval [0,1]
When discretizing our goal is to decrease the variety of numbers by means of dividing the
interval [0,1] intervals whose ends are at the points
Note, the first interval is 0[0, ]x and the last interval is 2[ ,1]; all the other intervals are defined as 1[ , ], 1,..., 1
. The number of bits B necessary to store the discretized weight define the number of intervals n :
at that we hold one bit for the sign of the number. Although an arbitrary partition into intervals (1) is possible, we examined the two most
popular ways only: the linear and the exponential partition (see Fig. 1).
2.1. Linear partition
In this case, we divide the interval of distribution into the same intervals:
Here 0x is the first point (a parameter of variation), n is the number of intervals, and the length of the intervals q we determine from the equality 1 1
Fig.1. Equidistant linear distribution (left panel) and exponential distribution (right panel).
2.2. Exponential partition
In the case of the exponential partition, the subsequent intervals increase exponentially with respect of first interval:
derive the formula for the length of the intervals
2.3.The value of 0x
When implementing the both linear and exponential discretization algorithms it is necessary to define the initial number 0x . We do this maximizing the correlation between the 32-
bit (the full precision) weights and the discretized weights. Let be the initial values of the weights in a layer and let y be the discretized values. The expression for the correlation has the form
Optimal value 0x corresponds to the maximal correlation. We suppose that the higher correlation between the discretized and initial weights the less the error in the work of the trained network.
2.4.Rounding of numbers
For a given number of bits B the discretization procedure is defined by the position of the end of the initial interval (the value of 0x ) as well as by the rounding method. There are different variants of rounding of numbers. We examined three of them: the floor, the ceiling, and the mean method.
the lower boundary of the interval
boundary of the interval 1
mean value of the numbers inside this interval.
Fig.2. Approximate code of discretization function. Commented lines show different variants of
Gaussian and the Laplacian distribution. In both cases we used 10 000 numbers from the given distribution to generate a vector and then performed the discretization using one of the abovementioned methods. We ranged the parameter 0x from 0 to 1 and changed the number of bits from 2 to 6. The obtained results we collected in Tables 1 and 2 and showed them in Figures 3 and 4.
maximal value of the correlation depends significantly on this parameter as well as on the discretization method. The values of maximums we collected in Table 1.
optimal partition into arbitrary intervals that did not use formulas (3) or (5). In Fig. 3, the horizontal point lines “optimal” correspond to the results obtained with the aid of this algorithm. Unfortunately, they define the optimal solution only if the number of the intervals is small (when B equals to 2 or 3). When the number of the intervals is larger ( 4) the algorithm often stucks and does not allow to obtain the optimal solution. In Fig. 3, we see that for 5
exponential_mean method provides a slightly better solution then this algorithm.
Fig.3. Dependence of correlation on 0x . On the horizontal axis is the value of 0x normalized to
Table 1. Maximal correlation value when discretizing Laplacian random numbers.
Table 2. Optimal values 0 / when discretizing Laplacian random numbers.
we saw almost linear dependence between 0x and the standard deviation of the numbers Fig. 4). The fluctuations of the normalized value 0 /
are much less (see Table 2). This value is exponentially dependent on the number of bits (see Fig. 5) and that is why we can obtain the following approximate formula for the optimal value of 0x :
Fig. 4. Dependence of optimal 0 when discretizing Laplace random numbers with 2
Fig. 5. Dependence of optimal value 0 / on number of bits B for Laplace distribution.
Our analysis allowed us to draw the following conclusions.
First, when the number of bits increases the correlation approaches 1; at that the approach speed depends exponentially on the number of bits (Table 1).
Second, the maximal correlation is achieved at a finite value of 0x . The right choice of 0x allows us to reduce the difference 1
significantly.
zero exponentially when the number of bits increases (see Eq. (8)). Finally, as we see from Table 1 the linear (uniform) discretization is 1 bit behind the
exponential discretization. This means that the maximal correlations obtained in the framework of the linear partition is achieved by exponential partition when the number of bits is one bit less. Until now, we do not find the reason for this. However, this result means that the exponential discretization provides a one-bit gain.
of recognition of photos from the database ImageNet [2]. The test set of photos of this database contains 50 000 pictures corresponding to 1000 classes. However, to reduce the computational complexity we used 1000 random images only and measured the classification accuracy prior and after discretization. Such tests allowed us to draw general conclusions about how the algorithm works.
Keras Framework (tensorflow backend), namely VGG16, MobileNet-v2, Inception-v3, Xception, and ResNet-50 [11]. The work was done in the python language and the discretization procedure used the methods from the standard library numpy only.
found evident regularities in the distributions of the weights. For example, most frequently, the weight distributions are Gaussian or Laplacian and they are symmetrical with respect to zero. Probably this is a result of implementation of the regularizations L1 or L2 in the course of the training.
them, because on the one hand, their number is not so large, and on the other hand, frequently their distributions differ from the Gaussian or Laplacian distributions.
percentage of the neural network will be discretized. For this purpose, we simply calculate the ratio of the weights we have to discretize to the total number of the weights. After excluding all the “unfitted” weights, we obtained the mobilenet-v2 neural network discretized by 95.7%. Other neural networks (Xception, Inceptionv3, ResNet-50) are discretized by 99%.
Fig. 6. Approximate code for discretization of all neural network layers. As input argument
should be paid to the next-to-last row. The point is that the maximization of the correlation is not the only one criterion of accurate discretization. We can multiply the discretized numbers by an arbitrary number without changing the correlation value. This was the reason why in our studies we additionally multiplied the discretized weights so that their dispersion remained the same prior and after the discretization.
discretization provides practically the same accuracy as the full precision weights. For some VGG-16 and ResNet50 neural networks, we obtain the comparable quality even using 4 bit exponential discretization. The same cannot be said for the linear discretization.
cases two bits behind the exponential discretization (the VGG-16 is an example).
Fig. 7. Dependence of accuracy Top1 on number of bits of discretization for neural networks
Fig. 8. Dependence of accuracy Top5 on number of bits of discretization for neural networks
Table 3. Results of discretization of neural networks by means of exponential_ceil method
Table 4. Results of discretization neural networks by means of linear_mean method
neural network retraining. The method does not demand sufficient computational resources but it provides a good recognition accuracy with 6 bit and sometimes, even with 4 bit (for example, in the case of VGG-16 neural network) weights. Consequently, it allows us to “lighten” the neural network approximately 5-8 times and in the same time to hold the classification accuracy.
The work was supported by Russian Foundation for Basic Research (RFBR Project 18-07-00750).
1. Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for LargeScale Image Recognition https://arxiv.org/abs/1409.1556
2. ImageNet – huge image dataset http://www.image-net.org
3. Kryzhanovsky, B.V., Kryzhanovsky, M.V. & Malsagov, M.Y. Discretization of a matrix in quadratic functional binary optimization. Dokl. Math. (2011) 83: 413. https://doi.org/10.1134/S1064562411030197.
4. Kryzhanovsky M.V., Malsagov M.Yu. Modification of Binary Optimization Algorithm and Use Small Digit Capacity Numbers. "Optical Memory and Neural Networks (Information Optics)". Vol.20. №3. pp. 194–200, 2011. https://doi.org/10.3103/S1060992X11030052.
5. B.V. Kryzhanovskii, V.M. Kryzhanovskii, A.L. Mikaelyan. Application of the clipping procedure to the binary minimization of a quadratic functional. Doklady Mathematics, Volume 75, No2 , pp. 310-313, 2007. (ISSN 1064-5624)
6. Kryzhanovsky, B.V. & Kryzhanovsky, V.M. An accelerated procedure for solving binary optimization problems. J. Comput. Syst. Sci. Int. (2009) 48: 732. https://doi.org/10.1134/S1064230709050074
7. Chenzhuo Zhu, Song Han, Huizi Mao, William J. Dally, Trained Ternary Quantization https://arxiv.org/abs/1612.01064
8. Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients https://arxiv.org/pdf/1606.06160.pdf
9. Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2, 2015.
10. Jingyong Cai, Masashi Takemoto, and Hironori Nakajo. 2018. A Deep Look into of Model Parameters in Neural Networks. In The 10th International Conference on Advances in Information Technology (IAIT2018), December 10–13, 2018, Bangkok, Thailand. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3291280.3291800
11. Models for image classification with weights trained on ImageNet: https://keras.io/applications/