b

DiscoverSearch
About
My stuff
Joint Regularization on Activations and Weights for Efficient Neural Network Pruning
2019·arXiv
Abstract
Abstract

With the rapid scaling up of deep neural networks (DNNs), extensive research studies on network model compression such as weight pruning have been performed for improving deployment efficiency. This work aims to advance the compression beyond the weights to neuron activations. We propose the joint regularization technique which simultaneously regulates the distribution of weights and activations. By distinguishing and leveraging the significance difference among neuron responses and connections during learning, the jointly pruned network, namely JPnet, optimizes the sparsity of activations and weights for improving execution efficiency. The derived deep sparsification of JPnet reveals more optimization space for the existing DNN accelerators dedicated for sparse matrix operations. We thoroughly evaluate the effectiveness of joint regularization through various network models with different activation functions and on different datasets. With 0.4% degradation constraint on inference accuracy, a JPnet can save  72.3% ∼ 98.8% ofcomputation cost compared to the original dense models, with up to  5.2× and 12.3×reductions in activation and weight numbers, respectively.

Index TermsDNN compression, joint regularization, weight pruning, activation pruning.

Deep neural networks (DNNs) have demonstrated signifi-cant advantages in many real-world applications, such as image classification, object detection and speech recognition [1][3]. On the one hand, DNNs are developed for improving performance in these applications, which leads to intensive demands in data storage, communication and processing. On the other hand, the ubiquitous intelligence promotes the deployment of DNNs in light-weight embedded systems that are only equipped with limited memory and computation resource. To reduce the model size while ensuring the performance quality, weight pruning has been widely explored. The weights in small values are taken as redundant parameters and removed with little impact on the model accuracy [4], [5]. Utilizing the zero-skipping technique [6] while computing on sparse weight parameters can further save the computation energy. In addition, many specific DNN accelerators [7], [8] leverage the intrinsic sparse activation patterns of the rectified linear unit (ReLU) function. The approach, however, cannot be extended to those activation functions that lack intrinsic zeros, e.g., leaky ReLU.

Although prior techniques achieved tremendous successes, merely focusing on the weights cannot lead to the best inference efficiency, the crucial metric in DNN deployment, for the following three reasons. First, existing weight pruning methods reduce the fully-connected (fc) layer size dramatically, while there lacks a systematic method to achieve a comparable compression rate for convolution (conv) layers. The conv layers account for most of the computation cost and dominate the inference time in DNNs, whose performance is usually bounded by computation instead of memory accesses [9], [10]. The most essential challenge of speeding up DNNs is to minimize the computation cost, i.e., the intensive multiple-and-accumulate operations (MACs). Second, the weights and activations together determine the performance of a network. Our experiments show that the zero-activation percentage obtained by ReLU decreases after applying the weight pruning [6]. Such a deterioration in activation sparsity could potentially eliminate the advantage of the aforementioned accelerator designs. Third, the activation in DNNs is not strictly limited to ReLU. Non-ReLU activation functions, such as leaky ReLU and sigmoid, do not have intrinsic zero-activation patterns.

In this work, we propose the joint regularization technique to minimize the computation cost of DNNs by pruning both weights and activations. Unlike the na¨ıve solution by pruning weights and activations in sequence, joint regularization is an end-to-end solution that simultaneously learns the sparse connections and neuron responses. Dynamic activation masks and static weight masks are learned at the same time with the joint regularization. Through the learning on the different importance of neuron responses and connections, the jointly pruned network, namely JPnet, can achieve a balance between activations and weights and therefore further improve execution efficiency. Moreover, the JPnet not only stretches the intrinsic activation sparsity of ReLU, but also targets as a generic solution for other activation functions, such as leaky ReLU. Our experiments on various network models with different activation functions and on different datasets show substantial reductions in MACs by the JPnet. Compared to the original dense models, JPnet can obtain up to  5.2×activation compression rate,  12.3×weight compression rate and eliminate  72.3% ∼ 98.8%of MACs. Compared with merely adopting the weight pruning [4], JPnet can further reduce the computation cost by  1.3× ∼ 10.5×in our experiments.

Weight pruning emerges as an effective compression technique in reducing the model size and computation cost of neural networks. A common approach of pruning the redundant weights in DNNs is to include an extra regularization term (e.g., the  ℓ1/ℓ2regularization) in the loss function [5], [11] to constrain the weight distribution. Then the weights below a heuristic threshold will be removed. Afterwards, a certain number of finetuning epochs will be applied to recover the accuracy loss induced by the pruning. In practice, the directpruning and finetuning stages can be carried out iteratively to gradually achieve the optimal tradeoff between the model compression rate and accuracy. To avoid erroneous removal of important weights in the na¨ıve pruning and finetuning approach, a dynamic compression method was proposed to recover those pruned weights whose expected updates are larger than an empirical threshold in each training iteration [12]. Rather than using  ℓ1/ℓ2regularization to constrain the weight magnitude and distribution,  ℓ0regularization can be adopted as a stochastic binary mask on weights, which was proven to produce a higher sparsification level [13]. These regularization-based weight pruning approaches demonstrated high effectiveness, especially for fc layers [4]. However, these methods are heuristic and lack theoretical guarantee for the convergence and compression performance. Being theoretically proved, sparse variational dropout can be utilized on individual weights to realize all possible dropout rates [14], [15]. The objective of weight pruning can also be transformed as a non-convex optimization problem which is mathematically solvable using the alternating direction method of multipliers (ADMM) [16]. Again, finetuning is needed to recover the accuracy drop for the sparsified model obtained by ADMM.

Removing the redundant weights in structured forms, e.g., the filters and filter channels, has been widely investigated too. For example, structured pruning [17] applies group lasso regularization on weight groups in a variety of self-defined shapes and sizes. In [18], the rankings of filters are indicated by the first-order Taylor series expansion of the loss function on feature maps. The filters in low rankings are then removed. The filter ranking can also be represented by the root mean square or the sum of absolute values of the filter weights [19], [20]. A theoretical view on the importance of neurons/filters can be derived from the perspective of variational information bottleneck which minimizes the mutual information between layers [21]. The structured pruning methods do not require dedicated supports for random sparse matrix and thus are hardware-friendly for conventional computation platforms. However, these methods seldom achieve a weight compression rate as high as the element-wise pruning methods.

Activation sparsity has been utilized in DNN accelerator designs. The activation sparsity originating from ReLU accelerates DNN inference with reduced off-chip memory access and computation cost [7], [8], [22]. A simple technique to improve activation sparsity was explored by zeroing out small activations [7]. However, the increment of activation sparsity is very limited with a concern of accuracy loss. Moreover, these works heavily relied on the zero activations of ReLU, which cannot be extended to other activation functions. Dropoutbased methods were proposed to regulate activation sparsity and obtain sparse feature representation [23], [24]. These techniques incur essential model modifications, e.g., adding a binary belief network overlaid on the original model. Some other studies were dedicated for feature map pruning in conv layers by learning to recognize and remove redundant channels [25], [26]. Our proposed joint regularization is an orthogonal technique to feature map pruning by dealing with activation redundancy in a much finer granularity, i.e., element-wise.

Generally, the model size compression is the main focus of weight pruning, while the regulation of activation sparsi-fication focuses more on the intrinsic activation sparsity by ReLU or exploiting the virtue of sparse activation in the DNN training for better model generalization. In contrast, our proposed joint regularization aims to reduce the DNN computation cost and accelerate the inference by simultaneously optimizing weight pruning and activation sparsification.

A. Joint Regularization

For an L-layer neural network represented by the weight set SW = {Wi : i = 1, . . . , L}where  Widenotes the weights of layer i, given the dataset {X, Y}, the  SWwill be learned to minimize the loss function as follows:

image

where  {xk, yk}is the sampled input-output pair from {X, Y}, and n is the minibatch size. The nonlinear relationship of the network is modeled as  D(·). The cross-entropy is usually adopted as the function  L(·)for multi-class problems. For the common weight pruning techniques, the optimization problem extends the loss function in Equation (1) with a regularization term on  SWas

image

RW (·)can be configured as the  ℓ0/ℓ1/ℓ2regularization on weights with a strength  α. The  RW (·)focuses on the optimal weight compression, whereas  Ai−1 · Wiin layer i is determined by both the activation  Ai−1from the previous layer and the weights  Wiof layer i. We propose joint regularization on both weights and activations to minimize the computation cost and optimize the execution efficiency thereafter. Overall, the loss function will be represented as:

image

where  SA = {Ai : i = 1, . . . , L−1}. ALindeed is the model output, which is not included into the activation regularization RA(·). It’s inappropriate to apply the  ℓ1/ℓ2regularization for activation, as the regularization may constrain the activation magnitude and hinder the feature learning process. Hence, we propose to adopt the  ℓ0regularization, which minimizes the number of the effective activations without disturbing their magnitudes. More specific, for each layer i, a binary mask Tiacting as an information filter is designed for the original activations  Aorig,isuch as

image

where  ⊙is the element-wise multiply operation. The  ℓ0optimization problem on activations is therefore transformed as the derivation of optimal mask set  ST = {Ti : i = 1, . . . , L − 1}.

B. Joint Pruning Procedure

When implementing the joint regularization, we choose the ℓ1regularization on weight distribution for its ease of gradient derivation while training. After combining with  STfor  ℓ0regularization on activations, the loss function in Equation (3) can be rewritten as:

image

To overcome the non-differentiability of the  ℓ0regularization, we adopt a deterministic solution to obtain the proper  ST. Noting that small weights in layers are learned to be pruned, activations with small magnitudes are taken as unimportant and masked out to further minimize inter-layer connections. Considering that neurons are activated in various patterns according to different input classes, we propose dynamic masks for the activation pruning. This is different from the static masks in the weight pruning.

The selected activations by mask  Tiare denoted as winners. To derive the activation  am,j ∈ Am,ibased on  aorig,j ∈Aorig,i, we have:

image

here the winners are dynamically determined at run-time according to the winner rate per layer. The determination of winners through the activation mask is a relaxed partial sorting problem to find top-k arguments in an array. The winner rate of layer i is defined as:

image

where  |Am,i|and  |Aorig,i|respectively denotes the number of winners selected by  Tiand that of the original activations. Usually, different layer features a unique optimal winner rate. To get the appropriate winner rate per layer, the model with configurable activation masks is tested on a validation set sampled from the training set. Verified by our experiments, the size of the validation set can be similar to that of the test set. The accuracy drop is taken as the indicator of the model sensitivity for the winner rate setting. The  (winner rate)iis set empirically according to the tolerable accuracy loss. Examples of activation sensitivity analysis will be presented in

image

Fig. 1. Working principle of joint pruning.

Section V. After deriving the winner rates, dynamic activation masks are configured as illustrated in Fig. 1.

To understand the working scheme of the optimization problem defined by the Equation (5), we focus on the operation of a single layer i:

image

where  fi(·)represents the function of layer i. In the backpropagation phase, the partial derivative of the loss function on  Aorig,iis propagated backwards:

image

The term ∂Am,i−1∂Aorig,i−1is equal to  Ti−1, which means the backpropagation process is masked in the same way as the forward propagation. Thereafter, only the activated neurons will be updated. For the weight updating in a finetuning iteration, a small decay will be applied according to the setting of  ℓ1regularization on weights. Those weights smaller than the empirical threshold will be pruned out.

As summarized in Fig. 1, the proposed end-to-end joint pruning approach consists of three steps. First, the significance of activations per layer is analyzed to determine the winner rates and define the pruning strength of each dynamic activation mask. Afterwards, the regularizations on both weights and activations are applied for the following finetuning stage. With the joint regulating force by  ||SW ||1and activation masks STas defined in Equation (5), weights and activations are co-trained to obtain deep sparsification. Through finetuning, the generated model is jointly optimized by dynamic sparse activation patterns and static compressed weights.

C. Optimizer and Learning Rate

We start the pruning process with several warm-up fine-tuning epochs to obtain the preliminary sparse patterns in both weights and activations with joint regularization. The same optimizer for training the original model is adopted. The learning rate is set as  0.1× ∼ 0.01×smaller than the original setting. Our experiments show that Adadelta [27] usually brings the best performance in the following pruning process after the warm-up finetuning, especially for deep spar-sified activations. Adadelta adapts the learning rate for each individual weight parameter. Smaller updates are performed on neurons associated with more frequently occurring activations, whereas larger updates will be applied for infrequent activated neurons. Hence, Adadelta is beneficial for sparse weight updates, which commonly occur in our joint pruning method. During finetuning, only a small portion of weight parameters are updated because of the combination of sparse patterns in weights and activations. The learning rate for Adadelta is recommended to be reduced  0.1× ∼ 0.01×compared to the setting for training the original model.

D. Reconcile Dropout and Activation Pruning

In DNN training, dropout layer is commonly added after large fc layers to avoid over-fitting. The neuron activations are randomly chosen in the feedforward phase, and weights updates will be only applied on the neurons associated with the selected activations in the backpropagation phase. Thus, a random partition of weight parameters are updated in each training iteration. Similarly, the activation mask only selects a small portion of activated neurons and realize sparse weight updates. However, the over-fitting is still prone to happen because the selected neurons with winner activations are always kept and updated. Thus the random dropout layer is still needed. In fc layers, the number of remaining activated neurons is reduced to  |Am,i|from  |Aorig,i|as defined in Equation (7). Similar to [4] dealing with sparse fc layer training, the dropout layer connected after the activation mask is suggested to be modified with the setting:

image

where the constant  Cdis the dropout rate in the training process for original models. The activation winner rate is introduced to modify the dropout strength to balance over-fitting and under-fitting. The dropout layers will be directly removed in the inference stage.

E. Winner Prediction in Activation Pruning

The dynamic activation pruning method increases the activation sparsity and maintains the model accuracy. As aforementioned, the determination of  Am,ithrough the activation mask is actually a relaxed partial sorting problem. According to the Master Theorem [28], partial sorting can be fast solved in linear time O(N) on average through recursive algorithms, where N is the number of elements to be partitioned. To further speed up,  Am,ican be predicted based on a down-sampled activation set. A threshold  θis derived by separating top-εkelements from the down-sampled activation set comprising  εNelements with  εas the down-sampling rate. Then  θis applied to derive  am,j ∈ Am,ifrom  aorig,j ∈ Aorig,ias follows:

image

We evaluate he joint regularization on various models ranging from multi-layer perceptron (MLP) to deep neural networks (DNNs) on three datasets, MNIST, CIFAR-10 and

image

Fig. 2. Comparison between WP and JP.

ImageNet (Table I). In ResNet-50 [1] and wide ResNet-32 [29], conv layers account for more than 99% computation cost and are our focus. All the evaluations are implemented in TensorFlow.

A. Overall Performance

The compression results of JPnets on activations, weights and MACs are summarized in Table I. Our method can learn both sparse activations and sparse weights. Compared to original dense models, JPnets achieve 1.4× ∼5.2×activation compression rate and 1.6× ∼12.3×weight compression rate. As such, JPnets execute only 1.2%  ∼27.7% of MACs required in dense models. The accuracy drop is kept less than 0.4%, and for some cases, the JPnets achieve even better accuracy (e.g., MLP-3, AlexNet and ResNet-50).

The ReLU function in MLP-3, Lenet-4, ConvNet-5, AlexNet and ResNet-50 brings intrinsic zero activations. However, our experiment results in Fig. 2(a) show that the non-zero activation percentage in the weight-pruned (WP) model tends to increase compared to the original dense models. This increment indeed undermines the benefit from weight pruning. Our proposed JP method can remedy the activation sparsity loss in WP models and remove  7.7% ∼ 22.5%more activations even compared to the original dense models. We observe the largest activation removal in ResNet-32 which uses leaky ReLU as activation function. As leaky ReLU doesn’t provide intrinsic zero activation, the WP model of ReNet-32 cannot benefit from activation sparsity. In contrast, the JPnet in this work can remove 69.2% activations and reduce additional 25% of MAC operations compared to the WP model. As shown in Fig. 2(b), JPnets decrease the MAC operations to 1.2% ∼ 27.7%. It is a 1.3× ∼10.5×improvement compared to WP models. More details on model configuration and analysis will be presented in the following subsections.

B. MNIST and CIFAR-10

The MLP-3 on MNIST has two hidden layers with 300 and 100 neurons respectively. The model configuration details are

TABLE I SUMMARY OF JPNETS

image

TABLE II MLP-3 ON MNIST

image

TABLE III LENET-4 ON MNIST

image

summarized in Table II. The non-zero activation percentage (Acti %) per layer indicates the pruning strength on activations before reaching next layer. The amount of MACs is calculated with batch size as 1. The same setting will be applied to the analysis for other models.

The MLP-3 is successfully compressed  10×and only 17.1% of activations are kept. The total umber of MACs is reduced to merely  3.65% (27.4×) without compromising the model accuracy at all. A higher computation reduction rate is achieved by a Lenet-4 model which comprises two conv layers and two fc layers (Table III). The JPnet for Lenet-4 reduces the computation cost  83.3×with only 5.5% activations and 8.1% weights retained.

To analyze and understand the effectiveness of dynamic activation masks, we take the example of the activation patterns from layer fc2 in MLP-3. Before starting the joint pruning, the activation distribution for all MNIST digits is visualized in Fig. 3, which clearly shows that digits  0−9incur different regions in fc2. The observation implies that it is impossible to design a static activation mask and obtain a comparable sparsification effectiveness as the dynamic counterpart. We name the neuron featuring maximum activation for each input as top neuron. Fig. 4 compares the number of activated top neurons for all digits by observing the training set before and after applying joint pruning. The results show that the JPnet needs fewer top neurons and generates a sparser feature representation.

We also apply joint regularization to ConvNet-5 on CIFAR-10 dataset. The accuracy of the original dense model is 86.0%. As detailed in Table IV, JPnet for ConvNet-5 needs only 27.7% of total MACs compared to the dense model by

image

Fig. 3. The activation distribution of fc2 in MLP-3 for all digits. The activation is obtained by averaging over all data examples per digit class.

image

Fig. 4. The number of activated top neurons for all digits.

pruning 59.6% of weights and 56.4% of activations. Only 0.1% accuracy drop is resulted by JPnet. The conv layers account for more than 80% of total MACs and dominate the computation cost.

C. ImageNet

We use ImageNet ILSVRC-2012 dataset to evaluate the joint pruning method on large datasets. ImageNet consists of about 1.2M training images and 50K validating images. The AlexNet and ResNet-50 are adopted.

The AlexNet comprises 5 conv layers and 3 fc layers and achieves 57.22% top-1 accuracy on the validation set. Similar to ConvNet-5, the computation bottleneck of AlexNet emerges in conv layers, which accounts for more than 90% of total MACs. As shown in Table V, deeper layers present

TABLE IV CONVNET-5 ON CIFAR-10

image

TABLE V ALEXNET ON IMAGENET

image

Fig. 5. The decomposition of weight and computation cost of AlexNet. The overall model size and computation cost are normalized here.

larger pruning strength on weights and activations due to the high-level feature abstraction of input images. For example, the MACs of conv5 can be reduced  18.2×, while only a 1.2×reduction rate is realized in conv1. In total, applying joint pruning removes 81.1% weights and 62.1% activations, inducing  4×reduction in effective MACs. The weight and computation cost decomposition is shown in Fig. 5. The fc layers contribute the most majority of model size and are generally pruned in larger strength than conv layers to realize a significant model compression rate. Whereas, the computation cost reduction mainly comes from the optimization in conv layers as depicted in Fig. 5(b).

To reach higher accuracy, DNNs are getting deeper with tens to hundreds of conv layers. We deploy the joint regularization on ResNet-50 and summarize the detailed results in Table VI. Consisting of 1 conv layer, 4 residual units and 1 fc layer, the ResNet-50 model achieves a 75.6% accuracy on ImageNet ILSVRC-2012 dataset. In each residual unit, several residual blocks equipped with bottleneck layer are stacked. The filter numbers in residual units increase rapidly, and the same for the weight amount as shown in the table. An average pooling layer is connected before the last fc layer to reduce feature dimension. Overall, conv layers contribute the most majority of weights and computation. The JPnet for ResNet-50 achieves a 75.7% accuracy, which is 0.1% higher than the original model. Only 19.1% MACs are retained in JPnet with a  5.65×activation reduction and a  1.6×weight compression.

D. Prune Activation without Intrinsic Zeros

For the networks aforementioned, joint regularization stretches the sparsity level in the ReLU activation. In the following, we validate the idea on the activation function without intrinsic sparse patterns, e.g., leaky ReLU. Table VII shows our results for ResNet-32. The model consists of 1

TABLE VI RESNET-50 ON IMAGENET.

image

conv layer, 3 stacked residual units and 1 fc layer. Each residual unit contains 5 consecutive residual blocks. Compared to conv layers, the last fc layer is negligible in terms of weight volume and computation cost. The original model has a 95.0% accuracy on CIFAR-10 dataset with 7.34G MACs per image. As its activation function is leaky ReLU, zero activations rarely occur in the original and WP models. After applying joint pruning, the activation percentage can be dramatically reduced down to 30.8%. As shown in Table VII, the JPnet keeps 32.3% weight parameters, while only 11.5% MACs are required in execution. The accuracy drop is merely 0.4%.

Fig. 6(a) demonstrates the activation distribution of the first residual block in the original model by randomly selecting 500 images from the training set. The distribution gathers near zero with long tails towards both positive and negative directions. For comparison, the activation distribution after joint pruning is shown in Fig. 6(b), in which activations near zero are pruned out. In addition, the kept activations are trained to be stronger with larger magnitude, which is consistent with the phenomenon that the non-zero activation percentage increases in WP models as illustrated in Fig. 2(a).

A. Comparison with Weight Pruning

In Table VIII, we compare the joint pruning method with the state-of-the-art weight pruning methods, including  ℓ0regularization (L0) [13], variational information bottleneck (VIB) [21], variational dropout (VD) [14], dynamic network surgery (DNS) [12] and the non-convex problem optimization method (ADMM) [16]. For the Lenet-4, our method achieves the best reduction rate on computation cost with similar inference error compared to others. While VIB and VD provide comparable reduction results on computation cost by merely focusing on weight pruning, the computational complexity during training

image

Fig. 6. Activation distribution of ResNet-32.

TABLE VIII COMPARISON WITH THE STATE-OF-THE-ART WEIGHT PRUNING METHODS.

image

For AlexNet, we focus on conv layers which are the computation bottleneck for inference. The top-5 prediction error is reported in the table.

hinders their application in DNNs for large datasets, e.g., ImageNet. Joint pruning can be easily applied for large models as shown in Section IV-C. Compared with DNS and ADMM, we can obtain the minimum prediction error with a comparable reduction rate on computation cost.

B. Comparison with Static Activation Pruning

The static activation pruning has been widely adopted in efficient DNN accelerator designs [7], [8]. By selecting a proper static threshold  θin Equation (11), more activations can be pruned with little impact on model accuracy. For the activation pruning in joint pruning, the threshold is dynamic according to the winner rate and activation distribution layer-wise. The comparison between static and dynamic pruning is conducted on ResNet-32 for CIFAR-10 dataset. For the static pruning setup, the  θfor leaky ReLU is assigned in the range of [0.07, 0.14], which brings different activation sparsity patterns.

As the result of leaky ReLU with static threshold shown in Fig. 7, the accuracy starts to drop rapidly when non-zero activation percentage is less than  58.6% (θ = 0.08). Using dynamic activation masks, a better accuracy can be obtained under the same activation sparsity constraint. Finetuning the model using dynamic activation masks will dramatically recover the accuracy loss. As our experiment in Section IV-D, the JPnet for ResNet-32 can be finetuned to eliminate the 10.4% accuracy drop caused by the static activation pruning.

C. Activation Analysis

In weight pruning, the applicable pruning strength varies by layers [4], [18]. Similarly, the pruning sensitivity analysis is required to determine the proper activation pruning strength layer-wise, i.e., the activation winner rate per layer. Fig. 8(a) shows the relation of JPnet accuracy drop and the selection

image

Fig. 7. Comparison to static activation pruning for ResNet-32.

image

Fig. 8. Activation pruning sensitivity.

of winner rates for AlexNet before pruning. As can be seen that the accuracy drops sharply as the activation winner rate of conv1 is less than 0.3, while setting the winner rate of conv5 as 0.1 doesn’t affect accuracy. This implies that deeper conv layers can support sparser activations. The unit-wise analysis results for ResNet-32 are shown in Fig. 8(b), which denotes a similar trend of activation pruning sensitivity to AlexNet: conv1 is most susceptible to the activation pruning. The accuracy of ResNet-32 drops quickly with the decrements of winner rate, indicating a high sensitivity. Verified by thorough experiments in Section IV, the accuracy loss can be well recovered by finetuning with proper activation winner rates.

D. Speedup from Dynamic Activation Pruning

The speedup for fc layers with dynamic activation pruning can be easily observed even without specific support for sparse matrix operations. After activation pruning, the weight matrix in fc layers can be condensed by removing all connections related to the pruned activations, which speeds up the inference time with the compact weight matrix. Table IX shows the experiment results implemented in TensorFlow compiled on Intel i7-7700HQ CPU for AlexNet’s 3 fc layers. The activation percentage listed here is the winner rate for the input activations. There is no accuracy loss after finetuning with these winner rate settings. Batch size is set as 1 in the test, which is the typical scenario in real-time applications on edge devices. The experiment obtains  1.95× ∼ 3.65×speedup. Time spent on activation pruning to get winner activations accounts for a very small portion of the time spent on the original densely connected layers.

E. Activation Threshold Prediction

As discussed in Section III-E, the process to select activation winners can be accelerated by threshold prediction on down-sampled activation set. We apply different down-sampling

TABLE IX SPEEDUP TEST FOR FC LAYERS IN ALEXNET.

image

Fig. 9. The effects of threshold prediction.

rates on the JPnet for AlexNet. As can be seen in Fig. 9, layer conv1 is most vulnerable to threshold prediction. From the overall results for AlexNet, it’s practical to down-sample 10% (ε = 0.1) of activations for activation threshold prediction by keeping the accuracy drop less than 0.5%.

To minimize the computation cost in DNNs, joint regularization integrating weight pruning and activation pruning is proposed in this paper. The experiment results on various models for MNIST, CIFAR-10 and ImageNet datasets have demonstrated considerable computation cost reduction. In total, a 1.4× ∼ 5.2×activation compression rate and a  1.6× ∼ 12.3×weight compression rate are obtained. Only  1.2% ∼ 27.7%of MACs are left with marginal effects on model accuracy, which outperforms the weight pruning by  1.3× ∼ 10.5×. The JPnets are targeted for the dedicated DNN accelerators with efficient sparse matrix storage and computation units on chip. The JPnets featuring compressed model size and reduced computation cost will meet the constraints from memory space and computing resource in embedded systems.

[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.

[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.

[3] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning, 2016.

[4] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con- nections for efficient neural network,” in Advances in neural information processing systems, 2015.

[5] J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey, “Faster cnns with direct sparse convolutions and guided pruning,” arXiv preprint arXiv:1608.01409, 2016.

[6] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in Computer Architecture, ACM/IEEE International Symposium on. IEEE, 2016.

[7] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network computing,” in ACM SIGARCH Computer Architecture News. IEEE Press, 2016.

[8] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hern´andez-Lobato, G.-Y. Wei, and D. Brooks, “Minerva: Enabling low-power, highly-accurate deep neural network accelerators,” in ACM SIGARCH Computer Architecture News. IEEE Press, 2016.

[9] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in Computer Architecture, ACM/IEEE International Symposium on. IEEE, 2017.

[10] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing fpga-based accelerator design for deep convolutional neural networks,” in Proceedings of the ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays. ACM, 2015.

[11] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparse convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.

[12] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” in Advances In Neural Information Processing Systems, 2016.

[13] C. Louizos, M. Welling, and D. P. Kingma, “Learning sparse neural networks through l 0 regularization,” arXiv preprint arXiv:1712.01312, 2017.

[14] D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout spar- sifies deep neural networks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[15] K. Neklyudov, D. Molchanov, A. Ashukha, and D. P. Vetrov, “Structured bayesian pruning via log-normal multiplicative noise,” in Advances in Neural Information Processing Systems, 2017.

[16] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang, “A systematic dnn weight pruning framework using alternating direction method of multipliers,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[17] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, 2016.

[18] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient inference,” arXiv preprint arXiv:1611.06440, 2016.

[19] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally, “Exploring the granularity of sparsity in convolutional neural networks,” IEEE CVPRW, 2017.

[20] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke, “Scalpel: Customizing dnn pruning to the underlying hardware parallelism,” in ACM SIGARCH Computer Architecture News. ACM, 2017.

[21] B. Dai, C. Zhu, and D. Wipf, “Compressing neural networks using the variational information bottleneck,” arXiv preprint arXiv:1802.10399, 2018.

[22] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in ACM SIGARCH Computer Architecture News. IEEE Press, 2016.

[23] J. Ba and B. Frey, “Adaptive dropout for training deep neural networks,” in Advances in Neural Information Processing Systems, 2013.

[24] A. Makhzani and B. J. Frey, “Winner-take-all autoencoders,” in Advances in Neural Information Processing Systems, 2015.

[25] X. Gao, Y. Zhao, ukasz Dudziak, R. Mullins, and C. zhong Xu, “Dynamic channel pruning: Feature boosting and suppression,” in International Conference on Learning Representations, 2019.

[26] J. Ye, X. Lu, Z. Lin, and J. Z. Wang, “Rethinking the smaller-norm- less-informative assumption in channel pruning of convolution layers,” arXiv preprint arXiv:1802.00124, 2018.

[27] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.

[28] J. L. Bentley, D. Haken, and J. B. Saxe, “A general method for solving divide-and-conquer recurrences,” ACM SIGACT News, 1980.

[29] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.


Designed for Accessibility and to further Open Science