Filter Sketch for Network Pruning

2020·Arxiv

Abstract

Abstract

We propose a novel network pruning approach by information preserving of pre-trained network weights (filters). Network pruning with the information preserving is formulated as a matrix sketch problem, which is efficiently solved by the off-the-shelf Frequent Direction method. Our approach, referred to as FilterSketch, encodes the second-order information of pre-trained weights, which enables the representation capacity of pruned networks to be recovered with a simple fine-tuning procedure. FilterSketch requires neither training from scratch nor data-driven iterative optimization, leading to a several-orders-of-magnitude reduction of time cost in the optimization of pruning. Experiments on CIFAR-10 show that FilterSketch reduces 63.3% of FLOPs and prunes 59.9% of network parameters with negligible accuracy cost for ResNet-110. On ILSVRC-2012, it reduces 45.5% of FLOPs and removes 43.0% of parameters with only 0.69% accuracy drop for ResNet-50. Our code and pruned models can be found at https://github.com/lmbxmu/FilterSketch.

Index Terms—network pruning, sketch, filter pruning, structured pruning, network compression & acceleration, information preserving

I. INTRODUCTION

DEEP convolutional neural networks (CNNs) typicallyresult in significant memory requirement and computational cost, hindering their deployment on front-end systems of limited storage and computational power. Consequently, there is a growing need for reduction of model size by parameter quantization [1], [2], low-rank decomposition [3], [4], and network pruning [5], [6]. Early pruning works [7], [8] use unstructured methods to obtain irregular sparsity of filters. Recent works pay more attention to structured pruning [9], [10], [11], which pursues simultaneously reducing model size and improving computational efficiency, facilitating model deployment on general-purpose hardware and/or usage of basic linear algebra subprograms (BLAS) libraries.

M. Lin, S. Li and R. Ji are with the Media Analytics and Computing Laboratory, Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen 361005, China. L. Cao (Corresponding Author) is with the Fujian Key Laboratory of Sensing and Computing for Smart City, Computer Science Department, School of Informatics, Xiamen University, Xiamen 361005, China (e-mail: caoliujuan@xmu.edu.cn) R. Ji is also with Institute of Artificial Intelligence, Xiamen University, Xiamen 361005, China. Q. Ye is with School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China. Y. Tian is with the School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China. J. Liu is with the Noah’s Ark Lab, Huawei Technologies Co. Ltd., Shenzhen 518129, China. Q. Tian is with Cloud BU, Huawei Technologies Co. Ltd., Shenzhen 518129, China. Manuscript received April 19, 2005; revised August 26, 2015.

Existing structured pruning approaches can be classified into three categories: (1) Regularization-based pruning, which introduces sparse constraint [12], [13], [10] and mask scheme [11] in the training process. Despite its simplicity, this kind of approaches usually requires to train from scratch and therefore is computationally expensive. (2) Property-based pruning, which picks up a specific property of a pre-trained network, e.g., ] and/or ratio of activations [15], and simply removes filters with less importance. However, many of these approaches require to recursively prune the filters of each single pre-trained layer and fine-tune, which is very costly. (3) Reconstruction-based pruning [16], [17], which imposes relaxed constraints to the optimization procedure. Nevertheless, the optimization procedure in each layer is typically data-driven and/or iterative [16], [17], which brings a heavy optimization burden.

In this paper, we propose the FilterSketch approach, which, by encoding the second-order information of pre-trained network weights, provides a new perspective for deep CNN compression. FilterSketch is inspired by the fact that preserving the second-order covariance of a matrix is equal to maximizing the correlation of multi-variate data [18]. The representation of the second-order information has been demonstrated to be effective in many other tasks [19], [20], [21], and for the first time, we apply it to network pruning in this paper.

Instead of simply discarding the unimportant filters, FilterSketch preserves the second-order information of the pre-trained model in the pruned model as shown in Fig. 1. For each layer in the pre-trained model, FilterSketch learns a set of new parameters for the pruned model which maintains the second-order covariance of the pre-trained model. The new group of sketched parameters then serves as a warm-up for fine-tuning the pruned network. The warm-up provides an excellent ability to recover the model performance.

We show that preserving the second-order information can be approximated as a matrix sketch problem, which can then be efficiently solved by the off-the-shelf Frequent Direction method [22], leading to a several-orders-of-magnitude reduction of optimization time in pruning. FilterSketch thus involves neither complex regularization to restart retraining nor data-driven iterative optimization to approximate the covariance information of the pre-trained model.

II. RELATED WORK

Unstructured pruning and structured pruning are two major lines of methods for network model compression. In a broader view, e.g., parameter quantization and low-rank decomposition can be integrated with network pruning to achieve higher compression and speedup. We give a brief discussion over

Fig. 1. Framework of FilterSketch. The upper part displays the second-order covariance approximation between the pre-trained model and the pruned model at the i-th layer. The lower part shows the approximation is achieved effectively and efficiently by the stream data based matrix sketch [22].

some related works in the following and refer readers to the survey paper [23] for a more detailed overview.

Unstructured Pruning. Pruning a neural network to a reasonable smaller size and for a better generalization, has long been investigated. As a pioneer work, the second-order Taylor expansion [7] is utilized to select less important parameters for deletion. Sum et al. [24] introduced an extended Kalman filter (EKF) training to measure the importance of a weight in a network. Given the recursive Bayesian property of EKF, they further considerd the sensitivity of a posterior probability as a measure of weight importance, which is then applied to prune in a nonstationary environment [25] or recurrent neural networks [26].

Han et al. [8] introduced an iterative weight pruning method by fine-tuning with a strong regularization and discarding the small weights with values below a threshold. Group sparsity based regularization of network parameters [27] is leveraged to penalize unimportant parameters. Further, [28] prunes parameters based on the second-order derivatives of a layer-wise error function, while [29] implements CNNs in the frequency domain and apply 2-D DCT transformation to sparsify the coefficients for spatial redundancy removal. The lottery ticket hypothesis [30] sets the weights below a threshold to zero, rewinds the rest of the weights to their initial configuration, and then retrains the network from this configuration.

Though progress has been made, unstructured pruning requires specialized hardware or software supports to speed up inference. It has limited applications on general-purpose hardware or BLAS libraries in practice, due to the irregular sparsity in weight tensors.

Structured Pruning. Compared to unstructured pruning, structured pruning does not have limitations on specialized hardware or software since the entire filters are removed, and thereby it is more favorable in accelerating CNNs.

To this end, regularization-based pruning techniques require a joint-retraining from scratch to derive the values of filters such that they can be made sufficiently small. To that effect, [12], [10] impose a sparse property on the scaling factor of the batch normalization layer with the deterministic -norm or dynamical distribution of channel saliency. After re-training, the channels below a threshold are discarded correspondingly. Huang et al. [13] proposed a data-driven sparse structure selection by introducing scaling factors to scale the outputs of the pruned structure and added the sparsity constraint on these scaling factors. Lin et al. [11] proposed to minimize an objective function with -regularization on a soft mask via a generative adversarial learning and adopted the knowledge distillation for optimization.

Property-based pruning tries to figure out a discriminative property of pre-trained CNN models and discards filters of less importance. Hu et al. [15] utilized the abundant zero activations in a large network and iteratively prunes filters with a higher percentage of zero outputs in a layer-wise fashion. Li et al. [14] used the sum of absolute values of filters as a metric to measure the importance of filters, and assumes filters with smaller values are less informative and thus should be pruned first. In [31], the importance scores of the final responses are propagated to every filter in the network and the CNN is pruned by removing the filter with the least importance. Polyak et al. [32] considered the contribution variance of each channel and removed filters that have low-variance outputs. It shows a good ability to accelerate the deep face networks. Centripetal SGD [33] constrains filters within the same cluster to move towards a center, and thus removes the identical filters without the necessity of fine-tuning. EigenDamage [34] decorrelates filter weight on top of the Kronecker-factored eigenbasis to enable weights to be approximately independent, meanwhile it allows a global ranking of filter importance in Hessian-based pruning. He et al. [35] proposed to calculate the geometric median in each layer, and the filters closest to this are pruned.

Optimization-based pruning leverages layer-wise optimization to minimize the reconstruction error between the full model and the pruned model. He et al. [16] presented a LASSO-based filter selection strategy to identify representative filters and a least square reconstruction error to reconstruct the outputs.

Fig. 2. Histograms of weights at different layers of ResNet-50, which have approximate zero means. Conv20_1 denotes the 1st filter in the 20th convolutional layer and the same with others. Note that similar statistical result can be observed in other convolutional networks as well.

These two steps are iteratively executed until convergence. In [17], Luo et al. reconstructed the statistics information from the next layer to guide the importance evaluation of filters from the current layer. A set of training samples is used to deduce a closed-form solution.

The proposed FilterSketch can be grouped into optimization-based pruning but differs from [16], [17] in two aspects: First, it preserves the second-order information of pre-trained weights, leading to quick accuracy recovery without the requirement of training from scratch or layer-wise fine-tuning. Second, it can be formulated as the matrix sketch problem and solved by the off-the-shelf Frequent Direction (FD) method, leading to a several-orders-of-magnitude reduction of time consumption without introducing data-driven and/or iterative optimization procedure. We note that [36] conducts a complex tensor sketch for network approximation. Differently, our FilterSketch uses the efficient FD algorithm for the goal of information preserving. The network pruning and network approximation can be combined to further reduce the network size, which will be our future work.

Note that, [7], [37], [28], [38] also exploit the second order information for the network pruning. Nevertheless, our second-order information fundamentally differs from [7], [37], [28], [38]. First, the second-order information of [7], [37], [28], [38] lies in the second-order derivatives while our FilterSketch lies in the second-order covariance. Second, the second-order derivatives in [7], [37], [28], [38] are used as a measure to identify unimportant weights/channels while our FilterSketch aims to preserve the covariance information in the preserved weights. Third, [7], [37], [28], [38] involve heavy computation for constructing Hessian Matrix while our FilterSketch is implemented in a computationally cheap manner via the Frequent Direction.

III. THE PROPOSED APPROACH

A. Notations

We start with notation definitions. Given a pre-trained CNN model F, which contains L convolutional layers, and a set of filters with and respectively denote the channel number, filter height and width of the i-th layer. is the filter set for the i-th layer and in the i-th layer.

The goal is to search for the pruned model F, a set of transformed filters and is the pruning rate for the i-th layer and rounds the input to its nearest integer.To learn for each layer, predominant approaches are often divided into three streams: (1) Retraining CNNs from scratch by imposing human-designed regularizations into the training loss [13], [11]. (2) Measuring the importance of filters via an intrinsic property of CNNs [14], [31]. (3) Minimizing the reconstruction error [16], [17] for pruning optimization. Nevertheless, these methods solely consider the first-order statistics, while missing the covariance information.

B. Information Preserving

In this study, we devise a novel second-order covariance preserving scheme, which provides a good warm-up for fine-tuning the pruned network. Different from existing works [19], [20], [21] where the covariance statistics of feature maps are calculated, we aim to preserve the covariance information of filters.

For each , our second-order preserving scheme aims to find a filter matrix , which contains only columns but preserves sufficient covariance information of

where and respectively denote the covariance matrices of and are defined as:

where are the mean values of the filters in the i-th layer for the full model and pruned model, respectively.

The covariance can effectively measure the pairwise interactions between the pre-trained filters. A key ingredient of FilterSketch is that it can well preserve the correlation information of in . Through this, it yields a more expressive and informative for fine-tuning, as validated in Sec. IV.

To preserve the covariance information in Eq. 1, we formulate the following objective function:

where denotes the Frobenius norm. Based on Eq. 2 and Eq. 3, Eq. 4 is expanded as:

We statistically observe that . As illustrated in Fig. 2, the pre-trained weights intend to follow a zero-mean Gaussian-like weight distribution, which is also discussed in many previous works [39], [40], [41], [42], [43]. The potential reason behind might be that the network is often trained with zero-mean Gaussian distribution as the initialization. During training, the regularization effect of -norm weight penalty confines weights to the bell-shape histogram, and thus prevents a drastic change of the initial Gaussian distribution. As such, the pre-trained weights still have the approximately zero-mean Gaussian-like histogram. Similarly, it is intuitive that a good pruned weight satisfies that . Thus, Eq. 5 can be re-written as:

Similar to [16], [17], one can develop a series of optimization steps to minimize the reconstruction error of Eq. 6. However, the optimization procedure is typically based on data-driven and/or iterative methods [16], [17], which inevitably introduces heavy computation cost.

C. Tractability

In this section, we show that Eq. 6 can be effectively and efficiently solved by the off-the-shelf matrix sketch method [22], which does not involve data-driven iterative optimization while maintaining the property of interest of

Specifically, a sketch of a matrix is a transformed matrix , which is smaller than but tracks an -approximation to the norm of

Several papers have been devoted to solving Eq. 7, including CUR decomposition [44], random projection [45], and column sampling methods [46], which however still rely on iterative optimization.

The streaming-based Frequent Direction (FD) method by [22] provides a promising direction to solve this problem, where each sample is passed forward only once without iterations, which is extremely efficient. We summarize it in Alg. 1. A data matrix and the sketched size are fed into FD. Each column represents a sample. Columns from will replace all zero-valued columns in of the columns in the sketch will be emptied with two steps once is fully fed with non-zero valued columns: In the first step, the sketch is rotated (from right) with the SVD decomposition of so that its columns are orthogonal and in descending magnitude order. In the SVD decomposition, , where is the identity matrix, S is a non-negative diagonal matrix and . In the second step, S is shrunk so

that half of its singular values are zeros. Accordingly, the right half of the columns in (see line 7 of Alg. 1 for zeros. The details of the method can be referred to [22].

It can be seen that the optimization of Eq. 6 is similar to the matrix sketch problem of Eq. 7, though the existence of the upper bound term does not necessarily result in optimal

Nevertheless, in what follows, we show that Alg. 1 can provide a tight convergence bound to solve the sketch problem of Eq. 7 while the learned can serve as a good warm-up for fine-tuning the pruned model as demonstrated in Sec. IV.

Corollary 1. If is the sketch result of matrix with the sketched size by Alg. 1, then it holds:

i.e., The proof of Corollary 1 can be referred to [22]. Accordingly,

the convergence bound of FD is proportional to

causes more error, which is intuitive since smaller means more pruned filters. Besides, the sketch time is up-bounded by [22]. Sec. IV-E shows that the sketch process requires less than 2 seconds with CPU, which verifies the efficiency of FilterSketch. In our experiments, we find that a small portion of the

elements in are unordinary larger especially for a small

value of , which damages the accuracy performance as demonstrated in Tab. III. This is understandable since the error bound does exist in Corollary 1. We can conduct a second-round sketch for to solve this problem of unstable numerical values (see below for analysis). However, in what follows, we show that the second-round sketch can be eliminated by simply normalizing Theorem 1. If is the sketch result by applying Alg. 1 to

matrix with the sketched size , then for any constant

is the result by applying Alg. 1 to matrix Proof. We start with Line 5 of Alg. 1, which can be modified

is equal to . Note that satisfied from our extensive experiments. Thus, the unordinary larger elements in can be re-scaled to smaller ones. Finally, is fed to the slimmed network for fine-tuning1.

We outline FilterSketch in Alg. 2. It can be seen that, compared with existing methods, FilterSketch stands out in that it is deterministic, simple to implement, and also very fast (see Tab. IV later).

IV. EXPERIMENTS

To show the effectiveness and efficiency of FilterSketch, we have conducted extensive experiments on image classifi-cation. Representative compact-designed networks, including GoogLeNet [47] and ResNet-50/56/110 [48], are chosen to compress. We report the performance of FilterSketch on CIFAR-10 [49] and ILSVRC-2012 [50], and compare it to state-of-the-arts (SOTAs) including regularization-based pruning [13], [11], property-based pruning [14], [31], [51], and optimization-based pruning [16], [17]. Besides, we also conduct subsampling of the pre-trained weights for fine-tuning (denoted as Random) to show the advantage of considering the second-order information.

A. Implementation Details

Training Strategy. We use the Stochastic Gradient Descent (SGD) for fine-tuning with the Nesterov momentum 0.9 and the batch size is set to 256. For CIFAR-10, the weight decay is set to 5e-3 and we fine-tune the network for 150 epochs with an initial learning rate of 0.01, which is then divided by 10 every 50 epochs. For ILSVRC-2012, the weight decay is set to 5e-4 and 90 epochs are given to fine-tune the network. The learning rate is initially set to 0.1, and divided by 10 every 30 epochs.

For all methods, we apply the standard data augmentation provided by the official Pytorch including random crop and horizontal flip. To stress, other techniques for image augmentation, such as lightening and color jitter, can be applied to further improve the performance as done in the implementations of [52], [53], [54], or even the cosine learning rate [55], can also be applied to further improve the accuracy performance. We do not consider these since we aim to show the performance of pruning algorithms themselves. We provide our codes in the supplementary material.

Performance Metric. Parameter amount and FLOPs (floating-point operations) are used as the metrics, which respectively denote the storage and computation cost. We also report the pruning rate (PR) of parameters and FLOPs. For CIFAR-10, top-1 accuracy are provided. For ILSVRC-2012, both top-1 and top-5 accuracies are reported.

B. Results on CIFAR-10

We evaluate the performance of FilterSketch on CIFAR-10 with popular networks, including GoogLeNet, ResNet-56 and ResNet-110. For GoogLeNet, we make the final output class number the same as the number of categories on CIFAR-10.

GoogLeNet. As can be seen from Tab. I, FilterSketch outperforms the SOTA methods in both accuracy retaining and model complexity reductions. Specifically, 61.1% of the FLOPs are reduced and 57.6% of the parameters are removed, achieving a significantly higher compression rate than GAL-0.6 and HRank. Besides, FilterSketch also maintains a comparable top-1 accuracy, even better than L1, which obtains a much less complexity reduction.

ResNet-56. Results for ResNet-56 are presented in Tab. I, where FilterSketch removes around 41.5% of the FLOPs and parameters while keeping the top-1 accuracy at 93.19%. Compared to 93.26% by the pre-trained model, the accuracy drop is negligible. Compared with L1, FilterSketch shows an overwhelming superiority. Though NISP obtains 1% more parameters reduction than FilterSketch, it takes more computation in the convolutional layers with a lower top-1 accuracy. Moreover, in comparison with HRank under similar reductions of FLOPs and parameters, our FilterSketch results in a higher top-1 accuracy, well demonstrating the effectivness of the sketched weights to recover the accuracy performance.

ResNet-110. Tab. I also displays the pruning results for ResNet-110. FilterSketch reduces the FLOPs of ResNet-110 by an impressive factor of 63.3%, and the parameters by 59.9%, while maintaining an accuracy of 93.44%. FilterSketch significantly outperforms these SOTAs, showing that it can greatly facilitate the ResNet model, a popular backbone for object detection and semantic segmentation, to be deployed on mobile devices.

In Tab. I, we also display the performance of randomly subsampling of filter weights (Random) given the same pruning rates as with FilterSketch. As seen, Random suffers great accuracy degradation in comparison with FilterSketch. In contrast, FilterSketch considers all the information in the pre-

TABLE I RESULTS OF GOOGLENET AND RESNET-56/110 ON CIFAR-10.

TABLE II RESULTS OF RESNET-50 ON ILSVRC-2012.

trained weights, which provides a more informative warm-up for fine-tuning the pruned model.

In Fig. 3, we further compare the Top-1 accuracies of the compressed models by GAL [11], L1 [14], Random, and our FilterSketch under different compression rates using ResNet-56. As shown in the figure, our method outperforms the compared methods easily. Especially, for large pruning rates (> 60%), both L1 and GAL suffer an extreme accuracy drop while FilterSketch maintains a relatively stable performance, which stresses the importance of information preserving in network pruning again.

C. Results on ILSVRC-2012

In Tab. II, we show the results for ResNet-50 on ILSVRC-2012 and compare FilterSketch to many SOTAs. We display different pruning rates for FilterSketch and compare top-1 and top-5 accuracies. For convenience, we use FilterSketch-to denote the sketch rate ) for FilterSketch. Smaller leads to a higher compression rate.

As shown in Tab. II, with similar or better reductions of FLOPs and parameters, FilterSketch demonstrates its great advantages in retaining the accuracy in comparisons with the SOTAs. For example, FilterSketch-0.6 obtains 74.68% top-1 and 92.17% top-5 accuracies, significantly better than GAL-0.5 and SSS-26. Another observation is that, with similar or more FLOPs reduction, FilterSketch also removes more parameters. Hence, FilterSketch is especially suitable for network compression.

D. Practical Speedup

The practical speedup for pruned CNNs depends on many factors, e.g., FLOPs reduction percentage, the number of CPU/GPU cores available and I/O delay of data swap, etc.

Fig. 3. FLOPs and parameter comparison among GAL [11], L1 [14], Random, and our FilterSketch under different compression rates. ResNet-56 is compressed and Top-1 accuracy is reported.

Fig. 4. Speedups corresponding to CPU (Intel(R) Xeon(R) CPU E5-2620 v4 @2.10GHz) and GPU (GTX-1080TI) over the different CNNs with a batch size 256 on CIFAR-10.

We test the speedup of pruned models by FilterSketch in Tab. I with CPU and GPU, and present the results in Fig. 4. Compared with the theoretical speedups of 1.712.58for ResNet-56, ResNet-110 and GoogLeNet, respectively. FilterSketch gains 1.11practical speedups with GPU while 1.25, 1.89and 1.28with CPU are obtained.

E. Normalization Influence and Optimization Efficiency

To measure the effectiveness of the sketch with the Frobenius normalization, we compare our FilterSketch models given in Tab. I and Tab. II (FilterSketch-0.4) with corresponding models but excluding the Frobenius normalization. As shown in Tab. III, the former (the third column) obtains a better accuracy than the latter (the second column), which verifies the analysis in Sec. III-C that the sketch with the Frobenius normalization can effectively solve the problem of unstable numerical values after sketch. As for the sketch efficiency, we again compare these four models in Tab. III with two optimization-based methods [17], [16]. The results in Tab. IV show that the time cost in the sketch process is little. Even with wider GoogLeNet and deeper ResNet-110, the sketches consume less than 2 seconds,

TABLE III PERFORMANCE COMPARISONS BETWEEN SKETCHES WITH AND WITHOUT THE NORMALIZATION.

TABLE IV OPTIMIZATION EFFICIENCY (CPU) AMONG THINET [17], CP [16] AND FILTERSKETCH.

which are several orders of magnitude faster than the other methods that cost many hours, or even days.

V. CONCLUSION

We have proposed a novel approach, termed FilterSketch, for structured network pruning. Instead of simply discarding unimportant filters, FilterSketch preserves the second-order information of the pre-trained model, through which the accuracy is well maintained. We have further proposed to obtain the information preserving constraint by utilizing the off-the-shelf matrix sketch method, based on which the requirement of training from scratch or iterative optimization can be eliminated, and the pruning complexity is significantly reduced. Extensive experiments have demonstrated the superiorities of FilterSketch over the state-of-the-arts. As the first attempt on weight information preserving, FilterSketch provides a fresh insight for the network pruning problem. Nevertheless, our FilterSketch is built on the fact that the filter weights have approximate zero mean in each layer of the convolutional neural network. Such a requirement might not be satisfied in other networks, e.g., multi-layer perceptron. More effort will be made to solve this issue in our future work.

REFERENCES

[1] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu, “Quantized cnn: A unified approach to accelerate and compress convolutional networks,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), vol. 29, no. 10, pp. 4730–4743, 2017.

[2] P. Wang, X. He, Q. Chen, A. Cheng, Q. Liu, and J. Cheng, “Unsupervised network quantization via fixed-point factorization,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2020.

[3] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep convolutional neural networks for fast and low power mobile applications,” arXiv preprint arXiv:1511.06530, 2015.

[4] K. Hayashi, T. Yamaguchi, Y. Sugawara, and S.-i. Maeda, “Exploring unexplored tensor network decompositions for convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 5553–5563.

[5] Z. Chen, T.-B. Xu, C. Du, C.-L. Liu, and H. He, “Dynamical channel pruning by conditional accuracy change for deep neural networks,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2020.

[6] S. Lin, R. Ji, Y. Li, C. Deng, and X. Li, “Toward compact convnets via structure-sparsity regularized filter pruning,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), vol. 31, no. 2, pp. 574–588, 2019.

[7] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 1990, pp. 598–605.

[8] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2015, pp. 1135– 1143.

[9] P. Singh, V. K. Verma, P. Rai, and V. P. Namboodiri, “Play and prune: Adaptive filter pruning for deep model compression,” in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2019, pp. 3460–3466.

[10] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, and Q. Tian, “Variational convolutional neural network pruning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2780–2789.

[11] S. Lin, R. Ji, C. Yan, B. Zhang, L. Cao, Q. Ye, F. Huang, and D. Doermann, “Towards optimal structured cnn pruning via generative adversarial learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2790–2799.

[12] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2736–2744.

[13] Z. Huang and N. Wang, “Data-driven sparse structure selection for deep neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 304–320.

[14] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” in Proceedings of the International Conference on Learning Representations (ICLR), 2017.

[15] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A data-driven neuron pruning approach towards efficient deep architectures,” arXiv preprint arXiv:1607.03250, 2016.

[16] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1389–1397.

[17] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5058– 5066.

[18] Y. Sun, L. Zheng, W. Deng, and S. Wang, “Svdnet for pedestrian retrieval,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3800–3808.

[19] J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2018, pp. 1564–1574.

[20] J. Zhang, Z. Feng, Y. Su, and M. Xing, “Discriminative saliency-pose-attention covariance for action recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2132–2136.

[21] C.-X. Ren, J. Feng, D.-Q. Dai, and S. Yan, “Heterogeneous domain adaptation via covariance structured feature translators,” IEEE Transactions on Cybernetics (T-Cybernetics), 2019.

[22] E. Liberty, “Simple and deterministic matrix sketching,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), 2013, pp. 581–588.

[23] S. Vadera and S. Ameen, “Methods for pruning deep neural networks,” arXiv preprint arXiv:2011.00241, 2020.

[24] J. Sum, C.-s. Leung, G. H. Young, and W.-k. Kan, “On the kalman filter-ing method in neural network training and pruning,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), vol. 10, no. 1, pp. 161–166, 1999.

[25] J. Sum, C.-s. Leung, G. H. Young, L.-w. Chan, and W.-k. Kan, “An adaptive bayesian pruning for neural networks in a non-stationary environment,” Neural Computation, vol. 11, no. 4, pp. 965–976, 1999.

[26] J. Sum, L.-w. Chan, C.-s. Leung, and G. H. Young, “Extended kalman filter–based pruning method for recurrent neural networks,” Neural Computation, vol. 10, no. 6, pp. 1481–1505, 1998.

[27] J. M. Alvarez and M. Salzmann, “Learning the number of neurons in deep networks,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2016, pp. 2270–2278.

[28] X. Dong, S. Chen, and S. Pan, “Learning to prune deep neural networks via layer-wise optimal brain surgeon,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 4857–4867.

[29] Z. Liu, J. Xu, X. Peng, and R. Xiong, “Frequency-domain dynamic pruning for convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2018, pp. 1043–1053.

[30] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in Proceedings of the International Conference on Learning Representations (ICLR), 2019.

[31] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. Lin, and L. S. Davis, “Nisp: Pruning networks using neuron importance score propagation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 9194–9203.

[32] A. Polyak and L. Wolf, “Channel-level acceleration of deep face representations,” IEEE Access, no. 3, pp. 2163–2175, 2015.

[33] X. Ding, G. Ding, Y. Guo, and J. Han, “Centripetal sgd for pruning very deep convolutional networks with complicated structure,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4943–4953.

[34] C. Wang, R. Grosse, S. Fidler, and G. Zhang, “Eigendamage: Structured pruning in the kronecker-factored eigenbasis,” in Proceedings of the International Conference on Machine Learning (ICML), 2019, pp. 6566– 6575.

[35] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via geometric median for deep convolutional neural networks acceleration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4340–4349.

[36] S. P. Kasiviswanathan, N. Narodytska, and H. Jin, “Network approximation using tensor sketching.” in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2018, pp. 2319–2325.

[37] B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: optimal brain surgeon,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 1992, pp. 164–171.

[38] H. Peng, J. Wu, S. Chen, and J. Huang, “Collaborative channel pruning for deep networks,” in Proceedings of the International Conference on Machine Learning (ICML), 2019, pp. 5113–5122.

[39] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang, “A systematic dnn weight pruning framework using alternating direction method of multipliers,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 184–199.

[40] Z. He and D. Fan, “Simultaneously optimizing weight and quantizer of ternary neural network using truncated gaussian approximation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11 438–11 446.

[41] G. Franchi, A. Bursuc, E. Aldea, S. Dubuisson, and I. Bloch, “Tradi: Tracking deep neural network weight distributions,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 105–121.

[42] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2010, pp. 249–256.

[43] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1026–1034.

[44] P. Drineas and R. Kannan, “Pass efficient algorithms for approximating large matrices.” in Proceedings of the ACM-SIAM Symposium On Discrete Algorithms (SODA), 2003, pp. 223–232.

[45] T. Sarlos, “Improved approximation algorithms for large matrices via random projections,” in Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), 2006, pp. 143–152.

[46] A. Frieze, R. Kannan, and S. Vempala, “Fast monte-carlo algorithms for finding low-rank approximations,” Journal of the ACM (JACM), vol. 51, no. 6, pp. 1025–1041, 2004.

[47] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.

[48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.

[49] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” University of Toronto, Toronto, ON, Canada, Tech. Rep., 01 2009.

[50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” Proceedings of the International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.

[51] M. Lin, R. Ji, Y. Wang, Y. Zhang, B. Zhang, Y. Tian, and S. Ling, “Hrank: Filter pruning using high-rank feature map,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[52] J. Yu and T. Huang, “Autoslim: Towards one-shot architecture search for channel numbers,” arXiv preprint arXiv:1903.11728, 2019.

[53] Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, K.-T. Cheng, and J. Sun, “Metapruning: Meta learning for automatic neural network channel pruning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3296–3305.

[54] X. Ding, G. Ding, Y. Guo, J. Han, and C. Yan, “Approximated oracle filter pruning for destructive cnn width optimization,” in Proceedings of the International Conference on Machine Learning (ICML), 2019, pp. 1607–1616.

[55] X. Dong and Y. Yang, “Network pruning via transformable architecture search,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 759–770.

Mingbao Lin is currently pursuing the Ph.D degree with Xiamen University, China. He has published over ten papers as the first author in international journals and conferences, including IEEE TPAMI, IJCV, IEEE TIP, IEEE TNNLS, IEEE CVPR, NeuriPS, AAAI, IJCAI, ACM MM and so on. His current research interest includes network compression & acceleration, and information retrieval.

Liujuan Cao received the B.S., M.S., and Ph.D degrees from the School of Computer Science and Technology, Harbin Engineering University. She is currently an associate professor at Xiamen University. Her research interest mainly focuses on computer vision and pattern recognition. She has authored over 40 papers in top and major tired journals and conferences, including CVPR, TIP, etc. She is the Financial Chair of the IEEE MMSP 2015, the Workshop Chair of the ACM ICIMCS 2016, and the Local Chair of the Visual and Learning Seminar 2017.

Shaojie Li studied for his B.S. degrees in FuZhou University, China, in 2019. He is currently trying to pursue a M.S. degree in Xiamen University, China. His research interests include model compression and computer vision.

Qixiang Ye (Senior Member, IEEE) received the B.S. and M.S. degrees in mechanical and electrical engineering from Harbin Institute of Technology, China, in 1999 and 2001, respectively, and the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences in 2006. He has been a professor with the University of Chinese Academy of Sciences since 2009, and was a visiting assistant professor with the Institute of Advanced Computer Studies (UMIACS), University of Maryland, College Park until 2013. His research interests include image processing, object detection and machine learning. He has published more than 100 papers in refereed conferences and journals including IEEE CVPR, ICCV, ECCV, NeurIPS, TNNLS, and PAMI.

Yonghong Tian (Senior Member, IEEE) is currently a Boya Distinguished Professor with the Department of Computer Science and Technology, Peking University, China, and is also the deputy director of Artificial Intelligence Research Center, PengCheng Laboratory, Shenzhen, China. His research interests include neuromorphic vision, brain-inspired computation and multimedia big data. He is the author or coauthor of over 200 technical articles in refereed journals such as IEEE TPAMI/TNNLS/TIP/TMM/TCSVT/TKDE/TPDS, ACM CSUR/TOIS/TOMM and conferences such as NeurIPS/CVPR/ICCV/AAAI/ACMMM/WWW. Prof. Tian was/is an Associate Editor of IEEE TCSVT (2018.1-), IEEE TMM (2014.8-2018.8), IEEE Multimedia Mag. (2018.1-), and IEEE Access (2017.1-). He co-initiated IEEE Int’l Conf. on Multimedia Big Data (BigMM) and served as the TPC Co-chair of BigMM 2015, and aslo served as the Technical Program Co-chair of IEEE ICME 2015, IEEE ISM 2015 and IEEE MIPR 2018/2019, and General Co-chair of IEEE MIPR 2020 and ICME2021. He is the steering member of IEEE ICME (2018-) and IEEE BigMM (2015-), and is a TPC Member of more than ten conferences such as CVPR, ICCV, ACM KDD, AAAI, ACM MM and ECCV. He was the recipient of the Chinese National Science Foundation for Distinguished Young Scholars in 2018, two National Science and Technology Awards and three ministerial-level awards in China, and obtained the 2015 EURASIP Best Paper Award for Journal on Image and Video Processing, and the best paper award of IEEE BigMM 2018. He is a senior member of IEEE, CIE and CCF, a member of ACM.

Jianzhuang Liu (Senior Member, IEEE) received the Ph.D. degree in computer vision from The Chinese University of Hong Kong, Hong Kong, in 1997. From 1998 to 2000, he was a Research Fellow with Nanyang Technological University, Singapore. From 2000 to 2012, he was a Postdoctoral Fellow, an Assistant Professor, and an Adjunct Associate Professor with The Chinese University of Hong Kong. In 2011, he joined the Shenzhen Institute of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, China, as a Professor. He is currently a Principal Researcher with Huawei Technologies Company Limited, Shenzhen, China. He has authored more than 150 papers. His research interests include computer vision, image processing, deep learning, and graphics.

Qi Tian (Fellow, IEEE) is currently a Chief Scientist in Artificial Intelligence at Cloud BU, Huawei. From 2018-2020, he was the Chief Scientist in Computer Vision at Huawei Noah’s Ark Lab. Before that he was a Full Professor in the Department of Computer Science, the University of Texas at San Antonio (UTSA) from 2002 to 2019. During 2008-2009, he took one-year Faculty Leave at Microsoft Research Asia (MSRA). Dr. Tian received his Ph.D. in ECE from University of Illinois at Urbana-Champaign (UIUC) and received his B.E. in Electronic Engineering from Tsinghua University and M.S. in ECE from Drexel University, respectively. Dr. Tian’s research interests include computer vision, multimedia information retrieval and machine learning and published 610+ refereed journal and conference papers. His Google citation is over 27900+ with H-index 81. He was the co-author of best papers including IEEE ICME 2019, ACM CIKM 2018, ACM ICMR 2015, PCM 2013, MMM 2013, ACM ICIMCS 2012, a Top 10% Paper Award in MMSP 2011, a Student Contest Paper in ICASSP 2006, and co-author of a Best Paper/Student Paper Candidate in ACM Multimedia 2019, ICME 2015 and PCM 2007. Dr. Tian received 2017 UTSA President’s Distinguished Award for Research Achievement, 2016 UTSA Innovation Award, 2014 Research Achievement Awards from College of Science, UTSA, 2010 Google Faculty Award, and 2010 ACM Service Award. He is the associate editor of IEEE TMM, IEEE TCSVT, ACM TOMM, MMSJ, and in the Editorial Board of Journal of Multimedia (JMM) and Journal of MVA. Dr. Tian is the Guest Editor of IEEE TMM, Journal of CVIU, etc. Dr. Tian is a Fellow of IEEE.

Rongrong Ji (Senior Member, IEEE) is a Nanqiang Distinguished Professor at Xiamen University, the Deputy Director of the Office of Science and Technology at Xiamen University, and the Director of Media Analytics and Computing Lab. He was awarded as the National Science Foundation for Excellent Young Scholars (2014), the National Ten Thousand Plan for Young Top Talents (2017), and the National Science Foundation for Distinguished Young Scholars (2020). His research falls in the field of computer vision, multimedia analysis, and machine learning. He has published 50+ papers in ACM/IEEE Transactions, including TPAMI and IJCV, and 100+ full papers on top-tier conferences, such as CVPR and NeurIPS. His publications have got over 10K citations in Google Scholar. He was the recipient of the Best Paper Award of ACM Multimedia 2011. He has served as Area Chairs in top-tier conferences such as CVPR and ACM Multimedia. He is also an Advisory Member for Artificial Intelligence Construction in the Electronic Information Education Committee of the National Ministry of Education.