Differentiable Sparsification for Deep Neural Networks

2019·arXiv

Abstract

Introduction

Since the ground-breaking success of deep neural networks (DNNs) in the realm of machine learning and pattern recognition (LeCun et al. 1989; Krizhevsky, Sutskever, and Hinton 2012), the focal point has shifted from feature engineering to architecture engineering (Simonyan and Zisserman 2015; He et al. 2016a,b; Xie et al. 2017). While DNNs have eased the need for extensive feature engineering, crafting an effective architecture now demands comparable human involvement. This entails determining the appropriate number of neurons, layers and connections between nodes. Furthermore, as DNNs continue to grow in size, substantial efforts are being dedicated to reducing the size of these models.

These challenges can be effectively addressed through the sparsification of over-complete models. A network structure can be learned or optimized by removing redundant blocks (Alvarez and Salzmann 2016; Wen et al. 2016) or unnecessary connections between blocks (Ahmed and Torresani 2017; Liu, Simonyan, and Yang 2019; Wortsman, Farhadi, and Rastegari 2019). Pruning has long been a popular approach to reduce the complexity and size of deep neural networks (LeCun, Denker, and Solla 1989; Hassibi, Stork, and Wolff 1993; Song Han and Dally 2015; Liu et al. 2017; Frankle and Carbin 2019; He et al. 2019; Sui et al. 2021; Hou et al. 2022). It typically involves multiple steps applied to a pre-trained model, including selecting unimportant parameters, deleting those parameters, and retraining a pruned model. Another widely used technique is sparsity regularization with a proximal gradient (Parikh and Boyd 2014; Alvarez and Salzmann 2016; Wen et al. 2016; Yoon and Hwang 2017). It sets redundant parameters to zero during training without relying on a pre-trained model. However, the requirement for closed-form solutions for proximal operations limits its applicability.

In this study, we introduce a fully differentiable sparsification method for DNNs. The proposed method can zero out unimportant parameters by directly optimizing a regularized objective function with stochastic gradient descent (SGD). This enables the simultaneous learning of both the sparsified structure and the weights of DNNs in an end-to-end manner. By leveraging off-the-shelf automatic differentiation tools, our implementation becomes simple and straightforward. It eliminates the need for manual coding of a pruning step or a proximal operator. Furthermore, our method is versatile in that it can accommodate various norms as regularizers, regardless of the availability of closed-form solutions for a proximal operator. To the best of our knowledge, these characteristics mark the first fully differentiable sparsification method.

Related Work

Our proposed method is closely related to sparsity regularization with a proximal gradient (Parikh and Boyd 2014). In this approach, a regularized objective function is formulated as

where L represents a prediction loss, R is a regularization term, D is a set of training data, W is a set of model parameters, and controls the trade-off between a prediction loss and a regularization term.

The -norm is the most popular regularizer:

where represents an individual element of W. The regularization term is optimized using the proximal operator:

where denotes an assignment operator, is a learning rate, and represents -regularization acts on an individual parameter, it often leads to unstructured irregular models.

To obtain regular sparse structures, the sparse regularization with -norm can be adopted (Alvarez and Salzmann 2016; Wen et al. 2016), where all parameters in the same group are either retained or zeroed-out together. The regularization can be expressed as:

where represents a group of model parameters. The regularization term is optimized using a proximal operator,

When a group has only one single parameter, it degenerates to -norm regularization.

Another related group regularization method is exclusive lasso with -norm (Zhou, Jin, and Hoi 2010; Yoon and Hwang 2017). This method promotes competition or sparsity within a group, rather than either retaining or removing an entire group altogether. The regularization term is expressed as:

and the updating rule is derived as:

These proximal operators involve weight decaying and thresholding steps and are performed separately after optimizing the prediction loss, typically at every mini-batch or epoch. Therefore, additional effort is required to manually implement and manage the update rules. Furthermore, the application of these methods is limited to cases where closed-form solutions for the proximal operator are known.

Proposed Approach

Base Model

In our proposed approach, we consider a module consisting of n components. These components can be various building blocks used in DNNs, such as channels or layers in a convolutional neural network (CNN) like ResNet (He et al. 2016a,b) or DenseNet layers (Huang, Liu, and van der Maaten 2017). They can also represent nodes in a neural graph (Wortsman, Farhadi, and Rastegari 2019) or a graph CNN (Kipf and Welling 2017). A module represents a composite of these components, forming a group of channels or a layer. For illustration purposes, we assume that a module y can be expressed as a linear combination of components f

where x represents the input to the module, wdenotes the model parameters for component frepresents an architecture parameter. The model parameters wular parameters used in the model, such as filters in a convolutional layer or weights in a fully connected layer. The value of determines the importance of component i or the strength of the connection between nodes. Setting is equivalent to removing component for zeroing out w

Differentiable Sparse Parameterization

We propose a sparse parameterization scheme that enables to be zero. First, we illustrate the integration of non-negative constraints and the softmax function:

where are unconstrained free scalar parameters, denotes the sigmoid function, and represents the

This parameterization allows to take the value of zero and ensures differentiability from a modern deep learning perspective. The free parameters are real-valued and do not restrict the training process using SGD. Hence, we can train . The exponential function in Eq.(9) ensures that the architecture parameters are non-negative. While cannot be zero due to the exponential function, in Eq.(10) can be zero through the thresholding operation, allowing to be zero as well. The term acts as a threshold. The thresholding operation can be interpreted as follows: if the strength of component i within a competition group is lower than the overall strength, it is dropped from the competition. Note that the scalar parameter in Eq. (10), which determines the magnitude of the threshold, is not a hyper-parameter but is automatically determined during training. Mathematically, the thresholding operator is not differentiable, but this is not an issue in modern deep learning frameworks, considering the support of relu as a built-in differentiable function. Additionally, the , denoted as, is simply the sum of the elements of

Similarly, a signed architecture parameter can be formulated as:

where are free scalar parameters as above. The gradient of the sign function is zero almost everywhere, but the equation can be rewritten as:

where . The gradient can be computed separately depending on whether the value is positive or not. In our understanding, modern deep learning tools, such as TensorFlow and PyTorch, already handle the sign function in this manner and we do not need to manually implement the conditional statement.

The proximal operator of Eq.(7) and the proposed thresholding operations of Eq.(10) and Eq.(12) share a similar form, but they have completely different meanings. The proximal operator is a learning rule used in the sparse regularization with -norm. On the other hand, the proposed thresholding operations are the parameterized forms of the architecture parameters, integrated into neural networks, and are not necessarily bound to a specific norm.

Sparsity Regularizer In our proposed approach, the regularized objective function is defined as:

where a represents a vector of architecture parameters. Regularizing the architecture parameter a enables the learning or sparsification of a network structure.

Our proposed method is not limited to a specific norm for the regularization term and can accommodate various norms. For instance, the -norm (Eq.2) can be used to remove individual components, while the -norm (Eq.4) can be employed to zero-out a group of components or an entire module (see Figures 1 and 2). Note that we do not need to manually implement different update rules as in the proximal gradient approach. Instead, we only need to rewrite a regularization term in an objective function. Example codes are provided in the supplementary material for reference.

For another example, the commonly used -norm is unsuitable for the architecture parameters a when they are normalized using the softmax function as in Eq. (11). The norm of the normalized architecture parameters is always 1, i.e.,. Therefore, it is more appropriate to employ p-norm with p < 1:

where the second equality holds as is always non-negative. It is important to highlight that a closed-form solution for the proximal operator of a p-norm with p < 1 remains elusive, rendering the proximal gradient method inapplicable. Nevertheless, since the regularization term is differentiable almost everywhere, our proposed approach can be directly employed.

Rectified Gradient Flow If in Eq.(10) or in Eq.(12) falls below the threshold, the gradient will be zero and the corresponding component will not receive any learning signals. Although a component might recover if the importance scores of other components decrease further, we take steps to ensure that the architecture parameters of dropped components continue to receive learning signals. This is accomplished by approximating the gradient of the thresholding function using the straight-through estimator (Hinton 2012; Bengio, L´eonard,

and Courville 2013; Hubara et al. 2016). Similar to the approach in (Xiao, Wang, and Rajasekaran 2019), where the gradient of a step function was approximated using the gradients of leaky relu or soft plus, we suggest using the elu (Clevert, Unterthiner, and Hochreiter 2016) activation function to estimate the gradients of relu. During the forward pass, we employ the relu activation function, while in the backward pass, we utilize elu. For example, when non-negative, the gradient of Eq.(12) is approximated as:

This approach leads to a learning mechanism similar to Discovering Neural Wiring (Wortsman, Farhadi, and Rastegari 2019). We will further discuss it in the section of the experiment and Appendix.

Application and Experiment

We demonstrate the effectiveness of the proposed approach in learning a sparse structure. It is important to note that our intend is not to compete with previous methods but rather to showcase the feasibility and potential of our proposed approach as the first fully differentiable sparsification method.

Channel Pruning and Sparse Batch Normalization Network-slimming(NS) (Liu et al. 2019) prunes unimportant channels in convolutional layers by leveraging the scaling factors in batch normalization. Let be the input and output, respectively, of batch normalization for channel i. Then, the operation can be written as

where denotes the mean and standard deviation of input activations, respectively, are scale and shift parameters, and is a small constant for numerical stability. The affine transformation can be re-written as

By forcing to be zero, the corresponding channel can be removed. NS trains a network with -regularization on a to identify insignificant channels. After the initial training, channels with small values of a are pruned and subsequently retrained. The scale parameter can be regarded as an importance score or an architecture parameter. In our approach, we parameterize the scale parameter using Eq. (12) and train a network with using SGD method without pruning and fine-tuning.

Preliminary Experiment. We conduct preliminary experiments on CIFAR10/100 (Krizhevsky 2009) to validate our implementation. For our experiments, we utilize ResNet (He et al. 2016a,b) as a base network architecture. We use a standard data augmentation scheme, which includes random shifting and flipping, following the approach used in (He et al. 2016a,b). In NS, in Eq.(13) is fixed to ing ratios are varied. On the other hand, in our approach, we vary the value of to induce different levels of sparsity. In

Table 1: Performance on CIFAR10, ResNet-164.

both approaches, we applied -regularization (weight de- cay) alongside -regularization.

In NS, pre-trained models are obtained by training networks for 160 epochs with an initial learning rate of 0.1. The learning rate is divided by 10 at 50% and 75% of the total number of training epochs. After the training process is complete, channels with small values of a are removed, and a slimmed network is re-trained for an additional 160 epochs with the same settings as in the initial training, but the learned weights are not re-initialized. In our approach, networks are simply trained for 320 epochs without fine-tuning or re-training. In NS, the scale factor is initialized as 0.5. In our approach, we set

so that

Table 1 presents the experimental results on CIFAR10. Experiments on CIFAR100 can be found in Appendix. Each experiment was run 5 times, and the table displays the average and standard deviation from these repetitions. Our proposed method is referred to as Differentiable Sparsification (DS). We controlled the value of to achieve a similar computational cost as NS. In the table, Channel and Layer Sparsity indicate the percentage of removed channels and layers, where larger values are preferable. It shows that the depth of DNNs can also be learned by removing redundant channels. The shortcuts across layers ensures that back-propagation is not impeded (Wen et al. 2016). NS has fewer chances of removing an entire layer. This can be attributed to the practice that NS simply prunes channels globally across all layers, without considering the inter-relationships among channels during the pruning procedure.

Exploring Diverse Sparsity Patterns. By simply switching from one norm to another for the regularization term, our method can achieve different sparsity patterns. To demonstrate this, we trained DenseNet-K12 with 40 layers (Huang, Liu, and van der Maaten 2017) with on a of Eq.(17). In the -norm case, we created a group of 12 channels to learn layer-wise connections. In this experiment, we intentionally omitted weight decay on Eq.(12) to isolate the effects of the

Figures 1 and 2 illustrate the learned sparse structures. Each row corresponds to a set of inputs to a hidden layer. Fig.1 shows the sparsified structures of the hidden layers in a channel-wise manner. The pixel-like thin short strip represents the magnitude of a scale parameter in the sparse

Figure 1: Channel-wise view. DenseNet-40-K12.

Figure 2: Group-wise view. DenseNet-40-K12.

batch normalization. Fig.2 shows the sparsified structures in a group-wise manner. Each square block in Fig. 2 represents a group of 12 channels. One block is added at a time as a layer proceeds from top to bottom since each hidden layer outputs 12 new channels. An input layer generates 24 channels, so the first row has 2 blocks. The number of surviving channels is represented by the color of the block, and the brightness is proportionate to the number of non-zero channels within a group. The figures clearly illustrate that the norm acts on individual channels, and the removed channels are irregularly spread. On the contrary, the -norm tends to either preserve an entire group or remove all the channels in a group. Table 2 shows the sparsity rates and classification performances for the proposed method (DS). The two experiments with -norm exhibit similar channel sparsity ratios, but their group sparsity ratios are notably different.

Table 2: Performance comparison of -norm on CIFAR10, DenseNet-40-K12.

Table 3: Knowledge Distillation. CIFAR10, ResNet-164.

Knowledge Distillation. Knowledge distillation enables a student network to leverage a teacher’s knowledge (Hinton, Vinyals, and Dean 2015). With the knowledge distillation, we can create a compact and high-performing model. A distillation loss is used as the prediction loss in Eq.(13):

where Lce denotes the cross-entropy loss, is the softmax function, and zare the logit outputs of a student and a teacher network, respectively. A student network utilizes -norm and the sparse batch normalization of Eq.(17). On the other hand, a teacher network uses the normal batch normalization of Eq.(16).

Following the practice of Hinge (Li et al. 2020), we set a fixed balancing factor and a temperature T = 4 in Eq.(18) and trained a teacher network on CIFAR10/100 and ImageNet (Deng et al. 2009) for 320 and 90 epochs, respectively. Li et al. fine-tuned a student network for a maximum of 500 epochs. In our case, we trained a student network on CIFAR10/100 and ImageNet2012 for 400 and 500 epochs, respectively. For CIFAR10/100, we reduced the learning rate by a factor of 10 at 240 and 320 epochs. For ImageNet2012, we reduced the learning rate at 430, 460, and 490 epochs and we gradually increased the value of from an initial value of to a final value of span of n = 430 epochs (see Appendix for details).

Tables 3, 4 and 5 present the experimental results on CIFAR10 and ImageNet2012. Experimental results on CIFAR100 are given in Appendix. For CIFAR10/100, our proposed method was tested five times, and the reported values in the tables are averages of these five repetitions. For ImageNet2012, each experiment was conducted once. These tables provide the remaining percentage of FLOPs

Table 4: Knowledge Distillation. CIFAR10, DenseNet-40.

Table 5: Knowledge Distillation. ImageNet, ResNet-50.

and parameters for the sparsified or pruned models, where lower values are preferable. The performance metrics of other methods are referenced from their original works. It is important to note that due to varying compression ratios and different learning settings across methods, direct comparisons might not be straightforward. Hence, the presented results should be interpreted as reference points. Despite its relatively simple concept and implementation, our approach demonstrates comparable performances with other methods.

Discovering Neural Wiring

Discovering Neural Wiring (DNW) (Wortsman, Farhadi, and Rastegari 2019) relaxes the notion of layers and treats channels as nodes in a neural graph. By learning connections between nodes, DNW simultaneously discovers the structure and learns the parameters of a neural graph. The input to node v, denoted as y, is expressed as:

where xdenotes the state of a preceding node, E represents an edge set and is the connection weight of an edge. The neural graph’s structure is determined by selecting a subset of edges. During each iteration of the training process, DNW identifies the top k edges with the highest magnitudes, named as the real edge set, while the remaining edges constitute the hallucinated edge set. During the

Table 6: Discovering Neural Wiring. Performance on CIFAR-10

forward pass, only real edges are utilized. DNW allows the weights in both sets to change throughout training, enabling a hallucinated edge to replace a real edge if it becomes strong enough. While the weights of real edges are updated using standard SGD, those of hallucinated edges follow a specialized learning rule: the gradient flows to hallucinated edges, but it does not pass through them. This distinction ensures that the hallucinated edges contribute to the network’s architecture discovery without directly affecting the other parameter updates.

In our proposed method, the architecture parameters correspond to the edges in DNW. We parameterize these edges using Eq. (12),

and train the network with -norm on the edges to induce sparsity. Along with the sparse parametrization, the rectified gradient flow (RGF) of Eq.(15) yields an update rule which is similar to that of DNW. However, unlike DNW, there is no need to keep track of real and hallucinated edge sets. Instead, we directly optimize an objective function using the approximated gradients. Implementing RGF is straightforward and the reference code is provided in the supplementary material.

We conduct experiments on CIFAR10/100 and Ima-geNet2012. Our base model is MobileNetV1 (Howard et al. 2017), and our implementation closely follows that of DNW. The model is trained for 160 and 250 epochs on CIFAR10/100 and ImageNet2012, respectively. An initial learning rate is set to 0.1 and scheduled using cosine annealing. To determine the sparsity level for DNW and our proposed method, we vary the value of , respectively. For RGF, the hyper-parameter of elu, which controls the extents of saturation for negative inputs, is set to 0.1.

Table 6 presents the experimental results on CIFAR10. Experimental results on CIFAR100 are given in Appendix. Our approach is denoted as DS. We also include results for DS without RGF, which corresponds to DNW-No Update, where the magnitudes of hallucinated edges are not updated. Even without RGF, the performance of our proposed method is close to that of DNW with the update rule. We further conducted experiments using proximal gradients in Eq.(3) and Eq.(7), denoted as PG in Table 6. PG--norm also em-

Table 7: Discovering Neural Wiring. ImageNet.

ploys the -norm as a regularizer, but it did not achieve the same level of performance as our approach. PG-uses the update rule of Eq.(7), which has a similar shape to our proposed sparse parameterization of Eq. (12), but its performance is inferior to ours. Table 7 presents the experimental results on ImageNet2012.

Learning Sparse Affinity Matrix

We employ the proposed approach to learn the sparse structure of an adjacency matrix in a graph convolutional network (GCN) (Kipf and Welling 2017). A GCN block or layer is defined (see Fig. 3) as

where A is an adjacency matrix; are an input feature and a weight matrix for layer l, respectively; and F is a nonlinear activation function. In general, A is non-negative and shared across GCN blocks. It is obtained by normalization. For example, is an unnormalized adjacency matrix; and is a diagonal matrix, where . The adjacency matrix repre- sents the connections or relationships between nodes on a graph and is usually given by prior knowledge. Learning the value of amounts to determining the relationship between nodes i and j. If the value of is zero, the two nodes are thought to be unrelated.

As shown in Fig. 3, we define each row and column as a group. Row grouping creates competition between incoming edges, whereas column grouping creates competi-

Figure 3: Structure of a GCN block. Each block consists of a shared adjacency, an input feature, and a weight matrix. Each row and column of an adjacent matrix are treated as groups to learn the relationships between a node and its neighbors.

tion between out-going edges. Each row and column of unnormalized adjacency matrix can be parameterized similarly as in of in Eq. (10):

where To convert into a doubly stochastic matrix, we employ an iterative balanced normalization, which can be considered as a variant of Sinkhorn normalization (Sinkhorn 1964; Sinkhorn and Knopp 1967; Knight 2008). We iteratively apply the following equations after initializing

where are diagonal matrices with , respectively. Despite be- ing an iterative process, this normalization is differentiable. Through numerical experiments, we have observed that the balanced normalization indeed makes doubly stochastic. However, we currently lack a theoretical justification for this observation, leaving the mathematical proof as an open question for future research.

As competition groups are created in row- and columnwise approaches, a regularized objective function can be written as

where is the size of square matrix A, and th row and column vector of A, respectively. We employ the -norm (Eq. (14)) with p = 0.5 as the regularization function R, which encourages the sparsity and competition within the groups. Note that may not work for this since the sum of a column and a row is equal to one.

To validate our proposed method, we applied the proposed GCN for traffic speed prediction in a road network. The traffic speed data were collected from 170 road segments, resulting in an adjacency matrix of size . We predict a

Table 8: Traffic speed prediction with GCN.

one-step-ahead observation from the previous eight observations. The output layer generates 170 estimates, one for each road segment. We employed the mean relative error (MRE) as the prediction loss for training. Additional details regarding the experimental data and our GCN model can be found in Appendix.

Two baseline models were used. For the first baseline, we set were adjacent to each other, otherwise. For the second baseline, no connectivity information was given. We set for all i, j, regardless of the actual connections. For the proposed method, we parameterized the adjacency matrix with the sparsification technique and employed regularizer. Balanced normalization was applied to all cases. To measure the learned relationship (LR) between nodes, we propose the following scoring function:

where denotes the element- wise product, and if the geodesic distance between node i and j is less than or equal to k, whereas otherwise. The maximum value is 1, and the minimum is 0. For example, the first baseline always takes the maximum value because its adjacency matrix has exactly the same structure as . We calculated the scores for k = 1 and 2 .

The experimental results are presented in Table 8. Each experiment was conducted five times, and the median among the five lowest validation errors was selected for analysis. In Baseline I, road connectivity information is provided, and the number of non-zero elements in the adjacency matrices is 878. For Baseline II, road connectivity information is not given, resulting in non-zero elements. Baseline II learned a mapping between input and target values by solely reducing the prediction loss and failed to learn the relationships between nodes. In contrast, the proposed model learned sparse relationships between nodes.

Conclusion

In this study, we proposed a fully differentiable sparsification method that can simultaneously learn the sparsified structure and weights of deep neural networks. Our proposed method is versatile in that it can seamlessly be integrated into different types of neural networks and used to address a wide range of problems and scenarios.

References

Ahmed, K.; and Torresani, L. 2017. Connectivity learning in multi-branch networks. arXiv:1709.09582.

Alvarez, J. M.; and Salzmann, M. 2016. Learning the Number of Neurons in Deep Networks. In NIPS.

Bengio, Y.; L´eonard, N.; and Courville, A. 2013. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arXiv:1308.3432.

Clevert, D.-A.; Unterthiner, T.; and Hochreiter, S. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In ICLR.

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and FeiFei, L. 2009. ImageNet: A large-scale Hierarchical Image Database. In CVPR.

Frankle, J.; and Carbin, M. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR.

Hassibi, B.; Stork, D. G.; and Wolff, G. J. . 1993. Optimal Brain Surgeon and general network pruning. In ICNN.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016a. Deep Residual Learning for Image Recognition. In CVPR.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016b. Identity Mappings in Deep Residual Networks. In ECCV.

He, Y.; Liu, P.; Wang, Z.; Hu, Z.; and Yang, Y. 2019. Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration. In CVPR.

Hinton, G. 2012. Neural networks for machine learning. Coursera, video lectures.

Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the Knowledge in a Neural Network.

Hou, Z.; Qin, M.; Sun, F.; Ma, X.; Yuan, K.; Xu4, Y.; Chen, Y.-K.; Jin, R.; Xie, Y.; and Kung, S.-Y. 2022. CHEX: CHannel EXploration for CNN Model Compression. In CVPR.

Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861.

Huang, G.; Liu, Z.; and van der Maaten, L. 2017. Densely Connected Convolutional Networks. In CVPR.

Huang, Z.; and Wang, N. 2018. Data-driven Sparse Structure Selection for Deep Neural Networks. In ECCV.

Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; and Bengio, Y. 2016. Binarized Neural Networks. In NeurIPS.

Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.

Knight, P. A. 2008. The sinkhorn-knopp algorithm: convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1).

Krizhevsky, A. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report.

Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS.

LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; and Jackel, L. D. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4): 541–551.

LeCun, Y.; Denker, J. S.; and Solla, S. A. 1989. Optimal Brain Damage. In NIPS.

Li, Y.; Gu, S.; Mayer, C.; Gool, L. V.; and Timofte, R. 2020. Group Sparsity: The Hinge Between Filter Pruning and Decomposition for Network Compression. In CVPR.

Liu, H.; Simonyan, K.; and Yang, Y. 2019. DARTS: Differentiable Architecture Search. In ICLR.

Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; and Zhang, C. 2017. Learning Efficient Convolutional Networks through Network Slimming. In ICCV.

Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; and Darrell, T. 2019. Rethinking the Value of Network Pruning. In ICLR.

Ma, N.; Zhang, X.; Zheng, H.-T.; and Sun, J. 2018. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In ECCV.

Parikh, N.; and Boyd, S. 2014. Proximal Algorithms. Found. Trends Optim., 1(3): 127–239.

Simonyan, K.; and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR. Sinkhorn, R. 1964. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematicalstatistics, 35(2): 876–879.

Sinkhorn, R.; and Knopp, P. 1967. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2).

Song Han, J. T., Jeff Pool; and Dally, W. 2015. Learning both weights and connections for efficient neural network. In NIPS.

Sui, Y.; Yin, M.; Xie, Y.; Phan, H.; Zonouz, S.; and Yuan, B. 2021. CHIP: CHannel Independence-based Pruning for Compact Neural Networks. In NeurIPS.

Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2016. Learning Structured Sparsity in Deep Neural Networks. In NIPS.

Wortsman, M.; Farhadi, A.; and Rastegari, M. 2019. Discovering Neural Wirings. In NeurIPS.

Xiao, X.; Wang, Z.; and Rajasekaran, S. 2019. AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters. In NeurIPS.

Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; and He, K. 2017. Aggregated Residual Transformations for Deep Neural Networks. In CVPR.

Yoon, J.; and Hwang, S. J. 2017. Combined Group and Exclusive Sparsity for Deep Neural Networks. In ICML.

Zhao, C.; Ni, B.; Zhang, J.; Zhao, Q.; Zhang, W.; and Tian, Q. 2019. Variational Convolutional Neural Network Pruning. In CVPR.

Zhou, Y.; Jin, R.; and Hoi, S. C. 2010. Exclusive Lasso for Multi-task Feature Selection. In AISTATS.

Appendix

Gradual Sparsity Regularization

In the early stages of training, it is difficult to determine the importance of parameters because they are randomly initialized. A gradual scheduling of provides a warm-up stage that prevents premature parameter dropping. Inspired by the gradual pruning of (Zhu and Gupta 2017), we smoothly increase the value of from an initial value to a final value over a span of n epochs. Starting at epoch , we increase the value at every epoch using the following formula:

where t denotes an epoch index.

Channel Pruning and Sparse Batch Normalization

Preliminary Experiment. Tables 1 and 2 present additional experimental results on CIFAR10/100 (Krizhevsky 2009). In Network-slimming (NS) (Liu et al. 2019), is fixed to and pruning ratios are varied. Conversely, in our approach (DS), we vary the value of to achieve different degrees of sparsity. In the tables, we ranged the values of for CIFAR10 and from 0.0000425 to 0.0001250 for CIFAR100. In both approaches, we utilized -regularization (weight decay) in conjunction with -regularization on the scaling factors or architecture parameters. The weight of -regularization is set to 0.00005 for the scaling factor a in NS and 0.00001 for the free architecture parameters in our method (DS).

Table 1: Performance on CIFAR-10, ResNet-164.

Table 2: Performance on CIFAR-100, ResNet-164.

Exploring Diverse Sparsity Patterns. As discussed in the main paper, our method can achieve different sparsity patterns by simply switching from one norm to another for the regularization term. An implementation example with TensorFlow is given in Listing 1. Figures 1 and 2 illustrate the learned sparse structures on CIFAR100/DenseNet40-K12. The figures clearly show that the -norm acts on individual channels, and the removed channels are irregularly spread. On the contrary, the tends to either preserve an entire group or remove all the channels in a group. Table 3 shows the sparsity rates and classification performances for the proposed method (DS). The experiments with -norm exhibit similar channel sparsity ratios, but their group sparsity ratios are notably different. In this experiment, we intentionally omitted weight decay on to isolate the effects of the

Figure 1: Sparsified Connections in Channel-wise view. DenseNet-40-K12. CIFAR100.

Figure 2: Sparsified Connections in Group-wise view. DenseNet-40-K12. CIFAR100.

Table 3: Performance comparison of -norm on CIFAR100, DenseNet-40-K12.

Knowledge Distillation. Table 4 is a comparison table for CIFAR100/ResNet-164, omitted in the main paper due to the page limit. Tables 5, 6 and 7 present additional experimental results on CIFAR10/100 and ImageNet2012 (Deng et al. 2009) for our proposed approach. For CIFAR10/100, the reported values in the tables are averages of five repetitions. In the case of ImageNet2012, each experiment was conducted once. To achieve varying degrees of sparsity, we experimented with different values of . In the tables, we ranged the values of for CIFAR10, from 0.000100 to 0.000225 for CIFAR100 and from 0.000175 to 0.000300 for ImageNet2012. Tables 8 and 9 present additional experimental results on

Listing 1: Example code for Sparsity Regularization

CIFAR10/100 with DenseNet-40-K12. In the tables, we ranged the values of for CIFAR10, and from 0.000050 to 0.000400 for CIFAR100.

Table 4: Knowledge Distillation. CIFAR100. ResNet-164.

Table 5: Knowledge Distillation. Performance on CIFAR10. ResNet-164.

Rectified Gradient Flow and Discovering Neural Wiring Code. Listing 2 shows the TensorFlow implementation of the rectified gradient flow. The learning mechanism can be toggled on and off by simply switching from tf.nn.relu to rgf relu, or vice versa.

Table 6: Knowledge Distillation. Performance on CIFAR100. ResNet-164.

Table 7: Knowledge Distillation. ImageNet2012. ResNet-50.

Table 8: Knowledge Distillation. Performance on CIFAR10. DenseNet-40-12K.

Table 9: Knowledge Distillation. Performance on CIFAR100. DenseNet-40-12K.

Connection to DNW. Assume that an output y of a neuron in a hidden layer is written as

n i=1

where x denotes an input from a preceding layer, wrepresents the model parameters for component is an edge or an architecture parameter. A loss function can be denoted

Let be a function of

which is simplified for this analysis. The gradients can be written as

a

Ifin Eq. (2) is zero and is not updated. However, if elu (Clevert, Unterthiner, and Hochreiter 2016) is employed in the backward pass, an approximate gradient for can be generated and can be updated. Since Eq. (3) and (4) when, they do not receive a learning signal through even if the rectified gradient flow is adopted. This results in a similar learning mechanism proposed in DNW (Wortsman, Farhadi, and Rastegari 2019), where the gradient flows to zeroed-out (hallucinated) edges but not through them.

Experiment. Tables 10 and 11 provide additional experimental results on CIFAR10/100 for our proposed method (DS). Following the practice of DNW, we employed MobileNetV(Howard et al. 2017) as a base model. In DNW, the value of k was chosen such that the final learned model has similar Mult-Adds as the base model. We set the value of in the same manner. We ran each experiment 5 times and show the average and standard deviation.

Learning Sparse Affinity Matrix Part of this section will be revised with appropriate citations upon official publication of the paper.

Traffic Speed Data. A summary of the data used in the main paper is shown in Table 12. There were approximately 70, 000 probe taxis operating in the metropolitan area where the data were collected. As the taxis operated in three shifts, more than 20, 000 probes on average ran at any given point in time. The traffic speed data covers 4, 663 links in the area, which includes major arterials. Speed data from probes in the metropolitan area were aggregated every 5 minutes for each road segment; if no probe passed a link during a 5-minute period, the speed was estimated based on speeds in the previous time periods or from the same time periods in the past. After preprocessing, the raw data were once again aggregated in 15-minute periods to maximize forecasting utility. A 15-minute period is the standard on which the highway capacity manual is based. As no information was

Table 10: Performance on CIFAR10, Discovering Neural Wiring.

Figure 3: The gray lines represent the 170 road links where the experimental data were collected.

available regarding probe counts, which contribute to averaging speed data, three speeds for a 5-minute period were averaged without weighting. Among the 4, 663 links for which speed data were available, we selected 170 links in the subarea (see Fig. 3). We expected that the speed data collected in this region would be free from missing observations as it is the busiest region in the metropolitan area, and most taxi drivers congregate in this region to find passengers. Following the generic conventions of machine learning, we used data from the previous eight months to train the proposed model and reserved the data for the latter two months to test the trained model.

Table 11: Performance on CIFAR100, Discovering Neural Wiring.

Table 12: Summary of Experimental Data

GCN Prediction Model. Figure 4 illustrates our GCN implementation. The outputs of hidden GCN blocks are concatenated and fed to an output block. This was motivated by the dense connections of DenseNet (Huang, Liu, and van der Maaten 2017). The structures of the hidden GCN blocks are illustrated in the main paper. We trained the model for 500 epochs using Adam (Kingma and Ba 2015) with the defaulting parameter setting of TensorFlow. The initial learning rate was set to 0.0005 and multiplied by 0.5 at epochs 400 and 450.

A training loss is defined by the mean relative error (MRE)

where N is the number of road segments and denote an estimate and actual future observation of travel speed on road segment i, respectively. We ran each experiment five times and selected the median among the five lowest validation errors. We reported the test error from the epoch with the median validation MRE. The report error was re-written in the table with mean absolute percentage error (MAPE) for accessability purposes; MAPE is defined by multiplying MRE by 100.

Figure 4: Graph convolutional neural network (GCN) structure. The GCN in our experiments consists of 5 GCN blocks. The output feature maps of GCN blocks are concatenated and then fed to an output block. Input and Output blocks consist of 3 fully connected layers.

References

Clevert, D.-A.; Unterthiner, T.; and Hochreiter, S. 2016. Fast and Accurate Deep Network Learning by Exponential Linear

Units (ELUs). In ICLR.

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. ImageNet: A large-scale Hierarchical Image Database.

In CVPR.

Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. MobileNets:

Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861.

Huang, G.; Liu, Z.; and van der Maaten, L. 2017. Densely Connected Convolutional Networks. In CVPR. Huang, Z.; and Wang, N. 2018. Data-driven Sparse Structure Selection for Deep Neural Networks. In ECCV.

Kingma, D. P.; and Ba, J. L. 2015. Adam: A method for stochastic optimization. In ICLR. Krizhevsky, A. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report.

Li, Y.; Gu, S.; Mayer, C.; Gool, L. V.; and Timofte, R. 2020. Group Sparsity: The Hinge Between Filter Pruning and Decom-

position for Network Compression. In CVPR.

Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; and Darrell, T. 2019. Rethinking the Value of Network Pruning. In ICLR.

Wortsman, M.; Farhadi, A.; and Rastegari, M. 2019. Discovering Neural Wirings. In NeurIPS.

Zhu, M.; and Gupta, S. 2017. To Prune, or not to Prune: Exploring the Efficacy of Pruning for Model Compression. arXiv:1710.01878.

Designed for Accessibility and to further Open Science