IGCV$2$: Interleaved Structured Sparse Convolutional Neural Networks

2018·Arxiv

Abstract

Abstract

In this paper, we study the problem of designing efficient convolutional neural network architectures with the interest in eliminating the redundancy in convolution kernels. In addition to structured sparse kernels, low-rank kernels and the product of low-rank kernels, the product of structured sparse kernels, which is a framework for interpreting the recently-developed interleaved group convolutions (IGC) and its variants (e.g., Xception), has been attracting increasing interests.

Motivated by the observation that the convolutions contained in a group convolution in IGC can be further decomposed in the same manner, we present a modularized building block, IGCV2: interleaved structured sparse convolutions. It generalizes interleaved group convolutions, which is composed of two structured sparse kernels, to the product of more structured sparse kernels, further eliminating the redundancy. We present the complementary condition and the balance condition to guide the design of structured sparse kernels, obtaining a balance among three aspects: model size, computation complexity and classification accuracy. Experimental results demonstrate the advantage on the balance among these three aspects compared to interleaved group convolutions and Xception, and competitive performance compared to other state-of-the-art architecture design methods.

1. Introduction

Deep convolutional neural networks with small model size, low computation cost, but still high accuracy become an urgent request, especially in mobile devices. The efforts include (i) network compression: compress the pretrained model by decomposing the convolutional kernel matrix or removing connections or channels to eliminate redundancy, and (ii) architecture design: design small kernels, sparse kernels or use the product of less-redundant kernels to approach single kernel and train the networks from scratch.

Our study lies in architecture design using the product of less-redundant kernels for composing a kernel. There are two main lines: multiply low-rank kernels (matrices) to approximate a high-rank kernel, e.g., bottleneck modules [7], and multiply sparse matrices, which has attracted research efforts recently [46, 13, 2] and is the focus of our work.

We point out that the recently-developed algorithms, such as interleaved group convolution [46], deep roots [13], and Xception [2], compose a dense kernel using the product of two structured-sparse kernels. We observe that one of the two kernels can be further approximated. For example, the kernel in Xception and deep roots can be approximated by the product of two block-diagonal sparse matrices. The suggested secondary group convolution in interleaved group convolutions contains two branches and each branch is a convolution, which similarly can be further approximated. This is able to further reduce the redundancy.

Motivated by this, we design a building block, IGCV2: Interleaved Structured Sparse Convolution, as shown in Figure 1, which consists of successive group convolutions. This block is mathematically formulated as multiplying structured-sparse kernels, each of which corresponds to a group convolution. We introduce the complementary condition and the balance condition, so that the resulting convolution kernel is dense and there is a good balance among three aspects: model size, computation complexity and classifi-cation performance. Experimental results demonstrate the advantage of the balance among these three aspects compared to interleaved group convolutions and Xception, and competitive performance compared to other state-of-the-art architecture design methods.

2. Related Work

Most existing technologies design efficient and effective convolutional kernels using various forms with redundancy eliminated, by learning from scratch or approximating pre-

Figure 1. IGCV2: the Interleaved Structured Sparse Convolution. (denoted as solid arrows) are sparse block matrices corresponding to group convolutions. (denoted as dashed arrows) are permutation matrices. The resulting composed kernel is ensured to satisfy the complementary condition which guarantees that for each output channel, there exists one and only one path connecting the output channel to each input channel. The bold line connecting gray feature maps shows such a path.

trained models. We roughly divide them into low-precision kernels, sparse kernels, low-rank kernels, product of low-rank kernels, product of structured sparse kernels.

Low-precision kernels. There exist redundancies in the weights in convolutional kernels represented by float numbers. The technologies eliminating such redundancies include quantization [48, 6], binarization [4, 32], and trinarization [23, 49, 50]. Weight-shared kernels in which some weights are equal to the same value, are in some sense low-precision kernels.

Sparse kernels. Sparse kernels, or namely sparse connections, mean that some weights are nearly zero. The efforts along this path mainly lie in how to perform optimization, and the technologies include non-structure sparsity regularization [26, 31], and structure sparsity regularization [42, 30]. The scheme of structure sparsity regularization is more friendly for hardware acceleration and storage. Recently, group convolutions, adopted in [44, 47] are essentially structured-sparse kernels. Different from sparsity regularization, the sparsity pattern of group convolution is manually pre-defined.

Low-rank kernels. Small filters, e.g., kernels replacing kernels, reduce the ranks in the spatial domain. Channel pruning [27] and filter pruning [43, 24, 28] compute low-rank kernels in the output channel domain and the input channel domain, respectively1.

Composition from low-rank kernels. Using a pair of and kernels to approximate a kernel [14, 15, 29] is an example of using the product of two small (low-rank) filters. Tensor decomposition uses the product of low-rank/small tensors (matrices) to approximate the kernel in the tensor form along the spatial domain [5, 15], or the input and output channel domains [5, 17, 15]. The bottleneck structure [7], if the intermediate ReLUs are removed, can be viewed as the low-rank approximation along the output channel domain.

Composition from sparse kernels. Interleaved group convolution [46] consists of two group convolutions, each of which corresponds to a structured-sparse kernel (the sizes are the same to that of the kernel to be approximated for the convolutions). Satisfying the complementary property [46] leads to that the resulting composite kernel is dense. Xception [2] can be viewed as an extreme case of interleaved group convolutions: one group convolution is degraded to a regular convolution and the other one is a channel-wise convolution, an extreme group convolution. Deep roots [13] instead uses the product of a structured-sparse kernel and a dense kernel. Our approach belongs to this category and shows a better balance among model size, computation complexity and classification accuracy.

3. Our Approach

The operation in a convolution layer in convolutional neural networks relies on a matrix-vector multiplication operation at each location:

Here the input x, corresponding to a patch around the location in the input channels, is a -dimensional vector, with S being the kernel size (e.g., being the number of input channels. The output y is a -dimensional vector, with being the number of output channels. W is formed from convolutional kernels and each row corresponds to a convolutional kernel. For presentation clarity, we assume , but all the formulations can be generalized to .

3.1. A Review of IGC, Xception and Deep Roots

We show that recent architecture design algorithms, Xception [2], deep roots [13], and interleaved group convolutions (IGC) [46], compose a dense convolution matrix

W by multiplying possibly sparse matrices:

where and are both, or at least one matrix is block-wise sparse, is a permutation matrix that is used to reorder the channels, and is a dense matrix.

Interleaved group convolutions. The interleaved group convolution block consists of primary and secondary group convolutions. The corresponding kernel matrices and are block-wise sparse,

where (i = 1 or 2) is the kernel matrix over the cor- responding channels in the gth branch, is the number of branches in the ith group convolution. In the case suggested in [46], the primary group convolution is a group convolution, , and is a matrix of size . The secondary group convolution is a group convolution, and are both dense matrices of size

Xception. The Xception block consists of a convolution layer followed by a channel-wise convolution layer. It is pointed out that the order of the two layers does not make effects. For convenience, we below discuss the form with the convolution put as the second operation. is a dense matrix of size is a sparse block matrix of size , a degraded form of the matrix shown in Equation 3: there are C blocks and is degraded to a row vector of size S.

Deep roots. In deep roots, is a dense matrix of size C, i.e., corresponding to a convolution while is a sparse block matrix as shown in Equation 3, corresponding to a group convolution.

Complexity. The computation complexity of Equation 2 is (with the complexity in permutation is ignored), where is the number of non-zero entries. The sparse block matrix, as given in Equation 3, are storage friendly, and the storage/memory cost is also .

3.2. Interleaved Structured Sparse Convolutions

Our approach is motivated by the observations: (i) the block in Equation 3 and the convolution in Xcep- tion are dense and can be composed by multiplying sparse matrices, thus further eliminating the redundancy and saving the storage and time cost; and (ii) such a process can be repeated more times.

The proposed Interleaved Structured Sparse Convolution

(IGCV2) is mathematically formulated as follows,

Here, is a sparse matrix. is a permutation matrix, and the role is to reorder the channels so that is a sparse block matrix, as given in Equation 3 and corresponds to the lth group convolution, where the numbers of channels in all the branches are in our work set to be the same, equal to , for easy design.

Construct a dense composed kernel matrix. We introduce the following complementary condition, which is generalized from interleaved group convolutions [46], as a rule for constructing the L group convolutions such that the resulting composed convolution kernel matrix is dense.

Condition 1 (Complementary condition) ,

corresponds to a group convo- lution and also corresponds to a group convolution. The two group convolutions are thought complementary if the channels lying in the same branch in one group convolution lie in different branches and come from all the branches in the other group convolution.

Here is the sketch showing that an interleaved structured sparse convolution block satisfying the complementary condition is dense. The proof is based on two points: (i) for a group convolution, we have that any channel output from a branch is connected to the channels input to this branch and any channel input to a branch is connected to the channels output from this branch; (ii) for two complementary group convolutions, the channels output from any branch of the second group convolution are connected to the channels input to the corresponding branch, which are from all the branches of the first group convolution. As a result, the channels output from an IGCV2 is connected to all the channels input to the IGCV2, i.e., the IGCV2 kernel is dense.

Let us look at the relation between the number of channels, C, and the number of channels in the branches of L group convolutions, . We analyze the relation according to Equation 5: (i) An input channel is connected to intermediate channels output by the first group convolution. (ii) Let be the number of intermediate channels output by the th group convolution, to which an input channel is connected. The complementary condition indicates that through the lth group convolution an input channel is connected to exactly intermediate channels output by the lth group convolution. (iii) Finally, an input channel is connected to exactly channels output from the L group convolutions. Since the

composed kernel is dense, we have

Because of the complementary property, there is no waste connection: there is only one path between each input channel and each output channel. Besides, the complementary condition is a sufficient condition yielding a dense composed kernel matrix, and not a necessary condition.

When the amount of parameters is the smallest? We further analyze when the number of parameters with L group convolutions, as given in Equation 5, satisfying the complementary condition, is the smallest.

We have that the number of parameters in the lth group convolution is for the convolutions, and for the spatial (e.g., ) convolution. It is easily shown that for consuming fewer parameters there is only one group spatial convolution and all others are . The spatial convolution lies in any group convolution, and without affecting the analysis, we assume it lies in the first group convolution2. Thus, the number of total parameters Q, smaller number of parameters in permutation matrices ignored, is:

According to Jensen’s inequality, we have

Here, the equality from the second line to the third line holds because of Equation 6. The equality in the second line holds, i.e., , when the following balance condition is satisfied3,

Furthermore, let us see the choice of L, yielding the smallest amount of parameters (), guaranteeing a dense composed kernel. We present a rough analysis by considering the derivative of Q with respect to L:

When the derivative is zero, , we have that Q is the minimum if

Examples. We take an example: separate the convolution along the spatial domain and the channel domain, to construct the IGCV2 block. The first group convolution is an extreme group convolution, a channel-wise convolution, followed by several group convolutions. This can be regarded as decomposing the convolution in Xception into group convolutions. In this case, the balance condition becomes , for which the amount of parameters is the smallest. As we empirically validate in Section 4.3, under the same number of parameters, an IGCV2 block satisfying such a balance condition leads to the maximum width and consistently superior performance: the best or nearly best. This consistency observation is different from [46] and might stem from that the balance condition is only used to group convolutions and that there is no coupling with spatial convolutions.

We also study the construction from interleaved group convolutions: each submatrix in Equation 3 in the sec- ondary group convolution corresponds to a (dense) convolution over a subset of channels, and thus can be further decomposed into group convolutions. The first group convolution is still a group convolution (other than channel-wise). Consequently, the balance condition given in Equation 11 is deduced from the coupling of convolutions over the spatial and channel domains, which does not lead to the consistency between the width increase and the performance gain and makes the analysis uneasy. This is empirically validated in our experiments in Section 4.4. Thus, we suggest to separate the convolution along the spatial and channel domains and design an IGCV2 over the channel domain.

3.3. Discussions

Non-structured sparse kernels. There is a possible extension: remove the structured sparsity requirement, i.e., replace the group convolution by a non-structured sparse kernel, and introduce the dense constraint (the composed kernel is dense) and the sparsity constraint. This potentially results in better performance, but leads to two drawbacks: the optimization is difficult and non-structured sparse matrices are not storage-friendly.

Complementary condition. The complementary condition is a sufficient condition guaranteeing the resulting composed kernel is dense. It should be noted that it is not a necessary condition. Moreover, it is also not necessary that the composed kernel is dense, and further sparsifying the connections, which remains as a future work, might be ben-eficial. The complementary condition is an effective guide to design the group convolutions.

Sparse matrix multiplication and low-rank matrix multiplication. Low-rank matrix (tensor) multiplication or decomposition has been widely studied in matrix analysis [21, 22] and applied to network compression and network architecture design. In comparison, sparse matrix (tensor) multiplication or decomposition is rarely studied in

Figure 2. Illustrating how the complementary condition affects the performance on CIFAR-100 in our approach. K denotes the number of channels in each branch and C denotes the width of the network. With L fixed, the composed kernel is denser with a larger K. The red bar corresponds to the case in which the complementary condition is the most satisfied. The best performances corresponding to the red bar or the bars immediately near to the red bar show that the complementary condition is reasonable for IGCV2 design.

matrix analysis. The future works include applying sparse matrix decomposition to compress convolutional networks, combining low-rank and sparse matrices together: low-rank sparse matrix multiplication or decomposition, and so on.

4. Experiment

4.1. Datasets and Training Settings

CIFAR. The CIFAR datasets [18], CIFAR-10 and CIFAR- 100, are subsets of the 80 million tiny images [40]. Both datasets contain color images with 50000 images for training and 10000 images for test. The CIFAR-10 dataset consists of 10 classes, each of which contains 6000 images. There are 5000 training images and 1000 testing images per class. The CIFAR-100 dataset consists of 100 classes, each of which contains 600 images. There are 500 training images and 100 testing images per class. The standard data augmentation scheme we adopt is widely used for these datasets [7, 12, 20, 11, 19, 25, 33, 37, 38]: we first zero-pad the images with 4 pixels on each side, and then randomly crop them to produce images, followed by horizontally mirroring half of the images. We normalize the images by using the channel means and standard deviations.

Tiny ImageNet. The Tiny ImageNet dataset4 is a subset of ImageNet [34]. The image size is resized to . There are 200 classes, sampled from 1000 classes of ImageNet, and 500 training images, 50 validation images and 50 testing images per class. In our experiment, we adopt the data augmentation scheme: scale up the training images randomly to the size within [64, 80], and randomly crop a patch for training, randomly horizontal mirroring and normalize the cropped images by subtracting the channel means and standard deviations.

Training settings. For CIFAR, we adopt the same training settings as [46]. We use SGD with Nesterov momentum

Figure 3. Illustrating how the number of layers L affects the performance on CIFAR-100. The number of channels in each branch of group convolution is chosen to satisfy both the balance condition and the (nearly) complementary condition, and to keep the #params the same. The maximum accuracy is achieved at some L, in which the width and the non-sparsity degree reach a balance.

to update network, starting from learning rate 0.1 and multiplying with a factor 0.1 at 200 epochs, 300 epochs and 350 epochs. Weight decay is set as 0.0001 and momentum as 0.9. We train the network with batch size as 64 for 400 epochs and report the accuracy at the final iteration. The implementation is based on Caffe [16]. For Tiny ImageNet, we use the similar training settings as CIFAR, except that we train for totally 200 epochs and multiply the learning rate with a factor 0.1 at 100 epochs, 150 epochs and 175 epochs. To adapt Tiny ImageNet to the networks designed for CIFAR, we set the stride of the first convolution layer as 2, which is adopted in [10] as well.

4.2. Empirical Analysis

Complementary condition. We empirically investigate how the complementary condition affects the performance

Table 1. Illustrating the architectures of networks we used in the experiments. x is the number of channels at the first stage. B is the number of blocks and the skip connection is added every two blocks. channel-wise convolution with the channel number being x. L and K are the hyper-parameters of IGCVdenotes the (convolutions with each branch containing K channels. For IGCV2 (Cx), L = 3, and for IGCV

Figure 4. Illustrating how the performance changes when our network goes deeper or wider. We use the network structure IGCV2* (Cx) in table 1 and conduct the experiments with various depths (denoted as D) and widths (denoted as C). Both going wider and going deeper increase the performance and the benefit of going wider is greater than going deeper.

over the 8-layer network without down-sampling, where there are 6 intermediate layers except the first convolutional layer and the last FC layer. We study an IGCV2 building block, which consists of L group convolutions: a channel-wise convolution, () group convolutions with each branch containing K channels. We conduct the studies over such blocks, where the IGCV2 is only applied to convolutions, for removing the possible influence of the coupling with spatial convolutions. The results are given in Figure 2 with various values of K and L under almost the same number of parameters.

An IGCV2 block satisfying the complementary condition leads to a dense kernel with a large width under the same number of parameters. The red bars in Figure 2 depict the results for the IGCV2 blocks that nearly satisfy this condition. The blue bars in Figure 2 show the results of the blocks that correspond to dense kernels but do not satisfy the complementary condition, thus with more redundancy. It can be seen that the networks with the IGCV2 blocks (nearly) satisfying the complementary condition (the red bars and the bars next to the red bar in Figure 2 (a) and (c)) achieve the best performance. We also notice that the

red bar in Figure 2 (b), satisfying the complementary condition performs better than the best results in Figure 2 (a) and (c).

In addition, we look at how the sparsity (density) of the composed convolution kernel affects the performance. The results correspond to the green bars in Figure 2. With L being fixed, the kernel is more sparse with smaller K. It can be seen that the denser kernel leads to higher performance. In Figure 2 (a), the performance for the block K = 8, which is quite near to satisfy the complementary condition is slightly better than and almost the same to the denser kernel K = 12. The reason might be that though the kernel for K = 8 is more sparse, but it corresponds to a larger width.

The effect of L. We empirically show how the number L of group convolutions affects the performance under the same number of parameters on the CIFAR-100 dataset as an example. We still use the 8-layer network and study the IGCV2 block, which contains a channel-wise spatial convolution and group convolutions satisfying the balance condition.

Figure 3 shows the accuracy curve, the width curve, and the non-sparsity curve, where the non-sparsity value is the ratio of the number of non-zero parameters to the size of the resulting dense kernel matrix. It can be seen that the function of accuracy w.r.t. L is concave, the function of width w.r.t. L is concave, and the function of non-sparsity w.r.t. L is convex. The accuracy depends on both the width and the non-sparsity degree. When width becomes larger, accuracy might be higher. On the other hand, accuracy might be lower when non-sparsity degree becomes smaller. The black dashed line denotes an extreme case that the width is the largest and the non-sparsity is the smallest, the performance however is not the best. Instead, the maximum accuracy is achieved at some L, in which the width and the non-sparsity degree reach a balance denoted by the black solid line.

Deeper and wider networks. We also conduct experiments to explore how the performance changes when our network goes deeper and wider. In this study, the IGCV2 block is composed of a channel-wise convolution and

Table 2. Classification accuracy comparison between Xception and IGCV2 over the 8-layer network with various widths. The number of parameters and FLOPs are calculated within each block.

Table 3. Classification accuracy comparison between Xception, IGC, and our network under various widths with depth fixed as 20.

group convolutions.

We study the performance over the networks with identity mappings as skip connections, where we replace the regular convolution in the residual network with our IGCV2 building block. We do experiments on IGCV2* (Cx) in Table 1, where Cx means that the network width C in the first stage is x. There are two types of experiments. (1) Go wider: we fix the depth as 20 (B = 6) and vary the width among {112, 136, 160, 192, 256}. (2) Go deeper: we fix the width as 64 and vary the depth among {38, 50, 62, 74, 98}. The results are shown in Figure 4 (a) and (b). One can see that both going wider and going deeper increase the network performance, and the benefit of going wider using a relative small depth is greater than that of going deeper, which is consistent to the observation for regular convolutions.

4.3. Comparison With Xception

We empirically show the comparison with Xception using various widths and various depths under roughly the same number of parameters.

Varying the width. We firstly conduct the experiments over the 8-layer network without down-sampling, where the 6 intermediate convolutional layers (except the first convolutional layer and the last FC layer) are replaced with Xception blocks and our IGCV2 blocks. Our IGCV2 block is composed of a channel-wise convolution similar to Xception, and two group convolutions corresponding to the convolution in Xception. The comparison on CIFAR-10 and CIFAR-100 is given in Table 2. It can be seen that our networks consistently perform better than

Table 4. Classification accuracy comparison between Xception and our network under various depths.

Xception on both CIFAR-10 and CIFAR-100 datasets. In particular on CIFAR-100, our networks achieve at least 2% improvement. The reason might be that our network using the IGCV2 block is wider and thus improve the performance.

In addition, we report the results over 20-layer networks with various widths. The networks we used are IGCV2* (Cx) and Xception (Cx) illustrated in Table 1, and we fix the channel number at the first convolutional layer in Xception as 35. The results are presented in Table 3. We can see that our network with fewer number of parameters, performs better than Xception, which shows the powerfulness of IGCV2 block.

Varying the depth. We also compare the performances between Xception and our network IGCV2 with various depths: 8, 20, 26. The width C in Xception is fixed as 35, and the width in our network is set to 64 in order to keep the number of parameters smaller than Xception. For IGCV2 (C64), the number of channels in each branch are the same within the stage and are different for the three stages. To satisfy the complementary condition, , and are set to 8, 16, 32 respectively. The results over CIFAR-100 and Tiny ImageNet are given in Table 4. Our IGCV2 (C64) network performs better than Xception (C35) with smaller numbers of parameters and smaller or similar computation complexity. For example, when depth is 26, our network consuming fewer number of parameters and less computation complexity gets 56.32% accuracy on Tiny ImageNet, 1% better than 55.39% of Xception.

4.4. Comparison With IGC.

Varying the width. Similar to the comparison with Xception, we perform comparison with IGC [46] (denoted by IGCV1 in experiments) over the simple networks and over 20-layer networks with different widths. The results are presented respectively in Table 5 and Table 3, in which the observation is consistent to that from the comparison to Xception.

Let us look at the detailed results over the simple 8-layer network. The IGCV1 block is designed by following [46]: the primary group convolution contains two branches. We study two designs of our IGCV2 block. The first design follows the IGCV1 design manner: the first group convolu-

Table 5. Classification accuracy comparison between IGCV1 and our networks with two designs, IGCV2 I and IGCV2 II, over the 8-layer network with various widths. The number of parameters and FLOPs are calculated within each block.

Table 6. Classification accuracy comparison between IGCV1 and our network under various depths.

tion contains two branches and the other two group convolution contains the same number of branches. In this case, for other two bigger networks), which is far from the balance condition given in Equation 11. In the second design, we use a channel-wise convolution, two group convolutions. In this case, for other two bigger networks), which satisfies the balance condition. The resulting networks are denoted as IGCV2 I and IGCV2 II respectively. The results given in Table 5 show (i) that the first design performs better than IGCV1 on CIFAR-100 and a little worse than IGCV1 on CIFAR-10, which might stem from different balance condition satisfaction degrees, and (ii) that the second design performs the best, which stems from the better satisfaction with the balance condition.

In addition, the comparison over 20-layer network shown in Table 3 verifies that our network IGCV2 with smaller model size as well as less computation complexity, is able to achieve better performance under various widths.

Varying the depth. We also show the performance comparison between IGCV1 and our network under various depths: 8, 20, 26. The width C is set to 48 in IGCV1 (corresponding to IGC-L24M2 in [46]) and the width in our network is set to 80. The network structure can be seen in Table 1 and here are set to 10, 16, 20 respectively to satisfy the complementary condition. The results over CIFAR-100 and Tiny ImageNet are given in Table 6, showing superior performance of our network over IGCV1, which demonstrates the effectiveness of IGCV2 block.

Table 7. Illustrating the advantages of our networks for the small model cases through the classification error comparison to existing state-of-the-art architecture design algorithms.

4.5. Performance Comparison to Small Models

We show the advantages of our network with small models by comparing to existing state-of-the-art architecture design algorithms. The results are presented in Table 7. The observation is that our network with a smaller model achieves similar classification accuracies. For example, on CIFAR-100, the classification error of our approach with 0.65M parameters is 22.95%, while Swapout [36] reaches 22.72% with 7.4M parameters, FractalNet [19] reaches 23.30% with 38.6M parameters, WRN-40-4 [45] reaches 21.18 with 8.9M parameters and WRN-32-4 [10] reaches 23.55% with 7.4M parameters. On Tiny ImageNet, our network contains the smallest number of parameters, and achieves better performance compared to the reported results. On CIFAR-10, our network achieves competitive performance. In table 7, DenseNet-BC(k = 12) [11] with more number of parameters achieves lower classification error on CIFAR-100 and CIFAR-10 compared with IGCV2* (C416). Dense connection is a structure complementary to IGCV1, and we can also combine densely connected structure with IGCV1 to improve the performance. We believe that our approach potentially gets more improvement if dense connection and bottleneck design are exploited.

5. Comparison to MobileNet on ImageNet

We compare our approach to MobileNetV1 [9] and MobileNetV2 [35] on the ImageNet classification task [34]. We use SGD to train the networks using the same hyperparameters (weight decay = 0.00004, and momentum = 0.9). The mini-batch size is 96, and we use 4 GPUs (24 samples per GPU). We adopt the same data augmentation as in [9, 35]. We train the models for 100 epochs with extra 20 epochs for retraining on MXNet [1]. We start from a learning rate of 0.045, and then divide it by 1 every 30 epochs. We evaluate on the single center crop from an

Figure 5. (a) A block in MobileNetV1. (b) A nonlinear IGCV2 block.

image whose shorter side is 256.

5.1. IGCV2 vs. MobileNetV1

We form our network using the same pattern as MobileNetV1 [9]: same number of blocks, no skip connections. In particular, we use a nonlinear IGCVchannel-wise convolution group convolution group convolution , which is illustrated in Figure 5. The dimension increase, if included, is conducted over the last group convolution. We adopt a loose complementary condition to form an IGCV2 block: each group convolution contains ) branches, in which some channels in the same branch might still lie in the same branch in another group convolution. The description of the IGCV2-1.0 and MobileNet-1.0 networks with the same number of parameters are shown in Table 8. The results are given in Table 9. This result demonstrates that IGCV2 is effective as well on large scale image dataset.

5.2. IGCV3 vs. MobileNetV2

We introduce an IGCV3 block: Combine the low-rank convolution kernels, bottleneck, and IGCV2, which is illustrated in Figure 6. We adopt nonlinear IGCV3 blocks and form it with the loose complementary condition: each group convolution contains ) branches. In the constructed networks, there is a skip connection for each block except the downsampling blocks, and two IGCV3 blocks correspond to one block MobileNetV2 [35]. The comparison results are given in Table 11.

Figure 6. (a) A block in MobileNetV2. (b) A nonlinear IGCV3 block.

6. Conclusion

In this paper, we aim to eliminate the redundancy in convolution kernels and present an Interleaved Structured Sparse Convolution (IGCV2) block to compose a dense kernel. We present the complementary condition and the balance condition to guide the design and obtain a balance among model size, computation complexity and classifica-tion accuracy. Empirical results show the advantage over MobileNetV1 and MobileNetV2, and demonstrate that our network with smaller model size achieves similar performance compared with other network structures.

Acknowledgement

We appreciate Ke Sun, Mingjie Li, and Depu Meng for helping on the experiments about the comparison to MobileNetV1 and MobileNetV2.

References

[1] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and effi-cient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.

[2] F. Chollet. Xception: Deep learning with depthwise separa- ble convolutions. CoRR, abs/1610.02357, 2016.

[3] L. Cordeiro. Wide residual network for the tiny imagenet challenge.

Table 8. The IGCV2 network and MobileNetVare the number of input and output channels for the blocks in the corresponding lines. are the number of branches in the two group convolutions. Blocks with means that there is no channel-wise convolution.

Table 9. A comparison of MobileNetV1 and IGCV2 on ImageNet classification. 1.0, 0.5, 0.25 are width multipliers.

[4] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.

[5] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer- gus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, pages 1269–1277, 2014.

[6] S. Han, H. Mao, and W. J. Dally. Deep compression: Com- pressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.

[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.

[8] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, pages 630–645. Springer, 2016.

[9] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-cient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

[10] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger. Snapshot ensembles: Train 1, get m for free. arXiv preprint arXiv:1704.00109, 2017.

[11] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.

[12] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In ECCV, pages 646– 661, 2016.

[13] Y. Ioannou, D. P. Robertson, R. Cipolla, and A. Criminisi. Deep roots: Improving CNN efficiency with hierarchical fil-ter groups. CoRR, abs/1605.06489, 2016.

[14] Y. Ioannou, D. P. Robertson, J. Shotton, R. Cipolla, and A. Criminisi. Training cnns with low-rank filters for efficient image classification. CoRR, abs/1511.06744, 2015.

[15] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014.

[16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

[17] Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. CoRR, abs/1511.06530, 2015.

[18] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

[19] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. CoRR, abs/1605.07648, 2016.

[20] C. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu. Deeply- supervised nets. In AISTATS, 2015.

[21] J. Lee, S. Kim, G. Lebanon, and Y. Singer. Local low-rank matrix approximation. In ICML (2), pages 82–90, 2013.

[22] J. Lee, S. Kim, G. Lebanon, Y. Singer, and S. Bengio. LLORMA: local low-rank matrix approximation. Journal of Machine Learning Research, 17:15:1–15:24, 2016.

[23] F. Li, B. Zhang, and B. Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.

[24] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. CoRR, abs/1608.08710, 2016.

[25] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013.

[26] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Penksy. Sparse convolutional neural networks. In CVPR, pages 806– 814. IEEE, 2015.

[27] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. In ICCV, pages 2755–2763, 2017.

Table 10. The IGCV3 network and MobileNetV2. t is the expansion factor. are the number of input and output channels for the blocks in the corresponding lines. are the number of branches in the two group convolutions. Blocks with means that there is no channel-wise convolution.

Table 11. A comparison of MobileNetV2 and IGCV3 on ImageNet classification. 0.7, 1.0 are width multipliers. Network #Params (M) FLOPs (M) Accuracy (

[28] J. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. In ICCV, pages 5068–5076, 2017.

[29] F. Mamalet and C. Garcia. Simplifying convnets for fast learning. In ICANN, pages 58–65, 2012.

[30] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.

[31] J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey. Faster cnns with direct sparse convolutions and guided pruning. 2016.

[32] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor- net: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542. Springer, 2016.

[33] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550, 2014.

[34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV), 115(3):211–252, 2015.

[35] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. CoRR, abs/1801.04381, 2018.

[36] S. Singh, D. Hoiem, and D. A. Forsyth. Swapout: Learning an ensemble of deep architectures. In NIPS, pages 28–36, 2016.

[37] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Ried- miller. Striving for simplicity: The all convolutional net. CoRR, abs/1412.6806, 2014.

[38] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In NIPS, pages 2377–2385, 2015.

[39] S. Targ, D. Almeida, and K. Lyman. Resnet in resnet: Gener- alizing residual architectures. CoRR, abs/1603.08029, 2016.

[40] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 30(11):1958–1970, 2008.

[41] J. Wang, Z. Wei, T. Zhang, and W. Zeng. Deeply-fused nets. CoRR, abs/1605.07716, 2016.

[42] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In NIPS, pages 2074–2082, 2016.

[43] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li. Coordi- nating filters for faster deep neural networks. arXiv preprint arXiv:1703.09746, 2017.

[44] S. Xie, R. B. Girshick, P. Doll´ar, Z. Tu, and K. He. Ag- gregated residual transformations for deep neural networks. CoRR, abs/1611.05431, 2016.

[45] S. Zagoruyko and N. Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.

[46] T. Zhang, G. Qi, B. Xiao, and J. Wang. Interleaved group convolutions for deep neural networks. CoRR, abs/1707.02725, 2017.

[47] L. Zhao, J. Wang, X. Li, and Z. Tu. Deep convolutional neural networks with merge-and-run mappings. CoRR, abs/1611.07718, 2016.

[48] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremen- tal network quantization: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044, 2017.

[49] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.

[50] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.

designed for accessibility and to further open science