Convolution with even-sized kernels and symmetric padding

2019·arXiv

Abstract

1 Introduction

Deep convolutional neural networks (CNNs) have achieved significant successes in numerous computer vision tasks such as image classification [37], semantic segmentation [43], image generation [8], and game playing [29]. Other than domain-specific applications, various architectures have been designed to improve the performance of CNNs [3, 12, 15], wherein the feature extraction and representation capabilities are mostly enhanced by deeper and wider models containing ever-growing numbers of parameters and operations. Thus, the memory overhead and computational complexity greatly impede their deployment in embedded AI systems. This motivates the deep learning community to design compact CNNs with reduced resources, while still retaining satisfactory performance.

Compact CNNs mostly derive generalization capabilities from architecture engineering. Shortcut connection [12] and dense concatenation [15] alleviate the degradation problem as the network deepens. Feature maps (FMs) are expanded by pointwise convolution (C1) and bottleneck architecture [35, 40]. Multi-branch topology [38], group convolution [42, 47], and channel shuffle operation [48] recover accuracy at the cost of network fragmentation [26]. More recently, there is a trend towards mobile models with <10M parameters and <1G FLOPs [14, 24, 26], wherein the depthwise convolution (DWConv) [4] plays a crucial role as it decouples cross-channel correlations and spatial correlations. Aside from human priors and handcrafted designs, emerging neural architecture search (NAS) methods optimize structures by reinforcement learning [49], evolution algorithm [32], etc.

Despite the progress, the fundamental spatial representation is dominated by 33 kernel convolutions (C3) and the exploration of other kernel sizes is stagnating. Even-sized kernels are deemed inferior and rarely adopted as basic building blocks for deep CNN models [37, 38]. Besides, most of the compact models concentrate on the inference efforts (parameters and FLOPs), whereas the training efforts (memory and speed) are neglected or even becoming more intractable due to complex topologies [24], expanded channels [35], additional transformations [17, 40, 48]. With the growing demands for online and continual learning applications, the training efforts should be jointly addressed and further emphasized. Furthermore, recent advances in data augmentation [6, 46] have shown more powerful and universal benefits. A simpler structure combined with enhanced augmentations easily eclipses the progress made by intricate architecture engineering, inspiring us to rethink basic convolution kernels and the mathematical principles behind them.

In this work, we explore the generalization capabilities of even-sized kernels (22, 44). Direct implementation of these kernels encounters performance degradation in both classification and generation tasks, especially in deep networks. We quantify the phenomenon by an information erosion hypothesis: even-sized kernels have asymmetric receptive fields (RFs) that produce pixel shifts in the resulting FMs. This location offset accumulates when stacking multiple convolutions, thus severely eroding the spatial information. To address the issue, we propose convolution with even-sized kernels and symmetric padding on each side of the feature maps (C2sp, C4sp).

Symmetric padding not merely eliminates the shift problem, but also extends RFs of even-sized kernels. Various classification results demonstrate that C2sp is an effective decomposition of C3 in terms of 30%-50% saving of parameters and FLOPs. Moreover, compared with compact CNN blocks such as DWConv, inverted-bottleneck [35], and ShiftNet [40], C2sp achieves competitive accuracy with >20% speedup and >35% memory saving during training. In generative adversarial networks (GANs) [8], C2sp and C4sp both obtain improved image qualities and stabilized convergence. This work stimulates a new perspective full of optional units for architecture engineering, as well as provides basic but effective alternatives that balance both the training and inference efforts.

2 Related work

Our method belongs to compact CNNs that design new architectures and then train them from scratch. Whereas most network compressing methods in the literature attempt to prune weights from the pre-trained reference network [9], or quantize weight and activation [16] in terms of inference efforts. Some recent advances also prune networks at the initialization stage [7] or quantize models during training [41]. The compression methods are orthogonal to compact architecture engineering and can be jointly implemented for further reducing memory consumption and computational complexity.

Even-sized kernel Even-sized kernels are mostly applied together with stride 2 to resize images. For example, GAN models in [28] apply 44 kernels and stride 2 in the discriminators and generators to down-sample and up-sample images, which can avoid the checkerboard artifact [30]. However, the 33 kernel is preferred when it comes to deep and large-scale GANs [3, 19, 22]. Except for scaling, few works have implemented even-sized kernels as basic building blocks for their CNN models. In relational reinforcement learning [45], two C2 layers are adopted to achieve reasoning and planning of objects represented by 4 pixels. 22 kernels are tested in relatively shallow (about 10 layers) models [10], and the FM sizes are not preserved strictly.

Atrous convolution Dilated convolution [43] supports exponential expansions of RFs without loss of resolution or coverage, which is specifically suitable for dense prediction tasks such as semantic segmentation. Deformable convolution [5] augments the spatial sampling locations of kernels by additional 2D offsets and learning the offsets directly from target datasets. Therefore, deformable kernels shift at pixel-level and focus on geometric transformations. ShiftNet [40] sidesteps spatial convolutions entirely by shift kernels that contain no parameter or FLOP. However, it requires large channel expansions to reach satisfactory performance.

3 Symmetric padding

3.1 The shift problem

We start with the spatial correlation in basic convolution kernels. Intuitively, replacing a C3 with two C2s should provide performance gains aside from 11% reduction of overheads, which is inspired by the factorization of C5 into two C3s [37]. However, experiments in Figure 3 indicate that the

Figure 1: Normalized FMs derived from well-trained ResNet-56 models. Three spatial sizes 32168 before down-sampling stages are presented. First row: Conv22 with asymmetric padding (C2). Second row: Conv22 with symmetric padding (C2sp). Left: a sample in CIFAR10 test dataset. Right: average results from all the samples in the test dataset.

classification accuracy of C2 is inferior to C3 and saturate much faster as the network deepens. In addition, replacing each C3 with C4 also hurts accuracy even though a 33 kernel can be regarded as a subset of 44 kernel, which contains 77% more parameters and FLOPs. To address this issue, the FMs of well-trained ResNet-56 [12] models with C2s are visualized in Figure 1. FMs of C4 and C3 have similar manners as C2 and C2sp, respectively and are omitted for clarity. It is clearly seen that the post-activation (ReLU) values in C2 are gradually shifting to the left-top corner of the spatial location. These compressed and distorted features are not suitable for the following classification, let alone pixel-level tasks based on it such as detection and semantic segmentation, where all the annotations will have offsets starting from the left-top corner of the image.

We identify this as the shift problem observed in even-sized kernels. For a conventional convolution between output FMs F and square kernels , it can be given as

where and p enumerate locations in RF R and in FMs of size , respectively. When k is an odd number, e.g., 3, we define the central point of R as origin:

where denotes the maximum pixel number from four sides to the origin. is the ceil rounding function. Since R is symmetrical, we have

When k is an even number, e.g., 2 or 4, implementing convolution between and kernels becomes inevitably asymmetric since there is no central point to align. In most deep learning frameworks, it draws little attention and is obscured by pre-defined offsets. For example, TensorFlow [1] picks the nearest pixel in the left-top direction as the origin, which gives an asymmetric R:

The shift occurs at all the spatial locations p and is equivalent to pad one more zero on the bottom and right sides of FMs before convolutions. On the contrary, Caffe [18] pads one more zero on the left and top sides. PyTorch [31] only supports symmetric padding by default, users need to manually define the padding policy if desired.

3.2 The information erosion hypothesis

According to the above, even-sized kernels make zero-padding asymmetric with 1 pixel, and averagely (between two opposite directions) lead to 0.5-pixel shifts in the resulting FMs. The position offset accumulates when stacking multiple layers of even-sized convolutions, and eventually squeezes and distorts features to a certain corner of the spatial location. Ideally, in case that such asymmetric padding is performed for n times in the TensorFlow style with convolutions in between, the resulting pixel-to-pixel correspondence of FMs will be

Since FMs have finite size and are usually down-sampled to force high-level feature representations, then the edge effect [2, 27] cannot be ignored because zero-padding at edges will distort the effective values of FM, especially in deep networks and small FMs. We hypothesize that the quantity of information Q equals to the mean L1-norm of the FM, then successive convolutions with zero-padding to preserve FM size will gradually erode the information:

The information erosion happens recursively and is very complex to be formulated, we directly derive FMs from deep networks that contain various kernel sizes. In Figure 2, 10k images of size 3232 are fed into untrained ResNet-56 models where identity connections and batch normalizations are removed. Q decreases progressively and faster in larger kernel sizes and smaller FMs. Besides, asymmetric padding in even-sized kernels (C2, C4) speeds up the erosion dramatically, which is consistent with well-trained networks in Figure 1. An analogy is that FM can be seen as a rectangular ice chip melting in water except that it can only exchange heat on its four edges. The smaller the ice, the faster the melting process happens. Symmetric padding equally distributes thermal gradients so as to slow down the exchange. Whereas asymmetric padding produces larger thermal gradients on a certain corner, thus accelerating it.

Our hypothesis also provides explanations for some experimental observations in the literature. (1) The degradation problem happens in very deep networks [12]: although the vanishing/exploding forward activations and backward gradients have been addressed by initialization [11] and intermediate normalization [17], the spatial information is eroded and blurred by the edge effect after multiple convolution layers. (2) It is reported [3] that in GANs, doubling the depth of networks hampers training, and increasing the kernel size to 7 or 5 leads to degradation or minor improvement. These indicate that GANs require information augmentation and are more sensitive to progressive erosion.

Figure 2: Left: layerwise quantity of information Q and colormaps derived from the last convolution layers. FMs are down-sampled after 18th and 36th layers. Right: implementation of convolution with 22 kernels and symmetric padding (C2sp).

3.3 Method

Since R is inevitably asymmetric for even kernels in Equation 3, it is difficult to introduce symmetry within a single FM. Instead, we aim at the final output summed by multiple input and kernels. For clarity, let be the shifted RF in Equation 3 that picks the nearest pixel in the left-top direction as origin, then we explicitly introduce a shifted collection

that includes all four directions: left-top, right-top, left-bottom, right-bottom.

Let be the surjective-only mapping from input channel indexes certain shifted RFs. By adjusting the proportion of four shifted RFs, we can ensure that

When mixing four shifted RFs within a single convolution, the RFs of even-sized kernels are partially extended, e.g., 23, 45. If is an integer multiple of 4 (usually satisfied), the symmetry is strictly obeyed within a single convolution layer by distributing RFs in sequence

As mention above, the shifted RF is equivalent to pad one more zero at a certain corner of FMs. Thus, the symmetry can be neatly realized by a grouped padding strategy, an example of C2sp is illustrated in Figure 2. In summary, the 2D convolution with even-sized kernels and symmetric padding consists of three steps: (1) Dividing the input FMs equally into four groups. (2) Padding FMs according to the direction defined in that group. (3) Calculating the convolution without any padding. We have also done ablation studies on other methods dealing with the shift problem, please see Section 5.

4 Experiments

In this section, the efficacy of symmetric padding is validated in CIFAR10/100 [21] and ImageNet [33] classification tasks, as well as CIFAR10, LSUN bedroom [44], and CelebA-HQ [19] generation tasks. First of all, we intuitively demonstrate that the shift problem has been eliminated by symmetric padding. In the symmetric case of Figure 1, FMs return to the central position, exhibiting healthy magnitudes and reasonable geometries. In Figure 2, C2sp and C4sp have much lower attenuation rates than C2 and C4 regarding information quantity Q. Besides, C2sp has larger Q than C3, expecting performance improvement in the following evaluations.

Figure 3: Left: parameter-accuracy curves of ResNets that have multiple depths and various convolution kernels. Middle: parameter-accuracy curves of DenseNets that have multiple depths with C3 and C2sp. Right: training and testing curves on DenseNet-112 with C3 and C2sp.

4.1 Exploration of various kernel sizes

To explore the generalization capabilities of various convolution kernels, ResNet series without bottleneck architectures [12] are chosen as the backbones. We maintain all the other components and training hyperparameters as the same, and only replace each C3 by a C4, C2 or C2sp. The networks are trained on CIFAR10 with depths in . The parameter-accuracy curves are shown in Figure 3. The original even-sized kernels 44, 22 perform poorly and encounter faster saturation as the network deepens. Compared with C3, C2sp reaches similar accuracy with only 60%-70% of the parameters, as well as FLOPs that are linearly correlated. We also find that symmetric padding only slightly improves the accuracy in C4sp. In such network depth, the edge effect might dominate the information erosion of 44 kernels rather than the shift problem, which is consistent with attenuation curves in Figure 2.

Based on the results of ResNets on CIFAR10, we further evaluate C2sp and C3 on CIFAR100. At this time, DenseNet [15] series with multiple depths in are the backbones, and results are shown in Figure 3. At the same depth, C2sp achieves comparable accuracy to C3 as the network gets deeper. The training losses indicate that C2sp have better generalization and less overfitting than C3. Under the criterion of similar accuracy, a C2sp model will save 30%-50% parameters and FLOPs in the CIFAR evaluations. Therefore, we recommend using C2sp as a better alternative to C3 in classification tasks.

4.2 Compare with compact CNN blocks

To facilitate fair comparisons for C2sp with compact CNN blocks that contain C1, DWConvs, or shift kernels, we use ResNets as backbones and adjust the width and depth to maintain the same number of parameters and FLOPs (overheads). In case there are n input channels for a basic residual block, then two C2sp layers will consume about overheads, the expansion is marked as 1-1-1 since no channel expands. For ShiftNet blocks [40], we choose expansion rate 3 and 33 shift kernels as suggested, the overheads are about . Therefore, the value of n is slightly increased. While for the inverted-bottleneck [35], the suggested expansion rate 6 results in overheads, thus the number of blocks is reduced by 1/3. For depthwise-separable convolutions [4], the overheads are about , so the channels are doubled and formed as 2-2-2 expansions.

Table 1: Comparison of various compact CNN blocks on CIFAR100. Shift, Invert, and Sep denotes ShiftNet block, inverted-bottleneck, and depthwise-separable convolution, respectively. mixup denotes training with mixup augmentation. Exp denotes the expansion rates of channels in that block. SPS refers to the speed during training: samples per second.

The results are summarized in Table 1. Since most models easily overfit CIFAR100 training set with standard augmentation, we also train the models with mixup [46] augmentation to make the differences more significant. In addition to error rates, the memory consumption and speed during training are reported. C2sp performs better accuracy than ShiftNets, which indicates that sidestepping spatial convolutions entirely by shift operations may not be an efficient solution. Compared with blocks that contain DWConv, C2sp achieves competitive results in 56 and 110 nets with fewer channels and simpler architectures, which reduce memory consumption (>35%) and speed up (>20%) the training process.

In Table 2, we compare C2sp with NAS models: NASNet [49], PNASNet [24], and AmoebaNet [32]. We apply Wide-DenseNet [15] and adjust the width and depth (K = 48, L = 50) to have approximately 3.3M parameters. C2sp suffers less than 0.2% accuracy loss compared with state-of-the-art auto-generated models, and achieves better accuracy (+0.21%) when the augmentation is enhanced. Although NAS models leverage fragmented operators [26], e.g., pooling, group convolution, DWConv to improve accuracy with similar numbers of parameters, the regular-structured Wide-DenseNet has better memory and computational efficiency in runtime. In our reproduction, the training speeds on TitanXP for NASNet-A and Wide-DesNet are about 200 and 400 SPS, respectively.

Table 2: Test error rates (%) on CIFAR10 dataset. c/o and mixup denotes cutout [6] and mixup [46] data augmentation.

4.3 ImageNet classification

We start with the widely-used ResNet-50 and DenseNet-121 architectures. Since both of them contain bottlenecks and C1s to scale down the number of channels, C3 only consumes about 53% and 32% of the total overheads. Changing C3s to C2sp results in about 25% and 17% reduction of parameters and FLOPs, respectively. The top-1 classification accuracy are shown in Table 3, C2sp have minor loss (0.2%) in ResNet, and slightly larger degradation (0.5%) in DenseNet. After all, there are only 0.9M parameters for spatial convolution in DenseNet-121 C2sp.

We further scale the channels of ResNet-50 down to 0.5as a mobile setting. At this stage, a C2 model (asymmetric), as well as reproductions of MobileNet-v2 [35] and ShuffleNet-v2 [26] are evaluated. Symmetric padding greatly reduces the error rate of ResNet-50 0.5C2 for 2.5%, making ResNet a comparable solution to compact CNNs. Although MobileNet-v2 models achieve the best accuracy, they use inverted-bottlenecks (the same structure in Table 1) to expand too many FMs, which significantly increase the memory consumption and slow down the training process (about 400 SPS), while other models can easily reach 1000 SPS.

Table 3: Top-1 error rates on ImageNet. Results are obtained by our reproductions using the same training hyperparameters.

4.4 Image generation

The efficacy of symmetric padding is further validated in image generation tasks with GANs. In CIFAR10 3232 image generation, we follow the same architecture described in [28], which has about 6M parameters in the generator and 1.5M parameters in the discriminator. In LSUN bedroom and CelebA-HQ 128128 image generation, ResNet19 [22] is adopted with five residual blocks in the generator and six residual blocks in the discriminator, containing about 14M parameters for each of them. Since the training of GAN is a zero-sum game between two neural networks, we remain all discriminators as the same (C3) to mitigate their influences, and replace each C3 in generators with a C4, C2, C4sp, or C2sp. Besides, the number of channels is reduced to 0.75in C4 and C4sp, or expanded 1.5in C2 and C2sp to approximate the same number of parameters.

The inception scores [34] and FIDs [13] are shown in Table 4 for quantitatively evaluating generated images, and examples from the best FID runs are visualized in Figure 4. Symmetric padding is crucial for the convergence of C2 generators, and remarkably improves the quality of C4 generators.

Table 4: Scores for different kernels. Higher inception score and lower FID is better.

In addition, the standard derivations () confirm that symmetric padding stabilizes the training of GANs. On CIFAR10, C2sp performs the best scores while in LSUN bedroom and CelebA-HQ generation, C4sp is slightly better than C2sp. The diverse results can be explained by the information erosion hypothesis: In CIFAR10 generation, the network depth is relatively deep in terms of image size , then a smaller kernel will have less attenuation rate and more channels. Whereas the network depth is relatively shallow in terms of image size 128128, and the edge effect is negligible. Then larger RFs are more important than wider channels in high-resolution image generation.

Figure 4: Examples generated by GANs on CIFAR10 (3232, C2sp, IS=8.27, FID=19.49), LSUNbedroom (128128, C4sp, FID=16.63) and CelebA-HQ (128128, C4sp, FID=19.83).

4.5 Implementation details

Results reported as meanstd in tables or error bars in figures are trained for 5 times with different random seeds. The default settings for CIFAR classifications are as follows: We train models for 300 epochs with mini-batch size 64 except for the results in Table 2, which run 600 epochs as in [49]. We use a cosine learning rate decay [25] starting from 0.1 except for DenseNet tests, where the piecewise constant decay performs better. The weight decay factor is 1e-4 except for parameters in depthwise convolutions. The standard augmentation [23] is applied and the equals to 1 in mixup augmentation.

For ImageNet classifications, all the models are trained for 100 epochs with mini-batch size 256. The learning rate is set to 0.1 initially and annealed according to the cosine decay schedule. We follow the data augmentation in [36]. Weight decay is 1e-4 in ResNet-50 and DenseNet-121 models, and decreases to 4e-5 in the other compact models. Some results are worse than reported in the original papers. It is likely due to the inconsistency of mini-batch size, learning rate decay, or total training epochs, e.g., about 420 epochs in [35].

In generation tasks with GANs, we follow models and hypermeters recommended in [22]. The learning rate is 0.2, is 0.5 and is 0.999 for Adam optimizer [20]. The mini-batch size is 64, the ratio of discriminator to generator updates is 5:1 (). The results in Table 3 and Figure 4 are trained for 200k and 500k discriminator update steps, respectively. We use the non-saturation loss [8] without gradient norm penalty. The spectral normalization [28] is applied in discriminators, no normalization is applied in generators.

5 Discussion

Ablation study We have tested other methods dealing with shift problem, and divided them into two categories: (1) Replacing asymmetric padding with additional non-convolution layer, e.g., interpolation, pooling; (2) Achieving symmetry with multiple convolution layers, e.g., padding 1 pixel at each side before/within two non-padding convolutions. Their implementation is restricted to certain architectures and the accuracy is no better than symmetric padding. Our main consideration is to propose a basic but elegant building element that achieves symmetry within a single layer, thus most of the existing compact models can be neatly transferred to even-sized kernels, providing universal benefits to compact CNN and GAN communities.

Network fragmentation From the evaluations above, C2sp achieves comparable accuracy with less training memory and time. Although fragmented operators distributed in many groups [26] have fewer parameters and FLOPs, the operational intensity [39] decreases as the group number increases. This negatively impacts the efficiency of computation, energy, and bandwidth in hardware that has strong parallel computing capabilities. In the situation where memory access dominates the computation, e.g., training, the reduction in FLOPs will be less meaningful. We conclude that when the training efforts are emphasized, it is still controversial to (1) increase network fragmentation by grouping strategies and complex topologies; (2) decompose spatial and channel correlations by DWConvs, shift operations, and C1s.

Naive implementation Meanwhile, most deep learning frameworks and hardware are mainly optimized for C3, which restrains the efficiency of C4sp and C2sp to a large extent. For example, in our high-level python implementation in TensorFlow for models with C2sp, C2, and C3, despite that the parameters and FLOPs ratio is 4:4:9, the speed (SPS) and memory consumption ratio during training is about 1:1.14:1.2 and 1:0.7:0.7, respectively. It is obvious that the speed and memory overheads can be further optimized in the following computation libraries and software engineering once even-sized kernels are adopted by the deep learning community.

6 Conclusion

In this work, we explore the generalization capabilities of even-sized kernels (24) and quantify the shift problem by an information erosion hypothesis. Then we introduce symmetric padding to elegantly achieve symmetry within a single convolution layer. In classifications, C2sp achieves 30%-50% saving of parameters and FLOPs compared to C3 on CIFAR10/100, and improves accuracy for 2.5% from C2 on ImageNet. Compared to existing compact convolution blocks, C2sp achieves competitive results with fewer channels and simpler architectures, which reduce memory consumption (>35%) and speed up (>20%) the training process. In generation tasks, C2sp and C4sp both achieve improved image qualities and stabilized convergence. Even-sized kernels with symmetric padding provide promising building units for architecture designs that emphasize training efforts on online and continual learning occasions.

References

[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.

[2] Farzin Aghdasi and Rabab K Ward. Reduction of boundary artifacts in image restoration. IEEE Transactions on Image Processing, 5(4):611–618, 1996.

[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.

[4] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.

[5] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.

[6] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.

[7] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.

[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

[9] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, 2016.

[10] Kaiming He and Jian Sun. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5353–5360, 2015.

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.

[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.

[14] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

[15] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.

[16] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.

[17] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.

[18] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.

[19] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.

[20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

[21] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[22] Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. The gan landscape: Losses, architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720, 2018.

[23] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeplysupervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015.

[24] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.

[25] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.

[26] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018.

[27] G McGibney, MR Smith, ST Nichols, and A Crawley. Quantitative evaluation of several partial fourier reconstruction algorithms used in mri. Magnetic resonance in medicine, 30(1):51–59, 1993.

[28] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.

[29] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.

[30] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts. Distill, 1(10):e3, 2016.

[31] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.

[32] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.

[33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.

[34] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.

[35] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.

[36] Nathan Silberman and Sergio Guadarrama. Tensorflowslim image classification model library, 2017.

[37] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.

[38] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.

[39] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.

[40] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: A zero flop, zero parameter alternative to spatial convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9127–9135, 2018.

[41] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. In International Conference on Learning Representations, 2018.

[42] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.

[43] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations, 2016.

[44] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.

[45] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, Murray Shanahan, Victoria Langston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, and Peter Battaglia. Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations, 2019.

[46] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.

[47] Ting Zhang, Guo-Jun Qi, Bin Xiao, and Jingdong Wang. Interleaved group convolutions. In Proceedings of the IEEE International Conference on Computer Vision, pages 4373–4382, 2017.

[48] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.

[49] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.

Designed for Accessibility and to further Open Science