Exploiting Local Structures with the Kronecker Layer in Convolutional Networks

2015·Arxiv

Abstract

Abstract

In this paper, we propose and study a technique to reduce the number of parameters and computation time in convolutional neural networks. We use Kronecker product to exploit the local structures within convolution and fully-connected layers, by replacing the large weight matrices by combinations of multiple Kronecker products of smaller matrices. Just as the Kronecker product is a generalization of the outer product from vectors to matrices, our method is a generalization of the low rank approximation method for convolution neural networks. We also introduce combinations of different shapes of Kronecker product to increase modeling capacity. Experiments on SVHN, scene text recognition and ImageNet dataset demonstrate that we can achieve speedup or parameter reduction with less than 1% drop in accuracy, showing the effectiveness and efficiency of our method. Moreover, the computation efficiency of Kronecker layer makes using larger feature map possible, which in turn enables us to outperform the previous state-of-the-art on both SVHN(digit recognition) and CASIA-HWDB (handwritten Chinese character recognition) datasets.

1. Introduction

Recently, convolutional neural networks (CNNs) have achieved a great success in many computer vision and machine learning tasks. This success facilitates the development of industrial applications using CNNs. However, there are two major challenges for practical use of these networks, especially on resource-limited devices:

1. Using a CNN for prediction may require significant amount of computation at run time. For example, AlexNetKrizhevsky et al. (2012) would require billions of floating point operations to process an image of size.

2. CNNs achieving state of the art results may require billions Dean et al. (2012), Le (2013), Jaderberg et al. (2014a) of parameters for storage.

As a consequence, there has been growing interest in model speedup and compression. It is common to sacrifice a little prediction accuracy in exchange for smaller model size and faster running speed.

In the literature, a major technique is based on the idea of low rank matrix and tensor approximations. In Sainath et al. (2013), low rank matrix factorization was used on the weight matrix of the final softmax layer. Denil et al. (2013) decomposed the weight matrix as a product of two smaller matrices and one of the matrices was carefully constructed as a dictionary. In Xue et al. (2013), Denton et al. (2014), model approximation is followed by fine-tuning on the training data. Zhang et al. (2015) also took the nonlinear activation functions into account when doing approximation.

Low rank technique can also be applied on the weight tensors of convolutional layers. Rigamonti et al. (2013) used a shared set of separable (rank-1) filters to approximate the original filters. Jaderberg et al. (2014b) exploited the redundancy that exists between different feature channels and filters. Lebedev et al. (2014) applied CP-decomposition, a type of tensor decomposition, on the weight tensors.

In this paper, we explore a framework for approximating the weight matrices and weight tensors in neural networks by sum of Kronecker products. We note that as the bases for low rank factorizations like SVD or CP-decomposition are outer products of vectors, approximation by these bases can only exploit the redundancy along each dimension. In contrast, as the Kronecker product generalizes the outer product from vectors to matrices of arbitrary shape, we may use the Kronecker product to exploit redundancy between local patches of any shape.

Figure 1 demonstrates a case when approximating by Kronecker product would produce less reconstruction error than outer products with the same number of parameters for image pixel value matrix. Intuitively, similar situation may also exist for weight matrices and tensors in convolutional networks, and in these cases our method may produce approximate models that run faster and have less number of parameters at the same level of accuracy loss. On the other hand, with similar number of parameters, our method can advances previous state-of-the-art.

The rest of this paper is organized as follows. In Section 2, we introduce the Kronecker layer. We discuss some details about implementing the Kronecker layer in Section 3. We extend our technique to convolutional layer in Section 4. Section 5 analyses the result of using Kronecker layers on some benchmark datasets. Section 6 discusses some related work not yet covered. Finally, Section 7 concludes the paper and discusses future work.

2. Kronecker Layer

In this section, we first review the property of the Kronecker product and describe its application on the fully-connected layer.

Figure 1: Comparison between approximations by outer product and Kronecker product for an image. The column (a) is the origin image of size , selected from BSD500 dataset Arbelaez et al. (2011). The column (b) is the SVD approximations of (a) by outer product and the column (c) is the approximation based on Kronecker productVan Loan and Pitsianis (1993), with rank 1, 2, 5, 10 respectively from top to down. The shape of the right matrix in the Kronecker product is deliberately selected as to make the number of parameters equal for each rank.

Kronecker Product

Let be two given matrices. Then the Kronecker product is an matrix, where :

An important property with Kronecker product is that, it can be presented by matrix multiplication with reshape operation:

for the matrix . Here denotes the vectorization (column vector) of the matrix X. Below we will show how to speedup calculation of Kronecker products in neural networks using this property.

Kronecker products are easy to generalize from matrices to tensors. Let and and define:

where A ⊗ B ∈ .

Approximating the Fully-Connected Layer

We next show how to use Kronecker products to approximate weight matrices of fully-connected layers, leading to construction of a new kind of layer which we call a Kronecker fully-connected layer, or KFC layer for short. The idea originates from the observation that for a matrix where the dimensions are not prime (in fact the dimension is commonly set to a multiple of 8, to make full use of modern CPU/GPU architecture) we have approximation:

where . So the KFC layer is:

where Łis the input of ith layer, f is the nonlinear activation function. With the case that m or n is prime, it suffices to add some extra dummy features or output classes to make the dimension a composite number.

Note that we need not to calculate the Kronecker product explicitly. When the KFC layer is fed with inputs of batch k, we can forward the KFC layer efficiently, according to Eq. (2):

where is a tensor stacked by is the batch size. is the tensor-matrix product over mode p Kolda and Bader (2009), which can be implemented as a matrix product following a

Figure 2: A simple visualization of the computation procedure of the KFC layer. (a) denotes , (b) denotes the result of (a). The KFC layer transform a matrix from to k.

linear time unfolding operation of tensor. The Reshape operator reshapes the tensor from to , which has nearly no overhead. Figure 2 illustrates this procedure. Similarly, the backward process is also simply matrix product.

Just as SVD approximation may be extended beyond rank-1 to an arbitrary rank, one could extend Kronecker product approximation to a sum of Kronecker products approximation. In addition, unlike the outer product, A and B may have different shapes. Hence, we get a more general KFC layer:

The number of parameters of a KFC layer is (bias terms are omitted), reduced from . In particular, when all the small matrices have the same shape, the number is .

The computation complexity is , reduced from O(mnk). When all the small matrices have the same shape, it is .

In particular, let . The Kronecker product degenerates to the outer product, and the approximation degenerates to a SVD method Xue et al. (2013), Denton et al. (2014). Let . The KFC layer degenerates to the classical fully-connected layer. Figure 3 illustrates the difference among fully-connected layer, fully-connected layer with SVD approximation, and our KFC layer.

In the rest of the paper, we use the following notation to describe a configuration of a KFC layer as an approximation of a FC layer with weight matrix

Figure 3: Illustration of the fully-connected layer, fully-connected layer with SVD approximating and the KFC layer.

denotes a KFC layer of rank r, with . In particular, we use the 5-tuple to denote a KFC layer of rank r, where all components have the same shape.

3. Details of the KFC Layer

We now consider some details about the KFC layer, including how to initialize the layer, how to select the shapes and use more nonlinearity.

KFC layers can be randomly initialized with the same method as with FC layers. However, in the case where we want to compress or speed-up a pre-trained model (for example, to run on mobile devices), KFC layer can be initialized by approximating the pre-trained weight matrix W, just like SVD method. The initialization problem can be formulated as the nearest KP problem.

Van Loan and Pitsianis (1993) solved this problem with KPSVD when the shapes of , are the same. KPSVD bears strong connection with SVD. In fact, it can be turned into the following SVD decomposition using R operator:

where is a reordering operation and and . Then we have:

For multiple shapes, we can apply KPSVD for one shape and reconstruct the weight matrix , where denotes the rank under certain shape. Then we apply KPSVD on with the second shape. Repeat the above two steps recursively until all shapes are computed.

Any factors of m and n may be selected as and in the formula 7. However, in CNNs, the input to a fully-connected layer may be a tensor of order 4, namely, , where c is the number of channels, h is the height, w is the width and k is the batch size. is often reshaped into a matrix before being fed into a fully-connected layer as Ł. Though the reshaping transformation from to Łdoes not incur any loss in pixel values of data, we note that the dimension information is lost in the matrix representation. Due to the shape of W, we may propose a few kinds of structural constraints by requiring W to be the Kronecker product of matrices of particular shapes.

• Formulation I: In this formulation, we require and . The number of parameters is reduced to . The underlying assumption for this model is that the channel transformation should be decoupled from the spatial transformation.

• Formulation II: In this formulation, we require and . The number of parameters is reduced to . The underlying assumption for this model is that the transformation w.r.t. columns may be decoupled.

• Formulation III: In this formulation, we require , and needs to swap the second and the third dimension first. The number of parameters is reduced to . The underlying assumption for this model is that the transformation w.r.t. rows may be decoupled.

Of course, we can also combine the above three formulation together.

Otherwise, when the input is a matrix, we do not have natural choices of and . Through experiments, we find it is possible to arbitrarily pick a decomposition of input matrix dimensions to enforce the Kronecker product structural constraint. It is also sensible to set as close to as possible with a small r to get a maximum compression ratio. But a smaller and and correspondingly larger and generally gives less accuracy loss. Nevertheless, we can use multiple components with different shapes to remedy the arbitrariness of the selection.

When r is not very large, we can move the summation out of the nonlinear function f in Eq. (7) to introduce more nonlinearity to the KFC layer with little overhead:

The number of parameters only increases a little (more bias terms) or we can share the bias to avoid the increment. We have found the additional nonlinearity in the KFC layer is very helpful sometimes. Note the additional nonlinearity makes KFC layers difficult to be initialized by KPSVD. But it is not a serious problem. Initializing KFC layers with random number works well in our experiments.

4. Generalization : the KConv Layer

Since the fully-connected layer is a kind of convolution, we expand our work to the convolutional layer. In this section, we describe how to use Kronecker products to approximate weight tensors of convolutional layers, leading to construction of a new kind of layers which we call Kronecker convolutional layers, or KConv layers for short. We assume stride is 1, no zero padding for simplicity in this section. Weights of the convolutional layer can be described as a 4-dimensional tensor: , where o is the number of output channels, c is the number of input channels, h and w are the spatial dimensions of the kernel. The weight tensor can be approximated as:

with and . The shapes of the filters are constrained: w. These constraints are the same as two schemes discussed in Jaderberg et al. (2014b).

Similar to the KFC layer, we do not need to calculate the tensor Kronecker product explicitly. For each shape , we can replace the original convolutional layer with two consecutive convolutional layers. Here we use rank r = 1 for simplicity. The input is denoted as , where x,y are the height and width, k is the batch size. The KConv layer proceeds as following:

1. = reshape(.

2. .

3. = reshape() .

4. .

5. += reshape() .

Figure 4 illustrates the KConv framework. The number of parameters reduces to from ochw. The computation complexity reduces to from .

In particular, if , the KConv is the same as Scheme 1 in Jaderberg et al. (2014b). If , the KConv is the same as Scheme 2 in Jaderberg et al. (2014b). It is worth mentioning that with a rank r > 1 and , the two convolution frameworks are very similar to the inception Szegedy et al. (2014b), where the main difference is that inception has an extra max pooling branch. However, KConv will allow more choices of and .

5. Experiments

In this section, we empirically study the properties and efficiency of the Kronecker layer and compare it with some other common low rank model approximation methods. As is well known, a large proportion of parameters in a CNN are contained in the fully-connected layers and most computation time is spent in the convolutional layer. Therefore, in the experiments, we mainly consider model acceleration in the convolutional layer and model compression in the fully-connected layer.

To make a fair comparison, for each dataset, we train a convolutional neural network as a baseline. Then we replace the convolutional layer or fully-connected layer with a KConv or KFC layer and train the new network until quality metrics stabilizes. We compare the Kronecker method with other low rank methods and the baseline model in terms of number of parameters, running time and prediction quality.

Figure 4: KConv Layer can be implemented via a sequence of two convolutional layers with some re- shape operation.

We perform experiments about model compression based on implementation of the Kronecker layers in Theano Bergstra et al. (2010), Bastien et al. (2012) framework, and experiments about model speedup are based on Caffe Jia et al. (2014) framework.

SVHN digits

The SVHN dataset Netzer et al. (2011) is a real-world digit recognition dataset consisting of photos of house numbers in Google Street View images. Here we consider the digit recognition task where the inputs are 32-by-32 colored images centered around a single character. There are 73257 digits for training, 26032 digits for testing, and 531131 less difficult samples which can be used as extra training data. To build a validation set, we randomly select 400 images per class from training set and 200 images per class from extra training set as Sermanet et al. (2012), Goodfellow et al. (2013b) did.

Our baseline model has 8 layers and the first 6 layers consist of four convolutional layers and two pooling layers. The 7th layer is the fully-connected layer and the 8th is the softmax output. The input of the fully-connected layer is of size . The baseline’s fully-connected layer has 256 hidden units.

Test results are listed in Table 1. All results are averaged by 5 models. In the SVD-r methods Xue et al. (2013), Denton et al. (2014), we apply singular value decomposition on baseline’s weight matrix, reconstruct it to rank r and fine-tune the restructured model. In the KFC-shape method, we replace the fully-connected layer by KFC layer with combination of 3 different formulations discussed above. in Formulation I, in Formulation II and Formulation III. In the KFCrank method, we replace the fully-connected layer by KFC layer with configuration (64, 4, 256, 25, 5). Both KFC-shape and KFC-rank use additional nonlinearity and do not share bias.

Table 1: Comparison of SVD method and KFC layers on SVHN digit recognition.

From the results we can see on SVHN digit recognition, the KFC layer can reduce the number of parameters by with 0.25% accuracy loss, while SVD method will incur 0.79% accuracy loss at the same compression ratio.

SVHN sequences

We also tested our models on SVHN digit sequence recognition. Following the experimental setting as in Goodfellow et al. (2013a), Ba et al. (2015), we preprocessed the training data by expanding the bounding box of each digit sequence by 30% and resize the patch to input. Our model is built on top of the strong baseline CNN used in Jaderberg et al. (2015), which by itself gives a 4.0% wholesequence error. The baseline model has three large fully-connected layers, each with 3072 hidden units. Based on this model, we replaced the first two fully-connected layers by two KFC layers of configuration and (30, 10, 60, 40, 10), and the third fully-connected layers in the baseline model by a fully-connected layer with 300 hidden units. The five parallel fully-connected layer for classification, with a total of 55 hidden units, is replaced by a fully-connected layer of 220 hidden units followed by a maxout layer of 4 units.

This model achieves a 3.4% sequence error rate, with only about 12% parameters compared to the baseline CNN in Jaderberg et al. (2015). The performance of other competing methods are listed in Tab. 2.

Table 2: Comparison of different methods on SVHN sequence recognition.

CASIA-HWDB

CASIA-HWDB Liu et al. (2011) is an offline Chinese handwriting dataset, containing more than 3 million character samples of over 7,000 character classes, produced by over 1,000 writers. On this dataset, the best reported error rate we know of is from Zhong et al. (2015). Their model uses a similar architecture of GoogLeNet Szegedy et al. (2014a) named HCCR-GoogLeNet, which reaches the single-model error rate of 3.74%.

It is worth pointing out that in both GoogLeNet and HCCR-GoogLeNet, a pooling layer with a considerable large size (7 or 5) is used to downsample the large feature maps generated by the inception layers, as an approach to reduce the number of parameters in the next fully-connected layer. Our KFC model, however, can directly operate on the large feature maps due to its nature of model compression. Our architecture is based on HCCR-GoogLeNet, with the layers after the four inception groups replaced by a KFC layer with configuration , followed by two fully-connected layer with 1024 and 512 hidden units, respectively. This model achieves an error rate of 3.37%, which advances the previous state-of-the-art. Although the original approach containing a large pooling layer actually uses fewer parameters, but using a large downsampling operation inevitably loses much information and hence doesn’t perform well enough. Other competing methods are listed in Tab. 3.

Table 3: Recognition Performances of different methods on CASIA-HWDB.

Scene Text Characters

We use CNN described in Jaderberg et al. (2014c) to test our KConv layers since Jaderberg et al. (2014b), Lebedev et al. (2014) both experimented on this dataset. The dataset contains 186k character images cropped from some character recognition datasets. The network has 4 convolutional layers, a fully-connected layer and a 36 classes softmax. The network uses maxout Goodfellow et al. (2013b) as nonlinearity function after each convolutional layer and fully-connected layer. Training settings are almost the same as SVHN. We test the model speed on Nvidia GPU with Caffe Jia et al. (2014) framework.

We replace the second and third convolutions by our KConv layer since these two layers constitute more than 90% of the running time. The second convolution has 48 input channels and 128 output channels with filters. The third convolution has 64 input channels and 512 output channels and filters of size .

The results are shown in Table 4. The KConv layer can achieve about speedup on the whole model with less than 1% accuracy loss. The result is similar to Jaderberg et al. (2014b), as the KConv layer includes the Jaderberg-style rank-1 filter as a special case. The reported results are measured by validation errors as we are only concerned with relative performance of each method.

Table 4: Speedup of KConv on scene text character recognition dataset. Parameters of the different layers are separated by a semicolon.

We have also experimented replacing the first convolutional layer with KConv layer. In this case, KConv with () = (2, 12, 1, 1, 9), is found to outperform Jaderberg-style rank-1 filter with () = (2, 96, 1, 1, 9) by 0.83%.

Scene Text Words

We also experiment on the word recognition model trained on the synthetic word dataset consisting of 9 million images covering about 90k English words from Jaderberg et al. (2014a) and tested on ICDAR 2013Karatzas et al. (2013) dataset. As the model predicts English word, the number of output classes is about 90k, resulting in a model with more than 400 million parameters, mostly in the fully-connected layers.

We use different shapes and ranks in experiments of the KFC layers to replace the last FC layer. For comparison, we test a method which simply decrease the number of output neurons of the second-to-last FC layer before softmax. We also test the method in Sainath et al. (2013) which also tries to approximate the last weight matrix. Figure 5 list the test results. Due to lack of space, we have not listed the detailed hyper-parameters of all these experiments. The KFC model with highest accuracy uses a configuration with shapes (26, 15, 719, 122), (26, 15, 122, 719), (13, 30, 61, 1438), (130, 3, 1438, 61), each shape of rank 10, building a KFC layer of total rank 40. But this layer itself still saves 92% parameters compared to its FC counterpart. The scatter diagram indicates that the KFC layer requires less parameter with the

Figure 5: Accuracy loss and total parameter reduction on ICDAR’13 with different models. The KFC indicates models using the KFC layer and the FC indicates models without the KFC layer.

same accuracy or has higher accuracy with the same number of parameters. This demonstrates that our technique also works well before softmax layer.

ImageNet

ImageNet (ILSVRC12) Russakovsky et al. (2015) is a large scale visual recognition dataset and contains 1000 categories and 1.2 million images for training. We use the AlexNet Krizhevsky et al. (2012) as the baseline network and use the implementation in Ding et al. (2014). The AlexNet has three fully-connected layers. The first’s input is a tensor of size and the weight matrix is of size . The second and the third layers have weight matrices with size and , respectively. Test

Table 5: Comparison of using SVD method and using KFC layers on ImageNet.

results are listed in Table 5. SVD-2 and KFC-2 compress the first two fully-connected layers, and SVD-3 and KFC-3 compress all the three fully-connected layers. We select the hyper-parameters carefully to ensure that the two comparing methods have the same model size. In SVD-2, the ranks are 237 and 1000. In SVD-3, the ranks are 100, 165, 200. In KFC-2, the configurations are (1024, 4, 1536, 6, 2) for the first layer and (2048, 2, 2048, 2, 2) for the second. In KFC-3 the configurations are (512, 8, 512, 18, 5), (512, 8, 512, 8, 5) and (500, 2, 512, 8, 4). We use additional nonlinearity in all KFC layers. The results demonstrate that we can look for the best or most suitable choice from a variety of different shapes and ranks in KFC layers.

6. Related Work

In this section we discuss some related works not yet covered.

In addition to the low rank methods mentioned earlier, hashing methods have also been used to reduce the number of parameters Chen et al. (2015), Bakhtiary et al. (2015), and distillation offers another way of compressing neural networks Hinton et al. (2015). Furthermore, Mathieu et al. (2013) used FFT to speedup convolution. Yang et al. (2015) used adaptive fastfood transform to reparameterize the matrixvector multiplication of fully-connected layers. Han et al. (2015) iteratively pruned redundant connection to reduce the number of parameters. Gong et al. (2014) used vector quantization to compress the fully-connected layer. Gupta et al. (2015) suggested using low precision arithmetic to compress the neural network.

7. Conclusion

In this paper, we have proposed and studied a framework to reduce the number of parameters and computation time in convolutional neural networks. Our framework uses Kronecker products to exploit the local structure within convolutional layers and fully-connected layers. As Kronecker product is a generalization of the outer product, our method generalizes the low rank approximation method for matrices and tensors. We also explored combining Kronecker product of different shapes to further balance the drop in accuracy and the reduction in parameters. Method for initializing Kronecker layer is also given.

Through a series of experiments on different datasets, our method is proven to be effective and effi-cient on different tasks. It can reduce the computation time and model size with minor lost in accuracy, or improve previous state-of-the-art performance with similar model size.

A key advantage of our method is that Kronecker layers can be implemented by combination of tensor reshaping operation and dense matrix product, which can be efficiently performed on CPU. The generality of Kronecker product also allows a lot of freedom to trade off between model size, running time and prediction accuracy through the selection of the hyper-parameters in Kronecker layers such as shapes and ranks.

References

Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 33(5):898–916, May 2011. ISSN 0162-8828. doi: 10.1109/TPAMI.2010.161. URL http://dx.doi.org/10.1109/TPAMI.2010.161.

J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. In Proc. of ICLR, 2015.

Amir H Bakhtiary, Agata Lapedriza, and David Masip. Speeding up neural networks for large scale classifica- tion using wta hashing. arXiv preprint arXiv:1504.07488, 2015.

Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nico- las Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.

James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression com-

piler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation.

Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 2285–2294, 2015. URL http://jmlr.org/ proceedings/papers/v37/chenc15.html.

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.

Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pages 2148–2156, 2013.

Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, pages 1269–1277, 2014.

Weiguang Ding, Ruoyan Wang, Fei Mao, and Graham Taylor. Theano-based large-scale visual recognition with multiple gpus. arXiv preprint arXiv:1412.2302, 2014.

Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.

I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks. In arXiv, 2013a.

Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. Maxout net- works. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 1319–1327, 2013b. URL http://jmlr.org/proceedings/papers/ v28/goodfellow13.html.

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited nu- merical precision. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 1737–1746, 2015. URL http://jmlr.org/proceedings/ papers/v37/gupta15.html.

Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626, 2015.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014a.

M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In arXiv, 2015.

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014, 2014b. URL http://www.bmva.org/bmvc/2014/papers/paper073/index.html.

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Deep features for text spotting. In Computer Vision– ECCV 2014, pages 512–528. Springer, 2014c.

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadar- rama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675–678. ACM, 2014.

Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Mikio Iwamura, Lluis Gomez i Bigorda, Sergi Rob- les Mestre, Jordi Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis-Pere de las Heras. Icdar 2013 robust reading competition. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, pages 1484–1493. IEEE, 2013.

Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Rev., 51(3):455–500, August 2009. ISSN 0036-1445. doi: 10.1137/07070111X. URL http://dx.doi.org/10.1137/ 07070111X.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

Quoc V Le. Building high-level features using large scale unsupervised learning. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8595–8598. IEEE, 2013.

Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up con- volutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014.

Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. Casia online and offline chinese handwriting databases. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 37– 41. IEEE, 2011.

Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851, 2013.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5. Granada, Spain, 2011.

Roberto Rigamonti, Amos Sironi, Vincent Lepetit, and Pascal Fua. Learning separable filters. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2754–2761. IEEE, 2013.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), pages 1–42, April 2015. doi: 10.1007/s11263-015-0816-y.

Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6655–6659. IEEE, 2013.

Pierre Sermanet, Sandhya Chintala, and Yann LeCun. Convolutional neural networks applied to house numbers digit classification. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 3288– 3291. IEEE, 2012.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Er- han, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014a.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Er- han, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014b.

Charles F Van Loan and Nikos Pitsianis. Approximation with Kronecker products. Springer, 1993.

Chunpeng Wu, Wei Fan, Yuan He, Jun Sun, and Satoshi Naoi. Handwritten character recognition by alternately trained relaxation convolutional neural network. In Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on, pages 291–296. IEEE, 2014.

Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In INTERSPEECH, pages 2365–2369, 2013.

Zichao Yang, Andrew Gordon Wilson, Alexander J. Smola, and Le Song. A la carte - learning fast kernels. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015, 2015. URL http://jmlr.org/proceedings/ papers/v38/yang15b.html.

Fei Yin, Qiu-Feng Wang, Xu-Yao Zhang, and Cheng-Lin Liu. Icdar 2013 chinese handwriting recognition competition. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, pages 1464–1470. IEEE, 2013.

Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. arXiv preprint arXiv:1505.06798, 2015.

Zhuoyao Zhong, Lianwen Jin, and Zecheng Xie. High performance offline handwritten chinese character recognition using googlenet and directional feature maps. CoRR, abs/1505.04925, 2015. URL http: //arxiv.org/abs/1505.04925.

designed for accessibility and to further open science