Learning Filter Scale and Orientation In CNNs

2018·Arxiv

Abstract

Abstract

Convolutional neural networks have many hyperparameters such as the filter size, number of filters, and pooling size, which require manual tuning. Though deep stacked structures are able to create multi-scale and hierarchical representations, manually fixed filter sizes limit the scale of representations that can be learned in a single convolutional layer.

This paper introduces a new adaptive filter model that allows variable scale and orientation. The scale and orientation parameters of filters can be learned using back propagation. Therefore, in a single convolution layer, we can create filters of different scale and orientation that can adapt to small or large features and objects. The proposed model uses a relatively large base size (grid) for filters. In the grid, a differentiable function acts as an envelope for the filters. The envelope function guides effective filter scale and shape/orientation by masking the filter weights before the convolution. Therefore, only the weights in the envelope are updated during training.

In this work, we employed a multivariate (2D) Gaussian as the envelope function and showed that it can grow, shrink, or rotate by updating its covariance matrix during back propagation training . We tested the new filter model on MNIST, MNIST-cluttered, and CIFAR-10 and compared the results with the networks that used conventional convolution layers. The results demonstrate that the new model can effectively learn and produce filters of different scales and orientations in a single layer. Moreover, the experiments show that the adaptive convolution layers perform equally; or better, especially when data includes objects of varying scale and noisy backgrounds.

1 Introduction

Naming or describing real life objects is only meaningful with respect to a relevant scale [8]. For example, a view can be described as a leaf, a branch, or a tree depending on the distance of the observer. Natural and casual scenes are generally composed of many different entities/objects at different scales. During image acquisition, the true physical scale is usually ignored. However, the relative scale of the objects is somehow implicitly captured and stored in the image grid and pixels.

An automated method to identify or describe objects in images can be analyzed in two parts: representation + classification. Basic classification algorithms without add-ons can not successfully handle variation and complexity of raw pixel-level representation of objects, instead they rely on functions that map image pixels into different constructs, -named features-, which are sought to represent the image content more briefly and invariant to various geometric and intensity changes.

Traditionally, computer vision researchers relied on manually designed feature extractors for representation. Recently, we are witnessing the success of the algorithms which can selflearn appropriate feature extractors. In either case, the size of an operator or a probe usually determines and fixes scale of the entities that can be represented. However, even in the selflearn case, size of the probes or operators is often manually selected. On the other hand, last two decades has seen many automated object detection/recognition algorithms that were superior to their counterparts because they have comprised multi-scale processing of images [3], [6]. Multi-scale feature extractors gather and present the inherent scale information of image pixels to a subsequent classifier. In SIFT [9] and wavelets [10], this is done by creating a multi-scale pyramid from the input image and then applying a fixed size probe-kernel to each scale. In an application of Gabor filters for object recognition, Serre et al. [12] used a hierarchy of stacked Gabor filtering layers, where the filters have predetermined scales and orientations. However, Chan et al. [1] showed that the adaptation of handcrafted filters to low-level representations is difficult. On the other hand, convolution neural networks (CNN) rely on stacked and hierarchical convolutions of the image to extract multi-scale information. Convention of CNNs for filter size selection is to use small fixed size weight kernels in the lower levels. However, thanks to the stacked operation of convolutional layers, sandwiched by pooling layers which down-sample intermediate outputs, the deeper levels of a network are able to learn representations of larger scales. Though the optimality of fixed size kernels has not proven, the convention is to use filters small as 3x3 in the first layer, which can be larger 5x5 or 7x7 in the later stages [16]. During back propagation training, filters are evolved to imitate lower level receptive fields in biological vision which are sensitive to certain shapes and orientations. Another justification for avoiding large filter sizes is that, while certainly increasing computation time, they may also increase over-fitting.

Though the number of filters and their sizes in convolution layers are usually selected intuitively, researchers are seeking alternatives to improve representation capacity of the network in deeper architectures. For example, Szegedy et al. [14] handcrafted their “inception” architecture to include mixing of parallel and wide convolution layers which use different sized filter kernels. In a deep architecture, this approach allows multi-scale, parallel and sparse representations. [? ] In summary, existing CNN based methods use fixed size convolution kernels and then rely on the fact that shape and orientation of the filters can be inferred from the training data. Additionally, CNNs employ stacked convolution layers to successfully create multi-scale representations.

On the other hand, Hubel and Wiesel [4] discovered three type of cells in visual cortex: simple, complex, hyper-complex (i.e. end-stopped cells). The simple cells are sensitive to the orientation of the excitatory input, whereas the hypercomplex cells are activated with a certain orientation, motion, and length of the stimuli. Therefore it is biologically plausible to assume that filters of different scales next to different orientations and direction may also work better in CNNs.

In this study, we create a new and adaptive model of convolution layers where the filter size (actually scale) and orientation are learned during training. Therefore, a single convolution layer can have distinct filter scales and orientations. Broadly speaking, this corresponds to extracting multi-scale information using a single convolution layer. However, our aim is not to fully replace the stacked architectures and deep networks for multi-scale information.

Figure 1: Illustration of the proposed weight envelope (a) an arbitrary differentiable envelope function controls the weight spread and shape on a regular relatively large base grid, (b) an example, initial Gaussian kernel with centered and initial , (c) initial weights of the filter that are randomly generated, (d) weights are masked with the envelope (b) by simple element-wise multiplication.

Instead, our approach improves the information that can be extracted from an input (may be an image), in a single layer. Additionally, the model removes the necessity of fixing convolution kernel sizes, so that the filter size can be removed from the list of hyper-parameters of deep learning networks.

Our experiments use MNIST, MNIST-cluttered and CIFAR-10 datasets to show that the proposed model can learn and produce different scaled filters in a single convolution layer whilst improving classification performance compared to conventional convolution layers. In experimental side, our work is concentrated on developing and proving an effective methodology for learnable adaptive filter scales and orientations, rather than improving highly optimized state-of-the-art results in these datasets. Organization of the paper is as follows ??.

2 Method

In deep network architectures, stacked convolution layers perform convolutions with fixed size kernels where sandwiched pooling layers perform downsampling operations to achieve a multi-scale and hierarchal representation. Fixed size convolution kernels put a limit on the

scale of features which can be extracted from a single layer.

?? Though it is possible to mix several kernels of different size in a single convolution layer,

the convention is to use a fixed size for all the kernels in a layer. Here, we introduce a new

filter model which can adapt its scale and orientation. Therefore it allows development of multi-scale and differently oriented filters in a single convolution layer. To realize this, we need a smooth and function that can grow, shrink, or rotate during training which acts as an envelope to guide filter scale and orientation. The following subsections explain the role of the envelope, selecting an appropriate function for the envelope, and its partial derivatives which are used in backpropagation.

2.1 Envelope Function

The role of the envelope function is to guide the filter scale and orientation development. As illustrated in Figure 1, a base grid acts as the envelope and filter domain. Since it is the most common case, we will assume a two-dimensional domain. Generalizations to a higher dimensions is straightforward. In this domain, the envelope function must be differentiable and smooth. Let us assume a base grid for an nn odd sized and square filter (1); and let G be a (continuously) differentiable function defined in grid g (coordinate space) and parameter vector Ri to define its shape (2).

By updating the parameters in , envelope function must be able to grow or shrink its effective area and change its orientation. The feed forward model of a single neuron with an input x and transfer function f can be written as:

Or if we think of a whole convolution layer of input matrix (image) X and weight matrix W and envelope matrix U. Simply, an element-wise multiplication of U with the weight matrix W will mask and scale the weights before the convolution.

Since the weights can not grow out of envelope U, the filter size and orientation will be bounded and determined by U. Assuming that the partial derivatives of G with respect to continuous parameter is defined using the chain rule, the update can be performed using the standard backpropagation algorithm with the learning rate . However note that the weight update also gets u as a scaler

2.2 Selecting an Envelope Function

It is well known that continuous Gaussian kernel has unique properties which are important for generating a scale space. Simply put, the Gaussian kernel does not create new local extrema, nor enhance existing extrema, whilst smoothing the image with a variable continuous parameter [8]. Some of this properties are proven to exist in discrete space if the sample size is sufficiently large. Therefore, Gaussian is an ideal candidate for the envelope function U:

Here, parameter A is an optional normalization parameter; controls the center of the envelope, whereas the covariance controls the scale and orientation of the kernel.

During the feed forward execution the envelope function is calculated on the grid coordinates g with the current covariance ; and then elementwise multiplied with the weight matrix W, prior to the convolution. This is illustrated in Figure 1(b)-1(d). Note that this operation not only bounds the weights and adjusts the effective area, it also scales the weights. To implement the convolution operation appropriately, we set as a vector of constants that is initialized with the center point coordinates of the grid g. Therefore, it is not updated during training. However, the covariance must be updated to learn the filter scale and orientation.

In order to keep the symmetric property of we calculate the gradients for each and apply update rule.

Covariance must be kept as a symmetric and positive definite matrix. A symmetric

matrix is positive definite if for all non-zero vectors: xT0; which imposes the following

where denotes the eigenvalues of the covariance matrix, which can be checked to ensure positive definiteness. However, during training the diagonal sigma terms are ensured to be positive (and nonzero) by setting bounds, e.g. min; whereas is constrained by =max. Experiments show that the covariance behave well during training when it is initialized properly and updated with a small learning rate; and thus it removes the necessity for these constraints.

2.3 Vanishing Variance 2.4 Implementation

We implemented the adaptive convolution filters using Lasagne&??Theano [15] and then tested using a Nvidia Tesla K40 GPU board. In terms of computational complexity, as it can be expected, calculating the Gaussian envelope function adds an extra overhead in training. However, during feed forward execution, the trained and enveloped final weights can be stored and used immediately without any overhead. Compared to conventional filters, we use relatively larger (e.g. 11x11) base filter sizes to observe adaptive growth and rotations. Please note that the grid can be selected as large as the input image. ACNN ran one epoch (500 examples) in 62 seconds whereas Cnn-11 ran in 315 seconds on MNIST dataset without an optimized implementation. (The code will be available in the cameraready version, not disclosed for anonymity.)

3 Experiments

3.1 Why do we need a filter guide?

One can argue that the filter guide is unnecessary because a relatively large CNN-layer can learn any filter. To disprove this idea we conducted a test with an encoder where the input is an image the filter is expected to learn a simple filtering operation. Can CNN learn the same

Table 1: The network topology that is used to test our method. All three networks were comprised of 8 layers. In conv-1 and conv-2 layers, the proposed adaptive model (ACNN) used an 11x11 base grid for filters, whereas ’cnn-5’ and ’cnn-11’ used 5x5 and 11x11 filter sizes, respectively. *CIFAR-10 experiments used 16 filters instead of 8.

3.2 Covariance

Show how covariance learns, whether it requires the constraints, in which conditions.

3.3 Runtime wrt grid size

The experiments were aimed at observing whether the adaptive filters can change its scale and orientation during training and whether this adaptation yields an improved classification performance. We tested the adaptive filter model with three different datasets, and also compared the results against two conventional CNN configurations that used different fixed size filters. All three networks had the same structure which was comprised of two convolution, two pooling layers, a dropout layer and two fully connected layers (Table 1). The only difference between the adaptive CNN (ACNN) and conventional CNNs (cnn-5, cnn-11) was the replacement of convolution layers. We used cross-entropy as the error function. The hyper parameters we used are as follows; Learning rate: 0.01, momentum: 0.95, batch size: 500. On each test, we examined training loss, error % and scale, orientation changes in filters. Though the learning rate for could be adjusted separately it was not necessary.

3.4 MNIST

MNIST [7] is a database of handwritten digits, widely used in machine learning research to test models. It has 50,000 training and 10,000 testing images from 10 different categories. To observe the change in , we calculated its eigenvalues and eigenvectors. The maximum eigenvalue represents the scale, whereas the tangent between the eigenvectors shows the orientation as illustrated in Figure 2.

In Figure 3, we can observe the learned envelope functions scale and orientation effects on filters. Smoothing effect of the envelope function over the input is also observed in some outputs (3(c)). Figure 4 shows the training loss and classification error plots. The adaptive filters had no performance gain against conventional cnn-11, cnn-5.

Figure 2: The plots for covariance matrix change in MNIST dataset. Depicted by the (a) angle of largest eigenvector and (b) largest eigenvalue.

Figure 3: The first layer (conv-1) filters at the end of training with MNIST. (a) Gaussian envelopes, (b) scaled filters, (c) output of a sample that was convolved with each filter.

Figure 4: Training loss and classification error for MNIST.

Figure 5: Eight randomly selected samples from the cluttered MNIST dataset.

Figure 6: Training loss and classification error in cluttered MNIST database.

3.5 MNIST Cluttered

Cluttered MNIST dataset [2] consists of 60,000 samples in 10 classes. We split this dataset into 50.000 and 10.000 for test and train purposes, respectively. Randomly selected 8 samples are illustrated in Figure 5. It contains 60x60 images generated from the original MNIST database with numerous of distractors. Projecting the original MNIST 28x28 pixel space onto 60x60 also caused changes in scale. Thus, the dataset had scale and rotational variances, in addition to cluttered background noise which makes it a suitable test case to demonstrate the use of adaptive filters.

Figure 6 shows the training loss and classification error, where we can observe better performance compared with conventional CNNs.

3.6 CIFAR-10

This dataset [5] is a relatively small (32x32x3) image set with 60,000 samples from 10 different classes. We divided this dataset into 50,000 train and 10,000 test sets, respectively. Other than MNIST dataset, color channels are present, and objects are much more in need for multi-scale features.

The classification results that are shown in Figure 8(b) demonstrates that again ACNN performed better in classification error. For further investigation, we also included the change of for learned envelope functions and scaled filters in Figure 7. Compared to MNIST, the

Figure 7: The first layer filters at the end of training in CIFAR-10 database.

Figure 8: Training loss and classification error in CIFAR-10 database.

envelope functions are observably different; and included both large, small, rotated filters. The change in scale and orientation is shown in Figure 9. Compared with change on in MNIST test, scales and orientations have more variation and some of the filters tend to shrink, whereas some were enlarged their scales.

4 Discussion

In this paper, we propose an adaptive convolution filter model based on a Gaussian kernel that is acting as an envelope function on shared filter weights. The plots of scale and orientation changes during the training epochs show that the adaptive model is capable of generating differently scaled and oriented filters in a single convolution layer. However, besides bounding and scaling of convolution weights, the Gaussian kernel tends to perform smoothing on input. Such that, if all weights were set to 1 and not trainable, the kernels perform only a Gaussian smoothing operation on input. The initial setting of variance terms to 1.0, enables an initial filter of 5x5 size. During training effective size of the filters are grad-

Figure 9: Plots for covariance matrix change in CIFAR-10 dataset, depicted by the (a) angle of largest eigenvector, (b) largest eigenvalue.

ually increased. This is because enlarging filters enables more weights to be included in the convolution, which will allow further reduction in the network error. Therefore, the adaptive filter may be prone to overfit more than a conventional fixed sized filter of the same initial size. However, because the envelope rescales weights (max 1.0), it has a regulative effect on their magnitudes, which shall create an advantage. In overall, training of the adaptive filter model did not require very fine tuning of the parameters. However, we observed that the use of dropout layer encouraged the development of filters of different scale and orientation. This can be explained by that the parallel and sparse network configurations induced by the dropout mechanism forces filters to prevent co-adaptation and become independent. We will investigate other ways of inducing independent filters, perhaps with an additional cost term for the network which punishes co-adaptation.

A clear benefit of our model is that it removes the filter size from the list of hyperparameters of deep learning networks. However, our main purpose is to add an adaptive multi-scale representation capacity to convolution layers. The results show that the advantage of using the new model depends on the complexity and variations in training and test data. Among three datasets, MNIST is the simplest where digits are size-normalized and centered [7]. The adaptive filters have less or no need for scale adaptation in pixel space, which resulted in no improvement in classification error when compared to conventional CNNs. However, MNIST cluttered and CIFAR-10 include examples of arbitrary scale, orientation and centering [5] [2], which allowed the filters to adapt their scale and orientation to improve training while not overfitting. Therefore, we can conclude the adaptive filters expressive power is revealed in datasets with variations in scale and orientation. It is worthwhile to investigate its applications to other domains.

The new and adaptive model of convolution layers allows filters’ scale and orientation to be learned during training. Therefore, a single convolution layer can have filters at various scales and orientations. Therefore, a single convolution layer can adapt to extract multi-scale information from its input. State-of-the-art deep networks have many layers and more complex designs compared to the networks that were tested in this study. An interesting question which we will investigate further is whether using the adaptive filter layers can shorten the depth of the state-of-the-art architectures, such as inception [14], highway [13] or thin [11]. Though our aim is not to fully replace stacked and deep architectures, the new model may help reduce redundancy and improve accuracy. Another question is whether placing the adaptive layer in deeper levels of a network can produce additional gains by focusing on the higher level representations.

?? using Gaussian lookup table for

5 Acknowledgment

This research was supported by a grant (undisclosed for anonymity) and NVIDIA Hardware Grant scheme.

References

[1] Tsung-Han Chan, Kui Jia, Shenghua Gao, Jiwen Lu, Zinan Zeng, and Yi Ma. Pcanet: A simple deep learning baseline for image classification? CoRR, abs/1404.3606, 2014. URL http://arxiv.org/abs/1404.3606.

[2] christopher5106. Christopher. Cluttered mnist dataset. https://github.com/ christopher5106/mnist-cluttered, 2015.

[3] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.

[4] David H Hubel and Torsten N Wiesel. Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology, 195(1):215–243, 1968.

[5] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.

[6] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998. ISSN 0018-9219. doi: 10.1109/5.726791.

[7] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.

[8] Tony Lindeberg. Scale-Space Theory in Computer Vision. Kluwer Academic Publishers, Norwell, MA, USA, 1994. ISBN 0792394186.

[9] David G Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999.

[10] Stephane Mallat and Sifen Zhong. Characterization of signals from multiscale edges. IEEE Transactions on pattern analysis and machine intelligence, 14(7):710–732, 1992.

[11] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550, 2014. URL http://arxiv.org/abs/1412.6550.

[12] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. Robust object recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(3):411–426, March 2007. ISSN 0162-8828. doi: 10.1109/TPAMI. 2007.56.

[13] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. CoRR, abs/1505.00387, 2015. URL http://arxiv.org/abs/1505.00387.

[14] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. URL http://arxiv.org/abs/ 1409.4842.

[15] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016. URL http: //arxiv.org/abs/1605.02688.

[16] Matthew D. Zeiler and Rob Fergus. Visualizing and Understanding Convolutional Networks, pages 818–833. Springer International Publishing, Cham, 2014. ISBN 978-3-319-10590-1. doi: 10.1007/978-3-319-10590-1_53. URL http://dx.doi.org/ 10.1007/978-3-319-10590-1_53.

designed for accessibility and to further open science