Spatial transformer networks (STNs) were designed to enable CNNs to learn invariance to image transformations. STNs were originally proposed to transform CNN feature maps as well as input images. This enables the use of more complex features when predicting transformation parameters. However, since STNs perform a purely spatial transformation, they do not, in the general case, have the ability to align the feature maps of a transformed image and its original. We present a theoretical argument for this and investigate the practical implications, showing that this inability is coupled with decreased classification accuracy. We advocate taking advantage of more complex features in deeper layers by instead sharing parameters between the classification and the localisation network.
Spatial transformer networks (STNs) [1, 2] were introduced as an option for CNNs to learn invariance to image transformations by transforming input images or convolutional feature maps before further processing. A spatial transformer (ST) module is composed of a localization network that predicts transformation parameters and a transformer that transforms an image or a feature map using these parameters. An STN is a network with one or several ST modules at arbitrary depths.
An ST module can clearly be used for pose alignment of images when applied directly to the input. Assume an input image and a set of image transformations indexed by some parameter g. Transformed images could be transformed into a canonical pose if the ST module correctly learns to apply the inverse transformation: .
However, if applying the inverse spatial transformation to a convolutional feature map (Γf)(x, c), here with c channels, this will, in the general case, not result in alignment of the feature maps of a transformed image and those of the original image
The intuition for this is illustrated in Figure 1, where Γ has two feature channels for recognising the letters ”W” and ”M”. Note how a purely spatial transformation cannot align the feature maps Γf and Γ , since there is also a shift in the channel dimension. A similar reasoning applies to a wide range of spatial image transformations.
This gives rise to the question of the relative ben-efits of transforming the input vs. transforming intermediate feature maps in STNs. Is there a point in transforming intermediate feature maps if it cannot support invariant recognition?
Figure 1: Inversely transforming the feature map will, in general, not align the feature maps of a transformed image and those of its original. The network Γ has two feature channels ”W” and ”M”. corresponds to a 180
To investigate the practical implications of the inability of ST modules to support invariance, if applied to CNN feature maps, we compared 4 differ-ent network configurations on rotated and translated MNIST and the Street View House Numbers dataset (SVHN): (i) A standard CNN (CNN) (ii)
Figure 2: Visualisation of image/feature map alignment for rotated and translated MNIST images (top rows). STN-C1 fails to compensate for rotations but performs better for translations (middle rows). STN-SL1 finds a canonical pose both for rotated and translated images (bottom rows).
An STN with the ST module directly following the input (STN-C0) (iii) An STN with the ST module following convolutional layer X (STN-CX) and (iv) An STN which transforms the input but where the localization network shares the first X layers with the classification network, which enables the use of more complex features to infer the transformation parameters (STN-SLX ).
Figure 2 and Figure 3 demonstrate that the transformation learned by STN-C1 does not correspond to pose alignment of rotated input images, while the transformation learned by STN-SL1 does. For translations, STN-C1 performs better, since a translation does not imply a shift in the feature map channel dimension. Thus STN-C1 works better as an attention mechanism than to compensate for image transformations. Table 1 shows that the inability of STN-C1 to align feature maps of rotated images leads to decreased classification performance. Table 2 shows that, while STN-CX suffers from a tradeoff between using deeper layer features and its inability to support invariance, STNSLX can fully take advantage of deeper features.
We have investigated the practical implications of the inability of an STN to align CNN feature maps to enable invariant recognition. Our results show that this inability is clearly visible in practice and, indeed, negatively impacts classification performance. When more complex features are needed to correctly estimate an image transformation, we thus advocate using deeper layer features by means
Figure 3: The rotation angle predicted by the ST module for MNIST images as a function of the rotation applied to the input image. STN-C1 has not learned to predict the image orientation (left). The reason for this is that a rotation is, in fact, not enough to align deeper layer feature maps. This is because a rotation of the feature map does not correspond to a rotation of the input. STN-SL1, which transforms the input, correctly predicts the image orientation (right).
Table 1: Classification error on rotated and translated MNIST data for the different network versions.
Table 2: Classification error on the SVHN dataset when transforming intermediate feature maps at different depths vs transforming the input but using parameter sharing between the localisation and the classification network.
 M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, pages 2017–2025, 2015.
 C.-H. Lin and S. Lucey. Inverse compositional spatial transformer networks. In CVPR, pages 2568–2576, 2017, doi:10.1109/CVPR.2017.242.