Capsule Routing via Variational Bayes

2019·Arxiv

Abstract

Abstract

Capsule networks are a recently proposed type of neural network shown to outperform alternatives in challenging shape recognition tasks. In capsule networks, scalar neurons are replaced with capsule vectors or matrices, whose entries represent different properties of objects. The relationships between objects and their parts are learned via trainable viewpoint-invariant transformation matrices, and the presence of a given object is decided by the level of agreement among votes from its parts. This interaction occurs between capsule layers and is a process called routing-by-agreement. In this paper, we propose a new capsule routing algorithm derived from Variational Bayes for fitting a mixture of transforming gaussians, and show it is possible transform our capsule network into a Capsule-VAE. Our Bayesian approach addresses some of the inherent weaknesses of MLE based models such as the variance-collapse by modelling uncertainty over capsule pose parameters. We outperform the state-of-the-art on smallNORB using 50% fewer capsules than previously reported, achieve competitive performances on CIFAR-10, FashionMNIST, SVHN, and demonstrate significant improvement in MNIST to affNIST generalisation over previous works.1

1 Introduction

Capsule networks are a recently proposed method of learning part-whole relationships between observed entities in data, by using groups of neurons known as capsules. These entities could be anything that possesses a consistent underlying structure across viewpoints. Capsules attempt to encode intrinsic viewpoint-invariant properties, and learn to adjust instantiation parameters as the entity varies across its appearance manifold (Hinton, Krizhevsky, and Wang 2011). CapsNets have shown to outperform standard Convolutional Neural Networks (CNNs) in specific tasks involving shape recognition and overlapping digit segmentation. These tasks are difficult for standard CNNs, as they struggle to exploit the frame of reference humans impose on objects, and thus often fail to generalise knowledge to novel viewpoints. Although this drawback can often be mitigated by data augmentation during training, it does not address the underlying issue directly. Nonetheless, CNNs perform remarkably well

Figure 1: Depiction of Variational Bayes (VB) routing between adjacent capsule layers: with lower layer capsules (orange) and higher layer capsules (blue).

in practice, partly because they make structural assumptions that ring true with natural images. Capsules extend this rationale by assuming objects are composed of object parts, and if we learn part-whole relationships perfectly then we can better generalise to novel viewpoints and affine transformations. In CNNs, the convolution operator and sparse weight sharing provides the useful property of equivariance under translation, enabling efficient spatial transfer of knowledge. CapsNets retain these benefits and only do away with pooling operations in favour of learning more robust representations for disentangling factors of variation with routing-by-agreement. Although promising, CapsNets remain underexplored, and few works thus far have proposed algorithmic improvements to the original formulations. In this paper, we propose a new capsule routing algorithm for fitting a mixture of transforming gaussians via Variational Bayes, which offers increased training stability, flexibility and performance. Capsule Networks CapsNets are composed of at least one layer of capsules in which capsules i from a lower layer (children) are routed to capsules j in a higher layer (parents). Each layer contains multiple lower capsules, each of which has a pose matrix of instantiation parameters and activation probability (see Figure 2). The pose matrix may learn to encode the relationship of an entity to the viewer, and the activation probability represents its presence. Each lower level capsule uses its pose matrix to posit a vote for what the pose of a higher level capsule should be, by multiplying it with a trainable viewpoint-invariant transformation weight matrix

where denotes the vote coming from capsules i to capsule j, and is the trainable transformation matrix. To compute the pose matrix of any higher level capsule j we can simply take a weighted mean of the votes it received from capsules in as in EM routing (Hinton, Sabour, and Frosst 2018): , where represents the posterior responsibilities of each capsule j for capsules i, and . These routing coef- ficients can be tuned via a variant of the EM algorithm for Gaussian Mixtures, and are updated according to the agreement between and , which in Dynamic routing (Sabour, Frosst, and Hinton 2017) for example, is simply the scalar product between capsule vectors and can be trivially extended to matrices with. Lastly, a parent capsule j is only activated if there is a measurably high agreement among the votes from child capsules i for its pose matrix , which forms a tight cluster in .

Motivation & Contributions In this paper, we propose a new capsule routing algorithm derived from Variational Bayes. We show that our probabilistic approach provides advantages over previous routing algorithms, including more flexible control over capsule complexity by tuning priors to induce sparsity, and reducing the well known variance-collapse singularities inherent to MLE based mixture models such as EM. Contextually, these singularities occur in part due to the single parent assumption–whereby a parent capsule (gaussian cluster) can claim sole custody of a child capsule (datapoint), yielding infinite likelihood and zero variance. This leads to overfitting and unstable training. By modelling uncertainty over the capsule parameters as well as the routing weights, we can avoid these singularities in a principled way, without adding arbitrary constants of minimum variance to ensure numerical stability, which can affect performance in EM. Furthermore, we provide some insight into capsule network training for practitioners including weight initialisation and normalisation schemes that improve training performance. Lastly, we show it’s possible to transform our capsule network into a Capsule-VAE by sampling latent code from capsule parameter approximate posteriors. We outperform the state-of-the-art on smallNORB using 50% fewer capsules than previously reported, achieve highly competitive performances on CIFAR-10, FashionMNIST, SVHN, and demonstrate significant improvement in MNIST to affNIST generalisation over previous works.

2 Variational Bayes Capsule Routing

Next, we briefly outline some necessary background on Variational Inference (VI), before contextualising some of these ideas with our proposed capsule routing algorithm.

2.1 Variational Inference

The Evidence Lower Bound Let x denote the observed data, z denote latent variables associated with x, and let represent some model parameters. Typically we’d like to infer the unknown latent variables, by evaluating the conditional which is the posterior on z. However, this distribution cannot be computed for most complex models due to the intractability of a normalising integral. VI provides an elegant solution to posterior inference by posing it as an optimisation problem. We approximate the posterior by choosing a variational distribution over the latent variables from a tractable family, with its own variational parameters . We can measure the quality of our approximation via the Kullback-Leibler (KL) divergence between the two distributions, which can be minimised via the variational parameters

However, since is unknown we cannot minimise the KL directly, so instead we maximise the variational lower bound (ELBO) on the log marginal likelihood

where the ELBO can be derived using Jensen’s inequality applied to giving

Here we use the joint which is tractable, rather than the unknown posterior . Recall that from the product rule of probability we simply have that . Given that the log marginal likelihood of the data is always negative and is independent of , maximising the ELBO is therefore equivalent to minimising the KL divergence.

Mean Field A popular way of performing VI is to posit a factorised form of the approximating family of distributions , such that each variable is assumed to be independent

Recall that the log marginal is given by , and therefore the factorised objective to be maximised can be written in the following form

Figure 2: Architectural depiction of our capsule network with Variational Bayes routing between convolutional capsule layers. Each capsule has an activation probability a and a pose matrix . Parent capsules j (blue) only receive votes from child capsules i (orange) within their receptive field. c and d denote the number of child and parent capsule types respectively.

2.2 Variational Bayes for a Mixture of Transforming Gaussians

Relation to Clustering Capsule routing naturally resembles clustering logic. This is reflected in the fact that any higher layer parent capsule j (cluster) is composed of, and receives votes from, many lower layer child capsules i (data points) within its receptive field (see Figure 2 for intuition).

However, capsule routing does differ from regular clustering substantially, as every cluster has its own learnable viewpoint-invariant transformation matrix with which it transforms its data points, and predictions are made by measuring similarity among them. Therefore, each cluster sees a different view of the data, and the algorithm converges much faster since it’s easier to break symmetry compared to simply initialising the gaussian clusters with different means (Hinton, Sabour, and Frosst 2018). Next we propose our capsule routing algorithm borrowing some ideas from (Bishop 2006), and begin by picking up from our general description of capsule networks in section 1.

Proposed Method Let denote a vectorised version of the 4x4 votes matrix, and let denote a vectorised version of capsule pose matrix , where D = 16. Assuming independence, consider the log likelihood function maximised in a Gaussian Mixture Model (GMM), applied to routing capsules i from a lower layer to capsules j in a higher layer

In EM routing, point estimates of the parameters and are computed in the M-step, and the routing probabilities are evaluated in the E-step. The mixing coef-ficients however, are replaced with activations which represent the probability of cluster j being switched on, and are computed by a shifting logistic non-linearity. The ’s play the role of the mixing proportions but . Re- call from section 1 that the votes play the roles of the data points and are computed as , using different transformation matrices for each capsule j.

In order to model uncertainty over the capsule parameters in our algorithm, we place conjugate priors over and . Our model’s generative process for any lower layer capsule i’s vectorised pose can be derived from the following

and can be retrieved by simply inverting the vectorised vote transformation . The joint dis- tribution of the model factorises as , where the latent variables z are a collection of one-hot vectors denoting the cluster assignments of each of the lower capsules votes , to their corresponding higher capsules’ gaussians. Following from the VI discussion in section 2.1, we approximate the posterior with a factorised variational distribution

and we choose conjugate priors that factor in the following standard form as in Bayesian Gaussian Mixtures

To parameterise diagonal precisions in practice, we simply let represent the diagonal entries of , and replace the Gaussian-Wishart prior with Gaussian-Gamma priors over each diagonal entry as follows

In order to perform routing, we simply iterate between optimising parent capsule parameter distributions using the responsibilities over child capsules fixed, and evaluating the new expected responsibilities using the current distributions over parent capsule parameters fixed. See Algorithm 1 for the standard closed-form update equations, which assume the same functional form as the priors through conjugacy, and for further details refer to (Bishop 2006).

Agreement & Activation We propose to measure agreement between the votes from lower capsules i using the differential entropy of a higher capsule j’s Gaussian-Wishart variational posterior distribution . Firstly, the differential entropy of a multivariate gaussian distributed random variable x is by definition given by

Let f(x) be capsule j’s variational posterior: , where and are the updated prior parameters for a capsule j as detailed in Algorithm 1. We then approximate the entropy

where is the digamma function, and we use to indirectly measure the differential entropy of capsule j’s variational posterior distribution, up to constant factors. Intuitively, the determinant of the precision matrix measures the concentration of data points across the volume defined by the matrix. The higher the concentration the higher the agreement is among votes for capsule j. To compute any capsule j’s activation probability , we pass in both its mixing proportion and posterior entropy, as a measure of vote agreement through a logistic non-linearity

where and are learnable offset parameters as in (Hin- ton, Sabour, and Frosst 2018). Unlike EM or Dynamic routing, we only activate the capsules after the routing iterations. We find this to have a stabilising effect during training, and we can add in the expected mixing coefficients as a weight on the differential entropy of each capsule, encouraging a trade-off between activating the capsule with the most votes and our measure of how concentrated they are. This decision is in part motivated by context-dependent weighted information and entropy principles, wherein two separate low probability events incurring equally high surprisal can yield contextually unequal informative value (Guias¸u 1971).

Note that the updated prior parameters and , have a dependency on the routing weights , which represent the amount of data assigned to capsule j, weighted by the previous capsule layer activations. From the perspective of any capsule j’s cluster, previous layer activations simply dictate how important each data point is.

Table 1: Test error rate comparisons with CapsNet literature. () denotes ensemble size, and () denotes our EM implementation.

2.3 Capsule-VAE

It is possible to transform our CapsNet into a Variational Autoencoder (VAE) (Kingma and Welling 2013) by sampling from the approximate variational posterior on the capsule parameters . We can do so by saving the updated prior parameters and , at the end of the routing procedure of the final layer, and output the capsule means and precisions as latent code. Recall that the approximate posterior on the mean and precision of any capsule j is a Gaussian-Wishart , and we can sample from this distribution in the following way

It is straight forward to condition the sample on the target class capsule during training based on the label, and make the process differentiable using the reparameterisation trick

where , and . This formulation also reduces computational time since we can avoid explicit redo of VB for each sample. Capsule-VAEs are interesting models as the output latent code is composed of capsule instantiation parameters, and we know from (Sabour, Frosst, and Hinton 2017) that each capsule dimension learns to encode different variations of object properties that we can visualise/tweak. We leave further exploration of these ideas and analysis of Capsule-VAEs to future work.

3 Related Work

Capsules were first introduced by (Hinton, Krizhevsky, and Wang 2011), wherein the encoding of instantiation parameters was established in a transforming autoencoder. More recently, work by (Sabour, Frosst, and Hinton 2017) achieved state-of-the-art performance on MNIST with a shallow CapsNet, using a Dynamic routing algorithm. Shortly after, EM routing was proposed in (Hinton, Sabour, and Frosst 2018), replacing capsule vectors with matrices to reduce the number of parameters. State-of-the-art performance was achieved on smallNORB, outperforming CNNs. More recently, Group Equivariant CapsNets were proposed in (Lenssen, Fey, and Libuschewski 2018), leveraging ideas from group theory to guarantee equivariance and invariance properties. In (Zhang, Zhou, and Wu 2018) a new routing algorithm based on kernel density estimation was proposed, providing a speed up compared to EM routing. Capsules have also been extended to action recognition in videos by (Duarte, Rawat, and Shah 2018), where the propose to average the votes before routing them for speed. Work in (Zhang, Edraki, and Qi 2018) proposes learning groups of capsule subspaces and project embedded features onto these subspaces. Despite these interesting works among others, CapsNets are still difficult to train and the original state-of-the-art benchmarks are yet to be beaten fairly.

4 Experiments

Capsule Network Architecture Our CapsNet follows the EM routing formulation and comprises 4 capsule layers, starting with a primary capsule (PrimaryCaps) layer followed by 3 convolutional capsule (ConvCaps) layers. The stem of the network consists of a layer using F filters and stride 2, and is followed by two Conv layers with F filters each, all using BatchNorm and ReLU activations. The PrimaryCaps layer transforms the F filters into capsule pose 4x4 matrices and activations using convolutions. This is followed by a ConvCaps layer with capsules types and stride 2, and a layer with capsule types and stride 1. The final ConvCaps layer shares weight matrices across spatial dimensions, yielding a capsule for each class of classes, and we perform coordinate addition as in (Hinton, Sabour, and Frosst 2018). In summary, we describe our network architectures using the notation .

Figure 3: Direct comparison between VB and EMrouting validation set error using identical networks and hyperparameters.

Objective Function We experiment with both a negative likelihood loss , and the spread loss in (Hinton, Sabour, and Frosst 2018), then add the VAE loss as an optional capsule reconstruction based regulariser

The total loss is a linear combination of a classification loss and the optional VAE loss i.e. . CapsNet regularisation by reconstruction was first proposed in (Sabour, Frosst, and Hinton 2017) with a fully-connected decoder, in our VAE we use a simple 5 layer deconvnet.

Uninformative Priors We set the gaussian priors on the mean parameters to be zeros with precision scaling , and the wishart priors on the precision matrix to be identities with degrees of freedom . For the diagonal case, is a vector of 1’s. These priors have a regularising effect since they encourage the parent capsule clusters j to remain close to the origin, and not to be too irregular in shape. The Dirichlet prior on the mixing coeffi-cients is set to 1, and reducing this value favours routing solutions with less active parent capsules. In section 4.4, we provide some analysis on sensitivity to prior initialisations.

Weight Initialisation CapsNets are known to be difficult to train, in fact, the EM routing results were yet to be fairly matched before this paper. With that said, we provide some valuable suggestions for practitioners on how to initialise the various parameters of the model that worked well for us experimentally, and helped stabilise training significantly. We offer the following two ways of initialising the viewpoint-invariant transformation weight matrices:

(i) As identities Iwith added random uniform noise on the off diagonal entries. In this way, at the start of training the capsule pose transformations don’t stray too far from computing the identity function, which we find to have a stabilising effect.

(ii) To help maintain constant variance of activations across capsule layers and help avoid exploding/vanishing gradients, we propose initialising with a modified (Glorot and Bengio 2010) scheme as

where and denote the number of capsules types in layers and is the convolutional kernel size and is the number of neurons per capsule matrix (). Lastly, we also normalise the argument of the logistic function for using BatchNorm without the learnable parameters and . This restricts the range of input values from being too high/low and helps prevent vanishing gradients.

4.1 Image Classification Results

The main comparative results are reported in Table 1, using smallNORB (LeCun et al. 2004), Fashion-MNIST (Xiao, Rasul, and Vollgraf 2017), SVHN (Netzer et al. 2011) and CIFAR-10 (Krizhevsky, Hinton, and others 2009). In all cases, we use the diagonal parameterisation in Eq. (11), 3 VB routing iters and batch size 32. All hyperparameters were tuned using validation sets, then models were retrained with the full training set until convergence before testing.

smallNORB smallNORB consists of grey-level stereo 96x96 images of 5 objects. Each object is given at 18 different azimuths (0-340), 9 elevations and 6 lighting conditions, and there are 24,300 training and test set images each. Following (Hinton, Sabour, and Frosst 2018), we standardise and resize all images to 48x48 and take random 32x32 crops during training. At test time, we simply center crop the images to 32x32. Our best model {64, 16, 16, 16, 5} was trained for 350 epochs using Adam, loss, and 3e-3 initial learning rate with exponentially decay. A 20% validation split of the training set was used to tune hyperparameters. As reported in Table 1, we achieve a best test error rate of 1.55% (over 5 runs) compared to the previous state-of-the-art 1.8% reported in (Hinton, Sabour, and Frosst 2018). Note that by averaging multiple crops at test time they can get 1.4% and we reach 1.29%. Our result is obtained without adding random brightness/contrast or any other augmentations/deformations during training. We also stress that our capsule network has 50% fewer capsules.

Table 2: Comparing novel viewpoint generalisation. () denotes our implementation of EM with same network as VB.

Figure 4: Test error (%) sensitivity to priors (Right), and runtime/error comparisons using {3,2,1} routing iters (Left).

Fashion-MNIST Fashion-MNIST is a more difficult version of MNIST comprised of 10 clothing item classes. The images are 28x28 and the training/test sets have 60,000 and 10,000 examples respectively. We normalise and pad to 36x36, and randomly crop 32x32 image patches during training. At test time we pad the images to 32x32. Our best model {64, 16, 16, 16, 10} was trained for 200 epochs using loss, with SGDM and a weight decay of 1e-6. The initial learning rate was set to 0.1 with step decay at 80, 120, 160 epochs and a decay rate of 0.1. As reported in Table 1 we achieve a best test error rate of 5.15% (over 3 runs) outperforming other works with fewer parameters.

SVHN SVHN comprises challenging real-world 32x32 images of house numbers (10 digit classes). We trained on the core training set only, consisting of 73,257 examples and tested on the 26,032 in the test set. We normalise and pad to 40x40 and take random 32x32 crops during training. Our best model {64, 16, 32, 32, 10} was trained for 350 epochs using loss with SGDM. The initial learning rate was set to 0.1 with step decay at 150, 250, 300 epochs and a decay rate of 0.1. As reported in Table 1, we achieved a best test error of 3.87% (over 3 runs), outperforming the Dynamic routing capsules (Sabour, Frosst, and Hinton 2017) and others, with significantly fewer parameters.

CIFAR-10 CIFAR-10 consists of 60,000 32x32 colour images of 10 classes. There are 50,000 training and 10,000 test images. We normalise and pad to 40x40, and randomly crop 32x32 patches during training. We also apply random horizontal flips with probability . Our best model {64, 16, 32, 32, 10} was trained for 350 epochs using loss with SGDM. Initial learning rate was 0.1 with step decay at 150, 250, 300 epochs and decay rate of 0.1. We achieved a best test error of 11.14% (over 3 runs), which is lower than EM routing (Hinton, Sabour, and Frosst 2018), and using considerably fewer parameters than

Figure 5: MNIST to affNIST generalisation performance comparisons. () denotes unsupervised learning was used, and the light blue line denotes matched performance on MNIST test set before testing on affNIST for fairness.

other capsule works (Table 1). CIFAR-10 is the most challenging of the 4 datasets, and to get better performance, a deeper network is required for learning better representations. To test this hypothesis, we simply replaced the stem of our capsule network with 4 residual blocks (8 layers), and achieved a much lower test error rate of 7.8%, outperforming even deeper Residual Networks (He et al. 2016).

4.2 Generalisation to Novel Viewpoints

In order to verify that our proposed capsule routing algorithm preserves generalisation to novel viewpoints, we trained our {64, 16, 16, 16, 5} model on the smallNORB training data containing azimuths of (300, 320, 340, 0, 20, 40), and tested on the test data containing azimuths from 60 to 280. For elevation viewpoints, we trained on the 3 smaller and tested on the 6 larger elevations. During training, we validated using the portion of test data containing the same viewpoints as in training and measured the generalisation to novel viewpoints after matching the performance on familiar ones. As reported in Table 2, we compare VB routing to the original EM routing performance in (Hinton, Sabour, and Frosst 2018) as well as our implementation of EM using the same network for fairness. In our experiments, VB routing does not sacrifice the ability to generalise to novel viewpoints, and outperforms EM routing in all cases.

4.3 Affine Transformation Robustness

To further demonstrate our methods generalisation and invariance to affine-transformations, we train our {64, 16, 16, 16, 10} CapsNet on MNIST, and assess generalisation performance on the affNIST test set. AffNIST images are 40x40 so we train by randomly padding MNIST training set images as done in works we compare to. We achieve a significantly superior generalisation accuracy of 98.1% comparatively (Figure 5). For fairer comparisons, we also match the 99.2% test set accuracy on MNIST reported in Dynamic/EM routing, before testing on the affNIST test set, achieving 96.9%.

4.4 Sensitivity to Prior Hyperparameters

We took our {64, 16, 16, 16, 5} CapsNet, and performed sensitivity analysis on the hyperparameters of the Wishart and Dirichlet priors, with respect to test error on FashionMNIST (Figure 4). We initialise as identities scaled by coefficients {0.01, 0.1, 1, 10}. The same coef-ficients were used for initialising the Dirichlet prior parameter . In general, we find that our models are quite robust to prior initialisations in terms of final test set performance, whereas convergence speed is mildly affected. It is also possible to learn prior parameters from data via backpropagation ( la empirical Bayes), avoiding manual tuning altogether. We tested this on the Dirichlet and observed no performance degradation (5.19% compared to ).

4.5 VB vs. EM Routing

For direct comparisons with the leading capsule routing algorithm, we took our best performing models for each dataset and replaced VB with our implementation of EM. Table 1 and Figure 3 report VB outperforming EM in terms of convergence rate, stability, and final test error with identical networks. VB routing is also almost 20% faster than EM. This is partly because capsule priors don’t require gradient updates, and mainly because we propose to measure agreement/activate capsules after the routing iterations. As shown in Figure 4, our method compares favourably, and we find that the number of VB routing iterations has a bigger impact on training time than test error, so we can reduce the number iterations to train faster, and still perform competitively.

5 Conclusion

In this paper, we propose a new capsule routing algorithm for learning a mixture of transforming gaussians via Variational Bayes. We model uncertainty over the capsule parameters in addition to the routing coefficients, which provides: (i) more flexible control over capsule complexity by tuning priors to induce sparsity, and (ii) reduces the well known variance-collapse problem inherent to MLE based mixture models, such as EM. We outperform the state-of-the-art on smallNORB using 50% fewer capsules than previously reported, achieve highly competitive performances on CIFAR-10, Fashion-MNIST, SVHN, and demonstrate significant improvement in MNIST to affNIST generalisation over previous methods. For future work, we plan to extend our Bayesian framework to obtain calibrated uncertainty estimates over predictions using capsule networks.

References

Bishop, C. M. 2006. Pattern recognition and machine learning. springer.

Choi, J.; Seo, H.; Im, S.; and Kang, M. 2019. Attention routing between capsules. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 0–0.

Deli`ege, A.; Cioppa, A.; and Van Droogenbroeck, M. 2019. An ef- fective hit-or-miss layer favoring feature interpretation as learned prototypes deformations. In Thirty-Third AAAI Conference on Ar-tificial Intelligence.

Duarte, K.; Rawat, Y.; and Shah, M. 2018. Videocapsulenet: A simplified network for action detection. In Advances in Neural Information Processing Systems, 7610–7619.

Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–256.

Guias¸u, S. 1971. Weighted entropy. Reports on Mathematical Physics 2(3):165–179.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn- ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.

Hinton, G. E.; Krizhevsky, A.; and Wang, S. D. 2011. Transform- ing auto-encoders. In International Conference on Artificial Neural Networks, 44–51. Springer.

Hinton, G. E.; Sabour, S.; and Frosst, N. 2018. Matrix capsules with em routing. In International Conference on Learning Representations (ICLR).

Jeong, T.; Lee, Y.; and Kim, H. 2019. Ladder capsule network. In International Conference on Machine Learning, 3071–3079.

Killian, T.; Goodwin, J.; Brown, O.; and Son, S.-H. 2019. Kernel- ized capsule networks. arXiv preprint arXiv:1906.03164.

Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

Kosiorek, A. R.; Sabour, S.; Teh, Y. W.; and Hinton, G. E. 2019. Stacked capsule autoencoders. arXiv preprint arXiv:1906.06818.

Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer.

LeCun, Y.; Huang, F. J.; Bottou, L.; et al. 2004. Learning methods for generic object recognition with invariance to pose and lighting. In CVPR (2), 97–104. Citeseer.

Lenssen, J. E.; Fey, M.; and Libuschewski, P. 2018. Group equiv- ariant capsule networks. In Advances in Neural Information Processing Systems, 8844–8853.

Nair, P.; Doshi, R.; and Keselj, S. 2018. Pushing the limits of capsule networks. Technical note.

Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning.

Phaye, S. S. R.; Sikka, A.; Dhall, A.; and Bathula, D. 2018. Dense and diverse capsule networks: Making the capsules learn better. arXiv preprint arXiv:1805.04001.

Rawlinson, D.; Ahmed, A.; and Kowadlo, G. 2018. Sparse unsupervised capsules generalize better. arXiv preprint arXiv:1804.06094.

Sabour, S.; Frosst, N.; and Hinton, G. E. 2017. Dynamic routing between capsules. In Advances in Neural Information Processing Systems (NIPS), 3856–3866.

Xiang, C.; Zhang, L.; Tang, Y.; Zou, W.; and Xu, C. 2018. Ms- capsnet: A novel multi-scale capsule network. IEEE Signal Processing Letters 25(12):1850–1854.

Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.

Zhang, L.; Edraki, M.; and Qi, G.-J. 2018. Cappronet: Deep fea- ture learning via orthogonal projections onto capsule subspaces. In Advances in Neural Information Processing Systems, 5814–5823.

Zhang, S.; Zhou, Q.; and Wu, X. 2018. Fast dynamic routing based on weighted kernel density estimation. In International Symposium on Artificial Intelligence and Robotics, 301–309. Springer.

Zhao, Z.; Kleinhans, A.; Sandhu, G.; Patel, I.; and Unnikrishnan, K. 2019. Capsule networks with max-min normalization. arXiv preprint arXiv:1903.09662.

Airplane

Figure 6: Histograms of the squared distances (X axis) between votes averaged over all airplane images in the smallNORB dataset, and each of the all 5 class capsules throughout training (epochs on Y axis). Variational Bayes Routing iterations 1-3 are depicted per row, and each column represents a different class capsule. As can be seen above, the average votes from the airplane images learn to agree with the airplane class capsule during training, and therefore the discrepancies between the votes and the target capsule parameters increasingly gather around 0 over time, more so than the other class capsules.

Car

Figure 7: Histograms of the squared distances (X axis) between votes averaged over all car images in the smallNORB dataset, and each of the all 5 class capsules throughout training (epochs on Y axis). Variational Bayes Routing iterations 1-3 are depicted per row, and each column represents a different class capsule. A very clear difference in the agreement between target (car) and non-target capsules as training progresses can be seen without inspecting the absolute distances on the X axis.

designed for accessibility and to further open science