Fully unsupervised methods are unable to learn disentangled representations without introducing further assumptions in the form of inductive biases on model and data (Locatello et al., 2018). In our challenge submission, we utilize the implicit inductive bias contained in models pretrained on the ImageNet database (Russakovsky et al., 2014), and enhance it by finetuning such models on challenge-relevant auxiliary tasks such as angle, position estimation, or color classification. In particular, our submission for stage 2 builds on our submission from stage 1 (Seitzer, 2020), in which we employed pretrained CNNs to extract convolutional feature maps as a preprocessing step before training a VAE (Kingma and Welling, 2014). Although this approach already yielded good disentanglement scores, we identified two weaknesses with the feature vectors extracted this way. First, the feature extraction network is trained on ImageNet, which is rather dissimilar to the MPI3d dataset (Gondal et al., 2019) used in the challenge. Secondly, the feature aggregation mechanism was chosen ad-hoc and likely does not retain all information needed for disentanglement. We attempt to alleviate these issues by finetuning the feature extraction network as well as learning the aggregation of feature maps from data by using the labels of the simulation datasets MPI3d-toy and MPI3d-realistic.
Our method consists of the following three steps: (1) supervised finetuning of the feature extraction CNN (section 2.1), (2) extracting a feature vector from each image in the dataset using the finetuned network (section 2.2), (3) training a VAE to reconstruct the feature vectors and disentangle the latent factors of variation (section 2.3).
2.1. Finetuning the Feature Extraction Network
In this step, we finetune the feature extraction network offline (before submission to the evaluation server). The goal is to adapt the network such that it produces aggregated feature vectors that retain the information needed to disentangle the latent factors of the MPI3d-real dataset. In particular, the network is finetuned by learning to predict the value of each latent factor from the aggregated feature vector of an image. To this end, we use the simulation datasets , namely the images as inputs and the labels as supervised classification targets.
For the feature extraction network, we use the VGG19-BN architecture (Simonyan and Zisserman, 2014) of the torchvision package. The input images are standardized using mean and variance across each channel computed from the ImageNet dataset. We use the output feature maps of the last layer before the final average pooling (dimensionality 512 2) as the input to a feature aggregation module which reduces the feature map to a 512-dimensional vector
. This aggregation module consists of three convolution layers with 1024, 2048, 512 feature maps and kernel sizes 1, 2, 1 respectively. Each layer is followed by batch normalization and ReLU activation. We also employ layerwise dropout with rate 0.1 before each convolution layer. Finally, the aggregated feature vector is
2-normalized, which was empirically found to be important for the resulting disentanglement performance. Then, for each latent factor, we add a linear classification layer computing the logits of each class from the aggregated feature vector. These linear layers are discarded after this step.
We use both MPI3d-toy and MPI3d-realistic for training to push the network to learn features that identify the latent factors in a robust way, regardless of details such as reflec-tions or specific textures. In particular, we use a random split of 80% of each dataset as the training set, and the remaining samples as a validation set. VGG19-BN is initialized with a set of weights resulting from ImageNet training, and the aggregation module and linear layers were randomly initialized using uniform He initialization (He et al., 2015). The network is trained for 5 epochs using the RAdam optimizer (Liu et al., 2019) with learning rate 0
9, a batch size of 512 and a weight decay of 0.01. We use a multi-task classification loss consisting of the sum of cross entropies between the prediction and the ground truth of each latent factor. After training, the classification accuracy on the validation set is around 98% for the two degrees of freedom of the robot arm, and around 99.9% for the remaining latent factors.
2.2. Feature Map Extraction and Aggregation
In this step, we use the finetuned feature extraction network to produce a set of aggregated feature vectors. We simply run the network detailed in the previous step on each image of the dataset and store the aggregated 512-dimensional vectors in memory. Again, inputs to the feature extractor are standardized such that mean and variance across each channel correspond to the respective ones from the ImageNet dataset.
2.3. VAE Training
Finally, we train a standard Higgins et al., 2017) on the set of aggregated feature vectors resulting from the previous step. The encoder network consists of a single fully-connected layers with 4096 neurons, followed by two fully-connected layers parametrizing
18 means and log variances of a normal distribution
used as the approximate posterior q pz | xq. The number of latent factors was experimentally determined. The decoder network consists of four fully-connected layers with 4096 neurons each, followed by a fully-connected layer parametrizing the means of a normal distribution
used as the conditional likelihood p px | zq. The mean is constrained to range p0, 1q using the sigmoid activation. All fully-connected layers but the final ones use batch normalization and are followed by ReLU activation functions. We use orthogonal initialization (Saxe et al., 2013) for all layers and assume a factorized standard normal distribution N p0, Iq as the prior p pzq on the latent variables.
For optimization, we use the RAdam optimizer (Liu et al., 2019) with a learning rate of 09 and a batch size of
256. The VAE is trained for
epochs by maximizing the evidence lower bound, which is equivalent to minimizing
where is a hyperparameter to balance the MSE reconstruction and the KLD penalty term. As the scale of the KLD term depends on the numbers of latent factors C, we normalize it by
can be varied independently of C. It can be harmful to start training with too much weight on the KLD term (Bowman et al., 2015). Therefore, we use the following cosine schedule to smoothly anneal
the course of training:
where is the value for
in training episode
, and annealing runs from epoch
10 to epoch
79. This schedule lets the model initially learn to reconstruct the data well and only then puts pressure on the latent variables to be factorized which we found to considerably improve performance.
Our approach was able to obtain the second place in stage 2 of the competition. Notably, compared to our stage 1 approach, our stage 2 approach leads to a large improvement on the FactorVAE (Kim and Mnih, 2018), and DCI (Eastwood and Williams, 2018) metrics. On the public leaderboard, our best submission achieves the first rank on these metrics, with a large gap to the second-placed entry. See appendix A for further discussion of the results.
Unsurprisingly, introducing prior knowledge simplifies the disentanglement task considerably, which is reflected in the improved scores. To do so, our approach makes use of task-specific supervision obtained from simulation, which restricts its applicability. Nevertheless, it constitutes a demonstration that this type of supervision can transfer to better disentanglement on real world data, which was one of the goals of the challenge.
This work was supported by the Bavarian Ministry of Economic Affairs, Regional Development and Energy through the Center for Analytics - Data - Applications (ADA-Center) within the framework of ”BAYERN DIGITAL II” (20-3410-2-9-8).
AIcrowd. NeurIPS 2019: Disentanglement Challenge. https://www.aicrowd.com/ challenges/neurips-2019-disentanglement-challenge, 2019.
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal J´ozefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. In CoNLL, 2015.
Tian Qi Chen, Xuechen Li, Roger Baker Grosse, and David Kristjanson Duvenaud. Isolating Sources of Disentanglement in Variational Autoencoders. In ICLR, 2018.
Cian Eastwood and Christopher K. I. Williams. A Framework for the Quantitative Evalu- ation of Disentangled Representations. In ICLR, 2018.
Muhammad Waleed Gondal, Manuel W¨uthrich, Doroe Miladinovic, Francesco Locatello, Martin Breidt, Valentin Volchkov, Joel Akpo, Olivier Bachem, Bernhard Sch¨olkopf, and Stefan Bauer. On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. In NeurIPS, 2019.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. 2015 IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015.
Irina Higgins, Lo¨ıc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In ICLR, 2017.
Hyunjik Kim and Andriy Mnih. Disentangling by Factorising. In ICML, 2018.
Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. 2014.
Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational Inference of Disentangled Latent Concepts from Unlabeled Observations. ArXiv, abs/1711.00848, 2017.
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the Variance of the Adaptive Learning Rate and Beyond. ArXiv, abs/1908.03265, 2019.
Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar R¨atsch, Sylvain Gelly, Bernhard Sch¨olkopf, and Olivier Bachem. Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. In RML@ICLR, 2018.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi- heng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115:211–252, 2014.
Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ArXiv, abs/1312.6120, 2013.
Maximilian Seitzer. NeurIPS 2019 Disentanglement Challenge: Improved Disentanglement through Aggregated Convolutional Feature Maps. ArXiv, abs/2002.10003, 2020. URL https://arxiv.org/abs/2002.10003.
Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR, 2014.
Raphael Suter, Doroe Miladinovic, Bernhard Sch¨olkopf, and Stefan Bauer. Robustly Dis- entangled Causal Mechanisms: Validating Deep Representations for Interventional Robustness. In ICML, 2019.
Appendix A. Discussion of Results on Leaderboard Results
We summarize the results of our best submissions on the public and private leaderboards
in table 1. The private leaderboard of this challenge stage used a dataset of real images,
but with objects more difficult to recognize than the objects in the dataset of the public
leaderboard. An exact description of the types of objects used for this test dataset was
not yet released at the time of writing to the best of our knowledge. On this more difficult
dataset, our approach achieves the first rank on the FactorVAE (Kim and Mnih, 2018) and
SAP (Kumar et al., 2017) metrics, with a particularly large difference of 0.24 to the second
ranked entry for FactorVAE. Compared to the easier dataset of the public leaderboard, all
metrics drop, sometimes strongly (e. g. 0.22 for DCI (Eastwood and Williams, 2018)). This
could stem from the fact that this more challenging dataset uses different types of objects
than the ones which were included in the supervised pretraining.
On the public leaderboard (i. e. on MPI3D-real), our method achieves the first rank on FactorVAE and DCI. For both metrics, there is a large absolute difference to the second ranked entry, namely 0.37 for FactorVAE and 0.26 for DCI. For SAP, our method is almost tied with the first ranking entry, with 0.01 absolute difference. For MIG (Chen et al., 2018) and IRS (Suter et al., 2019), our method falls behind the best method, with an absolute distance of 0.08 and 0.13 respectively.
Compared to our stage 1 submission which does not use supervised finetuning, metrics for which our approach was already good (FactorVAE, DCI and SAP), became even better, while other metrics for which our approach performed subpar stayed the same (MIG) or even became worse (IRS). It seems that adding supervised finetuning to our pretrained features
Table 1: Summary of scores and ranks of our best submissions on the private and public leaderboards at the end of stage 2. For comparison, we also include the private leaderboard scores of our best stage 1 submission. Note that our best result on the public leaderboard uses slighly different hyperparameters than the ones described before. We list them in appendix B.
approach enhances the already existing strengths and weaknesses. That being said, it is
known that the results of VAE-based disentanglement methods are highly sensitive to the
hyperparameters and even random seeds used (Locatello et al., 2018). Thus a more detailed
investigation is needed to draw any conclusions, which was out of scope for this report.
Appendix B. Hyperparameters for Best Result on Public Leaderboard
The challenge leaderboard lists only the best submissions for each dataset, which is why
the best submission on the public leaderboard has slighly different hyperparameters than
the best one on the private dataset. We list the hyperparameters different from the ones
described in the main text here:
• Encoder: four layers with 1024 neurons each
• Decoder: four layers with 1024 neurons each
• Latent dimensions:
• Training time: 100 epochs
• Beta annealing: from 4 over epochs