While conventional supervised learning is getting more stable and used in a wide range of applications, learning a complex model may require a daunting amount of labeled data. For this reason, transfer learning is often considered as an option to reduce the sample complexity of learning a new task 1. While there has been a significant amount of progress in domain adaptation [13], this particular form of transfer learning requires a source task highly related to the target task and a large amount of data on the source task. For this reason, we seek to make progress on multitask transfer learning (also know as few-shot learning), which is still far behind human level transfer capabilities [22]. In the few-shot learning setup, a potentially large number of tasks are available to learn parameters shared across all tasks. Once the shared parameters are learned, the objective is to obtain good generalization performance on a new task with a small number of samples.
Recently, significant progress has been made to scale Bayesian neural networks to large tasks and to provide better approximations of the posterior distribution [5, 23, 21]. This, however, comes with an important question: “What does the posterior distribution actually represent?”. For neural networks, the prior is often chosen for convenience and the approximate posterior is often very limited [5]. For sufficiently large datasets, the observations overcome the prior, and the posterior becomes a single mode around the true model2, justifying most uni-modal posterior approximations.
However, many usages of the posterior distribution require a meaningful prior. That is, a prior expressing our current knowledge on the task and, most importantly, our lack of knowledge on the task. In addition to that, a good approximation of the posterior under the small sample size regime is required, including the ability to model multiple modes. This is indeed the case for Bayesian optimization [30], Bayesian active learning [12], continual learning [20], safe reinforcement learning [4], exploration-exploitation trade-off in reinforcement learning [17]. Gaussian processes [27] have historically been used for these applications, but using an RBF kernel is a too generic prior for many tasks. More recent tools such as deep Gaussian processes [7] show great potential and yet their scalability whilst learning from multiple tasks needs to be improved.
Our aim in this work is to learn a good prior across multiple tasks and transfer it to a new task. To be able to express a rich and flexible prior learned across a large number of tasks, we use neural networks learned with a variational Bayes procedure. By doing so, we are able to (i) isolate a small number of task specific parameters and (ii) obtain a rich posterior distribution over this space. Additionally, the knowledge accumulated from the previous tasks provides a meaningful prior on the target task, yielding a meaningful posterior distribution which can be used in a small data regime.
The rest of the paper is organized as follows: We first describe the proposed approach in Section 2 while reviewing hierarchical Bayes modeling. Section 4 focuses on outlining key differences between our approach and related methods. In Section 3, we extend to 3 level of hierarchies to obtain a model more suited for classification. In Section 5, we conduct experiments on toy tasks to gain insight on the behavior of the algorithm. Finally, we show that we can obtain the new state of the art on the Mini-Imagenet benchmark [31].
By leveraging the variational Bayes approach, we show how we can learn a prior over models with neural networks. Also, by factorizing the posterior distribution into a task agnostic and task specific component, we show an important simplification resulting in a scalable algorithm, which we refer to as deep prior.
2.1 Hierarchical Bayes
We consider learning a prior from previous tasks by learning a probability distribution over the weights w of a network parameterized by
. This is done using a hierarchical Bayes approach across N tasks, with hyper-prior
. Each task has its own parameters
, with
. Using all datasets
, we have the following posterior:3
The term corresponds to the likelihood of sample i of task j given a model parameterized by
the probability of class
from the softmax of a neural network parameterized by
with input
. For the posterior
, we assume that the large amount of data available across multiple tasks will be enough to overcome generic prior
such as an isotropic Normal distribution. Hence, we consider a point estimate of the posterior
using maximum a posteriori4.
We can now focus on the remaining term: . Since
is potentially high dimensional with intricate correlations among the different dimensions, we cannot use a simple Gaussian distribution. Following inspiration from generative models such as GANs [14] and VAE [18], we use an auxiliary variable
and a deterministic function projecting the noise z to the space of w i.e.
. Marginalizing z, we have:
, where
is the Dirac delta function. Unfortunately, directly marginalizing z is intractable for general
. To overcome this issue, we add z to the joint inference and marginalize it at inference time.
Considering the point estimation of , the full posterior is factorized as follows:
where is the conventional likelihood function of a neural network with weight matrices generated from the function
i.e.:
. Similar architecture has been used in Krueger et al. [21] and Louizos and Welling [23], but we will soon show that it can be reduced to a simpler architecture in the context of multi-task learning. The other terms are defined as follows:
The task will consist of jointly learning a function common to all tasks and a posterior distribution
for each task. At inference time, predictions are performed by marginalizing z i.e.:
2.2 Hierarchical Variational Bayes Neural Network
In the previous section, we describe the different components for expressing the posterior distribution of Equation 4. While all those components are tractable, the normalization factor hidden behind the "" sign is still intractable. To address this issue, we follow the Variational Bayes approach [5].
Conditioning on , we saw in Equation 1 that the posterior factorizes independently for all tasks. This reduces the joint Evidence Lower BOund (ELBO) to a sum of individual ELBO for each task.
Given a family of distributions , parameterized by
, the Evidence Lower Bound for task j is:
where,
Notice that after simplification5, is no longer over the space of
but only over the space
. Namely, the posterior distribution is factored into two components, one that is task specific and one that is task agnostic and can be shared with the prior. This amounts to finding a low dimensional manifold in the parameter space where the different tasks can be distinguished. Then, the posterior
only has to model which of the possible tasks are likely, given observations
instead of modeling the high dimensional
But, most importantly, any explicit reference to w has now vanished from both Equation 5 and Equation 6. This simplification has an important positive impact on the scalability of the proposed approach. Since we no longer need to explicitly calculate the KL on the space of w, we can simplify the likelihood function to , which can be a deep network parameterized by
both
and
as inputs. This contrasts with the previous formulation, where
produces all the weights of a network, yielding an extremely high dimensional representation and slow training.
2.3 Posterior Distribution
For modeling , we can use
, where
and
can be learned individually for each task. This, however limits the posterior family to express a single mode. For more flexibility, we also explore the usage of more expressive posterior, such as Inverse Autoregressive Flow (IAF) [19]. This gives a flexible tool for learning a rich variety of multivariate distributions. In principle, we can use a different IAF for each task, but for memory and computational reasons, we use a single IAF for all tasks and we condition6 on an additional task specific context
Note that with IAF, we cannot evaluate for any values of z efficiently, only for those which we just sampled, but this is sufficient for estimating the KL term with a Monte-Carlo approximation i.e.:
where . It is common to approximate
with a single sample and let the mini-batch average the noise incurred on the gradient. We experimented with
, but this did not significantly improve the rate of convergence.
2.4 Training Procedure
In order to compute the loss proposed in Equation 5, we would need to evaluate every sample of every task. To accelerate the training, we describe a procedure following the mini-batch principle. First we replace summations with expectations:
Now it suffices to approximate the gradient with samples across all tasks. Thus, we simply concatenated all datasets into a meta-dataset and added j as an extra field. Then, we sample uniformly7
times with replacement from the meta-dataset. Notice the term
appearing in front of the likelihood in Equation 7, this indicates that individually for each task it finds the appropriate trade-off between the prior and the observations. Refer to Algorithm 1 for more details on the procedure.
Deep prior, gives rise to a very flexible way to transfer knowledge from multiple tasks. However, there is still an important assumption at the heart of deep prior (and other VAE based approach such as Edwards and Storkey [10]), the task information must be encoded in a low dimensional variable z. In Section 5, we show that it is appropriate for regression, but for image classification, it is not the most natural assumption. Hence, we propose to extend to a third level of hierarchy by introducing a latent classifier on the obtained representation.
In Equation 5, for a given8 task j, we decomposed the likelihood p(S|z) into by assuming that the neural network is directly predicting
. Here, we introduce a latent variable v to make the prediction
. This can be, for example, a Gaussian linear regression on the representation
produced by the neural network. The general form now factorizes as follow:
, which is commonly called the marginal likelihood.
To compute ELBOin 5 and update the parameters
, the only requirement is to be able to compute the marginal likelihood p(S|z). There are closed form solutions for, e.g., linear regression with Gaussian prior, but our aim is to compare with algorithms such as Prototypical Networks (Proto Net) [29] on a classification benchmark. Alternatively, we can factor the marginal likelihood as follow
. If a well calibrated task uncertainty is not required, one can also use a leave one out procedure
. Both of these factorizations corresponds to training n times the latent classifier on a subset of the training set and evaluating on a left out sample. We refer the reader to Rasmussen [27, Chapter 5] for a discussion on the difference between leave one out cross validation and marginal likelihood.
For a practical algorithm, we propose a closed form solution for leave one out in prototypical networks. In it’s standard form, the prototypical network produces a prototype by averaging all representations
of class k i.e.
, where
. Then, predictions are made using
Theorem 1. Let be the prototypes computed without example
in the training set. Then,
We defer to supplementary materials. Hence, we only need to compute prototypes one time and rescale the Euclidean distance when comparing with a sample that was used for computing the current prototype. This gives an efficient algorithm with the same complexity as the original one and a good proxy for the marginal likelihood.
Hierarchical Bayes algorithms for multitask learning has a long history [8, 32, 2]. However most of the literature focus on simple statistical models and do not consider transferring on new tasks.
More recently, Edwards and Storkey [10] and Bouchacourt et al. [6] explore hierarchical Bayesian inference with neural networks and evaluate on new tasks. Both of them use a two level Hierarchical VAE for modeling the observations. While similar, our approach differs in a few different ways. We use a discriminative approach and focus on model uncertainty. We show that we can obtain a posterior on z without having to explicitly encode . We also explore the usage of more complex posterior family such as IAF. Those differences make our algorithm simpler to implement, and easier to scale to larger datasets.
Some recent works on meta-learning are also targeting transfer learning from multiple tasks. ModelAgnostic Meta-Learning (MAML) [11] finds a shared parameter such that for a given task, one gradient step on
using the training set will yield a model with good predictions on the test set. Then, a meta-gradient update is performed from the test error through the one gradient step in the training set, to update
. This yields a simple and scalable procedure which learns to generalize. Recently Grant et al. [15] considers a Bayesian version of MAML. Additionally, [28] also consider a meta-learning approach where an encoding network reads the training set and generates the parameters of a model, which is trained to perform well on the test set.
Finally, some recent interest in few-shot learning give rise to various algorithms capable of transferring from multiple tasks. Many of these approaches [31, 29] find a representation where a simple algorithm can produce a classifier from a small training set. Bauer et al. [3] use a neural network pre-trained on a standard multi-class dataset to obtain a good representation and use classes statistics to transfer prior knowledge to new classes.
Through experiments, we want to answer i) Can deep prior learn a meaningful prior on tasks? ii) Can it compete against state of the art on a strong benchmark? iii) In which situations deep prior and other approaches are failing?
5.1 Regression on one dimensional Harmonic signals
To gain a good insight into the behavior of the prior and posterior, we choose a collection of one dimensional regression tasks. We also want to test the ability of the method to learn the task and not just match the observed points. For this, we will use periodic functions and test the ability of the regressor to extrapolate outside of its domain.
Specifically, each dataset consists of (x, y) pairs (noisily) sampled from a sum of two sine waves with different phase and amplitude and a frequency ratio of 2: where
. We construct a meta-training set of 5000 tasks, sampling
,
and
independently for each task. To evaluate the ability to extrapolate outside of the task’s domain, we make sure that each task has a different domain. Specifically, x values are sampled according to
is sample from the meta-domain
. The number of training samples ranges from 4 to 50 for each task and, evaluation is performed on 100 samples from tasks never seen during training.
Model Once z is sampled from IAF, we simply concatenate it with x and use 12 densely connected layers of 128 neurons with residual connections between every other layer. The final layer linearly projects to 2 outputs is used to produce a heteroskedastic noise,
0.1 + 0.001. Finally, we use
to express the likelihood of the training set. To help gradient flow, we use ReLU activation functions and Layer Normalization9 [1].
Results Figure 1a depicts examples of tasks with 1, 2, 8, and 64 samples. The true underlying function is in blue while 10 samples from the posterior distributions are faded in the background. The thickness of the line represent 2 standard deviations. The first plot has only one single data point and mostly represents samples from the prior, passing near this observed point. Interestingly, all samples are close to some parametrization of Equation 5.1. Next with only 2 points, the posterior is starting to predict curves highly correlated with the true function. However, note that the uncertainty is over optimistic and that the posterior failed to fully represent all possible harmonics fitting those two points. We discuss this issue more in depth in supplementary materials. Next, with 8 points, it managed to mostly capture the task, with reasonable uncertainty. Finally, with 64 points the model is certain of the task.
To add a strong baseline, we experimented with MAML [11]. After exploring a variety of values for hyper-parameter and architecture design we couldn’t make it work for our two harmonics meta-task. We thus reduced the meta-task to a single harmonic and reduced the base frequency range by a factor of two. With those simplifications, we managed to make it converge, but the results are far behind that of deep prior even in this simplified setup. Figure 1b shows some form of adaptation with 16 samples per task but the result is jittery and the extrapolation capacity is very limited. Those results were obtained with a densely connected network of 8 hidden layers of 64 units10, with residual connections every other layer. The training is performed with two gradient steps and the evaluation with 5 steps. To make sure our implementation is valid, we first replicated their regression result with a fixed frequency as reported in [11].
Finally, to provide a stronger baseline, we remove the KL regularizer of deep prior and reduced the posterior to a deterministic distribution centered on
. The mean square error is
Figure 1: Preview of a few tasks (blue line) with increasing amount of training samples (red dots). Samples from the posterior distribution are shown in semi-transparent colors. The width of each samples is two standard deviations (provided by the predicted heteroskedastic noise).
Figure 2: left: Mean Square Error on increasing dataset size. The baseline corresponds to the same model without the KL regularizer. Each value is averaged over 100 tasks and 10 different restart. right: 4 sample tasks from the Synbols dataset. Each row is a class and each column is a sample from the classes. In the 2 left tasks, the symbol have to be predicted while in the two right tasks, the font has to be predicted.
reported in Figure 2 for an increasing dataset size. This highlights how the uncertainty provided by deep prior yields a systematic improvement.
5.2 Mini-Imagenet Experiment
Vinyals et al. [31] proposed to use a subset of Imagenet to generate a benchmark for few-shot learning. Each task is generated by sampling 5 classes uniformly and 5 training samples per class, the remaining images from the 5 classes are used as query images to compute accuracy. The number of unique classes sums to 100, each having 600 examples of images. To perform meta-validation and meta-test on unseen tasks (and classes), we isolate 16 and 20 classes respectively from the original set of 100, leaving 64 classes for the training tasks. This follows the procedure suggested in Ravi and Larochelle [28].
The training procedure proposed in Section 2 requires training on a fixed set of tasks. We found that 1000 tasks yields enough diversity and that over 9000 tasks, the embeddings are not being visited
Table 2: Ablation Study of our model. Accuracy is shown with 90% confidence interval over bootstrap of the validation set.
often enough over the course of the training. To increase diversity during training, the and test sets are re-sampled every time from a fixed train-test split of the given task11.
We first experimented with the vanilla version of deep prior (2). In this formulation, we use a ResNet [16] network, where we inserted FILM layers [26, 9] between each residual block to condition on the task. Then, after flattening the output of the final convolution layer and reducing to 64 hidden units, we apply a 64 5 matrix generated from a transformation of z. Finally, predictions are made through a softmax layer. We found this architecture to be slow to train as the generated last layer is noisy for a long time and prevent the rest of the network to learn. Nevertheless, we obtained 62.6% accuracy on Mini-Imagenet, on par with many strong baselines.
To enhance the model, we combine task conditioning with prototypical networks as proposed in Section 3. This approach alleviates the need to generate the final layer of the network, thus accelerating training and increasing generalization performances. While we no longer have a well calibrated task uncertainty, the KL term still acts as an effective regularizer and prevents overfitting on small datasets12. With this improvement, we are now the new state of the art with 74.5% (Table 1). In Table 2, we perform an ablation study to highlight the contributions of the different components of the model. In sum, a deeper network with residual connections yields major improvements. Also, task conditioning does not yield improvement if the leave one out procedure is not used. Finally, the KL regularizer is the final touch to obtain state of the art.
5.3 Heterogeneous Collection of Tasks
In Section 5.2, we saw that conditioning helps, but only yields a minor improvement. This is due to the fact that Mini-Imagenet is a very homogeneous collection of tasks where a single representation is sufficient to obtain good results. To support this claim, we provide a new benchmark13 of synthetic symbols which we refer to as Synbols. Images are generated using various font family on different alphabets (Latin, Greek, Cyrillic, Chinese) and background noise (Figure 2, right). For each task we have to predict either a subset of 4 font families or 4 symbols with only 4 examples. Predicting either fonts or symbols with two separate Prototypical Networks, yields 84.2% and 92.3% accuracy respectively, with an average of 88.3%. However, blending the two collections of tasks in a single benchmark, brings prototypical network down to 76.8%. Now, conditioning on the task with deep prior brings back the accuracy to 83.5%. While there is still room for improvement, this supports the claim that a single representation will only work on homogeneous collection of tasks and that task conditioning helps learning a family of representations suitable for heterogeneous benchmarks.
Using variational Bayes, we developed a scalable algorithm for hierarchical Bayes learning of neural networks, called deep prior. This algorithm is capable of transferring information from tasks that are potentially remarkably different. Results on the Harmonics dataset shows that the learned manifold across tasks exhibits the properties of a meaningful prior. Finally, we found that MAML, while very general, will have a hard time adapting when tasks are too different. Also, we found that algorithms based on a single image representation only works well when all tasks can succeed with a very similar set of features. Together those findings allowed us to develop the new state of the art on Mini-Imagenet.
[1] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Research, 4(May):83–99, 2003.
[3] M. Bauer, M. Rojas-Carulla, J. B. ´Swi ˛atkowski, B. Schölkopf, and R. E. Turner. Discriminative k-shot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.
[4] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908–919, 2017.
[5] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
[6] D. Bouchacourt, R. Tomioka, and S. Nowozin. Multi-level variational autoencoder: Learning disentangled representations from grouped observations. arXiv preprint arXiv:1705.08841, 2017.
[7] A. Damianou and N. Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
[8] H. Daumé III. Bayesian multitask learning with latent hierarchies. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 135–142. AUAI Press, 2009.
[9] H. de Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. Courville. Modulating early visual processing by language. In Advances in Neural Information Processing Systems, pages 6597–6607, 2017.
[10] H. Edwards and A. Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
[11] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning, pages 1126–1135, 2017.
[12] Y. Gal, R. Islam, and Z. Ghahramani. Deep bayesian active learning with image data. arXiv preprint arXiv:1703.02910, 2017.
[13] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[15] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778, 2016.
[17] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
[18] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[19] D. P. Kingma, T. Salimans, and M. Welling. Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934, 2016.
[20] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
[21] D. Krueger, C.-W. Huang, R. Islam, R. Turner, A. Lacoste, and A. Courville. Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017.
[22] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
[23] C. Louizos and M. Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
[24] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. In ICLR, 2018.
[25] T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler. Rapid adaptation with conditionally shifted neurons. In ICML, 2018.
[26] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871, 2017.
[27] C. E. Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004.
[28] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2016.
[29] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4080–4090, 2017.
[30] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
[31] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638. 2016.
[32] J. Wan, Z. Zhang, J. Yan, T. Li, B. D. Rao, S. Fang, S. Kim, S. L. Risacher, A. J. Saykin, and L. Shen. Sparse bayesian multi-task learning for predicting cognitive outcomes from neuroimaging measures in alzheimer’s disease. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 940–947. IEEE, 2012.
7.1 Proof of Leave One Out
Theorem 1. Let be the prototypes computed without example
in the training set. Then,
Proof. Let and assume
When , the result is trivially
7.2 Limitations of IAF
Figure 3: top: True function in the original space with 2 observed data points. middle: True posterior distribution, where the orange dot corresponds to the location of the true underlying function. bottom: Samples from IAF’s learned posterior.
When experimenting with the Harmonics toy dataset in Section 5.1, we observed issues with repeatability, most likely due to local minima. We decided to investigate further on the multimodality of posterior distributions with small sample size and the capacity of IAF to model them. For this purpose we simplified the problem to a single sine function and removed the burden of learning the prior. The likelihood of the observations is defined as follows:
where is given and
. Only the frequency
and the bias b are unknown14, yielding a bi-dimensional problem that is easy to visualize and quick to train. We use a dataset of 2 points at x = 1.5 and x = 3 and the corresponding posterior distribution is depicted in Figure 3-middle, with an orange point at the location of the true underlying function. Some samples from the posterior distribution can be observed in Figure 3-top.
We observe a high amount of multi-modality on the posterior distribution (Figure 3-middle). Some of the modes are just the mirror of another mode and correspond to the same functions e.g. or
. But most of the time they correspond to different functions and modeling them is crucial for some application. The number of modes varies a lot with the choice of observed dataset, ranging from a few to several dozens. Now, the question is: "How many of those modes can IAF model?". Unfortunately, Figure 3-bottom reveals poor capability for this particular case. After carefully adjusting the hyperparameters15 of IAF, exploring different initialization schemes and running multiple restarts, we rarely capture more than two modes (sometimes 4). Moreover, it will not be able to fully separate the two modes. There is systematically a thin path of density connecting each modes as a chain. With longer training, the path becomes thinner but never vanishes and the magnitude stays significant.