Semi-Supervised Learning with Normalizing Flows

2019·Arxiv

Abstract

Abstract

Normalizing flows transform a latent distribution through an invertible neural network for a flexi-ble and pleasingly simple approach to generative modelling, while preserving an exact likelihood. We propose FlowGMM, an end-to-end approach to generative semi-supervised learning with normalizing flows, using a latent Gaussian mixture model. FlowGMM is distinct in its simplicity, uni-fied treatment of labelled and unlabelled data with an exact likelihood, interpretability, and broad applicability beyond image data. We show promising results on a wide range of applications, including AG-News and Yahoo Answers text data, tabular data, and semi-supervised image classi-fication. We also show that FlowGMM can discover interpretable structure, provide real-time optimization-free feature visualizations, and specify well calibrated predictive distributions.

1. Introduction

The discriminative approach to classification models the probability of a class label given an input p(y|x) directly. The generative approach, by contrast, models the class conditional density for the data p(x|y), and then uses Bayes rule to find p(y|x). In principle, generative modelling has long been more alluring, for the effort is focused on creating an interpretable object of interest, and “what I cannot create, I do not understand”. In practice, discriminative approaches typically outperform generative methods, and thus are far more widely used.

The challenge in generative modelling is that standard approaches to density estimation are often poor descriptions of high-dimensional natural signals. For example, a Gaussian mixture directly over images, while highly flexible for estimating densities, would specify similarities between images as related to Euclidean distances of pixel intensities, which would be a poor inductive bias for handling translations or representing other salient statistical properties. Recently, generative adversarial networks (Goodfellow et al., 2014), variational autoencoders (Kingma & Welling, 2013), and normalizing flows (Dinh et al., 2014), have led to great advances in unsupervised generative modelling, by leveraging the inductive biases of deep convolutional neural networks.

Normalizing flows are a pleasingly simple approach to generative modelling, which work by transforming a distribution through an invertible neural network. Since the transformation is invertible, it is possible to exactly express the likelihood over the observed data, to train the neural network mapping. The network provides both useful inductive biases, and a flexible approach to density estimation. Normalizing flows also admit controllable latent representations and can be sampled efficiently, unlike auto-regressive models (Papamakarios et al., 2017; Oord et al., 2016). Moreover, recent work (Dinh et al., 2016; Kingma & Dhariwal, 2018; Behrmann et al., 2018; Chen et al., 2019; Song et al., 2019) demonstrated that normalizing flows can produce high-fidelity samples for natural image datasets.

Advances in unsupervised generative modelling, such as normalizing flows, are particularly compelling for semi-supervised learning, where we wish to share structure over labelled and unlabelled data, to make better predictions of class labels on unseen data. In this paper, we introduce an approach to semi-supervised learning with normalizing flows, by modelling the density in the latent space as a Gaussian mixture, with each mixture component corresponding to a class represented in the labelled data. This Flow Gaussian Mixture Model (FlowGMM) is to the best of our knowledge the first approach to semi-supervised learning with normalizing flows that provides an exact joint likelihood over both labelled and unlabelled data, for end-to-end training.

We illustrate FlowGMM with a simple example in Figure 1. We are solving a binary semi-supervised classification problem on the dataset shown in panel (a): the labeled data are shown with triangles colored according to their class, and unlabeled data are shown with blue circles. We introduce a Gaussian mixture with two components corresponding to

Figure 1. Illustration of semi-supervised learning with FlowGMM on a binary classification problem. Colors represent the two classes or the corresponding Gaussian mixture components. Labeled data are shown with triangles, colored by the corresponding class label, and blue dots represent unlabeled data. (a): Data distribution and the classifier decision boundary. (b): The learned mapping of the data to the latent space. (c): Samples from the Gaussian mixture in the latent space. (d): Samples from the model in the data space.

each of the classes, shown in panel (c) in the latent space Z and an invertible transformation f. The transformation f is then trained to map the data distribution in the data space X to the latent Gaussian mixture in the Z space, mapping the labeled data to the corresponding mixture component. We visualize the learned transformation in panel (b), showing the positions of the images f(x) for all of the training data points. The inverse of this mapping serves as a class-conditional generative model, that we visualize in panel (d). To classify a data point x in the input space we compute its image f(x) in the latent space, and pick the class corresponding to the Gaussian that is closest to f(x). We visualize the decision boundary of the learned classifier with a dashed line in panel (a).

FlowGMM naturally encodes the clustering principle: the decision boundary between classes must lie in the low-density region in the data space. Indeed, in the latent space the decision boundary between two classes coincides with the hyperplane perpendicular to the line segment connecting means of the corresponding mixture components and passing through the midpoint of this line segment (assuming the components are normal distributions with identity covariance matrices); in panel (b) of Figure 1 we show the decision boundary in the latent space with a dashed line. The density of the latent distribution near the decision boundary is low. As the flow is trained to represent data as a transformation of this latent distribution, the density near the decision boundary should also be low. In panel (a) of Figure 1 the decision boundary indeed lies in the low-density region.

The contributions of this work include:

• We propose FlowGMM, a new probabilistic classifi-cation model based on normalizing flows that can be naturally applied to semi-supervised learning.

• We show that FlowGMM has good performance on a

broad range of semi-supervised tasks, including image, text and tabular data classification.

• We propose a new type of probabilistic consistency regularization that significantly improves FlowGMM on image classification problems.

• To demonstrate the interpretability of FlowGMM, we visualize the learned latent space representations for the proposed semi-supervised model and show that interpolations between data points from different classes pass through low-density regions. We also show how FlowGMM can be used for feature visualization in real-time, without requiring gradients.

• We show that the predictive uncertainties of FlowGMM can be naturally calibrated by scaling the variances of mixture components.

• We provide code for FlowGMM at: https:// github.com/izmailovpavel/flowgmm

2. Related Work

Kingma et al. (2014) represents one of the earliest works on semi-supervised deep generative modelling, demonstrating how the likelihood model of a variational autoencoder (Kingma & Welling, 2013) could be used for semi-supervised image classification. Xu et al. (2017) later extended this framework to semi-supervised text classification.

showed that classification performance and generative performance are in direct conflict: a perfect generator provides no benefit to classification performance.

Some works on normalizing flows, such as RealNVP (Dinh et al., 2016), have used class-conditional sampling, where the transformation is conditioned on the class label. These approaches pass the class label as an input to coupling layers, conditioning the output of the flow on the class.

Deep Invertible Generalized Linear Model (DIGLM, Nalis- nick et al., 2019), most closely related to our work, trains a classifier on the latent representation of a normalizing flow to perform supervised or semi-supervised image clas- sification. Our approach is principally different, as we use a mixture of Gaussians in the latent space Z and perform classification based on class-conditional likelihoods (see (5)), rather than training a separate classifier. One of the key advantages of our approach is the explicit encoding of clustering principle in the method and a more natural probabilistic interpretation.

Indeed, many approaches to semi-supervised learn from the labelled and unlabelled data using different (and possibly misaligned) objectives, often also involving two step procedures where the unsupervised model is used as pre-processing for a supervised approach. In general, FlowGMM is distinct in that the generative model is used directly as a Bayes classifier, and in the limit of a perfect generative model the Bayes classifier achieves a provably optimal misclassification rate (see e.g. Mohri et al., 2018). Moreover, approaches to semi-supervised classification, such as consistency regularization (Laine & Aila, 2016; Miyato et al., 2018; Tarvainen & Valpola, 2017; Athiwaratkun et al., 2019; Verma et al., 2019; Berthelot et al., 2019), typically focus on image modelling. We instead focus on showcasing the broad applicability of FlowGMM on text, tabular, and image data, as well as the ability to conveniently discover interpretable structure.

3. Background: Normalizing Flows

The normalizing flow (Dinh et al., 2016) is an unsupervised model for density estimation defined as an invertible map- ping from the data space X to the latent space Z. We can model the data distribution as a transformation applied to a random variable from the latent distribution , which is often chosen to be Gaussian. The density of the transformed random variable is given by the change of variables formula

The mapping f is implemented as a sequence of invertible functions, parametrized by a neural network with architecture that is designed to ensure invertibility and efficient computation of log-determinants, and a set of parameters that can be optimized. The model can be trained by maximizing the likelihood in Equation (1) of the training data with respect to the parameters

4. Flow Gaussian Mixture Model

We introduce the Flow Gaussian Mixture Model (FlowGMM), a probabilistic generative model for semi-supervised learning with normalizing flows. In FlowGMM, we introduce a discrete latent variable y for the class label, . Our latent space distribution, conditioned on a label k, is Gaussian with mean and covariance

The marginal distribution of z is then a Gaussian mixture. When the classes are balanced, this distribution is

Combining equations (2) and (1), the likelihood for labeled data is

and the likelihood for data with unknown label is . If we have access to both a labeled dataset and an unlabeled dataset , then we can train our model in a semi-supervised way to maximize the joint likelihood of the labeled and unlabeled data

over the parameters of the bijective function f, which learns a density model for a Bayes classifier. In particular, given a test point x, the model predictive distribution is given by

We can then make predictions for a test point x with the Bayes decision rule

As an alternative to direct likelihood maximization, we can adapt the Expectation Maximization algorithm for model training as discussed in Appendix A.

Figure 2. Illustration of FlowGMM performance on synthetic datasets. Labeled data are shown with colored triangles, and unlabeled data are shown with blue circles. Colors represent different classes. We compare the classifier decision boundaries when only using labeled data (panels b, d) and when using both labeled and unlabeled data (panels a, c) on two circles (panels a, b) and pinwheel (panels c, d) datasets. FlowGMM leverages unlabeled data to push the decision boundary to low-density regions of the space.

4.1. Consistency Regularization

Most of the existing state-of-the-art approaches to semi-supervised learning on image data are based on consistency regularization (Laine & Aila, 2016; Miyato et al., 2018; Tar- vainen & Valpola, 2017; Athiwaratkun et al., 2019; Verma et al., 2019; Xie et al., 2020; Berthelot et al., 2020). These methods penalize changes in network predictions with respect to input perturbations, such as random translations and horizontal flips, with an additional loss term that can be computed on unlabeled data,

where are random perturbations of x, and g is the vector of probabilities over the classes.

Motivated by these methods, we introduce a new consistency regularization term for FlowGMM. Let be the label predicted on image by FlowGMM according to (5). We then define our consistency loss as the negative log likelihood of the input given the label

This loss term encourages the model to map small perturbations of the same unlabeled inputs to the same components of the Gaussian mixture distribution in the latent space. Unlike the standard consistency loss of (6), the proposed loss in (7) takes values on the same scale as the data log likelihood (4), and indeed we find it to work better in practice. We refer to FlowGMM with the consistency term as FlowGMM-cons. The final loss for FlowGMM-cons is then the weighted sum of the consistency loss (7) and the negative log likelihood of both labeled and unlabeled data (4).

5. Experiments

We evaluate FlowGMM on a wide range of datasets across different application domains including low-dimensional synthetic data (Section 5.1), text and tabular data (Section 5.2), and image data (Section 5.3). We show that FlowGMM outperforms the baselines on tabular and text data. FlowGMM is also state-of-the-art as an end-to-end generative approach to semi-supervised image classifica-tion, conditioned on architecture. However, FlowGMM is constrained by the RealNVP architecture, and thus does not outperform the most powerful approaches in this setting, which involve discriminative classifiers.

In all experiments, we use the RealNVP normalizing flow architecture. Throughout training, Gaussian mixture parameters are fixed: the means are initialized randomly from the standard normal distribution and the covariances are set to I. See Appendix B for further discussion on GMM initialization and training.

5.1. Synthetic Data

We first apply FlowGMM to a range of two-dimensional synthetic datasets, in order to gain a better visual intuition for the method. We use the RealNVP architecture with 5 coupling layers, defined by fully-connected shift and scale networks, each with 1 hidden layer of size 512. In addition to the semi-supervised setting, we also trained the method only using the labeled data. In Figure 2 we visualize the decision boundaries of the classifier corresponding to FlowGMM for both of these settings on the two circles and pinwheel datasets. On both datasets, FlowGMM is able to benefit from the unlabeled data to push the decision boundary to a low-density region, as expected. On the two circles problem the method is unable to fit the data perfectly as flows are homeomorphisms, and the disk is topologically distinct from an annulus. However, FlowGMM still produces a reasonable decision boundary and improves over the case when only labeled data are available. We provide additional visualizations in Appendix C, Figure 4.

5.2. Text and Tabular Data

FlowGMM can be especially useful for semi-supervised learning on tabular data. Consistency-based semi-supervised methods have mostly been developed for image classification, where the predictions of the method are regularized to be invariant to random flips and translations of the image. On tabular data, desirable invariances are less obvious, finding suitable transformations to apply for consistency-based methods is not-trivial. Similarly, approaches based on GANs have mostly been developed for images. We evaluate FlowGMM on the Hepmass and Miniboone UCI classification datasets (previously used in Papa- makarios et al. (2017) for density estimation).

Along with standard tabular UCI datasets, we also consider text classification on AG-News and Yahoo Answers datasets. Using the recent advances in transfer learning for NLP, we construct embeddings for input texts using the BERT transformer model (Devlin et al., 2018) trained on a corpus of Wikipedia articles, and then train FlowGMM and other baselines on the embeddings.

We compare FlowGMM to the graph based label spreading method from Zhou et al. (2004), a Laine & Aila, 2016) that uses dropout perturbations, as well as supervised logistic regression, k-nearest neighbors, and a neural network trained on the labeled data only. We report the results in Table 1, where FlowGMM outperforms the alternative semi-supervised learning methods on each of the considered datasets. Implementation details for FlowGMM, the baselines, and data preprocessing details are in Appendix D.

5.3. Image Classification

We next evaluate the proposed method on semi-supervised image classification benchmarks on CIFAR-10, MNIST and SVHN datasets. For all the datasets, we use the RealNVP (Dinh et al., 2016) architecture. Exact implementation details are listed in the Appendix E. The supervised model is trained using the same loss (4), where all the data points are labeled (

We present the results for FlowGMM and FlowGMM-cons in Table 2. We also report results from DIGLM (Nalisnick et al., 2019), supervised only performance on MNIST and SVHN, and the M1+M2 VAE model (Kingma et al., 2014). FlowGMM outperforms the M1+M2 model and performs better or on par with DIGLM. Furthermore, FlowGMMcons improves over FlowGMM on all three datasets, suggesting that our proposed consistency regularization is helpful for performance.

Following Oliver et al. (2018), we evaluate FlowGMMcons varying the number of labeled data points. Specifically, we follow the setup of Kingma et al. (2014) and train FlowGMM-cons on MNIST with 100, 600, 1000 and 3000 labeled data points. We present the results in Table 3. FlowGMM-cons outperforms the M1+M2 model of Kingma et al. (2014) in all the considered settings.

We note that the results presented in this Section are not directly comparable with the state-of-the-art methods using GANs or consistency regularization (see e.g. Laine & Aila, 2016; Dai et al., 2017; Athiwaratkun et al., 2019; Berthelot et al., 2019), as the architecture we employ is much less powerful for classification than the ConvNet and ResNet architectures that have been designed for classifi-cation without the constraint of invertibility. We believe that invertible architectures with better inductive biases for classification may help bridge this gap; invertible residual networks (Behrmann et al., 2018; Chen et al., 2019) and invertible CNNs (Finzi et al., 2019) are some of the early examples of this class of architectures.

In general, it is difficult to directly compare FlowGMM with most existing approaches, because the types of architectures available for fully generative normalizing flows are very different than what is available to (partially) discriminative approaches or even other generative methods like VAEs. This difference is due to the invertibility requirement for normalizing flows.

6. Model Analysis

We empirically analyze different aspcects of FlowGMM and highlight some useful features of this model. In Section 6.1 we discuss the calibration of predictive uncertainties produced by the model. In Section 6.2, we study the latent representations learned by FlowGMM. Finally, in Section 6.3, we discuss a feature visualization technique that can be used to interpret the features learned by FlowGMM.

6.1. Uncertainty and Calibration

In many applications, particularly where decision making is involved, it is crucial to have reliable confidences associated with predictions. In classification problems, well-calibrated models are expected to output accurate probabilities of belonging to a particular class. Reliable uncertainty estimation is especially relevant in semi-supervised learning since label information is limited during training. Guo et al. (2017), showed that modern deep learning models are highly over-confident, but could be easily recalibrated with temperature scaling. In this Section, we analyze the predictive uncertainties produced by FlowGMM. In Appendix Section F, we also consider out-of-domain data detection.

When using FlowGMM for classification, the class predic-

Table 1. Accuracy on BERT embedded text classification datasets and UCI datasets with a small number of labeled examples. The kNN baseline, logistic regression, and the 3-Layer NN + Dropout were trained on the labeled data only. Numbers reported for each method are the best of 3 runs (ranked by performance on the validation set). are the number of labeled and unlabeled data points.

Table 2. Accuracy of the FlowGMM, VAE model (M1+M2 VAE, Kingma et al., 2014), DIGLM (Nalisnick et al., 2019) in supervised and semi-supervised settings on MNIST, SVHN, and CIFAR-10. FlowGMM Sup (All labels) as well as DIGLM Sup (All labels) were trained on full train datasets with all labels to demonstrate general capacity of these models. FlowGMM Sup (labels) was trained on labeled examples (and no unlabeled data). For reference, at the bottom we list the performance of the Laine & Aila, 2016) and BadGAN (Dai et al., 2017) as representative consistency-based and GAN-based state-of-the-art methods. Both of these methods use non-invertible architectures with substantially higher base performance and, thus, are not directly comparable.

tive probabilities are

Since we initialize Gaussian mixture means randomly from the standard normal distribution and do not train them along with the flow parameters (see Appendix B), FlowGMM predictions become inherently overconfident due to the curse of dimensionality. For example, consider two Gaussians with means sampled independently from the standard normal in D-dimensional space. If is a sample from the first Gaussian, then its expected squared distances to both mixture means are and (for a detailed derivation see Appendix Section G). In high dimensional spaces, such logits would lead to hard label assignment in FlowGMM (for exactly one class). In fact, in the experiments we observe that FlowGMM is over-confident and performs hard label assignment: predicted class probabilities are all close to either 1 or 0.

We address this problem by learning a single scalar parameter for all components in the Gaussian mixture (the component k will be ) by minimizing the negative log likelihood on a validation set. This way we can naturally re-calibrate the variance of the latent GMM. This procedure is also equivalent to applying temperature scaling (Guo et al., 2017) to logits . We test FlowGMM calibration on MNIST and CIFAR datasets in the supervised setting. On MNIST we restricted the training set size to 1000 objects, since on the full dataset the model makes too few mistakes which makes evaluating calibration harder. In Table 4, we report negative log likelihood and expected calibration error (ECE, see Guo et al. (2017) for a description of this metric). We can see that re-calibrating variances of the Gaussians in the mixture significantly improves both metrics and mitigates overconfidence. The effectiveness of this simple rescaling procedure suggests that the latent space distances learned by the flow model are correlated with the probabilities of belonging to a particular class: the closer a datapoint is to the mean of a Gaussian in the latent space, the more likely it belongs to the corresponding class.

Table 3. Semi-supervised classification accuracy for FlowGMM-cons and VAE M1 + M2 model (Kingma et al., 2014) on MNIST for different number of labeled data points

Table 4. Negative log-likelihood and Expected Calibration Error for supervised FlowGMM trained on MNIST (1k train, 1k validation, 10k test) and CIFAR-10 (50k train, 1k validation, 9k test). FlowGMM-temp stands for tempered FlowGMM where a single scalar parameter was learned on a validation set for variances in all components.

6.2. Learned Latent Representations

We next analyze the latent representation space learned by FlowGMM. We examine latent interpolations between members of the same class in Figure 3(a) and between different classes in Figure 3(b) for our MNIST FlowGMMcons model trained with labels. As expected, interclass interpolations pass through regions of low-density, leading to low quality samples but intra-class interpolations do not. These observations suggest that, as expected, the model learns to put the decision boundary in the low-density region of the data space.

In Appendix section H, we present images corresponding to the means of the Gaussian mixture and class-conditional samples from FlowGMM.

Distance to Decision Boundary To explicitly test this conclusion, we compute the distribution of distances from unlabeled data to the decision boundary for FlowGMM-cons and FlowGMM Sup trained on labeled data only. In order to compute this distance exactly for an image x, we find the two closest means to the corresponding latent variable z = f(x), and evaluate the expression d(x) =

6.3. Feature Visualization

Feature visualization has become an important tool for increasing the interpretability of neural networks in supervised learning. The majority of methods rely on maximizing the activations of a given neuron, channel, or layer over a parametrization of an input image with different kinds of image regularization (Szegedy et al., 2013; Olah et al., 2017; Mahendran & Vedaldi, 2015). These methods, while effective, require iterative optimization too costly for real time interactive exploration. In this Section we discuss a simple and efficient feature visualization technique that leverages the invertibility of FlowGMM. This technique can be used with any invertible model but is especially relevant for FlowGMM, where we can use feature visualization to gain insights into the classification decisions made by the model.

Since our classification model uses a flow which is a sequence of invertible transformations , intermediate activations can be inverted directly. This means that we can combine the methods of feature inversion and feature maximization directly by feeding in a set of input images, modifying intermediate activations arbitrarily, and inverting the representation. Given a set of activations in the layer with channels c and spatial extent i, j, we may perturb a single neuron with

where is a one hot vector at channel c; and is the standard deviation of the activations in channel c over the the training set and spatial locations. This procedure can be performed at real-time rates to explore the activation parametrized by and the location (c, i, j) without any optimization or hyper-parameters. We show the feature visualization for intermediate layers on CIFAR-10 test images in Figure 3(d). The channel being visualized appears to activate on the zeroed pixels from random translations as

Figure 3. Visualizations of the latent space representations learned by supervised FlowGMM on MNIST. (a): Latent space interpolations between test images from the same class and (b): from different classes. Observe that interpolations between objects from different classes pass through low-density regions. (c): Histogram of distances from unlabeled data to the decision boundary for FlowGMM-cons trained on 1k labeled and 59k unlabeled data and FlowGMM Sup trained on 1k labeled data only. FlowGMM-cons is able to push the decision boundary away from the data distribution using unlabeled data. (d): Feature visualization for CIFAR-10: four test reconstructions are shown as an intermediate feature is perturbed. The value of the perturbation is shown in red vs the distribution of the channel activations. Observe that the channel visualized activates on zeroed out pixels to the left of the image mimicking the random translations applied to the training data.

well as the green channel. Analyzing the features learned by FlowGMM we can gain insight into the workings of the model.

7. Discussion

We proposed a simple and interpretable approach for end-to-end generative semi-supervised prediction with normalizing flows. While FlowGMM does not yet outperform the most powerful discriminative approaches for semi-supervised image classification (Athiwaratkun et al., 2019; Verma et al., 2019), we believe it is a promising step towards making fully generative approaches more practical for semi-supervised tasks. As we develop improved invertible architectures, the performance of FlowGMM will also continue to improve.

Moreover, FlowGMM does outperform graph-based and consistency-based baselines on tabular data including semi-supervised text classification with BERT embeddings. We believe that the results show promise for generative semi-supervised learning based on normalizing flows, especially for tabular tasks where consistency-based methods struggle.

We view interpretability and broad applicability as a strong advantage of FlowGMM. The access to latent space representations and the feature visualization technique discussed in Section 6 as well as the ability to sample from the model can be used to obtain insights into the performance of the model in practical applications.

References

Atanov, A., Volokhova, A., Ashukha, A., Sosnovik, I., and Vetrov, D. Semi-conditional normalizing flows for semi-

supervised learning. arXiv preprint arXiv:1905.00505, 2019.

Athiwaratkun, B., Finzi, M., Izmailov, P., and Wilson, A. G. There are many consistent explanations of unlabeled data: Why you should average. In International Conference on Learning Representations, 2019. URL https:// openreview.net/forum?id=rkgKBhA5Y7.

Behrmann, J., Duvenaud, D., and Jacobsen, J.-H. Invert- ible residual networks. arXiv preprint arXiv:1811.00995, 2018.

Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. MixMatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249, 2019.

Berthelot, D., Carlini, N., Cubuk, E. D., Kurakin, A., Sohn, K., Zhang, H., and Raffel, C. Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring. In International Conference on Learning Representations, 2020. URL https:// openreview.net/forum?id=HklkeR4KPB.

Chen, R. T., Behrmann, J., Duvenaud, D., and Jacobsen, J.-H. Residual flows for invertible generative modeling. arXiv preprint arXiv:1906.02735, 2019.

Dai, Z., Yang, Z., Yang, F., Cohen, W. W., and Salakhutdi- nov, R. R. Good semi-supervised learning that requires a bad GAN. In Advances in Neural Information Processing Systems 30, pp. 6510–6520, 2017.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan-

guage understanding. arXiv preprint arXiv:1810.04805, 2018.

Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estima- tion using Real NVP. arXiv preprint arXiv:1605.08803, 2016.

Finzi, M., Izmailov, P., Maddox, W., Kirichenko, P., and Wilson, A. G. Invertible convolutional networks. In Workshop on Invertible Neural Nets and Normalizing Flows, International Conference on Machine Learning, 2019.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. CoRR, abs/1706.04599, 2017. URL http://arxiv.org/ abs/1706.04599.

Izmailov, P., Kirichenko, P., Finzi, M., and Wilson, A. G. Semi-supervised learning with normalizing flows. In Workshop on Invertible Neural Nets and Normalizing Flows, International Conference on Machine Learning, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 11 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224, 2018.

Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.

Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589, 2014.

Laine, S. and Aila, T. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.

Mahendran, A. and Vedaldi, A. Understanding deep im- age representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188–5196, 2015.

Miyato, T., Maeda, S.-i., Ishii, S., and Koyama, M. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018.

Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and Lakshminarayanan, B. Do deep generative models know what they don’t know? arXiv preprint arXiv:1810.09136, 2018.

Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and Lakshminarayanan, B. Hybrid models with deep and invertible features. arXiv preprint arXiv:1902.02767, 2019.

Olah, C., Mordvintsev, A., and Schubert, L. Feature vi- sualization. Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization.

Oliver, A., Odena, A., Raffel, C. A., Cubuk, E. D., and Good- fellow, I. Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pp. 3235–3246, 2018.

Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.

Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347, 2017.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training GANs. In Advances in neural information processing systems, pp. 2234–2242, 2016.

Song, Y., Meng, C., and Ermon, S. Mintnet: Building invertible neural networks with masked convolutions. In Advances in Neural Information Processing Systems, pp. 11002–11012, 2019.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204, 2017.

Verma, V., Lamb, A., Kannala, J., Bengio, Y., and Lopez- Paz, D. Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825, 2019.

Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le, Q. V. Unsupervised data augmentation for consistency training, 2020. URL https://openreview.net/forum? id=ByeL1R4FvS.

Xu, W., Sun, H., Deng, C., and Tan, Y. Variational autoen- coder for semi-supervised text classification. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.

Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Sch¨olkopf, B. Learning with local and global consistency. In Advances in neural information processing systems, pp. 321–328, 2004.

A. Expectation Maximization

As an alternative to direct optimization of the likelihood (4), we consider Expectation-Maximization algorithm (EM). EM is a popular approach for finding maximum likelihood estimates in mixture models. Suppose is the observed dataset, are corresponding un- observed latent variables (often denoting the component in mixture model) and is a vector of model parameters. EM algorithm consists of the two alternating steps: on E-step, we compute posterior probabilities of latent variables for each data point ; and on M-step, we fix q and maximize the expected log likelihood of the data and latent variables with respect to : The algorithm can be easily adapted to the semi-supervised setting where a subset of data is labeled with : then, on E-step we have hard assign- ment to the true mixture component for labeled data points.

EM is applicable to fitting the transformed mixture of Gaussians. We can perform the exact E-step for unlabeled data in the model since

which coincides with the E-step of EM algorithm on Gaussian mixture model. On M-step, the objective has the following form:

Since the exact solution is not tractable due to complexity of the flow model, we perform a stochastic gradient step to optimize the expected log likelihood with respect to flow parameters

Note that unlike regular EM algorithm for mixture models, we have Gaussian mixture parameters in our experiments, and on M-step the update of induces the change of latent space representations.

Using EM algorithm for optimization in the semi-supervised setting on MNIST dataset with 1000 labeled images, we obtain 98.97% accuracy which is comparable to the result for FlowGMM with regular SGD training. However, in our experiments, we observed that on E-step, hard label assignment happens for unlabeled points (for one of the classes) because of the high dimensionality of the problem (see section 6.1) which affects the M-step objective and hinders training.

B. Latent Distribution Mean and Covariance Choices

Initialization In our experiments, we draw the mean vectors of Gaussian mixture model randomly from the standard normal distribution , and set the covariance matrices to identity for all classes; we fixed GMM parameters throughout training. However, one could potentially benefit from data-dependent placing of means in the latent space. We experimented with different initialization methods, in particular, initializing means using the mean point of latent representations of labeled data in each class: where represents la- beled data points from class is the total number of labeled points in that class. In addition, we can scale all means by a scalar value to increase or decrease distances between them. We observed that such initialization leads to much faster convergence of FlowGMM on semi-supervised classification on MNIST dataset, however, the final performance of the model was worse compared to the one with random mean placing. We hypothesize that it becomes easier for the flow model to warm up faster with data-dependent initialization because Gaussian means are closer to the initial latent representations, but afterwards the model gets stuck in a suboptimal solution.

GMM training FlowGMM would become even more flexible and expressive if we could learn Gaussian mixture parameters in a principled way. In the current setup where means are sampled from the standard normal distribution, the distances between mixture components are aboutwhere D is the dimensionality of the data (see Appendix G). Thus, classes are quite far apart from each other in the latent space, which, as observed in Section 6.1, leads to model miscalibration. Training GMM parameters can further increase interpretability of the learned latent space representations: we can imagine a scenario in which some of the classes are very similar or even intersecting, and it would be useful to represent it in the latent space. We could train GMM by directly optimizing likelihood (4), or using expectation maximization (see Section A), either jointly with the flow parameters or iteratively switching between training flow parameters with the fixed GMM and training GMM with the fixed flow. In our initial experiments on semi-supervised classification on MNIST, training GMM jointly with the flow parameters did not improve performance or lead to substantial change of the latent representations. Further improvements require careful hyper-parameter choice which we leave for future work.

Table 5. Tuned learning rates for 3-Layer NN + Dropout, -model and method on text and tabular tasks. For kNN we report the number of neighbours. All hyper-parameters were tuned via cross-validation.

C. Synthetic Experiments

In Figure 4 we visualize the classification decision boundaries of FlowGMM as well as the learned mapping to the latent space and generated samples for three different synthetic datasets.

D. Tabular data preparation and hyperparameters

The AG-News and Yahoo Answers were constructed by applying BERT embeddings to the text input, yielding a 768 dimensional vector for each data point. AG-News has 4 classes while Yahoo Answers has 10. The UCI datasets Hepmass and Miniboone were constructed using the data preprocessing from Papamakarios et al. (2017), but with the inclusion of the removed background process class so that the two problems can be used for binary classification. We then subsample the fraction of background class examples so that the dataset is balanced. For each of the datasets, a separate validation set of size 5k was used to tune hyperparameters. All neural network models use the ADAM optimizer (Kingma & Ba, 2014).

k-Nearest Neighbors: We tested both using both L2 distance and L2 with inputs normalized to unit norm, (distance), and the latter performed the best. The value k chosen in the method was found sweeping over the optimal values for each of the datasets are shown in 5.

3 Layer NN + Dropout: The 3-Layer NN + Dropout baseline network has three fully connected hidden layers with inner dimension k = 512, ReLU nonlinearities, and dropout with p = 0.5. We use the learning rate for training the supervised baseline across all datasets.

: The -Model uses the same network architecture, and dropout for the perturbations. The additional consistency loss per unlabeled data point is computed as is are the output probabilities after the softmax layer of the neural network and the consistency weight which worked the best across the datasets. The model was trained for 50 epochs with labeled and unlabeled batch size for AG-News and Yahoo Answers, and labeled and unlabeled batch sizes for Hepmass and Miniboone.

Label Spreading: We use the local and global consistency method from Zhou et al. (2004), where in our case Y is the matrix of labels for the labeled, unlabeled, and test data but filled with zeros for unlabeled and test. computed from the affinity matrix where . This is equivalent to L2 dis- tance on the inputs normalized to unit magnitude. Because the algorithm scales poorly with number of unlabeled points for dense affinity matrices, , we we subsampled the number of unlabeled data points to 10k and test data points to 5k for this graph method. However, we also evaluate the label spreading algorithm with a sparse kNN affinity matrix on using a larger subset 20k of unlabeled data. The two hyperparameters for label spreading () were tuned by separate grid search for each of the datasets. In both cases, we use the inductive variant of the algorithm where the test data is not included in the unlabeled data.

FlowGMM: We train our FlowGMM model with a RealNVP normalizing flow, similar to the architectures used in Papamakarios et al. (2017). Specifically, the model uses 7 coupling layers, with 1 hidden layer each and 256 hidden units for the UCI datasets but 512 for text classification. UCI models were trained for 50 epochs of unlabeled data and the text datasets were trained for 30 epochs of unlabeled data. The labeled and unlabeled batch sizes are the same as in the

The tuned learning rates for each of the models that we used for these experiments are shown in Table 5.

E. Image data preparation and hyperparameters

We use the CIFAR-10 multi-scale architecture with 2 scales, each containing 3 coupling layers defined by 8 residual blocks with 64 feature maps. We use Adam optimizer (Kingma & Ba, 2014) with learning rate for CIFAR-10 and SVHN and for MNIST. We train the supervised model for 100 epochs, and semi-supervised models for 1000 passes through the labeled data for CIFAR-10 and SVHN and 3000 passes for MNIST. We use a batch size of 64 and sample 32 labeled and 32 unlabeled data points in each mini-batch. For the consistency loss term (7), we linearly

Figure 4. Illustration of FlowGMM on synthetic datasets: two circles (top row), eight Gaussians (middle row) and pinwheel (bottom row). (a): Data distribution and classification decision boundaries. Unlabeled data are shown with blue circles and labeled data are shown with colored triangles, where color represents the class. Background color visualizes the classification decision boundaries of FlowGMM. (b): Mapping of the data to the latent space. (c): Gaussian mixture in the latent space. (d): Samples from the learned generative model corresponding to different classes, as shown by their color.

Figure 5. Left: Log likelihoods on in- and out-of-domain data for our model trained on MNIST. Center: Log likelihoods on in- and out-of-domain data for our model trained on FashionMNIST. Right: MNIST digits get mapped onto the sandal mode of the FashionMNIST model 75% of the time, often being assigned higher likelihood than elements of the original sandal class. Representative elements are shown above.

increase the weight from 0 to 1 for the first 100 epochs following Athiwaratkun et al. (2019). For FlowGMM and FlowGMM-cons, we re-weight the loss on labeled data by (value tuned on validation (Kingma et al., 2014) on CIFAR-10), as otherwise, we observed that the method underfits the labeled data.

F. Out-of-domain data detection

Density models have held promise for being able to detect out-of-domain data, an especially important task for robust machine learning systems (Nalisnick et al., 2019). Recently, it has been shown that existing flow and autoregressive density models are not as apt at this task as previously thought, yielding high likelihood on images coming from other (simpler) distributions. The conclusion put forward is that datasets like SVHN are encompassed by, or have roughly the same mean but lower variance than, more complex datasets like CIFAR-10 (Nalisnick et al., 2018). We examine this hypothesis in the context of our flow model which has a multi-modal latent space distribution unlike methods considered in Nalisnick et al. (2018).

Using a fully supervised model trained on MNIST, we evaluate the log likelihood for data points coming from the NotMNIST dataset, consisting of letters instead of digits, and the FashionMNIST dataset. We then train a supervised model on the more complex dataset FashionMNIST and evaluate on MNIST and NotMNIST. The distribution of the log likelihood on these datasets is shown in Figure 5. For the model trained on MNIST we see that the data from Fashion MNIST and NotMNIST is assigned lower likelihood, as expected. However, the model trained on FashionMNIST predicts higher likelihoods for MNIST images. The majority () of the MNIST data points get mapped into the mode of the FashionMNIST model corresponding to sandals, which is the class with the largest fraction of pixels that are zero. Similarly, for the model trained on MNIST the image of all zeros has very high likelihood and gets mapped to the mode corresponding to the digit 1 which has the largest fraction of empty space.

G. Expected Distances between Gaussian Samples

Consider two Gaussians with means sampled indepen- dently from the standard normal in D-dimensional space. If is a sample from the first Gaussian, then its expected squared distances to both mixture means are:

For high-dimensional Gaussians the random variables and will be concentrated around their expectations. Since the function decreases rapidly to zero for positive x, the probability of belonging to the first Gaussian

saturates at 1 with the growth of dimensionality D.

H. FlowGMM as generative model

Figure 6. Visualizations of the latent space representations learned by supervised FlowGMM on MNIST. (a): Images corresponding to means of the Gaussians corresponding to different classes. (b): Class-conditional samples from the model at a reduced temperature T = 0.25.

In Figure 6a we show the images corresponding to the means of the Gaussians representing each class. We see that the flow correctly learns to map the means to samples from the corresponding classes. Next, in Figure 6b we show class-conditional samples from the model. To produce a sample from class i, we first generate , where T is a temperature parameter that controls trade-off between sample quality and diversity; we then compute the samples as . We set to produce samples in Figure 6b. As we can see, FlowGMM can produce reasonable class-conditional samples simultaneously with achieving a high classification accuracy (99.63%) on the MNIST dataset.