Disentanglement by Nonlinear ICA with General Incompressible-flow Networks (GIN)

2020·arXiv

ABSTRACT

ABSTRACT

A central question of representation learning asks under which conditions it is possible to reconstruct the true latent variables of an arbitrarily complex generative process. Recent breakthrough work by Khemakhem et al. (2019) on nonlinear ICA has answered this question for a broad class of conditional generative processes. We extend this important result in a direction relevant for application to real-world data. First, we generalize the theory to the case of unknown intrinsic problem dimension and prove that in some special (but not very restrictive) cases, informative latent variables will be automatically separated from noise by an estimating model. Furthermore, the recovered informative latent variables will be in one-to-one correspondence with the true latent variables of the generating process, up to a trivial component-wise transformation. Second, we introduce a mod-ification of the RealNVP invertible neural network architecture (Dinh et al., 2016) which is particularly suitable for this type of problem: the General Incompressible-flow Network (GIN). Experiments on artificial data and EMNIST demonstrate that theoretical predictions are indeed verified in practice. In particular, we provide a detailed set of exactly 22 informative latent variables extracted from EMNIST.

1 INTRODUCTION

Deep latent-variable models promise to unlock the key factors of variation within a dataset, opening a window to interpretation and granting the power to manipulate data in an intuitive fashion. The theory of identifiability in linear independent component analysis (ICA) (Comon, 1994) tells us when this is possible, if we restrict the model to a linear transformation, but until recently there was no corresponding theory for the highly nonlinear models needed to manipulate complex data. This changed with the recent breakthrough work by Khemakhem et al. (2019), which showed that under relatively mild conditions, it is possible to recover the joint data and latent space distribution, up to a simple transformation in the latent space. The key requirement is that the generating process is conditioned on a variable which is observed along with the data. This condition could be a class label, time index of a time series, or any other piece of information additional to the data. They interpret their theory as a nonlinear version of ICA.

This work extends this theory in a direction relevant for application to real-world data. The existing theory assumes knowledge of the intrinsic problem dimension, but this is unrealistic for anything but artificially generated datasets. Here, we show that in the special case of Gaussian latent space distributions, the intrinsic problem dimension can be discovered. The important latent variables are organically separated from noise variables by the estimating model. Furthermore, the variables discovered correspond to the true generating latent variables, up to a trivial component-wise translation and scaling. Very similar results exist for other members of the exponential family with two parameters, such as the beta and gamma distributions.

We introduce a variant of the RealNVP (Dinh et al., 2016) invertible neural network: the General Incompressible-flow Network (GIN). The flow is called incompressible in reference to fluid dynamics, since it preserves volumes: the Jacobian determinant is simply unity. We emphasise its generality and increased expressive power in comparison to previous volume-preserving flows, such as NICE (Dinh et al., 2014). As already noted in Khemakhem et al. (2019), flow-based generative models are a natural fit for the theory of nonlinear ICA, as are the variational autoencoders (VAEs) (Kingma & Welling, 2013) used in that work. For us, major advantages of invertible architectures over VAEs are the ability to specify volume preservation and directly optimize the likelihood, and freedom from the requirement to specify the dimension of the model’s latent space. An INN always has a latent space of the same dimension as the data. In addition, the forward and backward models share parameters, saving the effort of learning separate models for each direction.

In summary, our work makes the following contributions:

• We extend the theory of nonlinear ICA to allow for unknown intrinsic problem dimension. Doing so, we find that this dimension can be discovered and a one-to-one correspondence between generating and estimated latent variables established.

• We propose as an implementation an invertible neural network obtained by modifying the RealNVP architecture. We call our new architecture GIN: the General Incompressible-flow Network.

• We demonstrate the viability of the model on artificial data and the EMNIST dataset. We extract 22 meaningful variables from EMNIST, encoding both global and local features.

2 RELATED WORK

The basic goals of nonlinear ICA stem from the original work on linear ICA. An influential formulation, as well as the first identifiability results, were given in Comon (1994). These stated the conditions which allow the generating latent variables to be discovered, when the mixing function is a linear transformation. However, it was shown in Hyv¨arinen & Pajunen (1999) that this approach to identifiability does not extend to general nonlinear functions.

The first identifiability results in nonlinear ICA came in Hyv¨arinen & Morioka (2016) and Hyv¨arinen & Morioka (2017), applied to time series, and implemented via a discriminative model and semisupervised learning. A more general formulation, valid for other forms of data, was given in Hyv¨arinen et al. (2018) and the theory was extended to generative models in Khemakhem et al. (2019), where experiments were implemented by a VAE.

Many authors have addressed the general problem of disentanglement, and proposed models to learn disentangled features. Prominent among these is -VAE (Higgins et al., 2017) and its variations (e.g. Chen et al., 2018) which augment the standard ELBO loss with tunable hyperparameters to encourage disentanglement. There are also attempts to modify the GAN framework (Goodfellow et al., 2014) such as InfoGAN (Chen et al., 2016), which tries to maximize the mutual information between some dimensions of the latent space and the observed data. Many of these approaches are unsupervised. However, as pointed out and empirically demonstrated in Locatello et al. (2018), unsupervised models without conditioning in the latent space are in general unidentifiable.

Several unsupervised VAE models implement conditioning in the latent space by means of Gaussian mixtures (Johnson et al., 2016; Dilokthanakul et al., 2016; Zhao et al., 2019). Our work differs mainly by (i) only considering supervised tasks, therefore being safely covered by the theory of Khemakhem et al. (2019), and (ii) enforcing volume-preservation, not possible in a VAE.

The invertible neural networks in this work build upon the NICE framework (Dinh et al., 2014) and its extension in RealNVP (Dinh et al., 2016). A similar network design to ours is Ardizzone et al. (2019). This is a conditioned INN based on RealNVP, however the conditioning information is applied as a parameter of the network. The authors find in experiments with MNIST that meaningful variables are present in their latent space, but are rotated such that they are not aligned with the axes of the space. In this work, the conditioning is only present as a parameter of the latent space distributions. As a result, it is covered by the theory of Khemakhem et al. (2019) and its extension here, which results in non-rotated, meaningful latent variables.

3 THEORY

3.1 EXISTING THEORY OF NONLINEAR ICA

This section adapts theoretical results from Khemakhem et al. (2019) to the context of invertible neural networks. Suppose the existence of the following three random variables: a latent generating

Figure 1: Relationship between the variables and functions defined in Sec. 3.1, as well as indication of the training scheme.

variable z , a condition u and a data point x , where . If n < d, also suppose the existence of a noise variable . The variables z and make up the generating latent space, and are not necessarily known, whereas u and x are observable and therefore always known.

The distribution of z is a factorial member of the exponential family with k sufficient statistics, conditioned on u. In its most general form the distribution can be written as

where the are the sufficient statistics, the their coefficients and the normalizing constant. is called the base measure, which in many cases is simply 1.

The distribution of must not depend on u and its probability density function must be always finite.

The variable x is the result of an arbitrarily complex, invertible, deterministic transformation from the generating latent space to the data space: x . This can alternatively be formulated as an injective, non-deterministic transformation from the lower-dimensional z-space to the higher-dimensional x-space: x .

In general, an observed dataset D will consist only of instances of x and u. The task of nonlinear ICA is to disentangle the data to recover the generating latent variables z, as well as the form of the function f and its inverse. We can try to achieve this with a sufficiently general, invertible function approximator g, which maps from a latent variable w to the data space: x . Here denotes the parameters of g. Note that the dimensions of the latent space and data space are the same due to the invertibility of g. We assume that w follows a conditionally independent exponential probability distribution, conditioned on the same condition u as z:

The coefficients of the sufficient statistics are not restricted, which means that a given vari- able i of the estimated latent space is allowed to lose its dependence on u. If this occurs, we will consider this variable to encode noise, playing a role equivalent to in the generating latent space.

In addition to this specification of the generative and estimating models, some conditions are necessary to ensure the latent variables can be recovered. The most important of these concerns the variability of under u. However, as long as the are randomly and independently generated, and there are at least nk + 1 distinct conditions u, this condition is almost surely fulfilled. See Appendix A.1 for further details.

In the limit of infinite data and perfect convergence, the estimating model will give the same conditional likelihood to all data points as the true generating model:

If this is the case, the vector of sufficient statistics T from the generating latent space will be related to that of the estimated latent space by an affine transformation:

where A is some constant, full-rank, matrix and some constant vector. The relationship holds for all values of z and w. The proof of this result can be found in Appendix A.

This relationship is a generalization of that derived in Khemakhem et al. (2019). In that version, they assume knowledge of n, the dimension of the generating latent space, and give their estimating model a latent space of the same dimension. In this context the matrix A defines a relationship between latent spaces of the same dimension, making it square and invertible.

3.2 NONLINEAR ICA WITH A GAUSSIAN LATENT SPACE

Given the relationship in equation (4), we can ask whether any stronger results hold for particular special cases. We hope in particular to induce sparsity in the matrix A, so that the estimated latent space is related to the true one by as simple a transformation as possible. We show (proof in Appendix B) that when both the generating and estimated latent spaces follow a Gaussian distribution, the generating latent space variables are recovered up to a trivial translation and scaling. Furthermore, the dimension of the generating latent space is recovered.

More precisely, each generating latent variable is related to exactly one estimated latent variable , for some j, as , for some constant and . Furthermore, each estimated latent variable is related to at most one . If the estimating latent space has higher dimension than the dimension of the generating latent space, some estimating latent variables are not related to any generating latent variables and so must encode only noise. This is the dimension discovery mechanism, since the estimated latent space organically splits into informative and non-informative parts, with the dimension of the informative part equal to the unknown intrinsic dimension of the generating latent space. Very similar results can be derived for all common continuous two-parameter members of the exponential family, including the gamma and beta distributions. See Appendix C.

3.3 VOLUME PRESERVATION VIA INCOMPRESSIBLE FLOW

In a multivariate Gaussian distribution with diagonal covariance, the standard deviations are directly proportional to the principal axes of the ellipsoid defining a surface of constant probability. In classical PCA, this property is used to assign importance to each principal component based on its standard deviation. In PCA we can think of the rotation of the data into the basis defined by the principal components as the latent space. In a deep latent variable model the transformation between the latent space and the data is significantly more complex, but if this transformation preserves volumes like the rotation in PCA, it will retain the desirable correspondence between the standard deviation of a latent space variable and its importance in explaining the data. Because invertible neural networks (normalizing flows) have a tractable Jacobian determinant, they open the possibility to constrain it to unity. This is equivalent to volume preservation and is the rationale behind the General Incompressible-flow Network.

4 EXPERIMENTS

Experiments on artificial datasets confirm the theory for a normally distributed latent space, as well as identifying potential causes of failure. Experiments on the EMNIST dataset (Cohen et al., 2017) demonstrate the ability of GIN to estimate independent and interpretable latent variables from real-world data.

4.1 MODEL DESCRIPTION

The GIN model is similar in form to RealNVP (Dinh et al., 2016) and shares its flexibility, but retains the volume-preserving properties of the NICE framework (Dinh et al., 2014).

RealNVP coupling layers split D-dimensional input x into two parts, and where d < D. The output of the layer is the concatentation of and with

Figure 2: Successful reconstruction by GIN of the two informative latent variables out of ten in total. The other eight are correctly identified as noise. The observed data is ten-dimensional (projection into two dimensions shown here). The spectrum shows the standard deviation of each variable of the reconstruction (in black, log scale) which quantifies its importance. Ground truth is in gray. There is a clear distinction between the two informative dimensions and the noise dimensions, showing that GIN has correctly detected a two-dimensional manifold in the ten-dimensional data presented to it.

where addition, multiplication and exponentiation are applied component-wise. The logarithm of the Jacobian determinant of a coupling layer is simply the sum of the scaling function . Volume preservation is achieved by setting the final component to the negative sum of the previous components, so the total sum of is zero (in contrast, NICE sets ). Hence the Jacobian determinant is unity and volume is preserved. The network is free to allow the volume contribution of some dimensions of the output of any coupling layer to grow, but only by shrinking the other dimensions of the output in direct proportion. As well as enforcing a strong correspondence between the importance of a latent variable and its standard deviation, we believe volume-preservation has a regularizing effect, since it is a very strong constraint, comparable to orthonormality in linear transformations.

4.2 OPTIMIZATION

The experiments in this section deal with labeled data, where each data point belongs to one of M different classes. This class label is used as the condition u associated with each data point x. In the estimated latent space, all data instances with the same label should belong to the same Gaussian distribution. Hence we are learning a Gaussian mixture in the estimated latent space, with M mixture components. Since the distribution for each class in the estimated latent space is required to be factorial, the variance of each mixture component is diagonal, and we can write for the variance in the i-th dimension.

Given a set of data, condition pairs and model g, parameterized by (where g maps from the latent space to the data space) we can construct a loss from the log-likelihood. Using the change of variables formula, we have log p(x|u) = log p(w|u), with no Jacobian term since the transformation w is volume-preserving. To maximize the likelihood of D, we minimize the negative log-likelihood of w in the estimated latent space:

4.3 ARTIFICIAL DATA

Samples are generated in two dimensions, conditioned on five different cluster labels, see Fig. 2. The means of the clusters are chosen independently from a uniform distribution on and variances from a uniform distribution on [.5, 3].1 This data is then concatenated with independent Gaussian noise in eight dimensions to make a ten-dimensional generating latent space, where only the first two variables are informative. The noise is scaled by 0.01 to be small in comparison to the informative dimensions. The latent space samples are then passed through a RealNVP network with 8 fully connected coupling blocks with randomly initialized weights to produce the observed data. This acts as a highly nonlinear mixing which can only be successfully treated with nonlinear methods.

GIN is used as the estimating model, with 8 fully connected coupling blocks (full details in Appendix D). Training converges quickly and stably using the Adam optimizer (Kingma & Ba, 2014) with initial learning rate and other values set to the usual recommendations. Batch size is 1,000 and the data is augmented with Gaussian noise () at each iteration. After convergence of the loss, the learning rate is reduced by a factor of 10 and trained again until convergence.

Over a number of experiments we made the following observations:

• The model converges stably and gives importance (quantified by standard deviation) to only two variables in the estimated latent space, provided there is sufficient overlap between the mixture components in the generating latent space.

• Where there is not enough overlap in the generating latent space, the model cannot recognize common variables across all the different classes, and tends to split one genuine dimension of variation into two or more in its estimated latent space. This appears to be a problem of finite data. We have observed that when this behaviour occurs, and if the gap between mixture components is not too large, it can be prevented by increasing the number of samples so that the space between the mixture components is better filled (see Fig. 6 and 7 in Appendix E). This is consistent with the theory, where equation (4) is true asymptotically. Since the latent space distributions are members of the exponential family, they have support across the entire domain of the latent space, hence gaps can never remain in the limit of infinite samples.

• Choice of learning rate is important. If the initial learning rate is too low, training gets stuck in bad local optima, where single true variables are split into several latent dimensions.

Samples are generated as in Experiment 1, but with only three mixture components. Since there are two sufficient statistics per dimension, and two dimensions of variation, according to the theory we need at least nk + 1 = 5 distinct conditions u for equation (4) to hold (see section 3.1). Therefore, we might not expect successful experiments. Nonetheless, we observe essentially the same results as for the previous experiments (see Fig. 8 in Appendix E), with the same caveats regarding gaps between the mixture components in the generating latent space. This suggests that the conditions derived in Khemakhem et al. (2019), although sufficient for disentanglement, are not necessary.

4.4 EMNIST

4.4.1 EXPERIMENT

Figure 3: Spectrum of sorted standard deviations derived from training GIN on EMNIST. On the right is the equivalent spectrum from PCA, a linear method, on MNIST. The nonlinear spectrum exhibits a sharp knee not obtained by linear methods. In the nonlinear spectrum, the first 22 latent variables encode information about the shape of a digit, while the rest of the latent variables encode noise. This distinction is marked with a dotted line in the left and center figures (see Sec. 4.4.2 for explanation of the choice of this cut-off). Within the first 22 variables, the first eight encode global information, such as slant and width, whereas the following 14 encode more local information. This distinction is marked in the center figure only.

The data comes from the EMNIST Digits training set of 240,000 images of handwritten digits with labels (Cohen et al., 2017). EMNIST is a larger version of the well-known MNIST dataset, and also includes handwritten letters. Here we use only the digits. The digit label is used as the condition u, hence we construct a Gaussian mixture in the estimated latent space with 10 mixture components. According to the theory (see 3.1), these are only enough conditions to guarantee identifiability if there are only four informative latent variables in the generating latent space. We expect that the true number of informative variables is somewhat higher, so as in Experiment 2 above, we are operating outside of the guarantees of the theory. In addition, the true generative process of human handwriting may not exactly fulfill our method’s assumptions, and we might still lack data on rare or subtle variations, despite the large size of the dataset in comparison to MNIST.

The estimating model is a GIN which uses convolutional coupling blocks and fully connected coupling blocks to transform the data to the latent space (full details in Appendix D). Optimization is with the Adam optimizer, with initial learning rate 3e-4. Batch size is 240 and the data is augmented with Gaussian noise () at each iteration. The model is trained for 45 epochs, then for a further 50 epochs with the learning rate reduced by a factor of 10.

4.4.2 RESULTS

Figure 4: Selection of global latent variables found by GIN. Each row is conditioned on a different digit label. The variable runs from -2 to +2 standard deviations across the columns, with all other variables held constant at their mean value. The rightmost column shows a heatmap of the image areas most affected by the variable, computed as the absolute pixel difference between -1 and +1 standard deviations. Variable 1 controls the width of the top half of a digit, whereas variable 8 controls the width of the bottom half. Width in both cases is somewhat entangled with slant. Variable 3 controls the height and variable 4 controls how bent the digit is. Full set of variables for all digits in Appendix F.

Figure 5: Selection of local latent variables found by GIN. Refer to Figure 4 above for explanation of the layout. Variable 11 controls the shape of curved lines in the middle right. Digits without such lines are not affected. Variable 12 controls extension towards the upper right. Variable 13 modifies the top left of 2, 3 and 7 only (7 not shown here) and variable 16 modifies only the lower right stroke of a 2. Full set of variables for all digits in Appendix F.

The model encodes information into 22 broadly interpretable latent space variables. The eight variables with the highest standard deviation encode global information, such as slant, line thickness and width which affect all digits more or less equally. The remaining 14 meaningful variables encode more local information, which does not affect all digits equally. We interpret all other variables as encoding noise. This is motivated mainly by our observation of an abrupt end to any observable change in the reconstructed digits when any variable after the first 22 is changed. This can be seen clearly in Fig. 15 in Appendix F. Some variables, particularly the global ones, are not entirely disentangled from another. Nevertheless, the results are compelling and suggest that the assumptions required by the theory are approximately met.

We observed some global features which are not usually seen in disentanglement experiments on MNIST (e.g. Chen et al., 2016; Dupont, 2018). These experiments usually obtain digit slant, width and line thickness as the major global independent variables. We too observe slant and line thickness as independent variables (variables 2 and 5), but find width to be split into two variables, one governing the width of the upper half of a digit (variable 1) and the other the lower half (variable 8). This makes sense, since these can in fact vary independently. We also observe the height of a digit (variable 3) (see Fig. 4). This does not usually appear in disentanglement experiments, possibly because it is too subtle a variation for those experiments, but possibly because it is not present or not discoverable in the smaller MNIST dataset but is in EMNIST.

Local features are also usually not observed in such experiments, so the variables which control these are particularly interesting. These variables modify only a region of the digit, leaving the rest of the digit untouched. In addition, digits which do not have the feature which is being modified in that region are left alone. Examples include variable 13, which changes the orientation of the top-left stroke in 2, 3 and 7, and variable 16, which modifies only the lower-right stroke of a 2 (see Fig. 5). The full set of local and global variables can be seen in Fig. 10 to Fig. 15 in Appendix F.

The concept of a “true generative process” rests on the assumption that conditioning variables u act as causes for the latent variables z, such that the elements of the latter become independent given the former. In the case of handwritten digits, u represents a person’s decision to write a particular digit, and z determines the hand motions to produce this digit. Modelling u in terms of digit labels is a plausible proxy for the unknown causal variables in the human brain, and the success of the resulting model suggests that the assumptions are approximately fulfilled. Identifying promising conditioning variables u in less obvious situations is an important open problem for future work.

Increasing the number of conditioning variables would bring this work closer in line with the existing theory. At the current value of ten, the theory only applies if there are at most four informative generating latent variables, which is almost certainly too low a number. One option to increase the number of conditions is to relax the labels from hard to soft, i.e. compute posterior probabilities of class membership given a data example. This could be achieved by information distillation as in Hinton et al. (2015). Another possibility is to split existing labels into sub-labels, for example making a distinction between sevens with and without crossbars. This may also aid the generation of more realistic samples, since we observed no sevens with crossbars in our generated samples, even though they make up approximately 5% of the sevens in the dataset.

5 CONCLUSION AND OUTLOOK

We have expanded the theory of nonlinear ICA to cover problems with unknown intrinsic dimension and demonstrated an implementation with GIN, a new volume-preserving modification of the RealNVP invertible network architecture. The variables discovered by GIN in EMNIST are interpretable and detailed enough to suggest that the assumptions made about the generating process are approximately true. Furthermore, our experiments with EMNIST demonstrate the viability of applying models inspired by the new theory of nonlinear ICA to real-world data, even when not all of the conditions of the theory are met.

It is not clear if the methods from this work will scale to larger problems. However, given the recent advances of similar flow-based generative models in density estimation on larger datasets such as CelebA and ImageNet (e.g. Kingma & Dhariwal, 2018; Ho et al., 2019), it is a plausible prospect. In addition, it is not clear whether the method can successfully be extended to the context of semisupervised learning, or ultimately, unsupervised learning.

REFERENCES

Lynton Ardizzone, Carsten L¨uth, Jakob Kruse, Carsten Rother, and Ullrich K¨othe. Guided image generation with conditional invertible neural networks. arXiv:1907.02392, 2019.

Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentan- glement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620, 2018.

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andr´e van Schaik. EMNIST: an extension of MNIST to handwritten letters. arXiv:1702.05373, 2017.

Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994.

Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components esti- mation. arXiv:1410.8516, 2014.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. arXiv:1605.08803, 2016.

Emilien Dupont. Learning disentangled joint continuous and discrete representations. In Advances in Neural Information Processing Systems, pp. 710–720, 2018.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. -VAE: Learning basic visual concepts with a constrained variational framework. ICLR, 2(5):6, 2017.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.

Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flow-based generative models with variational dequantization and architecture design. arXiv:1902.00275, 2019.

Aapo Hyv¨arinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. In Advances in Neural Information Processing Systems, pp. 3765–3773, 2016.

Aapo Hyv¨arinen and Hiroshi Morioka. Nonlinear ICA of temporally dependent stationary sources. In Proceedings of Machine Learning Research, 2017.

Aapo Hyv¨arinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999.

Aapo Hyv¨arinen, Hiroaki Sasaki, and Richard E Turner. Nonlinear ICA using auxiliary variables and generalized contrastive learning. arXiv:1805.08651, 2018.

J¨orn-Henrik Jacobsen, Arnold Smeulders, and Edouard Oyallon. i-RevNet: Deep invertible net- works. arXiv:1802.07088, 2018.

Matthew J Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946–2954, 2016.

Ilyes Khemakhem, Diederik P Kingma, and Aapo Hyv¨arinen. Variational autoencoders and nonlin- ear ICA: A unifying framework. arXiv:1907.04809, 2019.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.

Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224, 2018.

Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Sch¨olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv:1811.12359, 2018.

Qingyu Zhao, Nicolas Honnorat, Ehsan Adeli, Adolf Pfefferbaum, Edith V Sullivan, and Kilian M Pohl. Variational autoencoder with truncated mixture of gaussians for functional connectivity analysis. In International Conference on Information Processing in Medical Imaging, pp. 867– 879. Springer, 2019.

SUPPLEMENTARY MATERIAL

A PROOF OF IDENTIFIABILITY

This section reproduces a proof from Khemakhem et al. (2019) with some modifications to adapt it

to the context of invertible neural networks (normalizing flows).

Define the domain of as . Define the vector of sufficient statistics T(z) =

and the vector of their coefficients .

Theorem 1 Assume we observe data distributed according to the generative model defined in Sec.

3.1. Further suppose the following:

(i) The sufficient statistics are differentiable almost everywhere and their derivatives are nonzero almost surely for all and all and .

Then the sufficient statistics of the generating latent space are related to those of the estimated latent

space by the following relationship:

where A is a constant, full-rank matrix and a constant vector.

Proof: The conditional probabilities and are assumed to be the same in

the limit of infinite data. By expanding these expressions via the change of variables formula and

taking the logarithm we find

where J is the Jacobian matrix. Let ube the conditions from (ii) above. We can subtract

this expression for ufrom the expression for some condition u. The Jacobian terms and the term

involving will vanish, since they do not depend on u:

Since the conditional probabilities of z and w are exponential family members, we can write this

expression as

where the base measures have cancelled out, since they do not depend on u. Defining

and writing the above in terms of inner products, we find

where and . It now remains to show that A has full rank.

According to a lemma from Khemakhem et al. (2019), there exist k distinct values to such

thatare linearly independent in , for all . Define k vectors

zfrom the points given by this lemma. Take the derivative of equation (15) evaluated

at each of these k vectors and concatentate the resulting Jacobians as .

Each Jacobian has size , hence Q has size and is invertible by the lemma and the

fact that each component of T is univariate. We can construct a corresponding matrix made up

of the Jacobians of evaluated at the same points and write

Since Q is full rank, both A and must also be full rank.

A.1 REMARK REGARDING CONDITION (II)

If the coefficients of the sufficient statistics are generated randomly and independently, then con-

dition (ii) is almost surely fulfilled. In this case, we can ignore the dependence on u to consider the

as independent random variables. Then condition (ii) states that instances of these random

variables are in general position in , which is true almost surely.

A.2 REMARK REGARDING NOISE VARIABLES IN THE ESTIMATED LATENT SPACE

Any variables in the estimated latent space whose distribution does not depend on the conditioning

variable u is considered as encoding only noise. Such variables will be cancelled out in equation

(12) and the corresponding column of will contain only zeros. This means the corresponding

column of A will also contain only zeros, so any variation in such a noise variable has no effect on

z. The reverse is also true: if a column of A contains only zeros, the corresponding variable in the

estimated latent space does not depend on u and must encode only noise.

B SPARSITY IN UNMIXING MATRIX: GAUSSIAN DISTRIBUTION

Suppose samples z from the generating latent space follow a conditional Gaussian distribution. Sup-

pose the estimating model g faithfully reproduces the observed conditional density p(x|u) and sam-

ples w from its latent space also follow a conditional Gaussian distribution. Then we can apply

equation (4) to relate the generating and estimating latent spaces.

The sufficient statistics of a normal distribution with free mean and variance are z and . Hence the

relationship between the latent spaces becomes

where the squaring is applied element-wise. We can write A in block matrix form as

and c as

Then:

so we can write for each dimension i of z

In order to compare the equations, we need to square (22). To do so, we will have to square the

second term on the right hand side, involving . There is no matching term in (23), so we have to

set all entries of to zero. In more detail:

The first term with matches no term in (23), so we have to set for all i and j. This

simplifies the earlier equation:

The square of the first term on the right hand side involves terms with cross terms:

so we have to set for all . This means that the i-th row of can have at most

one nonzero entry. It must also have at least one nonzero entry, since if the row were all zero, a row

of A would be all zero (since ), but A has full rank. Since there are as many or fewer rows

than columns (), each row of A is linearly independent, so it is not possible for one to be zero.

Hence each row of has exactly one nonzero entry. Moreover, no two rows of have their

nonzero entries in the same column. If they did, the two rows would not be linearly independent, but

they must be since A has full rank. Therefore we can write

where and . That is, the generating latent variable is linearly related to some

latent variable of the estimating model . This estimated latent variable is uniquely associated with

and any estimated latent variables not associated with a generating latent variable (in the case

d > n) encode no information about the generating latent space. So the model has decoded the

original latent variables z up to an affine transformation and permutation as a subset of variables

in its estimated latent space and has encoded no information (only noise) into the remaining latent

variables.

C SPARSITY IN UNMIXING MATRIX: TWO-PARAMETER EXPONENTIAL FAMILY MEMBERS

The results of Appendix B can be extended to other members of the exponential family with 2

parameters. In the general case, writing and for all i, equations (22) and (23)

become

which, with the definition becomes

We can combine these equations to get

t

(33) This equation has many summed terms on the right. Suppose that t has a convergent Taylor expansion in some region of its domain. We can take this expansion of the term on the left and compare coefficients. Since t cannot be linear, there will be terms in the expansion of order two or higher. As in the Gaussian case, these polynomial terms create cross terms which are impossible to reconcile with those on the right hand side of the equation. The only consistent solutions can be found by setting all coefficients except those of functions of (for some j) to zero:

We can further simplify this equation by examining higher-order terms, but need to know the form

of t to do so. In any case, we can now write equation (29) as

showing that each generating latent variable is related to exactly one estimated latent variable

. As in the Gaussian case, we can use the full rank property of A to see that each estimated

latent variable is associated with at most one generating latent variable, and any estimated latent

variables not associated with any generating latent variables must encode only noise. The task now

is to check the form of t for each two-parameter member of the exponential family, to see what

further constraints we can derive from equation (34). The results are stated in Table 1.

Table 1: Two-parameter exponential family members and selected properties.

D NETWORK ARCHITECTURE

The estimating model g is built in the reverse direction for practical purposes, so the models de-

scribed here are which maps from the data space to the latent space. The type and number

of coupling blocks for the different experiments are shown below. The affine coupling function is

the concatentation of the scale function s and the translation function t, computed together for ef-

ficiency, as in Kingma & Dhariwal (2018). It is implemented as either a fully connected network

(MLP) or convolutional network, with the specified layer widths and a ReLU activation after all but

the final layers. For the convolutional coupling blocks, the splits are along the channel dimension.

The scale function s is passed through a clamping function 2 tanh(s), which limits the output to the

range (-2, 2), as in Ardizzone et al. (2019). Two affine coupling functions are applied per block, as

described in Dinh et al. (2016). Downsampling increases the number of channels by a factor of 4

and decreases the image width and height by a factor of 2, done in a checkerboard-like manner, as

described in Jacobsen et al. (2018). The dimensions are permuted in a random but fixed way before

application of each fully connected coupling block and likewise the channels for the convolutional

coupling blocks. The network for the artificial data experiments has 4,480 learnable parameters and

the network for the EMNIST experiments has 2,620,192 learnable parameters.

Table 2: Network architecture for artificial data experiments

Table 3: Network architecture for EMNIST experiments

D.1 NOTE ON OPTIMIZATION METHOD

In the experiments described in this paper, the mean and variance of a mixture component was

updated at each iteration as the mean and variance of the transformations to latent space of all data

points belonging to that mixture component in a minibatch of data D:

Hence the parameters of the mixture components would change in each batch, according to the data

present. The notation does not indicate that is directly parameterized by and learned,

instead it indicates that is a function of which is parameterized by . A change in will also

change the output of , given the same mini-batch of data D. The same holds for .

We expect that specifying the means and variances as learnable parameters updated in tandem with

the model weights would have worked equally well.

E FIGURES FROM THE ARTIFICIAL DATA EXPERIMENTS

Figure 6: Experiment 1: Five mixture components and 100,000 data points. GIN successfully reconstructs the generating latent space and gives importance to only two of its ten latent variables, reflecting the two-dimensional nature of the generating latent space.

Figure 7: Same experiment as in Figure 6 with the number of data points reduced to 10,000. GIN fails to successfully estimate the ground truth latent variables, due to limited data: The first variable (x-axis) is well approximated in the reconstruction, but the second variable (y-axis) is split into two in the reconstruction, one capturing mainly information about the lower two clusters (shown here) and another information about the other mixture components (not shown). We also observe a less clear spectrum, where three variables are given more importance than the rest, not faithfully reflecting the two-dimensional nature of the generating latent space.

Figure 8: Experiment 2: Only three mixture components (not sufficient for identifiability according to the theory). Nevertheless, GIN successfully reconstructs the ground truth latent variables. This suggests that the current theory of nonlinear ICA relies on sufficient, but not necessary, conditions for identifiability.

F EMNIST FIGURES

Figure 9: Full and reduced temperature samples from the model trained on EMNIST. Reduced temperature samples are made by sampling from a Gaussian distribution where the standard deviation is reduced by the temperature factor. The 22 most significant variables are sampled, with the others kept to their mean value. This eliminates noise from the images but preserves the full variability of digit shapes. Each row has the same latent code (whitened value) but is conditioned on a different class in each column, hence the style of the digits is consistent across rows.

Figure 10: Most significant latent variables 1 to 4. Each row is conditioned on a different digit label. The variable runs from -2 to +2 standard deviations across the columns, with all other variables held constant at their mean value. The rightmost column shows a heatmap of the image areas most affected by the variable, computed as the absolute pixel difference between -1 and +1 standard deviations.

Figure 11: Most significant latent variables 5 to 8. Each row is conditioned on a different digit label. The variable runs from -2 to +2 standard deviations across the columns, with all other variables held constant at their mean value. The rightmost column shows a heatmap of the image areas most affected by the variable, computed as the absolute pixel difference between -1 and +1 standard deviations.

Figure 12: Most significant latent variables 9 to 12. Each row is conditioned on a different digit label. The variable runs from -2 to +2 standard deviations across the columns, with all other variables held constant at their mean value. The rightmost column shows a heatmap of the image areas most affected by the variable, computed as the absolute pixel difference between -1 and +1 standard deviations.

Figure 13: Most significant latent variables 13 to 16. Each row is conditioned on a different digit label. The variable runs from -2 to +2 standard deviations across the columns, with all other variables held constant at their mean value. The rightmost column shows a heatmap of the image areas most affected by the variable, computed as the absolute pixel difference between -1 and +1 standard deviations.

Figure 14: Most significant latent variables 17 to 20. Each row is conditioned on a different digit label. The variable runs from -2 to +2 standard deviations across the columns, with all other variables held constant at their mean value. The rightmost column shows a heatmap of the image areas most affected by the variable, computed as the absolute pixel difference between -1 and +1 standard deviations.

Figure 15: Most significant latent variables 21 to 24. Each row is conditioned on a different digit label. The variable runs from -2 to +2 standard deviations across the columns, with all other variables held constant at their mean value. The rightmost column shows a heatmap of the image areas most affected by the variable, computed as the absolute pixel difference between -1 and +1 standard deviations.

Designed for Accessibility and to further Open Science