A central question in unsupervised deep learning is how to learn nonlinear representations that are a faithful reconstruction of the true latent variables behind the data. This allows us to learn representations that are semantically meaningful, interpretable and useful for downstream tasks. Identifiability is fundamental for meaningful and principled disentanglement, and in applications such as causal discovery. However, this is a very difficult task: by definition, we never observe the latent variables; the only information directly available to us is given by the observed variables. Learning the true representations is only possible when the representation is identifiable: if, in the limit of infinite data, only a single representation function can fit the data. Conversely, if, in the limit of infinite data, multiple representation functions can fit the data, then the true representation function is unidentifiable.
Until recently (Hyvärinen and Morioka, 2016, 2017), results relating to identifiability of (explicit and implicit) latent variable models were mainly constrained to linear models (e.g., as in linear ICA), as it was acknowledged that the flexibility of nonlinear mappings could yield arbitrary latent variables which fulfill model assumptions such as independence (Hyvärinen and Pajunen, 1999). However, it is now understood that nonlinear deep latent variable models can be identifiable provided we observe some additional auxiliary variables such that the latent variables are conditionally independent given the auxiliary variable. The approach was introduced using self-supervised learning by Hyvärinen et al. (2019), and Khemakhem et al. (2020) explicited a connection between nonlinear ICA and the framework of variational autoencoders. It was shortly followed by work by Sorrenson et al. (2020), where a similar connection was made to flow-based models (Rezende and Mohamed, 2015). This signals the importance of identifiability in popular deep generative models.
We extend this trend to a broad family of (unnormalized) conditional energy-based models (EBM), using insight from the nonlinear ICA theory. EBMs offer unparalleled flexibility, mainly because they do not require the modeled densities to be normalized nor easy to sample from. In fact, the energy model we suggest will have universal approximation capabilities. The energy function we will consider is defined in two steps: we learn two feature extractors, parameterized by neural networks, one for each of the observed variables (dependent and conditioning); then, we set the energy function to be the dot-product of the learned features. The modeled conditional densities are defined to be the exponential of the negative energy function.
A first important contribution of this paper is to provide a set of sufficient mild conditions to be satisfied by the feature extractors, which would guarantee their identifiability: they learn representations that are unique up to a linear transformation. In addition, by slightly altering the definition of the energy function, we prove the linear transformation is essentially a permutation. These conditions are functional, i.e. they abstract away the architecture of the networks. As a concrete example, we provide a neural network architecture based on fully connected layers, for which the functional conditions hold, and is thus identifiable. Moreover, we do not make any assumptions on the distributions of the learned representations. Effectively, this makes our family of models very flexible and adaptable to practical problems. We call this model Identifiable Conditional Energy-Based deep Models, or ICE-BeeM for short.
Our second contribution is to develop a framework we call Independently Modulated Component Analysis (IMCA): a deep latent variable model where the latents are non-independent (thus generalizing nonlinear ICA), with an arbitrary global dependency structure. Nonlinear ICA research has formalized the trade-off between expressivity of the mapping between latents to observations (from linear to nonlinear) and distributional assumptions over latent variables (from independent to conditionally independent given auxiliary variables). However, the need for (conditional) independence in order to obtain identifiability results may sometimes be seen as a limitation, for example in the context of learning disentangled representations. Therefore, it would be important to relax the assumption of independence while maintaining identifiability. This was achieved before in the linear case (Monti and Hyvärinen, 2018; Hyvärinen and Hurri, 2004), and we show how it may be achieved in the nonlinear setting. We show how our ICE-BeeM can estimate this generative model, thus connecting both the generative and non-generative views.
Finally, we show empirically that ICE-BeeM learns identifiable representations from real-world image datasets. As a further, rather different application of our results, we show how identifiability of ICE-BeeM can be leveraged for transfer learning and semi-supervised learning. In fact, we believe that the identifiability results are generally important for principled application of EBMs, whether for the purposes of disentanglement or otherwise.
In this section, we define ICE-BeeM , and study its properties. All proofs can be found in Appendix C.
2.1 Model definition
We collect a dataset of observations of tuples (x, y), where is the main variable of interest, also called the dependent variable, and
is an auxiliary variable also called the conditioning variable.
Consider two feature extractors and
, which we parameterize by neural networks, and
is the vector of weights and biases. To alleviate notations, we will drop
clear which quantities we refer to. These feature extractors are used to define the conditional energy function
The parameter lives in the space
which is defined such that the normalizing constant
is finite. Our family of conditional energy-based models has the form:
As we will see later, this choice of energy function is not restrictive, as our model has powerful theoretical guarantees: universal approximation capabilities and strong identifiability properties. There exists a multitude of methods we can use to estimate this model (Hyvärinen, 2005; Gutmann and Hyvärinen, 2010; Ceylan and Gutmann, 2018; Uehara et al., 2020). In this work, we will use Flow Contrastive Estimation (Gao et al., 2019) and Denoising Score Matching (Vincent, 2011), which are discussed and extended to the conditional case in Appendix B.
2.2 Identifiability
As stated earlier, we want our model to learn meaningful representations of the dependent and conditioning variables. In particular, when learning two different models of the family (1) from the same dataset, we want the learned features to be very similar.
This similarity between representations is better expressed as equivalence relations on the parameters of the network, which would characterize the form of identifiability we will end up with for our energy model. This notion of identifiability up to equivalence class was introduced by Khemakhem et al. (2020) to address the fact that there typically exist many choices of neural network parameters
that map to the same point in function-space. In our case, it is given by the following definitions: Definition 1 (Weak identifiability). Let
be equivalence relations on
defined as:
where A and B are ()-matrices of rank at least
and
respectively, and c and e are vectors. Definition 2 (Strong identifiability). Let
be the equivalence relations on
defined as:
where are permutations of
are non-zero scalars and
are scalars.
Two parameters are thus considered equivalent if they parameterize two feature extractors that are equal up to a linear transformation (2) or a scaled permutation (3). The subscripts w and s stand for weak and strong, respectively. Special cases are discussed in Appendix C.1.
Identifiability in the context of probability densities modeled by neural networks can be seen as a study of degeneracy of the networks. In applications where the representations are used in a downstream classification task, the weak identifiability (2) may be enough. It guarantees that the hyperplanes defining the boundaries between classes in the feature space are consistent, up to a global rotation, and thus the downstream task may be unaffected. Strong identifiability (3), on the other hand, is crucial in applications where such rotation is undesirable. For example, Monti et al. (2019) propose an algorithm for causal discovery based on independence tests between the observations and latent variables learnt by solving a nonlinear ICA task. The tested independences only hold for the true latent noise variables. Were one to learn the latents only up to a rotation, such causal analysis method would not work at all.
2.2.1 Weak identifiability
This initial form of identifiability requires very few assumptions on the feature extractors f and g. In fact, the conditions we develop here are easy to satisfy in practice, and we will see how in Section 2.3. Most importantly, our result also covers the case where the number of features is larger than the number of observed variables. As far as we know, this is the first identifiability result that extends to overcomplete representations in the nonlinear setting. The following theorem summarizes the main result. Intuition behind the conditions, as well as a proof under milder assumptions, can be found in Appendix C.2.
Theorem 1. Let and
be the equivalence relations in (2). Assume that for any choice of parameter
2. There exist points
such that the matrix
is invertible.
With switched, the same conclusion applies to
Finally, if both assumptions 1 and 2 are satisfied by both feature extractors and
, then the matrices A and B in (2) have full row rank equal to
2.2.2 Strong identifiability
We propose two different alterations to our energy function which will both allow for the stronger form of identifiability defined by . We will focus on f, but the same results hold for g by a simple transposition of assumptions. Importantly, we will suppose that the output dimension
is smaller than the input dimension
The first is based on restricting the feature extractor f to be non-negative. It will induce constraints on the matrix A defining the equivalence relation : loosely speaking, if A induces a rotation in space, then it will violate the non-negativity constraint, since the only rotation that maps the positive orthant of the plan to itself is the identity.
The second alteration is based on augmenting f by its square, effectively resulting in the -dimensional feature extractor
. This augmented feature map is combined with a
-dimensional feature map
for the conditioning variable y, to define an augmented energy function
. The advantage of this approach is that it doesn’t require the feature extractors to be positive. However, it makes the effective size of the feature extractor equal to
Identifiability results derived from these two alterations are summarized by the following theorem. Theorem 2. Assume that and that the assumptions of Theorem 1 hold. Further assume that, for any choice of parameter
, either one of the following conditions hold:
3. The feature extractor is surjective, and its image is
4. The feature extractor is differentiable and surjective, its Jacobian
is full rank; there exist
points
such that the matrix
of size
is invertible; and we use the augmented energy function
in the definition of the model.
Then is defined in (3).
A more general form of the Theorem is provided in Appendix C.5. This theorem is fundamental as it proves very strong identifiability results for a conditional deep energy-based model. As far as we know, our results require the least amount of assumptions in recent theoretical work for functional identifiability of deep learning models (Khemakhem et al., 2020; Sorrenson et al., 2020). Most importantly, we do not make any assumption on the distribution of the latent features.
2.3 An identifiable neural network architecture
In this section, we give a concrete example of a neural network architecture that satisfies the functional assumptions of Theorem 1. We suppose that each of the networks f and g are parameterized as multi-layer perceptrons (MLP). More specifically, consider an MLP with L layers, where each layer consists of a linear mapping with weight matrix and bias
, followed by an activation function
. Consider the following architecture:
(b.) The weight matrices are full rank (its rank is equal to its smaller dimension),
(c.) The row dimension of the weight matrices are either monotonically increasing or decreasing:
This architecture satisfies the assumptions of Theorems 1 and 2, as is stated by the propositions below. Proposition 1. Consider an MLP f whose architecture satisfies assumptions (a.), (b.) and (c.), then f satisfies Assumption 1. If in addition, , then f satisfies Assumption 4. Finally, if on top of that, we apply a ReLU to the output of the network, then f satisfies Assumption 3.
Proposition 2. Consider a nonlinear MLP g whose architecture satisfies assumptions (a.), (b.), and (d.).3 Then, g satisfies Assumptions 2 and 4.
While assumptions (a.)-(d.) might seem a bit restrictive, they serve the important goal of giving sufficient architectural conditions that correspond to the purely functional assumptions of Theorems 1 and 2. Note that the full rank assumptions are necessary to ensure that the learnt representations are not degenerate, since we lose information with low rank matrices. In practice, random initialization of floating point parameters, which are then optimized with stochastic updates (SGD), will result in weight matrices that are almost certainly full rank.
2.4 Universal approximation capability
With such a potentially overcomplete network, we can further achieve universal approximation of the data distribution. It might initially seem that this is an impossible endeavor given the somehow restricted form of the energy function. However, if we also consider the dimension an additional architectural parameter that we can change at will, then we can always find an arbitrarily good approximation of the conditional probability density function:
Theorem 3. Let p(x|y) be a conditional probability density. Assume that X and Y are compact Hausdorff spaces, and that p(x|y) > 0 almost surely . Then for each
0, there exists
, where
is the dimension of the feature extractor, such that
This means that our model is capable of approximating any conditional distribution that is positive on its compact support arbitrarily well. In practice, the optimal dimension of the feature extractors can be estimated using cross-validation for instance. It is possible that to achieve a near perfect approximation, we require a value of
that is larger than the dimension of the input. This is why it is crucial that our identifiability result from Theorem 1 covers the overcomplete case as well, and highlights the importance of our contribution in comparison to previous identifiable deep models.
Next, we show how ICE-BeeM relates to a generative, latent variable model. We develop here a novel framework that generalizes nonlinear ICA to non-independent latent variables, and show how we can use our energy model to estimate them.
Model definition Assume we observe a random variable as a result of a nonlinear transformation h of a latent variable
. We assume the distribution of z is conditioned on an auxiliary variable
, which is also observed:
We will suppose here that . The proofs, as well as an extension to
, can be found in Appendix D. The main modeling assumption we make on the latent variable is that its density has the following form:
where is a base measure and
is the conditional normalizing constant. Crucially, the exponential term factorizes across components: the sufficient statistic T of this exponential family is composed of d functions that are each a function of only one component
of the latent variable z.
Equations (4) and (5) together define a nonparametric model with parameters . For the special case
, the distribution of z factorizes across dimensions, and the components
are independent. Then the generative model gives rise to a nonlinear ICA model, and it was studied to a great depth by Khemakhem et al. (2020).
We propose to generalize such earlier models by allowing for an arbitrary base measure components of the latent variable are no longer independent, as
doesn’t necessarily factorize across dimensions. We call this new framework Independently Modulated Component Analysis (IMCA). We show in Appendix E that the strong identifiability guarantees developed for nonlinear ICA can be extended to IMCA, yielding a more general and more flexible principled framework for representation learning and disentanglement.
Estimation by ICE-BeeM Guided by the strong identifiability results above, we suggest augmenting our feature extractor f by output activation functions, resulting in the modified feature map . In Section 2.2.2 for instance, we used
These output nonlinearities play the role of sufficient statistics to the learnt representation
have a double purpose: to allow for strong identifiability results, and to match the dimensions of the components
of sufficient statistic in (5), as formalized by the following theorem.
Theorem 4. Assume:
(iii) The sufficient statistics are twice differentiable, and the functions
are linearly independent on any subset of X of measure greater than zero. Furthermore, they all satisfy
is non-monotonic
(iv) There exist k + 1 distinct points such that the matrix L =
is invertible, where
(v) We use a consistent estimator to fit the model (1) to the conditional density p(x|y), where we assume the feature extractor f(x) to be a -diffeomorphism and d-dimensional, and the vector-valued pointwise nonlinearities
to be differentiable and k-dimensional, and their dimensions to be chosen from
without replacement.
Then, in the limit of infinite data, where
is a permutation of [[1, d]] such that
and
is an invertible square matrix; that is: we can recover the latent variables up to a block permutation linear transformation and point-wise nonlinearities.
Our results greatly extend existing identifiability results and models. The closest latent variable model identifiability theory to ours is that of nonlinear ICA theory (Hyvärinen and Morioka, 2016; Hyvärinen et al., 2019; Khemakhem et al., 2020). These works formalized a trade-off between distributional assumptions over latent variables (from linear and independent to nonlinear but conditionally independent given auxiliary variables) that would lead to identifiability.
On this front, our first contribution was to identify that conditional independence is not necessary for identifiability, and to propose the more general IMCA framework. Our proofs extend previous ones to the non-independent case, and are the most general to date, even considering linear ICA theory. In fact, as a second contribution, our conditional EBM generalizes previous results by completely dropping any distributional assumptions on the representations—which are ubiquitous in the latent variable case.
Third, most of our theoretical results hold for overcomplete representations, which means that unlike the earlier works cited above, our model can be shown to even have universal approximation capabilities. Fourth, while recent identifiability theory focused on providing functional conditions for identifiability, such work is a bit removed from the reality of neural network training. Our results on network architectures are the first step towards bridging the gap between theory and practice.
Figure 1: Quantifying the identifiability of learnt representations using MCC (higher is better).
Transfer learning onto unseen classes using denoising score matching objective (lower is better).
Simulations on artificial nonlinear ICA/IMCA data (higher is better).
5.1 Identifiability of representations on image datasets
We explore the importance of identifiability and the applicability of ICE-BeeM in a series of experiments on image datasets (MNIST, FashionMNIST, CIFAR10 and CIFAR100). First, we investigate the identifiability of ICE-BeeM by comparing representations obtained from different random initializations, using an unconditional EBM as a baseline. We further present applications to transfer and semi-supervised learning, where we find identifiability leads to significant improvements. The different architectures used throughout these experiments are described in Appendix A.1. Code for reproducibility is available here.
Quantifying identifiability We start by empirically validating Theorems 1 and 2 on image datasets. Briefly, these theorems provided conditions for weak and strong identifiability of latent representations, respectively. We propose to study the weak and strong identifiability properties of both conditional and unconditional EBMs by training such models multiple times using distinct random initializations. We subsequently compute the mean correlation coefficient (MCC, see Appendix A.2) between learned representations obtained via distinct random initializations; consistent high MCCs indicate the model is identifiable. In the context of weak identifiability, we consider the MCC up to a linear transformation, A, as defined in (2). Throughout experiments, we employ CCA to learn the linear mapping A. However, our main interest is studying the strong identifiability of EBM architectures, defined in (3). To this end we consider the MCC directly on inferred representations (i.e., without a linear mapping A). Both an ICE-BeeM model and an unconditional EBM were trained on three distinct image datasets: MNIST, CIFAR 10 and 100. For each dataset, we train models using 20 distinct random initializations and compare inferred representations. Conditional denoising score matching (CDSM, see Appendix B.1) was employed to train all networks. Results presented in
Table 1: (a) Transfer learning. (b) Semi-supervised learning.
Figures [1a-1b] show that for ICE-BeeM, the representations were more consistent, both in the weak and the strong case, thus validating our theory. See Appendix A.3 for further details and experiments.
Application to transfer learning Second, we present an application of ICE-BeeM to transfer learning. We suppose that the auxiliary variable is the index of a dataset or a task. We propose an intuitively appealing approach, where we approximate the unnormalized log-pdf in y-th dataset log p(x|y) by a linear combination of a learned "basis" functions
as
, where the
are scalar parameters that act as coefficients in the basis
. For a new unseen dataset
, reducing the transfer learning to the estimation of the
requires that we have estimated the true
, at least up to a linear transformation. This is exactly what can be achieved by ICE-BeeM based on weak identifiability. To this end, an ICE-BeeM model was trained on classes 0-7 of MNIST and CIFAR10 using the CDSM objective. After training, we fix f and learn
for the unseen classes (we denote this by
; unseen classes are 8 & 9). We allow
to be parameterized by a vector for each class, which leads to a drastic simplification for the new classes. We compare against a baseline where both
and
are trained directly on data from unseen classes only (i.e. there is no transfer learning—denoted
). Results are presented in Figures [1c] and [1d] where we vary the sample size of the unseen classes and report the CDSM objective. Overall, the use of a pretrained f network improves performance, demonstrating effective transfer learning. We also compare against a baseline where we just evaluate the pretrained f on the new classes, while fixing g = 1 (without learning the new coefficients—denoted
); and a baseline where we estimate an unconditional EBM using new classes only (no transfer—denoted
average CDSM scores are reported in Table [1a], where the transfer learning with an identifiable EBM (i.e., using ICE-BeeM ) performs best. See Appendix A.4 for further details and experiments. We note here that based on strong identifiability, we could impose sparsity on the coefficients
which might improve the results even further.
Application to semi-supervised learning Finally, we also highlight the benefits of identifiability in the context of semi-supervised learning. We compared training both an identifiable ICE-BeeM model and an unconditional (non-identifiable) EBM on classes 0-7 and employing the learned features to classify unseen classes 8-9 using a logistic regression. In both cases, training proceeded via CDSM. Table [1b] reports the classification accuracy over unseen classes. We note that ICE-BeeM obtains significantly higher classification accuracy, which we attribute to the identifiable nature of its representations. See Appendix A.5 for further details and experiments.
5.2 IMCA and nonlinear ICA simulations
We run a series of simulations comparing ICE-BeeM to previous nonlinear ICA methods such as iVAE (Khemakhem et al., 2020) and TCL (Hyvärinen and Morioka, 2016). We generate non-stationary 5-dimensional synthetic datasets, where data is divided into segments, and the conditioning variable y is defined to be a segment index. First, we let the data follow a nonlinear ICA model, which is a special case of equation (4) where the base measure, , is factorial. Following Hyvärinen and Morioka (2016), the z are generated according to isotropic Gaussian distributions with distinct precisions
determined by the segment index. Second, we let the data follow an IMCA model where the base measure
is not factorial. We set it to be a Gaussian term with a fixed but nondiagonal covariance matrix. More specifically, we randomly generate an invertible and symmetric matrix
, such that
. The covariance matrix of each segment is now equal to
, meaning the latent variables are no longer conditionally independent. In both cases, a randomly initialized neural network with varying number of layers,
, was employed to generate the nonlinear mixing function h. The data generation process and the employed architectures are detailed in Appendix A.6.
In the case of ICE-BeeM, conditional flow contrastive estimation (CFCE, see Appendix B.2) was employed to estimate network parameters. To evaluate the performance of the method, we compute the mean correlation coefficient (MCC, see Appendix A.2) between the true latent variables and the recovered latents estimated by all three methods. Results for nonlinear ICA are provided in Figure [1e], where we note that ICE-BeeM performs competitively with respect to both iVAE and TCL. We note that as the depth of the mixing network, L, increases the performance of all methods decreases. Results for IMCA are provided in Figure [1f] where ICE-BeeM outperforms alternative nonlinear ICA methods, particularly when L = 4. This is because such other methods implicitly assume latent variables are conditionally independent and are therefore misspecified, whereas in ICE-BeeM , no distributional assumptions on the latent space are made.
We proposed a new identifiable conditional energy-based deep model, or ICE-BeeM for short, for unsupervised representation learning. This is probably the first energy-based model to benefit from rigorous identifiability results. Crucially, the model benefits from the tremendous flexibility and generality of EBMs. We even prove a universal approximation capability for the model.
We further prove a fundamental connection between EBMs and latent variable models, showing that ICE-BeeM is able to estimate nonlinear ICA, as a special case. In fact, it can even estimate a generalized version where the components do not need to be independent: they only need to be independently modulated by another variable such as a time index, history or noisy labels.
Empirically, we showed on real-world image datasets that our model learns identifiable representations in the sense that the representations do not change arbitrarily from one run to another, and that such representations improve performance in a transfer learning and semi-supervised learning applications.
Identifiability is fundamental for meaningful and principled disentanglement; it is necessary to make any interpretation of the features meaningful; it is also crucial in such applications as causal discovery (Monti et al., 2019) and transfer learning. The present results go further than any identifiability results hitherto and extend them to the EBM framework. We believe this paves the way for many new applications of EBMs, by giving them a theoretically sound basis.
This work is mainly theoretical, and aims to provide theoretical guarantees for the identifiability of a large family of deep models. Identifiability is very important, as it is key for reproducible science and interpretable results. For instance, if the networks behind search engines were identifiable, then their results would be consistent for most users. In addition, using perfectly identifiable networks in real life applications eliminates the randomness and arbitrariness of the system, and gives more control to the operator.
In general, identifiability is a desirable property. The system we develop here does not make any decisions, and thus can not exhibit any bias. Our theoretical guarantees abstract away the nature of the data and the practical implementation. Therefore, our work doesn’t encourage the use of biased data or networks with potentially dangerous consequences.
I.K. and R.P.M. were supported by the Gatsby Charitable Foundation. A.H. was supported by a Fellowship from CIFAR, and from the DATAIA convergence institute as part of the "Programme d’Investissement d’Avenir", (ANR-17-CONV-0003) operated by Inria.
Arbel, M., Sutherland, D., Bi´nkowski, M., and Gretton, A. (2018). On gradient regularizers for mmd gans. In Advances in Neural Information Processing Systems, pages 6700–6710.
Brosowski, B. and Deutsch, F. (1981). An elementary proof of the stone-weierstrass theorem. Proceedings of the American Mathematical Society, pages 89–92.
Ceylan, C. and Gutmann, M. U. (2018). Conditional Noise-Contrastive Estimation of Unnormalised Models. arXiv:1806.03664 [cs, stat].
Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. (2019). Neural spline flows. In Advances in Neural Information Processing Systems, pages 7509–7520.
Gao, R., Nijkamp, E., Kingma, D. P., Xu, Z., Dai, A. M., and Wu, Y. N. (2019). Flow contrastive estimation of energy-based models. arXiv preprint arXiv:1912.00589.
Gutmann, M. and Hyvärinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304.
Gutmann, M. U. and Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13(Feb):307–361.
Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709.
Hyvärinen, A. (2007). Some extensions of score matching. Computational statistics & data analysis, 51(5):2499–2512.
Hyvärinen, A. and Hurri, J. (2004). Blind separation of sources that have spatiotemporal variance dependencies. Signal Processing, 84(2):247–254.
Hyvärinen, A. and Morioka, H. (2016). Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. In Advances in Neural Information Processing Systems, pages 3765–3773.
Hyvärinen, A. and Morioka, H. (2017). Nonlinear ICA of temporally dependent stationary sources. In The 20th International Conference on Artificial Intelligence and Statistics.
Hyvärinen, A. and Oja, E. (2000). Independent component analysis: Algorithms and applications. Neural Networks, 13(4):411–430.
Hyvärinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439.
Hyvärinen, A., Sasaki, H., and Turner, R. (2019). Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 859–868.
Khemakhem, I., Kingma, D. P., Monti, R. P., and Hyvärinen, A. (2020). Variational Autoencoders and Nonlinear ICA: A Unifying Framework. In The 23rd International Conference on Artificial Intelligence and Statistics.
Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs].
Kingma, D. P. and Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224.
Kingma, D. P. and Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat].
Lin, G., Milan, A., Shen, C., and Reid, I. (2017). Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1925–1934.
Monti, R. P. and Hyvärinen, A. (2018). A Unified Probabilistic Model for Learning Latent Factors and Their Connectivities from High-Dimensional Data. arXiv:1805.09567 [cs, stat].
Monti, R. P., Zhang, K., and Hyvarinen, A. (2019). Causal discovery with general non-linear relationships using non-linear ica. arXiv preprint arXiv:1904.09096.
Rezende, D. J. and Mohamed, S. (2015). Variational Inference with Normalizing Flows. arXiv:1505.05770 [cs, stat].
Saremi, S., Mehrjou, A., Schölkopf, B., and Hyvärinen, A. (2018). Deep Energy Estimator Networks. arXiv preprint arXiv:1805.08306.
Song, Y. and Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11895–11907.
Sorrenson, P., Rother, C., and Köthe, U. (2020). Disentanglement by Nonlinear ICA with General Incompressible-flow Networks (GIN). arXiv:2001.04872 [cs, stat].
Uehara, M., Kanamori, T., Takenouchi, T., and Matsuda, T. (2020). A unified statistically effi- cient estimation framework for unnormalized models. In International Conference on Artificial Intelligence and Statistics, pages 809–819.
Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674.
Appendix for
We divide the Appendix into 5 main sections:
• Section A: we give extensive details on the experimental setup, as well as additional experiments;
• Section B: we discuss the estimation algorithms we used with ICE-BeeM and how they can be extended to the conditional setting;
• Section C: we prove the identifiability of ICE-BeeM and its universal approximation capability;
• Section D: we show how ICE-BeeM estimates IMCA;
• Section E: we provide a thorough theoretical analysis of the IMCA framework and draw parallels to the identifiability results in nonlinear ICA.
A.1 Model architecture details
In this section, we describe the neural network architectures used for the experiments of Section 5.1, on the image datasets (MNIST, FashionMNIST, CIFAR10 and CIFAR100). Code to reproduce these experiments can be found in the supplementary material.
We can distinguish three different types of configurations:
1. A series of fully connected layers — denoted MLP. This configuration satisfies the assumptions of Section 2.3.
2. A mix of convolutional and fully connected layers — denoted ConvMLP. We expect this configuration to work better than an MLP for images.
3. A variant of a RefineNet (Lin et al., 2017), following Song and Ermon (2019), which implements skip connections to help low level information reach the top layers — denoted for simplicity Unet (RefineNets are modern variants of U-net architectures). This configuration is very advanced and complicated, and serves to test if identifiable representations can be learnt for modern architectures.
The detailed architectures are in Table [2].
After choosing one of the configurations, we can further chose to reduce the dimensionality of the features (), to use it in conjunction with positive features (condition 3 of Theorem 2) or with augmented features (condition 4 of Theorem 2). This results in the following nomenclature, where we will take as an example a ConvMLP network:
• If we reduce the dimension of the latent space ()—for example
, we denote the configuration by ConvMLP-50.
• If we used positive features, we denote the configuration by ConvMLP-p.
• If we used augmented features, we denote the configuration by ConvMLP-a.
• We can also have a mix of the above, for examples ConvMLP-50p.
• We can also have non of the above, in which case we simply write ConvMLP—implying that
We summarize the configurations used for the different experiments of Section 5.1 in Table [3].
For all the experiments, we used the Adam optimizer (Kingma and Ba, 2014) to update the parameters of the networks. We used a learning rate of 0.001, and was turned off, as well as weight decay. Data was fed to the networks in mini-batches of size 63, and the
Table 2: Architecture detail
ConvMLP Input: for all conv. layers Conv
, BatchNorm, ReLU padding 1, filter size 3 Conv
, BatchNorm, ReLU padding 1, filter size 3 MaxPool
Unet Input: for all conv. layers Conv
, filter size 3 4-cascaded RefinNet see Song and Ermon (2019) | activation: ELU exponential LU | normalization: InstanceNorm+ see Song and Ermon (2019) InstanceNorm+, ELU
Table 3: Architectures used in the experiments
training was done for 5000 iterations (no visible improvements in the results were observed after this many iterations). For CIFAR10 and CIFAR100 experiments, we introduced a random horizontal flip to the data, with probability 0.5.
We used conditional denoising score matching (CDSM, Appendix B.1) to train the energy models. The noise parameter used is
A.2 The MCC metric
To quantify identifiability, we use the mean correlation coefficient (MCC) metric, which computes the maximum linear correlations up to permutation of components. To obtain the value of this metric between two vectors x and y, we first calculate all pairs of correlation coefficients between the components of x, and the components
of y. Since the order of the components in each vector can be arbitrary, we have to account for possible permutations between the indices i and j. This is done by solving a linear sum assignment problem (for instance, using the auction algorithm). We finally average over all correlation coefficients (after finding the right permutation). This makes the MCC metric invariant by permutation and component-wise transformations (as a consequence of the transformation invariance of the correlation coefficient).
To better understand this metric, let’s consider the following example. Let be a bivariate random variable such that
. If we don’t account for any permutations, then the average correlation is equal to
. In reality, though, y and x are perfectly correlated, since the value of x completely determines that of y. Thus, we have to find the optimal permutation of the elements of y in order to maximize the average correlation. The MCC does this by computing all pair-wise correlations, and finding the assignment that maximizes the average correlation.
When the latent ground truth is known (Section 5.2—IMCA and nonlinear ICA simulations, for instance), we can test for identifiability of the components by comparing the recovered latents to this ground truth. A high MCC means that we recovered the true latents.
When the ground truth is unknown (Section 5.1—real image datasets), we compare pairs of learnt representations, each from a different random initialization. A consistently high MCC means that changing the random state of the model doesn’t drastically change the learnt representations.
A.3 Quality of representations
We argued that conditioning enables EBMs to learn identifiable representations. The results in Section 5.1 validate this. The plots presented in Figures [1a] and [1b] were produced using the Unet configuration, described in Table [3]. This architecture is complex and deep, and involves multiple layers for which a thorough theoretical analysis is very difficult, unlike MLPs for instance. In addition, the dimension of the latent space was chosen to be equal to that of the input space. Intuitively, we would expect that the chance of learning arbitrary representations increases as we increase the number of features because this increases the entropy of the system.
This allows us to challenge the capabilities of ICE-BeeM, and test its limits. We concluded from the results that the theory presented here does benefit modern deep learning architectures. This experiment serves to empirically validate our theoretical result, and is the first of its kind in recent identifiability literature, which focused on validating the theory on simulated data with well know ground truth.
The matrix A in equation (2) and the permutation in equation (3) were learnt from the first half of the test partition for each dataset. The evaluation of the MCCs was done on the remaining half of the test dataset.
We present further plots detailing the quality of the learnt representations on MNIST, FashionMNIST, CIFAR10 and CIFAR100 for a variety of different configurations in Figures [2] and [3].
Figure 2: Further experiments on the strong identifiability of learnt representations using the ConvMLP architecture on image datasets — C10/100 stands for CIFAR10 and CIFAR100, respectively.
Figure 3: Further experiments on the identifiability of representations using the Unet architecture on image datasets.
A.4 Transfer learning experiments
A.4.1 Intuition
As a practical application of our framework where identifiability is important, we consider metalearning, in particular multi-task and transfer learning. Assume we have N datasets, which could be, e.g., different subjects in biomedical settings, or different image datasets. This fits well with our framework, where y = 1, . . . , N is now the index of the dataset, or "task". The key question in such a setting is how we can leverage all the observations to better model each single dataset, and especially transfer knowledge of existing models to a new dataset.
To this end, we propose an intuitively appealing approach, where we approximate the unnormalized log-pdf in y-th dataset p(x; y) by a linear combination of a learned "basis" functions
where the are scalar parameters as a function of y, which act as coefficients in the basis
This linear approximation is nothing else than a special case of ICE-BeeM, but here, we interpret such an approximation as a linear approximation in log-pdf space. In fact, what we are doing is a kind of PCA in the set of probability distributions p(x; y). Such "probability space" PCA allows the models for the different datasets to learn from each other, as in the classical idea of denoising by projection onto the PCA subspace.
In transfer learning, we observe a new dataset, with distribution on our decomposition, we approximate
as in (6). This leads to a drastic simplification: we can learn the basis functions
from the first N datasets, then we only need to estimate the k scalar parameters
for the new dataset. The coefficients are likely to be sparse as well, which provides an additional penalty.
Reducing the transfer learning to estimation of the clearly requires that we have estimated the true
up to a linear transformation, which is the weaker form of identifiability in Theorem 1. Moreover, using a sparsity penalty is only meaningful if we have the true
without any linear mixing, which requires the stronger identifiability in Theorem 2.
Training can be done by any method for EBM estimation. In particular, it is very easy by score matching because equation (6) is an exponential family for fixed
A.4.2 Further experiments
The pre-training was done on labels 0-7 from the train partition for MNIST, FashionMNIST and CIFAR10, and on labels 0-84 from the train partition for CIFAR100. The second (transfer) step was done on labels 8-9 from the train partition for MNIST, FashionMNIST and CIFAR10, and on and labels 85-99 the train partition for CIFAR100.
Figure 4: Further results for transfer learning experiments on MNIST. In the case of transfer learning 99 out of a hundred returned digits are class 8 compared to only 58 in the baseline.
We considered a subset of size 6000 to produce the values in Table [1a]. This table should be read in conjunction with Figures [1c]-[1d] for a proper evaluation of performance.
We present further plots and results of transfer learning experiments in Figures [5]-[6] and Table [4] ran on MNIST, FashionMNIST, CIFAR10 and CIFAR100 for a variety of different configurations. for different configurations and datasets. We considered a subset of size 6000 to produce the values in Table 4. We expect the baseline where we don’t perform transfer learning to perform comparatively for such a subset size: transfer learning is mostly important when data is scarce. For the complete picture, this table should be read in conjunction with Figures [5]-[6].
As an additional way to visualize the results, Figure [4a] shows unseen MNIST samples (taken across all possible classes) which are assigned high confidence of belonging to the "new" class 8 after transfer learning, indicating that the ICE-BeeM model has learnt a reasonable distribution over unseen classes. By comparison the case where no transfer learning is employed (Figure [4b]), incorrectly assigns high confidences to other digits.
A.5 Semi-supervised learning
In this experiment, we train both an identifiable ICE-BeeM model and an unconditional (non-identifiable) EBM on classes 0-7. The purpose of this step is to learn a feature extractor that is able of learning meaningful features from the images. To test the quality of the features learnt by both models (the ICE-BeeM, and the unconditional EBM), we use the feature map
to classify unseen samples from classes 8-9. Results show that ICE-BeeM outperforms the unconditional baseline in this classification task. We attribute this to the identifiability of ICE-BeeM: our model seems to be performing a principled form of disentanglement by learning features that are faithful to the unknown factors of variation in the data.
Training was done on labels 0-7, using the train partition for MNIST, FashionMNIST and CIFAR10. Evaluation was done on labels 8-9, using the test partition for all three datasets. This data was in turn partitioned for the classification into a train and test split. The split proportion is 15% for MNIST and FashionMNIST, and 33% for CIFAR10 and CIFAR100.
We present further results for the semi-supervised learning experiments in Table [5], ran on MNIST, FashionMNIST, CIFAR10 for a variety of different configurations.
A.6 IMCA and nonlinear ICA simulations
We give here more detail on the data generation process for the simulations in Section 5.2, as well as the architectures used.
Figure 5: Further transfer learning — the dataset/configuration combo are reported in the captions.
Figure 6: Further transfer learning — the dataset/configuration combo are reported in the captions.
Table 4: Transfer learning — CDSM score (lower is better)
Table 5: Semi-supervised learning — classification accuracy (higher is better)
Data generation We generate 5-dimensional synthetic datasets following the nonlinear ICA model which is a special case of equation (4) where the base measure, , is factorial. In particular, we set it to
. As such, latent variables are conditionally independent given segment labels. The sources are divided into M = 8 segments, and the conditioning variable y is defined to be the segment index, uniformly drawn from the integer set [[1, M]]. Following Hyvärinen and Morioka (2016), the z are generated according to isotropic Gaussian distributions with distinct precisions
determined by the segment index. Second, we perform the same experiment but on data generated from an IMCA model where the base measure
not factorial. More specifically, we randomly generate an invertible and symmetric matrix
, such that
. As before, we define
to be the distinct conditional precisions. The precision matrix of each segment is now equal to
, meaning the latent variables are no longer conditionally independent.
For both nonlinear ICA and IMCA data, a randomly initialized neural network with varying number of layers, , was employed to generate the nonlinear mixing function h. Leaky ReLU with negative slope equal to 0.1 was employed as the activation function in order to ensure the network was invertible. The hidden dimensions of the mixing network are equal to the latent dimension
, and the output dimension is
Baseline methods The first baseline we compare to is TCL (Hyvärinen and Morioka, 2016), which is a self-supervised method for nonlinear ICA based on the nonstationarity of the sources. TCL learns to invert the mixing function h, by performing a surrogate classification task, where the goal is to classify original observations against their segment indices in a multinomial classification task. Its theory is premised on the fact that the feature extractor used for the classification has to extract meaningful latents in order to perform well in the classification task.
The second baseline is iVAE (Khemakhem et al., 2020), a nonlinear ICA method which uses an identifiable VAE to recover the independent sources. Its theory is premised on the consistency of maximum likelihood training, and on the flexibility of VAEs in approximating densities. They show that given enough data, the variational posterior learns to approximate the true posterior distribution, and can thus be used to invert the mixing function. The iVAE, like a regular VAE, is trained by maximizing the ELBO (Kingma and Welling, 2013).
Training of ICE-BeeM via flow contrastive estimation To demonstrate that ICE-BeeM can be trained by any method for training EBMs, we switched from denoising score matching to flow contrastive estimation (FCE, Appendix B.2). As a contrastive flow, we used a normalizing flow model (Rezende and Mohamed, 2015), with an isotropic and tractable base distribution. It is then transformed by a 10-layer flow, where each layer is made of a succession of a neural spline flow (Durkan et al., 2019), an invertible convolution (Kingma and Dhariwal, 2018), and an ActNorm layer (Kingma and Dhariwal, 2018). The flow parameters are updated by and Adam optimizer, with a learning rate of
Used architectures The architectures used to produce Figures [1e] and [1f] are summarized by Table [6].
It is important to note that the identifiability results presented above apply to conditional EBMs in general. As such, we may employ any of the wide variety of methods which have been proposed for the estimation of unnormalized EBMs. In this work we used two different options with good results for both: flow contrastive estimation (Gao et al., 2019) and denoising score matching (Vincent, 2011). Both methods can also be extended to the conditional case in a straightforward fashion.
Flow-contrastive estimation (FCE) can be seen as an extension of noise-contrastive estimation (Gutmann and Hyvärinen, 2012, NCE), which seeks to learn unnormalized EBMs by solving a surrogate classification task. The proposed classification task seeks to discriminate between the true data and some synthetic noise data based on the log-odds ratio of the EBM and the noise distribution. However, a limitation of NCE is the need to specify a noise distribution which can be sampled from and whose log-density can be evaluated pointwise but which also shares some of the empirical
Table 6: Architectures used in the simulations
ICE-BeeM Adam lr
batch norm after each FC layer hidden dim 32, LeakyReLU(0.1) act
learnable matrix
iVAE Adam Encoder p(z|x) Normal lr 3-layer MLP hidden dim
, LeakyReLU(0.1) act Decoder p(x|z, y) Normal 3-layer MLP hidden dim
, LeakyReLU(0.1) act Prior p(z|y) Normal 3-layer MLP hidden dim
, LeakyReLU(0.1) act
properties of the observed data. To address this concern Gao et al. (2019) propose to employ a flow model as the contrast noise distribution. FCE seeks to simultaneously learn both an unnormalized EBM as well as a flow model for the contrast noise in an alternating fashion. We naturally get a conditional version for FCE by learning a conditional EBM (Gao et al., 2019, eq. 12).
Score matching is another well-known method for learning unnormalized models (Hyvärinen, 2005). However, its computational implementation in deep networks is problematic, which is why Vincent (2011) proposed a stochastic approximation which can be interpreted as denoising the data, and which works efficiently in deep networks (Saremi et al., 2018; Song and Ermon, 2019).
B.1 Conditional denoising score matching
We extend the original score matching objective to the conditional setting in a natural way: for a fixed y, we compute the unconditional score matching objective: , and then average over all values of y. The expression of the conditional score matching objective is then:
We build on the recent developments by Vincent (2011), and introduce a conditional denoising score matching objective by replacing the unknown density by a kernel density estimator. Formally, given observations , we first derive nonparamteric kernel density estimates of p(x, y) and p(y), which we then use to derive the estimate for p(x|y) using the product rule. These estimates have the forms:
where and
are bounded kernel functions defined on X and Y and with bandwidths5 a and b, respectively. In the following, we assume that the bandwidth sequences are equal (
We replace p(x, y) and p(x|y) in (7) by their estimates and
, to arrive at the new objective
which is the conditional score matching objective when applied to the nonparametric estimates of the unknown target density. We will show below that it is equivalent to a simpler objective, in which we only need to compute gradients of the conditioning kernel
where the expectation is taken with respect to . We call this objective conditional denoising score matching. Its extrema landscape is the same as
, but it has the advantage of being simpler to evaluate and interpret.
Above, we presented this objective when is the Gaussian kernel, and
is simply the identity kernel.
From CSM to CDSM We will show here that the stochastic approximation used in denoising score matching can also be used for the conditional case to get to the CDSM objective (12) from the CSM objective (11):
where is a constant term that only depends on
Plugging this back into equation (13), we find that
where the expectation is with respect to is another constant that is only a function of
B.2 Conditional flow contrastive estimation
As described above, FCE learns the parameter for the density of an EBM by performing a surrogate classification task: noise is generated from a noise distribution
which is parameterized as a flow model, and a logistic regression is performed to classify observation into real data samples or noise samples. The objective function is simply the log-odds:
This objective is minimized with respect to and maximized with respect to
: the EBM and the flow model are playing a min-max game. This objective can be extended to the conditional case naturally: we replace the model density by the conditional density
. In the conditional case, it follows that noise samples should also be associated with a conditioning variable, y. One way this can be achieved is by also considering a conditional flow. This also has the additional benefit that an improved flow should lead to better estimation of EBM. Alternatively, a standard (non-conditional) flow could be employed. This would require marginalizing over the conditioning variable, y. The objective simply becomes:
We can write the flow density as . This is particularly useful when the conditioning variable y is discrete, like for instance the index of a dataset or a segment, as we can sample draw a index from a uniform distribution, and use the conditional flow to sample an observation.
Recall the form of our conditional energy model
We present in this section the proofs for the different forms of identifiability that is guaranteed for the feature extractors f and g. We will focus on the proofs for the feature extractor f, as the proofs for the feature extractor g are very similar. For the rest of the Appendix, we will denote by ,
C.1 More on the equivalence relations
The relation in equation (2) is an equivalence relation in the strict term only if A is full rank. If A is not full rank (which is only possible if
, given the rest of assumptions), then it is not necessarily symmetric. This is not a real problem, and can be fixed by changing the definition to: there exists
. We present the simpler version in the paper for clarity.
C.2 Proof of Theorem 1
We start by proving the main theoretical result of this paper, which applies to all dimensions of the feature extractor. Alternative and weaker assumptions are discussed after the proof. Theorem 1 (Identifiable conditional EBMs). Assume:
1. The feature extractor f is differentiable, and its Jacobian is full rank.
2. There exist such that the matrix
where is defined as follows:
)-matrix of rank at least
If, instead or in addition, we assume that:
3. The feature extractor g is differentiable, and its Jacobian is full rank.
4. There exist such that the matrix
)-matrix of rank at least
Finally, if and all assumptions 1- 4 hold, then the matrices A and B have full rank (equal to
Proof. We will only prove this theorem for the feature extractor f. The proof for g is very similar. Suppose assumptions 1 and 2 hold.
Consider two parameters
Then, by applying the logarithm to both sides, we get:
Consider the points provided by assumption 2 for
. We plug each of these points in (21) to obtain n + 1 such equations. We subtract the first equation for
from the remaining n equations, and write the resulting equations in matrix form:
where , and b =
. Since R is invertible (by assumption 2), we multiply by its inverse from the left to get:
where and
. Now since
is differentiable and its Jacobian is full rank (assumption 1), by differentiating the last equation we deduce that rank
, which in turn proves that
Finally, suppose that in addition, assumptions 4 holds. Then there exists such that
. Plugging these n + 1 points into equation (23), and subtracting the first equation for
from the remaining n equations, we get
Since Q is an invertible matrix, we conclude that A is also invertible, which concludes the proof.
Intuition behind assumption 2 Assumption 2 requires that the conditioning feature extractor g has an image that is rich enough. Intuitively, this relaxes the amount of flexibility the main feature extractor f would need to have if g were to be very simple. It implies that the search for f will be naturally restricted to a smaller space, for which we can prove identifiability.
Proof under weaker assumptions Assumption 1 of full rank Jacobian can be weakened without changing the conclusion of Theorem 1. In fact, this assumption is only used right after equation (23) to prove that the matrix A has a rank that is at least equal to min(n, d). Suppose instead that
Then by computing the differential of equation (23) at (assuming that
exists), we can make the same conclusion on the rank of A.
In fact, this condition can be scrapped altogether if we relax the definition of the equivalence class in Appendix C.1 to have no conditions on the ranks of matrices . This however comes at the expense of a relatively weak, and potentially meaningless, equivalence class.
Finally, assumption 2 of Theorem 1 can be replaced by requiring the Jacobian of to be differentiable and full rank in at least one point, but this requires the conditioning variable to be continuous.
C.3 Proof of Proposition 1
Proposition 1. Consider an MLP with L layers, where each layer consists of a linear mapping with weight matrix , followed by an activation function. Assume
c. The row dimension of the weight matrices are either monotonically increasing or decreasing:
Then the MLP has a full rank Jacobian almost everywhere. If in addition, , then the MLP is surjective.
Proof. Denote by x the input to the MLP, and by the output of layer l,
with h in equation (27) is applied to each element of its input, and
Denote by the vector whose elements are
which is undefined if . Note that
is a function of its input, and thus of x, but we keep this implicit for simplicity. Using these notations, and the fact that h is piece-wise linear, we can write,
where . This is of course only possible if
. As such, define the set
where , we have that
is full rank, and, using Lemma 2,
is also a full rank matrix.
While it is true that are functions of x, yet they only take a finite number of values. Thus, the set
is included in the union over all the values taken by
and
up to layer l. For each of these values, the set becomes a dot product between a row of
independent of the input x, and is nonzero because
is full rank; such set has measure zero in
Thus, N is included in a finite union of sets of measure zero, which implies that it also has measure zero.
which is full rank. Thus, the MLP has a full rank Jacobian almost everywhere.
The surjectivity is easy to prove since h is surjective and so is as a function of
if
and rank
Lemma 1. Denote by the smallest singular value of a matrix A. Let M be an
matrix, and N be an
matrix, such that
or
. Then
Proof. The proof in the case can be found in (Arbel et al., 2018, Lemma 10), but we provide a proof here for completeness, and for the other case
Let the smallest eigenvalue of A. Recall that for a matrix
with
Thus, if the null space of N is non trivial, then , and the inequality is satisfied. Otherwise, we have
If, instead,
Similarly, if the null space of is non trivial, then
, and the inequality holds. Otherwise, we have
which concludes the proof.
Lemma 2. Consider a finite sequence of matrices , with
. If
is full rank for all
, and either
or
, then the product
is also full rank.
Proof. If two matrices and
with ordered dimensions are full rank, then
and
. According to Lemma 1, this implies that
rank. The proof for
is done by induction on
C.4 Proof of Proposition 2
Linear MLPs The particular case of linear feature extractors is quite interesting. If feature extractor g satisfies the assumptions of Proposition 1, then assumption 2 is trivially satisfied. On the other hand, if
, then assumption 2 can’t hold when the network is linear. This signals that it is important to use deep nonlinear networks to parameterize the feature extractors, at least in the overcomplete case.
Proposition 2. Consider an MLP g with L layers, where each layer consists of a linear mapping with weight matrix , followed by an activation function. Assume
Then there exist points
such that the matrix R =
is invertible.
Proof. Let be an arbitrary point in
. Without loss of generality, suppose that
is because
is still an MLP that satisfies all the assumptions above. If for any choice of points
, the matrix R defined above isn’t invertible, then this means that
is necessarily included in a subspace of
of dimension at most
. In other words, this would imply that the functions
are not linearly independent. However, this is in contradiction with the result of Lemma 8, which stipulates that
are linearly independent, provided all weight matrices satisfy the assumptions of the lemma (which are the same as the assumptions made in this proposition).
Thus, we can conclude that there exist points
such that the matrix R =
is invertible.
Proof under weaker conditions Note that the proof argument used for the overcomplete case can be used for the undercomplete as well. This same argument can be proved for ReLU as the nonlinearity instead of LeakyReLU. We chose to give the proof for, and suggest to use the latter because it is needed for Proposition 1.
Lemma 3. Let A be an invertible matrix. Denote by
the n-th row of A. Then the matrix
is invertible for any choice of
Proof. Denote by
Then in particular, by looking at the first n lines of this vectorial equation, we have that is invertible, its rows are linearly independent, and thus
and
. Plugging this back into equation (36), and looking closely at the last equation, we have that
, and we conclude that
(because
), and that B is invertible.
Lemma 4. Consider n affine functions , such that the matrix
whose rows are the
is full column rank, and all its submatrices of size
are invertible if d < n. Then there exist n non-empty regions
defined by the signs of the functions
(for instance,
such that the matrix
defined as
is invertible.
Proof. We will prove this Lemma by induction on n the number of functions . Denote by
. The sign of
changes if we cross the hyperplan
First, suppose that n = 2. By assumption, we now that , and thus the hyperplans
are not parallel and divide
regions. This implies that the regions
are not empty.
Second, suppose that there exists n regions such that the the matrix
is invertible. Consider the affine function
. The hyperplan
intersects at least one of the regions
. This is because
are linearly independent for any J of size min(d, n + 1) such that
, and thus there exists
such that
. Suppose without loss of generality that this region is
. Denote by
. Now consider the matrix
such that
and
. Because
, we have that
and thus
, which implies that
is also invertible. Now define
. Again, the inclusion implies that
. Finally, consider the regions
, and the matrix
defined on those regions. Then
where and
is the n-th line of
. According to Lemma 3,
is invertible, which achieves the proof.
Lemma 5. Let h denote a LeakyReLU activation function with slope (if
, then h is simply a ReLU). Consider n piece-wise affine functions
, such that the matrix
whose rows are the
is full column rank, and all its submatrices of size
invertible if d < n. Then the functions
are linearly independent, and their generalized slopes (as piece-wise affine functions) are also linearly independent.
The assumptions of Lemma 4 are met for the function , and we conclude that there exists n regions
such that
is invertible. Define the matrix
where we replace all entries of
by
if they are equal to
. Then
is invertible (in fact, to see this, consider the proof of the previous lemma with the slightly unconventional choice of sign function
Now consider
Let , and evaluate this equation at
. After taking the gradient with respect to x, we get
Denote by -th line of the matrix
, and define
. We can write the l-th line of equation (39) as:
Collating these equations for a fixed
which implies that because S is invertible. In particular,
for all
and
[[1, d]]. This implies that
, where
of size
is a submatrix of A and
. Since we know, by assumption, that
is invertible for any choice of set of indices J (relevant when n > d), we conclude that
the functions
are linearly independent.
Each function is a piece-wise affine function, with a "generalized slope" equal to
. As a corollary of the independence of
, we can conclude that the slopes
are also independent.
Lemma 6. Let be a vector-valued function defined on
. We suppose that
are linearly independent piece-wise affine functions, and that their generalized slopes
are also linearly independent. Consider m piece-wise affine functions
, such that the matrix
whose rows are the
is full column rank, and all its submatrices of size
are invertible if n < m. Then there exist m non-empty regions
defined by the signs of the functions
such that the matrix
as
is invertible.
Proof. Denote by the generalized slope of the p.w. affine function
. The key is to show than under the assumptions made here, the slopes
are linearly independent for any choice of subset
If m > n, chose a subset of size n, and let
such that
. By replacing
by its expression, we get:
. Since
are linearly independent, we conclude that
for all
. This, along with the full rank assumption on C prove that
are linearly independent. We can use the same argument if, instead,
, and conclude.
The rest of the proof follows the same argument of the proof of Lemma 4: we proceed by induction on m. For m = 2, we know that , and so the "generalized hyperplans" defined by these two vectors divide
into at least 3 different regions, 2 of which yield a matrix
that is invertible. Then, if the result hold for m, then the hyperplan defined by the generalized slope of the (m + 1)-th p.w. affine function
necessarily intersects one of the regions
since for any subset J of size min(m + 1, n) s.t.
, the generalized slopes
are linearly independent. The rest is identical to Lemma 4.
Lemma 7. Let h denote a LeakyReLU activation function with slope (if
, then h is simply a ReLU), and
be a vector-valued function defined on
. We suppose that
are linearly independent piece-wise affine functions, and that their generalized slopes
are also linearly independent. Consider m piece-wise affine functions
, such that the matrix
whose rows are the
is full column rank, and all its submatrices of size
are invertible if n < m. Then the functions
linearly independent, and their generalized slopes are also linearly independent.
Proof. Let such that
. The assumptions of Lemma 6 are met for the functions
, and we conclude that there exists m regions
such that
is invertible. Let
the invertible matrix equal to
after substituting
Now consider such that
After taking the gradient with respect to x, we get:
Since are independent, we conclude that
for all
[[1, m]]. This in turn implies that
Let , and evaluate the last equation at
This last equation is similar to equation (39), and we can use the same argument used for the proof of Lemma 5 here (using instead of
) and deduce that
We conclude that are linearly independent, and so are their generalized slopes as a consequence.
Lemma 8. Let be the output of an L-layer MLP (we assume that
is at least one nonlinearity) that satisfies:
(a.) All activation functions are LeakyReLUs with slope , then the activation function is simply a ReLU).
(b.) All weight matrices are full rank, and all submatrices of
are invertible if
Then are linearly independent. In addition, all the intermediate features
are also linearly independent.
Proof. We prove the Lemma by induction on the number of layers , then by Lemma 5, we conclude that
are independent. If we suppose the result hold for
, we can use Lemma 7 to prove that it also holds for L + 1. Finally, since all layers satisfy the same conditions, the conclusion also applies to intermediate layers.
C.5 Proof of Theorem 2
We will decompose Theorem 2 into two sub-theorems, which will make the proof easier to understand, but also more adaptable into future work. Each of these sub-theorems corresponds to one of the assumptions.
C.5.1 Positive features
We will prove here a more general version where we assume that each component of the feature extractor f has a global minimum that is reached, instead of being necessarily non-negative.
Theorem 2a. Assume the assumptions of Theorem 1 hold. Further assume that , and that each
has a global minimum that is reached at least in the limit, and the feature extractor
is surjective onto the set that is defined by the lower bounds of the
where is defined as follows:
where is a permutation of
is a non zero scalar and
is a scalar.
Proof. Consider two different parameters such that:
To simplify notations, denote by and
. We start the proof from the conclusion of Theorem 1, since its assumptions hold:
where A is an invertible matrix and c a constant vector. Without loss of generality, we can suppose that
has an infimum equal to zero, simply by subtracting
, and including in c, and similarly for
. We will also suppose that the infima are reached, as the next argument would hold if we change exact minima by limits.
Now since and is surjective, then there exists
such that
. This implies that
. We know that
and is surjective, and so h is also surjective, and its image includes
be the matrix of canonical basis vectors, or positive scalar multiples of the canonical basis vectors
These must be mapped to the non-negative quadrant, so AI must be non-negative, which implies that A must be non-negative.
Denote by is also non-negative for the same reasons described above. Denote the rows of
. We have by definition of inverse:
where if i = j then , else
. Now, assume there is a row
which has at least two non-zero entries. By the property above,
of the vectors
must have zero dot-product with that vector. By non-negativity of B and A, those
vectors must have zeros in the at least two indices corresponding to the non-zeros of
. But that means they can only span a
-dimensional subspace, and all the
together can only span a
-dimensional subspace. This is in contradiction of the invertibility of B. Thus, each
can have only one non-zero entry, which, together with the invertibility of A, proves it is a scaled permutation matrix.
Thus, there exists a permutation , such that
, which concludes the proof.
C.5.2 Augmented features
Theorem 2b. Assume that , and that:
Proof. Similarly to the proof of Theorem 2a, we pass the features through the nonlinear function
which produces the augmented features
introduced in section 2.2.2.
Consider two different parameters such that:
Since we have similar assumptions to Theorem 1, we will skip the first part of the proof and make the same conclusion, where the equivalence up to linear transformation here applies to
where matrix of rank at least
are full rank (A is not necessarily invertible yet, but this will be proven later) and c a constant vector. By replacing H by its expression, we get:
where each is an
matrix, and
. To simplify notations, denote by
. We will also drop reference to
lines in the previous equation are:
and the last n lines are:
Fix an index i in equations (53) and (54). To alleviate notations and reduce the number of subscripts and superscripts, we introduce ,
,
,
,
and
. This proof is done in 5 steps. Note that the surjectivity assumption is key for the rest of the proof, and it requires that we set the dimension of the feature extractor to be lower than the dimension of the observations.
By equating equations (54) and (53) after squaring, we get, using our new notations:
Step 1 First, since h is surjective, there exists a point where it is equal to zero. Evaluating equation (55) at this point shows that
Step 2 Second, the left hand side of equation (55) has terms raised to the power 4. These terms grow to infinity much faster than the rest of the terms of the rhs and the lhs. It is thus equal to zero. More rigorously, consider the vectors where the only non zero entry is y at the l-th position. Each of these vectors has a preimage by h (since it is surjective), which we denote by
. By evaluating equation (55) at each of these points, we get
Divide both sides of this equation by , then take the limit
. The right hand side will converge to 0, while the left hand side will converge to
, which shows that
. By doing this process for all
, we can show that b = 0.
Step 3 So far, we’ve shown that (55) becomes, after expanding the square in the lhs, and writing
Let’s again consider the vectors from earlier, and their preimages
. By evaluating (57) at the points
Divide both sides by y, and take the limit . The lhs converges to
, while the rhs converges to
. Since this is valid for all
, we conclude that
. It also follows that
Step 4 Injecting this back into equation (57), and writing are left with:
By applying the trace operator to both sides of this equation, and rearranging terms, we get
which is of the form . This is a dot product on the space
matrices (both C and B(x) are symmetric!), which is a vector space of dimension
. If we can show that the matrix C is orthogonal to a basis of
, then we can conclude that C = 0.
For this, let be the Euclidean basis of
, where each vector
has one non-zero entry equal to 1 at index
be the Euclidean basis of
, where each matrix
has only one non-zero entry equal to 1 at row i and column j.
Now since h is surjective, there exists such that
, and
. The n different
give us our first n matrices we will use to construct a basis of
. We now need to find
remaining basis matrices. For this, consider the sums
, of which there is exactly
. Each of these sums of vectors have a preimage
by h, and
, which is a matrix in
linearly independent of all
, and all other
because they have non-zero entries at different rows and columns.
We have then found a total of different vectors
such that their images by
form a basis of
. If we now evaluate equation (60) at each of these points, we find that the matrix
is orthogonal to a basis of
, which implies that it is necessarily equal to 0. This in turn implies that
is a diagonal matrix, and that
implies that at most one
is non-zero.
Step 5 So far, we have proven that, among other things, for all i, j. We now go back to equation (53), which we can write as:
Both f and h are differentiable, and according to assumption 2, has rank n (it is full rank and
). Thus, by differentiating the last equation, we conclude that
has rank n, and is thus invertible.
Conclusion We’ve shown that , where
. This is valid for all
. Now since
is invertible, the non-zero entry
has to be in a different column for each row, otherwise some rows will be linearly dependent. Thus, there exists a permutation
of [[1, n]], such that
, and we deduce that
which concludes the proof.
From the second conclusion of step 3, we have that . Combined with the fact that exactly one element of a is nonzero such that
is full rank, this implies that
is also full rank, which in turn means that A is full rank.
C.6 Proof of Theorem 3
Theorem 3. Let p(x|y) be a conditional probability density. Assume that X and Y are compact Hausdorff spaces, and that p(x|y) > 0 almost surely . Then for each
, there exists
, where n is the dimension of the feature extractor, such that
Proof. We consider here two cases.
Continuous auxiliary variable Recall the form of our model:
By parameterizing each of as neural networks, these functions can approximate continuous function on their respective domains arbitrarily well. According to Lemma 9, this implies that any continuous function on
can be approximated arbitrarily well by a term of the form
Thus, any continuous function can be approximated by for some
, where Z(y) captures the difference in scale between the function in question and the normalized density
. We apply this result to
, there exists
such that:
Since p(x|y) > 0 a.s. on is finite and bounded. So is the term
. We can then use the fact that exp is Lipschitz on compacts to conclude for p(x|y), to conclude that:
where K is the Lipschitz constant of exp, which concludes the proof.
Discrete auxiliary variable If y is discrete and Y is compact, then y only takes finitely many values. In this case, we do not need Lemma 9 for the proof. g(y) can simply be a lookup table, and we learn different approximations for each fixed value of y, since f has the universal approximation capability, which concludes the proof.
Denote by C(X) (respectively ) the Banach algebra of continuous functions from X (respectively Y and
) to R. For any subsets of functions
and
, let
be the set of all linear combinations of products of functions from
. The energy function defining our model belongs to this last set. Finally, universal approximation is expressed in terms of density: for instance, the set of functions
have universal approximation of C(X) if it is dense in it, i.e. for any function in C(X), we can always find a limit of a sequence of functions of
that converges to it. We mathematically express density by writing
Let (respectively
) be the set of deep neural networks with input in X (respectively in Y). The universal approximation capability is summarised in the following Lemma.
Lemma 9 (Universal approximation capability). Suppose the following:
then . All completions here are with respect to the infinity norm.
Proof. We prove this theorem in two steps:
1. We first prove that is dense in
using the hypotheses of Theorem 3. 2. we prove that
is dense in
using Theorem 5.
Step 1 Let . Let
. Then there exists
and functions
and
such that
. For each i, since
dense in C(Y), there exists
such that
. From
dense in C(X), there exists
such that
Using this, we conclude that
which proves that is dense in
Step 2 We will use the Stone-Weirstrass theorem for this step. It is enough to show that:
To prove (i), we use the fact that every finite product of compact spaces is a compact space, and every finite product of Hausdorff spaces is a Hausdorff space. Points (ii) and (iii) are easy to verify. To prove (iv), let (x, y) and be distinct points in
. Assume that
(we proceed similarly if
). Define the continuous function
such that
and
. Then for
All the conditions required to use the Stone-Weirestrass Theorem are verified, and we can conclude that is dense in
Conclusion Combining the results of steps 1 and 2, we conclude that is dense in
Definition 3. Let K be a compact Hausdorff space. Consider the Banach algebra C(K) equipped with the supremum norm
Theorem 5 (Stone-Weirstrass). Let K be a compact Hausdorff space, and A a unital sub-algebra of C(K) which separates points of K. Then A is dense in C(K).
Proof. A proof to this theorem can be found in many references, for instance Brosowski and Deutsch (1981).
Recall the generative model of IMCA: we observe a random variable as a result of a nonlinear transformation h of a latent variable
whose distribution is conditioned on an auxiliary variable y that is also observed:
We assume the latent variable in the IMCA model has a density of the form
where is not necessarily factorial.
Further, we will suppose that the density p(z|y) belongs to the following subclass of the exponential families, introduced by Khemakhem et al. (2020): Definition 4 (Strongly exponential). We say that an exponential family distribution is strongly exponential if for any subset X of R the following is true:
where is the Lebesgue measure.
If we suppose that only n out of d components of the latent variable are modulated by the auxiliary variable y (equivalently, if we suppose that the parameters are constant), then we can write its density as
The term is absorbed into
. This last expression will be useful for dimensionality reduction.
To estimate the latent variables of the IMCA model, we fit an augmented version of our energy model
where is a (nonlinear) output activation. An example of such map is
In this section, we present the proofs for the estimation of the Independently Modulated Component Analysis by an identifiable energy model. These proofs are based on similar ideas and techniques to previous proofs, but are different enough that we can’t forgo them.
D.1 Assumptions
We prove dimensionality reduction capability in Theorem 6. We will decompose Theorem 4 into two sub-theorems, which will make the proof easier to understand, but also more adaptable into future work. For the sake of clarity, we will separate its assumptions into smaller assumptions, and refer to them when needed in the proofs.
(i) The observed data follows the exponential IMCA model of equations (68)-(70). (ii) The mixing function ) is invertible. (iii) The sufficient statistics
are differentiable, and the functions
are linearly independent on any subset of X of measure greater than zero. (iv) There exist k + 1 distinct points
such that the matrix
(v) We fit the model (73) to the conditional density p(x|y), where we assume the feature extractor f(x) to be differentiable, d-dimensional, and the pointwise nonlinearitiy H to be differentiable and k-dimensional, and the dimension of its vector-valued components be chosen from
without replacement.
(viii') The feature extractor f in (73) is a -diffeomorphism, and the nonlinearities
have a unique extremum.
(ix) Only components of the latent variable are modulated, and its density has the form (72).
(x) The feature extractor f has the form where
, and the auxiliary feature extractor g has the form
constant.
D.2 Lemmas
We rely on the following Lemmas from Khemakhem et al. (2020), which we state below in the interest of completeness.
Lemma 10. Consider an exponential family distribution with components. Then the components of the sufficient statistic T are linearly independent.
Lemma 11. Consider a strongly exponential family distribution such that its sufficient statistic T is differentiable almost surely. Then almost everywhere on
Lemma 12. Consider a strongly exponential distribution of size with sufficient statistic
. Further assume that T is differentiable almost everywhere. Then there exist k distinct values
are linearly independent in
Lemma 13. Consider a strongly exponential distribution of size with sufficient statistic T. Further assume that T is twice differentiable almost everywhere. Then
almost everywhere on R.
Lemma 14. Consider n strongly exponential distributions of size with respective sufficient statistics
. Further consider that the sufficient statistics are twice differentiable. Define the vectors
, such that
, where the non-zero entries are at indices
. Then the matrix
of size
has rank 2n almost everywhere on
D.3 Proofs
As mentioned above, we decompose Theorem 4 into two smaller results, summarized in what follows by Theorems 4a and 4b.
Theorem 4a. Assume assumptions (i)-(v) hold. Then, after convergence of our model true density p(x|y), we can recover the latent variables up to an invertible linear transformation and point-wise nonlinearities, i.e.
where A is an invertible matrix.
Proof. We fit our density model (73) to the conditional density p(x|y), setting the dimension of the feature extractor f to be equal to d, and the dimensions of the output nonlinearities chosen from
, as per assumption (v):
by doing the change of variable x = h(z), taking the log on both sides, we get:
Let be the points provided by assumption (iv) of the theorem, where
, and
. Define
and
. We plug each of those
such equations. We subtract the first equation for
from the remaining k equations to get for l = 1, . . . , k:
The crucial point here is that the non factorial terms cancel out when we take these differences. This is what allows us to generalize the identifiability results of nonlinear ICA to the context of IMCA.
Let L bet the matrix defined in assumption (iv), and . Expressing (79) for all points
in matrix form, we get:
By assumption (iv), L is invertible, and thus we can write
where
To prove that A is invertible, we first take the gradient of equation (81) with respect to z. The Jacobian of T is a matrix of size
. Its columns are independent because each
is only a function of
, and thus the non-zero entries of each column are in different rows. This means that its rank is d (since
). This is not enough to prove that A is invertible though. For that, we consider the functions
for which
: for each of these functions, using Lemma 12, there exists points
are independent. Collate these point into
vectors
, where for each
, and
is a point such that
if
. We plug these vectors into equation (81) after differentiating it, and collate the
equations in vector form:
where and
. Now the matrix M is of size
, and it has exactly k independent columns by definition of the points
. This means that M is of rank k, which in turn implies that rank
matrix, we conclude that A is invertible.
The theorem above shows a first step in identifiability which holds up to a linear transformation. This is similar to Hyvärinen et al. (2019), but here we allow for dependencies between components. We can further sharpen the result, in line with Khemakhem et al. (2020) even in this non-independent case as follows:
Theorem 4b. Assume assumptions (i)-(v) hold. Further assume that either assumptions (vi)-(viii) or assumptions (vi')-(viii') hold. Then equation (75) can be reduced to the component level, i.e. for each
where is a permutation of [[1, d]] such that
and
a square invertible matrix.
Proof. We prove this theorem separately for both sets of assumptions.
Multi-dimensional sufficient statistics: assumptions (vi)-(viii) We suppose that The assumptions of Theorem 4a hold, and so we have
for an invertible . We will index A by four indices
refer to the rows and
to the columns.
Let . Since both
-diffeomorphisms (assumptions (vii), (viii)), we can invert this relation and write z = v(y). We introduce the notations
and
. Each line of equation (84) can be written as:
for . The first step is to show that
is a function of only one
, for all
. by differentiating (85) with respect to
and by differentiating (86) with respect to
This equation is valid for all pairs (s, t), t > s. Define , such that the non-zero entries are at indices (2a, 2a + 1) and e(y) :=
. Then by grouping equation (87) for all valid pairs (s, t) and pairs (i, l) and writing it in matrix form, we get:
Now by Lemma 14, we know that e(y) has rank 2d almost surely on Z. Since A is invertible, it is full rank, and thus rank(e (y)A) = 2d almost surely on Z. It suffices then to multiply by its pseudo-inverse from the right to get
In particular, . This means that the Jacobian of v at each y has at most one non-zero entry in each row. By invertibility and continuity of
, we deduce that the location of the non-zero entries are fixed and do not change as a function of y. We deduce that there exists a permutation
of [[1, d]] such that each of the
, and the same would apply to
. Without any loss of generality, we assume that
is the identity.
Now let . This function is a pointwise function because
functions. Plugging this back into equation (84) yields:
The last equation is valid for every component:
By differentiating both sides with respect to
By Lemma 10, we get for all
. Since (92) is valid for all l and all
, we deduce that the matrix A has a block diagonal form:
which achieves the proof.
One-dimensional sufficient statistics: assumptions (vi')-(viii') We now suppose that The proof of Khemakhem et al. (2020, Theorem 3) can be used here, where we define
and
, where
. We can then rewrite equation (85) for every component as:
which is the same as equation (45) of Khemakhem et al. (2020). All the assumptions required to prove their theorem are met in our case, and the rest of their proof would simply apply here to prove that A is a permutation matrix.
In practice, it is a natural desire to have the feature extractor reduce the dimension of the data, as it is usually very large. This has been achieved in nonlinear ICA before (Khemakhem et al., 2020; Hyvärinen and Morioka, 2016). It turns out that we can also incorporate dimensionality reduction in IMCA and its estimation by ICE-BeeM, under some assumptions.
Theorem 6. Assume either of the following hold:
• Assumptions (i)- (v), (vi')- (viii'), and (ix)- (x).
Then recovers only the modulated latent components as per Theorem 4b.
Proof. The proof of Theorem 4a in this case is unchanged. Simply, we update the total dimension of matrix L here to . when we evaluate equation (78) on these points
, the constant term
and the non-modulated components cancel out, and we are left with the equation
We then use similar arguments to the proof of Theorem 4a to conclude that
where a square invertible matrix. At this point, we can make the same conclusion as Theorem 4a, while reducing the dimension of the latent space.
We now explain how we can extend Theorem 4b to the lower dimensional latent space case. Note that we still assume that is a diffeomorphism per assumptions (viii) and (viii'). We can then still define
We consider now two cases like in the proof of Theorem 4b.
One-dimensional sufficient statistics Let and
. We can still write equation (96) like equation (94) as
for all . The assumptions required for the proof are still met, despite reducing the dimension from d to n. This interesting fact is also used for the proof of Theorem 2b as well, which achieves this part of the proof.
for all . We proceed similarly to the proof of Theorem 4b, replacing all mentions of d by n and keeping all differentiations to indices
, up to equation (89), after which we can conclude that
. This is not enough to conclude that each of the
is only function of one
For that, we go back to equation (98) and differentiate it with respect to
which is valid for all is invertible, we can conclude that
for all
. Since we only consider strongly exponential distributions (assumption (iii)), and using proposition 11, we conclude that
almost everywhere, and that
for all s > n. This, in addition to the fact that
for all
, and all
allows us to conclude that the first n components of v are each only a function of one different
diffeomorphism and its Jacobian is continuous. Finally, we can use this fact to deduce that A is a block permutation matrix, which achieves the proof.
As mentioned in section 3, linear latent variable models (Hyvärinen and Oja, 2000) and more recently nonlinear latent variable models may be identifiable provided some additional auxiliary variables (Khemakhem et al., 2020; Hyvärinen et al., 2019). The purpose of this auxiliary variable serves to introduce additional constraints over the distribution over latent variables, which are typically required to be conditionally independent given the auxiliary variable. This avenue of research has thus formalized the trade-off between expressivity of the mapping between latents to observations (from linear to nonlinear) and distributional assumptions over latent variables (from independent to conditionally independent given auxiliary variables).
We would like to relax the assumption of independence while maintaining identifiability, resulting in the framework of Independently Modulated Component Analysis (IMCA). In this section of the Appendix, we will give a detailed analysis of the IMCA model independently of any estimation method, drawing parallels to the identifiability results of the nonlinear ICA model presented in Khemakhem et al. (2020).
E.1 Definition of the generative model
Assume we observe a random variable as a result of a nonlinear transformation h of a latent variable
whose distribution is conditioned on an auxiliary variable y that is also observed:
The main modelisation assumption we make is on the latent variable distribution, given by the following definition, where u is a dummy variable.
Definition 5 (Exponentially factorial distributions). We say that a multivariate exponential family distribution is exponentially factorial if its density p(u) has the form
We assume that the latent variable in the IMCA model has a conditional exponentially factorial distribution, where the parameters of the exponential family are a function of the auxiliary variable y:
Equations (100) and (101) together define the nonparametric IMCA model with parameters . Most importantly, we allow for an arbitrary base measure
, i.e. the components of the latent variable must no longer be independent, as
doesn’t necessarily factorize across dimensions. The crucial assumption is that the components of the latent variables are independently modulated given the auxiliary variable y, and that through the term
E.2 Identifiability
The concept of identifiability is core to this work. As such, it is important to understand the different views one can have of this concept.
According to the conventional definition, a probabilistic model is identifiable iif the mapping
is bijective, i.e.
. However, this definition is very restrictive and impractical.
Often, the identifiability form we can prove for a model is equality of the parameters up to some indeterminacies. This can be understood as an equivalence relation between parameters. Identifiability in this context implies that the equivalence class of the ground truth parameter can be uniquely recovered from observations. This is relevant only if the definition of the equivalence class is sufficiently narrow and specific to be able to make meaningful conclusions. One example of such equivalence relations can be found in linear ICA: the mixing matrix is uniquely recovered up to a scaled permutation. The permutation is irrelevant, and the scaling is circumvented by whitening the data. This is a good example of an equivalence class that doesn’t restrict the practical utility of the ICA model.
An example of indeterminacy which is relevant to us here can be found in variational inference of latent variable models: two parameters are equivalent if they map to the same inference distribution (Khemakhem et al., 2020). This is the definition we will be using in this work. We will say that a generative model is identifiable if we can uniquely recover the latent variables, as given by the following definition.
Definition 6. Consider two different sets of parameters and
, defining two densities
. We say that the IMCA model is strongly identifiable if
where is a permutation,
is an invertible matrix, and
We say that it is weakly identifiable if
where A is an invertible matrix, and b a vector.
E.3 Theoretical analysis
In this section, we develop the theory of IMCA. We will give sufficient conditions that guarantee a strong identifiability of the latent components, and discuss a degenerate case where we only obtain a weaker form of identifiability.
E.3.1 Definitions
We will first define some sets of distributions which are subsets of the exponential family distribution. We will use u as a dummy variable, and introduce the definitions for the unconditional case. Note that all these definitions apply to the conditional case, when the parameters of the exponential family are a function of an auxiliary variable y. For completeness, we restate here Definition 4.
Definition 7 (Strongly exponential distributions). We say that a univariate exponential family distribution with density is strongly exponential if for any subset U of R the following is true:
where is the Lebesgue measure.
We say that that a multivariate distribution is strongly exponential if all its univariate marginals are.
In other words, the density of a strongly exponential distribution has almost surely the exponential component in its expression and can only be reduced to the base measure on a set of measure zero. This definition is very general, and is satisfied by all the usual exponential family distributions like the Gaussian, Laplace, Pareto, Chi-squared, Gamma, Beta, etc. We will only prove identifiability results for strongly exponential families. The non-strongly exponential case will be explored in future work.
There is a certain class of exponential families for which we can only prove a weak form of identifia-bility. Loosely speaking, this is because this class doesn’t constrain the latent space enough.
Definition 8 (Quasi-location exponential distributions). We say that a univariate exponential family distribution with density quasi-location family if:
We say that that a multivariate distribution is quasi-location exponential if all its univariate marginals are.
As a simple illustration, the Gaussian family with fixed variance is a quasi-location family, but with fixed mean it is not. This is because in the first case, the sufficient statistic is T(u) = u which is a monotonic scalar function, while in the second case it is , a non-monotonic scalar function.
E.3.2 Identifiability of the general case
As mentioned in section 3, the IMCA model described by equations (100) and (101) generalizes previous nonlinear ICA models by relaxing the independence assumption required for the latent variables. We propose here to extend the identifiability theory of nonlinear ICA developed in Hyvärinen et al. (2019); Khemakhem et al. (2020) to this new framework.
We start by providing a weaker form of identifiability guarantee that applies to the general case, including quasi-location families. Theorem 7. Assume the following:
(III) The conditional latent distribution p(z|y) is strongly exponential (definition 7), and its sufficient statistic is differentiable.
Then the IMCA model is weakly identifiable.
This theorem extends the basic identifiability result of Khemakhem et al. (2020, Theorem 1). It is fundamental as it proves a general identifiability results without the restriction of having independent latent variables. This was previously not considered to be possible and could only be demonstrated in very specific circumstances and under very restrictive additional assumptions (e.g., Monti and Hyvärinen (2018) require both non-negativity and orthonormality of a mixing matrix in the linear case). In the nonlinear case, to prove Theorem 7, we still require that the latent variables are only dependent through the base measure, while still being independently modulated through the auxiliary variable y. This (and the necessity of having an auxiliary variable) is the price to pay for obtaining identifiability in a nonlinear setting.
E.3.3 Identifiability of the non quasi-location family
The identifiability result of Theorem 7 is weak because of the presence of the linear transformation A in equation (103). It turns out that by excluding the quasi-location family (definition 8), we can remove this matrix and achieve a stronger form of identifiability. The main technical result of this paper is the following.
Theorem 8. Assume that the assumptions of Theorem 7 hold. Further assume one of the two following sets of assumptions:
Then the IMCA model is strongly identifiable.
This form of identifiability mirrors the strongest results proven in the nonlinear ICA (Khemakhem et al., 2020, Theorems 2,3), without requiring that the latent components be independent. As far as we know, this is the first proof of the kind for nonlinear representation learning. We further note that this theorem generalizes even existing identifiability theory of the linear case. The mixed case where we have both cases where some sufficient statistics are of dimension greater than 2 and some are univariate and non-monotonic will be studied in future work.
E.4 Estimation of IMCA by self-supervised learning
A recent development in nonlinear ICA is given by Hyvärinen et al. (2019) where the authors assume they observe data x = h(z) following a noiseless conditional nonlinear ICA model p(z|y) = For estimation, they rely on a self-supervised binary discrimination task based on randomization to learn the unmixing function. More specifically, from a dataset of observations and auxiliary variables pairs
, they construct a randomized dataset
where
is randomly drawn from the observed distribution of y. To distinguish between both datasets, a deep logistic regression is used. The last hidden layer of the neural network is a feature extractor whose purpose is to extract the relevant features which will allow to distinguish between the two datasets. Surprisingly, this estimation technique works for IMCA, and is summarized by the following theorem.
Theorem 9. Self-supervised nonlinear ICA estimation algorithms presented in Hyvärinen and Morioka (2016); Hyvärinen et al. (2019) work for the estimation of IMCA.
E.5 Proofs
E.5.1 Proof of Theorem 7
Consider two different sets of parameters and
, defining two conditional latent densities p(z|y) and
. Suppose that the density of the observations arising from these two different models are equal:
Let be the points provided by assumption (IV) of the theorem for T, where
, and
. We plug each of those
in (107) to obtain k + 1 such equations. Then, we subtract the first equation for
from the remaining k equations to get for l = 1, . . . , k:
where we grouped terms that are only a function of
Most importantly, both base measure terms disappear after taking the differences, which is the key enabler of identifiability in the IMCA framework.
The rest of the proof is similar to the proof of Khemakhem et al. (2020, Theorem 1). The only difference is that we don’t restrict the sufficient statistics to have equal dimensions, and so we can’t use the proof technique from Khemakhem et al. (2020, Theorem 1) without any modification. We present an alternative technique in the proof of Theorem 4, which we refer too for more details. We then conclude that
which implies that the model is weakly identifiable.
E.5.2 Proof of Theorem 8
The conclusion of Theorem 7 is the same as the conclusion of Khemakhem et al. (2020, Theorem 1). Since we make the same assumptions as Khemakhem et al. (2020, Theorems 2,3), the proof to Theorem 8 is similar to the proof of these theorems, which we refer too for more details. The IMCA model is strongly identifiable under the assumptions of Theorem 8.
E.5.3 Proof of Theorem 9
We will first quickly summarize the method proposed in Hyvärinen et al. (2019), and then show how it works for IMCA.
We consider that we observe data (x, y) that follows the exponential IMCA model of equations (4)-(5). Following Hyvärinen et al. (2019) we start by constructing new data from the observations x and y to obtain two datasets
where is a random value from the distribution of y and independent of x. We then proceed by defining a multinomial classification task, where we consider the set of all
as data points to be classified, and whether they come from the randomized dataset or not as labels. In particular, we train a deep neural network using multinomial logistic regression to perform this classification task. The last hidden layer of the neural network is a feature extractor denoted s(x). The purpose of the feature extractor is therefore to extract the relevant features which will allow to distinguish between the true dataset
and the randomized dataset
. The final layer of the network is simply linear, and the regression function takes the form
We state now the main result.
Theorem 9 (Hyvärinen et al. (2019), adapted). Assume that the assumptions of Theorem 7, and the assumptions (V)-(VI) of Theorem 8 hold. Further assume that we train a nonlinear logistic regression with universal approximation capability to discriminate between regression function in (112), where the feature extractor has dimension d.
Then in the limit of infinite data, the components of the regression function give the latent components up to pointwise nonlinearities.
Proof. The proof of this theorem is inspired by Hyvärinen et al. (2019). By well known theory, after convergence of logistic regression, the regression function equals the difference of the log-densities of the two classes:
where is the Jacobian matrix of
at point x. Let
be the point provided by assumption (iv). We plug each of those
in (113) to obtain k + 1 such equations. We subtract the first equation for
from the remaining k equations to get for l = 1, . . . , k:
Interestingly, the term cancels out. The rest of the proof is similar to Theorems 4a and 4b. The only minor difference is that the matrix A will not be square, but it is still full rank, and can be used to prove that
is a point-wise nonlinearity.