Phase Transitions for the Information Bottleneck in Representation Learning

2020·Arxiv

ABSTRACT

ABSTRACT

In the Information Bottleneck (IB), when tuning the relative strength between compression and prediction terms, how do the two terms behave, and what’s their relationship with the dataset and the learned representation? In this paper, we set out to answer these questions by studying multiple phase transitions in the IB objective: IBdefined on the encoding distribution p(z|x) for input X, target Y and representation Z, where sudden jumps of and prediction accuracy are observed with increasing . We introduce a definition for IB phase transitions as a qualitative change of the IB loss landscape, and show that the transitions correspond to the onset of learning new classes. Using second-order calculus of variations, we derive a formula that provides a practical condition for IB phase transitions, and draw its connection with the Fisher information matrix for parameterized models. We provide two perspectives to understand the formula, revealing that each IB phase transition is finding a component of maximum (nonlinear) correlation between X and Y orthogonal to the learned representation, in close analogy with canonical-correlation analysis (CCA) in linear settings. Based on the theory, we present an algorithm for discovering phase transition points. Finally, we verify that our theory and algorithm accurately predict phase transitions in categorical datasets, predict the onset of learning new classes and class difficulty in MNIST, and predict prominent phase transitions in CIFAR10.

1 INTRODUCTION

The Information Bottleneck (IB) objective (Tishby et al., 2000):

explicitly trades off model compression (denoting mutual information) with predictive performance (I(Y ; Z)) using the Lagrange multiplier , where X, Y are observed random variables, and Z is a learned representation of X. The IB method has proved effective in a variety of scenarios, including improving the robustness against adversarial attacks (Alemi et al., 2016; Fischer, 2018), learning invariant and disentangled representations (Achille & Soatto, 2018a;b), underlying information-based geometric clustering (Strouse & Schwab, 2017b), improving the training and performance in adversarial learning (Peng et al., 2018), and facilitating skill discovery (Sharma et al., 2019) and learning goal-conditioned policy (Goyal et al., 2019) in reinforcement learning.

From Eq. (1) we see that when it will encourage I(X; Z) = 0 which leads to a trivial representation Z that is independent of X, while when , it reduces to a maximum likelihood objective1 that does not constrain the information flow. Between these two extremes, how will the IB objective behave? Will prediction and compression performance change smoothly, or do there exist interesting transitions in between? In Wu et al. (2019), the authors observe and study the learnability transition, i.e. the value such that the IB objective transitions from a trivial global minimum to learning a nontrivial representation. They also show how this first phase transition relates to the structure of the dataset. However, to answer the full question, we need to consider the full range of

Figure 1: CIFAR10 plots (a) showing the information plane, as well as and (c) accuracy, all on the training set with 20% label noise. The arrows point to empirically-observed phase transitions. The vertical lines correspond to phase transitions found with Alg. 1.

Motivation. To get a sense of how I(Y ; Z) and I(X; Z) vary with , we train Variational Information Bottleneck (VIB) models (Alemi et al., 2016) on the CIFAR10 dataset (Krizhevsky & Hinton, 2009), where each experiment is at a different and random initialization of the model. Fig. 1 shows the I(X; Z), I(Y ; Z) and accuracy vs. , as well as I(Y ; Z) vs. I(X; Z) for CIFAR10 with 20% label noise (see Appendix I for details).

From Fig. 1(b)(c), we see that as we increase , instead of going up smoothly, both I(X; Z) and I(Y ; Z) show multiple phase transitions, where the slopes and are discontinuous and the accuracy has discrete jumps. The observation lets us refine our question: When do the phase transitions occur, and how do they depend on the structure of the dataset? These questions are important, since answering them will help us gain a better understanding of the IB objective and its close interplay with the dataset and the learned representation.

Moreover, the IB objective belongs to a general form of two-term trade-offs in many machine learning objectives: L = Prediction-loss Complexity, where the complexity term generally takes the form of regularization. Usually, learning is set at a specific . Many more insights can be gained if we understand the behavior of the prediction loss and model complexity with varying , and how they depend on the dataset. The techniques developed to address the question in the IB setting may also help us understand the two-term tradeoff in other learning objectives.

Contributions. In this work, we begin to address the above question in IB settings. Specifically:

• We identify a qualitative change of the IB loss landscape w.r.t. p(z|x) for varying as IB phase transitions (Section 3).

• Based on the definition, we introduce a quantity G[p(z|x)] and use it to prove a theorem giving a practical condition for IB phase transitions. We further reveal the connection between G[p(z|x)] and the Fisher information matrix when p(z|x) is parameterized by (Section 3).

• We reveal the close interplay between the IB objective, the dataset and the learned representation, by showing that in IB, each phase transition corresponds to learning a new nonlinear component of maximum correlation between X and Y , orthogonal to the previously-learned Z, and each with decreasing strength (Section 4).

To the best of our knowledge, our work provides the first theoretical formula to address IB phase transitions in the most general setting. In addition, we present an algorithm for iteratively finding the IB phase transition points (Section 5). We show that our theory and algorithm give tight matches with the observed phase transitions in categorical datasets, predict the onset of learning new classes and class difficulty in MNIST, and predict prominent transitions in CIFAR10 experiments (Section 6).

2 RELATED WORK

The Information Bottleneck Method (Tishby et al., 2000) provides a tabular method based on the Blahut-Arimoto (BA) Algorithm (Blahut, 1972) to numerically solve the IB functional for the optimal encoder distribution P(Z|X), given the trade-off parameter and the cardinality of the representation variable Z. This work has been extended in a variety of directions, including to the case where all three variables X, Y, Z are multivariate Gaussians (Chechik et al., 2005), cases of variational bounds on the IB and related functionals for amortized learning (Alemi et al., 2016; Achille & Soatto, 2018a; Fischer, 2018), and a more generalized interpretation of the constraint on model complexity as a Kolmogorov Structure Function (Achille et al., 2018). Previous theoretical analyses of IB include Rey & Roth (2012), which looks at IB through the lens of copula functions, and Shamir et al. (2010), which starts to tackle the question of how to bound generalization with IB. We will make practical use of the original IB algorithm, as well as the amortized bounds of the Variational Informormation Bottleneck (Alemi et al., 2016) and the Conditional Entropy Bottleneck (Fischer, 2018).

Phase transitions, where key quantities change discontinuously with varying relative strength in the two-term trade-off, have been observed in many different learning domains, for multiple learning objectives. In Rezende & Viola (2018), the authors observe phase transitions in the latent representation of -VAE for varying . Strouse & Schwab (2017b) utilize the kink angle of the phase transitions in the Deterministic Information Bottleneck (DIB) (Strouse & Schwab, 2017a) to determine the optimal number of clusters for geometric clustering. Tegmark & Wu (2019) explicitly considers critical points in binary classification tasks using a discrete information bottleneck with a non-convex Pareto-optimal frontier. In Achille & Soatto (2018a), the authors observe a transition on the tradeoff of vs. in InfoDropout. Under IB settings, Chechik et al. (2005) study the Gaussian Information Bottleneck, and analytically solve the critical values eigenvalues of the matrix , and is the covariance matrix. This work provides valuable insights for IB, but is limited to the special case that X, Y and Z are jointly Gaussian. Phase transitions in the general IB setting have also been observed, which Tishby (2018) describes as “information bifurcation”. In Wu et al. (2019), the authors study the first phase transition, i.e. the learnability phase transition, and provide insights on how the learnability depends on the dataset. Our work is the first work that addresses all the IB phase transitions in the most general setting, and provides theoretical insights on the interplay between the IB objective, its phase transitions, the dataset, and the learned representation.

3 FORMULA FOR IB PHASE TRANSITIONS

3.1 DEFINITIONS

Let be random variables denoting the input, target and representation, respectively, having a joint probability distribution p(X, Y, Z), with its support. X, Y and Z satisfy the Markov chain are conditionally independent given X. We assume that the integral (or summing if X, Y or Z are discrete random variables) is on We use x, y and z to denote the instances of the respective random variables. The above settings are used throughout the paper. We can view the IB objective IB(Eq. 1) as a functional of the encoding distribution p(z|x). To prepare for the introduction of IB phase transitions, we first define relative perturbation function and second variation, as follows. Definition 1. Relative perturbation function: For p(z|x), its relative perturbation function r(z|x) is a bounded function that maps to R and satisfies . Formally, define . We have that iff r(z|x) is a relative perturbation function of p(z|x). The perturbed probability (density) is Definition 2. Second variation: Let functional F[f(x)] be defined on some normed linear space R. Let us add a perturbative function , and now the functional expanded as

such that , where denotes the norm, is a linear

functional of , and is called the first variation, denoted as is a quadratic functional of , and is called the second variation, denoted as

We can think of the perturbation function as an infinite-dimensional “vector” (x being the indices), with being its amplitude and h(x) its direction. With the above preparations, we define the IB phase transition as a change in the local curvature on the global minimum of IB

Definition 3. IB phase transitions: Let be a perturbation function of p(z|x), denote the optimal solution of IBat , where the IB functional IBis defined in Eq. (1). The IB phase transitions values satisfying the following two conditions:

(1) ∀

Here denote one-sided limits.

We can understand the as a local “curvature” of the IB objective IB(Eq. 1) w.r.t. p(z|x), along some relative perturbation r(z|x). A phase transition occurs when the convexity of IBw.r.t. p(z|x) changes from a minimum to a saddle point in the neighborhood of its optimal solution as increases from to . This means that there exists a perturbation to go downhill and find a better minimum. We validate this definition empirically below.

3.2 CONDITION FOR IB PHASE TRANSITIONS

The definition for IB phase transition (Definition 3) indicates the important role on the optimal solution in providing the condition for phase transitions. To concretize it and prepare for a more practical condition for IB phase transitions, we expand IBsecond order of

Lemma 0.1. For IB, the condition of is equivalent to . The threshold function G[p(z|x)] is given by:

The proof is given in Appendix B, in which we also give Eq. (20) for empirical estimation. Note that Lemma 0.1 is very general and can be applied to any p(z|x), not only at the optimal solution

The Fisher Information matrix. In practice, the encoder is usually parameterized by some parameter vector , e.g. weights and biases in a neural net, where is the parameter field. An infinitesimal change of induces a relative perturbation , from which we can compute the threshold function

Lemma 0.2. For IBobjective, the condition of is equivalent to

of for are the conditional Fisher information matrix (Zegers, 2015) of for Z conditioned on X and Y , respectively. is the largest eigenvalue of with the corresponding eigenvector, where is the Cholesky decomposition of the matrix , and is the eigenvector for . The infimum is attained at

The proof is in appendix C. We see that for parameterized encoders , each term of G[p(z|x)] in Eq. (2) can be replaced by a bilinear form with the Fisher information matrix of the respective variables. Although this lemma is not required to understand the more general setting of Lemma 0.1, where the model is described in a functional space, Lemma 0.2 helps understand G[p(z|x)] for parameterized models, which permits directly linking the phase transitions to the model’s parameters.

Phase Transitions. Now we introduce Theorem 1 that gives a concrete and practical condition for IB phase transitions, which is the core result of the paper: Theorem 1. The IB phase transition points as defined in Definition 3 are given by the roots of the following equation:

where G[p(z|x)] is given by Eq. (2) and is the optimal solution of IB

We can understand Eq. (4) as the condition when is about to be able to be negative at the optimal solution for a given . The proof for Theorem 1 is given in Appendix D. In Section 4, we will analyze Theorem 1 in detail.

4 UNDERSTANDING THE FORMULA FOR IB PHASE TRANSITIONS

In this section we set out to understand G[p(z|x)] as given by Eq. (2) and the phase transition condition as given by Theorem 1, from the perspectives of Jensen’s inequality and representational maximum correlation.

4.1 JENSEN’S INEQUALITY

The condition for IB phase transitions given by Theorem 1 involves G[p(z|x)] =

The equality between A and B holds when the perturbation r(z|x) is constant w.r.t. x for any z; the equality between B and C holds when is constant w.r.t. y for any z. Therefore, the minimization of encourages the relative perturbation function r(z|x) to be as constant w.r.t. x as possible (minimizing intra-class difference), but as different w.r.t. different y as possible (maximizing inter-class difference), resulting in a clustering of the values of r(z|x) for different examples x according to their class y. Because of this clustering property in classification problems, we conjecture that there are at most phase transitions, where |Y| is the number of classes, with each phase transition differentiating one or more classes.

4.2 REPRESENTATIONAL MAXIMUM CORRELATION

Under certain conditions we can further simplify G[p(z|x)] and gain a deeper understanding of it. Firstly, inspired by maximum correlation (Anantharam et al., 2013), we introduce two new concepts, representational maximum correlation and conditional maximum correlation, as follows.

Definition 4. Given a joint distribution p(X, Y ), and a representation Z satisfying the Markov chain , the representational maximum correlation is defined as

where

The conditional maximum correlation is defined as:

where

We prove the following Theorem 2, which expresses G[p(z|x)] in terms of representational maximum correlation and related quantities, with proof given in Appendix F.

Theorem 2. Define . If and satisfy: , there exists2 s.t. r(z|x) = , then we have:

(iii) When Z is continuous, an optimal relative perturbation function r(z|x) for G[p(z|x)] is given by

where is the second largest singular value of the matrix

Theorem 2 furthers our understanding of G[p(z|x)] and the phase transition condition (Theorem 1), which we elaborate as follows.

Discovering maximum correlation in the orthogonal space of a learned representation: Intuitively, the representational maximum correlation measures the maximum linear correlation between f(X, Z) and g(Y, Z) among all real-valued functions f, g, under the constraint that f(X, Z) is “orthogonal” to p(X|Z) and g(Y, Z) is “orthogonal” to p(Y |Z). Theorem 2 (i) reveals that G[p(z|x)] is the inverse square of this representational maximum correlation. Theorem 2 (ii) further shows that G[p(z|x)] is finding a specific on which maximum (nonlinear) correlation between X and Y conditioned on Z can be found. Combined with Theorem 1, we have that when we continuously increase , for the optimal representation shall monotonically decrease due to that X and Y has to find their maximum correlation on the orthogonal space of an increasingly better representation that captures more information about X. A phase transition occurs when reduces to , after which as continues to increase, will try to find maximum correlation between X and Y orthogonal to the full previously learned representation. This is reminiscent of canonical-correlation analysis (CCA) (Hotelling, 1992) in linear settings, where components with decreasing linear maximum correlation that are orthogonal to previous components are found one by one. In comparison, we show that in IB, each phase transition corresponds to learning a new nonlinear component of maximum correlation between X and Y in Z, orthogonal to the previously-learned Z. In the case of classification where different classes may have different difficulty (e.g. due to label noise or support overlap), we should expect that classes that are less difficult as measured by a larger maximum correlation between X and Y are learned earlier.

Conspicuous subset conditioned on a single z: Furthermore, we show in (iii) that an optimal relative perturbation function r(z|x) can be decomposed into a product of two factors, a

factor that only focus on perturbing a specific point in the representation space, and an factor that is finding the “conspicuous subset” (Wu et al., 2019), i.e. the most confident, large, typical, and imbalanced subset in the X space for the distribution

Singular values In categorical settings, (iv) reveals a connection between G[p(z|x)] and the singular value of the matrix. Due to the property of SVD, we know that the square of the singular values of equals the non-negative eigenvalue of the matrix . Then the phase transition condition in Theorem 1 is equivalent to a (nonlinear) eigenvalue problem. This is resonant with previous analogy with CCA in linear settings, and is also reminiscent of the linear eigenvalue problem in Gaussian IB (Chechik et al., 2005).

5 ALGORITHM FOR PHASE TRANSITIONS DISCOVERY IN CLASSIFICATION

As a consequence of the theoretical analysis above, we are able to derive an algorithm to efficiently estimate the phase transitions for a given model architecture and dataset. This algorithm also permits us to empirically confirm some of our theoretical results in Section 6.

Typically, classification involves high-dimensional inputs X. Without sweeping the full range of where at each it is a full learning problem, it is in general a difficult task to estimate the phase transitions. In Algorithm 1, we present a two-stage approach.

In the first stage, we train a single maximum likelihood neural network with the same encoder architecture as in the (variational) IB to estimate p(y|x), and obtain an N is the number of examples in the dataset and C is the number of classes. In the second stage, we perform an iterative algorithm w.r.t. , alternatively, to converge to a phase transition point.

Specifically, for a given , we use a Blahut-Arimoto type IB algorithm (Tishby et al., 2000) to efficiently reach IB optimal at , then use SVD (with the formula given in Theorem 2 (iv)) to efficiently estimate at (step 8). We then use the value as the new and do it again (step 7 in the next iteration). At convergence, we will reach the phase transition point given by (Theorem 1). After convergence as measured by patience parameter K, we slightly increase (step 13), so that the algorithm can discover the subsequent phase transitions.

6 EMPIRICAL STUDY

We quantitatively and qualitatively test the ability of our theory and Algorithm 1 to provide good predictions for IB phase transitions. We first verify them in fully categorical settings, where X, Y, Z are all discrete, and we show that the phase transitions can correspond to learning new classes as we increase . We then test our algorithm on versions of the MNIST and CIFAR10 datasets with added label noise.

6.1 CATEGORICAL DATASET

For categorical datasets, X and Y are discrete, and p(X) and p(Y |X) are given. To test Theorem 1, we use the Blahut-Arimoto IB algorithm to compute the optimal for each vs. is plotted in Fig. 2 (a). There are two phase transitions at and . For each and the corresponding , we use the SVD formula (Theorem 2) to compute , shown in Fig. 2 (b). We see that at exactly the observed phase transition points and . Moreover, starting at , Alg. 1 converges to each phase transition points within few iterations. Our other experiments with random categorical datasets show similarly tight matches.

Furthermore, in Appendix G we show that the phase transitions correspond to the onset of separation of p(z|x) for subsets of X that correspond to different classes. This supports our conjecture from Section 4.1 that there are at most phase transitions in classification problems.

6.2 MNIST DATASET

For continuous X, how does our algorithm perform, and will it reveal aspects of the dataset? We first test our algorithm in a 4-class MNIST with noisy labels3, whose confusion matrix and experimental settings are given in Appendix H. Fig. 3 (a) shows the path Alg. 1 takes. We see again that in each

Figure 2: (a) for a categorical dataset with by , and the vertical lines are the experimentally discovered phase transition points (b) for the same dataset, and the path for Alg. 1, with in (a) also plotted. The dataset is given in Fig. 5.

Figure 3: (a) Path of Alg. 1 starting with , where the maximum likelihood model is using the same encoder architecture as in the CEB model. This stairstep path shows that Alg. 1 is able to ignore very large regions of , while quickly and precisely finding the phase transition points. Also plotted is an accumulation of by running Alg. 1 with varying starting (blue dots). (b) Per-class accuracy vs. , where the accuracy at each is from training an independent CEB model on the dataset. The per-class accuracy denotes the fraction of correctly predicted labels by the CEB model for the observed label

phase Alg. 1 converges to the phase transition points within a few iterations, and it discovers in total 3 phase transition points. Similar to the categorical case, we expect that each phase transition corresponds to the onset of learning a new class, and that the last class is much harder to learn due to a larger separation of . Therefore, this class should have a much larger label noise so that it is hard to capture this component of maximum correlation between X and Y , as analyzed in representational maximum correlation (Section 4.2). Fig. 3 (b) plots the per-class accuracy with increasing for running the Conditional Entropy Bottleneck (Fischer, 2018) (another variational bound on IB). We see that the first two predicted phase transition points closely match the observed onset of learning class 3 and class 0. Class 1 is observed to learn earlier than expected, possibly due to the gap between the variational IB objective and the true IB objective in continuous settings. By looking at the confusion matrix for the label noise (Fig. 7), we see that the ordering of onset of learning: class 2, 3, 0, 1, corresponds exactly to the decreasing diagonal element (increasing noise) of the classes, and as predicted, class 1 has a much smaller diagonal element than the other three classes, which makes it much more difficult to learn. This ordering of classes by difficulty is what our representational maximum correlation predicts.

Figure 4: (a) Accumulated vs. by running Alg. 1 with varying starting (blue dots). Also plotted are predicted phase transition points. (b) I(X; Z) and I(Y ; Z) vs. . The manually-identified phase transition points are labelled with arrows. The vertical black lines are the phase transitions identified by Alg. 1, denoted as ,... , from left to right. (c) Accuracy vs. with the same sets of points identified. The most interesting region is right before , where accuracy decreases with . Alg. 1 identifies both sides of that region, as well as points at or near all of the early obvious phase transitions. It also seems to miss later transitions, possibly due to the gap between the variational IB objective and the true IB objective in continuous settings.

6.3 CIFAR10 DATASET

Finally, we investigate the CIFAR10 experiment from Section 1. The details of the experimental setup are described in Appendix I. This experiment stretches the current limits of our discrete approximation to the underlying continuous representation being learned by the models. Nevertheless, we can see in Fig. 4 that many of the visible empirical phase transitions are tightly identified by Alg. 1. Particularly, the onset of learning is predicted quite accurately; the large interval between the predicted and corresponds well to the continuous increase of I(X; Z) and I(Y ; Z) at the same interval. And Alg. 1 is able to identify many dense transitions not obviously seen by just looking at curve alone. Alg. 1 predicts 9 phase transitions, exactly equal to for CIFAR10.

7 CONCLUSION

In this work, we observe and study the phase transitions in IB as we vary . We introduce the definition for IB phase transitions, and based on it derive a formula that gives a practical condition for IB phase transitions. We further understand the formula via Jensen’s inequality and representational maximum correlation. We reveal the close interplay between the IB objective, the dataset and the learned representation, as each phase transition is learning a nonlinear maximum correlation component in the orthogonal space of the learned representation. We present an algorithm for finding the phase transitions, and show that it gives tight matches with observed phase transitions in categorical datasets, predicts onset of learning new classes and class difficulty in MNIST, and predicts prominent transitions in CIFAR10 experiments. This work is a first theoretical step towards a deeper understanding of the phenomenon of phase transitions in the Information Bottleneck. We believe our approach will be applicable to other “trade-off” objectives, like -VAE (Higgins et al., 2017) and InfoDropout (Achille & Soatto, 2018a), where the model’s ability to predict is balanced against a measure of complexity.

8 ACKNOWLEDGEMENTS

The authors would like to thank Alex Alemi, Kevin Murphy, Sergey Ioffe, Isaac Chuang and Max Tegmark for helpful discussions.

REFERENCES

Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018a.

Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018b.

Alessandro Achille, Glen Mbeng, and Stefano Soatto. The dynamics of differential learning i: Information-dynamics and task reachability. arXiv preprint arXiv:1810.02440, 2018.

Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.

Venkat Anantharam, Amin Gohari, Sudeep Kamath, and Chandra Nair. On maximal correlation, hypercontractivity, and the data processing inequality studied by erkip and cover. arXiv preprint arXiv:1304.6133, 2013.

Richard Blahut. Computation of channel capacity and rate-distortion functions. IEEE transactions on Information Theory, 18(4):460–473, 1972.

Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables. Journal of machine learning research, 6(Jan):165–188, 2005.

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.

Ian Fischer. The conditional entropy bottleneck, 2018. URL openreview.net/forum?id= rkVOXhAqY7.

Anirudh Goyal, Riashat Islam, Daniel Strouse, Zafarali Ahmed, Matthew Botvinick, Hugo Larochelle, Sergey Levine, and Yoshua Bengio. Infobot: Transfer and exploration via the information bottleneck. arXiv preprint arXiv:1901.10902, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy2fzU9gl.

Harold Hotelling. Relations between two sets of variates. In Breakthroughs in statistics, pp. 162–190. Springer, 1992.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. URL https://arxiv.org/abs/1412. 6980.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, CIFAR, 2009.

Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. arXiv preprint arXiv:1810.00821, 2018.

Mélanie Rey and Volker Roth. Meta-gaussian information bottleneck. In Advances in Neural Information Processing Systems, pp. 1916–1924, 2012.

Danilo Jimenez Rezende and Fabio Viola. Taming VAEs. arXiv preprint arXiv:1810.00597, 2018.

Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411(29-30):2696–2711, 2010.

Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019.

DJ Strouse and David J Schwab. The deterministic information bottleneck. Neural computation, 29 (6):1611–1630, 2017a.

DJ Strouse and David J Schwab. The information bottleneck and geometric clustering. arXiv preprint arXiv:1712.09657, 2017b.

Max Tegmark and Tailin Wu. Pareto-optimal data compression for binary classification tasks. arXiv preprint arXiv:1908.08961, 2019.

Naftali Tishby. Lecture: the information theory of deep neural networks: the statistical physics aspects. https://www.perimeterinstitute.ca/videos/ information-theory-deep-neural-networks-statistical-physics-aspects/, 2018.

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.

Tailin Wu, Ian Fischer, Isaac Chuang, and Max Tegmark. Learnability for the information bottleneck. arXiv preprint arXiv:1907.07331, 2019.

S. Zagoruyko and N. Komodakis. Wide Residual Networks. arXiv: 1605.07146, 2016.

Pablo Zegers. Fisher information properties. Entropy, 17(7):4918–4939, 2015.

Appendix

Here we prove the Lemma 2.1, which will be crucial in the lemmas and theorems in this paper that follows.

Lemma 2.1. For a relative perturbation function , we have that the IB objective can be expanded as

where . The expectations in the equations are all w.r.t. all variables. For example

Proof. Suppose that we perform a relative perturbation r(z|x) on p(z|x) such that the perturbed conditional probability is , then we have

Therefore, we can denote the corresponding relative perturbation r(z) on p(z) as

Similarly, we have

p

And we can denote the corresponding relative perturbation r(z|y) on p(z|y) as

Since

We have

The -order term is simply IB. The first order term is

The -order term for

In the last equality we have used

Combining the terms with all orders, we have

As a side note, the KL-divergence between

Therefore, to the second order, we have

Similarly, we have up to the second order. Using similar procedure, we have up to the second-order,

Proof. From Lemma 2.1, we have

The condition of

is equivalent to

Using Jensen’s inequality and the convexity of the square function, we have

The equality holds iff is constant w.r.t. y, for any z.

Using Jensen’s inequality on , we have , where the equality holds iff r(z|x) is constant w.r.t. x for any z.

When , we have that the condition Eq. (16) is equivalent to

If , substituting into Eq. (16), we have

which is always true due to that , and will be a looser condition than Eq. (17) above. Above all, we have Eq. (17).

Empirical estimate of G[p(z|x)] To empirically estimate G[p(z|x)] from a minibatch of and the encoder p(z|x), we can make the following Monte Carlo importance sampling estimation, where we use the samples and also get samples of , and have:

Here denotes the set of x examples that has label of , and is an indicator function that takes value 1 if its argument is true, 0 otherwise.

The requirement of

for any

Combining all terms, we have that the empirical is given by

where . It is also possible to use different distributions for importance sampling, which will results in different formulas for empirical estimation of G[p(z|x)].

Proof. For the parameterized4 with , after , where5 is an infinitesimal perturbation on , we have that the distribution changes from to ,

and thus the relative perturbation on

where is the norm of in the parameter field

Similarly, we have

Substituting the above expressions into the expansion of IBin Eq. (12), and preserving to the second order

In the last equality we have used , and similarly . In other words, the terms in the first-order variation vanish, and the remaining are all in . Also in the last expression, is the

Fisher information matrix of

are the conditional Fisher information matrix (Zegers, 2015) of conditioned on X and Y , respectively.

Let us look at

Firstly, note that is a quadratic function of , and the scale of does not change the sign of , so the condition of is invariant to the scale of , and is describing the “curvature” in the infinitesimal neighborhood of . Therefore, can explore any value in . Secondly, we see that Eq. (21) is a special case of Eq. (14) with . Therefore, The inequalities due to Jensen still hold: . If , then the condition of is equivalent to

i.e.

If , we have that Eq. (21) always holds, which is a looser condition than Eq. (22). Above all, we have that the condition of is equivalent to

Moreover, given by Eq. (22) has the format of a generalized Rayleigh quotient are both Hermitian matrices6, which can be reduced to Rayleigh quotient transformation is the Cholesky decomposition of . Moreover, we have that when attains its minimum value, the Reyleigh quotient attains its maximum value of with , i.e. , where is the largest eigenvalue of the corresponding eigenvector.

Proof. Define

where denotes taking expectation w.r.t. the optimal solution at . Using Lemma 2.1, we have that the IB phase transition as defined in Definition 3 corresponds to satisfying the following two equations:

Now we prove that is continuous at have

From Eq. (23), we have bounded, i.e.

Hence,

To prove that is continuous at , we have , s.t.

Hence is continuous at

Combining the continuity of , and Eq. (24) and (25), we have equivalent to after simple manipulation.

E INVARIANCE OF TO ADDITION OF A GLOBAL REPRESENTATION

Proof. When we r(z|x) is shifted by a global transformation , we have , and similarly

The numerator of G[r(z|x); p(z|x)] is then

Symmetrically, we have

Therefore, is invariant to s(z).

Proof. Using the condition of the theorem, we have that , there exists and s.t. . Note that the only difference between . Using Lemma 2.2, we have

where r(z|x) doesn’t have the constraint of

After dropping the constraint of , again using Lemma 2.2, we can let r(z) = (since we can perform the transformation , so that the new ). Now we get a simpler formula for G[p(z|x)], as follows:

where

From Eq. (26), we can further require that

where7 . Comparing with Eq. (26), it immediately follows that

(i) We only have to prove that is defined in Defini-tion 4.

where . We have used Cauchy-Schwarz inequality, where the equality holds when for some . Since , we have .

Taking the supremum of

Here is defined in Definition 4. By definition both take non-negative values. Therefore,

(ii) Using the definition of

where W

Denote , we have. Then the supremum

supremum:

where We can think of the inner supremum as only w.r.t. x, for some given z.

Now let’s consider another supremum:

where . Using similar technique in (ii), it is easy to prove that it equals as defined in Definition 4.

Comparing Eq. (30) and the supremum:

we see that the only difference is that in the latter instead of 1. Since W[f(x, z)] is a quadratic functional of f(x, z), we have

Therefore,

where in the last equality we have let c(z) have “mass” only on the place where attains supremum w.r.t. z.

(iii) When Z is a continuous variable, let

function, is a parameter, , with

Therefore, such constructed

(which equals by Eq. 28).

Substituting in the special form of f(x, z) into the expression of in Eq. (27), we have

We can identify with because satisfies the requirement for conditional maximum correlation that and , and using the same technique in (i), it is straightforward to prove that equals the conditional maximum correlation as defined in Definition 4.

Since the conditional maximum correlation can be viewed as the maximum correlation between X and Y , where , using the equality of (Eq. 7 in Wu et al. (2019)), we can identify the here, and an optimal maximizes is also an optimal that minimizes

Let column vectors and (note that z is given and fixed). Also let . Denote inner product , and the length of a vector as . We have due to the normalization of probability, due to , and due to . Furthermore, we have

which is exactly the second largest singular value of the matrix . Using the result in (ii), we have that

In this section we study the behavior of p(z|x) on the phase transitions. We use the same categorical dataset (where |X| = |Y | = |Z| = 3 and p(x) is uniform, and p(y|x) is given in Fig. 5). In Fig. 6 we show the p(z|x) on the simplex before and after each phase transition. We see that the first phase transition corresponds to the separation of x = 2 (belonging to (belonging to classes simplex. The second phase transition corresponds to the separation of x = 0 with x = 1. Therefore, each phase transition corresponds to the ability to distinguish subset of examples, and learning of new classes.

We use the MNIST training examples with class 0, 1, 2, 3, with a hidden label-noise matrix as given in Fig. 7, based on which at each minibatch we dynamically sample the observed label. We use conditional entropy bottleneck (CEB) (Fischer, 2018) as the variational IB objective, and run multiple independent instances with different the target . We jump start learning by started training at for 100 epochs, annealing from 100 down to the target over 600 epochs, and continue to train at the target epoch for another 800 epochs. The encoder is a three-layer neural net, where each hidden layer has 512 neurons and leakyReLU activation, and the last layer has linear activation. The classifier p(y|z) is a 2-layer neural net with a 128-neuron ReLU hidden layer. The backward encoder p(z|y) is also a 2-layer neural net with a 128-neuron ReLU hidden layer. We trained with Adam (Kingma & Welling, 2013) at learning rate of , and anneal down with factor For Alg. 1, for the we use the same architecture as the encoder of CEB, and use |Z| = 50 in Alg. 1.

We use the same CIFAR10 class confusion matrix provided in Wu et al. (2019) to generate noisy labels with about 20% label noise on average (reproduced in Table 1). We trained Wide ResNet (He et al., 2016; Zagoruyko & Komodakis, 2016) models using the open source implementation from Cubuk et al. (2018) as encoders for the Variational Information Bottleneck (VIB) (Alemi et al., 2016). The 10 dimensional output of the encoder parameterized a mean-field Gaussian with unit covariance. Samples from the encoder were passed to the classifier, a 2 layer MLP. The marginal distributions were mixtures of 500 fully covariate 10-dimensional Gaussians, all parameters of which are trained.

With this standard model, we trained 251 different models at from 1.0 to 6.0 with step size of 0.02. As in Wu et al. (2019), we jump-start learning by annealing from 100 down to the target this over the first 4000 steps of training. The models continued to train for another 56,000 gradient steps after that, a total of 600 epochs. We trained with Adam (Kingma & Ba, 2015) at a base learning rate of , and reduced the learning rate by a factor of 0.5 at 300, 400, and 500 epochs. The models converged to essentially their final accuracy within 40,000 gradient steps, and then remained stable.

Figure 5: p(y|x) for the categorical dataset in Fig. 2 and Fig. 6. The value in denotes p(y = j|x = i). p(x) is uniform.

Figure 6: (a) I(Y ; Z) vs. for the dataset given in Fig. 5. The phase transitions are marked with vertical dashed line, with and . (b)-(e) Optimal for four values of , i.e. (b) , (c) , (d) (e) (their values are also marked in (a)), where each marker denotes p(z|x = i) for a given

Figure 7: Confusion matrix for MNIST experiment. The value in row and column denotes for the label noise.

The accuracies reported in Figure 4 are averaged across five passes over the training set. We use |Z| = 50 in Alg. 1.

Table 1: Class confusion matrix used in CIFAR10 experiments, reproduced from (Wu et al., 2019). The value in row i, column j means for class i, the probability of labeling it as class j. The mean confusion across the classes is 20%.