The Information Bottleneck (IB) objective (Tishby et al., 2000):
explicitly trades off model compression (denoting mutual information) with predictive performance (I(Y ; Z)) using the Lagrange multiplier
, where X, Y are observed random variables, and Z is a learned representation of X. The IB method has proved effective in a variety of scenarios, including improving the robustness against adversarial attacks (Alemi et al., 2016; Fischer, 2018), learning invariant and disentangled representations (Achille & Soatto, 2018a;b), underlying information-based geometric clustering (Strouse & Schwab, 2017b), improving the training and performance in adversarial learning (Peng et al., 2018), and facilitating skill discovery (Sharma et al., 2019) and learning goal-conditioned policy (Goyal et al., 2019) in reinforcement learning.
From Eq. (1) we see that when it will encourage I(X; Z) = 0 which leads to a trivial representation Z that is independent of X, while when
, it reduces to a maximum likelihood objective1 that does not constrain the information flow. Between these two extremes, how will the IB objective behave? Will prediction and compression performance change smoothly, or do there exist interesting transitions in between? In Wu et al. (2019), the authors observe and study the learnability transition, i.e. the
value such that the IB objective transitions from a trivial global minimum to learning a nontrivial representation. They also show how this first phase transition relates to the structure of the dataset. However, to answer the full question, we need to consider the full range of
Figure 1: CIFAR10 plots (a) showing the information plane, as well as and (c) accuracy, all on the training set with 20% label noise. The arrows point to empirically-observed phase transitions. The vertical lines correspond to phase transitions found with Alg. 1.
Motivation. To get a sense of how I(Y ; Z) and I(X; Z) vary with , we train Variational Information Bottleneck (VIB) models (Alemi et al., 2016) on the CIFAR10 dataset (Krizhevsky & Hinton, 2009), where each experiment is at a different
and random initialization of the model. Fig. 1 shows the I(X; Z), I(Y ; Z) and accuracy vs.
, as well as I(Y ; Z) vs. I(X; Z) for CIFAR10 with 20% label noise (see Appendix I for details).
From Fig. 1(b)(c), we see that as we increase , instead of going up smoothly, both I(X; Z) and I(Y ; Z) show multiple phase transitions, where the slopes
and
are discontinuous and the accuracy has discrete jumps. The observation lets us refine our question: When do the phase transitions occur, and how do they depend on the structure of the dataset? These questions are important, since answering them will help us gain a better understanding of the IB objective and its close interplay with the dataset and the learned representation.
Moreover, the IB objective belongs to a general form of two-term trade-offs in many machine learning objectives: L = Prediction-loss Complexity, where the complexity term generally takes the form of regularization. Usually, learning is set at a specific
. Many more insights can be gained if we understand the behavior of the prediction loss and model complexity with varying
, and how they depend on the dataset. The techniques developed to address the question in the IB setting may also help us understand the two-term tradeoff in other learning objectives.
Contributions. In this work, we begin to address the above question in IB settings. Specifically:
• We identify a qualitative change of the IB loss landscape w.r.t. p(z|x) for varying as IB phase transitions (Section 3).
• Based on the definition, we introduce a quantity G[p(z|x)] and use it to prove a theorem giving a practical condition for IB phase transitions. We further reveal the connection between G[p(z|x)] and the Fisher information matrix when p(z|x) is parameterized by (Section 3).
• We reveal the close interplay between the IB objective, the dataset and the learned representation, by showing that in IB, each phase transition corresponds to learning a new nonlinear component of maximum correlation between X and Y , orthogonal to the previously-learned Z, and each with decreasing strength (Section 4).
To the best of our knowledge, our work provides the first theoretical formula to address IB phase transitions in the most general setting. In addition, we present an algorithm for iteratively finding the IB phase transition points (Section 5). We show that our theory and algorithm give tight matches with the observed phase transitions in categorical datasets, predict the onset of learning new classes and class difficulty in MNIST, and predict prominent transitions in CIFAR10 experiments (Section 6).
The Information Bottleneck Method (Tishby et al., 2000) provides a tabular method based on the Blahut-Arimoto (BA) Algorithm (Blahut, 1972) to numerically solve the IB functional for the optimal encoder distribution P(Z|X), given the trade-off parameter and the cardinality of the representation variable Z. This work has been extended in a variety of directions, including to the case where all three variables X, Y, Z are multivariate Gaussians (Chechik et al., 2005), cases of variational bounds on the IB and related functionals for amortized learning (Alemi et al., 2016; Achille & Soatto, 2018a; Fischer, 2018), and a more generalized interpretation of the constraint on model complexity as a Kolmogorov Structure Function (Achille et al., 2018). Previous theoretical analyses of IB include Rey & Roth (2012), which looks at IB through the lens of copula functions, and Shamir et al. (2010), which starts to tackle the question of how to bound generalization with IB. We will make practical use of the original IB algorithm, as well as the amortized bounds of the Variational Informormation Bottleneck (Alemi et al., 2016) and the Conditional Entropy Bottleneck (Fischer, 2018).
Phase transitions, where key quantities change discontinuously with varying relative strength in the two-term trade-off, have been observed in many different learning domains, for multiple learning objectives. In Rezende & Viola (2018), the authors observe phase transitions in the latent representation of -VAE for varying
. Strouse & Schwab (2017b) utilize the kink angle of the phase transitions in the Deterministic Information Bottleneck (DIB) (Strouse & Schwab, 2017a) to determine the optimal number of clusters for geometric clustering. Tegmark & Wu (2019) explicitly considers critical points in binary classification tasks using a discrete information bottleneck with a non-convex Pareto-optimal frontier. In Achille & Soatto (2018a), the authors observe a transition on the tradeoff of
vs.
in InfoDropout. Under IB settings, Chechik et al. (2005) study the Gaussian Information Bottleneck, and analytically solve the critical values
eigenvalues of the matrix
, and
is the covariance matrix. This work provides valuable insights for IB, but is limited to the special case that X, Y and Z are jointly Gaussian. Phase transitions in the general IB setting have also been observed, which Tishby (2018) describes as “information bifurcation”. In Wu et al. (2019), the authors study the first phase transition, i.e. the learnability phase transition, and provide insights on how the learnability depends on the dataset. Our work is the first work that addresses all the IB phase transitions in the most general setting, and provides theoretical insights on the interplay between the IB objective, its phase transitions, the dataset, and the learned representation.
3.1 DEFINITIONS
Let be random variables denoting the input, target and representation, respectively, having a joint probability distribution p(X, Y, Z), with
its support. X, Y and Z satisfy the Markov chain
are conditionally independent given X. We assume that the integral (or summing if X, Y or Z are discrete random variables) is on
We use x, y and z to denote the instances of the respective random variables. The above settings are used throughout the paper. We can view the IB objective IB
(Eq. 1) as a functional of the encoding distribution p(z|x). To prepare for the introduction of IB phase transitions, we first define relative perturbation function and second variation, as follows. Definition 1. Relative perturbation function: For p(z|x), its relative perturbation function r(z|x) is a bounded function that maps
to R and satisfies
. Formally, define
. We have that
iff r(z|x) is a relative perturbation function of p(z|x). The perturbed probability (density) is
Definition 2. Second variation: Let functional F[f(x)] be defined on some normed linear space R. Let us add a perturbative function
, and now the functional
expanded as
such that , where
denotes the norm,
is a linear
functional of , and is called the first variation, denoted as
is a quadratic functional of
, and is called the second variation, denoted as
We can think of the perturbation function as an infinite-dimensional “vector” (x being the indices), with
being its amplitude and h(x) its direction. With the above preparations, we define the IB phase transition as a change in the local curvature on the global minimum of IB
Definition 3. IB phase transitions: Let be a perturbation function of p(z|x),
denote the optimal solution of IB
at
, where the IB functional IB
is defined in Eq. (1). The IB phase transitions
values satisfying the following two conditions:
(1) ∀
Here denote one-sided limits.
We can understand the as a local “curvature” of the IB objective IB
(Eq. 1) w.r.t. p(z|x), along some relative perturbation r(z|x). A phase transition occurs when the convexity of IB
w.r.t. p(z|x) changes from a minimum to a saddle point in the neighborhood of its optimal solution
as
increases from
to
. This means that there exists a perturbation to go downhill and find a better minimum. We validate this definition empirically below.
3.2 CONDITION FOR IB PHASE TRANSITIONS
The definition for IB phase transition (Definition 3) indicates the important role on the optimal solution in providing the condition for phase transitions. To concretize it and prepare for a more practical condition for IB phase transitions, we expand IB
second order of
Lemma 0.1. For IB, the condition of
is equivalent to
. The threshold function G[p(z|x)] is given by:
The proof is given in Appendix B, in which we also give Eq. (20) for empirical estimation. Note that Lemma 0.1 is very general and can be applied to any p(z|x), not only at the optimal solution
The Fisher Information matrix. In practice, the encoder is usually parameterized by some parameter vector
, e.g. weights and biases in a neural net, where
is the parameter field. An infinitesimal change of
induces a relative perturbation
, from which we can compute the threshold function
Lemma 0.2. For IBobjective, the condition of
is equivalent to
of for
are the conditional Fisher information matrix (Zegers, 2015) of
for Z conditioned on X and Y , respectively.
is the largest eigenvalue of
with
the corresponding eigenvector, where
is the Cholesky decomposition of the matrix
, and
is the eigenvector for
. The infimum is attained at
The proof is in appendix C. We see that for parameterized encoders , each term of G[p(z|x)] in Eq. (2) can be replaced by a bilinear form with the Fisher information matrix of the respective variables. Although this lemma is not required to understand the more general setting of Lemma 0.1, where the model is described in a functional space, Lemma 0.2 helps understand G[p(z|x)] for parameterized models, which permits directly linking the phase transitions to the model’s parameters.
Phase Transitions. Now we introduce Theorem 1 that gives a concrete and practical condition for IB phase transitions, which is the core result of the paper: Theorem 1. The IB phase transition points as defined in Definition 3 are given by the roots of the following equation:
where G[p(z|x)] is given by Eq. (2) and is the optimal solution of IB
We can understand Eq. (4) as the condition when is about to be able to be negative at the optimal solution
for a given
. The proof for Theorem 1 is given in Appendix D. In Section 4, we will analyze Theorem 1 in detail.
In this section we set out to understand G[p(z|x)] as given by Eq. (2) and the phase transition condition as given by Theorem 1, from the perspectives of Jensen’s inequality and representational maximum correlation.
4.1 JENSEN’S INEQUALITY
The condition for IB phase transitions given by Theorem 1 involves G[p(z|x)] =
The equality between A and B holds when the perturbation r(z|x) is constant w.r.t. x for any z; the equality between B and C holds when is constant w.r.t. y for any z. Therefore, the minimization of
encourages the relative perturbation function r(z|x) to be as constant w.r.t. x as possible (minimizing intra-class difference), but as different w.r.t. different y as possible (maximizing inter-class difference), resulting in a clustering of the values of r(z|x) for different examples x according to their class y. Because of this clustering property in classification problems, we conjecture that there are at most
phase transitions, where |Y| is the number of classes, with each phase transition differentiating one or more classes.
4.2 REPRESENTATIONAL MAXIMUM CORRELATION
Under certain conditions we can further simplify G[p(z|x)] and gain a deeper understanding of it. Firstly, inspired by maximum correlation (Anantharam et al., 2013), we introduce two new concepts, representational maximum correlation and conditional maximum correlation, as follows.
Definition 4. Given a joint distribution p(X, Y ), and a representation Z satisfying the Markov chain , the representational maximum correlation
is defined as
where
The conditional maximum correlation is defined as:
where
We prove the following Theorem 2, which expresses G[p(z|x)] in terms of representational maximum correlation and related quantities, with proof given in Appendix F.
Theorem 2. Define . If
and
satisfy:
, there exists2
s.t. r(z|x) =
, then we have:
(iii) When Z is continuous, an optimal relative perturbation function r(z|x) for G[p(z|x)] is given by
where is the second largest singular value of the matrix
Theorem 2 furthers our understanding of G[p(z|x)] and the phase transition condition (Theorem 1), which we elaborate as follows.
Discovering maximum correlation in the orthogonal space of a learned representation: Intuitively, the representational maximum correlation measures the maximum linear correlation between f(X, Z) and g(Y, Z) among all real-valued functions f, g, under the constraint that f(X, Z) is “orthogonal” to p(X|Z) and g(Y, Z) is “orthogonal” to p(Y |Z). Theorem 2 (i) reveals that G[p(z|x)] is the inverse square of this representational maximum correlation. Theorem 2 (ii) further shows that G[p(z|x)] is finding a specific on which maximum (nonlinear) correlation between X and Y conditioned on Z can be found. Combined with Theorem 1, we have that when we continuously increase
, for the optimal representation
shall monotonically decrease due to that X and Y has to find their maximum correlation on the orthogonal space of an increasingly better representation
that captures more information about X. A phase transition occurs when
reduces to
, after which as
continues to increase,
will try to find maximum correlation between X and Y orthogonal to the full previously learned representation. This is reminiscent of canonical-correlation analysis (CCA) (Hotelling, 1992) in linear settings, where components with decreasing linear maximum correlation that are orthogonal to previous components are found one by one. In comparison, we show that in IB, each phase transition corresponds to learning a new nonlinear component of maximum correlation between X and Y in Z, orthogonal to the previously-learned Z. In the case of classification where different classes may have different difficulty (e.g. due to label noise or support overlap), we should expect that classes that are less difficult as measured by a larger maximum correlation between X and Y are learned earlier.
Conspicuous subset conditioned on a single z: Furthermore, we show in (iii) that an optimal relative perturbation function r(z|x) can be decomposed into a product of two factors, a
factor that only focus on perturbing a specific point in the representation space, and an
factor that is finding the “conspicuous subset” (Wu et al., 2019), i.e. the most confident, large, typical, and imbalanced subset in the X space for the distribution
Singular values In categorical settings, (iv) reveals a connection between G[p(z|x)] and the singular value of the matrix. Due to the property of SVD, we know that the square of the singular values of
equals the non-negative eigenvalue of the matrix
. Then the phase transition condition in Theorem 1 is equivalent to a (nonlinear) eigenvalue problem. This is resonant with previous analogy with CCA in linear settings, and is also reminiscent of the linear eigenvalue problem in Gaussian IB (Chechik et al., 2005).
As a consequence of the theoretical analysis above, we are able to derive an algorithm to efficiently estimate the phase transitions for a given model architecture and dataset. This algorithm also permits us to empirically confirm some of our theoretical results in Section 6.
Typically, classification involves high-dimensional inputs X. Without sweeping the full range of where at each
it is a full learning problem, it is in general a difficult task to estimate the phase transitions. In Algorithm 1, we present a two-stage approach.
In the first stage, we train a single maximum likelihood neural network with the same encoder architecture as in the (variational) IB to estimate p(y|x), and obtain an
N is the number of examples in the dataset and C is the number of classes. In the second stage, we perform an iterative algorithm w.r.t.
, alternatively, to converge to a phase transition point.
Specifically, for a given , we use a Blahut-Arimoto type IB algorithm (Tishby et al., 2000) to efficiently reach IB optimal
at
, then use SVD (with the formula given in Theorem 2 (iv)) to efficiently estimate
at
(step 8). We then use the
value as the new
and do it again (step 7 in the next iteration). At convergence, we will reach the phase transition point given by
(Theorem 1). After convergence as measured by patience parameter K, we slightly increase
(step 13), so that the algorithm can discover the subsequent phase transitions.
We quantitatively and qualitatively test the ability of our theory and Algorithm 1 to provide good predictions for IB phase transitions. We first verify them in fully categorical settings, where X, Y, Z are all discrete, and we show that the phase transitions can correspond to learning new classes as we increase . We then test our algorithm on versions of the MNIST and CIFAR10 datasets with added label noise.
6.1 CATEGORICAL DATASET
For categorical datasets, X and Y are discrete, and p(X) and p(Y |X) are given. To test Theorem 1, we use the Blahut-Arimoto IB algorithm to compute the optimal for each
vs.
is plotted in Fig. 2 (a). There are two phase transitions at
and
. For each
and the corresponding
, we use the SVD formula (Theorem 2) to compute
, shown in Fig. 2 (b). We see that
at exactly the observed phase transition points
and
. Moreover, starting at
, Alg. 1 converges to each phase transition points within few iterations. Our other experiments with random categorical datasets show similarly tight matches.
Furthermore, in Appendix G we show that the phase transitions correspond to the onset of separation of p(z|x) for subsets of X that correspond to different classes. This supports our conjecture from Section 4.1 that there are at most phase transitions in classification problems.
6.2 MNIST DATASET
For continuous X, how does our algorithm perform, and will it reveal aspects of the dataset? We first test our algorithm in a 4-class MNIST with noisy labels3, whose confusion matrix and experimental settings are given in Appendix H. Fig. 3 (a) shows the path Alg. 1 takes. We see again that in each
Figure 2: (a) for a categorical dataset with
by
, and the vertical lines are the experimentally discovered phase transition points
(b)
for the same dataset, and the path for Alg. 1, with
in (a) also plotted. The dataset is given in Fig. 5.
Figure 3: (a) Path of Alg. 1 starting with , where the maximum likelihood model
is using the same encoder architecture as in the CEB model. This stairstep path shows that Alg. 1 is able to ignore very large regions of
, while quickly and precisely finding the phase transition points. Also plotted is an accumulation of
by running Alg. 1 with varying starting
(blue dots). (b) Per-class accuracy vs.
, where the accuracy at each
is from training an independent CEB model on the dataset. The per-class accuracy denotes the fraction of correctly predicted labels by the CEB model for the observed label
phase Alg. 1 converges to the phase transition points within a few iterations, and it discovers in total 3 phase transition points. Similar to the categorical case, we expect that each phase transition corresponds to the onset of learning a new class, and that the last class is much harder to learn due to a larger separation of . Therefore, this class should have a much larger label noise so that it is hard to capture this component of maximum correlation between X and Y , as analyzed in representational maximum correlation (Section 4.2). Fig. 3 (b) plots the per-class accuracy with increasing
for running the Conditional Entropy Bottleneck (Fischer, 2018) (another variational bound on IB). We see that the first two predicted phase transition points
closely match the observed onset of learning class 3 and class 0. Class 1 is observed to learn earlier than expected, possibly due to the gap between the variational IB objective and the true IB objective in continuous settings. By looking at the confusion matrix for the label noise (Fig. 7), we see that the ordering of onset of learning: class 2, 3, 0, 1, corresponds exactly to the decreasing diagonal element
(increasing noise) of the classes, and as predicted, class 1 has a much smaller diagonal element
than the other three classes, which makes it much more difficult to learn. This ordering of classes by difficulty is what our representational maximum correlation predicts.
Figure 4: (a) Accumulated vs.
by running Alg. 1 with varying starting
(blue dots). Also plotted are predicted phase transition points. (b) I(X; Z) and I(Y ; Z) vs.
. The manually-identified phase transition points are labelled with arrows. The vertical black lines are the phase transitions identified by Alg. 1, denoted as
,...
, from left to right. (c) Accuracy vs.
with the same sets of points identified. The most interesting region is right before
, where accuracy decreases with
. Alg. 1 identifies both sides of that region, as well as points at or near all of the early obvious phase transitions. It also seems to miss later transitions, possibly due to the gap between the variational IB objective and the true IB objective in continuous settings.
6.3 CIFAR10 DATASET
Finally, we investigate the CIFAR10 experiment from Section 1. The details of the experimental setup are described in Appendix I. This experiment stretches the current limits of our discrete approximation to the underlying continuous representation being learned by the models. Nevertheless, we can see in Fig. 4 that many of the visible empirical phase transitions are tightly identified by Alg. 1. Particularly, the onset of learning is predicted quite accurately; the large interval between the predicted and
corresponds well to the continuous increase of I(X; Z) and I(Y ; Z) at the same interval. And Alg. 1 is able to identify many dense transitions not obviously seen by just looking at
curve alone. Alg. 1 predicts 9 phase transitions, exactly equal to
for CIFAR10.
In this work, we observe and study the phase transitions in IB as we vary . We introduce the definition for IB phase transitions, and based on it derive a formula that gives a practical condition for IB phase transitions. We further understand the formula via Jensen’s inequality and representational maximum correlation. We reveal the close interplay between the IB objective, the dataset and the learned representation, as each phase transition is learning a nonlinear maximum correlation component in the orthogonal space of the learned representation. We present an algorithm for finding the phase transitions, and show that it gives tight matches with observed phase transitions in categorical datasets, predicts onset of learning new classes and class difficulty in MNIST, and predicts prominent transitions in CIFAR10 experiments. This work is a first theoretical step towards a deeper understanding of the phenomenon of phase transitions in the Information Bottleneck. We believe our approach will be applicable to other “trade-off” objectives, like
-VAE (Higgins et al., 2017) and InfoDropout (Achille & Soatto, 2018a), where the model’s ability to predict is balanced against a measure of complexity.
The authors would like to thank Alex Alemi, Kevin Murphy, Sergey Ioffe, Isaac Chuang and Max Tegmark for helpful discussions.
Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018a.
Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018b.
Alessandro Achille, Glen Mbeng, and Stefano Soatto. The dynamics of differential learning i: Information-dynamics and task reachability. arXiv preprint arXiv:1810.02440, 2018.
Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
Venkat Anantharam, Amin Gohari, Sudeep Kamath, and Chandra Nair. On maximal correlation, hypercontractivity, and the data processing inequality studied by erkip and cover. arXiv preprint arXiv:1304.6133, 2013.
Richard Blahut. Computation of channel capacity and rate-distortion functions. IEEE transactions on Information Theory, 18(4):460–473, 1972.
Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables. Journal of machine learning research, 6(Jan):165–188, 2005.
Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
Ian Fischer. The conditional entropy bottleneck, 2018. URL openreview.net/forum?id= rkVOXhAqY7.
Anirudh Goyal, Riashat Islam, Daniel Strouse, Zafarali Ahmed, Matthew Botvinick, Hugo Larochelle, Sergey Levine, and Yoshua Bengio. Infobot: Transfer and exploration via the information bottleneck. arXiv preprint arXiv:1901.10902, 2019.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy2fzU9gl.
Harold Hotelling. Relations between two sets of variates. In Breakthroughs in statistics, pp. 162–190. Springer, 1992.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. URL https://arxiv.org/abs/1412. 6980.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, CIFAR, 2009.
Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. arXiv preprint arXiv:1810.00821, 2018.
Mélanie Rey and Volker Roth. Meta-gaussian information bottleneck. In Advances in Neural Information Processing Systems, pp. 1916–1924, 2012.
Danilo Jimenez Rezende and Fabio Viola. Taming VAEs. arXiv preprint arXiv:1810.00597, 2018.
Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411(29-30):2696–2711, 2010.
Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019.
DJ Strouse and David J Schwab. The deterministic information bottleneck. Neural computation, 29 (6):1611–1630, 2017a.
DJ Strouse and David J Schwab. The information bottleneck and geometric clustering. arXiv preprint arXiv:1712.09657, 2017b.
Max Tegmark and Tailin Wu. Pareto-optimal data compression for binary classification tasks. arXiv preprint arXiv:1908.08961, 2019.
Naftali Tishby. Lecture: the information theory of deep neural networks: the statistical physics aspects. https://www.perimeterinstitute.ca/videos/ information-theory-deep-neural-networks-statistical-physics-aspects/, 2018.
Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
Tailin Wu, Ian Fischer, Isaac Chuang, and Max Tegmark. Learnability for the information bottleneck. arXiv preprint arXiv:1907.07331, 2019.
S. Zagoruyko and N. Komodakis. Wide Residual Networks. arXiv: 1605.07146, 2016.
Pablo Zegers. Fisher information properties. Entropy, 17(7):4918–4939, 2015.
Here we prove the Lemma 2.1, which will be crucial in the lemmas and theorems in this paper that follows.
Lemma 2.1. For a relative perturbation function , we have that the IB objective can be expanded as
where . The expectations in the equations are all w.r.t. all variables. For example
Proof. Suppose that we perform a relative perturbation r(z|x) on p(z|x) such that the perturbed conditional probability is , then we have
Therefore, we can denote the corresponding relative perturbation r(z) on p(z) as
Similarly, we have
p
And we can denote the corresponding relative perturbation r(z|y) on p(z|y) as
Since
We have
The -order term is simply IB
. The first order term is
The -order term for
In the last equality we have used
Combining the terms with all orders, we have
As a side note, the KL-divergence between
Therefore, to the second order, we have
Similarly, we have up to the second order. Using similar procedure, we have up to the second-order,
Proof. From Lemma 2.1, we have
The condition of
is equivalent to
Using Jensen’s inequality and the convexity of the square function, we have
The equality holds iff is constant w.r.t. y, for any z.
Using Jensen’s inequality on , we have
, where the equality holds iff r(z|x) is constant w.r.t. x for any z.
When , we have that the condition Eq. (16) is equivalent to
If , substituting into Eq. (16), we have
which is always true due to that , and will be a looser condition than Eq. (17) above. Above all, we have Eq. (17).
Empirical estimate of G[p(z|x)] To empirically estimate G[p(z|x)] from a minibatch of and the encoder p(z|x), we can make the following Monte Carlo importance sampling estimation, where we use the samples
and also get samples of
, and have:
Here denotes the set of x examples that has label of
, and
is an indicator function that takes value 1 if its argument is true, 0 otherwise.
The requirement of
for any
Combining all terms, we have that the empirical is given by
where . It is also possible to use different distributions for importance sampling, which will results in different formulas for empirical estimation of G[p(z|x)].
Proof. For the parameterized4 with
, after
, where5
is an infinitesimal perturbation on
, we have that the distribution changes from
to
,
and thus the relative perturbation on
where is the norm of
in the parameter field
Similarly, we have
Substituting the above expressions into the expansion of IBin Eq. (12), and preserving to the second order
In the last equality we have used , and similarly
. In other words, the
terms in the first-order variation
vanish, and the remaining
are all in
. Also in the last expression,
is the
Fisher information matrix of
are the conditional Fisher information matrix (Zegers, 2015) of
conditioned on X and Y , respectively.
Let us look at
Firstly, note that is a quadratic function of
, and the scale of
does not change the sign of
, so the condition of
is invariant to the scale of
, and is describing the “curvature” in the infinitesimal neighborhood of
. Therefore,
can explore any value in
. Secondly, we see that Eq. (21) is a special case of Eq. (14) with
. Therefore, The inequalities due to Jensen still hold:
. If
, then the condition of
is equivalent to
i.e.
If , we have that Eq. (21) always holds, which is a looser condition than Eq. (22). Above all, we have that the condition of
is equivalent to
Moreover, given by Eq. (22) has the format of a generalized Rayleigh quotient
are both Hermitian matrices6, which can be reduced to Rayleigh quotient
transformation
is the Cholesky decomposition of
. Moreover, we have that when
attains its minimum value, the Reyleigh quotient
attains its maximum value of
with
, i.e.
, where
is the largest eigenvalue of
the corresponding eigenvector.
Proof. Define
where denotes taking expectation w.r.t. the optimal solution
at
. Using Lemma 2.1, we have that the IB phase transition as defined in Definition 3 corresponds to satisfying the following two equations:
Now we prove that is continuous at
have
From Eq. (23), we have bounded, i.e.
Hence,
To prove that is continuous at
, we have
, s.t.
Hence is continuous at
Combining the continuity of , and Eq. (24) and (25), we have
equivalent to
after simple manipulation.
E INVARIANCE OF TO ADDITION OF A GLOBAL REPRESENTATION
Proof. When we r(z|x) is shifted by a global transformation , we have
, and similarly
The numerator of G[r(z|x); p(z|x)] is then
Symmetrically, we have
Therefore, is invariant to
s(z).
Proof. Using the condition of the theorem, we have that , there exists
and
s.t.
. Note that the only difference between
. Using Lemma 2.2, we have
where r(z|x) doesn’t have the constraint of
After dropping the constraint of , again using Lemma 2.2, we can let r(z) =
(since we can perform the transformation
, so that the new
). Now we get a simpler formula for G[p(z|x)], as follows:
where
From Eq. (26), we can further require that
where7 . Comparing with Eq. (26), it immediately follows that
(i) We only have to prove that is defined in Defini-tion 4.
where . We have used Cauchy-Schwarz inequality, where the equality holds when
for some
. Since
, we have
.
Taking the supremum of
Here is defined in Definition 4. By definition both
take non-negative values. Therefore,
(ii) Using the definition of
where W
Denote , we have
. Then the supremum
supremum:
where We can think of the inner supremum
as only w.r.t. x, for some given z.
Now let’s consider another supremum:
where . Using similar technique in (ii), it is easy to prove that it equals
as defined in Definition 4.
Comparing Eq. (30) and the supremum:
we see that the only difference is that in the latter instead of 1. Since W[f(x, z)] is a quadratic functional of f(x, z), we have
Therefore,
where in the last equality we have let c(z) have “mass” only on the place where attains supremum w.r.t. z.
(iii) When Z is a continuous variable, let
function, is a parameter,
, with
Therefore, such constructed
(which equals
by Eq. 28).
Substituting in the special form of f(x, z) into the expression of in Eq. (27), we have
We can identify with
because
satisfies the requirement for conditional maximum correlation that
and
, and using the same technique in (i), it is straightforward to prove that
equals the conditional maximum correlation as defined in Definition 4.
Since the conditional maximum correlation can be viewed as the maximum correlation between X and Y , where , using the equality of
(Eq. 7 in Wu et al. (2019)), we can identify the
here, and an optimal
maximizes
is also an optimal
that minimizes
Let column vectors and
(note that z is given and fixed). Also let
. Denote inner product
, and the length of a vector as
. We have
due to the normalization of probability,
due to
, and
due to
. Furthermore, we have
which is exactly the second largest singular value of the matrix
. Using the result in (ii), we have that
In this section we study the behavior of p(z|x) on the phase transitions. We use the same categorical dataset (where |X| = |Y | = |Z| = 3 and p(x) is uniform, and p(y|x) is given in Fig. 5). In Fig. 6 we show the p(z|x) on the simplex before and after each phase transition. We see that the first phase transition corresponds to the separation of x = 2 (belonging to (belonging to classes
simplex. The second phase transition corresponds to the separation of x = 0 with x = 1. Therefore, each phase transition corresponds to the ability to distinguish subset of examples, and learning of new classes.
We use the MNIST training examples with class 0, 1, 2, 3, with a hidden label-noise matrix as given in Fig. 7, based on which at each minibatch we dynamically sample the observed label. We use conditional entropy bottleneck (CEB) (Fischer, 2018) as the variational IB objective, and run multiple independent instances with different the target . We jump start learning by started training at
for 100 epochs, annealing
from 100 down to the target
over 600 epochs, and continue to train at the target epoch for another 800 epochs. The encoder is a three-layer neural net, where each hidden layer has 512 neurons and leakyReLU activation, and the last layer has linear activation. The classifier p(y|z) is a 2-layer neural net with a 128-neuron ReLU hidden layer. The backward encoder p(z|y) is also a 2-layer neural net with a 128-neuron ReLU hidden layer. We trained with Adam (Kingma & Welling, 2013) at learning rate of
, and anneal down with factor
For Alg. 1, for the
we use the same architecture as the encoder of CEB, and use |Z| = 50 in Alg. 1.
We use the same CIFAR10 class confusion matrix provided in Wu et al. (2019) to generate noisy labels with about 20% label noise on average (reproduced in Table 1). We trained Wide ResNet (He et al., 2016; Zagoruyko & Komodakis, 2016) models using the open source implementation from Cubuk et al. (2018) as encoders for the Variational Information Bottleneck (VIB) (Alemi et al., 2016). The 10 dimensional output of the encoder parameterized a mean-field Gaussian with unit covariance. Samples from the encoder were passed to the classifier, a 2 layer MLP. The marginal distributions were mixtures of 500 fully covariate 10-dimensional Gaussians, all parameters of which are trained.
With this standard model, we trained 251 different models at from 1.0 to 6.0 with step size of 0.02. As in Wu et al. (2019), we jump-start learning by annealing
from 100 down to the target
this over the first 4000 steps of training. The models continued to train for another 56,000 gradient steps after that, a total of 600 epochs. We trained with Adam (Kingma & Ba, 2015) at a base learning rate of
, and reduced the learning rate by a factor of 0.5 at 300, 400, and 500 epochs. The models converged to essentially their final accuracy within 40,000 gradient steps, and then remained stable.
Figure 5: p(y|x) for the categorical dataset in Fig. 2 and Fig. 6. The value in denotes p(y = j|x = i). p(x) is uniform.
Figure 6: (a) I(Y ; Z) vs. for the dataset given in Fig. 5. The phase transitions are marked with vertical dashed line, with
and
. (b)-(e) Optimal
for four values of
, i.e. (b)
, (c)
, (d)
(e)
(their
values are also marked in (a)), where each marker denotes p(z|x = i) for a given
Figure 7: Confusion matrix for MNIST experiment. The value in row and
column denotes
for the label noise.
The accuracies reported in Figure 4 are averaged across five passes over the training set. We use |Z| = 50 in Alg. 1.
Table 1: Class confusion matrix used in CIFAR10 experiments, reproduced from (Wu et al., 2019). The value in row i, column j means for class i, the probability of labeling it as class j. The mean confusion across the classes is 20%.