b

DiscoverSearch
About
My stuff
Phase Transitions for the Information Bottleneck in Representation Learning
2020·arXiv
ABSTRACT
ABSTRACT

In the Information Bottleneck (IB), when tuning the relative strength between compression and prediction terms, how do the two terms behave, and what’s their relationship with the dataset and the learned representation? In this paper, we set out to answer these questions by studying multiple phase transitions in the IB objective: IBβ[p(z|x)] = I(X; Z)−βI(Y ; Z)defined on the encoding distribution p(z|x) for input X, target Y and representation Z, where sudden jumps of dI(Y ;Z)dβand prediction accuracy are observed with increasing  β. We introduce a definition for IB phase transitions as a qualitative change of the IB loss landscape, and show that the transitions correspond to the onset of learning new classes. Using second-order calculus of variations, we derive a formula that provides a practical condition for IB phase transitions, and draw its connection with the Fisher information matrix for parameterized models. We provide two perspectives to understand the formula, revealing that each IB phase transition is finding a component of maximum (nonlinear) correlation between X and Y orthogonal to the learned representation, in close analogy with canonical-correlation analysis (CCA) in linear settings. Based on the theory, we present an algorithm for discovering phase transition points. Finally, we verify that our theory and algorithm accurately predict phase transitions in categorical datasets, predict the onset of learning new classes and class difficulty in MNIST, and predict prominent phase transitions in CIFAR10.

The Information Bottleneck (IB) objective (Tishby et al., 2000):

image

explicitly trades off model compression (I(X; Z), I(·; ·)denoting mutual information) with predictive performance (I(Y ; Z)) using the Lagrange multiplier  β, where X, Y are observed random variables, and Z is a learned representation of X. The IB method has proved effective in a variety of scenarios, including improving the robustness against adversarial attacks (Alemi et al., 2016; Fischer, 2018), learning invariant and disentangled representations (Achille & Soatto, 2018a;b), underlying information-based geometric clustering (Strouse & Schwab, 2017b), improving the training and performance in adversarial learning (Peng et al., 2018), and facilitating skill discovery (Sharma et al., 2019) and learning goal-conditioned policy (Goyal et al., 2019) in reinforcement learning.

From Eq. (1) we see that when  β → 0it will encourage I(X; Z) = 0 which leads to a trivial representation Z that is independent of X, while when  β → +∞, it reduces to a maximum likelihood objective1 that does not constrain the information flow. Between these two extremes, how will the IB objective behave? Will prediction and compression performance change smoothly, or do there exist interesting transitions in between? In Wu et al. (2019), the authors observe and study the learnability transition, i.e. the  βvalue such that the IB objective transitions from a trivial global minimum to learning a nontrivial representation. They also show how this first phase transition relates to the structure of the dataset. However, to answer the full question, we need to consider the full range of  β.

image

Figure 1: CIFAR10 plots (a) showing the information plane, as well as  β vs (b) I(X; Z) and I(Y ; Z),and (c) accuracy, all on the training set with 20% label noise. The arrows point to empirically-observed phase transitions. The vertical lines correspond to phase transitions found with Alg. 1.

Motivation. To get a sense of how I(Y ; Z) and I(X; Z) vary with  β, we train Variational Information Bottleneck (VIB) models (Alemi et al., 2016) on the CIFAR10 dataset (Krizhevsky & Hinton, 2009), where each experiment is at a different  βand random initialization of the model. Fig. 1 shows the I(X; Z), I(Y ; Z) and accuracy vs.  β, as well as I(Y ; Z) vs. I(X; Z) for CIFAR10 with 20% label noise (see Appendix I for details).

From Fig. 1(b)(c), we see that as we increase  β, instead of going up smoothly, both I(X; Z) and I(Y ; Z) show multiple phase transitions, where the slopes dI(X;Z)dβand dI(Y ;Z)dβare discontinuous and the accuracy has discrete jumps. The observation lets us refine our question: When do the phase transitions occur, and how do they depend on the structure of the dataset? These questions are important, since answering them will help us gain a better understanding of the IB objective and its close interplay with the dataset and the learned representation.

Moreover, the IB objective belongs to a general form of two-term trade-offs in many machine learning objectives: L = Prediction-loss  + β ·Complexity, where the complexity term generally takes the form of regularization. Usually, learning is set at a specific  β. Many more insights can be gained if we understand the behavior of the prediction loss and model complexity with varying  β, and how they depend on the dataset. The techniques developed to address the question in the IB setting may also help us understand the two-term tradeoff in other learning objectives.

Contributions. In this work, we begin to address the above question in IB settings. Specifically:

We identify a qualitative change of the IB loss landscape w.r.t. p(z|x) for varying  βas IB phase transitions (Section 3).

Based on the definition, we introduce a quantity G[p(z|x)] and use it to prove a theorem giving a practical condition for IB phase transitions. We further reveal the connection between G[p(z|x)] and the Fisher information matrix when p(z|x) is parameterized by  θ(Section 3).

We reveal the close interplay between the IB objective, the dataset and the learned representation, by showing that in IB, each phase transition corresponds to learning a new nonlinear component of maximum correlation between X and Y , orthogonal to the previously-learned Z, and each with decreasing strength (Section 4).

To the best of our knowledge, our work provides the first theoretical formula to address IB phase transitions in the most general setting. In addition, we present an algorithm for iteratively finding the IB phase transition points (Section 5). We show that our theory and algorithm give tight matches with the observed phase transitions in categorical datasets, predict the onset of learning new classes and class difficulty in MNIST, and predict prominent transitions in CIFAR10 experiments (Section 6).

The Information Bottleneck Method (Tishby et al., 2000) provides a tabular method based on the Blahut-Arimoto (BA) Algorithm (Blahut, 1972) to numerically solve the IB functional for the optimal encoder distribution P(Z|X), given the trade-off parameter  βand the cardinality of the representation variable Z. This work has been extended in a variety of directions, including to the case where all three variables X, Y, Z are multivariate Gaussians (Chechik et al., 2005), cases of variational bounds on the IB and related functionals for amortized learning (Alemi et al., 2016; Achille & Soatto, 2018a; Fischer, 2018), and a more generalized interpretation of the constraint on model complexity as a Kolmogorov Structure Function (Achille et al., 2018). Previous theoretical analyses of IB include Rey & Roth (2012), which looks at IB through the lens of copula functions, and Shamir et al. (2010), which starts to tackle the question of how to bound generalization with IB. We will make practical use of the original IB algorithm, as well as the amortized bounds of the Variational Informormation Bottleneck (Alemi et al., 2016) and the Conditional Entropy Bottleneck (Fischer, 2018).

Phase transitions, where key quantities change discontinuously with varying relative strength in the two-term trade-off, have been observed in many different learning domains, for multiple learning objectives. In Rezende & Viola (2018), the authors observe phase transitions in the latent representation of  β-VAE for varying  β. Strouse & Schwab (2017b) utilize the kink angle of the phase transitions in the Deterministic Information Bottleneck (DIB) (Strouse & Schwab, 2017a) to determine the optimal number of clusters for geometric clustering. Tegmark & Wu (2019) explicitly considers critical points in binary classification tasks using a discrete information bottleneck with a non-convex Pareto-optimal frontier. In Achille & Soatto (2018a), the authors observe a transition on the tradeoff of  I(θ; X, Y )vs.  H(Y |X, θ)in InfoDropout. Under IB settings, Chechik et al. (2005) study the Gaussian Information Bottleneck, and analytically solve the critical values  βci = 11−λi , where λi areeigenvalues of the matrix  Σx|yΣ−1x, and  Σxis the covariance matrix. This work provides valuable insights for IB, but is limited to the special case that X, Y and Z are jointly Gaussian. Phase transitions in the general IB setting have also been observed, which Tishby (2018) describes as “information bifurcation”. In Wu et al. (2019), the authors study the first phase transition, i.e. the learnability phase transition, and provide insights on how the learnability depends on the dataset. Our work is the first work that addresses all the IB phase transitions in the most general setting, and provides theoretical insights on the interplay between the IB objective, its phase transitions, the dataset, and the learned representation.

3.1 DEFINITIONS

Let  X ∈ X, Y ∈ Y, Z ∈ Zbe random variables denoting the input, target and representation, respectively, having a joint probability distribution p(X, Y, Z), with  X × Y × Zits support. X, Y and Z satisfy the Markov chain  Z − X − Y , i.e. Y and Zare conditionally independent given X. We assume that the integral (or summing if X, Y or Z are discrete random variables) is on  X × Y × Z.We use x, y and z to denote the instances of the respective random variables. The above settings are used throughout the paper. We can view the IB objective IBβ[p(z|x)](Eq. 1) as a functional of the encoding distribution p(z|x). To prepare for the introduction of IB phase transitions, we first define relative perturbation function and second variation, as follows. Definition 1. Relative perturbation function: For p(z|x), its relative perturbation function r(z|x) is a bounded function that maps  X × Zto R and satisfies  Ez∼p(z|x)[r(z|x)] = 0. Formally, define QZ|X := {r(z|x) : X × Z → R�� Ez∼p(z|x)[r(z|x)] = 0, and ∃M > 0 s.t. ∀X ∈ X, Z ∈Z, |r(z|x)| ≤ M}. We have that  r(z|x) ∈ QZ|Xiff r(z|x) is a relative perturbation function of p(z|x). The perturbed probability (density) is  p′(z|x) = p(z|x) (1 + ϵ · r(z|x)) for some ϵ > 0.Definition 2. Second variation: Let functional F[f(x)] be defined on some normed linear space R. Let us add a perturbative function  ϵ · h(x) to f(x), and now the functional  F[f(x) + ϵ · h(x)] can beexpanded as

image

such that  limϵ→0 ϕr[ϵ · h(x)] = 0, where  ∥·∥denotes the norm,  ϕ1[ϵ · h(x)] = ϵ dF [f(x)]dϵis a linear

functional of  ϵ·h(x), and is called the first variation, denoted as  δF[f(x)]. ϕ2[ϵ·h(x)] = 12ϵ2 d2F [f(x)]dϵ2is a quadratic functional of  ϵ · h(x), and is called the second variation, denoted as  δ2F[f(x)].

We can think of the perturbation function  ϵ · h(x)as an infinite-dimensional “vector” (x being the indices), with  ϵbeing its amplitude and h(x) its direction. With the above preparations, we define the IB phase transition as a change in the local curvature on the global minimum of IBβ[p(z|x)].

Definition 3. IB phase transitions: Let  r(z|x) ∈ QZ|Xbe a perturbation function of p(z|x), p∗β(z|x)denote the optimal solution of IBβ[p(z|x)]at  β, where the IB functional IB[·]is defined in Eq. (1). The IB phase transitions  βci are the βvalues satisfying the following two conditions:

(1) r(z|x) ∈ QZ|X , δ2IBβ[p(z|x)]��p∗β(z|x) ≥ 0;

image

Here  β+ and 0− denote one-sided limits.

We can understand the  δ2IBβ[p(z|x)]as a local “curvature” of the IB objective IBβ(Eq. 1) w.r.t. p(z|x), along some relative perturbation r(z|x). A phase transition occurs when the convexity of IBβ[p(z|x)]w.r.t. p(z|x) changes from a minimum to a saddle point in the neighborhood of its optimal solution  p∗β(z|x)as  βincreases from  βcto  βc + 0+. This means that there exists a perturbation to go downhill and find a better minimum. We validate this definition empirically below.

3.2 CONDITION FOR IB PHASE TRANSITIONS

The definition for IB phase transition (Definition 3) indicates the important role  δ2IBβ[p(z|x)] playson the optimal solution in providing the condition for phase transitions. To concretize it and prepare for a more practical condition for IB phase transitions, we expand IBβ[p(z|x)(1 + ϵ · r(z|x))] to thesecond order of  ϵ, giving:

Lemma 0.1. For IBβ[p(z|x)], the condition of  ∀r(z|x) ∈ QZ|X , δ2IBβ[p(z|x)] ≥ 0is equivalent to  β ≤ G[p(z|x)]. The threshold function G[p(z|x)] is given by:

image

The proof is given in Appendix B, in which we also give Eq. (20) for empirical estimation. Note that Lemma 0.1 is very general and can be applied to any p(z|x), not only at the optimal solution  p∗β(z|x).

The Fisher Information matrix. In practice, the encoder  pθ(z|x)is usually parameterized by some parameter vector  θ = (θ1, θ2, ...θk)T ∈ Θ, e.g. weights and biases in a neural net, where  Θis the parameter field. An infinitesimal change of  θ′ ← θ + ∆θinduces a relative perturbation  ϵ · r(z|x) ≃∆θT ∂logpθ(z|x)∂θ on pθ(z|x), from which we can compute the threshold function  GΘ[pθ(z|x)]:

Lemma 0.2. For IBβ[pθ(z|x)]objective, the condition of  ∀∆θ ∈ Θ, δ2IBβ[pθ(z|x)] ≥ 0is equivalent to  β ≤ GΘ[pθ(z|x)], where

image

of  θfor  Z, IZ|X(θ) := �dxdzp(x)pθ(z|x)�∂logpθ(z|x)∂θ � �∂logpθ(z|x)∂θ �T, IZ|Y (θ) :=�dydzp(y)pθ(z|y)�∂logpθ(z|y)∂θ � �∂logpθ(z|y)∂θ �Tare the conditional Fisher information matrix (Zegers, 2015) of  θfor Z conditioned on X and Y , respectively.  λmaxis the largest eigenvalue of  C−1 �IZ|Y (θ) − IZ(θ)�(CT )−1with  vmaxthe corresponding eigenvector, where  CCTis the Cholesky decomposition of the matrix  IZ|X(θ) − IZ(θ), and  vmaxis the eigenvector for  λmax. The infimum is attained at  ∆θ = (CT )−1vmax.

The proof is in appendix C. We see that for parameterized encoders  pθ(z|x), each term of G[p(z|x)] in Eq. (2) can be replaced by a bilinear form with the Fisher information matrix of the respective variables. Although this lemma is not required to understand the more general setting of Lemma 0.1, where the model is described in a functional space, Lemma 0.2 helps understand G[p(z|x)] for parameterized models, which permits directly linking the phase transitions to the model’s parameters.

Phase Transitions. Now we introduce Theorem 1 that gives a concrete and practical condition for IB phase transitions, which is the core result of the paper: Theorem 1. The IB phase transition points  {βci }as defined in Definition 3 are given by the roots of the following equation:

image

where G[p(z|x)] is given by Eq. (2) and  p∗β(z|x)is the optimal solution of IBβ[p(z|x)] at β.

We can understand Eq. (4) as the condition when  δ2IBβ[p(z|x)]is about to be able to be negative at the optimal solution  p∗β(z|x)for a given  β. The proof for Theorem 1 is given in Appendix D. In Section 4, we will analyze Theorem 1 in detail.

In this section we set out to understand G[p(z|x)] as given by Eq. (2) and the phase transition condition as given by Theorem 1, from the perspectives of Jensen’s inequality and representational maximum correlation.

4.1 JENSENS INEQUALITY

The condition for IB phase transitions given by Theorem 1 involves G[p(z|x)] =

image

The equality between A and B holds when the perturbation r(z|x) is constant w.r.t. x for any z; the equality between B and C holds when  Ex∼p(x|y,z)[r(z|x)]is constant w.r.t. y for any z. Therefore, the minimization of A−CB−Cencourages the relative perturbation function r(z|x) to be as constant w.r.t. x as possible (minimizing intra-class difference), but as different w.r.t. different y as possible (maximizing inter-class difference), resulting in a clustering of the values of r(z|x) for different examples x according to their class y. Because of this clustering property in classification problems, we conjecture that there are at most  |Y| − 1phase transitions, where |Y| is the number of classes, with each phase transition differentiating one or more classes.

4.2 REPRESENTATIONAL MAXIMUM CORRELATION

Under certain conditions we can further simplify G[p(z|x)] and gain a deeper understanding of it. Firstly, inspired by maximum correlation (Anantharam et al., 2013), we introduce two new concepts, representational maximum correlation and conditional maximum correlation, as follows.

Definition 4. Given a joint distribution p(X, Y ), and a representation Z satisfying the Markov chain Z − X − Y, the representational maximum correlation  ρr(X, Y ; Z)is defined as

image

where  S1 = {(f : X × Z → R, g : Y × Z → R)�� f, g bounded, and Ex∼p(x|z)[f(x, z)] =Ey∼p(y|z)[g(y, z)] = 0, Ex,z∼p(x,z)[f 2(x, z)] = Ey,z∼p(y,z)[g2(y, z)] = 1}.

The conditional maximum correlation  ρm(X, Y |Z)is defined as:

image

where  S2 = {(f : X → R, g : Y → R)�� f, g bounded, and ∀z ∈ Z : Ex∼p(x|z)[f(x)] =Ey∼p(y|z)[g(y)] = 0, Ex∼p(x|z)[f 2(x)] = Ey∼p(y|z)[g2(y)] = 1}.

We prove the following Theorem 2, which expresses G[p(z|x)] in terms of representational maximum correlation and related quantities, with proof given in Appendix F.

Theorem 2. Define  Q(0)Z|X := {r(z|x) : X × Z → R�� r bounded}. If  Q(0)Z|Xand  QZ|Xsatisfy: ∀r(z|x) ∈ Q(0)Z|X, there exists2  r1(z|x) ∈ QZ|X , s(z) ∈ {s(z) : Z → R | s bounded}s.t. r(z|x) = r1(z|x) + s(z), then we have:

image

(iii) When Z is continuous, an optimal relative perturbation function r(z|x) for G[p(z|x)] is given by

image

where  σ2(Z)is the second largest singular value of the matrix  QX,Y |Z :=� p(x,y|z)√p(x|z)p(y|z)

Theorem 2 furthers our understanding of G[p(z|x)] and the phase transition condition (Theorem 1), which we elaborate as follows.

Discovering maximum correlation in the orthogonal space of a learned representation: Intuitively, the representational maximum correlation measures the maximum linear correlation between f(X, Z) and g(Y, Z) among all real-valued functions f, g, under the constraint that f(X, Z) is “orthogonal” to p(X|Z) and g(Y, Z) is “orthogonal” to p(Y |Z). Theorem 2 (i) reveals that G[p(z|x)] is the inverse square of this representational maximum correlation. Theorem 2 (ii) further shows that G[p(z|x)] is finding a specific  z∗ on which maximum (nonlinear) correlation between X and Y conditioned on Z can be found. Combined with Theorem 1, we have that when we continuously increase  β, for the optimal representation  Z∗β given by p∗β(z|x) at β, ρr(X, Y ; Z∗β)shall monotonically decrease due to that X and Y has to find their maximum correlation on the orthogonal space of an increasingly better representation  Z∗βthat captures more information about X. A phase transition occurs when  ρr(X, Y ; Z∗β)reduces to 1√β, after which as  βcontinues to increase,  ρr(X, Y ; Z∗β)will try to find maximum correlation between X and Y orthogonal to the full previously learned representation. This is reminiscent of canonical-correlation analysis (CCA) (Hotelling, 1992) in linear settings, where components with decreasing linear maximum correlation that are orthogonal to previous components are found one by one. In comparison, we show that in IB, each phase transition corresponds to learning a new nonlinear component of maximum correlation between X and Y in Z, orthogonal to the previously-learned Z. In the case of classification where different classes may have different difficulty (e.g. due to label noise or support overlap), we should expect that classes that are less difficult as measured by a larger maximum correlation between X and Y are learned earlier.

Conspicuous subset conditioned on a single z: Furthermore, we show in (iii) that an optimal relative perturbation function r(z|x) can be decomposed into a product of two factors, a�

factor that only focus on perturbing a specific point  z∗in the representation space, and an  h∗(x)factor that is finding the “conspicuous subset” (Wu et al., 2019), i.e. the most confident, large, typical, and imbalanced subset in the X space for the distribution  (X, Y ) ∼ p(X, Y |z∗).

Singular values In categorical settings, (iv) reveals a connection between G[p(z|x)] and the singular value of the  QX,Y |Zmatrix. Due to the property of SVD, we know that the square of the singular values of  QX,Y |Zequals the non-negative eigenvalue of the matrix  QTX,Y |ZQX,Y |Z. Then the phase transition condition in Theorem 1 is equivalent to a (nonlinear) eigenvalue problem. This is resonant with previous analogy with CCA in linear settings, and is also reminiscent of the linear eigenvalue problem in Gaussian IB (Chechik et al., 2005).

As a consequence of the theoretical analysis above, we are able to derive an algorithm to efficiently estimate the phase transitions for a given model architecture and dataset. This algorithm also permits us to empirically confirm some of our theoretical results in Section 6.

Typically, classification involves high-dimensional inputs X. Without sweeping the full range of  βwhere at each  βit is a full learning problem, it is in general a difficult task to estimate the phase transitions. In Algorithm 1, we present a two-stage approach.

In the first stage, we train a single maximum likelihood neural network  fθwith the same encoder architecture as in the (variational) IB to estimate p(y|x), and obtain an  N × C matrix p(y|x), whereN is the number of examples in the dataset and C is the number of classes. In the second stage, we perform an iterative algorithm w.r.t.  G and β, alternatively, to converge to a phase transition point.

Specifically, for a given  β, we use a Blahut-Arimoto type IB algorithm (Tishby et al., 2000) to efficiently reach IB optimal  p∗β(z|x)at  β, then use SVD (with the formula given in Theorem 2 (iv)) to efficiently estimate  G[p∗β(z|x)]at  β(step 8). We then use the  G[p∗β(z|x)]value as the new  βand do it again (step 7 in the next iteration). At convergence, we will reach the phase transition point given by  G[p∗β(z|x)] = β(Theorem 1). After convergence as measured by patience parameter K, we slightly increase  β by δ(step 13), so that the algorithm can discover the subsequent phase transitions.

We quantitatively and qualitatively test the ability of our theory and Algorithm 1 to provide good predictions for IB phase transitions. We first verify them in fully categorical settings, where X, Y, Z are all discrete, and we show that the phase transitions can correspond to learning new classes as we increase  β. We then test our algorithm on versions of the MNIST and CIFAR10 datasets with added label noise.

image

6.1 CATEGORICAL DATASET

For categorical datasets, X and Y are discrete, and p(X) and p(Y |X) are given. To test Theorem 1, we use the Blahut-Arimoto IB algorithm to compute the optimal  p∗β(z|x)for each  β. I(Y ; Z∗)vs.  βis plotted in Fig. 2 (a). There are two phase transitions at  βc0and  βc1. For each  βand the corresponding  p∗β(z|x), we use the SVD formula (Theorem 2) to compute  G[p∗β(z|x)], shown in Fig. 2 (b). We see that  G[p∗β(z|x)] = βat exactly the observed phase transition points  βc0and  βc1. Moreover, starting at  β = 1, Alg. 1 converges to each phase transition points within few iterations. Our other experiments with random categorical datasets show similarly tight matches.

Furthermore, in Appendix G we show that the phase transitions correspond to the onset of separation of p(z|x) for subsets of X that correspond to different classes. This supports our conjecture from Section 4.1 that there are at most  |Y| − 1phase transitions in classification problems.

6.2 MNIST DATASET

For continuous X, how does our algorithm perform, and will it reveal aspects of the dataset? We first test our algorithm in a 4-class MNIST with noisy labels3, whose confusion matrix and experimental settings are given in Appendix H. Fig. 3 (a) shows the path Alg. 1 takes. We see again that in each

image

Figure 2: (a)  I(Y ; Z∗) vs. βfor a categorical dataset with  |X| = |Y | = |Z| = 3, where Z∗ is givenby  p∗β(z|x), and the vertical lines are the experimentally discovered phase transition points  βc0 and βc1.(b)  G[p∗β(z|x)] vs. βfor the same dataset, and the path for Alg. 1, with  βc0 and βc1 in (a) also plotted. The dataset is given in Fig. 5.

image

Figure 3: (a) Path of Alg. 1 starting with  β = 1, where the maximum likelihood model  fθis using the same encoder architecture as in the CEB model. This stairstep path shows that Alg. 1 is able to ignore very large regions of  β, while quickly and precisely finding the phase transition points. Also plotted is an accumulation of  G[p∗β(z|x)] vs. βby running Alg. 1 with varying starting  β(blue dots). (b) Per-class accuracy vs.  β, where the accuracy at each  βis from training an independent CEB model on the dataset. The per-class accuracy denotes the fraction of correctly predicted labels by the CEB model for the observed label  ˜y.

phase Alg. 1 converges to the phase transition points within a few iterations, and it discovers in total 3 phase transition points. Similar to the categorical case, we expect that each phase transition corresponds to the onset of learning a new class, and that the last class is much harder to learn due to a larger separation of  β. Therefore, this class should have a much larger label noise so that it is hard to capture this component of maximum correlation between X and Y , as analyzed in representational maximum correlation (Section 4.2). Fig. 3 (b) plots the per-class accuracy with increasing  βfor running the Conditional Entropy Bottleneck (Fischer, 2018) (another variational bound on IB). We see that the first two predicted phase transition points  βc0, βc1closely match the observed onset of learning class 3 and class 0. Class 1 is observed to learn earlier than expected, possibly due to the gap between the variational IB objective and the true IB objective in continuous settings. By looking at the confusion matrix for the label noise (Fig. 7), we see that the ordering of onset of learning: class 2, 3, 0, 1, corresponds exactly to the decreasing diagonal element  p(˜y = 1|y = 1)(increasing noise) of the classes, and as predicted, class 1 has a much smaller diagonal element  p(˜y = 1|y = 1)than the other three classes, which makes it much more difficult to learn. This ordering of classes by difficulty is what our representational maximum correlation predicts.

image

Figure 4: (a) Accumulated  G[p∗β(z|x)]vs.  βby running Alg. 1 with varying starting  β(blue dots). Also plotted are predicted phase transition points. (b) I(X; Z) and I(Y ; Z) vs.  β. The manually-identified phase transition points are labelled with arrows. The vertical black lines are the phase transitions identified by Alg. 1, denoted as  βc0, βc1,...  βc8, from left to right. (c) Accuracy vs.  βwith the same sets of points identified. The most interesting region is right before  β = 2, where accuracy decreases with  β. Alg. 1 identifies both sides of that region, as well as points at or near all of the early obvious phase transitions. It also seems to miss later transitions, possibly due to the gap between the variational IB objective and the true IB objective in continuous settings.

6.3 CIFAR10 DATASET

Finally, we investigate the CIFAR10 experiment from Section 1. The details of the experimental setup are described in Appendix I. This experiment stretches the current limits of our discrete approximation to the underlying continuous representation being learned by the models. Nevertheless, we can see in Fig. 4 that many of the visible empirical phase transitions are tightly identified by Alg. 1. Particularly, the onset of learning is predicted quite accurately; the large interval between the predicted  β3 = 1.21and  β4 = 1.61corresponds well to the continuous increase of I(X; Z) and I(Y ; Z) at the same interval. And Alg. 1 is able to identify many dense transitions not obviously seen by just looking at I(Y ; Z) vs. βcurve alone. Alg. 1 predicts 9 phase transitions, exactly equal to  |Y| − 1for CIFAR10.

In this work, we observe and study the phase transitions in IB as we vary  β. We introduce the definition for IB phase transitions, and based on it derive a formula that gives a practical condition for IB phase transitions. We further understand the formula via Jensen’s inequality and representational maximum correlation. We reveal the close interplay between the IB objective, the dataset and the learned representation, as each phase transition is learning a nonlinear maximum correlation component in the orthogonal space of the learned representation. We present an algorithm for finding the phase transitions, and show that it gives tight matches with observed phase transitions in categorical datasets, predicts onset of learning new classes and class difficulty in MNIST, and predicts prominent transitions in CIFAR10 experiments. This work is a first theoretical step towards a deeper understanding of the phenomenon of phase transitions in the Information Bottleneck. We believe our approach will be applicable to other “trade-off” objectives, like  β-VAE (Higgins et al., 2017) and InfoDropout (Achille & Soatto, 2018a), where the model’s ability to predict is balanced against a measure of complexity.

The authors would like to thank Alex Alemi, Kevin Murphy, Sergey Ioffe, Isaac Chuang and Max Tegmark for helpful discussions.

Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018a.

Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018b.

Alessandro Achille, Glen Mbeng, and Stefano Soatto. The dynamics of differential learning i: Information-dynamics and task reachability. arXiv preprint arXiv:1810.02440, 2018.

Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.

Venkat Anantharam, Amin Gohari, Sudeep Kamath, and Chandra Nair. On maximal correlation, hypercontractivity, and the data processing inequality studied by erkip and cover. arXiv preprint arXiv:1304.6133, 2013.

Richard Blahut. Computation of channel capacity and rate-distortion functions. IEEE transactions on Information Theory, 18(4):460–473, 1972.

Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables. Journal of machine learning research, 6(Jan):165–188, 2005.

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.

Ian Fischer. The conditional entropy bottleneck, 2018. URL openreview.net/forum?id= rkVOXhAqY7.

Anirudh Goyal, Riashat Islam, Daniel Strouse, Zafarali Ahmed, Matthew Botvinick, Hugo Larochelle, Sergey Levine, and Yoshua Bengio. Infobot: Transfer and exploration via the information bottleneck. arXiv preprint arXiv:1901.10902, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy2fzU9gl.

Harold Hotelling. Relations between two sets of variates. In Breakthroughs in statistics, pp. 162–190. Springer, 1992.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. URL https://arxiv.org/abs/1412. 6980.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, CIFAR, 2009.

Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. arXiv preprint arXiv:1810.00821, 2018.

Mélanie Rey and Volker Roth. Meta-gaussian information bottleneck. In Advances in Neural Information Processing Systems, pp. 1916–1924, 2012.

Danilo Jimenez Rezende and Fabio Viola. Taming VAEs. arXiv preprint arXiv:1810.00597, 2018.

Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411(29-30):2696–2711, 2010.

Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019.

DJ Strouse and David J Schwab. The deterministic information bottleneck. Neural computation, 29 (6):1611–1630, 2017a.

DJ Strouse and David J Schwab. The information bottleneck and geometric clustering. arXiv preprint arXiv:1712.09657, 2017b.

Max Tegmark and Tailin Wu. Pareto-optimal data compression for binary classification tasks. arXiv preprint arXiv:1908.08961, 2019.

Naftali Tishby. Lecture: the information theory of deep neural networks: the statistical physics aspects. https://www.perimeterinstitute.ca/videos/ information-theory-deep-neural-networks-statistical-physics-aspects/, 2018.

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.

Tailin Wu, Ian Fischer, Isaac Chuang, and Max Tegmark. Learnability for the information bottleneck. arXiv preprint arXiv:1907.07331, 2019.

S. Zagoruyko and N. Komodakis. Wide Residual Networks. arXiv: 1605.07146, 2016.

Pablo Zegers. Fisher information properties. Entropy, 17(7):4918–4939, 2015.

image

Here we prove the Lemma 2.1, which will be crucial in the lemmas and theorems in this paper that follows.

Lemma 2.1. For a relative perturbation function  r(z|x) ∈ QZ|X for a p(z|x), where r(z|x) satisfiesEz∼p(z|x)[r(z|x)] = 0, we have that the IB objective can be expanded as

image

where  r(z|y) = Ex∼p(x|y,z)[r(z|x)] and r(z) = Ex∼p(x|z)[r(z|x)]. The expectations in the equations are all w.r.t. all variables. For example  E[r2(z|x)] = Ex,z∼p(x,z)[r2(z|x)].

Proof. Suppose that we perform a relative perturbation r(z|x) on p(z|x) such that the perturbed conditional probability is  p′(z|x) = p(z|x) (1 + ϵ · r(z|x)), then we have

image

Therefore, we can denote the corresponding relative perturbation r(z) on p(z) as

image

Similarly, we have

p′(z|y) = p′(y, z)p(y) = 1p(y)

And we can denote the corresponding relative perturbation r(z|y) on p(z|y) as

image

Since

image

We have

image

The  0th-order term is simply IBβ[p(z|x)]. The first order term is

image

The  nth-order term for  n ≥ 2 is

image

− β · (−1)nϵnn(n − 1)�Ey,z∼p(y,z)[rn(z|y)] − nEy,z∼p(y,z)[r(z|y)rn−1(z)] − (n − 1)Ez∼p(z)[rn(z)]�= (−1)nϵnn(n − 1) {(E[rn(z|x)] − E[rn(z)]) − β · (E[rn(z|y)] − E[rn(z)])}

In the last equality we have used

image

Combining the terms with all orders, we have

image

As a side note, the KL-divergence between  p′(z|x) = p(z|x)(1 + ϵ · r(z|x)) and p(z|x) is

image

Therefore, to the second order, we have

image

Similarly, we have  Ex∼p(x) [KL (p(z|x)||p′(z|x))] = ϵ22 E[r2(z|x)]up to the second order. Using similar procedure, we have up to the second-order,

image

Proof. From Lemma 2.1, we have

image

The condition of

image

is equivalent to

image

Using Jensen’s inequality and the convexity of the square function, we have

image

The equality holds iff  r(z|y) = Ex∼p(x|y,z)[r(z|x)]is constant w.r.t. y, for any z.

Using Jensen’s inequality on  E[r2(z)], we have  E[r2(z)] = Ez∼p(z)��Ex∼p(x|z)[r(z|x)]�2� ≤Ez∼p(z)�Ex∼p(x|z)[r2(z|x)]�= E[r2(z|x)], where the equality holds iff r(z|x) is constant w.r.t. x for any z.

When  E[r2(z|y)] − E[r2(z)] > 0, we have that the condition Eq. (16) is equivalent to  ∀r(z|x) ∈QZ|X , β ≤ E[r2(z|x)]−E[r2(z)]E[r2(z|y)]−E[r2(z)], i.e.

image

If  E[r2(z|y)] − E[r2(z)] = 0, substituting into Eq. (16), we have

image

which is always true due to that  E[r2(z|x)] ≥ E[r2(z)], and will be a looser condition than Eq. (17) above. Above all, we have Eq. (17).

image

Empirical estimate of G[p(z|x)] To empirically estimate G[p(z|x)] from a minibatch of {(xi, yi)}, i = 1, 2, ...Nand the encoder p(z|x), we can make the following Monte Carlo importance sampling estimation, where we use the samples  {xj} ∼ p(x)and also get samples of {zi} ∼ p(z) = p(x)p(z|x), and have:

image

image

Here  Ωx(yi)denotes the set of x examples that has label of  yi, and  1[·]is an indicator function that takes value 1 if its argument is true, 0 otherwise.

The requirement of  Ez∼p(z|x)[r(z|x)] = 0 yields

image

for any  xj.

Combining all terms, we have that the empirical ˆG[p(z|x)]is given by

image

where  {zi} ∼ p(z) and {xi} ∼ p(x). It is also possible to use different distributions for importance sampling, which will results in different formulas for empirical estimation of G[p(z|x)].

image

Proof. For the parameterized4 pθ(z|x)with  θ ∈ Θ, after  θ′ ← θ + ∆θ, where5 ∆θ ∈ Θis an infinitesimal perturbation on  θ, we have that the distribution changes from  pθ(z|x)to  pθ+∆θ(z|x),

and thus the relative perturbation on  pθ(z|x) is

image

where  ∥∆θ∥is the norm of  ∆θin the parameter field  Θ.

Similarly, we have

image

Substituting the above expressions into the expansion of IBβ[p(z|x)]in Eq. (12), and preserving to the second order  ∥∆θ∥2, we have

image

In the last equality we have used  Ex,z∼pθ(x,z)[ 1pθ(z|x)∂2pθ(z|x)∂θ2 ] =�dxp(x) ∂2∂θ2�dzpθ(z|x) =�dxp(x) ∂2∂θ2 1 = 0, and similarly  Ey,z∼pθ(y,z)[ 1pθ(z|y)∂2pθ(z|y)∂θ2 ] = 0. In other words, the ∥∆θ∥2terms in the first-order variation  δIBβ[pθ(z|x)]vanish, and the remaining  ∥∆θ∥2are all in  δ2IBβ[pθ(z|x)]. Also in the last expression,  IZ(θ) ≡�dzpθ(z)�∂logpθ(z)∂θ � �∂logpθ(z)∂θ �Tis the

Fisher information matrix of  θ for Z, IZ|X(θ) ≡�dxdzp(x)pθ(z|x)�∂logpθ(z|x)∂θ � �∂logpθ(z|x)∂θ �T,

IZ|Y (θ) ≡�dydzp(y)pθ(z|y)�∂logpθ(z|y)∂θ � �∂logpθ(z|y)∂θ �Tare the conditional Fisher information matrix (Zegers, 2015) of  θ for Zconditioned on X and Y , respectively.

Let us look at

image

Firstly, note that  δ2IBβ[pθ(z|x)]is a quadratic function of  ∆θ, and the scale of  ∆θdoes not change the sign of  δ2IBβ[pθ(z|x)], so the condition of  ∀∆θ ∈ Θ, δ2IBβ[pθ(z|x)] ≥ 0is invariant to the scale of  ∆θ, and is describing the “curvature” in the infinitesimal neighborhood of  θ. Therefore,  ∆θcan explore any value in  Θ. Secondly, we see that Eq. (21) is a special case of Eq. (14) with  ϵ · r(z|x) = ∆θT ∂∂θlog pθ(z|x). Therefore, The inequalities due to Jensen still hold: ϵ2 �E[r2(z|x)] − E[r2(z)]� = ∆θT �IZ|X(θ) − IZ(θ)�∆θ ≥ 0, ϵ2 �E[r2(z|y)] − E[r2(z)]� =∆θT �IZ|Y (θ) − IZ(θ)�∆θ ≥ 0. If  ∆θT �IZ|Y (θ) − IZ(θ)�∆θ > 0, then the condition of ∀∆θ ∈ Θ, δ2IBβ[pθ(z|x)] ≥ 0is equivalent to  ∀∆θ ∈ Θ,

image

i.e.

image

If  ∆θT �IZ|Y (θ) − IZ(θ)�∆θ = 0, we have that Eq. (21) always holds, which is a looser condition than Eq. (22). Above all, we have that the condition of  ∀∆θ ∈ Θ, δ2IBβ[pθ(z|x)]is equivalent to β ≤ GΘ[pθ(z|x)].

Moreover,  (GΘ[pθ(z|x)])−1given by Eq. (22) has the format of a generalized Rayleigh quotient R(A, B; x) ≡ ∆θT A∆θ∆θT B∆θ where A = IZ|Y (θ)−IZ(θ) and B = IZ|X(θ)−IZ(θ)are both Hermitian matrices6, which can be reduced to Rayleigh quotient  R(D, CT ∆θ) = (CT ∆θ)T D(CT ∆θ)(CT ∆θ)T (CT ∆θ) , with thetransformation  D = C−1A(CT )−1 where CCT is the Cholesky decomposition of  B = IZ|X(θ) −IZ(θ). Moreover, we have that when  GΘ[pθ(z|x)]attains its minimum value, the Reyleigh quotient R(D, CT ∆θ)attains its maximum value of  λmaxwith  CT ∆θ = vmax, i.e.  ∆θ = (CT )−1vmax, where  λmaxis the largest eigenvalue of  D and vmaxthe corresponding eigenvector.

image

Proof. Define

image

where  Eβ[·]denotes taking expectation w.r.t. the optimal solution  p∗β(x, y, z) = p(x, y)p∗β(z|x)at β. Using Lemma 2.1, we have that the IB phase transition as defined in Definition 3 corresponds to satisfying the following two equations:

image

Now we prove that  Tβ(β′)is continuous at  β′ = β, i.e. ∀ε > 0, ∃δ > 0 s.t. ∀β ∈ (β − δ, β + δ), wehave  |Tβ(β′) − Tβ(β)| < ϵ.

From Eq. (23), we have  Tβ(β′) − Tβ(β) = −(β′ − β) ·�Eβ[r2(z|y)] − Eβ[r2(z)]�. Since r(z|x) isbounded, i.e.  ∃M > 0 s.t. ∀z ∈ Z, x ∈ X, |r(z|x)| ≤ M, we have

image

Hence,  |Tβ(β′) − Tβ(β)| = |β′ − β|��Eβ[r2(z|y)] − Eβ[r2(z)]�� ≤ 2|β′ − β|M 2.

To prove that  Tβ(β′)is continuous at  β′ = β, we have  ∀ε > 0, ∃δ = ε2M 2 > 0, s.t.  ∀β′ ∈

image

Hence  Tβ(β′)is continuous at  β′ = β.

Combining the continuity of  Tβ(β′) at β′ = β, and Eq. (24) and (25), we have  Tβ(β) = 0, which isequivalent to  G[p∗β(z|x)] = βafter simple manipulation.

image

E INVARIANCE OF  G[r(z|x); p(z|x)]TO ADDITION OF A GLOBAL REPRESENTATION

image

Proof. When we r(z|x) is shifted by a global transformation  r′(z|x) ← r(z|x) + s(z), we have r′(z) ← Ex∼p(x|z)[r(z|x) + s(z)] = Ex∼p(x|z)[r(z|x)] + s(z)Ex∼p(x|z)[1] = r(z) + s(z), and similarly  r′(z|y) ← r(z|y) + s(z).

The numerator of G[r(z|x); p(z|x)] is then

image

Symmetrically, we have

image

Therefore,  G[r(z|x); p(z|x)] =Ex,z∼p(x,z)[r2(z|x)]−Ez∼p(z)[r2(z)]Ey,z∼p(y,z)[r2(z|y)]−Ez∼p(z)[r2(z)]is invariant to  r′(z|x) ← r(z|x) +s(z).

image

Proof. Using the condition of the theorem, we have that  ∀r(z|x) ∈ Q0Z|X, there exists  r1(z|x) ∈QZ|Xand  s(z) ∈ {s : Z → R|s bounded}s.t.  r(z|x) = r1(z|x) + s(z). Note that the only difference between  QZ|X and Q(0)Z|X is that QZ|X requires Ep(z|x)[r1(z|x)] = 0. Using Lemma 2.2, we have

image

where r(z|x) doesn’t have the constraint of  Ep(z|x)[·] = 0.

After dropping the constraint of  Ez∼p(z|x)[r(z|x)] = 0, again using Lemma 2.2, we can let r(z) = Ex∼p(x|z)[r(z|x)] = 0(since we can perform the transformation  r′(z|x) ← r(z|x) − r(z), so that the new  r′(z) ≡ 0). Now we get a simpler formula for G[p(z|x)], as follows:

image

where  Q(1)Z|X := {r : X × Z → R��Ex∼p(x|z)[r(z|x)] = 0, r bounded}.

From Eq. (26), we can further require that  Ex,z∼p(x,z)[r2(z|x)] = 1. Define

image

where7 Q(2)Z|X := {r : X × Z → R��Ex∼p(x|z)[r(z|x)] = 0, Ex,z∼p(x,z)[r2(z|x)] = 1, r bounded}. Comparing with Eq. (26), it immediately follows that

image

(i) We only have to prove that  ρs(X, Y ; Z) = ρr(X, Y ; Z), where ρr(X, Y ; Z)is defined in Defini-tion 4.

image

where  F(y, z) :=�dxp(x|y, z)f(x, z). We have used Cauchy-Schwarz inequality, where the equality holds when  g(y, z) = αF(y, z)for some  α. Since  E[g2(y, z)] = 1, we have  α2E[F 2(y, z)] = 1.

Taking the supremum of  (E[f(X, Z)g(Y, Z)])2 w.r.t. f and g, we have

image

Here  S1is defined in Definition 4. By definition both  ρr(X, Y ; Z) and ρs(X, Y ; Z)take non-negative values. Therefore,

image

(ii) Using the definition of  ρr(X, Y ; Z), we have

image

where W[f(x, z)] :=�dyp(y|z)��dxp(x|y, z)f(x, z)�2.

Denote  c(z) := p(z)Ex∼p(x|z)[f 2(x, z)], we have�c(z)dz = Ex,z∼p(x,z)[f 2(x, z)] = 1. Then the supremum  ρ2r(X, Y ; Z) = sup

image

supremum:

image

where  Q(3)Z|X := {X × Z → R�� Ex∼p(x|z)[f 2(x, z)] = c(z)p(z), Ex∼p(x|z)[f(x, z)] = 0, f bounded}We can think of the inner supremum  supf(x,z)∈Q(3)Z|X W[f(x, z)]as only w.r.t. x, for some given z.

Now let’s consider another supremum:

image

where  Q(h)X := {h : X → R�� Ep(x|z)[h(x)] = 0, Ep(x|z)[h2(x)] = 1, h bounded}. Using similar technique in (ii), it is easy to prove that it equals  ρ2m(X, Y |Z)as defined in Definition 4.

Comparing Eq. (30) and the supremum:

image

we see that the only difference is that in the latter  Ex∼p(x|z)[f 2(x, z)] equals c(z)p(z) instead of 1. Since W[f(x, z)] is a quadratic functional of f(x, z), we have

image

Therefore,

image

where in the last equality we have let c(z) have “mass” only on the place where  ρ2m(X, Y |Z = z)attains supremum w.r.t. z.

(iii) When Z is a continuous variable, let  f(x, z) = fX(x)�

function,  z0is a parameter,  fX(x) ∈ Q(f)X|Z, with  Q(f)X|Z := {fX : X → R�� fX bounded; ∀Z ∈ Z :

image

Therefore, such constructed  f(x, z) = fX(x)�

ρs(X, Y ; Z)(which equals  ρr(X, Y ; Z)by Eq. 28).

Substituting in the special form of f(x, z) into the expression of  ρs(X, Y ; Z)in Eq. (27), we have

image

We can identify  supfX(X)∈Q(f)X|Z E[(E[fX(X)|Y, Z = z])2|Z = z]with  ρ2m(X, Y |Z = z)because fX(x)satisfies the requirement for conditional maximum correlation that  Ep(x|z)[fX(x)] = 0and Ep(x|z)[f 2X(x)] = 1, for any z, and using the same technique in (i), it is straightforward to prove that supfX(X)∈Q(f)X|Z E[(E[fX(X)|Y, Z = z])2|Z = z]equals the conditional maximum correlation as defined in Definition 4.

Since the conditional maximum correlation can be viewed as the maximum correlation between X and Y , where  X, Y ∼ p(X, Y |Z), using the equality of  (β0[h(x)])−1 = ρ2m(X; Y )(Eq. 7 in Wu et al. (2019)), we can identify the  h(x) in β0[h(x)] with the fX(X)here, and an optimal  f ∗X(X) thatmaximizes  ρ2m(X, Y |Z)is also an optimal  h∗(x)that minimizes  β0[h(x)].

image

Let column vectors  u1 =�p(x|z)and  v1 =�p(y|z)(note that z is given and fixed). Also let u2 = f(x)�p(x|z) and v2 = g(y)�p(y|z). Denote inner product  ⟨u, v⟩ ≡ �i uivi, and the length of a vector as  ||u|| =�⟨u, u⟩. We have  ||u1|| = ||v1|| = 1due to the normalization of probability, ||u2|| = ||v2|| = 1due to  Ex∼p(x|z)[f 2(x)] = Ey∼p(y|z)[g2(y)] = 1, and  ⟨u1, u2⟩ = ⟨v1, v2⟩ = 0due to  Ex∼p(x|z)[f(x)] = Ey∼p(y|z)[g(y)] = 0. Furthermore, we have

image

which is exactly the second largest singular value  σ2(Z)of the matrix  QX,Y |Z. Using the result in (ii), we have that  ρr(X, Y ; Z) = maxZ∈Z σ2(Z).

image

image

In this section we study the behavior of p(z|x) on the phase transitions. We use the same categorical dataset (where |X| = |Y | = |Z| = 3 and p(x) is uniform, and p(y|x) is given in Fig. 5). In Fig. 6 we show the p(z|x) on the simplex before and after each phase transition. We see that the first phase transition corresponds to the separation of x = 2 (belonging to  y = 2) w.r.t. x ∈ {0, 1}(belonging to classes  y ∈ {0, 1}), on the p(z|x)simplex. The second phase transition corresponds to the separation of x = 0 with x = 1. Therefore, each phase transition corresponds to the ability to distinguish subset of examples, and learning of new classes.

image

We use the MNIST training examples with class 0, 1, 2, 3, with a hidden label-noise matrix as given in Fig. 7, based on which at each minibatch we dynamically sample the observed label. We use conditional entropy bottleneck (CEB) (Fischer, 2018) as the variational IB objective, and run multiple independent instances with different the target  β. We jump start learning by started training at β = 100for 100 epochs, annealing  βfrom 100 down to the target  βover 600 epochs, and continue to train at the target epoch for another 800 epochs. The encoder is a three-layer neural net, where each hidden layer has 512 neurons and leakyReLU activation, and the last layer has linear activation. The classifier p(y|z) is a 2-layer neural net with a 128-neuron ReLU hidden layer. The backward encoder p(z|y) is also a 2-layer neural net with a 128-neuron ReLU hidden layer. We trained with Adam (Kingma & Welling, 2013) at learning rate of  10−3, and anneal down with factor  1/(1+0.01·epoch).For Alg. 1, for the  fθwe use the same architecture as the encoder of CEB, and use |Z| = 50 in Alg. 1.

image

We use the same CIFAR10 class confusion matrix provided in Wu et al. (2019) to generate noisy labels with about 20% label noise on average (reproduced in Table 1). We trained  28 × 1Wide ResNet (He et al., 2016; Zagoruyko & Komodakis, 2016) models using the open source implementation from Cubuk et al. (2018) as encoders for the Variational Information Bottleneck (VIB) (Alemi et al., 2016). The 10 dimensional output of the encoder parameterized a mean-field Gaussian with unit covariance. Samples from the encoder were passed to the classifier, a 2 layer MLP. The marginal distributions were mixtures of 500 fully covariate 10-dimensional Gaussians, all parameters of which are trained.

With this standard model, we trained 251 different models at  βfrom 1.0 to 6.0 with step size of 0.02. As in Wu et al. (2019), we jump-start learning by annealing  βfrom 100 down to the target  β. We dothis over the first 4000 steps of training. The models continued to train for another 56,000 gradient steps after that, a total of 600 epochs. We trained with Adam (Kingma & Ba, 2015) at a base learning rate of  10−3, and reduced the learning rate by a factor of 0.5 at 300, 400, and 500 epochs. The models converged to essentially their final accuracy within 40,000 gradient steps, and then remained stable.

image

Figure 5: p(y|x) for the categorical dataset in Fig. 2 and Fig. 6. The value in  ith row and jth columndenotes p(y = j|x = i). p(x) is uniform.

image

Figure 6: (a) I(Y ; Z) vs.  βfor the dataset given in Fig. 5. The phase transitions are marked with vertical dashed line, with  βc0 = 2.065571and  βc1 = 5.623333. (b)-(e) Optimal  p∗β(z|x)for four values of  β, i.e. (b)  β = 2.060, (c)  β = 2.070, (d)  β = 5.620(e)  β = 5.625(their  βvalues are also marked in (a)), where each marker denotes p(z|x = i) for a given  i ∈ {0, 1, 2}.

image

Figure 7: Confusion matrix for MNIST experiment. The value in  ithrow and  jthcolumn denotes p(˜y = j|y = i)for the label noise.

The accuracies reported in Figure 4 are averaged across five passes over the training set. We use |Z| = 50 in Alg. 1.

Table 1: Class confusion matrix used in CIFAR10 experiments, reproduced from (Wu et al., 2019). The value in row i, column j means for class i, the probability of labeling it as class j. The mean confusion across the classes is 20%.

image


Designed for Accessibility and to further Open Science