The goal of a supervised learning algorithm is to construct a model on the training set that can generalize well on new data. Yet, generalization is an elusive property, involving intractable quantities to be approximated or bound –like the generalization error– or notions with multiple definitions –like that of model complexity. As a result
there are many different theoretical routes to generalization, leading to often apparently ‘contradictory’ conclusions with one another or with empirical evidence [31]. For instance, why are certain learning algorithms that explore overparameterized or non-parametric model families so good at producing models that can generalize well, even without explicit regularization [5, 11, 31]? A unified language for comparing the complexity of models trained on a given dataset can help us identify good model selection and algorithmic practices that guide the learning algorithm towards models that are complex enough to not underfit yet also maximally resistant to overfitting.
In this paper we make a step towards this direction by bridging two –until now disconnected– theoretical paths to generalization in the case of classification, namely information theory [16] inspired by recent advances on the information bottleneck principle [18,23,24] and margin theory [13,25]. From an information-theoretic perspective, we would like our learning algorithm to learn a model that contains all the information from the features necessary for describing the target (we call this property losslessness) and no more information beyond that (we call this property maximal compression). Margin theory suggests constructing a model that can correctly classify as many training examples as possible with as high confidence as possible (i.e. one that maximizes the quantity known as the functional margin over the training set). We prove that in the case of classification of on noiseless (i.e. unambiguously labelled) datasets, functional margin maximization is equivalent to lossless maximal compression in the information-theoretic sense. The existence of margin-based bounds on the generalization error implies that margin maximization is beneficial for achieving good generalization and therefore so is lossless maximal compression.
Our experiments on gradient boosting, a method that maximizes the training margins, show empirically that on noiseless data, margin maximization amounts to lossless maximal compression and that maximally compressed models on average exhibit the highest generalization capability (as estimated by the test error). We identify interesting similarities between the way training progresses in Deep Neural Networks (DNNs) and in gradient boosting and gain useful insights on the training of gradient boosting algorithms. All findings persist across a wide range of datasets & hyperparameter configurations.
To our knowledge, there is no prior work establishing the connection between functional margin maximization and lossless model compression in the information-theoretic sense. Both margin theory and information theory have been individually connected to generalization and have been used to explain resistance to overfitting. The idea that functional margin maximization promotes good generalization can be traced back to [25]. It has been used in the theoretical analysis of Boosting algorithms [13], with recent work using it to explain good generalization in DNNs [7,12,20,27]. The related notion of has been used to justify good generalization in Support Vector Machines (SVMs) [6]. The idea that a learned representation of a dataset that generalizes well is one that extracts from the features all the useful information for predicting the target and no more, is captured in information theoretic terms under the information bottleneck principle [23]. Recent work has offered insights into the good generalization capabilities of DNNs, utilizing these ideas [18,24]. More recently, bounds on the generalization error of a learning algorithm in terms of the mutual information between its input and outputhave been established [2,29].
2.1 Binary Classification
A classification algorithm, receives as input a training dataset S consisted of n pairs of feature vectors
and corresponding
. The training set is drawn i.i.d. from some unknown probability measure
. We shall focus on
. In this setting, we consider w.l.o.g. the output of the learning algorithm (model) as a function
that allows us to predict the label on unseen examples drawn from P with feature vector x as
. Given any probability measure
and any function
denote the probability of making an error with respect to distribution P,
Ideally the learning algorithm will output the model with the lowest possible risk w.r.t. the unknown distribution P, i.e. a function f that minimizes the true classification risk . However, since we do not have direct access to the unknown distribution P we must estimate P with the empirical measure
defined for each set
in terms of the training data
The empirical risk of a function
is given by
In what follows we shall refer to a general motivating example here is the empirical measure
which is supported on the finite set S, the training dataset.
2.2 Information theory
We now present some basic definitions and properties from information theory [16]. Let A & B be random variables (RVs), with alphabets A & B with probability distribution measure P. We shall assume that P is finitely supported and there exist finite subsets
The entropy of a RV A, measures the amount of uncertainty associated with its value when only its distribution is known. It is defined by
The amount of information shared by RVs A & B is their mutual information, defined as
In terms of information theory, the chain rule of probability takes the form
where
which measures the uncertainty of the value of RV A given the value of RV B. From Eq. (5), it is clear that measures the decrease in uncertainty about either the value of RV A or the value of RV B, when the value of the other RV is known.
If B is a deterministic transformation of A then there is no uncertainty remaining about the value of B given the value of A, so we have . Finally, if G(A) is an invertible transformation of RV A, we have
as the value of A grants us perfect knowledge of the value of G(A) and vice-versa.
In what follows we shall be particularly interested in the is the empirical measure of Eq. (2).
2.2.1 The information bottleneck principle
Suppose we wish to learn a compressed representation F = f(X) from the original features X that is useful for predicting a target variable Y . Treating X, Y & F as RVs, the Information Bottleneck principle [23], offers a way to select a representation F, by trading-off the information the learned representation F captures from X regarding the target variable Y , i.e. –the higher, the better for predicting Y – and the total information it captures from X, i.e.
–the lower, the higher the degree compression. We thus look for a representation
where is a Lagrange multiplier that controls the aforementioned tradeoff.
Recently, this principle has been used to explain good generalization in DNNs [18,24] and the use of training set estimates of Eq. (7) or variants of it as objective functions for training DNNs –and other models– have been growing in popularity [1,19,22]. The reasoning is that compression controls for the complexity of the learned representation, thus promoting good generalization [15].
In this work we draw inspiration from the above line of research and our findings further reinforce the role of information compression in promoting generalization. We
set S. We shall call such an intuitively ‘ideal’ model F a lossless maximal compressor of the training set S.
2.3 Margin maximization
The (normalized) of a training example
under a model f is defined as
It is a combined measure of confidence and correctness of the classification of the example under f. Its sign encodes whether the example is correctly classified () or misclassified (
), while the magnitude of the margin (i.e. the magnitude of the score
) measures the confidence of the model in its prediction (the higher, the more confident).
Maximizing the margins over the training set has been connected to good generalization [13,25]. An upper bound to the generalization error of an AdaBoost classifier f, based on its minimum margin over the training set, is proven in [13]. Tighter generalization bounds, dependent not only on the minimum margin but on the entire distribution of the training margins have been derived (e.g. Emargin bound [26], k-th margin bound [8]). Beyond boosting, such bounds hold for voting classifiers in general and recently similar bounds have been derived for DNNs [7,12,20,27].
In this work, we will establish an equivalence between models that maximize the margins on a noiseless training set S and lossless maximal compressors of S. We will verify our observations empirically, using boosting, a method that explicitly minimizes a monotonically decreasing loss function of the margin (i.e. maximizes training examples’ margins). As we will see, boosting drives learning towards a lossless maximal compressor of the noiseless training dataset. It achieves the lowest generalization error (estimated by the average test set error) once lossless maximal compression has been achieved.
3.1 An information-theoretic view of datasets & models
We will now define properties that capture the relationship between the information content of the model’s output (F) and the information present in the features & the target (X & Y , respectively) as measured on the training dataset S. In Figure 3.1 we provide a visual summary of these properties and their possible combinations. In Table 3.1 we summarize the information-theoretic equalities and inequalities that hold under each scenario. Proofs not directly following the statement of a lemma or theorem can be found in the Supplementary Material.
S. Typically, the model constructed by a learning algorithm is a member of some given model family. In this work we impose no restriction on the model space, i.e. is the set of all models. Being a deterministic transformation of X, F = f(X) cannot contain more information than X. So,
Definition 1 (Noiselessness) A probability distribution P is noiseless if and only if
Otherwise, P is noisy and . We shall say that a dataset S is noiseless (respectively, noisy) if the corresponding empirical measure
is noiseless (respectively, noisy).
noiseless dataset S, is one in which the features X, contain all information to perfectly describe the target Y .
Given a distribution denote a minimizer of the risk i.e.
In particular, when P is the underlying distribution P then sign is the Bayes classifier. When P is the empirical measure
then sign
is the empirical risk minimiser where
Lemma 1 A dataset S is noiseless if and only if
In other words, a noiseless training dataset S is one in which no datapoints with the same feature vector x have different labels y. For such a dataset, there exists a model that can achieve zero empirical risk (training error), i.e. that can perfectly classify the training data. In other words, there exists some deterministic mapping from the features X to the target Y .
We shall now introduce properties that make a model f useful for the purposes of capturing relevant and ignoring redundant information from a training set S.
Definition 2 (Losslessness) A model f is lossless on the dataset S if and only if . Otherwise, the model is lossy on
A losslessmodel f on a dataset S is one that captures all the information in features X that is relevant for describing the target Y . We can equivalently state that the r.v. F is a sufficient statistic of the empirical distribution
of the training data.
Lemma 2 Suppose dataset S is noiseless. A model f is lossless on S if and only if there exists an invertible transformation
Lemma 2 means that if a model f is lossless on a training set S, its output can be used to describe the target Y with the only source of training error being the irreducible class overlap in the training set.
Definition 3 (Maximal Compression) A model is a maximal compressor of the dataset S if and only if
. Otherwise, the model is undercompressed on
A model f that is a maximal compressor of a training dataset S is one that only captures from the features X information relevant for describing the target Y . It does not necessarily capture all that information; this special case, merits a definition of its own given below.
Definition 4 (Lossless Maximal Compression - (LMC)) A model f is a lossless maximal compressor (LMC) of a training dataset S if and only if it is lossless on S and a maximal compressor on S.
Proposition 1 A model f is an LMC of a training dataset S, if and only if it satisfies
Proof. Follows straightforwardly from Definition 2 & Definition 3.
A model f that is an LMC of a training dataset S is one that only captures from the features X all the information relevant for describing the target Y . From an information-theoretic perspective, an LMC of S is the optimal classification model that can be constructed from S.
We have defined the notion of a noiseless / noisy training dataset S and those of a lossy / lossless f on S and of an undercompressed / maximally compressed model f on S. In Figure 3.1 we provide a visual summary of these properties and their possible combinations in the form of entropy Venn diagrams. In Table 3.1 we summarize the relationships among the various information-theoretic quantities involved that hold under each scenario.
In the next subsections, we shall use the properties we defined here to obtain a better understanding of what types of models the information-theoretically optimal model, the LMC corresponds to for a noiseless dataset S and for a noisy dataset S.
Table 1 Relationships among various information-theoretic quantities under the possible scenarios (see Figure 3.1) relating the feature joint RV X and the target RV Y in a training dataset S with a model f whose output is distributed as a RV F.
Fig. 1 Venn diagrams capturing the possible relationships between the information content of the feature joint RV X and the target RV Y in a training dataset S with a RV F representing the output of a model f (shaded). [Left] S is a noiseless dataset; X contains all information to perfectly describe Y . [Right] S is a noisy dataset; X does not contain all information to perfectly describe Y . [From Top to Bottom] (i) A lossy, undercompressed model f. (ii) A lossy, maximally compressed model f. (iii) A lossless, undercompressed model f. (iv) A lossless, maximally compressed model f. Table 3.1 shows the equalities & inequalities involving the various underlying information-theoretic terms.
3.2 Noiseless data: Equivalence of lossless maximal compression & margin maximisation
Let us first focus on the special case of a noiseless dataset S, i.e. a dataset that does not contain any datapoints with the same feature vector but different labels. We will then discuss the noisy case where ambiguously labelled datapoints can be present in the dataset.
The noiseless case merits a special discussion for several reasons: (i) It is the typical case studied in the literature and as such it allows us to connect our observations to existing work. (ii) It allows us to establish an equivalence between information theoretic lossless maximal compression and margin maximization. (iii) It is a very common case in practice since in large dimensional datasets, encountering datapoints that have the same feature vector but different class labels are typically expected to be rare
We will now show the equivalence between lossless maximal compression and margin maximisation on a noiseless dataset S.
Theorem 1 Suppose S is noiseless and finitely supported. A model f is an LMC with respect to S if and only if there exists some invertible transformation such that
is a margin maximizer with respect to S.
Under Theorem 1 a classification model that maximizes the training margins on a given noiseless datasetis one that captures all the information present in the features of that dataset relevant for predicting the target label and no more. Conversely, since an LMC is a margin maximizer, it offers the same guarantees on the generalization error as the latter. Note that Theorem 1 captures a relationship between a noiseless dataset S and a model f, regardless of the underlying learning algorithm that produced it (i.e. the model family it explores or the optimization method used to explore it).
From Lemma 1 & Lemma 2, we have that a lossless model f on a noiseless training dataset S is one whose output can be used to classify every training example to the correct class (i.e. S is separable by f). The success of algorithms that generate models that can the data, yet, despite exploring overparameterized model spaces, are resistant to overfitting (e.g. gradient boosting, random forests, SVMs and DNNs) has recently attracted considerable research interest [3,9,28].
Our work connects these findings to information theory and margin theory: we posit that models generated by such methods are typically not simply lossless, but actually LMCs, hence margin maximizers and their good generalization follows via the margin-based generalization bounds. Algorithms such as the aforementioned, have mechanisms for promoting both losslessness (interpolation, in a noiseless dataset), guaranteeing the model produced will not underfit (afforded by overparameterizing the model space) and maximal compression (afforded via explicit or implicit margin maximization) which produces a model from that space that is maximally resistant to overfitting.
3.3 Noisy data: The equivalence collapses
Let us now discuss the case of a noisy dataset S and how it differs from the case of noiseless data.
In the noiseless case, any model that correctly classifies every training datapoint in S (i.e. achieves zero training error) is a lossless model on S. In the noisy case this observation is no longer relevant, as there exist at least 2 datapoints which are noisy, i.e. have the same feature vector x, but different labels y. It is no longer the case that there exists a model f that can perfectly separate the data.
We can also rephrase the observation stated above as follows: Any model f that yields the minimal achievable training error on a noiseless training dataset S is a lossless model on S. As we will see from Lemma 3, this condition is no longer sufficient for f to be lossless on a noisy training dataset S.
A model that minimizes the training error on a noisy dataset S will be one that classifies all points in S with the same feature vector x to the majority class among them. Furthermore, a margin maximizer is a model that minimizes the training error while also assigning maximal absolute score to its predictions (i.e.
It is easy to see that – unlike in the case of a noiseless training dataset S where a margin maximizer was an LMC– in the noisy case, a margin maximizer cannot even be a lossless model. This is a direct consequence of Lemma 3, the proof of which can be found in Section A of the Supplementary Material.
Lemma 3 Suppose that X and Y are discrete random variables taking values in X and , respectively. Suppose that
I(X; Y ) = I(F; Y ) if and only if the map
is constant on all sets of the form
In simpler terms, Lemma 3 tells us that if for two feature vectors f satisfies
, then it also has to be the case that
for f to be lossless (and inversely). Therefore, a margin maximizer, i.e. a model assigning maximal (i.e. the same) score both to noiseless positive examples (unambiguously labelled positive examples, for which P(Y = 1|X = x) = 1) and to noisy popsitive examples (ambiguously labelled examples, i.e. ones with 0.5 < P(Y = 1|X = x) < 1) violates the condition of losslessness.
We therefore see that a margin maximizing model f of a noisy training dataset S cannot be an LMC of S as it is not even lossless on S. Furthermore, as margin maximizers are themselves training error minimizers, this implies that not all training error minimizers of S are LMCs (or even lossless) on S either. These observations are summarized in Table 2.
A lossless model (one satisfying ) is one that captures all the information present in the features X relevant for predicting the target Y . In the case of a noisy training dataset, this information includes the uncertainty introduced by the ambiguous labelling of a feature vector x, i.e. P(Y = 1|X = x). So a lossless model should assign different scores f(x) to feature vectors f(x) which have different values of P(Y = 1|X = x).
Moreover, f is an LMC (has the minimum that allows
) iff it is lossless while using the fewest values f(x) possible to encode the empirical P(Y = 1|X = x), i.e. have as many distinct values for f(x) as there are distinct values of P(Y = 1|X = x).
The above discussion provides an intuition into the limitations of margin maximization approaches in the presence of label noise. The sub-optimality of boosting (a margin maximization approach) in the presence of label noise has been observed in earlier studies [4, 10, 14] and here we provide an information-theoretic justification for this phenomenon. Simply put, the strategy of maximizing the margins on a noisy dataset is not producing a lossless maximally compressing model on that dataset. In fact, the resulting margin maximizing model, is not even going to be lossless on the training dataset as it will fail to capture the uncertainty over the labels. When the training data are noisy, we should instead aim to produce models whose scores f(x) capture the underlying empirical distribution P(Y = 1|X = x) (lossless). Ideally, we should aim for strategies producing models whose scores correspondence to P(Y = 1|X = x) (LMCs).
Table 2 A brief comparison of loslessness and lossless maximal compression (LMC) and how it relates to training error minimization and margin maximization depending on whether the training dataset S is noisy or noiseless.
4.1 Experimental Setup
Boosting, a method that explicitly maximizes the margins of the training examples, can be shown empirically to also converge to LMC models on noiseless datasets. After lossless maximal compression is achieved, so is the minimal generalization error, as estimated by the error on the test set. To demonstrate this, we plot the trajectory of the boosting ensemble on the
vs.
. For each boosting round
denotes the RV of which the ensemble’s outputs are realizations.
The experiments were carried out on binary classification tasks on both real-world UCI datasets and artificial data (dataset descriptions in the Supplementary Material). Qualitatively, the results are similar for all datasets (see Figure 2 as well as Section C of the Supplementary Material). The boosting ensemble consisted of a maximum of T = 100 decision trees (i.e. rounds of boosting) of maximal depth 6. No shrinkage of the updates or subsampling of the examples was performed (both are techniques to counter overfitting), and the exponential loss function was used (i.e. the loss minimized by AdaBoost). We performed no hyperparameter optimization. Plotting trajectories on the information plane follows [18,24]. All information-theoretic quantities were estimated on the training data by first discretizing the features & model outputs in b = 100 equal-sized bins, then using maximum likelihood estimators. The joint RV X was then constructed by the discretized features
average results across 100 runs with different train-test splits (50%–50%) on the same original data. We also visualize the trajectories obtained by some random individual runs to showcase that although they can vary significantly from one another, they all follow the same general pattern. All datasets & code used in the experiments can be downloaded at https://github.com/nnikolaou/margin_maximization_LMC.
4.2 Results & Analysis
Let us first introduce some characteristic points on the information plane: Lossless maximal compression (LMC) point: A red star on the information plane denotes the point of lossless maximal compression – the optimal feasible point a model f can occupy on the plane on a given dataset – on which is the maximal achievable while
is minimal. On this point,
and for a noiseless dataset,
Average margin maximization point: With a hollow green circle on the information plane, we denote the model (round of boosting) under which the average (equiv. total) margin is first minimized. Training error minimization point: With a full black dot on the information plane, we denote the model (round of boosting) under which the training error is first minimized (losslessness is achieved). At this point
has reached its maximum, so for a noiseless dataset,
Test error minimization point: With a magenta square on the information plane, we denote the model (round of boosting) under which the test error (proxy for generalization error) is first minimized.
Let us now summarize our observations from Figure 2 & the figures of Section C of
the Supplementary Material:
Boosting leads to lossless maximal compression: In all datasets, the boosting ensemble traces a trajectory on the information plane that leads to the LMC point and
once it reaches it in never escapes.
Lossless maximal compression coincides with margin maximization: In all datasets the image on the information plane of the models that minimize the margin
coincides with the LMC point. Lossless maximal compression coincides with maximal generalization: The
Fig. 2 Trajectory of the boosting ensemble on the normalized information plane as the rounds of boosting progress. We highlight the point on which the training error is first minimized (full black circle), the point on which the test error is first minimized (magenta square), the point on which the margins are first maximized (hollow green circle) and the lossless maximal compression point (red star). From TOP to BOTTOM (dataset): waveform, krvskp, musk2, credit; LEFT: Average trajectory across 100 runs; RIGHT: Some random individual trajectories. Notice that on all datasets, boosting traces a trajectory that leads to the LMC point, the latter coinciding with the margin maximization point and –on average– with the test error minimization point. The results are qualitatively consistent across individual runs: trajectories can vary significantly yet the aforementioned observations hold.
Fig. 3 Trajectory of the boosting ensemble on the normalized information plane as the rounds of boosting progress. We highlight the point on which the training error is first minimized (full black circle), the point on which the test error is first minimized (magenta square), the point on which the margins are first maximized (hollow green circle) and the lossless maximal compression point (red star). From TOP to BOTTOM (dataset): sonar, splice, semeion, wdbc; LEFT: Average trajectory across 100 runs; RIGHT: Some random individual trajectories.
Fig. 4 Trajectory of the boosting ensemble on the normalized information plane as the rounds of boosting progress. We highlight the point on which the training error is first minimized (full black circle), the point on which the test error is first minimized (magenta square), the point on which the margins are first maximized (hollow green circle) and the lossless maximal compression point (red star). From TOP to BOTTOM (dataset): heart, congress, landsat, mushroom; LEFT: Average trajectory across 100 runs; RIGHT: Some random individual trajectories.
Fig. 5 Trajectory of the boosting ensemble on the normalized information plane as the rounds of boosting progress. We highlight the point on which the training error is first minimized (full black circle), the point on which the test error is first minimized (magenta square), the point on which the margins are first maximized (hollow green circle) and the lossless maximal compression point (red star). From TOP to BOTTOM (dataset): ionosphere, parkinsons; LEFT: Average trajectory across 100 runs; RIGHT: Some random individual trajectories.
point of the ensemble’s trajectory corresponding to the minimal test error coincides –on average– with the LMC point on the information plane (and so does the margin maximization point). In other words, LMCs correspond to the models exhibiting –on average– the best generalization behaviour.
Average trajectory shape: After the training error is minimized, the test error can be further decreased by training for more rounds. This is a known result in boosting, explained via margin theory. Here we give an information-theoretic interpretation. Training until training error minimization, amounts to achieving losslessness. Subsequent rounds result in travelling along the line of maximal on the information plane, towards the LMC point. This compresses the model f (relieves it of remaining information from X irrelevant for predicting Y ), decreasing its effective complexity
Training in boosting consists of 2 (typically distinct) phases: The results suggest the presence of 2 distinct phases during training under gradient boosting. A similar behaviour was observed in [18] for the trajectories of the representations learned by DNNs. Following the terminology of [18], these are the empirical risk minimization (ERM) phase, when
increases (the model better fits the training data) but typically so does
(the model uses more information from X) and the compres-
decreases (the model uses increasingly less information from X, reducing its effective complexity), without decreasing
. We can view the ERM phase as decreasing the bias of the model while not decreasing its variance and the compression phase as reducing variance while not increasing bias. The ERM phase is usually much shorter than the compression phase, as is the case with DNNs [18]. Although typically we do observe these 2 phases as distinct in the average trajectories, they need not be, as was also observed in subsequent studies in DNNs [30]. Trajectories of individual runs, are not as smooth as the average trend; we can even observe steps that increase both bias & variance. However, the 2 phases still appear to be distinct: once losslessness is achieved (ERM phase terminates), it is maintained and pure compression begins.
as losslessness can be achieved, additional boosting rounds do not hurt generalization. Once the model reaches the LMC point on the information plane, it never escapes it. Subsequent iterations neither increase the training nor the test error. This suggests that early stopping with boosting is unnecessary for improving generalization and agrees with recent observations [28]. General margin losses minimized via stochastic gradient descent (SGD) also exhibit similar behaviour [21].
aforementioned observations hold across different datasets and hyperparameter settings. Section C of the Supplementary Material contains more results supporting this claim. They also hold if we change the number of bins used to discretize the features (provided the dataset remains noiseless) or the scores (provided they are
Margin maximization as a built-in regularization mechanism: Additional regularization techniques like subsampling or shrinkage are not the main reason why boosting regularizes. Their contribution is small compared to the algorithms’ built-in regularization mechanism: margin maximization, which as we saw amounts to lossless maximal compression of the training dataset. This is another similarity shared with DNNs trained with SGD that achieve good generalization by tracing a similar trajectory on the information plane [18], and additional regularization control (e.g. dropout or batch normalization) is beneficial, but not the main contributor to their good generalization [11,18,31].
We characterized from an information theoretic perspective, models trained on a given training set w.r.t. the information they capture from it. Under this light, we identified an ideal model trained on a given dataset as its lossless maximal compressor (LMC): one capturing all the information from the features relevant for predicting the target and no more. We then established that an LMC is –in the case of classification– equivalent to a margin maximizer of the dataset (provided it is noiseless, i.e. consistently labelled). The existence of margin-based bounds on the generalization error implies that margin maximization, hence lossless maximal compression, is beneficial to generalization.
Our experiments on gradient boosting, demonstrate that indeed, margin maximization amounts to lossless maximal compression on noiseless data. The evolution of the model constructed by boosting, traces a trajectory on the information plane that leads to the point of lossless maximal compression which also coincides with the point of margin maximization and the point on average exhibiting the best generalization. In agreement with recent studies on boosting, we observe that early stopping is unnecessary for improving generalization [28] and identify interesting similarities between how training progresses in DNNs and in gradient boosting in terms of the trajectory they trace on the information plane [18]. All observations persist across a wide range of datasets and hyperparameter configurations.
This work gives an information-theoretic interpretation of margin maximization and provides us with a principled way to define model complexity for the purposes of generalization, thus shedding more light on the success of methods like gradient boosting. It also opens various directions for future work. For instance, exploring how these concepts can be applied in model selection or to inform learning algorithm design to more efficiently traverse the information plane to reach the LMC point. It would also be of interest to identify the analogue of the LMC in learning tasks other than classification, like ranking or regression.
Acknowledgements This project was partially supported by the EPSRC LAMBDA [EP/N035127/1] & Anyscale Apps [EP/L000725/1] project grants and the EU Horizon 2020 research & innovation programme [grant No 758892, ExoAI]. NN acknowledges the support of the EPSRC Doctoral Prize Fellowship at the University of Manchester and the NVIDIA Corporation’s GPU grant. The authors thank Konstantinos Sechidis, Konstantinos Papangelou & Ingo Waldmann for their useful comments and suggestions.
1. Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. arXiv preprint arXiv:1612.00410 (2016)
2. Asadi, A., Abbe, E., Verdú, S.: Chaining mutual information and tightening generalization bounds. In: Advances in Neural Information Processing Systems, pp. 7234–7243 (2018)
3. Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118 (2018)
4. Bootkrajang, J., Kabán, A.: Boosting in the presence of label noise. In: Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pp. 82–91. AUAI Press (2013)
5. Bühlmann, P., Hothorn, T.: Boosting algorithms: Regularization, prediction and model fitting. Statistical Science pp. 477–505 (2007)
6. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). DOI 10.1023/A:1022627411411. URL https://doi.org/10.1023/A:1022627411411
7. Dziugaite, G.K., Roy, D.M.: Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008 (2017)
8. Gao, W., Zhou, Z.H.: On the doubt about margin explanation of boosting. Artificial Intelligence 203, 1–18 (2013)
9. Hastie, T., Montanari, A., Rosset, S., Tibshirani, R.J.: Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560 (2019)
10. Kalai, A., Servedio, R.A.: Boosting in the presence of noise. In: Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pp. 195–205. ACM (2003)
11. Kawaguchi, K., Kaelbling, L.P., Bengio, Y.: Generalization in deep learning. arXiv preprint arXiv:1710.05468 (2017)
12. Neyshabur, B., Bhojanapalli, S., Srebro, N.: Exploring generalization in deep learning. In: Advances in Neural Information Processing Systems, pp. 5943–5952 (2017)
13. Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S., et al.: Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics 26(5), 1651–1686 (1998)
14. Servedio, R.A.: Smooth boosting and learning with malicious noise. Journal of Machine Learning Research 4(Sep), 633–648 (2003)
15. Shamir, O., Sabato, S., Tishby, N.: Learning and generalization with the information bottleneck. Proc. of Advances in Learning Theory 8, 92–107 (2008)
16. Shannon, C.E.: A mathematical theory of communication. Bell system technical journal 27(3), 379–423 (1948)
17. Shen, C., Li, H.: On the dual formulation of boosting algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(12), 2216–2231 (2010)
18. Shwartz-Ziv, R., Tishby, N.: Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 (2017)
19. Simeone, O., et al.: A brief introduction to machine learning for engineers. Foundations and Trends in Signal Processing 12(3-4), 200–431 (2018)
20. Sokolić, J., Giryes, R., Sapiro, G., Rodrigues, M.R.: Generalization error of deep neural networks: Role of classification margin and data structure. In: Sampling Theory and Applications (SampTA), 2017 International Conference on, pp. 147–151. IEEE (2017)
21. Soudry, D., Hoffer, E., Srebro, N.: The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345 (2017)
22. Strouse, D., Schwab, D.J.: The deterministic information bottleneck. Neural computation 29(6), 1611–1630 (2017)
23. Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. arXiv preprint physics/0004057 (2000)
24. Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle. In: Information Theory Workshop (ITW), 2015 IEEE, pp. 1–5. IEEE (2015)
25. Vapnik, V.: Estimation of dependences based on empirical data. Springer Series in Statistics (1982)
26. Wang, L., Sugiyama, M., Jing, Z., Yang, C., Zhou, Z.H., Feng, J.: A refined margin analysis for boosting algorithms via equilibrium margin. Journal of Machine Learning Research 12(Jun), 1835–1863 (2011)
27. Wei, C., Lee, J.D., Liu, Q., Ma, T.: On the margin theory of feedforward neural networks. arXiv preprint arXiv:1810.05369 (2018)
28. Wyner, A.J., Olson, M., Bleich, J., Mease, D.: Explaining the success of adaboost and random forests as interpolating classifiers. arXiv preprint arXiv:1504.07676 (2015)
29. Xu, A., Raginsky, M.: Information-theoretic analysis of generalization capability of learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2524–2533 (2017)
30. Xu, Z.Q.J., Zhang, Y., Xiao, Y.: Training behavior of deep neural network in frequency domain. arXiv preprint arXiv:1807.01251 (2018)
31. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016)
A. Proofs
In this section we shall prove Lemma 1, Lemma 2 and Theorem 1. Rather than proving these results directly we shall instead prove generalisations to an arbitrary finitely supported probability measure P (Lemma 6, Lemma 5 and Theorem 2). The sample based (i.e. dataset based) results used in the main paper correspond to the special case in which the probability measure P is the empirical measure
Definition 5 A finitely supported probability distribution P is noiseless if and only if
Definition 5 generalises Definition 1 which corresponds to the special case in which P is the empirical measure . The proofs of the following results require the following elementary lemma.
Lemma 4 Let the function for
. Then we have
with equality if and only if
We shall now prove Lemma 5 which generalises Lemma 1.
Lemma 5 A finitely supported probability distribution P is noiseless if and only if
Proof. We can write out the conditional entropy in terms of
as follows,
Since is non-negative it follows that
if and only if for each
have
which is the case if and only if
Now suppose that so for each
{0, 1}. Then we can define
so that
, so
. Thus for each
One the other hand, if then we must have P (Y = y|X = x) = 1 if
otherwise. Thus,
for each
Definition 6 generalises Definition 2.
Definition 6 A model f is lossless with respect to P if and only if
Lemma 6 Suppose P is noiseless. A model f is lossless with respect to P if and only if there exists an invertible transformation
is lossless if and only if
is the case if and only if
, where we have used Eq. (5) and the assumption that P is noiseless. Moreover, as in the proof of Lemma 1 we have
Using the fact that on [0, 1] with equality only at {0, 1} we infer that
if and only if for each
{0, 1}.
This implies that for each for y = sign(g(s)), P(Y = y|f(X) = s) = 1 and for
, so in general
is lossless.
Conversely, if f is lossless then for each we can take
and extend arbitrarily to form a bijection. Since f is lossless, for each
. If for some
have P(Y = 1|f(X) = s) = 1 then g(s) > 0 and so
. Similarly, for
and so again
. Hence, in general
Definitions 7 and 6 generalise Definitions 3 and 2, respectively.
Definition 7 A model f is a maximal compressor of a distribution P if and only if
Definition 8 A model f is a lossless maximal compressor (LMC) of a training dataset S if and only if it is lossless on S and a maximal compressor on S.
Proposition 2 generalises Proposition 1.
Proposition 2 A model f is an LMC of a finitely supported probability distribution P, if and only if it satisfies
Proof. Follows straightforwardly from Definition 6 & Definition 7.
Finally we shall prove Theorem 2 which generalises Theorem 1.
Theorem 2 Suppose P is noiseless and finitely supported. A model f is an LMC with respect to P if and only if there exists some invertible transformation such that
is a margin maximizer with respect to P.
Proof. As we saw in the proof of Lemma 1, the fact that P is noiseless implies that for each and
we have
. We form partition partition
so that for
and for
,
Now suppose that for some invertible transformation is a margin maximizer with respect to P. Hence, if g(f(x)) = 1 for
and
for
. This implies that
. Thus, by Lemma 2, f is lossless. Moreover, g is invertible this is equivalent to
for
and
for
. Hence, P(f(X) = s|Y = y) = 1 when
and P(f(X) = s|Y = y) = 0 otherwise, so in general
where the final inequality follows from the fact that P (F = f(x)|X = x) = 1. Hence, we have . It follows that f is a maximal compressor and we have already shown that f is lossless.
we infer from Lemma 2 that there is some transformation such
, which in turn implies that if
and
then
. Moreover, since f is a maximal compressor we must have
which implies
Thus, for each and
we have
, where once again we use the fact that
is non-negative with zero attained at {0, 1}. Hence, there exists some
such that for all
and some
such that for all
. Thus, if we choose
to be any invertible map with
and
we see that
is a margin maximiser. This completes the proof of the theorem.
Finally, we prove Lemma 7 which generalises Lemma 3.
Lemma 7 Suppose that X and Y are discrete random variables taking values in X and , respectively. Suppose that
I(X; Y ) = I(F; Y ) if and only if the map
is constant on all sets of the form
Proof. The proof uses the entropy functional by
. Note that
is strictly concave. Now observe that
where we have used the fact that P(X = x|F = z) = P(X = x)/P(F = z) if z = f(x) and P(X = x|F = z) = 0 otherwise. We also have
with equality if and only if P(Y = 1|X = x) is constant for all with P(X = x|F = z) > 0, so constant for all
. Hence, we have
equality if and only if for each
is constant on
To conclude note that
with equality if and only H(Y |X) = H(Y |F) which holds if and only if for each
is constant on
B. Details of datasets used
B1. Artificial Data
The artificial dataset was generated by function. We generated 2000 examples, each consisting of 20 features, only 2 of which were relevant for predicting the class. The examples belonged to 2 different clusters for each of the 2 classes, each cluster’s points normally distributed (with unit standard deviation) about vertices of a 2-sided hypercube. Some label noise was added by randomly flipping the label of each point with probability 0.01. For more information see the function’s documentation at http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html.
B2. UCI Datasets
Table 3 shows the characteristics of the real-world datasets used in our experiments. The original datasets are all from the UCI repository. Examples with missing values were discarded. The multiclass datasets were converted to balanced binary ones by setting the minority class as the ‘positive’ one and uniformly sampling examples from the remaining classes to form the ‘negative’ class. A link to the final datasets will be provided along with the code used to generate the results.
Table 3 Characteristics of the UCI datasets used in our experiments; number of instances used, number of features and number of classes before binarization.
6.1 C. Additional experimental results
This section contains additional experimental results that further showcase the consistency of the trajectory of the boosting ensemble on the information plane towards the lossless maximally compressing (LMC) model point across different datasets and hyperparameter settings.
Figure 6 shows only the average trajectories of the boosting ensemble trained on the mushroom dataset using base learners of varying capacity, demonstrating that the
Fig. 6 Effect of the capacity of the weak learner (max. tree depth d) on the average trajectory of the boosting ensemble on the normalized information plane as the rounds of boosting progress. All plots are for the mushroom dataset. The same points of interest as in the previous figures are shown. TOP LEFT: d = 1; TOP RIGHT: d = 2; MID LEFT: d = 3; MID RIGHT: d = 4; BOTTOM LEFT: d = 5; BOTTOM RIGHT: d = 6.
trajectories are again qualitatively similar. Naturally, the lower the capacity of an individual learner, the more boosting rounds are required to reach the LMC point.
Figure 7 shows the same results as Figures 2–5 on artificial data generated as described in Section B1 of this Supplementary Material. It also demonstrates that changing the loss function, using subsampling of the examples for the purposes of the updates or shrinkage of the updates does not change the trajectory qualitatively.
Fig. 7 Effect of hyperparameters on the trajectory of the boosting ensemble on the normalized information plane as the rounds of boosting progress. All results are on artificial data generated as described in Section B1 of this Supplementary Material. All points of interest from the previous figures are shown. ROW 1: Exponential loss, no shrinkage, no subsampling; ROW 2: Exponential loss, shrinkage with , no subsampling; ROW 3: Exponential loss, no shrinkage, subsampling set to 0.8; ROW 4: Binomial deviance loss, no shrinkage, no subsampling; LEFT: Average trajectory across 100 runs; RIGHT: Some random individual trajectories.