Implicit biases introduced by optimization algorithms play an crucial role in learning deep neural networks [Neyshabur et al., 2015b,a, Hochreiter and Schmidhuber, 1997, Keskar et al., 2016, Chaudhari et al., 2016, Dinh et al., 2017, Andrychowicz et al., 2016, Neyshabur et al., 2017, Zhang et al., 2017, Wilson et al., 2017, Hoffer et al., 2017, Smith, 2018]. Large scale neural networks used in practice are highly over-parameterized with far more trainable model parameters compared to the number of training examples. Consequently, optimization objectives for learning such high capacity models have many global minima that fit training data perfectly. However, minimizing the training loss using specific optimization algorithms take us to not just any global minima, but some special global minima, e.g., global minima minimizing some regularizer . In over-parameterized models, specially deep neural networks, much, if not most, of the inductive bias of the learned model comes from this implicit regularization from the optimization algorithm. Understanding the implicit bias, e.g., via characterizing
, is thus essential for understanding how and what the model learns.
For example, in linear regression we understand how minimizing an under-determined model (with more parameters than samples) using gradient descent yields the minimum norm solution, and for linear logistic regression trained on linearly separable data, Soudry et al. [2017] recently showed that gradient descent converges in the direction of the hard margin support vector machine solution, even though the norm or margin is not explicitly specified in the optimization problem. Such minimum norm or maximum margin solutions are of course very special among all solutions or separators that fit the training data, and in particular can ensure generalization Bartlett and Mendelson [2003], Kakade et al. [2009].
Changing the optimization algorithm, even without changing the model, changes this implicit bias, and consequently also changes generalization properties of the learned models [Neyshabur et al., 2015a, Keskar et al., 2016, Wilson et al., 2017, Gunasekar et al., 2017, 2018]. For example, for linear logistic regression, using coordinate descent instead of gradient descent return a maximum solution instead of the hard margin support vector solution solution—an entirely different inductive bias Telgarsky [2013], Gunasekar et al. [2018].
Similarly, and as we shall see in this paper, changing to a different parameterization of the same model class can also dramatically change the implicit bias Gunasekar et al. [2017]. In particular, we study the implicit bias of optimizing multi-layer fully connected linear networks, and linear convolutional networks (multiple full width convolutional layers followed by a single fully connected layer) using gradient descent. Both of these types of models ultimately implement linear transformations, and can implement any linear transformation. The model class defined by these networks is thus simply the class of all linear predictors, and these models can be seen as mere (over) parameterizations of the class of linear predictors. Minimizing the training loss on these models is therefore entirely equivalent to minimizing the training loss for linear classification. Nevertheless, as we shall see, optimizing these networks with gradient descent leads to very different solutions.
In particular, we show that for fully connected networks with single output, optimizing the exponential loss over linearly separable data using gradient loss again converges to the homogeneous hard margin support vector machine solution. This holds regardless of the depth of the network, and hence, at least with a single output, gradient descent on fully connected networks has the same implicit bias as direct gradient descent on the parameters of the linear predictor. In contrast, training a linear convolutional network with gradient descent biases us toward linear separators that are sparse in the frequency domain. Furthermore, this bias changes with the depth of the network, and a network of depth L (with convolutional layers), implicitly biases towards minimizing the
bridge penalty with
of the Fourier transform
of the learned linear predictor
subject to margin constraints (the gradient descent predictor reaches a stationary point of the
minimization problem). This is a sparsity inducing regularizer, which induces sparsity more aggressively as the depth increases.
Finally, in this paper we focus on characterizing which global minimum does gradient descent on over-parameterized linear models converge to, while assuming that for appropriate choice of step sizes gradient descent iterates asymptotically minimize the optimization objective. A related challenge in neural networks, not addressed in this paper, is an answer to when does gradient descent minimize the non-convex empirical loss objective to reach a global minimum. This problem while hard in worst case, has been studied for linear networks. Recent work have concluded that with sufficient over-parameterization (as is the case with our settings), loss landscape of linear models are well behaved and all local minima are global minima making the problem tractable Burer and Monteiro [2003], Journée et al. [2010], Kawaguchi [2016], Nguyen and Hein [2017], Lee et al. [2016].
Notation We typeface vectors with bold characters e.g., . Individual entries of a vector
are indexed using 0 based indexing as z[d] for
. Complex numbers are represented in the polar form as
with
denoting the magnitude of z and
denoting the phase.
denotes the complex conjugate of z. The complex inner product between
is given by
root of 1 is denoted by
. For
we use the notation
to denote the representation of z in the discrete Fourier basis given by,
For integers D and a, we denote the modulo operator as
. Finally, for multi-layer linear networks (formally defined in Section 2), we will use
to denote parameters of the model in general domain
to denote the equivalent linear predictor.
We consider feed forward linear networks that map input features to a single real valued output
, where w denote the parameters of the network. Such networks can be thought of as directed acyclic graphs where each edge is associated with a weight, and the value at each node/unit is the weighted sum of values from the parent nodes. The input features form source nodes with no incoming edges and the output is a sink node with no outgoing edge. Every such network realizes a linear function
denotes the effective linear predictor.
In multi-layer networks, the nodes are arranged in layers, so an L–layer network represents a composition of L linear maps. We use the convention that, the input is indexed as the zeroth layer l = 0, while the output forms the final layer with l = L. The outputs of nodes in layer l are denoted by
, where
is the number of nodes in layer l. We also use
to denote the parameters of the linear map between
and
, and
to denote the collective set of all parameters of the linear network.
Linear fully connected network In a fully connected linear network, the nodes between successive layers are densely connected with edge weights
, and all the weights are independent parameters. This model class is parameterized by
the computation for intermediate nodes
and the composite linear map
is given by,
Linear convolutional network We consider one-dimensional convolutional network architectures where each non-output layer has exactly D units (same as the input dimensionality) and the linear transformations from layer are given by the following circular convolutional operation1 parameterized by full width filters with weights
The output layer is fully connected and parameterized by weights . The parameters of the model class therefor consists of L vectors of size D collectively denoted by
and the composite linear map
is given by:
Remark: We use circular convolution with a scaling of to make the analysis cleaner. For convolutions with zero-padding, we expect a similar behavior. Secondly, since our goal here to study implicit bias in sufficiently over-parameterized models, we only study full dimensional convolutional filters. In practice it is common to have filters of width K smaller than the number of input features, which can change the implicit bias.
The fully connected and convolutional linear networks described above can both be represented in terms of a mapping that maps the input parameters
to a linear predictor in
such that the output of the network is given by
. For fully connected networks, the mapping is given by
, and for convolutional networks,
, where
denotes the flipped vector corresponding to w,
given by , . . . , D
Separable linear classification Consider a binary classification dataset 1, 2, . . . N} with
and
. The empirical risk minimization objective for training a linear network parameterized as P(w) is given as follows,
where is some surrogate loss for classification accuracy, e.g., logistic loss
and exponential loss
It is easy to see that both fully connected and convolutional networks of any depth L can realize any linear predictor . The model class expressed by both networks is therefore simply the unconstrained class of linear predictors, and the two architectures are merely different (over) parameterizations of this class
Thus, the empirical risk minimization problem in (4) is equivalent to the following optimization over the linear predictors
Although the optimization problems (4) and (5) are exactly equivalent in terms of the set of global minima, in this paper, we show that optimizing (4) with different parameterizations leads to very different classifiers compared to optimizing (5) directly.
In particular, consider problems (4)/(5) on a linearly separable dataset and using the logistic loss (the two class version of the cross entropy loss typically used in deep learning). The global infimum of
is 0, but this is not attainable by any finite
. Instead, the loss can be minimized by scaling the norm of any linear predictor that separates the data to infinity. Thus, any sequence of predictors
(say, from an optimization algorithm) that asymptotically minimizes the loss in eq. (5) necessarily separates the data and diverges in norm,
. In general there are many linear separators that correctly label the training data, each corresponding to a direction in which we can minimize (5). Which of these separators will we converge to when optimizing (4)/(5)? In other words, what is the direction
will diverge in? If this limit exist we say that converges in direction to the limit direction
Soudry et al. [2017] studied this implicit bias of gradient descent on (5) over the direct parameterization of . They showed that for any linearly separable dataset and any initialization, gradient descent w.r.t.
converges in direction to hard margin support vector machine solution:
In this paper we study the behavior of gradient descent on the problem (4) w.r.t different parameterizations of the model class of linear predictors. For initialization and sequence of step sizes
gradient descent updates for (4) are given by,
where denotes the Jacobian of
with respect to the parameters
is the gradient of the loss function in (5).
For separable datasets, if minimizes (4) for linear fully connected or convolutional networks, then we will again have
, and the question we ask is: what is the limit direction
The result in Soudry et al. [2017] holds for any loss function that is strictly monotone in uy with specific tail behavior, name the tightly exponential tail, which is satisfied by popular classification losses like logistic and exponential loss. In the rest of the paper, for simplicity we exclusively focus on the exponential loss function
, which has the same tail behavior as that of the logistic loss. Along the lines of Soudry et al. [2017], our results should also extend for any strictly monotonic loss function with a tight exponential tail, including logistic loss.
Our main results characterize the implicit bias of gradient descent for multi-layer fully connected and convolutional networks with linear activations. For the gradient descent iterates in eq. (7), we henceforth denote the induced linear predictor as
Assumptions. In the following theorems, we characterize the limiting predictor
under the following assumptions:
These assumptions allow us to focus on the question of which specific linear predictor do gradient descent iterates converge to by separating it from the related optimization questions of when gradient descent iterates minimize the non-convex objective in eq. (5) and nicely converge in direction.
Figure 1: Implicit bias of gradient descent for different linear network architectures.
Theorem 1 (Linear fully connected networks). For any depth L, almost all linearly separable datasets , almost all initializations
, and any bounded sequence of step sizes
consider the sequence gradient descent iterates
in eq. (7) for minimizing
in eq. (4) with exponential loss
–layer fully connected linear networks.
If (a) the iterates minimize the objective, i.e.,
, (b)
, and consequently
, converge in direction to yield a separator with positive margin, and (c) gradients with respect to linear predictors
converge in direction, then the limit direction is given by,
For fully connected networks with single output, Theorem 1 shows that there is no effect of depth on the implicit bias of gradient descent. Regardless of the depth of the network, the asymptotic classifier is always the hard margin support vector machine classifier, which is also the limit direction of gradient descent for linear logistic regression with the direct parameterization of
In contrast, next we show that for convolutional networks we get very different biases. Let us first look at a 2–layer linear convolutional network, i.e., a network with single convolutional layer followed by a fully connected final layer.
Recall that denote the Fourier coefficients of
, i.e.,
and that any non-zero is denoted in polar form as
for
. Linear predictors induced by gradient descent iterates
for convolutional networks are denoted by
. It is evident that if
converges in direction to
, then its Fourier transformation
converges in direction to
. In the following theorems, in addition to the earlier assumptions, we further assume a technical condition that the phase of the Fourier coefficients
converge
coordinate-wise. For coordinates d with
, in which case
. We assume such a
also exists when
Theorem 2 (Linear convolutional networks of depth two). For almost all linearly separable datasets , almost all initializations
, and any sequence of step sizes
with
smaller than the local Lipschitz at
, consider the sequence gradient descent iterates
in eq. (7) for minimizing
with exponential loss over 2–layer linear convolutional networks.
If (a) the iterates minimize the objective, i.e.,
converge in direction to yield a separator
with positive margin, (c) the phase of the Fourier coefficients
of the linear predictors
converge coordinate-wise, i.e.,
, and (d) the gradients
converge in direction, then the limit direction
is given by,
We already see how introducing a single convolutional layer changes the implicit bias of gradient descent—even without any explicit regularization, gradient descent on the parameters of convolutional network architecture returns solutions that are biased to have sparsity in the frequency domain.
Furthermore, unlike fully connected networks, for convolutional networks we also see that the implicit bias changes with the depth of the network as shown by the following theorem. Theorem 2a (Linear Convolutional Networks of any Depth). For any depth L, under the conditions of Theorem 2, the limit direction
point of the following optimization problem,
where the penalty given by
(also called the bridge penalty) is a norm for p = 1 and a quasi-norm for p < 1.
When L > 2, and thus , problem (10) is non-convex and intractable Ge et al. [2011]. Hence, we cannot expect to ensure convergence to a global minimum. Instead we show convergence to a first order stationary point of (10) in the sense of sub-stationary points of Rockafellar [1979] for optimization problems with non-smooth and non-convex objectives. These are solutions where the local directional derivative along the directions in the tangent cone of the constraints are all zero.
The first order stationary points, or sub-stationary points, of (10) are the set of feasible predictors such that
satisfying the following:
where is the Fourier transformation of
, and
denotes the local sub-differential (or Clarke’s sub-differential) operator defined as
For p = 1 and represented in polar form as
is convex and the local sub-differential is indeed the global sub-differential given by,
For p < 1, the local sub-differential of is given by,
Figures 1a–1b summarize the implications of the main results in the paper. The proof of this Theorem, exploits the following representation of in the Fourier domain. Lemma 3. For full-dimensional convolutions,
is equivalent to
where for are the Fourier coefficients of the parameters
From above lemma (proved in Appendix C), we can see a connection of convolutional networks to a special network where the linear transformation between layers is restricted to diagonal entries (see depiction in Figure 1c), we refer to such networks as linear diagonal network.
The proof of Theorem 1 and Theorem 2-2a are provided in Appendix B and C, respectively.
We can decompose the characterization of implicit bias of gradient descent on a parameterization P(w) into two parts: (a) what is the implicit bias of gradient descent in the space of parameters w?, and (b) what does this imply in term of the linear predictor , i.e., how does the bias in parameter space translate to the linear predictor learned from the model class?
We look at the first question for a broad class of linear models, where the linear predictor is given by a homogeneous polynomial mapping of the parameters: are the parameters of the model and
satisfies definition below. This class covers the linear convolutional, fully connected networks, and diagonal networks discussed in Section 3.
Definition (Homogeneous Polynomial). A multivariate polynomial function is said to be homogeneous, if for some finite integer
Theorem 4 (Homogeneous Polynomial Parameterization). For any homogeneous polynomial map from parameters
to linear predictors, almost all datasets
separable by
, almost all initializations
, and any bounded sequence of step sizes
, consider the sequence of gradient descent updates
for minimizing the empirical risk objective
with exponential loss
If (a) the iterates asymptotically minimize the objective, i.e.,
, (b)
, and consequently
, converge in direction to yield a separator with positive margin, and (c) the gradients w.r.t. to the linear predictors,
converge in direction, then the limit direction of the parameters
point of the following optimization problem,
Theorem 4 is proved in Appendix A. The proof of Theorem 4 involves showing that the asymptotic direction of gradient descent iterates satisfies the KKT conditions for first order stationary points of (14). This crucially relies on two properties. First, the sequence of gradients converge in direction to a positive span of support vectors of
[2018]), and this result relies on the loss function being exponential tailed. Secondly, if P is not homogeneous, then the optimization problems
for different values of unnormalized margin
are not equivalent and lead to different separators. Thus, for general non-homogeneous P, the unnormalized margin of one does not have a significance and the necessary conditions for the first order stationarity of (14) are not satisfied.
Finally, we also note that in many cases (including linear convolutional networks) the optimization problem (14) is non-convex and intractable (see e.g., Ge et al. [2011]). So we cannot expect to be always be a global minimizer of eq. (14). We however suspect that it is possible to obtain a stronger result that
reaches a higher order stationary point or even a local minimum of the explicitly regularized estimator in eq. (14).
Implications of the implicit bias in predictor space While eq. (14) characterizes the bias of gradient descent in the parameter space, what we really care about is the effective bias introduced in the space of functions learned by the network. In our case, this class of functions is the set of linear predictors norm penalized solution in eq. (14), is equivalently given by,
The problems in eq. (14) and eq. (15) have the same global minimizers, i.e., is global minimizer of eq. (14) if and only if
minimizes eq. (15). However, such an equivalence does not extend to the stationary points of the two problems. Specifically, it is possible that a stationary point of eq. (14) is merely a feasible point for eq. (15) with no special significance. So instead of using Theorem 4, for the specific networks in Section 3, we directly show (in Appendix) that gradient descent updates converge in direction to a first order stationary point of the problem in eq. (15).
In the previous section, we saw that the implicit bias of gradient descent on a parameterization P(w) can be described in terms of the optimization problem (14) and the implied penalty function . We now turn to studying this implied penalty
and obtaining explicit forms for it, which will reveal the precise form of the implicit bias in terms of the learned linear predictor. The proofs of the lemmas in this section are provided in the Appendix D.
Lemma 5. For fully connected networks of any depth L > 0,
We see that in eq. (15) for fully connected networks is independent of the depth of the network L. In Theorem 1, we indeed show that gradient descent for this class of networks converges in the direction of
Next, we motivate the characterization of for linear convolutional networks by first looking at the special linear diagonal network depicted in Figure 1c. The depth–L diagonal network is parameterized by
and the mapping to a linear predictor is given by
Lemma 6. For a depth–L diagonal network with parameters
Finally, for full width linear convolutional networks parameterized by , recall the following representation of
in Fourier from Lemma 3.
where are Fourier basis representation of
, respectively. Extending the result of diagonal networks for the complex vector spaces, we get the following characterization of
for linear convolutional networks.
Lemma 7. For a depth–L convolutional network with parameters
In this paper, we characterized the implicit bias of gradient descent on linear convolutional networks. We showed that even in the case of linear activations and a full width convolution, wherein the convolutional network defines the exact same model class as fully connected networks, merely changing to a convolutional parameterization introduces radically different, and very interesting, bias when training with gradient descent. Namely, training a convolutional representation with gradient descent implicitly biases towards sparsity in the frequency domain representation of linear predictor.
For convenience and simplicity of presentation, we studied one dimensional circular convolutions. Our results can be directly extended to higher dimensional input signals and convolutions, including the two-dimensional convolutions common in image processing and computer vision. We also expect similar results for convolutions with zero padding instead of circular convolutions, although this requires more care with analysis of the edge effects.
A more significant way in which our setup differs from usual convolutional networks is that we use full width convolutions, while in practice it is common to use convolutions with bounded width, much smaller then the input dimensionality. This setting is within the scope of Theorem 4, as the linear transformation is still homogeneous. However, understanding the implied bias in the predictor space, i.e. understanding requires additional work. It will be very interesting to see if restricting the width of the convolutional network gives rise to further interesting behaviors.
Another important direction for future study is understanding the implicit bias for networks with multiple outputs. For both fully connected and convolutional networks, we looked at networks with a single output. With C > 1 outputs, the network implements a linear transformation is now a matrix. Results for matrix sensing in Gunasekar et al. [2018] imply that for two layer fully connected networks with multiple outputs, the implicit bias is to a maximum margin solution with respect to the nuclear norm
. This is already different from the implicit bias of a one-layer “network” (i.e. optimizing
directly), which would be in terms of the Frobenius norm
(from the result of Soudry et al. [2017]). We suspect that with multiple outputs, as more layers are added, even fully connected networks exhibit a shrinking sparsity penalty on the singular values of the effective linear matrix predictor
. Precisely characterizing these biases requires further study.
When using convolutions as part of a larger network, with multiple parallel filters, max pooling, and non-linear activations, the situation is of course more complex, and we do not expect to get the exact same bias. However, we do expect the bias to be at the very least related to the sparsity-in-frequency-domain bias that we uncover here, and we hope our work can serve as a basis for further such study. There are of course many other implicit and explicit sources of inductive bias—here we show that merely parameterizing transformations via convolutions and using gradient descent for training already induces sparsity in the frequency domain.
On a technical level, we provided a generic characterization for the bias of gradient descent on linear models parameterized as for a homogeneous polynomial P. The
bias (in parameter space) we obtained is not surprising, but also should not be taken for granted – e.g., the result does not hold in general for non-homogeneous P, and even with homogeneous polynomials, the characterization is not as crisp when other loss functions are used, e.g., with a squared loss and matrix factorization (a homogeneous degree two polynomial representation), the implicit bias is much more fragile Gunasekar et al. [2017], Li et al. [2017]. Moreover, Theorem 4 only ensures convergence to first order stationary point in the parameter space, which is not sufficient for convergence to stationary points of the implied bias in the model space (eq. (15)). It is of interest for future work to strengthen this result to show either convergence to higher order stationary points or local minima in parameter space, or to directly show the convergence to stationary points of (15).
It would also be of interest to strengthen other technical aspects of our results: extend the results to loss functions with tight exponential tails (including logistic loss) and handle all datasets including the set of measure zero degenerate datasets—these should be possible following the techniques of Soudry et al. [2017], Telgarsky [2013], Ji and Telgarsky [2018]. We can also calculate exact rates of convergence to the asymptotic separator along the lines of Soudry et al. [2017], Nacson et al. [2018], Ji and Telgarsky [2018] showing how fast the inductive bias from optimization kicks in and why it might be beneficial to continue optimizing even after the loss value itself is negligible. Finally, for logistic regression, Ji and Telgarsky [2018] extend the results of asymptotic convergence of gradient descent classifier to the cases where the data is not strictly linearly separable. This is an important relaxation of our assumption on strict linear separability. More generally, for non-separable data, we would like a more fine grained analysis connecting the iterates
along the optimization path to the estimates along regularization path,
, where an explicit regularization is added to the optimization objective.
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.
P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 2003.
Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003.
Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.
Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, 2017.
Dongdong Ge, Xiaoye Jiang, and Yinyu Ye. A note on the complexity of lp minimization. Mathematical programming, 2011.
Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In NIPS, 2017.
Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint, 2018.
Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 1997.
Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, 2017.
Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
Michel Journée, Francis Bach, P-A Absil, and Rodolphe Sepulchre. Low-rank optimization on the cone of positive semidefinite matrices. SIAM Journal on Optimization, 20(5):2327–2351, 2010.
Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, 2009.
Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, 2016.
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2016.
Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In 29th Annual Conference on Learning Theory, 2016.
Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix recovery. arXiv preprint arXiv:1712.09203, 2017.
Marian Muresan. A concrete approach to classical analysis, volume 14. Springer, 2009.
Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. arXiv preprint arXiv:1803.01905, 2018.
Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2422–2430, 2015a.
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations, 2015b.
Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Geometry of optimization and implicit regularization in deep learning. arXiv preprint, 2017.
Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017.
R Tyrrell Rockafellar. Directionally lipschitzian functions and subdifferential calculus. Proceedings of the London Mathematical Society, 1979.
Le Smith, Kindermans. Don’t Decay the Learning Rate, Increase the Batch Size. In ICLR, 2018.
Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345, 2017.
Matus Telgarsky. Margins, shrinkage and boosting. In Proceedings of the 30th International Conference on International Conference on Machine Learning-Volume 28, pages II–307. JMLR. org, 2013.
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, 2017.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
The proofs of the theorems in the paper are organized as follows: In Appendix A we first give the proof for Theorem 4, which includes linear fully connected and full width convolutional networks as special cases. This gives us some general results that can be special-cased to prove the stronger results for these networks in Section 3. In Appendix B, we prove Theorem 1 on the implicit bias of fully connected linear networks. In Appendix C, we prove Theorem 2–2a on the implicit bias of linear convolutional networks. Finally, in Appendix D we prove the lemmas in Section 5 on computing the form of implicit bias of linear networks learned using gradient descent.
Unless specified otherwise, denotes the Euclidean norm. We additionally use the notation
to denote equality up to strictly positive scalar multipliers, i.e., when
The following is a paraphrasing of Lemma 8 in Gunasekar et al. [2018] and is used in multiple proofs.
Lemma 8. [Lemma 8 in Gunasekar et al. [2018]] For almost all linearly separable dataset consider any sequence
that minimizes the empirical objective in eq. (5), i.e.,
. If (a)
the indices of the data points with smallest margin to the limit direction
Theorem 4 (Homogeneous Polynomial Parameterization). For any homogeneous polynomial map from parameters
to linear predictors, almost all datasets
separable by
, almost all initializations
, and any bounded sequence of step sizes
, consider the sequence of gradient descent updates
for minimizing the empirical risk objective
with exponential loss
If (a) the iterates asymptotically minimize the objective, i.e.,
, (b)
, and consequently
, converge in direction to yield a separator with positive margin, and (c) the gradients w.r.t. to the linear predictors,
converge in direction, then the limit direction of the parameters
point of the following optimization problem,
Proof. are the sequence gradient descent iterates from eq. (7) for minimizing
with exponential loss over the model class of
, where P is a homogeneous polynomial function. We first introduce some notation.
1. From the assumption in theorem, we have that
2. Let denote the sequence of linear predictors for this network induced by the gradient descent iterates. We can see that
converges in direction too using the following arguments: homogeneity of P implies that
for some
. Hence,
3. . Since we assume that
converges in
4. Let denote the Jacobian of
If is a homogeneous polynomial of degree
, then
is a homogeneous polynomial of degree
. Using eq. (16), we have
Thus, , such that
5. Finally, from the definition of hence from eq. (7),
Using the assumptions in the theorem along with our argument above for convergence of in direction, we satisfy the conditions of Lemma 8, which will be crucially used in our proof.
Primal feasibility. We showed earlier that if converges in direction, then
converges in direction to
we have that , which also implies
is homogeneous of of degree
Showing other KKT conditions for The crux of the proof of Theorem 4 involves showing the existence of
such that the stationarity and complementary slackness conditions in eq. (20) are satisfied. This crucially relies on a key lemma (Lemma 8) showing that the gradient in the space of linear predictors
are dominated by positive linear combinations of support vectors of the asymptotic predictor
Let denote the indices of support vectors for
, which are also the support vectors of
, since by homogeneity of
. Thus, from Lemma 8, we have
We propose a positive scaling of this as our candidate dual certificate, which satisfies both dual feasibility and complementary slackness.
To prove the theorem, the remaining step is to show that and P is homogeneous, this condition is equivalent to showing that
Showing that Substituting for
and
from eqs. (17) and (18), respectively, in the gradient descent updates (eq. (19)), we have the following:
Summing over t, we have
We want to argue that the first term, i.e., , is the dominant term. Towards this we state and prove the following intermediate claim
Claim 1.
Proof. First, it is straight forward to check that for any scalar valued homogeneous polynomial f : of degree
, we have
, where for
(this is also known as the Euler’s homogeneous function theorem). Extending this to our vector valued homogeneous function
, we have that for all w, the Jacobian
Moreover, we have that for the limit direction , the margin of the corresponding classifier is strictly positive, i.e.,
. Now from Lemma 8, using that
for
(and not all zero since
is unit norm), we immediately get the following
To prove the second part, we note the following
Thus, if . On contrary, if
from eq. (22), for large t we get,
which contradicts
From above claim, the sequence is monotonic increasing and diverging. Thus, for
, using Stolz-Cesaro theorem (Theorem 11), we have
Substituting eq. (23) in eq. (22), we have
where in (a) we absorbed the diminishing terms into (b) follows since we proved in the claim above that
and hence dominates
We have shown that
for a positive scalar
, which completes the proof.
Theorem 1 (Linear fully connected networks). For any depth L, almost all linearly separable datasets , almost all initializations
, and any bounded sequence of step sizes
consider the sequence gradient descent iterates
in eq. (7) for minimizing
in eq. (4) with exponential loss
–layer fully connected linear networks.
If (a) the iterates minimize the objective, i.e.,
, (b)
, and consequently
, converge in direction to yield a separator with positive margin, and (c) gradients with respect to linear predictors
converge in direction, then the limit direction is given by,
Proof. Recall that for fully connected networks of any depth L > 0 with parameters , the equivalent linear predictor given by
is a homogeneous polynomial of degree L.
Let denote the iterates of individual matrices
along the gradient descent path, and
denote the corresponding sequence of linear predictors.
We first introduce the following notation.
1. Let denote the limit direction of the parameters, with component matrices in each layer denoted as
. Specializing (16) for fully connected networks, we have:
3. Let . Again repeating eq. (17) for fully connected networks, we have for some
The proof of Theorem 1 is fairly straight forward from using Lemma 8 and the intermediate results in the proof of Theorem 4.
Showing KKT conditions for Using our notation described above, we have
. In the following arguments we show that a positive scaling
satisfies the following KKT conditions for the optimality of
maximum margin problem in eq. (8):
As we saw in proof of Theorem 4, since has strictly positive margin, us- ing homogeneity of
, we can scale
to get
with unit margin, i.e.,
. For dual variables, we again use a positive scaling of
from Lemma 8, such that
. In order to prove the theorem, we need to show that
or equivalently
Recall that in the proof of Theorem 4, we showed a version of stationarity in the parameter space in eq. (26), repeated below.
This case in particular includes which is homogeneous with
. We special case the result fully connected network. In particular, for the parameters of the first layer
, we have P(w) =
, where
and
. This implies, for any
. Using this along with eq. (31), we get the following expression for some positive scalar
Since , we have shown that
, which completes our proof of Theorem 1.
Recall that L–layer linear convolutional networks have parameters . We first recall some complex numbers terminology and properties
1. Complex vectors are represented in polar form as
are the vectors with magnitudes and phases, respectively, of components
, the complex conjugate vector is denoted by
3. The complex inner product for
is given by
4. Let denote the discrete Fourier transform matrix with
recall that is the
complex root of unity. Thus, for any
, the representation in Fourier basis is given by
and its complex conjugate matrix
also satisfy:
Before getting into full proofs of Theorem 2a–2, we also prove the two lemmas (Lemma 3 and Lemma 9) that establish equivalence of dynamics of gradient descent on full dimensional convolutional networks to those on linear diagonal networks (Figure 1c), albeit with complex valued parameters. This makes the analysis of the of convolutional networks simpler and more intuitive.
We begin by proving Lemma 3 which shows the equivalence of representation between convolutional networks and diagonal networks.
Lemma 3. For full-dimensional convolutions, is equivalent to
Proof. First, we state the following properties which follow immediately from definitions:
1. For
where recall that the complex inner product is given by
2. We next show the following property
where denotes the Hadamard product (elementwise product), i.e.,
The above equation follows from simple manipulations of definitions: recall that (Fz)[d] =
where (a) follows as we used the change of variables
our use of modulo operator as
Recall from eq. (3) the output of an L-layer convolutional network is given by
Denote . By iteratively using eq. (34), we have
Thus, on one hand using the above equation we have,
where (a) follows from substituting for and noting that for any
, and (b) uses the definition of complex inner product
Now further using eq. (33) in above equation, we have
Thus, for , we have shown that
diag
For , let
denote the equivalent parameterization of convolutional network in Fourier domain.
The above lemma shows that optimizing in eq. (4) is equivalent to the following minimization problem in terms of representation,
The following lemma further shown that not only the representations of equivalent, but there corresponding gradient descent updates for problems in eq. (4) and eq. (39) are also equivalent up to Fourier transformations.
Lemma 9. Consider the gradient descent iterates from eq. (7) for minimizing
in eq. (4) over full dimensional linear convolutional networks. For all l, the incremental update directions,
satisfy the following,
where are the Fourier transformations of
, respectively.
Using the above equation we have,
where in and the remaining equalities simply follow from manipulation of derivatives. From above equation, we have the following:
Theorem 2 (Linear convolutional networks of depth two). For almost all linearly separable datasets , almost all initializations
, and any sequence of step sizes
with
smaller than the local Lipschitz at
, consider the sequence gradient descent iterates
in eq. (7) for minimizing
with exponential loss over 2–layer linear convolutional networks.
If (a) the iterates minimize the objective, i.e.,
converge in direction to yield a separator
with positive margin, (c) the phase of the Fourier coefficients
of the linear predictors
converge coordinate-wise, i.e.,
, and (d) the gradients
converge in direction, then the limit direction
is given by,
Theorem 2a (Linear Convolutional Networks of any Depth). For any depth L, under the conditions of Theorem 2, the limit direction
point of the following optimization problem,
where the penalty given by
(also called the bridge penalty) is a norm for p = 1 and a quasi-norm for p < 1.
For the gradient descent iterates denote the sequence of corresponding linear predictors as
. Let
and
denote the Fourier transforms of
, respectively, and let
KKT conditions for optimality We want to show that a positive scaling of , denoted by
is a first order stationary point of eq. (10), repeated below,
Recall the KKT conditions discussed in Section 3. The first order stationary points, or sub-stationary points, of (10) are the set of feasible predictors satisfying the following:
where denotes the local sub-differential (or Clarke’s sub-differential) operator defined as
conv
For p = 1 and represented in polar form as
is convex and the local sub-differential is indeed the global sub-differential given by,
For p < 1, the local sub-differential of is given by,
Showing KKT conditions for As we showed proof of Theorem 4, since
has strictly positive margin, using homogeneity of
, we can scale
to get
with unit margin, i.e.,
. For dual variables, we again use a positive scaling of
from Lemma 8, such that
In order to prove the theorem, we need to show that for some positive scalar , i.e., satisfies the conditions in eq. (47) and (48), for L = 2 and L > 2, respectively.
We start from the stationarity condition in the parameter space in eq. (26) of Theorem 4. For some positive scalar
We will now special case the above equation for fully width convolutional networks.
From Lemma 3, we have that for all , we have
where F and
denote discrete Fourier matrix and its inverse in appropriate dimensions. Let
denote the standard basis in
. We first note that for all l = 1, 2, . . . , L and for all d = 1, 2, . . . , D, the following holds
This implies, for
Substituting the above equation in eq. (49), we have,
where denotes the complex conjugate of
Also, by multiplying the LHS of eq. (55) across all l and taking Lth root over positive scalars, we have for
Finally, let be a positive scaling of
has unit margin. Let
C.1.1 Case of
For , since
, eq. (58) is indeed the first order stationarity condition for eq. (10) as described in eq. (11) and (13).
C.1.2 Case of
For the case of p = 1, in addition to eq. (58), we need to show that . From eq. (58), for
We need to further show that
Using Lemma 9 for for the special case of 2–layer linear convolutional network, for
Further, from eq. (55), we have , and hence
From the convergence of complex numbers, we have the following:
In the remainder of the proof, we only consider
Consider defined below,
where (a) follows from using whenever
(from eq. (62)), and (b) follows from eq. (60).
Additionally, since , we can write
are real scalars. Substituting in above equation and rearranging the terms, we have
where in
The following intermediate lemma is proved in Appendix C.1.3.
, there exists
such that
. Substituting these representations in eq. (65), we have the following dynamics for
where in (a) we have accumulated all diminishing terms into
Step 2. Remainder of the proof: We now prove our theorem by looking the following quantity: For any
We will show that whenever . Along with eq. (64), this would imply that
have . Moreover from eq.(57)), we know that
for all
with
Showing |
For any . We note that the since the loss
of the gradient
. Hence, for any finite step size sequence
, there exists
and the following inequalities hold,
where in (a) follows from using for x < 1 since
for all
, and in (c), we absorbed all o(p(t)) terms as
for
and used
Further, from the conditions of the theorem, for almost all initializations, for all d. For step sizes
smaller than the local Lipschitz constant, for all finite
, we also have
. Moreover from Lemma 10, we have that
and hence
such that
, but for any finite
. Thus, for
, using the above observations, we have that
Now, using eq. (71), for all
Moreover, we have for at least one d, and for any finite step sizes and finite
This then implies that for some
. Thus, for any
, we also have
From eq. (72) and above claim, we conclude that for all
This completes the proof of the theorem.
C.1.3 Proof of Lemma 10
Proof. Recalling from eq. (65) and
from eq. (63), we have the following:
For all d if
), and also that
1 (from eq. (62)). This along with eq. (73) gives us
Moreover, since , we have
or
. Further, using
We now only need to show that these results also hold for d such that
assumptions of the theorem that even when
. We now prove the lemma by showing the following steps for
Proof of lemma assuming Step 1 and Step 2 hold The above steps would imply that in eq. (73),
These eqs. along with eq. (73) in turn prove the lemma, i.e.,
Showing Step 1 and Step 2
Step 1. Show
Note that since are finite, we have that
such that for all
1. From the above equation, we have the following for
where (a) follows from iterating over
Since , at least one of
must diverge. Without loss of generality, let
where the convergence in (a) follows since (from eq. (76)) and
Step 2. Show Re
Note that from Step 1 above, we have that , which implies
Also, from eq. (44), there exists , such that
Using the above representations, along with eq. (59), we have the following,
where (a) follows from substituting eqs. (79)-(80), and (b) follows from using and defining
Denote . Additionally, from the assumption in the theorem, we have
, hence there exists
Now, from the above equation, for any , we derive the updates for
(since ); in (b) we defined
is obtained by iterating over t; and (d) follows from using
If possible, let , and for finite step sizes
that for all
, we now have
Finally, for any finite step sizes and finite , we have
and this creates a contradiction since the LHS in the above equation diverges,
. Hence, in order for the updates in eq. (82) to lead to a divergent
, we necessarily require that
This completes the proof of the lemma.
In this appendix we prove the lemmas in Section 5 that compute the form of induced bias of linear networks in the space of predictors. Recall that for linear predictors parameterized as ,
Lemma 5. For fully connected networks of any depth L > 0,
Proof. Recall that for fully connected networks of any depth L > 0 with parameters , the equivalent linear predictor given by
where (a) follows as arithmetic mean is greater than the geometric mean.
This ensures that , and hence
Combining eq. (83) and eq. (84), we get
The proofs of the lemmas for computing for diagonal and convolutional networks are similar to those of fully connected network.
Lemma 6. For a depth–L diagonal network with parameters
Proof. Recall that for an L–layer linear diagonal networks with parameters equivalent linear predictor is given by
Let be the minimizer of
, so that
and
. We then have,
where (a) again follows as arithmetic mean is greater than the geometric mean.
Similar to the case of fully connected networks, we now choose that satisfies
and
. This would ensure that,
We can check that these properties are satisfied by choosing w as follows: for
Combining this argument with eq. 85 concludes the proof.
For convolutional networks, the argument is the exactly the same as that for diagonal network adapted for complex vectors. Lemma 7. For a depth–L convolutional network with parameters
Proof. Denote the Fourier basis coefficients of in polar form as
where and
are the vectors with magnitudes and phases, respec- tively, of
From Lemma 3, the Fourier basis representation of is given by
where we have overloaded the notation to denote the mapping of diagonal networks in complex vector fields, and
. We thus have for
From orthonormality of discrete Fourier transformation, we have for all
We can now adapt the proof of diagonal networks here. Let be the minimizer of
Similar to the diagonal networks, we can choose the parameters in the Fourier domain to ensure that
as follows: for
Combining this with eq. 87 concludes the proof.