Implicit Bias of Gradient Descent on Linear Convolutional Networks

2018·Arxiv

Abstract

Abstract

We show that gradient descent on full width linear convolutional networks of depth L converges to a linear predictor related to the bridge penalty in the frequency domain. This is in contrast to fully connected linear networks, where regardless of depth, gradient descent converges to the maximum margin solution.

1 Introduction

Implicit biases introduced by optimization algorithms play an crucial role in learning deep neural networks [Neyshabur et al., 2015b,a, Hochreiter and Schmidhuber, 1997, Keskar et al., 2016, Chaudhari et al., 2016, Dinh et al., 2017, Andrychowicz et al., 2016, Neyshabur et al., 2017, Zhang et al., 2017, Wilson et al., 2017, Hoffer et al., 2017, Smith, 2018]. Large scale neural networks used in practice are highly over-parameterized with far more trainable model parameters compared to the number of training examples. Consequently, optimization objectives for learning such high capacity models have many global minima that fit training data perfectly. However, minimizing the training loss using specific optimization algorithms take us to not just any global minima, but some special global minima, e.g., global minima minimizing some regularizer . In over-parameterized models, specially deep neural networks, much, if not most, of the inductive bias of the learned model comes from this implicit regularization from the optimization algorithm. Understanding the implicit bias, e.g., via characterizing , is thus essential for understanding how and what the model learns.

For example, in linear regression we understand how minimizing an under-determined model (with more parameters than samples) using gradient descent yields the minimum norm solution, and for linear logistic regression trained on linearly separable data, Soudry et al. [2017] recently showed that gradient descent converges in the direction of the hard margin support vector machine solution, even though the norm or margin is not explicitly specified in the optimization problem. Such minimum norm or maximum margin solutions are of course very special among all solutions or separators that fit the training data, and in particular can ensure generalization Bartlett and Mendelson [2003], Kakade et al. [2009].

Changing the optimization algorithm, even without changing the model, changes this implicit bias, and consequently also changes generalization properties of the learned models [Neyshabur et al., 2015a, Keskar et al., 2016, Wilson et al., 2017, Gunasekar et al., 2017, 2018]. For example, for linear logistic regression, using coordinate descent instead of gradient descent return a maximum solution instead of the hard margin support vector solution solution—an entirely different inductive bias Telgarsky [2013], Gunasekar et al. [2018].

Similarly, and as we shall see in this paper, changing to a different parameterization of the same model class can also dramatically change the implicit bias Gunasekar et al. [2017]. In particular, we study the implicit bias of optimizing multi-layer fully connected linear networks, and linear convolutional networks (multiple full width convolutional layers followed by a single fully connected layer) using gradient descent. Both of these types of models ultimately implement linear transformations, and can implement any linear transformation. The model class defined by these networks is thus simply the class of all linear predictors, and these models can be seen as mere (over) parameterizations of the class of linear predictors. Minimizing the training loss on these models is therefore entirely equivalent to minimizing the training loss for linear classification. Nevertheless, as we shall see, optimizing these networks with gradient descent leads to very different solutions.

In particular, we show that for fully connected networks with single output, optimizing the exponential loss over linearly separable data using gradient loss again converges to the homogeneous hard margin support vector machine solution. This holds regardless of the depth of the network, and hence, at least with a single output, gradient descent on fully connected networks has the same implicit bias as direct gradient descent on the parameters of the linear predictor. In contrast, training a linear convolutional network with gradient descent biases us toward linear separators that are sparse in the frequency domain. Furthermore, this bias changes with the depth of the network, and a network of depth L (with convolutional layers), implicitly biases towards minimizing the bridge penalty with of the Fourier transform of the learned linear predictor subject to margin constraints (the gradient descent predictor reaches a stationary point of the minimization problem). This is a sparsity inducing regularizer, which induces sparsity more aggressively as the depth increases.

Finally, in this paper we focus on characterizing which global minimum does gradient descent on over-parameterized linear models converge to, while assuming that for appropriate choice of step sizes gradient descent iterates asymptotically minimize the optimization objective. A related challenge in neural networks, not addressed in this paper, is an answer to when does gradient descent minimize the non-convex empirical loss objective to reach a global minimum. This problem while hard in worst case, has been studied for linear networks. Recent work have concluded that with sufficient over-parameterization (as is the case with our settings), loss landscape of linear models are well behaved and all local minima are global minima making the problem tractable Burer and Monteiro [2003], Journée et al. [2010], Kawaguchi [2016], Nguyen and Hein [2017], Lee et al. [2016].

Notation We typeface vectors with bold characters e.g., . Individual entries of a vector are indexed using 0 based indexing as z[d] for . Complex numbers are represented in the polar form as with denoting the magnitude of z and denoting the phase. denotes the complex conjugate of z. The complex inner product between is given by root of 1 is denoted by . For we use the notation to denote the representation of z in the discrete Fourier basis given by, For integers D and a, we denote the modulo operator as . Finally, for multi-layer linear networks (formally defined in Section 2), we will use to denote parameters of the model in general domain to denote the equivalent linear predictor.

2 Multi-layer Linear Networks

We consider feed forward linear networks that map input features to a single real valued output , where w denote the parameters of the network. Such networks can be thought of as directed acyclic graphs where each edge is associated with a weight, and the value at each node/unit is the weighted sum of values from the parent nodes. The input features form source nodes with no incoming edges and the output is a sink node with no outgoing edge. Every such network realizes a linear function denotes the effective linear predictor.

In multi-layer networks, the nodes are arranged in layers, so an L–layer network represents a composition of L linear maps. We use the convention that, the input is indexed as the zeroth layer l = 0, while the output forms the final layer with l = L. The outputs of nodes in layer l are denoted by , where is the number of nodes in layer l. We also use to denote the parameters of the linear map between and , and to denote the collective set of all parameters of the linear network.

Linear fully connected network In a fully connected linear network, the nodes between successive layers are densely connected with edge weights , and all the weights are independent parameters. This model class is parameterized by the computation for intermediate nodes and the composite linear map is given by,

Linear convolutional network We consider one-dimensional convolutional network architectures where each non-output layer has exactly D units (same as the input dimensionality) and the linear transformations from layer are given by the following circular convolutional operation1 parameterized by full width filters with weights

The output layer is fully connected and parameterized by weights . The parameters of the model class therefor consists of L vectors of size D collectively denoted by and the composite linear map is given by:

Remark: We use circular convolution with a scaling of to make the analysis cleaner. For convolutions with zero-padding, we expect a similar behavior. Secondly, since our goal here to study implicit bias in sufficiently over-parameterized models, we only study full dimensional convolutional filters. In practice it is common to have filters of width K smaller than the number of input features, which can change the implicit bias.

The fully connected and convolutional linear networks described above can both be represented in terms of a mapping that maps the input parameters to a linear predictor in such that the output of the network is given by . For fully connected networks, the mapping is given by , and for convolutional networks, , where denotes the flipped vector corresponding to w,

given by , . . . , D

Separable linear classification Consider a binary classification dataset 1, 2, . . . N} with and . The empirical risk minimization objective for training a linear network parameterized as P(w) is given as follows,

where is some surrogate loss for classification accuracy, e.g., logistic loss and exponential loss

It is easy to see that both fully connected and convolutional networks of any depth L can realize any linear predictor . The model class expressed by both networks is therefore simply the unconstrained class of linear predictors, and the two architectures are merely different (over) parameterizations of this class

Thus, the empirical risk minimization problem in (4) is equivalent to the following optimization over the linear predictors

Although the optimization problems (4) and (5) are exactly equivalent in terms of the set of global minima, in this paper, we show that optimizing (4) with different parameterizations leads to very different classifiers compared to optimizing (5) directly.

In particular, consider problems (4)/(5) on a linearly separable dataset and using the logistic loss (the two class version of the cross entropy loss typically used in deep learning). The global infimum of is 0, but this is not attainable by any finite . Instead, the loss can be minimized by scaling the norm of any linear predictor that separates the data to infinity. Thus, any sequence of predictors (say, from an optimization algorithm) that asymptotically minimizes the loss in eq. (5) necessarily separates the data and diverges in norm, . In general there are many linear separators that correctly label the training data, each corresponding to a direction in which we can minimize (5). Which of these separators will we converge to when optimizing (4)/(5)? In other words, what is the direction

will diverge in? If this limit exist we say that converges in direction to the limit direction

Soudry et al. [2017] studied this implicit bias of gradient descent on (5) over the direct parameterization of . They showed that for any linearly separable dataset and any initialization, gradient descent w.r.t. converges in direction to hard margin support vector machine solution:

In this paper we study the behavior of gradient descent on the problem (4) w.r.t different parameterizations of the model class of linear predictors. For initialization and sequence of step sizes gradient descent updates for (4) are given by,

where denotes the Jacobian of with respect to the parameters is the gradient of the loss function in (5).

For separable datasets, if minimizes (4) for linear fully connected or convolutional networks, then we will again have , and the question we ask is: what is the limit direction

The result in Soudry et al. [2017] holds for any loss function that is strictly monotone in uy with specific tail behavior, name the tightly exponential tail, which is satisfied by popular classification losses like logistic and exponential loss. In the rest of the paper, for simplicity we exclusively focus on the exponential loss function , which has the same tail behavior as that of the logistic loss. Along the lines of Soudry et al. [2017], our results should also extend for any strictly monotonic loss function with a tight exponential tail, including logistic loss.

3 Main Results

Our main results characterize the implicit bias of gradient descent for multi-layer fully connected and convolutional networks with linear activations. For the gradient descent iterates in eq. (7), we henceforth denote the induced linear predictor as Assumptions. In the following theorems, we characterize the limiting predictor

under the following assumptions:

These assumptions allow us to focus on the question of which specific linear predictor do gradient descent iterates converge to by separating it from the related optimization questions of when gradient descent iterates minimize the non-convex objective in eq. (5) and nicely converge in direction.

Figure 1: Implicit bias of gradient descent for different linear network architectures.

Theorem 1 (Linear fully connected networks). For any depth L, almost all linearly separable datasets , almost all initializations , and any bounded sequence of step sizes consider the sequence gradient descent iterates in eq. (7) for minimizing in eq. (4) with exponential loss –layer fully connected linear networks.

If (a) the iterates minimize the objective, i.e., , (b) , and consequently , converge in direction to yield a separator with positive margin, and (c) gradients with respect to linear predictors converge in direction, then the limit direction is given by,

For fully connected networks with single output, Theorem 1 shows that there is no effect of depth on the implicit bias of gradient descent. Regardless of the depth of the network, the asymptotic classifier is always the hard margin support vector machine classifier, which is also the limit direction of gradient descent for linear logistic regression with the direct parameterization of

In contrast, next we show that for convolutional networks we get very different biases. Let us first look at a 2–layer linear convolutional network, i.e., a network with single convolutional layer followed by a fully connected final layer.

Recall that denote the Fourier coefficients of , i.e.,

and that any non-zero is denoted in polar form as for . Linear predictors induced by gradient descent iterates for convolutional networks are denoted by . It is evident that if converges in direction to , then its Fourier transformation converges in direction to . In the following theorems, in addition to the earlier assumptions, we further assume a technical condition that the phase of the Fourier coefficients converge

coordinate-wise. For coordinates d with

, in which case . We assume such a also exists when

Theorem 2 (Linear convolutional networks of depth two). For almost all linearly separable datasets , almost all initializations , and any sequence of step sizes with smaller than the local Lipschitz at , consider the sequence gradient descent iterates in eq. (7) for minimizing with exponential loss over 2–layer linear convolutional networks.

If (a) the iterates minimize the objective, i.e., converge in direction to yield a separator with positive margin, (c) the phase of the Fourier coefficients of the linear predictors converge coordinate-wise, i.e., , and (d) the gradients converge in direction, then the limit direction is given by,

We already see how introducing a single convolutional layer changes the implicit bias of gradient descent—even without any explicit regularization, gradient descent on the parameters of convolutional network architecture returns solutions that are biased to have sparsity in the frequency domain.

Furthermore, unlike fully connected networks, for convolutional networks we also see that the implicit bias changes with the depth of the network as shown by the following theorem. Theorem 2a (Linear Convolutional Networks of any Depth). For any depth L, under the conditions of Theorem 2, the limit direction

point of the following optimization problem,

where the penalty given by (also called the bridge penalty) is a norm for p = 1 and a quasi-norm for p < 1.

When L > 2, and thus , problem (10) is non-convex and intractable Ge et al. [2011]. Hence, we cannot expect to ensure convergence to a global minimum. Instead we show convergence to a first order stationary point of (10) in the sense of sub-stationary points of Rockafellar [1979] for optimization problems with non-smooth and non-convex objectives. These are solutions where the local directional derivative along the directions in the tangent cone of the constraints are all zero.

The first order stationary points, or sub-stationary points, of (10) are the set of feasible predictors such that satisfying the following:

where is the Fourier transformation of , and denotes the local sub-differential (or Clarke’s sub-differential) operator defined as

For p = 1 and represented in polar form as is convex and the local sub-differential is indeed the global sub-differential given by,

For p < 1, the local sub-differential of is given by,

Figures 1a–1b summarize the implications of the main results in the paper. The proof of this Theorem, exploits the following representation of in the Fourier domain. Lemma 3. For full-dimensional convolutions, is equivalent to

where for are the Fourier coefficients of the parameters

From above lemma (proved in Appendix C), we can see a connection of convolutional networks to a special network where the linear transformation between layers is restricted to diagonal entries (see depiction in Figure 1c), we refer to such networks as linear diagonal network.

The proof of Theorem 1 and Theorem 2-2a are provided in Appendix B and C, respectively.

4 Understanding Gradient Descent in the Parameter Space

We can decompose the characterization of implicit bias of gradient descent on a parameterization P(w) into two parts: (a) what is the implicit bias of gradient descent in the space of parameters w?, and (b) what does this imply in term of the linear predictor , i.e., how does the bias in parameter space translate to the linear predictor learned from the model class?

We look at the first question for a broad class of linear models, where the linear predictor is given by a homogeneous polynomial mapping of the parameters: are the parameters of the model and satisfies definition below. This class covers the linear convolutional, fully connected networks, and diagonal networks discussed in Section 3.

Definition (Homogeneous Polynomial). A multivariate polynomial function is said to be homogeneous, if for some finite integer

Theorem 4 (Homogeneous Polynomial Parameterization). For any homogeneous polynomial map from parameters to linear predictors, almost all datasets separable by , almost all initializations , and any bounded sequence of step sizes , consider the sequence of gradient descent updates for minimizing the empirical risk objective with exponential loss

If (a) the iterates asymptotically minimize the objective, i.e., , (b) , and consequently , converge in direction to yield a separator with positive margin, and (c) the gradients w.r.t. to the linear predictors, converge in direction, then the limit direction of the parameters

point of the following optimization problem,

Theorem 4 is proved in Appendix A. The proof of Theorem 4 involves showing that the asymptotic direction of gradient descent iterates satisfies the KKT conditions for first order stationary points of (14). This crucially relies on two properties. First, the sequence of gradients converge in direction to a positive span of support vectors of

[2018]), and this result relies on the loss function being exponential tailed. Secondly, if P is not homogeneous, then the optimization problems for different values of unnormalized margin are not equivalent and lead to different separators. Thus, for general non-homogeneous P, the unnormalized margin of one does not have a significance and the necessary conditions for the first order stationarity of (14) are not satisfied.

Finally, we also note that in many cases (including linear convolutional networks) the optimization problem (14) is non-convex and intractable (see e.g., Ge et al. [2011]). So we cannot expect to be always be a global minimizer of eq. (14). We however suspect that it is possible to obtain a stronger result that reaches a higher order stationary point or even a local minimum of the explicitly regularized estimator in eq. (14).

Implications of the implicit bias in predictor space While eq. (14) characterizes the bias of gradient descent in the parameter space, what we really care about is the effective bias introduced in the space of functions learned by the network. In our case, this class of functions is the set of linear predictors norm penalized solution in eq. (14), is equivalently given by,

The problems in eq. (14) and eq. (15) have the same global minimizers, i.e., is global minimizer of eq. (14) if and only if minimizes eq. (15). However, such an equivalence does not extend to the stationary points of the two problems. Specifically, it is possible that a stationary point of eq. (14) is merely a feasible point for eq. (15) with no special significance. So instead of using Theorem 4, for the specific networks in Section 3, we directly show (in Appendix) that gradient descent updates converge in direction to a first order stationary point of the problem in eq. (15).

5 Understanding Gradient Descent in Predictor Space

In the previous section, we saw that the implicit bias of gradient descent on a parameterization P(w) can be described in terms of the optimization problem (14) and the implied penalty function . We now turn to studying this implied penalty and obtaining explicit forms for it, which will reveal the precise form of the implicit bias in terms of the learned linear predictor. The proofs of the lemmas in this section are provided in the Appendix D.

Lemma 5. For fully connected networks of any depth L > 0,

We see that in eq. (15) for fully connected networks is independent of the depth of the network L. In Theorem 1, we indeed show that gradient descent for this class of networks converges in the direction of

Next, we motivate the characterization of for linear convolutional networks by first looking at the special linear diagonal network depicted in Figure 1c. The depth–L diagonal network is parameterized by and the mapping to a linear predictor is given by

Lemma 6. For a depth–L diagonal network with parameters

Finally, for full width linear convolutional networks parameterized by , recall the following representation of in Fourier from Lemma 3.

where are Fourier basis representation of , respectively. Extending the result of diagonal networks for the complex vector spaces, we get the following characterization of for linear convolutional networks.

Lemma 7. For a depth–L convolutional network with parameters

6 Discussion

In this paper, we characterized the implicit bias of gradient descent on linear convolutional networks. We showed that even in the case of linear activations and a full width convolution, wherein the convolutional network defines the exact same model class as fully connected networks, merely changing to a convolutional parameterization introduces radically different, and very interesting, bias when training with gradient descent. Namely, training a convolutional representation with gradient descent implicitly biases towards sparsity in the frequency domain representation of linear predictor.

For convenience and simplicity of presentation, we studied one dimensional circular convolutions. Our results can be directly extended to higher dimensional input signals and convolutions, including the two-dimensional convolutions common in image processing and computer vision. We also expect similar results for convolutions with zero padding instead of circular convolutions, although this requires more care with analysis of the edge effects.

A more significant way in which our setup differs from usual convolutional networks is that we use full width convolutions, while in practice it is common to use convolutions with bounded width, much smaller then the input dimensionality. This setting is within the scope of Theorem 4, as the linear transformation is still homogeneous. However, understanding the implied bias in the predictor space, i.e. understanding requires additional work. It will be very interesting to see if restricting the width of the convolutional network gives rise to further interesting behaviors.

Another important direction for future study is understanding the implicit bias for networks with multiple outputs. For both fully connected and convolutional networks, we looked at networks with a single output. With C > 1 outputs, the network implements a linear transformation is now a matrix. Results for matrix sensing in Gunasekar et al. [2018] imply that for two layer fully connected networks with multiple outputs, the implicit bias is to a maximum margin solution with respect to the nuclear norm . This is already different from the implicit bias of a one-layer “network” (i.e. optimizing directly), which would be in terms of the Frobenius norm (from the result of Soudry et al. [2017]). We suspect that with multiple outputs, as more layers are added, even fully connected networks exhibit a shrinking sparsity penalty on the singular values of the effective linear matrix predictor . Precisely characterizing these biases requires further study.

When using convolutions as part of a larger network, with multiple parallel filters, max pooling, and non-linear activations, the situation is of course more complex, and we do not expect to get the exact same bias. However, we do expect the bias to be at the very least related to the sparsity-in-frequency-domain bias that we uncover here, and we hope our work can serve as a basis for further such study. There are of course many other implicit and explicit sources of inductive bias—here we show that merely parameterizing transformations via convolutions and using gradient descent for training already induces sparsity in the frequency domain.

On a technical level, we provided a generic characterization for the bias of gradient descent on linear models parameterized as for a homogeneous polynomial P. The bias (in parameter space) we obtained is not surprising, but also should not be taken for granted – e.g., the result does not hold in general for non-homogeneous P, and even with homogeneous polynomials, the characterization is not as crisp when other loss functions are used, e.g., with a squared loss and matrix factorization (a homogeneous degree two polynomial representation), the implicit bias is much more fragile Gunasekar et al. [2017], Li et al. [2017]. Moreover, Theorem 4 only ensures convergence to first order stationary point in the parameter space, which is not sufficient for convergence to stationary points of the implied bias in the model space (eq. (15)). It is of interest for future work to strengthen this result to show either convergence to higher order stationary points or local minima in parameter space, or to directly show the convergence to stationary points of (15).

It would also be of interest to strengthen other technical aspects of our results: extend the results to loss functions with tight exponential tails (including logistic loss) and handle all datasets including the set of measure zero degenerate datasets—these should be possible following the techniques of Soudry et al. [2017], Telgarsky [2013], Ji and Telgarsky [2018]. We can also calculate exact rates of convergence to the asymptotic separator along the lines of Soudry et al. [2017], Nacson et al. [2018], Ji and Telgarsky [2018] showing how fast the inductive bias from optimization kicks in and why it might be beneficial to continue optimizing even after the loss value itself is negligible. Finally, for logistic regression, Ji and Telgarsky [2018] extend the results of asymptotic convergence of gradient descent classifier to the cases where the data is not strictly linearly separable. This is an important relaxation of our assumption on strict linear separability. More generally, for non-separable data, we would like a more fine grained analysis connecting the iterates along the optimization path to the estimates along regularization path, , where an explicit regularization is added to the optimization objective.

References

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.

P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 2003.

Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003.

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, 2017.

Dongdong Ge, Xiaoye Jiang, and Yinyu Ye. A note on the complexity of lp minimization. Mathematical programming, 2011.

Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In NIPS, 2017.

Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint, 2018.

Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 1997.

Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, 2017.

Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.

Michel Journée, Francis Bach, P-A Absil, and Rodolphe Sepulchre. Low-rank optimization on the cone of positive semidefinite matrices. SIAM Journal on Optimization, 20(5):2327–2351, 2010.

Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, 2009.

Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, 2016.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2016.

Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In 29th Annual Conference on Learning Theory, 2016.

Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix recovery. arXiv preprint arXiv:1712.09203, 2017.

Marian Muresan. A concrete approach to classical analysis, volume 14. Springer, 2009.

Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. arXiv preprint arXiv:1803.01905, 2018.

Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2422–2430, 2015a.

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations, 2015b.

Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Geometry of optimization and implicit regularization in deep learning. arXiv preprint, 2017.

Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017.

R Tyrrell Rockafellar. Directionally lipschitzian functions and subdifferential calculus. Proceedings of the London Mathematical Society, 1979.

Le Smith, Kindermans. Don’t Decay the Learning Rate, Increase the Batch Size. In ICLR, 2018.

Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345, 2017.

Matus Telgarsky. Margins, shrinkage and boosting. In Proceedings of the 30th International Conference on International Conference on Machine Learning-Volume 28, pages II–307. JMLR. org, 2013.

Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, 2017.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.

Appendix

The proofs of the theorems in the paper are organized as follows: In Appendix A we first give the proof for Theorem 4, which includes linear fully connected and full width convolutional networks as special cases. This gives us some general results that can be special-cased to prove the stronger results for these networks in Section 3. In Appendix B, we prove Theorem 1 on the implicit bias of fully connected linear networks. In Appendix C, we prove Theorem 2–2a on the implicit bias of linear convolutional networks. Finally, in Appendix D we prove the lemmas in Section 5 on computing the form of implicit bias of linear networks learned using gradient descent.

Unless specified otherwise, denotes the Euclidean norm. We additionally use the notation to denote equality up to strictly positive scalar multipliers, i.e., when

The following is a paraphrasing of Lemma 8 in Gunasekar et al. [2018] and is used in multiple proofs.

Lemma 8. [Lemma 8 in Gunasekar et al. [2018]] For almost all linearly separable dataset consider any sequence that minimizes the empirical objective in eq. (5), i.e., . If (a)

the indices of the data points with smallest margin to the limit direction

A Homogeneous Polynomial Parameterization: Proof of Theorem 4

Theorem 4 (Homogeneous Polynomial Parameterization). For any homogeneous polynomial map from parameters to linear predictors, almost all datasets separable by , almost all initializations , and any bounded sequence of step sizes , consider the sequence of gradient descent updates for minimizing the empirical risk objective with exponential loss

If (a) the iterates asymptotically minimize the objective, i.e., , (b) , and consequently , converge in direction to yield a separator with positive margin, and (c) the gradients w.r.t. to the linear predictors, converge in direction, then the limit direction of the parameters

point of the following optimization problem,

Proof. are the sequence gradient descent iterates from eq. (7) for minimizing with exponential loss over the model class of , where P is a homogeneous polynomial function. We first introduce some notation.

1. From the assumption in theorem, we have that

2. Let denote the sequence of linear predictors for this network induced by the gradient descent iterates. We can see that converges in direction too using the following arguments: homogeneity of P implies that for some . Hence,

3. . Since we assume that converges in

4. Let denote the Jacobian of

If is a homogeneous polynomial of degree , then is a homogeneous polynomial of degree . Using eq. (16), we have

Thus, , such that

5. Finally, from the definition of hence from eq. (7),

Using the assumptions in the theorem along with our argument above for convergence of in direction, we satisfy the conditions of Lemma 8, which will be crucially used in our proof.

Primal feasibility. We showed earlier that if converges in direction, then converges in direction to

we have that , which also implies is homogeneous of of degree

Showing other KKT conditions for The crux of the proof of Theorem 4 involves showing the existence of such that the stationarity and complementary slackness conditions in eq. (20) are satisfied. This crucially relies on a key lemma (Lemma 8) showing that the gradient in the space of linear predictors are dominated by positive linear combinations of support vectors of the asymptotic predictor

Let denote the indices of support vectors for , which are also the support vectors of , since by homogeneity of . Thus, from Lemma 8, we have

We propose a positive scaling of this as our candidate dual certificate, which satisfies both dual feasibility and complementary slackness.

To prove the theorem, the remaining step is to show that and P is homogeneous, this condition is equivalent to showing that

Showing that Substituting for and from eqs. (17) and (18), respectively, in the gradient descent updates (eq. (19)), we have the following:

Summing over t, we have

We want to argue that the first term, i.e., , is the dominant term. Towards this we state and prove the following intermediate claim

Claim 1.

Proof. First, it is straight forward to check that for any scalar valued homogeneous polynomial f : of degree , we have , where for

(this is also known as the Euler’s homogeneous function theorem). Extending this to our vector valued homogeneous function , we have that for all w, the Jacobian

Moreover, we have that for the limit direction , the margin of the corresponding classifier is strictly positive, i.e., . Now from Lemma 8, using that for (and not all zero since is unit norm), we immediately get the following

To prove the second part, we note the following

Thus, if . On contrary, if from eq. (22), for large t we get,

which contradicts

From above claim, the sequence is monotonic increasing and diverging. Thus, for , using Stolz-Cesaro theorem (Theorem 11), we have

Substituting eq. (23) in eq. (22), we have

where in (a) we absorbed the diminishing terms into (b) follows since we proved in the claim above that and hence dominates We have shown that for a positive scalar , which completes the proof.

B Linear Fully Connected Networks: Proof of Theorem 1

Theorem 1 (Linear fully connected networks). For any depth L, almost all linearly separable datasets , almost all initializations , and any bounded sequence of step sizes consider the sequence gradient descent iterates in eq. (7) for minimizing in eq. (4) with exponential loss –layer fully connected linear networks.<