b

DiscoverSearch
About
My stuff
Implicit Bias of Gradient Descent on Linear Convolutional Networks
2018·arXiv
Abstract
Abstract

We show that gradient descent on full width linear convolutional networks of depth L converges to a linear predictor related to the  ℓ2/Lbridge penalty in the frequency domain. This is in contrast to fully connected linear networks, where regardless of depth, gradient descent converges to the  ℓ2maximum margin solution.

Implicit biases introduced by optimization algorithms play an crucial role in learning deep neural networks [Neyshabur et al., 2015b,a, Hochreiter and Schmidhuber, 1997, Keskar et al., 2016, Chaudhari et al., 2016, Dinh et al., 2017, Andrychowicz et al., 2016, Neyshabur et al., 2017, Zhang et al., 2017, Wilson et al., 2017, Hoffer et al., 2017, Smith, 2018]. Large scale neural networks used in practice are highly over-parameterized with far more trainable model parameters compared to the number of training examples. Consequently, optimization objectives for learning such high capacity models have many global minima that fit training data perfectly. However, minimizing the training loss using specific optimization algorithms take us to not just any global minima, but some special global minima, e.g., global minima minimizing some regularizer  R(β). In over-parameterized models, specially deep neural networks, much, if not most, of the inductive bias of the learned model comes from this implicit regularization from the optimization algorithm. Understanding the implicit bias, e.g., via characterizing  R(β), is thus essential for understanding how and what the model learns.

For example, in linear regression we understand how minimizing an under-determined model (with more parameters than samples) using gradient descent yields the minimum  ℓ2norm solution, and for linear logistic regression trained on linearly separable data, Soudry et al. [2017] recently showed that gradient descent converges in the direction of the hard margin support vector machine solution, even though the norm or margin is not explicitly specified in the optimization problem. Such minimum norm or maximum margin solutions are of course very special among all solutions or separators that fit the training data, and in particular can ensure generalization Bartlett and Mendelson [2003], Kakade et al. [2009].

Changing the optimization algorithm, even without changing the model, changes this implicit bias, and consequently also changes generalization properties of the learned models [Neyshabur et al., 2015a, Keskar et al., 2016, Wilson et al., 2017, Gunasekar et al., 2017, 2018]. For example, for linear logistic regression, using coordinate descent instead of gradient descent return a maximum  ℓ1 marginsolution instead of the hard margin support vector solution solution—an entirely different inductive bias Telgarsky [2013], Gunasekar et al. [2018].

Similarly, and as we shall see in this paper, changing to a different parameterization of the same model class can also dramatically change the implicit bias Gunasekar et al. [2017]. In particular, we study the implicit bias of optimizing multi-layer fully connected linear networks, and linear convolutional networks (multiple full width convolutional layers followed by a single fully connected layer) using gradient descent. Both of these types of models ultimately implement linear transformations, and can implement any linear transformation. The model class defined by these networks is thus simply the class of all linear predictors, and these models can be seen as mere (over) parameterizations of the class of linear predictors. Minimizing the training loss on these models is therefore entirely equivalent to minimizing the training loss for linear classification. Nevertheless, as we shall see, optimizing these networks with gradient descent leads to very different solutions.

In particular, we show that for fully connected networks with single output, optimizing the exponential loss over linearly separable data using gradient loss again converges to the homogeneous hard margin support vector machine solution. This holds regardless of the depth of the network, and hence, at least with a single output, gradient descent on fully connected networks has the same implicit bias as direct gradient descent on the parameters of the linear predictor. In contrast, training a linear convolutional network with gradient descent biases us toward linear separators that are sparse in the frequency domain. Furthermore, this bias changes with the depth of the network, and a network of depth L (with  L − 1convolutional layers), implicitly biases towards minimizing the  ∥�β∥2/Lbridge penalty with 2/L ≤ 1of the Fourier transform �βof the learned linear predictor  βsubject to margin constraints (the gradient descent predictor reaches a stationary point of the  ∥�β∥2/Lminimization problem). This is a sparsity inducing regularizer, which induces sparsity more aggressively as the depth increases.

Finally, in this paper we focus on characterizing which global minimum does gradient descent on over-parameterized linear models converge to, while assuming that for appropriate choice of step sizes gradient descent iterates asymptotically minimize the optimization objective. A related challenge in neural networks, not addressed in this paper, is an answer to when does gradient descent minimize the non-convex empirical loss objective to reach a global minimum. This problem while hard in worst case, has been studied for linear networks. Recent work have concluded that with sufficient over-parameterization (as is the case with our settings), loss landscape of linear models are well behaved and all local minima are global minima making the problem tractable Burer and Monteiro [2003], Journée et al. [2010], Kawaguchi [2016], Nguyen and Hein [2017], Lee et al. [2016].

Notation We typeface vectors with bold characters e.g.,  w, β, x. Individual entries of a vector z ∈ RDare indexed using 0 based indexing as z[d] for  d = 0, 1, . . . , D − 1. Complex numbers are represented in the polar form as  z = |z|eiφzwith  |z| ∈ R+denoting the magnitude of z and φz ∈ [0, 2π)denoting the phase.  z∗ = |z|e−iφzdenotes the complex conjugate of z. The complex inner product between  z, β ∈ CD is given by  ⟨z, β⟩ = �Dd=1 z[d]β∗[d] = z⊤β∗. The Dth complexroot of 1 is denoted by  ωD = e− 2πiD. For  z ∈ RDwe use the notation  �z ∈ CDto denote the representation of z in the discrete Fourier basis given by,  �z[d] = 1√D�D−1p=0 z[p]ωpdD .For integers D and a, we denote the modulo operator as  a mod D = a − D� aD�. Finally, for multi-layer linear networks (formally defined in Section 2), we will use  w ∈ Wto denote parameters of the model in general domain  W, and βw or simply βto denote the equivalent linear predictor.

We consider feed forward linear networks that map input features  x ∈ RDto a single real valued output  fw(x) ∈ R, where w denote the parameters of the network. Such networks can be thought of as directed acyclic graphs where each edge is associated with a weight, and the value at each node/unit is the weighted sum of values from the parent nodes. The input features form source nodes with no incoming edges and the output is a sink node with no outgoing edge. Every such network realizes a linear function  x → ⟨x, βw⟩, where βw ∈ RD denotes the effective linear predictor.

In multi-layer networks, the nodes are arranged in layers, so an L–layer network represents a composition of L linear maps. We use the convention that, the input  x ∈ RD is indexed as the zeroth layer l = 0, while the output forms the final layer with l = L. The outputs of nodes in layer l are denoted by  hl ∈ RDl, where  Dlis the number of nodes in layer l. We also use  wlto denote the parameters of the linear map between  hl−1and  hl, and  w = [wl]Ll=1to denote the collective set of all parameters of the linear network.

Linear fully connected network In a fully connected linear network, the nodes between successive layers  l − 1 and lare densely connected with edge weights  wl ∈ RDl−1×Dl, and all the weights are independent parameters. This model class is parameterized by  w = [wl]Ll=1 ∈ �Ll=1 RDl−1×Dl andthe computation for intermediate nodes  hland the composite linear map  fw(x)is given by,

image

Linear convolutional network We consider one-dimensional convolutional network architectures where each non-output layer has exactly D units (same as the input dimensionality) and the linear transformations from layer  l − 1 to layer lare given by the following circular convolutional operation1 parameterized by full width filters with weights  [wl ∈ RD]L−1l=1 . For l = 1, 2, . . . , L − 1,

image

The output layer is fully connected and parameterized by weights  wL ∈ RD. The parameters of the model class therefor consists of L vectors of size D collectively denoted by  w = [wl]Ll=1 ∈ �Ll=1 RD,and the composite linear map  fw(x)is given by:

image

Remark: We use circular convolution with a scaling of 1/√Dto make the analysis cleaner. For convolutions with zero-padding, we expect a similar behavior. Secondly, since our goal here to study implicit bias in sufficiently over-parameterized models, we only study full dimensional convolutional filters. In practice it is common to have filters of width K smaller than the number of input features, which can change the implicit bias.

The fully connected and convolutional linear networks described above can both be represented in terms of a mapping  P : W → RD that maps the input parameters  w ∈ Wto a linear predictor in  RD,such that the output of the network is given by  fw(x) = ⟨x, P(w)⟩. For fully connected networks, the mapping is given by  Pfull(w) = w1w2 . . . wL, and for convolutional networks,  Pconv(w) =��(w↓L ⋆ wL−1) ⋆ wL−2�. . . ⋆ w1�↓, where  w↓denotes the flipped vector corresponding to w,

given by  w↓[k] = w[D − k − 1] for k = 0, 1, . . . , D  − 1.

Separable linear classification Consider a binary classification dataset  {(xn, yn) : n =1, 2, . . . N} with  xn ∈ RDand  yn ∈ {−1, 1}. The empirical risk minimization objective for training a linear network parameterized as P(w) is given as follows,

image

where  ℓ : R × {−1, 1} → R+is some surrogate loss for classification accuracy, e.g., logistic loss ℓ(�y, y) = log(1 + exp(−�yy))and exponential loss  ℓ(�y, y) = exp(−�yy).

It is easy to see that both fully connected and convolutional networks of any depth L can realize any linear predictor  β ∈ RD. The model class expressed by both networks is therefore simply the unconstrained class of linear predictors, and the two architectures are merely different (over) parameterizations of this class

image

Thus, the empirical risk minimization problem in (4) is equivalent to the following optimization over the linear predictors  β = P(w):

image

Although the optimization problems (4) and (5) are exactly equivalent in terms of the set of global minima, in this paper, we show that optimizing (4) with different parameterizations leads to very different classifiers compared to optimizing (5) directly.

In particular, consider problems (4)/(5) on a linearly separable dataset  {xn, yn}Nn=1and using the logistic loss (the two class version of the cross entropy loss typically used in deep learning). The global infimum of  L(β)is 0, but this is not attainable by any finite  β. Instead, the loss can be minimized by scaling the norm of any linear predictor that separates the data to infinity. Thus, any sequence of predictors  β(t) (say, from an optimization algorithm) that asymptotically minimizes the loss in eq. (5) necessarily separates the data and diverges in norm,  ∥β(t)∥ → ∞. In general there are many linear separators that correctly label the training data, each corresponding to a direction in which we can minimize (5). Which of these separators will we converge to when optimizing (4)/(5)? In other words, what is the direction  β∞ = lim

will diverge in? If this limit exist we say that  β(t) converges in direction to the limit direction  β∞.

Soudry et al. [2017] studied this implicit bias of gradient descent on (5) over the direct parameterization of  β. They showed that for any linearly separable dataset and any initialization, gradient descent w.r.t.  βconverges in direction to hard margin support vector machine solution:

image

In this paper we study the behavior of gradient descent on the problem (4) w.r.t different parameterizations of the model class of linear predictors. For initialization  w(0) and sequence of step sizes  {ηt},gradient descent updates for (4) are given by,

image

where  ∇wP(.)denotes the Jacobian of  P : W → RD with respect to the parameters  w, and ∇βL(.)is the gradient of the loss function in (5).

For separable datasets, if  w(t)minimizes (4) for linear fully connected or convolutional networks, then we will again have  ∥w(t)∥ → ∞, and the question we ask is: what is the limit direction

image

The result in Soudry et al. [2017] holds for any loss function  ℓ(u, y)that is strictly monotone in uy with specific tail behavior, name the tightly exponential tail, which is satisfied by popular classification losses like logistic and exponential loss. In the rest of the paper, for simplicity we exclusively focus on the exponential loss function  ℓ(u, y) = exp(−uy), which has the same tail behavior as that of the logistic loss. Along the lines of Soudry et al. [2017], our results should also extend for any strictly monotonic loss function with a tight exponential tail, including logistic loss.

Our main results characterize the implicit bias of gradient descent for multi-layer fully connected and convolutional networks with linear activations. For the gradient descent iterates  w(t)in eq. (7), we henceforth denote the induced linear predictor as  β(t) = P(w(t)).Assumptions. In the following theorems, we characterize the limiting predictor  β∞ = lim

under the following assumptions:

image

These assumptions allow us to focus on the question of which specific linear predictor do gradient descent iterates converge to by separating it from the related optimization questions of when gradient descent iterates minimize the non-convex objective in eq. (5) and nicely converge in direction.

image

Figure 1: Implicit bias of gradient descent for different linear network architectures.

Theorem 1 (Linear fully connected networks). For any depth L, almost all linearly separable datasets  {xn, yn}Nn=1, almost all initializations  w(0), and any bounded sequence of step sizes  {ηt}t,consider the sequence gradient descent iterates  w(t)in eq. (7) for minimizing  LPfull(w)in eq. (4) with exponential loss  ℓ(�y, y) = exp(−�yy) over L–layer fully connected linear networks.

If (a) the iterates  w(t)minimize the objective, i.e.,  LPfull(w(t)) → 0, (b)  w(t), and consequently β(t) = Pfull(w(t)), converge in direction to yield a separator with positive margin, and (c) gradients with respect to linear predictors  ∇βL(β(t))converge in direction, then the limit direction is given by,

image

For fully connected networks with single output, Theorem 1 shows that there is no effect of depth on the implicit bias of gradient descent. Regardless of the depth of the network, the asymptotic classifier is always the hard margin support vector machine classifier, which is also the limit direction of gradient descent for linear logistic regression with the direct parameterization of  β = w.

In contrast, next we show that for convolutional networks we get very different biases. Let us first look at a 2–layer linear convolutional network, i.e., a network with single convolutional layer followed by a fully connected final layer.

image

Recall that �β ∈ CDdenote the Fourier coefficients of  β, i.e., �β[d] = 1√D

and that any non-zero  z ∈ Cis denoted in polar form as  z = |z|eiφzfor  φz ∈ [0, 2π). Linear predictors induced by gradient descent iterates  w(t) for convolutional networks are denoted by  β(t) =Pconv(w(t)). It is evident that if  β(t)converges in direction to  β∞, then its Fourier transformation �β(t)converges in direction to  �β∞. In the following theorems, in addition to the earlier assumptions, we further assume a technical condition that the phase of the Fourier coefficients  eiφ�β(t)converge

coordinate-wise. For coordinates d with �β

w(t), in which case  eiφ�β(t)[d] → eiφ�β∞[d]. We assume such a  φ�β∞[d] also exists when  �β∞[d] = 0.

Theorem 2 (Linear convolutional networks of depth two). For almost all linearly separable datasets {xn, yn}Nn=1, almost all initializations  w(0), and any sequence of step sizes  {ηt}twith  ηtsmaller than the local Lipschitz at  w(t), consider the sequence gradient descent iterates  w(t)in eq. (7) for minimizing  LPconv(w) in eq. (4)with exponential loss over 2–layer linear convolutional networks.

If (a) the iterates  w(t) minimize the objective, i.e.,  LPconv(w(t)) → 0, (b) w(t) converge in direction to yield a separator  β∞with positive margin, (c) the phase of the Fourier coefficients  �β(t)of the linear predictors  β(t) converge coordinate-wise, i.e.,  ∀d, eiφ�β(t)[d] → eiφ��β∞[d], and (d) the gradients ∇βL(β(t))converge in direction, then the limit direction  β∞is given by,

image

We already see how introducing a single convolutional layer changes the implicit bias of gradient descent—even without any explicit regularization, gradient descent on the parameters of convolutional network architecture returns solutions that are biased to have sparsity in the frequency domain.

Furthermore, unlike fully connected networks, for convolutional networks we also see that the implicit bias changes with the depth of the network as shown by the following theorem. Theorem 2a (Linear Convolutional Networks of any Depth). For any depth L, under the conditions of Theorem 2, the limit direction  β∞ = lim

point of the following optimization problem,

image

where the  ℓppenalty given by  ∥z∥p =��Di=1 |z[i]|p�1/p(also called the bridge penalty) is a norm for p = 1 and a quasi-norm for p < 1.

When L > 2, and thus  p = 2/L < 1, problem (10) is non-convex and intractable Ge et al. [2011]. Hence, we cannot expect to ensure convergence to a global minimum. Instead we show convergence to a first order stationary point of (10) in the sense of sub-stationary points of Rockafellar [1979] for optimization problems with non-smooth and non-convex objectives. These are solutions where the local directional derivative along the directions in the tangent cone of the constraints are all zero.

The first order stationary points, or sub-stationary points, of (10) are the set of feasible predictors  βsuch that  ∃{αn ≥ 0}Nn=1 satisfying the following:  ∀n, yn⟨xn, β⟩ > 1 =⇒ αn = 0, and

image

where  �xnis the Fourier transformation of  xn, and  ∂◦denotes the local sub-differential (or Clarke’s sub-differential) operator defined as  ∂◦f(β) = conv{v : ∃(zk)k s.t. zk → β and ∇f(zk) → v}.

For p = 1 and �βrepresented in polar form as �β = |�β|eiφ�β ∈ CD, ∥�β∥pis convex and the local sub-differential is indeed the global sub-differential given by,

image

image

For p < 1, the local sub-differential of  ∥�β∥pis given by,

image

Figures 1a–1b summarize the implications of the main results in the paper. The proof of this Theorem, exploits the following representation of  Pconv(β)in the Fourier domain. Lemma 3. For full-dimensional convolutions,  β = Pconv(w)is equivalent to

image

where for  l = 1, 2, . . . , L, �w1 ∈ CD are the Fourier coefficients of the parameters  wl ∈ RD.

From above lemma (proved in Appendix C), we can see a connection of convolutional networks to a special network where the linear transformation between layers is restricted to diagonal entries (see depiction in Figure 1c), we refer to such networks as linear diagonal network.

The proof of Theorem 1 and Theorem 2-2a are provided in Appendix B and C, respectively.

We can decompose the characterization of implicit bias of gradient descent on a parameterization P(w) into two parts: (a) what is the implicit bias of gradient descent in the space of parameters w?, and (b) what does this imply in term of the linear predictor  β = P(w), i.e., how does the bias in parameter space translate to the linear predictor learned from the model class?

We look at the first question for a broad class of linear models, where the linear predictor is given by a homogeneous polynomial mapping of the parameters:  β = P(w), where w ∈ RP are the parameters of the model and  P : RP → RD satisfies definition below. This class covers the linear convolutional, fully connected networks, and diagonal networks discussed in Section 3.

Definition (Homogeneous Polynomial). A multivariate polynomial function  P : RP → RDis said to be homogeneous, if for some finite integer  ν < ∞, ∀α ∈ R, v ∈ RP , P(αv) = ανP(v).

Theorem 4 (Homogeneous Polynomial Parameterization). For any homogeneous polynomial map P : RP → RDfrom parameters  w ∈ RDto linear predictors, almost all datasets  {xn, yn}Nn=1separable by  B := {P(w) : w ∈ RP }, almost all initializations  w(0), and any bounded sequence of step sizes  {ηt}t, consider the sequence of gradient descent updates  w(t) from eq. (7)for minimizing the empirical risk objective  LP(w) in (4)with exponential loss  ℓ(u, y) = exp(−uy).

If (a) the iterates  w(t)asymptotically minimize the objective, i.e.,  LP(w(t)) = L(P(w(t))) → 0, (b)  w(t), and consequently  β(t) = P(w(t)), converge in direction to yield a separator with positive margin, and (c) the gradients w.r.t. to the linear predictors,  ∇βL(β(t))converge in direction, then the limit direction of the parameters  w∞ = lim

point of the following optimization problem,

image

Theorem 4 is proved in Appendix A. The proof of Theorem 4 involves showing that the asymptotic direction of gradient descent iterates satisfies the KKT conditions for first order stationary points of (14). This crucially relies on two properties. First, the sequence of gradients  ∇βL(β(t))converge in direction to a positive span of support vectors of  β∞ = lim

[2018]), and this result relies on the loss function  ℓbeing exponential tailed. Secondly, if P is not homogeneous, then the optimization problems  minw∥w∥22 s.t. ∀n, ⟨xn, yn⟩ ≥ γfor different values of unnormalized margin  γare not equivalent and lead to different separators. Thus, for general non-homogeneous P, the unnormalized margin of one does not have a significance and the necessary conditions for the first order stationarity of (14) are not satisfied.

Finally, we also note that in many cases (including linear convolutional networks) the optimization problem (14) is non-convex and intractable (see e.g., Ge et al. [2011]). So we cannot expect  w∞to be always be a global minimizer of eq. (14). We however suspect that it is possible to obtain a stronger result that  w∞reaches a higher order stationary point or even a local minimum of the explicitly regularized estimator in eq. (14).

Implications of the implicit bias in predictor space While eq. (14) characterizes the bias of gradient descent in the parameter space, what we really care about is the effective bias introduced in the space of functions learned by the network. In our case, this class of functions is the set of linear predictors  {β ∈ RD}. The ℓ2norm penalized solution in eq. (14), is equivalently given by,

image

The problems in eq. (14) and eq. (15) have the same global minimizers, i.e.,  w∗ is global minimizer of eq. (14) if and only if  β∗ = P(w∗)minimizes eq. (15). However, such an equivalence does not extend to the stationary points of the two problems. Specifically, it is possible that a stationary point of eq. (14) is merely a feasible point for eq. (15) with no special significance. So instead of using Theorem 4, for the specific networks in Section 3, we directly show (in Appendix) that gradient descent updates converge in direction to a first order stationary point of the problem in eq. (15).

In the previous section, we saw that the implicit bias of gradient descent on a parameterization P(w) can be described in terms of the optimization problem (14) and the implied penalty function RP(β) = minw:P(w)=β∥w∥22. We now turn to studying this implied penalty  RP(β)and obtaining explicit forms for it, which will reveal the precise form of the implicit bias in terms of the learned linear predictor. The proofs of the lemmas in this section are provided in the Appendix D.

Lemma 5. For fully connected networks of any depth L > 0,

image

We see that  β∗RPfull = argminβ RPfull(β) s.t. ∀n, yn⟨xn, β⟩ ≥ 1in eq. (15) for fully connected networks is independent of the depth of the network L. In Theorem 1, we indeed show that gradient descent for this class of networks converges in the direction of  β∗RPfull .

Next, we motivate the characterization of  RP(β)for linear convolutional networks by first looking at the special linear diagonal network depicted in Figure 1c. The depth–L diagonal network is parameterized by  w = [wl ∈ RD]Ll=1and the mapping to a linear predictor is given by  Pdiag(w) = diag(w1)diag(w2) . . . diag(wL−1)wL.

Lemma 6. For a depth–L diagonal network with parameters  w = [wl ∈ RD]Ll−1, we have

image

Finally, for full width linear convolutional networks parameterized by  w = [wl ∈ RD]Ll=1, recall the following representation of  β = Pconv(w)in Fourier from Lemma 3.

image

where �β, �wl ∈ CDare Fourier basis representation of  β, wl ∈ RD, respectively. Extending the result of diagonal networks for the complex vector spaces, we get the following characterization of RPconv(β)for linear convolutional networks.

Lemma 7. For a depth–L convolutional network with parameters  w = [wl ∈ RD]Ll−1, we have

image

In this paper, we characterized the implicit bias of gradient descent on linear convolutional networks. We showed that even in the case of linear activations and a full width convolution, wherein the convolutional network defines the exact same model class as fully connected networks, merely changing to a convolutional parameterization introduces radically different, and very interesting, bias when training with gradient descent. Namely, training a convolutional representation with gradient descent implicitly biases towards sparsity in the frequency domain representation of linear predictor.

For convenience and simplicity of presentation, we studied one dimensional circular convolutions. Our results can be directly extended to higher dimensional input signals and convolutions, including the two-dimensional convolutions common in image processing and computer vision. We also expect similar results for convolutions with zero padding instead of circular convolutions, although this requires more care with analysis of the edge effects.

A more significant way in which our setup differs from usual convolutional networks is that we use full width convolutions, while in practice it is common to use convolutions with bounded width, much smaller then the input dimensionality. This setting is within the scope of Theorem 4, as the linear transformation is still homogeneous. However, understanding the implied bias in the predictor space, i.e. understanding  RP(β)requires additional work. It will be very interesting to see if restricting the width of the convolutional network gives rise to further interesting behaviors.

Another important direction for future study is understanding the implicit bias for networks with multiple outputs. For both fully connected and convolutional networks, we looked at networks with a single output. With C > 1 outputs, the network implements a linear transformation  x �→ βx whereβ ∈ RC×Dis now a matrix. Results for matrix sensing in Gunasekar et al. [2018] imply that for two layer fully connected networks with multiple outputs, the implicit bias is to a maximum margin solution with respect to the nuclear norm  ∥β∥⋆. This is already different from the implicit bias of a one-layer “network” (i.e. optimizing  βdirectly), which would be in terms of the Frobenius norm ∥β∥F(from the result of Soudry et al. [2017]). We suspect that with multiple outputs, as more layers are added, even fully connected networks exhibit a shrinking sparsity penalty on the singular values of the effective linear matrix predictor  β ∈ RC×D. Precisely characterizing these biases requires further study.

When using convolutions as part of a larger network, with multiple parallel filters, max pooling, and non-linear activations, the situation is of course more complex, and we do not expect to get the exact same bias. However, we do expect the bias to be at the very least related to the sparsity-in-frequency-domain bias that we uncover here, and we hope our work can serve as a basis for further such study. There are of course many other implicit and explicit sources of inductive bias—here we show that merely parameterizing transformations via convolutions and using gradient descent for training already induces sparsity in the frequency domain.

On a technical level, we provided a generic characterization for the bias of gradient descent on linear models parameterized as  β = P(w)for a homogeneous polynomial P. The  ℓ2bias (in parameter space) we obtained is not surprising, but also should not be taken for granted – e.g., the result does not hold in general for non-homogeneous P, and even with homogeneous polynomials, the characterization is not as crisp when other loss functions are used, e.g., with a squared loss and matrix factorization (a homogeneous degree two polynomial representation), the implicit bias is much more fragile Gunasekar et al. [2017], Li et al. [2017]. Moreover, Theorem 4 only ensures convergence to first order stationary point in the parameter space, which is not sufficient for convergence to stationary points of the implied bias in the model space (eq. (15)). It is of interest for future work to strengthen this result to show either convergence to higher order stationary points or local minima in parameter space, or to directly show the convergence to stationary points of (15).

It would also be of interest to strengthen other technical aspects of our results: extend the results to loss functions with tight exponential tails (including logistic loss) and handle all datasets including the set of measure zero degenerate datasets—these should be possible following the techniques of Soudry et al. [2017], Telgarsky [2013], Ji and Telgarsky [2018]. We can also calculate exact rates of convergence to the asymptotic separator along the lines of Soudry et al. [2017], Nacson et al. [2018], Ji and Telgarsky [2018] showing how fast the inductive bias from optimization kicks in and why it might be beneficial to continue optimizing even after the loss value  L(β(t))itself is negligible. Finally, for logistic regression, Ji and Telgarsky [2018] extend the results of asymptotic convergence of gradient descent classifier to the cases where the data is not strictly linearly separable. This is an important relaxation of our assumption on strict linear separability. More generally, for non-separable data, we would like a more fine grained analysis connecting the iterates  β(t) along the optimization path to the estimates along regularization path, �β(c) = argminRP(β)≤c L(β), where an explicit regularization is added to the optimization objective.

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.

P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 2003.

Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003.

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, 2017.

Dongdong Ge, Xiaoye Jiang, and Yinyu Ye. A note on the complexity of lp minimization. Mathematical programming, 2011.

Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In NIPS, 2017.

Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint, 2018.

Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 1997.

Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, 2017.

Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.

Michel Journée, Francis Bach, P-A Absil, and Rodolphe Sepulchre. Low-rank optimization on the cone of positive semidefinite matrices. SIAM Journal on Optimization, 20(5):2327–2351, 2010.

Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, 2009.

Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, 2016.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2016.

Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In 29th Annual Conference on Learning Theory, 2016.

Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix recovery. arXiv preprint arXiv:1712.09203, 2017.

Marian Muresan. A concrete approach to classical analysis, volume 14. Springer, 2009.

Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. arXiv preprint arXiv:1803.01905, 2018.

Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2422–2430, 2015a.

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations, 2015b.

Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Geometry of optimization and implicit regularization in deep learning. arXiv preprint, 2017.

Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017.

R Tyrrell Rockafellar. Directionally lipschitzian functions and subdifferential calculus. Proceedings of the London Mathematical Society, 1979.

Le Smith, Kindermans. Don’t Decay the Learning Rate, Increase the Batch Size. In ICLR, 2018.

Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345, 2017.

Matus Telgarsky. Margins, shrinkage and boosting. In Proceedings of the 30th International Conference on International Conference on Machine Learning-Volume 28, pages II–307. JMLR. org, 2013.

Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, 2017.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.

The proofs of the theorems in the paper are organized as follows: In Appendix A we first give the proof for Theorem 4, which includes linear fully connected and full width convolutional networks as special cases. This gives us some general results that can be special-cased to prove the stronger results for these networks in Section 3. In Appendix B, we prove Theorem 1 on the implicit bias of fully connected linear networks. In Appendix C, we prove Theorem 2–2a on the implicit bias of linear convolutional networks. Finally, in Appendix D we prove the lemmas in Section 5 on computing the form of implicit bias of linear networks learned using gradient descent.

Unless specified otherwise,  ∥.∥denotes the Euclidean norm. We additionally use the notation  v ∝ v′to denote equality up to strictly positive scalar multipliers, i.e., when  v = γv′ for some γ > 0.

The following is a paraphrasing of Lemma 8 in Gunasekar et al. [2018] and is used in multiple proofs.

Lemma 8. [Lemma 8 in Gunasekar et al. [2018]] For almost all linearly separable dataset  {xn, yn}n,consider any sequence  β(t)that minimizes the empirical objective in eq. (5), i.e.,  L(β(t)) → 0. If (a)  β∞ := lim

∃{αn ≥ 0}n∈S s.t. z∞ = �n∈S αn ynxn, where S = {n : yn⟨β∞, xn⟩ = minn yn⟨β∞, xn⟩} arethe indices of the data points with smallest margin to the limit direction  β∞.

Theorem 4 (Homogeneous Polynomial Parameterization). For any homogeneous polynomial map P : RP → RDfrom parameters  w ∈ RDto linear predictors, almost all datasets  {xn, yn}Nn=1separable by  B := {P(w) : w ∈ RP }, almost all initializations  w(0), and any bounded sequence of step sizes  {ηt}t, consider the sequence of gradient descent updates  w(t) from eq. (7)for minimizing the empirical risk objective  LP(w) in (4)with exponential loss  ℓ(u, y) = exp(−uy).

If (a) the iterates  w(t)asymptotically minimize the objective, i.e.,  LP(w(t)) = L(P(w(t))) → 0, (b)  w(t), and consequently  β(t) = P(w(t)), converge in direction to yield a separator with positive margin, and (c) the gradients w.r.t. to the linear predictors,  ∇βL(β(t))converge in direction, then the limit direction of the parameters  w∞ = lim

point of the following optimization problem,

image

Proof.  w(t) are the sequence gradient descent iterates from eq. (7) for minimizing  LP(w) in eq (4)with exponential loss over the model class of  B = {P(w) : w ∈ RP }, where P is a homogeneous polynomial function. We first introduce some notation.

1. From the assumption in theorem, we have that  w∞ = lim

image

2. Let  β(t) = P(w(t))denote the sequence of linear predictors for this network induced by the gradient descent iterates. We can see that  β(t)converges in direction too using the following arguments: homogeneity of P implies that  P(w(t)/∥w(t)∥) = P(w(t))/∥w(t)∥νfor some  ν. Hence,

image

3.  z(t) = −∇βL(β(t)) = �n exp�−⟨β(t), ynxn⟩�ynxn. Since we assume that  z(t)converges in

image

4. Let  ∇wP�w(t)�∈ RP ×D denote the Jacobian of  P(w), i.e., ∇wP�w(t)�[p, d] =∂(P(w(t))[d])∂w[p] .

If  P : RP → RDis a homogeneous polynomial of degree  ν > 0, then  ∇wP : RP → RP ×Dis a homogeneous polynomial of degree  ν − 1. Using eq. (16), we have

image

Thus,  ∃δ(t)1 → 0, such that

image

5. Finally, from the definition of  ∇wP(w), we have ∇wLP(w(t)) = ∇wP�w(t)�∇βL(β(t)), andhence from eq. (7),

image

Using the assumptions in the theorem along with our argument above for convergence of  β(t)in direction, we satisfy the conditions of Lemma 8, which will be crucially used in our proof.

image

Primal feasibility. We showed earlier that if  w(t)converges in direction, then  β(t) = P(w(t))converges in direction to  β∞ = lim

we have that  β∞ satisfies ∀n, yn⟨xn, β∞⟩ > 0, which also implies  minn yn⟨xn, P(w∞)⟩ > 0 sinceβ∞ ∝ P(w∞). Now, if Pis homogeneous of of degree  ν, then for γ = (minn yn⟨xn, P(w∞)⟩)−1/ν,�w∞ = γw∞ satisfies minn yn⟨xn, P(�w∞)⟩ = 1.

Showing other KKT conditions for  �w∞.The crux of the proof of Theorem 4 involves showing the existence of  {αn ≥ 0}nsuch that the stationarity and complementary slackness conditions in eq. (20) are satisfied. This crucially relies on a key lemma (Lemma 8) showing that the gradient in the space of linear predictors  ∇βL(β(t))are dominated by positive linear combinations of support vectors of the asymptotic predictor  β∞.

Let  S∞ = {n : yn⟨P(�w∞), xn⟩ = 1}denote the indices of support vectors for  P(�w∞), which are also the support vectors of  β∞, since by homogeneity of  P, β∞ ∝ P(w∞) ∝ P(�w∞). Thus, from Lemma 8, we have  z∞ = lim

We propose a positive scaling of this  {αn}Nn=1 as our candidate dual certificate, which satisfies both dual feasibility and complementary slackness.

To prove the theorem, the remaining step is to show that  �w∞ ∝ ∇wP(�w∞)z∞. Since �w∞ = γw∞and P is homogeneous, this condition is equivalent to showing that  w∞ ∝ ∇wP(w∞)z∞.

Showing that  w∞ ∝ ∇wP(w∞)z∞.Substituting for  z(t)and  ∇wP(w(t))from eqs. (17) and (18), respectively, in the gradient descent updates (eq. (19)), we have the following:

image

image

Summing over t, we have

image

We want to argue that the first term, i.e.,  ∇wP (w∞) z∞, is the dominant term. Towards this we state and prove the following intermediate claim

Claim 1.  ∥∇wP (w∞) z∞∥ > 0 and �u<t ηup(u)g(u)ν−1 → ∞.

Proof. First, it is straight forward to check that for any scalar valued homogeneous polynomial f : RP → Rof degree  ν, we have  ⟨w, ∇wf(w)⟩ = νf(w), where for  p = 1, 2 . . . , P, ∇wf(w)[p] =

dw[p] (this is also known as the Euler’s homogeneous function theorem). Extending this to our vector valued homogeneous function  P : RP → RD, we have that for all w, the Jacobian  ∇wP(w) ∈RP ×D satisfies ∇wP(w)⊤w = νP(w).

Moreover, we have that for the limit direction  w∞, the margin of the corresponding classifier is strictly positive, i.e.,  minn yn⟨P(w∞), xn⟩ > 0. Now from Lemma 8, using that  z∞ = �n∈S∞ αnynxnfor  αn ≥ 0(and not all zero since  z∞ is unit norm), we immediately get the following

image

To prove the second part, we note the following

image

Thus, if  lim supt→∞ bt = ∞ then limt→∞ bt = ∞. On contrary, if  lim supt→∞ bt = C < ∞, thenfrom eq. (22), for large t we get,  ∥w(t)∥ ≤ ∥w(0)∥ + ∥∇P(w∞)z∞∥C +�supt∥δ(t)∥�C < ∞

which contradicts  ∥w(t)∥ → ∞.

From above claim, the sequence  bt = �u<t ηup(u)g(u)ν−1is monotonic increasing and diverging. Thus, for  at = �u<t δ(u)ηup(u)g(u)ν−1, using Stolz-Cesaro theorem (Theorem 11), we have

image

Substituting eq. (23) in eq. (22), we have

image

where in (a) we absorbed the diminishing terms into  δ(t)3 = δ(t)2 +w(0)/ �u<t ηup(u)g(u)ν−1 → 0,(b) follows since we proved in the claim above that  ∇wP (w∞) z∞ ̸= 0and hence dominates  δ(t)3 .We have shown that  w∞ = γ∇wP (w∞) z∞ for a positive scalar  γ, which completes the proof.

Theorem 1 (Linear fully connected networks). For any depth L, almost all linearly separable datasets  {xn, yn}Nn=1, almost all initializations  w(0), and any bounded sequence of step sizes  {ηt}t,consider the sequence gradient descent iterates  w(t)in eq. (7) for minimizing  LPfull(w)in eq. (4) with exponential loss  ℓ(�y, y) = exp(−�yy) over L–layer fully connected linear networks.

If (a) the iterates  w(t)minimize the objective, i.e.,  LPfull(w(t)) → 0, (b)  w(t), and consequently β(t) = Pfull(w(t)), converge in direction to yield a separator with positive margin, and (c) gradients with respect to linear predictors  ∇βL(β(t))converge in direction, then the limit direction is given by,

image

Proof. Recall that for fully connected networks of any depth L > 0 with parameters w = [wl ∈ RDl−1×Dl]Ll−1, the equivalent linear predictor given by  Pfull(w) = w1w2 . . . wLis a homogeneous polynomial of degree L.

Let  w(t) = [w(t)l ∈ RDl−1×Dl]Ll=1denote the iterates of individual matrices  wlalong the gradient descent path, and  β(t) = Pfull(w(t))denote the corresponding sequence of linear predictors.

We first introduce the following notation.

1. Let  w∞ = limt→∞w(t)∥w(t)∥denote the limit direction of the parameters, with component matrices in each layer denoted as  w∞ = [w∞l ]. Specializing (16) for fully connected networks, we have:

image

3. Let  z(t) = −∇βL(β(t)). Again repeating eq. (17) for fully connected networks, we have for some  δ(t)z → 0 and p(t) = ∥z(t)∥,

image

The proof of Theorem 1 is fairly straight forward from using Lemma 8 and the intermediate results in the proof of Theorem 4.

Showing KKT conditions for �β∞ ∝ Pfull(w∞).Using our notation described above, we have w∞1:L = Pfull(w∞). In the following arguments we show that a positive scaling  �β∞ = γw∞1:Lsatisfies the following KKT conditions for the optimality of  ℓ2maximum margin problem in eq. (8):

image

As we saw in proof of Theorem 4, since  w∞1:L = Pfull(w∞)has strictly positive margin, us- ing homogeneity of  Pfull, we can scale  w∞1:Lto get  �β∞ = γw∞1:Lwith unit margin, i.e., ∀n, yn⟨xn, �β∞⟩ ≥ 1. For dual variables, we again use a positive scaling of  αnfrom Lemma 8, such that  z∞ = �n∈S∞ αn ynxn. In order to prove the theorem, we need to show that  �β∞ ∝ z∞or equivalently  w∞1:L ∝ z∞.

Recall that in the proof of Theorem 4, we showed a version of stationarity in the parameter space in eq. (26), repeated below.

image

This case in particular includes  Pfullwhich is homogeneous with  ν = L. We special case the result fully connected network. In particular, for the parameters of the first layer  w1, we have P(w) = w1w2:L, where  w1 ∈ Rd×d1and  w2:L ∈ Rd1×1. This implies, for any  z, ∇w1P(w)z = zw⊤2:L. Using this along with eq. (31), we get the following expression for some positive scalar  γ

image

Since  w∞1:L ∝ �β∞, we have shown that  �β∞ ∝ z∞, which completes our proof of Theorem 1.

Recall that L–layer linear convolutional networks have parameters  w = [wl ∈ RD]Ll−1. We first recall some complex numbers terminology and properties

1. Complex vectors  �z ∈ CD are represented in polar form as  �z = |�z|eiφ�z, where |�z| ∈ RD+ andφ�z ∈ [0, 2π)D are the vectors with magnitudes and phases, respectively, of components  �z.2. For �z = |�z|eiφ�z ∈ CD, the complex conjugate vector is denoted by  �z∗ = |�z|e−iφ�z.3. The complex inner product for  �x, �β ∈ CD is given by  ⟨�x, �β⟩ = �d �x[d]�β∗[d] = �x⊤�β∗.

4. Let  F ∈ CD×D denote the discrete Fourier transform matrix with  F[d, p] = 1√DωdpD where

recall that  ωD = e− 2πiDis the  Dthcomplex root of unity. Thus, for any  z ∈ RD, the representation in Fourier basis is given by  �z = Fz. Fand its complex conjugate matrix  F∗also satisfy:  FF∗ = F∗F = I, F = F⊤ and F∗ = F∗⊤.

Before getting into full proofs of Theorem 2a–2, we also prove the two lemmas (Lemma 3 and Lemma 9) that establish equivalence of dynamics of gradient descent on full dimensional convolutional networks to those on linear diagonal networks (Figure 1c), albeit with complex valued parameters. This makes the analysis of the of convolutional networks simpler and more intuitive.

We begin by proving Lemma 3 which shows the equivalence of representation between convolutional networks and diagonal networks.

Lemma 3. For full-dimensional convolutions,  β = Pconv(w)is equivalent to

image

Proof. First, we state the following properties which follow immediately from definitions:

1. For  x, β ∈ RD,

image

where recall that the complex inner product is given by  ⟨�x, �β⟩ = �x⊤�β∗.

2. We next show the following property

image

where  ⊙denotes the Hadamard product (elementwise product), i.e.,  (a ⊙ b)[d] = a[d]b[d].

The above equation follows from simple manipulations of definitions: recall that (Fz)[d] =

image

image

where (a) follows as  ωDD = 1 and in (b)we used the change of variables  p = (k′ − k) mod D (recallour use of modulo operator as  a mod D = a − D� aD�).

Recall from eq. (3) the output of an L-layer convolutional network is given by

image

Denote  hL−1(x) = (((x ⋆ w1) ⋆ w2) . . .) ⋆ wL−1. By iteratively using eq. (34), we have

image

Thus, on one hand using the above equation we have,

image

where (a) follows from substituting for  FhL−1(x) from eq. (36)and noting that for any  {zl ∈ RD},(z1 ⊙ z2 ⊙ . . . zL−1)⊤zL = z⊤1 (z2 ⊙ z3 ⊙ . . . zL), and (b) uses the definition of complex inner product  ⟨�x, �β⟩ = �x⊤�β∗.

Now further using eq. (33) in above equation, we have

image

Thus, for  β = Pconv(w), we have shown that �β = FPconv(w) = �w1 ⊙ �w2 . . . ⊙ �wL =diag(�w1)diag(�w2) . . . diag(�wL−1)�wL.

For  �w = [�wl ∈ CD]Ll=1, let  Pdiag(�w) = diag(�w1)diag(�w2) . . . diag(�wL−1)�wL = �w1 ⊙ �w2 . . . ⊙�wLdenote the equivalent parameterization of convolutional network in Fourier domain.

The above lemma shows that optimizing  LPconv(w)in eq. (4) is equivalent to the following minimization problem in terms of representation,

image

The following lemma further shown that not only the representations of  Pconv(w) and Pdiag(�w) areequivalent, but there corresponding gradient descent updates for problems in eq. (4) and eq. (39) are also equivalent up to Fourier transformations.

Lemma 9. Consider the gradient descent iterates  w(t) = [w(t)l ]Ll=1from eq. (7) for minimizing LPconvin eq. (4) over full dimensional linear convolutional networks. For all l, the incremental update directions,  ∆w(t)l := w(t+1)l − w(t)l = −ηt∇wlLPconv(w(t))satisfy the following,

image

where  �w(t) =��w(t)l �Ll=1 are the Fourier transformations of  w(t) = [w(t)l ]Ll=1, respectively.

image

image

Using the above equation we have,

image

where in  (a) we use ℓ′(�y, y) = ∂ℓ(�y,y)∂�yand the remaining equalities simply follow from manipulation of derivatives. From above equation, we have the following:

image

C.1 Proof of Theorem 2–2a

Theorem 2 (Linear convolutional networks of depth two). For almost all linearly separable datasets {xn, yn}Nn=1, almost all initializations  w(0), and any sequence of step sizes  {ηt}twith  ηtsmaller than the local Lipschitz at  w(t), consider the sequence gradient descent iterates  w(t)in eq. (7) for minimizing  LPconv(w) in eq. (4)with exponential loss over 2–layer linear convolutional networks.

If (a) the iterates  w(t) minimize the objective, i.e.,  LPconv(w(t)) → 0, (b) w(t) converge in direction to yield a separator  β∞with positive margin, (c) the phase of the Fourier coefficients  �β(t)of the linear predictors  β(t) converge coordinate-wise, i.e.,  ∀d, eiφ�β(t)[d] → eiφ��β∞[d], and (d) the gradients ∇βL(β(t))converge in direction, then the limit direction  β∞is given by,

image

Theorem 2a (Linear Convolutional Networks of any Depth). For any depth L, under the conditions of Theorem 2, the limit direction  β∞ = lim

point of the following optimization problem,

image

where the  ℓppenalty given by  ∥z∥p =��Di=1 |z[i]|p�1/p(also called the bridge penalty) is a norm for p = 1 and a quasi-norm for p < 1.

For the gradient descent iterates  w(t) = [w(t)l ]Ll=1 from eq. (7)denote the sequence of corresponding linear predictors as  β(t) = Pconv(w(t)). Let �β(t) = Fβ(t)and  �w(t)l = Fw(t)ldenote the Fourier transforms of  β(t) and w(t)l , respectively, and let  �w(t) =��w(t)l �Ll=1.

image

image

KKT conditions for optimality We want to show that a positive scaling of  β∞ ∝ Pconv(w∞), denoted by �β∞ = γPconv(w∞)is a first order stationary point of eq. (10), repeated below,

image

Recall the KKT conditions discussed in Section 3. The first order stationary points, or sub-stationary points, of (10) are the set of feasible predictors  β such that ∃{αn ≥ 0}Nn=1 satisfying the following:

∀n, yn⟨xn, β⟩ > 1 =⇒ αn = 0, and

image

where  ∂◦ denotes the local sub-differential (or Clarke’s sub-differential) operator defined as  ∂◦f(β) =conv{v : ∃(zk)k s.t. zk → β and ∇f(zk) → v}.

For p = 1 and �βrepresented in polar form as �β = |�β|eiφ�β ∈ CD, ∥�β∥pis convex and the local sub-differential is indeed the global sub-differential given by,

image

For p < 1, the local sub-differential of  ∥�β∥pis given by,

image

Showing KKT conditions for �β∞ ∝ Pconv(w∞).As we showed proof of Theorem 4, since Pconv(w∞)has strictly positive margin, using homogeneity of  Pconv, we can scale  Pconv(w∞)to get �β∞ = γPconv(w∞)with unit margin, i.e.,  ∀n, yn⟨xn, �β∞⟩ ≥ 1. For dual variables, we again use a positive scaling of  αnfrom Lemma 8, such that  z∞ = �n∈S∞ αn ynxn.

In order to prove the theorem, we need to show that for some positive scalar  γ, γ�z∞ ∈ ∂◦∥�β∥2/L, i.e., satisfies the conditions in eq. (47) and (48), for L = 2 and L > 2, respectively.

We start from the stationarity condition in the parameter space in eq. (26) of Theorem 4. For some positive scalar  γ, we have

image

We will now special case the above equation for fully width convolutional networks.

From Lemma 3, we have that for all  w = [wl ∈ RD], we have  Pconv(w) = F∗Pdiag(Fw)where F and  F∗denote discrete Fourier matrix and its inverse in appropriate dimensions. Let  {ed}Dd=1denote the standard basis in  RD. We first note that for all l = 1, 2, . . . , L and for all d = 1, 2, . . . , D, the following holds

image

This implies, for  l = 1, 2, . . . , L and any z ∈ RD, we have

image

Substituting the above equation in eq. (49), we have,

image

where �w∞∗l′denotes the complex conjugate of �w∞l′ .

image

Also, by multiplying the LHS of eq. (55) across all l and taking Lth root over positive scalars, we have for  d = 0, 1, . . . , D − 1,

image

Finally, let  γbe a positive scaling of  β∞ such that �β∞ = γβ∞has unit margin. Let  ��β∞= F �β∞ =

image

C.1.1 Case of  L > 2 or p = 2/L < 1

For  p = 2/L < 1, since �z∞ = �n∈S∞ αnyn�xn, eq. (58) is indeed the first order stationarity condition for eq. (10) as described in eq. (11) and (13).

C.1.2 Case of  L = 2 or p = 2/L = 1

For the case of p = 1, in addition to eq. (58), we need to show that  γ|�z∞| ≤ 1. From eq. (58), for L = 2 we have�����β∞[d]��� ̸= 0 =⇒ γ|�z∞[d]| = 1.

We need to further show that  ∀d s.t.�����β∞[d]��� ∝����β∞[d]��� = 0, γ|�z∞[d]| ≤ 1.

image

Using Lemma 9 for for the special case of 2–layer linear convolutional network, for  ∀d,

image

Further, from eq. (55), we have  ∀d, |�w∞1 [d]|2 = |�w∞2 [d]|2, and hence

image

From the convergence of complex numbers, we have the following:

image

In the remainder of the proof, we only consider  d with |�z∞[d]| ̸= 0.

Consider  u(t)ddefined below,

image

where (a) follows from using  eiφ�z∞[d] = eiφ�β∞[d] = eiφ �w∞1 [d] · eiφ �w∞2 [d]whenever  β∞[d] ̸= 0(from eq. (62)), and (b) follows from eq. (60).

image

Additionally, since  eiφ�z(t)[d] → eiφ�z∞[d], we can write  e±i�φ�z(t)[d]−φ�z∞[d]�= 1 + δ(t)1,d ± iδ(t)2,d whereδ(t)1,d, δ(t)2,d → 0are real scalars. Substituting in above equation and rearranging the terms, we have

image

where in  (a) we define τ (t)d = iδ(t)2,d��w(t)∗2 [d] − �w(t)1 [d] · e−iφ�z∞[d]�.

The following intermediate lemma is proved in Appendix C.1.3.

image

p(t) → |�z∞[d]|, there exists  δ(t)4,d → 0such that  |�z(t)[d]| = |�z∞[d]|p(t) + δ(t)4,dp(t). Substituting these representations in eq. (65), we have the following dynamics for  ud(t),

image

where in (a) we have accumulated all diminishing terms into  δ(t)d = δ(t)4,d�1 + δ(t)1,d + δ(t)3,d�+

image

Step 2. Remainder of the proof: We now prove our theorem by looking the following quantity: For any  d, d′ with �z∞[d],�z∞[d′] ̸= 0, define κ(t)d,d′ =����u(t)du(t)d′

We will show that whenever  |�z∞[d]| > |�z∞[d′]|, we get κ(t)d,d′ → ∞. Along with eq. (64), this would imply that  limt→∞ κ(t)d,d′ =�

have  γ|�z∞[d]| ≤ γ|�z∞[d′]|. Moreover from eq.(57)), we know that  γ|�z∞[d′]| = 1for all  d′with

image

Showing |�z∞[d]| > |�z∞[d′]| =⇒ κ(t)d,d′ → ∞:

For any  2ϵ > 0, let |�z∞[d]| − |�z∞[d′]| = 2ϵ > 0. We note that the since the loss  L(β(t)) → 0, normof the gradient  p(t) = ∥z(t)∥ = ∥�zt∥ → 0. Hence, for any finite step size sequence  {ηt}, there exists t1 such that ∀t ≥ t1 and ∀d, ηtp(t)�|�z∞[d]| + |δ(t)d |�< 0.5and the following inequalities hold,

image

where in (a) follows from using 1/(1+x) ≥ (1 − x)for x < 1 since  ηtp(t)�|�z∞[d]| + |δ(t)d |�< 0.5for all  t ≥ t1, and in (c), we absorbed all o(p(t)) terms as  δ(t)d,d′p(t)for  δ(t)d,d′ → 0and used |�z∞[d]| − |�z∞[d′]| = 2ϵ > 0.

image

Further, from the conditions of the theorem, for almost all initializations,  |�w(0)l [d]| > 0for all d. For step sizes  {ηt}smaller than the local Lipschitz constant, for all finite  t′ < ∞, we also have |w(t′)l [d]| > 0. Moreover from Lemma 10, we have that  |u(t)d |, |u(t)d′ | → ∞and hence  ∃t3such that ∀t ≥ t3, |u(t)d | > 0, but for any finite  t′ < ∞, |u(t′)d′ | < ∞. Thus, for  t0 = max{t1, t2, t3}, using the above observations, we have that  κ(t0)d,d′ =����u(t0)du(t0)d′

Now, using eq. (71), for all  t ≥ t0,

image

Moreover, we have  u(t)d → ∞for at least one d, and for any finite step sizes and finite  t0, |u(t0)d | < ∞.This then implies that for some  µ < ∞, exp��tu=t0 µηup(u)�→ ∞ =⇒ �tu=t0 ηup(u) → ∞. Thus, for any  ϵ > 0, we also have �tu=t0(1 + ϵηup(u)) ≥ ϵ �tu=t0 ηup(u) → ∞.

From eq. (72) and above claim, we conclude that for all  d, d′, if |�z∞[d]| > |�z∞[d′]|, then κ(t)d,d′ → ∞.

This completes the proof of the theorem. □

C.1.3 Proof of Lemma 10

image

Proof. Recalling  τ (t)dfrom eq. (65) and  u(t)dfrom eq. (63), we have the following:

image

For all d if �β

| �w(t)2 [d]|/g(t) → | �w∞1 [d]|| �w∞2 [d]| = 1 (from eq. (60)), and also that  e−iφ�z∞[d]+iφ�β(t)[d] → e−iφ�z∞[d]+iφ�β∞[d] =1 (from eq. (62)). This along with eq. (73) gives us τ (t)du(t)d → 0.

Moreover, since  |�β(t)[d]| → ∞, we have  |�w(t)2 [d]|or  |�w(t)2 [d]| → ∞. Further, using e−iφ�z∞[d]+iφ�β(t)[d] → 1, we have |u(t)d | = |�w(t)2 [d]| + |�w(t)1 [d]|e−iφ�z∞[d]+iφ�β(t)[d] → ∞.

We now only need to show that these results also hold for d such that �β

assumptions of the theorem that even when �β

eiφ�β∞[d]. We now prove the lemma by showing the following steps for  d such that �β∞[d] = 0. :

image

Proof of lemma assuming Step 1 and Step 2 hold The above steps would imply that in eq. (73),

image

These eqs. along with eq. (73) in turn prove the lemma, i.e., τ (t)du(t)d → 0 and |u(t)d | → ∞.

Showing Step 1 and Step 2

Step 1. Show | �w(t)1 [d]|| �w(t)2 [d]| → 1.

image

Note that since  |�z(t)[d]|2 → 0 and ηtare finite, we have that  ∃t1such that for all  t ≥ t1, ηt|�z(t)[d]|2 ≤1. From the above equation, we have the following for  t ≥ t1,

image

where (a) follows from iterating over  t and using |�z(t)[d]|2 ≤ 1 for t ≥ t1.

Since  |�β(t)[d]| = |�w(t)1 [d]| · |�w(t)2 [d]| → ∞, at least one of  |�w(t)1 [d]|, |�w(t)2 [d]|must diverge. Without loss of generality, let  |�w(t)2 [d]| → ∞. Let c(t) := |�w(t)1 [d]|2 − |�w(t)2 [d]|2 with |c(t)| < ∞. We have

image

where the convergence in (a) follows since  |c(t)| < ∞(from eq. (76)) and  |�w(t)2 [d]| → ∞.

Step 2. Show Re(e−iφ�z∞[d]+iφ�β∞[d]) = 2 cos�φ�z∞[d] − φ�β∞[d]

Note that from Step 1 above, we have that | �w(t)1 [d]|2| �w(t)2 [d]|2 → 1, which implies  | �w(t)1 [d]|2+| �w(t)2 [d]|22|�β(t)[d]| =

image

Also, from eq. (44), there exists  δ(t)2,d → 0, such that

image

Using the above representations, along with eq. (59), we have the following,

image

where (a) follows from substituting eqs. (79)-(80), and (b) follows from using |�z(t)[d]| ≤ p(t) → 0and defining  δ(t)3,d = δ(t)2,d�1 + δ(t)1,d + 1/2ηt�z(t)[d]e−iφ�β(t)[d]� +�z∞[d]�δ(t)1,d + 1/2ηt�z(t)[d]e−iφ�β(t)[d]�→ 0.

Denote  ∆d = φ�β∞[d] − φ�z∞[d]. Additionally, from the assumption in the theorem, we have eiφ�β(t)[d] → eiφ�β∞[d], hence there exists  δ(t)4,d → 0 such that eiφ�β(t)[d]−iφ�z∞[d] = ei∆d(1 + δ(t)4,d).

Now, from the above equation, for any  t0 and t ≥ t0, we derive the updates for  |�β(t)[d]|,

image

(since  p(t), δ(t)3,d → 0); in (b) we defined  δ(t)6,d = 1/2δ(t)∗4,d ei∆d + 1/2δ(t)3,de−i∆d + δ(t)5,d → 0; (c)is obtained by iterating over t; and (d) follows from using  (1 + x) ≤ exp(x).

If possible, let  cos(∆d) = −2ϵ < 0. Since |δ(t)6,d| → 0, and for finite step sizes  ηtp(t) → 0, ∃t0 suchthat for all  t ≥ t0, |δ(t)6,d| < ϵ|�z∞[d]| and exp�−4ϵ|�z∞[d]|ηtp(t)�≤ 1. From eq. (82), we now have

image

Finally, for any finite step sizes and finite  t0, we have  |�β(t0)[d]|2 < ∞and this creates a contradiction since the LHS in the above equation diverges,  |�β(t+1)[d]|2 → ∞. Hence, in order for the updates in eq. (82) to lead to a divergent  |�β(t+1)[d]|, we necessarily require that

image

This completes the proof of the lemma.

In this appendix we prove the lemmas in Section 5 that compute the form of induced bias of linear networks in the space of predictors. Recall that for linear predictors parameterized as  β = P(w), RP(β) = minw:P(w)=β∥w∥22.

Lemma 5. For fully connected networks of any depth L > 0,

image

Proof. Recall that for fully connected networks of any depth L > 0 with parameters w = [wl ∈ RDl−1×Dl]Ll−1, the equivalent linear predictor given by  Pfull(w) = w1w2 . . . wL.

image

where (a) follows as arithmetic mean is greater than the geometric mean.

image

This ensures that  Pfull(w) = w1w2 . . . wL = β and ∥w∥22 = L∥β∥2/L2 , and hence

image

Combining eq. (83) and eq. (84), we get  RPfull(β) = L∥β∥2/L2

The proofs of the lemmas for computing  RP(w)for diagonal and convolutional networks are similar to those of fully connected network.

Lemma 6. For a depth–L diagonal network with parameters  w = [wl ∈ RD]Ll−1, we have

image

Proof. Recall that for an L–layer linear diagonal networks with parameters  w = [wl ∈ RD]Ll−1, theequivalent linear predictor is given by  Pdiag(w) = diag(w1)diag(w2) . . . diag(wL−1)wL.

Let  w⋆(β) = [w⋆l (β)]Ll=1be the minimizer of  minw:Pdiag(w)=β∥w∥22, so that  β = Pdiag(w⋆(β))and  RPdiag(β) = ∥w⋆(β)∥22. We then have,

image

where (a) again follows as arithmetic mean is greater than the geometric mean.

Similar to the case of fully connected networks, we now choose  w = [wl]that satisfies  Pdiag(w) = βand  ∥w∥22 = L∥β∥2/L2/L. This would ensure that,

image

We can check that these properties are satisfied by choosing w as follows: for  d = 0, 1, . . . D − 1, letw1[d] = sign(β(d)) |β(d)|1/L and wl[d] = |β(d)|1/L for l = 2, 3, . . . , L.

Combining this argument with eq. 85 concludes the proof.

For convolutional networks, the argument is the exactly the same as that for diagonal network adapted for complex vectors. Lemma 7. For a depth–L convolutional network with parameters  w = [wl ∈ RD]Ll−1, we have

image

Proof. Denote the Fourier basis coefficients of  wl ∈ RD and β = Pconv(w) ∈ RD in polar form as

image

where  |�wl|, |�β| ∈ RD+and  φ �wl, φ�β ∈ [0, 2π)Dare the vectors with magnitudes and phases, respec- tively, of  �wl, �β.

From Lemma 3, the Fourier basis representation of  β = Pconv(w)is given by

image

where we have overloaded the notation  Pdiagto denote the mapping of diagonal networks in complex vector fields, and  �w = [�wl]Ll=1. We thus have for  d = 0, 1, . . . , D − 1,

image

From orthonormality of discrete Fourier transformation, we have for all  w, ∥w∥22 = ∥�w∥22. Thus,

image

We can now adapt the proof of diagonal networks here. Let  �w⋆(β) = [�w⋆l (β) ∈ CD]Ll=1be the minimizer of  min �w:�β=Pdiag( �w)∥�w∥22, so that �β = Pdiag(�w⋆(β)) and RPconv(β) = ∥�w⋆(β)∥22, and

image

Similar to the diagonal networks, we can choose the parameters in the Fourier domain �w = [�wl ∈CD]to ensure that  Pdiag(�w) = �β and ∥�w∥22 = L∥�β∥2/L2/L as follows: for  d = 0, 1, . . . D − 1, let

image

Combining this with eq. 87 concludes the proof.

Theorem 11 (Stolz–Cesaro theorem, proof in Theorem 1.22 of Muresan [2009]). Assume that {ak}∞k=1and  {bk}∞k=1are two sequences of real numbers such that  {bk}∞k=1is strictly monotonic and diverging (i.e., monotone increasing with  bk → ∞, or monotone decreasing with  bk → −∞). Additionally, if  limk→∞ak+1−akbk+1−bk = Lexists, then  limk→∞akbk exists and is equal to L.


Designed for Accessibility and to further Open Science