It is well-known that training a neural network is, in the worst-case, NP-Hard (Blum & Rivest, 1992), and that the optimization problem can have exponentially many distinct local minima (Auer et al., 1996). Yet, in practice neural networks can be successfully trained with many more parameters than training samples. Indeed, gradient descent methods can find a globally optimal solution when using appropriate initialization schemes, such as Xavier’s initialization (Zhang et al., 2017). Understanding this phenomenon from a mathematically rigorous point of view has attracted a lot of recent interest in the research community.
Figure 1. A neural network satisfying our Assumption 2.1.
Jacot et al. (2018) showed that, for the limiting case of infinitely wide neural networks, training can be described precisely by the so-called ‘Neural Tangent Kernel’ (NTK). Motivated by this, a series of recent papers studied the convergence of gradient descent for over-parameterized neural nets. Despite being conceptually interesting, most of the existing results are either restricted to one-hidden-layer networks (Du et al., 2019b; Brutzkus et al., 2018; Li & Liang, 2018; Wu et al., 2019; Oymak & Soltanolkotabi, 2019; Ji & Telgarsky, 2020; Daniely, 2019), or require all the hidden layers to be simultaneously large (Allen-Zhu et al., 2019; Du et al., 2019a; Zou et al., 2018; Zou & Gu, 2019). In contrast, neural networks used in practice are typically only wide at the first layer(s), after which the width starts to decrease toward the output layer, see e.g. Figure 1.
This leads to an interesting open theoretical question: Do similar results also hold for this kind of ‘pyramidal’ network topology? Into this direction, Nguyen (2019) showed that the loss landscape of these networks has connected sublevel sets, and consequently it is geometrically well behaved. However, no algorithmic guarantees are provided, which means that gradient descent can still get trapped in a bad local minimum (Safran & Shamir, 2018; Yun et al., 2019).
In this paper, we close the gap and answer positively the question above. We show that gradient descend converges to a global minimum at a geometric rate, when the network consists of one wide hidden layer followed by a pyramidal topology. Our analysis considers the standard parametrization (as opposed to the NTK parameterization), and it assumes the widely employed Xavier’s initialization (Glorot & Bengio, 2010; He et al., 2015). We refer to (Chizat et al., 2019) for more details about the NTK regime, and to (Park et al., 2019) for a comparison between the empirical per-
Table 1. Recent progress on convergence of gradient descent for training neural networks. is the minimum Euclidean distance between any pair of data points;
is the smallest eigenvalue of a Gram matrix defined for 2-layer neural networks (Du et al., 2019b) which might depend on the number of samples
is the smallest eigenvalue of a Gram matrix defined recursively for L-layer neural networks (Du et al., 2019a);
is the margin of a limiting kernel (Ji & Telgarsky, 2020) which has been shown to scale as 1/ poly(N) for generic data (but can have a more favorable scaling under additional assumptions); R is defined in (Chen et al., 2019) as a constant such that the ball
contains a solution with training loss less than
is the initial point and m is the number of neurons per layer in their notation. Depending on the data assumption, Chen et al. (2019) show that the per-layer width scales as
can have more favorable scaling as
under the similar separability assumption as in (Ji & Telgarsky, 2020). Most of the prior work assume that the data lies on the unit sphere.
is defined in (13). In the last line of this table, we write
max(a, b, c).
formance of NTK and of the standard parameterization. In particular, the latter paper shows that the standard parameterization consistently outperforms the NTK one for several convolutional neural networks, especially those with a ‘bottom-heavy’ architecture, as analyzed in this paper. Our main results can be summarized as follows:
1. In Section 3.1, we show that gradient descent converges to a global optimum at a geometric rate for a class of pyramidal networks. More specifically, we provide a set of sufficient conditions on the initialization and the network topology that delivers such guarantees.
2. In Section 3.2, we specialize our results to the widely used Xavier’s initialization. In particular, we prove that the sufficient conditions for convergence are satisfied under a moderate over-parameterization condition: the width of the first layer needs to scale as
where N is the number of samples, d the input dimension, L the number of layers, X the training data, the smallest eigenvalue of the expected feature output at the first layer, and c a constant. Let us emphasize that only the first layer is required to be wide. The other layers follow a pyramidal structure and can have a constant (i.e., independent of N and d) number of
neurons. We also characterize the convergence rate, and we show that the number of iterations needed to achieve training loss scales as
3. In Section 3.3, we prove that under mild assumptions on the activation function and training data. Furthermore, for training data uniformly distributed on the sphere, we show that
is lower bounded by a numerical constant (independent of N and d). Thus, in this setting, the width of the first layer scales as
and the iteration complexity scales as
To clarify how the settings and results of our paper differ from previous works, we provide a comparison in Table 1: the key difference between this work and the prior work is that we consider an empirically relevant deep network topology with only one wide hidden layer. Our network parameterization and initialization are standard. Similarly to (Du et al., 2019a; Chen et al., 2019), we train all layers, while several other existing works train a subset of them. The recent papers (Chen et al., 2019; Ji & Telgarsky, 2020) require only a polylogarithmic layer width, but they make strong assumptions on the data separability and focus on other parameterizations (similar to the NTK (Jacot et al., 2018) or lazy regime (Chizat et al., 2019)) which are outperformed by the standard one with Xavier’s initialization.
We consider an L-layer neural network with activation function and parameters
, where
is the weight matrix at layer l. Given
and
, their
distance is given by
, where
denotes the Frobenius norm. Let
and
be respectively the training input and output, where N is the number of training samples, d is the input dimension and
is the output dimension (for consistency, we set
). Let
be the output of layer l, which is defined as
where and the activation function
is applied componentwise. Denote by
the pre-activation output, which is defined as
for
and
. Let
and
be obtained by concatenating the columns of
, respectively.
We are interested in minimizing the square loss given by
To do so, we consider the gradient descent (GD) update
where is the step size and
contains all parameters at step k.
In this paper, we consider neural networks with the following topological structure, see e.g. (Nguyen, 2019; Nguyen & Hein, 2017; 2018) and Figure 1.
Assumption 2.1 (Pyramidal network topology) Let
We make the following assumptions on the activation
Assumption 2.2 (Activation function) Fix and
. Let
satisfy that: (i)
, (ii)
, and (iii)
-Lipschitz.
As a concrete example, we consider a family of parameter- ized ReLU functions, smoothed out by a Gaussian kernel:
Figure 2. Plots of and various values of
Then, fulfills the conditions of Assumption 2.2. Furthermore,
uniformly approximates the Leaky-ReLU function over the real line, i.e.,
. These statements are proved in Lemma B.1 in the appendix. Figure 2 shows several examples of
as a uniform approximation of max(0.1x, x).
3.1. General Framework
First, let us introduce some notation for the singular values of the weight matrices at initialization:
where and
are the smallest resp. largest singular value of the matrix A. We define
as the smallest singular value of the output of the first hidden layer at initialization.
We make the following assumptions on the initialization.
Assumption 3.1 (Initial conditions)
Note that (6) and (7) involve the number of neurons at all layers (and hence the full network topology). At this point, we are ready to present our most general result. Its proof is presented in Section 4.1.
Theorem 3.2 Let the network satisfy Assumption 2.1, the activation function satisfy Assumption 2.2 and the initial conditions satisfy Assumption 3.1. Define
with Let the learning rate be
Then, the training loss vanishes at a geometric rate as
Furthermore, define
Then, the network parameters converge to a global mini- mizer at a geometric rate as
3.2. Xavier’s Initialization
Theorem 3.2 proves that gradient descent converges to a global minimum at a geometric rate under the rather general conditions (6)-(7). In this section, we specialize these results to the popular Xavier’s initialization. Let us define the following crucial quantity which appear in our bounds:
G
As the entries of that
. Thus,
can be interpreted as the expected Gram matrix with respect to the feature representation of the first layer.
Theorem 3.3 Let the activation function satisfy Assumption
2.2. Fix
and denote by c a large enough constant depending only on Let the widths of the neural network satisfy the
following conditions:
Assume that and consider Xavier’s initialization:
Let the learning rate satisfy
Then, the training loss vanishes and the network parameters converge to a global minimizer at a geometric rate as
with probability at least
Condition (14) provides a mild requirement on the speed at which the widths of the layers need to decrease. Condition (15) requires a lower bound on the size of the first layer, which is the widest. Let us emphasize that only the first layer needs to be over-parameterized: the widths of all the other layers can be set to constants that do not depend on the number of training samples or the input dimension.
Assume that (i) is lower bounded by a constant, (ii)
is a constant and (iii) the entries of X are of the same order as the entries of Y (this implies that
the same order as
). Then, the width
first layer scales as
where neglects logarithmic factors and the constant c depends only on the parameters
of the activation function. The convergence rate of gradient descent also depends on
and, under the assumptions mentioned above, the number of iterations needed to achieve
training loss scales as
The proof of Theorem 3.3 is provided in Appendix C. The high level idea is as follows. First, we bound the output of the network at initialization using Gaussian concentration arguments. Then, by following an approach similar to that of Theorem 3.2 in (Oymak & Soltanolkotabi, 2019), we show that, with high probability, . Finally, we bound the quantities
using results on the singular values of random Gaussian matrices. By combining these bounds, we show that conditions (6)-(7) are satisfied. Then, we compute the quantities
and
from Theorem 3.2 to get the corresponding statements of Theorem 3.3.
3.3. Lower Bound on λ∗
Theorem 3.3 shows that the amount of over-parameterization for the first layer depends on the quantity as given by (13). Note that, by definition,
only depends on the activation function
, the training data X, and hence N and d. In this section, we study how this quantity scales with respect to these variables. First, let us show that
under some mild conditions on
The proof of this result is contained in Appendix D.1.
Lemma 3.4 Assume there exists an index (i)
Further, assume that
is analytic and
for all
The family of activation functions (3) satisfies both Assumption 2.2 and the hypotheses of Lemma 3.4. Furthermore, the set of data points that do not satisfy the hypotheses of Lemma 3.4 has Lebesgue measure zero. It has been shown that the result of Lemma 3.4 also holds when the data points are not parallel and the activation function is analytic and not polynomial, see (Du et al., 2019a).
Lemma 3.4 shows that , but it does not provide information on how the lower bound scales with N and d. If we further assume that the data points are uniform on the sphere with radius
, then
is lower bounded by a constant independent of N and d.
Theorem 3.5 Let be a matrix whose rows are i.i.d. random vectors uniformly distributed on the sphere of radius
. Assume that (i)
is not a linear function, (ii)
is continuous and piecewise smooth, and (iii) for every
, there exists a polynomial
of finite degree such that
Then, there exists
that, for
for some constants that do not depend on N, d.
The lower bound (21) is tight (up to a constant). In fact,
and the RHS of (22) is equal to 1 when the rows of X have norm. Note also that the requirement that
is not a linear function is necessary. In fact, if
, then
and
as long as d < N. The other two conditions on
are purely technical, and they require smoothness as well as bounds on the growth of the derivatives of
. Note that the conditions of Theorem 3.5 are satisfied for the family of activation functions (3), as we verify in Appendix D.2.
If the rows of are uniform on the sphere of radius
is of order N, see e.g. Theorem 5.39 in (Vershynin, 2010). Thus, an immediate consequence of Theorem 3.5 is that, in the regime
, the requirement of Theorem 3.3 on the over-parameterization of the first layer is moderate:
scales as
. We also obtain an explicit bound on the convergence rate: the number of iterations needed to achieve
training loss scales as
4.1. General Framework (Theorem 3.2)
Let denote the Kronecker product, and let
Below, we frequently encounter situations where we need to evaluate the matrices
at specific iterations of gradient descent. To this end, we define the shorthands
and
We omit the parameter
and write just
when it is clear from the context.
We recall the following results from (Nguyen & Hein, 2018; 2017), which provide a PL-type inequality for the training objective. This is one of the key components in our proofs to show convergence of gradient descent to a global minimum.
Lemma 4.1 Let Assumption 2.1 hold. Then,
4. Let be such that
for every
Then every stationary point
for which
has full rank and all the weight matrices
rank is a global minimizer of
Note that the first two statements of Lemma 4.1 are just gradient computation which we state here for the purpose of the proof. The last two statements suggest that in order to show convergence of gradient descent to a global minimum, it suffices to (i) initialize all the matrices to be full rank and (ii) make sure that the dynamics of gradient descent stays inside the manifold of full-rank matrices. One way to do that is to keep track of the smallest singular values of these matrices and ensure that they are kept away from zero throughout training. This is the key idea that we will use to prove our Theorem 3.2.
The next lemmas serve as key building blocks for the proof (for the proof see Appendix B.2, B.3, B.4 and B.5).
Lemma 4.2 Let Assumption 2.2 hold. Then, for every
Furthermore, let , and
for some scalars
Then, for
We recall the following standard result from optimization, which is proved for completeness in Appendix B.6.
Lemma 4.3 Let be a twice continuously differentiable function. Let
be given, and assume that
for every z = x +
Proof of Theorem 3.2. We show by induction that, for every , the following holds:
� � � � �
where are defined in (4). Clearly, (27) holds for k = 0. Now, suppose that (27) holds for all iterations from 0 to k, and let us show the claim at iteration k + 1. For every
, we have by triangle inequality,
where in the last line we have used our induction hypothesis. Set . Then, the RHS of the previous expression can be written as
which, by using (6) and , is upper bounded by
By Weyl’s inequality, we obtain that
� � � � �
where in the first inequality we use Assumption 2.2 and in the last line we use (7). It follows that
So far, we have shown that the first three statements in (27) hold for k + 1. It remains to show that . To do so, define the shorthand
for the Jacobian of the network:
where
We first derive a Lipschitz constant of the gradient restricted to the line segment for
Then, by triangle inequality,
In the following, we bound each term in (28). We first note that, for
By following a similar chain of inequalities as done in the beginning, we obtain that, for
Note that, for a partitioned matrix , we have that
We now bound the Lipschitz constant of the Jacobian restricted to the segment By using (26) and (29),
with defined in (8). By using Lemma 4.3, we have that
which, by the induction hypothesis (27) and Assumption 2.2, is upper bounded by
So far, we have proven the hypothesis (27). It remains to show that the iterates also converge. From arguments similar to those used at the beginning of this proof, one can prove that
is a Cauchy sequence and that (12) holds (a detailed proof is given in Appendix B.7). Thus,
is a convergent sequence and there exists some
such that
By continuity,
is a global minimizer.
4.2. Lower Bound on λ∗ (Theorem 3.5)
For two matrices and B =
, we define their Khatri-Rao product as
For
, we denote the r-th Khatri-Rao power of A by
. Given a function
with bounded Gaussian measure, let
denote its r-th Hermite coefficient (see Appendix D.3 for some background on Hermite expansions).
The proof of Theorem 3.5 relies on two intermediate lemmas. The first one connects with the Khatri-Rao powers of X. This connection is stated in Lemma H.2 of (Oymak & Soltanolkotabi, 2019) for ReLU and softplus activation functions. As a fully rigorous proof is missing in (Oymak & Soltanolkotabi, 2019), we provide it in Appendix D.4.
Lemma 4.4 Let where
for all
Assume that (i)
is continuous and piecewise smooth and (ii) for every
, there exists a polynomial
of finite degree such that
be defined as in (13). Then,
Here, “=” is understood in the sense of uniform convergence, that is, for every there exists a sufficiently large
The second intermediate lemma lower bounds the smallest singular value of , when the rows of X are uniformly distributed on the sphere.
Lemma 4.5 Let be a matrix whose rows are i.i.d. random vectors uniformly distributed on the sphere of radius
. Fix an integer
. Then, there exists
such that, for
with probability for some constant
Let us emphasize that the constants do not depend on N and d, but they can depend on the integer r. At this point, we are ready to prove Theorem 3.5.
Proof of Theorem 3.5. As is not linear, there exists an integer
such that
. Thus, Lemma 4.5 implies that
Furthermore, by Lemma 4.4, there exists
Note that Thus by Weyl’s inequality, we get
which completes the proof.
In the rest of this section, we provide a few comments on Lemma 4.5, whose proof is contained in Appendix D.5. A result similar to Lemma 4.5 is shown in Corollary 7.5 of (Soltanolkotabi et al., 2018) for the special case r = 2: it is assumed that is a matrix with i.i.d. N(0, 1) entries and that
, with
a sufficiently small constant; then, it is proved that, with high probability,
, for some constant c. Here, we consider a different data distribution (the data points are uniformly distributed on a sphere) and, most importantly, we provide a bound on higher order Khatri-Rao powers. Consequently, we give a lower bound on
for a wider class of activation functions
. In fact, by showing Lemma 4.5 for r = 2, we obtain a non-trivial bound for
activation functions
The proof strategy is similar to that of Corollary 7.5 in (Soltanolkotabi et al., 2018): it builds on results from (Adamczak et al., 2011), in order to bound the smallest singular value of a random matrix whose rows are i.i.d. and have bounded sub-exponential norm. For r = 2, the bound on the sub-exponential norm follows from a general version of the Hanson-Wright inequality proved in (Adamczak, 2015). Here, the main technical challenge is to obtain a tight bound on the sub-exponential norm for r > 2. To solve this issue, we study the tails of homogenous tetrahedral polynomials of degree r whose variables are given by the components of
, and we apply a concentration inequality for non-Lipschitz functions with bounded derivatives of higher order (Adamczak & Wolff, 2015; Bobkov et al., 2019).
As a final remark, let us point out that, if , then
. Lemma 4.5 shows that
, as long as
. Thus, it remains an open problem to characterize the behavior of
in the regime
In this paper, we show that one wide layer followed by a pyramidal topology suffices for gradient descent to converge to a global minimum with a geometric rate. Our analysis is based on a PL-type inequality, which holds for a class of deep pyramidal networks in the manifold of full-rank weight matrices. The crux of the argument is to bound the smallest singular value of the feature output at the first layer. This makes our analysis to some extent simpler than existing approaches, which study the recursively-defined L-th layer Gram matrix (Du et al., 2019a), or the full Jacobian of the network (Oymak & Soltanolkotabi, 2019).
The authors would like to thank Daniel Soudry and Mahdi Soltanolkotabi for the helpful discussions, Jan Maas for pointing out reference (Dolbeault et al., 2014), and Marius Kloft, Matthias Hein and Quoc Dinh Tran for proofreading portions of a prior version of this paper. M. Mondelli was partially supported by the 2019 Lopez-Loreta Prize.
Abramowitz, M. Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables. Dover Publications, 1974.
Adamczak, R. A note on the hanson-wright inequality for random vectors with dependencies. Electronic Communications in Probability, 20, 2015.
Adamczak, R. and Wolff, P. Concentration inequalities for non-lipschitz functions with bounded derivatives of higher order. Probability Theory and Related Fields, 162 (3-4):531–586, 2015.
Adamczak, R., Litvak, A. E., Pajor, A., and Tomczak- Jaegermann, N. Restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling. Constructive Approximation, 34(1): 61–88, 2011.
Aida, S. and Stroock, D. Moment estimates derived from poincaré and logarithmic sobolev inequalities. Mathematical Research Letters, 1(1):75–86, 1994.
Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In ICML, 2019.
Auer, P., Herbster, M., and Warmuth, M. Exponentially many local minima for single neurons. In NIPS, 1996.
Blum, A. L. and Rivest, R. L. Training a 3-node neural network is NP-complete. Neural Networks, 5(1):117– 127, 1992.
Bobkov, S. G., Götze, F., and Sambale, H. Higher order con- centration of measure. Communications in Contemporary Mathematics, 21(03), 2019.
Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
Brutzkus, A., Globerson, A., Malach, E., and Shalev- Shwartz, S. Sgd learns over-parameterized networks that provably generalize on linearly separable data. In ICLR, 2018.
Chen, Z., Cao, Y., Zou, D., and Gu, Q. How much over- parameterization is sufficient to learn deep ReLU networks?, 2019. arXiv:1911.12360.
Chizat, L., Oyallon, E., and Bach, F. On lazy training in differentiable programming. In NIPS, 2019.
Daniely, A. Neural networks learning and memorization with (almost) no over-parameterization, 2019. arXiv:1709.06838.
Dolbeault, J., Esteban, M. J., Kowalczyk, M., and Loss, M. Sharp interpolation inequalities on the sphere: new methods and consequences. In Partial Differential Equations: Theory, Control and Approximation, pp. 225–242. Springer, 2014.
Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In ICML, 2019a.
Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In ICLR, 2019b.
Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. ICML, 2010.
He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In CVPR, 2015.
Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In NIPS, 2018.
Ji, Z. and Telgarsky, M. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. In ICLR, 2020.
Li, D., Ding, T., and Sun, R. Over-parameterized deep neural networks have no strict local minima for any continuous activations, 2018. arXiv:1812.11039.
Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In NIPS, 2018.
Mityagin, B. The zero set of a real analytic function, 2015. arXiv:1512.07276.
Nguyen, Q. On connected sublevel sets in deep learning. In ICML, 2019.
Nguyen, Q. and Hein, M. The loss surface of deep and wide neural networks. In ICML, 2017.
Nguyen, Q. and Hein, M. Optimization landscape and expressivity of deep CNNs. In ICML, 2018.
Oymak, S. and Soltanolkotabi, M. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks, 2019. arXiv:1902.04674.
Park, D., Sohl-Dickstein, J., Le, Q., and Smith, S. The effect of network width on stochastic gradient descent and generalization: an empirical study. In ICML, 2019.
Safran, I. and Shamir, O. Spurious local minima are com- mon in two-layer ReLU neural networks. In ICML, 2018.
Soltanolkotabi, M., Javanmard, A., and Lee, J. D. Theo- retical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2018.
Stewart, G. W. Perturbation theory for the singular value decomposition. Technical Report, 1990.
Vershynin, R. Introduction to the non-asymptotic analysis of random matrices, 2010. arXiv:1011.3027.
Wu, X., Du, S. S., and Ward, R. Global convergence of adap- tive gradient methods for an over-parameterized neural network, 2019. arXiv:1902.07111.
Yun, C., Sra, S., and Jadbabaie, A. Small nonlinearities in activation functions create bad local minima in neural networks. In ICLR, 2019.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires re-thinking generalization. In ICLR, 2017.
Zou, D. and Gu, Q. An improved analysis of training over- parameterized deep neural networks. In NIPS, 2019.
Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gra- dient descent optimizes over-parameterized deep ReLU networks, 2018. arXiv:1811.08888.
Proposition A.1 Letfor a nonnegative function
Proof: Let every x, it holds
which implies
Proposition A.2 Let
Proof: Let A , . . . , a
, . . . , b
Lemma A.3 (Zeros of analytic functions, see e.g. (Mityagin, 2015)) If is a real analytic function which is not identically zero then the set
has Lebesgue measure zero.
Proposition A.4 (Weyl’s inequality, see e.g. (Stewart, 1990)) Let with
and
Lemma A.5 (Singular values of random gaussian matrices, see e.g. (Vershynin, 2010)) Let be a random matrix with
Then for every
we have that, with probability at least
Theorem A.6 (Matrix Chernoff) Let be a sequence of independent, random, symmetric matrices. Assume that
In the following, we frequently use a basic inequality, namely, for every and
, see Proposition A.2 for the proof.
Lemma B.1 Let be given as in (3). Then,
1. is real analytic.
2.
3.
4. -Lipschitz.
5. lim
Proof: Let be the CDF of the standard normal distribution. Then, after some manipulations, we have that
1. Since is known as an entire function (i.e. analytic everywhere), it follows from (31) that
is analytic on R.
2. Note that Thus, after some simplifications, we have that
3. It is easy to check that . Moreover,
is 1-Lipschitz and thus
4. We have that
B.2. Proof of (23) in Lemma 4.2
We prove by induction on l. Note that the lemma holds for l = 1 since
where the second step follows from our assumption on Now suppose the lemma holds for
It is easy to verify that it also holds for l. Indeed,
Note that for l = L one can skip the first equality above, as there is no activation at the output layer.
B.3. Proof of (25) in Lemma 4.2
We first prove the following intermediate result.
Lemma B.2 Let be 1-Lipschitz and
for every
Let
Let
Then, for every
Here, we denote
Applying Lemma B.2 to the output layer yields:
B.4. Proof of (24) in Lemma 4.2
B.5. Proof of (26) in Lemma 4.2
We start by showing the following intermediate result.
Lemma B.3 Let be 1-Lipschitz, and let
and
hold for every
Let
Then, for every
Here, we denote
Proof: For every
Note that for t = l the product terms inside brackets are inactive, and thus
follows from Lemma 4.1 that
The following inequality holds
where the second inequality follows from (23) and To prove the lemma, we will prove that, for every
Suppose that it holds for and we want to show it for t. Then,
≤ ∥
where the last line follows by plugging the bound offrom the induction assumption.
Proof of (26) in Lemma 4.2. Let
≤
B.6. Proof of Lemma 4.3
Let g
B.7. Proof that is a Cauchy Sequence
Let us fix any We need to show that there exists r > 0 such that for every
trivial, so we assume w.l.o.g. that i < j. Then, the following chain of inequalities hold
Note that and thus there exists a sufficiently large r such that
This shows that
is a Cauchy sequence, and hence convergent to some
By continuity,
is a global minimizer. The rate of convergence is
Before presenting the proof of the convergence result under Xavier’s initialization in Appendix C.3, let us state two helpful lemmas. The first lemma bounds the output of the network at initialization using standard Gaussian concentration and it is proved in Appendix C.1.
Lemma C.1 Let be 1-Lipschitz, and consider Xavier’s initialization scheme:
Assume that
with probability at least
The second lemma identifies sufficient conditions on so that the quantity
is bounded away from zero. The proof is similar to that of Theorem 3.2 of (Oymak & Soltanolkotabi, 2019) (see Section 6.8 in their appendix), and we provide it in Appendix C.2.
Lemma C.2 Let for all
. Define also
with probability at least
C.1. Proof of Lemma C.1
It is straightforward to show the following inequality.
Lemma C.3 Let for every
Let
every
Proof:
where the first inequality follows from our assumption on , and the last equality follows from the fact that
Proof of Lemma C.1. In the following, we write
to denote a sub-gaussian random variable with mean zero and variance proxy
It is well-known that if
then for every
We prove by induction on , then it holds w.p.
It follows that By Gaussian concentration inequality, we have w.p. at least
Thus the hypothesis holds for l = 1. Now suppose it holds for that is, we have w.p.
and thus ∥
Thus the above events hold w.p. at least in which case we get
Thus, the hypothesis also holds for
C.2. Proof of Lemma C.2
Let be a random matrix defined as
Thus, by using our assumption on
Let Applying Matrix Chernoff bound (Theorem A.6) to the sum of random p.s.d. matrices,
we obtain that for every
Substituting
Thus, as long as is large enough, in particular,
we have w.p. at least
This implies that Plugging this into the above statement yields for every
it holds w.p. at least
Lastly, since
C.3. Proof of Theorem 3.3
From known results on random Gaussian matrices, we have, w.p.
where the last inequality in each line follows from (14) and from . From the definition (4), we get
Similarly, for any , we have, w.p.
Furthermore, by Lemma C.1 and C.2, we have, w.p.
as long as the width of the first layer satisfies the following condition from Lemma C.2:
for a suitable constant c. From (39), we get a lower bound on the LHS of (6); and from (37), (38) and (40) we get an upper bound on the RHS of (6). Thus in order to satisfy the initial condition (6), it suffices to have (41) and
which together leads to condition (15).
To satisfy the initial condition (7), it suffices to have in addition to (6) that which is fulfilled for
One can verify that this is true under the condition (15). Indeed, by (15) we have for some large enough
constant c, which combined with (22) leads to
As a result, the initial conditions (6)-(7) are satisfied and we can apply Theorem 3.2. Let us now bound the quantities and
defined in (8). Note that
and
It is easy to see that the upper bound of dominates that of
Thus to satisfy the learning rate condition (9) from Theorem 3.2, it suffices to set
to be smaller than the inverse of the upper bound on
, which leads to condition (16).
From the lower bound of in (43) and (10), we immediately get the convergence of the loss as stated in (17). Similarly, one can compute the quantity
defined in (11) to get the convergence of the parameters as stated in (18).
D.1. Proof of Lemma 3.4
We first show an intermediate result in Lemma D.1. The idea of using a Vandermonde matrix for proving full-rank features is similar to (Li et al., 2018).
Lemma D.1 Let the training data and the activation function satisfy the conditions of Lemma 3.4. Then the set has Lebesgue measure zero.
Proof: Let be a function defined as
By hypothesis, is real analytic, and thus f is a real analytic function. It follows from Lemma A.3 that either f is a constant zero function, or the zeros of f has Lebesgue measure zero in which case the result of the lemma follows immediately. Thus it remains to check that f is not zero everywhere. In the following, we prove this statement by induction on N. First, the statement is true for N = 1 since one can pick
for some
with
so that
Now, suppose that the statement holds for
, and we want to prove it for N.
Indeed, assume by contradiction that One can easily check that the high order derivatives of
are given by
Since f is constant zero, it holds that
By induction assumption, one can pick
vectors
so that the following matrix
has full rank. One observes that E is a submatrix contained in the last columns of
, and so the last
of
are linearly independent. This combined with the fact that
for all W implies that the set of vectors associated to the 1st column of
for all possible values of
must lie on the same linear subspace spanned by the last
columns of
which has been fixed by the choice of
as above. Pick
and denote the first column of
Then it follows from above that lie on the same linear subspace of dimension
and thus the following matrix
must have rank at most
Note that the middle one is a Vandermonde matrix and thus,
By our assumption, is a contradiction.
Proof of Lemma 3.4. Recall Let
be defined as
for every
Let P denote the Gaussian measure in
with mean zero and covariance matrix
By definition,
Suppose by contradiction that
Then there exist coefficients
not all zero so that
which implies
and thus by Proposition A.1,
Let independent random vectors drawn from
denote the joint distribution over W. By the union bound,
Let Since not all
’s are zero,
This combined with the previous inequality leads to Moreover, by Lemma D.1, the set of W for which det(A) = 0 has Lebesgue measure zero. As
has a density w.r.t. the Lebesgue measure, this implies
which is a contradiction. Thus
D.2. Activation Function (3) Satisfies the Conditions of Theorem 3.5.
Clearly is continuous and smooth according to Lemma B.1. Moreover, we have
Next, we verify the polynomial bounds on the derivatives of Then from Lemma B.1 we have:
D.3. Background on Hermite Expansions
The normalized probabilist’s hermite polynomials are given by
where “=” is understood in the sense of convergence, and
-th Hermite coefficient given by
Lemma D.2, D.3 and D.4 state some useful properties of Hermite polynomials, see (Abramowitz, 1974).
Lemma D.2 We have for every
Lemma D.3 (Cramer’s inequality)
Note that the constant 2 above is not optimal compared to the original version, but it suffices for the purpose of this paper.
Lemma D.4
Lemma D.5 Let be given. For every
, we assume that
for some polynomial
finite degree. Then,
Proof: By Lemma D.4 we haveIntegrating by parts
As does not grow faster than a polynomial, the presence of
makes sure that the first term vanishes. Thus,
Similarly, we can integrate by parts the RHS to higher orders. From our assumption, it is easy to see that for every there exists a polynomial P of finite degree such that
Thus, the presence of
always make the first term of the integration by parts vanish. Finally, we get
which implies the desired result.
Corollary D.6 Letfor some polynomial
of finite degree, for every
Then, there exists a constant C such that, for every
Suppose that P has degree n and is given by Using Lemma D.5 for k = 2, we get
where we bounded the sum by a constant C. Note here that ’s are being treated as constants of
Lemma D.7 Let be such that
Then, for every
Proof: Let be given finite variables. Then,
Thus, it follows that
The generating function of the normalized probabilist’s Hermite polynomials is
Plugging this into the above formula gives
Note that the LHS can be rewritten as
By Lebesgue’s dominated convergence theorem, we get
Equating the coefficients on both sides gives the desired result.
It remains to justify the application of the dominated convergence theorem. Since are obtained from the Taylor series of exponential functions in f, it follows that
By abuse of notation, let
the standard Gaussian measure in
Let
By a similar calculation, we obtain
Thus ’s are dominated by
It suffices to show that g is integrable w.r.t. the measure
Note that
The integral of is given by
D.4. Proof of Lemma 4.4
Let From the definition of
By Fubini’s theorem, we have
Here, we note that
It remains to justify the application of Fubini’s Theorem. It suffices to show that
Indeed, the LHS is upper bounded by
where the equality follows from Lemma D.7. By Corollary D.6, the infinite series on the RHS can be upper bounded, up to some constant, by the series which is known to be convergent.
D.5. Proof of Lemma 4.5
Recall that, given a random variable , its sub-exponential norm is defined as
Furthermore, for a centered random vector , its sub-exponential norm is defined as
We start by stating two intermediate results that will be useful for the proof.
Lemma D.8 Consider an r-indexed matrix such that
whenever
for some
be a random vector in
uniformly distributed on the unit sphere, and define
where C is a numerical constant.
If x is uniformly distributed on the unit sphere, then it satisfies the logarithmic Sobolev inequality with constant 2/d, see Corollary 1.1 in (Dolbeault et al., 2014). Thus, Lemma D.8 follows from Theorem 1.14 in (Bobkov et al., 2019), where (see (1.18) in (Bobkov et al., 2019)) and the function f is a homogeneous tetrahedral polynomial of degree r.
for any , for any smooth bounded function
, and for some constant C independent of g and p. It is known that if x satisfies the logarithmic Sobolev inequality with constant
holds with
Stroock, 1994) or Theorem 3.4 of (Adamczak & Wolff, 2015). However, these results refer to random vectors (and test functions g) defined on
. Here, we are interested in a concentration result for vectors uniformly distributed on the sphere, and this case is analyzed in (Bobkov et al., 2019).
The second intermediate lemma is stated below and it follows from Theorem 5.1 of (Adamczak et al., 2011) (this is also basically a restatement of Lemma F.2 of (Soltanolkotabi et al., 2018)).
Lemma D.9 Let be independent sub-exponential random vectors with
Then,
and is a numerical constant.
At this point, we are ready to provide the proof of Lemma 4.5.
Proof of Lemma 4.5. The first step is to drop columns from . Define
and note that, for
, the i-th row of K is given by the r-th Kronecker power of
and index the columns of K as
, with
for all
, so that the element of K in row i and column
is given by
. Consider the matrix
obtained by keeping only the columns of K where the indices
are all different. Note that
columns as
is obtained by dropping columns from K, then
The second step is to bound the sub-exponential norm of the rows of . Let
be the row of
corresponding to the data point
. Let us emphasize that, from now till the end of the proof, we denote by
the i-th element of the vector x (and not the i-th training sample, which is a vector in
). Let A be the set of r-indexed matrices
such that
and
whenever
for some
. Then, by definition of sub-exponential norm of a vector, we have that
where in (a) we use Cauchy-Schwarz inequality and in (b) we use that . Consequently,
Note that Lemma D.8 considers a vector x uniformly distributed on the unit sphere, while in (57) x is uniformly distributed on the sphere with radius) can be re-written as
By definition (46) of sub-exponential norm, we obtain that
where is a numerical constant.
The third step is to bound the Euclidean norm of the rows of . Recall that
is obtained by keeping the elements of
where the indices are all different. As for the upper bound, we have that
As for the lower bound, we have that
since entries and each of these entries is at most
that
is a polynomial in d of degree r whose leading coefficient is 1. Thus,
for some constant that depends only on r. Consequently,
As x is uniform on the sphere of radius, we can write
where g , . . . , g
Recall that the norm of a vector is a 1-Lipschitz function of the components of the vector. Thus,
Furthermore,
As a result,
for some constant depending on r (but not on d). Consequently, with probability at least
, we have that
An application of Theorem 5.8 of (Boucheron et al., 2013) gives that, for any t > 0,
Furthermore, we have that, for any
where the first passage follows from Jensen’s inequality. By taking , we obtain
which, combined with (71), leads to
where is the constant in (63) and
is a constant that depends only on r (and not on d). Since the Gaussian distribution is symmetric, we also have that
By combining (63), (70) and (75), we obtain that, with probability at least
Hence, by doing a union bound on the rows of , we have that, with probability at least
where denotes the
The last step is to apply the results of Lemma D.9. Let be such that
with
By applying Lemma D.9 and using the bounds (60) and (78), we have that
By combining (80), (77) and (83), we obtain that
which, together with (54), gives the desired result.