Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology

2020·Arxiv

Abstract

Abstract

A recent line of research has provided convergence guarantees for gradient descent algorithms in the excessive over-parameterization regime where the widths of all the hidden layers are re- quired to be polynomially large in the number of training samples. However, the widths of practical deep networks are often only large in the first layer(s) and then start to decrease towards the output layer. This raises an interesting open question whether similar results also hold under this empirically relevant setting. Existing theoretical insights suggest that the loss surface of this class of networks is well-behaved, but these results usually do not provide direct algorithmic guarantees for optimization. In this paper, we close the gap by showing that one wide layer followed by pyramidal deep network topology suffices for gradient descent to find a global minimum with a geometric rate. Our proof is based on a weak form of Polyak-Lojasiewicz inequality which holds for deep pyramidal networks in the manifold of full-rank weight matrices.

1. Introduction

It is well-known that training a neural network is, in the worst-case, NP-Hard (Blum & Rivest, 1992), and that the optimization problem can have exponentially many distinct local minima (Auer et al., 1996). Yet, in practice neural networks can be successfully trained with many more parameters than training samples. Indeed, gradient descent methods can find a globally optimal solution when using appropriate initialization schemes, such as Xavier’s initialization (Zhang et al., 2017). Understanding this phenomenon from a mathematically rigorous point of view has attracted a lot of recent interest in the research community.

Figure 1. A neural network satisfying our Assumption 2.1.

Jacot et al. (2018) showed that, for the limiting case of infinitely wide neural networks, training can be described precisely by the so-called ‘Neural Tangent Kernel’ (NTK). Motivated by this, a series of recent papers studied the convergence of gradient descent for over-parameterized neural nets. Despite being conceptually interesting, most of the existing results are either restricted to one-hidden-layer networks (Du et al., 2019b; Brutzkus et al., 2018; Li & Liang, 2018; Wu et al., 2019; Oymak & Soltanolkotabi, 2019; Ji & Telgarsky, 2020; Daniely, 2019), or require all the hidden layers to be simultaneously large (Allen-Zhu et al., 2019; Du et al., 2019a; Zou et al., 2018; Zou & Gu, 2019). In contrast, neural networks used in practice are typically only wide at the first layer(s), after which the width starts to decrease toward the output layer, see e.g. Figure 1.

This leads to an interesting open theoretical question: Do similar results also hold for this kind of ‘pyramidal’ network topology? Into this direction, Nguyen (2019) showed that the loss landscape of these networks has connected sublevel sets, and consequently it is geometrically well behaved. However, no algorithmic guarantees are provided, which means that gradient descent can still get trapped in a bad local minimum (Safran & Shamir, 2018; Yun et al., 2019).

In this paper, we close the gap and answer positively the question above. We show that gradient descend converges to a global minimum at a geometric rate, when the network consists of one wide hidden layer followed by a pyramidal topology. Our analysis considers the standard parametrization (as opposed to the NTK parameterization), and it assumes the widely employed Xavier’s initialization (Glorot & Bengio, 2010; He et al., 2015). We refer to (Chizat et al., 2019) for more details about the NTK regime, and to (Park et al., 2019) for a comparison between the empirical per-

Table 1. Recent progress on convergence of gradient descent for training neural networks. is the minimum Euclidean distance between any pair of data points; is the smallest eigenvalue of a Gram matrix defined for 2-layer neural networks (Du et al., 2019b) which might depend on the number of samples is the smallest eigenvalue of a Gram matrix defined recursively for L-layer neural networks (Du et al., 2019a); is the margin of a limiting kernel (Ji & Telgarsky, 2020) which has been shown to scale as 1/ poly(N) for generic data (but can have a more favorable scaling under additional assumptions); R is defined in (Chen et al., 2019) as a constant such that the ball contains a solution with training loss less than is the initial point and m is the number of neurons per layer in their notation. Depending on the data assumption, Chen et al. (2019) show that the per-layer width scales as can have more favorable scaling as under the similar separability assumption as in (Ji & Telgarsky, 2020). Most of the prior work assume that the data lies on the unit sphere. is defined in (13). In the last line of this table, we write max(a, b, c).

formance of NTK and of the standard parameterization. In particular, the latter paper shows that the standard parameterization consistently outperforms the NTK one for several convolutional neural networks, especially those with a ‘bottom-heavy’ architecture, as analyzed in this paper. Our main results can be summarized as follows:

1. In Section 3.1, we show that gradient descent converges to a global optimum at a geometric rate for a class of pyramidal networks. More specifically, we provide a set of sufficient conditions on the initialization and the network topology that delivers such guarantees.

2. In Section 3.2, we specialize our results to the widely used Xavier’s initialization. In particular, we prove that the sufficient conditions for convergence are satisfied under a moderate over-parameterization condition: the width of the first layer needs to scale as

where N is the number of samples, d the input dimension, L the number of layers, X the training data, the smallest eigenvalue of the expected feature output at the first layer, and c a constant. Let us emphasize that only the first layer is required to be wide. The other layers follow a pyramidal structure and can have a constant (i.e., independent of N and d) number of

neurons. We also characterize the convergence rate, and we show that the number of iterations needed to achieve training loss scales as

3. In Section 3.3, we prove that under mild assumptions on the activation function and training data. Furthermore, for training data uniformly distributed on the sphere, we show that is lower bounded by a numerical constant (independent of N and d). Thus, in this setting, the width of the first layer scales as and the iteration complexity scales as

To clarify how the settings and results of our paper differ from previous works, we provide a comparison in Table 1: the key difference between this work and the prior work is that we consider an empirically relevant deep network topology with only one wide hidden layer. Our network parameterization and initialization are standard. Similarly to (Du et al., 2019a; Chen et al., 2019), we train all layers, while several other existing works train a subset of them. The recent papers (Chen et al., 2019; Ji & Telgarsky, 2020) require only a polylogarithmic layer width, but they make strong assumptions on the data separability and focus on other parameterizations (similar to the NTK (Jacot et al., 2018) or lazy regime (Chizat et al., 2019)) which are outperformed by the standard one with Xavier’s initialization.

2. Problem Setup

We consider an L-layer neural network with activation function and parameters , where is the weight matrix at layer l. Given and , their distance is given by , where denotes the Frobenius norm. Let and be respectively the training input and output, where N is the number of training samples, d is the input dimension and is the output dimension (for consistency, we set ). Let be the output of layer l, which is defined as

where and the activation function is applied componentwise. Denote by the pre-activation output, which is defined as for and . Let and be obtained by concatenating the columns of , respectively.

We are interested in minimizing the square loss given by

To do so, we consider the gradient descent (GD) update

where is the step size and contains all parameters at step k.

In this paper, we consider neural networks with the following topological structure, see e.g. (Nguyen, 2019; Nguyen & Hein, 2017; 2018) and Figure 1.

Assumption 2.1 (Pyramidal network topology) Let

We make the following assumptions on the activation

Assumption 2.2 (Activation function) Fix and . Let satisfy that: (i) , (ii) , and (iii) -Lipschitz.

As a concrete example, we consider a family of parameter- ized ReLU functions, smoothed out by a Gaussian kernel:

Figure 2. Plots of and various values of

Then, fulfills the conditions of Assumption 2.2. Furthermore, uniformly approximates the Leaky-ReLU function over the real line, i.e., . These statements are proved in Lemma B.1 in the appendix. Figure 2 shows several examples of as a uniform approximation of max(0.1x, x).

3. Main Results

3.1. General Framework

First, let us introduce some notation for the singular values of the weight matrices at initialization:

where and are the smallest resp. largest singular value of the matrix A. We define

as the smallest singular value of the output of the first hidden layer at initialization.

We make the following assumptions on the initialization.

Assumption 3.1 (Initial conditions)

Note that (6) and (7) involve the number of neurons at all layers (and hence the full network topology). At this point, we are ready to present our most general result. Its proof is presented in Section 4.1.

Theorem 3.2 Let the network satisfy Assumption 2.1, the activation function satisfy Assumption 2.2 and the initial conditions satisfy Assumption 3.1. Define

with Let the learning rate be

Then, the training loss vanishes at a geometric rate as

Furthermore, define

Then, the network parameters converge to a global mini- mizer at a geometric rate as

3.2. Xavier’s Initialization

Theorem 3.2 proves that gradient descent converges to a global minimum at a geometric rate under the rather general conditions (6)-(7). In this section, we specialize these results to the popular Xavier’s initialization. Let us define the following crucial quantity which appear in our bounds:

G

As the entries of that . Thus, can be interpreted as the expected Gram matrix with respect to the feature representation of the first layer.

Theorem 3.3 Let the activation function satisfy Assumption

2.2. Fix

and denote by c a large enough constant depending only on Let the widths of the neural network satisfy the

following conditions:

Assume that and consider Xavier’s initialization:

Let the learning rate satisfy

Then, the training loss vanishes and the network parameters converge to a global minimizer at a geometric rate as

with probability at least

Condition (14) provides a mild requirement on the speed at which the widths of the layers need to decrease. Condition (15) requires a lower bound on the size of the first layer, which is the widest. Let us emphasize that only the first layer needs to be over-parameterized: the widths of all the other layers can be set to constants that do not depend on the number of training samples or the input dimension.

Assume that (i) is lower bounded by a constant, (ii) is a constant and (iii) the entries of X are of the same order as the entries of Y (this implies that the same order as ). Then, the width first layer scales as

where neglects logarithmic factors and the constant c depends only on the parameters of the activation function. The convergence rate of gradient descent also depends on and, under the assumptions mentioned above, the number of iterations needed to achieve training loss scales as

The proof of Theorem 3.3 is provided in Appendix C. The high level idea is as follows. First, we bound the output of the network at initialization using Gaussian concentration arguments. Then, by following an approach similar to that of Theorem 3.2 in (Oymak & Soltanolkotabi, 2019), we show that, with high probability, . Finally, we bound the quantities using results on the singular values of random Gaussian matrices. By combining these bounds, we show that conditions (6)-(7) are satisfied. Then, we compute the quantities and from Theorem 3.2 to get the corresponding statements of Theorem 3.3.

3.3. Lower Bound on λ∗

Theorem 3.3 shows that the amount of over-parameterization for the first layer depends on the quantity as given by (13). Note that, by definition, only depends on the activation function , the training data X, and hence N and d. In this section, we study how this quantity scales with respect to these variables. First, let us show that under some mild conditions on The proof of this result is contained in Appendix D.1.

Lemma 3.4 Assume there exists an index (i) Further, assume that is analytic and for all

The family of activation functions (3) satisfies both Assumption 2.2 and the hypotheses of Lemma 3.4. Furthermore, the set of data points that do not satisfy the hypotheses of Lemma 3.4 has Lebesgue measure zero. It has been shown that the result of Lemma 3.4 also holds when the data points are not parallel and the activation function is analytic and not polynomial, see (Du et al., 2019a).

Lemma 3.4 shows that , but it does not provide information on how the lower bound scales with N and d. If we further assume that the data points are uniform on the sphere with radius, then is lower bounded by a constant independent of N and d.

Theorem 3.5 Let be a matrix whose rows are i.i.d. random vectors uniformly distributed on the sphere of radius. Assume that (i) is not a linear function, (ii) is continuous and piecewise smooth, and (iii) for every , there exists a polynomial of finite degree such thatThen, there exists that, for

for some constants that do not depend on N, d.

The lower bound (21) is tight (up to a constant). In fact,

and the RHS of (22) is equal to 1 when the rows of X have norm. Note also that the requirement that is not a linear function is necessary. In fact, if , then and as long as d < N. The other two conditions on are purely technical, and they require smoothness as well as bounds on the growth of the derivatives of . Note that the conditions of Theorem 3.5 are satisfied for the family of activation functions (3), as we verify in Appendix D.2.

If the rows of are uniform on the sphere of radiusis of order N, see e.g. Theorem 5.39 in (Vershynin, 2010). Thus, an immediate consequence of Theorem 3.5 is that, in the regime , the requirement of Theorem 3.3 on the over-parameterization of the first layer is moderate: scales as . We also obtain an explicit bound on the convergence rate: the number of iterations needed to achieve training loss scales as

4. Sketch of the Proofs

4.1. General Framework (Theorem 3.2)

Let denote the Kronecker product, and let Below, we frequently encounter situations where we need to evaluate the matrices at specific iterations of gradient descent. To this end, we define the shorthands and We omit the parameter and write just when it is clear from the context.

We recall the following results from (Nguyen & Hein, 2018; 2017), which provide a PL-type inequality for the training objective. This is one of the key components in our proofs to show convergence of gradient descent to a global minimum.

Lemma 4.1 Let Assumption 2.1 hold. Then,

4. Let be such that for every Then every stationary point for which has full rank and all the weight matrices rank is a global minimizer of

Note that the first two statements of Lemma 4.1 are just gradient computation which we state here for the purpose of the proof. The last two statements suggest that in order to show convergence of gradient descent to a global minimum, it suffices to (i) initialize all the matrices to be full rank and (ii) make sure that the dynamics of gradient descent stays inside the manifold of full-rank matrices. One way to do that is to keep track of the smallest singular values of these matrices and ensure that they are kept away from zero throughout training. This is the key idea that we will use to prove our Theorem 3.2.

The next lemmas serve as key building blocks for the proof (for the proof see Appendix B.2, B.3, B.4 and B.5).

Lemma 4.2 Let Assumption 2.2 hold. Then, for every

Furthermore, let , and for some scalars Then, for

We recall the following standard result from optimization, which is proved for completeness in Appendix B.6.

Lemma 4.3 Let be a twice continuously differentiable function. Let be given, and assume that for every z = x +

Proof of Theorem 3.2. We show by induction that, for every , the following holds:

� � � � �

where are defined in (4). Clearly, (27) holds for k = 0. Now, suppose that (27) holds for all iterations from 0 to k, and let us show the claim at iteration k + 1. For every , we have by triangle inequality,

where in the last line we have used our induction hypothesis. Set . Then, the RHS of the previous expression can be written as

which, by using (6) and , is upper bounded by

By Weyl’s inequality, we obtain that

� � � � �

where in the first inequality we use Assumption 2.2 and in the last line we use (7). It follows that

So far, we have shown that the first three statements in (27) hold for k + 1. It remains to show that . To do so, define the shorthand for the Jacobian of the network: where

We first derive a Lipschitz constant of the gradient restricted to the line segment for Then, by triangle inequality,

In the following, we bound each term in (28). We first note that, for

By following a similar chain of inequalities as done in the beginning, we obtain that, for

Note that, for a partitioned matrix , we have that

We now bound the Lipschitz constant of the Jacobian restricted to the segment By using (26) and (29),

with defined in (8). By using Lemma 4.3, we have that

which, by the induction hypothesis (27) and Assumption 2.2, is upper bounded by

So far, we have proven the hypothesis (27). It remains to show that the iterates also converge. From arguments similar to those used at the beginning of this proof, one can prove that is a Cauchy sequence and that (12) holds (a detailed proof is given in Appendix B.7). Thus, is a convergent sequence and there exists some such that By continuity, is a global minimizer.

4.2. Lower Bound on λ∗ (Theorem 3.5)

For two matrices and B = , we define their Khatri-Rao product as For , we denote the r-th Khatri-Rao power of A by . Given a function with bounded Gaussian measure, let denote its r-th Hermite coefficient (see Appendix D.3 for some background on Hermite expansions).

The proof of Theorem 3.5 relies on two intermediate lemmas. The first one connects with the Khatri-Rao powers of X. This connection is stated in Lemma H.2 of (Oymak & Soltanolkotabi, 2019) for ReLU and softplus activation functions. As a fully rigorous proof is missing in (Oymak & Soltanolkotabi, 2019), we provide it in Appendix D.4.

Lemma 4.4 Let where for all Assume that (i) is continuous and piecewise smooth and (ii) for every , there exists a polynomial of finite degree such that be defined as in (13). Then,

Here, “=” is understood in the sense of uniform convergence, that is, for every there exists a sufficiently large

The second intermediate lemma lower bounds the smallest singular value of , when the rows of X are uniformly distributed on the sphere.

Lemma 4.5 Let be a matrix whose rows are i.i.d. random vectors uniformly distributed on the sphere of radius. Fix an integer . Then, there exists such that, for

with probability for some constant

Let us emphasize that the constants do not depend on N and d, but they can depend on the integer r. At this point, we are ready to prove Theorem 3.5.

Proof of Theorem 3.5. As is not linear, there exists an integer such that . Thus, Lemma 4.5 implies that

Furthermore, by Lemma 4.4, there exists

Note that Thus by Weyl’s inequality, we get

which completes the proof.

In the rest of this section, we provide a few comments on Lemma 4.5, whose proof is contained in Appendix D.5. A result similar to Lemma 4.5 is shown in Corollary 7.5 of (Soltanolkotabi et al., 2018) for the special case r = 2: it is assumed that is a matrix with i.i.d. N(0, 1) entries and that , with a sufficiently small constant; then, it is proved that, with high probability, , for some constant c. Here, we consider a different data distribution (the data points are uniformly distributed on a sphere) and, most importantly, we provide a bound on higher order Khatri-Rao powers. Consequently, we give a lower bound on for a wider class of activation functions . In fact, by showing Lemma 4.5 for r = 2, we obtain a non-trivial bound for activation functions

The proof strategy is similar to that of Corollary 7.5 in (Soltanolkotabi et al., 2018): it builds on results from (Adamczak et al., 2011), in order to bound the smallest singular value of a random matrix whose rows are i.i.d. and have bounded sub-exponential norm. For r = 2, the bound on the sub-exponential norm follows from a general version of the Hanson-Wright inequality proved in (Adamczak, 2015). Here, the main technical challenge is to obtain a tight bound on the sub-exponential norm for r > 2. To solve this issue, we study the tails of homogenous tetrahedral polynomials of degree r whose variables are given by the components of , and we apply a concentration inequality for non-Lipschitz functions with bounded derivatives of higher order (Adamczak & Wolff, 2015; Bobkov et al., 2019).

As a final remark, let us point out that, if , then . Lemma 4.5 shows that , as long as . Thus, it remains an open problem to characterize the behavior of in the regime

5. Concluding Remarks

In this paper, we show that one wide layer followed by a pyramidal topology suffices for gradient descent to converge to a global minimum with a geometric rate. Our analysis is based on a PL-type inequality, which holds for a class of deep pyramidal networks in the manifold of full-rank weight matrices. The crux of the argument is to bound the smallest singular value of the feature output at the first layer. This makes our analysis to some extent simpler than existing approaches, which study the recursively-defined L-th layer Gram matrix (Du et al., 2019a), or the full Jacobian of the network (Oymak & Soltanolkotabi, 2019).

Acknowledgements

The authors would like to thank Daniel Soudry and Mahdi Soltanolkotabi for the helpful discussions, Jan Maas for pointing out reference (Dolbeault et al., 2014), and Marius Kloft, Matthias Hein and Quoc Dinh Tran for proofreading portions of a prior version of this paper. M. Mondelli was partially supported by the 2019 Lopez-Loreta Prize.

References

Abramowitz, M. Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables. Dover Publications, 1974.

Adamczak, R. A note on the hanson-wright inequality for random vectors with dependencies. Electronic Communications in Probability, 20, 2015.

Adamczak, R. and Wolff, P. Concentration inequalities for non-lipschitz functions with bounded derivatives of higher order. Probability Theory and Related Fields, 162 (3-4):531–586, 2015.

Adamczak, R., Litvak, A. E., Pajor, A., and Tomczak- Jaegermann, N. Restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling. Constructive Approximation, 34(1): 61–88, 2011.

Aida, S. and Stroock, D. Moment estimates derived from poincaré and logarithmic sobolev inequalities. Mathematical Research Letters, 1(1):75–86, 1994.

Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In ICML, 2019.

Auer, P., Herbster, M., and Warmuth, M. Exponentially many local minima for single neurons. In NIPS, 1996.

Blum, A. L. and Rivest, R. L. Training a 3-node neural network is NP-complete. Neural Networks, 5(1):117– 127, 1992.

Bobkov, S. G., Götze, F., and Sambale, H. Higher order con- centration of measure. Communications in Contemporary Mathematics, 21(03), 2019.

Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

Brutzkus, A., Globerson, A., Malach, E., and Shalev- Shwartz, S. Sgd learns over-parameterized networks that provably generalize on linearly separable data. In ICLR, 2018.

Chen, Z., Cao, Y., Zou, D., and Gu, Q. How much over- parameterization is sufficient to learn deep ReLU networks?, 2019. arXiv:1911.12360.

Chizat, L., Oyallon, E., and Bach, F. On lazy training in differentiable programming. In NIPS, 2019.

Daniely, A. Neural networks learning and memorization with (almost) no over-parameterization, 2019. arXiv:1709.06838.

Dolbeault, J., Esteban, M. J., Kowalczyk, M., and Loss, M. Sharp interpolation inequalities on the sphere: new methods and consequences. In Partial Differential Equations: Theory, Control and Approximation, pp. 225–242. Springer, 2014.

Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In ICML, 2019a.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In ICLR, 2019b.

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. ICML, 2010.

He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In CVPR, 2015.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In NIPS, 2018.

Ji, Z. and Telgarsky, M. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. In ICLR, 2020.

Li, D., Ding, T., and Sun, R. Over-parameterized deep neural networks have no strict local minima for any continuous activations, 2018. arXiv:1812.11039.

Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In NIPS, 2018.

Mityagin, B. The zero set of a real analytic function, 2015. arXiv:1512.07276.

Nguyen, Q. On connected sublevel sets in deep learning. In ICML, 2019.

Nguyen, Q. and Hein, M. The loss surface of deep and wide neural networks. In ICML, 2017.

Nguyen, Q. and Hein, M. Optimization landscape and expressivity of deep CNNs. In ICML, 2018.

Oymak, S. and Soltanolkotabi, M. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks, 2019. arXiv:1902.04674.

Park, D., Sohl-Dickstein, J., Le, Q., and Smith, S. The effect of network width on stochastic gradient descent and generalization: an empirical study. In ICML, 2019.

Safran, I. and Shamir, O. Spurious local minima are com- mon in two-layer ReLU neural networks. In ICML, 2018.

Soltanolkotabi, M., Javanmard, A., and Lee, J. D. Theo- retical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2018.

Stewart, G. W. Perturbation theory for the singular value decomposition. Technical Report, 1990.

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices, 2010. arXiv:1011.3027.

Wu, X., Du, S. S., and Ward, R. Global convergence of adap- tive gradient methods for an over-parameterized neural network, 2019. arXiv:1902.07111.

Yun, C., Sra, S., and Jadbabaie, A. Small nonlinearities in activation functions create bad local minima in neural networks. In ICLR, 2019.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires re-thinking generalization. In ICLR, 2017.

Zou, D. and Gu, Q. An improved analysis of training over- parameterized deep neural networks. In NIPS, 2019.

Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gra- dient descent optimizes over-parameterized deep ReLU networks, 2018. arXiv:1811.08888.

A. Mathematical Tools

Proposition A.1 Letfor a nonnegative function

Proof: Let every x, it holds

which implies

Proposition A.2 Let

Proof: Let A , . . . , a, . . . , b

Lemma A.3 (Zeros of analytic functions, see e.g. (Mityagin, 2015)) If is a real analytic function which is not identically zero then the set has Lebesgue measure zero.

Proposition A.4 (Weyl’s inequality, see e.g. (Stewart, 1990)) Let with and

Lemma A.5 (Singular values of random gaussian matrices, see e.g. (Vershynin, 2010)) Let be a random matrix with Then for every we have that, with probability at least

Theorem A.6 (Matrix Chernoff) Let be a sequence of independent, random, symmetric matrices. Assume that

B. Proofs for General Framework

In the following, we frequently use a basic inequality, namely, for every and , see Proposition A.2 for the proof.

Lemma B.1 Let be given as in (3). Then,

1. is real analytic.

2.

3.

4. -Lipschitz.

5. lim

Proof: Let be the CDF of the standard normal distribution. Then, after some manipulations, we have that

1. Since is known as an entire function (i.e. analytic everywhere), it follows from (31) that is analytic on R.

2. Note that Thus, after some simplifications, we have that

3. It is easy to check that . Moreover, is 1-Lipschitz and thus

4. We have that

B.2. Proof of (23) in Lemma 4.2

We prove by induction on l. Note that the lemma holds for l = 1 since

where the second step follows from our assumption on Now suppose the lemma holds for

It is easy to verify that it also holds for l. Indeed,

Note that for l = L one can skip the first equality above, as there is no activation at the output layer.

B.3. Proof of (25) in Lemma 4.2

We first prove the following intermediate result.

Lemma B.2 Let be 1-Lipschitz and for every Let Let Then, for every

Here, we denote

Applying Lemma B.2 to the output layer yields:

B.4. Proof of (24) in Lemma 4.2

B.5. Proof of (26) in Lemma 4.2

We start by showing the following intermediate result.

Lemma B.3 Let be 1-Lipschitz, and let and hold for every Let Then, for every

Here, we denote

Proof: For every

Note that for t = l the product terms inside brackets are inactive, and thus

follows from Lemma 4.1 that

The following inequality holds

where the second inequality follows from (23) and To prove the lemma, we will prove that, for every

Suppose that it holds for and we want to show it for t. Then,

≤ ∥

where the last line follows by plugging the bound offrom the induction assumption.

Proof of (26) in Lemma 4.2. Let

≤

B.6. Proof of Lemma 4.3

Let g

B.7. Proof that is a Cauchy Sequence

Let us fix any We need to show that there exists r > 0 such that for every trivial, so we assume w.l.o.g. that i < j. Then, the following chain of inequalities hold

Note that and thus there exists a sufficiently large r such that This shows that is a Cauchy sequence, and hence convergent to some By continuity,

is a global minimizer. The rate of convergence is