An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis

2017·Arxiv

Abstract

Abstract

In this paper, we explore theoretical properties of training a two-layered ReLU network with cen- tered d-dimensional spherical Gaussian input x (descent on w to mimic the output of a teacher network with the same architecture and fixed parameters . We show that its population gradient has an analytical formula, leading to interesting theoretical analysis of critical points and convergence behaviors. First, we prove that critical points outside the hyperplane spanned by the teacher parameters (“out-of-plane“) are not isolated and form manifolds, and characterize in-plane critical-point-free regions for two ReLU case. On the other hand, convergence to for one ReLU node is guaranteed with at least probability, if weights are initialized randomly with standard deviation upper-bounded by , consistent with empirical practice. For network with many ReLU nodes, we prove that an infinitesimal perturbation of weight initialization results in convergence towards (or its permutation), a phenomenon known as spontaneous symmetric-breaking (SSB) in physics. We assume no independence of ReLU activations. Simulation verifies our findings.

1. Introduction

Despite empirical success of deep learning (e.g., Computer Vision (He et al., 2016; Simonyan & Zisserman, 2015; Szegedy et al., 2015; Krizhevsky et al., 2012), Natural Language Processing (Sutskever et al., 2014) and Speech Recognition (Hinton et al., 2012)), it remains elusive how and why simple methods like gradient descent can solve the complicated non-convex optimization in its training procedure. In this paper, we focus on the following two-layered ReLU network:

Here is the ReLU nonlinearity. We consider the setting that a student network is optimized to minimize the distance between its prediction and the supervision provided by a teacher network of the same architecture with fixed parameters . Note that although the network prediction (Eqn. 1) is convex due to the property of ReLU, if coupled with loss functions (e.g., loss Eqn. 2), then the optimization becomes highly non-convex and has exponential number of critical points.

To analyze it, we introduce a simple analytic formula for population gradient in the case of loss, when inputs x are sampled from zero-mean spherical Gaussian. Using this formula, critical point and convergence analysis follow.

For critical points, we show that critical points outside the principal hyperplane (the subspace spanned by ) form manifolds. We also characterize the region in the principal hyperplane that has no critical points, in two ReLU case.

We also analyze the convergence behavior under the population gradient. Using Lyapunov method (LaSalle & Lef- schetz, 1961), for single ReLU case we prove that gradient descent converges to with at least probability, if initialized randomly with standard deviation upper-bounded by , verifying common initialization techniques (Bottou, 1988; Glorot & Bengio, 2010; He et al., 2015; LeCun et al., 2012),. For multiple ReLU case, when the teacher parameters form a set of or- thonormal basis, we prove that (1) a symmetric weight initialization gets stuck at a saddle point and (2) a particular infinitesimal perturbation of (1) leads to convergence towards or its permutation. This behavior is known as spontaneous symmetry breaking in physics, in which the population gradient field enjoys invariance under a certain kind of symmetry, but the solution breaks it. Although such behaviors have been known empirically, to our knowledge, this paper first formally characterizes them in 2-layered ReLU network.

(a)

Figure 1. (a) We consider the student and teacher network as nonlinear neural networks with ReLU nonlinearity. The student network updates its weight w from the output of the teacher with fixed weights The 2-layered ReLU network structure (Eqn. 1) discussed in this paper. The first layer contains fixed weights of value 1, while the second layers has K ReLU nodes. Each node j has a d-dimensional weight to be optimized. Teacher network has the same architecture as the student.

2. Related Works

For multilayer linear network, many works analyze its critical points and convergence behaviors. (Saxe et al., 2013) analyzes its dynamics of gradient descent and (Kawaguchi, 2016) shows every local minimum is global. On the other hand, very few theoretical works have been done for nonlinear networks. (Mei et al., 2016) shows the global convergence for a single nonlinear node whose derivatives of activation are bounded and . Similar to our approach, (Saad & Solla, 1996) also uses the student-teacher setting and analyzes the student dynamics when the teacher’s parameters are orthonormal. However, their activation is Gaussian error function erf(x), and only the local behaviors of the two critical points (the initial saddle point near the origin and ) are analyzed. Recent paper (Zhang et al., 2017) analyzes a similar teacher-student setting on 2-layered network when the involved function is harmonic, but it is unclear how the conclusion is generalized to ReLU case. To our knowledge, our close-form formula for 2-layered ReLU network is novel, as well as the critical point and convergence analysis.

Many previous works analyze nonlinear network based on the assumption of independent activations: the activations of ReLU (or other nonlinear) nodes are independent of the input and/or mutually independent. For example, (Choromanska et al., 2015a;b) relates the nonlinear ReLU network with spin-glass models when several assumptions hold, including the assumption of independent activations (A1p and A5u). (Kawaguchi, 2016) proves that every local minimum in nonlinear network is global based on similar assumptions. (Soudry & Carmon, 2016) shows the global optimality of the local minimum in a two-layered ReLU network, when independent multiplicative Bernoulli noise is applied to the activations. In practice, activations that share the input are highly dependent. Ignoring such dependency misses important behaviors, and may lead to misleading conclusions. In this paper, no assumption of independent activations is made. Instead, we assume input to follow spherical Gaussian distribution, which gives more realistic and interdependent activations during training.

For sigmoid activation, (Fukumizu & Amari, 2000) gives complicated conditions for a local minimum to be global when adding a new node to a 2-layered network. (Janzamin et al., 2015) gives guarantees for parameter recovery of a 2-layered network learnt with tensor decomposition. In comparison, we analyze ReLU networks trained with gradient descent, which is more popular in practice.

3. Problem Deﬁnition

Denote N as the number of samples and d as the input dimension. The N-by-d matrix X is the input data and is the fixed parameter of the teacher network. Given the current estimation w, we have the following loss:

Here we focus on population loss , where the input X is assumed to follow spherical Gaussian distribution N(0, I). Its gradient is the population gradient (abbrev. ). In this paper, we study critical points and vanilla gradient dynamics , where is the learning rate.

4. The Analytical Formula

Properties of ReLU. ReLU nonlinearity has useful properties. We define the gating function 0) as an N-by-N binary diagonal matrix. Its l-th diagonal element is a binary variable showing whether the neuron is activated for sample l. Using this notation, D(w)Xw which means D(w) selects the output of a linear neuron, based on their activations. Note that D(w) only depends on the direction of w but not its magnitude.

D(w) is also “transparent” with respect to derivatives. For example, at differentiable regions, . This gives a very concise rule for gradient descent update in ReLU networks.

One ReLU node. Given the properties of ReLU, the pop- ulation gradient can be written as:

Intuitively, this term vanishes when , and should be around if the data are evenly distributed, since roughly half of the samples are blocked. However, such an estimation fails to capture the nonlinear behavior.

If we define Population Gating (PG) function

, then can be written as:

Interestingly, F(e, w) has an analytic formula if the data X follow spherical Gaussian distribution:

Theorem 1 Denote where e is a unit vector, is the N-by-d data matrix and D(w) = diag(Xw > 0) is a binary diagonal matrix. If (and thus bias-free), then:

where is the angle between e and w.

See the link1 for the proof of all theorems. Note that we do not require X to be independent between samples. Intuitively, the first mass term aligns with w and is proportional to the amount of activated data whose ReLU are on. When , the gating function is fully on and half of the data contribute to the term; when , the gating function is completely switched off. The gate is controlled by the angle between w and the control signal e. The second asymmetric term is aligned with e, and is proportional to the asymmetry of the activated data samples with respect to e (Fig. 2).

Note that the expectation analysis smooths out ReLU and leaves only one singularity at the origin, where is not continuous. That is, if approaching from different directions towards is different.

With the close form of also has a close form:

where . The first term is from linear approximation, while the second term shows the nonlinear behavior.

For linear case, (no gating) and thus . For spherical Gaussian input X, and . Therefore, the dynamics has only one critical point and global convergence follows, which is consistent with its convex nature.

Extension to other distributions. From its definition, is linear to , regardless of the distribution of X. On the other hand, isotropy in spherical Gaussian distribution leads to the fact that E [F(e, w)] only depends on angles between vectors. For other isotropic distributions, we could similarly derive:

Figure 2. Decomposition of Population Gating (PG) function F(e, w) (Eqn. 5) into mass term and asymmetric term. F(e, w) is computed from the portion of data with ReLU gate on. The mass term is proportional to the amount of data, while the asymmetric term is related to the data asymmetry with respect to e.

where A(0) = N/2 (gating fully on), (gating fully off), and (no asymmetry when w and e are aligned). Although we focus on spherical Gaussian case, many following analysis, in particular critical point analysis, can also be applied to Eqn. 7.

Multiple ReLU node. For Eqn. 1 that contains K ReLU node, we could similarly write down the population gradient with respect to (note that ):

5. Critical Point Analysis

By solving Eqn. 8 (the normal equation, ), we could identify all critical points of g(x). However, it is highly nonlinear and cannot be solved easily. In this paper, we provide conditions for critical points using the structure of Eqn. 8. Following the analysis, the case study for K = 2 gives examples for saddle points and regions without critical points.

For convenience, we define as the Principal Hyperplane spanned by K ground truth weight vectors. Note that is at most K dimensional. is said to be in-plane, if all . Otherwise it is out-of-plane.

5.1. Normal Equation

The normal equation contain Kd scalar equations and can be written as the following:

where ¯sin Θ¯) + (Θ)diag ¯w and (Θ. Here

-th row, j-th column of is ) and .

Note that Y and are both K-by-K matrices that only depend on angles and magnitudes, and hence rotational invariant. This leads to the following theorem characterizing the structure of out-of-plane critical points:

Theorem 2 If , then out-of-plane critical points (solutions of Eqn. 9) are non-isolated and lie in a manifold.

The intuition is to construct a rotational matrix that is not identity matrix but keeps invariant. Such matrices form a Lie group L that transforms critical points to critical points. Then for any out-of-plane critical point, there is one matrix in L that changes at least one of its weights, yielding a non-isolated different critical point.

Note that Thm. 2 also works for any general isotropic distribution, in which E [F(e, w)] has the form of Eqn. 7. This is due to the symmetry of the input X, which in turn affects the geometry of critical points. The theorem also explains why we have flat minima (Hochreiter et al., 1995; Dauphin et al., 2014) often occuring in practice.

5.2. In-Plane Normal Equation

To analyze in-plane critical points, it suffices to study gradient projections on . When is full-rank, the projections could be achieved by right-multiplying both sides by , which gives equations:

This again shows decomposition of angles and magnitudes, and linearity with respect to the norms of weight vectors. Here and similarly for and are -by-K matrices that only depend on angles. Entries of M and are:

Here index j is the j-th column of Eqn. 9, is from projection vector and k is the k-th weight magnitude.

Diagnoal constraints. For “diagonal” constraints (j, j) of Eqn. 10, we have and , where . Therefore, we arrive at the following subset of the constraints:

where and are both K-by- K matrices. Note that if is full-rank, then we could solve from Eqn. 13 and plug it back in Eqn. 10 to check whether it is indeed a critical point. This gives necessary conditions for critical points that only depend on angles.

Figure 3. Separable property of critical points using (Eqn. 14). Checking the criticability of decomposed into two subproblems, one related to and the other is related to

Separable Property. Interestingly, the plugging back operation leads to conditions that are separable with respect to ground truth weight (Fig. 3). To see this, we first define the following quantity which is a function between a single (rather than K) ground truth unit weight vector and all current unit weights :

where is the angle between and , , and (like Eqn. 12). Note that is the j-th column of . Fig. 3 illustrates the case when has the following properties:

Proposition 1 when there exists l so that . In addition, always.

Intuitively, characterizes the relative geometric relationship among and . It is like determinant of a matrix whose columns are and . With , we have the following necessary conditions for critical points:

Theorem 3 If , and for a given parameter w, (or < 0) for all , then w cannot be a critical point.

5.3. Case study: K = 2 network

In this case, and are 2-by-2 matrices. Here we discuss the case that both and are in .

Saddle points. When and are collinear), is singular since and are identical. From Eqn. 9, if , i.e., they are both aligned with the bisector angle of and , and , then the current solution is a saddle point. Note that this gives one constraint for two weight magnitudes, and thus there exist infinite solutions.

Region without critical points. We rely on the following conjecture that is verified empirically in an exhaustive manner (Sec. 7.2). It characterizes zero-crossings of a 2D function on a closed region . In comparison, in-plane 2 ReLU network has 6 parameters and is more dif-ficult to handle: 8 for and , minus the rota- tional and scaling symmetries.

Figure 4. Critical point analysis for changes sign when is in/out of the cone spanned by weights (b) Two cases that cannot be critical points.

Conjecture 1 If is in the interior of , then . If is in the exterior, then .

This is also empirically true for . Combined with Thm. 3, we know that (Fig. 4):

Theorem 4 If Conjecture 1 is correct, then for 2 ReLU network, ) is not a critical point, if they both are in , or both out of it.

On the other hand, when exact one ground truth weight is inside , it is not sure whether is a critical point.

6. Convergence Analysis

Application of Eqn. 5 also yields interesting convergence analysis. We focus on infinitesimal analysis, i.e., when learning rate and the gradient update becomes a first-order differential equation:

Then the populated objective does not increase:

The goal of convergence analysis is to determine specific weight initializations that leads to convergence to following the gradient descent dynamics (Eqn. 15).

6.1. Single ReLU case

Using Lyapunov method (LaSalle & Lefschetz, 1961), we show that the gradient dynamics (Eqn. 15) converges to

when ∈ ∥}:

Theorem 5 When , following the dynamics of Eqn. 15, the Lyapunov function has dV/dt < 0 and the system is asymptotically stable and thus when .

The intuition is to represent dV/dt as a 2-by-2 bilinear form of vector , and the bilinear coefficient matrix, as a function of angles, is negative definite (except for ). Note that similar approaches do not apply to

Figure 5. (a) Sampling strategy to maximize the probability of convergence. (b) Relationship between sampling range r and desired probability of success

regions including the origin because at the origin, the population gradient is discontinuous. does not include the origin and for any initialization , we could always find a slightly smaller subset with that covers , and apply Lyapunov method within. Note that the global convergence claim in (Mei et al., 2016) for loss does not apply to ReLU, since it requires .

Random Initialization. How to sample without knowing ? Uniform sampling around origin with radius for any results in exponentially small success rate in high-dimensional space. A better idea is to sample around the origin with very small radius (but not at w = 0), so that looks like a hyperplane near the origin, and thus almost half samples are useful (Fig. 5(a)), as shown in the following theorem:

Theorem 6 The dynamics in Eqn. 6 converges to with probability at least , if the initial value is sampled uniformly from with

The idea is to lower-bound the probability of the shaded area (Fig. 5(b)). Thm. 6 gives an explanation for common initialization techniques (Glorot & Bengio, 2010; He et al., 2015; LeCun et al., 2012; Bottou, 1988) that uses random variables with standard deviation.

6.2. Multiple ReLU case

For multiple ReLUs, Lyapunov method on Eqn. 8 yields no decisive conclusion. Here we focus on the symmetric property of Eqn. 8 and discuss a special case, that the teacher parameters and the initial weights respect the following symmetry: and , where is an orthogonal matrix whose col- lection forms a group. Without loss of gen- erality, we set as the identity. Then from Eqn. 8 the population gradient becomes:

Figure 6. Spontaneous Symmetric-Breaking (SSB): Objective / gradient field is symmetric but the solution is not. (a) Reflection keeps the gradient field invariant, but transforms 1 to 2 and vice versa. (b) The Mexican hat example. Rotation keeps the objective invariant, but transforms any local minimum to a different one.

This means that if all and are symmetric under group actions, so does their population gradients. Therefore, the trajectory also respects the symmetry (i.e., ) and we only need to solve one equation for instead of K (here ):

Eqn. 18 has interesting properties, known as Spontaneous Symmetric-Breaking (SSB) in physics (Brading & Castel- lani, 2003), in which the equations of motion respect a certain symmetry but its solution breaks it (Fig. 6). In our language, despite that the population gradient field and the objective E [J] are invariant to the group transformation P, i.e., for and remain the same, its solution is not (). Furthermore, since P is finite, as we will see, the final solution converges to different permutations of due to infinitesimal perturbations of initialization.

To illustrate such behaviors, consider the following example in which forms an orthonormal basis and un- der this basis, P is a cyclic group in which circularly shifts dimension by (e.g., ). In this case, if we start with under the basis of , then Eqn. 18 is further reduced to a convergent 2D nonlinear dynamics and Thm. 7 holds (Please check Supplementary Materials for the associated close-form of the 2D dynamics):

Theorem 7 For a bias-free two-layered ReLU network that takes spherical Gaussian in- puts, if the teacher’s parameters form orthnomal bases, then (1) when the student parameters is initialized to be under the basis of , where , then Eqn. 8 converges to teacher’s parameters (or (x, y) = (1, 0)); (2) when , then it converges to a saddle point .

Thm. 7 suggests that when , the system converges to , etc. Since can be arbitrarily small, a slightest perturbation around leads to a different fixed point for some j. Unlike single ReLU case, the initialization in Thm. 7 is -dependent, and serves as an example for the branching behavior.

Thm. 7 also suggests that for convergence, and can be arbitrarily small, regardless of the magnitude of , showing a global convergence behavior. In comparison, (Saad & Solla, 1996) uses Gaussian error function (erf) as the activation, and only analyzes local behaviors near the two fixed points (origin and ).

In practice, even with noisy initialization, Eqn. 18 and the original dynamics (Eqn. 8) still converge to (and its transformations). We leave it as a conjecture, whose proof may lead to an initialization technique for 2-layered ReLU that is -independent.

Conjecture 2 If the initialization , where is noise and , then Eqn. 8 also converges to with high probability.

7. Simulations

7.1. The analytical solution to F(e, w)

We verify (Eqn. 5) with simulation. We randomly pick e and w so that their angle is uniformly distributed in . The analytical formula E [F(e, w)] is compared with F(e, w), which is computed via sampling on the input X that follows spherical Gaussian distribution. We use relative RMS error: . Fig. 7(a) shows the error distribution with respect to angles. For small , the gating function D(w) and D(e) mostly overlap and give a reliable estimation. When and D(e)overlap less and the variance grows. Note that our convergence analysis operate on and is not affected. In the following, we sample angles from .

Fig. 7(a) shows that the accuracy of the formula increases with more samples. We also examine other zero-mean distributions of X, e.g., uniform distribution in . Fig. 7(d) shows that the formula still works for large d. Note that the error is computed up to a global scale, due to different normalization constants in probability distributions. To prove the usability of Eqn. 5 for more general distributions remains open.

7.2. Empirical Results in critical point analysis K = 2

Conjecture 1 can be reduced to enumerate a complicated but 2D function via exhaustive sampling. In comparison, a full optimization of 2-ReLU network constrained on principal hyperplane involves 6 parameters (8 parameters

Figure 7. (a) Distribution of relative RMS error with respect to Relative RMS error decreases with sample size, showing the asympototic behavior of the analytical formula (Eqn. 5). Note that the y-axis of the right plot is in log scale. (c) Eqn. 5 also works well when the input data X are generated by other zero-mean distribution X, e.g., uniform distribution in

Figure 8. (a)-(b) Vector field in (x, y) plane following 2D dynamics (Thm. 7, See Supplementary Materials for the close-form formula) for K = 2 and K = 5. Saddle points are visible. The parameters of teacher’s network are at Trajectory in (x, y) plane for K = 2, K = 5, and K = 10. All trajectories start from is aligned with , gradient descent takes detours. (d) Training curve. Interestingly, when K is larger the convergence is faster.

minus 2 degrees of symmetry) and is more difficult to handle. Fig. 10 shows that empirically has no extra zerocrossing other than or . As shown in Fig. 10(c), we have densely enumerated and on a grid without finding any counterexamples.

7.3. Convergence analysis for multiple ReLU nodes

Fig. 8(a) and (b) shows the 2D vector field in Thm 7. Fig. 8(c) shows the 2D trajectory towards convergence to the teacher’s parameters . Interestingly, even when we initialize the weights as , whose direction is aligned with at , the gradient descent still takes detours to reach the destination. This is because at the beginning of optimization, all ReLU nodes explain the training error in the same way (both x and y increases); when the “obvious” component is explained, the error pushes some nodes to explain other components. Hence, specialization follows (x increases but y decreases).

Fig. 9 shows empirical convergence for , when the initialization deviates from initialization [x, y, . . . , y] in Thm. 7. Unless the deviation is large, w converges to . For more general network , when convergence follows. When some is negative, the network fails to converge to , even when the student is initialized with the true values .

8. Extension to multilayer ReLU network

A natural question is whether the proposed method can be extended to multilayer ReLU network. In this case, there is similar subtraction structure for gradient as Eqn. 3:

Proposition 2 Denote [c] as all nodes in layer c. Denote and as the output of node j at layer c of the teacher and student network, then the gradient of the parameters immediate under node is:

where is the data fed into node and are N- by-N diagonal matrices. For any node and similarly for .

Figure 9. Top row: Convergence when weights are initialized with noise: 2-layered network converges to until huge noise. Both teacher and student networks use . Each experiment has 8 runs. Bottom row: Convergence for . Here we fix top weights at different numbers (rather than 1). Large positive corresponds to fast convergence. When contains mixture signs, convergence to is not achieved.

Figure 10. Quantity ReLU network. We fix . In this case, are both dependent vari- ables with respect to , otherwise negative. There are no extra zero-crossings. (a)-(b) Examples: evaluation on with grid size

The 2-layered network in this paper is a special case with . Despite the difficulty that is now de- pends on the weights of upper layers, and the input is not necessarily Gaussian distributed, Proposition 2 gives a mathematical framework to explore the structure of gradient. For example, a similar definition of Population Gradient function is possible.

9. Conclusion and Future Work

In this paper, we study the gradient descent dynamics of a 2-layered bias-free ReLU network. The network is trained using gradient descent to reproduce the output of a teacher network with fixed parameters in the sense of norm. We propose a novel analytic formula for population gradient when the input follows zero-mean spherical Gaussian distribution. This formula leads to interesting critical point and convergence analysis. Specifically, we show that critical points out of the hyperplane spanned by are not isolated and form manifolds. For two ReLU case, we characterize regions that contain no critical points. For convergence analysis, we show guaranteed convergence for a single ReLU case with random initialization whose standard deviation is on the order of . For multiple ReLU case, we show that an infinitesimal change of weight initialization leads to convergence to different optima.

Our work opens many future directions. First, Thm. 2 characterizes the non-isolating nature of critical points in the case of isotropic input distribution, which explains why often practical solutions of NN are degenerated. What if the input distribution has different symmetries? Will such symmetries determine the geometry of critical points? Second, empirically we see convergence cases that are not covered by the theorems, suggesting the conditions imposed by the theorems can be weaker. Finally, how to apply similar analysis to broader distributions and how to generalize the analysis to multiple layers are also open problems.

Acknowledgement We thank L´eon Bottou, Ruoyu Sun, Jason Lee, Yann Dauphin and Nicolas Usunier for discussions and insightful suggestions.

References

Bottou, L´eon. Reconnaissance de la parole par reseaux connexionnistes. In Proceedings of Neuro Nimes 88, pp. 197–218, Nimes, France, 1988. URL http://leon. bottou.org/papers/bottou-88b.

Brading, Katherine and Castellani, Elena. Symmetries in physics: philosophical reflections. Cambridge University Press, 2003.

Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, Arous, G´erard Ben, and LeCun, Yann. The loss surfaces of multilayer networks. In AISTATS, 2015a.

Choromanska, Anna, LeCun, Yann, and Arous, G´erard Ben. Open problem: The landscape of the loss surfaces of multilayer networks. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3, volume 6, pp. 1756–1760, 2015b.

Dauphin, Yann N, Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, Ganguli, Surya, and Bengio, Yoshua. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pp. 2933– 2941, 2014.

Fukumizu, Kenji and Amari, Shun-ichi. Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural Networks, 13(3):317–327, 2000.

Glorot, Xavier and Bengio, Yoshua. Understanding the dif- ficulty of training deep feedforward neural networks. In Aistats, volume 9, pp. 249–256, 2010.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034, 2015.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. Computer Vision anad Pattern Recognition (CVPR), 2016.

Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four

research groups. IEEE Signal Processing Magazine, 29 (6):82–97, 2012.

Hochreiter, Sepp, Schmidhuber, J¨urgen, et al. Simplify- ing neural nets by discovering flat minima. Advances in Neural Information Processing Systems, pp. 529–536, 1995.

Janzamin, Majid, Sedghi, Hanie, and Anandkumar, An- ima. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. CoRR abs/1506.08473, 2015.

Kawaguchi, Kenji. Deep learning without poor local min- ima. Advances in Neural Information Processing Systems, 2016.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.

LaSalle, J. P. and Lefschetz, S. Stability by lyapunov’s second method with applications. New York: Academic Press., 1961.

LeCun, Yann A, Bottou, L´eon, Orr, Genevieve B, and M¨uller, Klaus-Robert. Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Springer, 2012.

Mei, Song, Bai, Yu, and Montanari, Andrea. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534, 2016.

Saad, David and Solla, Sara A. Dynamics of on-line gradi- ent descent learning for multilayer neural networks. Advances in Neural Information Processing Systems, pp. 302–308, 1996.

Saxe, Andrew M, McClelland, James L, and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.

Simonyan, Karen and Zisserman, Andrew. Very deep con- volutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR), 2015.

Soudry, Daniel and Carmon, Yair. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.

Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), pp. 1–9, 2015.

Zhang, Qiuyi, Panigrahy, Rina, and Sachdeva, Sushant. Electron-proton dynamics in deep learning. arXiv preprint arXiv:1702.00458, 2017.

designed for accessibility and to further open science