b

DiscoverSearch
About
My stuff
An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis
2017·arXiv
Abstract
Abstract

In this paper, we explore theoretical properties of training a two-layered ReLU network  g(x; w) = �Kj=1 σ(w⊺j x)with cen- tered d-dimensional spherical Gaussian input x (σ=ReLU). We train our network with gradientdescent on w to mimic the output of a teacher network with the same architecture and fixed parameters  w∗. We show that its population gradient has an analytical formula, leading to interesting theoretical analysis of critical points and convergence behaviors. First, we prove that critical points outside the hyperplane spanned by the teacher parameters (“out-of-plane“) are not isolated and form manifolds, and characterize in-plane critical-point-free regions for two ReLU case. On the other hand, convergence to  w∗for one ReLU node is guaranteed with at least  (1 −ϵ)/2probability, if weights are initialized randomly with standard deviation upper-bounded by O(ϵ/√d), consistent with empirical practice. For network with many ReLU nodes, we prove that an infinitesimal perturbation of weight initialization results in convergence towards  w∗(or its permutation), a phenomenon known as spontaneous symmetric-breaking (SSB) in physics. We assume no independence of ReLU activations. Simulation verifies our findings.

Despite empirical success of deep learning (e.g., Computer Vision (He et al., 2016; Simonyan & Zisserman, 2015; Szegedy et al., 2015; Krizhevsky et al., 2012), Natural Language Processing (Sutskever et al., 2014) and Speech Recognition (Hinton et al., 2012)), it remains elusive how and why simple methods like gradient descent can solve the complicated non-convex optimization in its training procedure. In this paper, we focus on the following two-layered ReLU network:

image

Here  σ(x) = max(x, 0)is the ReLU nonlinearity. We consider the setting that a student network is optimized to minimize the  l2distance between its prediction and the supervision provided by a teacher network of the same architecture with fixed parameters  w∗. Note that although the network prediction (Eqn. 1) is convex due to the property of ReLU, if coupled with loss functions (e.g.,  l2loss Eqn. 2), then the optimization becomes highly non-convex and has exponential number of critical points.

To analyze it, we introduce a simple analytic formula for population gradient in the case of  l2loss, when inputs x are sampled from zero-mean spherical Gaussian. Using this formula, critical point and convergence analysis follow.

For critical points, we show that critical points outside the principal hyperplane (the subspace spanned by  w∗) form manifolds. We also characterize the region in the principal hyperplane that has no critical points, in two ReLU case.

We also analyze the convergence behavior under the population gradient. Using Lyapunov method (LaSalle & Lef- schetz, 1961), for single ReLU case we prove that gradient descent converges to  w∗with at least  (1 − ϵ)/2probability, if initialized randomly with standard deviation upper-bounded by  O(ϵ/√d), verifying common initialization techniques (Bottou, 1988; Glorot & Bengio, 2010; He et al., 2015; LeCun et al., 2012),. For multiple ReLU case, when the teacher parameters  {wj}Kj=1form a set of or- thonormal basis, we prove that (1) a symmetric weight initialization gets stuck at a saddle point and (2) a particular infinitesimal perturbation of (1) leads to convergence towards  w∗or its permutation. This behavior is known as spontaneous symmetry breaking in physics, in which the population gradient field enjoys invariance under a certain kind of symmetry, but the solution breaks it. Although such behaviors have been known empirically, to our knowledge, this paper first formally characterizes them in 2-layered ReLU network.

image

(a)

image

Figure 1. (a) We consider the student and teacher network as nonlinear neural networks with ReLU nonlinearity. The student network updates its weight w from the output of the teacher with fixed weights  w∗. (b)The 2-layered ReLU network structure (Eqn. 1) discussed in this paper. The first layer contains fixed weights of value 1, while the second layers has K ReLU nodes. Each node j has a d-dimensional weight  wjto be optimized. Teacher network has the same architecture as the student.

For multilayer linear network, many works analyze its critical points and convergence behaviors. (Saxe et al., 2013) analyzes its dynamics of gradient descent and (Kawaguchi, 2016) shows every local minimum is global. On the other hand, very few theoretical works have been done for nonlinear networks. (Mei et al., 2016) shows the global convergence for a single nonlinear node whose derivatives of activation  σ′, σ′′, σ′′′are bounded and  σ′ > 0. Similar to our approach, (Saad & Solla, 1996) also uses the student-teacher setting and analyzes the student dynamics when the teacher’s parameters  w∗are orthonormal. However, their activation is Gaussian error function erf(x), and only the local behaviors of the two critical points (the initial saddle point near the origin and  w∗) are analyzed. Recent paper (Zhang et al., 2017) analyzes a similar teacher-student setting on 2-layered network when the involved function is harmonic, but it is unclear how the conclusion is generalized to ReLU case. To our knowledge, our close-form formula for 2-layered ReLU network is novel, as well as the critical point and convergence analysis.

Many previous works analyze nonlinear network based on the assumption of independent activations: the activations of ReLU (or other nonlinear) nodes are independent of the input and/or mutually independent. For example, (Choromanska et al., 2015a;b) relates the nonlinear ReLU network with spin-glass models when several assumptions hold, including the assumption of independent activations (A1p and A5u). (Kawaguchi, 2016) proves that every local minimum in nonlinear network is global based on similar assumptions. (Soudry & Carmon, 2016) shows the global optimality of the local minimum in a two-layered ReLU network, when independent multiplicative Bernoulli noise is applied to the activations. In practice, activations that share the input are highly dependent. Ignoring such dependency misses important behaviors, and may lead to misleading conclusions. In this paper, no assumption of independent activations is made. Instead, we assume input to follow spherical Gaussian distribution, which gives more realistic and interdependent activations during training.

For sigmoid activation, (Fukumizu & Amari, 2000) gives complicated conditions for a local minimum to be global when adding a new node to a 2-layered network. (Janzamin et al., 2015) gives guarantees for parameter recovery of a 2-layered network learnt with tensor decomposition. In comparison, we analyze ReLU networks trained with gradient descent, which is more popular in practice.

Denote N as the number of samples and d as the input dimension. The N-by-d matrix X is the input data and  w∗is the fixed parameter of the teacher network. Given the current estimation w, we have the following  l2loss:

image

Here we focus on population loss  EX [J], where the input X is assumed to follow spherical Gaussian distribution N(0, I). Its gradient is the population gradient EX [∇Jw(w)](abbrev.  E [∇J]). In this paper, we study critical points  E [∇J] = 0and vanilla gradient dynamics wt+1 = wt − ηE [∇J(wt)], where  ηis the learning rate.

Properties of ReLU. ReLU nonlinearity has useful properties. We define the gating function  D(w) ≡ diag(Xw >0) as an N-by-N binary diagonal matrix. Its l-th diagonal element is a binary variable showing whether the neuron is activated for sample l. Using this notation,  σ(Xw) =D(w)Xw which means D(w) selects the output of a linear neuron, based on their activations. Note that D(w) only depends on the direction of w but not its magnitude.

D(w) is also “transparent” with respect to derivatives. For example, at differentiable regions,  Jacobianw[σ(Xw)] =σ′(Xw)X = D(w)X. This gives a very concise rule for gradient descent update in ReLU networks.

One ReLU node. Given the properties of ReLU, the pop- ulation gradient  E [∇J]can be written as:

image

Intuitively, this term vanishes when  w → w∗, and should be around N2 (w − w∗)if the data are evenly distributed, since roughly half of the samples are blocked. However, such an estimation fails to capture the nonlinear behavior.

If we define Population Gating (PG) function  F(e, w) ≡

X⊺D(e)D(w)Xw, then  E [∇J]can be written as:

image

Interestingly, F(e, w) has an analytic formula if the data X follow spherical Gaussian distribution:

Theorem 1 Denote  F(e, w) = X⊺D(e)D(w)Xwwhere e is a unit vector,  X = [x1, x2, · · · , xN]⊺is the N-by-d data matrix and D(w) = diag(Xw > 0) is a binary diagonal matrix. If  xi ∼ N(0, I)(and thus bias-free), then:

image

where  θ = ∠(e, w) ∈ [0, π]is the angle between e and w.

See the link1 for the proof of all theorems. Note that we do not require X to be independent between samples. Intuitively, the first mass term N2π(π−θ)waligns with w and is proportional to the amount of activated data whose ReLU are on. When  θ = 0, the gating function is fully on and half of the data contribute to the term; when  θ = π, the gating function is completely switched off. The gate is controlled by the angle between w and the control signal e. The second asymmetric term is aligned with e, and is proportional to the asymmetry of the activated data samples with respect to e (Fig. 2).

Note that the expectation analysis smooths out ReLU and leaves only one singularity at the origin, where  E [∇J]is not continuous. That is, if approaching from different directions towards  w = 0, E [∇J]is different.

With the close form of  F, E [∇J]also has a close form:

image

where  θ = ∠(w, w∗) ∈ [0, π]. The first term is from linear approximation, while the second term shows the nonlinear behavior.

For linear case,  D ≡ I(no gating) and thus  ∇J ∝X⊺X(w − w∗). For spherical Gaussian input X, EX [X⊺X] = Iand  E [∇J] ∝ w − w∗. Therefore, the dynamics has only one critical point and global convergence follows, which is consistent with its convex nature.

Extension to other distributions. From its definition, E [F(e, w)] = E [X⊺D(e)D(w)Xw]is linear to  ∥w∥, regardless of the distribution of X. On the other hand, isotropy in spherical Gaussian distribution leads to the fact that E [F(e, w)] only depends on angles between vectors. For other isotropic distributions, we could similarly derive:

image

image

Figure 2. Decomposition of Population Gating (PG) function F(e, w) (Eqn. 5) into mass term and asymmetric term. F(e, w) is computed from the portion of data with ReLU gate on. The mass term is proportional to the amount of data, while the asymmetric term is related to the data asymmetry with respect to e.

where A(0) = N/2 (gating fully on),  A(π) = 0(gating fully off), and  B(0) = B(π) = 0(no asymmetry when w and e are aligned). Although we focus on spherical Gaussian case, many following analysis, in particular critical point analysis, can also be applied to Eqn. 7.

Multiple ReLU node. For Eqn. 1 that contains K ReLU node, we could similarly write down the population gradient with respect to  wj(note that  ej = wj/∥wj∥):

image

By solving Eqn. 8 (the normal equation,  E�∇wjJ�= 0), we could identify all critical points of g(x). However, it is highly nonlinear and cannot be solved easily. In this paper, we provide conditions for critical points using the structure of Eqn. 8. Following the analysis, the case study for K = 2 gives examples for saddle points and regions without critical points.

For convenience, we define  Π∗as the Principal Hyperplane spanned by K ground truth weight vectors. Note that  Π∗is at most K dimensional.  {wj}Kj=1is said to be in-plane, if all  wj ∈ Π∗. Otherwise it is out-of-plane.

5.1. Normal Equation

The normal equation  {E�∇wjJ� = 0}Kj=1contain Kd scalar equations and can be written as the following:

image

where  Y = diag(sin Θ⊺¯w −sin Θ∗⊺¯w∗) + (π11⊺ −Θ⊺)diag ¯w and  B∗ = π11⊺ −∗)⊺. Here  θ∗j′j ≡

∠(wj, w∗j′), θj′j ≡ ∠(wj, wj′), Θ = [θij] (i-th row, j-th column of  Θis  θij) and  Θ∗ = [θ∗ij ].

Note that Y and  B∗are both K-by-K matrices that only depend on angles and magnitudes, and hence rotational invariant. This leads to the following theorem characterizing the structure of out-of-plane critical points:

Theorem 2 If  d ≥ K +2, then out-of-plane critical points (solutions of Eqn. 9) are non-isolated and lie in a manifold.

The intuition is to construct a rotational matrix that is not identity matrix but keeps  Π∗invariant. Such matrices form a Lie group L that transforms critical points to critical points. Then for any out-of-plane critical point, there is one matrix in L that changes at least one of its weights, yielding a non-isolated different critical point.

Note that Thm. 2 also works for any general isotropic distribution, in which E [F(e, w)] has the form of Eqn. 7. This is due to the symmetry of the input X, which in turn affects the geometry of critical points. The theorem also explains why we have flat minima (Hochreiter et al., 1995; Dauphin et al., 2014) often occuring in practice.

5.2. In-Plane Normal Equation

To analyze in-plane critical points, it suffices to study gradient projections on  Π∗. When  {wj}is full-rank, the projections could be achieved by right-multiplying both sides by  {ej′}, which gives  K2equations:

image

This again shows decomposition of angles and magnitudes, and linearity with respect to the norms of weight vectors. Here  ¯w = [∥w1∥, ∥w2∥, . . . , ∥wK∥]⊺and similarly for ¯w∗. Mand  M ∗are  K2-by-K matrices that only depend on angles. Entries of M and  M ∗are:

image

Here index j is the j-th column of Eqn. 9,  j′is from projection vector  ej′and k is the k-th weight magnitude.

Diagnoal constraints. For “diagonal” constraints (j, j) of Eqn. 10, we have  cos θjj = 1and  mjj,k = h(θkj ), m∗jj,k =h(θ∗kj ), where  h(θ) = (π − θ) cos θ + sin θ. Therefore, we arrive at the following subset of the constraints:

image

where  Mr = h(Θ⊺)and  M ∗r = h(Θ∗⊺)are both K-by- K matrices. Note that if  Mris full-rank, then we could solve  ¯wfrom Eqn. 13 and plug it back in Eqn. 10 to check whether it is indeed a critical point. This gives necessary conditions for critical points that only depend on angles.

image

Figure 3. Separable property of critical points using  Ljj′ function(Eqn. 14). Checking the criticability of  {w1, w2, w∗1, w∗2} can bedecomposed into two subproblems, one related to  {w1, w2, w∗1}and the other is related to  {w1, w2, w∗2}.

Separable Property. Interestingly, the plugging back operation leads to conditions that are separable with respect to ground truth weight (Fig. 3). To see this, we first define the following quantity  Ljj′which is a function between a single (rather than K) ground truth unit weight vector  e∗and all current unit weights  {el}Kl=1:

image

where  θ∗l = ∠(e∗, el)is the angle between  e∗and  el, v = v({θ∗l }) = [h(θ∗1), . . . , h(θ∗K)]⊺, and  m∗jj′ =(π − θ∗j ) cos θ∗j′ + sin θ∗j cos θjj′(like Eqn. 12). Note that v({θ∗jl })is the j-th column of  M ∗r. Fig. 3 illustrates the case when  K = 2. Ljj′has the following properties:

Proposition 1  Ljj′({θ∗l }, Θ) = 0when there exists l so that  e∗ = el. In addition,  Ljj({θ∗l }, Θ) = 0always.

Intuitively,  Ljj′characterizes the relative geometric relationship among  e∗and  {el}. It is like determinant of a matrix whose columns are  {el}and  e∗. With  Ljj′, we have the following necessary conditions for critical points:

Theorem 3 If  ¯w∗ ̸= 0, and for a given parameter w, Ljj′({θ∗kl }, Θ) > 0(or < 0) for all  1 ≤ k ≤ K, then w cannot be a critical point.

5.3. Case study: K = 2 network

In this case,  Mrand  M ∗rare 2-by-2 matrices. Here we discuss the case that both  w1and  w2are in  Π∗.

Saddle points. When  θ12 = 0 (w1and  w2are collinear), Mr = π11⊺is singular since  e1and  e2are identical. From Eqn. 9, if  θ∗11 = θ∗21, i.e., they are both aligned with the bisector angle of  w∗1and  w∗2, and  π ¯w⊺1 =h�θ∗1∗2/2�( ¯w∗)⊺1, then the current solution is a saddle point. Note that this gives one constraint for two weight magnitudes, and thus there exist infinite solutions.

Region without critical points. We rely on the following conjecture that is verified empirically in an exhaustive manner (Sec. 7.2). It characterizes zero-crossings of a 2D function on a closed region  [0, 2π] × [0, π]. In comparison, in-plane 2 ReLU network has 6 parameters and is more dif-ficult to handle: 8 for  w1, w2, w∗1and  w∗2, minus the rota- tional and scaling symmetries.

image

Figure 4. Critical point analysis for  K = 2. (a) L12changes sign when  w∗ is in/out of the cone spanned by weights  w1 and w2.(b) Two cases that  (w1, w2)cannot be critical points.

Conjecture 1 If  e∗is in the interior of  Cone(e1, e2), then L12(θ∗1, θ∗2, θ12) > 0. If  e∗is in the exterior, then  L12 < 0.

This is also empirically true for  L21. Combined with Thm. 3, we know that (Fig. 4):

Theorem 4 If Conjecture 1 is correct, then for 2 ReLU network,  (w1, w2) (w1 ̸= w2) is not a critical point, if they both are in  Cone(w∗1, w∗2), or both out of it.

On the other hand, when exact one ground truth weight is inside  Cone(w1, w2), it is not sure whether  (w1, w2)is a critical point.

Application of Eqn. 5 also yields interesting convergence analysis. We focus on infinitesimal analysis, i.e., when learning rate  η → 0and the gradient update becomes a first-order differential equation:

image

Then the populated objective  EX [J]does not increase:

image

The goal of convergence analysis is to determine specific weight initializations  w0that leads to convergence to  w∗following the gradient descent dynamics (Eqn. 15).

6.1. Single ReLU case

Using Lyapunov method (LaSalle & Lefschetz, 1961), we show that the gradient dynamics (Eqn. 15) converges to  w∗

when  w0 Ω = {w : ∥w − w∗∥ < ∥w∗∥}:

Theorem 5 When  w0 ∈ Ω = {w : ∥w − w∗∥ < ∥w∗∥}, following the dynamics of Eqn. 15, the Lyapunov function V (w) = 12∥w − w∗∥2has dV/dt < 0 and the system is asymptotically stable and thus  wt → w∗when  t → +∞.

The intuition is to represent dV/dt as a 2-by-2 bilinear form of vector  [∥w∥, ∥w∗∥], and the bilinear coefficient matrix, as a function of angles, is negative definite (except for  w = w∗). Note that similar approaches do not apply to

image

Figure 5. (a) Sampling strategy to maximize the probability of convergence. (b) Relationship between sampling range r and desired probability of success  (1 − ϵ)/2.

regions including the origin because at the origin, the population gradient is discontinuous.  Ωdoes not include the origin and for any initialization  w0 ∈ Ω, we could always find a slightly smaller subset  Ω′δ = {w : ∥w − w∗∥ ≤∥w∗∥−δ}with  δ > 0that covers  w0, and apply Lyapunov method within. Note that the global convergence claim in (Mei et al., 2016) for  l2loss does not apply to ReLU, since it requires  σ′(x) > 0.

Random Initialization. How to sample  w0 ∈ Ωwithout knowing  w∗? Uniform sampling around origin with radius  r ≥ γ∥w∗∥for any  γ > 1results in exponentially small success rate  (r/∥w∗∥)d ≤ γ−din high-dimensional space. A better idea is to sample around the origin with very small radius (but not at w = 0), so that  Ωlooks like a hyperplane near the origin, and thus almost half samples are useful (Fig. 5(a)), as shown in the following theorem:

Theorem 6 The dynamics in Eqn. 6 converges to  w∗with probability at least  (1 − ϵ)/2, if the initial value  w0is sampled uniformly from  Br = {w : ∥w∥ ≤ r}with

image

The idea is to lower-bound the probability of the shaded area (Fig. 5(b)). Thm. 6 gives an explanation for common initialization techniques (Glorot & Bengio, 2010; He et al., 2015; LeCun et al., 2012; Bottou, 1988) that uses random variables with  O(1/√d)standard deviation.

6.2. Multiple ReLU case

For multiple ReLUs, Lyapunov method on Eqn. 8 yields no decisive conclusion. Here we focus on the symmetric property of Eqn. 8 and discuss a special case, that the teacher parameters  {w∗j}Kj=1and the initial weights {w0j}Kj=1respect the following symmetry:  wj = Pjwand w∗j = Pjw∗, where  Pjis an orthogonal matrix whose col- lection  P ≡ {Pj}Kj=1forms a group. Without loss of gen- erality, we set  P1as the identity. Then from Eqn. 8 the population gradient becomes:

image

image

Figure 6. Spontaneous Symmetric-Breaking (SSB): Objective / gradient field is symmetric but the solution is not. (a) Reflection keeps the gradient field invariant, but transforms 1 to 2 and vice versa. (b) The Mexican hat example. Rotation keeps the objective invariant, but transforms any local minimum to a different one.

This means that if all  wjand  w∗jare symmetric under group actions, so does their population gradients. Therefore, the trajectory  {wt}also respects the symmetry (i.e., Pjwt1 = wtj) and we only need to solve one equation for E [∇wJ]instead of K (here  e = w/∥w∥):

image

Eqn. 18 has interesting properties, known as Spontaneous Symmetric-Breaking (SSB) in physics (Brading & Castel- lani, 2003), in which the equations of motion respect a certain symmetry but its solution breaks it (Fig. 6). In our language, despite that the population gradient field  E [∇wJ]and the objective E [J] are invariant to the group transformation P, i.e., for  w∗ → Pjw∗, E [J]and  E [∇wJ]remain the same, its solution is not (Pjw ̸= w). Furthermore, since P is finite, as we will see, the final solution converges to different permutations of  w∗due to infinitesimal perturbations of initialization.

To illustrate such behaviors, consider the following example in which  {w∗j}Kj=1forms an orthonormal basis and un- der this basis, P is a cyclic group in which  Pjcircularly shifts dimension by  j − 1(e.g.,  P2[1, 2, 3]⊺ = [3, 1, 2]⊺). In this case, if we start with  w0 = x0w∗ + �j̸=1 Pjw∗j =[x0, y0, . . . , y0]under the basis of  w∗, then Eqn. 18 is further reduced to a convergent 2D nonlinear dynamics and Thm. 7 holds (Please check Supplementary Materials for the associated close-form of the 2D dynamics):

Theorem 7 For a bias-free two-layered ReLU network g(x; w) = �j σ(w⊺j x)that takes spherical Gaussian in- puts, if the teacher’s parameters  {w∗j}form orthnomal bases, then (1) when the student parameters is initialized to be  [x0, y0, . . . , y0]under the basis of  w∗, where  (x0, y0) ∈Ω = {x ∈ (0, 1], y ∈ [0, 1], x > y}, then Eqn. 8 converges to teacher’s parameters  {w∗j}(or (x, y) = (1, 0)); (2) when  x0 = y0 ∈ (0, 1], then it converges to a saddle point  x = y = 1πK (√K − 1 − arccos(1/√K) + π).

Thm. 7 suggests that when  w0 = [y0, x0, . . . , y0], the system converges to  P2w∗, etc. Since  |x0 − y0|can be arbitrarily small, a slightest perturbation around  x0 = y0leads to a different fixed point  Pjw∗for some j. Unlike single ReLU case, the initialization in Thm. 7 is  w∗-dependent, and serves as an example for the branching behavior.

Thm. 7 also suggests that for convergence,  x0and  y0can be arbitrarily small, regardless of the magnitude of  w∗, showing a global convergence behavior. In comparison, (Saad & Solla, 1996) uses Gaussian error function (σ =erf) as the activation, and only analyzes local behaviors near the two fixed points (origin and  w∗).

In practice, even with noisy initialization, Eqn. 18 and the original dynamics (Eqn. 8) still converge to  w∗(and its transformations). We leave it as a conjecture, whose proof may lead to an initialization technique for 2-layered ReLU that is  w∗-independent.

Conjecture 2 If the initialization  w0 = x0w∗ +y0 �j̸=1 Pjw∗ + ϵ, where  ϵis noise and  (x0, y0) ∈ Ω, then Eqn. 8 also converges to  w∗with high probability.

7.1. The analytical solution to F(e, w)

We verify  E [F(e, w)] = E [X⊺D(e)D(w)Xw](Eqn. 5) with simulation. We randomly pick e and w so that their angle  ∠(e, w)is uniformly distributed in  [0, π]. The analytical formula E [F(e, w)] is compared with F(e, w), which is computed via sampling on the input X that follows spherical Gaussian distribution. We use relative RMS error:  err = ∥E [F(e, w)] − F(e, w)∥/∥F(e, w)∥. Fig. 7(a) shows the error distribution with respect to angles. For small  θ, the gating function D(w) and D(e) mostly overlap and give a reliable estimation. When  θ → π, D(w)and D(e)overlap less and the variance grows. Note that our convergence analysis operate on  θ ∈ [0, π/2]and is not affected. In the following, we sample angles from  [0, π/2].

Fig. 7(a) shows that the accuracy of the formula increases with more samples. We also examine other zero-mean distributions of X, e.g., uniform distribution in  [−1/2, 1/2]. Fig. 7(d) shows that the formula still works for large d. Note that the error is computed up to a global scale, due to different normalization constants in probability distributions. To prove the usability of Eqn. 5 for more general distributions remains open.

7.2. Empirical Results in critical point analysis K = 2

Conjecture 1 can be reduced to enumerate a complicated but 2D function via exhaustive sampling. In comparison, a full optimization of 2-ReLU network constrained on principal hyperplane  Π∗involves 6 parameters (8 parameters

image

Figure 7. (a) Distribution of relative RMS error with respect to  θ = ∠(w, e). (b)Relative RMS error decreases with sample size, showing the asympototic behavior of the analytical formula (Eqn. 5). Note that the y-axis of the right plot is in log scale. (c) Eqn. 5 also works well when the input data X are generated by other zero-mean distribution X, e.g., uniform distribution in  [−1/2, 1/2].

image

Figure 8. (a)-(b) Vector field in (x, y) plane following 2D dynamics (Thm. 7, See Supplementary Materials for the close-form formula) for K = 2 and K = 5. Saddle points are visible. The parameters of teacher’s network are at  w∗ = (1, 0). (c)Trajectory in (x, y) plane for K = 2, K = 5, and K = 10. All trajectories start from  w0 = (10−3, 0). Even w0 is aligned with  w∗, gradient descent takes detours. (d) Training curve. Interestingly, when K is larger the convergence is faster.

minus 2 degrees of symmetry) and is more difficult to handle. Fig. 10 shows that empirically  L12has no extra zerocrossing other than  e∗ = e1or  e2. As shown in Fig. 10(c), we have densely enumerated  θ12 ∈ [0, π]and  e∗on a 104 × 104grid without finding any counterexamples.

7.3. Convergence analysis for multiple ReLU nodes

Fig. 8(a) and (b) shows the 2D vector field in Thm 7. Fig. 8(c) shows the 2D trajectory towards convergence to the teacher’s parameters  w∗. Interestingly, even when we initialize the weights as  [10−3, 0]⊺, whose direction is aligned with  w∗at  [1, 0]⊺, the gradient descent still takes detours to reach the destination. This is because at the beginning of optimization, all ReLU nodes explain the training error in the same way (both x and y increases); when the “obvious” component is explained, the error pushes some nodes to explain other components. Hence, specialization follows (x increases but y decreases).

Fig. 9 shows empirical convergence for  K ≥ 2, when the initialization deviates from initialization [x, y, . . . , y] in Thm. 7. Unless the deviation is large, w converges to w∗. For more general network  g2(x) = �Kj=1 ajσ(w⊺j x), when  aj > 0convergence follows. When some  ajis negative, the network fails to converge to  w∗, even when the student is initialized with the true values  {a∗j}Kj=1.

A natural question is whether the proposed method can be extended to multilayer ReLU network. In this case, there is similar subtraction structure for gradient as Eqn. 3:

Proposition 2 Denote [c] as all nodes in layer c. Denote u∗jand  ujas the output of node j at layer c of the teacher and student network, then the gradient of the parameters wjimmediate under node  j ∈ [c]is:

image

where  Xcis the data fed into node  j, Qjand  Q∗jare N- by-N diagonal matrices. For any node  k ∈ [c + 1], Qk =�j∈[c] wjkDjQjand similarly for  Q∗k.

image

Figure 9. Top row: Convergence when weights are initialized with noise:  w0 = 10−3w∗ + ϵ, where ϵ ∼ N(0, 10−3 ∗ noise). The2-layered network converges to  w∗ until huge noise. Both teacher and student networks use  g(x) = �Kj=1 σ(w⊺j x). Each experiment has 8 runs. Bottom row: Convergence for  g2(x) = �Kj=1 ajσ(w⊺j x). Here we fix top weights  ajat different numbers (rather than 1). Large positive  ajcorresponds to fast convergence. When  {aj}contains mixture signs, convergence to  w∗ is not achieved.

image

Figure 10. Quantity  L12(θ∗1, θ∗2, θ12) and L21(θ∗1, θ∗2, θ12) in 2ReLU network. We fix  θ12 = ∠(e1, e2) and vary e∗ =[cos φ, sin φ]⊺. In this case,  θ∗1 and θ∗2 are both dependent vari- ables with respect to  φ. When e∗ ∈ Cone(e1, e2), L12 andL21 > 0, otherwise negative. There are no extra zero-crossings. (a)-(b) Examples:  θ12 = 3π/8 and θ12 = 7π/8. (c) Empiricalevaluation on  (θ12, φ) ∈ [0, π] × [0, 2π]with grid size  104 × 104.

The 2-layered network in this paper is a special case with Qj = Q∗j = I. Despite the difficulty that  Qjis now de- pends on the weights of upper layers, and the input  Xcis not necessarily Gaussian distributed, Proposition 2 gives a mathematical framework to explore the structure of gradient. For example, a similar definition of Population Gradient function is possible.

In this paper, we study the gradient descent dynamics of a 2-layered bias-free ReLU network. The network is trained using gradient descent to reproduce the output of a teacher network with fixed parameters  w∗in the sense of  l2norm. We propose a novel analytic formula for population gradient when the input follows zero-mean spherical Gaussian distribution. This formula leads to interesting critical point and convergence analysis. Specifically, we show that critical points out of the hyperplane spanned by  w∗are not isolated and form manifolds. For two ReLU case, we characterize regions that contain no critical points. For convergence analysis, we show guaranteed convergence for a single ReLU case with random initialization whose standard deviation is on the order of  O(1/√d). For multiple ReLU case, we show that an infinitesimal change of weight initialization leads to convergence to different optima.

Our work opens many future directions. First, Thm. 2 characterizes the non-isolating nature of critical points in the case of isotropic input distribution, which explains why often practical solutions of NN are degenerated. What if the input distribution has different symmetries? Will such symmetries determine the geometry of critical points? Second, empirically we see convergence cases that are not covered by the theorems, suggesting the conditions imposed by the theorems can be weaker. Finally, how to apply similar analysis to broader distributions and how to generalize the analysis to multiple layers are also open problems.

Acknowledgement We thank L´eon Bottou, Ruoyu Sun, Jason Lee, Yann Dauphin and Nicolas Usunier for discussions and insightful suggestions.

Bottou, L´eon. Reconnaissance de la parole par reseaux connexionnistes. In Proceedings of Neuro Nimes 88, pp. 197–218, Nimes, France, 1988. URL http://leon. bottou.org/papers/bottou-88b.

Brading, Katherine and Castellani, Elena. Symmetries in physics: philosophical reflections. Cambridge University Press, 2003.

Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, Arous, G´erard Ben, and LeCun, Yann. The loss surfaces of multilayer networks. In AISTATS, 2015a.

Choromanska, Anna, LeCun, Yann, and Arous, G´erard Ben. Open problem: The landscape of the loss surfaces of multilayer networks. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3, volume 6, pp. 1756–1760, 2015b.

Dauphin, Yann N, Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, Ganguli, Surya, and Bengio, Yoshua. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pp. 2933– 2941, 2014.

Fukumizu, Kenji and Amari, Shun-ichi. Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural Networks, 13(3):317–327, 2000.

Glorot, Xavier and Bengio, Yoshua. Understanding the dif- ficulty of training deep feedforward neural networks. In Aistats, volume 9, pp. 249–256, 2010.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034, 2015.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. Computer Vision anad Pattern Recognition (CVPR), 2016.

Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four

research groups. IEEE Signal Processing Magazine, 29 (6):82–97, 2012.

Hochreiter, Sepp, Schmidhuber, J¨urgen, et al. Simplify- ing neural nets by discovering flat minima. Advances in Neural Information Processing Systems, pp. 529–536, 1995.

Janzamin, Majid, Sedghi, Hanie, and Anandkumar, An- ima. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. CoRR abs/1506.08473, 2015.

Kawaguchi, Kenji. Deep learning without poor local min- ima. Advances in Neural Information Processing Systems, 2016.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.

LaSalle, J. P. and Lefschetz, S. Stability by lyapunov’s second method with applications. New York: Academic Press., 1961.

LeCun, Yann A, Bottou, L´eon, Orr, Genevieve B, and M¨uller, Klaus-Robert. Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Springer, 2012.

Mei, Song, Bai, Yu, and Montanari, Andrea. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534, 2016.

Saad, David and Solla, Sara A. Dynamics of on-line gradi- ent descent learning for multilayer neural networks. Advances in Neural Information Processing Systems, pp. 302–308, 1996.

Saxe, Andrew M, McClelland, James L, and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.

Simonyan, Karen and Zisserman, Andrew. Very deep con- volutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR), 2015.

Soudry, Daniel and Carmon, Yair. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.

Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), pp. 1–9, 2015.

Zhang, Qiuyi, Panigrahy, Rina, and Sachdeva, Sushant. Electron-proton dynamics in deep learning. arXiv preprint arXiv:1702.00458, 2017.


Designed for Accessibility and to further Open Science