Sharp Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting

2020·Arxiv

Abstract

Abstract

SHARP RATE OF CONVERGENCE FOR DEEP NEURAL NETWORK CLASSIFIERS UNDER THE TEACHER-STUDENT SETTING

1. Introduction. Deep learning has gained tremendous success in clas-siﬁcation problems such as image classiﬁcations [Deng et al., 2009b]. With the introduction of convolutional neural network [Krizhevsky et al., 2012] and residual neural network [He et al., 2016], various benchmarks in computer vision have been revolutionized and neural network based methods have achieved better-than-human performance [Nguyen et al., 2017]. For instance,

AlexNet [Krizhevsky et al., 2012] and its variants [Zeiler and Fergus, 2014,

Simonyan and Zisserman, 2014] have demonstrated superior performance

in ImageNet data [Deng et al., 2009a, Russakovsky et al., 2015], where the

data dimension is huge, i.e., each image has pixel size 256 × 256 and hence is an 65536-dimensional vector. Traditional statistical thinking sounds an alarm when facing such high-dimension data as the “curse of dimensionality” usually prevents nonparametric classiﬁcation achieving fast convergence rates. This work attempts to provide a theoretical explanation for the empirical success of deep neural networks (DNN) in (especially high dimensional) classiﬁcation, beyond the existing statistical theories.

In the context of nonparametric regression, similar investigations have been recently carried out. Among others [Farrell et al., 2018, Suzuki, 2018,

et al., 2019], Schmidt-Hieber [2019] shows that deep ReLU neural networks can achieve minimax rate of convergence when the underlying regression function possesses a certain compositional smooth structure; Bauer et al. [2019], Kohler and Langer [2019] show a similar result by instead considering hierarchical interaction models; Imaizumi and Fukumizu [2018] demonstrate the superiority of neural networks in estimating a class of non-smooth functions, in which case no linear methods, e.g. kernel smoothing, Gaussian process, can achieve the optimal convergence rate. The aforementioned works all build on the traditional smoothness assumption and the convergence rates derived therein are still subject to curse of dimensionality. Classiﬁcation and regression are fundamentally diﬀerent due to the discrete nature of class labels. Speciﬁcally, in nonparametric regression, we are interested in recovering the whole underlying function while in classiﬁcation, the focus is on the nonparametric estimation of sets corresponding to diﬀer-ent classes. As a result, it is well known that many established results on regression cannot be directly translated to classiﬁcation. The goal of this paper is hence to ﬁll this gap by investigating how well neural network based classiﬁers can perform in theory and further provide a theoretical explanation for the “break-the-curse-of-dimensionality” phenomenon. To this end, we propose to study neural network based classiﬁers in a teacher-student setting where the traditional smoothness assumption is no longer present. The teacher-student framework has originated from statistical mechanics

[Saad and Solla, 1996, Mace and Coolen, 1998, Engel and Broeck, 2002] and

recently gained increasing interest [Hinton et al., 2015, Ba and Caruana, 2014, Goldt et al., 2019, Aubin et al., 2018]. In this setup, one neural network, called student net, is trained on data generated by another neural network, called teacher net. While worst-case analysis for arbitrary data distributions may not be suitable for real structured dataset, adopting this framework can facilitate the understanding of how deep neural networks work as it provides an explicit target function with bounded complexity. Furthermore, assuming the target classiﬁer to be a teacher network of an explicit architecture may provide insights on what speciﬁc architecture of the student classiﬁer is needed to achieve an optimal excess risk. At the same time, by comparing the two networks, both optimization and generalization can be handled more elegantly. Existing works on how well student network can learn from the teacher mostly focus on regression problems and study how the student network evolves during training from computational aspects, e.g., [Tian, 2018,

2019, Goldt et al., 2019, Zhang et al., 2019, Cao and Gu, 2019]. Still, there

is a lack of statistical understanding in this important direction, particularly on classiﬁcation aspects. In this paper, we consider binary classiﬁcation, and focus on the teacher-student framework where the optimal decision region is deﬁned by ReLU neural networks. This setting is closely related to the classical smooth boundary assumption where the neural networks are substituted by smooth functions. Speciﬁcally, a well-adopted assumption called as “boundary fragment”

assumes the smooth function to be linear in one of the dimensions (see Appendix 6.1). Our teacher-student network setting is more general as it does not impose any special structures on the decision boundary. Moreover, by the universal approximation property [Cybenko, 1989, Arora et al., 2016, Lu et al., 2017], the teacher network can suﬃciently approximate any continuous function given large enough size. In the above setting, an un-improvable rate of convergence is derived as �Od(n−2/3) for the excess risk of the empirical 0-1 loss minimizer, given that the student network is deeper and larger than the teacher network (unless the teacher network has a limited capacity in some sense to be speciﬁed later). When data are separable, the rate improves to �Od(n−1). In contrast, under the smooth boundary assumption, Mammen et al. [1999] establish the optimal risk bound to be O(n−β(κ+1)/[β(κ+2)+(d−1)κ]) where β > 0 represents the smoothness of the “boundary fragments” and κ > 0 is the so-called Tsybakov noise exponent. Clearly, this rate suﬀers from the “curse of dimensionality” but interestingly, coincides with our rate when κ = 1 and β → ∞ (up to a logarithmic factor). If we further allow κ → ∞ (corresponding to separable data), the classical rate above recovers �O(n−1) (up to a logarithmic factor). Please see the Appendix 6.1 for detail. Furthermore, we extend our analysis to a speciﬁc surrogate loss, i.e., hinge loss, and show that the convergence rate remains the same (up to higher order logarithmic terms) while allowing deeper student and teacher nets. The obtained sharp risk bounds may explain the empirical success of deep neural networks in high-dimensional classiﬁcation as the data dimension d only appears in the log(n) terms. Our main technical novelty is the nontrivial entropy calculation for nonparametric set estimation based on combinatorial analysis of ReLU neural networks. Existing Works. We review some related works on classiﬁcation using neural networks. The ﬁrst class of literature demonstrate that deep neural network (DNN) classiﬁers can be eﬃciently optimized in diﬀerent senses. For example, Liang et al. [2018] study the loss surface of neural networks in classiﬁcation

and provide conditions that guarantee zero training error at all local minima of appropriately chosen surrogate loss functions. Additionally, Lyu and Li [2019]

show that under exponential loss [Soudry et al., 2018, Gunasekar et al., 2018],

gradient descent on homogeneous neural network has implicit bias towards the maximum L2 margin solution. In all these optimization works, sharp bounds are not derived for either generalization error or convergence rate of the excess risk. From a nonparametric perspective, Kim et al. [2018] derive the excess risk bound for DNN classiﬁers as �O(n−β(κ+1)/[β(κ+2)+(d−1)(κ+1)]), which is suboptimal in the sense of Tsybakov et al. [2004]. Clearly, such a dimension-dependent bound, indicating exponential dependence on the sample size, may not support the empirical success of deep learning in high dimension classiﬁcation. Notations. For any function f : X → R, denote let ∥f∥∞ = supx∈X |f(x)| and ∥f∥p = (�|f|p)1/p for p ∈ N. For two given sequences {an}n∈N and {bn}n∈N of real numbers, we write an ≲ bn if there exists a constant C > 0 such that an ≤ Cbn for all suﬃciently large n, which is also denoted as an = O(bn). Let Ω(·) be the counterpart of O(·) that an = Ω(bn) means an ≳ bn In addition, we write ab ≍ bn if an ≲ bn and an ≳ bn. For a, b ∈ R, denote a ∨ b = max{a, b} and a ∧ b = min{a, b}. Let ⌊a⌋ represent max{n ∈ N : n ≤ a}. I denotes the indicator function. Independently, identically distributed is abbreviated as i.i.d..

2.1. Neural Network Setup. We consider deep neural networks with Rec-tiﬁed Linear Unit (ReLU) activation that σ(x) = max{x, 0}. For an L hidden layer ReLU neural network f(·), let the width of each layer be n0, n1, · · · , nL, where n0 = d is the input dimension, and denote the weight matrices and bias vectors in each layer to be W (l) and b(l), respectively. Let σ(W ,b)(x) = σ(W · x + b) and ◦ represent function composition. Then, the ReLU DNN can be written as

where Θ = {(W (l), b(l))}l=1,...,L+1 denotes the parameter set. For any given Θ, let |Θ| be the number of hidden layers in Θ, and Nmax(Θ) be the maximum width. We deﬁne ∥Θ∥0 as the number of nonzero parameters: ∥Θ∥0 =

�∥vec(W (l))∥0 + ∥b(l)∥0�,

where vec(W (l)) transforms the matrix W (l) into the corresponding vector by concatenating the column vectors. Similarly, we deﬁne ∥Θ∥∞ as the largest absolute value of the parameters in Θ, ∥Θ∥∞ = max

max1≤l≤L+1 ∥vec(W (l))∥∞, max1≤l≤L+1 ∥b(l)∥∞

For any given n, let Fn be

2.2. Binary Classiﬁcation. Consider binary classiﬁcation with a feature vector x ∈ X ⊂ Rd and a label y ∈ {−1, 1}. Assume x|y = 1 ∼ p(x), x|y = −1 ∼ q(x) where p and q are two bounded densities on X w.r.t. base measure Q. If p, q have disjoint support, we say the data distribution or the classiﬁcation problem is separable. For simplicity, assume that Q is Lebesgue measure, and positive and negative labels are equally likely to appear, i.e., balanced labels. The objective of classiﬁcation is to ﬁnd an optimal classiﬁer (called the Bayes classiﬁer) C∗ within some classiﬁer family C, that minimizes the 0-1 loss deﬁned as C∗ = argmin

R(C) := argmin

E [I{C(x) ̸= y}] . We can estimate C∗ based on the training data by minimizing the empirical 0-1 risk as follows �Cn = argmin

Rn(C) := argmin

where Cn is a given class of classiﬁers possibly depending on the sample size n. In practice, the above 0-1 loss is often replaced by its (computationally feasible) surrogate counterparts [Bartlett et al., 2006], such as hinge loss φ(z) = (1 − z)+ = max{1 − z, 0} or logistic loss φ(z) = log(1 + exp(−z)). Given a surrogate loss φ, we ﬁrst obtain �fφ by minimizing the empirical risk Rφ,n(f) =

over F, and then construct a classiﬁer by �Cφ(x) = sign( �fφ(x)). Accordingly,

the population risk. Given that C(x) = sign(f(x)), with a slight abuse of notation, we write R(C) and R(f) interchangeably. A classiﬁer C is evaluated by its excess risk deﬁned as the diﬀerence of the population risk between C and the Bayes optimal classiﬁer C∗ that E(C, C∗) = R(C) − R(C∗). Our goal is to derive sharp convergence rates of E(C, C∗) under diﬀerent losses.

we set up the teacher-student framework for classiﬁcation under which sharp rates for the excess risk are developed. Such a teacher-student bound sets an algorithmic independent benchmark for various deep neural network classiﬁers and also helps understand the role of input data dimension in the classiﬁcation performance. The Bayes classiﬁer C∗ is deﬁned via the optimal decision region G∗ := {x ∈ X, p(x) − q(x) ≥ 0}. The set estimate �G = {x ∈ X, �f(x) ≥ 0} can be constructed through deep neural network classiﬁers �f : Rd → R trained using either 0-1 loss or surrogate loss. Accordingly, a natural teacher network assumption is that p(x) − q(x) can be expressed by some neural

such an assumption is not uncommon in high-dimensional statistics, where population quantities may depend on the sample size n, e.g., Zhao and Yu

3.1. Training with 0-1 Loss. In this section, we focus on, for the theoretical purpose, DNN classiﬁers trained with the empirical 0-1 loss. Denote �fn = argmin

1

given a certain DNN family Fn. It is important to control the complexity of the underlying classiﬁcation problem. Otherwise, the student network would not be able to recover the Bayes classiﬁer [Telgarsky, 2015] with suﬃcient accuracy. To this end, we impose the following teacher network assumptions on (p(x) − q(x)): (A1) p, q have compact supports. (A2) p(x) − q(x) is representable by some teacher ReLU DNN f∗n ∈ F∗n with N∗n = O (log n)m∗ , L∗n = O (1) for some m∗ ≥ 1. (A3) For any n, there exists cn, 1/Tn = O(log n)m∗d2L∗n such that for all

Assumption (A3) characterizes how concentrated the data are around the decision boundary, which can be seen as an extension to the classical Tsybakov noise condition [Mammen et al., 1999]. The diﬀerence is that in our case, the underlying densities are indexed by sample size and thus cn and Tn are allowed to vary with n. Assumption (A3) is not unrealistic as we will show that it holds with high probability if the teacher network is random as stated in the following lemma (see Appendix 6.3 for detail).

Lemma 3.1. Let be the teacher network with structures specified in

assumption (A2). Suppose that all weights of f∗n are i.i.d. with any continuous distribution, e.g. Gaussian, truncated Gaussian, etc.. Then, with probability at least 1 − δ, assumption (A3) holds with cn, 1/Tn ≤ A(δ)(log n)m∗d2L∗n where A(δ) is some constant depending on δ. The following theorem characterizes how well the student network of proper size can learn from the teacher in terms of the excess risk. Theorem 3.2. Under the teacher assumptions (A1) through (A3), denote

Let Fn be a student ReLU DNN family with Nn = O(log n)m and Ln = O(1) for some m ≥ m∗ and assume the student network is larger than the teacher network in the sense that Ln ≥ L∗n, Sn ≥ S∗n, Nn ≥ N∗n, Bn ≥ B∗n. Then the excess risk for �fn ∈ Fn satisﬁes sup

� 1

where �Od hides the log n terms, which depend on d. The dependence on the dimension d is in the order of O(log n)d2. We further argue that under the present setting, the rate n−2/3 in Theorem 3.2 cannot be further improved. Theorem 3.3. Under the same assumptions of p, q as in Theorem 3.2 that

inf

sup

� 1

where �Ωd hides the log n terms, which depend on d. Theorem 3.3 shows that the convergence rate achieved by the empirical 0-1 loss minimizer cannot be further improved (up to a logarithmic term). If p and q have disjoint supports, i.e. separable, which could be true in some

image data, the rate improves to n−1, as stated in the following corollary. This rate improvement is not surprising since the classiﬁcation task becomes much easier for separable data. Corollary 3.4. Under the same setting as in Theorem 3.2, if we further assume p, q have disjoint supports, then the rate of convergence of the empirical 0-1 loss minimizer improves to inf

sup

� 1

Remark 1 (Disjoint Support). Given that data are separable, Srebro et al. [2010] derived the excess risk bound as O(D log n/n) (under a smooth loss) where D is the VC-subgraph-dimension of the estimation family. Additionally, separability implies that the noise exponent κ in Tsybakov noise condition

[Mammen et al., 1999, Tsybakov et al., 2004] can be arbitrarily large, which

also gives O(1/n) rate under the “boundary fragments” assumption. The imposed relation between teacher and student nets in Theorem 3.2 is referred to as “over-realization” in Goldt et al. [2019], Tian [2019], Bai et al. [2019]: at each layer, the number of student nodes is larger than that of teacher nodes given the same depth. In other words, the student network is larger than the teacher in order to obtain zero approximation error. On the other hand, such a requirement is not necessary as long as the corresponding Bayes classiﬁer is not too complicated. A ReLU neural network is a continuous piecewise linear function, i.e. its domain can be divided into connected regions (pieces) within where the function is linear. If the ReLU neural network crosses 0 on one piece, we call that piece as being active (see Figure 1 for an illustration). One way to measure the complexity of the teacher network is the number of active pieces. The following Corollary says that the teacher network can be much larger and deeper as long as the number of active pieces are in a logarithmic order with respect to n. Corollary 3.5. The same result in Theorem 3.2 holds when the teacher network is larger than the student network, i.e. Ln ≤ L∗n, Sn ≤ S∗n, Nn ≤

network is of the following order (3.1) o

where n1, · · · , nLn are the width of each hidden layer of the student network.

Fig 1. Example of a ReLU DNN function in [0, 1]. There are 5 pieces among them, only cross value 0 (horizontal line). There are 3 active pieces in this example and they are colored red.

The number of active pieces is the key quantity in controlling the complexity of the optimal set G∗. The expression in (3.1) comes from the lower bound developed by Montufar et al. [2014] on the maximum number of linear pieces for a ReLU neural network (Lemma 3.8). This lower bound is determined by the structure of the student network. If the number of active pieces of the teacher network is on this order, i.e. within the capacity of the student, then the corresponding optimal set can still be recovered by an even smaller student network, which ensures zero approximation error. Since the student network in consideration satisﬁes Nn = O(log n)m, the required order for the number of active pieces is in the order of o(log n)mdLn. 3.2. Proof of Theorem 3.2. We ﬁrst present some preliminary lemmas. As we mentioned before, classiﬁcation can be thought of as nonparametric estimation of sets. For this, we deﬁne two distances over sets. The ﬁrst one is the usual symmetric diﬀerence of sets: for any G1, G2 ⊂ Rd, d△(G1, G2) = Q(G1△G2) = Q ((G1\G2) ∪ (G2\G1)) , where Q denotes the Lebesgue measure. The second one is induced by densities p, q: for any G1, G2 ⊂ Rd,

|p(x) − q(x)|Q(dx). There are two key factors governing the rate of convergence in classiﬁcation: • How concentrated the data are around the decision boundary; • The complexity of the set G∗ where the optimal G∗ resides. For the ﬁrst factor, the following Tsybakov noise condition [Mammen et al., 1999] quantiﬁes how close p and q are:

(N) There exists constant c > 0 and κ ∈ [0, ∞] such that for any 0 ≤ t ≤ T

The parameter κ > 0 is referred to as the noise exponent. The bigger the κ, the less concentrated the data are around the decision boundary and hence the easier the classiﬁcation. In the extreme case that p, q have diﬀerent supports, κ can be arbitrarily large and the classiﬁcation is easy. To another extreme where Q{x ∈ X : p(x) = q(x)} > 0, there exists a region where diﬀerent classes are indistinguishable. In this case, κ = 0 and the classiﬁcation is hard in that region. For the second factor, we use bracketing entropy to measure the complexity of a collection of subsets G in Rd. For any δ > 0, the bracketing number NB(δ, G, d△) is the minimal number of set pairs (Uj, Vj) such that, (a) for each j, Uj ⊂ Vj and d△(Uj, Vj) ≤ δ; (b) for any G ∈ G, there exists a pair (Uj, Vj) such that Uj ⊂ G ⊂ Vj. Simply denote NB(δ) = NB(δ, G, d△) if no confusion arises. The bracketing entropy is deﬁned as HB(δ) = log NB(δ, G, d△). Lemma 3.6 characterizes the complexity of a special collection of sets. Lemma 3.6. Let X = [0, 1]d and G be a collection of polyhedrons with at most S vertices in Rd. Then the bracketing entropy of ¯G = G ∩ X satisﬁes HB(δ, ¯G, d△) = log NB(δ, ¯G, d△) ≲ d2S log(d3/2S/δ) Proof of Lemma 3.6. Let’s ﬁrst introduce some notations and terminologies. For any δ > 0, let Mδ denote the smallest integer such that Mδ > 1/δ. Consider the set of lattice points Xdδ = {(i1/Mδ, . . . , id/Mδ) : i1, . . . , id = 0, 1, . . . , Mδ} which has cardinality (Mδ + 1)d. Let G(x1, · · · , xs) denote a polyhedron with vertices x1, · · · , xs ∈ [0, 1]d where s ≤ S. (the xi’s are not necessarily distinct). Any convex polyhedron G in Rd is the intersection of multiple (d − 1)-dimensional hyperplanes. If we move all such hyperplanes inwards (to the direction perpendicular to the hyperplanes) by a small distance δ, they produce another polyhedron, denoted G−δ, called as the δ-contraction of G. Note that G−δ can be empty if δ is not small enough. We prove the result for d = 1, in which ¯G is a collection of subintervals in [0, 1]. For any subinterval [a, b] ⊂ [0, 1], there exist xi, xj ∈ X1δ such that

(By convention, [xi, xj] is empty if xi > xj.) Then ([xi, xj+1], [xi+1, xj]) is a 2δ-bracket of [a, b] since obviously

(3.2) [x

Fig 2. Grid in 2D and the outer cover (green) constructed for with grid points for a polygon (blue).

There are�Mδ+12 �diﬀerent choices of [xi, xj], hence,�Mδ+12 �diﬀerent choices of the pairs ([xi, xj+1], [xi+1, xj]). Any [a, b] ⊂ [0, 1] can be 2δ bracketed by one of such pairs in the sense of (3.2). This shows that HB(2δ) ≤ log�Mδ+12 �≤ 2 log(1/δ). When d ≥ 2, any G ∈ ¯G has at most S vertices, so ¯G := G ∩ [0, 1]d has at most dS vertices where the factor d is due to the fact that each edge of G intersects at most d edges of [0, 1]d therefore creates at most dS vertices for ¯G. For any polygon G(x1, · · · , xs) where s ≤ dS, denote G−√dδ(x1, · · · , xs) =

to see that there exist v11, . . . , vd1, v12, . . . , vd2, · · · · · · , v1s, . . . , vds ∈ Xdδ , where

• G(

See Figure 2 for an illustration when d = 2. Similarly for G(x−1 , · · · , x−s ),

there exist

• G(

By the deﬁnition of G−√dδ, we have ∥xi−x−i ∥2 ≥√dδ. Thus ∥uji −x−i ∥2 ≤√dδ implies G(u11, . . . , ud1, · · · · · · , u1s, . . . , uds) ⊂ G(x1, · · · , xs). On the other

hand,

where the term s is due to the fact that G(x1, · · · , xs) has at most O(s) faces. Notice that

and s ≤ dS. Thus, with at most (Mδ + 1)d2S pairs of subsets in ¯G, we can 2d3/2Sδ-bracket any ¯G ∈ ¯G. Therefore, log NB((2d3/2Sδ), ¯G, d△) ≲ log�(Mδ + 1)d2S�, which implies log NB(δ, ¯G, d△) ≲ d2S log(d3/2S/δ). Lemma 3.7 (Theorem 1 in [Serra et al., 2017]). Consider a deep ReLU network with L layers, nl ReLU nodes at each layer l, and an input of dimension n0. The maximal number of linear pieces of this neural network is at most

where : 0

jl−1, nl} ∀l = 1, . . . , L}. This bound is tight when L = 1. When n0 = O(1) and all layers have the same width N, we have the same best known asymptotic bound O(NLn0) ﬁrst presented in [Raghu et al., 2017]. Consider a deep ReLU network with n0 = d inputs and L hidden layers of widths ni ≥ n0 for all i ∈ [L]. The following lemma establishes a lower bound for the maximal number of linear pieces of deep ReLU networks: Lemma 3.8 (Theorem 4 in [Montufar et al., 2014]). The maximal number of linear pieces of a ReLU network with n0 input units, L hidden layers, and ni ≥ n0 rectiﬁers on the i-th layer, is lower bounded by

Fig 3. Demonstration of how a polygon in d = 2 case can be divided into basic triangles. The union of the two brackets form a bracket of the original polygon. The blue shade is the symmetric difference.

Lemma 3.9. Let F be a class of ReLU neural networks, deﬁned on X = [0, 1]d, with at most L layers and N neurons per layer. Let Gf = {x ∈ X : f(x) ≥ 0} and GF = {Gf : f ∈ F}. Then the bracketing number of GF satisﬁes log NB(δ, GF, d△) ≲ NLd2d3 �Ld2 log(N) ∨ log(1/δ)�. Proof. The proof relies on Lemma 3.6 for which we need to control the number of vertexes of Gf based on the number of pieces (linear regions) of the ReLU neural network. Since ReLU neural networks are piecewise linear, Gf is a collection of sets of polyhedrons. Deﬁne the subgraph of a function f : Rd → R to be the set of points in Rd+1: sub(f) = {(x, t) : f(x) ≥ t}. In this sense, sub(f) ∩ {(x, 0) : x ∈ X} = {(x, 0) : x ∈ Gf}, a slice of the subgraph. Denote all the pieces to be p1, p2, · · · , ps. Each piece is a d-dimensional polyhedron on which f(x) is linear. To control the complexity of Gf, consider the most extreme case that the function crosses zero on each piece, i.e. for any i = 1, . . . , s, {(x, f(x)) : x ∈ pi} ∩ {(x, 0) : x ∈ X} ̸= ∅. Each intersection resides in a (d − 1)-dimensional hyperplane, e.g. dot for d = 1, line segment for d = 2 and so on. So the number of such (d − 1)-dimensional hyperplanes in Gf is at most s. A vertex of a polyhedron in [0, 1]d can be thought of as the intersection of at least d hyperplanes of dimension d − 1. Thus, with at most s hyperplanes there are at most�sd�< sd vertices in Gf. In order to apply Lemma 3.6, we break the collection of polyhedrons into the so-called basic polyhedrons each with d + 1 vertices. For instance, the basic polyhedrons are intervals when d = 2, are triangles when d = 3, and so on. A polyhedron G with at most s vertices can be divided into at most s disjoint basic polyhedrons B1, . . . , Bs. For instance, Figure 3 demonstrates

the d = 2 case. Therefore, the bracketing number of the polyhedrons can be derived by bracketing the basic polyhedrons. For a basic polyhedron B, denote its δ-bracketing pair to be (UB,δ, VB,δ), i.e., UB,δ ⊂ B ⊂ VB,δ. Then (UG,δ, VG,δ), deﬁned as below

form a (sδ)-bracket of G. Hence, the bracketing number of all polyhedrons is controlled by the s-th power of the bracketing number of all basic polyhedrons. Applying Lemma 3.7 we know s = O(NLd) and the number of vertices is at most S = O(NLd2). Together with Lemma 3.6, we therefore get that log NB(Sδ, GF, d△) ≲ S(d + 1)d2 log((d + 1)d3/2/δ), which implies log NB(δ, GF, d△) ≲ NLd2d3 log(NLd2d3/δ) ≲ NLd2d3 �Ld2 log(N) ∨ log(1/δ)�. More discussions about Lemma 3.9 can be found in Appendix 6.2. Next, we present some lemmas that can take advantage of the obtained entropy bound and eventually take us to the proof of the excess risk convergence rate. Lemma 3.10 (Theorem 5.11 in Van De Geer [2000]). For some function space H with suph∈H ∥h(x)∥∞ ≤ K and suph∈H ∥h(x)∥L2(P) ≤ R where P is the distribution of x. Take a > 0 satisfying (1) a ≤ C1√nR2/K; (2)

(3) a ≥ C0

;

sup

where Pn is the empirical counterpart of P.

So far, the presented lemmas are only concerned with the general case, i.e. set G∗, p, q, etc. that does not depend on n. However, in our teacher-student

framework, the optimal set is indexed by n as it’s determined by the teacher network . In the remaining part of the proof, we will consider

speciﬁcally for our teacher network case. The next lemma investigates the modulus of continuity of the empirical process. It’s similar to Lemma 5.13 in Van De Geer [2000] but with a key diﬀerence in the entropy assumption (3.3), where the entropy bound contains n. Lemma 3.11. For a probability measure P, let Hn be a class of uniformly bounded (by 1) functions h in L2(P) depending on n. Suppose that the δ-entropy with bracketing HB(δ, Hn, L2(P)) satisﬁes, for some An > 0, the inequality HB(δ, Hn, L2(P)) ≤ An log(1/δ)(3.3) for all δ > 0 small enough. Let hn0 be a ﬁxed element in Hn. Let Hn(δ) = {hn ∈ Hn : ∥hn −hn0∥L2(P) ≤ δ}. Then there exist constants D1 > 0, D2 > 0 such that for a sequence of i.i.d. random variables x1, · · · , xn with probability distribution P, it holds that

sup

for all x ≥ 1. Proof. The main tool for the proof is Lemma 3.10. Replace H with Hn(δ) in Lemma 3.10 and take K = 4, R =√2δ and a = 12C1√Anδ log(1/δ), with C1 = 2√2C0. Then (1) is satisﬁed if (3.4)

log(1δ ) ≤ √n. Under (3.4), condition (2) and (3) are satisﬁed automatically. Choosing C0 suﬃciently large will ensure (4). Thus, for all δ satisfying (3.4), we have

sup

�Anδ log(1/δ)

≤ C exp

Notice that (3.4) holds if δ ≥�An/n. Let B = min{b > 1 : 2−b ≤�An/n} and apply the peeling device. Then,

sup

sup

�An2−b log(2b)

C exp

≤ 2C exp

if C1An is suﬃciently large. We then present a lemma that establishes the connection between d△ and dp,q, which is adapted from Lemma 2 in Mammen et al. [1999] to our teacher network setting. Corresponding to assumption (A3), we deﬁne (Nn) as an extension to the classical Tsybakov noise condition (N). (Nn) There exists cn > 0 depending on n and κ ∈ [0, ∞] such that for any

Note that the (N) is a special case of (Nn) with Tn and cn being absolute constant. Lemma 3.12. Assume (Nn) and pn, qn are bounded by b2 > 0. Then, there exists absolute constants b1(κ) > 0 depending on κ such that for any Lebesgue measurable subsets G1 and G2 of X,

Proof. The second inequality is trivial given that p, q are bounded by b2. For the ﬁrst inequality, since Q(|pn − qn| ≤ t) ≤ cntκ for all 0 ≤ t ≤ Tn, the boundedness of Q(X) implies that

where An =�Q(X)T κn ∨ cn�. Then,

2An

)

Our goal in classiﬁcation is to estimate G∗n by �Gn = argminG∈Gn Rn(G), where Gn is some collection of sets associated with the student network Fn and Rn(G) = 12n

(I{xi ∈ G|yi = 1}(x) + I{xi /∈ G|yi = −1}(x)) . Similar to Theorem 1 in Mammen et al. [1999], we have the following lemma regarding the upper bound on the rate of convergence.

of X ⊂ Rd. Deﬁne

(3.5) where b2 is an absolute constant. Let Gn be another class of subsets satisfying

that for any δ > 0 small enough, (3.6) HB(δ, Gn, d△) ≤ An log(1/δ). Then we have (3.7) lim

sup

�An log2 n

given set G ∈ X, let hG(x) = I{x ∈ G}. In particular, let h∗n = hG∗n. Let ∥h∥2p =�h2(x)p(x)Q(dx). Since both pn and qn are bounded,

Consider the random variable

And△(G∗n, �Gn) log(1/d△(G∗n, �Gn))

(3.9) √nE(Rn( �Gn) − Rn(G∗n))� And△(G∗n, �Gn) log(1/d△(G∗n, �Gn))

Note that

+ 1 2n

Then Vn can be written as

+

Consider the event En = {d△(G∗n, �Gn) >�An/n} and let �Gn = {G ∈ Gn :

d△(G, G∗n) >�An/n}. If En holds, then

V

And△(G∗n, �Gn) log(1/d△(G∗n, �Gn)) ≤ sup

And△(G∗n, �Gn) log(1/d△(G∗n, �Gn)) ≤ sup

+ sup

≤ sup

sup

where Hn = {hn(x) = I{x ∈ Gn} : Gn ∈ Gn}. The last inequality follow from the fact that √x log(1/x) is strictly increasing when x < 1. Notice that hn’s are uniformly bounded by 1 and the L2 norm squared of hG1 − hG2 is d△(G1, G2). Applying Lemma 3.11, we have (3.10) E[VnI(En)] ≤ C for some ﬁnite constant C. Now we use this inequality to prove the main result. From (3.9), we know that

which, together with Lemma 3.12, yields that

which simpliﬁes to be

�An log2 n

where we used the fact that dpn,qn( �Gn, G∗n) ≳ 1/n. Therefore, under En, (3.10) implies that

�An log2 n

On the other hand, under Ecn, we have

By Lemma 3.12 we know dp,q( �Gn, G∗n) is also bounded by�An/n. Since (κ + 1)/(κ + 2) ≤ 1, the rate under En dominates and the proof is complete. Proof of Theorem 3.2. First, we verify that the Tsybakov noise condition holds for κ = 1 in our setting. The proof is based on the fact that a ReLU network is piecewise linear and the number of linear pieces is quantiﬁable. Assumption (A3) implies (Nn) with cn, 1/Tn = O(log n)m∗d2L∗n and κ = 1. In the case where p, q have disjoint support, obviously κ can be arbitrarily large. Next, we consider the bracketing number of Gn deﬁned via Fn that Gn = {x ∈ X : f(x) ≥ 0, f ∈ Fn}. From Lemma 3.9 we have log NB(δ, Gn, d△) ≲ NLd2d2 �Ld2 log(N) ∨ log(1/δ)�. Thus, An = O(Nn)d2Ln as in (3.6) if δ ≪ 1/N. Recall from assumption (A2) and (A3) that Ln = O(1), Nn = O(log n)m and 1/Tn, cn = O(log n)m∗d2L∗n. Applying Lemma 3.13 with κ = 1 we have that the excess risk has upper bound sup

E[E( �fn, C∗n)]

�An log2 n

� 1

(log n)

Proof of Corollary 3.4. Corollary 3.4 easily follows from the fact that p, q having disjoint support implies κ = ∞ in (Nn).

3.3. Proof of Theorem 3.3. We will show that the lower bound holds in special case that (1) assumption (A3) satisﬁes cn, 1/Tn being absolute constants that doesn’t depend on n; (2) instead of general ReLU neural

the dimensions, reminiscent of the “boundary fragment” assumption. In this special case, we are able to show the best possible convergence rate already matches that in Theorem 3.2. For ease of notation, we omit the subscript n and write pn, qn as p, q if no confusion arises. Proof. Without loss of generality, let X = [0, 1]d. Consider the “boundary fragment” setting and let �Gn be a set deﬁned by a ReLU network family �Fn containing functions from Rd−1 to R:

where x−j = (x1, · · · , xj−1, xj, · · · , xd). Notice that if h(x−j) is a ReLU network on Rd−1, then �h(x) = h(x−j) − xj is a ReLU network on Rd. Thus �Gn is a subset of Gn which corresponds to the student network that (3.11) Gn = {x ∈ X : f(x) > 0, f ∈ Fn} Let �Gn denote the empirical 0-1 loss minimizer over �Gn. To show the lower bound, consider the subset of D �Gn (3.5) that contains all pairs like (p, q0), where p ∈ F1, q0 will be speciﬁed later. Then, sup

� 1

where F1 is a ﬁnite set to be speciﬁed later, p, q0 are the underlying densities for the two labels and Dq0 denotes all the data generated from q0. For ease of presentation, we ﬁrst give the proof for the case d = 2 and then extend to general d. Let φ(t) be a piecewise linear function supported on [−1, 1] deﬁned as φ(t) =

t + 1 −1 < t ≤ 0,

Rewrite φ as φ(t) = σ(t + 1) − σ(t) + σ(−t + 1) − σ(−t) − 2, which is a one hidden layer ReLU neural network with 11 non-zero weights that are either 1 or −1. For x = (x1, x2) ∈ [0, 1]2, deﬁne q0(x) =(1 − η0 − b1)I{0 ≤ x2 < 1/2} + I{1/2 ≤ x2 < 1/2 + e−M} + (1 + η0 + b2)I{1/2 + e−M ≤ x2 ≤ 1}, where M ≥ 2 is an integer to be speciﬁed later. Let b1 = c−1/κ2 e−M/κ and b2 > 0 be chosen such that q0 integrates to 1 (so q0 is a valid probability density). For j = 1, 2, · · · , M and t ∈ [0, 1], let

Note that ψj is only supported on [j−1M , jM ]. For any vector ω = (ω1, · · · , ωM) ∈ Ω := {0, 1}M, deﬁne bω(t) =

and pω(x) =1 +�1/2 + e−M − x2c2

I{1/2 ≤ x2 ≤ 1/2 + bω(x1)}

where b3(ω) > 0 is a constant depending on ω chosen such that pω(x) integrates to 1. Let F1 = {pω : ω ∈ Ω} and we will show that (pω, q0) ∈ D �Gn for all ω ∈ Ω. To this end, we need to verify that (a) pω(x) ≤ c1 for x ∈ [0, 1]2;

For (a), since pω integrates to 1,

Thus, pω(x) ≤ c1 for a large enough M and some absolute constant c1. For (b), notice that {x : pω(x) ≥ q0(x)} = {x : 0 ≤ x2 ≤ 1/2 + bω(x1)} = {x ∈ [0, 1]2 : bω(x1) − σ(x2) + 1/2 ≥ 0} ∈ Gn,

where the last inclusion follows from the deﬁnition of Gn (3.11) and the fact that bω(x1) − σ(x2) + 1/2 is a ReLU neural network with one hidden layer, whose width and number of non-zero weights are both O(M). Later we will see that M = O(log n), and thus the constructed neural network satisﬁes all the size constraints in Theorem 3.2. For (c), it follows that

Since the above (a)-(c) hold and by the deﬁnition of D �Gn (3.5), we conclude that (pω, q0) ∈ D �Gn for all ω ∈ Ω . We next establish how fast

use the Assouad’s lemma stated in [Korostelev and Tsybakov, 2012] which is adapted to the estimation of sets. For j = 1, · · · , M and ω = (ω1, · · · , ωM) ∈ Ω, let

For i = 0 and i = 1, let Pji be the probability measure corresponding to the distribution of x1, · · · , xn when the underlying density is fωji. Denote the expectation w.r.t. Pji as Eji. Let Dj = {x ∈ X : 1/2 + bωj0(x1) < x2 ≤ 1/2 + bωj1(x1)}

Then S ≥ 1/2

Q(Dj)

≥ 1/2

≥ 1/2

≥ 14

where H(·, ·) denotes the Hellinger distance. Then it holds that

1 +�1/2 + e−M − x2

+

1 +

We will analyze the last two terms. For the ﬁrst term,

1 +

� 1

For the second term, notice that

which yields b3(ω11) = 1 1/2 − bω11(x1)

=

(1/2 − e−M)(1 + 1/κ)e−M(1+1/κ)�(1 − (1 − φ(Mt))1+1/κ)dt

(1/2 − e−M)(1 + 1/κ)e−M(1+1/κ)

Hence, |b3(ω11) − b3(ω10)| = O�e−M(1+1/κ)�. Unifying the above, we have

� 1

� 1

Now choose M as the smallest integer such that

κ + 2 log n. Then we have H2(P10, P11) ≤ C∗n−1 (1 + o(1)) for some constant C∗ depending only on κ, c2, φ, and

for n large enough and is another absolute constant depending only on

C∗. Thus for n large enough,

in which the constant C∗2 only depends on κ, c2 and φ. Combining all the results so far we get that lim infn→∞ inf�Gn sup(p,q)∈D �Gnn

which holds when d = 2. Using Lemma 3.12, we have lim infn→∞ inf�Gn sup(p,q)∈D �Gnn

Using the same argument as in the proof of Theorem 3.2, we get κ = 1, which will give us the rate 2/3. The proof for general d can be derived similarly. We treat the last dimension xd as x2 in the d = 2 case and treat x−d := (x1, · · · , xd−1) as x1 in the d = 2 case. Deﬁne q0(x) =(1 − η0 − b1)I{0 ≤ xd < 1/2} + I{1/2 ≤ xd < 1/2 + e−M} + (1 + η0 + b2)I{1/2 + e−M ≤ xd ≤ 1}, and pω(x) =1 +�1/2 + e−M − x2c2

I{1/2 ≤ xd ≤ 1/2 + bω(x−d)} − b3(ω)I{1/2 + bω(x−d) < xd ≤ 1}, where bω(x−d) is constructed similarly as a shallow ReLU neural network that bω(x−d) =

where ωj1,··· ,jd−1 are binary 0, 1 variables and

where φ(·) is a shallow ReLU neural network with input dimension d − 1 satisfying the following conditions: • φ = 0 outside [−1, 1]d and φ ≤ 1 on [−1, 1]d; • maxx−d∈[−1,1]d φ(x−d) ≤ 1 and φ(0) = 1. Such a construction is similar to the “spike” function in Yarotsky and Zhevnerchuk [2019] and it requires O(d2) non-zero weights. The rest of the proof follows the d = 2 case.

4. Training with Surrogate Loss. In this section, we consider deep classiﬁers trained under the hinge loss φ(z) = (1 − z)+ = max{1 − z, 0}. This kind of surrogate loss has been widely used for “maximum-margin” classiﬁcation, most notably for support vector machines [Cortes and Vapnik, 1995]. An desirable property of hinge loss is that its optimal classiﬁer coincides with that under 0-1 loss [Lin, 2002], i.e. f∗φ(x) = C∗(x). Hence, a lot of arguments for 0-1 loss can be easily carried over. Additionally, minimizing the sample average of an appropriately behaved loss function has a regularizing eﬀect [Bartlett et al., 2006]. It is thus possible to obtain uniform upper bounds on the risk of a function that minimizes the empirical average of the loss φ, even for rich classes that no such upper bounds are possible for the minimizer of the empirical average of the 01 loss. Under the surrogate loss, our requirement on the size of the teacher network is relaxed from (A2) as follows: (A2φ) p(x) − q(x) is representable by some teacher ReLU DNN f∗n ∈ F∗n with

for some m∗ ≥ 1. The following theorem says that the same un-improvable rate can be obtained for the empirical hinge loss minimizer �fφ,n ∈ Fn. Theorem 4.1. Suppose the underlying densities p and q satisfy assumptions (A1), (A2φ), (A3) and denote all such (p, q) pairs as �F∗n. Let Fn be a student ReLU DNN family with Ln = O(log n), Nn = O(log n)m and Bn, Fn = O(log n) for some m ≥ m∗. Assume the student network is larger than the teacher network, i.e., Ln ≥ L∗n, Sn ≥ S∗n, Nn ≥ N∗n, Bn ≥ B∗n, Fn ≥ F ∗n. Then the excess risk for �fφ,n ∈ Fn satisﬁes sup

� 1

Similarly, results in Corollary 3.4 and 3.5 hold for the empirical hinge loss minimizer. Speciﬁcally, when p, q are disjoint, the convergence rate of excess risk improves to n−1, and all conclusions hold when the teacher network is larger but with bounded active pieces. Remark 2 (Network Depth). Training with surrogate loss such as hinge loss, unlike 0-1 loss, doesn’t involve any hard thresholding, i.e. I{yf(x) < 0}. As a result, to control the complexity of the student network, Lemma 4.4 is used instead of Lemma 3.9, which allows us to use deeper neural networks (Ln = O(log n)) for both the student and teacher network.

4.1. Proof of Theorem 4.1. One important observation to be used in the proof is that the Bayes classiﬁer under hinge loss is the same as that

convergence rate, we utilize the following lemma from Kim et al. [2018]. Let η(x) denote the conditional probability of label 1 that η(x) = P(y = 1|x). Lemma 4.2. [Theorem 6 of [Kim et al., 2018]] Let φ be the hinge loss. Assume (N) with the noise exponent κ ∈ [0, ∞], and that following conditions (C1) through (C4) hold. (C1) For a positive sequence an = O(n−a0) as n → ∞ for some a0 > 0, there exists a sequence of function classes {Fn}n∈N such that Eφ(fn, f∗φ) ≤ an for some fn ∈ Fn. (C2) There exists a real valued sequence {Fn}n∈N with Fn ≳ 1 such that

(C3) There exists a constant ν ∈ (0, 1] such that for any f ∈ Fn and any

for a constant C2 > 0 depending only on φ and η(·). (C4) For a positive constant C3 > 0, there exists a sequence {δn}n∈N such that HB(δn, Fn, ∥ · ∥2) ≤ C3n � δn Fn

for {Fn}n∈N in (C1), {Fn}n∈N in (C2), and ν in (C3).

trarily small constant ι > 0. Then, the empirical φ-risk minimizer �fφ,n over Fn satisﬁes

In Lemma 4.2, condition (C1) guarantees the approximation error of fn

lemma, which is reminiscent of Lemma 3.12 in the sense that it characterizes

between f and f∗φ. Lemma 4.3 (Lemma 6.1 of Steinwart et al. [2007]). Assume (N) with the Tsybakov noise exponent κ ∈ [0, ∞]. Assume ∥f∥∞ ≤ F for any f ∈ F.

Under the hinge loss φ, for any f ∈ F,

where Cη,κ =�∥(2η − 1)−1∥κκ,∞ + 1�I(κ > 0) + 1 and ∥(2η − 1)−1∥κκ,∞ is deﬁned by ∥(2η − 1)−1∥κκ,∞ = supt>0 �tκ Pr�{x : |(2η(x) − 1)−1| > t}��. For condition (C4) in Lemma 4.2, we present the following lemma. Lemma 4.4. [Lemma 3 in Suzuki [2018]] For any δ > 0, the covering number of FDNN(L, N, S, B) (in sup-norm) satisﬁes

≤ 2L(S + 1) log(δ−1(L + 1)(N + 1)(B ∨ 1)). proof of Theorem 4.1. The lower bound directly follows from Theorem 3.3, as the constructed ReLU neural network in the proof also satisfy assumption (A2φ). For the upper bound on the convergence rate, we utilize Lemma 4.2 and check the conditions (C1) through (C4). Since the student network is larger than the teacher, (C1) and (C2) trivially hold with arbitrarily small an and Fn = O(log n) as assumed. To apply Lemma 4.3, notice that Cη,κ = O(cn) = O(log n)m∗d2L∗n by assumption (A3) and F = O(log n), we have (C3) holds for ν = κ/(κ + 1) +ϵn, where ϵn = (2 +m∗d2L∗n) log log n/ log n. The term ϵn is to deal with the fact that Cη,κ can also diverge at an O(log n)m∗d2L∗n rate. For (C4), by Lemma 4.4,

≤ 2Ln(Sn + 1) log�δ−1n (Ln + 1)(Nn + 1)(Bn ∨ 1)� ≲ (log n)2m+2 log�δ−1n ∨ logm(n)�. Therefore, (4.2) implies that (C3) is satisﬁed if we choose δn with

which can be satisﬁed by choosing δn = �(log n)2m+m∗d2L∗n+7

Similar to the proof of Theorem 3.2, the Tsybakov exponent κ = 1. Thus, by

sharp rate of convergence for the excess risk under both empirical 0-1 loss and hinge loss minimizer in the teacher-student setting. Our current results for training under 0-1 loss only hold for student networks with O(1) layers

and the assumption that , i.e. zero approximation, is required. In the

future, we aim to relax these two constraints and provide more comprehensive analysis of the teacher-student network. Additionally, we would like to • explore other type of neural networks such as convolutional neural network and residual neural network, which are both very successful at image classiﬁcation; • consider the implicit bias of training algorithms, e.g. stochastic gradient descent, to regularize the complexity of larger and deeper neural networks in the teacher-student setting; • consider the more general improper learning scenario where the Bayes classiﬁer is not necessarily in the student neural network; • consider other popular surrogate losses such as exponential loss or cross entropy loss. Further investigation under the teacher-student network setting may facilitate a better understanding of how deep neural network works and shed light on its empirical success especially in high-dimensional image classiﬁcation.

References.

Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. arXiv preprint arXiv:1611.01491, 2016.

Benjamin Aubin, Antoine Maillard, jean barbier, Florent Krzakala, Nicolas Macris, and Lenka Zdeborov´a. The committee machine: Computational to statistical gaps in learning a two-layers neural network. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3223–3234. Curran Associates, Inc., 2018.

Jean-Yves Audibert, Alexandre B Tsybakov, et al. Fast learning rates for plug-in classifiers. The Annals of statistics, 35(2):608–633, 2007.

Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2654–2662. Curran Associates, Inc., 2014.

Benedikt Bauer, Michael Kohler, et al. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4):2261–2285, 2019.

Yuan Cao and Quanquan Gu. Tight sample complexity of learning one-hidden-layer convolutional neural networks. In Advances in Neural Information Processing Systems, pages 10611–10621, 2019.

Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Efficient approximation of deep relu networks for functions on low dimensional manifolds. In Advances in Neural Information Processing Systems, pages 8172–8182, 2019.

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3): 273–297, 1995.

George Cybenko. Approximations by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2:183–192, 1989.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009a.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009b.

A Engel and C.V. Broeck. Statistical Mechanics of Learning. Cambridge University Press, 2002.

Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation

Sebastian Goldt, Madhu Advani, Andrew M Saxe, Florent Krzakala, and Lenka Zdeborov´a. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 6979–6989. Curran Associates, Inc., 2019.

Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461–9471, 2018.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.

Masaaki Imaizumi and Kenji Fukumizu. Deep neural networks learn non-smooth functions effectively. arXiv preprint arXiv:1802.04474, 2018.

Yongdai Kim, Ilsang Ohn, and Dongha Kim. Fast convergence rates of deep neural networks for classification. arXiv preprint arXiv:1812.03599, 2018.

Michael Kohler and Sophie Langer. On the rate of convergence of fully connected very deep neural network regression estimates. arXiv preprint arXiv:1908.11133, 2019.

Aleksandr Petrovich Korostelev and Alexandre B Tsybakov. Minimax theory of image reconstruction, volume 82. Springer Science & Business Media, 2012.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

Shiyu Liang, Ruoyu Sun, Yixuan Li, and Rayadurgam Srikant. Understanding the loss surface of neural networks for binary classification. arXiv preprint arXiv:1803.00909, 2018.

Yi Lin. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, 6(3):259–275, 2002.

Ruiqi Liu, Ben Boukai, and Zuofeng Shang. Optimal nonparametric inference via deep neural network. arXiv preprint arXiv:1902.01687, 2019.

Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. In Advances in neural information processing systems, pages 6231–6239, 2017.

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890, 2019.

C. W. H. Mace and A. C. C. Coolen. Statistical mechanical analysis of the dynamics of

Enno Mammen, Alexandre B Tsybakov, et al. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.

Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.

Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and estimation of deep neural network to intrinsic dimensionality. arXiv preprint arXiv:1907.02177, 2019.

Kien Nguyen, Clinton Fookes, Arun Ross, and Sridha Sridharan. Iris recognition with off-the-shelf cnn features: A deep learning perspective. IEEE Access, 6:18848–18855, 2017.

Kenta Oono and Taiji Suzuki. Approximation and non-parametric estimation of resnet-type convolutional neural networks. arXiv preprint arXiv:1903.10047, 2019.

Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2847–2854. JMLR. org, 2017.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. .

David Saad and Sara A. Solla. Dynamics of on-line gradient descent learning for multilayer neural networks, 1996.

Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 2019. Forthcoming.

Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and counting linear regions of deep neural networks. arXiv preprint arXiv:1711.02114, 2017.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.

Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Optimistic rates for learning with a smooth loss. arXiv preprint arXiv:1009.3896, 2010.

Ingo Steinwart, Clint Scovel, et al. Fast rates for support vector machines using gaussian kernels. The Annals of Statistics, 35(2):575–607, 2007.

Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. arXiv preprint arXiv:1810.08033, 2018.

Yuandong Tian. A theoretical framework for deep locally connected relu network. arXiv

Yuandong Tian. Over-parameterization as a catalyst for better generalization of deep relu network. arXiv preprint arXiv:1909.13458, 2019.

Alexander B Tsybakov et al. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.

Alexandre B Tsybakov, Sara A van de Geer, et al. Square root penalty: adaptation to the margin in classification and in edge estimation. The Annals of Statistics, 33(3): 1203–1224, 2005.

Sara Van De Geer. Empirical Processes in M-estimation. Cambridge University Press, 2000.

Dmitry Yarotsky and Anton Zhevnerchuk. The phase diagram of approximation rates for deep neural networks. arXiv preprint arXiv:1906.09477, 2019.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.

Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer

6.1. Smooth Boundary Condition. In this section we review some existing work under smooth boundary condition in details and point out its connection to our proposed teacher-student neural network. Smooth Functions. A function has H¨older smoothness index β if all partial derivatives up to order ⌊β⌋ exist and are bounded, and the partial derivatives of order ⌊β⌋ are β −⌊β⌋ Lipschitz. The ball of β-H¨older functions with radius R is then deﬁned as Hβr (R) =�f : Rr → R :

sup

where ∂, . . . , α

Boundary Assumption. It is known that estimating the classiﬁer directly instead of the conditional class probability helps achieve fast convergence

rates [Mammen et al., 1999, Tsybakov et al., 2004, 2005, Audibert et al., 2007].

Classiﬁcation in this case can be thought of as nonparametric estimation of sets where we directly estimate the decision regions for diﬀerent labels, e.g., G for label 1. Then, the classiﬁer is determined by attributing x to label 1 if x ∈ G and to label −1 otherwise, i.e., C(x) = 2 · IG(x) − 1. In this case, the Bayes risk can be written as R(G) = 1/2

q(x)Q(dx)

Denote G∗ = {x : p(x) ≥ q(x)} to be the Bayes risk minimizer, and the classiﬁcation problem is equivalent to estimation of the optimal set G∗. Given p, q, let

The optimal decision rule is to assign label 1 to x ∈ X+ and −1 to x ∈ X−. The decision boundary in this case is {x ∈ X : p(x) = q(x)}. To characterize the smoothness of the boundary, it is usually assumed that X+ consists of union and intersection of smooth hyper-surfaces [Kim et al., 2018, Tsybakov et al., 2004]. Speciﬁcally, the following assumption is widely

adopted [Mammen et al., 1999, Tsybakov et al., 2004, Kim et al., 2018,

Imaizumi and Fukumizu, 2018] and referred to as ”boundary fragment”. Let H be some smooth function space from Rd−1 → R. Deﬁne sets GH as

(6.1) G

where x−j = (x1, · · · , xj−1, xj+1, · · · , xd). It is assumed that X+ is composed of ﬁnite union and intersection of sets in GH. The seemingly odd form (6.1) enforces special structures on the indicator function and reduces the complexity of the corresponding sets. A more general assumption on the decision boundary is that the set, denoted as G, containing all possible X+ cannot be too large. This is measured by the bracketing entropy HB of the metric space (G, d△). The more general assumption of the decision boundary is stated as (B) There exists A > 0 and ρ ∈ [0, ∞] such that

Our proposed teacher-student network setting follows this more general assumption and makes more sense in high-dimensional classiﬁcation. Remark 3. If H are β-smooth functions in (6.1), then (B) holds with ρ = (d−1)/β. On the other hand, Lemma 3.9 gives HB ≲ An ∨(Bn log(1/δ)) where in terms of δ, the order is log(1/δ), which corresponds to ρ → 0 and β → ∞. However, this doesn’t necessarily imply the boundary fragment set (6.1) is larger, since An, Bn depend on n and are not absolute constants as in (B). 6.2. Comments on Lemma 3.9. Lemma 3.9 is the main result for controlling the bracketing entropy of the estimation sets. Below we point out some key property of this result and compare to other entropy bounds of neural networks. Exponential Dependence on Depth. The bracketing entropy of GF developed in Lemma 3.9 is much larger than that of F itself with respect to ∥ · ∥∞, as described in Lemma 4.4. The main diﬀerence is the dependence on the number of layers L: the dependence is linear in Lemma 4.4 while exponential in Lemma 3.9. Thus, even though GF is a slice of the subgraph of F, GF is much more complicated than F in term of entropy. We argue that this gap cannot be closed even in the special case d = 1. Montufar et al. [2014] establish a lower bound on the maximum number of linear pieces for a ReLU neural network (Lemma 3.8). Consider a 1-dimensional ReLU DNN function with L layers and 2 nodes on each layer.

Fig 4. Example of a ReLU function in 1D. The induced set where f > 0 is colored red and it’s a union of two intervals . All pieces cross 0 so there are all active.

Corollary 5 of Montufar et al. [2014] show that there exists some f with s = Ω(2L−1) pieces on [0, 1]. With scaling and shifting, assume that on each piece the linear function crosses 0. Then, Gf will be at least ⌊s/2⌋ = Ω(2L−2) intervals. Denote these disjoint intervals to be {(ai, bi)}⌊s/2⌋i=1 . Since they are disjoint, to construct a δ-bracket of all the intervals, we need to δ-cover all the ai’s and bi’s. Similar to the grid argument from the proof of Lemma 3.6, we need at least

Independent of Weights Magnitude. We also want to point out that the entropy of GF is not concerned with the magnitude of the neural network weights, in contrast to the bound in Lemma 4.4. This is because any scaling of the function doesn’t change how it intercepts with zero. Hence, unlike F, the entropy of GF doesn’t depend on the weight maximum B. The Use of ReLU Activation. The reason why we can even bound the entropy of GF critically relies on the fact that we are considering the ReLU activation function. If we consider smooth nonlinear activation functions, e.g. hyperbolic tangent, sigmoid, instead of the order log(1/δ), we can only get the entropy of a much larger order

for some constant A > 0 and α > 0. To see this, consider the case d = 2. Instead of polygons, which can be controlled by the vertices, the regions have smooth boundary and will require O(1/δ) many grid points to cover. Thus the covering number is of order

Thus, the entropy is in a polynomial order of 1/δ.

(A3) will be examined in the setting that the teacher network f∗n has random weights. We will argue that with probability at least 1 − δ, f∗n will satisfy assumption (A3) with Tn = A(δ)/(log n)m∗d2L∗n and cn = B(δ)(log n)m∗d2L∗n, where A(δ), B(δ) are constants depending only on δ and the distribution of the random weights, e.g. normal, truncated normal, etc. Hence, the results which assume Assumption (A3) will hold with high probability. A Toy Case. To illustrate the intuition, consider the case where d = 1 and

(6.2) f∗n(x) =

Since all the weights are almost surely nonzero, we omit the zero weight cases for the analysis. Let pi = (ui, vi), i = 1, 2, . . . , s, denote the active pieces of (6.2). By Lemma 3.7, we know that s = O(log n). For each pi, deﬁne the following quantities: 1. ki = the slope of f∗n(x) on x ∈ pi;

See Figure 5 for an illustration. Then, assumption (A3) is satisﬁed if

Next we will rigorously examine (6.3). From (6.2), each ki can be expressed as w1jw2j for some j ∈ {1, 2, · · · , N∗n}. Therefore, min1≤i≤N∗n{|ki|} = min1≤j≤N∗n{|w1jw2j|}. Since w1j, w2j are i.i.d.

Fig 5. Example of a ReLU function in [0, 1]. There are two active pieces active piece, are illustrated in color red.

standard Gaussian, we have P( min

By choosing k = � δ2N∗n�2, we have min1≤i≤N∗n{|ki|} = Ω(1/ log n) with probability at least 1 − δ. On the other hand, for any i = 1, . . . , s, ti = |f∗n(xhi)| for some hi ∈ {1, · · · , N∗n}, where xhi = −bhi/w1hi. Hence

min1≤i≤s{ti} ≥ min1≤j≤N∗n{|f∗n(xj)|}.

Let . Then, where has an expression of + 1. Hence, for any t > 0,

P(min

1 − δ, mini{ti} ≥ t and t = Ω(1/ log n). Therefore, (6.3) holds with high probability, so that assumption (A3) holds by setting 1/cn = mini{|ki|} and Tn = mini{ti}, which are both in the order of Ω(1/ log n).

General Case. Now we consider the general case d > 1 and L∗n > 1. The teacher network has an expression f∗n(x) = W (L∗n+1)σ(W (L∗n),b(L∗n)) ◦ · · · ◦ σ(W (1),b(1))(x) + b(L∗n+1), x ∈ [0, 1]d.

s = O(log n)m∗L∗nd. Let {xi, x2, . . . , xvs} be the collection of vertices of {p1, . . . , ps}. We call such xi ∈ Rd a piece vertex and it’s not the same as the vertex of {x ∈ X : fn(x) ≥ 0}, which is closely examined in the proof of Lemma 3.9. The following lemma states that vs = O(log n)m∗L∗nd2 in our setting. Lemma 6.1. Let f be a ReLU neural network with d-dimensional input, L hidden layers and width N for every layer. Then, vs = O(N)Ld2. Proof. Recall that w(l)i and b(l)i for i = 1, . . . , N, 1 ≤ l ≤ L are the weight vectors and biases on the l-th hidden layer. For i = 1, . . . , N, deﬁne

which maps Rd → R. We can rewrite f as f(x) =

In other words, f(L−1)i (x) represents the inputs to the i-th ReLU unit in the last hidden layer of f and itself is an (L − 1)-hidden-layer ReLU neural network. The key idea of the proof is by induction. Notice that the piece vertices of f can only come from the following two ways: Type I: The piece vertices of f(L−1)1 , f(L−1)2 , . . . , f(L−1)N , in whose local neighbourhoods, the ReLU units in the last layer doesn’t change sign; Type II: By activations of the ReLU unit in the last layer. i.e. f(L−1)i (x) = 0 for some i = 1, . . . , N. Let V (l) be the maximum number of piece vertices of an l-hidden-layer ReLU neural network with width N and let U(l) be the maximum number of Type II piece vertices created at layer l. Then for 1 < l ≤ L we have (6.5) V (l) ≤ NV (l − 1) + U(l). For U(l), the key is to connect the Type II piece vertices of f to the vertices of {x ∈ X : f(L−1)i (x) ≥ 0}, which has been extensively studied in

Lemma 3.9. To this end, we deﬁne another quantity. On the i-th ReLU unit in the l-th hidden layer, let R(l)i := {x ∈ X : f(l)i (x) = 0}, which consists of (d − 1)-dimensional hyperplane segments. To be speciﬁc, denote all the active pieces of f(l)i (x) to be {p(l)ij : j = 1, . . . , s(l)i }, where s(l)i = O(N)(l−1)d according to Lemma 3.7 for any 1 ≤ i ≤ N. On each active piece p(l)ij , denote

which is part of a (d−1)-dimensional hyperplane. Then we have R(l)i = {h(l)ij :

, a collection of (1)-dimensional hyperplane segments. Let , which corresponds to the piece boundaries of

By deﬁnition, all Type II pieces vertices must reside in at least one of the the activation sets (z = 0 in σ(z)) of the ReLU units in the last layer. R(L) contains all such activation sets for the last hidden layer, i.e. for any h ∈ R, there exists 1 ≤ i ≤ N such that fi(x) = 0, ∀x ∈ h. The Type II pieces vertices are jointly determined by such activation sets and the piece boundary of fi’s (dimension d − 1), i.e. R(L−2)i . Therefore, the total number of such piece vertices can be bounded by

where��R(l)�� denotes the number of elements in R(l), which is bounded by

For V (L), we ﬁrst conclude that V (1) = O(Nd). For a 1-hidden layer ReLU network, the decision boundary of every ReLU unit is a (d − 1)-dimension hyperplane (w1x + b1 = 0). The maximum number of piece vertices is bounded by�Nd�= O(Nd). Then, (6.5) can be repeatedly broken down to V (L) ≤ NV (L − 1) + U(L) ≤ N2V (L − 2) + NU(L − 1) + U(L) ≤ · · · ≤ NL−1V (1) +

As an extension to the toy case, for any 1 ≤ i ≤ N∗n, deﬁne

2. t0 = min1≤i≤vs {|f∗n(xi)|} . That is, ki is the minimal absolute values of the directional derivatives of f∗n on piece pi. Assumption (A3) is satisﬁed if the following holds: (6.6) min1≤i≤s{ki}, t0 = Ω(log n)m∗d2L∗n. We will check (6.6). Since the partial derivative of f∗n(x) for x ∈ pi can be expressed as the product of the random weights, we have min1≤i≤s{ki} ≥ min1≤jl≤N∗n

min

we get that P( min1≤i≤s{ki} < k) ≤ P

min

By taking

k0 =

(N∗n)2(L∗n + 1)

we have that with probability at least 1 − δ, min1≤i≤s{ki} ≥ k0 and k0 = Ω(1/ log n)2m∗(L∗n+1). On the other hand, for any ti, there exist j = 1, . . . , vs such that ti = f∗n(xj). Hence

Let := . Then we have where

which is reminiscent of (6.4) and NL∗n is the width of the last layer and σj(·)’s are outputs (post-activations) from the last layer given W−L∗n. Therefore, for any t > 0, we have P( min1≤j≤vs{|f∗n(xj)|} < t | W−L∗n) ≤

Thus by taking t = δ/(N∗n)d2L∗n, we have that with probability at least 1 − δ, mini{ti} ≥ t and t = Ω(1/ log n)m∗d2L∗n. Therefore, (6.6) holds. That is to say, when d ≥ 2, with high probability, Assumption (A3) holds in which cn, 1/Tn = O(log n)m∗d2L∗n. Notice that the probability arguments used in this section don’t rely on Gaussian distribution. As long as all weights are i.i.d. with distribution that doesn’t have a point mass at 0, our claim holds.