28:["$","$L31",null,{"isWhiteLabelled":false,"children":["$","$Lc",null,{"pt":{"compact":0,"expanded":3},"children":[["$","$L32",null,{"noStar":true,"publisher":true,"task":true,"params":true,"size":"xl","product":{"id":"eyJwYXBlcklEIjoiMjAwMS4wNjg5MiIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","updated":"2020-02-01T04:58:57.000Z","paperID":"2001.06892","published":"2020-01-19T19:58:43.000Z","authors":"[\"Tianyang Hu\",\"Zuofeng Shang\",\"Guang Cheng\"]","title":"Sharp Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting","scoreTrending":null,"summary":"$33","lastCheckedForCode":"2025-10-18T11:58:38.350Z","links":[],"reposConnection":{"edges":[]},"models":[],"tags":[],"summaries":[{"model":"gpt-4o-mini","header":"paper.summary.expertise.beginner","summary":"This research paper explores how deep neural networks (DNNs) can be very effective for classifying data, especially when the data has a lot of dimensions, like images. It looks at a method called the \"teacher-student setting,\" where a simpler model (student) learns from a more complex one (teacher). The main finding is that DNNs can learn quickly and accurately in high-dimensional classifying tasks, breaking the usual limitations of traditional methods. This helps us understand why DNNs perform so well in real-world problems, like image recognition."}],"emailsConnection":{"edges":[]},"__typename":"paper","authorArray":["Tianyang Hu","Zuofeng Shang","Guang Cheng"]}}],["$","$L25",null,{"container":true,"columns":100,"spacing":{"compact":0,"expanded":2,"large":3},"children":[["$","$L25",null,{"size":{"compact":100,"expanded":100,"large":68},"children":[["$","$8",null,{"children":["$","$L34",null,{"publisher":"arxiv","paperID":"2001.06892","product":{"paper":"$28:props:children:props:children:0:props:product","models":"$28:props:children:props:children:0:props:product:models"},"isWhiteLabelled":false}]}],["$","$8",null,{"children":["$","$L35",null,{"article":"$L36","model":"$undefined"}]}]]}],["$","$L25",null,{"size":"grow","children":["$","$L37",null,{}]}]]}],["$","$8",null,{"children":null}],[["$","audio",null,{"id":"tts"}],["$","$L38",null,{"paperID":"2001.06892","publisher":"arxiv","paperJSON":{"title":"Sharp Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting","paperID":"2001.06892","avgLineHeight":13.57,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"SHARP RATE OF CONVERGENCE FOR DEEP NEURAL NETWORK CLASSIFIERS UNDER THE TEACHER-STUDENT SETTING","element":"span"}],[{"style":{"width":"84%"},"width":1215,"height":811,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/0-0.png","element":"img"}]]},{"heading":"1. Introduction. Deep learning has gained tremendous success in clas-siﬁcation problems such as image classiﬁcations [Deng et al., 2009b]. With the introduction of convolutional neural network [Krizhevsky et al., 2012] and residual neural network [He et al., 2016], various benchmarks in computer vision have been revolutionized and neural network based methods have achieved better-than-human performance [Nguyen et al., 2017]. For instance,","paragraphs":[[{"text":"AlexNet [","element":"span"},{"href":"#id-0","text":"Krizhevsky et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","text":"2012","element":"a"},{"text":"] and its variants [","element":"span"},{"href":"#id-1","text":"Zeiler and Fergus","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","text":"2014","element":"a"},{"text":",","element":"span"}]]},{"heading":"Simonyan and Zisserman, 2014] have demonstrated superior performance","paragraphs":[[{"text":"in ImageNet data [","element":"span"},{"href":"#id-2","text":"Deng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","text":"2009a","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","text":"Russakovsky et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","text":"2015","element":"a"},{"text":"], where the","element":"span"}]]},{"heading":"data dimension is huge, i.e., each image has pixel size 256 × 256 and hence is an 65536-dimensional vector. Traditional statistical thinking sounds an alarm when facing such high-dimension data as the “curse of dimensionality” usually prevents nonparametric classiﬁcation achieving fast convergence rates. This work attempts to provide a theoretical explanation for the empirical success of deep neural networks (DNN) in (especially high dimensional) classiﬁcation, beyond the existing statistical theories.","paragraphs":[]},{"heading":"In the context of nonparametric regression, similar investigations have been recently carried out. Among others [Farrell et al., 2018, Suzuki, 2018,","paragraphs":[[{"href":"#id-4","text":"Nakada and Imaizumi","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","text":"2019","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","text":"Oono and Suzuki","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","text":"2019","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","text":"Chen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","text":"2019","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","text":"Liu","element":"a"}]]},{"heading":"$39","paragraphs":[[{"text":"[","element":"span"},{"href":"#id-8","text":"Saad and Solla","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","text":"1996","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","text":"Mace and Coolen","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","text":"1998","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","text":"Engel and Broeck","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","text":"2002","element":"a"},{"text":"] and","element":"span"}]]},{"heading":"$3a","paragraphs":[[{"href":"#id-11","text":"2019","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","text":"Goldt et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","text":"2019","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","text":"Zhang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","text":"2019","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","text":"Cao and Gu","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","text":"2019","element":"a"},{"text":"]. Still, there","element":"span"}]]},{"heading":"is a lack of statistical understanding in this important direction, particularly on classiﬁcation aspects. In this paper, we consider binary classiﬁcation, and focus on the teacher-student framework where the optimal decision region is deﬁned by ReLU neural networks. This setting is closely related to the classical smooth boundary assumption where the neural networks are substituted by smooth functions. Speciﬁcally, a well-adopted assumption called as “boundary fragment”","paragraphs":[[{"text":"[","element":"span"},{"href":"#id-15","text":"Mammen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","text":"1999","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","text":"Tsybakov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","text":"2004","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","text":"Imaizumi and Fukumizu","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","text":"2018","element":"a"},{"text":"]","element":"span"}]]},{"heading":"$3b","paragraphs":[]},{"heading":"and provide conditions that guarantee zero training error at all local minima of appropriately chosen surrogate loss functions. Additionally, Lyu and Li [2019]","paragraphs":[[{"text":"show that under exponential loss [","element":"span"},{"href":"#id-18","text":"Soudry et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","text":"2018","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","text":"Gunasekar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","text":"2018","element":"a"},{"text":"],","element":"span"}]]},{"heading":"$3c","paragraphs":[[{"style":{"width":"22%"},"width":330,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/3-0.png","element":"img"}]]},{"heading":"2.1. Neural Network Setup. We consider deep neural networks with Rec-tiﬁed Linear Unit (ReLU) activation that σ(x) = max{x, 0}. For an L hidden layer ReLU neural network f(·), let the width of each layer be n0, n1, · · · , nL, where n0 = d is the input dimension, and denote the weight matrices and bias vectors in each layer to be W (l) and b(l), respectively. Let σ(W ,b)(x) = σ(W · x + b) and ◦ represent function composition. Then, the ReLU DNN can be written as","paragraphs":[[{"style":{"width":"90%"},"width":1306,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/3-1.png","element":"img"}]]},{"heading":"where Θ = {(W (l), b(l))}l=1,...,L+1 denotes the parameter set. For any given Θ, let |Θ| be the number of hidden layers in Θ, and Nmax(Θ) be the maximum width. We deﬁne ∥Θ∥0 as the number of nonzero parameters: ∥Θ∥0 =","paragraphs":[]},{"heading":"�∥vec(W (l))∥0 + ∥b(l)∥0�,","paragraphs":[[{"style":{"width":"3%"},"width":51,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/3-2.png","element":"img"}]]},{"heading":"where vec(W (l)) transforms the matrix W (l) into the corresponding vector by concatenating the column vectors. Similarly, we deﬁne ∥Θ∥∞ as the largest absolute value of the parameters in Θ, ∥Θ∥∞ = max","paragraphs":[]},{"heading":"max1≤l≤L+1 ∥vec(W (l))∥∞, max1≤l≤L+1 ∥b(l)∥∞","paragraphs":[]},{"heading":"For any given n, let Fn be","paragraphs":[[{"style":{"width":"70%"},"width":1013,"height":188,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/4-0.png","element":"img"}]]},{"heading":"2.2. Binary Classiﬁcation. Consider binary classiﬁcation with a feature vector x ∈ X ⊂ Rd and a label y ∈ {−1, 1}. Assume x|y = 1 ∼ p(x), x|y = −1 ∼ q(x) where p and q are two bounded densities on X w.r.t. base measure Q. If p, q have disjoint support, we say the data distribution or the classiﬁcation problem is separable. For simplicity, assume that Q is Lebesgue measure, and positive and negative labels are equally likely to appear, i.e., balanced labels. The objective of classiﬁcation is to ﬁnd an optimal classiﬁer (called the Bayes classiﬁer) C∗ within some classiﬁer family C, that minimizes the 0-1 loss deﬁned as C∗ = argmin","paragraphs":[]},{"heading":"R(C) := argmin","paragraphs":[]},{"heading":"E [I{C(x) ̸= y}] . We can estimate C∗ based on the training data by minimizing the empirical 0-1 risk as follows �Cn = argmin","paragraphs":[]},{"heading":"Rn(C) := argmin","paragraphs":[[{"style":{"width":"35%"},"width":508,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/4-1.png","element":"img"}]]},{"heading":"where Cn is a given class of classiﬁers possibly depending on the sample size n. In practice, the above 0-1 loss is often replaced by its (computationally feasible) surrogate counterparts [Bartlett et al., 2006], such as hinge loss φ(z) = (1 − z)+ = max{1 − z, 0} or logistic loss φ(z) = log(1 + exp(−z)). Given a surrogate loss φ, we ﬁrst obtain �fφ by minimizing the empirical risk Rφ,n(f) =","paragraphs":[[{"style":{"width":"3%"},"width":52,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/4-2.png","element":"img"}]]},{"heading":"over F, and then construct a classiﬁer by �Cφ(x) = sign( �fφ(x)). Accordingly,","paragraphs":[[{"style":{"width":"99%"},"width":1433,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/5-0.png","element":"img"}]]},{"heading":"the population risk. Given that C(x) = sign(f(x)), with a slight abuse of notation, we write R(C) and R(f) interchangeably. A classiﬁer C is evaluated by its excess risk deﬁned as the diﬀerence of the population risk between C and the Bayes optimal classiﬁer C∗ that E(C, C∗) = R(C) − R(C∗). Our goal is to derive sharp convergence rates of E(C, C∗) under diﬀerent losses.","paragraphs":[[{"style":{"width":"96%"},"width":1388,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/5-1.png","element":"img"}]]},{"heading":"we set up the teacher-student framework for classiﬁcation under which sharp rates for the excess risk are developed. Such a teacher-student bound sets an algorithmic independent benchmark for various deep neural network classiﬁers and also helps understand the role of input data dimension in the classiﬁcation performance. The Bayes classiﬁer C∗ is deﬁned via the optimal decision region G∗ := {x ∈ X, p(x) − q(x) ≥ 0}. The set estimate �G = {x ∈ X, �f(x) ≥ 0} can be constructed through deep neural network classiﬁers �f : Rd → R trained using either 0-1 loss or surrogate loss. Accordingly, a natural teacher network assumption is that p(x) − q(x) can be expressed by some neural","paragraphs":[[{"style":{"width":"99%"},"width":1433,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/5-2.png","element":"img"}]]},{"heading":"such an assumption is not uncommon in high-dimensional statistics, where population quantities may depend on the sample size n, e.g., Zhao and Yu","paragraphs":[[{"style":{"width":"8%"},"width":115,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/5-3.png","element":"img"}]]},{"heading":"3.1. Training with 0-1 Loss. In this section, we focus on, for the theoretical purpose, DNN classiﬁers trained with the empirical 0-1 loss. Denote �fn = argmin","paragraphs":[]},{"heading":"1","paragraphs":[[{"style":{"width":"44%"},"width":642,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/5-4.png","element":"img"}]]},{"heading":"given a certain DNN family Fn. It is important to control the complexity of the underlying classiﬁcation problem. Otherwise, the student network would not be able to recover the Bayes classiﬁer [Telgarsky, 2015] with suﬃcient accuracy. To this end, we impose the following teacher network assumptions on (p(x) − q(x)): (A1) p, q have compact supports. (A2) p(x) − q(x) is representable by some teacher ReLU DNN f∗n ∈ F∗n with N∗n = O (log n)m∗ , L∗n = O (1) for some m∗ ≥ 1. (A3) For any n, there exists cn, 1/Tn = O(log n)m∗d2L∗n such that for all","paragraphs":[[{"style":{"width":"65%"},"width":937,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/5-5.png","element":"img"}]]},{"heading":"Assumption (A3) characterizes how concentrated the data are around the decision boundary, which can be seen as an extension to the classical Tsybakov noise condition [Mammen et al., 1999]. The diﬀerence is that in our case, the underlying densities are indexed by sample size and thus cn and Tn are allowed to vary with n. Assumption (A3) is not unrealistic as we will show that it holds with high probability if the teacher network is random as stated in the following lemma (see Appendix 6.3 for detail).","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Lemma 3.1. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16.72},"width":43.06,"height":41.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/6-0.png","element":"img","alt":" f∗n","inline":true,"padRight":true},{"text":"be the teacher network with structures specified in","element":"span"}]]},{"heading":"assumption (A2). Suppose that all weights of f∗n are i.i.d. with any continuous distribution, e.g. Gaussian, truncated Gaussian, etc.. Then, with probability at least 1 − δ, assumption (A3) holds with cn, 1/Tn ≤ A(δ)(log n)m∗d2L∗n where A(δ) is some constant depending on δ. The following theorem characterizes how well the student network of proper size can learn from the teacher in terms of the excess risk. Theorem 3.2. Under the teacher assumptions (A1) through (A3), denote","paragraphs":[[{"style":{"width":"100%"},"width":1439,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/6-1.png","element":"img"}]]},{"heading":"Let Fn be a student ReLU DNN family with Nn = O(log n)m and Ln = O(1) for some m ≥ m∗ and assume the student network is larger than the teacher network in the sense that Ln ≥ L∗n, Sn ≥ S∗n, Nn ≥ N∗n, Bn ≥ B∗n. Then the excess risk for �fn ∈ Fn satisﬁes sup","paragraphs":[]},{"heading":"� 1","paragraphs":[[{"style":{"width":"62%"},"width":898,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/6-2.png","element":"img"}]]},{"heading":"where �Od hides the log n terms, which depend on d. The dependence on the dimension d is in the order of O(log n)d2. We further argue that under the present setting, the rate n−2/3 in Theorem 3.2 cannot be further improved. Theorem 3.3. Under the same assumptions of p, q as in Theorem 3.2 that","paragraphs":[[{"style":{"width":"74%"},"width":1064,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/6-3.png","element":"img"}]]},{"heading":"inf","paragraphs":[]},{"heading":"sup","paragraphs":[]},{"heading":"� 1","paragraphs":[[{"style":{"width":"66%"},"width":953,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/6-4.png","element":"img"}]]},{"heading":"where �Ωd hides the log n terms, which depend on d. Theorem 3.3 shows that the convergence rate achieved by the empirical 0-1 loss minimizer cannot be further improved (up to a logarithmic term). If p and q have disjoint supports, i.e. separable, which could be true in some","paragraphs":[]},{"heading":"image data, the rate improves to n−1, as stated in the following corollary. This rate improvement is not surprising since the classiﬁcation task becomes much easier for separable data. Corollary 3.4. Under the same setting as in Theorem 3.2, if we further assume p, q have disjoint supports, then the rate of convergence of the empirical 0-1 loss minimizer improves to inf","paragraphs":[]},{"heading":"sup","paragraphs":[]},{"heading":"� 1","paragraphs":[[{"style":{"width":"48%"},"width":691,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/7-0.png","element":"img"}]]},{"heading":"Remark 1 (Disjoint Support). Given that data are separable, Srebro et al. [2010] derived the excess risk bound as O(D log n/n) (under a smooth loss) where D is the VC-subgraph-dimension of the estimation family. Additionally, separability implies that the noise exponent κ in Tsybakov noise condition","paragraphs":[[{"text":"[","element":"span"},{"href":"#id-15","text":"Mammen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","text":"1999","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","text":"Tsybakov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","text":"2004","element":"a"},{"text":"] can be arbitrarily large, which","element":"span"}]]},{"heading":"$3d","paragraphs":[[{"style":{"width":"99%"},"width":1434,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/7-1.png","element":"img"}]]},{"heading":"network is of the following order (3.1) o","paragraphs":[[{"style":{"width":"37%"},"width":538,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/7-2.png","element":"img"}]]},{"heading":"where n1, · · · , nLn are the width of each hidden layer of the student network.","paragraphs":[[{"style":{"width":"62%"},"width":902,"height":370,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/8-0.png","element":"img"}],[{"text":"Fig 1","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":". Example of a ReLU DNN function in ","element":"figcaption","subtype":"caption"},{"text":"[0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":", ","element":"figcaption","subtype":"caption"},{"text":"1]","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":". There are 5 pieces ","element":"figcaption","subtype":"caption"},{"style":{"height":12.8},"width":278.2,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/8-1.png","element":"img","alt":" p1, p2, . . . , p5 and","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"among them, only ","element":"figcaption","subtype":"caption"},{"style":{"height":9.2},"width":136.49,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/8-2.png","element":"img","alt":" p1, p4, p5","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"cross value 0 (horizontal line). There are 3 active pieces in this example and they are colored red.","element":"figcaption","subtype":"caption"}]]},{"heading":"$3e","paragraphs":[[{"style":{"width":"1%"},"width":15,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/8-3.png","element":"img"}]]},{"heading":"|p(x) − q(x)|Q(dx). There are two key factors governing the rate of convergence in classiﬁcation: • How concentrated the data are around the decision boundary; • The complexity of the set G∗ where the optimal G∗ resides. For the ﬁrst factor, the following Tsybakov noise condition [Mammen et al., 1999] quantiﬁes how close p and q are:","paragraphs":[]},{"heading":"(N) There exists constant c > 0 and κ ∈ [0, ∞] such that for any 0 ≤ t ≤ T","paragraphs":[[{"style":{"width":"43%"},"width":619,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/9-0.png","element":"img"}]]},{"heading":"$3f","paragraphs":[[{"style":{"width":"41%"},"width":594,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/9-1.png","element":"img"}]]},{"heading":"(By convention, [xi, xj] is empty if xi > xj.) Then ([xi, xj+1], [xi+1, xj]) is a 2δ-bracket of [a, b] since obviously","paragraphs":[[{"text":"(3.2) ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"style":{"height":18.22},"width":1103.36,"height":45.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/9-2.png","element":"img","alt":"i+1, xj] ⊂ [a, b] ⊂ [xi, xj+1], d△([xi, xj+1], [xi+1, xj]) ≤ 2δ.","inline":true}],[{"style":{"width":"43%"},"width":627,"height":629,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/10-0.png","element":"img"}],[{"text":"Fig 2","element":"figcaption","subtype":"caption"},{"id":"id-20","style":{"fontStyle":"italic"},"text":". Grid in 2D and the outer cover (green) constructed for with grid points for a polygon ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"(blue).","element":"figcaption","subtype":"caption"}]]},{"heading":"There are�Mδ+12 �diﬀerent choices of [xi, xj], hence,�Mδ+12 �diﬀerent choices of the pairs ([xi, xj+1], [xi+1, xj]). Any [a, b] ⊂ [0, 1] can be 2δ bracketed by one of such pairs in the sense of (3.2). This shows that HB(2δ) ≤ log�Mδ+12 �≤ 2 log(1/δ). When d ≥ 2, any G ∈ ¯G has at most S vertices, so ¯G := G ∩ [0, 1]d has at most dS vertices where the factor d is due to the fact that each edge of G intersects at most d edges of [0, 1]d therefore creates at most dS vertices for ¯G. For any polygon G(x1, · · · , xs) where s ≤ dS, denote G−√dδ(x1, · · · , xs) =","paragraphs":[[{"style":{"width":"99%"},"width":1434,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/10-1.png","element":"img"}]]},{"heading":"to see that there exist v11, . . . , vd1, v12, . . . , vd2, · · · · · · , v1s, . . . , vds ∈ Xdδ , where","paragraphs":[[{"style":{"width":"54%"},"width":784,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/10-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":"(","element":"span"},{"style":{"height":19.81},"width":1099.58,"height":49.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/10-3.png","element":"img","alt":"x1, · · · , xs) ⊂ G(v11, . . . , vd1, v12, . . . , vd2, · · · · · · , v1s, . . . , vds);","inline":true},{"style":{"height":22.1},"width":1094.17,"height":55.26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/10-4.png","element":"img","alt":"• ∥vji − xi∥2 ≤√dδ for i = 1, 2 · · · , s and j = 1, 2, · · · , d.","inline":true}]]},{"heading":"See Figure 2 for an illustration when d = 2. Similarly for G(x−1 , · · · , x−s ),","paragraphs":[[{"text":"there exist ","element":"span"},{"href":"#id-20","style":{"height":20.45},"width":1076.6,"height":51.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/10-5.png","element":"img","alt":" u11, . . . , ud1, u12, . . . , ud2, · · · · · · , u1s, . . . , uds ∈ Xdδ such that","inline":true}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":"(","element":"span"},{"style":{"height":20.13},"width":1139.92,"height":50.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/10-6.png","element":"img","alt":"x−1 , · · · , x−s ) ⊂ G(u11, . . . , ud1, u12, . . . , ud2, · · · · · · , u1s, . . . , uds);","inline":true},{"style":{"height":22.1},"width":1112.36,"height":55.26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/10-7.png","element":"img","alt":"• ∥uji − x−i ∥2 ≤√dδ for i = 1, 2 · · · , s and j = 1, 2, · · · , d.","inline":true}]]},{"heading":"By the deﬁnition of G−√dδ, we have ∥xi−x−i ∥2 ≥√dδ. Thus ∥uji −x−i ∥2 ≤√dδ implies G(u11, . . . , ud1, · · · · · · , u1s, . . . , uds) ⊂ G(x1, · · · , xs). On the other","paragraphs":[]},{"heading":"hand,","paragraphs":[[{"style":{"width":"84%"},"width":1214,"height":192,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/11-0.png","element":"img"}]]},{"heading":"where the term s is due to the fact that G(x1, · · · , xs) has at most O(s) faces. Notice that","paragraphs":[[{"style":{"width":"84%"},"width":1220,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/11-1.png","element":"img"}]]},{"heading":"and s ≤ dS. Thus, with at most (Mδ + 1)d2S pairs of subsets in ¯G, we can 2d3/2Sδ-bracket any ¯G ∈ ¯G. Therefore, log NB((2d3/2Sδ), ¯G, d△) ≲ log�(Mδ + 1)d2S�, which implies log NB(δ, ¯G, d△) ≲ d2S log(d3/2S/δ). Lemma 3.7 (Theorem 1 in [Serra et al., 2017]). Consider a deep ReLU network with L layers, nl ReLU nodes at each layer l, and an input of dimension n0. The maximal number of linear pieces of this neural network is at most","paragraphs":[[{"style":{"width":"25%"},"width":363,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/11-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":19.54},"width":479.25,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/11-3.png","element":"img","alt":" J = {(j1, . . . , jL) ∈ ZL","inline":true,"padRight":true},{"text":": 0 ","element":"span"},{"style":{"height":17.6},"width":707.66,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/11-4.png","element":"img","alt":" ≤ jl ≤ min{n0, n1 − j1, . . . , nl−1 −","inline":true}]]},{"heading":"jl−1, nl} ∀l = 1, . . . , L}. This bound is tight when L = 1. When n0 = O(1) and all layers have the same width N, we have the same best known asymptotic bound O(NLn0) ﬁrst presented in [Raghu et al., 2017]. Consider a deep ReLU network with n0 = d inputs and L hidden layers of widths ni ≥ n0 for all i ∈ [L]. The following lemma establishes a lower bound for the maximal number of linear pieces of deep ReLU networks: Lemma 3.8 (Theorem 4 in [Montufar et al., 2014]). The maximal number of linear pieces of a ReLU network with n0 input units, L hidden layers, and ni ≥ n0 rectiﬁers on the i-th layer, is lower bounded by","paragraphs":[[{"style":{"width":"33%"},"width":479,"height":136,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/11-5.png","element":"img"}],[{"style":{"width":"98%"},"width":1417,"height":311,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/12-0.png","element":"img"}],[{"text":"Fig 3","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":". Demonstration of how a polygon in ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"d ","element":"figcaption","subtype":"caption"},{"text":"= 2 ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"case can be divided into basic triangles. The union of the two brackets form a bracket of the original polygon. The blue shade is the symmetric difference.","element":"figcaption","subtype":"caption"}]]},{"heading":"$40","paragraphs":[]},{"heading":"the d = 2 case. Therefore, the bracketing number of the polyhedrons can be derived by bracketing the basic polyhedrons. For a basic polyhedron B, denote its δ-bracketing pair to be (UB,δ, VB,δ), i.e., UB,δ ⊂ B ⊂ VB,δ. Then (UG,δ, VG,δ), deﬁned as below","paragraphs":[[{"style":{"width":"43%"},"width":619,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/13-0.png","element":"img"}]]},{"heading":"form a (sδ)-bracket of G. Hence, the bracketing number of all polyhedrons is controlled by the s-th power of the bracketing number of all basic polyhedrons. Applying Lemma 3.7 we know s = O(NLd) and the number of vertices is at most S = O(NLd2). Together with Lemma 3.6, we therefore get that log NB(Sδ, GF, d△) ≲ S(d + 1)d2 log((d + 1)d3/2/δ), which implies log NB(δ, GF, d△) ≲ NLd2d3 log(NLd2d3/δ) ≲ NLd2d3 �Ld2 log(N) ∨ log(1/δ)�. More discussions about Lemma 3.9 can be found in Appendix 6.2. Next, we present some lemmas that can take advantage of the obtained entropy bound and eventually take us to the proof of the excess risk convergence rate. Lemma 3.10 (Theorem 5.11 in Van De Geer [2000]). For some function space H with suph∈H ∥h(x)∥∞ ≤ K and suph∈H ∥h(x)∥L2(P) ≤ R where P is the distribution of x. Take a > 0 satisfying (1) a ≤ C1√nR2/K; (2)","paragraphs":[[{"style":{"width":"80%"},"width":1159,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/13-1.png","element":"img"}]]},{"heading":"(3) a ≥ C0","paragraphs":[]},{"heading":";","paragraphs":[[{"style":{"width":"89%"},"width":1280,"height":206,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/13-2.png","element":"img"}]]},{"heading":"sup","paragraphs":[[{"style":{"width":"77%"},"width":1113,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/13-3.png","element":"img"}]]},{"heading":"where Pn is the empirical counterpart of P.","paragraphs":[]},{"heading":"So far, the presented lemmas are only concerned with the general case, i.e. set G∗, p, q, etc. that does not depend on n. However, in our teacher-student","paragraphs":[[{"text":"framework, the optimal set ","element":"span"},{"style":{"height":16.72},"width":55.31,"height":41.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/14-0.png","element":"img","alt":" G∗n","inline":true,"padRight":true},{"text":"is indexed by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"as it’s determined by the ","element":"span"},{"text":"teacher network ","element":"span"},{"style":{"height":16.71},"width":52.7,"height":41.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/14-1.png","element":"img","alt":" F∗n","inline":true},{"text":". In the remaining part of the proof, we will consider","element":"span"}]]},{"heading":"speciﬁcally for our teacher network case. The next lemma investigates the modulus of continuity of the empirical process. It’s similar to Lemma 5.13 in Van De Geer [2000] but with a key diﬀerence in the entropy assumption (3.3), where the entropy bound contains n. Lemma 3.11. For a probability measure P, let Hn be a class of uniformly bounded (by 1) functions h in L2(P) depending on n. Suppose that the δ-entropy with bracketing HB(δ, Hn, L2(P)) satisﬁes, for some An > 0, the inequality HB(δ, Hn, L2(P)) ≤ An log(1/δ)(3.3) for all δ > 0 small enough. Let hn0 be a ﬁxed element in Hn. Let Hn(δ) = {hn ∈ Hn : ∥hn −hn0∥L2(P) ≤ δ}. Then there exist constants D1 > 0, D2 > 0 such that for a sequence of i.i.d. random variables x1, · · · , xn with probability distribution P, it holds that","paragraphs":[[{"style":{"width":"85%"},"width":1235,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/14-2.png","element":"img"}]]},{"heading":"sup","paragraphs":[[{"style":{"width":"88%"},"width":1266,"height":155,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/14-3.png","element":"img"}]]},{"heading":"for all x ≥ 1. Proof. The main tool for the proof is Lemma 3.10. Replace H with Hn(δ) in Lemma 3.10 and take K = 4, R =√2δ and a = 12C1√Anδ log(1/δ), with C1 = 2√2C0. Then (1) is satisﬁed if (3.4)","paragraphs":[]},{"heading":"log(1δ ) ≤ √n. Under (3.4), condition (2) and (3) are satisﬁed automatically. Choosing C0 suﬃciently large will ensure (4). Thus, for all δ satisfying (3.4), we have","paragraphs":[[{"style":{"width":"79%"},"width":1142,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/14-4.png","element":"img"}]]},{"heading":"sup","paragraphs":[]},{"heading":"�Anδ log(1/δ)","paragraphs":[[{"style":{"width":"79%"},"width":1141,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/14-5.png","element":"img"}]]},{"heading":"≤ C exp","paragraphs":[[{"style":{"width":"24%"},"width":355,"height":10,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/14-6.png","element":"img"}]]},{"heading":"Notice that (3.4) holds if δ ≥�An/n. Let B = min{b > 1 : 2−b ≤�An/n} and apply the peeling device. Then,","paragraphs":[[{"style":{"width":"79%"},"width":1138,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/15-0.png","element":"img"}]]},{"heading":"sup","paragraphs":[[{"style":{"width":"92%"},"width":1327,"height":136,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/15-1.png","element":"img"}]]},{"heading":"sup","paragraphs":[]},{"heading":"�An2−b log(2b)","paragraphs":[[{"style":{"width":"91%"},"width":1313,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/15-2.png","element":"img"}]]},{"heading":"C exp","paragraphs":[]},{"heading":"≤ 2C exp","paragraphs":[[{"style":{"width":"66%"},"width":959,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/15-3.png","element":"img"}]]},{"heading":"if C1An is suﬃciently large. We then present a lemma that establishes the connection between d△ and dp,q, which is adapted from Lemma 2 in Mammen et al. [1999] to our teacher network setting. Corresponding to assumption (A3), we deﬁne (Nn) as an extension to the classical Tsybakov noise condition (N). (Nn) There exists cn > 0 depending on n and κ ∈ [0, ∞] such that for any","paragraphs":[[{"style":{"width":"69%"},"width":1002,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/15-4.png","element":"img"}]]},{"heading":"Note that the (N) is a special case of (Nn) with Tn and cn being absolute constant. Lemma 3.12. Assume (Nn) and pn, qn are bounded by b2 > 0. Then, there exists absolute constants b1(κ) > 0 depending on κ such that for any Lebesgue measurable subsets G1 and G2 of X,","paragraphs":[[{"style":{"width":"92%"},"width":1335,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/15-5.png","element":"img"}]]},{"heading":"Proof. The second inequality is trivial given that p, q are bounded by b2. For the ﬁrst inequality, since Q(|pn − qn| ≤ t) ≤ cntκ for all 0 ≤ t ≤ Tn, the boundedness of Q(X) implies that","paragraphs":[[{"style":{"width":"45%"},"width":649,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/15-6.png","element":"img"}]]},{"heading":"where An =�Q(X)T κn ∨ cn�. Then,","paragraphs":[[{"style":{"width":"93%"},"width":1348,"height":268,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/16-0.png","element":"img"}]]},{"heading":"2An","paragraphs":[]},{"heading":")","paragraphs":[[{"style":{"width":"96%"},"width":1380,"height":424,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/16-1.png","element":"img"}]]},{"heading":"Our goal in classiﬁcation is to estimate G∗n by �Gn = argminG∈Gn Rn(G), where Gn is some collection of sets associated with the student network Fn and Rn(G) = 12n","paragraphs":[]},{"heading":"(I{xi ∈ G|yi = 1}(x) + I{xi /∈ G|yi = −1}(x)) . Similar to Theorem 1 in Mammen et al. [1999], we have the following lemma regarding the upper bound on the rate of convergence.","paragraphs":[[{"style":{"width":"99%"},"width":1433,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/16-2.png","element":"img"}]]},{"heading":"of X ⊂ Rd. Deﬁne","paragraphs":[]},{"heading":"(3.5) where b2 is an absolute constant. Let Gn be another class of subsets satisfying","paragraphs":[[{"style":{"width":"99%"},"width":1433,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/16-3.png","element":"img"}]]},{"heading":"that for any δ > 0 small enough, (3.6) HB(δ, Gn, d△) ≤ An log(1/δ). Then we have (3.7) lim","paragraphs":[]},{"heading":"sup","paragraphs":[]},{"heading":"�An log2 n","paragraphs":[[{"style":{"width":"97%"},"width":1407,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/16-4.png","element":"img"}],[{"style":{"width":"96%"},"width":1385,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/17-0.png","element":"img"}]]},{"heading":"given set G ∈ X, let hG(x) = I{x ∈ G}. In particular, let h∗n = hG∗n. Let ∥h∥2p =�h2(x)p(x)Q(dx). Since both pn and qn are bounded,","paragraphs":[[{"style":{"height":42.77},"width":1009.24,"height":106.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/17-1.png","element":"img","alt":"∥hGn − h∗n∥2p =�Gn△G∗npn(x)Q(dx) ≤ b2d△(Gn, G∗n),","inline":true},{"style":{"height":54.8},"width":1219.4,"height":137,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/17-2.png","element":"img","alt":"∥hGn − h∗n∥2q =�Gn△G∗nqn(x)Q(dx) ≤ b2d△(Gn, G∗n).(3.8)","inline":true}]]},{"heading":"Consider the random variable","paragraphs":[[{"style":{"width":"72%"},"width":1043,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/17-3.png","element":"img"}]]},{"heading":"And△(G∗n, �Gn) log(1/d△(G∗n, �Gn))","paragraphs":[[{"style":{"width":"64%"},"width":932,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/17-4.png","element":"img"}]]},{"heading":"(3.9) √nE(Rn( �Gn) − Rn(G∗n))� And△(G∗n, �Gn) log(1/d△(G∗n, �Gn))","paragraphs":[]},{"heading":"Note that","paragraphs":[[{"style":{"width":"64%"},"width":933,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/17-5.png","element":"img"}]]},{"heading":"+ 1 2n","paragraphs":[[{"style":{"width":"3%"},"width":51,"height":5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/17-6.png","element":"img"}]]},{"heading":"Then Vn can be written as","paragraphs":[[{"style":{"height":58.74},"width":1268.98,"height":146.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/17-7.png","element":"img","alt":"Vn =(1/2n) �ni=1 I{yi=1}(h �Gn − h∗n)(xi) − E(I{y=1}(h �Gn − h∗n)(x))�","inline":true},{"style":{"height":17.6},"width":688.24,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/17-8.png","element":"img","alt":"And△(G∗n, �Gn)/n log(1/d△(G∗n, �Gn))","inline":true}]]},{"heading":"+","paragraphs":[[{"style":{"width":"85%"},"width":1222,"height":220,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/17-9.png","element":"img"}]]},{"heading":"Consider the event En = {d△(G∗n, �Gn) >�An/n} and let �Gn = {G ∈ Gn :","paragraphs":[]},{"heading":"d△(G, G∗n) >�An/n}. If En holds, then","paragraphs":[[{"style":{"fontStyle":"italic"},"text":"V","element":"span"},{"style":{"height":55.28},"width":1012.78,"height":138.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/18-0.png","element":"img","alt":"n = − √n Rn( �Gn) − Rn(G∗n) − E(Rn( �Gn) − Rn(G∗n))�","inline":true}]]},{"heading":"And△(G∗n, �Gn) log(1/d△(G∗n, �Gn)) ≤ sup","paragraphs":[]},{"heading":"And△(G∗n, �Gn) log(1/d△(G∗n, �Gn)) ≤ sup","paragraphs":[]},{"heading":"+ sup","paragraphs":[[{"style":{"width":"87%"},"width":1259,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/18-1.png","element":"img"}]]},{"heading":"≤ sup","paragraphs":[[{"style":{"width":"87%"},"width":1259,"height":67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/18-2.png","element":"img"}]]},{"heading":"sup","paragraphs":[[{"style":{"width":"79%"},"width":1137,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/18-3.png","element":"img"}]]},{"heading":"where Hn = {hn(x) = I{x ∈ Gn} : Gn ∈ Gn}. The last inequality follow from the fact that √x log(1/x) is strictly increasing when x < 1. Notice that hn’s are uniformly bounded by 1 and the L2 norm squared of hG1 − hG2 is d△(G1, G2). Applying Lemma 3.11, we have (3.10) E[VnI(En)] ≤ C for some ﬁnite constant C. Now we use this inequality to prove the main result. From (3.9), we know that","paragraphs":[[{"style":{"width":"84%"},"width":1216,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/18-4.png","element":"img"}]]},{"heading":"which, together with Lemma 3.12, yields that","paragraphs":[[{"style":{"width":"92%"},"width":1322,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/18-5.png","element":"img"}]]},{"heading":"which simpliﬁes to be","paragraphs":[]},{"heading":"�An log2 n","paragraphs":[[{"style":{"width":"78%"},"width":1129,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/18-6.png","element":"img"}],[{"style":{"width":"0%"},"width":13,"height":4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/19-0.png","element":"img"}]]},{"heading":"where we used the fact that dpn,qn( �Gn, G∗n) ≳ 1/n. Therefore, under En, (3.10) implies that","paragraphs":[]},{"heading":"�An log2 n","paragraphs":[[{"style":{"width":"73%"},"width":1062,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/19-1.png","element":"img"}]]},{"heading":"On the other hand, under Ecn, we have","paragraphs":[[{"style":{"width":"30%"},"width":435,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/19-2.png","element":"img"}]]},{"heading":"By Lemma 3.12 we know dp,q( �Gn, G∗n) is also bounded by�An/n. Since (κ + 1)/(κ + 2) ≤ 1, the rate under En dominates and the proof is complete. Proof of Theorem 3.2. First, we verify that the Tsybakov noise condition holds for κ = 1 in our setting. The proof is based on the fact that a ReLU network is piecewise linear and the number of linear pieces is quantiﬁable. Assumption (A3) implies (Nn) with cn, 1/Tn = O(log n)m∗d2L∗n and κ = 1. In the case where p, q have disjoint support, obviously κ can be arbitrarily large. Next, we consider the bracketing number of Gn deﬁned via Fn that Gn = {x ∈ X : f(x) ≥ 0, f ∈ Fn}. From Lemma 3.9 we have log NB(δ, Gn, d△) ≲ NLd2d2 �Ld2 log(N) ∨ log(1/δ)�. Thus, An = O(Nn)d2Ln as in (3.6) if δ ≪ 1/N. Recall from assumption (A2) and (A3) that Ln = O(1), Nn = O(log n)m and 1/Tn, cn = O(log n)m∗d2L∗n. Applying Lemma 3.13 with κ = 1 we have that the excess risk has upper bound sup","paragraphs":[]},{"heading":"E[E( �fn, C∗n)]","paragraphs":[[{"style":{"width":"9%"},"width":130,"height":14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/19-3.png","element":"img"}]]},{"heading":"�An log2 n","paragraphs":[[{"style":{"width":"36%"},"width":523,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/19-4.png","element":"img"}]]},{"heading":"� 1","paragraphs":[]},{"heading":"(log n)","paragraphs":[[{"style":{"width":"69%"},"width":1001,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/19-5.png","element":"img"}]]},{"heading":"Proof of Corollary 3.4. Corollary 3.4 easily follows from the fact that p, q having disjoint support implies κ = ∞ in (Nn).","paragraphs":[]},{"heading":"3.3. Proof of Theorem 3.3. We will show that the lower bound holds in special case that (1) assumption (A3) satisﬁes cn, 1/Tn being absolute constants that doesn’t depend on n; (2) instead of general ReLU neural","paragraphs":[[{"style":{"width":"100%"},"width":1437,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/20-0.png","element":"img"}]]},{"heading":"the dimensions, reminiscent of the “boundary fragment” assumption. In this special case, we are able to show the best possible convergence rate already matches that in Theorem 3.2. For ease of notation, we omit the subscript n and write pn, qn as p, q if no confusion arises. Proof. Without loss of generality, let X = [0, 1]d. Consider the “boundary fragment” setting and let �Gn be a set deﬁned by a ReLU network family �Fn containing functions from Rd−1 to R:","paragraphs":[[{"style":{"width":"89%"},"width":1292,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/20-1.png","element":"img"}]]},{"heading":"where x−j = (x1, · · · , xj−1, xj, · · · , xd). Notice that if h(x−j) is a ReLU network on Rd−1, then �h(x) = h(x−j) − xj is a ReLU network on Rd. Thus �Gn is a subset of Gn which corresponds to the student network that (3.11) Gn = {x ∈ X : f(x) > 0, f ∈ Fn} Let �Gn denote the empirical 0-1 loss minimizer over �Gn. To show the lower bound, consider the subset of D �Gn (3.5) that contains all pairs like (p, q0), where p ∈ F1, q0 will be speciﬁed later. Then, sup","paragraphs":[[{"style":{"width":"78%"},"width":1133,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/20-2.png","element":"img"}]]},{"heading":"� 1","paragraphs":[[{"style":{"width":"40%"},"width":580,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/20-3.png","element":"img"}]]},{"heading":"where F1 is a ﬁnite set to be speciﬁed later, p, q0 are the underlying densities for the two labels and Dq0 denotes all the data generated from q0. For ease of presentation, we ﬁrst give the proof for the case d = 2 and then extend to general d. Let φ(t) be a piecewise linear function supported on [−1, 1] deﬁned as φ(t) =","paragraphs":[]},{"heading":"t + 1 −1 < t ≤ 0,","paragraphs":[[{"style":{"width":"36%"},"width":525,"height":113,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/20-4.png","element":"img"}]]},{"heading":"Rewrite φ as φ(t) = σ(t + 1) − σ(t) + σ(−t + 1) − σ(−t) − 2, which is a one hidden layer ReLU neural network with 11 non-zero weights that are either 1 or −1. For x = (x1, x2) ∈ [0, 1]2, deﬁne q0(x) =(1 − η0 − b1)I{0 ≤ x2 < 1/2} + I{1/2 ≤ x2 < 1/2 + e−M} + (1 + η0 + b2)I{1/2 + e−M ≤ x2 ≤ 1}, where M ≥ 2 is an integer to be speciﬁed later. Let b1 = c−1/κ2 e−M/κ and b2 > 0 be chosen such that q0 integrates to 1 (so q0 is a valid probability density). For j = 1, 2, · · · , M and t ∈ [0, 1], let","paragraphs":[[{"style":{"width":"43%"},"width":626,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/21-0.png","element":"img"}]]},{"heading":"Note that ψj is only supported on [j−1M , jM ]. For any vector ω = (ω1, · · · , ωM) ∈ Ω := {0, 1}M, deﬁne bω(t) =","paragraphs":[[{"style":{"width":"3%"},"width":57,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/21-1.png","element":"img"}]]},{"heading":"and pω(x) =1 +�1/2 + e−M − x2c2","paragraphs":[]},{"heading":"I{1/2 ≤ x2 ≤ 1/2 + bω(x1)}","paragraphs":[[{"style":{"width":"43%"},"width":632,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/21-2.png","element":"img"}]]},{"heading":"where b3(ω) > 0 is a constant depending on ω chosen such that pω(x) integrates to 1. Let F1 = {pω : ω ∈ Ω} and we will show that (pω, q0) ∈ D �Gn for all ω ∈ Ω. To this end, we need to verify that (a) pω(x) ≤ c1 for x ∈ [0, 1]2;","paragraphs":[[{"style":{"width":"57%"},"width":828,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/21-3.png","element":"img"}]]},{"heading":"For (a), since pω integrates to 1,","paragraphs":[[{"style":{"width":"82%"},"width":1178,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/21-4.png","element":"img"}]]},{"heading":"Thus, pω(x) ≤ c1 for a large enough M and some absolute constant c1. For (b), notice that {x : pω(x) ≥ q0(x)} = {x : 0 ≤ x2 ≤ 1/2 + bω(x1)} = {x ∈ [0, 1]2 : bω(x1) − σ(x2) + 1/2 ≥ 0} ∈ Gn,","paragraphs":[]},{"heading":"where the last inclusion follows from the deﬁnition of Gn (3.11) and the fact that bω(x1) − σ(x2) + 1/2 is a ReLU neural network with one hidden layer, whose width and number of non-zero weights are both O(M). Later we will see that M = O(log n), and thus the constructed neural network satisﬁes all the size constraints in Theorem 3.2. For (c), it follows that","paragraphs":[[{"style":{"width":"83%"},"width":1204,"height":317,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/22-0.png","element":"img"}]]},{"heading":"Since the above (a)-(c) hold and by the deﬁnition of D �Gn (3.5), we conclude that (pω, q0) ∈ D �Gn for all ω ∈ Ω . We next establish how fast","paragraphs":[[{"style":{"width":"99%"},"width":1432,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/22-1.png","element":"img"}]]},{"heading":"use the Assouad’s lemma stated in [Korostelev and Tsybakov, 2012] which is adapted to the estimation of sets. For j = 1, · · · , M and ω = (ω1, · · · , ωM) ∈ Ω, let","paragraphs":[[{"style":{"width":"49%"},"width":706,"height":113,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/22-2.png","element":"img"}]]},{"heading":"For i = 0 and i = 1, let Pji be the probability measure corresponding to the distribution of x1, · · · , xn when the underlying density is fωji. Denote the expectation w.r.t. Pji as Eji. Let Dj = {x ∈ X : 1/2 + bωj0(x1) < x2 ≤ 1/2 + bωj1(x1)}","paragraphs":[[{"style":{"width":"70%"},"width":1009,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/22-3.png","element":"img"}]]},{"heading":"Then S ≥ 1/2","paragraphs":[]},{"heading":"Q(Dj)","paragraphs":[[{"style":{"width":"26%"},"width":378,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/22-4.png","element":"img"}]]},{"heading":"≥ 1/2","paragraphs":[[{"style":{"width":"29%"},"width":422,"height":87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/22-5.png","element":"img"}]]},{"heading":"≥ 1/2","paragraphs":[[{"style":{"width":"28%"},"width":410,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/22-6.png","element":"img"}]]},{"heading":"≥ 14","paragraphs":[[{"style":{"width":"3%"},"width":57,"height":10,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/22-7.png","element":"img"}]]},{"heading":"where H(·, ·) denotes the Hellinger distance. Then it holds that","paragraphs":[[{"style":{"width":"92%"},"width":1333,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/23-0.png","element":"img"}]]},{"heading":"1 +�1/2 + e−M − x2","paragraphs":[[{"style":{"width":"70%"},"width":1009,"height":101,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/23-1.png","element":"img"}]]},{"heading":"+","paragraphs":[[{"style":{"width":"57%"},"width":819,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/23-2.png","element":"img"}]]},{"heading":"1 +","paragraphs":[[{"style":{"width":"49%"},"width":705,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/23-3.png","element":"img"}]]},{"heading":"We will analyze the last two terms. For the ﬁrst term,","paragraphs":[[{"style":{"width":"49%"},"width":710,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/23-4.png","element":"img"}]]},{"heading":"1 +","paragraphs":[[{"style":{"width":"70%"},"width":1010,"height":427,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/23-5.png","element":"img"}]]},{"heading":"� 1","paragraphs":[[{"style":{"width":"30%"},"width":431,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/23-6.png","element":"img"}]]},{"heading":"For the second term, notice that","paragraphs":[[{"style":{"width":"88%"},"width":1272,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/23-7.png","element":"img"}]]},{"heading":"which yields b3(ω11) = 1 1/2 − bω11(x1)","paragraphs":[[{"style":{"width":"78%"},"width":1125,"height":176,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/24-0.png","element":"img"}]]},{"heading":"=","paragraphs":[]},{"heading":"(1/2 − e−M)(1 + 1/κ)e−M(1+1/κ)�(1 − (1 − φ(Mt))1+1/κ)dt","paragraphs":[[{"style":{"width":"5%"},"width":76,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/24-1.png","element":"img"}]]},{"heading":"(1/2 − e−M)(1 + 1/κ)e−M(1+1/κ)","paragraphs":[[{"style":{"width":"25%"},"width":362,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/24-2.png","element":"img"}]]},{"heading":"Hence, |b3(ω11) − b3(ω10)| = O�e−M(1+1/κ)�. Unifying the above, we have","paragraphs":[]},{"heading":"� 1","paragraphs":[[{"style":{"width":"63%"},"width":914,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/24-3.png","element":"img"}]]},{"heading":"� 1","paragraphs":[[{"style":{"width":"30%"},"width":431,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/24-4.png","element":"img"}]]},{"heading":"Now choose M as the smallest integer such that","paragraphs":[[{"style":{"width":"1%"},"width":19,"height":9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/24-5.png","element":"img"}]]},{"heading":"κ + 2 log n. Then we have H2(P10, P11) ≤ C∗n−1 (1 + o(1)) for some constant C∗ depending only on κ, c2, φ, and","paragraphs":[[{"style":{"width":"73%"},"width":1058,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/24-6.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"large enough and ","element":"span"},{"style":{"height":17.08},"width":51.31,"height":42.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/24-7.png","element":"img","alt":" C∗1","inline":true,"padRight":true},{"text":"is another absolute constant depending only on","element":"span"}]]},{"heading":"C∗. Thus for n large enough,","paragraphs":[[{"style":{"width":"46%"},"width":666,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/24-8.png","element":"img"}]]},{"heading":"in which the constant C∗2 only depends on κ, c2 and φ. Combining all the results so far we get that lim infn→∞ inf�Gn sup(p,q)∈D �Gnn","paragraphs":[]},{"heading":"which holds when d = 2. Using Lemma 3.12, we have lim infn→∞ inf�Gn sup(p,q)∈D �Gnn","paragraphs":[]},{"heading":"Using the same argument as in the proof of Theorem 3.2, we get κ = 1, which will give us the rate 2/3. The proof for general d can be derived similarly. We treat the last dimension xd as x2 in the d = 2 case and treat x−d := (x1, · · · , xd−1) as x1 in the d = 2 case. Deﬁne q0(x) =(1 − η0 − b1)I{0 ≤ xd < 1/2} + I{1/2 ≤ xd < 1/2 + e−M} + (1 + η0 + b2)I{1/2 + e−M ≤ xd ≤ 1}, and pω(x) =1 +�1/2 + e−M − x2c2","paragraphs":[]},{"heading":"I{1/2 ≤ xd ≤ 1/2 + bω(x−d)} − b3(ω)I{1/2 + bω(x−d) < xd ≤ 1}, where bω(x−d) is constructed similarly as a shallow ReLU neural network that bω(x−d) =","paragraphs":[[{"style":{"width":"13%"},"width":193,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/25-0.png","element":"img"}]]},{"heading":"where ωj1,··· ,jd−1 are binary 0, 1 variables and","paragraphs":[[{"style":{"width":"85%"},"width":1235,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/25-1.png","element":"img"}]]},{"heading":"where φ(·) is a shallow ReLU neural network with input dimension d − 1 satisfying the following conditions: • φ = 0 outside [−1, 1]d and φ ≤ 1 on [−1, 1]d; • maxx−d∈[−1,1]d φ(x−d) ≤ 1 and φ(0) = 1. Such a construction is similar to the “spike” function in Yarotsky and Zhevnerchuk [2019] and it requires O(d2) non-zero weights. The rest of the proof follows the d = 2 case.","paragraphs":[]},{"heading":"$41","paragraphs":[[{"style":{"width":"73%"},"width":1060,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/26-0.png","element":"img"}]]},{"heading":"for some m∗ ≥ 1. The following theorem says that the same un-improvable rate can be obtained for the empirical hinge loss minimizer �fφ,n ∈ Fn. Theorem 4.1. Suppose the underlying densities p and q satisfy assumptions (A1), (A2φ), (A3) and denote all such (p, q) pairs as �F∗n. Let Fn be a student ReLU DNN family with Ln = O(log n), Nn = O(log n)m and Bn, Fn = O(log n) for some m ≥ m∗. Assume the student network is larger than the teacher network, i.e., Ln ≥ L∗n, Sn ≥ S∗n, Nn ≥ N∗n, Bn ≥ B∗n, Fn ≥ F ∗n. Then the excess risk for �fφ,n ∈ Fn satisﬁes sup","paragraphs":[]},{"heading":"� 1","paragraphs":[[{"style":{"width":"42%"},"width":611,"height":67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/26-1.png","element":"img"}]]},{"heading":"Similarly, results in Corollary 3.4 and 3.5 hold for the empirical hinge loss minimizer. Speciﬁcally, when p, q are disjoint, the convergence rate of excess risk improves to n−1, and all conclusions hold when the teacher network is larger but with bounded active pieces. Remark 2 (Network Depth). Training with surrogate loss such as hinge loss, unlike 0-1 loss, doesn’t involve any hard thresholding, i.e. I{yf(x) < 0}. As a result, to control the complexity of the student network, Lemma 4.4 is used instead of Lemma 3.9, which allows us to use deeper neural networks (Ln = O(log n)) for both the student and teacher network.","paragraphs":[]},{"heading":"4.1. Proof of Theorem 4.1. One important observation to be used in the proof is that the Bayes classiﬁer under hinge loss is the same as that","paragraphs":[[{"style":{"width":"99%"},"width":1435,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/27-0.png","element":"img"}]]},{"heading":"convergence rate, we utilize the following lemma from Kim et al. [2018]. Let η(x) denote the conditional probability of label 1 that η(x) = P(y = 1|x). Lemma 4.2. [Theorem 6 of [Kim et al., 2018]] Let φ be the hinge loss. Assume (N) with the noise exponent κ ∈ [0, ∞], and that following conditions (C1) through (C4) hold. (C1) For a positive sequence an = O(n−a0) as n → ∞ for some a0 > 0, there exists a sequence of function classes {Fn}n∈N such that Eφ(fn, f∗φ) ≤ an for some fn ∈ Fn. (C2) There exists a real valued sequence {Fn}n∈N with Fn ≳ 1 such that","paragraphs":[[{"style":{"width":"26%"},"width":384,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/27-1.png","element":"img"}]]},{"heading":"(C3) There exists a constant ν ∈ (0, 1] such that for any f ∈ Fn and any","paragraphs":[[{"style":{"width":"82%"},"width":1188,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/27-2.png","element":"img"}]]},{"heading":"for a constant C2 > 0 depending only on φ and η(·). (C4) For a positive constant C3 > 0, there exists a sequence {δn}n∈N such that HB(δn, Fn, ∥ · ∥2) ≤ C3n � δn Fn","paragraphs":[]},{"heading":"for {Fn}n∈N in (C1), {Fn}n∈N in (C2), and ν in (C3).","paragraphs":[[{"style":{"width":"100%"},"width":1439,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/27-3.png","element":"img"}]]},{"heading":"trarily small constant ι > 0. Then, the empirical φ-risk minimizer �fφ,n over Fn satisﬁes","paragraphs":[[{"style":{"width":"27%"},"width":392,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/27-4.png","element":"img"}]]},{"heading":"In Lemma 4.2, condition (C1) guarantees the approximation error of fn","paragraphs":[[{"style":{"width":"99%"},"width":1435,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/27-5.png","element":"img"}]]},{"heading":"lemma, which is reminiscent of Lemma 3.12 in the sense that it characterizes","paragraphs":[[{"style":{"width":"99%"},"width":1434,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/27-6.png","element":"img"}]]},{"heading":"between f and f∗φ. Lemma 4.3 (Lemma 6.1 of Steinwart et al. [2007]). Assume (N) with the Tsybakov noise exponent κ ∈ [0, ∞]. Assume ∥f∥∞ ≤ F for any f ∈ F.","paragraphs":[]},{"heading":"Under the hinge loss φ, for any f ∈ F,","paragraphs":[[{"style":{"width":"80%"},"width":1156,"height":161,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/28-0.png","element":"img"}]]},{"heading":"where Cη,κ =�∥(2η − 1)−1∥κκ,∞ + 1�I(κ > 0) + 1 and ∥(2η − 1)−1∥κκ,∞ is deﬁned by ∥(2η − 1)−1∥κκ,∞ = supt>0 �tκ Pr�{x : |(2η(x) − 1)−1| > t}��. For condition (C4) in Lemma 4.2, we present the following lemma. Lemma 4.4. [Lemma 3 in Suzuki [2018]] For any δ > 0, the covering number of FDNN(L, N, S, B) (in sup-norm) satisﬁes","paragraphs":[[{"style":{"width":"45%"},"width":657,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/28-1.png","element":"img"}]]},{"heading":"≤ 2L(S + 1) log(δ−1(L + 1)(N + 1)(B ∨ 1)). proof of Theorem 4.1. The lower bound directly follows from Theorem 3.3, as the constructed ReLU neural network in the proof also satisfy assumption (A2φ). For the upper bound on the convergence rate, we utilize Lemma 4.2 and check the conditions (C1) through (C4). Since the student network is larger than the teacher, (C1) and (C2) trivially hold with arbitrarily small an and Fn = O(log n) as assumed. To apply Lemma 4.3, notice that Cη,κ = O(cn) = O(log n)m∗d2L∗n by assumption (A3) and F = O(log n), we have (C3) holds for ν = κ/(κ + 1) +ϵn, where ϵn = (2 +m∗d2L∗n) log log n/ log n. The term ϵn is to deal with the fact that Cη,κ can also diverge at an O(log n)m∗d2L∗n rate. For (C4), by Lemma 4.4,","paragraphs":[[{"style":{"width":"56%"},"width":818,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/28-2.png","element":"img"}]]},{"heading":"≤ 2Ln(Sn + 1) log�δ−1n (Ln + 1)(Nn + 1)(Bn ∨ 1)� ≲ (log n)2m+2 log�δ−1n ∨ logm(n)�. Therefore, (4.2) implies that (C3) is satisﬁed if we choose δn with","paragraphs":[[{"style":{"width":"58%"},"width":838,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/28-3.png","element":"img"}]]},{"heading":"which can be satisﬁed by choosing δn = �(log n)2m+m∗d2L∗n+7","paragraphs":[[{"style":{"width":"29%"},"width":417,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/28-4.png","element":"img"}]]},{"heading":"Similar to the proof of Theorem 3.2, the Tsybakov exponent κ = 1. Thus, by","paragraphs":[[{"style":{"width":"99%"},"width":1434,"height":146,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/29-0.png","element":"img"}]]},{"heading":"sharp rate of convergence for the excess risk under both empirical 0-1 loss and hinge loss minimizer in the teacher-student setting. Our current results for training under 0-1 loss only hold for student networks with O(1) layers","paragraphs":[[{"text":"and the assumption that ","element":"span"},{"style":{"height":16.72},"width":150.69,"height":41.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/29-1.png","element":"img","alt":" f∗n ∈ Fn","inline":true},{"text":", i.e. zero approximation, is required. In the","element":"span"}]]},{"heading":"future, we aim to relax these two constraints and provide more comprehensive analysis of the teacher-student network. Additionally, we would like to • explore other type of neural networks such as convolutional neural network and residual neural network, which are both very successful at image classiﬁcation; • consider the implicit bias of training algorithms, e.g. stochastic gradient descent, to regularize the complexity of larger and deeper neural networks in the teacher-student setting; • consider the more general improper learning scenario where the Bayes classiﬁer is not necessarily in the student neural network; • consider other popular surrogate losses such as exponential loss or cross entropy loss. Further investigation under the teacher-student network setting may facilitate a better understanding of how deep neural network works and shed light on its empirical success especially in high-dimensional image classiﬁcation.","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"References.","element":"span"}],[{"text":"Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding ","element":"span"},{"text":"deep neural networks with rectified linear units. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1611.01491","element":"span"},{"text":", 2016.","element":"span"}],[{"text":"Benjamin Aubin, Antoine Maillard, jean barbier, Florent Krzakala, Nicolas Macris, and ","element":"span"},{"text":"Lenka Zdeborov´a. The committee machine: Computational to statistical gaps in learning a two-layers neural network. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 31","element":"span"},{"text":", pages 3223–3234. Curran Associates, Inc., 2018.","element":"span"}],[{"id":"id-22","text":"Jean-Yves Audibert, Alexandre B Tsybakov, et al. Fast learning rates for plug-in classifiers. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of statistics","element":"span"},{"text":", 35(2):608–633, 2007.","element":"span"}],[{"text":"Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Z. Ghahramani, ","element":"span"},{"text":"M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 27","element":"span"},{"text":", pages 2654–2662. Curran Associates, Inc., 2014.","element":"span"}],[{"text":"Benedikt Bauer, Michael Kohler, et al. On deep learning as a remedy for the curse of ","element":"span"},{"text":"dimensionality in nonparametric regression. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Statistics","element":"span"},{"text":", 47(4):2261–2285, 2019.","element":"span"}],[{"id":"id-14","text":"Yuan Cao and Quanquan Gu. ","element":"span"},{"text":"Tight sample complexity of learning one-hidden-layer convolutional neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 10611–10621, 2019.","element":"span"}],[{"id":"id-6","text":"Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Efficient approximation of ","element":"span"},{"text":"deep relu networks for functions on low dimensional manifolds. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 8172–8182, 2019.","element":"span"}],[{"text":"Corinna Cortes and Vladimir Vapnik. Support-vector networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine learning","element":"span"},{"text":", 20(3): 273–297, 1995.","element":"span"}],[{"text":"George Cybenko. Approximations by superpositions of a sigmoidal function. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematics of Control, Signals and Systems","element":"span"},{"text":", 2:183–192, 1989.","element":"span"}],[{"id":"id-2","text":"J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale ","element":"span"},{"text":"Hierarchical Image Database. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR09","element":"span"},{"text":", 2009a.","element":"span"}],[{"text":"Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A ","element":"span"},{"text":"large-scale hierarchical image database. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2009 IEEE conference on computer vision and pattern recognition","element":"span"},{"text":", pages 248–255. Ieee, 2009b.","element":"span"}],[{"id":"id-10","text":"A Engel and C.V. Broeck. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Statistical Mechanics of Learning","element":"span"},{"text":". Cambridge University Press, 2002.","element":"span"}],[{"text":"Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation","element":"span"}],[{"style":{"width":"96%"},"width":1392,"height":77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/30-0.png","element":"img"}],[{"id":"id-12","text":"Sebastian Goldt, Madhu Advani, Andrew M Saxe, Florent Krzakala, and Lenka Zdeborov´a. ","element":"span"},{"text":"Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alch´e-Buc, E. Fox, and R. Garnett, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 32","element":"span"},{"text":", pages 6979–6989. Curran Associates, Inc., 2019.","element":"span"}],[{"id":"id-19","text":"Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient ","element":"span"},{"text":"descent on linear convolutional networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 9461–9471, 2018.","element":"span"}],[{"text":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for ","element":"span"},{"text":"image recognition. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE conference on computer vision and pattern recognition","element":"span"},{"text":", pages 770–778, 2016.","element":"span"}],[{"text":"Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.","element":"span"}],[{"style":{"width":"43%"},"width":627,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/30-1.png","element":"img"}],[{"id":"id-17","text":"Masaaki Imaizumi and Kenji Fukumizu. Deep neural networks learn non-smooth functions ","element":"span"},{"text":"effectively. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1802.04474","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-23","text":"Yongdai Kim, Ilsang Ohn, and Dongha Kim. Fast convergence rates of deep neural networks ","element":"span"},{"text":"for classification. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1812.03599","element":"span"},{"text":", 2018.","element":"span"}],[{"text":"Michael Kohler and Sophie Langer. On the rate of convergence of fully connected very ","element":"span"},{"text":"deep neural network regression estimates. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1908.11133","element":"span"},{"text":", 2019.","element":"span"}],[{"text":"Aleksandr Petrovich Korostelev and Alexandre B Tsybakov. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Minimax theory of image reconstruction","element":"span"},{"text":", volume 82. Springer Science & Business Media, 2012.","element":"span"}],[{"id":"id-0","text":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep ","element":"span"},{"text":"convolutional neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pages 1097–1105, 2012.","element":"span"}],[{"text":"Shiyu Liang, Ruoyu Sun, Yixuan Li, and Rayadurgam Srikant. Understanding the loss ","element":"span"},{"text":"surface of neural networks for binary classification. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1803.00909","element":"span"},{"text":", 2018.","element":"span"}],[{"text":"Yi Lin. Support vector machines and the bayes rule in classification. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Data Mining and Knowledge Discovery","element":"span"},{"text":", 6(3):259–275, 2002.","element":"span"}],[{"id":"id-7","text":"Ruiqi Liu, Ben Boukai, and Zuofeng Shang. Optimal nonparametric inference via deep ","element":"span"},{"text":"neural network. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1902.01687","element":"span"},{"text":", 2019.","element":"span"}],[{"text":"Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive ","element":"span"},{"text":"power of neural networks: A view from the width. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pages 6231–6239, 2017.","element":"span"}],[{"text":"Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural ","element":"span"},{"text":"networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1906.05890","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-9","text":"C. W. H. Mace and A. C. C. Coolen. Statistical mechanical analysis of the dynamics of","element":"span"}],[{"style":{"width":"97%"},"width":1394,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/31-0.png","element":"img"}],[{"id":"id-15","text":"Enno Mammen, Alexandre B Tsybakov, et al. Smooth discrimination analysis. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Statistics","element":"span"},{"text":", 27(6):1808–1829, 1999.","element":"span"}],[{"text":"Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number ","element":"span"},{"text":"of linear regions of deep neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pages 2924–2932, 2014.","element":"span"}],[{"id":"id-4","text":"Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and estimation of deep ","element":"span"},{"text":"neural network to intrinsic dimensionality. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1907.02177","element":"span"},{"text":", 2019.","element":"span"}],[{"text":"Kien Nguyen, Clinton Fookes, Arun Ross, and Sridha Sridharan. Iris recognition with ","element":"span"},{"text":"off-the-shelf cnn features: A deep learning perspective. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Access","element":"span"},{"text":", 6:18848–18855, 2017.","element":"span"}],[{"id":"id-5","text":"Kenta Oono and Taiji Suzuki. Approximation and non-parametric estimation of resnet-type ","element":"span"},{"text":"convolutional neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1903.10047","element":"span"},{"text":", 2019.","element":"span"}],[{"text":"Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On ","element":"span"},{"text":"the expressive power of deep neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 34th International Conference on Machine Learning-Volume 70","element":"span"},{"text":", pages 2847–2854. JMLR. org, 2017.","element":"span"}],[{"id":"id-3","text":"Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ","element":"span"},{"text":"Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Journal of Computer Vision (IJCV)","element":"span"},{"text":", 115(3):211–252, 2015. .","element":"span"}],[{"id":"id-8","text":"David Saad and Sara A. Solla. Dynamics of on-line gradient descent learning for multilayer ","element":"span"},{"text":"neural networks, 1996.","element":"span"}],[{"text":"Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu ","element":"span"},{"text":"activation function. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Statistics","element":"span"},{"text":", 2019. Forthcoming.","element":"span"}],[{"text":"Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and ","element":"span"},{"text":"counting linear regions of deep neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1711.02114","element":"span"},{"text":", 2017.","element":"span"}],[{"text":"Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale ","element":"span"},{"text":"image recognition. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1409.1556","element":"span"},{"text":", 2014.","element":"span"}],[{"id":"id-18","text":"Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. ","element":"span"},{"text":"The implicit bias of gradient descent on separable data. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Journal of Machine Learning Research","element":"span"},{"text":", 19(1):2822–2878, 2018.","element":"span"}],[{"text":"Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Optimistic rates for learning with ","element":"span"},{"text":"a smooth loss. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1009.3896","element":"span"},{"text":", 2010.","element":"span"}],[{"text":"Ingo Steinwart, Clint Scovel, et al. Fast rates for support vector machines using gaussian ","element":"span"},{"text":"kernels. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Statistics","element":"span"},{"text":", 35(2):575–607, 2007.","element":"span"}],[{"text":"Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth ","element":"span"},{"text":"besov spaces: optimal rate and curse of dimensionality. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1810.08033","element":"span"},{"text":", 2018.","element":"span"}],[{"style":{"width":"26%"},"width":382,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/32-0.png","element":"img"}],[{"text":"Yuandong Tian. A theoretical framework for deep locally connected relu network. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv","element":"span"}],[{"style":{"width":"36%"},"width":526,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/32-1.png","element":"img"}],[{"id":"id-11","text":"Yuandong Tian. Over-parameterization as a catalyst for better generalization of deep relu ","element":"span"},{"text":"network. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1909.13458","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-16","text":"Alexander B Tsybakov et al. Optimal aggregation of classifiers in statistical learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Statistics","element":"span"},{"text":", 32(1):135–166, 2004.","element":"span"}],[{"id":"id-21","text":"Alexandre B Tsybakov, Sara A van de Geer, et al. Square root penalty: adaptation to ","element":"span"},{"text":"the margin in classification and in edge estimation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Statistics","element":"span"},{"text":", 33(3): 1203–1224, 2005.","element":"span"}],[{"text":"Sara Van De Geer. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Empirical Processes in M-estimation","element":"span"},{"text":". Cambridge University Press, 2000.","element":"span"}],[{"text":"Dmitry Yarotsky and Anton Zhevnerchuk. The phase diagram of approximation rates for ","element":"span"},{"text":"deep neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1906.09477","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-1","text":"Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"European conference on computer vision","element":"span"},{"text":", pages 818–833. Springer, 2014.","element":"span"}],[{"id":"id-13","text":"Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer","element":"span"}],[{"style":{"width":"19%"},"width":287,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/33-0.png","element":"img"}]]},{"heading":"6.1. Smooth Boundary Condition. In this section we review some existing work under smooth boundary condition in details and point out its connection to our proposed teacher-student neural network. Smooth Functions. A function has H¨older smoothness index β if all partial derivatives up to order ⌊β⌋ exist and are bounded, and the partial derivatives of order ⌊β⌋ are β −⌊β⌋ Lipschitz. The ball of β-H¨older functions with radius R is then deﬁned as Hβr (R) =�f : Rr → R :","paragraphs":[[{"style":{"width":"73%"},"width":1062,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/33-1.png","element":"img"}]]},{"heading":"sup","paragraphs":[[{"style":{"width":"76%"},"width":1094,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/33-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"∂","element":"span"},{"style":{"height":17.6},"width":550.63,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/33-3.png","element":"img","alt":"α = ∂α1 . . . ∂αr with α = (α1","inline":true},{"style":{"fontStyle":"italic"},"text":", . . . , α","element":"span"},{"style":{"height":17.6},"width":453.75,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/33-4.png","element":"img","alt":"r) ∈ Nr and |α| := |α|1.","inline":true}]]},{"heading":"Boundary Assumption. It is known that estimating the classiﬁer directly instead of the conditional class probability helps achieve fast convergence","paragraphs":[[{"text":"rates [","element":"span"},{"href":"#id-15","text":"Mammen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","text":"1999","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","text":"Tsybakov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","text":"2004","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","text":"2005","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","text":"Audibert et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","text":"2007","element":"a"},{"text":"].","element":"span"}]]},{"heading":"Classiﬁcation in this case can be thought of as nonparametric estimation of sets where we directly estimate the decision regions for diﬀerent labels, e.g., G for label 1. Then, the classiﬁer is determined by attributing x to label 1 if x ∈ G and to label −1 otherwise, i.e., C(x) = 2 · IG(x) − 1. In this case, the Bayes risk can be written as R(G) = 1/2","paragraphs":[]},{"heading":"q(x)Q(dx)","paragraphs":[[{"style":{"width":"43%"},"width":632,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/33-5.png","element":"img"}]]},{"heading":"Denote G∗ = {x : p(x) ≥ q(x)} to be the Bayes risk minimizer, and the classiﬁcation problem is equivalent to estimation of the optimal set G∗. Given p, q, let","paragraphs":[[{"style":{"width":"79%"},"width":1146,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/33-6.png","element":"img"}]]},{"heading":"The optimal decision rule is to assign label 1 to x ∈ X+ and −1 to x ∈ X−. The decision boundary in this case is {x ∈ X : p(x) = q(x)}. To characterize the smoothness of the boundary, it is usually assumed that X+ consists of union and intersection of smooth hyper-surfaces [Kim et al., 2018, Tsybakov et al., 2004]. Speciﬁcally, the following assumption is widely","paragraphs":[[{"text":"adopted [","element":"span"},{"href":"#id-15","text":"Mammen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","text":"1999","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","text":"Tsybakov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","text":"2004","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","text":"Kim et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","text":"2018","element":"a"},{"text":",","element":"span"}]]},{"heading":"Imaizumi and Fukumizu, 2018] and referred to as ”boundary fragment”. Let H be some smooth function space from Rd−1 → R. Deﬁne sets GH as","paragraphs":[[{"text":"(6.1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"style":{"height":21.03},"width":983.14,"height":52.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/34-0.png","element":"img","alt":"H = {x ∈ X : xj > h(x−j), h ∈ H, j ∈ {1, 2, · · · , d}}","inline":true}]]},{"heading":"where x−j = (x1, · · · , xj−1, xj+1, · · · , xd). It is assumed that X+ is composed of ﬁnite union and intersection of sets in GH. The seemingly odd form (6.1) enforces special structures on the indicator function and reduces the complexity of the corresponding sets. A more general assumption on the decision boundary is that the set, denoted as G, containing all possible X+ cannot be too large. This is measured by the bracketing entropy HB of the metric space (G, d△). The more general assumption of the decision boundary is stated as (B) There exists A > 0 and ρ ∈ [0, ∞] such that","paragraphs":[[{"style":{"width":"28%"},"width":406,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/34-1.png","element":"img"}]]},{"heading":"$42","paragraphs":[[{"style":{"width":"65%"},"width":939,"height":428,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/35-0.png","element":"img"}],[{"text":"Fig 4","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":". Example of a ReLU function in 1D. The induced set where ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"f > ","element":"figcaption","subtype":"caption"},{"text":"0 ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"is colored red and it’s a union of two intervals ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":243.22,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/35-1.png","element":"img","alt":" (a1, b1), (a2, b2)","inline":true},{"style":{"fontStyle":"italic"},"text":". All pieces cross 0 so there are all active.","element":"figcaption","subtype":"caption"}]]},{"heading":"Corollary 5 of Montufar et al. [2014] show that there exists some f with s = Ω(2L−1) pieces on [0, 1]. With scaling and shifting, assume that on each piece the linear function crosses 0. Then, Gf will be at least ⌊s/2⌋ = Ω(2L−2) intervals. Denote these disjoint intervals to be {(ai, bi)}⌊s/2⌋i=1 . Since they are disjoint, to construct a δ-bracket of all the intervals, we need to δ-cover all the ai’s and bi’s. Similar to the grid argument from the proof of Lemma 3.6, we need at least","paragraphs":[[{"style":{"width":"99%"},"width":1436,"height":467,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/35-2.png","element":"img"}]]},{"heading":"Independent of Weights Magnitude. We also want to point out that the entropy of GF is not concerned with the magnitude of the neural network weights, in contrast to the bound in Lemma 4.4. This is because any scaling of the function doesn’t change how it intercepts with zero. Hence, unlike F, the entropy of GF doesn’t depend on the weight maximum B. The Use of ReLU Activation. The reason why we can even bound the entropy of GF critically relies on the fact that we are considering the ReLU activation function. If we consider smooth nonlinear activation functions, e.g. hyperbolic tangent, sigmoid, instead of the order log(1/δ), we can only get the entropy of a much larger order","paragraphs":[[{"style":{"width":"29%"},"width":427,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/35-3.png","element":"img"}]]},{"heading":"for some constant A > 0 and α > 0. To see this, consider the case d = 2. Instead of polygons, which can be controlled by the vertices, the regions have smooth boundary and will require O(1/δ) many grid points to cover. Thus the covering number is of order","paragraphs":[[{"style":{"width":"32%"},"width":472,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/36-0.png","element":"img"}]]},{"heading":"Thus, the entropy is in a polynomial order of 1/δ.","paragraphs":[[{"style":{"width":"96%"},"width":1385,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/36-1.png","element":"img"}]]},{"heading":"(A3) will be examined in the setting that the teacher network f∗n has random weights. We will argue that with probability at least 1 − δ, f∗n will satisfy assumption (A3) with Tn = A(δ)/(log n)m∗d2L∗n and cn = B(δ)(log n)m∗d2L∗n, where A(δ), B(δ) are constants depending only on δ and the distribution of the random weights, e.g. normal, truncated normal, etc. Hence, the results which assume Assumption (A3) will hold with high probability. A Toy Case. To illustrate the intuition, consider the case where d = 1 and","paragraphs":[[{"style":{"width":"76%"},"width":1101,"height":135,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/36-2.png","element":"img"}]]},{"heading":"(6.2) f∗n(x) =","paragraphs":[[{"style":{"width":"100%"},"width":1441,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/36-3.png","element":"img"}]]},{"heading":"Since all the weights are almost surely nonzero, we omit the zero weight cases for the analysis. Let pi = (ui, vi), i = 1, 2, . . . , s, denote the active pieces of (6.2). By Lemma 3.7, we know that s = O(log n). For each pi, deﬁne the following quantities: 1. ki = the slope of f∗n(x) on x ∈ pi;","paragraphs":[[{"style":{"width":"53%"},"width":772,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/36-4.png","element":"img"}]]},{"heading":"See Figure 5 for an illustration. Then, assumption (A3) is satisﬁed if","paragraphs":[[{"style":{"width":"85%"},"width":1221,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/36-5.png","element":"img"}]]},{"heading":"Next we will rigorously examine (6.3). From (6.2), each ki can be expressed as w1jw2j for some j ∈ {1, 2, · · · , N∗n}. Therefore, min1≤i≤N∗n{|ki|} = min1≤j≤N∗n{|w1jw2j|}. Since w1j, w2j are i.i.d.","paragraphs":[[{"style":{"width":"62%"},"width":904,"height":382,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/37-0.png","element":"img"}],[{"text":"Fig 5","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":". Example of a ReLU function in ","element":"figcaption","subtype":"caption"},{"text":"[0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":", ","element":"figcaption","subtype":"caption"},{"text":"1]","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":". There are two active pieces ","element":"figcaption","subtype":"caption"},{"style":{"height":12.8},"width":244.71,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/37-1.png","element":"img","alt":" p1, p2. On each","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"active piece, ","element":"figcaption","subtype":"caption"},{"style":{"height":11.6},"width":66.39,"height":28.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/37-2.png","element":"img","alt":" ti.ki","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are illustrated in color red.","element":"figcaption","subtype":"caption"}]]},{"heading":"standard Gaussian, we have P( min","paragraphs":[[{"style":{"width":"71%"},"width":1029,"height":180,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/37-3.png","element":"img"}]]},{"heading":"By choosing k = � δ2N∗n�2, we have min1≤i≤N∗n{|ki|} = Ω(1/ log n) with probability at least 1 − δ. On the other hand, for any i = 1, . . . , s, ti = |f∗n(xhi)| for some hi ∈ {1, · · · , N∗n}, where xhi = −bhi/w1hi. Hence","paragraphs":[]},{"heading":"min1≤i≤s{ti} ≥ min1≤j≤N∗n{|f∗n(xj)|}.","paragraphs":[[{"text":"Let ","element":"span"},{"style":{"height":25.71},"width":358.24,"height":64.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/37-4.png","element":"img","alt":" W1 = {w1j, bj}N∗nj=1","inline":true},{"text":". Then, ","element":"span"},{"style":{"height":20.99},"width":465.16,"height":52.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/37-5.png","element":"img","alt":" f∗n(xi) | W1 ∼ N(0, σ2xi),","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":20.99},"width":55,"height":52.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/37-6.png","element":"img","alt":" σ2xi","inline":true,"padRight":true},{"text":"has an ","element":"span"},{"text":"expression of ","element":"span"},{"style":{"height":25.71},"width":385.34,"height":64.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/37-7.png","element":"img","alt":"�N∗nj=1 σ(w1jxi + bj)2","inline":true,"padRight":true},{"text":"+ 1. Hence, for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > ","element":"span"},{"text":"0,","element":"span"}],[{"style":{"width":"3%"},"width":53,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/37-8.png","element":"img"}]]},{"heading":"P(min","paragraphs":[[{"style":{"width":"99%"},"width":1432,"height":249,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/37-9.png","element":"img"}]]},{"heading":"1 − δ, mini{ti} ≥ t and t = Ω(1/ log n). Therefore, (6.3) holds with high probability, so that assumption (A3) holds by setting 1/cn = mini{|ki|} and Tn = mini{ti}, which are both in the order of Ω(1/ log n).","paragraphs":[]},{"heading":"General Case. Now we consider the general case d > 1 and L∗n > 1. The teacher network has an expression f∗n(x) = W (L∗n+1)σ(W (L∗n),b(L∗n)) ◦ · · · ◦ σ(W (1),b(1))(x) + b(L∗n+1), x ∈ [0, 1]d.","paragraphs":[[{"style":{"width":"99%"},"width":1434,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/38-0.png","element":"img"}]]},{"heading":"s = O(log n)m∗L∗nd. Let {xi, x2, . . . , xvs} be the collection of vertices of {p1, . . . , ps}. We call such xi ∈ Rd a piece vertex and it’s not the same as the vertex of {x ∈ X : fn(x) ≥ 0}, which is closely examined in the proof of Lemma 3.9. The following lemma states that vs = O(log n)m∗L∗nd2 in our setting. Lemma 6.1. Let f be a ReLU neural network with d-dimensional input, L hidden layers and width N for every layer. Then, vs = O(N)Ld2. Proof. Recall that w(l)i and b(l)i for i = 1, . . . , N, 1 ≤ l ≤ L are the weight vectors and biases on the l-th hidden layer. For i = 1, . . . , N, deﬁne","paragraphs":[[{"style":{"width":"75%"},"width":1088,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/38-1.png","element":"img"}]]},{"heading":"which maps Rd → R. We can rewrite f as f(x) =","paragraphs":[[{"style":{"width":"3%"},"width":55,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/38-2.png","element":"img"}]]},{"heading":"In other words, f(L−1)i (x) represents the inputs to the i-th ReLU unit in the last hidden layer of f and itself is an (L − 1)-hidden-layer ReLU neural network. The key idea of the proof is by induction. Notice that the piece vertices of f can only come from the following two ways: Type I: The piece vertices of f(L−1)1 , f(L−1)2 , . . . , f(L−1)N , in whose local neighbourhoods, the ReLU units in the last layer doesn’t change sign; Type II: By activations of the ReLU unit in the last layer. i.e. f(L−1)i (x) = 0 for some i = 1, . . . , N. Let V (l) be the maximum number of piece vertices of an l-hidden-layer ReLU neural network with width N and let U(l) be the maximum number of Type II piece vertices created at layer l. Then for 1 < l ≤ L we have (6.5) V (l) ≤ NV (l − 1) + U(l). For U(l), the key is to connect the Type II piece vertices of f to the vertices of {x ∈ X : f(L−1)i (x) ≥ 0}, which has been extensively studied in","paragraphs":[]},{"heading":"Lemma 3.9. To this end, we deﬁne another quantity. On the i-th ReLU unit in the l-th hidden layer, let R(l)i := {x ∈ X : f(l)i (x) = 0}, which consists of (d − 1)-dimensional hyperplane segments. To be speciﬁc, denote all the active pieces of f(l)i (x) to be {p(l)ij : j = 1, . . . , s(l)i }, where s(l)i = O(N)(l−1)d according to Lemma 3.7 for any 1 ≤ i ≤ N. On each active piece p(l)ij , denote","paragraphs":[[{"style":{"width":"65%"},"width":938,"height":67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/39-0.png","element":"img"}]]},{"heading":"which is part of a (d−1)-dimensional hyperplane. Then we have R(l)i = {h(l)ij :","paragraphs":[[{"style":{"height":24.01},"width":277.61,"height":60.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/39-1.png","element":"img","alt":"j = 1, . . . , s(l)i }","inline":true},{"text":", a collection of (","element":"span"},{"style":{"height":12.8},"width":66.25,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/39-2.png","element":"img","alt":"d −","inline":true,"padRight":true},{"text":"1)-dimensional hyperplane segments. Let ","element":"span"},{"style":{"height":24.01},"width":286.51,"height":60.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/39-3.png","element":"img","alt":"R(l) = ∪Ni=1R(l)i ","inline":true,"padRight":true},{"text":", which corresponds to the piece boundaries of ","element":"span"},{"style":{"height":19.13},"width":93.82,"height":47.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/39-4.png","element":"img","alt":" fl+1.","inline":true}]]},{"heading":"By deﬁnition, all Type II pieces vertices must reside in at least one of the the activation sets (z = 0 in σ(z)) of the ReLU units in the last layer. R(L) contains all such activation sets for the last hidden layer, i.e. for any h ∈ R, there exists 1 ≤ i ≤ N such that fi(x) = 0, ∀x ∈ h. The Type II pieces vertices are jointly determined by such activation sets and the piece boundary of fi’s (dimension d − 1), i.e. R(L−2)i . Therefore, the total number of such piece vertices can be bounded by","paragraphs":[[{"style":{"width":"61%"},"width":881,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/39-5.png","element":"img"}]]},{"heading":"where��R(l)�� denotes the number of elements in R(l), which is bounded by","paragraphs":[[{"style":{"width":"17%"},"width":258,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/39-6.png","element":"img"}]]},{"heading":"For V (L), we ﬁrst conclude that V (1) = O(Nd). For a 1-hidden layer ReLU network, the decision boundary of every ReLU unit is a (d − 1)-dimension hyperplane (w1x + b1 = 0). The maximum number of piece vertices is bounded by�Nd�= O(Nd). Then, (6.5) can be repeatedly broken down to V (L) ≤ NV (L − 1) + U(L) ≤ N2V (L − 2) + NU(L − 1) + U(L) ≤ · · · ≤ NL−1V (1) +","paragraphs":[[{"style":{"width":"74%"},"width":1066,"height":409,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/39-7.png","element":"img"}]]},{"heading":"As an extension to the toy case, for any 1 ≤ i ≤ N∗n, deﬁne","paragraphs":[[{"style":{"width":"49%"},"width":717,"height":77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/40-0.png","element":"img"}]]},{"heading":"2. t0 = min1≤i≤vs {|f∗n(xi)|} . That is, ki is the minimal absolute values of the directional derivatives of f∗n on piece pi. Assumption (A3) is satisﬁed if the following holds: (6.6) min1≤i≤s{ki}, t0 = Ω(log n)m∗d2L∗n. We will check (6.6). Since the partial derivative of f∗n(x) for x ∈ pi can be expressed as the product of the random weights, we have min1≤i≤s{ki} ≥ min1≤jl≤N∗n","paragraphs":[[{"style":{"width":"76%"},"width":1093,"height":182,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/40-1.png","element":"img"}]]},{"heading":"min","paragraphs":[[{"style":{"width":"53%"},"width":764,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/40-2.png","element":"img"}]]},{"heading":"we get that P( min1≤i≤s{ki} < k) ≤ P","paragraphs":[]},{"heading":"min","paragraphs":[[{"style":{"width":"68%"},"width":986,"height":167,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/40-3.png","element":"img"}]]},{"heading":"By taking","paragraphs":[[{"style":{"width":"29%"},"width":418,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/40-4.png","element":"img"}]]},{"heading":"k0 =","paragraphs":[]},{"heading":"(N∗n)2(L∗n + 1)","paragraphs":[]},{"heading":"we have that with probability at least 1 − δ, min1≤i≤s{ki} ≥ k0 and k0 = Ω(1/ log n)2m∗(L∗n+1). On the other hand, for any ti, there exist j = 1, . . . , vs such that ti = f∗n(xj). Hence","paragraphs":[[{"style":{"width":"42%"},"width":606,"height":69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/40-5.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":17.89},"width":115.08,"height":44.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/40-6.png","element":"img","alt":" W−L∗n","inline":true,"padRight":true},{"text":":= ","element":"span"},{"style":{"height":23.74},"width":271.9,"height":59.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/40-7.png","element":"img","alt":" {W (l), b(l)}L∗nl=1","inline":true},{"text":". Then we have ","element":"span"},{"style":{"height":22.99},"width":539.38,"height":57.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/40-8.png","element":"img","alt":" f∗n(xj) | W−L∗n ∼ N(0, σ2xj),","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":23},"width":766.96,"height":57.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/40-9.png","element":"img","alt":" σ2xj depends on W−L∗n and σ2xj ≥ 1 that","inline":true}],[{"style":{"width":"40%"},"width":579,"height":142,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/40-10.png","element":"img"}]]},{"heading":"which is reminiscent of (6.4) and NL∗n is the width of the last layer and σj(·)’s are outputs (post-activations) from the last layer given W−L∗n. Therefore, for any t > 0, we have P( min1≤j≤vs{|f∗n(xj)|} < t | W−L∗n) ≤","paragraphs":[[{"style":{"width":"35%"},"width":511,"height":201,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.06892/images/41-0.png","element":"img"}]]},{"heading":"Thus by taking t = δ/(N∗n)d2L∗n, we have that with probability at least 1 − δ, mini{ti} ≥ t and t = Ω(1/ log n)m∗d2L∗n. Therefore, (6.6) holds. That is to say, when d ≥ 2, with high probability, Assumption (A3) holds in which cn, 1/Tn = O(log n)m∗d2L∗n. Notice that the probability arguments used in this section don’t rely on Gaussian distribution. As long as all weights are i.i.d. with distribution that doesn’t have a point mass at 0, our claim holds.","paragraphs":[]}],"_version":"3.3.2"},"paperNode":"$28:props:children:props:children:0:props:product"}]]]}]}]