AlexNet [Krizhevsky et al., 2012] and its variants [Zeiler and Fergus, 2014,
in ImageNet data [Deng et al., 2009a, Russakovsky et al., 2015], where the
[Saad and Solla, 1996, Mace and Coolen, 1998, Engel and Broeck, 2002] and
2019, Goldt et al., 2019, Zhang et al., 2019, Cao and Gu, 2019]. Still, there
show that under exponential loss [Soudry et al., 2018, Gunasekar et al., 2018],
Lemma 3.1. Let be the teacher network with structures specified in
[Mammen et al., 1999, Tsybakov et al., 2004] can be arbitrarily large, which
Fig 1. Example of a ReLU DNN function in [0, 1]. There are 5 pieces among them, only
cross value 0 (horizontal line). There are 3 active pieces in this example and they are colored red.
(3.2) [x
Fig 2. Grid in 2D and the outer cover (green) constructed for with grid points for a polygon (blue).
• G(
there exist
• G(
where : 0
Fig 3. Demonstration of how a polygon in d = 2 case can be divided into basic triangles. The union of the two brackets form a bracket of the original polygon. The blue shade is the symmetric difference.
framework, the optimal set is indexed by n as it’s determined by the teacher network
. In the remaining part of the proof, we will consider
V
for n large enough and is another absolute constant depending only on
and the assumption that , i.e. zero approximation, is required. In the
References.
Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. arXiv preprint arXiv:1611.01491, 2016.
Benjamin Aubin, Antoine Maillard, jean barbier, Florent Krzakala, Nicolas Macris, and Lenka Zdeborov´a. The committee machine: Computational to statistical gaps in learning a two-layers neural network. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3223–3234. Curran Associates, Inc., 2018.
Jean-Yves Audibert, Alexandre B Tsybakov, et al. Fast learning rates for plug-in classifiers. The Annals of statistics, 35(2):608–633, 2007.
Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2654–2662. Curran Associates, Inc., 2014.
Benedikt Bauer, Michael Kohler, et al. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4):2261–2285, 2019.
Yuan Cao and Quanquan Gu. Tight sample complexity of learning one-hidden-layer convolutional neural networks. In Advances in Neural Information Processing Systems, pages 10611–10621, 2019.
Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Efficient approximation of deep relu networks for functions on low dimensional manifolds. In Advances in Neural Information Processing Systems, pages 8172–8182, 2019.
Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3): 273–297, 1995.
George Cybenko. Approximations by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2:183–192, 1989.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009a.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009b.
A Engel and C.V. Broeck. Statistical Mechanics of Learning. Cambridge University Press, 2002.
Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation
Sebastian Goldt, Madhu Advani, Andrew M Saxe, Florent Krzakala, and Lenka Zdeborov´a. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 6979–6989. Curran Associates, Inc., 2019.
Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461–9471, 2018.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.
Masaaki Imaizumi and Kenji Fukumizu. Deep neural networks learn non-smooth functions effectively. arXiv preprint arXiv:1802.04474, 2018.
Yongdai Kim, Ilsang Ohn, and Dongha Kim. Fast convergence rates of deep neural networks for classification. arXiv preprint arXiv:1812.03599, 2018.
Michael Kohler and Sophie Langer. On the rate of convergence of fully connected very deep neural network regression estimates. arXiv preprint arXiv:1908.11133, 2019.
Aleksandr Petrovich Korostelev and Alexandre B Tsybakov. Minimax theory of image reconstruction, volume 82. Springer Science & Business Media, 2012.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
Shiyu Liang, Ruoyu Sun, Yixuan Li, and Rayadurgam Srikant. Understanding the loss surface of neural networks for binary classification. arXiv preprint arXiv:1803.00909, 2018.
Yi Lin. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, 6(3):259–275, 2002.
Ruiqi Liu, Ben Boukai, and Zuofeng Shang. Optimal nonparametric inference via deep neural network. arXiv preprint arXiv:1902.01687, 2019.
Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. In Advances in neural information processing systems, pages 6231–6239, 2017.
Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890, 2019.
C. W. H. Mace and A. C. C. Coolen. Statistical mechanical analysis of the dynamics of
Enno Mammen, Alexandre B Tsybakov, et al. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.
Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.
Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and estimation of deep neural network to intrinsic dimensionality. arXiv preprint arXiv:1907.02177, 2019.
Kien Nguyen, Clinton Fookes, Arun Ross, and Sridha Sridharan. Iris recognition with off-the-shelf cnn features: A deep learning perspective. IEEE Access, 6:18848–18855, 2017.
Kenta Oono and Taiji Suzuki. Approximation and non-parametric estimation of resnet-type convolutional neural networks. arXiv preprint arXiv:1903.10047, 2019.
Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2847–2854. JMLR. org, 2017.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. .
David Saad and Sara A. Solla. Dynamics of on-line gradient descent learning for multilayer neural networks, 1996.
Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 2019. Forthcoming.
Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and counting linear regions of deep neural networks. arXiv preprint arXiv:1711.02114, 2017.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Optimistic rates for learning with a smooth loss. arXiv preprint arXiv:1009.3896, 2010.
Ingo Steinwart, Clint Scovel, et al. Fast rates for support vector machines using gaussian kernels. The Annals of Statistics, 35(2):575–607, 2007.
Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. arXiv preprint arXiv:1810.08033, 2018.
Yuandong Tian. A theoretical framework for deep locally connected relu network. arXiv
Yuandong Tian. Over-parameterization as a catalyst for better generalization of deep relu network. arXiv preprint arXiv:1909.13458, 2019.
Alexander B Tsybakov et al. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.
Alexandre B Tsybakov, Sara A van de Geer, et al. Square root penalty: adaptation to the margin in classification and in edge estimation. The Annals of Statistics, 33(3): 1203–1224, 2005.
Sara Van De Geer. Empirical Processes in M-estimation. Cambridge University Press, 2000.
Dmitry Yarotsky and Anton Zhevnerchuk. The phase diagram of approximation rates for deep neural networks. arXiv preprint arXiv:1906.09477, 2019.
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer
where ∂, . . . , α
rates [Mammen et al., 1999, Tsybakov et al., 2004, 2005, Audibert et al., 2007].
adopted [Mammen et al., 1999, Tsybakov et al., 2004, Kim et al., 2018,
(6.1) G
Fig 4. Example of a ReLU function in 1D. The induced set where f > 0 is colored red and it’s a union of two intervals . All pieces cross 0 so there are all active.
Fig 5. Example of a ReLU function in [0, 1]. There are two active pieces active piece,
are illustrated in color red.
Let . Then,
where
has an expression of
+ 1. Hence, for any t > 0,
, a collection of (
1)-dimensional hyperplane segments. Let
, which corresponds to the piece boundaries of
Let :=
. Then we have
where