We study the approximation of measurable functions on the hypercube by functions arising from affine neural networks. Our main achievement is an approximation of any measurable function 1] up to a prescribed precision 0 by a bounded number of neurons, depending only on and not on the function
The study of functions defined by neural networks has a long history dating back to the work of McCulloch and Pitts [5]. Recent advances in applications to deep learning raised numerous questions on why neural networks are able to solve so many different problems. It is well known that neural networks can approximate any given function up to arbitrary precision, see for example [1–3], however, the dependence of the architecture of the neural network on the quality of the approximation and the function is much harder to understand on a theoretical level. This is in contrast to the observation that relatively easy neural networks are able to provide desirable approximations in many cases. Our take on this is a new viewpoint that tries to approach this phenomenon as an effect of efficient separation of structure and randomness.
We consider the standard hypercube and an arbitrary parameter Every vector naturally defines a function
where
and denotes the standard inner product on . We call such functions rec-tified affine and suppress the parameter q throughout the entire article. In this note, we consider the basic question how easily an arbitrary measurable function 1] can be approximated by compositions and insertions of functions of the form as above. Our main achievement is an approximation of any measurable function 1] up to a prescribed precision 0 by a bounded number of neurons, where the bound depends only on independence of is reminiscent of Szemer´edi’s Regularity Lemma, which provides a decomposition of a graph of arbitrary size into quasi-random and deterministic constituents. In a similar way, we decompose an arbitrary measurable function 1] into constituents that can be reproduced by a neural network of bounded size and such which are quasi-random and thus invisible to a neural network of bounded size in a very precise way.
Let us introduce a hierarchy of functions built from the basic building blocks. A function 1] is said to be represented by an affine neural network of type (, if there exists a sequence
where we set = 1 for convenience and each form
with We say that representable if it is representable by an affine neural network of type (with . A function is (d|0)-representable function if and only if it is an rectified affine function. It is also easy to see that (d|r)-representable functions are both (d+1|r)-representable and (d|r+1)-representable. In particular, any rectified affine function is (d|r)-representable for any
-representable for 1 . Then, the function
The space of functions is naturally endowed with a normalized
where denotes the normalized Lebesgue measure on
For fixed n, d, r, the set of (d|r)-representable functions might be small, but nevertheless, we can use it to define alternative notions of distance on the set of all real-valued, measurable and essentially bounded functions on as follows:
where the supremum runs over all functions 1] which are (d|r)-representable. It is clear that takes non-negative values, that 2satisfies the triangle inequality. The first non-trivial observation is that is actually a metric for any
The metric measures how well (d|r)-representable functions are able to tell the difference between f and g. Note that it is very natural to include the complexity of the observer in any attempt of approximation of functions by neural networks. We call a function )-invisible if
i.e., if it does not significantly correlate with any (d|r)-representable function. We set
The following result is already interesting for d = 1, r = 0.
In particular, every measurable function -representable and -invisible.
The proof is inspired by various analytic approaches to Szemer´edi’s Regularity Lemma, see for example [4]. Note that our bounds are independent of f and n, which should make the results particularly useful.
Proof. Consider the Hilbert space ) with the usual inner product and consider a function 1] as a vector in we set
and define We clearly have that 1 Hence, there exists some conclude that there exists with the property that We consider now the vector and set
Note that g is just the composition of which immediately implies
Note that is assumed to be (21)-representable for each 1 so that we conclude by Lemma 1 that )-representable. Let now h be any (d|r)-representable function. Note that that we have
and hence
If , then we set ] and can conclude that In the other case, when , we get more easily just using 1. This finishes the proof, since now
where the supremum runs over all (d|r)-representable functions.
Acknowledgments
This research was supported by ERC Consolidator Grant No. 681207. I thank Nihat Ay for interesting comments on a previous version of this preprint.
References
[1] G. Cybenko, Approximation by superpositions of a sigmoidal function, Math. Control Signal 2 (1989), no. 4, 303–314.
[2] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Netw. 2 (1989), no. 5, 359–366.
[3] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, Multilayer feedforward networks with a non-polynomial activation function can approximate any function, Neural Netw. 6 (1993), no. 6, 861–867.
[4] L. Lov´asz and B. Szegedy, , Geom. Funct. Anal. 17 (2007), no. 1, 252–270.
[5] W. McCulloch and W. Pitts, A logical calculus of ideas immanent in nervous activity, Bull. Math. Biophys. 5 (1943), 115–133.
Designed for Accessibility and to further Open Science