b

DiscoverSearch
About
My stuff
Approximation of functions by neural networks
2019·arXiv
Abstract
Abstract

We study the approximation of measurable functions on the hypercube by functions arising from affine neural networks. Our main achievement is an approximation of any measurable function  f : Wn → [−1,1] up to a prescribed precision  ε >0 by a bounded number of neurons, depending only on  εand not on the function  f or n ∈ N.

The study of functions defined by neural networks has a long history dating back to the work of McCulloch and Pitts [5]. Recent advances in applications to deep learning raised numerous questions on why neural networks are able to solve so many different problems. It is well known that neural networks can approximate any given function up to arbitrary precision, see for example [1–3], however, the dependence of the architecture of the neural network on the quality of the approximation and the function is much harder to understand on a theoretical level. This is in contrast to the observation that relatively easy neural networks are able to provide desirable approximations in many cases. Our take on this is a new viewpoint that tries to approach this phenomenon as an effect of efficient separation of structure and randomness.

We consider the standard hypercube  Wn := [−1, 1]n ⊂ Rn and an arbitrary parameter  q ≥ 1.Every vector  ξ = (ξ0, c) ∈ [−q, q]n × Rnaturally defines a function

image

where

image

and  ⟨., .⟩denotes the standard inner product on  Rn. We call such functions rec-tified affine and suppress the parameter q throughout the entire article. In this note, we consider the basic question how easily an arbitrary measurable function f : Wn → [−1,1] can be approximated by compositions and insertions of functions of the form  ϕξas above. Our main achievement is an approximation of any measurable function  f : Wn → [−1,1] up to a prescribed precision  ε >0 by a bounded number of neurons, where the bound depends only on  ε and not on n ∈ N. Thisindependence of  n ∈ Nis reminiscent of Szemer´edi’s Regularity Lemma, which provides a decomposition of a graph of arbitrary size into quasi-random and deterministic constituents. In a similar way, we decompose an arbitrary measurable function  f : Wn → [−1,1] into constituents that can be reproduced by a neural network of bounded size and such which are quasi-random and thus invisible to a neural network of bounded size in a very precise way.

Let us introduce a hierarchy of functions built from the basic building blocks. A function  f : Wn → [−1,1] is said to be represented by an affine neural network of type (d1, d2, ..., dr), where r ∈ N, if there exists a sequence

image

where we set  d0 = n and dr+1= 1 for convenience and each  α0, . . . , αn is of theform

image

with  ξ(i, j) ∈ [−q, q]di × R, for 1 ≤ j ≤ di+1We say that  f : Wn → [−1, 1] is (d|r)-representable if it is representable by an affine neural network of type (d1, . . . , dr)with  di ≤ d for all 1 ≤ i ≤ r. A function is (d|0)-representable function if and only if it is an rectified affine function. It is also easy to see that (d|r)-representable functions are both (d+1|r)-representable and (d|r+1)-representable. In particular, any rectified affine function is (d|r)-representable for any  d ≥ 1, r ≥ 0.

Lemma 1. Let fi : Wn → [−1, 1] be (di|ri)-representable for 1  ≤ i ≤ m andλ1, . . . , λm ∈ [ − q, q]. Then, the function

image

The space of functions is naturally endowed with a normalized  L1-distance

image

where  µdenotes the normalized Lebesgue measure on  Wn

For fixed n, d, r, the set of (d|r)-representable functions might be small, but nevertheless, we can use it to define alternative notions of distance on the set of all real-valued, measurable and essentially bounded functions on  Wnas follows:

image

where the supremum runs over all functions  h: Wn → [−1,1] which are (d|r)-representable. It is clear that  σ(d|r)takes non-negative values, that  σ(d|r)(f, g) ≤2σ(f, g) and that σ(d|r)satisfies the triangle inequality. The first non-trivial observation is that  σ(d|r)is actually a metric for any  d ≥ 1, r ≥ 0.

image

The metric  σ(d|r)measures how well (d|r)-representable functions are able to tell the difference between f and g. Note that it is very natural to include the complexity of the observer in any attempt of approximation of functions by neural networks. We call a function  f : Wn → [−1, 1] (ε, d|r)-invisible if

image

i.e., if it does not significantly correlate with any (d|r)-representable function. We set

image

The following result is already interesting for d = 1, r = 0.

image

In particular, every measurable function  f : Wn → [−1, 1] is a sum f = g+h, whereg : Wn → [−1, 1] is (2md, r + m)-representable and  h = f − g is (ε, d|r)-invisible.

The proof is inspired by various analytic approaches to Szemer´edi’s Regularity Lemma, see for example [4]. Note that our bounds are independent of f and n, which should make the results particularly useful.

Proof. Consider the Hilbert space  L2(Wn, µ) with the usual inner product and consider a function  f : Wn → [−1,1] as a vector in  L2(Wn, µ). For 1 ≤ k ≤ m + 1,we set

image

and define  tk := inf{∥f − ξ∥2 | ξ ∈ Ξk}.We clearly have that 1  ≥ ∥f∥2 ≥ t1 ≥· · · ≥ 0.Hence, there exists some  m′ ≤ m = ⌈1/ε2⌉ such that tm′ ≤ tm′+1 + ε2. Weconclude that there exists  g′ = λ1f1 + · · · + λm′fm′ ∈ Ξm′with the property that ∥f − g′∥2 ≤ tm′+1 + ε2.We consider now the vector  ξ := (λ1, . . . , λm′) ∈ [ − q, q]m′and set

image

Note that g is just the composition of  g′ with β and hence |f(w) − g′(w)| ≥|f(w) − g(w)| for all w ∈ Wnwhich immediately implies

image

Note that  fjis assumed to be (2j−1d|r+j−1)-representable for each 1  ≤ j ≤ m′,so that we conclude by Lemma 1 that  g is (2m′d, r + m′)-representable. Let now h be any (d|r)-representable function. Note that  g + th ∈ Ξm′+1, for t ∈ [−q, q], sothat we have

image

and hence

image

If  ∥h∥ ≥ ε ≥ ε/q, then we set  t = ±ε/∥h∥ ∈ [−q, q] and can conclude that |⟨h, f −g⟩| ≤ ε.In the other case, when  ∥h∥ ≤ ε, we get more easily  |⟨h, f −g⟩| ≤ εjust using  ∥f − g∥ ≤1. This finishes the proof, since now

image

where the supremum runs over all (d|r)-representable functions. □

This research was supported by ERC Consolidator Grant No. 681207. I thank Nihat Ay for interesting comments on a previous version of this preprint.

[1] G. Cybenko, Approximation by superpositions of a sigmoidal function, Math. Control Signal 2 (1989), no. 4, 303–314.

[2] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Netw. 2 (1989), no. 5, 359–366.

[3] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, Multilayer feedforward networks with a non-polynomial activation function can approximate any function, Neural Netw. 6 (1993), no. 6, 861–867.

[4] L. Lov´asz and B. Szegedy,  Szemer´edi’s lemma for the analyst, Geom. Funct. Anal. 17 (2007), no. 1, 252–270.

[5] W. McCulloch and W. Pitts, A logical calculus of ideas immanent in nervous activity, Bull. Math. Biophys. 5 (1943), 115–133.

image


Designed for Accessibility and to further Open Science