Deep Neural Networks (NNs) with many hidden layers are now at the core of modern machine learning applications and can achieve remarkable performance that was previously unattainable using shallow networks. But why are deeper networks better than shallow? Perhaps intuitively, one can understand that the nature of computation done by deep and shallow networks is different; simple one hidden layer NNs extract independent features of the input and return their weighted sum, while deeper NNs can compute features of features, making the features computed by deeper layers no longer independent. Another line of intuition (Poole et al. (2016)), is that highly complicated manifolds in input space can actually turn into flattened manifolds in hidden space, thus helping with downstream tasks (e.g., classification).
To make the above intuitions formal and understand the benefits of depth, researchers try to understand the expressivity of NNs and prove depth separation results. Early results in this area sometimes referred to as universality theorems (Cybenko, 1989; Hornik et al., 1989), state that NNs of just one hidden layer, equipped with standard activation units (e.g., sigmoids, ReLUs etc.) are “dense” in the space of continuous functions, meaning that any continuous function can be represented by an appropriate combination of these activation units. There is a computational caveat however, since the width of this one hidden layer network can be unbounded and grow arbitrarily with the input function. In practice, resources are bounded, hence the more meaningful questions have to do with depth separations.
This is a foundational question not only in deep learning theory but also in other computational models (e.g., boolean circuit complexity (Hastad, 1986; Kane and Williams, 2016)) with a rich history of prior work, bringing together ideas and techniques from boolean functions, Fourier and harmonic analysis, special functions, fractal geometry, differential geometry and more recently dynamical systems and chaos. At a high level, all these works define an appropriate notion of “complexity” and later demonstrate how deeper models are significantly more powerful than shallower models. A partial list of the different notions of complexity that have been considered include global curvature (Poole et al., 2016) and trajectory length (Raghu et al., 2017), number of activation patterns (Hanin and Rolnick, 2019) and linear regions (Montufar et al., 2014; Arora et al., 2016), fractals (Malach and Shalev-Shwartz, 2019), the dimension of algebraic varieties (Kileel et al., 2019), Fourier spectrum of radial functions (Eldan and Shamir, 2016), number of oscillations (Schmitt, 2000; Telgarsky, 2016, 2015) and periods of continuous maps (Chatziafratis et al., 2020).
In this work, we build upon the works by Telgarsky (2016) that relied on the number of oscillations of continuous functions and by Chatziafratis et al. (2020) that relied on periodic orbits present in a continuous function and connections to dynamical systems to derive depth separations (see Section 2 for definitions). We pose the following question:
Can we exploit further connections to dynamical systems to derive improved depth-width trade-offs?
We are indeed able to do this and improve the known depth separations along several aspects:
• We show that there exist real-valued functions f, expressible by deep NNs, for which shallower networks, even with exponentially larger width, incur large error instead of the weaker
notion of classification error that was previously shown (Chatziafratis et al., 2020; Telgarsky, 2015).
• We obtain width lower bounds that are sharper across all regimes for the periodic orbits in f and surprisingly we show that there is a universal constant characterizing the depth-width trade-offs, as long as f contains points of odd period. This was not known before as the trade-offs were becoming increasingly less pronounced (approaching the trivial value 1) when f’s period was growing.
• Finally, the obtained period-specific depth-width trade-offs are shown to hold against shallow networks equipped with semi-algebraic units as defined in Telgarsky (2016) and can be extended to the case of high-dimensional input functions by an appropriate projection.
Technically, our improved results are based on a tighter eigenvalue analysis of the dynamical systems arising from the periodic orbits in f and on some new connections between the Lipschitz constant of f, its (prime) period, and the growth rate of the oscillatory behaviour of repeatedly composing f with itself. This latter connection allows us to lower bound the error of shallow (but wide) networks, yielding period-specific depth-width lower bounds.
At a broader perspective, we completely answer a question raised by Telgarsky (2016), regarding the construction of large families of hard-to-represent functions. Our results are tight, as one can explicitly construct examples of functions that achieve equality in our bounds (see Lemma 3.6). En route to our results, we unify and extend previous methods for depth separations Telgarsky (2016); Chatziafratis et al. (2020); Schmitt (2000).
Last but not least, we complement our theoretical findings with experiments on a synthetic data set to validate our obtained bounds and also contextualize the fact that depth can indeed be beneficial for some simple learning tasks involving functions of certain periods.
1.1 Background on dynamical systems
Here we give the necessary background from dynamical systems in order to later state our results more formally. From now on, ] is assumed to be continuous.
Periods: The notion of a periodic point (a generalization of a fixed point) will be important:
Definition 1.1. We say f contains period n or has a point of period 1, if there exists a point
] such that
In particular, C := has distinct elements (each of which is a point of period n) and is called a cycle (or orbit) with period n.
Observe that since f : [] is continuous, it must have a point of period 1, i.e., a fixed point.
Sharkovsky’s Theorem: Recently, Chatziafratis et al. (2020) used the period of f to derive period-specific depth-width trade-offs via Sharkovsky’s theorem (Sharkovsky, 1964, 1965) from dynamical systems that provides restrictions on the allowed periods f can have:
Definition 1.2. Define the following (decreasing) ordering Sharkovsky’s ordering:
We write or
whenever l is to the left of r and this gives a total ordering on the natural numbers.
Observe that the number 3 is the largest according to this ordering. Sharkovsky showed a surprising and elegant result about his ordering: it describes which numbers can be periods for a continuous map on an interval; allowed periods must be a suffix of his ordering:
Theorem 1.1 (Sharkovsky’s Theorem). If f contains period n and , then f also contains period
According to Sharkovsky’s Theorem, 3 is the maximum period, so one important and easy-to-remember corollary is that period 3 implies all periods.
We finally need the definition of a prime period for f:
Definition 1.3 (Prime period). A function f has prime period n as long as it contains period n, but has no periods greater than n, according to the Sharkovsky
Notice that for , its prime period is 2, since
, which implies that it also has fixed point (
1.2 Classification, L1 and L∞ errors
Telgarsky (2016) proved that f can be the output of a deep NN, for which any function g belonging to a family of shallow, yet extremely wide NNs, will incur high approximation error. He used the most satisfying measure for lower bounding the approximation error between f and g, which was the error. We say
is satisfying, because if the
distance between two functions is large, then certainly there are sets of positive measure in the domain where they differ. Just to make the point clear, if
was used, it wouldn’t imply good depth separations, since
is extremely sensitive even to single point differences. Of course, the situation gets reversed if instead the goal is to obtain distance upper bounds, for which
is the most desirable. On the other hand, the classification error used in (Chatziafratis et al., 2020; Telgarsky, 2015) (for exact definition, see Section 2) is a much weaker notion of approximation, that does not seem appropriate for comparing continuous functions, since f and g can have large classification distance, yet still be the same, almost everywhere (i.e., their
is arbitrarily close to zero). An explanation for this is depicted in Figure 1.
Figure 1: A comparison between two functions that agree almost everywhere. The error is small, however the classification error can be artificially inflated, hence leading to separations, based on unsatisfactory notions of approximation. Similarly, one should not rely on using
error to get separation results, which is also large in this example.
To get his bound, Telgarsky presented a simple and highly symmetric construction based on the triangle (or tent) map, which can be thought of as a combination of just two ReLUs. Later he used it to argue that repeated compositions of this map with itself (equivalently, concatenating layers one after the other) yield highly oscillatory outputs. Since functions g generated by shallow networks cannot possibly have so rapidly changing oscillations, he relied on symmetries due to the triangle map and he estimated areas where the two functions differ in order to get a lower bound between g and the triangle compositions.
However, here we can no longer use the specific tent map, since we generalize the constructions based only on the periods of the functions; hence all the symmetries and regularities used to derive the bound are gone. For us, the challenge will be to bound the
error based on the periods. For example, in a special case of our result, when the function f has period 3, only 3 values of the function are known on 3 points in the domain. Can one use such limited information to bound the
error against shallow NNs? The natural question that arises is the following:
Is it possible to obtain period-specific depth-width trade-offs based on error instead of classification error using only information about the periods?
Surprisingly, the answer is yes and at a high level, we show that the oscillations arising by function compositions are not pathologically concentrated only on “tiny” sets in the domain. Specifically, we carry this out by exploiting some new connections between the prime period of f, its Lipschitz constant and the growth rate of the number of oscillations when taking compositions with itself.
1.3 Periods, Lipschitz constant and Oscillations
A byproduct of our analysis will be that given two “design” parameters for a function f, its prime period and Lipschitz constant, we will be able to construct hard-to-represent functions with these parameters, or say it is impossible. This gives a better understanding between those apparently unrelated quantities. To do this we rely on the oscillations that appear after composing f with itself multiple times, as the underlying thread connecting the above notions.
We can show that the Lipschitz constant always dominates the growth rate of the number of oscillations as we take compositions of f with itself. Moreover, whenever its Lipschitz constant matches this growth rate, we can prove that its repeated compositions cannot be approximated by shallow NNs (where the depth of the NN is sublinear in the number of compositions). Finally, we can characterize the number of oscillations in terms of the prime period of the function of interest. These findings provide bounds between the three quantities,: prime period, Lipschitz constant and oscillations.
1.4 Our contributions
We now have the vocabulary to state and interpret our results. For simplicity, we will give informal statements that hold against ReLU NNs, but everything goes through for semi-algebraic gates and for higher dimensions as well. Our first result connects the periods with the number of oscillations and improves upon the bounds obtained in Chatziafratis et al. (2020).
Theorem 1.2. Let f : [] have odd prime period p > 1. Then, there exist points
], so that the number of oscillations between x and y is
where
is the root greater than one of the polynomial:
Our second result ties the Lipschitz constant with the depth-width trade-offs under the mation.
Theorem 1.3. Let -Lipschitz, and g be any ReLU NN with u units per layer and l layers. Suppose there exist numbers
], such that the oscillations of
between x, y are Θ(
for some constant
. As long as
, then for any NN g that has width-depth such
that (2, we get the desired
-separation:
where c(x, y) depends on x, y but not on t.
The above theorem implies depth separations, since if the depth of the “shallow” network g is l = o(t), then even exponential width u will not suffice for a good approximation.
Given the above understanding regarding the Lipschitz constant, the periods and the number of oscillations, it is now easy to construct infinite families of functions that are tight in the sense that they achieve the depth-width trade-offs bounds promised by our theorem for any period p (see Lemma 3.6).
Observe that the largest root of the polynomial in the statement of Theorem 1.2 is always larger than. This implies a sharp transition for the depth-width trade-offs, since the oscillations growth rate will be at least
contains an odd period. Previous results, only acquired a base in the exponent that would approach 1, as the (odd) period p increased, and it is known that if f does not contain odd factors in its prime period, then the oscillations can grow only polynomially quickly (Chatziafratis et al., 2020).
Finally, in our experimental section we give a simple regression task based on periodic functions, that validates our obtained bound and we also demonstrate how the error drops as we increase the depth.
In this section we provide some important definitions and facts that will be used for the proofs of our main results. First we define the notion of crossings/oscillations.
Definition 2.1 (Crossings/Oscillations). A continuous function ] crosses the interval [x, y] with
] if there exist
], such that f(c) = x and f(d) = y. Moreover we denote C
) the number of times f crosses [x, y]. It holds C
) = t if there exist numbers
We next mention the definition of covering relation between two intervals . This notion is crucial because as we shall see later, it enables us to define a graph and analyze the spectral properties of its adjacency matrix. Bounding the spectral norm of the adjacency matrix from below will enable us to give lower bounds on the number of crossings/oscillations.
Definition 2.2 (Covering relation). Let f be a function and be two closed intervals. We say that
, denoted by
We conclude this section with the definition of and classification error.
Definition 2.3 (For two functions
distance is:
Definition 2.4 (Classification error). If we specify a collection of one can define the classification error of a function g to be:
where ˜g(z) = 1[] is the thresholded value of g based on some chosen threshold v (e.g., v could be
In this section, we provide characterizations of continuous L-Lipschitz functions f : [], the compositions of which cannot be approximated (in terms of
error) by shallow NNs. For the rest of this section, we assume that there exist
] with x < y, such that the number of oscillations is: C
, where
is a constant greater than one (we shall call
the growth rate of the oscillations) and C is some positive constant.
The lemma below formalizes the idea that a highly oscillatory function needs to have large Lipschitz constant, by showing that
Lemma 3.1 (Lower bound on L). Let f : [] be as above. It holds that
is at least
is another positive constant.
Proof. Without loss of generality let n be even, and let it denote the number of oscillations between x, y of the function , i.e.,
. Let
be the points such that
) = x and
) = y for 0
. Since
has Lipschitz constant
, it holds that
for all 0
1. By adding these inequalities, we get a telescoping sum, and we conclude:
An immediate corollary of Lemma 3.1 is as desired.
3.1 Lipschitz matches oscillations rate for L1 error
In this section, we give sufficient conditions for a class of functions f, so that it cannot be approximated (in sense) by shallow ReLU NNs, and we will later extend it to semi-algebraic gates. The key statement is that the Lipschitz constant of such a function should match the growth rate of the number of oscillations.
Assume that ] is a neural network with l layers and u nodes (activations) per layer. It is known that a ReLU NN with u ReLU’s per layer and with l layers is piecewise affine with at most (2
From now on, let h := for ease of presentation. We define as ˜h(z) = 1[
] and ˜
] for some chosen values of
] to be defined later (as we shall see, x, y
are just points for which h oscillates between them). Let be the partition of [a, b], where ˜
are piecewise constant respectively. We also define ˜
the collection of intervals with the extra assumption that there exists w in each of them such that h(w) = y or h(w) = x. Finally define a maximal (in cardinality) sub-collection of intervals
in such a way that if
are consecutive intervals in ˜
, the image
and the image of
contains y (or vice-versa), that is there is an alternation between x, y. It follows (Telgarsky, 2015) that
Moreover, one can show the following claim for any interval , and we will use this later:
Claim 1. Let
Proof. Firstly, observe that h is Lipschitz with constant by definition and without loss of generality let’s assume x < y. In what follows, we make use of the intermediate value theorem for continuous functions.
First we consider the case where there exists a
Let c < d, with so that h(c) = h(d) =
+
and
+
for
] and
. It is clear that
Finally, by the fact that h is Lipschitz with constant , it follows that (
. The claim for the case there exists
follows by substitution. See also Figure 2.
Similarly, we consider the case in which there exists a such that h(w) = x. Let c < d, with
so that h(c) = h(d) =
+
and
+
for
] and
] with f(w) = x. It is clear that
Again using the fact that h is Lipschitz with constant , it follows that (
. The claim for the case in which there exists a
by substitution.
As mentioned in the beginning, a sufficient condition for a function f to be hard-to-represent, is that its Lipschitz constant should be exactly equal to the base in the growth rate of the number of oscillations. In Telgrasky’s paper, the function used was the tent map that has Lipschitz constant equal to 2, and the oscillations growth rate under repeated compositions also had growth rate 2. This is not a coincidence and here we generalize this observation.
Figure 2: Graphs demonstrating the proof described in Claim 1 with respect to the Lipschitz constant. The lower bound on the area computed by the integral would hold even if the repeated compositions created piecewise concave functions (left); however, for convex functions no guarantee on the area of the triangle (hence no separation) can be derived (right).
Theorem 3.2 (Lipschitz matches oscillations). Let f : [] be L-Lipschitz, and g be any ReLU NN with u units per layer and l layers. Suppose there exist x, y such that the oscillations of
between x, y are Θ(
) for some constant
1. As long as
(by Lemma 3.1 we already know that
, then we get the desired
-separation:
where c(x, y) depends on x, y but not on t.
Proof. We will prove a lower bound for the distance between h :=
and an arbitrary g from the aforementioned family of NNs with (2
It is clear that is at least the number of crossings C(
), hence we conclude that
Θ(
). It follows that
Larger Lipschitz: Observe that if we didn’t require that and instead
, no meaningful
guarantee could be derived since the term (
would shrink for large t (see also Figure 2).
Semi-algebraic activation units: Our results can be easily generalized for the general class of semi-algebraic units (see Telgarsky (2016) for definitions). The idea works as follows: Any neural network that has activation units that are semi-algebraic, it is piecewise polynomial, therefore piecewise monotone (the pieces depend on the degree of the polynomial, which in turn depends on the specifications of the activation units). Therefore, the function ˜g (as defined above) is piecewise constant and defines a partition of the domain [a, b]. The crucial observation is that the size of this partition is bounded by a number that depends exponentially on the number of layers (i.e., layers appear in the exponent) and polynomially on the number of units per layer (i.e., width is in the base). Finally, our results can be applied for the multivariate case. As in Telgarsky (2016), we handle this case by first choosing an affine map ) and considering functions
3.2 Periodicity and Lipschitz constant
In this subsection, we improve the result of Chatziafratis et al. (2020), by showing that functions f of odd period p > 1 have points x < y so that the number of oscillations between x and y is , where
is the root greater than one of the polynomial equation
This consists an improvement from the previous result in Chatziafratis et al. (2020) that states that the growth rate of the oscillations of compositions of a function with p-periodic point is the root greater than one of the polynomial 1 = 0 (observe that the two aforementioned polynomials coincide for p = 3). This is true because if
are the roots of
and
= 3 for which we have
. This gives better depth-width trade-offs for any value of the (odd) period.
Moreover, if is the Lipschitz constant of f, then Lemma 3.2 applies and any shallow neural network g (with (2
) has
distance bounded away from zero for any number of compositions of f.
We first need the following structural lemma Alsedà et al. (2000).
Lemma 3.3 (Monotonicity Alsedà et al. (2000)). Let p > 1 be an odd number and consider f : [] with prime period p. Then there exists a cycle of period p with points
such that
Figure 3: The covering relations of intervals from Corollary 3.4 are shown here. Note the existence of more directed edges when the period is odd and greater than 3, compared to the graph used to quantify the growth in Chatziafratis et al. (2020). This allows us to obtain improved bounds for p > 3.
This lemma will help us define an appropriate covering relation, to be used later in order to bound the number of oscillations in . Towards this goal, we set
= [
= [
] for 1
and
= [
] for 1
. From Lemma 3.3, we trivially have the following covering relations.
Corollary 3.4. It holds that
• I
•
•
• J
Let be the adjacency matrix of the covering relation graph above (the intervals denote the nodes of the graph):
We define which is in
to keep track of how many times
crosses specific intervals. The coordinate
captures the number of times
crosses interval
for
and the coordinate
captures the number of times
crosses interval
. We get that
where 1) (all ones vector).
Claim 2. The characteristic polynomial of has the same roots as
Proof. For p = 3, the desired equation holds, since the matrix becomes just
with characteristic polynomial (1)
1 =
1. Let I denote the identity matrix of size (
1). Assume
5. We consider the matrix:
where
Observe that = 0 is not an eigenvalue of the matrix
. Suppose that
are the four block submatrices of the matrix above. Using Schur’s complement, we get that det(
), where det(
,. . . , the i-th row by
(and so on) and add them to the last row. Let B be the resulting matrix:
where It is clear that the equation det(B) = 0 has the same roots as det(
) = 0. Since B is an upper triangular matrix, it follows that
We conclude that the eigenvalues of (and hence of A) must be roots of
=
and the claim follows.
The following corollary establishes a connection between the growth rate of the oscillations of compositions of function f with its prime period. Also, we establish a universal sharp threshold phenomenon demonstrating that the width needs to grow at a rate at least as large asas the function contains an odd period (this is in contrast with previous depth separation results where the growth rate converges to one as the period p goes to infinity).
Corollary 3.5. Let be a continuous function with prime odd period p > 1. There exist x, y such that the number of oscillations between x, y of
is Θ(
) where
is the positive root greater than one of
. Moreover,
is decreasing in
for all p.
Proof. We first need to relate the spectral radius with the number of oscillations. We follow the idea from Chatziafratis et al. (2020) which concludes that (where spec(A) denotes the spectral radius), that is the growth rate of the number of oscillations of compositions of f is at least
Assume 1 < p be an odd number. It suffices to show that (and then use induction). Observe that
1. Therefore
hence since 1 we conclude that
0. Since lim
) = +
, by Bolzano’s theorem it follows that
has a root in the interval (
. One can also see that
1 =
0 and 2
0, thus from Bolzano’s again, it follows that
Remark We note that (the growth rate is at least
) whereas the growth rate in
Chatziafratis et al. (2020) was converging to one as
We now provide tight constructions for a family of functions f that have points of period p (thus the number of oscillations of t compositions of , i.e., the growth rate is
) and moreover the Lipschitz constant is
Theorem 3.2, this family cannot be approximated by shallow neural networks in
Lemma 3.6. Let p > 1 be an odd number and be the largest positive root greater than one of the polynomial
1 = 0. The function f : [
1]
1], defined to be f(x) :=
1 has Lipschitz constant
and has period p.
Proof. It suffices to show that f has period p (the Lipschitz constant is trivially ). We start from
= 0 and we get
. Observe that
Set
) =
. First, we shall show that for
, we have
0 and that
), whereas for t even, we have
1 in the interval above.
For t = 3 we get that 1 =
0 because we showed
is decreasing in p and moreover holds
) = 0. Since
0 we get that
1 =
1. Let us show that
0. Observe that
We will use induction. Assume now, that we have the result for some t even, we need to show that 1 and moreover
By induction, we have that 1)
is decreasing in
) = 0, we conclude that
0. Since
0, we get that
1 =
1. To finish the claim, it suffices to show that
0. Observe that
The term 1) and moreover
is decreasing in p and
0 and the induction is complete.
From the above, we conclude that ) = 0, thus
form a cycle. If we show that
are distinct, the proof of the lemma follows.
First observe that is strictly increasing in t as long as
(by computing the derivative). Therefore it holds that
= 0 (for all the odd indices) and also
Furthermore,
1 is decreasing in
, therefore we conclude
also
We will show that and finally
and the lemma will follow. Recall
and
1. Equivalently, we need to show that
0 which holds because
3.3 Sensitivity to Lipschitzness and separation examples based on periods
We end this section with some simple examples that illustrate the behavior of the aforementioned family of functions f(x) := 1 for different parameters and the corresponding depth-width trade-offs that can be derived. As a consequence, we will observe how similar-looking functions can actually have vastly different behaviors with regards to oscillations, periods and hence depth separations (see Figure 4).
We consider three regimes. The first regime corresponds to the functions that appear in Lemma 3.2, where and
], where
=
618 is the golden ratio. The second regime corresponds to the case when
and the third regime corresponds to the case when
. We can see in Figure 5 that the function
1 has period 3 and a Lipschitz constant of L = 2, while in Figure 6, we can see that the function f(x) := 1
1, does not have any odd period and L = 1.2.
Figure 5 and Figure 6 correspond to cases where the Lipschitz constant of the function does not match
• When, we see from Figure 4, how small differences in the values of the slope can lead to the existence of different (prime) periods, which consequently lead to different depth-width trade-offs.
• When , we can see from Figure 5 that L = 2 and also the growth rate of oscillations is 2. This means that
separation is achievable. Note that period 3 is present in the tent map, so
for this case.
• When , we can see from Figure 6 that the oscillations do not grow exponentially with compositions and that the existing ones are of small magnitude, which means that the
error can be made arbitrarily small. Observe here that no odd period is present in the function (as this would imply that
Our goal here is to experimentally validate our theoretical results by exploring the interplay between optimization/representation error bounds obtained in theory. For instance, in order to understand how training with random initialization works on compositions of periodic functions, we combine the example from Lemma 3.6 together with Theorem 3.2; in particular, theory suggests that for a fixed width u and depth l, as long as the condition stated in the theorem is satisfied, i.e., (2then we have an error bound that is independent of the width and depth. We indeed validate this and the
error we get from theory almost matches the experimental error (see Figure 7).
Rewriting the condition of the theorem, for a fixed width and depth, there is a large
, that always produces constant error. To test that, we create a “hard task” that satisfies this above equation for depths l = 1, 2, 3, 4, 5, for constant width u = 20. On the other end of the spectrum, we create a relatively “easy task” (with fewer compositions) and study how the error varies with depth. We define a regression task and we fix the neurons for each layer to be 20.
Figure 4: The family of functions parameterized by 465 correspond to period 3, 5 and 7 respectively. Observe how slight perturbations of the function can lead to different trade-offs.
Figure 5: Here L = 2, and this function has period 3. However, the growth rate of oscillations is exactly 2 and since we have equality separations even though the largest root
Figure 6: Here L = 1.2 that corresponds to the regime where . It follows that this function cannot have any odd period (because then
). Observe that the oscillations do not grow exponentially fast and they shrink in area, hence no
separation is achievable.
We vary the depth of the NN (excluding the input and the output layer) from l = 1 to l = 5, adding one extra hidden layer at a time. We are using the same parameters to train all networks and we require the training error or the mean squared error to tend to 0 during the training procedure, i.e, we try to overfit the data (here we try to demonstrate a representation result, rather than a statistical/generalization result). Thus, during training we use the same parameters to train all the different models using Adam with epochs being 1500 to enable overfitting. To record the training error, we verify that the training saturates based on the error of the last epoch. We will now describe the “easy” and the “hard” task:
The Regression Task: We create a regression task that is based on the theory that ties periods with Lipschitz constants. The function we want to fit is the composition of f(x) := 1 with itself. For
= 1.618 (golden ratio), the Lipschitz is also
and exhibits period 3. The hardness of the task is characterized by the number of compositions of the function f, as the oscillations increase exponentially fast with the number of compositions. Thus for the hard task we use
and for the easy task we use
Figure 7: Left: Experimental error closely matches the theoretical bound for the hard task (and we get the
separation (as long as u, l are related as required by Theorem 3.2). Right: Generally, the performance of the NN for the regression task improves with depth when the task is easy (
In conclusion, by combining some ideas from dynamical systems related to periodic orbits and growth rate of oscillations, we presented several results for functions that are expressible with NNs of certain depth, yet are hard-to-represent with shallow, wide NNs. These results generalize and unify previous results on depth separations.
One potential direction for future work, that could further unify the different notions of “complexity” considered in previous works, is to explore connections between the notion of topological or metric entropy (see Alsedà et al. (2000)) and oscillations, periods and VC dimension. The goal here would be to derive general results stating that whenever a function f has large topological entropy, it is harder to approximate its repeated compositions using shallow NNs, as opposed to a function with smaller topological entropy.
Vaggos Chatziafratis is partially supported by an Onassis Foundation Scholarship. Sai Ganesh Nagarajan would like to acknowledge SUTD President’s Graduate Fellowship (SUTD-PGF). Ioannis Panageas would like to acknowledge SRG ISTD 2018 136, NRF for AI Fellowship and NRF2019-NRF-ANR095. Part of this project happened while the authors were visiting the MIT MIFODS program “Complex Structures” and we would like to thank the organizers for their hospitality.
Lluís Alsedà, Jaume Llibre, and Michał Misiurewicz. Combinatorial dynamics and entropy in dimension one. 2000.
Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. arXiv preprint arXiv:1611.01491, 2016.
Vaggos Chatziafratis, Sai Ganesh Nagarajan, Ioannis Panageas, and Xiao Wang. Depth-width trade-offs for relu networks via sharkovsky’s theorem. International Conference on Learning Representations, Addis Ababa, Africa, 2020.
George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907–940, 2016.
Boris Hanin and David Rolnick. Deep relu networks have surprisingly few activation patterns. In Advances in Neural Information Processing Systems, pages 359–368, 2019.
John Hastad. Almost optimal lower bounds for small depth circuits. In Proceedings of the eighteenth annual ACM symposium on Theory of computing, pages 6–20. Citeseer, 1986.
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
Daniel M Kane and Ryan Williams. Super-linear gate and super-quadratic wire lower bounds for depth-two and depth-three threshold circuits. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 633–643. ACM, 2016.
Joe Kileel, Matthew Trager, and Joan Bruna. On the expressive power of deep polynomial neural networks. arXiv preprint arXiv:1905.12207, 2019.
Tien-Yien Li and James A Yorke. Period three implies chaos. The American Mathematical Monthly, 82(10):985–992, 1975.
Eran Malach and Shai Shalev-Shwartz. Is deeper better only when shallow is good? arXiv preprint arXiv:1903.03488, 2019.
Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.
Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in neural information processing systems, pages 3360–3368, 2016.
Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2847–2854. JMLR. org, 2017.
Michael Schmitt. Lower bounds on the complexity of approximating continuous functions by sigmoidal neural networks. In Advances in neural information processing systems, pages 328–334, 2000.
OM Sharkovsky. Coexistence of the cycles of a continuous mapping of the line into itself. Ukrainskij matematicheskij zhurnal, 16(01):61–71, 1964.
OM Sharkovsky. On cycles and structure of continuous mapping. Ukrainskij matematicheskij zhurnal, 17(03):104–111, 1965.
Matus Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101, 2015.
Matus Telgarsky. Benefits of depth in neural networks. In Conference on Learning Theory, pages 1517–1539, 2016.