Memory capacity of neural networks with threshold and ReLU activations

2020·Arxiv

Abstract

Abstract

Overwhelming theoretical and empirical evidence shows that mildly overparametrized neural networks – those with more connections than the size of the training data – are often able to memorize the training data with 100% accuracy. This was rigorously proved for networks with sigmoid activation functions [23, 13] and, very recently, for ReLU activations [24]. Addressing a 1988 open question of Baum [6], we prove that this phenomenon holds for general multilayered perceptrons, i.e. neural networks with threshold activation functions, or with any mix of threshold and ReLU activations. Our construction is probabilistic and exploits sparsity.

1. Introduction

This paper continues the long study of the memory capacity of neural architectures. How much information can a human brain learn? What are fundamental memory limitations, and how should the “optimal brain” be organized to achieve maximal capacity? These questions are complicated by the fact that we do not sufficiently understand the architecture of the human brain. But suppose that a neural architecture is known to us. Consider, for example, a given artificial neural network. Is there a general formula that expresses the memory capacity in terms of the network’s architecture?

1.1. Neural architectures. In this paper we study a general layered, feedforward, fully connected neural architecture with arbitrarily many layers, arbitrarily many nodes in each layer, with either threshold or ReLU activation functions between all layers, and with the threshold activation function at the output node.

Readers unfamiliar with this terminology may think of a neural architecture as a computational device that can compute certain compositions of linear and nonlinear maps. Let us describe precisely the functions computable by a neural architecture. Some of the best studied and most popular nonlinear functions , or “activation functions”, include the threshold function and the rectified linear unit (ReLU), defined by

respectively.We call a map pseudolinear if it is a composition of an affine transformation and a nonlinear transformation applied coordinate-wise. Thus, Φ : is pseudolinear map if it can be expressed as

where matrix of “weights”, is a vector of “biases”, and is either the threshold or ReLU function (1.1), which we apply to each coordinate of the vector

of the type

where Φ, . . . , Φpseudolinear maps. Each of maps Φmay be defined using either the threshold or ReLU function, and mix and match is allowed. However, for the purpose of this paper, we require

parameters of the given neural architecture. Varying these free parameters one can make a given neural architecture compute different functions . Let us denote the class of such functions computable by a given architecture by

Figure 1. A neural architecture with an input layer, two hidden layers, and an output node. The class of functions this architecture can compute is denoted F(3, 4, 2, 1).

A neural architecture can be visualized as a directed graph, which consists of L layers each having ), and one output node. Successive layers are connected by bipartite graphs, each of which represents a pseudolinear map Φ. Each neuron is a little computational device. It sums all inputs from the neurons in the previous layer with certain weights, applies the activation function to the sum, and passes the output to neurons in the next layer. More specifically, the neuron determines if the sum of incoming signals from the previous layers exceeds a certain firing threshold b. If so, the neuron fires with either a signal of strength 1 (if is the threshold activation function) or with strength proportional to the incoming signal (if is the ReLU activation).

1.2. Memory capacity. When can a given neural architecture remember a given data? Suppose, for example, that we have K digital pictures of cats and dogs, encoded as as vectors , and labels where 0 stands for a cat and 1 for a dog. Can we train a given neural architecture to memorize which images are cats and which are dogs? Equivalently, does there exist a function 1) such that

A common belief is that this should happen for any sufficiently overparametrized network – an architecture that that has significantly more free parameters than the size of the training data. The free parameters of our neural architecture are the weight matrices and the bias vectors . The number of biases is negligible compared to the number of weights, and the number of free parameters is approximately the same as the number of

Thus, one can wonder whether a general neural architecture is able to memorize the data as long as the number of connections is bigger than the size of the data, i.e. as long as

Motivated by this question, one can define memory capacity of a given architecture as the largest size of general data the architecture is able to memorize. In other words, the memory capacity is the largest K such that for a generalset of points and for any labels there exists a function ) that satisfies (1.2).The memory capacity is clearly bounded above by the vc-dimension,for neural architectures with threshold activation functions [7] and O(LW log W) for neural architectures with ReLU activation functions [5]. Thus, our question is whether these bounds are tight – is memory capacity (approximately) proportional to W?

1.3. The history of the problem. A version of this question was raised by Baum [6] in 1988. Building on the earlier work of Cover [8], Baum studied the memory capacity of multilayer perceptrons, i.e. feedforward neural architectures with threshold activation functions. He first looked at the network architecture [n, m, 1] with one hidden layer consisting of m nodes (and, as notation suggests, n nodes in the hidden layer and one output node). Baum noticed that for data points in general position in , the memory capacity of the architecture [n, m, 1] is about nm, i.e. it is proportional to the number of connections. This is not difficult: general position guarantees that the hyperplane spanned by any subset of n data points misses any other data points; this allows one to train each of the m neurons in the hidden layer on its own batch of n data points.

Baum then asked if the same phenomenon persists for deeper neural networks. He asked whether for large K there exists a deep neural architecture with a total of ) neurons in the hidden layers and with memory capacity at least K. Such result would demonstrate the benefit of depth. Indeed, we just saw that shallow architecture [1] has capacity just , which would be smaller than the hypothetical capacity K of deeper architectures for

There was no significant progress on Baum’s problem. As Mitchison and Durbin noted in 1989, “one significant difference between a single threshold unit and a multilayer network is that, in the latter case, the capacity can vary between sets of input vectors, even when the vectors are in general position” [17]. Attempting to count different functions F that a deep network can realize on a given data set (), Kowalczyk writes in 1997: “One of the complications arising here is that in contrast to the single neuron case even for perceptrons with two hidden units, the number of implementable dichotomies may be different for various n-tuples in general position... Extension of this result to the multilayer case is still an open problem” [15].

The memory capacity problem is more tractable for neural architectures in which the threshold activation is replaced by one of its continuous proxies such as ReLU, sigmoid, tanh, or polynomial activation functions. Such activations allow neurons to fire with variable, controllable amplitudes. Heuristically, this ability makes it possible to encode the training data very compactly into the firing amplitudes.

Yamasaki claimed without proof in 1993 that for the sigmoid activation and for data in general position, the memory capacity of a general deep neural architecture is lower bounded W, the number of connections [23]. A version of Yamasaki’s claim was proved in 2003 by Huang for arbitrary data and neural architectures with two hidden layers [13].

In 2016, Zhang et al. [25] gave a construction of an arbitrarily large (but not fully connected) neural architecture with ReLU activations and whose memory capacity is proportional to both the number of connections and the number of nodes. Hardt and Ma [12] gave a different construction of a residual network with similar properties.

Very recently, Yun et at. [24] removed the requirement that there be more nodes than data, showing that the memory capacity of networks with ReLU and tahn activation functions is proportional to the number of connections. Ge et al. [11] proved a similar result for polynomial activations.

Significant efforts was made in the last two years to justify why for overparametrized networks, the gradient descent and its variants could achieve 100% capacity on the training data [10, 9, 16, 26, 1, 14, 27, 18, 20, 2, ?]; see [21] for a survey of related developments.

1.4. Main result. Meanwhile, the original problem of Baum [6] – determine memory capacity of networks with threshold activations – has remained open. In contrast to the neurons with continuous activation functions, neurons with threshold activations either not fire at all of fire with the same unit amplitude. The strength of the incoming signal is lost when transmitted through such neurons, and it is not clear how the data can be encoded.

This is what makes Baum’s question hard. In this paper, we (almost) give a positive answer to this question.

Why “almost”? First, the size of the input layer should not affect the capacity bound and should be excluded from the count of the free parameters W. To see this, consider, for example, the data points all laying on one line; with respect to such data, the network is equivalent to one with = 1. Next, ultra-narrow bottlenecks should be excluded at least for the threshold nonlinearity: for example, any layer with just = 1 node make the number of connections that occur in the further layers irrelevant as free parameters.

In our actual result, we make somewhat stronger assumptions: in counting connections, we exclude not only the first layer but also the second; we rule out all exponentially narrow bottlenecks (not just of size one); we assume that the data points are unit and separated; finally, we allow logarithmic factors.

be positive integers, and set max(. Consider unit vectors that satisfy

Consider any labels . Assume that the number of deep connections W :=

as well as . Then the network can memorize the label assignment exactly, i.e. there exists a map

Here C and c denote certain positive absolute constants.

In short, Theorem 1.1 states that the memory capacity of a general neural architecture with threshold or ReLU activations (or a mix thereof) is lower bounded by the number of the deep connections. This bound is independent of the depth, bottlenecks (up to exponentially narrow), or any other architectural details.

1.5. Should the data be separated? One can wonder about the necessity of the separation assumption (1.4). Can we just assume that are distinct? While this is true for ReLU and tanh activations [24], it is false for threshold activations. A moment’s thought reveals that any pseudolinear map from transforms any line into a finite set such of cardinality O(m). Thus, by pigeonhole principle, any map from layer 1 to layer 2 is non-injective on the set of K data points – which makes it impossible to memorize some label assignments – unless ). In other words, if we just assume that the data points are distinct, the network must have at least as many nodes in the second layer as the number of data points. Still, the separation assumption (5.2) does not look tight and might be weakened.

1.6. Related notions of capacity. Instead of requiring the network to memorize the training data with 100% accuracy as in Theorem 1.1, one can ask to memorize just 1fraction, or just a half of the training data correctly. This corresponds to a relaxed, or fractional memory capacity of neural architectures that was introduced by Cover in 1965 [8] and studied extensively afterwards.

this architecture can realize on a given finite set points . When this set is the Boolean cube , this amounts to counting all Boolean functions the architecture can realize. The binary logarithm of the number of all such Boolean functions was called (expressive) capacity by Baldi and the author [3, 4]. For a neural architecture with all threshold activations and L layers, the expressive capacity is equivalent to the cubic polynomial in the sizes of layers

up to an absolute constant factor [4]. The factor min() quantifies the effect of any bottlenecks that occur before layer i.

Similar results can be proved for the restricted expressive capacity where we count the functions F the architecture can realize on a given finite set of , Section 10.5]. Ideally, one might hope to find that all 2functions can be realized on a general K-element set, which would imply that the memory capacity is at least K. However, the current results on restricted expressive capacity are not tight enough to reach such conclusions.

2. The method

Our construction of the function F in Theorem 1.1 is probabilistic. Let us first illustrate our approach for the architecture [n, n, n, 1] with two hidden layers, and with threshold activations throughout. We would like to find a composition of pseudolinear functions

that fits the given data (

The first two maps Φwhose only purpose is spread the data in the space, transforming it into an almost orthogonal set. Specifically, Φwill transform the separated points (1)-orthogonal points will transform the o(1)-orthogonal points )-orthogonal points ), and, finally, the ) will fit the data: Ψ(

2.1. Enrichment. Our construction of the enrichment maps ΦBoth maps will have the form

making almost orthogonal for small p.

and are exactly orthogonal. Nevertheless, our heuristic that sparsity induces orthogonality still works in this setting. To see this, let us check that the correlation of the coefficients Φ(x) and Φ() is small even if are far from being orthogonal. A standard asymptotic analysis of the tails of the normal distribution implies that

if are unit and -separated (Proposition 3.1), and

if are unit and -orthogonal (Proposition 3.2).

thus sufficiently small to make the factor exp(2constant, i.e.

Finally, we choose the separation threshold sufficiently large to make the factor 2 exp(in (2.1) less than

confirming our claim that Φ() tend to be -orthogonal provided that -separated. Similarly, (2.2) gives

confirming our claim that Φ() tend to be ()-orthogonal provided that -orthogonal.

2.2. Perception. As we just saw, the enrichment process transforms our data points )-orthogonal vectors . Let us now find a perception map Ψ that can fit the labels to the data: Ψ(

Consider the random vector

where the signs are independently chosen with probability 1/2 each. Then, separating the k-th term from the sum defining w and assuming for simplicity that are unit vectors, we get

Taking the expectation over independent signs, we see that

if . This yields

Since , the “mirror perceptron”

2.3. Deeper networks. The same argument can be repeated for networks with variable sizes of layers, i.e. [n, m, d, 1]. Interestingly, the enrichment works fine even if making the lower-dimensional data almost orthogonal even in very high dimensions. This explains why (moderate) bottlenecks – narrow layers – do not restrict memory capacity.

The argument we outlined allows the network [n, m, d, 1] to fit around d data points, which is not very surprising, since we expect the memory capacity be proportional to the number of connections and not the number of nodes. However, the power of enrichment allows us to boost the capacity using the standard method of batch learning (or distributed learning).

Let us show, for example, how the network [n, m, d, r, 1] with three hidden layers can fit data points (). Partition the data into r batches each having O(d) data points. Use our previous result to train each of the r neurons in the fourth layer on its own batch of O(d) points, while zeroing out all labels outside that batch. Then simply sum up the results. (The details are found in Theorem 7.1.)

This result can be extended to deeper networks using stacking, or unrolling a shallow architecture into a deep architecture, thereby trading width for depth. Figure 2 gives an illustration of stacking, and Theorem 7.2 provides the details. A similar stacking construction was employed in [4].

The success of stacking indicates that depth has no benefit formemorization purposes: a shallow architecture [n, m, d, r, 1] can memorize roughly as much data as any deep architecture with the same number of connections. It should be noted, however, that training algorithms commonly used by practitioners, i.e. variants of stochastic gradient descent, do not seem to lead to anything similar to stacking; this leaves the question of benefit of depth open.

2.4. Neural networks as preconditioners. As we explained in Section 2.1, the first two hidden layers of the network act as preconditioners: they transform the input vectors vectors that are almost orthogonal. Almost orthogonality facilitates memorization process in the deeper layers, as we saw in Section 2.2.

The idea to keep the data well-conditioned as it passes through the network is not new. The learning rate of the stochastic gradient descent (which we are not dealing with here) is related to how well conditioned is the so-called gradient Gram matrix H. In the simplest scenario where the activation is ReLU and the network has one hidden layer of infinite size, matrix with entries

Much effort was made recently to prove that H is well-conditioned, i.e. its smallest singular value of H is bounded away from zero, since this can be used to establishes a good convergence rate for the stochastic gradient descent, see the papers cited in Section 1.3 and especially [1, 10, 9, ?]. However, existing results that prove that H is well conditioned only hold for very overparametrized networks, requiring at least nodes in the hidden layer [?]. This is a much stronger overparametrization requirement than in our Theorem 1.1.

On the other hand, as opposed to many of the results quoted above, Theorem 1.1 does not shed any light on the behavior of stochastic gradient descent, the most popular method for training deep networks. Instead of training the weights, we explicitly compute them from the data. This allows us to avoid dealing with the gradient Gram matrix: our enrichment method provides an explicit way to make the data well conditioned. This is achieved by setting the biases high enough to enforce sparsity. It would be interesting to see if similar preconditioning guarantees can be achieved with small (or even zero) biases and thus without exploiting sparsity.

A different form of enrichment was developed recently in the paper [4] which showed that a neural network can compute a lot of different Boolean functions. Toward this goal, an enrichment map was implemented in the first hidden layer. The objective of this map is to transform the input set (the Boolean cube ) into a set on which there are lots of different threshold functions – so that the next layers can automatically compute lots of different Boolean functions. While the general goal of this enrichment map in [4] is the same as in the present paper – achieve a more robust data representation that is passed to deeper layers – the constructions of these two enrichment maps are quite different.

2.5. Further observations and questions. As we saw in the previous section, we utilized the first two hidden layers of the network to preprocess, or enrich, the data vectors made us skip the sizes of the first two layers when we counted the number of connections W. If these vectors are already nice, no enrichment may be necessary, and we have a higher memory capacity.

Suppose, for example, that the data vectors in Theorem 1.1 are )-orthogonal. Then, since no enrichment is needed in this case, the conclusion of the theorem holds with

which is the sum of all connections in the network. If, on the other hand, the data vectors in Theorem 1.1 are only )-orthogonal, just the second enrichment is needed, and so the conclusion of the theorem holds with

which is the sum of all connections between the non-input layers.

This make us wonder: can enrichment be always achieved in one step instead of two? Can one find a pseudolinear map Φ : that transforms a given set of -separated vectors 01) into a set of )-orthogonal vectors? If this is possible, we would not need to exclude the second layer from the parameter count, and Theorem 1.1 would hold for

A related question for further study is to find an optimal separation threshold assumption in Theorem 1.1, and to remove the assumption that vectors. Both the logarithmic separation level of and the normalization requirement could be artifacts of the enrichment scheme we used.

There are several ways Theorem 1.1 could be extended. It should not be too difficult, for example, to allow the output layer have more than one node; such multi-output networks are used in classification problems with multiple classes.

Finally, it should be possible to extend the analysis for completely general activation functions. Threshold activations we treated are conceptually the hardest case, since they act as extreme quantizers that restrict the flow of information through the network in the most dramatic way.

2.6. The rest of the paper. In Section 3 we prove bounds (2.1) and (2.2) which control , the correlation of the coefficients of the coordinates of Φ(immediately controls the expected inner product Section 4 we develop a deviation inequality to make sure that the inner product is close to its expectation with high probability. In Section 5, we take a union bound over all data points and thus control all inner products simultaneously. This demonstrates how enrichment maps Φ make the data almost orthogonal – the property we outlined in Section 2.1. In Section 6, we construct a random perception map Ψ as we outlined in Section 2.2. We combine enrichment and perception in Section 7 as we outlined in Section 2.3. We first prove a version of our main result for networks with three hidden layers (Theorem 7.1); then we stack shallow networks into an arbitrarily deep architecture proving a full version of our main result in Theorem 7.2.

In the rest of the paper, positive absolute constants will be denoted notation ) means that ) for some absolute constant C and for all values of parameter x. Similarly, ) means that another positive absolute constant.

We call a map ) for some nonnegative constant and some pseudolinear map Φ. For the ReLU nonlinearity, almost pseudolinear maps are automatically pseudolinear, but for the threshold nonlinearity this is not necessarily the case.

3. Correlation decay

Let ) and consider the random process

which is indexed by points x on the unit Euclidean sphere in can be either the threshold of ReLU nonlinearity as in (1.1), and is a fixed value. Due to rotation invariance, the correlation of only depends on the distance between Although it seems to be difficult to compute this dependence exactly, we will demonstrate that the correlation of decays rapidly in b. We will prove this in two extreme regimes – where are just a little separated, and where are almost orthogonal.

3.1. Correlation for separated vectors. Cauchy-Schwarz inequality gives a trivial bound

with equality when . Our first result shows that if the vectors -separated, this bound can be dramatically improved, and we have

Proposition 3.1 (Correlation for separated vectors). Consider a pair of unit vectors be a number that is larger than a certain absolute constant. Then

Proof. Step 1. Orthogonal decomposition. Consider the vectors

To check this claim, note that if both are greater than Expressing this implication as

This yields (3.2) for the ReLU nonlinearity

Step 2. Taking expectation. Substitute ) into the bound (3.2) and take expectation on both sides. We get

Denote

Since x = u + v is a unit vector and u, v are orthogonal, we have 1 = . Therefore, the random variable in the right side of (3.4) is distributed identically with 1), and we obtain

Step 3. Stability. Now use the stability property of the normal distribution, which we state in Lemma A.2. For a = b larger than a suitable absolute constant, , and for either the threshold or ReLU nonlinearity , we see that

Combine this with (3.5) to complete the proof.

3.2. Correlation for almost orthogonal vectors. We continue to study the covariance

structure of the random process are orthogonal, are independent and we have

In this subsection, we show the stability of this equality. The result below implies that if x and are almost orthogonal, namely

Proposition 3.2 (Correlation for -orthogonal vectors). Consider a pair of vectors

for some be a number that is larger than a certain absolute constant. Then

In order to prove this proposition, we first establish a more general stability property:

be as in Proposition 3.2. Then, for any measurable function

Proof. Consider the 2 whose rows are , and define the function

Since the vector Ag has coordinates , we have Ψ(Thus

where

The assumptions on

Indeed, the first bound is straightforward. To verify the second bound, note that each entry of the matrix Σ is bounded in absolute value by . Thus, the operator norm of Σ bounded by 2, which we can write as Σ in the positive-semidefinite order. This implies that Σ, and multiplying both sides of this relation by get the second bound in (3.7).

Substitute (3.7) into (3.6) to obtain

by independence. Lemma 3.3 is proved.

Proof of Proposition 3.2. By assumption,

where the last inequality follows by monotonicity; see Lemma A.3 for justification. Now we use the stability property of the normal distribution that we state in Lemma A.2. For a = b larger than a suitable absolute constant and , it gives for both threshold of ReLU nonlinearities the following:

where the last bound follows since 8. Combining this with (3.8), we obtain the first conclusion of the lemma. Next, Lemma 3.3 gives

Now we again use the stability property of the normal distribution, Lemma A.2, this time for . It gives for both threshold of ReLU nonlinearities the following:

Combining this with (3.9) gives

where the last step follows since 8. This completes the proof of Proposition 3.2.

4. Deviation

In the previous section, we studied the covariance of the random process

where is either the threshold of ReLU nonlinearity as in (1.1), ) is a standard normal random variable, and is a fixed value. Consider a multivariate version of this process, a random pseudolinear map Φ : components are independent copies of . In other words, define

where ) are independent standard normal random vectors. We are interested in how the map Φ transforms the distances between different points. Since

the bounds on we proved in the previous section describe the behavior of Φ in expectation. In this section, we use standard concentration inequalities to ensure a similar behavior with high probability.

Lemma 4.1 (Deviation). Consider a pair of vectors and let be a number that is larger than a certain absolute constant. Define

Proof. Step 1. Decomposition and truncation. By construction, deviation from the mean is

where These two normal random variables are possibly correlated, and each has zero mean and variance bounded by 4.

We will control the sum of i.i.d. random variables in (4.2) using Bernstein’s concentration inequality. In order to apply it, we first perform a standard truncation of the terms of the sum. The level of truncation will be

where is a sufficiently large absolute constant. Consider the random variables

Then we can decompose the sum in (4.2) as follows:

Step 2. The residual is small. Let us first control the residual, i.e. the second sum on the right side of (4.4). For a fixed i, the probability that is nonzero can be bounded by

where 1). In the second inequality, we used that are normal with mean zero and variance at most 4. The third inequality follows from the asymptotic (A.2) on the tail of the normal distribution and our choice (4.3) of the truncation level M with sufficiently large

Taking the union bound we see that all vanish simultaneously with probability at least 1 . Furthermore, by monotonicity,

where in the last step we used generalized H¨older’s inequality. Now, for the threshold nonlinearity , the terms obviously equal 1/2, and for the ReLU nonlinearity these terms are bounded by the fourth moment of the standard normal distribution, which equals 3. Combining this with (4.5) gives

The last bound holds because otherwise we have 1 0 and the statement of the proposition holds trivially.

Step 3. The main sum is concentrated. To bound the first sum in (4.4), we can use Bernstein’s inequality [22], which we can state as follows. If are independent random variables and 0, then with probability at least 1

where . In our case, it is easy to check that for both threshold and ReLU nonlinearities

and

by definition of p in (4.1). Apply Bernstein’s inequality (4.8) for

where is a suitably large absolute constant. We obtain that with probability at least 1

Here we used the choice of M we made in (4.3) and s in (4.9). Combining the bounds on the residual (4.7) and on the main sum (4.10) and putting them into the decomposition (4.4), we conclude that with probability at least 1

The proof is complete.

5. Enrichment

In the previous section, we defined a random pseudolinear map

where is either the threshold of ReLU nonlinearity as in (1.1), ) are independent standard normal random vectors, and b is a fixed value.

We will now demonstrate the ability of Φ to “enrich” the data, to move different points away from each other. To see why this could be the case, choose the value of b to be moderately large, say . Then with high probability, most of the random variables will fall below b, making most of the coordinates Φ(x) equal zero, thus making Φ(x) a random sparse vector. Sparsity will tend to make Φ() almost orthogonal even when x and are just a little separated from each other.

To make this rigorous, we can use the results of Section 3 to check that for such b, the coordinates of Φ() are almost uncorrelated. This immediately implies that Φ(x) and Φ() are almost orthogonal in expectation, and the deviation inequality from Section 4 then implies that the same holds with high probability. This allows us to take a union bound over all data points and conclude that Φ() are almost orthogonal for all distinct data points.

As in Section 3, we will prove this in two regimes, first for the data points that are just a little separated, and then for the data points that are almost orthogonal.

5.1. In this part we show that the random pseudolinear map Φ transforms separated data points into almost orthogonal points.

Lemma 5.1 (Enrichment I: from separated to -orthogonal). Consider a pair of unit vectors

for some be numbers such that

Consider the random pseudolinear map Φ : . Then with probability at least 1 , the vectors

Proof. Step 1. Bounding the bias b. We begin with some easy observations. Note that is bounded above by 2 and below by . Thus, by setting the value of sufficiently large, we can assume that m is arbitrarily large, i.e. larger than any given absolute constant. Furthermore, the restrictions on

so p is arbitrarily small, smaller than any given absolute constant. The function is continuous, takes an absolute constant value at t = 0, and tends to zero as Thus the equation has a solution, so b is well defined and

To get a better bound on b, one can use (A.2) for the threshold nonlinearity and Lemma A.1

Since , this and (5.3) yield

Step 2. Controlling the norm. Applying Lemma 4.1 for , we obtain with probability at least 1

Divide both sides by mp to get

where the second inequality follows from our choice of p with large . We proved the first conclusion of the proposition.

Step 3. Controlling the inner product. Proposition 3.1 gives

where in the last step we used the bounds (5.4) on b and the separation assumption (5.2) with a sufficiently large constant . Now, applying Lemma 4.1, we obtain with probability at least 1

Divide both sides by mp to obtain

where the last step follows from the bound (5.5) on q and our choice of p with a sufficiently large . This is an even stronger bound than we claimed.

Theorem 5.2 (Enrichment I: from separated to -orthogonal)

for all distinct . Assume that . Then there exists an almostpseudolinear map such that the vectors

for all distinct indices i, j = 1, . . . , K.

Proof. Apply Lemma 5.1 followed by a union bound over all pairs of distinct vectors we chose N = 2mK, then the probability of success is at least 1proof is complete. 5.2. In this part we show that a random pseudolin- ear map Φ : makes almost orthogonal data points even closer to being orthogonal: Φ reduces the inner products from a small constant

The pseudolinear map Φ considered in this part will have the same form as in (5.1), but for different dimensions:

where is either the threshold of ReLU nonlinearity as in (1.1), ) are independent standard normal random vectors, and b is a fixed value.

Lemma 5.3 (Enrichment II: from -orthogonal to -orthogonal). Consider a pair of vectors

for some 0 be numbers such that

Then with probability at least 1 , the vectors

satisfy

Proof. Step 1. Bounding the bias b. Following the beginning of the proof of Lemma 5.1, we can check that b exists and

Step 2. Controlling the norm. Applying Proposition 3.2, we see that

where in the last step we used the bound (5.8) on b and the assumption on with a sufficiently small constant . Then, applying Lemma 4.1 for , we obtain with probability at least 1

where we used our choice of p and the restriction on N with sufficiently small constant Divide both sides by dp to get

which is the first conclusion of the proposition.

Step 3. Controlling the inner product. Proposition 3.2 gives

where the last inequality follows as before from bound (5.8) on b and the assumption on with sufficiently small

Next, we will use the following inequality that holds for all sufficiently large a > 0:

For the threshold nonlinearity , this bound is trivial even without the factor a in the right side. For the ReLU nonlinearity, it follows from Lemma A.1 in the Appendix. Therefore, we have

Now, applying Lemma 4.1, we obtain with probability at least 1

Divide both sides by dp to obtain

where in the last step we used (5.10) and our choice of

Theorem 5.4 (Enrichment II: from -orthogonal to -orthogonal)

for all distinct . Assume that . Then there exists an almost pseudolinear map such that the vectors

Proof. Apply Lemma 5.3 followed by a union bound over all pairs of distinct vectors we chose N = 2dK, then the probability of success is at least 1 proof is complete.

6. Perception

The previous sections were concerned with preprocessing, or “enrichment”, of the data. We demonstrated how a pseudolinear map can transform the original data points can be just a little separated, into -orthogonal points (1), and further into -orthogonal points ). In this section we train a pseudolinear map that can memorize any label assignment -orthogonal points

We will first try to train a single neuron to perform this task assuming that the number K of the data points is smaller than the dimension d, up to a logarithmic factor. Specifically, we construct a vector so that the valuesare small whenever large whenever = 1. Our construction is probabilistic: we choose random independent signs, and show that w succeeds with high probability.

Lemma 6.1 (Perception)and consider vectors

for all distinct i, j. Consider any labels of which equal 1. Assume that . Then there exists a vector that satisfies the following holds for every k = 1, . . . , K:

be independent Rademacher random variables and define

We are going to show that the random vector w satisfies the conclusion of the proposition with positive probability.

Let us first check the conclusion (6.2) for k = 1. To this end, we decompose follows:

To bound the noise, we shall use Hoeffding’s inequality (see e.g. [22, Theorem 2.2.2]), which can be stated as follows. If are any fixed numbers and 0, then with probability at least 1

Using this for , we conclude that with probability at least 1

where we used (6.1) and the assumption on with a sufficiently small constant

as claimed.

Repeating this argument for general k, we can obtain the same bounds fortake the union bound over the K choices of k. The random vector satisfies the conclusion with probability at least 1 0. The proof is complete.

Lemma 6.1 essentially says that one neuron can memorize the labels of O(d) data points in neurons should be able to memorize the labels of O(dr) data points in make this happen, we can partition the data into r batches of size O(d) each, and train each neuron on a different batch. The following lemma makes this formal; to see the connection, apply it for

Theorem 6.2 (One layer). Consider a number as in Lemma 6.1. Assume that (2a positive integer. Then there exists a pseudolinear map such that for all

Proof. Without loss of generality, assume that the first of the labels equal 1 and the rest equal zero, i.e. . Partition the indices of the nonzero labels (“batches”), each of cardinality at most 2+ 1. For each batch i, define a new set of labels

In other words, the labels are obtained from the original labels by zeroing out the labels outside batch i.

2+ 1, so we can use this number instead of , noting that the condition (21) log required in Lemma 6.1 does hold by our assumption. We obtain a vector that satisfies the following holds for every k = 1, . . . , K:

Define the pseudolinear map Φ(as follows:

If does not belong to any batch ) implies that . Then both 8 are negative, and since ) = 0 for negative t, all coordinates of ) are zero, i.e.

Conversely, if ) = 0 then, by construction, for each must be nonpositive, which yields16. Thus, by (6.3), k may not belong to any batch , which means that , and this implies

7. Assembly

In this section we prove a general version of our main result. Let us first show how to train a network with four layers. To this end, choose an enrichment map from layer 1 to layer 2 to transform the data from merely separated to -orthogonal, choose a map from layer 2 to layer 3 to further enrich the data by making it )-orthogonal, and finally make a map from layer 3 to layer 4 memorize the labels. This yields the following result:

Theorem 7.1 (Shallow networks). Consider unit vectors that satisfy

Consider any labels of which equal 1. Assume that

as well as . Then there exists a map such that for all k = 1, . . . , K we have:

Apply Theorem 5.2 with that the required constraints in that result hold by our assumptions with small obtain an almost pseudolinear map such that the vectors

for all distinct i, j.

Apply Theorem 5.4. We obtain an almost pseudolinear map such that the vectors

for all distinct indices i, j.

Step 3. Perception. Apply Theorem 6.2. (Note that our assumptions with small enough ensure that the required constraint (2does hold.) We obtain a pseudolinear map such that the vectors

Since E and R are almost pseudolinear and P is pseudolinear, F can be represented as a composition of three pseudolinear maps (by absorbing the linear factors), i.e. Also, by construction, so the proof is complete.

Finally, we can extend Theorem 7.1 for arbitrarily deep networks by distributing the memorization tasks among all layers evenly. Indeed, consider a network with L layers and with nodes in layer i. As in Theorem 7.1, first we enrich the data, thereby making the input to layer 3 almost orthogonal. Train the map from layer 3 to layer 4 to memorize the labels of the first ) data points using Theorem 6.2 (for ). Similarly, train the map from layer 4 to layer 5 to memorize the labels of the next ) data points, and so on. This allows us to train the network on the total of data points, where W is the number of “deep connections” in the network, i.e. connections that occur from layer 3 and up. This leads us to the main result of this paper.

Theorem 7.2 (Deep networks)be positive integers, and set and . Consider unit vectors that satisfy

Consider any labels of which equal 1. Assume that the number of deep connections

as well as . Then there exists a map such that for all k = 1, . . . , K we have:

We stated a simplified version of this result in Theorem 1.1. To see the connection, just take the ‘OR’ of the outputs of all nodes of the last layer.

Proof. Step 1. Initial reductions. For networks with L = 4 layers, we already proved the result in Theorem 7.1, so we an assume that 5. Moreover, for K = 1 the result is trivial, so we can assume that 2. In this case, if we make the constant c in our assumption 2 ) sufficiently small, we can assume that (and thus also all arbitrarily large, i.e. larger than any given absolute constant.

the labels equal 1 and the rest equal zero, i.e. . Partition the indices of the nonzero labels into subsets (“batches”) so that

(This is possible since the numbers sum to one.) For each batch i, define a new set of labels

In other words, are obtained from the original labels by zeroing out the labels outside batch i.

for the number of nonzero labels instead of 3. Thus, if

as well as

Moreover, when we factorize into three almost pseudolinear maps, then , the enrichment map from , is trivially independent of i, so

hold. In order to check (7.2), we will first note a somewhat stronger bound than (7.1) holds, namely we have

Indeed, if then using that

when W is sufficiently large. If then using that

where the last step follows from (7.1). Hence, we verified (7.6) for the entire range of W. Now, to check (7.2), note that

where we used that is arbitrarily large and the assumption on K with a sufficiently small constant c. Combining this bound with (7.6), we obtain

if C is sufficiently large. We have checked(7.2).

Step 4. Stacking. To complete the proof, it suffices to construct map with the following property:

Indeed, this would imply that ) = 0 happens iff ) = 0 for all i, which, according to (7.4) is equivalent to = 0 for all i. By definition of , this is further equivalent to for any i, which by construction of is equivalent to , which is finally equivalent to = 0, as claimed.

Figure 2. Trading width for depth: stacking shallow networks into a deep network.

in Figure 2. To help us stack these maps, we drop some nodes from the original network and first construct

with some ; we can then extend F trivially to ). As Figure 2 suggests, we choose these layers if 3. Note that by definition of indeed have We made this choice so that the network can realize the maps . As Figure 2 illustrates, the map 3) is realized by setting the factor E : to map the first layer to the second, the factor second layer to the last 3 nodes of the third layer, and the factor map further to the last 3 nodes of the fourth layer. Moreover, the output of the second layer is transferred to the first 3 nodes of the third layer by the identity mapcan realize the next map , and so on.

the components of the output of , i.e. the last 3 nodes of the fourth layer, are summed together and added to any node (say, the last node) of the fifth layer; the components of the output of , i.e. the last 3 nodes of the fourth layer, are summed together and added to the last node of the sixth layer, and so on. For ReLU nonlinearity, the + refers to addition of real numbers; for threshold nonlinearity, we replace adding by taking the maximum (i.e. the ‘OR’ operation), which is clearly realizable.

Step 5. Conclusion. Due to our construction, the sum of all components of the function F(x) computed by the network equals the sum (or maximum, for threshold nonlinearity) of all components of all functions ). Since the components are always nonnegative, F(x) is zero iff all components of all functions ) are zero. In other words, our claim (7.7) holds.

The asymptotical expansion of Mills ratio for the normal distribution is well known, see [19]. For our purposes, the first three terms of the expansion will be sufficient:

In particular, the tail probability of the standard normal random variable 1) satisfies

The following two lemmas give asymptotical expressions for the expected value of the first two moments of the random variable (0) where, as before, is standard normal.

Lemma A.1 (ReLU of the normal distribution)

Proof. Expressing expectation as the integral of the tail (see e.g. [22, Lemma 1.2.1]), we have

Using substitution 2, we see that the value of the first integral on the right hand side is . Using the Mills ratio asymptotics (A.1) for the second integral, we get

Integrating by parts, we find that the first integral on the right side equals

the second integral equals as before, and the third integral equals Ψ(bining these and using the asymptotical expansion (A.1) for Ψ(a), we conclude that

This completes the proof of the second part of the lemma.

Proof. Use the asymptotics in (A.2) and Lemma A.1 for and simplify.

We complete this paper by proving an elementary monotonicity property for Gaussian integrals, which we used in the proof of of Proposition 3.2.

Lemma A.3 (Monotonicity)be a nondecreasing function satisfying is a nondecreasing function on [0

Proof. Denoting by f(x) the probability density function of N(0, 1), we have

The last step follows since, by assumption, ) = 0 for all x < 0. To complete the proof, it remains to note that for every fixed 0, the function ) is nondecreasing.

References

[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. arXiv preprint arXiv:1811.03962, 2018.

[2] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimiza- tion and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.

[3] Pierre Baldi and Roman Vershynin. On neuronal capacity. In Advances in Neural Information Processing Systems, pages 7729–7738, 2018.

[4] Pierre Baldi and Roman Vershynin. The capacity of feedforward neural networks. Neural Networks, 116:288–311, 2019.

[5] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1–17, 2019.

[6] Eric B Baum. On the capabilities of multilayer perceptrons. Journal of complexity, 4(3):193–215, 1988.

[7] Eric B Baum and David Haussler. What size net gives valid generalization? In Advances in neural information processing systems, pages 81–90, 1989.

[8] Thomas M Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers, (3):326–334, 1965.

[9] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.

[10] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over- parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.

[11] Rong Ge, Runzhe Wang, and Haoyu Zhao. Mildly overparametrized neural nets can memorize training data efficiently. arXiv preprint arXiv:1909.11837, 2019.

[12] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.

[13] Guang-Bin Huang. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2):274–281, 2003.