b

DiscoverSearch
About
My stuff
Quasi-Equivalence of Width and Depth of Neural Networks
2020·arXiv
Abstract
Abstract

While classic studies proved that wide networks allow universal approximation, recent research and successes of deep learning demonstrate the power of deep networks. Based on a symmetric consideration, we investigate if the design of artificial neural networks should have a directional preference, and what the mechanism of interaction is between the width and depth of a network. Inspired by the De Morgan law, we address this fundamental question by establishing a quasi-equivalence between the width and depth of ReLU networks in two aspects. First, we formulate two transforms for mapping an arbitrary ReLU network to a wide network and a deep network respectively for either regression or classification so that the essentially same capability of the original network can be implemented. Then, we replace the mainstream artificial neuron type with a quadratic counterpart, and utilize the factorization and continued fraction representations of the same polynomial function to construct a wide network and a deep network, respectively. Based on our findings, a deep network has a wide equivalent, and vice versa, subject to an arbitrarily small error.

Keywords: Deep networks, wide networks, ReLU networks, quasi-equivalence, network transformation

Over the past years, deep learning (Goodfellow et al., 2016; Fan et al., 2018a) has become the mainstream approach of machine learning and achieved the state-of-the-art performance in many important tasks (Dahl et al., 2011; Kumar et al., 2016; Chen et al., 2017; Wang, 2016). One of the key reasons that accounts for the success of deep learning is the increased depth, which allows a hierarchical representation of features. There are a number of papers dedicated to explaining why deep networks are better than shallow ones. Encouraging progress has been made along this direction. The idea to show the superiority of deep networks is basically to find a special family of functions that are very hard to be approximated by a shallow network but easy to be approximated by a deep network, or that a deep network can express complicated functions that a wide network could not (Szymanski and McCane, 2014; Cohen et al., 2016; Mhaskar and Poggio, 2016; Eldan and Shamir, 2016; Montufar et al., 2014; Bianchini and Scarselli, 2014). For example, in Eldan and Shamir (2016), a special class of radial functions was constructed so that a one-hidden-layer network needs to use an exponential number of neurons to obtain a good approximation, but a two-hidden-layer network only requires a polynomial number of neurons for the same purpose. With the number of linear regions as the complexity measure, Montufar et al. (2014) showed that the number of linear regions grows exponentially with the depth of a network but only polynomially with the width of a network. In Bianchini and Scarselli (2014), a topological measure was utilized to characterize the complexity of functions. Then, it was shown that deep networks can represent more complex functions than what the shallow counterparts express. Besides, width-bounded but depth-unbounded universal approximators were also developed (Lu et al., 2017; Lin and Jegelka, 2018; Fan et al., 2018c) in analogy to the depth-bounded but width-unbounded universal approximators (Funahashi, 1989; Hornik et al., 1989).

Recently, the effects of width are discussed by more and more studies (Cheng et al., 2016; Chen and Liu, 2017; Zagoruyko and Komodakis, 2016). Since width and depth are the most basic topology measures of a neural network, exploring the roles of width and depth in neural networks is a problem of strong interest and importance. Currently, there exist both width-bounded and depth-bounded universal approximators. Since both width-bounded and depth-bounded networks can represent any function, they can represent each other as well, which suggests the width-depth equivalence of neural networks. Nevertheless, how a neural network learns a mapping is quite different from the way used in proving the universal approximation. Moreover, the core of the width-depth conversion is to employ a network to learn another network instead of any function. Therefore, the width-depth conversion based on universal approximation falls short to capture the relationship between width and depth.

Specifically, we argue that the width-depth conversion via universal approximation is simplistic, inefficient, and lack of insight: 1) (Simplistic) As mentioned earlier, the way used in enabling universal approximation is to divide a target function into many functions over tiny hypercubes. In practice, a network usually does not do so, e.g., a ReLU network divides the space into polytopes. 2) (Inefficient) The published universal approximation analyses do not consider the character of a target function, and just inefficiently divide the function into many pieces over tiny hypercubes. However, in the width-depth transformation, the problem is how to express a wide (or deep) network using a deep (or wide) network. Thus, the properties of networks should be used to make a much more efficient transformation. 3) (Lack of insight) Universal approximation theoretically guarantees the representation capability of neural networks but it does not give any new insight in terms of difference in consumed computational resources and a practical procedure for the transformation from a wide (or deep) network to a deep (or wide) counterpart. As a small summary, utilizing universal approximation theorems to do width-depth transformation do not accurately characterize the width-depth conversion. We still do not fully resolve essential questions on width and depth, e.g., how to non-trivially do transformation between a wide and a deep network, and what the mechanism of interaction is between the width and depth of a network.

To fill this gap, inspired by the De Morgan law, here we demonstrate from two perspectives that the width and depth of neural networks are quasi-equivalent. The first perspective leverages that a ReLU network is a piecewise linear function over polytopes, while second perspective utilizes the nested structure of deep networks and the parallel structure of wide networks. Specifically, in the first perspective, we revisit the De Morgan law:

image

where  Aiis a propositional rule (e.g., IF input ∈ [ai, bi]m, THEN inputbelongs to some class), and such rules are disjoint. A neural network can be linked to a rule-based system such as a collection of propositional rules. Straightforwardly, we can construct either a deep network to realize a union of propositional rules (left side) or a wide network that realizes the complement of the intersection of those rules after complement (right side). As a result, the constructed deep and wide networks are equivalent to each other. Furthermore, we elaborate the quasi-equivalence of general regression and classification networks by constructing two transforms mapping an arbitrary ReLU network to a wide network and a deep network, respectively, thereby verifying a general quasi-equivalence of the width and depth of ReLU networks. Our constructive scheme is largely based on the fact that a ReLU network partitions the space into polytopes (Chu et al., 2018). This enables us to have a simplicial complex in the space and then to establish a quasi-equivalence of networks using the essential building blocks, fan-shaped (more generally, hyper-cone-shape) functions, in the form of modularized ReLU networks.

Table 1: Network structures and complexities through transformation of regression and classification networks. D is the input dimension, and M is the complexity measure of a function class represented by ReLU networks.

image

In the second perspective, we replace the mainstream artificial neuron type with a quadratic counterpart and extend our first perspective by utilizing the factorization and continued fraction representations of the same univariate polynomial to construct wide and deep networks, respectively. Specifically, a univariate polynomial function can be expressed as follows:

image

where  ai ̸= 0, and rj, sj, tj and bl, clare related. Previously, inspired by neuronal diversity, our group designed the quadratic neuron (Fan et al., 2018a) that replaces the inner product in a conventional neuron with a quadratic function. Due to the merits of the idea, the network based on quadratic neurons have been increasingly studied and applied (Bu and Karpatne, 2021; Ji et al., 2021; Mantini and Shah, 2021; Xu et al., 2022). Here, we can construct a wide quadratic network and a deep quadratic network to implement the left side and right side of Eq. (2), respectively. This establishes the equivalence between wide and deep quadratic networks. Finally, we generalize such an equivalence into a multivariate setting based on Kolmogorov-Arnold theorem (Kolmogorov, 1956).

Our main contribution is the establishment of the width-depth quasi-equivalence of neural networks. We summarize our main results on ReLU networks (the first perspective) and quadratic networks (the second perspective) in Tables 1 and 2, respectively. Specifically, Table 1 lists the width and depth of the wide and deep networks constructed in our first perspective, where the complexity measure M is the minimum number of simplices needed to cover the polytopes formed by a ReLU network. Table 2 shows the width and depth of the wide and deep quadratic networks constructed in our second perspective, where the complexity measures  K1 and K2are the degrees of polynomials to represent a function of interest. Clearly, given a complexity measure, the width of the constructed wide network is greater than the depth, while the depth of the constructed deep network is greater than the width.

Table 2: Network structures and complexities through the extension of the De Morgan law. D is the input dimension, and  K1 and K2are the complexity measure of a function class.

image

To put our contributions in perspective, we would like to mention relevant studies. Jacot et al. (2018) proposed the theory of the neural tangent kernel (NTK), which provides a useful lens to understand a network when the width of a network goes to infinity. Kawaguchi et al. (2019) analyzed the effect of width and depth on the quality of local minima. They showed that the quality of local minima improves toward the global minima as depth and width become larger. Levine et al. (2020) revealed the width-depth interplay in a self-attention network. We discuss the width of neural networks as related to NTK in the Supplementary Information I. To the best of our knowledge, our study is the first work to reveal the width-depth quasi-equivalence of neural networks.

2.1 Preliminaries

For convenience, we use  σ(x) = max{0, x}to denote the ReLU function. We mainly discuss ReLU networks in this work, thus all networks in the rest of this paper are referred as ReLU networks unless otherwise specified. At the same time, we focus on the fully-connected ReLU networks.

Definition 1 A regression network is a network with continuous outputs, while a classifi-cation network is a network that produces categorical outputs (for example,  {0, 1, · · · , 9} fordigit recognition). The classification network is obtained by thresholding a ReLU network in the last layer. In this study, we investigate a classification network with binary labels without loss of generality.

Definition 2 (Width and depth of a feedforward network (Arora et al., 2016)) For any number of hidden layers  k ∈ N, input and output dimensions  w0, wk+1 ∈ N, a Rw0 →Rwk+1 feedforward network is given by specifying a sequence of k natural numbers  w1, w2, . . . , wkrepresenting widths of the hidden layers, a set of k affine transformations  Ti : Rwi−1 → Rwifor i = 1, . . . , k and a linear transformation  Tk+1 : Rwk → Rwk+1 corresponding to weights of the hidden layers. The function  f : Rn1 → Rn2 computed or represented by this network is

image

where  ◦denotes function composition, and  σis an activation function. The depth of a ReLU DNN is defined as k + 1. The width of a ReLU DNN is max  {w1, . . . , wk}.

Definition 3 (Width and depth of a shortcut network) Given a shortcut network  Π,we delete a minimum number of links such that the resultant network  Π′ is a feedforward network without any isolated neuron. Then, we define the width and depth of  Πas the width and depth of  Π′, respectively.

Over the past several years, increasingly diversified network architectures, such as randomly wired networks (Xie et al., 2019), networks with stochastic structures (Deng et al., 2020), etc. are used as backbones for deep learning. Our definitions for width and depth are applicable to many unusual network configurations. Moreover, they are also natural extensions of the conventional width and depth definitions and can make sense for common networks.

Definition 4 (Simplicial complex) A D-simplex S is a D-dimensional convex hull provided by convex combinations of D + 1 affinely independent vectors  {vi}Di=0 ⊂ RD. In other

words, S =

image

called a face of S. A simplicial complex  S =�

satisfying: 1) every face of a simplex from S is also in S; 2) the non-empty intersection of any two simplices  S1, S2 ∈ Sis a face of both  S1 and S2.

Proposition 1 Suppose that f(x) is a function represented by a ReLU network, then f is a piecewise linear function that splits the space into polytopes, where each polytope is convex and associated with a linear function.

Proof Inspired by the idea in (Chu et al., 2018), the proof here is for all ReLU networks, including networks using shortcuts. Let a vector  C = {c1, ..., cN}denote the firing states of all neurons in the network, where N is the total number of neurons, and  ci ∈ {0, 1}.ci = 0 means that the i-th neuron is not fired and vice versa. The firing state of every neuron is determined by the input x, then we denote the set of instances that share the same collective neuron firing state  Chas the polytope  Ph: Ph = {x | C(x) = Ch}. Forx ∈ Ph, the output of the neuron i is a linear function, denoted as  n(i)(x). Then, the firing state of the neuron i is controlled by a linear inequality (n(i)(x) > 0 or n(i)(x) ≤ 0). Intotal,  C(x) = Chis equivalent to a set of N linear inequality constraints, indicating that Phis a convex polytope.

Definition 5 We define the complexity of the function represented by a ReLU network as the minimum number of simplices: M that are needed to cover each and every polytope to support the function of the ReLU network.

image

Figure 1: (a) A one-hidden-layer network with three neurons to classify concentric rings. (b) A one-hidden-layer network with six neurons to classify concentric rings.

Here, we elaborate why M is a good measure. Previously, because a deep network with piecewise linear activation is a piecewise linear function, the number of linear regions (polytopes) was intensively studied to measure the complexity of a neural network. For example, Montufar et al. (2014) and Serra et al. (2018) estimated the upper and lower bounds of the number of linear regions with respect to the number of neurons at each layer. Xiong et al. (2020) computed the bounds for convolutional neural networks. Park et al. (2021) proposed neural activation coding to maximize the number of linear regions to improve the model performance. Despite these results, we find that there exists a problem with the number of linear regions as a complexity measure. It may happen that simple and complex networks realize the same number of regions for a given task. As shown in Figure 1, two networks divide the space into two linear regions to separate concentric rings. But one network uses three neurons to define a triangle domain, while the other has six neurons to form a hexagon. In this example, according to the number of linear regions, the two networks have the same complexity but the six-neuron network is apparently more complex than the other. To address this problem, the complexity of a linear region should be taken into account as well. We argue that how many simplices a linear region comprises indicates how complex a linear region is. Therefore, we propose to use the number of simplices as a legitimate complexity measure. In Figure 1, counting the number of simplices, the complexity of two networks are 2 and 4, respectively, which is a better characterization.

Let us estimate the lower bound of M. To this end, we need to take advantage of the lower bound of the number of polytopes. Empirical bounds of the number of polytopes (Np)in a feedforward ReLU network were estimated in (Montufar et al., 2014; Serra et al., 2018; Serra and Ramalingam, 2020), where one result in (Montufar et al., 2014) states that let ni, i = 1, · · · , L, be the number of neurons in the i-th layer, and D be the dimension of the input space,  Npis lower bounded by� �L−1i=1 [ niD ]D�· �Dj=0�nLj�. The polytopes constructed therein are hypercubes. Because the minimum number of simplices that fill a D-dimensional hypercube is 2D·D!(D+1)(D+1)/2 ,

image

Definition 6 We define a wide network and a deep network as follows. Let us assume a function that can be sufficiently complex and yet can be represented by a network. When such a function becomes increasingly complex, the structure of this network must be also increasingly complex, depending on the complexity of the function. We call a network wide if its width is larger than its depth by at least an order of magnitude in  M, e.g., O(Mα+1)vs  O(Mα), where Mis the complexity measure and  α > 0. Similarly, we call a network deep if its depth is larger than its width by at least an order of magnitude in M.

It is underscored that we use two different concepts: the complexity of the function class represented by networks and the structural complexity of a network. The former measures the complexity of the function, while the latter measures the topological structure of a network. In our transformation scheme, the structures of constructs/networks are determined by the complexity of the function of interest.

Definition 7 We call a wide network  N1 : Ω → Ris equivalent to a deep network  N2 : → R, if N1(x) = N2(x), ∀x ∈ Ω. We call a wide network  N1is quasi-equivalent to a deep network  N2, if there is  δ > 0, m({x ∈ Ω | N1(x) ̸= N2(x)} < δ, where mis a Lebesgue measurement defined on .

2.2 Motivating Example

An important school of neural network interpretability research (Fan and Wang, 2020; Adadi and Berrada, 2018) is to extract interpretable rules from a network (Thrun, 1995; Setiono and Liu, 1995; Saad and II, 2007) using decompositional or pedagogical methods (Thrun, 1995). Pedagogical methods decode a set of rules that imitate the input-output relationship of a network, whereas these rules do not necessarily correspond to the parameters of the network. One common type of rules are propositional in the IF-THEN format, where the preconditions are provided as a set of hypercubes with respect to the input: IF  input ∈ [ai, bi]m, THEN inputbelongs to some class. Since there is a connection between the rule-based inference and the network-based

inference, we consider a neural network in terms of propositional rules. Furthermore, we

image

Figure 2: The width and depth equivalence in light of the De Morgan law duality. In this construction, a deep network implements  A1 ∨ A2 · · · ∨ Anusing a trapezoid function, and a wide network implements  ¬�(¬A1) ∧ (¬A2) · · · ∧ (¬An)�usingthe trap-like function. (·)+ denotes the ReLU activation.

know that the De Morgan law holds true for disjoint propositional rules. Mathematically, the De Morgan law is formulated as

image

where  Aiis a rule, and  ¬Aiis its negation. The De Morgan law gives a duality in the sense of binary logic that the operations  ∨ and ∧are dual, which means that for any propositional rule system described by  A1 ∨A2 · · ·∨An, there exists an equivalent dual propositional rule system  ¬�(¬A1) ∧ (¬A2) · · · ∧ (¬An)�.

Regarding each rule as an indicator function over a hypercube:

image

in Figure 2, we construct a deep network that realizes a logic union of propositional rules (the left hand side of Eq. (5)) and a wide network that realizes the negation of the logic intersection of those rules after negation (the right hand side of Eq. (5)). As a result, the constructed deep and wide networks are equivalent by the De Morgan law.

The above motivating example inspires us to consider the width-depth equivalence in a broader domain. First, a ReLU network is a piecewise linear function over polytopes. To generate rules, such a piecewise linear function should be divided into simplices instead of hypercubes. As shown in Figure 3, we only need two rules if we build rules over simplices, which is much more efficient than building rules over hypercubes. Second, the network can be a regression network rather than a classification network. Thus, representing a linear function rather than an indicator function is demanded. Based on these two considerations, we generalize an indicator function over a hypercube to a linear function over a simplex  Siin a bounded domain:

image

Figure 3: Building rules over simplices is more efficient than over hypercubes for a network.

2.3 Quasi-Equivalence of Width and Depth of Networks

This section describes the first contribution of our paper. We formulate the transformation from an arbitrary ReLU network to a wide network and a deep network, respectively. We use a network-based building block to represent a linear function over a simplex. Integrating such building blocks can represent any piecewise linear function over polytopes, thereby elaborating a general equivalence of the width and depth of networks. Particularly, a regression ReLU network is converted into both a wide and a deep ReLU network (Theorem 9), while a classification ReLU network is a special case of a regression ReLU network (Theorem 10). In the regression networks, the transformation of a univariate network is rather different from that of a multivariate network. As a result, the equivalence for the wide and deep networks in the univariate case is precise, whereas the multivariate wide and deep networks are made approximately equivalent up to an arbitrarily small error. What’s more, in the multivariate case, the width of the wide network is not the same as the depth of the deep network. This is why we term such an equivalence as a quasi-equivalence.

2.3.1 Regression Networks

The sketch of transforming a regression ReLU network is that we first construct either a wide modular network or a deep modular network to represent the corresponding function over each and every simplex, then we aggregate the results into deep or wide networks in series or parallel, respectively, to represent the original network well.

Theorem 8 (Equivalence of Univariate Regression Networks) Given any ReLU network  f : [−B, B] → R, there is a wide ReLU network  H1 : [−B, B] → Rand a deep ReLU network  H2 : [−B, B] → R, such that  f(x) = H1(x) = H2(x), ∀x ∈ [−B, B].

Our main result is formally summarized as the following quasi-equivalence theorem for the multivariate case.

Theorem 9 (Quasi-Equivalence of Multivariate Regression Networks) Suppose that the representation of an arbitrary ReLU network is  h : [−B, B]D → R, and Mis the minimum number of simplices to cover the polytopes to support  h, for any δ > 0, there exist a wide ReLU network  H1 of width O�D(D + 1)(2D − 1)M�and depth D, and also a deep ReLU network  H2 of width (D + 1)D2 and depth O [(D + 1)M], satisfying that

image

where  m(·)is the standard measure in [−B, B]D.

We defer the proof of Theorem 8 to Supplementary Information II, and split the proof of Theorem 9 into the two-dimensional case (more intuitive) in Appendix and the general case in Supplementary Information II for better readability.

image

Figure 4: Due to the unboundedness and continuity, the representations in (Wang and Sun, 2005; He et al., 2018) is handicapped in representing a function over polytopes that make a non-convex region.

The key idea to represent a linear function over a simplex is to construct high-dimensional fan-shaped functions that are supported in fan-shaped domains, and to use these constructs to eliminate non-zero functional values outside the simplex of interest. This is a new and local way to represent a piecewise linear function over polytopes. In contrast, there are two global ways to represent piecewise linear functions (Wang and Sun, 2005; He et al., 2018). In (Wang and Sun, 2005), for every piecewise linear function  f : Rn → R, thereexists a finite set of linear functions  g1, · · · , gmand subsets  T1, · · · , TP ⊆ [m] such that f = �Pp=1 spmaxi∈Tp {gi}, where sp ∈ {−1, +1}, p = 1, · · · , P. In (He et al., 2018), the rep- resentation is  f = max1≤p≤Pmini∈Tp{gi}. Nevertheless, due to the unboundedness and continuity, the global representation of a piecewise linear function is handicapped over polytopes that make a non-convex region. Let us use Figure 4 to illustrate our point, where the relations of  g1, g2, g3are summarized in Table 3, and the function value over the purple area is zero.

Representation in Wang and Sun (2005); He et al. (2018):  f = max{g1, g2, g3}

Table 3: Regions and relations of functions.

image

Ours:  f = (g1){x∈Ω1} + (g2){x∈Ω2,1} + (g2){x∈Ω2,2} + (g2){x∈Ω2,3} + (g3){x∈Ω3}.

It can be seen that due to the unboundedness and continuity,  f = max{g1, g2, g3}is inaccurate over the purple area. But our representation is accurate over the purple area because it is local. We highlight the construction of fan-shaped functions, which opens a new door for high-dimensional piecewise function representation. Particularly, the employment of fan-shaped functions will enable a neural network to express a manifold more effectively.

image

Figure 5: Typical fan-shaped functions constructed by a modularized network to eliminate non-zero functional values outside the simplex of interest.

Since such a fan-shaped function is a basic building block in our construction of wide and deep equivalent networks, let us explain it in a two-dimensional case for easy visualization. An essential building block expressed by a network in Figure 5(a) is based on the following function:

image

where  h1(x) = p(1)1 x1 + p(1)2 x2 + r(1), and h2(x) = p(2)1 x1 + p(2)2 x2 + r(2)are provided by two linearly independent vectors  {(p(1)1 , p(1)2 ), (p(2)1 , p(2)2 )}, and µis a positive controlling factor. Eq. (7) is a ReLU network of depth=2 and width=2 according to our width-depth definition. As illustrated in Figure 5(b), the support of F(x) contains three boundaries and four polytopes (two of which only allow zero value of F). For convenience, given a linear function  ℓ(x) = c1x1 + c2x2 + c3, we define ℓ− = {x ∈ R2 | ℓ(x) < 0} andℓ+ = {x ∈ R2 | ℓ(x) ≥ 0}. Thus, we can write Ω1 = h+1 ∩ h−2 and Ω2 = (h1 − µh2)− ∩ h+2 .There are three properties of F(x). First, the common line shared by Ω1 and Ω2 is h2(x) = 0.Second, the size of Ω2is adjustable by controlling  µ. Note that  h1(x) − µh2(x) = 0 canmove very close to  h2(x) = 0 as µ → ∞, which makes Ω2negligible. In the limiting case, the support of F(x) converges to the fan-shaped domain Ω1. Because h1(x) − µh2(x) = 0is almost parallel to  h2(x) = 0 when µis big enough, we approximate the area of Ω2 asthe product of the length of  h2(x) = 0 within [−B, B]2 and the distance between two lines, which yields  |Ω2| ≤ 2√2B/µ. Third, the function F over the fan-shaped area Ω1 is h1.

Remark 1. As a ReLU network of interest partitions the space into more and more polytopes, the number of needed simplices will go increasingly larger. Because the lower bound of M in Eq. (4) is far larger than  D or (D +1)D2, the width of  H1(x) and the depth of  H2(x) will dominate. Furthermore, the width of  H1(x) is higher than its depth by an order of magnitude in terms of M, and the depth of  H2(x) is higher than its width in a similar way. Therefore,  H1(x) is a wide network, and  H2(x) is a deep network.

Remark 2. Compared to universal approximation, our construction, with the use of fan-shaped functions, is valuable in the following aspects: 1) Our construction utilizes the character of ReLU networks. It divides the space into finitely many simplices instead of infinitely many tiny hypercubes; 2) Given a target network, the complexity of our construction does not change with the prescribed error rate. In contrast, the complexity of the construction schemes used in the universal approximation analyses would increase as the preset error decreases. Therefore, our construction is much more efficient; 3) Our construction offers a new and local way to represent a piecewise linear function over polytopes, which is more flexible in representing discontinuous piecewise linear function than the global ways (Wang and Sun, 2005; He et al., 2018). Furthermore, inspired by the network structure used to construct the proposed fan-shaped function, we find that intra-layer links can enhance the representation capability of a shallow network, closely relevant to the published results on “depth separation”. Because intra-layer links greatly increase the number of pieces represented by a network, a shallow network with intra-layer links can express a complicated piecewise linear function as well as a deep network! For more details on this new finding, please see Supplementary Information IX for details.

2.3.2 Classification Networks

A regression network gives a continuous output, while a classification network produces categorical outputs (for example,  {0, 1, · · · , 9}for digit recognition). The classification network is derived by thresholding the output of the last layer of a ReLU network. In this study, we investigate a classification network with binary labels without loss of generality. We can directly build the equivalence for classification networks in the same way as the regression networks.

Theorem 10 (Quasi-Equivalence of Classification Networks) Without loss of generality for multi-class classification, we assume a binary output. Suppose that the representation of an arbitrary ReLU network is  h : [−B, B]D → {0, 1}, and Mis the minimum number of simplices to cover the polytopes to support  h, for any δ > 0, there exist a wide ReLU network  H1 of width O�D(D + 1)(2D − 1)M�and depth D, and also a deep ReLU network  H2 of width (D + 1)D2 and depth O [(D + 1)M], satisfying that

image

where  m(·)is the standard measure in [−B, B]D.

Proof The key is to regard the classification network as a special case of regression network. Then, applying the construction techniques used in the proof of Theorem 9 will lead to that for any  δ > 0,

image

which verifies the correctness of Theorem 10.

A classification neural network can be interpreted as a disjoint rule-based system  A1 ∨A2 · · ·∨Anby splitting the representation of a neural network into many decision polytopes: IF (input ∈certain polytope), THEN (input belongs to some class). Furthermore, each rule is a local function supported over a decision region.

Remark 3. The De Morgan equivalence in Figure 2 can be summarized as

image

when the rules are based on hypercubes. Theorem 10 corresponds to

image

when rules are based on simplices.

3.1 Preliminaries

Definition 11 Quadratic neurons (Fan et al., 2018a,b, 2020) integrate the n-variable input x with a quadratic function as follows before being nonlinearly processed:

image

where  wr, wg, wbare vectors of the same dimensionality as that of  x, br, bg, care biases, and  ⊙is the Hadamard product.

The network using quadratic neurons is interesting in several ways: 1) Over the past years the design of neural networks has been focusing on architectures, such as shortcut connections, transformer structure, etc. Almost exclusively, the mainstream deep learning models are constructed with neurons of the same type, which are characterized by the inner product and nonlinear activation. Despite that an artifical neural network was invented via biomimicry, the current artificial networks and a biological neural system are fundamentally different in terms of neuronal diversity and complexity. As we know, a biological neural system coordinates numerous types of neurons to support intellectual behaviors. To fill in this gap, we believe that neuronal diversity should be taken into account in machine learning; 2) Due to the enhanced expressive power at the neuronal level, quadratic neurons have been used in real-world problems and achieved superior performance, such as medical imaging (Fan et al., 2019), civil engineering (Ji et al., 2021), applied math (Bu and Karpatne, 2021), and so on. Given the utility of quadratic neurons, research on networks with quadratic neurons is attractive.

Hereafter, networks made of quadratic neurons are referred to as quadratic networks. We refer to the neurons using inner-product as the conventional neurons, and corresponding networks are called conventional networks. Please note that quadratic neurons are distinct from the conventional neurons that use quadratic activation. The decision boundary of the latter is still linear. The key characteristic of quadratic neurons is the employment of the quadratic function replacing the inner-product, and the choice of activation functions is flexible. Hereafter, we refer to networks made of quadratic neurons as quadratic networks.

Proposition 2 Any univariate polynomial of degree N can be perfectly computed by a quadratic ReLU network with the depth of log2(N)and the width of N (Fan et al., 2020).

Proof Please refer to (Fan et al., 2020) for a detailed proof. Any univariate polynomial of degree N can be factorized as �N/2j=1(rjx2 + sjx + tj), without the incorporation of complex numbers. Due to the identity:  f(x) = σ(f(x)) − σ(−f(x)), every two quadratic neurons σ(rjx2 + sjx + tj) and σ(−(rjx2 + sjx + tj)) can be ensembled together to perfectly express rjx2 + sjx + tjin the first layer of a network, followed by the half number of quadratic neurons in the second layer to combine the yields of the first layer, and so on and so forth. Consequently, a quadratic ReLU network with a depth of log2(N) and a width of N can express any univariate polynomial of degree N.

Proposition 3 (Stone-Weierstrass Theorem (De Branges, 1959)) Let f(x) be a continuous real-valued function over [a, b], then for any  σ > 0, there exists a polynomial P(x), satisfying

image

In the above propositions, Proposition 1 shows a quadratic ReLU network can express a univariate polynomial through factorization, which corresponds to the left side of Eq. (111). At the same time, this quadratic network is wide. Proposition 2 suggests that a polynomial can represent any continuous function. Proposition 3 informs us that a multivariate function can be expressed as a combination of univariate functions, which serves as a bridge to generalize our result from approximation to univariate functions to the case of multivariate functions.

3.2 Motivating Example

In contrast to the factorization representation, there is a continued fraction representation for a general univariate polynomial:

image

where  ai ̸= 0, b0 = a0, bk = a2k/a2k−2, k ≥ 1 and c0 = a1, ck = a2k+1/a2k−1, k ≥ 1. In theright side of Eq. (111), the first part contains the terms with even powers of x, while the second part with the odd powers of x. In Supplementary Information V, we prove the correctness of Eq. (111) in detail.

image

Figure 6: The width and depth equivalence for networks of quadratic neurons. In this con- struction, a deep network is to implement the continued fraction of a polynomial, and a wide network reflects the factorization of the polynomial.

Since both the factorization representation and the continued fraction representation express the same polynomial, we have the following identity:

image

where  rj, sj, tj and bi, ciare intrinsically connected. Let  Qj = rjx2 + sjx + tj, Bl(x) =

image

which could be somehow analogized to the De Morgan law by considering that multiplication and composition operations replace  ∨ and ∧operations, respectively, for either side of Eq. (17). But the objects of those operations are global functions instead of local rules. Eq. (17) is inspiring in the sense that the left side is of a parallel computational structure, and the right side is with two nested structures. Clearly, the left and right sides of Eq. (17) suggest a wide network and a deep network, respectively.

In Figure 6, a deep quadratic network is constructed to represent the right side of Eq. (17), while a wide quadratic network is constructed to represent the left side. In light of Eq. (17), these two quadratic networks are functionally equivalent to each other. In the case of the deep network, if the polynomial is of degree 2N, then the depth of the deep network is N. The inter-layer relationship in the two branches are respectively y(1)l+1(x) = bN−lx21+bN−lx2−y(1)l (x), y(1)N+1 = b01−y1N , and y(2)l+1(x) = cN−lx21+cN−lx2−y(2)l (x), y2N = c0x1−y2N , wherey(1)l and y(2)lare the outputs of the  lth layer in either branch. Specially,  y(1)0 = y(2)0 = 0.In this deep network, the activation function is in a form of reciprocal relation between two inputs:  z(x, y) = ax1+ax−y, which can be well approximated by a ReLU function after a proper normalization operation. In Supplementary Information V, we demonstrate that the reciprocal activation works effectively in the deep network after some twists. As for constructing the wide network, we employ Proposition 2 directly. Suppose that the polynomial is of degree 2N, the resultant quadratic network has a width of 2N and a depth of only log2(2N), where the width dominates.

Despite the novelty, the above equivalence only fits univariate polynomials. To make a broader impact, we further demonstrate that in terms of representing a multivariate polynomial, the width and the depth of a neural network is also equivalent to each other.

3.3 Width-Depth Quasi-Equivalence

This section delineates the second contribution of our paper. Particularly, we first show that a combination of univariate polynomials can approximate any continuous multivariate function. Then, we leverage the constructed wide and deep quadratic networks for univariate polynomials to represent a general continuous multivariate function. As a result, the resultant wide and deep networks offer the same functionality.

Lemma 12 For any continuous function  f : [0, 1]D → R, given δ > 0, there exists a function formulated as �2Dt=0 Pt(�Ds=1 Pt,s(xt)), where Pt,s and Ptare univariate polynomials, such that

image

Proof First, we apply the famous Kolmogorov-Arnold theorem to express f:

image

where  φt,s(xs) and Φtare continuous. According to Proposition 3, for every function φt,s(xs), given an arbitrarily small quantity  ϵt,s >0 , there exists a polynomial  Pt,s(xs)such that

image

Combining  φt,1(x1), φt,2(x2), ..., φt,D(xD) and applying the triangle inequality, we have

image

Because Φtis a continuous function, we can use the property of continuity. We choose a sufficiently small  ϵt,s, s = 1, ..., Dsuch that for every Φt, the following inequality holds:

image

With respect to the continuous function Φt, we can find a polynomial  Pt such that

image

Next, applying the triangle inequality again, we have

image

Finally, integrating Φt, t = 0, ..., 2D, we immediately obtain that

image

which concludes the proof.

Theorem 13 (Quasi-Equivalence by the Extension of the De Morgan Law) Given a continuous function  h : [0, 1]D → R, for any δ > 0, there exists a function expressed as

image

where  Pt and Pt,sare polynomials of degrees deg(Pt) and deg(Ps,t). Correspondingly, let K1 = maxt[deg(Pt)] and K2 = max{s,t}[deg(Ps,t)], there exist a wide quadratic network  Q1of a width max{K1, K2}and a depth log2(K1K2)and a deep quadratic network  Q2 of awidth 2(2D + 1) and a depth  K1 + K2, satisfying

image

Proof To prototype the wide network, we can use the wide quadratic sub-network scheme in Figure 6 to express  Pt and Pt,swhose [width, depth] are [deg(Pt), log2(deg(Pt))] and

[deg(Pt,s), log2(deg(Pt,s))], respectively. A straightforward combination of these wide sub-networks can express �2Dt=0 Pt(�Ds=1 Pt,s(xt)). Thus, the width of the construction is the summation of the widths of all the sub-networks: max{�2Dt=0 deg(Pt), �2Dt=0�Ds=1 deg(Pt,s)} ≥max{K1, K2}, while the depth is maxt[log2(deg(Pt)] + maxt,s[log2(deg(Pt,s))] = log2(K1) +log2(K2) = log2(K1K2).

For the deep network, we use the deep quadratic sub-network shown in Figure 6 to express  Pt and Pt,swhose [width, depth] are [2, deg(Pt)] and [2, deg(Pt,s)], respectively. Similarly, by integrating these deep sub-networks, we can have a deep network that expresses �2Dt=0 Pt(�Ds=1 Pt,s(xt)). As a result, the width of the derived network is 2(2D + 1), and the depth is maxt[deg(Pt)] + maxt,s[deg(Pt,s)] = K1 + K2.

In Supplementary Information IV, we use the quasi-equivalence relationship to construct wide and deep quadratic network variants for the same task, verify their validity on the MNIST dataset, and show empirical hints that a wide network might have an enhanced robustness, since adversarial samples have much less room to perturb latent features and play tricks.

Gains of Equivalence Notion. Due to the importance of width and depth, a width-depth equivalence theory should have a major impact on deep learning. Some commonly-used deep learning theories, such as neural tangent kernel (NTK) and neural network Gaussian process (NNGP), were developed assuming an infinite width. Naturally, in the era of deep learning we are more concerned with how depth affects the behavior of a neural network. Therefore, depth-oriented theories would be highly desirable. Our work suggests that depth-oriented NTK and NNGP might be feasible since width and depth are equivalent. 2) Our work suggests a significant potential of wide networks in practical applications. Currently, wide networks are less popular than deep networks due to their sub-optimal performance. However, the reason why wide networks do not work as well might be related to the training strategies instead of the model capacities, since width and depth are equivalent in representation power. If wide networks can be trained to match the performance of deep networks, they suggest an alternative path to develop neural networks.

Equivalent Networks. In a broader sense, our quasi-equivalence studies demonstrate the existence of mutually equivalent networks. We argue that the network equivalence is useful in network design. Although deep networks manifest superb power, their applications can be constrained, for example, when the application is time-critical. In that case, we can convert the deep network to a wide counterpart that can be executed at a high speed. A direction for network design optimization is to derive a compact network that maintains a high performance of an original large network through quantization (Wu et al., 2016), pruning (Li et al., 2016), distillation (Polino et al., 2018), low-rank approximation (Zhang et al., 2015), etc. We envision that the equivalence of a deep network and a wide network suggests a new means of network design. Ideally, a wide network can replace the well-trained deep network without compromising the performance. Due to the parallel nature, the wide network can be trained on a computing cluster with many machines for the fast training. At the same time, the inference time of the equivalent wide network is shorter than its deep counterpart.

Width-Depth Correlation. Every continuous n-variable function  f on [0, 1]n canbe in the  L1sense represented by partially separable multivariate functions (Light and Cheney, 2006):

image

where  ϵis an arbitrarily small positive number,  φliis a continuous function, and L is the number of products. In the Supplementary Information V and VI, we justify the suitability of this partially separable representation by showing its boundedness, and comparing it with other representations.

Further, we correlate the width and depth of a network to the structure of a function to be approximated. In a nutshell, each continuous function  φlican be approximated by a polynomial of some degree, which can be appropriately represented by quadratic neurons. As a consequence, via a quadratic representation scheme, the width and depth of a network structure must reflect the complexity of �Ll=1�ni=1 φli(xi). In other words, they are controlled by the nature of a specific task. As the task becomes complicated, the width and depth must increase accordingly, and the combination of the width and depth is not unique. For more details, please see the Supplementary Information VII.

Effects of Width on Optimization and Generalization. In the Supplementary Information VIII, we illustrate the importance of width on optimization in the context of over-paramterization, kernel ridge regression, and NTK, and then report our findings that the existing generalization bounds also shed light on the relationship of the width and depth given a fixed complexity.

Inspired by the De Morgan law and through a systematic analysis, we have established the quasi-equivalence between the depth and width of ReLU neural networks from two perspectives. In the first perspective, we have formulated two transforms for mapping an arbitrary regression/classification ReLU network to a wide ReLU network and a deep ReLU network, respectively. In the second perspective, we have extended our quasi-equivalence results from ReLU networks of popular artificial neurons to those of quadratic neurons. This quasi-equivalence represents a step forward in developing a unified deep learning theory. More efforts are needed in the future to refine this quasi-equivalence relationship and find real-world applications.

F. L. Fan is supported by the Rensselaer-IBM AI Research Collaboration Program (http: //airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons), R. Lai is partially supported by an NSF Career Award DMS–1752934 and NSF DMS-2134168. G. Wang is partially supported by an NIH R01 CA237267 grant.

Here, we show the correctness of Theorem 9 in the 2D case. Regarding the transformation of an arbitrary multivariate network, the situation is more complicated than in the case of univariate networks. Nevertheless, we are able to establish the  δ-equivalence, which is a slightly relaxed result.

The sketch of proof: A ReLU network is a piecewise linear function over polytopes, which can be decomposed into a summation of linear functions over a simplex. Lemma 14 shows that a wider network module  N1(x) and a deeper network module  N2(x) canrepresent an arbitrary linear function over a simplex. Next, in Theorem 9, to transform an arbitrary ReLU network into a wide and a deep network, we horizontally aggregate network modules  N1(x) to have a wide network, and we use shortcuts to sequentially establish a deep network with  N2(x).

A D-simplex S is a D-dimensional convex hull provided by convex combinations of D+1

image

affinely independent vectors  {vi}Di=0 ⊂ RD. In other words, S =

In 2D case, if we write  V = (v1−v0, v2−v0), then Vis invertible and  S = {v0 + V x | x ∈ ∆},where ∆ =�x ∈ R2 | x ≥ 0, 1⊤x ≤ 1�is a template simplex in  R2. It is clear that the following one-to-one affine mapping between S and ∆ exists, which is

image

Therefore, we only need to make the construction in the special case where  S = ∆ tosimplify our analysis. The coordinate transform in Eq. (29) can conveniently map the construction from ∆ to S.

Given a linear function  ℓ(x) = c1x1 + c2x2 + c3, we write ℓ− = {x ∈ R2 | ℓ(x) < 0} andℓ+ = {x ∈ R2 | ℓ(x) ≥ 0}. ∆ is enclosed by three lines provided by  ℓ1(x) = x1, ℓ2(x) = x2,and  ℓ3(x) = −x1−x2+1. We write three vertices of ∆ as  v0 = (0, 0), v1 = (1, 0), v2 = (0, 1).Then,  f : [−B, B]2 → Rsupported on ∆ is expressed as follows:

image

where  a = (f(v1) − f(v0), f(v2) − f(v0)), b = f(v0).

Lemma 14 Suppose that the representation of an arbitrary ReLU network is  f : [−B, B]D →R expressed as Eq. (30), for any  δ > 0, there exist a ReLU network  N1 of width D(D +1)(2D − 1) + 2 and depth D, and also a ReLU network  N2 of width (D + 1)D2 and depth

image

Proof (D=2) Our goal is to approximate the given piecewise linear function  f over ∆;therefore, we need to cancel f outside its domain. We first index the polytopes separated by three lines  ℓ1(x) = 0, ℓ2(x) = 0, and ℓ3(x) = 0 as A(χ1,χ2,χ3) = ℓχ11 ∩ ℓχ22 ∩ ℓχ33 , χi ∈{+, −}, i = 1, 2, 3. It is clear that ∆ = A(+,+,+).In addition, we use  ∨to exclude a component. For instance,  A(χ1,∨,χ3) = ℓχ11 ∩ ℓχ33 . It can be easily verified that  A(χ1,∨,χ3) =A(χ1,+,χ3) ∪ A(χ1,−,χ3).

Constructing  N1(x): The discontinuity of f in Eq. (30) is a major challenge of representing the function using a ReLU network. To tackle this issue, we start from a linear function ˜f(x) = a⊤x+b, ∀x ∈ R2, which can be represented by two neurons  σ◦ ˜f −σ◦(− ˜f).The key idea is to eliminate f over all polytopes outside ∆. In other words, ˜f over three fan-shaped polytopes  A(∨,−,+), A(−,+,∨), and A(+,∨,−) should be cancelled.

Let us take the polytope  A(+,∨,−) as an example. Note that  A(+,∨,−) has two boundaries ℓ1(x) = 0 and ℓ3(x) = 0 as illustrated in Figure 7(b). We choose a sufficiently large positivenumber  µto construct the three fan-shaped functions:

image

where the positive number  ηis chosen to be small enough such that the lines  x1 − ηx2 = 0and  x1 − η = 0 are very close to x1 = 0, then m((x1)+ ∩ (x1 − ηx2)−) < 2√2Bη andm((x1)+ ∩ (x1 − η)−) < 2√2Bη.

image

Figure 7: Quasi-equivalence analysis in 2D case. Left: The structure of the wide network to represent f over ∆, where two neurons denote  f over [−B, B]2 and nine fan- shaped functions handle the polytopes outside ∆. Right: The polytopes outside ∆ comprise of three fan-shaped domains, on which f can be cancelled by three linearly independent fan-shaped functions.

According to the aforementioned properties of fan-shaped functions, we approximately have

image

Based on the property of fan-shaped functions,  F (+,∨,−)1 (x), F (+,∨,−)2 (x), F (+,∨,−)3 (x) areapproximately constrained into the region  A(+,∨,−) such that ˜f(x) is approximately counteracted over  A(+,∨,−). Mathematically, the new function  F (+,∨,−)(x) = ω∗1F (+,∨,−)1 (x) +

image

Similarly, we can construct  F (∨,−,+) and F (−,+,∨) to eliminate ˜f on A(∨,−,+) and A(−,+,∨)respectively. Finally, these fan-shaped functions are aggregated to form the following ReLU network  N1(illustrated in Figure 7(a)):

image

where the width and depth of the network are 2 + 3  × 3 × 2 = 20 and 2 respectively. Inaddition, due to the 9 fan-shaped functions being utilized and the effect of the  η, the total area of the regions suffering from errors is no more than

image

Therefore, for any  δ >0, as long as we choose  η and µsatisfying

image

the constructed network N will have

image

Constructing  N2(x): Allowing more layers in a network provides an alternate way to represent  f. Let F(x1, x2) = σ◦(x1−µσ◦(−x2)) and F′(x1, x2) = σ◦(x1−νx2−µσ◦(−x2)),both of which are approximately enclosed by boundaries  x1 = 0 and x2 = 0. Therefore, the

image

Figure 8: Quasi-equivalence analysis in 2D case. Left: The structure of the deep network to represent f over ∆. Right: The polytopes outside ∆ comprise of three fan-shaped domains, on which f can be cancelled by three linearly independent functions over ∆.

fan-shaped regions of  F(x1, x2) and F′(x1, x2) almost overlap as  νis small. The negative sign for  x2is to make sure that the fan-shaped region is ∆. To obtain the third boundary ℓ3(x) = 0 for building the simplex ∆, we stack one more layer with only one neuron toseparate the fan-shaped region of  F(x1, x2) with the boundary  −x1 − x2 + 1 = 0 as follows:

image

Thus,  E1(x1, x2) will represent the function  −x1 − x2+ 1 over ∆ and zero in the rest area. The depth and width of  E1(x1, x2) are 3 and 4 respectively. Similarly, due to the employment of the two fan-shaped functions and the effect of  ν, the area of the region with errors is estimated as

image

To acquire f over ∆, similarly we need three linear independent functions as linear independent bases. We modify  ℓ3slightly to get  ℓ′3 = ℓ3 − τ ′x1 and ℓ′′3 = ℓ3 − τ ′′x2.Repeating the procedure described in (1), for  ℓ′3 we construct the network  E2(x1, x2) thatis  ℓ3 − τ ′x1 over ℓ+1 ∩ ℓ+2 ∩ (ℓ′3)+, while for  ℓ′′3 we construct the network  E3(x1, x2) that isℓ3 − τ ′x1 over ℓ+1 ∩ ℓ+2 ∩ (ℓ′′3)+. We set positive numbers  τ ′ and τ ′′small enough to have two

triangular domains  ℓ+1 ∩ℓ+2 ∩(ℓ′3)+ and ℓ+1 ∩ℓ+2 ∩(ℓ′′3)+almost identical with ∆. In addition, let  τ ′ and τ ′′ satisfy

image

where  ρ∗1, ρ∗2, ρ∗3 are solutions. As a consequence, the deep network (illustrated in Figure 8 (c)):

image

produces f on ∆. The depth and width of the network are 3 and 12. Similarly, the area of the region with errors is bounded above by

image

Therefore, for any  δ >0, if we choose

image

then the constructed network  N2will satisfy

image

Proof (Theorem 9, D = 2) The network h is piecewise linear and splits the space into polytopes. It is feasible to employ a number of simplices to represent the polytopes defined by h Ehrenborg (2007). Given that M is the minimum number of required simplices, we have

image

where

image

and  S(m) is the m-th simplex. For construction of a wide network, we use network modules to represent  f(m)(x) and then horizontally aggregate them into a wide network. In contrast, for construction of a deep network, we sequentially express  f(m)(x) in terms of a network module without linking them to the input.

Representing f with a wide ReLU network: Lemma 14 suggests that a wide network module  N1can generically represent a function over a template simplex. To represent  f(m)over  S(m), we need to use Eq. (29) to transform the function from the barycentric coordinate system to the Euclidean coordinate system. Let three vertices of  S(m) be {v(m)0 , v(m)1 , v(m)2 }and  V (m) = (v(m)1 − v(m)0 , v(m)2 − v(m)0 ), we have

image

image

Figure 9: Illustration of deep networks in 1D and 2D cases.

satisfying

image

By aggregating the network  N(m)1 (x) horizontally, we have the following wide network:

image

Therefore, the constructed wide network  H1(x) is of width O(20M) and depth 2. It is clear that the width O(20M) of the wide network  H1(x) dominates, as the number of needed simplices goes larger and larger.

Representing f with a deep ReLU network: For a deep construction, the fundamental difficulty is how to sequentially express each  f(m), i.e., the input of each block can only come from the earlier block instead of the input, like what we did in one-dimensional case (Figure 9(a)). Let us derive via induction how to sequentially represent each  f(m). We stilladopt the idea of modularized networks, but now each network module has two outputs. Lemma 14 suggests that a deep network module  N2can generically represent a function over a template simplex.

Step 1. Assume that the two outputs of the first block are  N(1)2 and Ω(1). To represent f(1) over S(1), similarly, we need to transform the function from the barycentric coordinate system to the Euclidean coordinate system. Let three vertices of  S(1) be {v(1)0 , v(1)1 , v(1)2 }and  V (1) = (v(1)1 − v(1)0 , v(1)2 − v(1)0 ), we have

image

which is one output of the first block.

Next, we derive the other output Ω(1)(x). Note that we are not allowed to use the input directly. Encouraged by the inversion idea in the univariate case, we invert the function domain into the input domain to get a function that is approximately only supported over S(1):

image

To do so, recall that we have  E1, E2, E3 in constructing  N2, we will have (ξ∗1, ξ∗2, ξ∗3) suchthat

image

As shown in Figure 9(b), use the residual connection, we compute Ω(1)(x) = x − ˜xS(1),which is zero over  S1 and xfor other regions. Ω(1)(x) will be used to feed the next block to construct  f(2).

Step 2. Suppose that the two output functions of the m-th block are  Nm2 (x) and(m)(x) = x − �mi=1 ˜xS(i). Ωm(x) is fed into the (m + 1)-th block as the input. Because S(m+1) is outside the domain of  S(1) ∪ · · · ∪ S(m), with the same technique in Step 1, we construct

image

where  {v(m+1)0 , v(m+1)1 , v(m+1)2 }are three vertices of  S(m+1), and V (m+1) = (v(m+1)1 −v(m+1)0 , v(m+1)2 − v(m+1)0) to express  f(m+1) over S(m+1) well. Ω(m)(x) is functionally equivalent to x for two reasons. First, all simplices do not overlap. Ω(m+1)(x) is xoutside the domain of  S(1) ∪ · · · ∪ S(m); therefore zero value over  S(1) ∪ · · · ∪ S(m) has no effect. Second, w.l.o.g., we assume that (0, 0) is outside of all simplices or lies on the boundary of a certain simplex. As such, we have  N(m+1)2 ((0, 0)) = 0, ∀m, therefore,  N(m+1)2 (Ω(m)(x)) will not erroneously produce a non-zero constant over  S(1) ∪· · ·∪S(m). Furthermore, we also obtain a function ˜xS(m+1)inside the (m + 1)-th block. Apply the residual operation, we have

image

Lastly, simlilar to the one-dimensional deep network, we use the shortcut connection to aggregate  N(m)2 (Ω(m−1)(x)) to obtain the following deep network:

image

Therefore, the constructed deep network  H2(x) is of depth O(3M) and width 12. Please note that in the above equation, the summation is made by shortcuts. It is clear that the depth of  H2(x) dominates over the width.

image

I. Width as Related to Neural Tangent Kernel (NTK)

NTK sheds light on the power of a network when the width of this network goes to infin-ity. Previous results have suggested a link between the neural network and the Gaussian distribution when the network width increases infinitely. In Neal (1996); Lee et al. (2018); Matthews et al. (2018); Novak et al. (2018); Arora et al. (2019), the Gaussian process is shown in the two-layer networks, the deep networks, and the convolutional networks. Weakly-trained networks (Arora et al., 2019) refer to the networks where only the top layer is trained after other layers are randomly initialized. Let  f(θ, x) ∈ Rdenote the output of the network on input  x where θdenotes the parameters of the network, and  θis an initialization over  Θ. In the above context, training the top layer with the  l2penalty reduces to the kernel regression:

image

where (F) denotes the functional space. On the other hand, an infinite wide network is considered as an over-parameterized network, with the motivation to explain the fact that why over-paramterized networks scale. There were extensive studies on this topic (Du et al., 2018; Allen-Zhu et al., 2019; Cao and Gu, 2019). Notably, these studies indicate that the training renders relatively small changes when the network is sufficiently wide. Such a concentration behavior is justified by NTK in terms of the Gaussian process. The NTK is defined as the following:

image

where (∂) denotes the gradient space. Along this line, Huang et al. (2020) compared the NTK of the ResNet and the deep chain-like network, and found that the NTK of the chain-like network converges to a non-informative kernel as the depth goes to infinity, while the counterpart of ResNet does not. Du and Hu (2019) showed that the width provably matters in networks using the piecewise linear activation.

II. Transformation of an Arbitrary Network

Here, we provide the proof for Theorem 15 and the experiment that shows the feasibility of the procedures in the proof of Theorem 16. The similar derivation to the proof for Theorem 15 is also seen in Fan et al. (2018c).

Theorem 15 (Equivalence of Univariate Regression Networks) Given any ReLU network  f : [−B, B] → Rwith one dimensional input and output variables. There is a wide ReLU network  H1 : [−B, B] → Rand a deep ReLU network  H2 : [−B, B] → R, such that f(x) = H1(x) = H2(x), ∀x ∈ [−B, B].

Proof (Theorem 15) Since the function represented by a network  f : [−B, B] → R ispiecewise linear, we express f(x) as

image

where  −B = x0 < x1 < x2 < · · · < xn < xn+1 = B, w(i) = f(xi+1) − f(xi)xi+1 − xito guarantee continuity, and the slopes of neighboring pieces are different; otherwise they can be fused together.

We first construct a wide ReLU network  H1(x) to represent f. This can be straightforward as follows:

image

where specially  w(−1) = 0. This network is of width n+1. Now, we verify this claim. Given x ∈ [xj, xj+1), for some  j ≥0, we have the following:

image

image

Figure 10: Equivalent wide (a) and deep (b) univariate ReLU networks.

Next, we construct an equivalent deep network dual to the above wide network. Note that (62) is the same as

image

where sgn(x < 0) = −1 and sgn(x > 0) = 1. If we write Ri(x) = σ(|w(i)−w(i−1)|(x−xi)), i =0, · · · , n −1, then it is clear that the following recursive relation holds:

image

Thanks to the recurrent relation, each  Rican accurately represent a small monotonically increasing piece over [xi, xi+1]. We aggregate the outputs of those n+1 pieces in the output neuron to get the desired deep ReLU network  H2(x):

image

which is the same as  H1(x). To construct  Rn(x), n+ 1 consecutive neurons are connected in series. Therefore, such a one-neuron-wide network has n + 2 layers. We plot two types of neural networks  H1(x) and H2(x) in Figure 10.

Theorem 16 (Quasi-Equivalence of Multivariate Regression Networks) Suppose that the representation of an arbitrary ReLU network is  h : [−B, B]D → R, and Mis the minimum number of simplices to cover the polytopes to support  h, for any δ > 0, there exist a wide ReLU network  H1 of width O�D(D + 1)(2D − 1)M�and depth D, and also a deep

ReLU network  H2 of width (D + 1)D2 and depth O [(D + 1)M], satisfying that

image

where  m(·)is the standard measure in [−B, B]D.

The sketch of proof: A ReLU network is a piecewise linear function over polytopes, which can be decomposed into a summation of linear functions over a simplex. Lemma 17 shows that a wider network module  N1(x) and a deeper network module  N2(x) canrepresent an arbitrary linear function over a simplex. Next, in Theorem 16, to transform an arbitrary ReLU network into a wide and a deep network, we horizontally aggregate network modules  N1(x) to have a wide network, and we use shortcuts to sequentially establish a deep network.

A D-simplex S is a D-dimensional convex hull provided by convex combinations of D+1

image

affinely independent vectors  {vi}Di=0 ⊂ RD. In other words, S =

If we write  V = (v1 − v0, v2 − v0, · · · , vD − v0), then Vis invertible, and S = {v0 + V x | x ∈ ∆} where ∆ =�x ∈ RD | x ≥ 0, 1⊤x ≤ 1�is a template simplex in  RD.It is clear that the following one-to-one affine mapping between S and ∆ exists, which is

image

where  a = (f(v1) − f(v0), f(v2) − f(v0), · · · , f(vD) − f(v0)), b = f(v0). We denote the domain of a network as [−B, B]D. Given a linear function  ℓ(x) = c1x1 +c2x2 +· · ·+cnxn +cn+1, we write ℓ− = {x ∈ RD | ℓ(x) < 0} and ℓ+ = {x ∈ RD | ℓ(x) ≥ 0}. Sis enclosed by D + 1 hyperplanes provided by  ℓi(x) = xi, i = 1, · · · , D, and ℓD+1(x) = −x1 − · · · − xD + 1.

Lemma 17 Suppose that the representation of an arbitrary ReLU network is  f : [−B, B]D →R expressed as Eq. (69), for any  δ > 0, there exist a ReLU network  N1 of width D(D +1)(2D − 1) + 2 and depth D, and also a ReLU network  N2 of width (D + 1)D2 and depthD + 1, satisfying that

image

Proof (Lemma 17, D ≥ 2) Our goal is to approximate the given piecewise linear function f using ReLU networks. We first index the polytopes separated by D + 1 hyperplanes ℓi(x) = 0, i = 1, · · · , D+1 as A(χ1,··· ,χi,··· ,χD+1) = ℓχ11 ∩· · ·∩ℓχii ∩· · ·∩ℓχD+1D+1 , χi ∈ {+, −}, i =1, · · · , D+1. It is clear to see that  S = A(+,+,··· ,+). In addition, we use  ∨to denote exclusion of certain component. For instance,  A(χ1,∨,χ3,··· ,χD+1) = ℓχ11 ∩ ℓχ33 ∩ · · · ∩ ℓχD+1D+1 . It can be easily verified that

image

Please note that  A(−,−,··· ,−) = ∅. Thus, D+1 hyperplanes create in total 2D+1−1 polytopes in the [−B, B]D.

Now we recursively define an essential building block, a D-dimensional fan-shaped ReLU network  FD(x):

image

where the set of linear functions  {hk(x) = p⊤k x + rk}Dk=1 are provided by D linearly inde- pendent vectors  {pk}Dk=1, and µis a large positive number (µj denotes µwith the power to j). Note that the network  FDis of width D and depth D. This network enjoys the following key characteristics: 1) As  µ → ∞, the hyperplane  h1 − µh2 − · · · − µjhj+1 = 0 isapproximate to the hyperplane  hj+1 = 0 as the term µjhj+1dominates. Thus, the support of  FD(x) converges to  h+1 ∩ h−2 ∩ · · · ∩ h−D which is a D-dimensional fan-shaped function. 2) Let C be the maximum area of hyperplanes in [−B, B]D. Because the real boundary h1−µh2−· · ·−µjhj+1 = 0 is almost parallel to the ideal boundary hj+1 = 0, the measure ofthe imprecise domain caused by  µj is at most C/µj, where 1/µj is the approximate distance between the real and ideal boundaries. In total, the measure of the inaccurate region in building  FD(x) is at most  C �D−1j=1 1/µj ≤ C/(µ −1). 3) The function over D-dimensional fan-shaped domain is  h1, since (hj)+ = 0, j ≥2 over the D-dimensional fan-shaped domain.

Constructing  N1: Discontinuity of f in Eq.(69) is one of the major challenges of representing it using a ReLU network. To tackle this issue, we start from a linear function ˜f(x) = a⊤x + b, ∀x ∈ RD, which can be represented by two neurons  σ ◦ ˜f − σ ◦ (− ˜f). Thekey idea is to eliminate f over all 2D+1 −2 polytopes outside S using the D-dimensional fan-shaped functions.

Let us use  A(+,+,+,−,··· ,−) and A(+,+,−,−,··· ,−) to show how to cancel the function ˜f over the polytopes outside S. According to (71),  A(+,+,+,−,··· ,−) and A(+,+,−,−,··· ,−) satisfy

image

where  A(+,+,∨,−,··· ,−) is a D-dimensional fan-shaped domain. Without loss of generality, a number D + 1 of D-dimensional fan-shaped functions over  A(+,+,∨,−,··· ,−) are needed as the group of linear independent bases to cancel ˜f , where the  kth fan-shaped function is

image

where we let  xD+1 = 1 for consistency, the negative sign for x2is to make sure that the fan-shaped region  ℓ+1 ∩(−ℓ2)− ∩ℓ−4 ∩· · ·∩ℓ−D+1 of F (k)D is A(+,+,∨,−,··· ,−), η1 = 0, and ηk = η, k =2, ..., D + 1 represents a small shift for  x1 = 0 such that m((x1)+ ∩ (x1 − ηkxk)−) < Cηk.The constructed function over  A(+,+,∨,−,··· ,−) is

image

which is approximately over

image

Let us find  ω∗1, · · · , ω∗D+1 by solving

image

and then the new function  F (+,+,∨,−,··· ,−)(x) = �D+1k=1 ω∗kF (k)D (x) satisfies that

image

Similarly, we can construct other functions

cancel ˜f over other polytopes. Finally, these D-dimenional fan-shaped functions are aggregated to form the following wide ReLU network  N1(x):

image

where the width and depth of the network are  D(D + 1)(2D −1) + 2 and D respectively. In addition, because there are 2D −1 polytopes being cancelled, the total area of the regions suffering from errors is no more than

image

Therefore, for any  δ >0, as long as we choose appropriate  µ, ηthat fulfill

image

the constructed network  N1(x) will have

image

Constructing  N2: Allowing more layers in a network provides an alternate way to represent f. The fan-shaped functions remain to be used. The whole pipeline can be divided into two steps: (1) build a function over S; and (2) represent f over S by slightly moving one boundary of S to create linear independent bases.

(1) We construct a number D of D-dimensional fan-shaped functions. Without loss of generality, the  kth fan-shaped function is constructed as

image

whose fan-shaped region is approximately (ℓ1 − νkxk)+ ∩ (−ℓ2)− ∩ · · · ∩ (−ℓD)− = (ℓ1 −νkxk)+ ∩ ℓ+2 ∩ · · · ∩ ℓ+D, which almost overlaps with  A(+,··· ,+,∨) = ℓ+1 ∩ ℓ+2 ∩ · · · ∩ ℓ+D as νkbecomes sufficiently small. The output of  F (k)D is x1 − νkxk, k = 1, · · · , D. To obtain the last boundary  ℓD+1(x) = −x1 − · · · − xD + 1 = 0 so as to construct the simplex S, we stackone more layer with only one neuron as follows:

image

Thus,  E1(x) will approximately represent the linear function  −x1 − · · · − xD + 1 over Sand zero elsewhere. The depth and width of the network are  D + 1 and D2 respectively. Similarly, due to the employment of a number D of D-dimensional fan-shaped functions and the effect of  νk, the area of the region with errors is estimated as

image

(2) To acquire an arbitrary linear function, similarly we need D + 1 linear independent functions as linear independent bases. Other than the one obtained in step (1), we further modify  ℓD+1a little bit D times to get  ℓlD+1 = ℓD+1 − τlxl, l = 1, · · · , D. Repeating the same procedure described in step (1), for  ℓlD+1, we can construct the network  El(x) that isℓlD+1 − τlxlapproximately over  ℓ+1 ∩ ℓ+2 ∩ · · · ∩ (ℓlD+1)+, where τl, l = 1, · · · , Dis small to render these domains almost identical to  S, and τl, l = 1, · · · , Dsatisfies that

image

producing f on S. The depth and width of  N2(x) are D + 1 and D2(D+ 1), respectively. Similarly, the area of the region with errors is bounded above by

image

Therefore, for any  δ >0, if we choose  µ, νk, τlappropriately such that

image

then the constructed network  N2(x) will satisfy

image

Proof (Theorem 16, D ≥ 2) The network h is piecewise linear and splits the space into polytopes. It is feasible to employ a number of simplices to fill the polytopes defined by h (Ehrenborg, 2007). Given that M is the minimum number of required simplices, we have

image

where

image

and  S(m) is the m-th simplex. The core of the wide construction is to horizontally aggregate network modules representing  f(m)(x) into a wide network. In contrast, the core of the deep construction is to sequentially express  f(m)(x) in a network module without linking these modules to the input.

Representing h with a wide ReLU network: Lemma 17 suggests that a wide network module  N1can generically represent a function over a template simplex. To represent f(m) over S(m), we need to use Eq. (68) to transform the function from the barycentric coordinate system to the Euclidean coordinate system. Let D + 1 vertices of  S(m) be

image

satisfying

image

By aggregating the network  N(m)1 (x) horizontally, we have the following wide network:

image

Therefore, the constructed wide network  H1(x) is of width  O(D(D + 1)(2D − 1)M) anddepth D. It is clear that the width  O(D(D + 1)(2D − 1)M) of the wide network  H1(x)dominates, as the number of needed simplices goes larger and larger.

Representing h with a deep ReLU network: For a deep construction, the fundamental difficulty is how to sequentially express each  f(m), i.e., the input of each block can only come from the earlier block instead of the input, like what we did in one-dimensional case (Figure 9(a)). Let us use mathematical deduction to derive how to sequentially represent each  f(m). We still adopt the idea of modularized networks, but now each network module has two outputs. Lemma 17 suggests that a deep network module  N2can generically represent a function over a template simplex.

Step 1. Assume that the two outputs of the first block are  N(1)2 and Ω(1). To represent f(1) over S(1), similarly, we need to transform the function from the barycentric coordinate system to the Euclidean coordinate system. Let D+1 vertices of  S(1) be {v(1)0 , v(1)1 , · · · , v(1)D }and  V (1) = (v(1)1 − v(1)0 , v(1)2 − v(1)0 , · · · , v(1)D − v(1)0 ), we have

image

which is one output of the first block. Next, we derive the other output Ω(1)(x). Note that we are not allowed to use the input directly. Encouraged by the inversion idea in the univariate case, we invert the function

domain into the input domain to get a function that is approximately only supported over S(1):

image

To do so, recall that we have  E1, E2, · · · , ED+1 in constructing  N2, we will have a coefficient vector (ξ∗1, ξ∗2, · · · , ξ∗D+1) such that

image

As shown in Figure 9(b), use the residual connection, we compute Ω(1)(x) = x − ˜xS(1),which is zero over  S1 and xfor other regions. Ω(1)(x) will be used to feed the next block to construct  f(2).

Step 2. Suppose that the two output functions of the m-th block are  Nm2 (x) and(m)(x) = x − �mi=1 ˜xS(i). Ωm(x) is fed into the (m + 1)-th block as the input. Because S(m+1) is outside the domain of  S(1) ∪ · · · ∪ S(m), with the same technique in Step 1, we construct

image

where  {v(m+1)0 , v(m+1)1 , · · · , v(m+1)D }are three vertices of  S(m+1), and V (m+1) = (v(m+1)1 −v(m+1)0 , v(m+1)2 − v(m+1)0 , · · · , v(m+1)D − v(m+1)0) to express  f(m+1) over S(m+1) well. Ω(m)(x)is functionally equivalent to x for two reasons. First, all simplices do not overlap. Ω(m+1)(x)is x outside the domain of  S(1) ∪· · ·∪S(m); therefore zero value over  S(1) ∪· · ·∪S(m) has noeffect. Second, w.l.o.g., we assume that 0 is outside of all simplices or lies on the boundary of a certain simplex. As such, we have  N(m+1)2 (0) = 0, ∀m, therefore,  N(m+1)2 (Ω(m)(x)) willnot erroneously produce a non-zero constant over  S(1) ∪ · · · ∪ S(m). Furthermore, we also obtain a function ˜xS(m+1)inside the (m + 1)-th block. Apply the residual operation, we have

image

Lastly, simlilar to the one-dimensional deep network, we use the shortcut connection to aggregate  N(m)2 (Ω(m−1)(x)) to obtain the following deep network:

image

Therefore, the constructed deep network  H2(x) is of depth O((D + 1)M) and width (D + 1)D2. Please note that in the above equation, the summation is made by shortcuts. It is clear that the depth of  H2(x) dominates over the width.

image

Experiment: A 2-6-2-1 network was trained on the moon dataset. When the training process was finished, the accuracy of the network was 0.94. Then, we transformed the original network into a wide network (µ = 80) and a deep network (µ = 5) using the proceduresdescribed in the proof of Theorem 16. The patterns of outputs of the constructed wide

image

Figure 11: With the procedure in the proof for Theorem 16, an exemplary 2-6-2-1 network is transformed into a wide network (µ = 80) and a deep network (µ = 5)respectively.

network and deep network are respectively shown in Figure 11. It is seen that the constructed networks can approximate the original network well, and some expected artifacts can be squeezed by further increasing  µor using a post-processing network that can identify streaks and remove them.

III. Width-Depth equivalence by the De Morgan Law

We have the following formal statement for the quasi-equivalence in light of the De Morgan law. The proof of this theorem is constructive.

Proposition 5 (Proposition C.1 in (Lin and Jegelka, 2018)) Given a piecewise constant function  h : Rd → R, for any small enough  ϵ > 0, there exists a ResNet  H2(x) suchthat

image

The sketch of proof. The detailed proof can be referred to Lin and Jegelka (2018). We only show the one-dimensional case here for a brief introduction of the idea. The operations that are realizable by a residual network block are

1. shifting:  G+ = G + c, for any c ∈ R. This operation is to shift a function with a constant.

2. min or max:  G+ = min{G, c} = G − σ(G − c) or G+ = max{G, c} = G + σ(c − G),for any  c ∈ R. This operation allows us to threshold a function.

3. min or max with a linear transformation:  G+ = min{G, αG+β} = G−σ(G−(αG+β))or  G+ = max{G, αG + β} = G + σ((αG + β) − G) for any α, β ∈ R. This operation can adjust the slope of trapezoid functions.

image

Figure 12: To approximate any univariate function, (a) we construct an increasing trape- zoidal function; (b) we adjust the height of each trapezoid function to its corre- sponding height.

As Figure 12 shows, we first construct an increasing trapezoidal function with a very high plateau over each trapezoid. Then, we adjust the height of each trapezoid function to its corresponding height. The procedures of constructing an increasing function are shown in Figure 13, which are based on mathematical induction. Suppose  Gm satisfies

1.  Gm = 0, when x ∈ [−∞, a1].

2.  Gmis a trapezoid function over each [ai, ai+1].

3.  Gm = k∥G∥∞over an interval [ak + δ, ak+1 − δ] for any k = 1, 2, · · · , m.

4. 0  ≤ Gm ≤ m∥G∥∞ over [−∞, am+1].

5.  Gm = − m∥G∥∞δ (x − am+1) over [am+1, ∞].

image

Figure 13: An illustrative plot to show how to derive  Gm+1 from Gm.

image

The procedure of adjusting the height of each trapezoid function is shown in Figure 12(b). An important consideration is that we need to keep the function on previous subdivisions unchanged while twisting the current trapezoid function. This is realized by the fact that  ||G||∞is large. Suppose that  φkis the target value over the interval [ak, ak+1], weadjust the values of intervals sequentially by

image

Theorem 18 (Quasi-equivalence in light of the De Morgan law) Given a disjoint rule system  {Ai}ni=1, where each rule  Aiis characterized by an indicator function  gi(x)over a hypercube Γi = [a1i, b1i] × · · · × [aDi, bDi] ∈ [0, 1]D:

image

fulfilling the De Morgan law

image

we can construct a wide ReLU network  H1(x)to represent the right hand side of the De Morgan law and a deep ReLU network  H2(x)to represent the left hand side of the De Morgan law, such that for any  ϵ > 0,

image

where m is a measurement in [0, 1]D.

image

Figure 14: A wide network to realize the negation of the logic intersection of the negation of propositional rules in a high-dimensional space. The negation of each propositional rule is associated with a trap-like indicator function.

Proof The defined rule system is a piecewise constant function over a hypercube. The wide and deep networks are constructed as follows:

Wide network  H1(x): As illustrated in Figure 14, to represent the rule  Aithat is equivalent to  gi(x), a trap-like function is constructed to represent  ¬Aithat is equivalent to 1  − gi(x). In our construction, for each hypercube the measures of  {x ∈ Γi|H1(x) ̸= gi(x)}is no more than  vol(Γi)(1 − 2δ)n, where vol(Γi) is the volume of a hypercube Γi. There-fore, the total measure of errors for all the hypercubes is less than �ni vol(Γi)(1 − 2δ)n =

image

Deep network  H2(x): A1∨A2 · · ·∨Anis a piecewise constant function over a hypercube, based on Proposition 5, we have

image

Combining Eqs. (108) and (109) leads to  m�{x |H1(x)} ̸= H2(x)}�< ϵ, which concludes our proof.

image

IV. Quasi-equivalence of Quadratic Networks.

In this section, we first prove the correctness of the continued fraction representation of a polynomial. Then, we report the experimental results on the quasi-equivalency of quadratic networks.

image

For a general univariate polynomial, there is a continued fraction representation as follows:

image

where  ai ̸= 0, b0 = a0, bk = a2k/a2k−2, k ≥ 1 and c0 = a1, ck = a2k+1/a2k−1, k ≥ 1. In theright side of (111), the first part contains the terms with even powers of x, while the second part is for the odd powers of x. Because both even and odd parts are in the essentially same format, we only show the correctness of the even part:

image

Specially, we can derive the continued fraction as follows:

image

The following example helps illustrate the correctness of (111):

image

If  ai = 0 for certain i, we can derive an approximate continued fraction representation by complementing �2Ni=0 aixiwith a term  δixi, where δiis a small constant. For a given approximation requirement, we can always choose a sufficient small, nonzero  δi toaccommodate the accuracy requirement since x is bounded.

image

This subsection reports the experimental results on quasi-equivalence of quadratic networks in two steps. In the first step, we show the feasibility of the reciprocal activation function in the network according to the continued fraction representation. In the second step, we compare the accuracy and robustness between wide and deep networks.

Experimental Design: We first preprocessed the MNIST dataset using image deskewing and dimension deduction techniques. Image deskewing (https://fsix.github.io/ mnist/Deskewing.html) straightens the digits that are written in a crooked manner. Mathematically, skewing is modeled as an affine transformation:  Image′ = A(Image) + b, inwhich the center of mass of the image is computed to estimate how much offset is needed, and the covariance matrix is estimated to approximate by how much an image is skewed. Furthermore, the center and covariance matrix are employed for the inverse affine transformation, which is referred to as deskewing. Then, we used t-SNE (Van der Maaten and Hin- ton, 2008) to reduce the dimension of the MNIST from 28×28 to 2, as the two-dimensional embedding space. Figure 15 presents the effect of deskewing on t-SNE.

image

Figure 15: The effect of the deskewing operation on the t-SNE result of the MNIST data. Deskewing significantly improves the embedding quality.

The embedding data can be naturally divided into the training and testing sets based on the partition in the original space. To illustrate the aforementioned quasi-equivalence between quadratic networks, We built three representative quadratic deep learning models, using two wide sub-networks (by the factorization representation), four deep sub-networks with the reciprocal activation (by the continued fraction), and four deep sub-networks using the ReLU activation (representing the first-order approximation to the continued fraction representation). Each wide sub-network takes one element of the input, and two paired deep sub-networks take the same element of the input. Each wide sub-network has four layers with 8, 4, 2, 1 neurons sequentially, while each deep sub-network also has four layers with one neuron in each layer. These sub-networks are followed by a fully connected ReLU layer of the same structure. The fully connected network has three layers with 300, 200, and 10 neurons respectively.

We performed the batch training with 1, 000 instances per batch in each iteration. All the parameters were initialized with truncated Gaussian distribution of a mean 0 and a variance 0.1. The learning rate was set to 0.002, and the whole network was optimized with Adam (Kingma and Ba, 2015). The number of epochs was 200.

Feasibility of the Reciprocal Activation: At the first glance, the training of a deep model using the reciprocal activation function is subject to instability. For example, when the value of the input is large, bNx21+bNx2gets closer to 1, which will lead to zero in the denominator in the next fraction, and undermine the training convergence. However, we argue that when data are appropriately normalized, the training process can be made stable. We did the following experiment to study how the normalization can help. Specifically, the input was normalized by the formula:  xnew = xmaxζ � x−xminxmax−xmin�, where ζis a scaling factor. We evaluated four scaling factors {1, 4, 8, 16} by repeating 20 times the training process for each factor. The criterion is whether the training process converges or not. In this experiment, as long as ’nan’ does not appear, the training will converge because we used a sufficiently large number of epochs. Table 4 shows the success rate of training for different  ζscales. It can be seen that without scaling the training process always fails, while the success rate reaches 80% when  ζ = 16. The point is that the training process can be stablized withan appropriate scaling operation.

Table 4: The success rates of the training process with reciprocal activation after normal- ization.

image

Accuracy and Robustness of the Three Models: We repeated the training process 10 times and computed the accuracy of the three models on the test dataset. As shown in Table 5, all the three models achieved the state-of-the-art results. The performance of the deep model using ReLU was only slightly lower than that of the other two models.

Table 5: Performance of the three models on the test dataset after 10 repetitions.

image

Furthermore, we used the following four popular adversarial attack methods to evaluate the robustness of the deep learning models: (1) fast gradient method (FGM); (2) fast sign gradient method (FSGM (Goodfellow et al., 2014b)); (3) iterative fast sign gradient method (I-FSGM (Kurakin et al., 2016)); and (4) DeepFool (Moosavi-Dezfooli et al., 2016). Let  θdenote the model parameters, x be the input, y be the target pertaining to  x and L(θ, x, y)the loss, FGM generates an adversary as

image

where  ϵis an amplitude factor. FGM is plausible to find the attack along the direction of the gradient. FSGM computes an adversary based on

image

where  ϵis also a factor, and  sign(·) outputs 1 for a positive argument and  −1 otherwise. I-FSGM iteratively derives an adversary with the FSGM formula:

image

where  Clipx,α(·) is a thresholding function such that the maximum value of x is no more

than a preset limit  α. In the experiment, we set this limit  αto the maximum value of the

image

Table 6 shows the performance of the three models on the adversarial samples generated based on the test dataset using the aforementioned attacking methods. We bold-faced the best scores under attack. It can be seen that the wide model shows the highest accuracy under most attacks. The only exceptions are the cases of FSGM(ϵ = 3) and FSGM(ϵ = 5),where the wide model is the second best.

Table 6: Accuracy of the three models under attacks, where  ϵis the amplitude factor and N is the number of iterations.

image

Next, we probed the robustness of each model by examining the needed strength of the adversarial attack such that the network performance will drop by a pre-specified percentage. Using the absolute performance drop as a reference is better than using the performance compromised by adversarial attacks, as it eliminates the bias induced by the original network performance. Suppose that the original performance (without attack) is O, with each attacking method, we reduced  O to O′ = O −5% by gradually increasing the strength of attack. The reason to select a 5% drop is that if the drop is too high the comparison is not sensitive, while if the drop is too low the attack is too weak. Specifically, we gradually increased the strength by a fixed step for each attacking method until the performance drop is over 5%. For FGM, FSGM, I-FSGM and DeepFool, the steps are 50, 1, 1 and 1 respectively. Table 7 shows the needed strength values. The higher the needed strength, the more robust the model is. We bold-faced the best scores. Overall, the wide model shows the highest robustness. The only exception is the case of I-FSGM, where the wide model is the second best. These data suggest that the wide model seems more robust than the deep model. Further studies are needed to understand the underlying mechanisms and derive practical guidelines.

Table 7: Needed strength values for each of the four attacking methods to reduce the unattacked network performance by 5%

image

V. Bounds for the Partially Separable Representation

The purpose of this section is to show that the partially separable function can be realized a quadratic network whose structure is bounded. Let us first introduce necessary preliminaries.

Partially Separable Representation: Approximating a multivariate function  f(x1, x2, · · · , xn)by a set of functions of fewer variables is a basic problem in approximation theory (Light and Cheney, 2006). Despite that some function f is directly separable in the form of

image

a more general formulation to express a multivariate function f is

image

If L is permitted to be sufficiently large, then such a model is rather universal. For bivariate functions, an inspiring theorem has been proved (Light and Cheney, 2006): Let  {un}n∈Nand  {vn}n∈Nare orthornomal bases of  L2(X) and L2(Y) respectively, then  {umvn}(m,n)∈N 2is an orthornomal basis of  L2(X × Y).

To approximate a general multivariate function, we relax the restrictive equality to the approximation in the  L1sense and assume that, for every continuous n-variable function f(x1, · · · , xn) on [0, 1]n, given any positive number  ϵ, there exists a group of  φli, satisfying:

image

The key to demonstrate boundedness of the partially separable representation with a quadratic network is to use the Taylor’s expansion, and estimate the remainder, leading to a realization of the partially separable representation. Based on such a realization, we obtain an upper bound of the needed width and depth for the partially separable representation. Let  ∂αf(x) = ∂α1∂x1∂α2∂x2 · · · ∂αn∂xn f(x), α! = �ni=1 αi!, and xα = xα11 · · · xαnn , for the function class  Fn,k = {f ∈ Ck+1([0, 1]n) | ∥∂αf∥∞ ≤ M, ∀|α| ≤ k}, the following theorem holds.

Theorem 19 For any f ∈ Fn,k, if x ∈ [0, 1]n, there exists a quadratic network Q(x) with the width no more than  k�n+kn �and the depth no more than log2 (kn)to represent f in the partially separable representation, satisfying that

image

Proof Let f ∈ Fn,k, if x ∈ [0, 1]n, then based on classical multivariate calculus (Widder, 1989),

image

where  Rx,k(x) is the remainder term. For any  x ∈ [0, 1]n, we have

image

It is observed from Eqs. 122 and 123 that, given  f ∈ Fn,k, the Taylor expansion forms a partially separable representation �|α|≤k∂αα! f(0)xα with�n+kn �products, and the error is no more than M(k+1)!nk+1. Along this line, we apply the quadratic network to express

image

network using the ReLU function with the depth of log2(N) and the width of N. To approximate a general term ∂αα! f(0)xα = ∂αα! f(0)xα11 · · · xαnn , the depth of the quadratic network is no more than log2(max{α1, · · · , αn}) ≤ log2 k, and the width is no more than �ni=1 αi = k.In addition, the depth of a network that is used to express the product of n terms is log2 n.Therefore, the depth is no more than log2 k + log2 n = log2 (kn). In contrast, the width is no more than  k�n+kn �, considering that there are�n+kn �terms in �|α|≤k∂αα! f(x)xα.

VI. Properties of the Partially Separable Representation

Let us compare the properties of the partially separable representation with those of the piecewise polynomial representation and the Kolmogorov–Arnold representation in terms of universality, finiteness, decomposability and smoothness. As a result, the partially separable representation is well justified.

1. Universality: The representation should have a sufficient ability to express the functions. A piecewise polynomial representation is capable of representing any continuous function in a local way. The Kolmogorov–Arnold representation theorem states that every multivariate continuous function can be represented as a combination of continuous univariate functions, which solves a generalized Hilbert thirteen problem. Therefore, the Kolmogorov–Arnold representation also possesses universality. As far as the partially separable representation is concerned, its universality in the  L1distance is analyzed in the above Taylor expansion analysis. In summary, these three representation classes are all powerful enough to express continuous functions.

2. Finiteness: The multivariate Taylor expansion can approximate a general continuous function  f ∈ Fn,k with a finite number of terms. Although the proof in the manuscript devises a specific strategy of obtaining a partially separable representation, the partially separable representation actually allows a variety of constructions in addition to the Taylor expansion. There is no fundamental reason that the partially separable representation will necessarily have exponentially many terms in a majority of practical tasks after reasonable assumptions are incorporated.

3. Decomposability: Decomposability is to decode the functionality of the model in terms of its building units, i.e., cells, layers, blocks, and so on. A plethora of engineering examples, such as software development and optical system design, have shown that modularized analysis is effective. Particularly, modularizing a neural network is advantageous for the model interpretability. For example, Chen et al. (2016) developed InfoGAN to enhance decomposability of features learned by GAN (Goodfellow et al., 2014a), where InfoGAN maximizes the mutual information between the latent code and the observations, encouraging the use of noise to encode the semantic concept. Since the partially separable representation is more decomposable than its counterparts, the network constructed in reference to the partially separable representation is easily interpretable. For instance, such a network has simpler partial derivatives that can simplify interpretability analysis (Lipton, 2018).

4. Smoothness: A useful representation should be as smooth as possible such that the approximation can suppress high-frequency oscillations and be robust to noise. In the last eighties, Girosi and Poggio claimed that the use of the Kolmogorov-Arnold representation is doomed because the inner functions and the outer functions are highly non-smooth (Girosi and Poggio, 1989). In the piecewise polynomial representation, the situation gets better since at least over each interval the representation is smooth. As far as the partially separable representation is concerned, because the expression is greatly relaxed (from the exact computation to approximate representation), it is feasible to make each element smooth as L goes large, and extract meaningful structures underlying big data.

Table 8: Comparison of the three functional representations, where ”✓” means ”yes”, ”X”means ”no”, and ”–” means ”mediocre”.

image

In summary, we compare the three representation schemes in terms of universality, finiteness, decomposability, and smoothness in Table 8. Clearly, the partially separable representation is suitable for machine learning tasks.

VII. Width-Depth Correlation Through Partially Separable Presentation

Now, let us analyze the complexity of the quadratic network in light of the aforementioned partially separate representation scheme as shown in Figure 16. Suppose that the polynomial  Pliis of degree  Nli, then the representation of each  Plican be done with a network of width �Ll=1�ni=1 Nliand depth maxl,i{log2(Nli)}. Next, the multiplication demands an additional network of width Ln and depth log2(n). Therefore, the overall quadratic network architecture will be of width max{�Ll=1�ni=1 Nli, Ln}and depth maxl,i{log2(Nli)} +log2(n).Because the depth scales with a log function, which changes slowly when the dimensionality of the input is large. For simplicity, we take an approximation for depth maxl,i{log2(Nli)} + log2(n) = log2(maxl,i{Nli}) + log2(n) ≈ α log2�Ll=1�ni=1 Nli + log2(n),where  αis a positive constant. Let �Ll=1�ni=1 Nli = NΣ, which describes the overall com- plexity of the function to be expressed, then the formulas to compute the width and depth are simplified as follows:

image

Figure 16: Use of a quadratic network to represent a partially separable representation. Suppose that the polynomial  Pliis of degree  Nli, the width and depth of the quadratic network to approximate  Pli are Nli and log2(Nli) respectively.

One interesting point from (124) is that the lower bounds for depth and width to realize a partially separable representation are also suggested. As shown in Figure 17, we plot the width and depth as  NΣ changes. There are two highlights in Figure 17. The first is that the width is generally larger than the depth, which is different from the superficial impression on deep learning. The second is that, as the  NΣ goes up, the width/depth ratio is accordingly increased.

image

Figure 17: Width and depth versus  NΣ changes (L = 4, n = 5, and α = 1 without loss ofgenerality) assuming the partially separable representation.

Table 9: Descriptions of different building blocks. Modules Degree Operation Width Depth

image

Remark: Through the complexity analysis, we realize that the width and depth of a network depend on the structure or complexity of the function to be approximated. In other words, they are controlled by the nature of a specific task. As the task becomes complicated, the width and depth must increase accordingly, and the combination of the width and depth is not unique.

VIII. Effects of Width on Optimization and Generalization

We first illustrate the importance of width on optimization in the context of over-paramterization, kernel ridge regression, and NTK, and then report our findings that the existing generalization bounds and VC dimension results somehow suggest the width and depth equivalence for a given complexity of networks.

1. Optimization: The optimization mechanism is the key to understand the training process of neural networks (Ge et al., 2015). From the view of optimization, the fact that randomly-initialized first-order methods can find well-generalizable minima on the training data is quite intriguing. Given the theme of this paper, we put our endeavor into the importance of width on the optimization of neural networks in hope to provide insights for practitioners. We divide the relevant literature into the three categories:

(1) Increase width for over-parameterization: Brutzkus et al. showed that a wide two-layer network using the hinge loss can generalize well on linearly separable data with stochastic gradient descent (SGD) (Brutzkus et al., 2017). Li and Liang showed that when data is normalized and fulfills a separability condition, a two-layer over-parameterized network can learn these data in a polynomial time (Li and Liang, 2018). Allen-Zhu et al. (2019) showed that if the width of hidden layers is  O�n30D30 log30(1/ϵ)�, where nis the number of samples, D is the network depth, and  ϵis an expected error, then the gradient descent search converges with the error  ϵ. This bound was further reduced to  n42D for the network using non-linear smooth activation functions such as soft-plus (Du et al., 2018). These paper established that a pre-specified small training error can be achieved by gradient descent when the network is very wide. The secret therein is that the weight matrix is very close to the initialization due to NTK (Jacot et al., 2018).

(2) Kernel ridge regression: A neural network tends to give a Gaussian process when the width goes infinitely large. In this situation, only training the top layer of a network (weak training) will reduce to Kernel ridge regression. In a classical way, the weak training is to minimize the quadratic loss:

image

where  X = [x1; ...; xn] is a collection of data  xi ∈ R1×d, i = 1, ..., n. Taking derivatives with respect to w and equating them to zero gives

image

Given a new example  x∗, the prediction is

image

We replace the data samples with the high-dimensional representations of the neural network:  x → gθ(x) ∈ R1×k and let Φg(X) = [gθ(x1); ...; gθ(xn)]. By a similar derivation, we have

image

(3) Neural Tangent Kernel: In the above argument, we have justified the legitimacy of training the top layer by the large network width. How about training the entire network? Given the dataset  {(xi, yi)}ni=1, we also consider training the network with the quadratic loss for regression:  l(θ) = 12�ni=1(fθ(xi)) − yi)2.Consider the gradient descent in an infinitesimally small learning rate, then

image

Next, we can describe the dynamics of the model output  y(θ) given the input  xjas follows:

image

Let us consider  u(t) = (fθ(t)(xi))i=1,...,nfor all samples at time  t and y = (yi)i=1,...,n is theoutput, (130) can be written in a compact way:

image

where the pq-entry of H(t) is

image

In the width limit, the Gaussian process shows that H(t) becomes a constant  H∗. Therefore,

image

which characterizes the trajectory of the training in the functional space instead of the parameter space.

2. Generalization: Analysis of generalization bounds is a powerful tool to explain the excellent performance of neural networks. Traditional wisdom suggests that the increased model complexity will cause over-fitting to training data, which contradicts the fact that deep networks can easily fit random labels to the data and yet practically generalize well (Zhang et al., 2016). Recently, the generalization theory has gained increasingly more traction. In reference to Li et al. (2018), we have summarized the the state-of-the-art generalization bounds in Table 10 and provided their complexities. In Table 10, we boldface the bounds of interest which the width and depth dominate. As far as Neyshabur et al. (2015) is concerned, due to the exponential dependence on the depth, we will not focus on it. Instead, we argue that these bold-faced bounds somehow suggest the width and depth equivalence under a given complexity. Now, we use  B1(L, p) = log(p)√L3,B2(L, p) = log(Lp)�L3p, and B3(L, p) = √Lpto denote the bounds of interest. Figure 18 shows the contours plots of  B1, B2 and B3. Along a contour, the width and depth changes for the fixed bound complexity. In other words, the depth and width of a neural

Table 10: Representative bounds for chain-like neural networks, where  Bl,2, Bl,F and Bl,2→1to denote the upper bounds of the spectral norm  ||Wd||2, Frobenius norm ||Wd||F , and ||Wd||2,1of the rank-r matrix in lth layer. Generally,  ||Wd||F and||Wd||2,1 is √r-times larger than  ||Wd||2, γmeans the ramp loss function is 1/γ-Lipschitz function, Γ is the lower bound for the product of the spectal norm of all the layers and the m is the size of data.

image

Figure 18: Contours of  B1, B2 and B3, where numbers on the curves of each sub-figure represent log(p)√L3, log(p)√L3 and √Lprespectively. Along a contour, the width and depth changes to give the same bound.

network are mutually transformable without impacting the overall generalization ability theoretically.

Therefore, increasing either width or depth can boost the hypothesis space of a neural network. In other words, when it comes to promoting the expressive power of a network, increasing the width is essentially equivalent to increasing the depth in the sense of VC dimension, which also implies the equivalence of the width and depth of neural networks.

IX. Rethinking the Depth Separation with Intra-layer Links

Depth separation highlights the representation ability of a deep network, which has been intensively investigated over the past years. The idea is to show that a deep network can more efficiently approximate a complicated function than a shallow network. One notable depth separation result is from Arora et al. (2016), showing that  ”for every pair of naturalnumbers k ≥ 1, w ≥ 2, there exists a family of hard functions representable by a R → R(k + 1)-layer feedforward ReLU DNN of width w such that if it is also representable by a (k′ + 1)-layer feedforward ReLU DNN for any k′ ≤ k, then this (k′ + 1)-layer feedforwardReLU DNN has size at least 12k′wkk′ −1”.Suppose that the number of neurons, i.e., the size of a ReLU network is s, and the piecewise linear function of this ReLU network has p pieces, the core of the depth separation theorem established in Arora et al. (2016) is summarized as the size-piece relationship:  s ≥ 12k(2p)1/k − 1.

In the proposed network structure for the width-depth conversion, intra-layer links are employed for a network to flexibly represent arbitrary piecewise linear functions over polytopes using fan-shaped functions. Here, we find that adding intra-layer links can greatly increase the maximum number of pieces represented by a shallow network such that it can express as a complicated function as a deep network could. As shown in Table 11, let n be the number of intra-linked neurons, the size-piece inequality  s ≥ n2(2n−1)k(2p)1/k−1 trivially holds true. As such, it cannot be used to demonstrate depth separation any more. Instead, we need to rethink the depth separation and re-characterize the power of a network when intra-layer links are used. Our detailed proof is as follows.

Table 11: Network structures and their associated size-piece relationships. The number of neurons, i.e., the size of a ReLU network is s. k is the number of hidden layers. The piecewise linear function of this ReLU network has p pieces. n is the number of intra-linked neurons.

image

Note: In the first row, we re-calculate the inequality and obtain a slightly different inequality from the original. The previously-established inequality in (Arora et al., 2016) is  s ≥ 12kp1/k.

The result listed in the first row of Table 11 is obtained from the following two lemmas and proofs in (Arora et al., 2016).

Lemma 20 (Lemma D.5 in (Arora et al., 2016)) Let f : R → Rbe a function represented by a  R → RReLU DNN, as shown in Figure 19(a), with depth k + 1 and widths w1, . . . , wk of khidden layers. Then f is a PWL function with at most 2k−1 · (w1 + 1) · w2 ·. . . · wk pieces.

Lemma 21 (Lemma D.6 in (Arora et al., 2016)) Let f : R → Rbe a piecewise linear function with p pieces. If f is represented by a ReLU DNN, as shown in Figure 19(a), with

image

Figure 19: (a) the feedforward architecture; (b)-(c) the structures with intra-layer links.

depth k + 1, then it must have size at least 12kp1/k − 1. Conversely, any piecewise linear function f represented by a ReLU DNN of depth k + 1 and size at most s, can have at most

image

Now, we prove the result listed in the second row of Table 11 for ReLU DNN with intra-links.

Lemma 22 (Our 1st New Result) Let f : R → Rbe a function represented by a  R → RReLU DNN, as shown in Figure 19(b), with depth k + 1 and widths  w1, . . . , wk of k hiddenlayers. Then f is a PWL function with at most 3k−1 ·� 3w12 + 1�· w2 · . . . · wk pieces.

Proof As shown in Figure 19(b), let us connect every two neurons with an intra-layer link. Suppose that one neuron is  σ(wx + b) which can create 1 breakpoint at  −b/w, while the other neuron is  σ(w′x + b′ + σ(wx + b)) creating at most 3 breakpoints  −b/w, −b′/w′, −(b +b′)/(w +w′). Thus, we can get at most w12 ·3 = 3w12distinct breakpoints, i.e., 3w12 +1 pieces.

Let us estimate the number of pieces via induction. Assume that for some  k ≥ 1, anyR → RReLU DNN, as shown in Figure 19(b), with depth k+1 and widths  w1, . . . , wk of thek hidden layers produces at most  Ak = 3k−1·� 3w12 + 1�·w2·. . .·wkpieces. We add one more layer of  wk+1neurons to this network such that its pre-activation is actually the output of a R → RReLU DNN with depth k+1 and widths  w1, . . . , wk. Suppose that the preactivation of the i-th neuron is  fi, i ∈ [wk+1]. By the induction hypothesis,  fiis a piecewise linear function with at most  Akpieces. Without loss of generality, let the i-th and (i + 1)-th neurons be connected by an intra-layer link. Their outputs are  σ(fi) and σ(fi+1 + σ(fi))whose total number of pieces equals to that of three functions  σ(fi), σ(fi+1), σ(fi + fi+1),which is at most 6Ak. Because we have  wk+1neurons in the last hidden layer, we can get at most 6Ak ·(wk+1/2) = 3k ·� 3w12 + 1�·w2 ·. . . wk ·wk+1pieces, which concludes the induction step.

image

Lemma 23 (Our 2nd New Result) Let f : R → Rbe a piecewise linear function with p pieces. If f is represented by a ReLU DNN, as shown in Figure 19(b), with depth k + 1, then it must have size at least 13k(2p)1/k − 23. Conversely, any piecewise linear function f represented by a ReLU DNN of depth k + 1 and size at most s, can have at most 12� 3sk�k

pieces.

Proof Let widths of the k hidden layers be  w1, . . . , wk. By Lemma 22, we must have

image

By the AM-GM inequality, we minimize the size subject to (134),

image

where the inequality is achieved at 3w12 + 1 = 3w22 = . . . = 3wk2 ≥ 12p1/k. This leads to the first statement. The second statement follows by reversing the above equation.

image

Lemma 24 (The 1st New Result Extended to n Intra-Linked Neurons) Let f : R →R be a function represented by a  R → RReLU DNN, as shown in Figure 19(c), with depth k + 1, widths w1, . . . , wk of khidden layers and n intra-linked neurons in each layer. Then f is a PWL function with at most 2k−1 ·�(2n−1)w1n + 1�· (2n−1)w2n · . . . · (2n−1)wkn pieces.

Proof For the first layer, the n intra-linked neuron can create 2n −1 breaking points. Let us derive this fact via induction. Suppose that  Gn−1is the piecewise linear function of  n −1 intra-linked neurons, then  Gn = σ(wx + b + Gn−1). When Gn−1 ≥ 0, Gn =σ(wx + b + Gn−1), generating at most 2n−1 −1 breaking points (same as  Gn−1). WhenGn−1 ≤ 0, Gn = σ(wx + b) has one breaking point. Without repetition, the number of breaking points of  Gnis the summation of these newly-generated breaking points and breaking points of  Gn−1, which is 2(2n−1 − 1) + 1 = 2n −1. Because the first layer has  w1neurons, at most  w1/n groups of nintra-linked neurons are available. Thus, we can get at most (2n−1)w1n+ 1 pieces.

Computing the number of pieces for the hidden layers shares the same spirit with the first layer, combining the proof of Lemma 22, we conclude that a  R → R ReLU DNN,as shown in Figure 19(c), with depth k + 1, widths  w1, . . . , wk of khidden layers, and n intra-linked neurons in each layer, has at most 2k−1 ·�(2n−1)w1n + 1�· (2n−1)w2n · . . . · (2n−1)wknpieces.

Lemma 25 (The 2nd New Result Extended to n Intra-Linked Neurons) Let f : R → Rbe a piecewise linear function with p pieces. If f is represented by a ReLU DNN, as shown in Figure 19(c), with depth k + 1 and n neurons intra-linked in each hidden layer, then it must have size at least n2(2n−1)k(2p)1/k − n2n−1. Conversely, any piecewise linear function f represented by a ReLU DNN of depth k + 1 and size at most s, can have at most

image

Proof Please refer to our proof for Lemma 23.

A. Adadi and M. Berrada. Peeking inside the black-box: A survey on explainable artificial intelligence (xai). IEEE Access, 6:52138–60, 2018.

Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overpa- rameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pages 6155–6166, 2019.

Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. arXiv preprint arXiv:1611.01491, 2016.

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8139–8148, 2019.

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.

Monica Bianchini and Franco Scarselli. On the complexity of neural network classifiers: A comparison between shallow and deep architectures. IEEE transactions on neural networks and learning systems, 25(8):1553–1565, 2014.

Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over- parameterized networks that provably generalize on linearly separable data, 2017.

Jie Bu and Anuj Karpatne. Quadratic residual networks: A new class of neural networks for solving forward and inverse problems in physics involving pdes. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pages 675–683. SIAM, 2021.

Yuan Cao and Quanquan Gu. A generalization theory of gradient descent for learning over-parameterized deep relu networks, 2019.

CL Philip Chen and Zhulin Liu. Broad learning system: An effective and efficient incre- mental learning system without the need for deep architecture. IEEE transactions on neural networks and learning systems, 29(1):10–24, 2017.

Hu Chen, Yi Zhang, Mannudeep K Kalra, Feng Lin, Yang Chen, Peixi Liao, Jiliu Zhou, and Ge Wang. Low-dose ct with a residual encoder-decoder convolutional neural network. IEEE transactions on medical imaging, 36(12):2524–2535, 2017.

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7–10, 2016.

Lingyang Chu, Xia Hu, Juhua Hu, Lanjun Wang, and Jian Pei. Exact and consistent

image

Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on learning theory, pages 698–728. PMLR, 2016.

George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1):30–42, 2011.

Louis De Branges. The stone-weierstrass theorem. Proceedings of the American Mathematical Society, 10(5):822–824, 1959.

Zhijie Deng, Yinpeng Dong, Shifeng Zhang, and Jun Zhu. Understanding and exploring the network with stochastic architectures. Advances in Neural Information Processing Systems, 33, 2020.

Simon Du and Wei Hu. Width provably matters in optimization for deep linear neural networks. In International Conference on Machine Learning, pages 1655–1664. PMLR, 2019.

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably

image

Richard Ehrenborg. The perles-shephard identity for non-convex polytopes. Technical report, Technical Report, University of Kentucky, 2007.

Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907–940. PMLR, 2016.

F. Fan and G. Wang. Fuzzy logic interpretation of quadratic networks. Neurocomputing, 374:10–21, 2020.

Fenglei Fan, Wenxiang Cong, and Ge Wang. A new type of neurons for machine learn- ing. International journal for numerical methods in biomedical engineering, 34(2):e2920, 2018a.

Fenglei Fan, Wenxiang Cong, and Ge Wang. Generalized backpropagation algorithm for training second-order neural networks. International journal for numerical methods in biomedical engineering, 34(5):e2956, 2018b.

Fenglei Fan, Dayang Wang, Hengtao Guo, Qikui Zhu, Pingkun Yan, Ge Wang, and Hengy- ong Yu. On a sparse shortcut topology of artificial neural networks, 2018c.

Fenglei Fan, Hongming Shan, Mannudeep K Kalra, Ramandeep Singh, Guhan Qian, Matthew Getzin, Yueyang Teng, Juergen Hahn, and Ge Wang. Quadratic autoencoder (q-ae) for low-dose ct denoising. IEEE transactions on medical imaging, 39(6):2035–2050, 2019.

Fenglei Fan, Jinjun Xiong, and Ge Wang. Universal approximation with quadratic deep networks. Neural Networks, 124:383–392, 2020.

Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989.

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.

Federico Girosi and Tomaso Poggio. Representation properties of networks: Kolmogorov’s theorem is irrelevant. Neural Computation, 1(4):465–469, 1989.

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks, 2017.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014a.

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2014b.

Juncai He, Lin Li, Jinchao Xu, and Chunyue Zheng. Relu deep neural networks and linear finite elements, 2018.

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.

Kaixuan Huang, Yuqing Wang, Molei Tao, and Tuo Zhao. Why do deep residual networks generalize better than deep feedforward networks?–a neural tangent kernel perspective, 2020.

Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.

Duofa Ji, Jin Liu, Weiping Wen, Changhai Zhai, Wei Wang, and Evangelos I Katsanos. Prediction of cumulative absolute velocity based on refined second-order deep neural network. Journal of Earthquake Engineering, pages 1–20, 2021.

Kenji Kawaguchi, Jiaoyang Huang, and Leslie Pack Kaelbling. Effect of depth and width on local minima in deep learning. Neural computation, 31(7):1462–1498, 2019.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR

image

A. N. Kolmogorov. On the representation of continuous functions of several variables by superpositions of continuous functions of a smaller number of variables. Proceedings of the USSR Academy of Sciences, 108:179–182, 1956.

Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. In International conference on machine learning, pages 1378–1387. PMLR, 2016.

Alexey Kurakin, Ian Goodfellow, Samy Bengio, et al. Adversarial examples in the physical world, 2016.

Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington,

image

Yoav Levine, Noam Wies, Or Sharir, Hofit Bata, and Amnon Shashua. Limits to depth efficiencies of self-attention, 2020.

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets, 2016.

Xingguo Li, Junwei Lu, Zhaoran Wang, Jarvis Haupt, and Tuo Zhao. On tighter general- ization bound for deep neural networks: Cnns, resnets, and beyond, 2018.

Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochas- tic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018.

William A Light and Elliot W Cheney. Approximation theory in tensor product spaces, volume 1169. Springer, 2006.

Hongzhou Lin and Stefanie Jegelka. Resnet with one-neuron hidden layers is a universal approximator. Advances in Neural Information Processing Systems, 31:6169–6178, 2018.

Zachary C Lipton. The mythos of model interpretability. Queue, 16(3):31–57, 2018.

Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: a view from the width. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6232–6240, 2017.

Pranav Mantini and Shishr K Shah. Cqnn: Convolutional quadratic neural networks. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 9819–9826. IEEE, 2021.

Alexander G de G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin

image

Hrushikesh N Mhaskar and Tomaso Poggio. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 14(06):829–848, 2016.

Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2574–2582, 2016.

Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996.

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015.

Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach

image

Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A

image

Yookoon Park, Sangho Lee, Gunhee Kim, and David Blei. Unsupervised representation learning via neural activation coding. In International Conference on Machine Learning, pages 8391–8400. PMLR, 2021.

Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantization. In International Conference on Learning Representations, 2018.

E. W. Saad and D. C. Wunsch II. Neural network explanation using inversion. Neural networks, 20(1):78–93, 2007.

Thiago Serra and Srikumar Ramalingam. Empirical bounds on linear regions of deep rectifier networks. In AAAI, pages 5628–5635, 2020.

Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and count- ing linear regions of deep neural networks. In International Conference on Machine Learning, pages 4558–4566. PMLR, 2018.

Rudy Setiono and Huan Liu. Understanding neural networks via rule extraction. In IJCAI, volume 1, pages 480–485. Citeseer, 1995.

Lech Szymanski and Brendan McCane. Deep networks are effective encoders of periodicity. IEEE transactions on neural networks and learning systems, 25(10):1816–1827, 2014.

Sebastian Thrun. Extracting rules from artificial neural networks with distributed repre- sentations. pages 505–512. MORGAN KAUFMANN PUBLISHERS, 1995.

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.

Ge Wang. A perspective on deep imaging. Ieee Access, 4:8914–8924, 2016.

Shuning Wang and Xusheng Sun. Generalization of hinging hyperplanes. IEEE Transactions on Information Theory, 51(12):4425–4431, 2005.

David Vernon Widder. Advanced calculus. Courier Corporation, 1989.

Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convo- lutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828, 2016.

Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural networks for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 1284–1293, 2019.

Huan Xiong, Lei Huang, Mengyang Yu, Li Liu, Fan Zhu, and Ling Shao. On the number of linear regions of convolutional neural networks. In International Conference on Machine Learning, pages 10514–10523. PMLR, 2020.

Zirui Xu, Fuxun Yu, Jinjun Xiong, and Xiang Chen. Quadralib: A performant quadratic

image

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under- standing deep learning requires rethinking generalization, 2016.

Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. Efficient and ac- curate approximations of nonlinear convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and pattern Recognition, pages 1984–1992, 2015.


Designed for Accessibility and to further Open Science