Over the past years, deep learning (Goodfellow et al., 2016; Fan et al., 2018a) has become the mainstream approach of machine learning and achieved the state-of-the-art performance in many important tasks (Dahl et al., 2011; Kumar et al., 2016; Chen et al., 2017; Wang, 2016). One of the key reasons that accounts for the success of deep learning is the increased depth, which allows a hierarchical representation of features. There are a number of papers dedicated to explaining why deep networks are better than shallow ones. Encouraging progress has been made along this direction. The idea to show the superiority of deep networks is basically to find a special family of functions that are very hard to be approximated by a shallow network but easy to be approximated by a deep network, or that a deep network can express complicated functions that a wide network could not (Szymanski and McCane, 2014; Cohen et al., 2016; Mhaskar and Poggio, 2016; Eldan and Shamir, 2016; Montufar et al., 2014; Bianchini and Scarselli, 2014). For example, in Eldan and Shamir (2016), a special class of radial functions was constructed so that a one-hidden-layer network needs to use an exponential number of neurons to obtain a good approximation, but a two-hidden-layer network only requires a polynomial number of neurons for the same purpose. With the number of linear regions as the complexity measure, Montufar et al. (2014) showed that the number of linear regions grows exponentially with the depth of a network but only polynomially with the width of a network. In Bianchini and Scarselli (2014), a topological measure was utilized to characterize the complexity of functions. Then, it was shown that deep networks can represent more complex functions than what the shallow counterparts express. Besides, width-bounded but depth-unbounded universal approximators were also developed (Lu et al., 2017; Lin and Jegelka, 2018; Fan et al., 2018c) in analogy to the depth-bounded but width-unbounded universal approximators (Funahashi, 1989; Hornik et al., 1989).
Recently, the effects of width are discussed by more and more studies (Cheng et al., 2016; Chen and Liu, 2017; Zagoruyko and Komodakis, 2016). Since width and depth are the most basic topology measures of a neural network, exploring the roles of width and depth in neural networks is a problem of strong interest and importance. Currently, there exist both width-bounded and depth-bounded universal approximators. Since both width-bounded and depth-bounded networks can represent any function, they can represent each other as well, which suggests the width-depth equivalence of neural networks. Nevertheless, how a neural network learns a mapping is quite different from the way used in proving the universal approximation. Moreover, the core of the width-depth conversion is to employ a network to learn another network instead of any function. Therefore, the width-depth conversion based on universal approximation falls short to capture the relationship between width and depth.
Specifically, we argue that the width-depth conversion via universal approximation is simplistic, inefficient, and lack of insight: 1) (Simplistic) As mentioned earlier, the way used in enabling universal approximation is to divide a target function into many functions over tiny hypercubes. In practice, a network usually does not do so, e.g., a ReLU network divides the space into polytopes. 2) (Inefficient) The published universal approximation analyses do not consider the character of a target function, and just inefficiently divide the function into many pieces over tiny hypercubes. However, in the width-depth transformation, the problem is how to express a wide (or deep) network using a deep (or wide) network. Thus, the properties of networks should be used to make a much more efficient transformation. 3) (Lack of insight) Universal approximation theoretically guarantees the representation capability of neural networks but it does not give any new insight in terms of difference in consumed computational resources and a practical procedure for the transformation from a wide (or deep) network to a deep (or wide) counterpart. As a small summary, utilizing universal approximation theorems to do width-depth transformation do not accurately characterize the width-depth conversion. We still do not fully resolve essential questions on width and depth, e.g., how to non-trivially do transformation between a wide and a deep network, and what the mechanism of interaction is between the width and depth of a network.
To fill this gap, inspired by the De Morgan law, here we demonstrate from two perspectives that the width and depth of neural networks are quasi-equivalent. The first perspective leverages that a ReLU network is a piecewise linear function over polytopes, while second perspective utilizes the nested structure of deep networks and the parallel structure of wide networks. Specifically, in the first perspective, we revisit the De Morgan law:
where is a propositional rule (
belongs to some class), and such rules are disjoint. A neural network can be linked to a rule-based system such as a collection of propositional rules. Straightforwardly, we can construct either a deep network to realize a union of propositional rules (left side) or a wide network that realizes the complement of the intersection of those rules after complement (right side). As a result, the constructed deep and wide networks are equivalent to each other. Furthermore, we elaborate the quasi-equivalence of general regression and classification networks by constructing two transforms mapping an arbitrary ReLU network to a wide network and a deep network, respectively, thereby verifying a general quasi-equivalence of the width and depth of ReLU networks. Our constructive scheme is largely based on the fact that a ReLU network partitions the space into polytopes (Chu et al., 2018). This enables us to have a simplicial complex in the space and then to establish a quasi-equivalence of networks using the essential building blocks, fan-shaped (more generally, hyper-cone-shape) functions, in the form of modularized ReLU networks.
Table 1: Network structures and complexities through transformation of regression and classification networks. D is the input dimension, and M is the complexity measure of a function class represented by ReLU networks.
In the second perspective, we replace the mainstream artificial neuron type with a quadratic counterpart and extend our first perspective by utilizing the factorization and continued fraction representations of the same univariate polynomial to construct wide and deep networks, respectively. Specifically, a univariate polynomial function can be expressed as follows:
where are related. Previously, inspired by neuronal diversity, our group designed the quadratic neuron (Fan et al., 2018a) that replaces the inner product in a conventional neuron with a quadratic function. Due to the merits of the idea, the network based on quadratic neurons have been increasingly studied and applied (Bu and Karpatne, 2021; Ji et al., 2021; Mantini and Shah, 2021; Xu et al., 2022). Here, we can construct a wide quadratic network and a deep quadratic network to implement the left side and right side of Eq. (2), respectively. This establishes the equivalence between wide and deep quadratic networks. Finally, we generalize such an equivalence into a multivariate setting based on Kolmogorov-Arnold theorem (Kolmogorov, 1956).
Our main contribution is the establishment of the width-depth quasi-equivalence of neural networks. We summarize our main results on ReLU networks (the first perspective) and quadratic networks (the second perspective) in Tables 1 and 2, respectively. Specifically, Table 1 lists the width and depth of the wide and deep networks constructed in our first perspective, where the complexity measure M is the minimum number of simplices needed to cover the polytopes formed by a ReLU network. Table 2 shows the width and depth of the wide and deep quadratic networks constructed in our second perspective, where the complexity measures are the degrees of polynomials to represent a function of interest. Clearly, given a complexity measure, the width of the constructed wide network is greater than the depth, while the depth of the constructed deep network is greater than the width.
Table 2: Network structures and complexities through the extension of the De Morgan law. D is the input dimension, and are the complexity measure of a function class.
To put our contributions in perspective, we would like to mention relevant studies. Jacot et al. (2018) proposed the theory of the neural tangent kernel (NTK), which provides a useful lens to understand a network when the width of a network goes to infinity. Kawaguchi et al. (2019) analyzed the effect of width and depth on the quality of local minima. They showed that the quality of local minima improves toward the global minima as depth and width become larger. Levine et al. (2020) revealed the width-depth interplay in a self-attention network. We discuss the width of neural networks as related to NTK in the Supplementary Information I. To the best of our knowledge, our study is the first work to reveal the width-depth quasi-equivalence of neural networks.
2.1 Preliminaries
For convenience, we use to denote the ReLU function. We mainly discuss ReLU networks in this work, thus all networks in the rest of this paper are referred as ReLU networks unless otherwise specified. At the same time, we focus on the fully-connected ReLU networks.
Definition 1 A regression network is a network with continuous outputs, while a classifi-cation network is a network that produces categorical outputs (for example, digit recognition). The classification network is obtained by thresholding a ReLU network in the last layer. In this study, we investigate a classification network with binary labels without loss of generality.
Definition 2 (Width and depth of a feedforward network (Arora et al., 2016)) For any number of hidden layers , input and output dimensions
feedforward network is given by specifying a sequence of k natural numbers
representing widths of the hidden layers, a set of k affine transformations
for i = 1, . . . , k and a linear transformation
corresponding to weights of the hidden layers. The function
computed or represented by this network is
where denotes function composition, and
is an activation function. The depth of a ReLU DNN is defined as k + 1. The width of a ReLU DNN is max
Definition 3 (Width and depth of a shortcut network) Given a shortcut network we delete a minimum number of links such that the resultant network
is a feedforward network without any isolated neuron. Then, we define the width and depth of
as the width and depth of
, respectively.
Over the past several years, increasingly diversified network architectures, such as randomly wired networks (Xie et al., 2019), networks with stochastic structures (Deng et al., 2020), etc. are used as backbones for deep learning. Our definitions for width and depth are applicable to many unusual network configurations. Moreover, they are also natural extensions of the conventional width and depth definitions and can make sense for common networks.
Definition 4 (Simplicial complex) A D-simplex S is a D-dimensional convex hull provided by convex combinations of D + 1 affinely independent vectors
words, S =
called a face of S. A simplicial complex
satisfying: 1) every face of a simplex from S is also in S; 2) the non-empty intersection of any two simplices is a face of both
Proposition 1 Suppose that f(x) is a function represented by a ReLU network, then f is a piecewise linear function that splits the space into polytopes, where each polytope is convex and associated with a linear function.
Proof Inspired by the idea in (Chu et al., 2018), the proof here is for all ReLU networks, including networks using shortcuts. Let a vector denote the firing states of all neurons in the network, where N is the total number of neurons, and
-th neuron is not fired and vice versa. The firing state of every neuron is determined by the input x, then we denote the set of instances that share the same collective neuron firing state
as the polytope
, the output of the neuron i is a linear function, denoted as
). Then, the firing state of the neuron i is controlled by a linear inequality (
total,
is equivalent to a set of N linear inequality constraints, indicating that
is a convex polytope.
Definition 5 We define the complexity of the function represented by a ReLU network as the minimum number of simplices: M that are needed to cover each and every polytope to support the function of the ReLU network.
Figure 1: (a) A one-hidden-layer network with three neurons to classify concentric rings. (b) A one-hidden-layer network with six neurons to classify concentric rings.
Here, we elaborate why M is a good measure. Previously, because a deep network with piecewise linear activation is a piecewise linear function, the number of linear regions (polytopes) was intensively studied to measure the complexity of a neural network. For example, Montufar et al. (2014) and Serra et al. (2018) estimated the upper and lower bounds of the number of linear regions with respect to the number of neurons at each layer. Xiong et al. (2020) computed the bounds for convolutional neural networks. Park et al. (2021) proposed neural activation coding to maximize the number of linear regions to improve the model performance. Despite these results, we find that there exists a problem with the number of linear regions as a complexity measure. It may happen that simple and complex networks realize the same number of regions for a given task. As shown in Figure 1, two networks divide the space into two linear regions to separate concentric rings. But one network uses three neurons to define a triangle domain, while the other has six neurons to form a hexagon. In this example, according to the number of linear regions, the two networks have the same complexity but the six-neuron network is apparently more complex than the other. To address this problem, the complexity of a linear region should be taken into account as well. We argue that how many simplices a linear region comprises indicates how complex a linear region is. Therefore, we propose to use the number of simplices as a legitimate complexity measure. In Figure 1, counting the number of simplices, the complexity of two networks are 2 and 4, respectively, which is a better characterization.
Let us estimate the lower bound of M. To this end, we need to take advantage of the lower bound of the number of polytopes. Empirical bounds of the number of polytopes (in a feedforward ReLU network were estimated in (Montufar et al., 2014; Serra et al., 2018; Serra and Ramalingam, 2020), where one result in (Montufar et al., 2014) states that let
, be the number of neurons in the i-th layer, and D be the dimension of the input space,
is lower bounded by
. The polytopes constructed therein are hypercubes. Because the minimum number of simplices that fill a D-dimensional hypercube is
Definition 6 We define a wide network and a deep network as follows. Let us assume a function that can be sufficiently complex and yet can be represented by a network. When such a function becomes increasingly complex, the structure of this network must be also increasingly complex, depending on the complexity of the function. We call a network wide if its width is larger than its depth by at least an order of magnitude in vs
is the complexity measure and
. Similarly, we call a network deep if its depth is larger than its width by at least an order of magnitude in M.
It is underscored that we use two different concepts: the complexity of the function class represented by networks and the structural complexity of a network. The former measures the complexity of the function, while the latter measures the topological structure of a network. In our transformation scheme, the structures of constructs/networks are determined by the complexity of the function of interest.
Definition 7 We call a wide network is equivalent to a deep network
Ω
. We call a wide network
is quasi-equivalent to a deep network
, if there is
is a Lebesgue measurement defined on Ω.
2.2 Motivating Example
An important school of neural network interpretability research (Fan and Wang, 2020; Adadi and Berrada, 2018) is to extract interpretable rules from a network (Thrun, 1995; Setiono and Liu, 1995; Saad and II, 2007) using decompositional or pedagogical methods (Thrun, 1995). Pedagogical methods decode a set of rules that imitate the input-output relationship of a network, whereas these rules do not necessarily correspond to the parameters of the network. One common type of rules are propositional in the IF-THEN format, where the preconditions are provided as a set of hypercubes with respect to the input: IF belongs to some class. Since there is a connection between the rule-based inference and the network-based
inference, we consider a neural network in terms of propositional rules. Furthermore, we
Figure 2: The width and depth equivalence in light of the De Morgan law duality. In this construction, a deep network implements using a trapezoid function, and a wide network implements
the trap-like function. (
denotes the ReLU activation.
know that the De Morgan law holds true for disjoint propositional rules. Mathematically, the De Morgan law is formulated as
where is a rule, and
is its negation. The De Morgan law gives a duality in the sense of binary logic that the operations
are dual, which means that for any propositional rule system described by
, there exists an equivalent dual propositional rule system
Regarding each rule as an indicator function over a hypercube:
in Figure 2, we construct a deep network that realizes a logic union of propositional rules (the left hand side of Eq. (5)) and a wide network that realizes the negation of the logic intersection of those rules after negation (the right hand side of Eq. (5)). As a result, the constructed deep and wide networks are equivalent by the De Morgan law.
The above motivating example inspires us to consider the width-depth equivalence in a broader domain. First, a ReLU network is a piecewise linear function over polytopes. To generate rules, such a piecewise linear function should be divided into simplices instead of hypercubes. As shown in Figure 3, we only need two rules if we build rules over simplices, which is much more efficient than building rules over hypercubes. Second, the network can be a regression network rather than a classification network. Thus, representing a linear function rather than an indicator function is demanded. Based on these two considerations, we generalize an indicator function over a hypercube to a linear function over a simplex in a bounded domain:
Figure 3: Building rules over simplices is more efficient than over hypercubes for a network.
2.3 Quasi-Equivalence of Width and Depth of Networks
This section describes the first contribution of our paper. We formulate the transformation from an arbitrary ReLU network to a wide network and a deep network, respectively. We use a network-based building block to represent a linear function over a simplex. Integrating such building blocks can represent any piecewise linear function over polytopes, thereby elaborating a general equivalence of the width and depth of networks. Particularly, a regression ReLU network is converted into both a wide and a deep ReLU network (Theorem 9), while a classification ReLU network is a special case of a regression ReLU network (Theorem 10). In the regression networks, the transformation of a univariate network is rather different from that of a multivariate network. As a result, the equivalence for the wide and deep networks in the univariate case is precise, whereas the multivariate wide and deep networks are made approximately equivalent up to an arbitrarily small error. What’s more, in the multivariate case, the width of the wide network is not the same as the depth of the deep network. This is why we term such an equivalence as a quasi-equivalence.
2.3.1 Regression Networks
The sketch of transforming a regression ReLU network is that we first construct either a wide modular network or a deep modular network to represent the corresponding function over each and every simplex, then we aggregate the results into deep or wide networks in series or parallel, respectively, to represent the original network well.
Theorem 8 (Equivalence of Univariate Regression Networks) Given any ReLU network , there is a wide ReLU network
and a deep ReLU network
, such that
Our main result is formally summarized as the following quasi-equivalence theorem for the multivariate case.
Theorem 9 (Quasi-Equivalence of Multivariate Regression Networks) Suppose that the representation of an arbitrary ReLU network is is the minimum number of simplices to cover the polytopes to support
, there exist a wide ReLU network
, and also a deep ReLU network
, satisfying that
where is the standard measure in [
We defer the proof of Theorem 8 to Supplementary Information II, and split the proof of Theorem 9 into the two-dimensional case (more intuitive) in Appendix and the general case in Supplementary Information II for better readability.
Figure 4: Due to the unboundedness and continuity, the representations in (Wang and Sun, 2005; He et al., 2018) is handicapped in representing a function over polytopes that make a non-convex region.
The key idea to represent a linear function over a simplex is to construct high-dimensional fan-shaped functions that are supported in fan-shaped domains, and to use these constructs to eliminate non-zero functional values outside the simplex of interest. This is a new and local way to represent a piecewise linear function over polytopes. In contrast, there are two global ways to represent piecewise linear functions (Wang and Sun, 2005; He et al., 2018). In (Wang and Sun, 2005), for every piecewise linear function exists a finite set of linear functions
and subsets
] such that
. In (He et al., 2018), the rep- resentation is
. Nevertheless, due to the unboundedness and continuity, the global representation of a piecewise linear function is handicapped over polytopes that make a non-convex region. Let us use Figure 4 to illustrate our point, where the relations of
are summarized in Table 3, and the function value over the purple area is zero.
• Representation in Wang and Sun (2005); He et al. (2018):
Table 3: Regions and relations of functions.
• Ours:
It can be seen that due to the unboundedness and continuity, is inaccurate over the purple area. But our representation is accurate over the purple area because it is local. We highlight the construction of fan-shaped functions, which opens a new door for high-dimensional piecewise function representation. Particularly, the employment of fan-shaped functions will enable a neural network to express a manifold more effectively.
Figure 5: Typical fan-shaped functions constructed by a modularized network to eliminate non-zero functional values outside the simplex of interest.
Since such a fan-shaped function is a basic building block in our construction of wide and deep equivalent networks, let us explain it in a two-dimensional case for easy visualization. An essential building block expressed by a network in Figure 5(a) is based on the following function:
where are provided by two linearly independent vectors
is a positive controlling factor. Eq. (7) is a ReLU network of depth=2 and width=2 according to our width-depth definition. As illustrated in Figure 5(b), the support of F(x) contains three boundaries and four polytopes (two of which only allow zero value of F). For convenience, given a linear function
. Thus, we can write Ω
There are three properties of F(x). First, the common line shared by Ω
Second, the size of Ω
is adjustable by controlling
. Note that
move very close to
, which makes Ω
negligible. In the limiting case, the support of F(x) converges to the fan-shaped domain Ω
is almost parallel to
is big enough, we approximate the area of Ω
the product of the length of
and the distance between two lines, which yields
. Third, the function F over the fan-shaped area Ω
Remark 1. As a ReLU network of interest partitions the space into more and more polytopes, the number of needed simplices will go increasingly larger. Because the lower bound of M in Eq. (4) is far larger than , the width of
) and the depth of
) will dominate. Furthermore, the width of
) is higher than its depth by an order of magnitude in terms of M, and the depth of
) is higher than its width in a similar way. Therefore,
) is a wide network, and
) is a deep network.
Remark 2. Compared to universal approximation, our construction, with the use of fan-shaped functions, is valuable in the following aspects: 1) Our construction utilizes the character of ReLU networks. It divides the space into finitely many simplices instead of infinitely many tiny hypercubes; 2) Given a target network, the complexity of our construction does not change with the prescribed error rate. In contrast, the complexity of the construction schemes used in the universal approximation analyses would increase as the preset error decreases. Therefore, our construction is much more efficient; 3) Our construction offers a new and local way to represent a piecewise linear function over polytopes, which is more flexible in representing discontinuous piecewise linear function than the global ways (Wang and Sun, 2005; He et al., 2018). Furthermore, inspired by the network structure used to construct the proposed fan-shaped function, we find that intra-layer links can enhance the representation capability of a shallow network, closely relevant to the published results on “depth separation”. Because intra-layer links greatly increase the number of pieces represented by a network, a shallow network with intra-layer links can express a complicated piecewise linear function as well as a deep network! For more details on this new finding, please see Supplementary Information IX for details.
2.3.2 Classification Networks
A regression network gives a continuous output, while a classification network produces categorical outputs (for example, for digit recognition). The classification network is derived by thresholding the output of the last layer of a ReLU network. In this study, we investigate a classification network with binary labels without loss of generality. We can directly build the equivalence for classification networks in the same way as the regression networks.
Theorem 10 (Quasi-Equivalence of Classification Networks) Without loss of generality for multi-class classification, we assume a binary output. Suppose that the representation of an arbitrary ReLU network is is the minimum number of simplices to cover the polytopes to support
, there exist a wide ReLU network
, and also a deep ReLU network
, satisfying that
where is the standard measure in [
Proof The key is to regard the classification network as a special case of regression network. Then, applying the construction techniques used in the proof of Theorem 9 will lead to that for any
which verifies the correctness of Theorem 10.
A classification neural network can be interpreted as a disjoint rule-based system by splitting the representation of a neural network into many decision polytopes: IF (
certain polytope), THEN (input belongs to some class). Furthermore, each rule is a local function supported over a decision region.
Remark 3. The De Morgan equivalence in Figure 2 can be summarized as
when the rules are based on hypercubes. Theorem 10 corresponds to
when rules are based on simplices.
3.1 Preliminaries
Definition 11 Quadratic neurons (Fan et al., 2018a,b, 2020) integrate the n-variable input x with a quadratic function as follows before being nonlinearly processed:
where are vectors of the same dimensionality as that of
are biases, and
is the Hadamard product.
The network using quadratic neurons is interesting in several ways: 1) Over the past years the design of neural networks has been focusing on architectures, such as shortcut connections, transformer structure, etc. Almost exclusively, the mainstream deep learning models are constructed with neurons of the same type, which are characterized by the inner product and nonlinear activation. Despite that an artifical neural network was invented via biomimicry, the current artificial networks and a biological neural system are fundamentally different in terms of neuronal diversity and complexity. As we know, a biological neural system coordinates numerous types of neurons to support intellectual behaviors. To fill in this gap, we believe that neuronal diversity should be taken into account in machine learning; 2) Due to the enhanced expressive power at the neuronal level, quadratic neurons have been used in real-world problems and achieved superior performance, such as medical imaging (Fan et al., 2019), civil engineering (Ji et al., 2021), applied math (Bu and Karpatne, 2021), and so on. Given the utility of quadratic neurons, research on networks with quadratic neurons is attractive.
Hereafter, networks made of quadratic neurons are referred to as quadratic networks. We refer to the neurons using inner-product as the conventional neurons, and corresponding networks are called conventional networks. Please note that quadratic neurons are distinct from the conventional neurons that use quadratic activation. The decision boundary of the latter is still linear. The key characteristic of quadratic neurons is the employment of the quadratic function replacing the inner-product, and the choice of activation functions is flexible. Hereafter, we refer to networks made of quadratic neurons as quadratic networks.
Proposition 2 Any univariate polynomial of degree N can be perfectly computed by a quadratic ReLU network with the depth of logand the width of N (Fan et al., 2020).
Proof Please refer to (Fan et al., 2020) for a detailed proof. Any univariate polynomial of degree N can be factorized as ), without the incorporation of complex numbers. Due to the identity:
)), every two quadratic neurons
)) can be ensembled together to perfectly express
in the first layer of a network, followed by the half number of quadratic neurons in the second layer to combine the yields of the first layer, and so on and so forth. Consequently, a quadratic ReLU network with a depth of log
) and a width of N can express any univariate polynomial of degree N.
Proposition 3 (Stone-Weierstrass Theorem (De Branges, 1959)) Let f(x) be a continuous real-valued function over [a, b], then for any , there exists a polynomial P(x), satisfying
In the above propositions, Proposition 1 shows a quadratic ReLU network can express a univariate polynomial through factorization, which corresponds to the left side of Eq. (111). At the same time, this quadratic network is wide. Proposition 2 suggests that a polynomial can represent any continuous function. Proposition 3 informs us that a multivariate function can be expressed as a combination of univariate functions, which serves as a bridge to generalize our result from approximation to univariate functions to the case of multivariate functions.
3.2 Motivating Example
In contrast to the factorization representation, there is a continued fraction representation for a general univariate polynomial:
where right side of Eq. (111), the first part contains the terms with even powers of x, while the second part with the odd powers of x. In Supplementary Information V, we prove the correctness of Eq. (111) in detail.
Figure 6: The width and depth equivalence for networks of quadratic neurons. In this con- struction, a deep network is to implement the continued fraction of a polynomial, and a wide network reflects the factorization of the polynomial.
Since both the factorization representation and the continued fraction representation express the same polynomial, we have the following identity:
where are intrinsically connected. Let
which could be somehow analogized to the De Morgan law by considering that multiplication and composition operations replace operations, respectively, for either side of Eq. (17). But the objects of those operations are global functions instead of local rules. Eq. (17) is inspiring in the sense that the left side is of a parallel computational structure, and the right side is with two nested structures. Clearly, the left and right sides of Eq. (17) suggest a wide network and a deep network, respectively.
In Figure 6, a deep quadratic network is constructed to represent the right side of Eq. (17), while a wide quadratic network is constructed to represent the left side. In light of Eq. (17), these two quadratic networks are functionally equivalent to each other. In the case of the deep network, if the polynomial is of degree 2N, then the depth of the deep network is N. The inter-layer relationship in the two branches are respectively are the outputs of the
layer in either branch. Specially,
In this deep network, the activation function is in a form of reciprocal relation between two inputs:
, which can be well approximated by a ReLU function after a proper normalization operation. In Supplementary Information V, we demonstrate that the reciprocal activation works effectively in the deep network after some twists. As for constructing the wide network, we employ Proposition 2 directly. Suppose that the polynomial is of degree 2N, the resultant quadratic network has a width of 2N and a depth of only log
), where the width dominates.
Despite the novelty, the above equivalence only fits univariate polynomials. To make a broader impact, we further demonstrate that in terms of representing a multivariate polynomial, the width and the depth of a neural network is also equivalent to each other.
3.3 Width-Depth Quasi-Equivalence
This section delineates the second contribution of our paper. Particularly, we first show that a combination of univariate polynomials can approximate any continuous multivariate function. Then, we leverage the constructed wide and deep quadratic networks for univariate polynomials to represent a general continuous multivariate function. As a result, the resultant wide and deep networks offer the same functionality.
Lemma 12 For any continuous function , there exists a function formulated as
are univariate polynomials, such that
Proof First, we apply the famous Kolmogorov-Arnold theorem to express f:
where are continuous. According to Proposition 3, for every function
), given an arbitrarily small quantity
0 , there exists a polynomial
such that
Combining ) and applying the triangle inequality, we have
Because Φis a continuous function, we can use the property of continuity. We choose a sufficiently small
such that for every Φ
, the following inequality holds:
With respect to the continuous function Φ, we can find a polynomial
Next, applying the triangle inequality again, we have
Finally, integrating Φ, we immediately obtain that
which concludes the proof.
Theorem 13 (Quasi-Equivalence by the Extension of the De Morgan Law) Given a continuous function , there exists a function expressed as
where are polynomials of degrees deg(
. Correspondingly, let
, there exist a wide quadratic network
of a width max
and a depth log
and a deep quadratic network
width 2(2D + 1) and a depth
, satisfying
Proof To prototype the wide network, we can use the wide quadratic sub-network scheme in Figure 6 to express whose [width, depth] are [deg(
[deg())], respectively. A straightforward combination of these wide sub-networks can express
)). Thus, the width of the construction is the summation of the widths of all the sub-networks: max
max
, while the depth is max
log
For the deep network, we use the deep quadratic sub-network shown in Figure 6 to express whose [width, depth] are [2
)], respectively. Similarly, by integrating these deep sub-networks, we can have a deep network that expresses
)). As a result, the width of the derived network is 2(2D + 1), and the depth is max
In Supplementary Information IV, we use the quasi-equivalence relationship to construct wide and deep quadratic network variants for the same task, verify their validity on the MNIST dataset, and show empirical hints that a wide network might have an enhanced robustness, since adversarial samples have much less room to perturb latent features and play tricks.
Gains of Equivalence Notion. Due to the importance of width and depth, a width-depth equivalence theory should have a major impact on deep learning. Some commonly-used deep learning theories, such as neural tangent kernel (NTK) and neural network Gaussian process (NNGP), were developed assuming an infinite width. Naturally, in the era of deep learning we are more concerned with how depth affects the behavior of a neural network. Therefore, depth-oriented theories would be highly desirable. Our work suggests that depth-oriented NTK and NNGP might be feasible since width and depth are equivalent. 2) Our work suggests a significant potential of wide networks in practical applications. Currently, wide networks are less popular than deep networks due to their sub-optimal performance. However, the reason why wide networks do not work as well might be related to the training strategies instead of the model capacities, since width and depth are equivalent in representation power. If wide networks can be trained to match the performance of deep networks, they suggest an alternative path to develop neural networks.
Equivalent Networks. In a broader sense, our quasi-equivalence studies demonstrate the existence of mutually equivalent networks. We argue that the network equivalence is useful in network design. Although deep networks manifest superb power, their applications can be constrained, for example, when the application is time-critical. In that case, we can convert the deep network to a wide counterpart that can be executed at a high speed. A direction for network design optimization is to derive a compact network that maintains a high performance of an original large network through quantization (Wu et al., 2016), pruning (Li et al., 2016), distillation (Polino et al., 2018), low-rank approximation (Zhang et al., 2015), etc. We envision that the equivalence of a deep network and a wide network suggests a new means of network design. Ideally, a wide network can replace the well-trained deep network without compromising the performance. Due to the parallel nature, the wide network can be trained on a computing cluster with many machines for the fast training. At the same time, the inference time of the equivalent wide network is shorter than its deep counterpart.
Width-Depth Correlation. Every continuous n-variable function be in the
sense represented by partially separable multivariate functions (Light and Cheney, 2006):
where is an arbitrarily small positive number,
is a continuous function, and L is the number of products. In the Supplementary Information V and VI, we justify the suitability of this partially separable representation by showing its boundedness, and comparing it with other representations.
Further, we correlate the width and depth of a network to the structure of a function to be approximated. In a nutshell, each continuous function can be approximated by a polynomial of some degree, which can be appropriately represented by quadratic neurons. As a consequence, via a quadratic representation scheme, the width and depth of a network structure must reflect the complexity of
). In other words, they are controlled by the nature of a specific task. As the task becomes complicated, the width and depth must increase accordingly, and the combination of the width and depth is not unique. For more details, please see the Supplementary Information VII.
Effects of Width on Optimization and Generalization. In the Supplementary Information VIII, we illustrate the importance of width on optimization in the context of over-paramterization, kernel ridge regression, and NTK, and then report our findings that the existing generalization bounds also shed light on the relationship of the width and depth given a fixed complexity.
Inspired by the De Morgan law and through a systematic analysis, we have established the quasi-equivalence between the depth and width of ReLU neural networks from two perspectives. In the first perspective, we have formulated two transforms for mapping an arbitrary regression/classification ReLU network to a wide ReLU network and a deep ReLU network, respectively. In the second perspective, we have extended our quasi-equivalence results from ReLU networks of popular artificial neurons to those of quadratic neurons. This quasi-equivalence represents a step forward in developing a unified deep learning theory. More efforts are needed in the future to refine this quasi-equivalence relationship and find real-world applications.
F. L. Fan is supported by the Rensselaer-IBM AI Research Collaboration Program (http: //airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons), R. Lai is partially supported by an NSF Career Award DMS–1752934 and NSF DMS-2134168. G. Wang is partially supported by an NIH R01 CA237267 grant.
Here, we show the correctness of Theorem 9 in the 2D case. Regarding the transformation of an arbitrary multivariate network, the situation is more complicated than in the case of univariate networks. Nevertheless, we are able to establish the -equivalence, which is a slightly relaxed result.
The sketch of proof: A ReLU network is a piecewise linear function over polytopes, which can be decomposed into a summation of linear functions over a simplex. Lemma 14 shows that a wider network module ) and a deeper network module
represent an arbitrary linear function over a simplex. Next, in Theorem 9, to transform an arbitrary ReLU network into a wide and a deep network, we horizontally aggregate network modules
) to have a wide network, and we use shortcuts to sequentially establish a deep network with
A D-simplex S is a D-dimensional convex hull provided by convex combinations of D+1
affinely independent vectors . In other words, S =
In 2D case, if we write is invertible and
is a template simplex in
. It is clear that the following one-to-one affine mapping between S and ∆ exists, which is
Therefore, we only need to make the construction in the special case where simplify our analysis. The coordinate transform in Eq. (29) can conveniently map the construction from ∆ to S.
Given a linear function . ∆ is enclosed by three lines provided by
and
+1. We write three vertices of ∆ as
Then,
supported on ∆ is expressed as follows:
where
Lemma 14 Suppose that the representation of an arbitrary ReLU network is R expressed as Eq. (30), for any
, there exist a ReLU network
1)(2
, and also a ReLU network
Proof (D=2) Our goal is to approximate the given piecewise linear function therefore, we need to cancel f outside its domain. We first index the polytopes separated by three lines
In addition, we use
to exclude a component. For instance,
. It can be easily verified that
Constructing ): The discontinuity of f in Eq. (30) is a major challenge of representing the function using a ReLU network. To tackle this issue, we start from a linear function ˜
, which can be represented by two neurons
The key idea is to eliminate f over all polytopes outside ∆. In other words, ˜f over three fan-shaped polytopes
should be cancelled.
Let us take the polytope as an example. Note that
has two boundaries
number
to construct the three fan-shaped functions:
where the positive number is chosen to be small enough such that the lines
and
Figure 7: Quasi-equivalence analysis in 2D case. Left: The structure of the wide network to represent f over ∆, where two neurons denote and nine fan- shaped functions handle the polytopes outside ∆. Right: The polytopes outside ∆ comprise of three fan-shaped domains, on which f can be cancelled by three linearly independent fan-shaped functions.
According to the aforementioned properties of fan-shaped functions, we approximately have
Based on the property of fan-shaped functions, approximately constrained into the region
) is approximately counteracted over
. Mathematically, the new function
Similarly, we can construct to eliminate ˜
respectively. Finally, these fan-shaped functions are aggregated to form the following ReLU network
(illustrated in Figure 7(a)):
where the width and depth of the network are 2 + 3 addition, due to the 9 fan-shaped functions being utilized and the effect of the
, the total area of the regions suffering from errors is no more than
Therefore, for any 0, as long as we choose
satisfying
the constructed network N will have
Constructing ): Allowing more layers in a network provides an alternate way to represent
both of which are approximately enclosed by boundaries
Figure 8: Quasi-equivalence analysis in 2D case. Left: The structure of the deep network to represent f over ∆. Right: The polytopes outside ∆ comprise of three fan-shaped domains, on which f can be cancelled by three linearly independent functions over ∆.
fan-shaped regions of ) almost overlap as
is small. The negative sign for
is to make sure that the fan-shaped region is ∆. To obtain the third boundary
separate the fan-shaped region of
) with the boundary
Thus, ) will represent the function
+ 1 over ∆ and zero in the rest area. The depth and width of
) are 3 and 4 respectively. Similarly, due to the employment of the two fan-shaped functions and the effect of
, the area of the region with errors is estimated as
To acquire f over ∆, similarly we need three linear independent functions as linear independent bases. We modify slightly to get
Repeating the procedure described in (1), for
we construct the network
is
, while for
we construct the network
. We set positive numbers
small enough to have two
triangular domains almost identical with ∆. In addition, let
where are solutions. As a consequence, the deep network (illustrated in Figure 8 (c)):
produces f on ∆. The depth and width of the network are 3 and 12. Similarly, the area of the region with errors is bounded above by
Therefore, for any 0, if we choose
then the constructed network will satisfy
Proof (Theorem 9, D = 2) The network h is piecewise linear and splits the space into polytopes. It is feasible to employ a number of simplices to represent the polytopes defined by h Ehrenborg (2007). Given that M is the minimum number of required simplices, we have
where
and -th simplex. For construction of a wide network, we use network modules to represent
) and then horizontally aggregate them into a wide network. In contrast, for construction of a deep network, we sequentially express
) in terms of a network module without linking them to the input.
Representing f with a wide ReLU network: Lemma 14 suggests that a wide network module can generically represent a function over a template simplex. To represent
over
, we need to use Eq. (29) to transform the function from the barycentric coordinate system to the Euclidean coordinate system. Let three vertices of
and
Figure 9: Illustration of deep networks in 1D and 2D cases.
satisfying
By aggregating the network ) horizontally, we have the following wide network:
Therefore, the constructed wide network ) is of width O(20M) and depth 2. It is clear that the width O(20M) of the wide network
) dominates, as the number of needed simplices goes larger and larger.
Representing f with a deep ReLU network: For a deep construction, the fundamental difficulty is how to sequentially express each , the input of each block can only come from the earlier block instead of the input, like what we did in one-dimensional case (Figure 9(a)). Let us derive via induction how to sequentially represent each
adopt the idea of modularized networks, but now each network module has two outputs. Lemma 14 suggests that a deep network module
can generically represent a function over a template simplex.
Step 1. Assume that the two outputs of the first block are . To represent
, similarly, we need to transform the function from the barycentric coordinate system to the Euclidean coordinate system. Let three vertices of
and
which is one output of the first block.
Next, we derive the other output Ω). Note that we are not allowed to use the input directly. Encouraged by the inversion idea in the univariate case, we invert the function domain into the input domain to get a function that is approximately only supported over
To do so, recall that we have in constructing
, we will have (
that
As shown in Figure 9(b), use the residual connection, we compute Ωwhich is zero over
for other regions. Ω
) will be used to feed the next block to construct
Step 2. Suppose that the two output functions of the m-th block are Ω
) is fed into the (m + 1)-th block as the input. Because
is outside the domain of
, with the same technique in Step 1, we construct
where are three vertices of
) to express
) is functionally equivalent to x for two reasons. First, all simplices do not overlap. Ω
outside the domain of
; therefore zero value over
has no effect. Second, w.l.o.g., we assume that (0, 0) is outside of all simplices or lies on the boundary of a certain simplex. As such, we have
, therefore,
)) will not erroneously produce a non-zero constant over
. Furthermore, we also obtain a function ˜
inside the (m + 1)-th block. Apply the residual operation, we have
Lastly, simlilar to the one-dimensional deep network, we use the shortcut connection to aggregate )) to obtain the following deep network:
Therefore, the constructed deep network ) is of depth O(3M) and width 12. Please note that in the above equation, the summation is made by shortcuts. It is clear that the depth of
) dominates over the width.
I. Width as Related to Neural Tangent Kernel (NTK)
NTK sheds light on the power of a network when the width of this network goes to infin-ity. Previous results have suggested a link between the neural network and the Gaussian distribution when the network width increases infinitely. In Neal (1996); Lee et al. (2018); Matthews et al. (2018); Novak et al. (2018); Arora et al. (2019), the Gaussian process is shown in the two-layer networks, the deep networks, and the convolutional networks. Weakly-trained networks (Arora et al., 2019) refer to the networks where only the top layer is trained after other layers are randomly initialized. Let denote the output of the network on input
denotes the parameters of the network, and
is an initialization over
. In the above context, training the top layer with the
penalty reduces to the kernel regression:
where (F) denotes the functional space. On the other hand, an infinite wide network is considered as an over-parameterized network, with the motivation to explain the fact that why over-paramterized networks scale. There were extensive studies on this topic (Du et al., 2018; Allen-Zhu et al., 2019; Cao and Gu, 2019). Notably, these studies indicate that the training renders relatively small changes when the network is sufficiently wide. Such a concentration behavior is justified by NTK in terms of the Gaussian process. The NTK is defined as the following:
where () denotes the gradient space. Along this line, Huang et al. (2020) compared the NTK of the ResNet and the deep chain-like network, and found that the NTK of the chain-like network converges to a non-informative kernel as the depth goes to infinity, while the counterpart of ResNet does not. Du and Hu (2019) showed that the width provably matters in networks using the piecewise linear activation.
II. Transformation of an Arbitrary Network
Here, we provide the proof for Theorem 15 and the experiment that shows the feasibility of the procedures in the proof of Theorem 16. The similar derivation to the proof for Theorem 15 is also seen in Fan et al. (2018c).
Theorem 15 (Equivalence of Univariate Regression Networks) Given any ReLU network with one dimensional input and output variables. There is a wide ReLU network
and a deep ReLU network
, such that
Proof (Theorem 15) Since the function represented by a network piecewise linear, we express f(x) as
where to guarantee continuity, and the slopes of neighboring pieces are different; otherwise they can be fused together.
We first construct a wide ReLU network ) to represent f. This can be straightforward as follows:
where specially +1. Now, we verify this claim. Given
), for some
0, we have the following:
Figure 10: Equivalent wide (a) and deep (b) univariate ReLU networks.
Next, we construct an equivalent deep network dual to the above wide network. Note that (62) is the same as
where sgn(0
1, then it is clear that the following recursive relation holds:
Thanks to the recurrent relation, each can accurately represent a small monotonically increasing piece over [
]. We aggregate the outputs of those n+1 pieces in the output neuron to get the desired deep ReLU network
which is the same as ). To construct
+ 1 consecutive neurons are connected in series. Therefore, such a one-neuron-wide network has n + 2 layers. We plot two types of neural networks
Theorem 16 (Quasi-Equivalence of Multivariate Regression Networks) Suppose that the representation of an arbitrary ReLU network is is the minimum number of simplices to cover the polytopes to support
, there exist a wide ReLU network
, and also a deep
ReLU network , satisfying that
where is the standard measure in [
The sketch of proof: A ReLU network is a piecewise linear function over polytopes, which can be decomposed into a summation of linear functions over a simplex. Lemma 17 shows that a wider network module ) and a deeper network module
represent an arbitrary linear function over a simplex. Next, in Theorem 16, to transform an arbitrary ReLU network into a wide and a deep network, we horizontally aggregate network modules
) to have a wide network, and we use shortcuts to sequentially establish a deep network.
A D-simplex S is a D-dimensional convex hull provided by convex combinations of D+1
affinely independent vectors . In other words, S =
If we write is invertible, and S =
is a template simplex in
It is clear that the following one-to-one affine mapping between S and ∆ exists, which is
where ). We denote the domain of a network as [
. Given a linear function
is enclosed by D + 1 hyperplanes provided by
Lemma 17 Suppose that the representation of an arbitrary ReLU network is R expressed as Eq. (69), for any
, there exist a ReLU network
1)(2
, and also a ReLU network
D + 1, satisfying that
) Our goal is to approximate the given piecewise linear function f using ReLU networks. We first index the polytopes separated by D + 1 hyperplanes
1
+1. It is clear to see that
. In addition, we use
to denote exclusion of certain component. For instance,
. It can be easily verified that
Please note that +1 hyperplanes create in total 2
1 polytopes in the [
Now we recursively define an essential building block, a D-dimensional fan-shaped ReLU network
where the set of linear functions are provided by D linearly inde- pendent vectors
is a large positive number (
with the power to j). Note that the network
is of width D and depth D. This network enjoys the following key characteristics: 1) As
, the hyperplane
approximate to the hyperplane
dominates. Thus, the support of
) converges to
-dimensional fan-shaped function. 2) Let C be the maximum area of hyperplanes in [
. Because the real boundary
the imprecise domain caused by
is the approximate distance between the real and ideal boundaries. In total, the measure of the inaccurate region in building
) is at most
1). 3) The function over D-dimensional fan-shaped domain is
2 over the D-dimensional fan-shaped domain.
Constructing : Discontinuity of f in Eq.(69) is one of the major challenges of representing it using a ReLU network. To tackle this issue, we start from a linear function ˜
, which can be represented by two neurons
key idea is to eliminate f over all 2
2 polytopes outside S using the D-dimensional fan-shaped functions.
Let us use to show how to cancel the function ˜f over the polytopes outside S. According to (71),
where -dimensional fan-shaped domain. Without loss of generality, a number D + 1 of D-dimensional fan-shaped functions over
are needed as the group of linear independent bases to cancel ˜f , where the
fan-shaped function is
where we let is to make sure that the fan-shaped region
2, ..., D + 1 represents a small shift for
The constructed function over
which is approximately over
Let us find
and then the new function ) satisfies that
Similarly, we can construct other functions
cancel ˜f over other polytopes. Finally, these D-dimenional fan-shaped functions are aggregated to form the following wide ReLU network
where the width and depth of the network are 1) + 2 and D respectively. In addition, because there are 2
1 polytopes being cancelled, the total area of the regions suffering from errors is no more than
Therefore, for any 0, as long as we choose appropriate
that fulfill
the constructed network ) will have
Constructing : Allowing more layers in a network provides an alternate way to represent f. The fan-shaped functions remain to be used. The whole pipeline can be divided into two steps: (1) build a function over S; and (2) represent f over S by slightly moving one boundary of S to create linear independent bases.
(1) We construct a number D of D-dimensional fan-shaped functions. Without loss of generality, the fan-shaped function is constructed as
whose fan-shaped region is approximately (, which almost overlaps with
becomes sufficiently small. The output of
. To obtain the last boundary
one more layer with only one neuron as follows:
Thus, ) will approximately represent the linear function
and zero elsewhere. The depth and width of the network are
respectively. Similarly, due to the employment of a number D of D-dimensional fan-shaped functions and the effect of
, the area of the region with errors is estimated as
(2) To acquire an arbitrary linear function, similarly we need D + 1 linear independent functions as linear independent bases. Other than the one obtained in step (1), we further modify a little bit D times to get
. Repeating the same procedure described in step (1), for
, we can construct the network
approximately over
is small to render these domains almost identical to
satisfies that
producing f on S. The depth and width of + 1), respectively. Similarly, the area of the region with errors is bounded above by
Therefore, for any 0, if we choose
appropriately such that
then the constructed network ) will satisfy
) The network h is piecewise linear and splits the space into polytopes. It is feasible to employ a number of simplices to fill the polytopes defined by h (Ehrenborg, 2007). Given that M is the minimum number of required simplices, we have
where
and -th simplex. The core of the wide construction is to horizontally aggregate network modules representing
) into a wide network. In contrast, the core of the deep construction is to sequentially express
) in a network module without linking these modules to the input.
Representing h with a wide ReLU network: Lemma 17 suggests that a wide network module can generically represent a function over a template simplex. To represent
, we need to use Eq. (68) to transform the function from the barycentric coordinate system to the Euclidean coordinate system. Let D + 1 vertices of
satisfying
By aggregating the network ) horizontally, we have the following wide network:
Therefore, the constructed wide network ) is of width
depth D. It is clear that the width
) of the wide network
dominates, as the number of needed simplices goes larger and larger.
Representing h with a deep ReLU network: For a deep construction, the fundamental difficulty is how to sequentially express each , the input of each block can only come from the earlier block instead of the input, like what we did in one-dimensional case (Figure 9(a)). Let us use mathematical deduction to derive how to sequentially represent each
. We still adopt the idea of modularized networks, but now each network module has two outputs. Lemma 17 suggests that a deep network module
can generically represent a function over a template simplex.
Step 1. Assume that the two outputs of the first block are . To represent
, similarly, we need to transform the function from the barycentric coordinate system to the Euclidean coordinate system. Let D+1 vertices of
and
which is one output of the first block. Next, we derive the other output Ω). Note that we are not allowed to use the input directly. Encouraged by the inversion idea in the univariate case, we invert the function
domain into the input domain to get a function that is approximately only supported over
To do so, recall that we have in constructing
, we will have a coefficient vector (
) such that
As shown in Figure 9(b), use the residual connection, we compute Ωwhich is zero over
for other regions. Ω
) will be used to feed the next block to construct
Step 2. Suppose that the two output functions of the m-th block are Ω
) is fed into the (m + 1)-th block as the input. Because
is outside the domain of
, with the same technique in Step 1, we construct
where are three vertices of
) to express
is functionally equivalent to x for two reasons. First, all simplices do not overlap. Ω
is x outside the domain of
; therefore zero value over
effect. Second, w.l.o.g., we assume that 0 is outside of all simplices or lies on the boundary of a certain simplex. As such, we have
, therefore,
not erroneously produce a non-zero constant over
. Furthermore, we also obtain a function ˜
inside the (m + 1)-th block. Apply the residual operation, we have
Lastly, simlilar to the one-dimensional deep network, we use the shortcut connection to aggregate )) to obtain the following deep network:
Therefore, the constructed deep network ) is of depth O((D + 1)M) and width (D + 1)
. Please note that in the above equation, the summation is made by shortcuts. It is clear that the depth of
) dominates over the width.
Experiment: A 2-6-2-1 network was trained on the moon dataset. When the training process was finished, the accuracy of the network was 0.94. Then, we transformed the original network into a wide network (described in the proof of Theorem 16. The patterns of outputs of the constructed wide
Figure 11: With the procedure in the proof for Theorem 16, an exemplary 2-6-2-1 network is transformed into a wide network (respectively.
network and deep network are respectively shown in Figure 11. It is seen that the constructed networks can approximate the original network well, and some expected artifacts can be squeezed by further increasing or using a post-processing network that can identify streaks and remove them.
III. Width-Depth equivalence by the De Morgan Law
We have the following formal statement for the quasi-equivalence in light of the De Morgan law. The proof of this theorem is constructive.
Proposition 5 (Proposition C.1 in (Lin and Jegelka, 2018)) Given a piecewise constant function , for any small enough
, there exists a ResNet
that
The sketch of proof. The detailed proof can be referred to Lin and Jegelka (2018). We only show the one-dimensional case here for a brief introduction of the idea. The operations that are realizable by a residual network block are
1. shifting: . This operation is to shift a function with a constant.
2. min or max: for any
. This operation allows us to threshold a function.
3. min or max with a linear transformation: or
. This operation can adjust the slope of trapezoid functions.
Figure 12: To approximate any univariate function, (a) we construct an increasing trape- zoidal function; (b) we adjust the height of each trapezoid function to its corre- sponding height.
As Figure 12 shows, we first construct an increasing trapezoidal function with a very high plateau over each trapezoid. Then, we adjust the height of each trapezoid function to its corresponding height. The procedures of constructing an increasing function are shown in Figure 13, which are based on mathematical induction. Suppose
1.
2. is a trapezoid function over each [
3. over an interval [
4. 0
5.
Figure 13: An illustrative plot to show how to derive
The procedure of adjusting the height of each trapezoid function is shown in Figure 12(b). An important consideration is that we need to keep the function on previous subdivisions unchanged while twisting the current trapezoid function. This is realized by the fact that is large. Suppose that
is the target value over the interval [
adjust the values of intervals sequentially by
Theorem 18 (Quasi-equivalence in light of the De Morgan law) Given a disjoint rule system , where each rule
is characterized by an indicator function
over a hypercube Γ
fulfilling the De Morgan law
we can construct a wide ReLU network to represent the right hand side of the De Morgan law and a deep ReLU network
to represent the left hand side of the De Morgan law, such that for any
where m is a measurement in [0
Figure 14: A wide network to realize the negation of the logic intersection of the negation of propositional rules in a high-dimensional space. The negation of each propositional rule is associated with a trap-like indicator function.
Proof The defined rule system is a piecewise constant function over a hypercube. The wide and deep networks are constructed as follows:
Wide network ): As illustrated in Figure 14, to represent the rule
that is equivalent to
), a trap-like function is constructed to represent
that is equivalent to 1
). In our construction, for each hypercube the measures of
is no more than
) is the volume of a hypercube Γ
fore, the total measure of errors for all the hypercubes is less than
Deep network is a piecewise constant function over a hypercube, based on Proposition 5, we have
Combining Eqs. (108) and (109) leads to , which concludes our proof.
IV. Quasi-equivalence of Quadratic Networks.
In this section, we first prove the correctness of the continued fraction representation of a polynomial. Then, we report the experimental results on the quasi-equivalency of quadratic networks.
For a general univariate polynomial, there is a continued fraction representation as follows:
where right side of (111), the first part contains the terms with even powers of x, while the second part is for the odd powers of x. Because both even and odd parts are in the essentially same format, we only show the correctness of the even part:
Specially, we can derive the continued fraction as follows:
The following example helps illustrate the correctness of (111):
If , we can derive an approximate continued fraction representation by complementing
with a term
is a small constant. For a given approximation requirement, we can always choose a sufficient small, nonzero
accommodate the accuracy requirement since x is bounded.
This subsection reports the experimental results on quasi-equivalence of quadratic networks in two steps. In the first step, we show the feasibility of the reciprocal activation function in the network according to the continued fraction representation. In the second step, we compare the accuracy and robustness between wide and deep networks.
Experimental Design: We first preprocessed the MNIST dataset using image deskewing and dimension deduction techniques. Image deskewing (https://fsix.github.io/ mnist/Deskewing.html) straightens the digits that are written in a crooked manner. Mathematically, skewing is modeled as an affine transformation: which the center of mass of the image is computed to estimate how much offset is needed, and the covariance matrix is estimated to approximate by how much an image is skewed. Furthermore, the center and covariance matrix are employed for the inverse affine transformation, which is referred to as deskewing. Then, we used t-SNE (Van der Maaten and Hin- ton, 2008) to reduce the dimension of the MNIST from 28
28 to 2, as the two-dimensional embedding space. Figure 15 presents the effect of deskewing on t-SNE.
Figure 15: The effect of the deskewing operation on the t-SNE result of the MNIST data. Deskewing significantly improves the embedding quality.
The embedding data can be naturally divided into the training and testing sets based on the partition in the original space. To illustrate the aforementioned quasi-equivalence between quadratic networks, We built three representative quadratic deep learning models, using two wide sub-networks (by the factorization representation), four deep sub-networks with the reciprocal activation (by the continued fraction), and four deep sub-networks using the ReLU activation (representing the first-order approximation to the continued fraction representation). Each wide sub-network takes one element of the input, and two paired deep sub-networks take the same element of the input. Each wide sub-network has four layers with 8, 4, 2, 1 neurons sequentially, while each deep sub-network also has four layers with one neuron in each layer. These sub-networks are followed by a fully connected ReLU layer of the same structure. The fully connected network has three layers with 300, 200, and 10 neurons respectively.
We performed the batch training with 1, 000 instances per batch in each iteration. All the parameters were initialized with truncated Gaussian distribution of a mean 0 and a variance 0.1. The learning rate was set to 0.002, and the whole network was optimized with Adam (Kingma and Ba, 2015). The number of epochs was 200.
Feasibility of the Reciprocal Activation: At the first glance, the training of a deep model using the reciprocal activation function is subject to instability. For example, when the value of the input is large, gets closer to 1, which will lead to zero in the denominator in the next fraction, and undermine the training convergence. However, we argue that when data are appropriately normalized, the training process can be made stable. We did the following experiment to study how the normalization can help. Specifically, the input was normalized by the formula:
is a scaling factor. We evaluated four scaling factors {1, 4, 8, 16} by repeating 20 times the training process for each factor. The criterion is whether the training process converges or not. In this experiment, as long as ’nan’ does not appear, the training will converge because we used a sufficiently large number of epochs. Table 4 shows the success rate of training for different
scales. It can be seen that without scaling the training process always fails, while the success rate reaches 80% when
an appropriate scaling operation.
Table 4: The success rates of the training process with reciprocal activation after normal- ization.
Accuracy and Robustness of the Three Models: We repeated the training process 10 times and computed the accuracy of the three models on the test dataset. As shown in Table 5, all the three models achieved the state-of-the-art results. The performance of the deep model using ReLU was only slightly lower than that of the other two models.
Table 5: Performance of the three models on the test dataset after 10 repetitions.
Furthermore, we used the following four popular adversarial attack methods to evaluate the robustness of the deep learning models: (1) fast gradient method (FGM); (2) fast sign gradient method (FSGM (Goodfellow et al., 2014b)); (3) iterative fast sign gradient method (I-FSGM (Kurakin et al., 2016)); and (4) DeepFool (Moosavi-Dezfooli et al., 2016). Let denote the model parameters, x be the input, y be the target pertaining to
the loss, FGM generates an adversary as
where is an amplitude factor. FGM is plausible to find the attack along the direction of the gradient. FSGM computes an adversary based on
where is also a factor, and
) outputs 1 for a positive argument and
1 otherwise. I-FSGM iteratively derives an adversary with the FSGM formula:
where ) is a thresholding function such that the maximum value of x is no more
than a preset limit . In the experiment, we set this limit
to the maximum value of the
Table 6 shows the performance of the three models on the adversarial samples generated based on the test dataset using the aforementioned attacking methods. We bold-faced the best scores under attack. It can be seen that the wide model shows the highest accuracy under most attacks. The only exceptions are the cases of FSGM(where the wide model is the second best.
Table 6: Accuracy of the three models under attacks, where is the amplitude factor and N is the number of iterations.
Next, we probed the robustness of each model by examining the needed strength of the adversarial attack such that the network performance will drop by a pre-specified percentage. Using the absolute performance drop as a reference is better than using the performance compromised by adversarial attacks, as it eliminates the bias induced by the original network performance. Suppose that the original performance (without attack) is O, with each attacking method, we reduced 5% by gradually increasing the strength of attack. The reason to select a 5% drop is that if the drop is too high the comparison is not sensitive, while if the drop is too low the attack is too weak. Specifically, we gradually increased the strength by a fixed step for each attacking method until the performance drop is over 5%. For FGM, FSGM, I-FSGM and DeepFool, the steps are 50, 1, 1 and 1 respectively. Table 7 shows the needed strength values. The higher the needed strength, the more robust the model is. We bold-faced the best scores. Overall, the wide model shows the highest robustness. The only exception is the case of I-FSGM, where the wide model is the second best. These data suggest that the wide model seems more robust than the deep model. Further studies are needed to understand the underlying mechanisms and derive practical guidelines.
Table 7: Needed strength values for each of the four attacking methods to reduce the unattacked network performance by 5%
V. Bounds for the Partially Separable Representation
The purpose of this section is to show that the partially separable function can be realized a quadratic network whose structure is bounded. Let us first introduce necessary preliminaries.
Partially Separable Representation: Approximating a multivariate function by a set of functions of fewer variables is a basic problem in approximation theory (Light and Cheney, 2006). Despite that some function f is directly separable in the form of
a more general formulation to express a multivariate function f is
If L is permitted to be sufficiently large, then such a model is rather universal. For bivariate functions, an inspiring theorem has been proved (Light and Cheney, 2006): Let and
are orthornomal bases of
) respectively, then
is an orthornomal basis of
To approximate a general multivariate function, we relax the restrictive equality to the approximation in the sense and assume that, for every continuous n-variable function
, given any positive number
, there exists a group of
, satisfying:
The key to demonstrate boundedness of the partially separable representation with a quadratic network is to use the Taylor’s expansion, and estimate the remainder, leading to a realization of the partially separable representation. Based on such a realization, we obtain an upper bound of the needed width and depth for the partially separable representation. Let , for the function class
, the following theorem holds.
, there exists a quadratic network Q(x) with the width no more than
and the depth no more than log
to represent f in the partially separable representation, satisfying that
, then based on classical multivariate calculus (Widder, 1989),
where ) is the remainder term. For any
It is observed from Eqs. 122 and 123 that, given , the Taylor expansion forms a partially separable representation
products, and the error is no more than
. Along this line, we apply the quadratic network to express
network using the ReLU function with the depth of log) and the width of N. To approximate a general term
, the depth of the quadratic network is no more than log
, and the width is no more than
In addition, the depth of a network that is used to express the product of n terms is log
Therefore, the depth is no more than log
). In contrast, the width is no more than
, considering that there are
VI. Properties of the Partially Separable Representation
Let us compare the properties of the partially separable representation with those of the piecewise polynomial representation and the Kolmogorov–Arnold representation in terms of universality, finiteness, decomposability and smoothness. As a result, the partially separable representation is well justified.
1. Universality: The representation should have a sufficient ability to express the functions. A piecewise polynomial representation is capable of representing any continuous function in a local way. The Kolmogorov–Arnold representation theorem states that every multivariate continuous function can be represented as a combination of continuous univariate functions, which solves a generalized Hilbert thirteen problem. Therefore, the Kolmogorov–Arnold representation also possesses universality. As far as the partially separable representation is concerned, its universality in the distance is analyzed in the above Taylor expansion analysis. In summary, these three representation classes are all powerful enough to express continuous functions.
2. Finiteness: The multivariate Taylor expansion can approximate a general continuous function with a finite number of terms. Although the proof in the manuscript devises a specific strategy of obtaining a partially separable representation, the partially separable representation actually allows a variety of constructions in addition to the Taylor expansion. There is no fundamental reason that the partially separable representation will necessarily have exponentially many terms in a majority of practical tasks after reasonable assumptions are incorporated.
3. Decomposability: Decomposability is to decode the functionality of the model in terms of its building units, i.e., cells, layers, blocks, and so on. A plethora of engineering examples, such as software development and optical system design, have shown that modularized analysis is effective. Particularly, modularizing a neural network is advantageous for the model interpretability. For example, Chen et al. (2016) developed InfoGAN to enhance decomposability of features learned by GAN (Goodfellow et al., 2014a), where InfoGAN maximizes the mutual information between the latent code and the observations, encouraging the use of noise to encode the semantic concept. Since the partially separable representation is more decomposable than its counterparts, the network constructed in reference to the partially separable representation is easily interpretable. For instance, such a network has simpler partial derivatives that can simplify interpretability analysis (Lipton, 2018).
4. Smoothness: A useful representation should be as smooth as possible such that the approximation can suppress high-frequency oscillations and be robust to noise. In the last eighties, Girosi and Poggio claimed that the use of the Kolmogorov-Arnold representation is doomed because the inner functions and the outer functions are highly non-smooth (Girosi and Poggio, 1989). In the piecewise polynomial representation, the situation gets better since at least over each interval the representation is smooth. As far as the partially separable representation is concerned, because the expression is greatly relaxed (from the exact computation to approximate representation), it is feasible to make each element smooth as L goes large, and extract meaningful structures underlying big data.
Table 8: Comparison of the three functional representations, where ”” means ”yes”, ”
means ”no”, and ”–” means ”mediocre”.
In summary, we compare the three representation schemes in terms of universality, finiteness, decomposability, and smoothness in Table 8. Clearly, the partially separable representation is suitable for machine learning tasks.
VII. Width-Depth Correlation Through Partially Separable Presentation
Now, let us analyze the complexity of the quadratic network in light of the aforementioned partially separate representation scheme as shown in Figure 16. Suppose that the polynomial is of degree
, then the representation of each
can be done with a network of width
and depth max
. Next, the multiplication demands an additional network of width Ln and depth log
). Therefore, the overall quadratic network architecture will be of width max
and depth max
log
Because the depth scales with a log function, which changes slowly when the dimensionality of the input is large. For simplicity, we take an approximation for depth max
where
is a positive constant. Let
, which describes the overall com- plexity of the function to be expressed, then the formulas to compute the width and depth are simplified as follows:
Figure 16: Use of a quadratic network to represent a partially separable representation. Suppose that the polynomial is of degree
, the width and depth of the quadratic network to approximate
) respectively.
One interesting point from (124) is that the lower bounds for depth and width to realize a partially separable representation are also suggested. As shown in Figure 17, we plot the width and depth as changes. There are two highlights in Figure 17. The first is that the width is generally larger than the depth, which is different from the superficial impression on deep learning. The second is that, as the
goes up, the width/depth ratio is accordingly increased.
Figure 17: Width and depth versus generality) assuming the partially separable representation.
Table 9: Descriptions of different building blocks. Modules Degree Operation Width Depth
Remark: Through the complexity analysis, we realize that the width and depth of a network depend on the structure or complexity of the function to be approximated. In other words, they are controlled by the nature of a specific task. As the task becomes complicated, the width and depth must increase accordingly, and the combination of the width and depth is not unique.
VIII. Effects of Width on Optimization and Generalization
We first illustrate the importance of width on optimization in the context of over-paramterization, kernel ridge regression, and NTK, and then report our findings that the existing generalization bounds and VC dimension results somehow suggest the width and depth equivalence for a given complexity of networks.
1. Optimization: The optimization mechanism is the key to understand the training process of neural networks (Ge et al., 2015). From the view of optimization, the fact that randomly-initialized first-order methods can find well-generalizable minima on the training data is quite intriguing. Given the theme of this paper, we put our endeavor into the importance of width on the optimization of neural networks in hope to provide insights for practitioners. We divide the relevant literature into the three categories:
(1) Increase width for over-parameterization: Brutzkus et al. showed that a wide two-layer network using the hinge loss can generalize well on linearly separable data with stochastic gradient descent (SGD) (Brutzkus et al., 2017). Li and Liang showed that when data is normalized and fulfills a separability condition, a two-layer over-parameterized network can learn these data in a polynomial time (Li and Liang, 2018). Allen-Zhu et al. (2019) showed that if the width of hidden layers is is the number of samples, D is the network depth, and
is an expected error, then the gradient descent search converges with the error
. This bound was further reduced to
for the network using non-linear smooth activation functions such as soft-plus (Du et al., 2018). These paper established that a pre-specified small training error can be achieved by gradient descent when the network is very wide. The secret therein is that the weight matrix is very close to the initialization due to NTK (Jacot et al., 2018).
(2) Kernel ridge regression: A neural network tends to give a Gaussian process when the width goes infinitely large. In this situation, only training the top layer of a network (weak training) will reduce to Kernel ridge regression. In a classical way, the weak training is to minimize the quadratic loss:
where ] is a collection of data
. Taking derivatives with respect to w and equating them to zero gives
Given a new example , the prediction is
We replace the data samples with the high-dimensional representations of the neural network: )]. By a similar derivation, we have
(3) Neural Tangent Kernel: In the above argument, we have justified the legitimacy of training the top layer by the large network width. How about training the entire network? Given the dataset , we also consider training the network with the quadratic loss for regression:
Consider the gradient descent in an infinitesimally small learning rate, then
Next, we can describe the dynamics of the model output ) given the input
as follows:
Let us consider for all samples at time
output, (130) can be written in a compact way:
where the pq-entry of H(t) is
In the width limit, the Gaussian process shows that H(t) becomes a constant . Therefore,
which characterizes the trajectory of the training in the functional space instead of the parameter space.
2. Generalization: Analysis of generalization bounds is a powerful tool to explain the excellent performance of neural networks. Traditional wisdom suggests that the increased model complexity will cause over-fitting to training data, which contradicts the fact that deep networks can easily fit random labels to the data and yet practically generalize well (Zhang et al., 2016). Recently, the generalization theory has gained increasingly more traction. In reference to Li et al. (2018), we have summarized the the state-of-the-art generalization bounds in Table 10 and provided their complexities. In Table 10, we boldface the bounds of interest which the width and depth dominate. As far as Neyshabur et al. (2015) is concerned, due to the exponential dependence on the depth, we will not focus on it. Instead, we argue that these bold-faced bounds somehow suggest the width and depth equivalence under a given complexity. Now, we use to denote the bounds of interest. Figure 18 shows the contours plots of
. Along a contour, the width and depth changes for the fixed bound complexity. In other words, the depth and width of a neural
Table 10: Representative bounds for chain-like neural networks, where to denote the upper bounds of the spectral norm
, Frobenius norm
of the rank-
layer. Generally,
-times larger than
means the ramp loss function is 1
Lipschitz function, Γ is the lower bound for the product of the spectal norm of all the layers and the m is the size of data.
Figure 18: Contours of , where numbers on the curves of each sub-figure represent log(
respectively. Along a contour, the width and depth changes to give the same bound.
network are mutually transformable without impacting the overall generalization ability theoretically.
Therefore, increasing either width or depth can boost the hypothesis space of a neural network. In other words, when it comes to promoting the expressive power of a network, increasing the width is essentially equivalent to increasing the depth in the sense of VC dimension, which also implies the equivalence of the width and depth of neural networks.
IX. Rethinking the Depth Separation with Intra-layer Links
Depth separation highlights the representation ability of a deep network, which has been intensively investigated over the past years. The idea is to show that a deep network can more efficiently approximate a complicated function than a shallow network. One notable depth separation result is from Arora et al. (2016), showing that (k + 1)-layer feedforward ReLU DNN of width w such that if it is also representable by a (
Suppose that the number of neurons, i.e., the size of a ReLU network is s, and the piecewise linear function of this ReLU network has p pieces, the core of the depth separation theorem established in Arora et al. (2016) is summarized as the size-piece relationship:
In the proposed network structure for the width-depth conversion, intra-layer links are employed for a network to flexibly represent arbitrary piecewise linear functions over polytopes using fan-shaped functions. Here, we find that adding intra-layer links can greatly increase the maximum number of pieces represented by a shallow network such that it can express as a complicated function as a deep network could. As shown in Table 11, let n be the number of intra-linked neurons, the size-piece inequality 1 trivially holds true. As such, it cannot be used to demonstrate depth separation any more. Instead, we need to rethink the depth separation and re-characterize the power of a network when intra-layer links are used. Our detailed proof is as follows.
Table 11: Network structures and their associated size-piece relationships. The number of neurons, i.e., the size of a ReLU network is s. k is the number of hidden layers. The piecewise linear function of this ReLU network has p pieces. n is the number of intra-linked neurons.
Note: In the first row, we re-calculate the inequality and obtain a slightly different inequality from the original. The previously-established inequality in (Arora et al., 2016) is
The result listed in the first row of Table 11 is obtained from the following two lemmas and proofs in (Arora et al., 2016).
be a function represented by a
ReLU DNN, as shown in Figure 19(a), with depth k + 1 and widths
hidden layers. Then f is a PWL function with at most 2
be a piecewise linear function with p pieces. If f is represented by a ReLU DNN, as shown in Figure 19(a), with
Figure 19: (a) the feedforward architecture; (b)-(c) the structures with intra-layer links.
depth k + 1, then it must have size at least . Conversely, any piecewise linear function f represented by a ReLU DNN of depth k + 1 and size at most s, can have at most
Now, we prove the result listed in the second row of Table 11 for ReLU DNN with intra-links.
be a function represented by a
ReLU DNN, as shown in Figure 19(b), with depth k + 1 and widths
layers. Then f is a PWL function with at most 3
Proof As shown in Figure 19(b), let us connect every two neurons with an intra-layer link. Suppose that one neuron is ) which can create 1 breakpoint at
, while the other neuron is
)) creating at most 3 breakpoints
). Thus, we can get at most
distinct breakpoints, i.e.,
Let us estimate the number of pieces via induction. Assume that for some ReLU DNN, as shown in Figure 19(b), with depth k+1 and widths
k hidden layers produces at most
pieces. We add one more layer of
neurons to this network such that its pre-activation is actually the output of a
ReLU DNN with depth k+1 and widths
. Suppose that the preactivation of the i-th neuron is
]. By the induction hypothesis,
is a piecewise linear function with at most
pieces. Without loss of generality, let the i-th and (i + 1)-th neurons be connected by an intra-layer link. Their outputs are
whose total number of pieces equals to that of three functions
which is at most 6
. Because we have
neurons in the last hidden layer, we can get at most 6
pieces, which concludes the induction step.
be a piecewise linear function with p pieces. If f is represented by a ReLU DNN, as shown in Figure 19(b), with depth k + 1, then it must have size at least
. Conversely, any piecewise linear function f represented by a ReLU DNN of depth k + 1 and size at most s, can have at most
pieces.
Proof Let widths of the k hidden layers be , we must have
By the AM-GM inequality, we minimize the size subject to (134),
where the inequality is achieved at . This leads to the first statement. The second statement follows by reversing the above equation.
R be a function represented by a
ReLU DNN, as shown in Figure 19(c), with depth
hidden layers and n intra-linked neurons in each layer. Then f is a PWL function with at most 2
Proof For the first layer, the n intra-linked neuron can create 21 breaking points. Let us derive this fact via induction. Suppose that
is the piecewise linear function of
1 intra-linked neurons, then
), generating at most 2
1 breaking points (same as
) has one breaking point. Without repetition, the number of breaking points of
is the summation of these newly-generated breaking points and breaking points of
, which is 2(2
1. Because the first layer has
neurons, at most
intra-linked neurons are available. Thus, we can get at most
+ 1 pieces.
Computing the number of pieces for the hidden layers shares the same spirit with the first layer, combining the proof of Lemma 22, we conclude that a as shown in Figure 19(c), with depth k + 1, widths
hidden layers, and n intra-linked neurons in each layer, has at most 2
pieces.
Lemma 25 (The 2nd New Result Extended to n Intra-Linked Neurons) Let f : be a piecewise linear function with p pieces. If f is represented by a ReLU DNN, as shown in Figure 19(c), with depth k + 1 and n neurons intra-linked in each hidden layer, then it must have size at least
. Conversely, any piecewise linear function f represented by a ReLU DNN of depth k + 1 and size at most s, can have at most
Proof Please refer to our proof for Lemma 23.
A. Adadi and M. Berrada. Peeking inside the black-box: A survey on explainable artificial intelligence (xai). IEEE Access, 6:52138–60, 2018.
Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overpa- rameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pages 6155–6166, 2019.
Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. arXiv preprint arXiv:1611.01491, 2016.
Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8139–8148, 2019.
Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
Monica Bianchini and Franco Scarselli. On the complexity of neural network classifiers: A comparison between shallow and deep architectures. IEEE transactions on neural networks and learning systems, 25(8):1553–1565, 2014.
Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over- parameterized networks that provably generalize on linearly separable data, 2017.
Jie Bu and Anuj Karpatne. Quadratic residual networks: A new class of neural networks for solving forward and inverse problems in physics involving pdes. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pages 675–683. SIAM, 2021.
Yuan Cao and Quanquan Gu. A generalization theory of gradient descent for learning over-parameterized deep relu networks, 2019.
CL Philip Chen and Zhulin Liu. Broad learning system: An effective and efficient incre- mental learning system without the need for deep architecture. IEEE transactions on neural networks and learning systems, 29(1):10–24, 2017.
Hu Chen, Yi Zhang, Mannudeep K Kalra, Feng Lin, Yang Chen, Peixi Liao, Jiliu Zhou, and Ge Wang. Low-dose ct with a residual encoder-decoder convolutional neural network. IEEE transactions on medical imaging, 36(12):2524–2535, 2017.
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7–10, 2016.
Lingyang Chu, Xia Hu, Juhua Hu, Lanjun Wang, and Jian Pei. Exact and consistent
Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on learning theory, pages 698–728. PMLR, 2016.
George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1):30–42, 2011.
Louis De Branges. The stone-weierstrass theorem. Proceedings of the American Mathematical Society, 10(5):822–824, 1959.
Zhijie Deng, Yinpeng Dong, Shifeng Zhang, and Jun Zhu. Understanding and exploring the network with stochastic architectures. Advances in Neural Information Processing Systems, 33, 2020.
Simon Du and Wei Hu. Width provably matters in optimization for deep linear neural networks. In International Conference on Machine Learning, pages 1655–1664. PMLR, 2019.
Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably
Richard Ehrenborg. The perles-shephard identity for non-convex polytopes. Technical report, Technical Report, University of Kentucky, 2007.
Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907–940. PMLR, 2016.
F. Fan and G. Wang. Fuzzy logic interpretation of quadratic networks. Neurocomputing, 374:10–21, 2020.
Fenglei Fan, Wenxiang Cong, and Ge Wang. A new type of neurons for machine learn- ing. International journal for numerical methods in biomedical engineering, 34(2):e2920, 2018a.
Fenglei Fan, Wenxiang Cong, and Ge Wang. Generalized backpropagation algorithm for training second-order neural networks. International journal for numerical methods in biomedical engineering, 34(5):e2956, 2018b.
Fenglei Fan, Dayang Wang, Hengtao Guo, Qikui Zhu, Pingkun Yan, Ge Wang, and Hengy- ong Yu. On a sparse shortcut topology of artificial neural networks, 2018c.
Fenglei Fan, Hongming Shan, Mannudeep K Kalra, Ramandeep Singh, Guhan Qian, Matthew Getzin, Yueyang Teng, Juergen Hahn, and Ge Wang. Quadratic autoencoder (q-ae) for low-dose ct denoising. IEEE transactions on medical imaging, 39(6):2035–2050, 2019.
Fenglei Fan, Jinjun Xiong, and Ge Wang. Universal approximation with quadratic deep networks. Neural Networks, 124:383–392, 2020.
Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989.
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.
Federico Girosi and Tomaso Poggio. Representation properties of networks: Kolmogorov’s theorem is irrelevant. Neural Computation, 1(4):465–469, 1989.
Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks, 2017.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014a.
Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2014b.
Juncai He, Lin Li, Jinchao Xu, and Chunyue Zheng. Relu deep neural networks and linear finite elements, 2018.
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
Kaixuan Huang, Yuqing Wang, Molei Tao, and Tuo Zhao. Why do deep residual networks generalize better than deep feedforward networks?–a neural tangent kernel perspective, 2020.
Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
Duofa Ji, Jin Liu, Weiping Wen, Changhai Zhai, Wei Wang, and Evangelos I Katsanos. Prediction of cumulative absolute velocity based on refined second-order deep neural network. Journal of Earthquake Engineering, pages 1–20, 2021.
Kenji Kawaguchi, Jiaoyang Huang, and Leslie Pack Kaelbling. Effect of depth and width on local minima in deep learning. Neural computation, 31(7):1462–1498, 2019.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR
A. N. Kolmogorov. On the representation of continuous functions of several variables by superpositions of continuous functions of a smaller number of variables. Proceedings of the USSR Academy of Sciences, 108:179–182, 1956.
Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. In International conference on machine learning, pages 1378–1387. PMLR, 2016.
Alexey Kurakin, Ian Goodfellow, Samy Bengio, et al. Adversarial examples in the physical world, 2016.
Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington,
Yoav Levine, Noam Wies, Or Sharir, Hofit Bata, and Amnon Shashua. Limits to depth efficiencies of self-attention, 2020.
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets, 2016.
Xingguo Li, Junwei Lu, Zhaoran Wang, Jarvis Haupt, and Tuo Zhao. On tighter general- ization bound for deep neural networks: Cnns, resnets, and beyond, 2018.
Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochas- tic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018.
William A Light and Elliot W Cheney. Approximation theory in tensor product spaces, volume 1169. Springer, 2006.
Hongzhou Lin and Stefanie Jegelka. Resnet with one-neuron hidden layers is a universal approximator. Advances in Neural Information Processing Systems, 31:6169–6178, 2018.
Zachary C Lipton. The mythos of model interpretability. Queue, 16(3):31–57, 2018.
Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: a view from the width. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6232–6240, 2017.
Pranav Mantini and Shishr K Shah. Cqnn: Convolutional quadratic neural networks. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 9819–9826. IEEE, 2021.
Alexander G de G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin
Hrushikesh N Mhaskar and Tomaso Poggio. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 14(06):829–848, 2016.
Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2574–2582, 2016.
Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996.
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015.
Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach
Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A
Yookoon Park, Sangho Lee, Gunhee Kim, and David Blei. Unsupervised representation learning via neural activation coding. In International Conference on Machine Learning, pages 8391–8400. PMLR, 2021.
Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantization. In International Conference on Learning Representations, 2018.
E. W. Saad and D. C. Wunsch II. Neural network explanation using inversion. Neural networks, 20(1):78–93, 2007.
Thiago Serra and Srikumar Ramalingam. Empirical bounds on linear regions of deep rectifier networks. In AAAI, pages 5628–5635, 2020.
Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and count- ing linear regions of deep neural networks. In International Conference on Machine Learning, pages 4558–4566. PMLR, 2018.
Rudy Setiono and Huan Liu. Understanding neural networks via rule extraction. In IJCAI, volume 1, pages 480–485. Citeseer, 1995.
Lech Szymanski and Brendan McCane. Deep networks are effective encoders of periodicity. IEEE transactions on neural networks and learning systems, 25(10):1816–1827, 2014.
Sebastian Thrun. Extracting rules from artificial neural networks with distributed repre- sentations. pages 505–512. MORGAN KAUFMANN PUBLISHERS, 1995.
Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
Ge Wang. A perspective on deep imaging. Ieee Access, 4:8914–8924, 2016.
Shuning Wang and Xusheng Sun. Generalization of hinging hyperplanes. IEEE Transactions on Information Theory, 51(12):4425–4431, 2005.
David Vernon Widder. Advanced calculus. Courier Corporation, 1989.
Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convo- lutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828, 2016.
Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural networks for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 1284–1293, 2019.
Huan Xiong, Lei Huang, Mengyang Yu, Li Liu, Fan Zhu, and Ling Shao. On the number of linear regions of convolutional neural networks. In International Conference on Machine Learning, pages 10514–10523. PMLR, 2020.
Zirui Xu, Fuxun Yu, Jinjun Xiong, and Xiang Chen. Quadralib: A performant quadratic
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under- standing deep learning requires rethinking generalization, 2016.
Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. Efficient and ac- curate approximations of nonlinear convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and pattern Recognition, pages 1984–1992, 2015.