Stable behaviour of infinitely wide deep neural networks

2020·Arxiv

Abstract

Abstract

We consider fully connected feed-forward deep neural networks (NNs) where weights and biases are independent and identically distributed as symmetric centered stable distributions. Then, we show that the infi-nite wide limit of the NN, under suitable scaling on the weights, is a stochastic process whose finite-dimensional distributions are multivariate stable distributions. The limiting process is referred to as the stable process, and it generalizes the class of Gaussian processes recently obtained as infinite wide limits of NNs (Matthews et al., 2018b). Parameters of the stable process can be computed via an explicit recursion over the layers of the network. Our result contributes to the theory of fully connected feed-forward deep NNs, and it paves the way to expand recent lines of research that rely on Gaussian infinite wide limits.

1 Introduction

The connection between infinitely wide deep feed-forward neural networks (NNs), whose parameters at initialization are independent and identically distributed (iid) as scaled and centered Gaussian distributions, and Gaussian processes (GPs) is well known (Neal, 1995; Der and Lee, 2006; Lee et al., 2018; Matthews et al., 2018a,b; Yang, 2019). Recently, this intriguing connection has been exploited in many exciting research directions, including: i) Bayesian inference for GPs arising from infinitely wide networks

Proceedings of the 23International Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).

(Lee et al., 2018; Garriga-Alonso et al., 2019); ii) kernel regression for infinitely wide networks which are trained with continuous-time gradient descent via the neural tangent kernel (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019); iii) analysis of the properties of infinitely wide networks as function of depth via the information propagation framework (Poole et al., 2016; Schoenholz et al., 2017; Hayou et al., 2019). It has been shown a substantial gap between finite NNs and their corresponding infinite (wide) GPs counterparts in terms of empirical performance, at least on some of the standard benchmarks applications. Moreover, it has been shown to be a difficult task to avoid undesirable empirical properties arising in the context of very deep networks. Given that, there exists an increasing interest in expanding GPs arising in the limit of infinitely wide NNs as a way forward to close, or even reverse, this empirical performance gap and to avoid, or slow down, pathological behaviors in very deep NN.

Let ) denote the Gaussian distribution with mean and variance . Following the celebrated work of Neal (1995), we consider the shallow NN

where is a nonlinearity, i = 1) and is the input. It follows that

If is another input we obtain bivariate Gaussian distributions

where

Let denote the almost sure convergence. By the strong law of large numbers we know that, as , one has

from which one can conjecture that in the limit of infinite width the stochastic processes ) are distributed as iid (over i) centered GP with kernel ) = + (())]. Provided that the nonlinear function is chosen so that (x)) has finite second moment, Matthews et al. (2018b) made rigorous this argument and extended it to deep NNs.

A key assumption underlying the interplay between infinite wide NNs and GPs is the finiteness of the variance of the parameters’ distribution at initialization. In this paper we remove the assumption of finite variance by considering iid initializations based on stable distributions, which includes Gaussian initializations as a special case. We study the infinite wide limit of fully connected feed-forward NN in the following general setting: i) the NN is deep, namely the NN is composed of multiple layers; ii) biases and scaled weights are iid according to centered symmetric stable distributions; iii) the width of network’s layers goes to infinity jointly on the layers, and not sequentially on each layer; iv) the convergence in distribution is established jointly for multiple inputs, namely the convergence concerns the class of finite dimensional distributions of the NN viewed as a stochastic process in function space. See Neal (1995) and Der and Lee (2006) for early works on NNs under stable initialization.

Within our setting, we show that the infinite wide limit of the NN, under suitable scaling on the weights, is a stochastic process whose finite-dimensional distributions are multivariate stable distributions (Samorad- nitsky, 2017). This process is referred to as the stable process. Our result may be viewed as a generalization of the main result of Matthews et al. (2018b) to the context of stable distributions, as well as a generalization of results of Neal (1995) and Der and Lee (2006) to the context of deep NN. Our result contributes to the theory of fully connected feed-forward deep NNs, and it paves the way to extend the research directions i) ii) and iii) that rely on Gaussian infinite wide limits. The class of stable distributions is known to be especially relevant. Indeed while the contribution of each Gaussian weight vanishes as the width grows unbounded, some of the stable weights retains significant size, thus allowing them to represent ”hidden features” (Neal, 1995).

The paper is structured as follows. Section 2 contains some preliminaries on stable distributions, whereas in Section 3 we define the class of feedforward NNs considered in this work. Section 4 contains our main result: as the width tends to infinity jointly on network’s layers, the finite dimensional distributions of the NN converges to a multivariate stable distribution whose parameters are compute via a recursion over the layers. The convergence of the NN to the stable process then follows by finite-dimensional projections. In Section 5 we detail how our result extends previously established large width convergence results and comment on related work, whereas in Section 6 we discuss how our result applys to the research lines highlighted in i) ii) and iii) which relies on GP limits. In Section 7 we comment on future research directions. The Supplementary Material (SM) contains all the proofs (SM A,B,C), a preliminary numerical experiment on the recursion evaluation (SM D), an empirical investigation of the distribution of trained NN models’ parameters (SM E). Code is available at https://github.com/stepelu/deep-stable.

2 Stable distributions

Let St() denote the symmetric centered stable distribution with stability parameter (0, 2] and scale parameter 0, and let be a random variable distributed as St(). That is, the characteristic function of St() is ) = ] = e. For any 0, a random variable with 0 2 has finite absolute moments ] for any 0, while ] = +. Note that when = 2, we have that ). The random variable has finite absolute moments of any order. For any we have the scaling identity St().

We recall the definition of symmetric and centered multivariate stable distribution and its marginal distributions. First, let be the unit sphere in . Let StΓ) denote the symmetric and centered k-dimensional stable distribution with stability (0, 2] and scale (finite) spectral measure Γ on , and let be a random vector of dimension 1 distributed as StΓ). The characteristic function of StΓ) is

where

The distribution StΓ) with characteristic function (1) allows for marginals which are not centered nor symmetric. However in the present work all the marginals will be centered and symmetric, and the spectral measure will often be a discrete measure, i.e., Γ() = )for and . In particular, under these specific assumptions, we have

See Samoradnitsky (2017) for a detailed account on St(Γ).

3 Deep stable networks

We consider fully connected feed-forward NNs composed of 1 layers where each layer is of width 1. Let be the weights of the l-th layer, and assume that they are independent and identically distributed as St(), a stable distribution with stability parameter (0, 2] and scale parameter 0.

for any 1, 1 and 1. Let denote the biases of the l-th hidden layer, and assume that they are independent and identically distributed as St(). That is, the characteristic function of the random variable St() is

for any 1 and 1. The random weights are independent of the biases , for any 1, 1 and 1. That is,

Let be a nonlinearity with a finite number of discontinuities and such that it satisfies the envelope condition

for every , and for any parameter a, b > 0, and . If is the input argument of the NN, then the NN is explicitly defined by means of

and

for l = 2, . . . , D and i = 1, . . . , n in (7) and (8). The scaling of the weights in (8) will be shown to be the correct one to obtain non-degenerate limits as +.

4 Inﬁnitely wide limits

We show that, as the width of the NN tends to in-finity jointly on network’s layers, the finite dimensional distributions of the network converge to a multivariate stable distribution whose parameters are compute via a suitable recursion over the network layers. Then, by combining this limiting result with standard arguments on finite-dimensional projections we obtain the large n limit of the stochastic process (((where are the inputs to the NN. In particular, let denote the weak convergence. Then, we show that as ,

(f (x, n), . . . , f (x, n))

where is the product measure. From now on k is the number of inputs, which is equal to the dimensionality of the finite dimensional distributions of interest for the stochastic processes .Thorough the rest of the paper we assume that the assumptions introduced in Section 3 hold true. Hereafter we present a sketch of the proofs of our main result for a fixed index i and input x, and we defer to the SM for the complete proofs of our main results.

We start with a technical remark: in (7)-(8) the stochastic processes (x, n) are only defined for i = 1, . . . , n, while the limiting measure in (9) is the product measure on 1. This fact does not determine problems, as for each there is a n large enough such that for each the processes (x, n) are defined. In any case, the simplest solution consists in extending (x, n) from i = 1, . . . , n to 1 in (7)-(8), and we will make this assumption in all the proofs.

4.1 Large width asymptotics: k = 1

We characterize the limiting distribution of (x, n) as for a fixed i and input x. We show that, as ,

where the parameter ) is computed through the recursion:

and q(l) = St()) for each 1. The generalization of this result to 1 inputs is given in Section 4.2.

Proof of (10). The proof exploits exchangeability of the sequence ((, an induction argument on the layer’s index l for the directing (random measure) of ((, and some technical lemmas that are proved in SM. Recall that the input is a real-valued vector of dimension I. By means of (4) and (5), for 1:

and for l = 2, . . . , D

i.e.,

It comes from (8) that, for every fixed l and for every fixed n the sequence ((is exchange- able. In particular, let denote the directing (random) probability measure of the exchangeable sequence ((. That is, by de Finetti repre- sentation theorem, conditionally to the (n, x)’s are iid as . Now, consider the induction hypoth- esis that as , with be St(1)), and the parameter 1) will be specified. Therefore,

where the first equality comes from plugging in the definition of (x, n), rewriting E[exp()] as exp()] = [exp()] due to indepen- dence, computing the characteristic function for each term, and re-arranging. therein, since (is exchangeable there exists (de Finetti theorem) a random probability measure such that conditionally to the ) are iid as which explains (11).

Now, let denote the convergence in probability. The following technical lemmas (Appendix A):

L1) for each 2 Pr[] = 1, with ;

L2)

), as ;

L3)

0, as .

together with Lagrange theorem, are the main ingredients for proving (10) by studying the large n asymptotic behavior of (11). By combining (11) with lemma L1,

By means of Lagrange theorem, there exists [0, 1] such that

Now, since

+ n

Finally, recall the fundamental limit e= lim(1 + . This, combined with L2 and L3 leads to

as . That is, we proved that the large n limiting distribution of (x, n) is St()), where we set

4.2 Large width asymptotics: k ≥ 1

We establish the convergence in distribution of ((()) as for a fixed i and k inputs . This result, combined with standard arguments on finite-dimensional projections, then establishes the convergence of the NN to the stable process. Precisely, we show that, as , one has

where the spectral measure Γ(l) is computed through the recursion:

and q(l) = StΓ(l)) for each 1, where = [] . Here (and in all the expressions involving the function ) we make use of the notational assumption that if = 0 in , then = 0. This assumption allows us to avoid making the notation more cumbersome than necessary to explicitly exclude the case of ) = 0, for which is undefined. We omit the sketch of the proof of (12), as it is a step-by-step parallel of the proof of (10) with the added complexities due to the multivariate stable distributions. The reader can refer to the SM for the full proof.

4.3 Finite-dimensional projections

In Section 4.2 we obtained the convergence in law of (() for k inputs and a generic i to a multivariate Stable distribution. Let refer to this random vector as ). Now, we derive the limiting behavior in law of ) jointly over all i = 1, . . . (again for a given k-dimensional input). It is enough to study the convergence of ) for a generic 1. That is, it is enough to establish the convergence of the finite dimensional distributions (over i: we consider here ) as random sequence over i). See Billingsley (1999) for details.

To establish the convergence of the finite dimensional distributions (over i) it then suffices to establish the convergence of linear combinations. More precisely, let X = [. We show that, as ,

by proving the large n asymptotic behavior of any fi-nite linear combination of the (X, n)’s, for N. Following the notation of Matthews et al. (2018b), let

Then, we write

where

Then,

That is,

where

Then, along lines similar to the proof of the large n asymptotics for the i-th coordinate, we have the following

as . This complete the proof of the limiting behaviour (9).

5 Related work

For the classical case of Gaussian weights and biases, and more in general for finite-variance iid distributions, the seminal work is that of Neal (1995). Here the author establishes, among other notable contributions, the connection between infinitely wide shallow (1 hidden layer) NNs and centered GPs. We reviewed the essence of it in Section 1.

This result is extended in Lee et al. (2018) to deep NNs where the width n(l) of each layer l goes to in-finity sequentially, starting from lowest layer, i.e. n(1) to n(D). The sequential nature of the limits reduces the task to a sequential application of the approach of Neal (1995). The computation of the GP kernel for each layer l involves a recursion, and the authors propose a numerical method to approximate the integral involved in each step of the recursion. The case where each n(l) goes to infinity jointly, i.e. n(l) = n, is considered in Matthews et al. (2018a) under more restrictive hypothesis, which are relaxed in Matthews et al. (2018b). While this setting is most representative of a sequence of increasingly wide networks, the theoretical analysis is considerably more complicated as it does not reduce to a sequential application of the classical multivariate central limit theorem.

Going beyond finite-variance weight and bias distributions, Neal (1995) also introduced preliminary results for infinitely wide shallow NNs when weights and biases follow centered symmetric stable distributions. These results are refined in Der and Lee (2006) which establishes the convergence to a stable process, again in the setting of shallow NNs.

The present paper can be considered a generalization of the work of Matthews et al. (2018b) to the context of weights and biases distributed according to centered and symmetric stable distributions. Our proof follows different arguments from the proof of Matthews et al. (2018b), and in particular it does not rely on the central limit theorem for exchangeable sequences (Blum et al., 1958). Hence, since the Gaussian distribution is a special case of the stable distribution, our proof provides an alternative and self-contained proof to the result of Matthews et al. (2018b). It should be noted that our envelope condition (6) is more restrictive than the linear envelope condition of Matthews et al. (2018b). For the classical Gaussian setting the conditions on the activation function have been weakened in the work of Yang (2019).

Finally, there has been recent interest in using heavy-tailed distributions for gradient noise (Simsekli et al., 2019) and for trained parameter distributions (Mar- tin and Mahoney, 2019). In particular, Martin and Mahoney (2019) includes an empirical analysis of the parameters of pre-trained convolutional architectures (which we also investigate in SM E) supportive of heavy-tailed distributions. Results of this kind are compatible with the conjecture that stochastic processes arising from NNs whose parameters are heavy-tailed might be closer representations of their finite, high-performing, counterparts.

6 Future applications

6.1 Bayesian inference

Infinitely wide NNs with centered iid Gaussian initializations, and more in general finite variance centered iid initializations, gives rise to iid centered GPs at every layer l. Let us assume that weights and biases are distributed as in Section 1, and let us assume L layers (1 hidden layers). Each centered GPs is characterized by its covariance kernel function. Let us denote by such GPs for the layer 2 . Over two inputs x and the distribution of ()) is characterized by the variances )], )] and by the covari- ance )]. These quantities sat- isfy

where z and are independent standard Gaussian distributions N(0, 1),

with initial conditions + and = + .

To perform prediction via ], it is necessary to compute these recursions for all ordered pairs of data points in the training dataset D, and for all pairs with . Lee et al. (2018) proposes an efficient quadrature solution that keeps the computational requirements manageable for an arbitrary activation .

In our setting, the corresponding recursion is defined by (13)-(14), which is a more computationally challenging problem with respect to the Gaussian setting. A sketch of a potential approach is as follows. Over the training data points and test points, StΓ(1)) where k is equal to the size of training and test datasets combined. As Γ(1) is a discrete measure exact simulations algorithms are available with a computational cost of O(I) per sample (Nolan, 2008). We can thus generate M samples , j = 1, . . . , M, in O(IM), and use these to approximate StΓ(2)) with StΓ(2)) with