b

DiscoverSearch
About
My stuff
Stable behaviour of infinitely wide deep neural networks
2020·arXiv
Abstract
Abstract

We consider fully connected feed-forward deep neural networks (NNs) where weights and biases are independent and identically distributed as symmetric centered stable distributions. Then, we show that the infi-nite wide limit of the NN, under suitable scaling on the weights, is a stochastic process whose finite-dimensional distributions are multivariate stable distributions. The limiting process is referred to as the stable process, and it generalizes the class of Gaussian processes recently obtained as infinite wide limits of NNs (Matthews et al., 2018b). Parameters of the stable process can be computed via an explicit recursion over the layers of the network. Our result contributes to the theory of fully connected feed-forward deep NNs, and it paves the way to expand recent lines of research that rely on Gaussian infinite wide limits.

The connection between infinitely wide deep feed-forward neural networks (NNs), whose parameters at initialization are independent and identically distributed (iid) as scaled and centered Gaussian distributions, and Gaussian processes (GPs) is well known (Neal, 1995; Der and Lee, 2006; Lee et al., 2018; Matthews et al., 2018a,b; Yang, 2019). Recently, this intriguing connection has been exploited in many exciting research directions, including: i) Bayesian inference for GPs arising from infinitely wide networks

Proceedings of the 23rdInternational Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).

(Lee et al., 2018; Garriga-Alonso et al., 2019); ii) kernel regression for infinitely wide networks which are trained with continuous-time gradient descent via the neural tangent kernel (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019); iii) analysis of the properties of infinitely wide networks as function of depth via the information propagation framework (Poole et al., 2016; Schoenholz et al., 2017; Hayou et al., 2019). It has been shown a substantial gap between finite NNs and their corresponding infinite (wide) GPs counterparts in terms of empirical performance, at least on some of the standard benchmarks applications. Moreover, it has been shown to be a difficult task to avoid undesirable empirical properties arising in the context of very deep networks. Given that, there exists an increasing interest in expanding GPs arising in the limit of infinitely wide NNs as a way forward to close, or even reverse, this empirical performance gap and to avoid, or slow down, pathological behaviors in very deep NN.

Let  N(µ, σ2) denote the Gaussian distribution with mean  µ ∈ Rand variance  σ2 ∈ R+. Following the celebrated work of Neal (1995), we consider the shallow NN

image

where  φis a nonlinearity, i = 1, . . . , n, w(1)i,j , w(2)i,j iid∼N(0, σ2w), b(1)i , b(2)i iid∼ N(0, σ2b) and  x ∈ RIis the input. It follows that

image

image

If  x′is another input we obtain bivariate Gaussian distributions

image

where

image

Let a.s.−→denote the almost sure convergence. By the strong law of large numbers we know that, as  n → +∞, one has

image

from which one can conjecture that in the limit of infinite width the stochastic processes  f (2)i (x) are distributed as iid (over i) centered GP with kernel K(x, x′) =  σ2b+  σ2wE[φ(f (1)1(x))φ(f (1)1(x′))]. Provided that the nonlinear function  φis chosen so that φ(f (1)1(x)) has finite second moment, Matthews et al. (2018b) made rigorous this argument and extended it to deep NNs.

A key assumption underlying the interplay between infinite wide NNs and GPs is the finiteness of the variance of the parameters’ distribution at initialization. In this paper we remove the assumption of finite variance by considering iid initializations based on stable distributions, which includes Gaussian initializations as a special case. We study the infinite wide limit of fully connected feed-forward NN in the following general setting: i) the NN is deep, namely the NN is composed of multiple layers; ii) biases and scaled weights are iid according to centered symmetric stable distributions; iii) the width of network’s layers goes to infinity jointly on the layers, and not sequentially on each layer; iv) the convergence in distribution is established jointly for multiple inputs, namely the convergence concerns the class of finite dimensional distributions of the NN viewed as a stochastic process in function space. See Neal (1995) and Der and Lee (2006) for early works on NNs under stable initialization.

Within our setting, we show that the infinite wide limit of the NN, under suitable scaling on the weights, is a stochastic process whose finite-dimensional distributions are multivariate stable distributions (Samorad- nitsky, 2017). This process is referred to as the stable process. Our result may be viewed as a generalization of the main result of Matthews et al. (2018b) to the context of stable distributions, as well as a generalization of results of Neal (1995) and Der and Lee (2006) to the context of deep NN. Our result contributes to the theory of fully connected feed-forward deep NNs, and it paves the way to extend the research directions i) ii) and iii) that rely on Gaussian infinite wide limits. The class of stable distributions is known to be especially relevant. Indeed while the contribution of each Gaussian weight vanishes as the width grows unbounded, some of the stable weights retains significant size, thus allowing them to represent ”hidden features” (Neal, 1995).

The paper is structured as follows. Section 2 contains some preliminaries on stable distributions, whereas in Section 3 we define the class of feedforward NNs considered in this work. Section 4 contains our main result: as the width tends to infinity jointly on network’s layers, the finite dimensional distributions of the NN converges to a multivariate stable distribution whose parameters are compute via a recursion over the layers. The convergence of the NN to the stable process then follows by finite-dimensional projections. In Section 5 we detail how our result extends previously established large width convergence results and comment on related work, whereas in Section 6 we discuss how our result applys to the research lines highlighted in i) ii) and iii) which relies on GP limits. In Section 7 we comment on future research directions. The Supplementary Material (SM) contains all the proofs (SM A,B,C), a preliminary numerical experiment on the recursion evaluation (SM D), an empirical investigation of the distribution of trained NN models’ parameters (SM E). Code is available at https://github.com/stepelu/deep-stable.

Let St(α, σ) denote the symmetric centered stable distribution with stability parameter  α ∈(0, 2] and scale parameter  σ >0, and let  Sα,σbe a random variable distributed as St(α, σ). That is, the characteristic function of  Sα,σ ∼St(α, σ) is  ϕSα,σ(t) =  E[eitSα,σ] = e−σα|t|α. For any  σ >0, a  Sα,σrandom variable with 0  < α <2 has finite absolute moments  E[|Sα,σ|α−ε] for any  ε >0, while  E[|Sα,σ|α] = +∞. Note that when  α= 2, we have that  S2,σ ∼ N(0, 2σ2). The random variable  S2,σhas finite absolute moments of any order. For any  a ∈ Rwe have the scaling identity aSα,1 ∼St(α, |a|).

We recall the definition of symmetric and centered multivariate stable distribution and its marginal distributions. First, let  Sk−1be the unit sphere in  Rk. Let Stk(α,Γ) denote the symmetric and centered k-dimensional stable distribution with stability  α ∈(0, 2] and scale (finite) spectral measure Γ on  Sk−1, and let  Sα,Γbe a random vector of dimension  k ×1 distributed as Stk(α,Γ). The characteristic function of Sα,Γ ∼Stk(α,Γ) is

image

where

image

The distribution Stk(α,Γ) with characteristic function (1) allows for marginals which are not centered nor symmetric. However in the present work all the marginals will be centered and symmetric, and the spectral measure will often be a discrete measure, i.e., Γ(·) = �nj=1 γjδsj(·)for  n ∈ N, sj ∈ Sk−1and  γj ∈ R. In particular, under these specific assumptions, we have

image

See Samoradnitsky (2017) for a detailed account on Sα,Γ ∼St(α,Γ).

We consider fully connected feed-forward NNs composed of  D ≥1 layers where each layer is of width n ≥1. Let  w(l)i,jbe the weights of the l-th layer, and assume that they are independent and identically distributed as St(α, σw), a stable distribution with stability parameter  α ∈(0, 2] and scale parameter  σw >0.

image

for any  i ≥1,  j ≥1 and  l ≥1. Let  b(l)idenote the biases of the l-th hidden layer, and assume that they are independent and identically distributed as St(α, σb). That is, the characteristic function of the random variable  b(l)i ∼St(α, σb) is

image

for any  i ≥1 and  l ≥1. The random weights  w(l)i,jare independent of the biases  b(l)i, for any  i ≥1,  j ≥1 and l ≥1. That is,

image

Let  φ : R → Rbe a nonlinearity with a finite number of discontinuities and such that it satisfies the envelope condition

image

for every  s ∈ R, and for any parameter a, b > 0,  γ <α−1and  β < γ−1. If  x ∈ RIis the input argument of the NN, then the NN is explicitly defined by means of

image

and

image

for l = 2, . . . , D and i = 1, . . . , n in (7) and (8). The scaling of the weights in (8) will be shown to be the correct one to obtain non-degenerate limits as  n →+∞.

We show that, as the width of the NN tends to in-finity jointly on network’s layers, the finite dimensional distributions of the network converge to a multivariate stable distribution whose parameters are compute via a suitable recursion over the network layers. Then, by combining this limiting result with standard arguments on finite-dimensional projections we obtain the large n limit of the stochastic process (f (l)i(x(1), n), . . . , f (l)i(x(k), n))i≥1where  x(1), . . . , x(k)are the inputs to the NN. In particular, let w−→denote the weak convergence. Then, we show that as n → +∞,

(f (l)i(x(1), n), . . . , f  (l)i(x(k), n))i≥1 w−→�

image

where �is the product measure. From now on k is the number of inputs, which is equal to the dimensionality of the finite dimensional distributions of interest for the stochastic processes  f (l)i.Thorough the rest of the paper we assume that the assumptions introduced in Section 3 hold true. Hereafter we present a sketch of the proofs of our main result for a fixed index i and input x, and we defer to the SM for the complete proofs of our main results.

We start with a technical remark: in (7)-(8) the stochastic processes  f (l)i(x, n) are only defined for i = 1, . . . , n, while the limiting measure in (9) is the product measure on  i ≥1. This fact does not determine problems, as for each  L ⊂ Nthere is a n large enough such that for each  i ∈ Lthe processes  f (l)i(x, n) are defined. In any case, the simplest solution consists in extending  f (l)i(x, n) from i = 1, . . . , n to  i ≥1 in (7)-(8), and we will make this assumption in all the proofs.

4.1 Large width asymptotics: k = 1

We characterize the limiting distribution of  f (l)i(x, n) as  n → ∞for a fixed i and input x. We show that, as n → +∞,

image

where the parameter  σ(l) is computed through the recursion:

image

and q(l) = St(α, σ(l)) for each  l ≥1. The generalization of this result to  k ≥1 inputs is given in Section 4.2.

Proof of (10). The proof exploits exchangeability of the sequence (f (l)i(n, x))i≥1, an induction argument on the layer’s index l for the directing (random measure) of (f (l)i(n, x))i≥1, and some technical lemmas that are proved in SM. Recall that the input is a real-valued vector of dimension I. By means of (4) and (5), for i ≥1:

image

image

and for l = 2, . . . , D

image

i.e.,

image

It comes from (8) that, for every fixed l and for every fixed n the sequence (f (l)i(n, x))i≥1is exchange- able. In particular, let  p(l)ndenote the directing (random) probability measure of the exchangeable sequence (f (l)i(n, x))i≥1. That is, by de Finetti repre- sentation theorem, conditionally to  p(l)nthe  f (l)i(n, x)’s are iid as  p(l)n. Now, consider the induction hypoth- esis that  p(l−1)n w−→ q(l−1)as  n → +∞, with  q(l−1)be St(α, σ(l −1)), and the parameter  σ(l −1) will be specified. Therefore,

image

where the first equality comes from plugging in the definition of  f (l)i(x, n), rewriting E[exp(�nj=1 · · ·)] as E[�nj=1exp(· · ·)] =  �nj=1 E[exp(· · ·)] due to indepen- dence, computing the characteristic function for each term, and re-arranging. therein, since (f (l−1)I (n, x))i≥1is exchangeable there exists (de Finetti theorem) a random probability measure  p(l−1)nsuch that conditionally to  p(l−1)nthe  f (l−1)I (n, x) are iid as  p(l−1)nwhich explains (11).

Now, let p−→denote the convergence in probability. The following technical lemmas (Appendix A):

L1) for each  l ≥2 Pr[p(l−1)n ∈ I] = 1, with  I = {p :�|φ(f)|αp(df) < +∞};

image

L2)

�|φ(f)|αp(l−1)n (df) p−→ �|φ(f)|αq(l−1)(df), as n → +∞;

image

L3)

�|φ(f)|α[1 − e−|t|α σαwn |φ(f)|α]p(l−1)n (df) p−→0, as n → +∞.

together with Lagrange theorem, are the main ingredients for proving (10) by studying the large n asymptotic behavior of (11). By combining (11) with lemma L1,

image

By means of Lagrange theorem, there exists  θn ∈[0, 1] such that

image

Now, since

image

+  |t|α σαwn

image

Finally, recall the fundamental limit ex= limn→+∞(1 +  x/n)n. This, combined with L2 and L3 leads to

image

as  n → +∞. That is, we proved that the large n limiting distribution of  f (l)i(x, n) is St(α, σ(l)), where we set

image

4.2 Large width asymptotics: k 1

We establish the convergence in distribution of (f (l)i(x(1), n), . . . , f (l)i(x(k), n)) as  n → +∞for a fixed i and k inputs  x(1), . . . , x(k). This result, combined with standard arguments on finite-dimensional projections, then establishes the convergence of the NN to the stable process. Precisely, we show that, as  n → +∞, one has

image

where the spectral measure Γ(l) is computed through the recursion:

image

and q(l) = Stk(α,Γ(l)) for each  l ≥1, where  xj= [x(1)j , . . . , x(k)j]  ∈ Rk. Here (and in all the expressions involving the function  δ) we make use of the notational assumption that if  λ= 0 in  λδ•, then  λδ•= 0. This assumption allows us to avoid making the notation more cumbersome than necessary to explicitly exclude the case of  φ(f) = 0, for which  φ(f)/||φ(f)||is undefined. We omit the sketch of the proof of (12), as it is a step-by-step parallel of the proof of (10) with the added complexities due to the multivariate stable distributions. The reader can refer to the SM for the full proof.

4.3 Finite-dimensional projections

In Section 4.2 we obtained the convergence in law of f (l)i(x(1), n), . . . , f (l)i(x(k), n) for k inputs and a generic i to a multivariate Stable distribution. Let refer to this random vector as  fi(x). Now, we derive the limiting behavior in law of  fi(x) jointly over all i = 1, . . . (again for a given k-dimensional input). It is enough to study the convergence of  f1(x), . . . , fn(x) for a generic  n ≥1. That is, it is enough to establish the convergence of the finite dimensional distributions (over i: we consider here  fi(x) as random sequence over i). See Billingsley (1999) for details.

To establish the convergence of the finite dimensional distributions (over i) it then suffices to establish the convergence of linear combinations. More precisely, let X = [x(1), . . . , x(k)] ∈ RI×k. We show that, as n → +∞,

image

by proving the large n asymptotic behavior of any fi-nite linear combination of the  f (l)i(X, n)’s, for  i ∈ L ⊂N. Following the notation of Matthews et al. (2018b), let

image

Then, we write

image

where

image

Then,

image

That is,

image

where

image

Then, along lines similar to the proof of the large n asymptotics for the i-th coordinate, we have the following

image

as  n → +∞. This complete the proof of the limiting behaviour (9).

For the classical case of Gaussian weights and biases, and more in general for finite-variance iid distributions, the seminal work is that of Neal (1995). Here the author establishes, among other notable contributions, the connection between infinitely wide shallow (1 hidden layer) NNs and centered GPs. We reviewed the essence of it in Section 1.

This result is extended in Lee et al. (2018) to deep NNs where the width n(l) of each layer l goes to in-finity sequentially, starting from lowest layer, i.e. n(1) to n(D). The sequential nature of the limits reduces the task to a sequential application of the approach of Neal (1995). The computation of the GP kernel for each layer l involves a recursion, and the authors propose a numerical method to approximate the integral involved in each step of the recursion. The case where each n(l) goes to infinity jointly, i.e. n(l) = n, is considered in Matthews et al. (2018a) under more restrictive hypothesis, which are relaxed in Matthews et al. (2018b). While this setting is most representative of a sequence of increasingly wide networks, the theoretical analysis is considerably more complicated as it does not reduce to a sequential application of the classical multivariate central limit theorem.

Going beyond finite-variance weight and bias distributions, Neal (1995) also introduced preliminary results for infinitely wide shallow NNs when weights and biases follow centered symmetric stable distributions. These results are refined in Der and Lee (2006) which establishes the convergence to a stable process, again in the setting of shallow NNs.

The present paper can be considered a generalization of the work of Matthews et al. (2018b) to the context of weights and biases distributed according to centered and symmetric stable distributions. Our proof follows different arguments from the proof of Matthews et al. (2018b), and in particular it does not rely on the central limit theorem for exchangeable sequences (Blum et al., 1958). Hence, since the Gaussian distribution is a special case of the stable distribution, our proof provides an alternative and self-contained proof to the result of Matthews et al. (2018b). It should be noted that our envelope condition (6) is more restrictive than the linear envelope condition of Matthews et al. (2018b). For the classical Gaussian setting the conditions on the activation function have been weakened in the work of Yang (2019).

Finally, there has been recent interest in using heavy-tailed distributions for gradient noise (Simsekli et al., 2019) and for trained parameter distributions (Mar- tin and Mahoney, 2019). In particular, Martin and Mahoney (2019) includes an empirical analysis of the parameters of pre-trained convolutional architectures (which we also investigate in SM E) supportive of heavy-tailed distributions. Results of this kind are compatible with the conjecture that stochastic processes arising from NNs whose parameters are heavy-tailed might be closer representations of their finite, high-performing, counterparts.

6.1 Bayesian inference

Infinitely wide NNs with centered iid Gaussian initializations, and more in general finite variance centered iid initializations, gives rise to iid centered GPs at every layer l. Let us assume that weights and biases are distributed as in Section 1, and let us assume L layers (L −1 hidden layers). Each centered GPs is characterized by its covariance kernel function. Let us denote by  f (l)such GPs for the layer 2  ≤ l ≤ L. Over two inputs x and  x′the distribution of (f (l)(x), f (l)(x′)) is characterized by the variances q(l)x = V[f (l)(x)],  q(l)x′ = V[f (l)(x′)] and by the covari- ance  c(l)x,x′ = C[f (l)(x), f (l)(x′)]. These quantities sat- isfy

image

where z and  z′are independent standard Gaussian distributions N(0, 1),

image

with initial conditions  q(1)x = σ2b+  σ2w∥x∥2and  c(1)x,x′= σ2b+  σ2w⟨x, x′⟩.

To perform prediction via  E[f (L)(x∗)|x∗, D], it is necessary to compute these recursions for all ordered pairs of data points  x, x′in the training dataset D, and for all pairs  x∗, xwith  x ∈ D. Lee et al. (2018) proposes an efficient quadrature solution that keeps the computational requirements manageable for an arbitrary activation  φ.

In our setting, the corresponding recursion is defined by (13)-(14), which is a more computationally challenging problem with respect to the Gaussian setting. A sketch of a potential approach is as follows. Over the training data points and test points, f (1) ∼Stk(α,Γ(1)) where k is equal to the size of training and test datasets combined. As Γ(1) is a discrete measure exact simulations algorithms are available with a computational cost of O(I) per sample (Nolan, 2008). We can thus generate M samples �f (1)j, j = 1, . . . , M, in O(IM), and use these to approximate  f (2) ∼Stk(α,Γ(2)) with Stk(α, �Γ(2)) with �Γ(2) being

image

We can repeat this procedure by generating (approximate) random samples �f (2)j, with a cost of  O(M 2), that in turn are used to approximate Γ(3) and so on. In this procedure the errors can accumulate across the layers, as in Lee et al. (2018). This may be ameliorated by using quasi random number generators of Joe and Kuo (2008), as the sampling algorithms for multivariate stable distributions (Weron, 1996; Weron et al., 2010; Nolan, 2008) are all implemented as transformations of uniform distributions. The use of QRNG effectively defines a quadrature scheme for the integration problem. We report in the SM preliminary results regarding the numerical approximation of the recursion defined by (13)-(14).

This leaves us with the problem of computing a statistic of  f (L)(x∗)|(x∗, D) or sampling from it, to perform prediction. Again, it could be beneficial to leverage on the discreteness of �Γ(L). For example, these multivariate stable random variables can be expressed as suitable linear transformations of independent stable random variables (Samoradnitsky, 2017), and results expressing stable variables as mixtures of Gaussian variables are available in Samoradnitsky (2017).

6.2 Neural tangent kernel

In Section 6.1 we reviewed how the connection with GPs makes it possible to perform Bayesian inference directly on the limiting process. This corresponds to a ”weakly-trained” regime of NNs, in the sense that the point (mean) predictions are equivalent to assuming an  l2loss function, and fitting only a terminal linear layer to the training data, i.e. performing a kernel regression (Arora et al., 2019). The works of Jacot et al. (2018), Lee et al. (2019) and Arora et al. (2019) consider ”fully-trained” NNs with  l2loss and continuous-time gradient descent. Under Gaussian initialization assumptions it is shown that as the width of the NN goes to infinity, the point predictions corresponding by such fully trained networks are given again by a kernel regression but with respect to a different kernel, the neural tangent kernel.

In the derivation of the neural tangent kernel, one important point is that the gradients are not computed with respect to the standard model parameters, i.e. the the weights and biases entering the affine transforms. Instead they are ”reparametrized gradients” which are computed with respect to parameters initialized as N(0, 1), with any scaling (standard deviation) defined by parameter multiplication. It would thus be interesting to study whether a corresponding neural tangent kernel can be defined for the case of stable distributions with 0  < α <2, and whether the parametrization of (7)-(8) is the appropriate one to do so.

6.3 Information propagation

The recursions (15)-(16) define the evolution over depth of the distribution of  f (l)for two points  x, x′when weights and biases are distributed as in Section 1. The information propagation framework studies the behavior of  q(l)xand  ρ(l)x,x′as  l → +∞. It is shown in Poole et al. (2016) and Schoenholz et al. (2017) that the (σw, σb) positive quadrant is divided in two regions: a stable phase where  ρ(l)x,x′ →1 and a chaotic phase where  ρ(l)x,x′converges to a random variable (in the  φ= tanh case, in other cases the limiting processes may fail to exist). Thus in the stable phase f (l)is eventually perfectly correlated over inputs (and in most cases perfectly constant), while in the chaotic phase it is almost everywhere discontinuous. The work of Hayou et al. (2019) formalizes these results and investigates the case where (σw, σb) is on the curve separating the stable from the chaotic phase, i.e. the edge of chaos. Here it is shown that the behavior is qualitatively similar to that of the stable case, but with a lower rate of convergence with respect to depth. Thus in all cases the distribution of  f (l)eventually collapse to degenerate and inexpressive distributions as depth increases.

In this context it would be interesting to study what is the impact of the use of stable distributions. All results mentioned above holds for the Gaussian case, which corresponds to  α= 2. Thus this further analysis would study the case 0  < α <2, resulting in a triplet (σw, σb, α). Even though it seems hard to escape the course of depth under iid initializations, it might be that the use of stable distributions, with their not-uniformly-vanishing relevance at unit level (Neal, 1995), might slow down the rate of convergence to the limiting regime.

Within the setting of fully connected feed-forward deep NNs with weights and biases iid as centered and symmetric stable distributions, we proved that the infinite wide limit of the NN, under suitable scaling on the weights, is a stable process. This result contributes to the theory of fully connected feed-forward deep NNs, generalizing the work of Matthews et al. (2018b). We presented an extensive discussion on how our result can be used to extend recent lines of research which relies on GP limits.

On the theoretical side further developments of our work are possible. Firstly, Matthews et al. (2018b) performs an empirical analysis of the rates of convergence to the limiting process as function of depth with the respect to the MMD discrepancy (Gretton et al., 2012). Having proved the convergence of the finite dimensional distributions to multivariate stable distributions, the next step would be to establish the rate of convergence with respect to a metric of choice as function of the stability index  αand depth l. Secondly, all the established convergence results (this paper included) concern the convergence of the finite dimensional distributions of the NN layers. For the countable case, which is the case of the components  i ≥1 in each layer, this is equivalent to the convergence in distribution of the whole process (over all the i) with respect to the product topology. However, the input space being  RIit is not countable. Hence, for a given i, the convergence of the finite dimensional distributions (i.e. over a finite collection of inputs) is not enough to establish the convergence in distribution of the stochastic process seen as a random function on the input (with respect to an appropriate metric). This is also the case for results concerning the convergence to GPs. It would thus be worthwhile to complete this theoretical line of research by establishing such result for any 0  < α ≤2. As a side result, doing so is likely to provide estimates on the smoothness proprieties of the limiting stochastic processes.

We wish to thank the three anonymous reviewers and the meta reviewer for their valuable feedback. Stefano Favaro received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant agreement No 817257. Stefano Favaro gratefully acknowledge the financial support from the Italian Ministry of Education, University and Research (MIUR), “Dipartimenti di Eccellenza” grant 2018-2022.

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. (2019). On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems 32.

Billingsley, P. (1999). Convergence of Probability Measures. Wiley-Interscience, 2nd edition.

Der, R. and Lee, D. D. (2006). Beyond gaussian pro- cesses: On the distributions of infinite networks. In Advances in Neural Information Processing Systems, pages 275–282.

Garriga-Alonso, A., Rasmussen, C. E., and Aitchison, L. (2019). Deep convolutional networks as shallow gaussian processes. In International Conference on Learning Representations.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf, B., and Smola, A. (2012). A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773.

Hayou, S., Doucet, A., and Rousseau, J. (2019). On the impact of the activation function on deep neural networks training. In Proceedings of the 36th International Conference on Machine Learning, pages 2672–2680.

Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31, pages 8571–8580.

Joe, S. and Kuo, F. Y. (2008). Notes on generating sobol sequences. ACM Transactions on Mathematical Software (TOMS), 29(1):49–57.

Lee, J., Sohl-dickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y. (2018). Deep neural networks as gaussian processes. In International Conference on Learning Representations.

Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl- Dickstein, J., and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems 32.

Martin, C. H. and Mahoney, M. W. (2019). Traditional and heavy-tailed self regularization in neural network models. arXiv preprint arXiv:1901.08276.

Matthews, A. G. d. G., Hron, J., Rowland, M., Turner, R. E., and Ghahramani, Z. (2018a). Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations.

Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z. (2018b). Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271.

Neal, R. M. (1995). Bayesian Learning for Neural Networks. PhD thesis, University of Toronto.

Nolan, J. P. (2008). An overview of multivariate sta- ble distributions. Online: http://academic2. american.edu/ jpnolan/stable/overview.pdf.

Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016). Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems 29, pages 3360–3368.

Samoradnitsky, G. (2017). Stable non-Gaussian random processes: stochastic models with infinite variance. Routledge.

Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl- Dickstein, J. (2017). Deep information propagation. In International Conference on Learning Representations.

Simsekli, U., Sagun, L., and Gurbuzbalaban, M. (2019). A tail-index analysis of stochastic gradient noise in deep neural networks. arXiv preprint arXiv:1901.06053.

Weron, R. (1996). On the chambers-mallows-stuck method for simulating skewed stable random variables. Statistics & probability letters, 28(2):165–171.

Weron, R. et al. (2010). Correction to: ”on the chambers–mallows–stuck method for simulating skewed stable random variables”. Technical report, University Library of Munich, Germany.

Yang, G. (2019). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760.

A Large width asymptotics: k = 1

We first consider the case with of a single input being a real-valued vector of dimension I. By means of (4) and (5) we can write for  i ≥1 and l = 2, . . . , D:

image

= E[exp{itb(1)i }]

image

= eσαb |t|α I�j=1eσαw|txj|α

image

i.e., f (1)i (x)d= Sα,(σαw�Ij=1 |xj|α+σαb)1/α;

image

n �j=1 w(l)i,jφ(f (l−1)j (x, n)) +  b(l)i

n �j=1 E[exp{itw(l)i,jφ(f (l−1)j (x, n))n1/α | {f (l−1)j (x, n)}j=1,...,n}]

image

n�j=1|φ(f (l−1)j(x, n))|α+ σαb)|t|α

image

and we determine the expression of  σ(l).

A.1 Asymptotics for the i-th coordinate

It comes from (8) that, for every fixed l and for every fixed n the sequence (f (l)i(n, x))i≥1is exchangeable. In particular, let  p(l)ndenote the directing (random) probability measure of the exchangeable sequence (f (l)i(n, x))i≥1.

That is, by de Finetti representation theorem, conditionally to  p(l)nthe  f (l)i(n, x)’s are iid as  p(l)n. Now, consider the induction hypothesis that, as  n → +∞p(l−1)n w−→ q(l−1),

with  q(l−1)being St(α, σ(l −1)), and the parameter  σ(l −1) will be specified. Therefore, we can write the following expression

n�j=1|φ(f (l−1)j(x, n))|α+ σαb

image

= exp  {−|t|ασαb } E���exp�−|t|α σαwn |φ(f)|α�p(l−1)n (df)�n�.

Hereafter we show the limiting behaviour (19). In order to prove this limiting behaviour, we will prove:

L1) for each  l ≥2 Pr[p(l−1)n ∈ I] = 1, with  I = {p :�|φ(f)|αp(df) < +∞};

image

L2)

�|φ(f)|αp(l−1)n (df) p−→�|φ(f)|αq(l−1)(df), as  n → +∞;

image

L3)

�|φ(f)|α[1 − e−|t|α σαwn |φ(f)|α]p(l−1)n (df) p−→0, as  n → +∞.

A.1.1 Proof of L1)

The proof of L1) follows by induction. In particular, L1) is true for the envelope condition (6), i.e.,

E[|φ(f (1)i (x))|α] ≤ E[(a + b|f (1)i (x)|β)γα]

≤ E[aγα+  bγα|f (1)i (x)|βγα]

=  aγα+  bγαE[|f (1)i (x)|βγα]

image

since  αβγ < α. Now, assuming that L1) is true for (l −2), we prove that it is true for (l −1). Again from (6), one has

E[|φ(f (l−1)i (x, n))|α | {f (l−2)j (x, n)}j=1,...,n]

≤ E[(a + b|f (l−1)i (x, n)|β)γα | {f (l−2)j (x, n)}j=1,...,n]

≤ E[aγα+  bγα |f (l−1)i (x, n)|βγα | {f (l−2)j (x, n)}j=1,...,n]

=  aγα+  bγαE[|f (l−1)i (x, n)|βγα | {f (l−2)j (x, n)}j=1,...,n]

image

Thus, since  β < γ−1,

n�j=1|φ(f (l−2)j(x, n))|α+ σαb

n�j=1|φ(f (l−2)j(x, n))|α+ σαb | p(l−2)n

�|φ(f)|αp(l−2)n (df) �βγ

image

A.1.2 Proof of L1.1)

The proof of L1.1) follows by induction, and along lines similar to the proof of L1). In particular, let  ϵbe such that  βγ(α + ϵ)/α <1 and  γ(α + ϵ) <1. It exists since  βγ <1 and  γα <1. For l = 2,

E(|φ(f (1)i (x))|α+ϵ) ≤ E[(a + b|f (1)i (x)|β)γ(α+ϵ)]

≤ E[aγ(α+ϵ)+  bγ(α+ϵ)|f (1)i (x)|(α+ϵ)γβ]

=  aγ(α+ϵ)+  bγ(α+ϵ)E[|f (1)i (x)|(α+ϵ)γβ]

image

since (α + ϵ)βγ < α. This follows along lines similar to those applied in the previous subsection. Moreover the bound is uniform with respect to n since the law is invariant with respect to n. Now, assume that L1.1) is true for (l −2). Then, we can write the following inequality

image

n�j=1|φ(f (l−2)j(x, n))|α+ σαb | p(l−2)n

image

Thus,

≤ aγ(α+ϵ)+  bγ(α+ϵ)E[|Sα,1|βγ(α+ϵ)]�σαb+  σαwsupn

image

A.1.3 Proof of L2)

By the induction hypothesis,  p(l−1)nconverges to  p(l−1)in distribution with respect to the weak topology. Since the limit law is degenerate (on  p(l−1)), then for every subsequence (n′) there exists a subsequence (n′′) such that p(l−1)n′′converges a.s. By the induction hypothesis,  p(l−1)is absolutely continuous with respect to the Lebesgue measure. Since  |φ|αis a.s. continuous and, by L1.1), uniformly integrable with respect to (p(l−1)n), then we can

write the following

Thus,  n → +∞ �|φ(f)|αp(l−1)n (df) p−→� |φ(f)|αq(l−1)(df).

A.1.4 Proof of L3)

Let  ϵbe as in L1.1), and let p = (α + ϵ)/αand q = (α + ϵ)/ϵ. Then 1/p + 1/q = 1. Thus, by Holder inequality � |φ(f)|α[1 − e−|t|α σαwn |φ(f)|α]p(l−1)n (df)

image

≤��|φ(f)|αpp(l−1)n (df)�1/p ��[1 − e−|t|α σαwn |φ(f)|α]qp(l−1)n (df)�1/q.

Since we defined p = (α + ϵ)/αand q = (α + ϵ)/ϵ, i.e. we set q > 1, then we can write the following

��|φ(f)|αpp(l−1)n (df)�1/p ��[1 − e−|t|α σαwn |φ(f)|α]qp(l−1)n (df)�1/q

���|φ(f)|α+ϵp(l−1)n (df) �1/p� ��[1 − e−|t|α σαwn |φ(f)|α]p(l−1)n (df)�1/q

image

���|φ(f)|α+ϵp(l−1)n (df) �1/p� �|t|α σαwn

as  n → ∞, by L1.1).

A.1.5 Combine L1), L2) and L3)

= exp {−|t|ασαb}  E���exp�−|t|α σαwn|φ(f)|α�p(l−1)n (df)�n�= exp {−|t|ασαb} E �1{(p(l−1)n ∈I)}

��exp �−|t|α σαwn |φ(f)|α�p(l−1)n(df)�n�.

Then, by Lagrange theorem, there exists a value  θn ∈[0, 1] such that the following equality holds true

1 exp �−|t|α σαwn|φ(f)|α�= |t|α σαwn|φ(f)|αexp�−θn|t|α σαwn|φ(f)|α�,

thus

exp �−|t|α σαwn|φ(f)|α�= 1 − |t|α σαwn|φ(f)|αexp�−θn|t|α σαwn|φ(f)|α�= 1 − |t|α σαwn|φ(f)|α+ |t|α σαwn|φ(f)|α�1 exp�−θn|t|α σαwn|φ(f)|α��.

Now, since

0  ≤� |φ(f)|α[1 − e−θn|t|α σαwn |φ(f)|α]p(l−1)n (df)

image

then

�1 − |t|α σαwn

image

Thus, by using the definition of the exponential function, i.e. ex= limn→+∞(1 +  x/n)n, and L1)-L3) we have

E[eitf (l)i (x,n)] → e−|t|α[σαb +σαw�|φ(f)|αq(l−1)(df)],

as  n → +∞. That is, we proved that the large n limiting distribution of  f (l)i(x, n) is St(α, σ(l)), where

�|φ(f)|αq(l−1)(df) �1/α.

B Large width asymptotics:  k ≥ 1

We now consider the case with of a k inputs, each one being a real-valued vector of dimension I. We represent this generic case with a  I × kinput matrix X. Let  1rdenote a vector of dimension  k ×1 with 1 in the r-the entry and 0 elsewhere, and 1 denote a vector of dimension  k ×1 of 1’s. If  xjdenotes the j-th row of the input matrix, then we can write

image

and

n �j=1 w(l)i,j(φ ◦ f (l−1)j (X, n)) +  b(l)i 1

for l = 2, . . . , D, i = 1, . . . , n where we denote with  ◦element-wise application. Note that  f (l)i(X, n) is a random vector of dimension  k ×1, and we denote the r-th component of this vector by  f (l)i,r(X, n), namely f (l)i,r(X, n) =  1Tr f (l)i(X, n). Then, by means of (4) and (5), we can write for i = 1  ≥1 and l = 2, . . . , D:

image

I�j=1 E[exp{itT w(1)i,j xj}]

image

�||σb1||αδ 1||1||+

image

Γ(1)=  ||σb1||αδ 1||1||+

image

observe that we can also determine the marginal distributions of  f (1)i (X). From (2), (3),

f (1)i,r(X) ∼St(α, σ(1)(r)),

image

σ(1)(r) =��Sk−1 |1Tr s|αΓ(1)(ds)�1/α

image

image

n �j=1 w(l)i,j(φ ◦ f (l−1)j (X, n)) +  b(l)i 1

n �j=1 E[exp{itT w(l)i,jφ ◦ f (l−1)j (X, n)n1/α | {f (l−1)j (X, n)}j=1,...,n}]

image

n�j=1||φ ◦f (l−1)j (X, n)||α�����tT φ ◦f (l−1)j (X, n)||φ ◦f (l−1)j (X, n)||

image

f (l)i(X, n) | {f (l−1)j (X, n)}j=1,...,nd= Sα,Γ(l)

image

n�j=1||σw(φ ◦ f (l−1)j (X, n))||αδ φ◦f(l−1)j (X,n)||φ◦f(l−1)j (X,n)||;

observe that we can also determine the marginal distributions of  f (l)i(X). From (2), (3),

f (l)i,r(X, n)  | {f (l−1)j (X, n)}j=1,...,n ∼St(α, σ(l)(r)),

image

and we determine the expression of Γ(l).

B.1 Asymptotics for the i-th coordinate

Let  p(l)ndenote the directing (random) measure of the exchangeable sequence (f (l)i(n, X))i≥1. Now, consider the induction hypothesis that, as  n → +∞, p(l−1)n w−→ q(l−1),

with  q(l−1)being St(α, σ(l −1)), and the finite measure Γ(l −1) will be specified. Therefore, we can write the following expression

image

image

× E ���exp ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

Hereafter we show the limiting behaviour (22). In order to prove this limiting behaviour, we will prove:

L1) for each  l ≥2 Pr[p(l−1)n ∈ I] = 1, with  I = {p :�||φ ◦ f||αp(df) < +∞};

image

L2)

�||φ ◦ f||αp(l−1)n (df) p−→�||φ ◦ f||αq(l−1)(df), as n +;

L3)

�||φ ◦ f||α �1 exp�−�Sk−1 |tT s|α �1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

B.1.1 Proof of L1)

The proof of L1) follows by induction. In particular, L1) is true for the envelope condition (6), i.e.,

� k�r=1 |φ(f (1)i,r(X))|α�

k�r=1E[(a + b|f (1)i,r(X)|β)γα]

k �r=1 E[(aγα+  bγα|f (1)i,r(X)|βγα)]

≤ kaγα+  bγαk �r=1 E[|f (1)i,r(X)|βγα]

image

since  αβγ < α. Now, assuming that L1) is true for (l −2), we prove that it is true for (l −1). Again from (6),

image

� k�r=1 |φ(f (l−1)i,r (X, n))|α | {f (l−2)j (X, n)}j=1,...,n

k �r=1 E[(a + b |f (l−1)i,r (X, n)|β)γα | {f (l−2)j (X, n)}j=1,...,n]

k �r=1 E[(aγα+  bγα |f (l−1)i,r (X, n)|βγα) | {f (l−2)j (X, n)}j=1,...,n]

=  kaγα+  bγαk �r=1 E[|f (l−1)i,r (X, n)|βγα | {f (l−2)j (X, n)}j=1,...,n]

image

image

Since  β < γ−1,

=  E[E[||φ ◦ f (l−1)i (X, n)||α | {f (l−2)j (X, n)}j=1,...,n] | p(l−2)n] ≤ kaγα+  bγαE[|Sα,1|αβγ]

image

+ �Sk−1 |1Tr s|α�� ||σw(φ ◦ f)||αδ φ◦f||φ◦f|| p(l−2)n (df)�(ds)�βγ

image

B.1.2 Proof of L1.1)

The proof of L1.1) follows by induction, and along lines similar to the proof of L1). In particular, let  ϵbe such that  βγ(α + ϵ)/α <1 and  γ(α + ϵ) <1. It exists since  βγ <1 and  γα <1. For l = 2, we can find C(k) > 0 finite such that:

k �r=1 |φ(f (1)i,r(X))|α+ϵ�

k �r=1 E[(a + b|f (1)i,r(X)|β)γ(α+ϵ)]

k �r=1 E[(aγ(α+ϵ)+  bγ(α+ϵ)|f (1)i,r(X)|βγ(α+ϵ))]

�kaγ(α+ϵ)+  bγ(α+ϵ)k �r=1 E[|f (1)i,r(X)|(α+ϵ)γβ] �

image

since (α + ϵ)βγ < α. This follows along lines similar to those applied in the previous subsection. Moreover the bound is uniform with respect to n since the law is invariant with respect to n. Let us assume that L1.1) is true

for (l −2). Then we can write the following

+ �Sk−1 |1Tr s|α�� ||σw(φ ◦ f)||αδ φ◦f||φ◦f|| p(l−2)n (df)�(ds)�βγ(α+ϵ)/α �

Thus,

�kaγ(α+ϵ)+  bγ(α+ϵ)E[|Sα,1|βγ(α+ϵ)]

�(ds)

image

+ �Sk−1 |1Tr s|α�� ||σw(φ ◦ f)||αδ φ◦f||φ◦f|| p(l−2)n (df)�(ds)�βγ(α+ϵ)/α �

image

B.1.3 Proof of L2)

By the induction hypothesis,  p(l−1)nconverges to  p(l−1)in distribution with respect to the weak topology. Since the limit law is degenerate (on  p(l−1)), then for every subsequence (n′) there exists a subsequence (n′′) such that p(l−1)n′′converges a.s. By the induction hypothesis,  p(l−1)is absolutely continuous with respect to the Lebesgue measure. Since  |φ|αis a.s. continuous and, by L1.1), uniformly integrable with respect to (p(l−1)n), then we can

write the following

Thus, n +�||φ ◦ f||αp(l−1)n (df) p−→� ||φ ◦ f||αq(l−1)(df).

image

B.1.4 Proof of L3)

Let  ϵbe as in L1.1), and let p = (α + ϵ)/αand q = (α + ϵ)/ϵ. Then 1/p + 1/q = 1. Thus, by Holder inequality

image

image

× �� �1 exp ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

Since we defined p = (α + ϵ)/αand q = (α + ϵ)/ϵ, i.e. we set q > 1, then we can write the following

image

× �� �1 exp ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

image

× �� �1 exp ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

image

× ���Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

as  n → ∞, by L1.1).

B.1.5 Combine L1), L2) and L3)

× E ���exp ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

image

��exp ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

Then, by Lagrange theorem, there exists a value  θn ∈[0, 1] such that the following equality holds true

1 exp ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

= ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

�Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

thus

exp ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

= 1 ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

�Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

= 1 ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

+ ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

�Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

Now, since

�Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

image

× �1 exp ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

then

�1 � �Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

image

× �1 exp ��Sk−1 |tT s|α� 1n||σw(φ ◦ f)||αδ φ◦f||φ◦f||

Thus, by using the definition of the exponential function, i.e. ex= limn→+∞(1 +  x/n)n, and L1)-L3) we have

image

exp ��Sk−1 |tT s|α �||σb1||αδ 1||1||

× exp �� �Sk−1 |tT s|α �||σw(φ ◦ f)||αδ φ◦f||φ◦f||

as  n → +∞. That is, we proved that the large n limiting distribution of  f (l)i(x, n) is Stk(α,Γ(l)), where

image

C Finite-dimensional projections

We show that, as  n → +∞,

(f (l)i(X, n))i≥1 w−→�

image

by proving the large n asymptotic behavior of any finite linear combination of the  f (l)i(X, n)’s, for  i ∈ L ⊂ N. See, e.g. Billingsley (1999) and reference therein. Following the notation of Matthews et al. (2018b), consider a finite linear combination of the function values without the bias. In other terms, let us consider

T (l)(L, p, X, n) =�i∈Lpi[f (l)i(X, n)  − b(l)i 1].

Then, we write

T (l)(L, p, X, n) =�i∈Lpi[f (l)i(X, n)  − b(l)i 1]

� � 1 n1/α

image

�i∈Lpiw(l)i,j(φ ◦ f (l−1)j (X, n))

n�j=1γ(l)j(L, p, X, n),

where γ(l)j(L, p, X, n) =�i∈Lpiw(l)i,j(φ  ◦ f (l−1)j (X, n)).

Then,

=  E[eitT T (l)(L,p,X,n) | {f (l−1)j (X, n)}j=1,...,n]

image

�i∈Lαiw(l)i,j(φ ◦ f (l−1)j (X, n))

�i∈LE�exp �itT 1n1/α piw(l)i,j(φ ◦ f (l−1)j (X, n)) | {f (l−1)j (X, n)}j=1,...,n

image

That is,

T (l)(L, p, X, n) | {f (l−1)j (X, n)}j=1,...,nd= Sα,Γ(l).

where

image

Then, along lines similar to the proof of the large n asymptotics for the i-th coordinate, we have

image

�� �Sk−1 |tT s|α��i∈L||piσw(φ ◦ f)||αδ φ◦f||φ◦f||

as  n → +∞. This complete the proof.

image

Figure 1: 2D-KDE estimates for  y ∼ f (l)1(x, x′, n) (red) and  y ∼Stk(α, �Γ(x, x′, l, M)) (blue).

D Numerical evaluation of the recursion

In this section we perform a preliminary numerical investigation of the approach proposed in Section 6.1 for the evaluation of recursion (13)-(14). We consider only the case of two inputs  x = −0.5, x′= 1.0 (i.e. a bivariate stable distribution) and we use pseudo random numbers, i.e. standard Monte Carlo (MC), instead of quasi random numbers as suggested in the main text. We consider  σb = σw= 1, the tanh activation function, different values of the stability index  α, and both shallow (l = 2, i.e. 1 hidden layer) and deep (l = 10) NNs. In all cases the networks are wide: n = 300. In Figure 1 we compare the bivariate distributions of: i) the first dimension (i = 1) of the NN distribution  y ∼ f (l)1(x, x′, n) ii) its asymptotic distribution  y ∼Stk(α, �Γ(x, x′, l, M)) as n → +∞. In ii) we use M = 1000 MC samples to evaluate the discrete spectral measure �Γ at each layer. In both i) and ii) we generate 100.000 samples for  y ∈ R2that are used to obtain the 2D-KDE plots of Figure 1. We can observe close agreement in all cases considered (the ”squarish” level curves near the central regions for small  αare an artifact due to the specific KDE estimation algorithm employed and its non-robustness to ”outliers”).

The code at https://github.com/stepelu/deep-stable contains a numpy-based Python implementation of the algorithms used for the simulation of scalar and multivariate stable distributions. Scalar stable variables are generated according to Weron (1996); Weron et al. (2010). In the case of multivariate stable variables the algorithm implemented is the one reported in Nolan (2008), note that the discrete spectral measure needs to be symmetrized. The code also contains the routines used to sample from  f (l)(x, n) and from Stk(α, �Γ(x, l, M)). The implementation does not rely on advanced features so it is easily portable to deep learning frameworks such as tensorflow or pytorch. By modifying the calls to uniform random generators, it is also possible to use quasi random number generators. In all cases, the usual precaution to exclude the extremes of the supports of the uniform distributions involved (i.e. to sample from U(0, 1), not from U[0, 1]) applies.

image

Figure 2: Histograms for the stability indexes  αof the fitted Stable distributions for all layers, in blue for the weights and in yellow for the biases. The models, from left-to-right, are: VGG-16, ResNet-50 and ResNet-101.

image

Figure 3: Histograms of the cdf evaluations for the Stable distribution (blue) and Gaussian distribution (orange) fitted to the weights of the first three layers (left-to-right) of the VGG16 model.

E Empirical analysis of trained CNNs

In this section we investigate whether trained models exhibit parameters distributions close to that of Stable distributions with stability index 0  < α <2, i.e. non-Gaussian. We consider 3 models from the PyTorch’s TorchVision repository, i.e. CNNs trained on ImageNet. While fully connected networks are ideal starting points for a theoretical analysis, it seems possible to expand our results to CNNs as done in Garriga-Alonso et al. (2019) for Gaussian Processes (GP). This allows us to investigate the parameter distributions of trained model in the ”realistic” setting of overparametrized models applied to big datasets with the use of batch normalization and adaptive optimizers.

We restrict our analysis to marginal distributions and for each layer we collect all weights (CNN filters) and biases and fit a Stable distribution via maximum likelihood estimation (MLE). In Figure 2 we plot histograms for the stability indexes  αof the fitted Stable distributions for all layers. We see that distributions are often non-Gaussian, and  αseems to be decreasing with the depth of the model. However, it is not possible to draw definitive conclusions from this short experiment.

To obtain an indication of the goodness of fit of Stable distributions to the parameters, for the first three layers of VGG-16 (α ∼ 1.7) we: fit a Stable distribution to the weights; compute the cumulative distribution function (cdf) of this Stable distribution for each weight; fit a Gaussian distribution to the weights; compute the cdf of this Gaussian distribution for each weight; plot in Figure 3 a histogram of the cdf evaluations for the Stable and Gaussian distribution. In case of perfect fit the histogram should be flat, as the cdf evaluations should be iid uniformly distributed. We see that the fit of the Stable distributions is as expected better than the fit of Gaussian distributions, especially in the tails. The peculiar behavior at extremes of the histograms (tails) could be due to the use of truncated initializations in PyTorch. We validated MLE (limited here to  α > 0.5) and cdf computation on synthetic data generated via sample stable() from the code accompanying this paper.


Designed for Accessibility and to further Open Science