b

DiscoverSearch
About
My stuff
Analysis of Bayesian Inference Algorithms by the Dynamical Functional Approach
2020·arXiv
Abstract
Abstract

We analyze the dynamics of an algorithm for approximate inference with large Gaussian latent variable models in a student-teacher scenario. To model nontrivial dependencies between the latent variables, we assume random covariance matrices drawn from rotation invariant ensembles. For the case of perfect data-model matching, the knowledge of static order parameters derived from the replica method allows us to obtain efficient algorithmic updates in terms of matrix-vector multiplications with a fixed matrix. Using the dynamical functional approach, we obtain an exact effective stochastic process in the thermodynamic limit for a single node. From this, we obtain closed-form expressions for the rate of the convergence. Analytical results are excellent agreement with simulations of single instances of large models.

Keywords: Bayesian Inference, Iterative Algorithms, TAP Equations, Random Matrices, Dynamical Functional Theory

Tools of statistical mechanics have been extensively used in the late 1980’s and the early 1990’s to analyze the learning properties of large neural networks and related learning models [1, 2, 3, 4]. The powerful combination of statistical ensembles of learning machines with the application of the replica approach has allowed for the exact computation of average case learning performance. Remarkably, this could be achieved without having to deal with the details of concrete learning algorithms. Unfortunately, since much of the research was somewhat disconnected from practical existing machine learning approaches, this seemingly advantage has also led to a decline of the research activities in the field in the later 1990s.

More recently, the situation has again changed considerably. The establishment of relations between advanced mean field approximations (related to the so-called cavity method) and message passing algorithms for graphical statistical models has led to novel interest in the interdisciplinary area of statistical mechanics of learning and inference. Message passing algorithms were not only found (under certain conditions) to achieve optimal performance for large systems, but could also be analyzed dynamically by density evolution techniques [5, 6]. While most of this research concentrated originally on sparse networks [5], there was a growing interest on studying inference with networks of densely coupled probabilistic units [7, 8, 9, 10]. By taking formally the large density limit in the theory of message passing one obtains so-called AMP (approximate message passing) algorithms [11, 12, 13] which have been applied to a great variety of models. Fixed points of the AMP algorithm were found to coincide with the solutions of corresponding TAP (Thouless-Anderson-Palmer) mean field equations for statistical averages of the stochastic nodes. This heuristic approach might be criticized because certain independence assumptions made in the cavity approach may not be valid for dense networks. Nevertheless, exactness of the final results could be established for certain simple distributions of random network couplings. To go beyond such simple distributions, more complex ensembles allowing for dependencies between couplings in dense systems could be treated within an adaptive TAP approach [14] motivated by earlier work on spin glasses [15]. This research had (in parts) led to the development of the expectation propagation (EP) algorithms [16] in the field of machine learning. Assuming that matrices of network couplings are realizations of rotation invariant ensembles, prediction properties of the EPstyle VAMP (vector-AMP) algorithms could be analyzed rigorously by a density evolution method using methods of random matrix theory [17, 18]. The density evolution approach so far deals mainly with the computation of the temporal development of predictions errors which are derived from equal time marginals of the distribution of trajectories of an algorithm’s dynamics. It would be important to extend these results to multivariate statistical properties of trajectories which allow for more detailed computations of the algorithm’s performance including the convergence speed towards the fixed point.

In this paper, we will present such an approach. Based on our previous studies on the dynamics of solving TAP equations for Ising spin systems [19, 20], we introduce a VAMP-style algorithm for inference in Gaussian latent probabilistic models. These are important statistical data models with applications to classification and regression. We use the method of dynamical functional theory (DFT) to study the average case properties of algorithms in the large system limit. DFT is a statistical mechanics tool for analyzing dynamical systems [21] based on partition functions over trajectories. From these, effective distributions of entire trajectories for a single node can be obtained. In the past, the method was mainly applied to the dynamics of spin-glasses [22] and neural networks [23] with random independent couplings. We also refer to the study [24] where the DFT was used to analyze AMP-style algorithms (in the context of communication theory) with again random independent couplings assumption.

The novel contributions presented in this paper are twofold: On a technical level we extend the DFT approach of [20] to a combination of a teacher-student scenario for the generation of data together with the assumption of arbitrary rotation invariant random matrix couplings. From a more practical machine learning point of view, our new algorithm is designed for the case, where the class of data generating models is assumed to be known up to its parameters. This means we neglect model mismatch. In this case, we can use the knowledge of equilibrium order parameters given by the replica method to a obtain a simplified algorithm with a significant reduction of computational complexity.

The paper is organized as follows: In Section 2 we introduce the probabilistic model considered for the Bayesian inference. Section 3 provides a brief presentation on the TAP equations. In Section 4 we present our new algorithm for solving the TAP equations and in Section 5 we study its thermodynamic properties using the method of DFT. Comparisons of the theory with simulations are given in Section 6. Section 7 presents a summary and outlook. The derivations of our results are located at the Appendix.

We consider the problem of approximate Bayesian inference for posterior distributions over the Gaussian latent vector  θ ∈ ℜN×1 of the type

image

where Z is a normalization constant. This model assumes that the components of the vector y of N real data values are generated independently from a likelihood  p(y|θ) basedon a vector of unknown parameters  θ. Prior statistical knowledge about  θis introduced by the correlated Gaussian with covariance  K‡A well known example for (1) is the problem of Bayesian learning of a noisy perceptron—also known as probit regression [25]—for which we assume binary class labels  yi = ±1. For more details on the standard interpretation of the noisy perceptron, see section 6.

Our concern is to design and analyze an iterative algorithm that (approximately) computes the vector of posterior means

image

in the thermodynamic limit of large N under some statistical assumptions. For simplicity, we assume data-model matching, ie. the probabilistic model (1) describes the generation of the dataset {y, K} correctly. Moreover, in order to allow for some nontrivial dependencies between matrix elements  Kij, we assume that K is drawn from a rotation invariant matrix ensemble, i.e.  K and V KV ⊤ have the same probability distributions for any orthogonal matrix V independent of K. Equivalently, we have the spectral decomposition

image

where O is a Haar (orthogonal) matrix that is independent of a diagonal matrix D [26]. For the perceptron model, the “classic” assumption of independent components for inputs leads to a Wishart distribution for K which is rotational invariant. For more complex ensembles, see [20].

The TAP approach [15],[14]—related to expectation consistent inference approximations in machine learning [27]—typically provides highly accurate approximations for probabilistic inference. In our context, the TAP equations are the fixed-point equations for approximate posterior means m that are given by

image

Here, we denote the empirical average of an  N ×1 vector x by ⟨x⟩ .= 1N�i≤N xi. Moreover, given the auxiliary single-site partition function

image

we introduce the nonlinear functions

image

The only dependency the TAP equations of the random matrix ensemble is via the function R(·) which is the so-called R–transform. Its definition will be given later in equation (8). The TAP equations generalize the naive mean field approximation, which neglects statistical dependencies between the components of  θ ∼ p(θ|y, K). Since the diagonal entries of rotation invariant random matrices are asymptotically self averaging (e.g. [28, Theorem 2.1],[19]), a short computation shows that the naive mean field theory corresponds to the approximation R(−χ) ≈ R(0) = 1N tr(K−1). The (Onsager-) correction to the mean naive field approximation in terms of the R-transform was originally derived in [29] for Ising spin-glasses using a free energy approach, see also [30] for a comprehensive exposition in this approach. An alternative derivation was given in [14] using the cavity method. We also refer the reader to [31] for a rigorous approach on the self averaging assumptions made for cavity field variances in the latter approach.

The function R(·) stands for the R–transform of the spectral distribution of  K−1

which is defined as [32]

image

Here, we have defined

image

and G−1 is the functional inverse (which is well-defined on (−q,0)) of the Green-function

image

The TAP method is known to be consistent with the replica-symmetric (RS) ansatz when an average over data y and couplings K is considered. The validity of RS is commonly assumed to be asymptotically exact in the case of  data-model matching §. For the case of data-model matching, one can show that (see, e.g. the study of [34]), the RS ansatz leads to the asymptotic (N → ∞) solution for  χin (4c) as

image

Here, the expectation is taken with respect to the probability distribution

image

with  N(·|µ, σ2) denoting the Gaussian density function with mean  µand variance  σ2. Thiscorresponds to the asymptotic marginal distribution for the components of (θ, y, ρ)

image

valid for a large system. Our general idea is to design an iterative algorithm for solving  ρ in(4) in terms of an iteration of a vector of auxiliary variables  ρ(t) (for t = 1, . . .denoting the discrete time index of the iteration) such that the “effective” distributions of the marginals (θi, yi, ρi(t)) mimic the static distribution  prs(θ, y, ρ), specifically

image

for some dynamical order parameters  κ(t) and σ2(t) converging both to  κ as t → ∞.

In recent years, there has been considerable interest in analyzing EP-style [16] iterative algorithms [17, 18, 35]—commonly referred as the method of VAMP. From a computational complexity point of view, such algorithms require the computation of products of  N × Nmatrices at every iteration steps (for details, see Appendix A.1) which can be problematic for large N and large times. Here, we propose a simplified algorithm which utilities the result of the RS ansatz (11) and therefore assumes that the assumed data generating process is correct.

image

Here, for short, we have introduced the scalar function

image

This resembles the update of a single layer recurrent neural network: A vector of nodes ρ(t −1) is passed point wise through a nonlinear function  fη(t). The resulting vector is multiplied by a symmetric “coupling” matrix A before the process is repeated. Note that the iterations (15) are solely based on matrix vector multiplications and evaluations of scalar nonlinear functions.

Before iteration starts the iterative algorithm requires the elements  A and ν that areobtained as follows: We first compute the spectral decomposition

image

where the vector d contains the eigenvalues of K. Then, we obtain the parameters  ν, λand  χby solving the fixed–point equations

image

Finally, the matrix A is computed as

image

It is easy to see that the fixed points of  ρ(t) coincide with the solution of the TAP equations (4) for  ρgiven that the empirical average (4c) is substituted by the RS result (11). In fact, we will re-derive the RS result from the DFT analysis of the iterative algorithm.

In this section, we will analyze the dynamical properties of the iterative algorithm (15) using the method of the DFT. To this end, we introduce the moment generating functional for the trajectory of  {ρi(t)} .= {ρi(t)}t≤T as

image

where for the sake of compactness of notation we assume that the dynamical order parameter  η(t) is self-averaging. Note, that we have added a single external field l(t) to node i. To obtain statistical averages we compute the averaged-generating functional E[Z({l(t)})] where the expectation is taken over the random elements  y, θ, O, i.e.

image

From this, for example, we could compute

image

where the identity (22) follows from the fact that probability distribution of A is invariant under permutation. Using (23) we can quantify the averaged-normalized-square Euclidean distance between iterates of the algorithm at different times as

image

which will allow us to compute the convergence properties of the algorithm. We will defer the explicit and lengthy computation of DFT to Appendix B. There we show how to integrate out the trajectories for all other model  j ̸= iin the limit  N → ∞. The result is

image

where  N (·|µ, Σ) stands for a Gaussian distribution function with mean  µand covariance  Σ.Hence, we have obtained an effective stochastic process for the dynamics of single, arbitrary component  ρ(t) of the vector  ρ(t) as

image

The order parameter  κ(t) and the two-time covariance matrix  Cφ(t, s) (which is denoting the (t, s)th indexed entries of  Cφ) of the Gaussian process are computed by the recursions

image

Here,  σ2A stands for the variance of the spectral distribution of A and for short we have introduced the random dynamics

image

The relative simplicity of these equations may come as a surprise to readers familiar with the results of DFT obtained for spin-glass dynamics and neural networks. In contrast to models with non symmetric random matrices (see e.g [23]) the symmetric matrix case is usually plagued with memory terms in the effective single node stochastic processes. These usually make explicit analytical computations of single time marginals a hard task. As shown in Appendix A.2 the absence of such memory terms can be explained by the vanishing of the two-time response functions which in turn can be understood by wellknown concepts of random matrix theory.

5.1. Convergence rate of the algorithm

To study the large time behavior of the single node dynamics we define the deviation between the dynamical variables at different times (t ̸= s) as

image

where  Cρ(t, s) stands for the two-time covariance of the zero-mean Gaussian process  {ρ(t)}.We have defined the rate of convergence as

image

Assuming convergence, that is ∆(t, s) → 0 as t, s → ∞, we get the explicit rate as

image

as (see [20, Eq.(38)])

image

Then, we re-write (33) in the form

image

Thus, the necessary condition for convergence  µρ <1 holds if and only if

image

Following the arguments for [36, Eq.(18)] one can conclude that equation (37) coincides with the stability condition of the RS ansatz - known as de Almeida Thouless (AT) criterion. Since for the case of data-model matching, RS solution is usually locally stable, we can expect that local convergence broadly fulfilled.

5.2. Asymptotic consistency with the RS ansatz

Let  γ .= fχ(ρ, y) where (θ, y, ρ) ∼ prs(θ, y, ρ). Then, we have

image

Hence, if  κ(t) and Cφ(t, t) converge, they both converge to  κ.Thus, the stationary distribution of the stochastic process (26) yields the replica-symmetric solution, i.e.

image

Perceptrons are single layer neural networks that are parameterized by a vector of weights, say  w ∈ RK×1. We consider the binary classification problems from a training set that is given by  {(xi, yi)}i≤N. Here xi ∈ RK×1 stands for a vector of inputs and a binary label yi = ∓1 for classification. Specifically, we have the observation model for y as

image

The Gaussian latent vector  w ∼ N(0, I) and the noise vector  ǫ ∼ N(0, σ20I) are generated independently. So that, we set  θ = Xω, K = XX⊤ and p(y|θ) = Θ(yθ/σ0) with Θ(·)denoting the cumulative distribution function of standard normal distribution.

We illustrate our theoretical results through the following random matrix models for the data matrix:

image

(ii)  X = ˜HP where P is the N × Kprojection matrix with  N ≥ K and P ij = δij, and˜H is an N × N a randomly-signedHadamard matrix [37] as

image

Here Z is a uniformly distributed random  N × Nsigned permutation matrix, specifically  Zij = ǫiδi,σ(j) for ǫi = ∓1 being independent binary random variables with equal probabilities and  σis a random permutation. Furthermore,  HN is theN × NHadamard matrix which is deterministically constructed from the recursion

image

Note, that in the latter model, X is a column-orthogonal matrix (i.e.  X⊤X = I) withthe  binary entries Xij = ∓ 1√N . Though, K = XX⊤is not rotation invariant in this case, motivated with the study [37] we expect that our theoretical results might still yield quite accurate approximations.

Figure 1 refers to random matrix model (i) where in Figure1-(a) and (b) we illustrate the discrepancy between theory and simulations for the two-time covariance  Cρ(t, s) withrespect to the two-time relative-square-error

image

and the analytical convergence rate of the algorithm, respectively. Similarly, Figure 2 refers to the random matrix model (ii). Simulation results are based on single instances of large random matrices and data.

image

Figure 1: Random matrix model (i): The model parameters are chosen as  σ20 = 10−2,N = 2K and K = 104. (a) Discrepancy between theory and simulations for the two-time covariance with the (t, s) indexed segment representing the relative-squared-error in dB, i.e.  −10 log10 rse(t, s). (b) Asymptotic of the algorithm (where the flat line around 10−30are the consequence of the machine precision of the computer which was used).

We have presented the analysis of a Bayesian inference algorithm for latent Gaussian models in the large systems limit. We have based our approach on a statistical mechanics path integral technique, dynamical functional theory (DFT), which has been used before to treat spin-glass models and neural networks with random interactions. We have generalized the method to allow for more complex settings of the quenched randomness, allowing for a combination of a teacher-student scenario together with rotation invariant ensembles of coupling matrices. While related inference algorithms had been treated before using rigorous random matrix techniques, we were able, for the first time, to obtain the complete effective marginal stochastic process of single nodes. Although the computations turned out to be somewhat involved, the final results turned out to be surprisingly simple: The

image

Figure 2: Random matrix model (ii): The model parameters are chosen as  σ20 = 10−2,N = 2K and K = 213. (a) Discrepancy between theory and simulations for the two-time field covariance with (t, s) indexed segment representing the relative-squared-error in dB, i.e.  −10 log10 rse(t, s). (b) Asymptotic of the algorithm.

statistics of the effective field entering the nonlinear function of the algorithm was found to be a Gaussian process, for which the covariance function could be obtained iteratively. This result lacked the complications of the non-Gaussian fields caused by memory terms in the effective dynamics which were typically present in spin-glass dynamics. We could trace the origins of this simplification by utilizing concepts of random matrix theory. Based on our general result, we were able to compute exact convergence rates of the algorithm. The statistics of the algorithm’s fixed points coincides with that predicted by the replica theory. Within our scenario of data-model matching, the algorithm is found (at least) to be locally convergent. Simulations on single, large systems showed excellent agreement with the theory, From a more practical point of view, the restriction to the matching case allowed us to define an inference algorithm in terms of more efficient iterations (compared to previous VAMP algorithms) by using only a single fixed matrix.

We expect that our approach can be extended in various directions. The DFT method is general enough to treat the non-matching case as well, including the settings of VAMP algorithms. It will also be interesting to extend the method to other more complex probabilistic models of neural network type such as the restricted Boltzmann machines [38]. This however, would require different types of integrals over random matrices, which have to be incorporated into the DFT formalism. A further interesting generalization of DFT would be to random sequential updates in algorithms, where at each iteration step only a single, randomly selected node (or a small mini batch of nodes) are used in the update. It will be interesting to see, if a properly adapted DFT method would lead to tractable computations for this more realistic scenario.

The statistical mechanics techniques utilized in our DFT approach might present of course certain limitations of applicability to real world machine learning problems. The study of general rotation invariant ensembles of matrices is a nontrivial improvement over older models that were based on simpler independence assumptions. It allows us to deal with matrices of (almost) arbitrary eigenvalue distributions, but restricts the corresponding eigenvectors as irrelevant, simply pointing in random directions. It will important to find out for which types of real data such technical assumptions might be reasonable enough to draw relevant conclusions from the theory. The excellent agreement between theory and simulations for the case of a randomly–signed Hadamard matrix (which contains much less randomness compared to a the rotation invariant case, see section 6) gives us some hope that our method could be applicable to a broader range of problems.

This work was supported by the German Research Foundation, Deutsche Forschungsgemeinschaft (DFG), under Grant “RAMABIM” with No. OP 45/9-1.

image

In our context, the VAMP algorithm [17, 18] coincides with the iterative process

image

From the computational complexity point of view, the essential difference between our proposed iterations (15) from the VAMP iterations (A.1) is that at every iteration step the former requires a matrix-vector multiplication while the latter requires matrix-matrix multiplications (see (A.1e)).

image

The striking property of the algorithm (15) is that it has the memory-free property

image

This property makes the analysis of the algorithm relatively simple. To have a better insight on designing memory-free algorithms, we will next introduce a generic dynamical construction and present an intuitive derivation of its memory-free property by means of the concept of asymptotic freeness of random matrices [39, 32].

Specifically, consider the following generic iterative process

image

where  ftis a sequence of scalar function. Here, we suppose that A(t) is rotation invariant (for all t) and also for the diagonal matrices

image

we suppose that

image

Here, for an  N × N matrix Xwe denote its limiting normalized-trace by

image

Remark that both our proposed algorithm (15) and the VAMP algorithm (A.1) are in the family of this generic dynamical construction. We are interested the analyzing the diagonal elements of the dynamical susceptibility

image

In the large-system limit, this product of matrices can be simplified by the concept of asymptotic freeness of random matrices. Specifically, for the two families of matrices, say A .= {A1, A2, . . . , Aa} and E .= {E1, E2, . . . , Ee}, let P i(A) and Qi(E) stand for (noncommutative) polynomials of the matrices in A and the matrices in E, respectively. Then, we say the families A and E are asymptotically free if for all  i ∈ [1, K] and for all polynomials  P i(A) and Qi(E) we have [39]

image

In other words, the limiting normalized-trace of any adjacent product of powers of matrices – which belong to different free families and are centered around their limiting normalized-traces – vanishes asymptotically.

The matrices in (A.7) belong to two families: rotation invariant and diagonal. Under certain technical conditions, these two matrix families can be treated as asymptotically free (in the almost sure sense) [39], i.e. any adjacent two matrices in the susceptibility matrix (A.7) are asymptotically free. Moreover, by design the matrices are centered around their limiting normalized-traces (A.5). Hence, it immediately follows from the definition of asymptotic freeness that

image

In fact, following the analysis of [31] we can obtain a stronger result

image

Our first goal is to perform the average

image

where dP(O) stands for the Haar invariant measure of the orthogonal group O(N). To this end, we first invoke the representations of the Dirac  δfunctions in terms of its characteristic function and write the generating functional (20) of the form

image

Here, and throughout the sequel, c stands for a constant term—which will irrelevant for the analysis— to ensure the normalization property E[Z({l(t) = 0})] = 1. Moreover, we note, from the representation of the Gaussian density function in terms of the characteristic function, that

image

Then, by invoking (B.3) and (B.5) we write the averaged-generating functional as

image

image

In this subsection we will perform the disorder average, i.e. the last term of (B.6). To this end, we introduce the  N × 1 vector z .= u√N and the N × T matrices X and ˆX with the

image

So that we write

image

Thus, we have

image

Using the asymptotic Itzykson-Zuber integral formulation [40, 41] the expectation (B.9) can be expressed in terms of the so-called free cumulants of the spectral distribution of A as

image

where  cA,nstands for the nth order free cumulant of the spectral distribution of A and the constant term  ǫN → 0 as N → ∞. In particular,  cA,1 and cA,2are the mean and variance of the distribution, respectively, i.e.

image

In fact, the R-transform can be defined as a generating function of the free cumulants [32]

image

We will evaluate tr((Q − zz⊤)n) in terms of the order parameters

G .= X⊤ ˆX (B.13)C .= X⊤X (B.14)˜C .= ˆX⊤ ˆX (B.15)B .= z⊤X (B.16)˜B .= z⊤ ˆX (B.17)ζ .= z⊤z. (B.18)

Note that  Q = [ ˆX X][X ˆX]⊤. Hence, we have

image

By the cyclic invariance property of the trace operator, the traces tr((Q−zz⊤)n) do solely depend on the terms  {ζk} and {z⊤Qmz = [ ˜B B]Qm−1[B ˜B]⊤}. Hence, we can define

image

In particular, we have (see Appendix D)

image

fn(G, C, ˜C, B, ˜B, ζ) =2tr(Gn) + (−ζ)n + n

image

where for  X ∈ { ˜C, ˜B}we have for the last term

image

As usual [22, 19], this means that at the saddle-point values ˜C = 0 and ˜B = 0 the termSP(G, C, ˜C, B, ˜B, ζ) does not contribute to saddle–point equations.

image

We introduce the single-site (effective, more specifically) generating-functional

image

Here,  χ(t) .= E[m′ν(ρ(t − 1), y)]Zef where E[(·)]Zef represents the expectation of the argument with respect to the effective generating-functional  Zef. Then, we write

image

In the large N limit, we can perform the integration over  G, ˆG, C, ˆC, ˜C, ˆ˜C, B, ˆB, ˜B, ˆ˜B, ζ, ˆζwith the saddle point method. Doing so yields:

image

Furthermore, the solutions ˜C = 0 and ˜B = 0yield from (B.21) that ˆC = 0 and ˆB = 0,respectively. Moreover, we have

image

where RAstands for the R-transform of the spectral distribution of A, see (B.12). In these equations, we drop the contributions ∂ǫN∂X for X = {G, ˜C, ˜B, ζ}at the saddle point analysis, given that  ǫN ≃ 0.

In summary, we have  E[Z({l(t)})] ≃ Zef({l(t)}) where we define

image

Here, for convenience, we have introduced

image

We next integrate out the variable u in (B.35). To this, we define

image

By using the Gaussian integration formula we get

image

We also re-represent the temporal couplings of  {ˆρ(t)}via the averages of appropriate Gaussian fields. Doing so, we finally obtain

image

Here, we have defined  Cφ .= 2 ˆ˜C and

image

We next bypass the need for u in representing the order parameters  ζ and B. This will be possible by propagating  u2 and uthrough the derivative of the integral in (B.38) with respect to ˜RA and θ, respectively. Then, it is easy to show that

image

In particular, from (B.42) we write the dynamical order parameters of the effective generating functional (B.39) as

image

Finally, notice that the response function  G(t, s) = E[−iγ(t)ˆρ(s)]Zefcan be written as

image

image

We next show the memory-freeness property G = 0 which also implies ˆG = 0. Note that this leads (B.39) to

image

where  f ′χ(ρ, y) stands for the derivative  ∂fχ(ρ,y)∂ρ. Specifically, note from (B.45), that

image

Hence, we have (see (B.39))

image

The trivial solution  G = ˆG = 0fulfills this equation. This solution is unique given the fact that from the definitions of ˆG and G, one can show that these quantities are uniquely defined recursively in time as expectations over the stochastic process  {φ(t)}.

image

We next show that the solution

image

solves both (B.41) and (B.40). To show this, we will repeatedly use the R-transform

formulation for the inverse of matrices [42]:

image

First, we invoke the solution  q = RK(0) into (B.40) as 1 RK(0) = RK−1(−RK(0)) (B.52)

image

Then, we write everything in terms of RK(0) as

image

which implies that

image

Indeed, from (B.40) this is equivalent to (B.41).

image

We invoke the result G = 0 in (B.43) and get

image

Here, we get (B.62) by using the relations (see (B.40) and (B.59), respectively)

image

Second, as regards the expression (B.44), given the fact that G = 0 we note that

image

Hence, we get from (B.44)

image

image

It follows from the recursive expression (28) that the variances  τφ(t + 1) .= C(t + 1, s + 1)do not depend on  Cφ(t, s) for t ̸= sbut solely on the previous variances  τφ(t). We also note that

image

Using the equations (28) and (C.1) in the expression (31) we write

∆(t, s) = −2Cφ(t, s) + τφ(t) + τφ(s) + q(κ(t) − κ(s))2 (C.2)

image

for an appropriately defined function  h∥. We next define the function  gt,s(x) explicitly by using the representation of the Gaussian density in terms of the characteristic function as

image

So that we have (for  t ̸= s)

image

Hence, we get

image

Thus, for sufficiently large t and s we can expand ∆(t + 1, s + 1) around 0 as

image

Then, we get

image

This completes the derivation of (33).

image

image

We have

image

where  Cφ(t, ∞) .= lims→∞ Cφ(t, s). Similar to (C.3), we then write the recursion for ∆(t, ∞)as

image

Here we have defined

image

In particular, we have the derivative

image

where the random variables  φ(t) and φare jointly Gaussian with zero mean and covariance Cφ(t, ∞), and independent of  θ.We are interested in the rate of the asymptotic decay

image

The rate is computed as

image

Then, using the result (C.14) we obtain the rate as

image

This completes the derivation of (34).

image

For convenience, let  Z .= zz⊤. Note that

image

Furthermore, since Z is rank-one, we have for  ζ .= z⊤z that

image

Moreover, by the cyclic invariance property of the trace we have

image

For example,

[ ˜B B]Q[B ˜B]⊤ = 2 ˜BGB⊤ + ˜BC ˜B⊤ + B ˜CB⊤ (D.5)= 2 ˜BGB⊤ + B ˜CB⊤ + SP(G, C, ˜C, B, ˜B, ζ) (D.6)

image

([ ˜B B]Q[B ˜B]⊤)2 = SP(G, C, ˜C, B, ˜B, ζ) (D.8)

Thereby, we get

image

Again by cyclic invariance we have

image

Moreover, one can show that

image

Putting everything together leads to (B.21).

[1] Gardner E 1988 Journal of physics A: Mathematical and general 21 257

[2] Watkin T L, Rau A and Biehl M 1993 Reviews of Modern Physics 65 499

[3] Opper M and Kinzel W 1996 Statistical mechanics of generalization Models of neural networks III (Springer) pp 151–209

[4] Nishimori H 2001 Statistical physics of spin glasses and information processing: an introduction 111 (Clarendon Press)

[5] Richardson T J, Shokrollahi M A and Urbanke R L 2001 IEEE Transactions on Information Theory 47 619–637

[6] Bolthausen E 2014 Communications in Mathematical Physics 325 333–366.

[7] Tanaka T and Okada M 2005 IEEE Transactions on Information Theory 51 700–706

[8] Bayati M and Montanari A 2011 IEEE Transactions on Information Theory 57 764– 785

[9] Krzakala F, M´ezard M, Sausset F, Sun Y and Zdeborov´a L 2012 Physical Review X 2 021005

[10] Bayati M, Lelarge M, Montanari A et al. 2015 The Annals of Applied Probability 25 753–822

[11] Kabashima Y 2003 Journal of Physics A: Mathematical and General 36 11111

[12] Donoho D L, Maleki A and Montanari A 2009 Proceedings of the National Academy of Sciences 106 18914–18919

[13] Rangan S 2011 Generalized approximate message passing for estimation with random linear mixing Proc. IEEE International Symposium on Information Theory (ISIT) (Saint-Petersburg, Russia)

[14] Opper M and Winther O 2001 Physical Review E 64 056131–(1–14)

[15] M´ezard M, Parisi G and Virasoro M 1987 Spin Glass Theory and Beyond vol 9 Lecture Notes in Physics (World Scientific)

[16] Minka T P 2001 Expectation propagation for approximate Bayesian inference Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence UAI ’01(San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.) pp 362–369

[17] Rangan S, Schniter P and Fletcher A K 2017 Vector approximate message passing 2017 IEEE International Symposium on Information Theory (ISIT) (Piscataway, NJ, USA: IEEE) pp 1588–1592

[18] Takeuchi K 2017 Rigorous dynamics of expectation-propagation-based signal recovery from unitarily invariant measurements 2017 IEEE International Symposium on Information Theory (ISIT) (Piscataway, NJ, USA: IEEE) pp 501–505

[19] Opper M, C¸akmak B and Winther O 2016 Journal of Physics A: Mathematical and Theoretical 49 114002

[20] C¸akmak B and Opper M 2019 Phys. Rev. E 99(6) 062140

[21] Martin P C, Siggia E D and Rose H A 1973 Physical Review A 8 423

[22] Eisfeller H and Opper M 1992 Physical Review Letters 68 2094

[23] Sompolinsky H, Crisanti A and Sommers H J 1988 Physical review letters 61 259

[24] Mimura K and Okada M 2014 IEEE Transactions on Information Theory 60 3645– 3670

[25] Neal R M 1997 arXiv preprint physics/9701026

[26] Collins B, Matsumoto S and Saad N 2014 Journal of Multivariate Analysis 126 1–13

[27] Opper M and Winther O 6 (2005): 2177-2204 Journal of Machine Learning Research

[28] C¸akmak B Random matrices for information processing–a democratic vision Ph.D. thesis Aalborg University

[29] Parisi G and Potters M 1995 Journal of Physics A: Mathematical and General 28 5267

[30] Maillard A, Foini L, Castellanos A L, Krzakala F, M´ezard M and Zdeborov´a L 2019 High-temperature expansions and message passing algorithms (Preprint 1906.08479)

[31] C¸akmak B and Opper M 2018 Expectation propagation for approximate inference: Free probability framework 2018 IEEE International Symposium on Information Theory (ISIT) (Piscataway, NJ, USA: IEEE) pp 1276–1280 ISSN 2157-8117

[32] Mingo J A and Speicher R 2017 Free probability and random matrices (Fields Institute Monographs vol 35) (Springer)

[33] Barbier J, Krzakala F, Macris N, Miolane L and Zdeborov´a L 2019 Proceedings of the National Academy of Sciences 116 5451–5460

[34] Kabashima Y 2008 Journal of Physics: Conference Series 95

[35] Fletcher A K, Rangan S and Schniter P 2018 Inference in deep networks in high dimensions 2018 IEEE International Symposium on Information Theory (ISIT) (Piscataway, NJ, USA: IEEE) pp 1884–1888

[36] Takeda K, Hatabu A and Kabashima Y 2007 Journal of Physics A: Mathematical and Theoretical 40 14085

[37] Anderson G W and Farrell B 2014 Advances in Mathematics 255 381 – 413

[38] Tramel E W, Gabri´e M, Manoel A, Caltagirone F and Krzakala F 2018 Phys. Rev. X 8(4) 041006

[39] Hiai F and Petz D 2006 The Semicirle Law, Free Random Variables and Entropy (American Mathematical Society)

[40] Collins B and ´Sniady P 2007 Annales de l’Institut Henri Poincare (B) Probability and Statistics 43 139 – 146

[41] Guionnet A and Ma¨ıda M 2005 Journal of Functional Analysis 222 435 – 490

[42] Muller R R, Guo D and Moustakas A L 2008 IEEE Journal on Selected Areas in Communications 26 530–540


Designed for Accessibility and to further Open Science