Analysis of Bayesian Inference Algorithms by the Dynamical Functional Approach

2020·Arxiv

Abstract

Abstract

We analyze the dynamics of an algorithm for approximate inference with large Gaussian latent variable models in a student-teacher scenario. To model nontrivial dependencies between the latent variables, we assume random covariance matrices drawn from rotation invariant ensembles. For the case of perfect data-model matching, the knowledge of static order parameters derived from the replica method allows us to obtain efficient algorithmic updates in terms of matrix-vector multiplications with a fixed matrix. Using the dynamical functional approach, we obtain an exact effective stochastic process in the thermodynamic limit for a single node. From this, we obtain closed-form expressions for the rate of the convergence. Analytical results are excellent agreement with simulations of single instances of large models.

Keywords: Bayesian Inference, Iterative Algorithms, TAP Equations, Random Matrices, Dynamical Functional Theory

1. Introduction

Tools of statistical mechanics have been extensively used in the late 1980’s and the early 1990’s to analyze the learning properties of large neural networks and related learning models [1, 2, 3, 4]. The powerful combination of statistical ensembles of learning machines with the application of the replica approach has allowed for the exact computation of average case learning performance. Remarkably, this could be achieved without having to deal with the details of concrete learning algorithms. Unfortunately, since much of the research was somewhat disconnected from practical existing machine learning approaches, this seemingly advantage has also led to a decline of the research activities in the field in the later 1990s.

More recently, the situation has again changed considerably. The establishment of relations between advanced mean field approximations (related to the so-called cavity method) and message passing algorithms for graphical statistical models has led to novel interest in the interdisciplinary area of statistical mechanics of learning and inference. Message passing algorithms were not only found (under certain conditions) to achieve optimal performance for large systems, but could also be analyzed dynamically by density evolution techniques [5, 6]. While most of this research concentrated originally on sparse networks [5], there was a growing interest on studying inference with networks of densely coupled probabilistic units [7, 8, 9, 10]. By taking formally the large density limit in the theory of message passing one obtains so-called AMP (approximate message passing) algorithms [11, 12, 13] which have been applied to a great variety of models. Fixed points of the AMP algorithm were found to coincide with the solutions of corresponding TAP (Thouless-Anderson-Palmer) mean field equations for statistical averages of the stochastic nodes. This heuristic approach might be criticized because certain independence assumptions made in the cavity approach may not be valid for dense networks. Nevertheless, exactness of the final results could be established for certain simple distributions of random network couplings. To go beyond such simple distributions, more complex ensembles allowing for dependencies between couplings in dense systems could be treated within an adaptive TAP approach [14] motivated by earlier work on spin glasses [15]. This research had (in parts) led to the development of the expectation propagation (EP) algorithms [16] in the field of machine learning. Assuming that matrices of network couplings are realizations of rotation invariant ensembles, prediction properties of the EPstyle VAMP (vector-AMP) algorithms could be analyzed rigorously by a density evolution method using methods of random matrix theory [17, 18]. The density evolution approach so far deals mainly with the computation of the temporal development of predictions errors which are derived from equal time marginals of the distribution of trajectories of an algorithm’s dynamics. It would be important to extend these results to multivariate statistical properties of trajectories which allow for more detailed computations of the algorithm’s performance including the convergence speed towards the fixed point.

In this paper, we will present such an approach. Based on our previous studies on the dynamics of solving TAP equations for Ising spin systems [19, 20], we introduce a VAMP-style algorithm for inference in Gaussian latent probabilistic models. These are important statistical data models with applications to classification and regression. We use the method of dynamical functional theory (DFT) to study the average case properties of algorithms in the large system limit. DFT is a statistical mechanics tool for analyzing dynamical systems [21] based on partition functions over trajectories. From these, effective distributions of entire trajectories for a single node can be obtained. In the past, the method was mainly applied to the dynamics of spin-glasses [22] and neural networks [23] with random independent couplings. We also refer to the study [24] where the DFT was used to analyze AMP-style algorithms (in the context of communication theory) with again random independent couplings assumption.

The novel contributions presented in this paper are twofold: On a technical level we extend the DFT approach of [20] to a combination of a teacher-student scenario for the generation of data together with the assumption of arbitrary rotation invariant random matrix couplings. From a more practical machine learning point of view, our new algorithm is designed for the case, where the class of data generating models is assumed to be known up to its parameters. This means we neglect model mismatch. In this case, we can use the knowledge of equilibrium order parameters given by the replica method to a obtain a simplified algorithm with a significant reduction of computational complexity.

The paper is organized as follows: In Section 2 we introduce the probabilistic model considered for the Bayesian inference. Section 3 provides a brief presentation on the TAP equations. In Section 4 we present our new algorithm for solving the TAP equations and in Section 5 we study its thermodynamic properties using the method of DFT. Comparisons of the theory with simulations are given in Section 6. Section 7 presents a summary and outlook. The derivations of our results are located at the Appendix.

2. Models with Gaussian latent variables

We consider the problem of approximate Bayesian inference for posterior distributions over the Gaussian latent vector of the type

where Z is a normalization constant. This model assumes that the components of the vector y of N real data values are generated independently from a likelihood on a vector of unknown parameters . Prior statistical knowledge about is introduced by the correlated Gaussian with covariance A well known example for (1) is the problem of Bayesian learning of a noisy perceptron—also known as probit regression [25]—for which we assume binary class labels 1. For more details on the standard interpretation of the noisy perceptron, see section 6.

Our concern is to design and analyze an iterative algorithm that (approximately) computes the vector of posterior means

in the thermodynamic limit of large N under some statistical assumptions. For simplicity, we assume data-model matching, ie. the probabilistic model (1) describes the generation of the dataset {y, K} correctly. Moreover, in order to allow for some nontrivial dependencies between matrix elements , we assume that K is drawn from a rotation invariant matrix ensemble, i.e. have the same probability distributions for any orthogonal matrix V independent of K. Equivalently, we have the spectral decomposition

where O is a Haar (orthogonal) matrix that is independent of a diagonal matrix D [26]. For the perceptron model, the “classic” assumption of independent components for inputs leads to a Wishart distribution for K which is rotational invariant. For more complex ensembles, see [20].

3. The TAP Equations

The TAP approach [15],[14]—related to expectation consistent inference approximations in machine learning [27]—typically provides highly accurate approximations for probabilistic inference. In our context, the TAP equations are the fixed-point equations for approximate posterior means m that are given by

Here, we denote the empirical average of an . Moreover, given the auxiliary single-site partition function

we introduce the nonlinear functions

The only dependency the TAP equations of the random matrix ensemble is via the function R() which is the so-called R–transform. Its definition will be given later in equation (8). The TAP equations generalize the naive mean field approximation, which neglects statistical dependencies between the components of ). Since the diagonal entries of rotation invariant random matrices are asymptotically self averaging (e.g. [28, Theorem 2.1],[19]), a short computation shows that the naive mean field theory corresponds to the approximation R(). The (Onsager-) correction to the mean naive field approximation in terms of the R-transform was originally derived in [29] for Ising spin-glasses using a free energy approach, see also [30] for a comprehensive exposition in this approach. An alternative derivation was given in [14] using the cavity method. We also refer the reader to [31] for a rigorous approach on the self averaging assumptions made for cavity field variances in the latter approach.

The function R() stands for the R–transform of the spectral distribution of

which is defined as [32]

Here, we have defined

and Gis the functional inverse (which is well-defined on (0)) of the Green-function

The TAP method is known to be consistent with the replica-symmetric (RS) ansatz when an average over data y and couplings K is considered. The validity of RS is commonly assumed to be asymptotically exact in the case of . For the case of data-model matching, one can show that (see, e.g. the study of [34]), the RS ansatz leads to the asymptotic () solution for in (4c) as

Here, the expectation is taken with respect to the probability distribution

with ) denoting the Gaussian density function with mean and variance corresponds to the asymptotic marginal distribution for the components of (

valid for a large system. Our general idea is to design an iterative algorithm for solving (4) in terms of an iteration of a vector of auxiliary variables denoting the discrete time index of the iteration) such that the “effective” distributions of the marginals ()) mimic the static distribution ), specifically

for some dynamical order parameters ) converging both to

4. The iterative algorithm

In recent years, there has been considerable interest in analyzing EP-style [16] iterative algorithms [17, 18, 35]—commonly referred as the method of VAMP. From a computational complexity point of view, such algorithms require the computation of products of matrices at every iteration steps (for details, see Appendix A.1) which can be problematic for large N and large times. Here, we propose a simplified algorithm which utilities the result of the RS ansatz (11) and therefore assumes that the assumed data generating process is correct.

Here, for short, we have introduced the scalar function

This resembles the update of a single layer recurrent neural network: A vector of nodes 1) is passed point wise through a nonlinear function . The resulting vector is multiplied by a symmetric “coupling” matrix A before the process is repeated. Note that the iterations (15) are solely based on matrix vector multiplications and evaluations of scalar nonlinear functions.

Before iteration starts the iterative algorithm requires the elements obtained as follows: We first compute the spectral decomposition

where the vector d contains the eigenvalues of K. Then, we obtain the parameters and by solving the fixed–point equations

Finally, the matrix A is computed as

It is easy to see that the fixed points of ) coincide with the solution of the TAP equations (4) for given that the empirical average (4c) is substituted by the RS result (11). In fact, we will re-derive the RS result from the DFT analysis of the iterative algorithm.

5. The results of DFT

In this section, we will analyze the dynamical properties of the iterative algorithm (15) using the method of the DFT. To this end, we introduce the moment generating functional for the trajectory of

where for the sake of compactness of notation we assume that the dynamical order parameter ) is self-averaging. Note, that we have added a single external field l(t) to node i. To obtain statistical averages we compute the averaged-generating functional E[Z({l(t)})] where the expectation is taken over the random elements

From this, for example, we could compute

where the identity (22) follows from the fact that probability distribution of A is invariant under permutation. Using (23) we can quantify the averaged-normalized-square Euclidean distance between iterates of the algorithm at different times as

which will allow us to compute the convergence properties of the algorithm. We will defer the explicit and lengthy computation of DFT to Appendix B. There we show how to integrate out the trajectories for all other model in the limit . The result is

where ) stands for a Gaussian distribution function with mean and covariance Hence, we have obtained an effective stochastic process for the dynamics of single, arbitrary component ) of the vector

The order parameter ) and the two-time covariance matrix ) (which is denoting the (t, s)th indexed entries of ) of the Gaussian process are computed by the recursions

Here, stands for the variance of the spectral distribution of A and for short we have introduced the random dynamics

The relative simplicity of these equations may come as a surprise to readers familiar with the results of DFT obtained for spin-glass dynamics and neural networks. In contrast to models with non symmetric random matrices (see e.g [23]) the symmetric matrix case is usually plagued with memory terms in the effective single node stochastic processes. These usually make explicit analytical computations of single time marginals a hard task. As shown in Appendix A.2 the absence of such memory terms can be explained by the vanishing of the two-time response functions which in turn can be understood by wellknown concepts of random matrix theory.

5.1. Convergence rate of the algorithm

To study the large time behavior of the single node dynamics we define the deviation between the dynamical variables at different times (

where ) stands for the two-time covariance of the zero-mean Gaussian process We have defined the rate of convergence as

Assuming convergence, that is ∆(, we get the explicit rate as

as (see [20, Eq.(38)])

Then, we re-write (33) in the form

Thus, the necessary condition for convergence 1 holds if and only if

Following the arguments for [36, Eq.(18)] one can conclude that equation (37) coincides with the stability condition of the RS ansatz - known as de Almeida Thouless (AT) criterion. Since for the case of data-model matching, RS solution is usually locally stable, we can expect that local convergence broadly fulfilled.

5.2. Asymptotic consistency with the RS ansatz

Let ). Then, we have

Hence, if ) converge, they both converge to Thus, the stationary distribution of the stochastic process (26) yields the replica-symmetric solution, i.e.

6. Simulation results

Perceptrons are single layer neural networks that are parameterized by a vector of weights, say . We consider the binary classification problems from a training set that is given by stands for a vector of inputs and a binary label 1 for classification. Specifically, we have the observation model for y as

The Gaussian latent vector ) and the noise vector ) are generated independently. So that, we set denoting the cumulative distribution function of standard normal distribution.

We illustrate our theoretical results through the following random matrix models for the data matrix:

(ii) projection matrix with ˜Hadamard matrix [37] as

Here Z is a uniformly distributed random signed permutation matrix, specifically 1 being independent binary random variables with equal probabilities and is a random permutation. Furthermore, Hadamard matrix which is deterministically constructed from the recursion

Note, that in the latter model, X is a column-orthogonal matrix (i.e. the is not rotation invariant in this case, motivated with the study [37] we expect that our theoretical results might still yield quite accurate approximations.

Figure 1 refers to random matrix model (i) where in Figure1-(a) and (b) we illustrate the discrepancy between theory and simulations for the two-time covariance respect to the two-time relative-square-error

and the analytical convergence rate of the algorithm, respectively. Similarly, Figure 2 refers to the random matrix model (ii). Simulation results are based on single instances of large random matrices and data.

Figure 1: Random matrix model (i): The model parameters are chosen as . (a) Discrepancy between theory and simulations for the two-time covariance with the (t, s) indexed segment representing the relative-squared-error in dB, i.e. ). (b) Asymptotic of the algorithm (where the flat line around 10are the consequence of the machine precision of the computer which was used).

7. Summary and Outlook

We have presented the analysis of a Bayesian inference algorithm for latent Gaussian models in the large systems limit. We have based our approach on a statistical mechanics path integral technique, dynamical functional theory (DFT), which has been used before to treat spin-glass models and neural networks with random interactions. We have generalized the method to allow for more complex settings of the quenched randomness, allowing for a combination of a teacher-student scenario together with rotation invariant ensembles of coupling matrices. While related inference algorithms had been treated before using rigorous random matrix techniques, we were able, for the first time, to obtain the complete effective marginal stochastic process of single nodes. Although the computations turned out to be somewhat involved, the final results turned out to be surprisingly simple: The

Figure 2: Random matrix model (ii): The model parameters are chosen as . (a) Discrepancy between theory and simulations for the two-time field covariance with (t, s) indexed segment representing the relative-squared-error in dB, i.e. ). (b) Asymptotic of the algorithm.

statistics of the effective field entering the nonlinear function of the algorithm was found to be a Gaussian process, for which the covariance function could be obtained iteratively. This result lacked the complications of the non-Gaussian fields caused by memory terms in the effective dynamics which were typically present in spin-glass dynamics. We could trace the origins of this simplification by utilizing concepts of random matrix theory. Based on our general result, we were able to compute exact convergence rates of the algorithm. The statistics of the algorithm’s fixed points coincides with that predicted by the replica theory. Within our scenario of data-model matching, the algorithm is found (at least) to be locally convergent. Simulations on single, large systems showed excellent agreement with the theory, From a more practical point of view, the restriction to the matching case allowed us to define an inference algorithm in terms of more efficient iterations (compared to previous VAMP algorithms) by using only a single fixed matrix.

We expect that our approach can be extended in various directions. The DFT method is general enough to treat the non-matching case as well, including the settings of VAMP algorithms. It will also be interesting to extend the method to other more complex probabilistic models of neural network type such as the restricted Boltzmann machines [38]. This however, would require different types of integrals over random matrices, which have to be incorporated into the DFT formalism. A further interesting generalization of DFT would be to random sequential updates in algorithms, where at each iteration step only a single, randomly selected node (or a small mini batch of nodes) are used in the update. It will be interesting to see, if a properly adapted DFT method would lead to tractable computations for this more realistic scenario.

The statistical mechanics techniques utilized in our DFT approach might present of course certain limitations of applicability to real world machine learning problems. The study of general rotation invariant ensembles of matrices is a nontrivial improvement over older models that were based on simpler independence assumptions. It allows us to deal with matrices of (almost) arbitrary eigenvalue distributions, but restricts the corresponding eigenvectors as irrelevant, simply pointing in random directions. It will important to find out for which types of real data such technical assumptions might be reasonable enough to draw relevant conclusions from the theory. The excellent agreement between theory and simulations for the case of a randomly–signed Hadamard matrix (which contains much less randomness compared to a the rotation invariant case, see section 6) gives us some hope that our method could be applicable to a broader range of problems.

Acknowledgment

This work was supported by the German Research Foundation, Deutsche Forschungsgemeinschaft (DFG), under Grant “RAMABIM” with No. OP 45/9-1.

In our context, the VAMP algorithm [17, 18] coincides with the iterative process

From the computational complexity point of view, the essential difference between our proposed iterations (15) from the VAMP iterations (A.1) is that at every iteration step the former requires a matrix-vector multiplication while the latter requires matrix-matrix multiplications (see (A.1e)).

The striking property of the algorithm (15) is that it has the memory-free property

This property makes the analysis of the algorithm relatively simple. To have a better insight on designing memory-free algorithms, we will next introduce a generic dynamical construction and present an intuitive derivation of its memory-free property by means of the concept of asymptotic freeness of random matrices [39, 32].

Specifically, consider the following generic iterative process

where is a sequence of scalar function. Here, we suppose that A(t) is rotation invariant (for all t) and also for the diagonal matrices

we suppose that

Here, for an we denote its limiting normalized-trace by

Remark that both our proposed algorithm (15) and the VAMP algorithm (A.1) are in the family of this generic dynamical construction. We are interested the analyzing the diagonal elements of the dynamical susceptibility

In the large-system limit, this product of matrices can be simplified by the concept of asymptotic freeness of random matrices. Specifically, for the two families of matrices, say ) stand for (noncommutative) polynomials of the matrices in A and the matrices in E, respectively. Then, we say the families A and E are asymptotically free if for all ] and for all polynomials ) we have [39]

In other words, the limiting normalized-trace of any adjacent product of powers of matrices – which belong to different free families and are centered around their limiting normalized-traces – vanishes asymptotically.

The matrices in (A.7) belong to two families: rotation invariant and diagonal. Under certain technical conditions, these two matrix families can be treated as asymptotically free (in the almost sure sense) [39], i.e. any adjacent two matrices in the susceptibility matrix (A.7) are asymptotically free. Moreover, by design the matrices are centered around their limiting normalized-traces (A.5). Hence, it immediately follows from the definition of asymptotic freeness that

In fact, following the analysis of [31] we can obtain a stronger result

Our first goal is to perform the average

where dP(O) stands for the Haar invariant measure of the orthogonal group O(N). To this end, we first invoke the representations of the Dirac functions in terms of its characteristic function and write the generating functional (20) of the form

Here, and throughout the sequel, c stands for a constant term—which will irrelevant for the analysis— to ensure the normalization property E[Z({l(t) = 0})] = 1. Moreover, we note, from the representation of the Gaussian density function in terms of the characteristic function, that

Then, by invoking (B.3) and (B.5) we write the averaged-generating functional as

In this subsection we will perform the disorder average, i.e. the last term of (B.6). To this end, we introduce the

So that we write

Thus, we have

Using the asymptotic Itzykson-Zuber integral formulation [40, 41] the expectation (B.9) can be expressed in terms of the so-called free cumulants of the spectral distribution of A as

where stands for the nth order free cumulant of the spectral distribution of A and the constant term . In particular, are the mean and variance of the distribution, respectively, i.e.

In fact, the R-transform can be defined as a generating function of the free cumulants [32]

We will evaluate tr(() in terms of the order parameters

˜˜

Note that . Hence, we have

By the cyclic invariance property of the trace operator, the traces tr(() do solely depend on the terms . Hence, we can define

In particular, we have (see Appendix D)

where for we have for the last term

As usual [22, 19], this means that at the saddle-point values ˜SP() does not contribute to saddle–point equations.

We introduce the single-site (effective, more specifically) generating-functional

Here, represents the expectation of the argument with respect to the effective generating-functional . Then, we write

In the large N limit, we can perform the integration over with the saddle point method. Doing so yields:

Furthermore, the solutions ˜yield from (B.21) that ˆrespectively. Moreover, we have

where Rstands for the R-transform of the spectral distribution of A, see (B.12). In these equations, we drop the contributions at the saddle point analysis, given that

In summary, we have ) where we define

Here, for convenience, we have introduced

We next integrate out the variable u in (B.35). To this, we define

By using the Gaussian integration formula we get

We also re-represent the temporal couplings of via the averages of appropriate Gaussian fields. Doing so, we finally obtain

Here, we have defined

We next bypass the need for u in representing the order parameters . This will be possible by propagating through the derivative of the integral in (B.38) with respect to ˜R, respectively. Then, it is easy to show that

In particular, from (B.42) we write the dynamical order parameters of the effective generating functional (B.39) as

Finally, notice that the response function can be written as

We next show the memory-freeness property G = 0 which also implies ˆG = 0. Note that this leads (B.39) to

where ) stands for the derivative . Specifically, note from (B.45), that

Hence, we have (see (B.39))

The trivial solution fulfills this equation. This solution is unique given the fact that from the definitions of ˆG and G, one can show that these quantities are uniquely defined recursively in time as expectations over the stochastic process

We next show that the solution

solves both (B.41) and (B.40). To show this, we will repeatedly use the R-transform

formulation for the inverse of matrices [42]:

First, we invoke the solution (0) into (B.40) as 1 R

Then, we write everything in terms of R

which implies that

Indeed, from (B.40) this is equivalent to (B.41).

We invoke the result G = 0 in (B.43) and get

Here, we get (B.62) by using the relations (see (B.40) and (B.59), respectively)

Second, as regards the expression (B.44), given the fact that G = 0 we note that

Hence, we get from (B.44)

It follows from the recursive expression (28) that the variances do not depend on but solely on the previous variances ). We also note that

Using the equations (28) and (C.1) in the expression (31) we write

∆(

for an appropriately defined function . We next define the function ) explicitly by using the representation of the Gaussian density in terms of the characteristic function as

So that we have (for

Hence, we get

Thus, for sufficiently large t and s we can expand ∆(t + 1, s + 1) around 0 as

Then, we get

This completes the derivation of (33).

We have

where ). Similar to (C.3), we then write the recursion for ∆(as

Here we have defined

In particular, we have the derivative

where the random variables are jointly Gaussian with zero mean and covariance ), and independent of We are interested in the rate of the asymptotic decay

The rate is computed as

Then, using the result (C.14) we obtain the rate as

This completes the derivation of (34).

For convenience, let . Note that

Furthermore, since Z is rank-one, we have for

Moreover, by the cyclic invariance property of the trace we have

For example,

[ ˜= 2 ˜

([ ˜

Thereby, we get

Again by cyclic invariance we have

Moreover, one can show that

Putting everything together leads to (B.21).

References

[1] Gardner E 1988 Journal of physics A: Mathematical and general 21 257

[2] Watkin T L, Rau A and Biehl M 1993 Reviews of Modern Physics 65 499

[3] Opper M and Kinzel W 1996 Statistical mechanics of generalization Models of neural networks III (Springer) pp 151–209

[4] Nishimori H 2001 Statistical physics of spin glasses and information processing: an introduction 111 (Clarendon Press)

[5] Richardson T J, Shokrollahi M A and Urbanke R L 2001 IEEE Transactions on Information Theory 47 619–637

[6] Bolthausen E 2014 Communications in Mathematical Physics 325 333–366.

[7] Tanaka T and Okada M 2005 IEEE Transactions on Information Theory 51 700–706

[8] Bayati M and Montanari A 2011 IEEE Transactions on Information Theory 57 764– 785

[9] Krzakala F, M´ezard M, Sausset F, Sun Y and Zdeborov´a L 2012 Physical Review X 2 021005

[10] Bayati M, Lelarge M, Montanari A et al. 2015 The Annals of Applied Probability 25 753–822

[11] Kabashima Y 2003 Journal of Physics A: Mathematical and General 36 11111

[12] Donoho D L, Maleki A and Montanari A 2009 Proceedings of the National Academy of Sciences 106 18914–18919

[13] Rangan S 2011 Generalized approximate message passing for estimation with random linear mixing Proc. IEEE International Symposium on Information Theory (ISIT) (Saint-Petersburg, Russia)

[14] Opper M and Winther O 2001 Physical Review E 64 056131–(1–14)

[15] M´ezard M, Parisi G and Virasoro M 1987 Spin Glass Theory and Beyond vol 9 Lecture Notes in Physics (World Scientific)

[16] Minka T P 2001 Expectation propagation for approximate Bayesian inference (San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.) pp 362–369

[17] Rangan S, Schniter P and Fletcher A K 2017 Vector approximate message passing 2017 IEEE International Symposium on Information Theory (ISIT) (Piscataway, NJ, USA: IEEE) pp 1588–1592

[18] Takeuchi K 2017 Rigorous dynamics of expectation-propagation-based signal recovery from unitarily invariant measurements 2017 IEEE International Symposium on Information Theory (ISIT) (Piscataway, NJ, USA: IEEE) pp 501–505

[19] Opper M, C¸akmak B and Winther O 2016 Journal of Physics A: Mathematical and Theoretical 49 114002

[20] C¸akmak B and Opper M 2019 Phys. Rev. E 99(6) 062140

[21] Martin P C, Siggia E D and Rose H A 1973 Physical Review A 8 423

[22] Eisfeller H and Opper M 1992 Physical Review Letters 68 2094

[23] Sompolinsky H, Crisanti A and Sommers H J 1988 Physical review letters 61 259

[24] Mimura K and Okada M 2014 IEEE Transactions on Information Theory 60 3645– 3670

[25] Neal R M 1997 arXiv preprint physics/9701026

[26] Collins B, Matsumoto S and Saad N 2014 Journal of Multivariate Analysis 126 1–13

[27] Opper M and Winther O 6 (2005): 2177-2204 Journal of Machine Learning Research

[28] C¸akmak B Random matrices for information processing–a democratic vision Ph.D. thesis Aalborg University

[29] Parisi G and Potters M 1995 Journal of Physics A: Mathematical and General 28 5267

[30] Maillard A, Foini L, Castellanos A L, Krzakala F, M´ezard M and Zdeborov´a L 2019 High-temperature expansions and message passing algorithms (Preprint 1906.08479)

[31] C¸akmak B and Opper M 2018 Expectation propagation for approximate inference: Free probability framework 2018 IEEE International Symposium on Information Theory (ISIT) (Piscataway, NJ, USA: IEEE) pp 1276–1280 ISSN 2157-8117

[32] Mingo J A and Speicher R 2017 Free probability and random matrices (Fields Institute Monographs vol 35) (Springer)

[33] Barbier J, Krzakala F, Macris N, Miolane L and Zdeborov´a L 2019 Proceedings of the National Academy of Sciences 116 5451–5460

[34] Kabashima Y 2008 Journal of Physics: Conference Series 95

[35] Fletcher A K, Rangan S and Schniter P 2018 Inference in deep networks in high dimensions 2018 IEEE International Symposium on Information Theory (ISIT) (Piscataway, NJ, USA: IEEE) pp 1884–1888

[36] Takeda K, Hatabu A and Kabashima Y 2007 Journal of Physics A: Mathematical and Theoretical 40 14085

[37] Anderson G W and Farrell B 2014 Advances in Mathematics 255 381 – 413

[38] Tramel E W, Gabri´e M, Manoel A, Caltagirone F and Krzakala F 2018 Phys. Rev. X 8(4) 041006

[39] Hiai F and Petz D 2006 The Semicirle Law, Free Random Variables and Entropy (American Mathematical Society)

[40] Collins B and ´Sniady P 2007 Annales de l’Institut Henri Poincare (B) Probability and Statistics 43 139 – 146

[41] Guionnet A and Ma¨ıda M 2005 Journal of Functional Analysis 222 435 – 490

[42] Muller R R, Guo D and Moustakas A L 2008 IEEE Journal on Selected Areas in Communications 26 530–540

designed for accessibility and to further open science