b

DiscoverSearch
About
My stuff
Optimal short-term memory before the edge of chaos in driven random recurrent networks
2019·arXiv
Abstract
Abstract

The ability of discrete-time nonlinear recurrent neural networks to store time-varying small input signals is investigated by mean-field theory. The combination of a small input strength and mean-field assumptions makes it possible to derive an approximate expression for the conditional probability density of the state of a neuron given a past input signal. From this conditional probability density, we can analytically calculate short-term memory measures, such as memory capacity, mutual information, and Fisher information, and determine the relationships among these measures, which have not been clarified to date to the best of our knowledge. We show that the network contribution of these short-term memory measures peaks before the edge of chaos, where the dynamics of input-driven networks is stable but corresponding systems without input signals are unstable.

Natural and artificial high-dimensional nonlinear dynamical systems can be used as resources for real-time computing. By nonlinearly mapping time-varying input signals into a high-dimensional space, the signals can be learned in a supervised manner if the dynamical systems have enough ability to store the signals in their present state and separate different signals [1, 2]. A high computational performance can be achieved by tuning only the weights of linear connections to the output layer while keeping the parameters of the dynamical systems fixed [3–6]. Such dynamical systems called reservoirs can be artificial recurrent neural networks (RNNs) or physical systems, such as optical media [7, 8], nanoscale magnetization dynamics [9, 10], soft materials [11], and quantum systems [12].

As mentioned above, a requirement for real-time computing is the ability to memorize past input signals. Such short-term memory of dynamical systems has been studied extensively by assessing a quantity called memory capacity [13, 14]. For input-driven RNNs, it has been suggested that the part of memory capacity representing indirect memory through network takes a maximum value near the edge of chaos, namely, near the critical boundary between the stable and unstable dynamical regimes [15, 16]. Near criticality, different inputs are expected to lead to different states while suppressing the influence of the initial conditions. Hence, it seems reasonable for a dynamical system to be near the critical point for optimal memory capacity. However, it has also been pointed out that the dependence on network parameters is not straightforward based on a systematic numerical simulation [17].

For linear RNNs, detailed analytic studies of memory capacity can be performed for both discrete-time [18, 19] and continuous-time systems [20]. The ability to predict future inputs, which is complementary to memory capacity, has also been studied in linear systems with correlated input signals [21]. However, the memory capacity of nonlinear RNNs is difficult to study by analytical methods [22]. Recently, Schuecker et al. [23] successfully derived an analytical expression for memory capacity for continuous-time nonlinear RNNs [24] in which each neuron is driven by independent input signals following a white-noise Gaussian process. Toyoizumi and Abbott [25] analytically calculated the signal-to-noise ratio, which is equivalent to the inverse of memory capacity at the limit of zero input strength, for discrete-time nonlinear RNNs driven by a common time-varying input signal.

In this paper, we analytically investigate the memory capacity of discrete-time nonlinear RNNs called echo state networks (ESNs) [1] by a mean-field theory when the strength of input signals is small but non-zero. The main idea of our approach is that the conditional probability density of the present state of a neuron given a past input signal can be approximately calculated from a functional derivative with respect to past input signals under the assumption of a small input strength. Once we obtain this conditional probability density, it is straightforward to derive the memory capacity and other alternative memory measures, such as mutual information and Fisher information [22]. We show that all three measures of short-term memory through network behave similarly and take a maximum value before the edge of chaos, where the dynamics is stable in the presence of input signals but unstable in the absence of input signals. We also discuss the breakdown of the mean-field theory for calculating memory measures in the ordered regime and show that the linear approximation provides good predictions.

A. Echo State Networks

We consider ESNs consisting of N artificial neurons. The state of neuron i at discrete time step t is denoted xi(t). We assume that the time evolution of state  xi(t) is governed by

image

where f is an activation function.  ai(t) is the activation potential of neuron i at time step t given by

image

where s(t) is a time-dependent input signal,  wijis a time-independent weight of the connection from neuron j to neuron i, and  uiis a time-independent weight representing the strength of the coupling from the input signal to neuron i. We use the matrix and vector notations  W := (wij)1≤i,j≤N, u := (u1, u2, . . . , uN)T, and x(t) := (x1(t), x2(t), . . . , xN(t))T.

In the following analytical calculations and numerical simulations, the activation function is assumed to be a sigmoid function satisfying lima→±∞ f(a) = ±1, f(−a) = −aand  f ′(0) = 1.In particular, we adopt

image

are chosen independently at random from an identical Gaussian distribution with mean zero and variance  g2/N, where  g2 >0 is a control parameter. For simplicity,  uiare assumed to be independent variables taking  ±1 with probability 12. Since our primary concern is the memory capacity of ESNs, we consider an independent and identically distributed Gaussian input signal s(t) with mean zero and variance  s2. All numerical results in this paper were obtained in the following way unless otherwise stated. We simulated ESNs with N = 1000 artificial neurons over 40 trials. For a single trial, each quantity (for example, stationary variance of  xi(t)) was calculated from its values over 105time steps after discarding the initial 104time steps. Then, averages were obtained over all artificial neurons and all trials. All infinite sums appearing in the following sections were evaluated by truncation at the 500-th term.

The mean-field theory of ESNs [25–28] makes it possible to calculate the stationary variance of  xi(t) and the largest Lyapunov exponent in the limit  N → ∞. It assumes that  xi(t) are independent and identically distributed random variables. They are also assumed to be independent of  wijand  ui. This assumption can be

image

FIG. 1. Numerical results (marks) and mean-field predic- tions (solid lines) for the stationary variance of  xi(t) are shown as functions of  g2 for s2 = 0.01 (red), s2 = 0.02 (green), and s2 = 0.04 (blue). Vertical broken lines indicate the mean-field predictions of the critical value of  g2 obtained from Eq. (4) for s2 = 0 (g2 = 1, black), s2 = 0.01 (g2 ≈ 1.39, red), s2 = 0.02(g2 ≈ 1.50, green), and  s2 = 0.04 (g2 ≈ 1.64, blue).

justified in the limit  N → ∞when there is no input signal. Since f is odd, we can self-consistently assume that the mean of  xi(t) is equal to zero. Let  σ2(t) = ⟨xi(t)2⟩be the variance of  xi(t), where  ⟨· · · ⟩indicates the average over trials with the same  wijand  uibut possibly different realizations of the input signal s(t) and initial conditions. By the central limit theorem,  ai(t) follows a Gaussian distribution with mean zero and variance g2σ2(t)+s2+O(N − 12), where we use  u2i = 1. Neglectingthe  O(N − 12) term, the variance of  ai(t) does not depend on specific realizations of  wij[25]. In the following, we omit quantities that approach 0 as  N → ∞unless otherwise stated. Consequently,  σ2(t) follows the following recurrence equation [27]:

image

where Σ2(t) = g2σ2(t) +  s2. By numerically solving Eq. (3) with  σ2(t + 1) = σ2(t) = σ2, we can obtain the mean-field prediction of the stationary variance of  xi(t). We write Σ2 = g2σ2+  s2. These values of  σ2and Σ2have been used for plotting theoretical results later in the paper.

The largest Lyapunov exponent derived from the mean-field theory is [25, 27, 28]

image

Here,  λ >0 indicates that a small perturbation to a state of the system leads to exponential growth, while λ <0 implies that the perturbation eventually becomes undetectable. The dynamics is called chaotic or unstable in the former case and called ordered or stable in the latter. When an input signal is absent, the boundary between chaos and stability  λ = 0 corresponds to g2 = 1.The presence of input signals shifts the boundary towards the chaotic side [27, 28].

In Fig. 1, we confirm that the mean-field prediction of the stationary variance of  xi(t) agrees well with the result obtained by numerical simulations. We note that the difference is negligible, even for the input-driven regime g2 ≪1 where the mean-field assumption is expected to be violated. This will be explained when we discuss the breakdown of the mean-field theory in the calculation of memory capacity.

B. Memory Capacity

The memory capacity of an ESN is defined as the quality of the optimal linear estimator of the past input s(t − n) using the present state of neurons  xi(t). Following previous work [23, 25], we assume a sparse readout, namely, there are K = O(1) readout neurons 1  ≤ i ≤ Kand consider linear readout ˆs(t) = �Ki=1 vixi(t). Given time-delay n (n = 1, 2, . . .), the weights  viare determined by minimizing the mean squared error between s(t − n) and ˆs(t) over a sufficiently long time period T . The optimal mean squared error as a function of time-delay n is called the memory function and is given by [13, 14]

image

Mn =

image

where  dijis the (i, j)-th element of the matrix  C−1, which is the inverse of the matrix  C = (⟨xi(t)xj(t)⟩T )1≤i,j≤K, and  ⟨· · · ⟩Tindicates the time average over the period of length T . The memory capacity is defined as the sum of

image

Mnover all time-delays n:

image

It is known that  M ≤ Kholds [13, 14].

To compute  Mnby the mean-field theory, we replace the time average  ⟨· · · ⟩Tfor  T → ∞by the average over trials  ⟨· · · ⟩in stationary states. Since  ⟨xi(t)xj(t)⟩for i ̸= jvanishes as  N → ∞, the contribution of the off-diagonal terms of  C−1to Eq. (5) can be neglected for N → ∞by the sparse readout assumption K = O(1). Thus, in the mean-field calculation, Eq. (5) is just K times  Mnfor K = 1. Since

image

when K = 1, the task to obtain M reduces to calculating ⟨xi(t)s(t − n)⟩for n = 1, 2, . . .. In the following, we perform the calculation by assuming that the strength of the input signal is small, namely,  s2 ≪1.

The main idea to calculate  ⟨xi(t)s(t − n)⟩is that conditioning of  ai(t) on  s(t − n) (n = 0, 1, 2, . . .) can be regarded as a small perturbation to dependence of  ai(t) on  s(t − n), when  s2 ≪1. We assume that  xi(t) are independent and identically distributed and are also independent of  wijand  uieven after conditioning. By the central limit theorem,  ai(t) conditioned on  s(t − n) = cfollows a Gaussian distribution in the limit of large N. Hence, it is sufficient to calculate its mean  µn,c,iand variance Σ2n,ito determine the conditional probability density  Pai(t)|s(t−n)(a|c).

First, we calculate the mean of  ai(t) given  s(t − n) =c. We regard  ai(t) as a functional of stochastic variables  s(t), s(t −1), . . . , s(t − n), x(t − n). We write ai(t) = Fi [s0:n(t), x(t − n)], where  s0:n(t) = (s(t), s(t −1), . . . , s(t − n)). We consider a norm defined by the average over trials  ⟨· · · ⟩ (∥X∥ :=�⟨X2⟩for a stochastic variable X). Conditioning of  ai(t) on  s(t − n) = ccorresponds to replacing argument  s(t − n) with the constant stochastic variable c. If  c2, s2 ≪1, then  ∥c−s(t−n)∥2 =⟨c−s(t−n)⟩2 = c2+s2 ≪1. Thus,  ai(t) given  s(t−n) = ccan be approximated by the following:

Fi [s0:n−1(t), c, x(t  −n)]  ≃ Fi [s0:n(t), x(t  −n)] + δFi [s0:n(t), x(t  −n)]δs(t  −n) (c  −s(t  −n))

image

By taking the average over trials, we have

image

image

FIG. 2. Conditional probability density of the activation potential given a past input. Theoretical predictions and numerical results are compared for (a) the mean of  a1(t) given s(t − n) = 0.1 and (b) the conditional probability density  Pa1(t)|s(t−9)(a|c).We set  s2 = 0.01 and use single specific realizations of W and u.

where

image

Since f is an odd function, we have (Appendix B)

image

From Eqs. (10) and (12), the mean-field theory predicts

image

when  c2, s2 ≪1.

Second, the variance of  ai(t) given  s(t − n) = ccan be obtained as follows. The variance of  ai(t) can be expressed as

image

Thus, we have

image

Note that the population variance of (V nu)2itakes a nonzero finite value even in the limit of large N, as we will see in the linear case (Appendix E) when we discuss the breakdown of the mean-field theory in the ordered regime. This implies that the value of Σ2n,idepends on i or, equivalently, realizations of W and u, even after discarding the  O(N − 12) term. Another related remark is that Eq. (15) holds only in the limit of small  s2. Otherwise, the right-hand side may become negative even in the limit of large N, since (V nu)ifollows a Gaussian distribution with a variance of O(1) and thus can take an arbitrarily large value.

In Fig. 2, the mean of  a1(t) given  s(t − n) = 0.1 and the conditional probability density  Pa1(t)|s(t−9)(a|c) are shown for single specific realizations of W and u. Here, we set  s2 = 0.01. The numerical results are obtained by first generating a single orbit of length 106time steps after discarding the initial 104time steps and then sampling the value of  a1(t) with 0.1 ≤ s(t − n) < 0.11 for each n. The theoretical values for  µn,0.1,1and Σ2n,1are calculated from Eqs. (11), (13), and (15), where W and u are the same as those used in the numerical simulation. We can see that the numerical results and the theoretical predictions agree well.

Using Eqs. (13) and (15), we obtain (Appendix C)

image

for n = 1, 2, . . . . From Eqs. (7) and (16), we have

image

for n = 1, 2, . . . . The population average of  Mn, which is equivalent to the average over realizations of W and u, is (Appendix D)

image

where

image

The population average of M is

image

image

FIG. 3. Memory capacity and network memory capacity of ESNs. Theoretical predictions and numerical results are compared for the population averages of (a) memory function  Mn, (b) memory capacity M, and (c) network memory capacity  Mnet. (d)Mean-field lines for the effective measurement of the nonlinear response r (Eq. (19)). In (a), we set  s2 = 0.01. The vertical broken lines indicate the transition point between the ordered and chaotic regimes for the value of  s2 with the same color.

image

FIG. 4. Linear approximation of  M1. (a) Numerical values of  E [Mnet] for different system sizes N. (b) Comparison among numerical results, linear approximation, and mean-field theory for  E [M1]. We set s2 = 0.01 in both panels.

M can be decomposed into two parts [17, 23]: the direct memory  M1and the indirect memory through network Mnet := M −M1. We call the latter the network memory

capacity. The population average of the latter is

image

Figure 3 (a) shows the population average of  Mnfor

different values of  g2with  s2 = 0.01. The population averages of M and  Mnetare shown in Fig. 3 (b) and (c), respectively.  E [Mnet] peaks in the range 1  < g2 < g2∗, where  g2∗is the value of  g2such that  λ = 0. The exactlocation of the maximum point depends on the value of s2and shifts to a larger value as  s2increases. In the mean-field theory,  E [Mnet] is given as the product between E [M] and r. Hence, its qualitative behavior can be understood from those of E [M] and r (Fig. 3 (b) and (d), respectively). Since E [M] is a measure of the linear short-term memory, it is expected to decrease as the nonlinearity of the system increases. On the other hand, r can be interpreted as an effective measure of the nonlinear response of the system, which reaches saturation for sufficiently large  g2, since the activation function f is a sigmoid function. Indeed,  r → 2πas  g2 → ∞, since σ2 →1 as  g2 → ∞in Eq. (19) (However, this cannot be seen from Fig. 3 (d) because the range of  g2shown is restricted upto 2).

The mean-field predictions and the numerical results agree well over the whole range of  g2for E [M]. However, there is a clear discrepancy for  E [Mnet] in the ordered regime (Fig. 3(c)). This is due to the breakdown of the mean-field theory. That is, the assumption that  xi(t) are independent and identically distributed and are also independent of  wijand  uiis violated when the ESN dynamics is driven by input signals. Indeed, in a certain range of  g2in the ordered regime (0.2 < g2 < 0.7 in Fig. 4 (a) where  s2 = 0.01), the numerically obtained values of  E [Mnet] do not approach the mean-field value as the system size N increases. To understand the quantitative influence of the violation of the mean-field assumption on  E [Mnet], we consider the regime  g2 ≪1, where the activation function f can be approximated by the identity function f(x) = x. When  g2 ≪1, we can approximately calculate  E [Mn] without the mean-field theory. Since both the mean-field theory and the linear approximation lead to  E [M] = 1 for g2 ≪1, the differ-ence in  E [Mnet] is reduced to that in  E [M1]. The linear approximation predicts (Appendix E)

image

where  σ2iis the stationary variance of  xi(t). Note that we have  E�σ2i�= s21−g2in both the mean-field theory and the linear approximation. Indeed, in the mean-field theory, Eq. (3) reduces to  σ2 = Σ2 = g2σ2+  s2when f(x) = x. The equation for the linear approximation is derived in Appendix E (Eq. (E3)). However, the former predicts  E�1σ2i

E [M1] for  g2 ≪1, the mean-field theory fails to capture the variance of  σ2i.

We compare the values of  E [M1] obtained from the numerical simulation, the linear approximation, and the mean-field theory in Fig. 4 (b). Although the mean-field line does not fit the numerical result, the linear approximation can explain it well.

C. Mutual Information and Fisher Memory

Once we obtain the conditional probability density Pai(t)|s(t−n)(a|c) for n = 0, 1, 2, . . ., we can immediately calculate the mutual information between  xi(t+ 1) and s(t − n) as

I(xi(t+ 1);  s(t − n)) = I(ai(t); s(t − n)) ≃12 log Σ2Σ2n,i

image

when the mean-field assumption is valid. We would like to take the population average of Eq. (23). Recall that we assumed  s2 ≪1. Let us suppose Σ2 = O(1) in the limit s2 →0. In particular, this holds when  g2 >1. Then, we can approximate the population average of  I(xi(t+ 1);  s(t − n)) as

image

Let us consider the summation of Eq. (23) over n = 1, 2, . . . defined by

image

where the subscript ot indicates the mutual information between the future state and a one-time past input. Note that the n = 0 term is not included in the summation. Thus,  Iotis a measure of network short-term memory analogous to  Mnetbased on the mutual information. The population average  E [Iot] calculated based on the mean-field theory is shown in Fig. 5(a) and is compared with the numerical results. Since the direct numerical estimate of the mutual information between  xi(t+ 1) and s(t − n) for all n = 1, 2, . . . , 500 is computationally hard, we estimated the mutual information from the correlation coefficient between  ai(t) and  s(t − n) assuming that Pai(t)|s(t−n)(a|c) is Gaussian, which is valid both in the linear regime and the mean-field regime. As in the case of  E [Mnet], E [Iot] also takes a maximum value in the range 1  < g2 < g2∗.

As we have seen in the calculation of memory capacity, the mean-field theory is not applicable to the linear regime  g2 ≪1. Indeed, although the discrepancy between the numerical results and the mean-field predictions appears to be small on the scale of Fig. 5(a), the calculation of  Iotbased on the linear approximation (Appendix E) provides much better fits to the numerical results than the mean-field theory for  g2 ≪1, as shown in Fig. 5(b).

Another familiar information-theoretic memory measure is the Fisher information [22, 29]. Here, we regard the past input  s(t − n) as a parameter and consider the Fisher information with respect to the conditional probability density  Pxi(t+1)|s(t−n)(x|c), namely, information

image

FIG. 5. Network memory measure based on the mutual information between the future state and a one-time past input. (a) The theoretical predictions and the numerical results are compared for the population averages of  Iot(Eq. (25)). (b) Comparison among numerical results, linear approximation, and mean-field theory for  E [Iot] with s2 = 0.01.

image

FIG. 6. Population average of  Jot(Eq. (28)) calculated based on the mean-field theory under the same assumption as for Eq. (24). To adjust the scale of the vertical axis,  Jot ismultiplied by  s2.

about  s(t − n) contained in  xi(t+ 1). Since the activation function f is invertible and the Fisher information is invariant under an invertible transformation of stochastic variables, we can use  Pai(t)|s(t−n)(a|c) to calculate the Fisher information. When  s2 ≪1 and the mean-field assumption is valid, the Fisher information for  s(t − n) contained in  xi(t+ 1) is calculated as

image

The population average of Eq. (26) can be approximately obtained under the same assumption as for Eq. (24) as

follows:

image

We define the network Fisher memory with respect to a one-time past input by

image

As in the case of  Iot, we exclude the n = 0 term representing direct memory from the sum in the right-hand side of Eq. (28). The population average  E [Jot] calculated based on the mean-field theory under the same assumption as for Eq. (24) is shown in Fig. 6. E [Jot] behaves qualitatively similarly to  E [Mnet] and E [Iot] at least in the range where the mean-field theory is valid. Note that there is a close relationship between the mean-field predictions of memory function, mutual information, and Fisher information through Eqs. (18), (24) and (27): E [I(xi(t+ 1);  s(t − n))] =

2log�1 −�1 − s2Σ2�E [Mn]�= 12log�1 +  E [J(n)] s2�.

The derivation of the conditional probability density Pai(t)|s(t−n)(a|c) under the mean-field assumption can be extended in a straightforward manner to the conditioning on a set of past inputs at multiple time steps. In particular, the conditional probability density  Pai(t)|s1:n(t)(a|c) can be approximated as a Gaussian distribution with mean  µ1:n,c,iand variance Σ21:n,iwhere  s1:n(t) = (s(t −

1), s(t 2), . . . , s(t n)), c = (c1, c2, . . . , cn),

image

and

image

image

FIG. 7. Population averages of (a) I (Eq. (32)) and (b)  J ·s2 (Eq. (35) multiplied by  s2) calculated from the mean-field theory under the same assumption as for Eq. (24) are shown.

image

FIG. 8. The mean-field values of  g2 (g2χ, g2ι , g2φ, and g2ω)that attain the maxima of the five network memory measures (E [Mnet], E [Iot], E [Jot], and E [I] (E [J]), respectively) are shown as functions of  s2. Note that E [I] and E [J] peak at the same value of  g2. The critical line  g2 = g2∗ is also shown.

An alternative network memory measure to  Iotis the limit  n → ∞of the mutual information between  xi(t+1) and  s1:n(t). When the mean-field assumption is valid, it is given by

image

where Σ21:∞,i = Σ2 − �∞k=1�V ku�2i s2. Under the same assumption as for Eq. (24), its population average is approximately given by

image

Similarly, we can consider an alternative network memory measure based on the Fisher information matrix with

image

FIG. 9.  I = limn→∞ I(xi(t + 1); s1:n(t)) can be represented as the difference between  I1 = I(xi(t + 1); x(t)) and I2 =limn→∞ I(xi(t + 1); x(t)|s1:n(t)) (Eq. (36)). The population-averaged values of these quantities calculated from the mean-field theory under the same assumption as for Eq. (24) are shown for  s2 = 0.01.

respect to  Pai(t)|s1:n(t)(a|c). When the mean-field assumption is valid, the (k, l)-th element of the Fisher information matrix is given by

image

The Fisher memory with respect to the whole past input history is defined by

image

and its approximate population average is found to be

image

under the same assumption as for Eq. (24). Note that E [I] and E [J] are related by  E [I] ≃ 12log�1 +  E [J] s2�. Figure 7 shows Eqs. (32) and (35) for  s2 = 0.01, 0.02, and 0.04. Both E [I] and E [J] take maximum values at points close to those for  E [Mnet], E [Iot], and  E [Jot] as long as the mean-field theory is valid. Figure 8 summarizes the maximum points of these network memory measures in the range 0  < s2 ≤ 0.04 obtained from the mean-field theory.

Finally, we remark on the behavior of I. Because  xi(t+ 1) and x(t) are conditionally independent given  s1:n(t), we obtain

image

The first term on the right-hand side of Eq. (36) becomes I by taking the limit  n → ∞. The second term I(xi(t+ 1);  x(t)|s1:n(t)) will be negligible when the system is driven by input signals (g2 ≪1). On the other hand, the chaotic dynamics dominates as  g2 → ∞and I(xi(t+ 1);  x(t)|s1:n(t)) will approach  I(xi(t+ 1); x(t)). Thus, as  g2varies from 0 to  ∞, Iwill increase together with  I(xi(t+ 1); x(t)) in the ordered regime but will decrease in the sufficiently chaotic regime. Hence, I is expected to take a maximum value between the two extremes. In Fig. 9, the population-averaged values of the three terms in Eq. (36) in the limit  n → ∞calculated from the mean-field theory under the same assumption as for Eq. (24) are shown, where  I(xi(t+1);  x(t)) = 12log  Σ2s2due to the mean-field assumption.

The three network memory measures studied in this paper take maximum values in the ordered regime for ESNs with small input signals. The value of  g2that attains the maximum is always greater than 1, which is the boundary between the ordered and chaotic regimes in the corresponding autonomous system. However, it is far from the critical  g2∗(Fig. 8). In previous work, it was argued that the maximal Fisher information can be used to detect the edge of chaos [30, 31]. Our results suggest that such an approach is not necessarily effective for driven dynamical systems.

In the context of physical reservoir computing [7–12], it is generally difficult to tune the parameters of a given physical system for optimal computational performance. An alternative method is to choose an optimal input strength. Although the analysis presented in this paper assumes that  s2is small, our results theoretically suggest that tuning the input strength is meaningful. For example, the value of  E [Mnet] for  s2 = 0.02 is greater than those for  s2 = 0.01 and  s2 = 0.04 around  g2 ≈ 1.3 in Fig. 3 (c).

Toyoizumi and Abbott [25] analytically showed that the signal-to-noise ratio of ESNs decreases rapidly on the left side of the criticality  g2 = 1 when inputs are ab-sent, but decreases much slowly on the right side. They suggested that high computational performance can be achieved without fine tuning in the latter. Our results confirm this expectation because all of the short-term memory measures in the presence of inputs peak near g2 = 1 in the region 1 < g2 < g2∗.

In general, dynamical regimes of autonomous systems beyond stable fixed points are candidates for computational resources. For example, RNNs with sinusoidal activation functions achieve a high computational performance in the non-chaotic window regions after transition to chaos occurs in their autonomous dynamics [32]. An online supervised learning algorithm for RNNs proposed by Sussillo and Abbott [33] exhibits its best performance when their autonomous dynamics is adjusted to the chaotic region not far from the critical point where chaotic dynamics can be suppressed by input signals. Schuecker et al. [23] showed that the network memory capacity for continuous-time nonlinear RNNs peaks in the ordered regime with  g2 >1, which is consistent to our result. They argued that the dynamic suppression of chaos (DSC), which results from the fact that the onset of local instability precedes that of asymptotic instability, contributes to optimal information processing. However, DSC cannot occur in discrete-time ESNs where the two onsets coincide. In ESNs, the shift of the critical  g2∗to- ward the chaotic regime induced by input signals is solely due to a mechanism called the static suppression of chaos (SSC), which increases the frequency with which an orbit visits the contracting region of the phase space. Unlike SSC, DSC is conjectured to occur based on fast switching among different unstable directions caused by input signals [23]. However, ESNs with leaky neurons [34] are expected to exhibit DSC [23]. Future analyses of network memory measures for leaky ESNs based on the presented theory could deepen the understanding of the relationship between DSC and the information processing ability of dynamical systems.

It has been suggested that there exists a trade-off between nonlinearity of dynamical systems and their memory capacity [14]. Inubushi and Yoshimura [35] theoretically investigated the trade-off in terms of how nonlinearity degrades small initial differences of input signals. The mean-field theory presented in this paper makes it possible to study the trade-off when input strength is small by directly calculating the nonlinear memory capacity proposed by Damble et al. [14]. Performing the detailed calculation is also left as future work.

image

Since  ajk(t − k) (k = 1, 2, . . ., n) are independent and ajk(t − k) ∼ N(0, Σ2) by the mean-field assumption, we have

image

where  f ′(a) = e− π4 a2is used for  f(a) = erf(√π2 a). Equa- tion (10) follows from Eqs. (11), (A1) and (A2).

image

We set

image

and

gn−k(y, zn−k, . . . , zn) = �jn−k+1wjn−kjn−k+1f(gn−k+1(y, zn−k+1, . . . , zn)) + ujn−kzn−k(B4)

for  k = 1, . . . , n −1. Let us introduce

image

where  Dy = dy√2πg2σ2 e−

Eq. (B2) can be written as

image

image

where  Dzk = dzk√2πs2 e−

odd function with respect to (z1, . . . , zn), namely,

image

holds. This yields the desired result. First, note that gn−k(y, zn−k, . . . , zn) is odd with respect to (y, zn−k, . . . , zn) for  k = 0, 1, . . . , n −1. Namely, we have

image

Indeed, Eq. (B8) can be proved by mathematical induction. First,  gn(−y, −zn) = −y − ujnzn = −gn(y, zn) for k = 0. Assume that  gn−k(−y, −zn−k, . . . , −zn) =−gn−k(y, zn−k, . . . , zn) for  k = 0, 1, . . ., n −2. Then, we obtain

gn−(k+1)(−y, −zn−(k+1), . . . ,  −zn) =�jn−kwjn−(k+1)jn−kf(gn−k(−y, −zn−k, . . . ,  −zn)) − ujn−(k+1)zn−(k+1)

=�jn−kwjn−(k+1)jn−kf(−gn−k(y, zn−k, . . . , zn)) − ujn−(k+1)zn−(k+1)

= −�jn−kwjn−(k+1)jn−kf(gn−k(y, zn−k, . . . , zn)) − ujn−(k+1)zn−(k+1)

= −gn−(k+1)(y, zn−(k+1), . . . , zn), (B9)

image

where we applied the induction hypothesis for the third equality and we used the fact that f is an odd function for the fourth equality. Now, Eq. (B7) is obtained by

image

where we used Eq. (B8) for the third equality and the fact that  f ′is an even function for the fourth equality.

image

We have

image

=

image

By the change of variable  y = aΣ, the integral on the right-hand side of Eq. (C1) can be calculated as

image

image

From Eqs. (C1) and (C2), we obtain Eq. (16).

image

Since (V nu)2i =� 11+ π2 Σ2�n(W nu)2i, it is sufficient to show  E�(W nu)2i�= g2n+ O( 1N) for n = 0, 1, . . . .

Let  w(n)ijbe the (i, j)-th element of  W n. We have

image

where we used  E [ujuk] = δjkand  δjkis the Kronecker delta. The population average of�w(n)ij �2is given by

image

because  wij ∼ N(0, g2N) are independent. Thus, we ob- tain

image

When  g2 ≪1, we can approximate f(a) = a and obtain

image

Thus,

image

holds. Ignoring the O( 1N) terms, the mean and vari- ance of (W nu)2ifor  n ≥1 are given by  g2nand 2g4n,

respectively, because  wjkand  ujare independent and wjk ∼ N(0, g2N) and  uj = ±1. The population average of σ2iis given by

image

Its variance is

image

Var�σ2i�= s4

image

E�(W mu)2i�E�(W nu)2i�+ O( 1N) for  m ̸= n. From Eqs. (E3) and (E4), we obtain

image

Equation (22) follows from Eq. (E5).

Similarly, we can compute the mutual information between  xi(t+ 1) and  s(t − n) in the linear regime g2 ≪1 as follows. Let  X := �∞m=0(W mu)2iand

image

image

Since E [log X] can be approximated as

image

and a similar approximation can be obtained for E [log  Xn], we obtain

image

1−g4and Var [Xn] = 2g41−g4 − 2g4n. E [Iot] can be com- puted by summing Eq. (E8) over  n ≥1.

The authors are grateful to the anonymous reviewers for their comments and suggestions that improved the manuscript. TH was supported by JSPS KAKENHI Grant Number JP18K03423. KN was supported by JSPS KAKENHI Grant Number JP18H05472 and by MEXT Quantum Leap Flagship Program (MEXT QLEAP) Grant Number JPMXS0118067394. This work is partially based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).

[1] H. Jaeger, “The “echo state” approach to analysing and training recurrent neural networks,” (2001), GMDReport 148, GMD-German National Research Institute for Computer Science.

[2] W. Maass, T. Natschl¨ager, and H. Markram, Neural Comput. 14, 2531 (2002).

[3] H. Jaeger and H. Haas, Science 304, 78 (2004).

[4] D. Verstraeten, M. Schrauwen, B. D’Haene, and D. Stroobandt, Neural Netw. 20, 391 (2007).

[5] M. Lukoˇseviˇcius and H. Jaeger, Comput. Sci. Rev. 3, 127 (2009).

[6] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott, Phys. Rev. Lett. 120, 024102 (2018).

[7] L. Larger, M. C. Soriano, D. Brunner, L. Appeltant, J. M. Gutierrez, L. Pesquera, C. R. Mirasso, and I. Fischer, Optics Express 20, 3241 (2012).

[8] L. Larger, A. Bayl´on-Fuentes, R. Martinenghi, V. S. Udaltsov, Y. K. Chembo, and M. Jacquot, Phys. Rev. X 7, 011015 (2017).

[9] J. Torrejon, M. Riou, F. A. Araujo, S. Tsunegi, G. Khalsa, D. Querlioz, P. Bortolotti, V. Cros, K. Yakushiji, A. Fukushima, H. Kubota, S. Yuasa, M. D.

image

Stiles, and J. Grollier, Nature 547, 428 (2017).

[10] S. Tsunegi, T. Taniguchi, K. Nakajima, S. Miwa, K. Yakushiji, A. Fukushima, S. Yuasa, and H. Kubota, Appl. Phys. Lett. 114, 164101 (2019).

[11] K. Nakajima, H. Hauser, T. Li, and R. Pfeifer, Soft Robotics 5, 339 (2018).

[12] K. Fujii and K. Nakajima, Phys. Rev. Appl. 8, 024030 (2017).

[13] H. Jaeger, “Short term memory in echo state networks,” (2002), GMD-Report 152, GMD-German National Research Institute for Computer Science.

[14] J. Dambre, D. Verstraeten, B. Schrauwen, and S. Mas- sar, Sci. Rep. 2, 514 (2012).

[15] N. Bertschinger and T. Natschl¨ager, Neural Comput. 16, 1413 (2004).

[16] J. Boedecker, O. Obst, J. T. Lizier, N. M. Mayer, and M. Asada, Theory Biosci. 131, 205 (2012).

[17] I. Farkaˇs, R. Bos´ak, and P. Gergeˇl, Neural Netw. 83, 109 (2016).

[18] O. L. White, D. D. Lee, and H. Sompolinsky, Phys. Rev. Lett. 92, 148102 (2004).

[19] A. Rodan and P. Tino, IEEE Trans Neural Netw. 22, 131

(2011).

[20] M. Hermans and B. Schrauwen, Neural Netw. 23, 341 (2010).

[21] S. Marzen, Phys. Rev. E 96, 032308 (2017).

[22] S. Ganguli, D. Huh, and H. Sompolinsky, Proc. Natl. Acad. Sci. USA 105, 18970 (2008).

[23] J. Schuecker, S. Goedeke, and M. Helias, Phys. Rev. X 8, 041029 (2018).

[24] H. Sompolinsky, A. Crisanti, and H. J. Sommers, Phys. Rev. Lett. 61, 259 (1988).

[25] T. Toyoizumi and L. F. Abbott, Phys. Rev. E 84, 051908 (2011).

[26] B. Cessac and M. Samuelides, Eur. Phys. J. Spec. Top. 142, 7 (2007).

[27] M. Massar and S. Massar, Phys. Rev. E 87, 042809 (2013).

[28] L. Molgedey, J. Schuchhardt, and H. G. Schuster, Phys. Rev. Lett. 69, 3717 (1992).

[29] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. (John Wiley & Sons, Hoboken, NJ, 2006).

[30] M. Prokopenko, J. T. Lizier, O. Obst, and X. R. Wang, Phys. Rev. E 84, 041116 (2011).

[31] L. Livi, F. M. Bianchi, and C. Alippi, IEEE Trans. Neu- ral Netw. Learn. Syst. 29, 706 (2017).

[32] B. A. Marquez, L. Larger, M. Jacquot, Y. K. Chembo, and D. Brunner, Sci. Rep. 8, 3319 (2018).

[33] D. Sussillo and L. F. Abbott, Neuron 63, 544 (2009).

[34] H. Jaeger, M. Lukoˇseviˇcius, D. Popovici, and U. Siewert, Neural Netw. 20, 335 (2007).

[35] M. Inubushi and K. Yoshimura, Sci. Rep. 7, 10199 (2017).


Designed for Accessibility and to further Open Science