Optimal short-term memory before the edge of chaos in driven random recurrent networks

2019·Arxiv

Abstract

Abstract

The ability of discrete-time nonlinear recurrent neural networks to store time-varying small input signals is investigated by mean-field theory. The combination of a small input strength and mean-field assumptions makes it possible to derive an approximate expression for the conditional probability density of the state of a neuron given a past input signal. From this conditional probability density, we can analytically calculate short-term memory measures, such as memory capacity, mutual information, and Fisher information, and determine the relationships among these measures, which have not been clarified to date to the best of our knowledge. We show that the network contribution of these short-term memory measures peaks before the edge of chaos, where the dynamics of input-driven networks is stable but corresponding systems without input signals are unstable.

I. INTRODUCTION

Natural and artificial high-dimensional nonlinear dynamical systems can be used as resources for real-time computing. By nonlinearly mapping time-varying input signals into a high-dimensional space, the signals can be learned in a supervised manner if the dynamical systems have enough ability to store the signals in their present state and separate different signals [1, 2]. A high computational performance can be achieved by tuning only the weights of linear connections to the output layer while keeping the parameters of the dynamical systems fixed [3–6]. Such dynamical systems called reservoirs can be artificial recurrent neural networks (RNNs) or physical systems, such as optical media [7, 8], nanoscale magnetization dynamics [9, 10], soft materials [11], and quantum systems [12].

As mentioned above, a requirement for real-time computing is the ability to memorize past input signals. Such short-term memory of dynamical systems has been studied extensively by assessing a quantity called memory capacity [13, 14]. For input-driven RNNs, it has been suggested that the part of memory capacity representing indirect memory through network takes a maximum value near the edge of chaos, namely, near the critical boundary between the stable and unstable dynamical regimes [15, 16]. Near criticality, different inputs are expected to lead to different states while suppressing the influence of the initial conditions. Hence, it seems reasonable for a dynamical system to be near the critical point for optimal memory capacity. However, it has also been pointed out that the dependence on network parameters is not straightforward based on a systematic numerical simulation [17].

For linear RNNs, detailed analytic studies of memory capacity can be performed for both discrete-time [18, 19] and continuous-time systems [20]. The ability to predict future inputs, which is complementary to memory capacity, has also been studied in linear systems with correlated input signals [21]. However, the memory capacity of nonlinear RNNs is difficult to study by analytical methods [22]. Recently, Schuecker et al. [23] successfully derived an analytical expression for memory capacity for continuous-time nonlinear RNNs [24] in which each neuron is driven by independent input signals following a white-noise Gaussian process. Toyoizumi and Abbott [25] analytically calculated the signal-to-noise ratio, which is equivalent to the inverse of memory capacity at the limit of zero input strength, for discrete-time nonlinear RNNs driven by a common time-varying input signal.

In this paper, we analytically investigate the memory capacity of discrete-time nonlinear RNNs called echo state networks (ESNs) [1] by a mean-field theory when the strength of input signals is small but non-zero. The main idea of our approach is that the conditional probability density of the present state of a neuron given a past input signal can be approximately calculated from a functional derivative with respect to past input signals under the assumption of a small input strength. Once we obtain this conditional probability density, it is straightforward to derive the memory capacity and other alternative memory measures, such as mutual information and Fisher information [22]. We show that all three measures of short-term memory through network behave similarly and take a maximum value before the edge of chaos, where the dynamics is stable in the presence of input signals but unstable in the absence of input signals. We also discuss the breakdown of the mean-field theory for calculating memory measures in the ordered regime and show that the linear approximation provides good predictions.

II. RESULTS

A. Echo State Networks

We consider ESNs consisting of N artificial neurons. The state of neuron i at discrete time step t is denoted ). We assume that the time evolution of state ) is governed by

where f is an activation function. ) is the activation potential of neuron i at time step t given by

where s(t) is a time-dependent input signal, is a time-independent weight of the connection from neuron j to neuron i, and is a time-independent weight representing the strength of the coupling from the input signal to neuron i. We use the matrix and vector notations , and .

In the following analytical calculations and numerical simulations, the activation function is assumed to be a sigmoid function satisfying lim1, and In particular, we adopt

are chosen independently at random from an identical Gaussian distribution with mean zero and variance , where 0 is a control parameter. For simplicity, are assumed to be independent variables taking 1 with probability . Since our primary concern is the memory capacity of ESNs, we consider an independent and identically distributed Gaussian input signal s(t) with mean zero and variance . All numerical results in this paper were obtained in the following way unless otherwise stated. We simulated ESNs with N = 1000 artificial neurons over 40 trials. For a single trial, each quantity (for example, stationary variance of )) was calculated from its values over 10time steps after discarding the initial 10time steps. Then, averages were obtained over all artificial neurons and all trials. All infinite sums appearing in the following sections were evaluated by truncation at the 500-th term.

The mean-field theory of ESNs [25–28] makes it possible to calculate the stationary variance of ) and the largest Lyapunov exponent in the limit . It assumes that ) are independent and identically distributed random variables. They are also assumed to be independent of and . This assumption can be

FIG. 1. Numerical results (marks) and mean-field predic- tions (solid lines) for the stationary variance of ) are shown as functions of 02 (green), and 04 (blue). Vertical broken lines indicate the mean-field predictions of the critical value of obtained from Eq. (4) for (50, green), and

justified in the limit when there is no input signal. Since f is odd, we can self-consistently assume that the mean of ) is equal to zero. Let be the variance of ), where indicates the average over trials with the same and but possibly different realizations of the input signal s(t) and initial conditions. By the central limit theorem, ) follows a Gaussian distribution with mean zero and variance +), where we use the ) term, the variance of ) does not depend on specific realizations of [25]. In the following, we omit quantities that approach 0 as unless otherwise stated. Consequently, ) follows the following recurrence equation [27]:

where Σ) + . By numerically solving Eq. (3) with , we can obtain the mean-field prediction of the stationary variance of ). We write Σ+ . These values of and Σhave been used for plotting theoretical results later in the paper.

The largest Lyapunov exponent derived from the mean-field theory is [25, 27, 28]

Here, 0 indicates that a small perturbation to a state of the system leads to exponential growth, while 0 implies that the perturbation eventually becomes undetectable. The dynamics is called chaotic or unstable in the former case and called ordered or stable in the latter. When an input signal is absent, the boundary between chaos and stability The presence of input signals shifts the boundary towards the chaotic side [27, 28].

In Fig. 1, we confirm that the mean-field prediction of the stationary variance of ) agrees well with the result obtained by numerical simulations. We note that the difference is negligible, even for the input-driven regime 1 where the mean-field assumption is expected to be violated. This will be explained when we discuss the breakdown of the mean-field theory in the calculation of memory capacity.

B. Memory Capacity

The memory capacity of an ESN is defined as the quality of the optimal linear estimator of the past input ) using the present state of neurons ). Following previous work [23, 25], we assume a sparse readout, namely, there are K = O(1) readout neurons 1 and consider linear readout ˆ). Given time-delay n (n = 1, 2, . . .), the weights are determined by minimizing the mean squared error between ) and ˆs(t) over a sufficiently long time period T . The optimal mean squared error as a function of time-delay n is called the memory function and is given by [13, 14]

where is the (i, j)-th element of the matrix , which is the inverse of the matrix , and indicates the time average over the period of length T . The memory capacity is defined as the sum of

over all time-delays n:

It is known that holds [13, 14].

To compute by the mean-field theory, we replace the time average for by the average over trials in stationary states. Since for vanishes as , the contribution of the off-diagonal terms of to Eq. (5) can be neglected for by the sparse readout assumption K = O(1). Thus, in the mean-field calculation, Eq. (5) is just K times for K = 1. Since

when K = 1, the task to obtain M reduces to calculating for n = 1, 2, . . .. In the following, we perform the calculation by assuming that the strength of the input signal is small, namely, 1.

The main idea to calculate is that conditioning of ) on ) (n = 0, 1, 2, . . .) can be regarded as a small perturbation to dependence of ) on ), when 1. We assume that ) are independent and identically distributed and are also independent of and even after conditioning. By the central limit theorem, ) conditioned on follows a Gaussian distribution in the limit of large N. Hence, it is sufficient to calculate its mean and variance Σto determine the conditional probability density ).

First, we calculate the mean of ) given c. We regard ) as a functional of stochastic variables 1)). We write )], where 1))). We consider a norm defined by the average over trials for a stochastic variable X). Conditioning of ) on corresponds to replacing argument ) with the constant stochastic variable c. If 1, then 1. Thus, ) given can be approximated by the following:

(t), c, x(t n)] (t), x(t n)] + δ(t), x(t n)]δs(t n) (c s(t n))

By taking the average over trials, we have

FIG. 2. Conditional probability density of the activation potential given a past input. Theoretical predictions and numerical results are compared for (a) the mean of 1 and (b) the conditional probability density We set 01 and use single specific realizations of W and u.

where

Since f is an odd function, we have (Appendix B)

From Eqs. (10) and (12), the mean-field theory predicts

when 1.

Second, the variance of ) given can be obtained as follows. The variance of ) can be expressed as

Thus, we have

Note that the population variance of (takes a nonzero finite value even in the limit of large N, as we will see in the linear case (Appendix E) when we discuss the breakdown of the mean-field theory in the ordered regime. This implies that the value of Σdepends on i or, equivalently, realizations of W and u, even after discarding the ) term. Another related remark is that Eq. (15) holds only in the limit of small . Otherwise, the right-hand side may become negative even in the limit of large N, since (follows a Gaussian distribution with a variance of O(1) and thus can take an arbitrarily large value.

In Fig. 2, the mean of ) given 1 and the conditional probability density ) are shown for single specific realizations of W and u. Here, we set 01. The numerical results are obtained by first generating a single orbit of length 10time steps after discarding the initial 10time steps and then sampling the value of ) with 011 for each n. The theoretical values for and Σare calculated from Eqs. (11), (13), and (15), where W and u are the same as those used in the numerical simulation. We can see that the numerical results and the theoretical predictions agree well.

Using Eqs. (13) and (15), we obtain (Appendix C)

for n = 1, 2, . . . . From Eqs. (7) and (16), we have

for n = 1, 2, . . . . The population average of , which is equivalent to the average over realizations of W and u, is (Appendix D)

where

The population average of M is

FIG. 3. Memory capacity and network memory capacity of ESNs. Theoretical predictions and numerical results are compared for the population averages of (a) memory function , (b) memory capacity M, and (c) network memory capacity Mean-field lines for the effective measurement of the nonlinear response r (Eq. (19)). In (a), we set 01. The vertical broken lines indicate the transition point between the ordered and chaotic regimes for the value of with the same color.

FIG. 4. Linear approximation of . (a) Numerical values of ] for different system sizes N. (b) Comparison among numerical results, linear approximation, and mean-field theory for 01 in both panels.

M can be decomposed into two parts [17, 23]: the direct memory and the indirect memory through network . We call the latter the network memory

capacity. The population average of the latter is

Figure 3 (a) shows the population average of for

different values of with 01. The population averages of M and are shown in Fig. 3 (b) and (c), respectively. ] peaks in the range 1 , where is the value of such that location of the maximum point depends on the value of and shifts to a larger value as increases. In the mean-field theory, ] is given as the product between E [M] and r. Hence, its qualitative behavior can be understood from those of E [M] and r (Fig. 3 (b) and (d), respectively). Since E [M] is a measure of the linear short-term memory, it is expected to decrease as the nonlinearity of the system increases. On the other hand, r can be interpreted as an effective measure of the nonlinear response of the system, which reaches saturation for sufficiently large , since the activation function f is a sigmoid function. Indeed, as , since 1 as in Eq. (19) (However, this cannot be seen from Fig. 3 (d) because the range of shown is restricted upto 2).

The mean-field predictions and the numerical results agree well over the whole range of for E [M]. However, there is a clear discrepancy for ] in the ordered regime (Fig. 3(c)). This is due to the breakdown of the mean-field theory. That is, the assumption that ) are independent and identically distributed and are also independent of and is violated when the ESN dynamics is driven by input signals. Indeed, in a certain range of in the ordered regime (07 in Fig. 4 (a) where 01), the numerically obtained values of ] do not approach the mean-field value as the system size N increases. To understand the quantitative influence of the violation of the mean-field assumption on ], we consider the regime 1, where the activation function f can be approximated by the identity function f(x) = x. When 1, we can approximately calculate ] without the mean-field theory. Since both the mean-field theory and the linear approximation lead to 1, the differ-ence in ] is reduced to that in ]. The linear approximation predicts (Appendix E)

where is the stationary variance of ). Note that we have in both the mean-field theory and the linear approximation. Indeed, in the mean-field theory, Eq. (3) reduces to + when f(x) = x. The equation for the linear approximation is derived in Appendix E (Eq. (E3)). However, the former predicts

] for 1, the mean-field theory fails to capture the variance of .

We compare the values of ] obtained from the numerical simulation, the linear approximation, and the mean-field theory in Fig. 4 (b). Although the mean-field line does not fit the numerical result, the linear approximation can explain it well.

C. Mutual Information and Fisher Memory

Once we obtain the conditional probability density ) for n = 0, 1, 2, . . ., we can immediately calculate the mutual information between + 1) and ) as

+ 1); 12 log Σ

when the mean-field assumption is valid. We would like to take the population average of Eq. (23). Recall that we assumed 1. Let us suppose Σ(1) in the limit 0. In particular, this holds when 1. Then, we can approximate the population average of + 1); )) as

Let us consider the summation of Eq. (23) over n = 1, 2, . . . defined by

where the subscript ot indicates the mutual information between the future state and a one-time past input. Note that the n = 0 term is not included in the summation. Thus, is a measure of network short-term memory analogous to based on the mutual information. The population average ] calculated based on the mean-field theory is shown in Fig. 5(a) and is compared with the numerical results. Since the direct numerical estimate of the mutual information between + 1) and ) for all n = 1, 2, . . . , 500 is computationally hard, we estimated the mutual information from the correlation coefficient between ) and ) assuming that ) is Gaussian, which is valid both in the linear regime and the mean-field regime. As in the case of ] also takes a maximum value in the range 1 .

As we have seen in the calculation of memory capacity, the mean-field theory is not applicable to the linear regime 1. Indeed, although the discrepancy between the numerical results and the mean-field predictions appears to be small on the scale of Fig. 5(a), the calculation of based on the linear approximation (Appendix E) provides much better fits to the numerical results than the mean-field theory for 1, as shown in Fig. 5(b).

Another familiar information-theoretic memory measure is the Fisher information [22, 29]. Here, we regard the past input ) as a parameter and consider the Fisher information with respect to the conditional probability density ), namely, information

FIG. 5. Network memory measure based on the mutual information between the future state and a one-time past input. (a) The theoretical predictions and the numerical results are compared for the population averages of (Eq. (25)). (b) Comparison among numerical results, linear approximation, and mean-field theory for

FIG. 6. Population average of (Eq. (28)) calculated based on the mean-field theory under the same assumption as for Eq. (24). To adjust the scale of the vertical axis, multiplied by

about ) contained in + 1). Since the activation function f is invertible and the Fisher information is invariant under an invertible transformation of stochastic variables, we can use ) to calculate the Fisher information. When 1 and the mean-field assumption is valid, the Fisher information for ) contained in + 1) is calculated as

The population average of Eq. (26) can be approximately obtained under the same assumption as for Eq. (24) as

follows:

We define the network Fisher memory with respect to a one-time past input by

As in the case of , we exclude the n = 0 term representing direct memory from the sum in the right-hand side of Eq. (28). The population average ] calculated based on the mean-field theory under the same assumption as for Eq. (24) is shown in Fig. 6. ] behaves qualitatively similarly to ] and ] at least in the range where the mean-field theory is valid. Note that there is a close relationship between the mean-field predictions of memory function, mutual information, and Fisher information through Eqs. (18), (24) and (27): + 1);

loglog1 + .

The derivation of the conditional probability density ) under the mean-field assumption can be extended in a straightforward manner to the conditioning on a set of past inputs at multiple time steps. In particular, the conditional probability density ) can be approximated as a Gaussian distribution with mean and variance Σwhere

1), s(t − 2), . . . , s(t − n)), c = (c1, c2, . . . , cn),

and

FIG. 7. Population averages of (a) I (Eq. (32)) and (b) (Eq. (35) multiplied by ) calculated from the mean-field theory under the same assumption as for Eq. (24) are shown.

FIG. 8. The mean-field values of that attain the maxima of the five network memory measures (]), respectively) are shown as functions of . Note that E [I] and E [J] peak at the same value of . The critical line is also shown.

An alternative network memory measure to is the limit of the mutual information between +1) and ). When the mean-field assumption is valid, it is given by

where Σ. Under the same assumption as for Eq. (24), its population average is approximately given by

Similarly, we can consider an alternative network memory measure based on the Fisher information matrix with

FIG. 9. )) can be represented as the difference between lim)) (Eq. (36)). The population-averaged values of these quantities calculated from the mean-field theory under the same assumption as for Eq. (24) are shown for

respect to ). When the mean-field assumption is valid, the (k, l)-th element of the Fisher information matrix is given by

The Fisher memory with respect to the whole past input history is defined by

and its approximate population average is found to be

under the same assumption as for Eq. (24). Note that E [I] and E [J] are related by log1 + . Figure 7 shows Eqs. (32) and (35) for 02, and 0.04. Both E [I] and E [J] take maximum values at points close to those for ], and ] as long as the mean-field theory is valid. Figure 8 summarizes the maximum points of these network memory measures in the range 0 04 obtained from the mean-field theory.

Finally, we remark on the behavior of I. Because + 1) and x(t) are conditionally independent given ), we obtain

The first term on the right-hand side of Eq. (36) becomes I by taking the limit . The second term + 1); )) will be negligible when the system is driven by input signals (1). On the other hand, the chaotic dynamics dominates as and + 1); )) will approach + 1); x(t)). Thus, as varies from 0 to will increase together with + 1); x(t)) in the ordered regime but will decrease in the sufficiently chaotic regime. Hence, I is expected to take a maximum value between the two extremes. In Fig. 9, the population-averaged values of the three terms in Eq. (36) in the limit calculated from the mean-field theory under the same assumption as for Eq. (24) are shown, where +1); log due to the mean-field assumption.

III. DISCUSSION

The three network memory measures studied in this paper take maximum values in the ordered regime for ESNs with small input signals. The value of that attains the maximum is always greater than 1, which is the boundary between the ordered and chaotic regimes in the corresponding autonomous system. However, it is far from the critical (Fig. 8). In previous work, it was argued that the maximal Fisher information can be used to detect the edge of chaos [30, 31]. Our results suggest that such an approach is not necessarily effective for driven dynamical systems.

In the context of physical reservoir computing [7–12], it is generally difficult to tune the parameters of a given physical system for optimal computational performance. An alternative method is to choose an optimal input strength. Although the analysis presented in this paper assumes that is small, our results theoretically suggest that tuning the input strength is meaningful. For example, the value of ] for 02 is greater than those for 01 and 04 around 3 in Fig. 3 (c).

Toyoizumi and Abbott [25] analytically showed that the signal-to-noise ratio of ESNs decreases rapidly on the left side of the criticality sent, but decreases much slowly on the right side. They suggested that high computational performance can be achieved without fine tuning in the latter. Our results confirm this expectation because all of the short-term memory measures in the presence of inputs peak near .

In general, dynamical regimes of autonomous systems beyond stable fixed points are candidates for computational resources. For example, RNNs with sinusoidal activation functions achieve a high computational performance in the non-chaotic window regions after transition to chaos occurs in their autonomous dynamics [32]. An online supervised learning algorithm for RNNs proposed by Sussillo and Abbott [33] exhibits its best performance when their autonomous dynamics is adjusted to the chaotic region not far from the critical point where chaotic dynamics can be suppressed by input signals. Schuecker et al. [23] showed that the network memory capacity for continuous-time nonlinear RNNs peaks in the ordered regime with 1, which is consistent to our result. They argued that the dynamic suppression of chaos (DSC), which results from the fact that the onset of local instability precedes that of asymptotic instability, contributes to optimal information processing. However, DSC cannot occur in discrete-time ESNs where the two onsets coincide. In ESNs, the shift of the critical to- ward the chaotic regime induced by input signals is solely due to a mechanism called the static suppression of chaos (SSC), which increases the frequency with which an orbit visits the contracting region of the phase space. Unlike SSC, DSC is conjectured to occur based on fast switching among different unstable directions caused by input signals [23]. However, ESNs with leaky neurons [34] are expected to exhibit DSC [23]. Future analyses of network memory measures for leaky ESNs based on the presented theory could deepen the understanding of the relationship between DSC and the information processing ability of dynamical systems.

It has been suggested that there exists a trade-off between nonlinearity of dynamical systems and their memory capacity [14]. Inubushi and Yoshimura [35] theoretically investigated the trade-off in terms of how nonlinearity degrades small initial differences of input signals. The mean-field theory presented in this paper makes it possible to study the trade-off when input strength is small by directly calculating the nonlinear memory capacity proposed by Damble et al. [14]. Performing the detailed calculation is also left as future work.

Since ) (k = 1, 2, . . ., n) are independent and ) by the mean-field assumption, we have

where is used for ). Equa- tion (10) follows from Eqs. (11), (A1) and (A2).

We set

and

g(y, z, . . . , zwf(g(y, z, . . . , z)) + u(B4)

for 1. Let us introduce

where

Eq. (B2) can be written as

where

odd function with respect to (), namely,

holds. This yields the desired result. First, note that ) is odd with respect to () for 1. Namely, we have

Indeed, Eq. (B8) can be proved by mathematical induction. First, ) for k = 0. Assume that ) for 2. Then, we obtain

g, . . . , wf(g, . . . ,

wf((y, z, . . . , z

wf(g(y, z, . . . , z

(y, z, . . . , z), (B9)

where we applied the induction hypothesis for the third equality and we used the fact that f is an odd function for the fourth equality. Now, Eq. (B7) is obtained by

where we used Eq. (B8) for the third equality and the fact that is an even function for the fourth equality.

We have

=

By the change of variable , the integral on the right-hand side of Eq. (C1) can be calculated as

From Eqs. (C1) and (C2), we obtain Eq. (16).

Since (, it is sufficient to show + O( ) for n = 0, 1, . . . .

Let be the (i, j)-th element of . We have

where we used and is the Kronecker delta. The population average ofis given by

because ) are independent. Thus, we ob- tain

When 1, we can approximate f(a) = a and obtain

Thus,

holds. Ignoring the O( ) terms, the mean and vari- ance of (for 1 are given by and 2,

respectively, because and are independent and ) and 1. The population average of is given by

Its variance is

Var

( ) for . From Eqs. (E3) and (E4), we obtain

Equation (22) follows from Eq. (E5).

Similarly, we can compute the mutual information between + 1) and ) in the linear regime 1 as follows. Let (and

Since E [log X] can be approximated as

and a similar approximation can be obtained for E [log ], we obtain

and Var [] can be com- puted by summing Eq. (E8) over 1.

ACKNOWLEDGMENTS

The authors are grateful to the anonymous reviewers for their comments and suggestions that improved the manuscript. TH was supported by JSPS KAKENHI Grant Number JP18K03423. KN was supported by JSPS KAKENHI Grant Number JP18H05472 and by MEXT Quantum Leap Flagship Program (MEXT QLEAP) Grant Number JPMXS0118067394. This work is partially based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).

[1] H. Jaeger, “The “echo state” approach to analysing and training recurrent neural networks,” (2001), GMDReport 148, GMD-German National Research Institute for Computer Science.

[2] W. Maass, T. Natschl¨ager, and H. Markram, Neural Comput. 14, 2531 (2002).

[3] H. Jaeger and H. Haas, Science 304, 78 (2004).

[4] D. Verstraeten, M. Schrauwen, B. D’Haene, and D. Stroobandt, Neural Netw. 20, 391 (2007).

[5] M. Lukoˇseviˇcius and H. Jaeger, Comput. Sci. Rev. 3, 127 (2009).

[6] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott, Phys. Rev. Lett. 120, 024102 (2018).

[7] L. Larger, M. C. Soriano, D. Brunner, L. Appeltant, J. M. Gutierrez, L. Pesquera, C. R. Mirasso, and I. Fischer, Optics Express 20, 3241 (2012).

[8] L. Larger, A. Bayl´on-Fuentes, R. Martinenghi, V. S. Udaltsov, Y. K. Chembo, and M. Jacquot, Phys. Rev. X 7, 011015 (2017).

[9] J. Torrejon, M. Riou, F. A. Araujo, S. Tsunegi, G. Khalsa, D. Querlioz, P. Bortolotti, V. Cros, K. Yakushiji, A. Fukushima, H. Kubota, S. Yuasa, M. D.

Stiles, and J. Grollier, Nature 547, 428 (2017).

[10] S. Tsunegi, T. Taniguchi, K. Nakajima, S. Miwa, K. Yakushiji, A. Fukushima, S. Yuasa, and H. Kubota, Appl. Phys. Lett. 114, 164101 (2019).

[11] K. Nakajima, H. Hauser, T. Li, and R. Pfeifer, Soft Robotics 5, 339 (2018).

[12] K. Fujii and K. Nakajima, Phys. Rev. Appl. 8, 024030 (2017).

[13] H. Jaeger, “Short term memory in echo state networks,” (2002), GMD-Report 152, GMD-German National Research Institute for Computer Science.

[14] J. Dambre, D. Verstraeten, B. Schrauwen, and S. Mas- sar, Sci. Rep. 2, 514 (2012).

[15] N. Bertschinger and T. Natschl¨ager, Neural Comput. 16, 1413 (2004).

[16] J. Boedecker, O. Obst, J. T. Lizier, N. M. Mayer, and M. Asada, Theory Biosci. 131, 205 (2012).

[17] I. Farkaˇs, R. Bos´ak, and P. Gergeˇl, Neural Netw. 83, 109 (2016).

[18] O. L. White, D. D. Lee, and H. Sompolinsky, Phys. Rev. Lett. 92, 148102 (2004).

[19] A. Rodan and P. Tino, IEEE Trans Neural Netw. 22, 131

(2011).

[20] M. Hermans and B. Schrauwen, Neural Netw. 23, 341 (2010).

[21] S. Marzen, Phys. Rev. E 96, 032308 (2017).

[22] S. Ganguli, D. Huh, and H. Sompolinsky, Proc. Natl. Acad. Sci. USA 105, 18970 (2008).

[23] J. Schuecker, S. Goedeke, and M. Helias, Phys. Rev. X 8, 041029 (2018).

[24] H. Sompolinsky, A. Crisanti, and H. J. Sommers, Phys. Rev. Lett. 61, 259 (1988).

[25] T. Toyoizumi and L. F. Abbott, Phys. Rev. E 84, 051908 (2011).

[26] B. Cessac and M. Samuelides, Eur. Phys. J. Spec. Top. 142, 7 (2007).

[27] M. Massar and S. Massar, Phys. Rev. E 87, 042809 (2013).

[28] L. Molgedey, J. Schuchhardt, and H. G. Schuster, Phys. Rev. Lett. 69, 3717 (1992).

[29] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. (John Wiley & Sons, Hoboken, NJ, 2006).

[30] M. Prokopenko, J. T. Lizier, O. Obst, and X. R. Wang, Phys. Rev. E 84, 041116 (2011).

[31] L. Livi, F. M. Bianchi, and C. Alippi, IEEE Trans. Neu- ral Netw. Learn. Syst. 29, 706 (2017).

[32] B. A. Marquez, L. Larger, M. Jacquot, Y. K. Chembo, and D. Brunner, Sci. Rep. 8, 3319 (2018).

[33] D. Sussillo and L. F. Abbott, Neuron 63, 544 (2009).

[34] H. Jaeger, M. Lukoˇseviˇcius, D. Popovici, and U. Siewert, Neural Netw. 20, 335 (2007).

[35] M. Inubushi and K. Yoshimura, Sci. Rep. 7, 10199 (2017).

designed for accessibility and to further open science