Memory-Loss is Fundamental for Stability and Distinguishes the Echo State Property Threshold in Reservoir Computing & Beyond

2020·arXiv

Abstract

1 Introduction

The ability of nearly all living systems to process and act on external stimuli indicates life’s capabilities of computing (e.g., [1]). In the human-made world, we are in the era of having millions of artificial neurons on silicon chips, and building biological computational devices using engineered biological units like yeast cells (e.g., [2]). The intriguing question is what exactly is required by these computing paradigms, be it from the simplest to the most complex to respond distinctly and yet have input-related stability, i.e., slightly different versions of the input should lead only to slightly different responses. The criterion of criticality employed in computation (e.g., [3, 4, 5]), often deemed a governing principle of certain brain dynamics seems to be one of the successful paradigms in forgetting internal states while responding to the external stimuli or input. However, the question remains as to whether such computation always exhibits input-related stability? Moreover, next, is there any other computing paradigm that provides input-related stability? This article aims to provide conclusive answers to these questions.

In analogy with how a substance like water that raises its temperature by excitation due to external heat and also can only dissipate heat if its temperature is above a threshold (see [6] for a thought experiment), a driven dynamical system can exhibit two competing dynamical responses, one that allows the temporal input to excite the system in the sense the system states are made to scatter apart at certain times and the other to quench the input by bringing states together at some other times. Mathematically speaking, such contrasting dynamics requires nonlinearity, and the quenching would result in the dynamics be bounded. Systems that are nonlinear and that have bounded dynamics are abundantly found in many models. They include ecosystems responding to external signals ranging from predator-prey models to tumor growth (e.g., [7]), many classes of artificial systems ranging from open-loop control systems (e.g., [8]) to neural network models used in machine learning and artificial intelligence encompassing the field of reservoir computing (e.g., [9]), in all of which the external input could also be by design.

Memory-loss in such driven systems appears when the net effect of the bringing states together and scattering the states of the system results in the diameter of the state asymptotically decreasing to zero so that all trajectories tend to coalesce into a single trajectory (see [6] for examples). In that spirit, we call a system to have a memory-loss if a single trajectory emerges as a proxy of the input, or in other words, echoes the temporal input in the state space. Memory-loss is a property of both the temporal input and the system, a fact unfairly neglected mostly in the literature. The idea of relating memory-loss to the presence of a single proxy corresponding to an input is illustrated for both continuous-time (see [6, Section 1]) and discrete-time dynamical systems as the echo-state property with respect to an input (representative work is in [10]). Much of the literature considers the notion of memory loss to all possible input sequences. Although these ideas date back to the works in [11, 12], they were formalised as different notions such as the fading memory property in [13], the input forgetting property in [14], the echo state property, the state contraction, the state forgetting property in [15]. A rigorous treatment of several of these notions is available in [16, 17]. The idea of exploiting memory-loss for applications dates back to work in [18], however, realizing it for computation and information processing was conceived independently by two research groups – Jaeger et al. in [15, 19] as the Echo State Property (ESP) and Maass et al. in [20] as the Liquid State Machine. Collectively the methodologies are referred to as reservoir computing (RC), and some class of these systems can approximate any input-output function to arbitrary precision [16]. Today, RC has touched upon the era of hardware implementations (e.g., [21, 22, 23]) with photonic chips, memristors, spintronic nanodevices with even applications such as in prosthetic control and epileptic seizure detection besides classical application of prediction, filtering and information processing of temporal data (e.g., [9] and references therein).

A fundamental question in computing or modeling by driven dynamical systems is to find if temporal inputs are mapped continuously onto the state space of the dynamical system – input-related stability – failing which, certain temporal patterns in the input could drive the system into an unstable regime where it would behave wildly in the sense that small changes in the input render a drastically different system response. Input-related stability being a function of both the system and the input, in practice, a problem is either to examine if an input given a system, or if a design parameter of the system given an input, renders input-related stability. Another problem would be to understand the change in the design parameter – parameter-related stability for a given input – failing which, change in the parameter could give a drastically different response. For a representation in the state space to capture the information content of the input, better separability of inputs in the state space is needed (see example in [6, Section 1]). In examples such as in RC, one would be interested in maximizing the freedom of the driven system, i.e., maximize the separability of the inputs to distinguish close-by-variants of an input in the state space of the system, but yet not tipping the system into what authors following [4] have called it the “chaotic” behavior. While RC is used in a task of classifying inputs, such tipping reflects on a trade-off between input separability and input-related stability. This is also the hallmark of criticality, and such close to criticality operation has been found to be self-organized, like in biological neural systems (e.g., [24]), or can be tuned into, in artificial systems like RC. The criticality hypothesis drives the need to operate these systems under high-efficiency and thus to choose the design parameters more optimally and yet avoid instabilities. The challenge of doing so is overwhelming since there is no clear understanding of how the entire temporal input is mapped onto a different space, which is usually higher-dimensional.

A design parameter (not necessarily a scalar and includes the case of a vector or a matrix) , an input value u, a state-variable x in a compact state space X and a continuous function ) that takes values in X would constitute a parametric driven system (we denote this system by g throughout, and other entities silently understood). The dynamics of a parametric driven system g is rendered through a temporal input ¯) via the equation an ¯, a sequence ¯) contained in X that satisfies the equation is an image of ¯u in the space X (and also called a solution of g for a given ¯u). Having fixed , suppose we collect the nth component of all images ¯, and denote it by ), then we have a collection of sets which we call it the representation of ¯u. Following the intuitive idea of memory-loss as in [10], we say that the driven system g has echo-state property (ESP) with respect to ( w.r.t. ) an input ¯if it has exactly one solution or one image ¯x in the space X or (equivalently) if for each is single-valued, i.e., it is a singleton subset of X, i.e., there is no possibility of two or more reachable states at any given point in time n if the entire past values of the input have been fed into the driven system.

We observe that ) can be determined only by the left-infinite segment of the input ). Hence, we define ) as the encoding of () turns out (see Eq. (1)) to be the countable intersection of the sets )) and so on (see Fig. 1 for a schematic description). Further, ) happens to be also the limit of the sequence of sets (see Eq.(2)). Given the input segment (and a subset of , consider the set recursively obtained by ), and so on. From the viewpoint of computer simulations, if the cardinality of is sufficiently large, the set is an approximation of ), and due to the aforementioned limit, is an approximation of

Figure 1: Schematic figure to explain the sets is a square (boundary in black), ) is the square-like object (boundary in red), and when ) is the triangle (boundary in blue) then ) is the oval (boundary in green).

Our approach is to understand the influence ofseparately on the encoding ). To do this, we vary

), which we call it the . Next, we vary keepingu a constant, i.e., study ), which we call it the parameter-encoding map (givenu). While we study the entire representations in these two cases, i.e., when we study ¯for a given for a given ¯u, we call these functions the input-representation and parameter-representation maps respectively.

Both these encoding maps are set-valued functions, and in general, a set-valued map S(x) is continuous at x when it is both lower semicontinuous (l.s.c) and upper semicontinuous (u.s.c) that respectively precludes explosions and implosions at x (see [6, Fig. S3] for a schematic picture and definition). We remark that the continuity of a set-valued map S(x) can be equivalently defined by employing the Hausdorff metric (see Section 3).

We make two natural assumptions on driven systems that ensure the existence of a subdomain where states do not “collapse” to a single point but can contract the state space volume. We say a parametric driven system g to be an open-parametric driven system if X is connected and the mapping ) does not map any set with uncountably many points contained in X to a single element in X. This further ensures that any non-zero state space volume does not “suddenly collapse”. If ) is invertible for all parameter-values and input values u under consideration, then g is obviously an open-parametric driven system. Also, one can easily show that an open-parametric driven system g has the ESP w.r.t. ¯u (or to u) if ) is a singleton subset of X (see Section 3 for details). Further, if a driven dynamical system has a “contracting subdomain” in its state space for each , we characterize it by the terminology of contractibility. Precisely, given an open-parametric driven system , if there exists an input ¯v such that g has the ESP w.r.t. ¯, then we say that . Contractibility does not mean a contraction mapping but is a condition that ensures the existence of an input ¯v so that for every parameter can push the dynamics into a subset of X so that we have a net contraction in the sense g has the ESP w.r.t. ¯v. The example of a recurrent neural network (see Section 2) is an example of a driven system that satisfies these two hypotheses for our numerical simulations.

The paper is organized as follows. In Section 2, we state three results concerning stability and criticality of driven systems with less formalism and illustrate them using numerical results obtained from recurrent neural networks. In Section 3, we restate our results formally and prove them – the mathematically inclined reader may choose to peruse Section 3 prior to reading Section 2. In Section 4, we summarise the results and mention their broader implications. The supplementary material [6] available online contains a discussion on the intuitive meaning of the memory loss property, a slightly more elaborate version of Section 3 and additional numerical results.

2 Three Stability Results

Mathematically, the continuity of the input-encoding map atu means input-related stability atu and the continuity of the parameter-encoding map at means parameter-related stability at . The mathematical results are in Section 3, and here we state the results with less formalism and illustrate them with numerical results. Additional details used in obtaining the numerical results presented here are available in [6, Section 6].

To illustrate our results, we consider the example of a recurrent neural network (RNN). RNNs are employed for processing temporal data and are inspired by how the brain maps a stimulus onto its enormously higher-dimensional space by affecting the states of a vast number of neurons. An advantage observed in projecting data onto a higher-dimensional space is that it exaggerates certain data features to render better separability of inputs. In the RC framework employing RNNs, the temporal input signal is mapped on to a higher dimensional space (reservoir) without training. And then, a simple regression technique is used to obtain linear observables of the higher-dimensional entity that aims at an approximation of the desired function of the temporal signal (e.g., [15]). We consider RNNs that are driven systems of the form ), where tanh() is (the nonlinear activation) tanh performed component-wise on , a real-valued parameter would correspond to the scaling of the reservoir (invertible) matrix B of dimension N and A is a matrix with input connections and (the cartesian product of N copies of [1]). To maintain uniformity across numerical experiments, we set B to have a unit spectral radius. The driven dynamics through the equation ) is interpreted to be representing the evolution of an ensemble of neurons (coordinates of X). The function tanh is an example of a saturation or an activation function that is nonlinear enough to“squash” the system state to a smaller subspace.

Before, we state our results, we observe that whenever is invertible, ) is always invertible since tanh is invertible on (, and hence the RNN is an open-parametric driven system when = 0 is not in the parameter space Λ. Also, the RNNs that we consider are always contractible except in the case of the input weight matrix A having a row with all zeroes. It is a result in [10, (ii) of Theorem 4.1] that for a RNN with an input-weight matrix A having non-zero rows and any norm of the reservoir matrix B, there always exists an inputv so that it has the ESP w.r.t. v, and hence such systems are contractible. The idea behind contractibility is that if the inputv is such that if the entity () is thrown into the saturation region of the activation function sufficiently often, one can always witness ESP w.r.t.v. Given any

v is readily obtained by scalingu to have a sufficiently large amplitude.

For an open-parametric dynamical system g we present the following three results (R1), (R2) and (R3) that are proved as Theorem 1, Theorem 2 and Theorem 3 respectively in Section 3: (R1). If g is contractible at , then the input-encoding map. i.e., ) and the input-representation map are both continuous atu if and only if g has the ESP w.r.t.

u. (R2). If g has the ESP w.r.t.u then both the parameter-encoding map. i.e., ) and the parameter-representation map are both continuous atu. (R3). The parameter-encoding map continuous at if and only if is equicontinuous at

The result (R1) states that ESP is fundamental to avoid instabilities (discontinuous changes) due to small changes in the input in a wide-class of driven systems. The contractibility condition produces dissipative dynamics that resets the dynamics in every coordinate of the state space (a neuron in the case of a RNN), thereby inducing memory-loss. This is illustrated intuitively in a thought experiment in [6]. Further to (R1), we observe in Corollary 3.1 that the ESP always implies input-related stability in systems that are not necessarily contractible. However, without the contractibility hypothesis, (R1) would not hold, for example, the driven system

The scenarios when the system has ESP, and when it does not have the ESP w.r.t. anu is shown schematically in (a) and (b) of Fig. 2. At a point of discontinuityu of the input-encoding map, the map is u.s.c but not l.s.c at ¯u which means that there is an explosion in the set (see Lemma 3.2). Thus the input-encoding map becomes very sensitive to a small change in input and behaves wildly as schematically indicated in (b) of Fig. 2 (see [6, Fig. S4] for another illustration). It is not feasible to simulate the input-representation map (or the input-encoding map) by fixing varying the inputs since each input lies in an infinite-dimensional space. However, we demonstrate in (c) of Fig. 2 that a sinusoidal input (in black) solely guides the dynamics of a driven system by plotting (in red) a coordinate of points when there is the ESP w.r.t. the input, and that the system response (a coordinate in blue) shows rapid fluctuations incommensurate with the input when it does not have the ESP w.r.t. the same sinusoidal input. During the training of RNNs in the RC framework, a linear observable of the solution that is designed is called a read-out. The read-out map is a linear transformation of the state , and hence a continuous map of the state . Whenever the input-encoding map is discontinuous, the collection of all possible read-outs as a function of the parameter or the input ¯u, i.e., any such resultant read-out encoding map would then be a composition of a discontinuous map (the input-encoding or the parameter-encoding map) and a continuous map (coming from the linear read-out) map. We know that the composition of a discontinuous map and a continuous map is not continuous unless in very specific cases (like the read-out is a constant value). The read-out is plotted against time in (d) of Fig. 2. We observe that when the ESP fails, the read-out also shows rapid fluctuations that are incommensurate with the input. Hence the chance of mitigating the effect of discontinuity of the input-encoding map or counter-balancing the distorted system response to input with a read-out is remote.

Figure 2: Illustration of (R1): (a) & (b). Schematic of the graph of the input-encoding map when a system has the ESP in (a), and when it does not have the ESP in (b). The lighter grey values indicate the interior, while the darker values indicate the boundaries of the set-valued function’s graph, and the comb-like outward projections refer to the explosions due to the absence of l.s.c. (c). A coordinate of a simulated solution () plotted in red and blue against time n (for an RNN with a sinusoidal input shown in black) respectively when 02 (while it has ESP) and when 05 (while it does not have ESP); (d). A linear read-out of the states in (c) plotted against time n in red and blue when when 05 respectively.

We now illustrate (R1) from the viewpoint of the sensitivity to input noise to indicate the importance of the ESP, and also to show that some networks may strangely exhibit some robustness to noise even without the ESP. We consider the matrix in [25, Equation 6] and scale it to have a unit spectral radius to use it as the matrix B in a RNN (B is as in [6, Section 6]). It is known that there exist values of that the RNN with such a B does not have the ESP w.r.t. an input that is identically zero owing to a Neimark-Sacker bifurcation [25]. For the illustration considered, we remark that the system is found to have the ESP w.r.t. a sinusoidal input for and not while 8, such an inference is drawn using a parameter-stability plot (discussed later). We show the response of the system for the sinusoidal input and an input is a realization of uncorrelated noise uniformly distributed in the interval []. The responses (solutions) obtained from the same initial condition are shown in the top panel of Fig. 3 along with the differences

Figure 3: Figure to illustrate nuances with input-related stability: (Top Panel) A coordinate of a simulated solution of a RNN with a sinusoidal input and a solution with the noisy input is a realization of zeromean uniformly distributed noise with maximum amplitude 109. (Bottom Panel) Difference between a linear read-out out of the solutions obtained from the inputs

in the linear read-outs of the solutions in the bottom panel. It is remarkable that even with the noise of a magnitude not greater than 10, the differences in the solutions and hence the read-out is disproportionately large when the ESP fails at 9, the solutions do not have the sinusoidal shape; however, they are close to each other, and so are the read-outs. This quasi input-related stability depends on the network architecture and is observed to be exhibited only when the initial conditions (chosen to simulate the solutions) are from a proper subset of X. We believe that RNNs trained without the reservoir computing framework may exhibit such stability. Such stability is not generic and dependent on the matrix B and investigating into it is part of the author’s ongoing research.

It may be noted that the continuity of the input-encoding map (or the input-representation map) is local because it is defined for every single input, and further, it can be defined regardless of whether the ESP w.r.t. the input is defined or not. The notion of the fading memory property [13] is more of a global stability notion in the sense that it is defined on a subspace of input sequences for which the ESP w.r.t. all such inputs is a pre-requisite (see [16, 17]). Fading memory property and the input-related stability introduced here cannot be directly compared unless one extends the input-related stability to a global version, i.e., g is said to have the global input-related stability at parameter if the input-encoding map ) is continuous for allu. In that case, it is possible to show [26] that the fading memory property and the global input-related stability are equivalent for an open-parametric driven system that is contractible. The same equivalence can also be extended to the input forgetting property.

Result (R2) concerns parameter-related stability. It is clear from (R2) that when parameter-related stability fails, i.e., when the parameter encoding map is discontinuous, then the driven system does not have the ESP w.r.t. the concerned input. Hence input-related stability fails as well by (R1), and the discussion presented in Fig. 2 and Fig. 3 holds when parameter-related stability fails. A difference between parameter-related stability and input-related stability is that parameter-related stability is not necessarily equivalent to ESP as observed from (R2). In fact the converse statement of (R2) is not true (see examples in [6, Section 4]). It can be observed that we have not used the contractibility hypothesis to conclude the result in (R2).

It is possible to simulate the parameter-encoding map when is a scalar at discrete steps to observe the scattering of system states while the ESP is not satisfied. When the ESP is satisfied w.r.t.u, i.e., when ) is a singleton subset of X for some , a numerical estimate of ), and it would contain a single highly clustered set of points, and not so when ) is multi-valued i.e, when it is a subset of X with multiple elements. The result of identifying such a distinction can be made even with a small number of initial conditions. Fig. 4 shows two coordinates of the parameter-encoding map ) simulated at discrete set of points different RNNs. We consider ) with 2 neurons and with 250 neurons for the two RNNs, a randomly generated input of length 500, a input matrix A and a reservoir matrix B (chosen randomly, but with unit spectral radius). For the plot on the left in Fig. 4, we use 1000 initial conditions chosen on the boundary of (to exploit the fact that open mapping [27] and hence preserves the boundary). For the plot on the right, there were only 50 chosen samples in . Clearly, in both plots of Fig. 4, the clustering towards a single-point for smaller values of is obvious. One does not need to be worried about simulating the parameter-encoding map accurately if one’s task is only to identify if ) has a single point-cluster. Conversely, it is not possible to ascertain the parameter value at which the parameter-encoding map turns discontinuous or the parameter-related stability fails by using (R2). However, as we find in (R3), the continuity or discontinuity of the parameter-encoding map can be ascertained by the equicontinuous property of its finite time-approximations

In the particular case, when is a scalar parameter in an interval [a, b], given an open-parametric driven system g and an input] to be the ESP threshold if g has the ESP w.r.t.) and does not have the ESP w.r.t.). Further, suppose is the ESP threshold in an interval [a, b], and the parameter-encoding map is also continuous at would call to be a soft-ESP threshold, and call a hard-ESP threshold when parameter-encoding map is discontinuous at (examples are in [6, Section 5]). If

Figure 4: A simulation of the parameter-encoding map: Given a RNN with multiple nodes and an input, we employ several initial conditions (samples) in X to simulate elements in the set ). Two coordinates the ) are plotted against an increasing sequence . For the figure on the left, a reservoir with 2 neurons with 1000 samples in X was used, and on the right a reservoir with 250 neurons and 50 samples in X was used.

is a soft-ESP threshold, there would be a seamless transition of the parameter-encoding map from being single-valued to being multi-valued (as in (a) of [6, Fig. S3]) while crosses the soft-ESP threshold and hence there would be no instabilities (discontinuous changes) in the system response due to fluctuations in

The collective states of many interacting particles (elements) of a system show discontinuous behaviors when they undergo a “phase transition” due to the variation of a parameter (e.g. [28]). The parameters at which this happens are called critical points, and they mark a phase transition. For the lack of a general mathematical description of a phase transition, a critical transition is often identified or defined by the accompanying changes during a physical transition. These changes could be such as the emergence of a power-law distribution of some entities involved or observing a long-range correlation between the particles or by observing a transition from “order” to “disorder”. The issue with defining a critical transition based on the accompanying changes is that some of these changes can also be observed even while there is no criticality observed (e.g., [29]). Here, within a mathematical framework, we are in a position to describe the collective states of a driven system at a parameter , especially if it also represents a network with many nodes, as the set we have an open-parametric driven system, then it is enough to understand a critical transition at to be a sudden or discontinuous change in the set as

Based on experiments focused on living systems, Kauffman [3, 28] and on physical systems, Packard [30], Langton et al. [4] and Crutchfield et al. [31] conjectured that a dynamical regime between order and disorder optimally balances adaptiveness and robustness to input or stimuli. Based on these works, we here make the edge-of-criticality to refect on a transition that shows a change in the robustness to the input noise. By (R1), such a transition takes place exactly when the ESP is lost. Hence, we define the input-specific edge-of-criticality as the hard-ESP threshold, when both the parameter-encoding map (reflecting a change in the collective states) and the input-encoding map suddenly turn discontinuous. Note that in addition to having made the notion of edge-of-criticality precise, we have made this notion input-specific as it should be since the parameter-encoding map is input-specific. Thus applying (R3) to the case when is a real-parameter in the interval [as the ESP threshold, then is a soft-ESP threshold when is equicontinuous at a hard- ESP threshold or the edge-of-criticality otherwise.

There are heuristic methods in the literature to approximate the edge-of-criticality owing to the lack of a clear definition of the edge-of-criticality. Result (R3) or Theorem 3 distinguishes the two ESP thresholds defined above and, more generally, the continuous or discontinuous points of the parameter-encoding map through the notion of equicontinuity. Intuitively, equicontinuity is meant to deal with continuity of the entire set of functions at once. Formally, a family of functions between two metric spaces X and Y is equicontinuous at if for every there exists a 0 such that would imply all is said to be it is equicontinuous at every point A very simple example, is family of functions defined by 1] is not equicontinuous at 1 since for instance when 2, for every 0, there is an n large enough so that 1). The family of functions is equicontinuous for 1) though. If the parametric-encoding family has a similar behavior, then we witness a discontinuity in the parameter-encoding map as Theorem 3 asserts. Owing to the property of non-equicontinuity associated with the hard-ESP threshold or the edge-of-criticality we can identify the threshold in a numerical simulation.

To illustrate this simulation, we simulate the parameter-encoding map at a discrete set of parameter values (is a scalar in an interval [a, b]. We employ the definition of the Hausdorff metric (see Section 3). Given an inputu, an integer n, we consider a plot of which we call the parameter-stability plot and the procedure to obtain it is described next.

Procedure to obtain the parameter-stability plot. A step-wise procedure to numerically identify the discontinuous points of the parameter-encoding map and the edge-of-criticality or hard-ESP threshold for a givenu (when the parameter is a scalar) is presented next (we employ this procedure to determine in Fig. 5):

1. Fix equally distant points in an interval [a, b];

2. With M initial conditions of the reservoir and considering n successive values of the input, simulate the set

3. Find

)); the Hausdorff distance between two finite sets with identical cardinality M is computed practically by max(max(min()), where: (i) min(C) denotes the minimum of each row in a matrix C (ii) the (i, j) element of D is the norm between the ith element of th element of -norm could be employed in finite dimensions) (iii). is the transpose of D.

4. Obtain the parameter-stability plot, i.e., and identify the smallest where the plot of ) turns conspicuously positive as the edge-of-criticality or the hard-ESP threshold in the interval [a, b], and other where the plot is conspicuously positive as the points of discontinuities of the parameter-encoding map. If the plot has only constant behavior close to 0, then the parameter-encoding map is continuous everywhere.

While the parameter-encoding map is continuous at an ) can be made arbitrarily close to 0 by setting appropriately, and while it is discontinuous at )) would turn positive despite being small since the parameter-encoding family is not equicontinuous (by (R3)) and due to the fact that ). Hence, we can identify the smallest value of at which the parameter-stability plot turns distinctly positive as an estimate of the input-specific edge-of-criticality or the hard-ESP threshold.

The parameter-stability plot of an RNN is presented in Fig. 5 (a). In Fig. 5 (b), we plot the data of nine parameter-stability plots that are obtained by n successive values of the input, and different shifts of the input j, where the input and the RNN employed is the same as in (a) of Fig. 5. By the definition of an open-parametric driven system, we know that is an ESP threshold for a driven system with input u if and only if it is also the ESP threshold with the input ), where the function ) shifts each left-infinite sequence in its argument to the right by j units. We observe that determined with the plot in Fig 5 (b) is robust to both the different shifts and even shorter input lengths n and corroborates with the estimate of determined in Fig 5. Further simulation results that show the robustness of the results due to variation in and the effect of input scaling are presented in [6, Section 7].

Figure 5: Parameter-Stability plot(s) of an RNN:

plotted against 005. (a) With a randomly generatedu and n = 500 (b) data sets of nine parameter-stability plots (each set with a different colour) obtained by the input is varied from 0 to 80 in steps of 10, and with respective input lengths being varied as . The smallest value of where the plot turns distinctly positive is the estimate of the edge-of-criticality or the hard-ESP threshold in the interval [0.7, 1.5].

3 Mathematical Details and Proofs

We present the rigorous versions of the results (R1), (R2) and (R3) and prove them. The content here is also reproduced as three separate sections, along with additional examples in [6, Sections 2,3,4]. In our analysis, wherever a cartesian product of spaces is considered, the cartesian products are endowed with the most commonly used topology called product topology (e.g., [27]). We recall that a function taking values in a cartesian product equipped with product topology is continuous if and only if its coordinate functions are all continuous [27].

While we refer to a parametric driven system here, we bear in mind that it would comprise a parameter space Λ, an input space U, both topological spaces and a compact metric space X along with a continuous map X is a metric space with metric d, we denote by the collection of all nonempty closed subsets of X. On the space we equip the Hausdorff metric defined by

is the open -neighborhood of the set A. It is well known that is compact when X is compact.

The set ) defined previously through a solution of g can be defined equivalently in forms that are more convenient to be employed in a proof. Towards that end, we would adopt a composition-operator (called a process in [10, 32]). Given a parametric driven system g, we denote Suppose a parametric driven system g has been fed input values starting at time m. Then the map g transports a state-value give a state-value . Formally, for every choice of ¯u and g, we define for all , the function that ‘transports’ a system state at x at time m through the inputs to the state at time n given by an composition-operator and

Thus, if the inputs are fed in that order to every system would have evolved at time n to one of the states in ). A simple observation is that the set inclusion ) holds for all ). Based on this observation, it follows that if the entire left-infinite input would have been fed to every ) is a decreasing sequence of sets, i.e., for any . Hence, if the entire left-infinite input had influenced the dynamics of the parametric driven system g, the system would have evolved at time n to an intersection of a decreasing sequence of sets:

When X is compact, each set ) is a closed subset of is a nonempty closed subset of X (a proof is available in [10, Lemma 2.1]). Note that ) is an element of the space is a compact metric space, every sequence of sets has a convergent subsequence. Whenever a sequence in , it means that 0 and we denote it by may be verified that (see for e.g., [33]) when is a sequence of decreasing sets, i.e., if and only if , i.e., the limit in the compact space always exists for a decreasing sequence of sets and is equal to the nested intersection of the sequence of sets. Hence an alternate definition of (1) would be

One can easily show that “a sequence is a solution of g obtained through an input ¯u if and only if belongs to ” (see [10, Lemma 2.1] for a proof). Hence the definitions of ) that we have considered are all equivalent.

The continuity of the function () can be defined by considering the topology induced on by the Hausdorff metric . Such continuity defined using the Hausdorff metric is equivalent [33] to the continuity of treated as a set-valued function of the variable (). We make use of this equivalence without further remarks. To define the continuity of a set-valued map, let X be a topological space and be a subspace of the power set of X, and let Z be another topological space. A set-valued function is continuous [33] at a point x if it is both upper-semicontinuous and lower-semi-continuous at x. A function S is upper-semicontinuous (u.s.c) at x if for every open set V containing S(x) there is a neighborhood U of x so that S(U) is contained in V ; S is lower-semicontinuous at x (l.s.c) if for every open set V that intersects S(x) there is a neighborhood U of x so that S(x) also has non-empty intersection with . (see [6, Fig. S3] for a schematic picture). If S is u.s.c but not l.s.c at x then it has an “explosion” at x, and if S is l.s.c but not u.s.c at x then it has an “implosion” at x. The function S is called continuous or u.s.c or l.s.c if it is respectively continuous, u.s.c and l.s.c at all points x in its domain.

We use the fact that if S is both u.s.c at x and is single-valued, it is continuous at x [33]. An example of a map that is u.s.c but not l.s.c is the input-encoding map of a driven system when it does not have the ESP (see Theorem 1 and (ii) of Lemma 3.2). We point out that the “graphs” of such set-valued maps could behave wildly. It is not possible to explicitly describe such maps where the arguments are infinite-dimensional. An accessible example with a scalar argument is the set-valued function f defined by

The function f is u.s.c for all x but l.s.c only when x is irrational and at x = 0.

Given a parametric dynamical system g, for every given Λ, we call the set-valued mapping ¯The following theorem concerns input-related stability (version of (R1)).

encoding map) An open-parametric driven system g that is contractible at ESP w.r.t. ¯the input-representation map ¯is continuous at ¯the input-encoding map ¯is continuous at ¯u.

The proof of Theorem 1 would be presented after establishing the relationship of the ESP w.r.t. ¯u and the input-encoding map as stated in Lemma 3.1. The pedantic reader may note that Theorem 1 would not be true without the contractiblility condition. For instance, we recall an example from Section 2 that if X = [0, 1] and regardless of the input ¯u and hence the input-representation map ¯is continuous (i.e., both u.s.c and l.s.c), but there is no ESP to all inputs since ) is not a singleton subset of X for every n.

Lemma 3.1. (Continuity of the input-encoding map) An open-parametric-driven system g that is contractible at has the ESP w.r.t.if and only if the input-encoding map is continuous atu.

) The open-parametric driven system g has the ESP w.r.t.equivalent to saying that ) is a singleton subset of X. By (ii) of Lemma 3.2, ) is u.s.c atu. We use the fact that a map that is u.s.c and is a singleton subset of the space X is continuous to conclude ) is a continuous atu.

() be continuous at

. This means ) contains a subset of X that has at least two elements of X. Hence the encoding map evaluated at () is bounded away from all singleton subsets of

Since g is contractible at

v be an input for which ) is a singleton subset of X. We define the sequence of left-infinite sequences (). Since, the first n elements of are identical to that ofu, it follows that the sequence

in the product topology on where is the infinite cartesian product space

Since

v) is a singleton subset of X, we find that ) is a singleton subset of X since

where

(notations to avoid obscurity in the subscripts). Since ) is a singleton subset of X, it follows that

all n. This implies ) does not converge to

This contradicts our assumption that ) is continuous at

Corollary 3.1. If an open-parametric-driven system g that is not necessarily contractible at has the ESP w.r.t.then the input-encoding map tinuous atu.

Proof of Corollary 3.1. The first paragraph of the proof of Lemma 3.1.

Proof of Theorem 1. From Lemma 3.1, we have: an open-parametric driven system g that is contractible at has the ESP w.r.t. ¯if and only if the input-encoding map ¯) is continuous at ¯u. It remains to be proven that the input-encoding map ¯) is continuous at ¯u if and only if the input-representation map ¯is continuous at ¯u. Equip the representation space with the product topology. Let ). Clearly, ) and the mapping ¯is identical to The following equivalent statements (i)–(vi) lead to the proof: (i) ¯is continuous if and only if (iff) the coordinate mappings ¯) is continuous for all (by definition of convergence in product topology) (ii). ¯) is continuous for all continuous for all (by definition of

) is continuous for all has the ESP w.r.t.(by Lemma 3.1). (iv). g has the ESP w.r.t.has the ESP w.r.t.is open-parametric driven system) (v). g has the ESP w.r.t.has the ESP w.r.t.u (by definition ofhas the ESP w.r.t.

) is continuous (by Lemma 1).

We remark that the topology on representation space R can be obtained through a metric [27]. An example is . An analogously defined metric can be used as a metric for the input sequence space We make use of Lemma that follows from the assumption that g is continuous and X is compact. Lemma is proved in [6, Lemma 2].

Lemma 3.2. Let g be a driven dynamical system. Then

(iii). Every function in the input-encoding family or in the parameter-encoding family is continuous.

The following theorem establishes that the ESP ensures the continuity of the parameter-encoding map and is a rigorous version of (R2).

. If an open-parametric driven system g has the ESP w.r.t.then the parameter-encoding map is continuous at . More generally, if g has the ESP w.r.t. ¯then the parameter-representation map is continuous at

and hence the corresponding u. We note that from (i) of Lemma 3.2, a parameter-encoding map is u.s.c. Further, whenever it is single-valued it is continuous. Hence ) is continuous at

Note that and (the representa- tion space) R is equipped with the product topology. We know that is continuous if and only if the coordinate mappings ) is continuous for all ). Clearly, mapping ) is identical to the parameter-encoding map From Lemma 3.2, a parameter-encoding map is always u.s.c, and when it is single-valued it is continuous. Since, g is an open-parametric dynamical system, is a singleton subset of X for all n if and only if ) is a singleton subset of X. When g has the ESP w.r.t.it follows that is continuous.

We recall from Section 2 that when we restrict the parameter space to be an interval [a, b] the input-specific soft-ESP threshold and hard-ESP threshold as a parameter value ] for which the parameter-encoding map is continuous and discontinuous respectively in the scenario where g has the ESP w.r.t.u for all parameter values either to the left or right of (examples are in [6, Section 4]). The following theorem establishes conditions on and is a rigorous version of the result (R3) .

Theorem 3. Consider an open-parametric-driven system g and an input u. The parameter-encoding map is continuous at if and only if is equicontinuous at . In particular, when is a real-parameter in the interval [the ESP threshold, then is a soft-ESP threshold when is equicontinuous at and a hard-threshold otherwise.

Proof. We first show that

u)} is equicontinuous at continuous at . To prove this it is sufficient to show that given a sequence, we have ). Fix a sequence . We know by definition of the sets

) for every . Hence for a given 0, we can define ) to be an integer such that for every , we deduce

whenever is equicontinuous at exists an integer K such that for

Next, to show

u) is continuous at is equicontinuous at prove its contrapositive. Let be not equicontinuous at some fixed there exists an 0, a sequence and a sequence of integers that

Since

0, we can find an integer M such that the inequality

holds. Note ) is continuous for any finite M by (iii) of Lemma 3.2. Assume )) is continuous at . Hence, by continuity the set contains a neighborhood of the space Λ. Since ) is monotonically decreasing,

monotonically. Hence exists an integer so that for all the term )) in (4) can be made less than Thus, for we can rewrite (4) as

Now, by the assumption of continuity of , there exists an integer that

), there exists an integer 3which is absurd. Hence our assumption )) is continuous at is incorrect. When Λ = [is the ESP threshold, setting above proof, we obtain to be the soft-ESP threshold and the hard-ESP threshold when is equicontinuous and not-equicontinuous respectively at

4 Applications, Signiﬁcance and Discussion

With computation, dynamics, and information being intermingled concepts, we find some interconnections between them. The manifold hypothesis (e.g., [34]) is based on the view that phenomena in the physical world more often lie in or around a lower-dimensional manifold that can be embedded in a higher-dimensional manifold. This has been observed in several fields, and in particular, well-explored in machine learning [34]. When it comes to driven dynamics as considered in this article, the input ¯u renders the system compute a representation that is usually a low-dimensional manifold, attracting (see pullback attractors in [32]) and lies in the higher-dimensional manifold X. In a broad class of driven systems, we establish in (R1) that a continuous deformation of such a low-dimensional manifold (such manifolds exist, e.g., [35]) is feasible by feeding close-by variants of the input if and only if the system has the ESP w.r.t. to the input that aided the computation of the low-dimensional manifold. Such continuous deformations are necessary for stable computations in all systems that compute through driven dynamics. From the perspective of systems that are controlled or driven, result (R1) explains why memory-loss is needed to avoid unexpected behavior due to small changes in the temporal input.

Further, if parameters are a part of the design of driven systems, the representations are continuously deformed by a change in the parameter when the ESP is satisfied (result (R2)). Thus in a neurocomputing scheme, if the interconnections of the network are set as a design-parameter, then the ESP ensures stability due to small changes in interconnection strengths in the network. This result explains the empirically observed robustness of echo state networks with regard to the random choice of reservoir or input matrices in applications. We remark that such stability was explained in the context of a class of standard feedforward neural networks [36], but no such result is currently available for RNNs – an anonymous referee has pointed out interesting error estimates recently obtained for RNNs in [37]. ESP ensuring parameter-stability helps also design or explain stability in fully trained RNNs [38], and newer adaptations like deep recurrent neural networks [39], conceptors [40], databased evolving network models [41] and in nonlinear open-loop control with a wide range of applications [8].

In the context, where parameters need to be pushed to the limit to have better separability of inputs in the state space, it is important to identify this limit as the edge-of-criticality or the hard-ESP threshold. Unlike all earlier literature that describes the edge-of-criticality with phrases such as the boundary between order and chaos, or the midpoint between death and epilepsy, we define the input-specific edge-of-criticality mathematically owing to which we employ a procedure for numerically determining it using (R3). In the RC literature on RNNs, upper bounds on the attributes of the reservoir matrix (like spectral radius) to ensure the ESP for all possible inputs, i.e., the ESP w.r.t. to the entire space are available (e.g., [25]). More commonly, the criterion of the unit spectral radius of the matrix is currently being used to have the ESP w.r.t. all possible inputs. The disadvantage of these bounds or criteria is that they are not dependent on the input and particularly on the input’s temporal properties. In a practical application, often, only a class of inputs with specific templates are employed for a given task for the network. Hence the bounds are suboptimal on two accounts: (i) temporal relation in the input is ignored (ii) the gap between say the actual edge-of-criticality and the bound (like the unity of spectral radius of the reservoir) is unknown and we overcome this by using the parameter-stability plot. We anticipate that our procedure to determine the edge-of-criticality can be extended to a vector parameter in which case we could get a curve, surface, or a hypersurface as the hard-ESP threshold boundary. From the viewpoint of stochastic inputs, we remark that when inputs are drawn from a stationary ergodic process, the edge-of-criticality inferred from a single typical input is valid for all other generic inputs since the ESP holds with only either probability 0 or 1 [10]. This would mean when the hard-ESP threshold is obtained from a typical realization of an ergodic process as the input, the same threshold holds for other realizations with probability 1.

References

[1] Bray D (2009) Wetware: a computer in every living cell. (Yale University Press).

[2] Regot S, et al. (2011) Distributed biological computation with multicellular engineered networks. Nature 469(7329):207–211.

[3] Kauffman SA, , et al. (1993) The origins of order: Self-organization and selection in evolution. (Oxford University Press, USA).

[4] Langton CG (1990) Computation at the edge of chaos: phase transitions and emergent computation. Physica D: Nonlinear Phenomena 42(1-3):12–37.

[5] Kitzbichler MG, Smith ML, Christensen SR, Bullmore E (2009) Broadband criticality of human brain network synchronization. PLoS Comput Biol 5(3):e1000314.

[6] Manjunath G (2020). Supplementary Material that is available online along with this publication.

[7] Murray JD (2007) Mathematical biology: I. An introduction. (Springer Science & Business Media) Vol. 17.

[8] Terrell WJ (2009) Stability and stabilization: an introduction. (Princeton University Press).

[9] Lukoˇseviˇcius M, Jaeger H (2009) Reservoir computing approaches to recurrent neural network training. Computer Science Review 3(3):127–149.

[10] Manjunath G, Jaeger H (2013) Echo state property linked to an input: Exploring a fundamental characteristic of recurrent neural networks. Neural computation 25(3):671–696.

[11] Wiener N (1966) Nonlinear problems in random theory.

[12] Volterra V (1959) Theory of functionals and of integral and integro-differential equations.

[13] Boyd S, Chua L (1985) Fading memory and the problem of approximating non- linear operators with volterra series. IEEE Transactions on circuits and systems 32(11):1150–1161.

[14] Chua L, Green D (1976) A qualitative analysis of the behavior of dynamic nonlinear networks: Steady-state solutions of nonautonomous networks. IEEE Transactions on Circuits and Systems 23(9):530–550.

[15] Jaeger H (2001) The echo state approach to analysing and training recurrent neural networks-with an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report 148(34):13.

[16] Grigoryeva L, Ortega JP (2018) Echo state networks are universal. Neural Networks 108:495–508.

[17] Grigoryeva L, Ortega JP (2019) Differentiable reservoir computing. Journal of Machine Learning Research 20(179):1–62.

[18] Kocarev L, Parlitz U (1996) Generalized synchronization, predictability, and equivalence of unidirectionally coupled dynamical systems. Physical review letters 76(11):1816.

[19] Jaeger H, Haas H (2004) Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. science 304(5667):78–80.

[20] Maass W, Natschl¨ager T, Markram H (2002) Real-time computing without sta- ble states: A new framework for neural computation based on perturbations. Neural computation 14(11):2531–2560.

[21] Appeltant L, et al. (2011) Information processing using a single dynamical node as complex system. Nature communications 2:468.

[22] Kudithipudi D, Saleh Q, Merkel C, Thesing J, Wysocki B (2016) Design and analysis of a neuromemristive reservoir computing architecture for biosignal processing. Frontiers in neuroscience 9:502.

[23] Vandoorne K, et al. (2014) Experimental demonstration of reservoir computing on a silicon photonics chip. Nature communications 5:3541.

[24] Beggs JM (2008) The criticality hypothesis: how local cortical networks might optimize information processing. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 366(1864):329–343.

[25] Yildiz IB, Jaeger H, Kiebel SJ (2012) Re-visiting the echo state property. Neural networks 35:1–9.

[26] Manjunath G (2020) Embedding information onto a dynamical system. Under Preparation.

[27] Munkres J (2014) Topology. (Pearson Education).

[28] Kauffman S (1996) At home in the universe: The search for the laws of self-organization and complexity. (Oxford university press).

[29] Beggs JM, Timme N (2012) Being critical of criticality in the brain. Frontiers in physiology 3:163.

[30] Packard NH (1988) Adaptation toward the edge of chaos. Dynamic patterns in complex systems 212:293.

[31] Crutchfield JP, Young K (1988) Computation at the onset of chaos in The Santa Fe Institute, Westview. (Citeseer).

[32] Kloeden PE, Rasmussen M (2011) Nonautonomous dynamical systems. (American Mathematical Soc.) No. 176.

[33] Aubin JP, Frankowska H (2009) Set-valued analysis. (Springer Science & Business Media).

[34] G´eron A (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. (O’Reilly Media).

[35] Hart A, Hook J, Dawes J (2020) Embedding and approximation theorems for echo state networks. Neural Networks.

[36] Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1-3):489–501.

[37] Gonon L, Grigoryeva L, Ortega JP (2020) Approximation bounds for random neural networks and reservoir systems. arXiv preprint arXiv:2002.05933.

[38] Giles CL, Omlin CW (1994) Pruning recurrent neural networks for improved generalization performance. IEEE transactions on neural networks 5(5):848– 851.

[39] Hermans M, Schrauwen B (2013) Training and analysing deep recurrent neural networks in Advances in neural information processing systems. pp. 190–198.

[40] Jaeger H (2017) Using conceptors to manage neural long-term memories for temporal patterns. The Journal of Machine Learning Research 18(1):387–429.

[41] Manjunath G (2017) Evolving network model that almost regenerates epileptic data. Neural computation 29(4):937–967.

Designed for Accessibility and to further Open Science