System identification is a fundamental problem in control theory, reinforcement learning, econometrics, and time-series modeling. Given observations of the input-output behavior of a dynamical system, system identification seeks to estimate the parameters of the system. When the governing dynamics cannot be derived from first principles, this is an important tool for modeling the behavior of a system, allowing for downstream analysis and engineering. In this work we focus on the simplest possible dynamical system model—discrete-time, linear dynamical systems. Several recent works Simchowitz et al. (2018); Sarkar and Rakhlin (2018) have shown sharp rates for estimating the parameters of such systems in the passive case—where the system is driven by random noise. Here we seek to understand active system identification—given complete control over the inputs, how can we best excite the system to accelerate estimation? Dating back to the 1970s, significant attention has been given to the problem of how to best excite systems for estimation Mehra (1976); Goodwin and Payne (1977); Bombois et al. (2011) yet these works typically lack theoretical guarantees. To the best of our knowledge, we present the first provably correct method for active system identification. We show finite time and asymptotic sample complexity guarantees and characterize settings in which active input design yields performance improvements.
Formally, we consider linear dynamical systems (LDS) of the form:
where is unknown,
is unobserved process noise. We choose the input
sequentially, observe the state
, and wish to estimate
from this data. For simplicity
and ease of exposition, we assume is known, though all our results can be extended to the case where
is unknown. From an engineering perspective, assuming
is known is a reasonable assumption as one may have knowledge of
from the design of the system actuation. Throughout, we assume that
is the spectral radius of
. We are interested in estimating
in the spectral norm, in the case where our input is constrained to have bounded energy, that is:
for some constant
As we will show, the fundamental quantity that determines the sample complexity of estimation is the minimum eigenvalue of the covariates: . Optimally exciting the system is then equivalent to maximizing this quantity subject to the input power constraints. This quantity, however, depends on
, the parameter we wish to estimate, so cannot be optimized in practice.
Our main contribution is an algorithm which balances this tradeoff—progressively updating the inputs as the estimates of improve—and finite time bounds quantifying the estimation rate it achieves, as well as the number of samples necessary to guarantee the optimally exciting inputs are being played. In addition, we present a lower bound and asymptotic upper bound guaranteeing the asymptotic optimality of our algorithm. We show that playing Gaussian noise, even with an optimally tuned covariance, is insufficient to achieve this optimal rate. Our algorithm can be seen as an instance of adaptive E-optimal design Pronzato and P´azman (2013).
An important piece in our analysis is a new finite-time bound on the estimation error that holds when arbitrary periodic inputs are being played. Previous works Simchowitz et al. (2018); Sarkar and Rakhlin (2018); Dean et al. (2018) only consider inputs that are Gaussian or state feedback. These works emphasize obtaining bounds that scale properly with the spectral radius of the system. Following this, we develop bounds that avoid a poor scaling with the spectral radius. To the best of our knowledge, this is a novel result and may be of independent interest.
1.1. Related Works
A significant body of work exists on how to optimally excite dynamical systems for identification Mehra (1976); Goodwin and Payne (1977); Jansson and Hjalmarsson (2005); Gevers et al. (2009); Manchester (2010); H¨agg et al. (2013). An excellent survey of classical results can be found in Mehra (1974) and a more recent survey in Bombois et al. (2011). Broadly speaking, earlier works tended to focus on designing inputs so as to be optimal with respect to traditional experimental design objectives. More recent works Hjalmarsson et al. (1996); Hildebrand and Gevers (2002); Katselis et al. (2012) have focused on designing inputs to meet certain task-specific objectives—for instance, identifying a system for the purpose of control.
A primary difficulty in designing inputs for identification is that the design criteria, often some function of the Fisher Information Matrix, depend on the unknown parameters of the system. Several different approaches have been proposed to overcome this challenge. One line of work Rojas et al. (2007, 2011); Larsson et al. (2012); H¨agg et al. (2013) performs robust experimental design and optimizes a minimax objective. More comparable to our approach are works which perform adaptive experimental design Lindqvist and Hjalmarsson (2001); Gerencs´er and Hjalmarsson (2005); Barenthin et al. (2005); Gerencs´er et al. (2007, 2009)—alternating between estimating the unknown parameters and designing inputs based on the current estimates.
Existing works in active system identification lack sound theoretical guarantees and too often specialize results to single-input single-output systems. While several results guarantee asymptotic consistency Gerencs´er et al. (2007, 2009), most proposed approaches are heuristic and are validated only through examples. To our knowledge, no finite-time performance bounds exist. In addition, many works seek to optimize quantities that only describe the asymptotic behavior of the system— for instance minimizing the asymptotic variance—and it is unclear and unjustified if these are the correct quantities to optimize for over a finite time interval. Finally, existing works do not give precise, explicit algorithms.
Recently, considerable interest has been shown in the machine learning community towards obtaining finite-time performance guarantees for system identification and control problems. The latter category has primarily centered around developing finite time regret bounds for the LQR problem with unknown dynamics Abbasi-Yadkori and Szepesv´ari (2011); Dean et al. (2017, 2018); Mania et al. (2019); Dean et al. (2019); Cohen et al. (2019). Recent results in system identification have focused on obtaining finite time high probability bounds on the estimation error of the system’s parameters when observing the evolution over time Tu et al. (2017); Faradonbeh et al. (2018); Hazan et al. (2018); Hardt et al. (2018); Simchowitz et al. (2018); Sarkar and Rakhlin (2018); Oymak and Ozay (2019); Simchowitz et al. (2019); Sarkar et al. (2019); Tsiamis and Pappas (2019). Existing results rely on excitation from random noise to guarantee learning and do not consider the problem of learning with arbitrary sequences of inputs or optimally choosing inputs for excitation.
In the context of the existing literature, this work can be seen as the first rigorous treatment of active system identification and the first work to provide finite-time performance guarantees for the problem—bridging the gap between classical approaches and modern machine learning techniques. Indeed, our algorithm is similar to the adaptive input design approach in Lindqvist and Hjalmarsson (2001); our work can be seen as making their algorithm more precise and providing finite-time performance and asymptotic optimality guarantees. Our analysis framework is general enough it could be extended to different experimental design criteria proposed in the existing literature.
1.2. Notation
We will let denote the spectral radius of
denotes the spectral norm of a matrix.
hides log factors. We assume throughout that
though all results can be extended to more general noise distributions. Let:
and is the expected value of
is the expected value of
. In the case when the input is a deterministic, periodic signal of period
, then setting
and applying this input on the system with parameters A and B for all t, we denote the steady state covariates as:
where denotes the Discrete Fourier Transform of
holds by Parseval’s Theorem. Let
will denote an upper bound on the covariates:
will specify its precise form as needed. To aid in analyzing the transient behavior of a system, let:
is then the smallest value such that
always finite. We give a more thorough discussion of this parameter in Appendix A. To determine the optimal inputs, we will solve the following optimization problem. As we make clear in Section 2.1, the fundamental quantity that controls the sample complexity of estimation is the minimum eigenvalue of the covariates, the quantity OptInput maximizes:
Here is the set of frequencies we are optimizing over,
is the time horizon we will play the inputs for,
is the DFT of
is the set of mean-zero signals of length k with average power bounded by
. The constraint that the signal be mean zero is for technical reasons and does not affect the results. We let
denote the same set without the constraint that the signal be mean 0. In some cases we will overload notation, letting
but with the
term in the optimization replaced by M. In addition, we will sometimes use OptInput to refer to the maximum value of the optimization, and sometimes to refer to the inputs attaining that maximum—it will be clear from context which we are referring to.
Algorithm 1 proceeds in epochs, successively improving its input design as its estimate of proves. At each epoch, the input computed in the previous epoch is played (line 11), and
mated from the data collected (line 12). Using this estimate, a set of inputs are designed to excite the estimated system (line 15), and these inputs are played on the real system in the subsequent epoch, yielding a new estimate of
. This procedure continues with exponentially growing epoch length.
UpdateInputs pseudocode (full definition in Appendix A)
1: function 2: Check if
small enough to plan with all frequencies, if so set I = [k] 3: Otherwise set I to include frequencies we can guarantee will sufficiently excite the system 4: if
The FT flag in UpdateInputs controls how the inputs are designed. With FT = True (the finite time case), the algorithm does not take into account the expected future contribution due to noise when designing the inputs. Results for this case are outlined in Section 2.3. With FT = False (the asymptotic case), the algorithm does take into account the estimated future contribution due to noise when designing the inputs. Results for this case are outlined in Section 2.1.
2.1. Asymptotic Optimality of Algorithm 1
We show that our algorithm is asymptotically optimal—up to constants, no algorithm can estimate more quickly as
. We first present a lower bound for estimating linear dynamical systems actively. We call an algorithm
-locally-stable in A if there exists a finite time
such that for all
is the measure induced when the true matrix is
is the estimate obtained by the algorithm after t observations. The sample complexity
is the infimum of all times
satisfying the above definition. This condition was introduced in Jedra and Proutiere (2019) and allows us to avoid trivial algorithms that simply return
for all time. Also define:
Note that, by Lemma H.2 and Lemma H.3, this limit exists and is equal to the limit obtained by replacing with any other sequence
Theorem 2.1 Assume there exists finite k such that the input for any
small enough, any
-locally-stable in
algorithm will have:
Theorem 2.2 Assume we are running Algorithm 1 with FT = False. Then for any there exists a deterministic
such that, for any
is at an epoch boundary, we have:
, and, for small enough
and some universal constant C:
The proof of Theorem 2.1 is given in Section G and the proof of Theorem 2.2 is given in Section B.3. It follows that up to constant factors, Algorithm 1 is asymptotically optimal. The fundamental value present in both the upper and lower bound controlling the sample complexity of estimation is , the minimum eigenvalue of the expected covariates when the input u is being played. Optimally exciting the system for identification is then equivalent to choosing u so as to maximize
2.2. Suboptimality of Colored Noise
While Theorem 2.1 and Theorem 2.2 together show that the optimal performance can be attained in the limit by periodic inputs, it may seem reasonable that one could attain a similar rate by playing the optimal noise—setting for the optimal choice of
that satisfies the expected power constraint. We show this is false. Consider the following example. Let
be PSD with eigenvalues
, and assume that
. We show in the proof of Corollary 3.1 that
. In contrast, when playing
we show in Appendix I, we will have that
Note here that
upper bounds the minimum eigenvalue of the expected covariates when
. Depending on the values of
, there is clear gap between these quantities. For example, if
, the upper bound on the sample complexity of our algorithm is
while the lower bound on the sample complexity when playing optimal noise is
. Note that existing works on system identification Simchowitz et al. (2018); Sarkar and Rakhlin (2018) only apply to the case when the input is zero-mean noise and are thus insufficient to guarantee optimal rates.
2.3. Finite Time Performance of Algorithm 1
We next present our main result quantifying the finite time performance of Algorithm 1. Throughout, we let , the total time elapsed after i epochs, and k(T) the value of
steps. If T is at an epoch boundary,
Theorem 2.3 (Informal) Assume that is chosen sufficiently large relative to
large enough, with FT = True, Algorithm 1 will achieve the following rate:
and will produce inputs satisfying is a universal constant,
the solution to
Note that our finite time rate critically depends on the minimum eigenvalue of the expected covariates. At a high level, Theorem 2.3 provides a finite sample bound on the error in the estimates produced by Algorithm 1 and states that once T is large enough, despite lacking knowledge of the true system parameters, Algorithm 1 will play inputs that maximize . As was shown in Section 2.1, the fundamental quantity that controls the estimation rate is
which, in finite time, can be thought of as
, maximizing
is essentially equivalent to maximizing
then guar- antees in this case that Algorithm 1 plays the inputs that best excite the system for estimation.
The proof of this theorem is sketched in Section 4 and formally proved in Section B.1. A full version of this result is presented as Theorem B.1 in Appendix B, where we quantify formally how large T must be for the rate given in Theorem 2.3 to apply. Corollary 3.1 works this out explicitly in a simplified setting. Intuitively, T must be large enough for the transient effects of the last input to have dissipated, and for to be small enough to guarantee we are playing inputs that achieve nearly optimal performance. The former quantity scales as
. The latter depends on the system parameters in a complicated fashion. In the case where
is diagonalizable with largest and smallest magnitude eigenvalues
, respectively, and
allows for sufficient excitation of all modes, then when
, it will behave like
If
it will behave like
Remark 2.4 If is also unknown, it is still possible to run a procedure similar to Algorithm 1, choosing the inputs to improve estimation of both
simultaneously. In this case, we minimize the same least squares objective but now over both A and B. Theorem 2.6 can be modified to bound the error
, but the error scales instead with:
In this setting, the optimal design is one that maximizes this minimum eigenvalue. To obtain a result similar to Theorem 2.3, a version of Theorem 4.1 is needed to quantify how suboptimal our choice of input may be given only estimates of . A fairly straightforward extension of the argument used to obtain Theorem 4.1 can be used to argue such a bound, allowing a version of Theorem 2.3 to be proved.
Remark 2.5 The update of in Algorithm 1 requires knowledge of the true system parameters to compute
. In practice, bootstrapped estimates of these quantities could be used. Fur- ther, these terms only appear logarithmically and will not be the dominant terms in the expression. Experimentally, we found that greedily designing our inputs with respect to
, equivalent to solving
, yielded better performance and did not require any estimate of
2.4. Estimating Dynamical Systems With Periodic Inputs
As was shown in Section 2.2, exciting a system with random noise is insufficient to obtain optimal estimation rates. Relying on carefully designed periodic inputs, Algorithm 1 is able to attain this optimal rate. Showing this critically requires bounding the estimation error when arbitrary periodic inputs are being played. The following result quantifies this and can be thought of as a novel extension of Simchowitz et al. (2018); Sarkar and Rakhlin (2018) to non-noise inputs. This result may be of independent interest and is proved in Section E.
Theorem 2.6 Assume that we start from initial state and play input
deterministic with period k and average power
be some value satisfying
. Then as long as:
we have:
where are universal constants,
and is the (deterministic) response of the system to
Note, critically, the term in the denominator. This term quantifies how the estimation error scales in terms of the interaction between the input and the system.
We next present several corollaries to Theorem 2.3. Let for orthogonal V , real, diagonal
. Denote the eigenvalues of
. To aid in interpretability, assume that
is small enough to be thought of as a small constant factor,
. We then have the following.
Corollary 3.1 (Symmetric be some value satisfying:
then after steps, running Algorithm 1 with FT = True will produce an estimate satisfying, with high probability:
while instead playing for all time, our estimate will satisfy, with high probability:
In the high SNR regime of , the leading constant for the rate attained by Algorithm 1 behaves as
compared to a leading constant of
when playing
that in both cases the expected average power is
Now let be block diagonal matrices where
their jth blocks. Assume that it is known that
has this structure. For simplicity, assume
so that
denote the expected noise and input covariates of the jth subsystem.
Corollary 3.2 (Block Diagonal large enough, a version of Algorithm 1 slightly modified to account for the block structure, will have, with high probability, when FT = True:
In contrast, simply playing will, with high probability, achieve the following rate:
Intuitively, the rate obtained by Algorithm 1 scales as the average error in estimating each block, while the rate obtained by playing scales as the error of the worst case block. Note that while in both Corollary 3.1 and Corollary 3.2 we are comparing upper bounds, the leading constants in these bounds are identical to those obtained in the asymptotic lower bound, Theorem 2.1, and are thus unimprovable—the improvement in upper bounds we see in performing active estimation compared to playing noise are matched by the lower bound. Both corollaries are proved in Section C.
It is difficult to work out analytically what the performance will be when is a Jordan block. However, at an intuitive level, our algorithm should yield a large improvement over isotropic noise as the proper excitation of a Jordan block focuses nearly all the energy on the last coordinate in the block. This conjecture is supported by our experiments in Section 5.
To prove Theorem 2.3, our primary upper bound on the error in the estimates of produced by Algorithm 1, we first bound the error in the estimate of
obtained at the
th epoch, then bound the suboptimality of the inputs computed from this estimate, and finally bound the estimation error at the ith epoch in terms of these inputs.
Controlling the estimation error We rely on excitation due to noise to guarantee learning and bound
. This proof is similar to those given in Simchowitz et al. (2018); Sarkar and Rakhlin (2018) and is outlined in the appendix.
Bounding the suboptimality of the inputs. Given the estimate and past data
letting
denote the optimal inputs on the estimated system and
the optimal inputs on the true system, we wish to bound:
in terms of , as this will quantify how suboptimal our input’s response on the true system is. Theorem 4.1 provides such a bound in terms of
where is a measure of the smoothness of
with respect to
The full version of Theorem 4.1 is stated and proved in Appendix F. At a high level, the proof follows by upper bounding (2) in terms of the difference between This difference can be quantified in terms of the sensitivity of
to changes in
and, critically, does not require bounding the difference between
. The primary challenge in proving Theorem 4.1 is in avoiding standard matrix perturbation bounds of the form:
Depending on the structure of could be very ill-conditioned and (3) could be very loose. We instead show that it is sufficient to bound:
for a set M guaranteed to include the eigenvectors corresponding to the minimum eigenvalues of . Applying (4) instead of (3) with this M can save a factor of as much as
in the final perturbation bound.
Given this perturbation bound, we can quantify how suboptimal the inputs computed by solving OptInput on our estimated system are. As we make precise in Appendix F, the suboptimality depends on the frequencies our input signal contains. UpdateInputs carefully takes this into account, only playing inputs for which it can guarantee the system will be sufficiently excited. Ultimately, we are interested in exciting the system optimally, which requires that we have learned the system well enough to guarantee the performance at every frequency. We quantify this in Lemma D.1 and show that for sufficiently large T, we will be playing inputs that attain the optimal response. Controlling the estimation error in terms of the inputs. The final piece in the proof involves showing that, for the inputs being played,
, the estimation error will scale in accordance with how these inputs excite the true system. We can decompose the error in our estimate of
scales like
and can be handled using a self-normalized bound Abbasi-Yadkori et al. (2011); Sarkar and Rakhlin (2018). The primary difficulty is obtaining a lower bound on
in terms of the inputs being played. We in fact want to show something even stronger, that
, as this allows us to quantify precisely how an input affects the covariates, and how we can adjust the input to increase
. The following proposition is the key piece in proving such a lower bound. Proposition 4.2 (Informal) Consider
be a deterministic signal with period k. Assuming that
is large enough that the transient effects of the input have dissipated, we have:
The proof of this proposition is given in Section E. The main technical challenge comes in handling the interactions between the inputs and the noise. To avoid directly bounding these cross terms, we prove that the covariates over one period of the input are, with constant probability, lower bounded by the covariates obtained if running the system with no process noise. After enough periods, we show that with high probability the bound (5) holds. Given this pointwise lower bound, we can apply a similar argument to that in Simchowitz et al. (2018); Sarkar and Rakhlin (2018) to show the estimation error bound given in Theorem 2.6.
To complete the proof of Theorem 2.3, we effectively apply Theorem 2.6 to bound the estimation error in the ith epoch in terms of , and using the fact that
excites the system nearly optimally, conclude that we attain the optimal estimation rate.
Figure 1: diagonalizable by unitary matrix,
randomly generated
Figure 2: randomly generated, d = 5, p = 3
Figure 3: Jordan block with
Figure 4: Jordan block with d = 4,
We next validate our algorithm on several examples. Additional trials are included in Section J. We compare Algorithm 1 against three baselines: playing and playing the oracle set of inputs as computed by solving OptInput on the true system parameters.
is the covariance yielding the optimal noise excitation and can be computed via an SDP. We do not compare against existing works in active system identification as these works typically either require knowledge of
to implement, and so are not directly comparable, or propose approaches similar enough to ours (Lindqvist and Hjalmarsson (2001)) a comparison is not relevant.
We set . Rather than running the UpdateInputs function as stated, we plan greedily with respect to
—we do not restrict the set of allowable frequencies and set
. In every experiment we solve OptInput from a single random initialization and do not restart multiple times to obtain a globally optimal solution. We plot the error
against the iteration number. The solid lines show the averages over 50 trials (100 for Figure 2) and the shaded regions indicate the 10% and 90% percentiles.
Figures 1 and 2 illustrate the effectiveness of our approach as compared to exciting the system with noise—Algorithm 1 dramatically outperforms noise-based approaches and performs nearly as well as the optimal. Figure 3 investigates the performance of our algorithm when is unknown. Here we simultaneously solve for
and use our estimate of
when optimizing our inputs. As can be seen, this barely affects the algorithm’s performance.
At each epoch, Algorithm 1 devotes some amount of input energy to playing random noise. Let denote the variance of this noise. By default in Algorithm 1 we set
illustrates the performance of Algorithm 1 when
is varied. For a given
, all additional energy is devoted to the sinusoidal component of the input. As this plot illustrates, noise is not needed in practice to effectively learn and, when all energy is devoted to the sinusoidal inputs, the performance of Algorithm 1 almost immediately matches that of the optimal.
In this work we have presented an algorithm for active identification of linear dynamical systems. We show that our algorithm achieves optimal asymptotic rates and present finite time performance bounds quantifying how the interactions between the input and the system affect the estimation. This work opens up several possible directions for future work.
• OptInput is nonconvex so a globally optimal solution cannot be efficiently found. In practice, an alternating minimization approach can be used to compute a local optimum. While solving OptInput may be difficult, as our bounds show, the quantity being optimized is intrinsic to the problem. Developing algorithms to efficiently solve OptInput is an interesting future direction.
• Recent works in system identification Simchowitz et al. (2018); Sarkar and Rakhlin (2018) have emphasized obtaining bounds that do not scale with the mixing time of the system. Our error bounds do not scale with this quantity yet they require the transient effects of the inputs to have decayed. This condition seems necessary to cleanly quantify the performance and design inputs, yet may be possible to remove with a careful analysis of the transient behavior.
• This work only considers exciting the system with sinusoidal inputs. While we show this is sufficient to achieve optimal rates, one could also imagine choosing inputs that were a function of the current state. Dean et al. (2018) provides rates when a linear state feedback controller is used, but does not discuss how the choice of feedback could improve estimation. It is unclear a priori how effective it could be. At minimum, a carefully designed state feedback controller could be used to mitigate transient effects. We leave this direction for future work.
• A recent work Gonzlez and Rojas (2019) develops finite time bounds for estimating SISO AR(n) systems with n > 1. Extending this to MIMO AR(n) systems and allowing for active input design is an open problem and exciting future direction.
The authors would like to thank Yue Sun and Max Simchowitz for helpful comments. The work of AW was supported by an NSF GFRP Fellowship DGE-1762114. The work of KJ was supported in part by grant NSF RI 1907907.
Yasin Abbasi-Yadkori and Csaba Szepesv´ari. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 2011.
M¨arta Barenthin, Henrik Jansson, and H˚akan Hjalmarsson. Applications of mixed h2 and hinfin; input design in identification. IFAC Proceedings Volumes, 38(1):458–463, 2005.
Xavier Bombois, Michel Gevers, Roland Hildebrand, and Gabriel Solari. Optimal experiment de- sign for open and closed-loop system identification. Communications in Information and Systems, 11(3):197–224, 2011.
Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with onlyarXiv preprint arXiv:1902.06223, 2019.
Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample com- plexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.
Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
Sarah Dean, Stephen Tu, Nikolai Matni, and Benjamin Recht. Safely learning to control the con- strained linear quadratic regulator. In 2019 American Control Conference (ACC), pages 5582– 5588. IEEE, 2019.
Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite time identi- fication in unstable linear systems. Automatica, 96:342–353, 2018.
L´aszl´o Gerencs´er and H˚akan Hjalmarsson. Adaptive input design in system identification. In Proceedings of the 44th IEEE Conference on Decision and Control, pages 4988–4993. IEEE, 2005.
L´aszl´o Gerencs´er, Jonas M˚artensson, and H˚akan Hjalmarsson. Adaptive input design for arx sys- tems. In 2007 European Control Conference (ECC), pages 5707–5714. IEEE, 2007.
L´aszl´o Gerencs´er, H˚akan Hjalmarsson, and Jonas M˚artensson. Identification of arx systems with non-stationary inputsasymptotic analysis with application to adaptive input design. Automatica, 45(3):623–633, 2009.
Michel Gevers, Alexandre S Bazanella, Xavier Bombois, and Ljubisa Miskovic. Identification and the information matrix: how to get just sufficiently rich? IEEE Transactions on Automatic Control, 54(ARTICLE):2828–2840, 2009.
Rodrigo Gonzlez and Cristian Rojas. A finite-sample deviation bound for stable autoregressive processes. arXiv preprint arXiv:1912.08103, 2019.
Graham Clifford Goodwin and Robert L Payne. Dynamic system identification: experiment design and data analysis. Academic press, 1977.
Per H¨agg, Christian A Larsson, and H˚akan Hjalmarsson. Robust and adaptive excitation signal generation for input and output constrained systems. In 2013 European Control Conference (ECC), pages 1416–1421. IEEE, 2013.
Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. The Journal of Machine Learning Research, 19(1):1025–1068, 2018.
Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral filtering for general linear dynamical systems. In Advances in Neural Information Processing Systems, pages 4634– 4643, 2018.
Roland Hildebrand and Michel Gevers. Identification for control: optimal input design with respect to a worst-case -gap cost function. SIAM Journal on Control and optimization, 41(5):1586– 1608, 2002.
H˚akan Hjalmarsson, Michel Gevers, and Franky De Bruyne. For model-based control design, closed-loop identification gives better performance. Automatica, 32(12):1659–1673, 1996.
Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.
Henrik Jansson and H˚akan Hjalmarsson. Input design via lmis admitting frequency-wise model specifications in confidence regions. IEEE transactions on Automatic Control, 50(10):1534– 1549, 2005.
Yassir Jedra and Alexandre Proutiere. Sample complexity lower bounds for linear system identifi- cation. arXiv preprint arXiv:1903.10343, 2019.
Dimitrios Katselis, Cristian R Rojas, H˚akan Hjalmarsson, and Mats Bengtsson. Applicationoriented finite sample experiment design: A semidefinite relaxation approach. IFAC Proceedings Volumes, 45(16):1635–1640, 2012.
Christian Larsson, Egon Geerardyn, and Johan Schoukens. Robust input design for resonant systems under limited a priori information. IFAC Proceedings Volumes, 45(16):1611–1616, 2012.
Kristian Lindqvist and H˚akan Hjalmarsson. Identification for control: Adaptive input design using convex optimization. In Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No. 01CH37228), volume 5, pages 4326–4331. IEEE, 2001.
Ian R Manchester. Input design for system identification via convex relaxation. In 49th IEEE Conference on Decision and Control (CDC), pages 2041–2046. IEEE, 2010.
Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalent control of lqr is efficient. arXiv preprint arXiv:1902.07826, 2019.
Raman Mehra. Optimal input signals for parameter estimation in dynamic systems–survey and new results. IEEE Transactions on Automatic Control, 19(6):753–768, 1974.
Raman K Mehra. Synthesis of optimal inputs for multiinput-multioutput (mimo) systems with pro- cess noise part i: Frequenc y-domain synthesis part ii: Time-domain synthesis. In Mathematics in Science and Engineering, volume 126, pages 211–249. Elsevier, 1976.
Samet Oymak and Necmiye Ozay. Non-asymptotic identification of lti systems from a single tra- jectory. In 2019 American Control Conference (ACC), pages 5655–5661. IEEE, 2019.
Luc Pronzato and Andrej P´azman. Design of experiments in nonlinear models. Lecture notes in statistics, 212, 2013.
Cristian R Rojas, James S Welsh, Graham C Goodwin, and Arie Feuer. Robust optimal experiment design for system identification. Automatica, 43(6):993–1008, 2007.
Cristian R Rojas, Juan-Carlos Aguero, James S Welsh, Graham C Goodwin, and Arie Feuer. Ro- bustness in experiment design. IEEE Transactions on Automatic Control, 57(4):860–874, 2011.
Tuhin Sarkar and Alexander Rakhlin. How fast can linear dynamical systems be learned? arXiv preprint arXiv:1812.01251, 2018.
Tuhin Sarkar, Alexander Rakhlin, and Munther A Dahleh. Finite-time system identification for partially observed lti systems of unknown order. arXiv preprint arXiv:1902.01848, 2019.
Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. arXiv preprint arXiv:1802.08334, 2018.
Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi- parametric least squares. arXiv preprint arXiv:1902.00768, 2019.
Anastasios Tsiamis and George J Pappas. Finite sample analysis of stochastic system identification. arXiv preprint arXiv:1903.09122, 2019.
Stephen Tu, Ross Boczar, Andrew Packard, and Benjamin Recht. Non-asymptotic analysis of robust control from coarse-grained identification. arXiv preprint arXiv:1707.04791, 2017.
Full Definition of UpdateInputs
10: // Otherwise, set I to include frequencies we can plan effectively with
13: // Check if we can plan optimally with frequency
17: end for
18: end if
19: // Update inputs
24: end if
25: return U
26: end function
Several comments on notation are in order. First, note that is the smallest value such that
is finite as long as
. More generally, we can upper bound
As r is increased, will decrease, but the decay rate will be slower. Note that if we set
so the cumulative behavior of the transient, which corresponds to , and the upper bound on
, will each be within a factor of 2 of their optimal possible values. Throughout the appendix, we will upper bound
. In nearly all cases, however, the expressions obtained that contain
can be replaced with a
by adding a factor of 2.
To simplify notation throughout the proofs, we will let Throughout the appendix, we will let
refer to the variance of the exploration noise, which is set by default to
We first present the full version of Theorem 2.3.
Theorem B.1 (Full version of Theorem 2.3) Assume that
T
Then for any:
Algorithm 1 with FT = True will achieve the following rate:
and will produce inputs satisfying are universal constants,
is the solution to
Several additional remarks are in order.
Remark B.2 For Theorem B.1 to hold, must be set to satisfy (6). This condition is necessary to guarantee that the burn-in time required by Theorem 2.6 is met at each epoch. Satisfying this condition requires knowledge of the unknown system so, in practice, we cannot guarantee that it will be met for some
. However, since Algorithm 1 increases
faster than
, regardless of how
are set, it will eventually satisfy the burn-in condition of Theorem 2.6, and so the conclusion of Theorem B.1 will eventually hold.
Remark B.3 Every line in UpdateInputs, with the exceptions of solving OptInput, is at worst a convex program and can be solved efficiently. Computing involves a lin- ear search over
and the computation of a minimum eigenvalue for each
will be an ellipsoid. Line 7 and line 12 also involve iterating over all
and for each
imizing a quadratic over an ellipsoid. Since the maximization of a quadratic over an ellipsoid can be solved via a single SVD, this step can be efficiently completed. While k is growing exponentially with the epoch, we only call UpdateInputs once per epoch. Since the epoch length is also increasing exponentially, the number of epochs is only logarithmic in T. Thus, the total number of flops is only linear in T. In practice, one should simply stop increasing k when a sufficiently fine discretization of the space is reached to obtain close to optimal performance. Experimentally, we found this worked quite well.
Remark B.4 The only constraint we place on the inputs is that their average power is bounded by some value. This constraint allows for signals with large amplitudes, a situation which is often highly undesirable in practice. To avoid this possibility, further constraints could be added OptInput to guarantee that the input computed has bounded amplitude as well as power. Unfortunately, amplitude constraints are non-trivial to enforce when optimizing in the frequency domain. Further, adding this constraint would cause us to lose the guarantee of global optimality of inputs. In practice, we have observed that the optimal inputs typically do not exhibit large spikes are other such undesirable behavior.
Remark B.5 The restriction that is necessary to guarantee that the system will reach steady-state when a new input is played. As such, all our finite time results fundamentally depend on this assumption. A first step towards relaxing it would be proving a version of Proposition E.2 that does not require the system has reached steady state. We leave this for future work.
We also note that, in some sense, the interesting regime for active system identification is when . As was shown in Sarkar and Rakhlin (2018), when all modes in
are unstable, the system can be estimated at an exponential rate. Thus, in this case, active identification is likely unnecessary. A more interesting regime may be when some eigenvalues of
have magnitude greater than 1, and some have magnitude less than 1. In this case active identification could be used to excite the modes corresponding to the smaller eigenvalues. We leave this direction for future work.
We next present our master theorem quantifying the performance of Algorithm 1. Algorithm 1 operates in three regimes. In the first regime, when is not large enough for the system to reach steady state, we are only able to guarantee learning due to the contribution of the noise. In the second regime,
is large enough for the system to reach steady state but
is not small enough for all frequencies to be playable. Finally, in the third regime,
is large enough to reach steady state and all frequencies are playable, allowing us to attain the optimal performance. All three regimes are quantified in Theorem B.6.
Theorem B.6 Assume that
T
Let . Then Algorithm 1 with FT = True will have:
In all cases, the inputs produced will satisfy:
Here are universal constants.
B.1. Proof of Theorem 2.3 and Theorem B.1
The proof of Theorem B.1 follows an event-based analysis. We define several events, show that they all hold with high probability, and that together they imply the rate given in Theorem B.1 holds. We outline the steps at a high level here.
We first must show that the estimate attained at the th epoch is sufficiently accurate to guarantee that we are playing inputs that achieve a response close to optimal. Defining the event
to be the event that
is this small, Theorem B.6 shows that this holds with high probability. To show that the value of
is sufficiently small to guarantee that our inputs are nearly optimal, we must show that
. This requires controlling the covariates in a specific direction,
we define below. Event
is the event on which this is controlled and Lemma E.7 shows that it holds with high probability. On this event, Theorem F.1 and Lemmas D.2 and D.1 guarantee that, given
this small, we will have that our inputs achieve a nearly optimal response.
The remaining events are needed to guarantee our estimation rate at epoch i holds. Event guarantees an upper bound on the covariates.
are both lower bounds on the covariates.
lower bounds the covariates from all epochs prior to epoch i in terms of the noise and
bounds the covariates from the ith epochs in terms of the input. The former is necessary for more technical reasons while the latter allows us to lower bound the covariates in terms of the inputs, which ultimately yields the rate that depends on the input response.
is shown to hold with high probability by Lemma E.7 and, conditioned on the covariates being upper bounded Lemma E.3 shows that
holds with high probability.
A slightly more subtle issue arises in showing that holds with high probability. For
hold, we must have that T is large enough to guarantee that the system has reached steady state in the ith epoch. Guaranteeing the steady state condition is reached requires the initial state at the start of the epoch,
, to be bounded. Given that such a bound holds, we can guarantee, in terms of this bound, that T will be sufficiently large for the system to reach steady state. Event
gives this upper bound on
shows that it holds with high probability. Given this and the burn-in condition required by Theorem B.1, it follows that the system will have reached steady state at epoch i. This, combined with
holding, allows us to apply Corollary E.5 to show that
holds with high probability.
Event next shows that the self-normalized term in the error is bounded. On the event that the covariates are upper and lower bounded as in
holds with high probability by Lemma E.6.
Finally, we show that if all of these events hold simultaneously, the “good” event, A, which guarantees that the rate in Theorem B.1 holds, is always true. Since all of these events hold with high probability, it then follows that A holds with high probability.
Let be the solution to
and define the following events:
Let denote the columns of
is any unit norm vector such that
that do not correspond to the minimum eigenvalue of
. We wish to bound
. The following set of inequalities obviously holds:
By part (a) of Theorem B.6 it follows that if:
which allows us to deterministically upper bound:
By Lemma E.7 we then have that . Note that on the event
by Lemmas D.5 and D.4 the burn-in time required by Lemma E.3 will be met at the end of epoch i assuming that
are chosen to satisfy (6). Since
is random we cannot apply Lemma E.3 to bound this directly, however:
have that so long as the steady state condition required by Proposition E.2 is met for every
. That is, we need:
for all w meeting this condition. On the event , by Corollary D.8, this burn in time will be reached as long as:
Note that on the event increases as its first argument decreases, we have:
By Lemmas D.5 and D.4 and on the event , assuming that
are chosen to satisfy (6), we will have that:
so if
Noting that on , we will have that
, we see then that the burn-in time required by Corollary E.5 will be met if
. Repeating the same calculation we
used to bound to handle the fact that
is random, we conclude, by Corollary E.5, that
as:
On the event , we will have that:
Furthermore, on this event, we will have that and all the conditions of Lemma D.2 will be met so
. This implies that
so by Lemma D.1,
where is the solution to
the solution to
. Furthermore, on this event we will have that:
where holds given (9), (c) holds since the inputs
maximize the quantity
under the power constraint, and
which implies that both
are greater than
By the error decomposition above, on the event it then follows that:
and
To eliminate dependance on i, note that which implies that
. We then have that
B.2. Proof of Theorem B.6
Throughout we will let , the total time that has elapsed after i epochs. We first note that the bound on expected power of the inputs follows directly from Lemma D.6
and by the power constraint imposed in OptInput.
Let:
be the event that our desired error bound holds, and define the following events:
We wish to bound . The following inequalities obviously hold:
By Lemma E.7, and since, following the proof of Lemma D.4:
we have that . Note that on the event
, by Lemmas D.5 and D.4 the burn-in time required by Lemma E.3 will be met at the end of epoch i assuming that
are chosen to satisfy
On the event , we have that:
and:
Thus:
so . Combining everything, it follows that:
The proof of this mirrors closely the proof above but now with inputs included. Let:
be the event that our desired error bound holds, and define the following events:
We wish to bound . The following set of inequalities hold:
By Lemma D.7 we have that and since:
we have that . Note that on the event
, by Lemmas D.5 and D.4 the burn-in time required by Lemma E.3 will be met at the end of epoch i assuming that
are chosen to satisfy (8). Since
is random we cannot apply Lemma E.3 to bound this directly, however:
where the last inequality follows by applying Lemma E.3 since is deterministic on
noting that
A similar argument can be applied to bound . Note that on the event
Lemmas D.5 and D.4 and assuming that
are chosen to satisfy (8), we will have that:
so if
On the event , by Corollary D.8,
will then be sufficiently large for the
system to reach steady state so the burn-in time required by Lemma E.4 will be met. Then repeating
On the event , we have that:
and:
Thus:
so . Combining everything, it follows that:
To complete the result, we must show that the inputs , which are computed based on our estimate of the system
, are close to the optimal inputs computed on the true system, for a specific set of frequencies
where is the solution to
the solution to
then:
this then implies that so, again by Lemma F.9:
Assuming this condition is satisfied for a particular
Note that . So linking these together, if
Since , we can invoke Lemma F.4 to get that:
which is the desired conclusion.
Let:
Let denote the columns of
is any unit norm vector such that
that do not correspond to the minimum eigenvalue of
. We can follow the proof outlined in Section B.2.2 up to the final step, adding in the events
By Lemma E.7, , we will have that
. We would like to guarantee that
. On the event
, a sufficient condition to achieve this is:
On the event , we will have that
, so by Lemma D.1, then we will have that
where is the solution to
the solution to
. So it follows that on the event
will also hold, so
. We can then apply part (a) to get that, so long as
meets the condition above and
B.3. Proof of Theorem 2.2
Proof Throughout we will let , the total time that has elapsed after i epochs. By Lemma H.3, we know that:
exists and is finite, where here is the set of inputs in
that maximizes
It follows then that there exists
such that, for all
, we will have:
By Corollary F.3 and Lemma F.6, for small enough , we will have that:
which implies:
Modifying the burn-in time of Theorem B.1 to:
Assuming this burn in time is met and:
T
then by Theorem B.1, we will have that:
so long as (where here we use the fact that
or equivalently:
where is defined as above and here we use (10). Note that by modifying the burn-in time of Theorem B.1, replacing
, by the definition of
, we will have the inputs being played are optimal with the flag
noted above,
is upper bounded by a constant independent of
as
, the condition (11) will force
. This implies that for small enough
, we will have
. In this case, then, we will have:
Defining to be a solution to:
for small enough , it then follows by Theorem B.1 that for any T at an epoch boundary, so long as
and the burn-in condition is met, we will have that:
The above definition of implies that necessarily
, we will have that
. By definition:
so:
where the inequality will hold for small enough , it follows that for small enough
, we will have that:
Thus, for small enough , we will have that:
So if:
we will have that for any , so long as the burn-in condition is met:
We can set:
and then:
It remains to show that the modified burn-in time required by Theorem B.1 is met as is, we need to ensure that as
where here we have replaced by noting that
is at an epoch boundary, since
. By what we have shown and by definition of
, so long as
and for small enough
, we automatically have that:
Note that , and that the dependance in
is logarithmic in
, and scales as
Thus, using the same argument as what we used above in (12), since
increases as
term linear in
will eventually exceed a term logarithmic in
for small enough
, so we will eventually have that the burn-in condition is met. Finally, we see that the condition:
T
will be met eventually regardless of how are set since, as noted,
that the number of epochs will go to infinity as
increases faster than
, eventually the left hand side of the above inequality will be greater than the right hand side.
Corollary C.1 (Full version of Corollary 3.1) Assume the assumptions outlined in Section 3 for the case where is diagonalizable by a unitary matrix are met. Then after:
steps, Algorithm 1 will attain the following rate:
while simply playing for all time will yield the following rate:
C.1. Proof of Corollary 3.1 and Corollary C.1
Proof The above rate can be attained by the input:
To see this, note that with this input we will have that:
Note that:
where the last equality will hold as long as:
Assume that k satisfies this, then:
and:
Thus, we will have that
Since we have constructed a feasible input and Algorithm 1 constructs the optimal input on the true system (assuming T is large enough), it follows that Algorithm 1 will perform at least this well.
It remains then to quantify how large T must be to achieve this rate. From Theorem B.1, we know that we must have:
and from above we need
lets us lower bound
sufficiently large.
where the first inequality holds since close to 1 and the second holds by our lower bound on k.
To bound , we must first bound
. We see in our case that:
Note that this implies that, for any , we will have:
Then we will have that:
Based on our choice of inputs:
So combining these, we can lower bound
We can then write the burn in time from Theorem B.1 as:
The rate in the case where we simply play for all time follows from Theorem
C.2. Proof of Corollary 3.2
Proof Since has the same block diagonal structure), to minimize the error in the estimate we want to minimize the maximum error in the estimate of each subsystem. By Theorem B.1, once the burn-in time is reached, the estimation error for each subsystem will behave as:
where we let denote the covariates for the jth subsystem. For simplicity assume that:
where here we let denote the optimal response of the system to inputs with power 1, and
the true amount of power inputed to the jth block. Ignoring log factors, the optimal thing to do is to then set:
for all , as this will make the estimation error equal for each subsystem, minimizing the overall error. Meeting this constraint and the power constraint, the following condition will then be met for any j:
Thus, with high probability, we will have that:
In contrast, if we simply input random noise into the system—that is, set the ith block we will achieve the rate:
so, with high probability, noting that by construction
To achieve the adaptive rate, Algorithm 1 can be run separately for each subsystem. After the optimal solution for each subsystem is found, the power input to each subsystem can then be adjusted so that the empirical version of (14) is satisfied. Once the burn-in time from Theorem B.1 is met for each subsystem, our estimates of
will be sufficiently accurate to guarantee that (14) will be met on the true system, and we will then achieve the optimal adaptive rate.
D.1. Quantifying When Small Enough for
Lemma D.1 If:
where is the solution to
the solution to
Proof By Lemma F.9, if
this then implies that so, again by Lemma F.9:
Thus, if , we can upper bound:
∥
Applying Lemma F.9 again, a sufficient condition for
of Theorem F.1, it follows that:
where the inequality (a) follows from Lemma F.4 (with a slight readjustment of constants) and (b) follows from Lemma D.3. Thus, if we can guarantee that:
then it will follow that:
Assume is small enough to satisfy this. Then, with (15), it follows that if:
then:
so . Note that this condition will also imply that (16) holds. Combining all of this, it follows that if:
then . Finally, we see that the perturbation bound holds by applying Theorem F.1 and our condition on
for some to be specified, we will have:
where:
Proof From the definition of OptInput, it is clear that:
on the event . Further, conditioned on all three events assumed to hold, by Lemma F.5, we have that:
Finally, recall that . Combining all of this we have:
ϵ
Lemma D.3 When calling UpdateInputs, we will always have that:
Proof Recall that:
and:
for any satisfying
, we will have that:
Since UpdateInputs only includes frequencies lows that:
from which it follows that
D.2. Meeting the Burn-In Time of Theorem E.1
Proof We have that that:
which implies that:
so:
We also have that:
This gives that:
Thus:
Lemma D.5 Assume that
T
then:
so:
where the second to last inequality follows assuming that
A direct corollary of Lemma D.5 and Lemma D.4 is that, assuming , then, as long as:
T
the used by Algorithm 1 will satisfy:
for any
D.3. Additional Lemmas
Lemma D.6 For any , the inputs generated by Algorithm 1 will satisfy:
Proof Denote is the solution to
and
. Assume that
where (a) follows since are independent, (b) follows by our choice of
rithm 1. The final equality follows since, by construction, the inputs that are the solution to
will satisfy:
for any
Lemma D.7 After i epochs of running Algorithm 1, we will have, with probability
Proof Let is the response of the system due to the sinusoidal component of the input,
is the response due to the process noise, and
is the response due to the input noise. Note that this decomposition holds by linearity. Given this, we have
∥
By construction, we will have that so long as
is large enough that
is in epoch i. However, since
is doubled at each epoch, this sum will contain an integer multiple of the period of the input regardless what the value of
is, so we see that this inequality will hold for all values of
. This implies that for all
where the last inequality holds since if t is divisible by 1, and if t is not divisible by
, and since
is an integer, it follows that
By definition:
where:
Noting that , we can then apply the Hanson-Wright inequality to get:
where the inequality holds since , we see this is true since:
A similar calculation reveals that with probability at least
Corollary D.8 After i epochs of running Algorithm 1, on the event that:
we will have:
T
where T is the amount of time elapsed after i epochs.
Proof From Lemma E.10, we have:
where is the state at the start of the i + 1th epoch, and
is the initial state of the steady state response of the system to the inputs played at the i + 1th epoch. From Lemma D.7, since the noise term will be 0, we can deterministically upper bound:
and also:
so:
it follows then that:
T
Finally, we must upper bound . Upper bounding this over all
is equivalent to bounding:
Lemma D.9 After i epochs, we will have that:
Proof After the ith epoch, we will have that:
Solving this for i gives:
Thus:
Noting that , we can lower bound this as:
Theorem E.1 (Full version of Theorem 2.6) Assume that we start from some initial state we are playing some input
is deterministic with period
Then as long as:
we will have that:
and if:
T
(19) then:
where are universal constants.
Note that ) can be replaced with
may be helpful if our system is not controllable, in which case it’s possible example of this argument can be found in the proof of Theorem 2.3.
E.1. Proof of Theorem 2.6 and Theorem E.1
Proof Define the following events:
(18) follows directly from bounding . The following clearly holds:
By Lemma E.7, we will have that ) holds the burn in time required by Lemma E.3 will be met, so by Lemma E.3,
. Similarly, by Lemma E.6,
bound
, note that we can decompose the error of the least squares estimate as:
On the event , we will have that:
Combining these it follows that on this event:
so . It follows then that
which proves (18). To show (20), define the following events:
As before, we have that and, assuming (19) holds,
) holds, by Corollary E.11 the burn in condition required by Lemma E.4 will be met so we will also have that
and the error decomposition of
used above, we have that
from which (20) follows directly.
E.2. Lower Bounds on Covariates and Self-Normalized Bounds
The following proposition is crucial to proving a high probability bound on the error in the presence of non-random inputs.
Proposition E.2 (Full version of Proposition 4.2) Consider any according to the dynamical system (1). Let
be a deterministic periodic signal and k be an integer multiple of its period. Let
denote the steady state response of the system to this input and let
. Assume that
is chosen large enough so that, for any
where:
Then we will have that:
Proof We first note that, since our system is linear, the output of the system due to the input, will contain only the frequencies present in the input,
, with possibly some phase shift. Thus, the period of the periodic part of our output will be identical to that of the input once the system is in steady state.
Let:
for some to be specified, where I is the indicator function. Then
Let
be some constant to be specified. Then:
where the last inequality is simply Chernoff’s bound. To compute the expectation, we will use the tower property. To do so, it will be convenient to first calculate the conditional expectation of Letting
denote the
-field generated by
, we have that:
where the last equality follows since:
Note that, conditioned on the are deterministic. Further, since
is mean 0,
will simply be a linear combination of mean 0 Gaussians and so will itself be a mean 0 Gaussian. This implies that
Since we have constructed in such a way as to be mean zero over a block of length k, for any fixed a:
In particular then:
which implies:
where the last inequality follows by a reverse Markov inequality which states that, for any random variable Z supported in [0, 1] almost surely and with Simchowitz et al. (2018). Noting that, since the noise is 0 mean Gaussian, we have:
From this (a) follows by simple manipulations. Since we can choose as we wish, we set it equal to
and conclude that:
Returning to (23), we can now use this result to bound the expectation. Note that:
Then by what we just proved and applying Hoeffding’s Lemma, since
Repeating this procedure condition on each
and so:
where the final inequality follows from choosing the optimal (and assuming
chosen such that
is positive) and the final equality uses
. By our assumption on the power (21), we will have that
It remains then to write this in form of (22). Plugging in our definitions of , we have that the above is equivalent to:
where (a) holds by our assumption on the power (21) and (b) follows by Parseval’s Theorem. Choosing to balance the constants, we get that:
which completes the proof.
Lemma E.3 Assume that our system is driven by some input is deterministic and
. Then on the event that:
for some , choosing k so that:
we will have with probability less than
Proof Take some
where is the state obtained by driving the system with the input in the absence of noise, which is deterministic conditioned on
. Given this, we have that
satisfies the
-BMSB condition, as defined in Simchowitz et al. (2018). The proof of this closely mirrors the proof of Proposition 3.1 of Simchowitz et al. (2018). The primary difference is that the mean of
differs from that of the signal considered in Simchowitz et al. (2018), but this does not affect the argument and, as such, we omit it here. We can then apply Proposition 2.5 of Simchowitz et al. (2018) to get that:
where here p = 3/20. Following the proof of Theorem 2.4 of Simchowitz et al. (2018), let T be a 1/4-net in the norm . By Lemma 4.1 of Simchowitz et al. (2018),
. Then by Lemma 4.1 of Simchowitz et al. (2018) we have:
where , which is true by (25), and (b) holds by (25). Lower bounding
Lemma E.4 Let be a deterministic input with period
be the time such that condition (21) in Proposition E.2 is met for all
. On the event that:
for some , then as long as:
Proof The proof of this follows Simchowitz et al. (2018) closely but replacing Proposition 2.5 of Simchowitz et al. (2018) with our Proposition E.2. By Proposition E.2 we will have that:
Following the proof of Theorem 2.4 of Simchowitz et al. (2018), let T be a 1/4-net in the norm . By Lemma D.1 of Simchowitz et al. (2018), we have
that . Then by Lemma 4.1 of Simchowitz et al. (2018) we have:
where (a) holds so long as , which will be true by (26), and (b) holds by (26). The following holds by
which completes the result.
Corollary E.5 Let:
where be a deterministic input with period
be the time such that condition (21) in Proposition E.2 is met for all
. On the event that:
for some , then as long as:
with probability less than
Proof The proof of this result is very similar to that of Lemma E.4. For any
For any , by Proposition E.2, given the definition of
, we will have that:
Following the proof of Theorem 2.4 of Simchowitz et al. (2018), let T be a 1/4-net in the norm . By Lemma D.1 of Simchowitz et al. (2018),
. Then by Lemma 4.1of Simchowitz et al. (2018) we have:
where (a) holds by (28) and (b) and the final inequalities hold so long as holds, since in that case we will have that
Lemma E.6 Assume that is generated from some input
surable and
. On the event that
, we will have that, with probability less than
Proof Note that Proposition 8.2 of Sarkar and Rakhlin (2018) applies even when is driven by an input
which is changing over time, since we choose
measurable, so
measurable. Therefore, for any deterministic
with probability less than
E.3. Upper Bounds on Covariates
Lemma E.7 Assume for some deterministic
, and for any initial state, then with probability at least
and for any w, with probability at least
Proof We note that:
Where here we let denote the response of the system to the deterministic part of the input and
the response due to the random part of the input. The term
is then deterministic. Following Proposition 8.4 of Sarkar and Rakhlin (2018), we can bound the second and third terms each with probability
Combining these bounds gives the result. For the second inequality, following the same argument as in the proof of Proposition 8.4 of
combining this with the above gives the result.
Lemma E.8 Assume that the input satisfies, for some
then:
where (a) uses Lemma E.12. Since , we will have, by Parseval’s Theorem and our assumption on
so:
Thus:
Lemma E.9 Assume that we are running Algorithm 1 and that we started from initial condition be the Jordan decomposition of
and consider some
such that
except for
denote the start and stop indices of the jth Jordan block (so in particular, if
th Jordan block, we have that
). Assume that T is chosen to be within epoch i. Then, after T steps:
Proof Adopting the notation used in Algorithm 1, let denote the length of the ith epoch. Denote
be the start time of the ith epoch.
Following the analysis used in Section E.4, we can break up the response into its steady state and transient components and write:
for denotes the steady state response of the system at time t to the inputs used at epoch i. We then have:
Note that:
where, relying on the periodicity of for negative s. So:
Repeating this calculation:
where the last inequality follows since . Therefore:
and:
Finally, by Parseval’s Theorem:
Combining this, we have:
E.4. Transients
Consider the response of a system to a deterministic, periodic, zero-mean input starting from some initial state
(here the mean is taken over a full period). We can break up the response into the steady state response,
, and the transient response,
Precisely,
is the response of the system if the input
has been on for all time in the past and, to attain the desired response, we can set:
With these definitions, we will have:
Assume that , we will have that
Take k to be an integer multiple of the period of the input and note that, by linearity, k will also be an integer multiple of the period of
Lemma E.10 Using the definitions above, let:
Then if , we will have that:
≤ ∥
where (a) holds by our assumption on
Corollary E.11 Under the same assumptions as Lemma E.10, we will have that:
where:
Since, by assumption is zero-mean, it follows that
is zero-mean. Thus, the only non-zero mean component of
is that due to the transient so:
from which it follows that is a zero-mean signal. Denoting
DFT of
, by Parseval’s Theorem, we will have that:
where, crucially, since is zero-mean, we only sum over frequencies starting at
(that is, we do not sum over the DC component). Thus:
Thus:
where the last inequality follows since we have assumed Lemma E.10 holds.
Lemma E.12 Assume that the input satisfies, for some
then:
where denotes the response of the noiseless system running for T steps when the input
is applied.
Proof Note that:
and:
Thus:
where the last inequality follows from the proof of Lemma D.7.
Throughout this section we assume we are running Algorithm 1 and that T is the elapsed time after i epochs. We will let to simplify expressions. We will also often simplify notation by writing
Let:
where is simply some value constraining the power of our input signal and
the DFT of
, the time domain signal. Note that the normalization
due to the fact that, by Parseval’s Theorem:
assuming that has period k and that we are in steady state. Further, by the update rule of Algorithm 1,
, which is the expected amount of time we will play these inputs for.
It is worth noting that the constraint is equivalent, by Parseval’s Theorem, to the constraint:
We will denote the optimal set of inputs on the true system as and the optimal set of inputs on the estimated system as
is the solution to
Our main perturbation result is as follows.
Theorem F.1 (Full version of Theorem 4.1) Assuming that , then we will have that:
where is generated from a system with parameter
is the solution to
is the solution to
L
between the largest and smallest eigenvalues of will not include vectors corresponding to the subspace spanned by the eigenvectors corresponding to the largest eigenvalues, as these will be sufficiently excited by noise to make
large. In that case one can show that for all
F.1. Proof of Theorem 4.1 and Theorem F.1
where are the eigenvectors corresponding to the minimum eigenvalues of the matrices
, respectively. We wish to show that:
Denote the solution of the above minimization. Denote also
the eigenvector corre- sponding to the minimum eigenvalue of
. Then if for all
(29) and:
(30) the above will follow. To see this, assume that:
then:
where (a) follows by optimality of follows by our assumption (29), and (c) follows since
corresponds to the minimum eigenvalue of
. This is clearly a contradiction, which implies that:
We can repeat this argument identically in the opposite direction:
(31) We now return to bounding the difference assuming (29) and (30) hold:
First, assume that:
where the final inequality follows by (29) and (31). Assume instead that:
where the final equality follows by (31). If we assume that:
where the first inequality holds since are the optimal inputs and the final equality follows by (31). Combining these, we conclude that:
We want to guarantee that such a condition holds for . In practice we cannot determine what these are exactly since this requires knowledge of
. Thus, instead, we will find a set
which is guaranteed to contain them. Setting:
this will be satisfied. To see why, note that
upper bounds
and
for all
then w cannot possibly correspond to the minimum eigenvalue of either
or
F.2. Perturbation Lemmas
Corollary F.3 Assuming that and that the largest Jordan block of
has dimension q, we will have that, for small enough
where here is the solution to
is the solution to
Proof The proof of this result follows identically the proof of Theorem F.1 except now instead of showing:
Given this, the rest of the proof of Theorem F.1 follows identically now.
It is not clear in general how large is and how it scales with
. The following lemma provides an interpretable upper bound on
is small enough.
Lemma F.4 Assume that U has period k. Then as long as:
for some a > 1, then:
Proof
where the final inequality holds since, by Parseval’s Theorem:
By Lemma F.8 and our condition on we have that:
Thus:
To get deterministic bounds on the algorithm performance, it is helpful to deterministically upper bound . The following lemma provides such a bound.
Lemma F.5 Assume that is the Jordan decomposition of
denote the
Jordan block, and assume
Jordan blocks. On the event that:
and:
for some to be specified, and if:
then:
and:
where:
and here k is the frequency discretization at the epoch with end-time T.
Proof By definition:
By Lemma F.8 and our condition on , we have that:
By assumption:
Lemma E.9 implies that, assuming we choose T chosen such that it is within epoch i:
Following the computation from Lemma F.11 and noting that:
and that the inverse of a block diagonal matrix is equal to the matrix formed from each of the blocks inverted individually, we then have that:
In addition, Lemma F.11, gives that:
Assume that be some vector such that
. Note that (a) will also upper bound:
and the result follows.
Finally, for Theorem 2.2, it is necessary to quantify how close quantified below.
Lemma F.6 Let q be the dimension of the largest Jordan block of small enough
, where at least
Proof We first compute the directional derivate of with respect to
in direction
We can upper bound
Writing
By Lemma F.10 we will have that:
Combining these we have that:
F.3. Additional Lemmas
where:
L(A, B, U, ϵ, I, w)
Proof To bound, we calculate the directional derivative of
with respect to A and use this to bound the Lipschitz constant of the function
. The directional derivative is given by:
Lemma F.8 gives that, for small enough
so:
and thus:
Lemma F.8 For
Proof To see that this is true, we can simply multiply the right hand side above by and observe that the result is I. We wish to show that:
Since , we can make
arbitrarily small by making n large. Thus, for any
, we can find an N such that for all
This implies that:
Lemma F.9 If:
Proof Denote
∥
For the second inequality, we can simply multiply the first term in the expression by that the result holds.
Lemma F.10 Assume that for some small enough
. Denote by
the spectral radius of
be the Jordan decomposition of A. Then if J is diagonal, we will have that:
where is not diagonal then, letting n be the dimension of its largest Jordan block:
Proof Let be the Jordan decomposition of A. Assume that
be the eigenvalue of
with largest magnitude and assume that
not an eigenvalue of A (otherwise we are trivially done). Since
is an eigenvalue of
, following a standard proof of the Bauer-Fike Theorem we have:
Since by assumption is not an eigenvalue of
, which implies that
eigenvalue of
. Since the spectral norm upper bounds all eigenvalues:
so:
If J is diagonal, then then have:
where the implication follows by the reverse triangle inequality. If J is not diagonal, then will be a Jordan form with eigenvalues
. In particular then we have:
J
where th Jordan block of
is the dimension of the ith Jordan block, and:
Since the inverse of a block diagonal matrix is simply formed by inverting each block, we can calculate by calculating the inverse of each block
individually. Note that each block is invertible since we have assumed that
is not an eigenvalue of A. By Taylor expanding, and the fact that
is nilpotent, we have:
so:
Since eigenvalues are continuous functions of the entries of a matrix Horn and Johnson (2012), for small enough , we will have that
. If this holds then:
Then:
Combining this with (32) and denoting the indices at which the above maximum is achieved, we get that:
Lemma F.11 Let be the Jordan decomposition of A. Assume that A has r Jordan blocks and denote by n(i) and n(i) the start and stop indices of the ith Jordan block (so in particular, if
th Jordan block, we have that
the matrix with columns equal to the ith to jth columns of P. Then:
Proof We have:
Since, for nonnegative (by virtue of the fact that
), it then follows that:
, . . . , w
, . . . , w
We base our analysis off the lower bound presented in Jedra and Proutiere (2019). A slight modifi-cation of their analysis to our situation yields the following result.
Theorem G.1 For any matrix , the sample complexity
-locally-stable algorithm in
Proof The proof of this result is essentially identical to the proof of Theorem 1 in Jedra and Proutiere (2019) and we omit it here.
Denoting the response of the system due to the input and
the response due to the noise, we can write:
Thus:
so, Theorem G.1 gives that:
G.1. Proof of Theorem 2.1
Proof Since (33) holds for all input sequences , and since we wish to minimize the lower bound, we will have in particular:
Since is deterministic conditioned on
, maximizing
equivalent to maximizing
. For any input u satisfying the power constraint given in the statement of Theorem 2.1, by Lemma E.8:
Note that the term is scaling as
. Thus, for large enough
, since the left hand side is only scaling as
so, for large enough
For small enough will be sufficiently large for this to hold. We have then that:
By Lemma H.2, we know that:
exists and, further, that:
for all . Thus, for small enough
, we will have that:
Lemma H.1 Assume that . Then for any
, we will have that:
so it follows that is Lipschitz continuous in
Proof Noting that, since we assume , using the identity that
Thus:
For any matrix
(=
which implies:
So the Lipschitz constant of is bounded by:
from which the result follows directly.
Lemma H.2 For any sequences of integers
assuming the limit of each exists. Further, for any finite j, we will have:
Proof Assume the opposite, that there exists some sequence of integers satisfying the above condition such that
. By the definition of a limit, this implies that there exists some finite
such that for any
, we will have that
, note that we can make:
arbitrarily small for large enough and by proper choice of
). By Lemma H.1, this implies that we can make:
arbitrarily small. Thus, for large enough j, we can simply set the inputs at positions
to those at positions arbitrarily close to
while still meeting the feasibility constraint on the input. This contradicts the fact that
, which implies that
Lemma H.3 For any integer and finite input power budget
exists and is finite.
exists and is finite.
First, note that satisfying the power constraint in this setting is equivalent to this constraint, the optimal noise covariance can be obtained by solving:
In our setting, with , solving this is approximately equivalent to solving:
where be the optimal diagonal solution, and note that, in this case, we will have:
To see this, note that for any diagonal
The optimal solution will clearly be the solution that balances the energy in every diagonal element, that is:
for all , so combining this constraint with the trace constraint yields:
and thus the jth diagonal element will be:
Consider now some other matrix that is not necessarily diagonal. Note then that:
For to be in the constraint set, we must have that
. To have that:
we must have that is positive definite. However, this is not possible since the diagonal elements of
are the sum of non-negative scalings of the diagonal elements of
and since
must have at least one non-positive element on the diagonal to meet the constraint
, it follows that
has at least one non-positive diagonal element. Since the diagonal elements of every positive definite matrix are positive,
cannot be positive definite, so we cannot increase the value of
. By convexity of the constraint set, it follows that the directional derivative in the direction of any other point in our constraint set is negative. Since this is a concave function, it follows that
is optimal.
sufficiently large, we have that:
Figure 5: Jordan block with d = 4,
randomly generated with specified value of p
Figure 6: diagonalizable by a unitary matrix and has given spectral radius,
randomly generated. Dotted lines illustrate the performance of
for each value of
Figure 5 illustrates how the shape of B can influence the effectiveness of active system identifica-tion. With p = 1, it is not possible to control the direction of the input, which can greatly reduce the effectiveness of input design. Interestingly, for all p > 1, the performance is roughly the same— increasing p beyond 2 does not provide a large gain in the effectiveness of input design.
Figure 6 plots how the estimation rate depends on the spectral radius. Here the performance of our algorithm is plotted as the solid line and the performance of of isotropic noise as the dotted line. As our theory predicts, systems with a larger spectral radius are easier to estimate. Further, as Corollary 3.1 states, the gap between our algorithm and isotropic noise increases as increases—for
there is almost no gain in designing inputs actively but as
increases the gains of active input design also increase.