Bayesian System ID: Optimal management of parameter, model, and measurement uncertainty

2020·Arxiv

Abstract

Abstract

BAYESIAN SYSTEM ID: OPTIMAL MANAGEMENT OF PARAMETER, MODEL, AND MEASUREMENT UNCERTAINTY

NICHOLAS GALIOTO ∗ AND ALEX A. GORODETSKY ∗ Abstract. We evaluate the robustness of a probabilistic formulation of system identiﬁcation (ID) to sparse, noisy, and indirect data. Speciﬁcally, we compare estimators of future system behavior derived from the Bayesian posterior of a learning problem to several commonly used least squares-based optimization objectives used in system ID. Our comparisons indicate that the log posterior has improved geometric properties compared with the objective function surfaces of traditional methods that include diﬀerentially constrained least squares and least squares reconstructions of discrete time steppers like dynamic mode decomposition (DMD). These properties allow it to be both more sensitive to new data and less aﬀected by multiple minima – overall yielding a more robust approach. Our theoretical results indicate that least squares and regularized least squares methods like dynamic mode decomposition and sparse identiﬁcation of nonlinear dynamics (SINDy) can be derived from the probabilistic formulation by assuming noiseless measurements. We also analyze the computational complexity of a Gaussian ﬁlter-based approximate marginal Markov Chain Monte Carlo scheme that we use to obtain the Bayesian posterior for both linear and nonlinear problems. We then empirically demonstrate that obtaining the marginal posterior of the parameter dynamics and making predictions by extracting optimal estimators (e.g., mean, median, mode) yields orders of magnitude improvement over the aforementioned approaches. We attribute this performance to the fact that the Bayesian approach captures parameter, model, and measurement uncertainties, whereas the other methods typically neglect at least one type of uncertainty. Key words. System ID, Approximate marginal MCMC, UKF-MCMC, Bayesian inference, DMD, SINDy, Dynamical systems

1. Introduction. Recovering nonlinear models of dynamical systems from data is quickly becoming a primary enabling technology for analysis and decision making in fields spanning science and engineering where first principles models are often incomplete or simply unavailable. Examples range from forecasting the weather and climate [26, 27, 44], predicting fluid flows [43, 57, 4], and enabling adaptive control [51]. All of these fields have a long history of developing estimation and system identification techniques such as advanced Kalman filtering in forecasting [49, 35], dynamic mode decomposition for computational fluid dynamics [56], and a wide ranging set of schemes in adaptive control [41, 58, 16]. In this paper we compare the implicit and explicit optimization formulations posed by several representative approaches, and we demonstrate that algorithms that appropriately manage parameter, model, and measurement uncertainty in a cohesive manner are often more robust than more standard least squares-based approaches.

For any system identification approach, there are two primary challenges: (1) parameterizing a model space over which to search and (2) posing an optimization problem whose minimum yields an optimal model. A great majority of recent work has focused on addressing the first challenge, primarily due to the rapid availability of machine learning software. These recent works seek to learn neural network representations of problems because of their representation capacity [7, 47, 18]. These works are partly motivated by the belief that modern systems are complicated and existing linear or linear-subspace models are no longer capable of representing the systems we seek to model.

In this paper, we explore the second challenge – that of posing an optimization problem, or, more generally, specifying a goal whose minimum will yield a system with

∗N. Galioto and A.A. Gorodetsky are with the Department of Aerospace Engineering, University of Michigan, Ann Arbor, MI 48109, USA, email: {ngalioto, goroda}@umich.edu 1

predictive power. We argue that this problem is equally, if not more, important than appropriately parameterizing a model space. We support this assertion by showing that many currently used optimization objective specifications fail to recover models even when the correct model class is known. Specifically, these specifications cause system identification techniques to break down in the presence of sparse measurements and/or noisy data.

We advocate a probabilistic approach to system dynamics that explicitly provides for the representation and incorporation of three uncertainties: parameter uncertainty, model uncertainty, and measurement uncertainty. This probabilistic setting, given in Section 3, poses the problem as a hidden Markov model and is well known in the estimation and filtering literature across disciplines [60, 40, 6]. Despite being well known, this setting has not been thoroughly compared to predominant system identification approaches in the context of model learning rather than filtering/smoothing.

The solution to this formulation is a posterior distribution of the model parameters given the observed data. As a result, predictions and forecasting become probabilistic – weighting future outcomes by their relative probabilities. This posterior distribution must be computed using computational inference approaches such as Markov Chain Monte Carlo or variational inference. Given a posterior distribution, goal-oriented estimators can be extracted based on a specified loss and risk metric [8]. For instance, it is well known that the posterior mean is the optimal estimator for Bayes risk with squared loss, and the posterior median is optimal for loss.

To this end, our contributions involve proving that several existing and popular approaches for system identification, sparse identification of nonlinear dynamics (SINDy) [10] and dynamic mode decomposition (DMD) [56], are realizations of the probabilistic framework under some limiting assumptions (they assume no measurement uncertainty). We choose these two approaches because they are representative of many (nonlinear) least squares type approaches that are used. We then empirically demonstrate that we yield improved predictions compared to these approaches on wide varying problems. Concretely, our contributions are the following:

1. A complexity analysis of the unscented Kalman filter MCMC (UKF-MCMC) algorithm developed in [21], which enables an approximate marginal Markov Chain Monte Carlo algorithm to sample from the marginal posterior of the model parameters;

2. Theorems 4.1 and 4.2 proving that DMD and SINDy can be viewed as specific cases of the presented probabilistic approach with additional assumptions of zero measurement noise; and

3. A wide ranging set of numerical simulation results demonstrating the robustness and improved prediction quality of our approach in all cases, including sparse and noisy data. The UKF-MCMC approach mentioned in the first contribution refers to a computational algorithm that targets the marginal posterior distribution of the model parameters to avoid performing inference over the joint parameter-state space. It can be viewed as an approximation of the marginal likelihood that is traditionally very difficult to compute for dynamical systems [30]. The UKF-MCMC algorithm is one of a number of algorithms that have been recently developed that draw on Gaussianbased filtering to approximate the marginal likelihood [48, 20, 37]. These algorithms trade off the approximation quality for some additional computational efficiency compared with the seminal particle-marginal approach of Andrieu [2], which is able to reconstruct the exact posterior. Nevertheless, our results indicate that the posterior approximated by the UKF-MCMC algorithm is still able to reconstruct systems with

good accuracy.

We apply the UKF-MCMC algorithm to the hierarchical Bayesian setting where we explicitly learn the process and measurement covariance of the dynamical system. Furthermore, we use standardized uninformative priors for the model parameters and standard half-normal priors for the unknown covariances [25]. As a result, our algorithm requires no additional parameters, besides number of MCMC samples, compared to competing single-point estimators (DMD and SINDy). Furthermore, we provide a computational complexity analysis showing that the expense of our approach compared to these existing approaches grows linearly with the number of data points. However, our accuracy gains are shown to sufficiently offset this expense.

The second contribution aims to uncover, or at least interpret, some of the underlying assumptions that have led to observed poor performance of the methods to which we compare. Many data-driven methods claim a certain degree of objectivity (as compared to, for instance, the Bayesian approach we propose here) because they avoid placing strong assumptions (priors) on the system model that may influ-ence the method’s estimate. In reality, however, “analyses that have the appearance of objectivity virtually always contain hidden, and often quite extreme, subjective assumptions” [8]. It will be shown that the estimators DMD and SINDy hold the hidden assumption that uncertainty enters only through process noise and that the measurements are noiseless. Conversely, techniques such as parameter optimization of deterministic ODEs account only for noise in the measurements and not in the process. The Bayesian estimator presented here will be shown to outperform these common approaches as soon as their underlying assumptions are violated, even in the modifications of these algorithms that incorporate denoising [32, 11].

The rest of this paper is structured as follows. In Section 2, we explain the central problem this paper hopes to address: how common least squares-based system ID approaches create objective functions with certain undesirable features. We illustrate this problem by providing the contours of two common objective functions for a simple two-dimensional problem and show how the Bayesian approach incorporates the advantageous features of both without including the problematic features. In Section 3, we detail the probabilistic framework of a system ID problem, including the problem setup and primary goals. Then in Section 4, we provide an analysis of two existing approaches to system ID: DMD and SINDy. In this section, we give a brief explanation on the implementation of both algorithms and then provide theory that unveils the underlying assumptions used by these approaches. Section 5 outlines the algorithms used to implement the Bayesian approach and provides a comparison of their computational complexity to that of DMD and sparse regression. Finally, Section 6 applies the Bayesian algorithm to five different dynamical systems including linear, nonlinear, chaotic, and PDE systems. Comparisons of the Bayesian algorithm to DMD and SINDy are given, and it is shown that not only is the Bayesian approach just as effective for systems for which these common approaches display exemplary performance, but the Bayesian algorithm also remains robust in certain regimes where DMD and SINDy fail.

system ID. In this section, we highlight the geometry of the objective functions of several representative optimization formulations for system identification that we explore in this paper. Specifically, we consider three objectives: one that considers measurement uncertainty but no process/model uncertainty; one that assumes process uncertainty but no measurement uncertainty; and finally our proposed approach that considers both process and measurement uncertainty. The first two approaches are the most commonly used, but we show that they suffer from multiple minima and poor data sensitivity, respectively. Furthermore, while variations of these approaches are used on complex systems, we highlight their limitations in an extremely simple setting of recovering a linear pendulum.

To motivate the results, consider a simple setting where the true model is a linear oscillator with a frequency of 2.00rad/s. Suppose that the learning objective is to identify the frequency. One might intuitively believe that the following least squares objective (in the time-domain) would appropriately penalize incorrect frequencies

where ) measures the “error” of estimating some parameter An optimization scheme would then try to find the parameter to minimize J. This objective is not derived from any arguments, rather it is intuitively specified and here we attempt to see whether this specification makes sense.

Prior to considering the full system identification, we consider a property of the least squares objective. We compare the cost of two parameters = 2.01rad/s and = 4.00rad/s at two different times T = 10 and T = 1000. In the case where we obtain noise-free data for ten seconds, we obtain J(2.01, 10) = 0.02 and J(4.00, 10) = 9.63 – as we desire, the cost of estimating = 2.01rad/s is more than 100 times lower than estimating = 4.00rad/s. However, suppose that we obtain data for 100 times longer. Then we obtain J(2.01, 1000) = 1053.96 and J(4.00, 1000) = 999.58 – the relative difference between the two objectives has shrunk tremendously.

In this example, even small perturbations from the true parameters of a system yield large errors given enough time, and, in this case, greatly reduce the relative benefit of = 2.01 over = 4.00. In simpler terms, this example demonstrates that as the number of data points increases, the relative difference between = 2.01 and = 4.00 decays! The practical implication is that optimization formulations may have significantly more difficulty in distinguishing between correct and incorrect parameters. The issue here is that the least squares objective does not seem to behave as intuition would expect, nor does it match the behavior we are aiming to achieve. Specifically, we seek an objective function that exaggerates the difference between parameters with small errors and those with large errors as more data are obtained.

In this paper, we show that an approach that introduces (and then seeks to reduce) the uncertainty in parameters, models, and measurements leads to objective functions that are far better behaved. For example, to account for imprecision in our parameter estimates, we must include process noise in our model, allowing us to reposition our reconstructed trajectory at the time of every measurement. This way, the model is not given enough time for errors to accumulate and eventually become indistinguishable from a far worse model. While the process noise is not the model error, it does encapsulate the fact that the predicted motion is incorrect. In fact, we empirically show that it should be included even when the model class spanned by the parameters encapsulates the truth.

1For this linear problem, it is more appropriate to consider frequency-domain system ID, which would not encounter the problems described here. However, these types of time-domain system ID procedures using least squares-based regression/machine learning approaches are increasingly being used for complex nonlinear systems [13, 53, 22, 54], and we seek to show that they can be limited in an extremely simple setting. 4

To this end, we consider three objective functions

where are model parameters, f are continuous dynamics representing the time derivatives of a problem, and Ψ are discrete propagators. The first objective, Equation (2.2), assumes deterministic dynamics and performs least squares regression to match the trajectory of a differential equation to the data. The least squares objective here implicitly accounts for measurement noise, and is widely used in the literature [5, 50, 24]. The second objective, Equation (2.3), assumes there is no measurement noise, only process noise/model uncertainty, and instead builds a propagator between observations. This objective is representative of DMD [56] and similar least squares approaches [64, 15, 17]. The final objective, Equation (2.4), is the log marginal likelihood arising from Theorem 3.1 that we advocate for in this paper. Note that standard and sparsity-enhancing regularizations/priors can also be included to each of these objectives, but they do not change the qualitative conclusions. Figure 1 shows these objective functions for the case of learning a continuous time linear pendulum

where the true parameters are = 1 and Here, g is the acceleration due to gravity and L is the pendulum length. Data are obtained in 0.1 second increments with noise standard deviation of 0.1. Each column corresponds to a different objective. The first column corresponds to the objective of Equation (2.2), the second corresponds to Equation (2.3), and the third to Equation (2.4).

The rows of this figure correspond to 20, 40, and 80 data points collected in 0.1 second increments, respectively. In the left panel, we see that the assumption of a deterministic system (no process noise) results in many local minima, each of which represents a system that matches the data closely at some points and at other points may be completely opposite. In the middle column we see that excluding the measurement noise has smoothed over certain features of the deterministic system and the objective becomes insensitive to the number of data points. This panel corresponds to the shape of the objective used by DMD. In this case, finding the minimum of the objective is fast, but the reconstructed system may lack some of the key features of the true trajectory and is not tremendously affected by increasing data. Lastly, the third column represents the objective arising from our probabilistic approach that considers all types of uncertainty. Only in this approach do we see that increasing data has a beneficial effect on the objective function. Multiple local minima do not exist, the characteristic shape seen in the left column remains, and the objective becomes steeper in the region of the minimum.

3. General problem setting. In this section, we describe the probabilistic framework for system identification, the problem statement, and a description of our high-level solution approach.

Fig. 1: Comparison of three optimization objectives for the identification of a linear pendulum. The left column uses a least squares objective that neglects process noise, the middle column uses a least squares objective that neglects measurement noise, and the right column is the log marginal likelihood that accounts for both. The rows correspond to the objective functions obtained after 20, 40, and 80 data points are taken at 0.1 second intervals from top to bottom. White crosses indicate true parameters. Neglecting process noise results in many local minima. Neglecting measurement noise results in an objective insensitive to the number of data. The Bayesian approach results in the ideal scenario where the objective becomes steeper in the direction of the minimum as the amount of data increases.

3.1. Probabilistic formulation. In this section, we describe the probabilistic inference problem.

Let R denote the set of reals and denote the set of positive integers. Let us de-fine the norm of a vector as = . Let (Ω, F, P) denote a probability space, denote the size of a state space, denote the size of an observation space, and denote a time index corresponding to a time 0. Sequential time indices will typically occur with a constant interval ∆ so that + ∆.

We model the dynamical systems as evolving hidden/uncertain states

Fig. 2: Bayesian network representation of the system identification problem. The data are realizations of the random variables .

through a discrete-time dynamical system. We obtain information about these states through a noisy measurement operator providing us data These data can be viewed as realizations of another stochastic process that is dependent on the hidden states.

The dynamics and measurement operators are uncertain and the parameters for define a search space over which we will seek to learn the system. We partition the parameters = () into different aspects of the problem including the dynamics model parameters , observation model parameters , process noise parameters , and observation noise parameters Together these states, observations, and parameters are related through a hidden Markov model describing a discrete-time stochastic process [60]

where Ψ : is the dynamics operator, is the process noise with uncertain covariance Σ(is the observation/measurement operator, is the predictive stochastic process for the observable, and is the observation noise with uncertain covariance Γ(Finally, we have an additional source of uncertainty corresponding to the initial condition of the states A visual representation, in the form of a Bayesian network of this model is provided by Figure 2.

Examples of Ψ could include physics-inspired PDE operators [64], empirical linear models (a matrix), or nonlinear models such as neural networks [12, 62, 39]. The observation operator h is typically some known sensor model that may or may not have uncertain calibration parameters . We include the parameters for the observation model to maintain generality.

System (3.1) implicitly defines several probability distributions that completely describe our state of knowledge. The first distribution reflects the Markovian propagation dynamics

e(X, X, θ, θ) = Ψ(X, θ,(3.3)

where the error function e represents the misfit, or model error, of the dynamics under a fixed set of parameters. The next distribution reflects the noisy measurement models

r(X, θ, θ) = h(X, θ,(3.5)

where the residual r represents the misfit between the states and the observed measurements. Together, along with a prior, these distributions will enable us to concretely form the learning problem, which we establish in the next section.

3.2. Goals. In this section, we describe our two objectives: system identification (learning) and prediction/forecasting.

Our learning objective is to determine a dynamical model Ψ. Specifically, this objective requires representing our knowledge about the parameters (or ) after data are obtained. This knowledge is represented via a conditional distribution over given the observed data. This distribution is given by Bayes’ rule

(3.6) ) = ) , where = (

where the prior is denoted by ) and the marginal likelihood is a function of the unknown parameter

(3.7) L(θ; p(Y= y, . . . , Y= yθ).

This conditional/posterior distribution captures all the relevant information about our parameters contained in the data. It will be useful to leverage the sequential/Markovian nature of the process to factorize this likelihood as

(3.8) ) = )

where we have set ) and ) for k = 2, . . . , n.

Our second goal is to predict, or forecast, the system state at some future time . This prediction could either be the full posterior predictive distribution ) or some “best estimate” that can be derived from the posterior to satisfy some optimality conditions [8]. Furthermore, these two goals (system identification and prediction) are related in that the prediction is obtained by averaging over all possible system parameters, weighted according to the posterior distribution,

3.3. Marginal likelihood computation. In this section, we review the formulas for computing the marginal likelihood used in the target distribution (3.6). We first present the general case [60]. Then, we specialize the general case into special cases with (1) zero process noise (model error is ignored) and (2) noiseless, invertible measurements (measurement error is ignored). The formulation for evaluating the marginal likelihood is provided by the result of Theorem 3.1.

Theorem 3.1 (Marginal likelihood (Th. 12.1 [60])). Let denote the set of all observations up to time k. Let the initial condition be uncertain with distribution Then the marginal likelihood (3.8) is defined recursively in three stages: prediction

(3.10) ) =

update,

and marginalization,

for k = 1, 2, . . . .

This theorem provides a recursive algorithm for evaluating the marginal likelihood. This recursion requires not only maintaining a standard Bayesian filter for the prediction and update steps, but also keeping track of the marginalized distribution after every observation. Extensions to situations where data are not obtained at every time step is trivial – for times when no data are obtained, the update step is skipped.

3.3.a. Zero process noise. In the standard parameter estimation problem, e.g. Equation (2.2), we model the dynamics as deterministic where uncertainty enters only through measurement noise. In this case, the Markovian property that leads to distribution (3.2) reduces to a Dirac delta function

(3.13) p(X, θ, θ) = δ(Ψ(X, θ)).

The assumption of zero process noise leads to the marginal likelihood given in Theorem 3.2.

Theorem 3.2 (Marginal likelihood – zero process noise). Let the dynamics model be deterministic. Then, the marginal likelihood (3.8) is defined recursively as

where Ψdenotes k applications of the dynamics model. Moreover the log marginal likelihood becomes (3.15)

Proof. The proof follows from the fact that a deterministic system must follow a fixed trajectory defined entirely by the parameters. In other words, we have ) = ) = ). As a result, the measurement model can be written as a function of the parameters

This distribution is Gaussian with mean ) and covariance Γ(Applying these same facts at each time step completes the proof.

3.3.b. Noiseless and invertible measurements. In this section, we consider the ramifications on the posterior of assuming no measurement noise. In the next section we will show that several least squares optimization approaches correspond to this case.

Consider an invertible observation operator so that the states are uniquely determined ). Using this assumption in System (3.1) leads to a Markovian system for the system observables

(3.16) Y= h(Y), θ+ ξ, θfor k = 1, . . . , n 1,

Theorem 3.3 (Marginal likelihood – noiseless, invertible observations). Let h be an invertible operator and the measurements be noiseless. Then, the marginal likelihood (3.8) is defined recursively as

for k = 2, . . . , n and

Proof. The proof is trivial by noticing that the Markovian system for the observables (3.16) defines a sampling distribution using the change of variables formula

The given result is obtained by using this sampling distribution as the likelihood.

As we intuitively expected, there is no marginalization over states under this assumption because the learning problem effectively “resets” after every data point. After the reset, the states are at their true value, and optimization progresses to ensure that the residual of propagation between true values is small. This is exactly the same methodology that inspires the least squares regression based approaches such as DMD and SINDy. In fact, we will show that special assumptions on h and Ψ recover these least squares approaches.

Remark 3.4 (Data on initial condition). If the initial condition is treated as beginning when the data are obtained, then the log likelihood for the first data point becomes independent of the parameters and we can set it to an arbitrary constant.

3.4. Decision making. Whereas traditional system ID and ML approaches de-fine the problem through an optimization objective, the Bayesian approach separates learning and decision making. In effect, it provides a way of generating new optimization objectives and interpreting existing ones. Here, we briefly comment on the fact that this separation comes in the form of a two step procedure: (1) computing the posterior and (2) extracting a goal-oriented estimator through the specification of a loss function. For detailed discussion of these topics we refer the reader to [8].

First note that we have considered to contain all uncertain parameters in the problem. For prediction, however, it is standard to make predictions into the future using deterministic models based on Ψ. As a result, we can partition the parameters = () into those that correspond to the dynamics, observations, process noise, and measurement noise, respectively. Next we define the posterior predictive distribution of the states as an average over all possible values of the dynamics parameters conditioned on the observations

where we will use a deterministic prediction that discards the process noise

(3.21) ) =

This restriction is not explicitly necessary, but it is representative of how learned models are used in practice.

Finally, we can extract several estimators to use as “point estimates” from the posterior. For example, the mean estimator, which corresponds to the optimal estimator for the squared loss [8],

(3.22)

or the MAP estimator,

Note that these estimators do not, in general, have a 1-1 correspondence with the approaches that use the mean, MAP, and median of the parameters rather than the states, i.e., the mean estimator

(3.24) =

or the MAP estimator

In other words, we do not require a fixed estimator for the model to have a point estimate of the prediction. Indeed, the most likely dynamics may not actually correspond to the most likely states for nonlinear models.

4. Analysis of existing approaches. In this section, we analyze two common state-space system ID approaches that have recently garnered some success. These approaches are representative of those that ignore measurement noise (or account for it by heuristic “denoising”). We seek to demonstrate that many least squares approaches can be interpreted within the framework of Equation (3.1). Our choice of dynamic mode decomposition (DMD) and sparse identification of nonlinear dynamics (SINDy) are representative of algorithms that use least squares and/or regularization. Our main results are that DMD can be interpreted as a maximum likelihood estimate and SINDy as a MAP estimate under zero-noise and invertible observation operator assumptions.

4.1. Dynamic mode decomposition (DMD). Dynamic mode decomposition (DMD) is a data-driven method for system identification that is used to identify the ‘dynamic modes’ of a dynamical system [56]. These modes reveal characteristics such as unstable growth modes, resonance, and spectral properties [52]. DMD is favorable when the system at hand is high dimensional but has some hidden low-dimensional structure, as is the case in many fluids problems. DMD first organizes a series of measurements at regular time intervals into two matrices

and then seeks a linear operator A which maps the observables from one time step to the Next i.e., = AY . To find A, one simply minimizes the Frobenius norm of by solving the least squares problem

The solution is given by , where denotes the pseudo-inverse.

The method given above may at first appear only applicable to linear systems, but [55] showed that in the nonlinear case, the approximated operator A and its corresponding modes are approximations to the linear but infinite-dimensional Koopman operator and Koopman modes respectively, thus revealing its applicability to nonlinear systems.

Next we show the least squares procedure for DMD can also be derived directly from the general probabilistic system (3.1) under certain assumptions.

Theorem 4.1 (DMD as a maximum likelihood of system (3.1)). Assume a linear model Ψ() = ; identity observation operator h = I; noiseless measurements Γ() = 0; and identity process noise Σ() = I. Then, the maximum marginal likelihood estimator corresponding to System (3.1) is equivalent to the least squares objective of the DMD problem (4.2).

Proof. This result uses a straightforward application of Theorem 3.3. Without loss of generality, we use the fact that the first measurement is of the initial condition, and therefore we can ignore Here, we have an identity observation operator, and therefore the inverse and Jacobian are also the identity. The dynamics are linear and unknown so we can write . Together, these facts require that the log marginal likelihood (3.19) becomes

(4.3) log ) =

which is our stated result. Clearly, the maximizer of this function is equivalent to the minimizer of (4.2).

While the invertible measurement operator is not a restrictive assumption because all DMD cares about is mapping observables and not underlying states, Theorem 4.1 shows why DMD may not be appropriate for cases where the observations are noisy. This fact has been recognized in the literature and several procedures for rectifying this issue have been proposed. For instance, [32] showed that total least squares is a more appropriate algorithm to identify A when measurement noise is present, a method known as total DMD (TDMD). For a full analysis of the total least squares problem, see [28, 34]. We will empirically compare TDMD to our approach in Section 6, where we see that it also performs worse than the posterior predictive mean. Future work will attempt to determine the assumptions that TDMD makes in the context of System (3.1).

In [61], another connection between the Bayesian approach to DMD was developed that infers the Koopman modes and eigenfunctions of the Koopman operator directly, rather than learning the dynamical operator itself. That work showed that when the measurements are noiseless, the MLE of their Bayesian model, TDMD, and DMD all provide the same estimate. In contrast, here we have provided our result in terms of the underlying hidden state dynamics rather than explicitly assuming observation dynamics.

One benefit of the analysis in our context is that our use of an underlying state-space model makes the framework valid even when the observations cannot be written using a Markovian (zero-lag) model as in Equation (3.16), which was required for the approach developed in [61]. In fact, this result can be interpreted to indicate that zero-lag DMD is most effective if the observation operator is invertible.

4.2. Regularized regression for nonlinear models. Least squares optimization can also be used for identifying nonlinear systems by searching in a linear subspace. In these cases, it is often advantageous to add regularization to seek parsimonious solutions. One such approach that uses a sparsity enhancing regularization is the method of sparse regression or sparse identification of nonlinear dynamics (SINDy) [10].

These approaches organize a library of candidate functions (linear and nonlinear) into a matrix. They then aim to approximate the time derivative in the span of this library. For instance,

(4.4)

This example uses monomial candidate functions, but any basis (wavelets, orthogonal polynomials, empirical bases) can be used.

Suppose that the general dictionary of terms is given by Ξ : so that the deterministic portion of some continuous-time autonomous dynamics can be written as a linear system with respect to the parameters/coefficients of the functions in the dictionary ˙x = Ξ(If direct data were available on the states and derivatives, one might then try to solve a (regularized) linear least squares problem for the parameters

where is a regularization weight and the norm can be chosen by the user. If the norm is chosen, this becomes a sparse regression problem.

Practical applications, however, do not have data on the derivative of each state ˙. As a result, various numerical approximations can be made, and this is the approach taken by the SINDy algorithm. Here, we will consider one type of numerical approximation to the derivative, but our analysis can be extended to others. If a forward-difference approximation to the time derivative is taken, then the SINDy objective function is

Notice that this approach requires direct observation of the states. Next we show that it is also equivalent to the maximum a posteriori of our target conditional distribution under more strict assumptions.

Theorem 4.2 (SINDy as a maximum a posteriori estimate of system (3.1)).

Let Ξ(x) : denote a library of candidate functions for continuous time drift dynamics. Let Ψ() denote the resulting discrete-time operator that uses a forward-Euler integration scheme

(4.7) Ψ() = X + ∆

Furthermore, assume an identity observation operator h = I; noiseless measurements Γ() = 0; identity process noise Σ() = I; and a Laplace prior expThen, the MAP estimate of the conditional distribution given in Equa-

tion (3.6) is equivalent to the SINDy estimator obtained by minimizing (4.6).

Proof. This proof is again a straightforward application of Theorem 3.3. Recall that the data are taken on the initial condition, and note that we have The log-marginal likelihood (3.19) is then

2 log 2(4.9)

Then we can drop the parameter-independent term and add the log prior to obtain a posterior that is proportional to

˜(4.10)

.(4.11)

Maximizing the posterior is equivalent to minimizing the term in the parentheses. By setting , we see that this is the exact form of the SINDy objective (4.6).

5. Algorithm and computational complexity. In this section, we describe an approximate marginal MCMC approach that has recently been introduced and analyzed in parallel by several different fields [21, 48, 20, 37]. This approach is fundamentally based on approximately evaluating the marginal likelihood described in Theorem 3.1.

5.1. Algorithm. Theorem 3.1 provides a recursive approach to evaluate the marginal likelihood that avoids computation of a high-dimensional integral, but this theorem still requires the evaluation of lower-dimensional integrals. In the linear case, the solution to these recursive integrals can be found using the Kalman filter, however no solution is available for general nonlinear systems.

When no closed-form solution exists for these integrals, nonlinear filtering techniques can be introduced. These can include ensemble Kalman filtering [23], Gaussian filtering (including cubature Kalman filter [3] and unscented Kalman filter [36]), and particle filtering [29]. Of these filters, only the particle filter has been proven to enable an exact pseudomarginal MCMC scheme [1]. The other schemes approximate the prediction, update, and marginalization equations – yielding a (generally) biased estimate of the posterior. Nevertheless, they are often more computationally tractable and have empirically shown good performance.

These algorithms embed these filters within the accept-reject step of MetropolisHastings MCMC scheme, as shown in Algorithm 5.1. We slightly modify the UKFMCMC scheme of [21] by using delayed-rejection adaptive Metropolis MCMC [31] instead of the standard Metropolis-Hastings MCMC. Specifically, the log posterior enters these schemes during the computation of the likelihood portion of the posterior

,(5.1)

where ) is the proposal distribution and ˆ) is the likelihood estimator. As we mentioned above, in the linear case we use a Kalman filter to exactly evaluate the marginal likelihood ( ˆ)) . This algorithm is shown in Algorithm A.1. In the nonlinear case, we approximate each distribution to be Gaussian and approximate the marginal posterior using an unscented Kalman filter (UKF) as shown in Algorithm A.2. In the UKF algorithm, and are parameters that determine the spread of the sigma points around the mean, is a parameter used for incorporating prior information on the distribution of x, and the notation [denotes the i-th row of the matrix [60].

5.2. Computational complexity. We will show in Section 6 that this approach yields more robust estimators than competing system ID approaches by accounting for measurement noise; however, this robustness will be at the cost of slightly increased computational complexity. In this section, we assess the cost of the algorithm both in the linear case where the Kalman filter is used and the nonlinear case where the UKF is used by counting the number of floating-point operations (flops) required by each algorithm.

For this analysis, addition, subtraction, multiplication, and division of two floating point numbers and the logarithm of one floating point number all count as one flop. The multiplication of an matrix by an matrix then counts as 1) flops because each of the mp entries of the product matrix requires n multiplications and 1 additions.Similarly, the multiplication of an matrix by an 1 vector requires 1) flops. Additionally, we approximate the cost of a Cholesky decomposition, matrix inversion, and determinant performed on an matrix all to be 3 flops. Furthermore, the complexity of these algorithms strongly depends on the complexity of the dynamical and measurement models used, which will vary

2We only consider the naive matrix-multiplication scheme, not the asymptotically more optimal approaches such as Strassens algorithm. 15

from problem to problem. For the sake of generality, we define the computational complexity of the dynamical model Ψ and measurement model h to be denoted as F and H respectively. Clearly in the linear case, these variables will not be needed as the dynamical and measurement models are matrices, and the number of flops can be calculated without loss of generality. The number of flops for each algorithm will be given in terms of the problem dimensions, so recall the following notation: d the dimension of the state, m the dimension of the measurements, p the number of parameters, and n the total number of measurements available.

Our analysis focuses entirely on the computation of the marginal likelihood, which is the dominant cost of the MCMC algorithm. The complexity of the rest of the algorithm will depend on the complexity of the MCMC algorithm and prior selected by the user, but is typically orders of magnitude lower than the likelihood computation. In the following analysis, we provide results for the Kalman filtering algorithm, the unscented Kalman filtering algorithm, their prediction and update subcomponents, DMD, and sparse regression. Table 1 shows the number of different types of operations required by each algorithm. Table 2 shows the number of flops for each algorithm where the computation of the regularization term in sparse regression is excluded. Note that although the mean and covariance of the marginal likelihood are computed in the update step of the Bayesian algorithms, the computation of the log of this distribution is excluded from this step, and is instead included only in the total. Also, the 18 flops outside the parentheses in the UKF total count comes from the formation of the weights, which is required only once at the beginning of the algorithm.

In determining the number of flops used in DMD, we counted the number of flops needed to solve the normal equation ()where . Similarly, sparse regression was considered to be the computation Θ = (ΞΞ), where Ξ and ˙. In practice, this computation is performed multiple times with an increasingly small Ξ matrix, but for this analysis, only one iteration of the optimization procedure is considered. To execute TDMD, a singular value decomposition (SVD) of the concatenated matrixis first

Table 1: Tally of matrix and vector operations of algorithms A.1, A.2, DMD, and SINDy. VEW and MEW are element-wise vector and matrix operations respectively such as addition, subtraction, and element-wise multiplication and division. MV is a matrix-vector or vector-vector multiplication, and MM is matrix-matrix multiplication. Inv is a matrix inversion, Det a determinant, and Chol a Cholesky decomposition.

Table 2: Flop count of algorithms A.1, A.2, DMD, and SINDy

performed, which has computational complexity on the order of ). The solution of the total least squares problem is then given by (). Let r be the rank of matrix. Then, is a matrix composed of the first m rows of the last 2right singular vectors, and is a matrix composed of the last m rows of the last 2right singular vectors. The computational complexity of this least squares problem is then . Since m = d in the case of DMD the total computational complexity is on the order + ). Thus the added cost of including measurement noise is on the order ).

The computational costs of the Bayesian algorithms are on the order + )). Typically the dimension m of the observations is small, so this algorithm is primarily limited by the dimension d of the state vector. Furthermore, the dimension p of the parameter vector only affects the evaluation of the prior, which is usually chosen so as to be easy to compute. Therefore, this algorithm is most efficient for problems where the state dimension is low and the parameter dimension is high, such as in nonlinear regression problems.

6. Numerical experiments. In this section, we provide a set of empirical re- sults that demonstrate a lack of robustness amongst methods that do not account for all three sources of uncertainty. We then show that our proposed approach is able to perform well under a greater variety of experimental conditions. The conditions of each experiment are designed to highlight and exaggerate the specified limitation of some specific methods. We will show that in many cases only small changes to the setting, for instance a slightly larger noise or slower sampling frequency, can yield significant difference in learning with these existing methods – demonstrating their lack of robustness.

Our evaluations of the methodology examine two quantities: reconstruction errors and prediction/forecasting errors. Reconstruction error compares how well the learned parameter is able to match the trajectory from which the data were generated. This is essentially training error, and used more to verify that the algorithms are working properly. Prediction/forecasting compares our estimate to some trajectory that is not contained in the data. These trajectories could be a continuation of the system into the future from the last point at which data were taken, or it could be starting the estimated dynamics at a different initial condition. This comparison is of greater interest because it tests the extrapolatory power of the learned dynamics.

6.1. Algorithmic settings. To perform the following experiments, MATLAB 2019b was used. For our MCMC algorithm, we selected the delayed rejection adaptive Metropolis (DRAM) algorithm [31]. The tuning parameters of this algorithm are the number of samples to draw before beginning the AM algorithm, and the scaling factor used by DR to scale the second-tier proposal covariance. In this paper, we used = 200 and = 0.01 for each experiment. Also throughout the algorithm, whenever a covariance matrix was calculated, a nugget was added where = 10

to help ensure positive definiteness. Furthermore, the algorithm requires selection of a starting sample and initial proposal covariance; we used the MAP point as our initial sample , and the inverse Hessian of the negative log posterior evaluated at to be the initial covariance of our proposal distribution:

Both of these values were found using MATLAB’s fminunc function. For nonlinear systems, we must additionally select parameters , and for the UKF. In this paper, we followed a common choice of parameter selection where = 10= 0, and = 1.

Unless otherwise specified, an improper uniform prior is used for all dynamics model parameters

) = (6.2)

and half normal priors are specified on the variance parameters and as suggested in [25]

) = half-N(0, 1).(6.3)

The code used to implement the Bayesian algorithms can be found on the author’s GitHub https://github.com/ngalioto. To execute DMD, MATLAB’s right matrix division operator ‘/’ was used, which returns the least squares solution. TDMD was performed using a script taken from MATLAB file exchange [33] that solves the total least squares problem. Lastly, SINDy was run using code from [10], which utilizes code from [11] to compute the total variation regularized derivatives.

6.2. Linear pendulum, linear model. In this section, we consider learning a linear model under an identity observation operator h = I when the truth model is also linear. We show that that the proposed probabilistic approach is more robust to sparse observations and measurement noise than the least squares-based DMD and TDMD.

Consider the linear model (2.5) for which the exact propagator is

where g = 9.81 is the acceleration due to gravity and L = 1 is the length of the pendulum.

We are learning an unknown linear model ) and assume that the process noise and measurement noise is also uncertain. Under this setting, System (3.1) becomes

where

Because this setup is precisely the one corresponding to DMD, we seek to compare the performance of our approach to DMD and TDMD. Our comparison takes the form of average performance over 500 different realizations of the data sets for different combinations of training data sizes n and true measurement noise standard deviation . The data points are spread out over a simulation period of four seconds, so increasing n indicates increasing density of data per time.

The results, shown in Figure 3, provide (log base 10) ratios of the expected error of the posterior predictive mean (computed with 1000 posterior samples) to the (T)DMD estimators. The squared errors were calculated only at the times of observations, and the largest MSE from each data set for each algorithm was discarded to prevent biasing from outliers. We see that the biggest gains in using the probabilistic Bayesian approach come in the low noise regime. At first this seems surprising, but in the low noise regime, this is likely the result of the scale of the errors being so small. As the noise increases, we see the ratio increasing even though we’d expect DMD to break down much more quickly than the Bayesian approach. The reason this occurs is because DMD predictions decay to zero after a certain level of noise (shown in Figure 3a), effectively placing an upper bound on the MSE of the algorithm. Regardless, the contour plots show that the Bayesian algorithm outperforms both DMD and TDMD at every measurement frequency and noise pair considered.

Next we provide a detailed look at two specific points on these contour plots to demonstrate the mechanism by which DMD/TDMD decline. The first case is a low-noise/sparse-data case of = 10and n = 8, and the second case is for a higher noise case = 10with more data n = 40.

Fig. 3: Log base 10 ratio of the MSE obtained by the proposed Bayesian approach to that obtained by (T)DMD for the linear pendulum model. In all cases, this value is less than zero signifying that our proposed approach outperforms (T)DMD in all cases considered. Also observe in the high noise regime, TDMD can begin to lose stability.

The reconstruction results for each state are compared in Figure 4. The prediction (forecasting) results for just the second state are shown in Figure 5. The shaded area represents the region between the 97.5th and 2.5th quantiles of the Bayesian posterior. In the low noise case, we see that all three algorithms perform essentially equally – though the DMD-based approaches slightly underestimate the amplitude. In other words, even in the case for which DMD was designed to perform well, the Bayesian approach performs slightly better. In the high noise case, we see that the TDMD predictions become completely out of phase with increasingly small amplitude, and the DMD estimator smooths out the data too much and rapidly converges to zero. Not only does the Bayesian approach provide the most accurate estimate, but it also gives a quantification of the certainty of its estimate in the form of its posterior, which (T)DMD is unable to provide.

Figure 6 shows the estimated eigenvalues of the system by the Bayesian and (T)DMD algorithms. In the low noise case, Figure 6a shows that the Bayesian approach is slightly more accurate than the (T)DMD approaches, though they all perform well. For the high noise case, Figure 6b shows that DMD is unable to provide a reasonable estimate of the eigenvalues. TDMD gives a close estimate, but the estimated eigenvalues are too far in the left-hand plane, causing the gradual decay seen in Figure 5. The Bayesian estimate lies almost exactly on top of the truth.

Finally, Figure 7 shows the marginal and joint distributions of the process and measurement noise variances for these two cases. The process noise is very close to zero because we are using a linear model for a linear system, and thus the system learns that the dynamics can be captured exactly. These plots also indicate that we have learned the measurement noise, as the mode aligns closely with the true value shown in red. Note also that the joint distribution in this figure shows that the two noise variances are negatively correlated, conveying the fact that the estimator does not yet have enough data to determine if the model is off and the measurements are accurate, or if the model is accurate and the measurements are noisy. As more data come in, however, one of these scenarios can usually be ruled out and the distribution becomes unimodal.

Fig. 4: Comparison of reconstruction error amongst the Bayesian and (T)DMD algorithms for different levels of noise and measurements for the linear pendulum truth model. Top row corresponds to a low-noise/sparse-data case and the bottom row corresponds to a high-noise/dense-data case. Left column corresponds to the first state (angular position) and right column corresponds to the second state (angular velocity). For low-noise, the algorithms perform similarly; however, the (T)DMD approaches underestimate the amplitude. For the high-noise case, DMD fails and TDMD misfits the amplitude. The Bayesian approach is able to recognize greater uncertainty for the high-noise case.

the model class within which we are learning does not encompass the true underlying dynamical system. This is the most realistic situation that would be encountered in practice, and avoids the so-called “inverse crime” [14, 63].

(6.7)

to be the truth model. We have changed the initial condition to ensure that we are operating in the nonlinear regime.

The learning setup is identical to that provided in Section 6.2; we learn a linear model, and the same validation experiments are performed. These experimental

Fig. 5: Comparison of prediction error amongst the Bayesian and (T)DMD algorithms for different levels of noise and measurements for the linear pendulum truth model. Left panel corresponds to a low-noise/sparse-data case and the right panel corresponds to a high-noise/dense-data case. Both panels show the angular velocity of the pendulum. For low-noise, the algorithms perform similarly. For the high-noise case, DMD fails and TDMD can be seen to be out of phase and have a smaller amplitude. The Bayesian approach is able to recognize greater uncertainty for the high-noise case.

Fig. 6: Eigenvalue distributions for the estimators of the linear pendulum. All three algorithms come very close to learning the true eigenvalues in the low noise case, but Bayes is able to outperform the other two in both the high and low noise cases. DMD achieves significant error when the data are noisy. The mean value here represents the mean of the eigenvalues.

results are shown in Figure 8. We are able to clearly see here that, although the three algorithms are comparable in the low noise regime, the strength of the Bayesian approach increases with the measurement noise. A discussion on why (T)DMD may outperform the mean estimator from the Bayesian approach in the low noise regime is provided later in Section 6.3.a.

We again present more detailed results for two representative cases. Both cases have n = 24 data points, but the first case is a low noise case of = 10and the

Fig. 7: Marginal and joint posterior distributions of the process and measurement noise variance parameters during the recovery of the linear pendulum. In the left panel, 8 measurements are not enough for the Bayesian estimator to unambiguously determine the measurement noise, but its best guess (the mode) aligns with the truth. On the right, we see that 40 measurements are enough to define a distinct mode within the joint distribution, which also aligns with the truth.

second case is a higher noise case of = 1.

The resulting reconstructions are shown in Figure 9, and the predictions are given in Figure 10. Note that the variances of the posterior distributions in both cases grow much more quickly than in either of the linear pendulum examples as a consequence of increased model uncertainty (process noise). The posterior distribution can therefore be used to qualitatively assess not only how informative the data are, but also how appropriate the chosen model is for the system at hand. In the low noise case, the performances of the three estimates are virtually indistinguishable, once again demonstrating that even in systems that are ideal for (T)DMD, there is no loss of performance when using the Bayesian estimator. In the high noise case, DMD struggles with noisy measurements and settles on quickly decaying to zero, similar to what we observed in the linear case. TDMD, on the other hand, comes closer but is noticeably out of phase with the truth. The Bayesian approach is able to reconstruct the signal very closely, at least within the constraints imposed by using a linear model.

Next we investigate what the Bayesian approach learns for the process and measurement noise in the case where there is a model error. The marginal and joint posterior distributions for both measurement noise cases are shown in Figure 11. We observe that in the low noise case 11a, the joint distribution is bimodal. The smaller mode corresponds to a model with low process noise and high measurement noise, and the larger mode corresponds to a model with high process noise and low measurement noise. Bayes has effectively uncovered that the data can be explained in one of two ways: either the model fits the true system well, but the data are very noisy, or the measurement noise is low and the model is not capable of properly capturing the dynamics. In this case, the latter is true and is also the option that Bayes found to

Fig. 8: The experiment conducted here is the same as described in Figure 3, but this time with the nonlinear pendulum. Note that as the measurement noise increases, the log ratios decrease, reflecting the robustness of the Bayesian approach in the face of noisy measurements. A detailed explanation for the low noise regime where it appears (T)DMD outperforms Bayes is given in Section 6.3.a.

be much more likely. For the high noise case 11b, the joint distribution is unimodal, conveying the possibility of only one process-measurement noise pairing. Once again, the modes of both the measurement noise marginal distributions align closely with the truth shown in red. Finally, we see that the process noise magnitudes in both cases are much larger than those seen in the linear pendulum examples (Figure 7) as a consequence of trying to capture nonlinear dynamics with a linear model.

6.3.a. Discussion on diagnostics. One of the strengths of the Bayesian ap- proach is that it separates the learning stage from the decision making stage, so if the initial decision rule yields an unsatisfactory estimate, one can go back and analyze the posterior distribution to devise an improved decision rule. It was noted earlier in Figure 8 that the average MSE of (T)DMD is lower than that of the average MSE of the Bayesian estimator over 500 data sets when the measurement noise is low. This observation likely implies that there is a better decision rule that can be used to achieve performance at least as strong as DMD. To understand how to best select a point from the posterior to be our estimate, we first look at the posterior over the states. Figure 12 shows samples from the posterior predictive distribution for a single data set containing n = 26 measurements with noise standard deviation of = 0.1. The mean deviates from the truth near the peaks and valleys of the trajectory between about 2.5 and 4 seconds. This is the same location in which the posterior appears to be significantly spread in possible predictions. This presence of significant outliers is a result of the bimodal noise distribution previously discussed. Furthermore, it is clear that the mean is not a good estimator in the case of bimodal distributions; however, we see that there exists a mode in alignment with the truth. Upon this realization, we can then craft a decision rule that selects this mode rather than the mean for improved performance. In this case, the mode-based rule would result in the Bayesian approach being 1.3 times better than the TDMD estimator. Moreover, this entire analysis can be done a posteriori, and therefore uses no additional assumptions or requirements on our approach.

Fig. 9: Reconstruction performance for low-noise (top row) and high-noise (bottom row) data sets for the nonlinear pendulum using a linear model. All three estimates capture the truth closely in the low noise case, but only the Bayesian algorithm performs well (it is in phase and approximately the correct amplitude) for the high noise case. Furthermore, it reflects the additional uncertainty resulting from the simplistic linear model through its posterior distribution as compared to the results in Figure 4. In the high noise case, DMD fails as it did for the linear pendulum, and TDMD underestimates both the period and amplitude of the pendulum’s oscillations.

We also note that the effect this has on the MSE ratio appears more strongly in this nonlinear case for two reasons. The first reason is that the higher process noise due to the model error and low measurement noise can create a bimodal distribution because of the alternate possibility of a good model with noisy data as shown earlier. The second reason is that the ratio of process noise to measurement noise is higher than that in the linear case. As we have shown in Theorem 4.1, the (T)DMD approaches effectively assume the existence of process noise but no measurement noise. In cases where the linear and nonlinear models are mismatched this becomes a better assumption.

In summary, for cases where the model error can be significant, a non-mean estimator should be extracted from the Bayesian posterior. This estimator should be chosen by considering the bimodality of the learned process/measurement noise

Fig. 10: Comparison shown here is the same as in Figure 5, but this time for a nonlinear pendulum truth model. In the low noise case, the estimates are all visually aligned with the truth. In the high noise case, DMD fails and TDMD falls out of phase, but the Bayesian algorithm remains robust and produces an accurate estimate.

Fig. 11: Marginal and joint posterior distributions of the process and measurement noise variance parameters during the recovery of the nonlinear pendulum. In the left panel, the joint distribution is bimodal, offering two possible models with the true case being strongly preferred. In the right panel, all of the distributions are unimodal and in alignment with the truth.

estimator, and can often be the peak of one of the modes. If this is done (it is an a posteriori procedure), we have seen that it yields improved performance compared to (T)DMD.

6.4. Optimal estimators and the Van der Pol oscillator. Next we consider learning a sparse representation of a nonlinear system so that we can compare the Bayesian algorithm directly to SINDy. Here, it will once again be shown that factoring

Fig. 12: Posterior samples from a data set with n = 26 data points with noise standard deviation of = 0.1. This data set produced the worst mean estimate out of the 500 with respect to MSE. This figure illustrates that the mean deviates from the truth at the extrema of the curve where samples are skewed toward larger magnitudes. Using a decision rule that selects the mode here would give a much better estimate.

the process and measurement noise into our estimator will allow it to be robust even for noisy measurements.

where = 3. In this case, we use the SINDy algorithm rather than DMD to account for the nonlinear dynamics. For both the Bayesian and SINDy algorithms we therefore consider a subspace of right hand sides that is spanned by a set of candidate functions. We choose monomial candidates up to third degree and their interacting terms. As a result, each algorithm seeks to learn 20 dynamics parameters (10 for each state). The Bayesian algorithm is additionally tasked with learning the covariance matrices parameterized as follows:

The priors on the dynamics parameters are Laplace distributions with zero mean and on the variance parameters are once again half-normal distributions.

We consider two cases: one where SINDy shows strong performance, and one in which SINDy struggles, and we show that the Bayesian algorithm yields an accurate estimate in both cases. The case in which SINDy excels is frequent and low noise data. Here, n = 2, 000 measurements were taken over the course of 20 seconds with measurement noise standard deviation of = 10. In the opposite case, we collect only n = 200 measurements over 20 seconds with measurement noise standard deviation of = 210.

The reconstructions from these experiments are shown in Figure 13, predictions are given in Figure 14, and the phase plots over 200 seconds are given in Figure 15. Here, the mode represents the mode of the posterior predictive distribution. In the low noise case, we see that the Bayesian algorithm and SINDy both capture the dynamics

Fig. 13: Comparison of reconstruction error amongst the Bayesian and SINDy algorithms for different levels of noise and measurements for the Van der Pol system. Top row corresponds to a low-noise/dense-data case, and the bottom row corresponds to a high-noise/sparse-data case. Left column corresponds to the first state (position), and right column corresponds to the second state (velocity). The Bayesian estimator is able to accurately reconstruct the dynamics, even in the presence of high noise.

very closely. We see that SINDy agrees slightly more closely with the trajectory as a result of its hard threshold regularization. Note that the posterior in this case is very small because the high number of data points and low measurement noise gives us high certainty in our estimate. In the high noise case, we see that SINDy gives a similar result to what DMD gave when the measurements were noisy: the trajectory immediately flatlines. When the data are noisy like this, the procedure for SINDy is to denoise the data using total variation (TV) regularization [11] before executing the algorithm. However, the increased timestep between data makes it difficult to accurately denoise the data, and when the TV regularization is performed, SINDy ends up giving an unstable estimate. The Bayesian approach, however, is still able to identify the dynamics of the Van der Pol system. The posterior in this high noise case is wider, signifying that the estimate holds more uncertainty than the low noise and frequent measurements case.

Fig. 14: Comparison of prediction error amongst the Bayesian and SINDy algorithms for different levels of noise and measurements for the Van der Pol system. The meaning of the figures is the same as described in Figure 13. The model learned by the Bayesian estimator is still accurate at a different initial condition.

6.5. Known model form. Finally, we consider the case where the model form is known, for instance from physical laws, but the parameters are uncertain. This is the classical inference setting and has seen a lot of development [46, 45, 9, 19, 38], including in the computational physics community. However, much of this literature either only considers deterministic dynamics according to some variation of Equation (2.2) or only static problems. In this section, we consider both a chaotic system and a reaction-diffusion PDE in which we impose process noise to aid in the parameter estimation. For the reaction-diffusion PDE, this implies that the process noise is added to the discretized dynamics. Our results suggest that these methods are applicable to spatial problems and are able to effectively learn chaotic dynamics with a much smaller amount of data than observed in the literature.

(6.10)

Fig. 15: Phase-diagram reconstruction for the Van der Pol oscillator under the two indicated data conditions. In the low-noise and frequent data domain, both the Bayesian and SINDy estimates lie directly on the truth. In the high noise-case, the Bayesian posterior is wider, but is still visually aligned with the truth. The SINDy estimate is unable to recover the limit cycle, and the large “x” marks the equilibrium point to which SINDy converges, as shown in Figure 14.

The initial condition of this system was chosen so as to sit on the attractor. We attempt only to learn the parameters = (). The difficulty with learning in chaotic systems is that the computation of the likelihood can be challenging. Since the likelihood involves running a filter, and filtering chaotic systems is well known to be challenging, it may seem that our approach would breakdown. Here we show that our Gaussian filtering approach is still able to learn an approximate dynamical system without resorting to more complicated likelihood building processes, e.g., using correlation integrals [30, 59].

The priors on the dynamics parameters are once again improper and uniform. In addition to learning the model parameters in this example, we also learn the process noise variance for each state and the measurement noise variance for a total of seven parameters. The parameterizations of the covariance matrices are shown:

) = (6.11)

with half-normal priors as before.

One hundred data points uniformly spaced over ten seconds are collected with a true measurement noise standard deviation of 2.0. The predicted state trajectories after 10 seconds of simulation using the parameter posterior mode are shown in Figure 16. Similar to the Van der Pol oscillator, the dynamics exist on a low-dimensional attractor in phase space, and the wide, but constant, posterior distribution once again reflects this fact. Figure 17 shows the reconstructed and predicted attractors from the Bayesian algorithm. These figures show that while we cannot accurately capture the state, indeed all methods would eventually break down due to the chaotic nature of the system, we do predict a qualitatively similar attractor. As such, one would expect that most post-processing of these attractors, e.g., for control, would yield similar

Fig. 16: Lorenz ’63 prediction posteriors. Measurements were taken every 0.1 seconds for 10 seconds with noise standard deviation of 2.0. Although the trajectories become misaligned rather quickly due to the chaotic nature of the system, the posterior phase diagram 17 reveals that the algorithm has discovered that the dynamics exist on a low-dimensional attractor.

Fig. 17: Reconstruction and prediction of the Lorenz ’63 attractor. The right panel compares the predicted and true trajectories up to 200 seconds using the mode of the parameter posterior distribution. The proposed approach is able to successfully discover the Lorenz attractor from sparse, noisy data.

results.

6.5.b. Reaction diffusion. In the final example we consider both a PDE and a case where the measurement operator h is not the identity. The reaction diffusion PDE is given by

(6.12)

where specify the concentrations. A one dimensional spatial grid was selected to have regular intervals of 0.4 units between boundaries of -40 and 40 for a total of 201 grid points for each of the two states. The boundary conditions at 40 are

and the initial condition of the system was drawn from a uniform distribution as shown

(for t = 0; = 1, 2; = 1, ..., 201.(6.14)

Similar to the Lorenz example, for this system we attempt to learn only the model parameters, , and rather than the complete model. The measurement covariance matrix is assumed to be known, and the process noise covariance is fixed to be 1e-8 such that the total number of parameters that we are learning remains only three. The observation operator indirectly measures the concentration through only the first two moments of the concentration of the first species at certain time intervals

(6.15)

We collect measurements every 0.5 seconds for 15 seconds with noise standard deviation of 10. The reconstructions and predictions of the moments from these data using the mode of the parameter posterior distribution are shown in Figure 18. Additionally, the true and reconstructed contours of and are shown in Figure 19. The Bayesian estimate shows close agreement with the truth.

7. Conclusion. In this paper, we have shown how data-driven system ID methods that consider only the measurement noise or only the process noise are impractical for many problems. When only the measurement noise is considered, increasingly many local minima arise as data collection is continued, making identification of the optimal solution difficult. When only process noise is considered, noisy and/or sparse measurements can cause the estimator to break down, even after incorporation of a denoising algorithm. By deriving a probabilistic model of our dynamical system from first principles, we were able to account for how parameter, model, and measurement uncertainty can each affect the learning problem in different ways. From this probabilistic formulation, we were then able to prove that DMD and SINDy assume noiseless measurements, and are thus poorly suited for problems where the measurement noise to process noise ratio is nonnegligible.

Next we outlined a Kalman filter and unscented Kalman filter (UKF-MCMC) MCMC algorithm for linear models and nonlinear models respectively that facilitates drawing samples from the marginal posterior without having to compute a high-dimensional integral to marginalize out the states from the joint posterior. In the linear case, the Kalman filter algorithm targets exactly the marginal posterior, but in the nonlinear case, the UKF-MCMC targets an approximate marginal posterior. A comparison of the computational complexity of these algorithms to that of DMD and sparse regression was then performed. It was found that the cost of the Bayesian algorithms is roughly n times more expensive than DMD and sparse regression, but

Fig. 18: Reconstruction and prediction of the observables of the reaction diffusion system. The top row shows the reconstruction, and the bottom row shows the prediction for an alternate initial condition. The left column is the first measurement state (first moment), and the right column is the second measurement state (second moment). The estimates are very close to the truth, demonstrating the generality of the learned model

for many problems, this is an acceptable cost for the enhanced performance of the Bayesian algorithms.

Lastly, the Bayesian algorithms were compared to DMD and SINDy on a number of systems for varying values of measurement noise and frequency. It was shown that when substantial noise is introduced into the measurements, DMD and SINDy will fail due to their underlying assumption that the measurements are noiseless. The Bayesian algorithm makes no such assumption and in addition to yielding strong performance for low-noise measurements, remains robust to noisy and infrequent data as well. Thus it has been empirically shown that consideration of parameter, model, and measurement uncertainty leads to enhanced performance on a wider class of systems than that to which most least squares-based approaches can be reliably applied.

Acknowledgments. This research was primarily supported by the DARPA Physics of AI Program under the grant “Physics Inspired Learning and Learning the Order and Structure of Physics.” It was also supported in part by the DARPA Arti-ficial Intelligence Research Associate under the grant “Artificial Intelligence Guided

Fig. 19: The experiment is the same as in Figure 18. The top row shows the true contours of and . The bottom row shows the contours of and reconstructed using the mode of the parameter posterior distribution. Visually, the two rows appear very similar, reflecting the strong performance of the Bayesian algorithm.

Multi-scale Multi-physics Framework for Discovering Complex Emergent Materials Phenomena.”

Appendix A. Pseudocode. In this appendix we provide the pseudocode for both the linear Kalman filter and nonlinear unscented Kalman filter algorithms.

REFERENCES

[1] C. Andrieu, A. Doucet, and R. Holenstein, Particle markov chain monte carlo methods, Journal of the Royal Statistical Society. Series B (Statistical Methodology), 72 (2010), pp. 269–342.

[2] C. Andrieu and G. O. Roberts, The pseudo-marginal approach for efficient monte carlo computations, The Annals of Statistics, 37 (2009), pp. 697–725.

[3] I. Arasaratnam and S. Haykin, Cubature kalman filters, IEEE Transactions on Automatic Control, 54 (2009), pp. 1254–1269.

[4] N. Aubry, R. Guyonnet, and R. Lima, Spatiotemporal analysis of complex signals: theory and applications, Journal of Statistical Physics, 64 (1991), pp. 683–739.

[5] I. Ayed, E. de Bzenac, A. Pajot, J. Brajard, and P. Gallinari, Learning dynamical

systems from partial observations, 2019.

[6] T. D. Barfoot, State estimation for robotics, Cambridge University Press, 2017.

[7] V. M. Becerra, F. R. Garces, S. J. Nasuto, and W. Holderbaum, An efficient parameterization of dynamic neural networks for nonlinear system identification, IEEE Transactions on Neural Networks, 16 (2005), pp. 983–988.

[8] J. O. Berger, Statistical Decision Theory and Bayesian Analysis, Springer New York, 1985.

[9] O. Brastein, D. Perera, C. Pfeifer, and N.-O. Skeie, Parameter estimation for grey-box models of building thermal behaviour, Energy and Buildings, 169 (2018), pp. 58 – 68.

[10] S. L. Brunton, J. L. Proctor, and J. N. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proceedings of the National Academy of Sciences, 113 (2016), pp. 3932–3937.

[11] R. Chartrand, Numerical differentiation of noisy, nonsmooth data, ISRN Appl. Math., 2011 (2011).

[12] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, Neural ordinary differential equations, in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds., Curran Associates, Inc., 2018, pp. 6571–6583.

[13] Z. Chen and D. Xiu, On generalized residue network for deep learning of unknown dynamical systems, 2020.

[14] D. Colton and R. Kress, Inverse acoustic and electromagnetic scattering theory, 1992.

[15] P. G. Constantine and Q. Wang, Residual minimizing model interpolation for parameterized nonlinear dynamical systems, SIAM Journal on Scientific Computing, 34 (2012), pp. A2118–A2144.

[16] J. J. Craig, P. Hsu, and S. S. Sastry, Adaptive control of mechanical manipulators, The International Journal of Robotics Research, 6 (1987), pp. 16–28.

[17] H. Dai and N. Sinha, Robust recursive least-squares method with modified weights for bilinear system identification, in IEE Proceedings D (Control Theory and Applications), vol. 136, IET, 1989, pp. 122–126.

[18] A. Delgado, C. Kambhampati, and K. Warwick, Dynamic recurrent neural network for system identification and control, IEE Proceedings - Control Theory and Applications, 142 (1995), pp. 307–314.

[19] S. Dokos and N. H. Lovell, Parameter estimation in cardiac ionic models, Progress in Biophysics and Molecular Biology, 85 (2004), pp. 407 – 431. Modelling Cellular and Tissue

Function.

[20] C. Drovandi, R. G. Everitt, A. Golightly, and D. Prangle, Ensemble mcmc: Accelerating pseudo-marginal mcmc for state space models using the ensemble kalman filter, (2019).

[21] K. Erazo and S. Nagarajaiah, An offline approach for output-only bayesian identification of stochastic nonlinear systems using unscented kalman filtering, Journal of Sound and Vibration, (2018).

[22] N. B. Erichson, M. Muehlebach, and M. W. Mahoney, Physics-informed autoencoders for lyapunov-stable fluid flow prediction, 2019.

[23] G. Evensen, Sequential data assimilation with a nonlinear quasi-geostrophic model using monte carlo methods to forecast error statistics, Journal of Geophysical Research: Oceans, 99 (1994), pp. 10143–10162.

[24] Ocean Modeling and Parameterization, Springer, 1998, pp. 373–398.

[25] A. Gelman, Prior distributions for variance parameters in hierarchical models, Bayesian Analysis, 1 (2006).

[26] H. R. Glahn and D. A. Lowry, The use of model output statistics (mos) in objective weather forecasting, Journal of applied meteorology, 11 (1972), pp. 1203–1211.

[27] T. Gneiting and A. E. Raftery, Weather forecasting with ensemble methods, Science, 310 (2005), pp. 248–249.

[28] G. H. Golub and C. F. V. Loan, An analysis of the total least squares problem, SIAM Journal on Numerical Analysis, 17 (1980), pp. 883–893.

[29] N. J. Gordon, D. J. Salmond, and A. F. M. Smith, Novel approach to nonlinear/nongaussian bayesian state estimation, IEE Proceedings F - Radar and Signal Processing, 140 (1993), pp. 107–113.

[30] H. Haario, L. Kalachev, and J. Hakkarainen, Generalized correlation integral vectors: A distance concept for chaotic dynamical systems, Chaos: An Interdisciplinary Journal of Nonlinear Science, 25 (2015), p. 063102.

[31] H. Haario, M. Laine, A. Mira, and E. Saksman, Dram: efficient adaptive mcmc, Statistics and Computing, 16 (2006), pp. 339–354.

[32] M. Hemati, C. Rowley, and L. Cattafesta, De-biasing the dynamic mode decomposition for applied koopman spectral analysis, Theoretical and Computational Fluid Dynamics, (2017).

[33] I. Houtzager, Total least squares with mixed and/or weighted disturbances. https://www.mathworks.com/matlabcentral/fileexchange/ 50332-total-least-squares-with-mixed-and-or-weighted-disturbances, 2019. Retrieved December 5, 2019.

[34] S. V. Huffel and J. Vandewalle, Analysis and properties of the generalized total least squares problem when some or all columns in a are subject to error, SIAM Journal on Matrix Analysis and Applications, 10 (1989), pp. 294–22.

[35] B. R. HUNT, E. KALNAY, E. J. KOSTELICH, E. OTT, D. J. PATIL, T. SAUER, I. SZUNYOGH, J. A. YORKE, and A. V. ZIMIN, Four-dimensional ensemble kalman filtering, Tellus A, 56 (2004), pp. 273–277.

[36] S. J. Julier and J. K. Uhlmann, New extension of the Kalman filter to nonlinear systems, in Signal Processing, Sensor Fusion, and Target Recognition VI, I. Kadar, ed., vol. 3068, International Society for Optics and Photonics, SPIE, 1997, pp. 182 – 193.

[37] M. Khalil, A. Sarkar, S. Adhikari, and D. Poirel, The estimation of time-invariant parameters of noisy nonlinear oscillatory systems, Journal of Sound and Vibration, 344 (2015), pp. 81 – 100.

[38] G. A. Kivman, Sequential parameter estimation for stochastic systems, Nonlinear Processes in Geophysics, 10 (2003), pp. 253–259.

[39] I. E. Lagaris, A. Likas, and D. I. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE Transactions on Neural Networks, 9 (1998), pp. 987– 1000.

[40] K. Law, A. Stuart, and K. Zygalakis, Data assimilation, Cham, Switzerland: Springer, (2015).

[41] E. Leffens, F. Markley, and M. Shuster, Kalman filtering for spacecraft attitude estimation, Journal of Guidance, Control, and Dynamics, 5 (1982), pp. 417–429.

[42] E. N. Lorenz, Deterministic nonperiodic flow, Journal of the Atmospheric Sciences, 20 (1963), pp. 130–141.

[43] J. L. Lumley, Stochastic tools in turbulence, Courier Corporation, 2007.

[44] I. Maqsood, M. Khan, and A. Abraham, An ensemble of neural networks for weather forecasting, Neural Computing and Applications, 13 (2004), pp. 112–122.

[45] Y. M. Marzouk and H. N. Najm, Dimensionality reduction and polynomial chaos acceleration of bayesian inference in inverse problems, Journal of Computational Physics, 228 (2009), pp. 1862–1902.

[46] Y. M. Marzouk, H. N. Najm, and L. A. Rahn, Stochastic spectral methods for efficient bayesian solution of inverse problems, Journal of Computational Physics, 224 (2007), pp. 560–586.

[47] P. A. Mastorocostas and J. B. Theocharis, A recurrent fuzzy-neural model for dynamic system identification, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 32 (2002), pp. 176–190.

[48] S. Noh, Posterior inference on parameters in a nonlinear dsge model via gaussian-based filters, Computational Economics, (2019).

[49] E. Ott, B. R. Hunt, I. Szunyogh, A. V. Zimin, E. J. Kostelich, M. Corazza, E. Kalnay, D. Patil, and J. A. Yorke, A local ensemble kalman filter for atmospheric data assimilation, Tellus A: Dynamic Meteorology and Oceanography, 56 (2004), pp. 415–428.

[50] H. Peng, L. Li, Y. Yang, and F. Liu, Parameter estimation of dynamical systems via a chaotic ant swarm, Physical Review E, 81 (2010), p. 016207.

[51] M. M. Polycarpou and P. A. Ioannou, A robust adaptive nonlinear control design, in 1993 American Control Conference, June 1993, pp. 1365–1369.

[52] J. L. Proctor, S. L. Brunton, and J. N. Kutz, Dynamic mode decomposition with control, SIAM J. Applied Dynamical Systems, 15 (2014), pp. 142–161.

[53] C. Rackauckas, Y. Ma, J. Martensen, C. Warner, K. Zubov, R. Supekar, D. Skinner, and A. Ramadhan, Universal differential equations for scientific machine learning, 2020.

[54] M. Raissi, P. Perdikaris, and G. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics, 378 (2019), pp. 686 – 707.

[55] C. W. Rowley, I. Mezi, S. Bagheri, P. Schlatter, and D. S. Henningson, Spectral analysis of nonlinear flows, Journal of Fluid Mechanics, 641 (2009), p. 115127.

[56] P. J. Schmid, Dynamic mode decomposition of numerical and experimental data, Journal of Fluid Mechanics, 656 (2010), p. 528.

[57] L. Sirovich, Turbulence and the dynamics of coherent structures. i. coherent structures, Quarterly of applied mathematics, 45 (1987), pp. 561–571.

[58] J.-J. E. Slotine and W. Li, On the adaptive control of robot manipulators, The International Journal of Robotics Research, 6 (1987), pp. 49–59.

[59] S. Springer, H. Haario, V. Shemyakin, L. Kalachev, and D. Shchepakin, Robust parameter estimation of chaotic systems, Inverse Problems & Imaging, 13 (2019), p. 1189.

[60] S. Srkk, Bayesian Filtering and Smoothing, Institute of Mathematical Statistics Textbooks, Cambridge University Press, 2013.

[61] N. Takeishi, Y. Kawahara, Y. Tabei, and T. Yairi, Bayesian dynamic mode decomposition, in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, 2017, pp. 2814–2821.

[62] I. G. Tsoulos, D. Gavrilis, and E. Glavas, Solving differential equations with constructed neural networks, Neurocomputing, 72 (2009), pp. 2385 – 2391. Lattice Computing and Natural Computing (JCIS 2007) / Neural Networks in Intelligent Systems Designn (ISDA 2007).

[63] A. Wirgin, The inverse crime, 2004.

[64] K. Wu and D. Xiu, Data-driven deep learning of partial differential equations in modal space, Journal of Computational Physics, (2020), p. 109307.

designed for accessibility and to further open science