EM-like Learning Chaotic Dynamics from Noisy and Partial Observations

2019·Arxiv

Abstract

Abstract

The identification of the governing equations of chaotic dynamical systems from data has recently emerged as a hot topic. While the seminal work by Brunton et al. reported proof-of-concepts for idealized observation setting for fully-observed systems, i.e. large signal-to-noise ratios and high-frequency sampling of all system variables, we here address the learning of data-driven representations of chaotic dynamics for partially-observed systems, including significant noise patterns and possibly lower and irregular sampling setting. Instead of considering training losses based on short-term prediction error like state-of-the-art learning-based schemes, we adopt a Bayesian formulation and state this issue as a data assimilation problem with unknown model parameters. To solve for the joint inference of the hidden dynamics and of model parameters, we combine neural-network representations and state-of-the-art assimilation schemes. Using iterative Expectation-Maximization (EM)-like procedures, the key feature of the proposed inference schemes is the derivation of the posterior of the hidden dynamics. Using a neural-network-based Ordinary Differential Equation (ODE) representation of these dynamics, we investigate two strategies: their combination to Ensemble Kalman Smoothers and Long Short-Term Memory (LSTM)-based variational approximations of the posterior. Through numerical experiments on the Lorenz-63 system with different noise and time sampling settings, we demonstrate the ability of the proposed schemes to recover and reproduce the hidden chaotic dynamics, including their Lyapunov characteristic exponents, when classic machine learning approaches fail.

1 Introduction

Dynamical systems are at the core of numerous scientific fields, among which we may cite geosciences, aerodynamics, fluid dynamics, etc. Classically, the determination of the governing laws of a given system, usually stated as Ordinary Dynamical Equations (ODE) or Partial Differential Equations (PDE) combines mathematical derivation based on physical principles and some experimental validations [1, 2, 3, 4], this approach forms the discipline of data assimilation. Recently, advances in machine learning [5], together with the availability of more and more data, open an appealing means for the data-driven identification of dynamical systems. However, learning dynamical systems is an “extremely difficult task" [6]. Although machine learning in general, and deep learning in particular has achieved illustrious results in many domains [7], their performance in learning dynamical systems may remain limited as many dynamical systems are non-linear, chaotic and associated with noisy and partial observations in practice.

In this paper, we propose a novel methodology, which combine the advances of state-of-the-art machine learning—neural networks—and classical data assimilation schemes for the problem of learning dynamical systems. The advantages of the proposed methodology are demonstrated by experiments on two chaotic Lorenz–63 system [1].

The paper is organized as follows. In Section 2, we formulate the problem of learning non-linear dynamical systems. The detail of the proposed methodology and its two schmemes are presented in Section 3. Section 4 present the experiments and results. Finally, we reviews the related work then close with a discussion in Section ??.

2 Data-driven identiﬁcation of non-linear dynamical systems

Following the pioneering work of Lorenz [1, 8], the identification of data-driven representations of dynamical systems amounts to determine a data-driven computational representation to approximate a dynamical model typically stated as the solution of an ODE (Ordinary Differential Equation):

where F is the unknown dynamical model of a multi-dimensional state and is zero-mean additive noise due to neglected physics and/or numerical approximations. State-of-the-art data-driven schemes typically exploits observation data

where H is the observation operator, usually known, is a zero-mean noise process and to the time sampling, typically a regular high-frequency sampling such that with respect to a time resolution and reference starting time . We introduce a masking operator to account for observation may not be available at each time step if the component of is missing). Noise processes and/or are generally assumed to be zero-mean Gaussian noise processes.

While data assimilation methods [9, 10] address linear or linearizable systems only, recent advances in machine learning has proved that complex and highly non-linear systems can also be learned from ideal observations [6, 11, 12], that is to say noise-free observations available at all high-frequency time steps. However, when significant noise patterns are presented and/or the observation is partial, these methods are likely to fail, as the minimization of the short-term prediction error based on the propagation of noisy inputs through operator F cannot be guaranteed to lead to the true dynamical operator.

Following a classic state-space formulation, the identification of operator F from a series of observations amounts to a joint identification of the hidden states and of operator F. The combination of assimilation methods for the identification issue and machine learning frameworks (especially, neural network architectures) for the parameterization of operator F naturally appears as an appealing solution. As detailed in the next Section, inspired by Bayesian formulation and associated Expectation-Maximization algorithm, we propose and evaluate two alternative schemes.

3 Proposed Expectation-Maximization-like Approach

In this Section, we introduce the proposed schemes for the joint of the hidden states and operator F, using neural-network-based representation for the latter. Formally, we consider a discrete state-space formulation. It amounts to reformulating Eqs. 1 and 2 as follows:

Where results from an integration of operator F from state . and represent the uncertainty of the model and the observation, respectively. For the sake of simplicity, is arbitrarily set to 1 without any loss of generality. Within this Bayesian setting, Eqs. 3 and 4 relate respectively to the dynamical prior and the observation likelihood Under Gaussian assumption for noise processes, and and are Gaussian distributions.

The Expectation-Maximization (EM) algorithm [13] is an iterative procedure to estimate model parameters, here the considered parameterization for operator F as well as noise process parameters, that maximize the likelihood of observed data y. Let us denote by the set of all model parameters. Especially, we assume that operator F lies within a finite-dimensional space of operators X. We describe in Section 3.1 and Section 3.2 the considered parameterization using neural network architectures. Formally, the EM procedure comes to maximize likelihood

For any arbitrary distribution q, this function can be decomposed into:

where:

The EM algorithm alternately maximizes the Evidence Lower Bound (ELBO) with respect to q (the E-step) and (the M-step) in each iteration. The E-step comes to determine the posterior knowing and the M-step amounts to maximizing the log-likelihood of the observation conditionally to posterior q. Compared to the direct joint minimization of ELBO w.r.t. q and [14, 15, 16], EM procedures usually lead to stable solutions [17] and faster convergence [18]. As stated in [17], they are also particularly appealing to deal with noise and missing data/irregularly sampling.

For the simplest case, analytic expressions can be derived both for the Expectation and Maximization steps, as for instance for Gaussian mixture models [13]. When dealing with non-linear dynamical systems as addressed here, neither the E-step nor the M-step can be solved analytically. Then, approximations or numerical solutions have to be considered. Restricting the problem to a family of distributions that can be factored over t, we can rewrite Eq. 7 as:

where:

Regarding specifically the E-step, i.e. the approximation of true posterior given model parameters , we consider two alternative solutions. On the one hand, stochastic filtering approaches naturally apply, especially ensemble Kalman and particle filtering schemes [19, 20]. We here focus on Ensemble Kalman smoothers (EnKS), which are among the state-of-the-art approaches in data assimilation [21]. On the other hand, variational Bayesian approximation [13] exploits an explicit parameterization of posterior q. This variational setting has been recently investigated using neural-network-based parameterization and proved computational efficient for inference of latent state and dynamics [14, 15]. We further detail these two alternative approaches, referred to respectively as EnKS-EM and VODEN (Variational ODE Networks), in Section 3.1 and Section 3.2.

Regarding the M-step, it relates to the learning of a parametric representation of dynamical operator F as considered in previous works [6, 22] using inferred hidden states series rather than observation data, and may implement gradient-based schemes using neural network frameworks. Many “pure" machine learning methods [14, 15, 16, 23, 24, 25] use here a Monte Carlo approximation of the gradient of L for training. However, we empirically observe that when the dynamical system is chaotic, this approach turns out to be unstable. Recall that under Gaussian hypotheses, mean and variance respectively. We can decompose Maximizing means simultaneously maximizing overand minimizing . Since we are more interested in F, we can ignore the first term of Eq. 10. Because q is fixed and is Gaussian, suppose known, we now have to maximize:

Where the posterior q is the current approximation of the true posterior denotes the Euclidean norm. , parameterized by , is the forecast of the hidden state . Again we still have the expectation with respect to q (computed in the E-step). One simple solution is to use Maximum A Posteriori (MAP) instead of Maximum Likelihood (ML) in the E-step. In other words, we restrict the family of distribution q to Dirac distributions. 15 then becomes:

With

We refind here the short-term prediction error as the objective that widely used in machine learning [11, 12, 26, 27].

Although the hypothesis is not true for most of systems, and the lower-bound com- puted using MAP is looser than the bound computed using ML, we empirically observe that these approximations stabilize and significantly accelerate the training process.

3.1 Ensemble Kalman Smoother–Expectation Maximization (EnKS-EM)

For non-linear dynamical systems, the Ensemble Kalman Smoother (EnKS) is one of the most popular and efficient data assimilation schemes. The EnKS was first introduced by Evensen [19] and has rapidly become popular thanks to its simplicity, both in terms of conceptual formulation and implementation. It is an improvement of the Ensemble Kalman Filter [28]. Similar to EnKF, the backbone of the EnKS is a so-called Markov Chain Monte Carlo (MCMC) method to solve Fokker–Planck equation (also know as the the Kolmogorov forward equation) [29] which represents the evolution of . Briefly speaking, the EnKS uses an ensemble of model states (“cloud points") to represent . Any moment of this distribution can be calculated easily by integrating the ensemble of states forward in time. For a series of observed data points , the posterior can be obtained by applying the Kalman update formulas, the covariance of the hidden states being estimated by the sample covariance computed from the ensemble members. For the details of the formulation, the implementation as well as the application of the EnKS, we let the reader refer to [19] and [30].

The Ensemble Kalman Smoother–Expectation Maximization (EnKS-EM) exploits the EnKS approximation for the true posterior within the proposed EM framework. Overall, the considered algorithm is presented in Alg. 1.

Figure 1: VODEN architecture.

3.2 Variational Ordinary Differential Equations Networks (VODEN)

The second EM scheme for learning non-linear dynamical models is the Variational Ordinary Differential Equations Networks (VODEN), where the approximate posterior is parametrized by neural networks. This approach is inspired by the applausive successes of the neural network-based implementation of the variational methods [16, 24, 25]. The inference network, whose backbone is a LSTM, takes as input and provides at the output. Usually, an encoder and an decoder are also added to improve the capacity of q, as shown in Fig. 1.

The parameters of the whole inference network are denoted as . Using the same simplifications in the M-step, in the E-step, we calibrate to minimize the following loss

with is the output of the inference network.

The first term of is analogous to the innovation (the measurement of pre-fit residual) in data assimilation schemes, while the second term ensures that if f is the true state-transition function, the minimization of amounts to retrieving that best approximates the posterior is a multiplication factor. The details of the training procedure is presented in Alg. 2. It may be noted that we set a maximum number of gradient steps for both the E and M step and do not let the gradient descent converge within each step. This was proven numerically more efficient in our experiments.

4 Experiments and results

We tested the proposed methodology on two classic systems for learning non-linear dynamics: the Lorenz–63 system [1]. We examined the learning under significant noise level with partial and

irregular sampling of the observations. The performance of the proposed methodology is compared with state-of-the-art methods, namely Analog forecasting (AF) [31], a Sparse regression model (SR) [6] and a bilinear residual Neural Network architecture (BiNN) [32].

4.1 Lorenz–63 chaotic system

The Lorenz–63 dynamical system is a 3-dimensional model governed by the following ODE:

When and , this system has a chaotic behavior, with the Lorenz attractor shown in Fig. 2.

Figure 2: Lorenz attractor of the Lorenz–63 system When

We simulate Lorenz-63 state sequences using the LOSDA (Livermore Solver for Ordinary Differential Equations) ODE solver [33] with an integration step of 0.01. We then add Gaussian noise with several variance levels and evaluate the learned models given the noisy training data. This means, H is now an identity operator, is zero-mean Gaussian white noise.

For this task, we used the following settings: for

AF, SR and BiNN, we used the setting mentioned

in the original paper of each method; for EnKS-EM

and VODEN: the integration scheme is the neural

network implementation of the Runge-Kutta 4 integration scheme as in [22]; F is parametrized by a bi-linear residual network with the same setting as in [22]. We may note that this parametrization embeds the true parametrization of the Lorenz-63 model. For EnKS-EM, we chose an ensemble of 50 members. For VODEN, the approximate posterior was modeled by a 2-layer bi-directional LSTM with a size of 9. Both the encoder and the decoder were modelled by a fully connected network, with one hidden layer of size 7. We used RMSprop as the optimizer, with a learning rate of was set to 0.1.

4.2 Identification with noisy observations

We first assess the identification performance with noisy observations only, which means that masking operator is the identity at all time steps. We evaluated both short-term prediction performance and the capacity of maintaining the very-long-term topology through the first Lyapunov exponent

Figure 3: LSTM acts as a denoiser.

calculated in a forecasting sequence of length 10 lorenz-time (10000 time steps). The first Lyapunov exponent of the considered Lorenz-63 system is 0.91 [3].

As shown in Table 1 both the EnKS-EM and the VODEN outperform existing methods. The EnKSEM gives the best forecasting when the noise level is small, however, when the noise level increases, the forecasting gradually becomes worse. This is because the increase of noise level leads to the increase of uncertainty (error covariance), we may need a bigger ensemble to maintain the same performance level. On the other hand, the VODEN performs extremely well on very noisy data. We believe that this relates to the ability of LSTM-based models to capture longer-term time patterns in the data.

It should be noted that the BiNN is the EnKS-EM/VODEN without the inference schemes. When the training data are clean, many data-driven methods can successfully identify the underlying dynamics. For example, the BiNN and the VODEN have similar result when . However, when the data is very noisy, all the three model without inference schemes (AF, SR, BiNN) fail. We also did an additional experiment by adding a preprocessing step, in which a Hanning window of size 20 was applied to reduce the effect of noise, before applying SR. This setting is referred as SR_Hann in Table 1. The Hanning denoiser does significantly improve the performance of SR. Fig. 3 shows the sequences smoothed by the Hanning window and the LSTM. However, overall, our proposed methods are more robust to noise. For example, at a noise level of 64, the EnKS-EM gives a better prediction error than the SR_Hann with a factor of 2, the VODEN performs even better, with a factor of 5. Topology-wise, we can see that the attractors generated by the EnKS-EM and the VODEN are visually better.

Table 1: Short-term forecasting error and very-long-term forecasting topology of data-driven models learned on noisy Lorenz-63 data.

Figure 4: An example of sequences forecast by a VODEN learned on Lorenz-63 data,

We plot in Fig. 4 the sequences forecast by the a VODEN learned on Lorenz-63 data, . The Lorenz-63 system has a positive Lyapunov exponent, any small difference between the true and learned models grows exponentially. However, the "patterns" of the sequences, or the topology, remains intact. As discussed in [6] [12], for dynamical models identification, the most important criterion is the ability to maintain this topology in very-long forecasting. Fig. 5 show the attractor of the sequences generated by the models in Table 1.

1.0

1.5

2.0

Figure 5: Attactors generated by models trained on noisy data.

4.3 Partial observation

As mentioned above, one importance advantage of EM-like procedures for learning dynamical model is the ability to deal with partial observations. In this and the next sections, we show that the models trained by the proposed methodology on partial observations can obtain comparable performance with models in Section 4.2.

The first case study is when the observation frequency is small. We downsampled the data used in Section 4.2 8 times (partial and regular observation). That means

In the second case study we also reduced the number of observations 8 times, moreover, the observation is also irregular, both in time and in space.

While EnKS-EM can naturally deal with partial observations, VODEN needs an input at each time step. We use linear interpolation to interpolate where the observations is missing, as shown in Fig. 6.

We kept the same settings that used in Section 4.2. The results are shown in Table 2, Fig. 7 and Fig. 8. For all cases, the performance of both models was decreased. This again highlights the importance of the sampling frequency for learning dynamical systems. As expected, the EnKS-EM works better on irregular data, because EnKS naturally provides a straightforward tool to deal with irregularity.

5 Related work and Discussion

Learning dynamical systems is a topic that has been studied for decades. Before the era of deep learning, almost all methods used EM-like iterative learning procedures [9, 17, 34]. These models use one method in the family of Kalman-based data assimilation schemes in the E-step to estimate . The maximization in the M step is also performed analytically. However, to do so, these methods can use only simple distributions and processes whose analytic form is known. Therefore, the systems that can be learned are very restricted.

Recently, the influence of deep learning [7] has spread to every domain, including model identification. [11] used DenseNet, [35] used LSTM, [27] used ResNet to identify the nonlinear dynamical systems

Figure 6: Noisy and partial observation, noise level = 8.0. For VODEN, we use linear interpolation to interpolate the observation.

Figure 7: Sequences forecast by a VODEN learned on partial and irregular Lorenz-63 data (scenario 2),

Table 2: Short-term forecasting error and very-long-term forecasting topology of data-driven models learned on noisy and partial Lorenz-63 data.

Figure 8: Attractors generated by models trained on partially observed data scenario 1 and 2.

by minimizing the short-term prediction error. They try to exploit the power of neural networks to overcome the difficulties of modelling the nonlinearities. The problems of these methods appear when the observations are partial, irregular or when a high level of noise is present. Neural networks, in general, do not have an efficient way to deal with irregularity. When the observations are highly damaged by noise, using the short-term prediction error as the objective function would very likely to make the network overfit the data. These methods can be considered as a specific case of our methodology, when collapse to

Two special methods that do not fall into the two classes above is the Analog Forecasting (AF) [31] and the Sparse Regression (SR) [6]. The analog forecasting is a non-parametric model that "learns by heart" the dynamics in the catalog. For each new observation , AF looks up its catalog and find the most similar points. It then predicts the next observation by averaging the evolution of these points in the catalog. Since AF is a k-NN based method, it does not work well in high-dimensional spaces. The sparse regression finds the analytic form of the dynamics by performing a regression on a basis formed by many possible functions of each component of the state. This method works extremely well when the observations are complete and clean. When the observation is noisy, partial or irregular, SF fails.

Our proposed EM-like methodology unifies the classical data assimilation schemes and the recent advances in neural networks for the problem of learning non-linear dynamical systems. Different communities may value our contributions for different aspects. For the data assimilation community, we introduce neural networks as a means to go beyond the limit of using analytic functions/processes, such as the nonlinearity. For deep learning practitioners, we has shown that the classical EM procedure might be a way to handle noise and irregularity.

This framework may be applied to improve the modeling capacity in numerous fields. For example, satellite data are noisy (interfered by unknown factors in the atmosphere) and irregular (constrained by the revisit time of the satellite). We demonstrated here two models that follow this methodology—the EnKS-EM and the VODEN—on the Lorenz system. In general, the methodology is expected to improve the performance of any existing dynamical learning model, by using an appropriate inference scheme.

A number of open problems remain for future work. The two simplifications could be relaxed for a better objective function. We used simple linear interpolation for the partial cases, but a more sophisticated interpolation clearly might be applied. How to learn a dynamical system whose latent states have some components that have never been observed is still a challenge.

6 Acknowledgments

This work was supported by GERONIMO project (ANR-13-JS03-0002), Labex Cominlabs (grant SEACS), Region Bretagne, CNES (grant OSTST-MANATEE), Microsoft (AI EU Ocean awards) and by MESR, FEDER, Région Bretagne, Conseil Général du Finistère, Brest Métropole and Institut Mines Télécom in the framework of the VIGISAT program managed by "Groupement Bretagne Télédétection" (BreTel).

References

[1] Edward N. Lorenz, “Deterministic Nonperiodic Flow,” Journal of the Atmospheric Sciences, vol. 20, no. 2, pp. 130–141, Mar. 1963.

[2] Robert C Hilborn, Chaos and nonlinear dynamics: an introduction for scientists and engineers, Oxford University Press on Demand, 2000.

[3] Julien Clinton Sprott and Julien C Sprott, Chaos and time-series analysis, vol. 69, Citeseer, 2003.

[4] Morris W Hirsch, Stephen Smale, and Robert L Devaney, Differential equations, dynamical systems, and an introduction to chaos, Academic press, 2012.

[5] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and prospects,” Science, vol. 349, no. 6245, pp. 255–260, July 2015.

[6] Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz, “Discovering governing equations from data: Sparse identification of nonlinear dynamical systems,” Proceedings of the National Academy of Sciences, vol. 113, no. 15, pp. 3932–3937, Apr. 2016, arXiv: 1509.03580.

[7] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015.

[8] Edward N Lorenz, “Predictability: A problem partly solved,” 1996, vol. 1.

[9] Zoubin Ghahramani and Geoffrey E. Hinton, “Parameter estimation for linear dynamical systems,” Tech. Rep., Technical Report CRG-TR-96-2, University of Totronto, Dept. of Computer Science, 1996.

[10] Zoubin Ghahramani, “Learning dynamic Bayesian networks,” in Adaptive Processing of Sequences and Data Structures: International Summer School on Neural Networks “E.R. Caianiello” Vietri sul Mare, Salerno, Italy September 6–13, 1997 Tutorial Lectures, C. Lee Giles and Marco Gori, Eds., Lecture Notes in Computer Science, pp. 168–197. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.

[11] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis, “Multistep Neural Networks for Data-driven Discovery of Nonlinear Dynamical Systems,” arXiv:1801.01236 [nlin, physics:physics, stat], Jan. 2018, arXiv: 1801.01236.

[12] Jaideep Pathak, Zhixin Lu, Brian R. Hunt, Michelle Girvan, and Edward Ott, “Using Machine Learning to Replicate Chaotic Attractors and Calculate Lyapunov Exponents from Data,” Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 27, no. 12, pp. 121102, Dec. 2017, arXiv: 1710.07313.

[13] Christopher M Bishop, Pattern recognition and machine learning, Springer, 2006.

[14] Alex Graves, “Practical Variational Inference for Neural Networks,” in Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds., pp. 2348–2356. Curran Associates, Inc., 2011.

[15] Andriy Mnih and Karol Gregor, “Neural Variational Inference and Learning in Belief Networks,” in International Conference on Machine Learning, Jan. 2014, pp. 1791–1799.

[16] Chris J. Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Whye Teh, “Filtering Variational Objectives,” in Advances in Neural Information Processing Systems, May 2017, pp. 6576–6586.

[17] Zoubin Ghahramani and Sam T. Roweis, “Learning Nonlinear Dynamical Systems Using an EM Algorithm,” in Advances in Neural Information Processing Systems 11, M. J. Kearns, S. A. Solla, and D. A. Cohn, Eds., pp. 431–437. MIT Press, 1999.

[18] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The "wake-sleep" algorithm for unsupervised neural networks,” Science, vol. 268, no. 5214, pp. 1158–1161, May 1995.

[19] Geir Evensen and Peter Jan van Leeuwen, “An Ensemble Kalman Smoother for Nonlinear Dynamics,” Monthly Weather Review, vol. 128, no. 6, pp. 1852–1867, June 2000.

[20] Arnaud Doucet and Adam M Johansen, “A tutorial on particle filtering and smoothing: Fifteen years later,” Handbook of nonlinear filtering, vol. 12, no. 656-704, pp. 3, 2009.

[21] Geir Evensen, Data Assimilation: The Ensemble Kalman Filter, Springer Science & Business Media, Aug. 2009, Google-Books-ID: 2_zaTb_O1AkC.

[22] Ronan Fablet, Said Ouala, and Cedric Herzet, “Bilinear residual Neural Network for the identification and forecasting of dynamical systems,” arXiv:1712.07003 [physics], Dec. 2017, arXiv: 1712.07003.

[23] Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin, “Variational Autoencoder for Deep Learning of Images, Labels and Captions,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. 2016, pp. 2352–2360, Curran Associates, Inc.

[24] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio, “A Recurrent Latent Variable Model for Sequential Data,” in Advances in neural information processing systems, June 2015, pp. 2980–2988.

[25] Marco Fraccaro, Sø ren Kaae Sø nderby, Ulrich Paquet, and Ole Winther, “Sequential Neural Models with Stochastic Layers,” in Advances in Neural Information Processing Systems. 2016, pp. 2199–2207, Curran Associates, Inc.

[26] Maziar Raissi, “Forward-Backward Stochastic Neural Networks: Deep Learning of Highdimensional Partial Differential Equations,” arXiv:1804.07010 [cs, math, stat], Apr. 2018, arXiv: 1804.07010.

[27] Tong Qin, Kailiang Wu, and Dongbin Xiu, “Data Driven Governing Equations Approximation Using Deep Neural Networks,” arXiv:1811.05537 [cs, math, stat], Nov. 2018, arXiv: 1811.05537.

[28] Geir Evensen, “Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics,” Journal of Geophysical Research: Oceans, vol. 99, no. C5, pp. 10143–10162, 1994.

[29] A Kolmogoroff, “On analytical methods in the theory of probability,” Mathematische Annalen, vol. 104, pp. 415–458, 1931.

[30] Geir Evensen, “The Ensemble Kalman Filter: theoretical formulation and practical implementation,” Ocean Dynamics, vol. 53, no. 4, pp. 343–367, Nov. 2003.

[31] Redouane Lguensat, Pierre Tandeo, Pierre Ailliot, Manuel Pulido, and Ronan Fablet, “The Analog Data Assimilation,” Monthly Weather Review, vol. 145, no. 10, pp. 4093–4107, Aug. 2017.

[32] R. Fablet, M. Lopez-Radcenco, J. Verron, B. Mourre, B. Chapron, and A. Pascual, “Learning multi-tracer convolutional models for the reconstruction of high-resolution SSH fields,” in 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), July 2017, pp. 406–409.

[33] Alan C. Hindmarsh, “ODEPACK, a systematized collection of ODE solvers,” IMACS Transactions on Scientific Computation, vol. 1, pp. 55–64, 1983.

[34] Henning U. Voss, Jens Timmer, and Jürgen Kurths, “Nonlinear dynamical system identification from uncertain and indirect measurements,” International Journal of Bifurcation and Chaos, vol. 14, no. 06, pp. 1905–1933, June 2004.

[35] Kyongmin Yeo and Igor Melnyk, “Deep learning algorithm for data-driven simulation of noisy dynamical system,” Journal of Computational Physics, vol. 376, pp. 1212–1231, Jan. 2019.

designed for accessibility and to further open science