On generalized residue network for deep learning of unknown dynamical systems

2020·Arxiv

Abstract

Abstract

ON GENERALIZED RESIDUE NETWORK FOR DEEP LEARNING OF UNKNOWN DYNAMICAL SYSTEMS

ZHEN CHEN AND DONGBIN XIU∗ Abstract. We present a general numerical approach for learning unknown dynamical systems using deep neural networks (DNNs). Our method is built upon recent studies that identiﬁed residue network (ResNet) as an eﬀective neural network structure. In this paper, we present a generalized ResNet framework and broadly deﬁne ”residue” as the discrepancy between observation data and prediction made by another model, which can be an existing coarse model or reduced order model. In this case, the generalized ResNet serves as a model correction to the existing model and recovers the unresolved dynamics. When an existing coarse model is not available, we present numerical strategies for fast creation of coarse models, to be used in conjunction with the generalized ResNet. These coarse models are constructed using the same data set and thus do not require additional resource. The generalized ResNet is capable of learning the underlying unknown equations and producing predictions with accuracy higher than the standard ResNet structure. This is demonstrated via several numerical examples, including long-term prediction of a chaotic system. Key words. Deep neural network, residual network, governing equation discovery, model correction

1. Introduction. There has been a surge of interest in developing algorithms to recover unknown governing equations via observation data. Most efforts focus on recovering unknown system of ordinary differential equations, i.e., dynamical systems. Some (relatively) earlier efforts seek to construct certain optimal linear approximation for the unknown system. These include dynamic model decomposition (DMD) ([39]) and its variants for nonlinear systems using Koopman theory, cf., [3, 16, 15]. More recent efforts cast the problem into an approximation problem, where the unknown governing equation is treated as a target function relating the data of the state variables to their temporal derivatives. Methods along this line of approach usually seek exact recovery of the equations by using certain sparse approximation techniques (e.g., [41]) from a large set of dictionaries; see, for example, [4]. Studies have been conducted to deal with noises in data [4, 37, 11], corruptions in data [42], limited data [38], partial differential equations [33, 36], etc. Variations of the approaches have been developed in conjunction with other methods such as model selection approach [19], Koopman theory [3], Gaussian process regression [28, 27], and expectation-maximization approach [20], to name a few. Methods using standard basis functions and without requiring exact recovery were also developed for dynamical systems [48] and Hamiltonian systems [47].

The use of modern machine learning techniques, particularly deep neural networks (DNNs), offers a new line of approaches for the task. DNN structures have been developed to recover ordinary differential equations (ODEs) [31, 25, 34] and partial differential equations (PDEs) [18, 29, 30, 26, 17, 40]. It was shown that residual network (ResNet) is particularly suitable for equation recovery, in the sense that it can be an exact integrator [25]. Neural networks have also been explored for other aspects of scientific computing, including reduced order modeling [8, 22], solution of conservation laws [32, 44], multiphase flow simulation [46], high-dimensional PDEs [6, 14], uncertainty quantification [5, 43, 49, 12], etc.

∗Department of Mathematics, The Ohio State University, Columbus, OH 43210, USA. chen.7168@osu.edu, xiu.16@osu.edu. Funding: This work was partially supported by AFOSR FA9550-18-1-0102. 1

The focus and contribution of this paper is on generalization of ResNet (gResNet) structure for recovering unknown dynamical systems. In the standard ResNet method for equation recovery, the residue is defined as the difference between the data inputs and data outputs, and a deep neural network is used to model the residue. (For more detail, see [25].) In this paper, we broaden the concept of residue and broadly define it as the difference between the data outputs and the predictive outputs made by another model, which shall be referred to as “prior prediction” hereafter. We then use a standard feedforward neural network to model this generalized residue and to construct the final prediction, hereafter referred to as “posteriori prediction”, by taking into account of the prior prediction. The gResNet structure can then be viewed as a model correction method for the prior predictive model. The prior model can be an existing coarse model, reduced order model, empirical model, etc. The deep neural network in the gResNet is used to construct a governing equation for the discrepancy between the prior model and the true model. We remark that model correction/calibration is an ongoing research topic, where several methods exist. See, for example, [2, 7, 9, 10, 13, 35, 45, 24] and the references therein. Our method here represents a new and drastically different approach, via the use of deep neural networks, to this line of study. It is also straightforward to see that the generalized ResNet (gResNet) includes the standard ResNet as a special case, in the sense that in the standard ResNet the prior model takes the trivial form of an identity operator and the prior prediction is the same as the data inputs.

In many practical situations, one may not possess a prior model and only has access to data. In this case, we propose a number of numerical strategies to create a prior model using the same data set. In particular, we discuss two approaches. The first one seeks to construct an affine approximation for the underlying system as the prior model. This is an extension of the well known DMD (dynamic mode decomposition) method, which utilizes linear approximation ([39]). The use of affine approximation, which has a constant vector term in addition to linear transformation, allows one to model possible non-homogenenous terms in the underlying dynamical system. This proves to be more flexible in practice. The other method seeks to construct the prior model as a nonlinear approximation using a single hidden layer NN. This is an extension of the affine approximation modeling, which is a linear procedure. Both approaches can produce prior models in efficient manner and do not increase the overall computational cost of the entire modeling process. Note that the emphasis of the paper is on the gResNet framework, and the discussion of the two approaches for prior model construction is to provide some viable choices. In practice, one is free to choose any other suitable (for the given problem) method to construct the prior model.

2. Setup and Preliminaries. Let us consider an autonomous dynamical system

where are the state variables. Note that the dimension n can be large, especially when the system is obtained via a spatial discretization of a PDE system. In this paper, we assume the form of the governing equations is unknown. What is available is measurement data of the solution states. Let be a sequence of time instances. We use

to stand for the solution state at time instance , originated from the i-th initial state , for a total number trajectories. Note that the data can also contain measurement noises, which are usually modeled as random variables. Our goal is then to create an accurate approximation model for the unknown governing equations by using the solution state data.

2.1. Flow Map and Data Pairing. While many of the existing work seeks to construct a system dx/dt = ) as an approximation to the unknown system (2.1), we here adopt a different approach developed in [25]. This approach does not directly approximate the right-hand-side of the system (2.1). Instead, it seeks to approximate the flow map of the underlying system.

Let be the flow map of system (2.1), whose solution follows

Note that for autonomous systems the time variable t can be arbitrarily shifted and only the time difference, or time lag, is relevant. Also, . The flow map completely determines the evolution of the solution from one state to another state at a different time. Recovery of the flow map allows one to conduct prediction of the system via recursive use of the flow map. Let ∆= 1, . . . be the time differences in the time instances. We then organize the solution state data in (2.2) into pairs, separated by time lag ∆,

For notational convenience, hereafter we assume ∆= ∆ is a constant for all k. We then denote the entire data set as

where is the total number of data pairs. For each j-th pair, is the

“initial state”, is the “end state”, and the two states are separated by the time lag ∆. In the noiseless case, the two states are governed by the (unknown) flow map such that

2.2. ResNet Modeling of Flow Map. In [25], a method was proposed to discover the unknown dynamical system (2.1) via numerically approximating its underlying flow map. This is accomplished by using the data paris (2.5) to approximate the unknown flow map over discrete time step ∆ in (2.6). Once this ∆-flow map is constructed, it can be recursively applied to conduct solution prediction over much longer time horizon.

Moreover, the work of [25] also proposed to utilize residue network (ResNet) to conduct deep learning of the ∆-flow map. While the standard feedforward deep neural networks (DNNs) utilize multiple hidden layers to approximate input-output maps, ResNet applies the identity operator on the input data and superimposes it on the neural network outputs. This effectively creates a DNN modeling for the “residue” of the input-output data.

Let ; Θ) : be the operator from a standard fully connected feedforward neural network, where Θ denotes its parameter set. Its corresponding ResNet

then creates a mapping

where I is the identity operator. By applying this structure to the data set (2.5) and defining a loss function in the form of

one can train a ResNet model

where Θis the network parameter set after successful training. Upon obtaining the trained network model, one can recursively apply the model to conduct system prediction in the following form

for a given initial condition . Even though the form of (2.10) resembles Euler forward time stepper, it was shown in [25] that this model is an exact time integrator, in the sense that there is no error associated with the time step ∆.

3. Generalized ResNet (gResNet) Modeling. In this section we present a generalization of the ResNet (gResNet) method developed in [25] for approximating unknown equations. We first present the general approach of the gResNet method and then discuss a few practical options.

3.1. General Approach of gResNet. Let be an operator such that

The proposed gResNet mapping takes the following form

where ; Θ) is the operator associated with a standard fully connected feedforward neural network. The training of the gResNet is similar to that of ResNet. With the data set (2.5) and loss function

one can train for the optimized network parameter set Θand obtain the corresponding trained network ). Subsequently, we obtain the corresponding gResNet prediction model

It is straightforward to see, upon comparing (3.2) with the standard ResNet mapping (2.7), that the original ResNet model is a special case of the gResNet when L = I, the identity operator. In fact, when the time lag ∆ is sufficiently small, (∆) and the exact flow map satisfies

This implies that the identity operator is a reasonable choice for L in term of (3.1). This is the reason why the original ResNet provides a computational advantage, as discussed in [25]. When ∆ is not sufficiently small, the standard ResNet may not be advantageous.

3.2. Model Correction for Known L. The operator L (3.1) in the gResNet model (3.2) is critical. It should be available prior to the gResNet model construction. Hereafter we will loosely refer the operator L to as “prior model”. In general, L stands for any available models for the underlying dynamical system. This can be a linear approximation, a reduce order model, a coarse grained model, etc. By construction (3.2), the fully connected hidden layers inside gResNet correspond to the operator N and are used to approximate a generalized residue in the following sense

where is the flow map of the true model. If one assumes that the operator L is an approximation of the underlying dynamics via (3.1) such that

then it is natural to see the neural network operator ). Consequently, this provides a computational advantage for the gResNet.

dynamics, one can then view the network operator N in the gResNet as a “model correction” to L, as shown in (3.2). And we will refer to the trained gResNet model (3.2) as “posteriori model” hereafter. Examples of coarse, or reduced order, modeling are abundant in scientific computing. Here we give a very specific example, which is adopted from [21] and will be used in this paper as a numerical test.

where 0 is a real parameter. This is a chaotic system. A reduced order model for

where the fast variable y is averaged out. The reduced system (3.6) serves as a good approximation of the true system (3.5) when 1. This is our prior model in this case. The operator L of this prior model does not have an explicit expression and needs to be computed via solving (3.6) numerically.

3.3. Affine Approximation for Unknown L. In many practical situations, one does not have an existing prior model and subsequently the operator L is not available. In this case, it is possible to construct a prior model and its associated operator L using the same dataset. It is also desirable that such an construction should be reasonably faster than the neural network training of the gResNet model, in order not to increase the overall computational cost. Any efficient method to create an approximation model using the data set (2.5) can be adopted. Here we present a construction using affine approximation as a possible choice. This affine approximation is a modification of dynamic model decomposition (DMD) method ([39]), which has been used as a linear approximation model for a variety of problems.

The idea of DMD is to construct a best-fit linear dynamical model to approximate the underlying system based on data. For the given data set (2.5), DMD seeks a linear flow map such that

With sufficient number of data pairs, also known as snapshots, the matrix A can be solved in a least squares sense. For more detailed discussion of DMD, see [15].

The form of DMD (3.7) makes it effective for homogeneous systems. In order to cope with potential non-homogeneity of the underlying dynamics, we employ the following modification

where . Hereafter, we will refer this to as modified DMD (mDMD) method. This effectively creates an affine mapping as an approximation of the underlying flow map, i.e.,

where the matrix A and vector b are solved via the following optimization problem

To solve the optimization problem, we take the data set (2.5) and write

Let 1 := [1 1]be a vector of size 1 and

where stands for matrix pseudo inverse. We now define the L operator in the gResNet (3.2) as the mDMD model, i.e.,

After training the network operator N using the loss function (3.3) to obtain the trained parameter set Θ, we obtain the mDMD based gResNet prediction model

Let

be the residue of the mDMD model construction from the minimization problem (3.10). It is then straightforward to see that during the training of the neural network operator N, its loss function (3.3) can be rewritten as

This further indicates that the neural network operator N is trained to learn the “residue” of the prior model, in this case the mDMD model.

3.4. Adaptive Nonlinear Approximation for Unknown L. The mDMD (resp. DMD) in the previous section employs affine (resp. linear) approximation as the prior model L, and the model L remains fixed for all solution data (2.5). Upon examining the affine approximation (3.14), it is evident that its form L(x) = Ax + b resembles that of a feedforward neural network with a single hidden layer, where A resembles the weight matrix and b resembles the bias vector. Motivated by this observation, we propose to use a single-layer neural network as the prior model L. Let ; Θ) be the operator associated with a feedforward neural network with a single hidden layer. We first train this network using the same data set (2.5) with the loss function

This in turn gives us the trained prior model

We then construct gResNet network by using this operator L and solving (3.3). Again, it is straightforward to see that the loss function is equivalent to

where

is the residue of the training error of the prior model L.

Note that one is free to construct the prior model using a neural network with multiple hidden layers. However, with the gResNet inherently having a network N with multiple layers, there is no compelling reason to introduce multiple layers in the prior model, especially that the construction of the prior model should be reasonably fast as not to increase the overall cost of the model construction.

4. Numerical Examples. In this section we present numerical examples to demonstrate the efficiency of the proposed methods. For benchmarking purpose, in all examples the true dynamical models are known. We use the true models to generate synthetic data and then construct the corresponding gResNet approximation models. The gResNet models are then used for system predictions and compared against the solutions of the true systems.

To generate data set in the form of (2.5), we conduct random sampling for the “initial condition” and use the true models to advance time lag ∆ to obtain the corresponding . It has been established in [48] that random sampling is more effective for equation recovery work. Depending on the network structure, the total number of data pairs in (2.5), J, is usually kept at 5 10 times of the number of parameters in the network structure. This is to ensure that the network training does not suffer from overfitting issue. Note there is no comprehensive theory regarding the sufficient number of data entries to ensure accurate network training. Therefore, we purposefully keep the data set sufficiently large so that we can focus on the fundamental properties of the network. All models are trained via the loss function (3.3) and by using the open-source Tensorflow library [1]. The training data sets are usually divided into mini-batches of size 10. All models are trained for 300 epochs with reshuffling after each epoch. All the weights are initialized randomly from Gaussian distributions and all the biases are initialized to be zeros. We use ) = tanh(x) as activation function in all the examples.

4.1. Linear ODEs. We first study two linear ODE systems, whose exact solutions are known. In both examples, our gResNet networks have 3 hidden layers, each of which with 30 neurons.

The computational domain is taken to be D = [0, 2]and the time lag ∆ = 0.1. No pre-existing prior model is involved. Instead, we adopt the approach discussed in Section 3.3 and construct a standard DMD as the prior model in the gResNet model. Note that the true dynamical system is homogeneous. Subsequently, the standard DMD can be highly accurate.

After satisfactory training, we conduct system prediction for up to t = 2 for some arbitrarily chosen initial conditions. The phase plot and trajectories of the DMD based gResNet (denoted as “DMR-ResNet”) are shown in Fig. 4.1. We observe that the numerical predictions match the reference solutions extremely well. This is mostly due to the high accuracy of the DMD prior model. The neural network, which serves as a correction to the DMD prior model, has almost negligible impact in this case. This is manifested from Fig. 4.2 (left), where we observe the training loss for DMD-ResNet reaches extremely small magnitude. In the right of Fig. 4.2, we plot the numerical errors in the trajectory prediction. We observe that DMD based gResNet incurs much smaller errors than the standard ResNet.

Fig. 4.1: Example 1. Left: phase plot; Right: Trajectory.

Fig. 4.2: Example 1. Left: Loss history during training; Right: Errors in trajectory prediction.

The computational domain is taken to be D = [0, 2]and the time lag ∆ = 0.1. We employ the standard ResNet, gResNet using the standard DMD as prior model (DMDResNet) and gResNet using the modified DMD as prior model (mDMD-ResNet). System predictions by the learned models are conducted for time up to t = 2. In Fig. 4.3, we show the trajectory plots and phase portrait produced by the mDMDResNet model. We observe very good agreement with the exact solution. The mDMDResNet is in fact the most accurate model of the three approaches. This can be seen from Fig. 4.4. It can be seen that the numerical errors in the prediction by mDMDResNet is two orders of magnitude smaller than those by ResNet and DMD-ResNet. This demonstrates that the gResNet method can be advantageous when a proper prior model is available or can be constructed (in this case via mDMD). The standard DMD is not a very accurate prior model, as it can not model the non-homogeneous term in the system. In this case, its performance is similar to the standard ResNet, which corresponds to using the identity operator as the prior model.

We then consider the case of noisy data by adding randomly generated small noises to the synthetic data. The results by mDMD-ResNet are shown in Fig. 4.5, with noise at 2% and 5% relative levels. Again, mDMD-ResNet produces accurate system predictions, whose discrepancy with the exact is higher at noise level 5% than at noise level 2%. This is expected.

4.2. Nonlinear ODEs. We now consider four nonlinear examples: (1) a modi-fication of Example 2 by adding a nonlinear term; (2) the well-known damped pendulum problem; (3) a nonlinear differential-algebraic equation (DAE) for electric network

Fig. 4.3: Example 2. Results by mDMD-ResNet. Left: Phase plot; Right: Trajectory.

Fig. 4.4: Example 2. Left: Loss history during training; Right: errors in trajectory prediction.

Fig. 4.5: Example 2 with noisy data. Phase plots of mDMD-ResNet with = (1.5, 0). Left: 2% noises; Right: 5% noises.

model; and (4) the chaotic multiscale system (3.5) from Section 3.2. In the first and fourth example, the neural networks have 3 hidden layers, each of which with 30 neurons. In the second and third examples, the networks have 2 hidden layers, each of which with 40 neurons.

This is a modification of Example 2 by adding a nonlinear term. The computational domain is taken to be D = [0, 3]and the time lag ∆ = 0.1. Upon learning the system, predictions are conducted up to t = 2. The phase plot and trajectories produced by the mDMD-ResNet are plotted in Fig. 4.6. The training error history and numerical errors in the trajectory predictions are plotted In Fig. 4.7, along with those produced by ResNet and DMD-ResNet. Again, we observe that mDMD-ResNet produces far superior predictions than those by ResNet and DMD-ResNet. This is further demonstrated in Table 4.1. Note that the network norm from mDMD-ResNet is much smaller than ResNet and DMD-ResNet. This is primary reason for the better performance by mDMD-ResNet.

Fig. 4.6: Example 3 with mDMD-ResNet. Left: phase plot with = (1.5, 0); Right: Trajectory prediction.

Fig. 4.7: Example 3. Left loss history during training; Right: errors in trajectory prediction.

Table 4.1: Example 3. Key network properties for ResNet, DMD-ResNet and mDMD- ResNet.

4.2.2. Example 4: Damped pendulum. We now consider the damped pendulum problem

where = 0.2 and = 8.91. The computational domain is D = [], and the time lag ∆ = 0.1. Here we employ the adaptive nonlinear approximation method from Section 3.4 as the prior model. In Fig. 4.8, we present the trajectory prediction results of the adaptive gResNet model, starting from an arbitrarily chosen initial condition = (193876) and for time up to t = 20. We observe excellent agreements between the network model and the reference solution. In Fig. 4.9, we show training loss history and trajectory errors in the prediction, along with those obtained by ResNet. It can be seen that, with errors in prediction one order of magnitude smaller, the adaptive gResNet produces much more accurate results than the standard ResNet.

Fig. 4.8: Example 4 with adaptive gResNet modeling. Left: phase portrait; Right: trajectory prediction.

Fig. 4.9: Example 4. Left Loss history during training; Right: errors in trajectory prediction.

linear differential-algebraic equations (DAE), a model for a electric network from [23],

where denotes the node voltage, and are branch currents. The physical parameters are specified as C = 10= 10= 11 and = 0.25. In our test, we define the computational domain of () as D = [2]2] and fix the time lag ∆t = 2 10. For system prediction, we choose an (arbitrary) initial condition = (0, 0.1) and produce result for up to t = 1 10(1,000 times of the size of ∆).

The solution trajectories produced by mDMD-ResNet are shown in Fig. 4.10 and Fig. 4.11. We observe high accuracy in the prediction, when compared with the reference solution. The training loss history and numerical errors in the system prediction are plotted in Fig. 4.12, along with those produced by the standard ResNet. It can be seen that the performance of ResNet and DMD-ResNet is similar. This is because in this particular example the time lag ∆ = 10is very small, which is dictated by the scaling of the physical problem. Consequently, the identity operator I can be considered a very good prior model and the standard ResNet performs well. A more detailed comparison of the ResNet and mDMD-ResNet models are presented in Table 4.2. It can be seen that the mDMD-ResNet still offers slight advantage.

Fig. 4.10: Example 5. Solution prediction of () via mDMD-ResNet.

Fig. 4.11: Example 5. Solution prediction of () via mDMD-ResNet.

Fig. 4.12: Example 5. Left: loss history during training; Right: errors in trajectory predictions.

Table 4.2: Example 5: Key network properties for ResNet and mDMD-ResNet.

Example 6. We now consider the chaotic multiscale system (3.5) from Section 3.2. The prior model is the averaged system (3.6). This represents a case discussed in Section 3.2, where the prior model is available as an existing coarse model. The operator L associated with the prior model does not have an explicit form and needs to be computed via numerically solving the reduced system (3.6).

The prior model (3.6) is a good approximation of the true model (3.5) when the parameter 1. Here we set = 0.1, which is not exceedingly small. In this case, the approximation offered by the prior model (3.6) is relatively coarse. We fix the computational domain as D = [15] 10] 25] 140] and set the time lag as ∆ = 0.05. After satisfactory training, we utilize the trained gResNet model to conduct long-term prediction for time up to t = 100. The results from an arbitrarily chosen initial condition are shown in Fig. 4.13. For comparison, we also plot the prediction results obtained by the reduced system (3.6) (labeled as “Reduced”), the standard ResNet model, along with the reference exact solution via solving the true system (3.5) numerically. We first observe that the gResNet method offers significantly better results than the standard ResNet. This again confirms that it is highly advantageous to have a good prior model. In this case, the reduced system (3.6) in gResNet is obviously much better than the curde model of identity operator in the standard ResNet. More careful examination of the results also reveals that the gResNet has better predictive accuracy than the reduced model, especially in term of capturing the correct phase over longer time. To visual this closely, we compute the spectrum density of the trajectories in Fig. 4.14, in order to examine the dominant frequencies in the solutions. It can be clearly seen that the gResNet offers significant improvement in accuracy over the reduced system.

5. Conclusion. We presented a generalized residue network (gResNet) framework for effective learning of unknown governing equations from observational data. The gResNet incorporates the standard residue network (ResNet) as a special case.

0 20 40 60 80 100 t

0 20 40 60 80 100 t

0 20 40 60 80 100 t

0 20 40 60 80 100 t

Fig. 4.13: Example 6. Trajectory predictions by different models using initial condition (2.4350451, 3.416925, -2.16129375, 3.4650658). Note that the Reduced system does not contain variable y.

Fig. 4.14: Example 6. Power spectral density (PSD) of each trajectory obtained by different models.

In gResNet, “residue” is more broadly defined as the difference between the data and the prediction of a prior model, and a deep neural network is used to model the residue. Consequently, gResNet can be considered as a model correction to the prior model, which is usually a reduced/coarse model. In situations where prior models are not available, we propose a few choices for fast construction of prior models using the same data set and without incurring much computational cost. Various numerical examples were presented and demonstrated that gResNet is a viable tool for equation learning and offers better accuracy than the standard ResNet. It is especially useful as a model correction tool, to improve the predictive accuracy of an existing coarse model.

REFERENCES

sition: data-driven modeling of complex systems, SIAM, 2016.

[16] J. N. Kutz, S. L. Brunton, D. M. Luchtenburg, C. W. Rowley, and J. H. Tu, On dynamic mode decomposition: Theory and applications, Journal of Computational Dynamics, 1 (2014), p. 391421, https://doi.org/10.3934/jcd.2014.1.391, http://dx.doi.org/10.3934/jcd.2014.1.391.

[17] Z. Long, Y. Lu, and B. Dong, PDE-Net 2.0: Learning PDEs from data with a numericsymbolic hybrid deep network, arXiv preprint arXiv:1812.04426, (2018).

[18] Z. Long, Y. Lu, X. Ma, and B. Dong, PDE-net: Learning PDEs from data, in Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause, eds., vol. 80 of Proceedings of Machine Learning Research, Stockholmsm?ssan, Stockholm Sweden, 10– 15 Jul 2018, PMLR, pp. 3208–3216.

[19] N. M. Mangan, J. N. Kutz, S. L. Brunton, and J. L. Proctor, Model selection for dynamical systems via sparse regression and information criteria, Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 473 (2017).

[20] D. Nguyen, S. Ouala, L. Drumetz, and R. Fablet, EM-like learning chaotic dynamics from noisy and partial observations, arXiv preprint arXiv:1903.10335, (2019).

[21] G. Pavliotis and A. Stuart, Multiscale methods: averaging and homogenization, Springer, 2008.

[22] S. Pawar, S. M. Rahman, H. Vaddireddy, O. San, A. Rasheed, and P. Vedula, A deep learning enabler for nonintrusive reduced order modeling of fluid flows, Physics of Fluids, 31 (2019), p. 085101, https://doi.org/10.1063/1.5113494, http://dx.doi.org/10.1063/1.5113494.

[23] R. Pulch, Polynomial chaos for semiexplicit differential algebraic equations of index 1, Int. J. Uncertain. Quantif., 3 (2013).

[24] Z. Qian and C. Wu, Bayesian hierarchical modeling for integration low-accuracy and highaccuracy experiements, Technometrics, 50 (2008), pp. 192–204.

[25] T. Qin, K. Wu, and D. Xiu, Data driven governing equations approximation using deep neural networks, J. Comput. Phys., 395 (2019), pp. 620 – 635.

[26] M. Raissi, Deep hidden physics models: Deep learning of nonlinear partial differential equations, Journal of Machine Learning Research, 19 (2018), pp. 1–24.

[27] M. Raissi and G. E. Karniadakis, Hidden physics models: Machine learning of nonlinear partial differential equations, Journal of Computational Physics, 357 (2018), pp. 125 – 141.

[28] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Machine learning of linear differential equations using gaussian processes, J. Comput. Phys., 348 (2017), pp. 683–693.

[29] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations, arXiv preprint arXiv:1711.10561, (2017).

[30] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics informed deep learning (part ii): Data-driven discovery of nonlinear partial differential equations, arXiv preprint arXiv:1711.10566, (2017).

[31] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Multistep neural networks for data-driven discovery of nonlinear dynamical systems, arXiv preprint arXiv:1801.01236, (2018).

[32] D. Ray and J. S. Hesthaven, An artificial neural network as a troubledcell indicator, Journal of Computational Physics, 367 (2018), pp. 166 – 191, https://doi.org/https://doi.org/10.1016/j.jcp.2018.04.029, http://www.sciencedirect.com/science/article/pii/S0021999118302547.

[33] S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz, Data-driven discovery of partial differential equations, Science Advances, 3 (2017), p. e1602614.

[34] S. H. Rudy, J. N. Kutz, and S. L. Brunton, Deep learning of dynamics and signal-noise decomposition with time-stepping constraints, J. Comput. Phys., 396 (2019), pp. 483–506.

[35] K. Sargsyan, H. Najm, and R. Ghanem, On the statistical calibration of physical models, Int. J. Chem. Kinetics, DOI 10.1002/kin.20906 (2015).

[36] H. Schaeffer, Learning partial differential equations via data discovery and sparse optimization, Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 473 (2017).

[37] H. Schaeffer and S. G. McCalla, Sparse model selection via integral terms, Phys. Rev. E, 96 (2017), p. 023302.

[38] H. Schaeffer, G. Tran, and R. Ward, Extracting sparse high-dimensional dynamics from limited data, SIAM Journal on Applied Mathematics, 78 (2018), pp. 3279–3295.

[39] P. Schmid, Dynamic mode decomposition of numerical and experimental data, J. Fluid Mech., 656 (2010), pp. 5–28.

[40] Y. Sun, L. Zhang, and H. Schaeffer, NeuPDE: Neural network based ordinary and partial

differential equations for modeling time-dependent data, arXiv preprint arXiv:1908.03190, (2019).

[41] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), (1996), pp. 267–288.

[42] G. Tran and R. Ward, Exact recovery of chaotic systems from highly corrupted data, Multiscale Model. Simul., 15 (2017), pp. 1108–1129.

[43] R. K. Tripathy and I. Bilionis, Deep uq: Learning deep neural network surrogate models for high dimensional uncertainty quantification, Journal of Computational Physics, 375 (2018), p. 565588, https://doi.org/10.1016/j.jcp.2018.08.036, http://dx.doi.org/10.1016/j.jcp.2018.08.036.

[44] Q. Wang, J. S. Hesthaven, and D. Ray, Non-intrusive reduced order modeling of unsteady flows using artificial neural networks with application to a combustion problem, Journal of Computational Physics, 384 (2019), pp. 289 – 307, https://doi.org/https://doi.org/10.1016/j.jcp.2019.01.031, http://www.sciencedirect.com/science/article/pii/S0021999119300828.

[45] S. Wang, W. Chen, and K. Tsui, Bayesian validation of computer models, Technometrics, 51 (2009), pp. 439–451.

[46] Y. Wang and G. Lin, Efficient deep learning techniques for multiphase flow simulation in heterogeneous porous media, 2019, https://arxiv.org/abs/1907.09571.

[47] K. Wu, T. Qin, and D. Xiu, Structure-preserving method for reconstructing unknown hamiltonian systems from trajectory data, arXiv preprint arXiv:1905.10396, (2019).

[48] K. Wu and D. Xiu, Numerical aspects for approximating governing equations using data, J. Comput. Phys., 384 (2019), pp. 200–221.

[49] Y. Zhu and N. Zabaras, Bayesian deep convolutional encoderdecoder networks for surrogate modeling and uncertainty quantification, Journal of Computational Physics, 366 (2018), p. 415447, https://doi.org/10.1016/j.jcp.2018.04.018, http://dx.doi.org/10.1016/j.jcp.2018.04.018.

designed for accessibility and to further open science