Learning Nonlinear State Space Models with Hamiltonian Sequential Monte Carlo Sampler

2019·arXiv

Abstract

1 Introduction

System identification (Ljung, 1998, 2010) is a fundamental ingredient of many problems, such as model-predictive control (Camacho and Alba, 2013) and model-based reinforcement learning (Deisenroth and Rasmussen, 2011; Berkenkamp et al., 2017), which is to learn system dynamics from practical data. State-space model (SSM) (Billings, 2013) is the most popular method to represent the system with input and output as functions of a latent Markovian state . Specifically, linear and non-linear Gaussian state space models (GSSM) are most widely used in practical applications from robotic planning to neural signal processing. However, despite significant effort in research community over past decades, efficient learning method for non-linear GSSM is still lacking.

The sequential Monte Carlo (SMC) (Gordon et al., 1993) is a classical method to infer latent state in SSM, which uses weighted particles to approximate the intractable posterior of the latent states. The proposal distribution, from which particles are sampled, has significant influence on the approximation performance. In order to perform model learning and proposal adaptation at the same time, recent work (Maddison et al., 2017; Le et al., 2017; Naesseth et al., 2017) combines variational auto-encoder (VAE) (Kingma and Welling, 2013; Rezende et al., 2014) with importance weighted auto-encoder (IWAE) (Burda et al., 2015), and uses SMC as the estimator for marginal observation likelihood. However, since in nonlinear GSSM the emission and transition frameworks are realized by deep neural networks, the real posterior of latent states is intractable to sample, and the proposal distributions in previous work are always assumed to be Gaussian, different from the true posterior. Although the approximated log-likelihood is unbiased asymptotically (Maddison et al., 2017), it is not close to the real log-likelihood without large number of particles.

In order to improve the inference performance of SMC with low sampling and model complexity, here we propose an SMC sampler augmented by Hamiltonian dynamics (HSMC). Different from previous SMC methods, we don’t need proposals here to generate particles of latent states. Using transition function as initial distribution, we use HMC to sample particles to approximate the posterior. And since latent space is time-dependent in nonlinear SSM, we modified the Riemannian Manifold HMC (RMHMC) (Girolami and Calderhead, 2011), whose mass matrix is generated by an MLP with latent states as inputs. The state dynamics in SSM is usually realized by Gaussian Processes (GP) and neural network. GP is data-efficient and has much less parameters than neural network, but its expressive capability is not as good as neural network, especially in high-dimensional problems. Here we show that the proposed method can improve the learning performance of GP to be comparable as neural network.

2 Preliminary

Denote and as input, latent and output variables at time t respectively.

2.1 Gaussian State Space Model

Gaussian state space model (GSSM) is the most popular method to model the dynamics of complex sequential data in the latent space (Raiko and Tornio, 2009). The inference and learning of GSSM are both considered in this work. The model is defined as

where is the output distribution parametrized by function f, and latent variables distribute as multivariate Gaussian conditioned on previous latent variables and input variables. The GSSM defined above includes both linear and nonlinear GSSM. When functions f and are linear, the model can be learned by extended Kalman filter (Wan and Nelson, 1997) and expectation maximization (Ghahramani and Roweis, 1999). However, in most practical problems, the dynamic and emission functions are nonlinear. In this paper we propose an efficient method to deal with nonlinearties.

2.2 Variational Sequential Monte Carlo

SMC performs particle-based approximate inference on a sequence of target distributions, which is the extension of importance sampling to sequential data. In the context of SSM, the target distributions are the posterior of latent variables, i.e., . The generative model includes initial distribution of latent variables , transition distribution , and emission distribution . The proposal distribution, defined as , is the approximate inference on target distributions. Then the generative and inference models are factorized as

Recently, a new ELBO objective has been introduced (Le et al., 2017; Maddison et al., 2017; Naesseth et al., 2017), which is asymptotically unbiased estimator on log-likelihood

where K is the number of particles and is the weight of particle k at time t. Each particle is defined by weight and value . At time t = 0 each particle value is sampled from initial latent distribution . In this paper, the transition (emission) function is denoted as ). By resampling from the previous particle set , the weight for every particle at each time is defined as below

showing that product of particle weights is an unbiased estimator to the marginal likelihood of observations.

2.3 Hamiltonian Monte Carlo

In learning transition and emission of SSM, it is key to generate samples from posterior distribution of latent states given observation data. However, the real posterior is intractable in nonlinear SSM. In this work, different previous methods (Le et al., 2017; Maddison et al., 2017; Naesseth et al., 2017), in each time step we don’t learn proposals but use HMC (Neal et al., 2011) to directly sample latent states from the posterior, which can reduce the model complexity significantly. At each time step t, we denote the joint log likelihood of the observations and latent states as

summing log probability of emission and transition function together. Then we add to it a term involving "momenta" variables , to obtain the Hamiltonian energy function,

This quantity can be interpreted as in physical terminology as the sum of the potential energy and the kinetic energy , where M acts as the canonical mass matrix. The joint distribu- tion of and is then defined as .

Denote the time derivatives with the dot notation, i.e., , where is the refined time in [t, t + 1). The Hamiltonian equations of motion governing the dynamics of this system can be written as

Obviously these equations are time-reversible, and the dynamics conserve the total energy. These continuous-time equations can be discretized to give "leapfrog" algorithms which are used for Monte Carlo simulations along with Metropolis-Hastings correction steps (Neal et al., 2011).

3 Model Learning Method

In this section, the proposed Hamiltonian Sequential Monte Carlo (HSMC) is thoroughly described. We first extend Riemann manifold HMC to recurrent setting, and then formulate the learning objective function based on variational inference. Finally, based on HSMC, we will introduce a new meta-learning method for nonstationary state space model.

3.1 Riemann Manifold HMC

The sampling quality of HMC is heavily influenced by the choice of mass matrix (Girolami and Calderhead, 2011). In order to automatically select mass matrix across different time steps, we adopt Riemann manifold HMC (Girolami and Calderhead, 2011) to incorporate local geometric properties of latent states. Here, at time step t, we parametrize the mass matrix as a function of latent states, i.e., , and it is a positive definite metric tensor defining the Riemann manifold on which we are sampling. Defining the kinetic energy in terms of the mass matrix, we can get the Hamiltonian energy function as below

where is the dimension of latent state. The desired marginal density of can be obtained by integrating out the momenta . The equations of motion at every time step t are as below

where the last equation is denoted as a function of latent state. To discretize this system of equations, we use the generalized leapfrog algorithm, where a first order symplectic integrator is composed with its adjoint; the resultant second order integrator can be shown to be both time-reversible and symplectic (Leimkuhler and Reich, 2004). The Riemann manifold HMC operation is formulated as a function shown as below. In implementation, the mass matrix is realized by a low-rank

matrix, i.e., , where are MLPs with current latent states as inputs, and their parameters are denoted as . Then we denote mass matrix as .

3.2 Hamiltonian SMC

Here we describe the proposed method, Hamiltonian Sequential Monte Carlo (HSMC). At each time step, the particles of latent states are directly sampled from the transition function conditioned on

previous particles , current input variables and previous observations , which plays the role of proposal in previous work (Le et al., 2017; Maddison et al., 2017; Naesseth et al., 2017). The momenta variables are introduced necessarily, sampled from Gaussian distribution . And each particle consists of a tuple of latent state and momenta variable. Then we use the Riemann Manifold Hamiltonian Monte Carlo (RMHMC) to transform the sampled particles to follow the posterior of latent states given current observations. For each particle k, denote and as initial particle and transformed particle with S-step RMHMC. Due to the measure preserving property of HMC (Neal et al., 2011), the initial and transformed particles have the same density value, i.e., , even though they follow different distributions. Then the weight for k-th particle can be defined as

where the second equation comes from the measure preserving property of HMC. Therefore, the proposed algorithm is summarized as below.

3.3 Objective Function

In this work, we choose the objective function to be the ELBO defined as SMC marginal likelihood estimator in (2). However, we can show that incorporation of HMC can make the ELBO objective arbitrarily tight. Defining , the ELBO objective can be formulated as

where is formulated as

where are obtained as (11) and is the distribution of particle and momenta variables after Hamiltonian dynamics.

We adopt stochastic gradient descent (Hoffman et al., 2013) to learn optimal model and mass matrix parameters . The gradient of objective function can be derived as below, where observed and control data are omitted here,

In order to reduce the gradient variance, we ignore the first term in the squared bracket above.

3.4 Theoretical Analysis

Based on the property of importance sampling (Murphy, 2012), we can easily show that at each time step t the weight expression (10) is an unbiased estimator of marginal likelihood likelihood of observations conditioned on and . In this section, we show that as step number S increasing with particle number K fixed, our learning objective (12) can be arbitrarily close or converge to the marginal log likelihood of observations. As (Le et al., 2017; Maddison et al., 2017) we can assume the state space model has independent structure for t = 2, . . . , T . It is a reasonable assumption since in online learning and many practical applications, the future observations can not be known in advance. Define conditional marginal likelihood of observations for each time t and particle k as

Then, if we can show the convergence of ELBO at each time t and particle k, the convergence of ELBO objective across all time in (12) can be shown.

According to the Hamilton energy defined in (6), the HMC adopted in algorithm 2 is ergodic with invariant distribution as

which is the posterior of joint probabilities of emission, transition and momenta distributions, or alternatively,

Based on properties of HMC (Neal et al., 2011), the particle distribution will tend to invariant distribution in total variation, with the increase of step number,i.e., for all t and k,

Then based on the weight definition (10), we have, for each t and k,

where the second equality is due to (14) and (15). Due to the concavity of log function, the ELBO objective (12) at each time t can be lower bounded as

Combining (16) and (17) yields that at each time t the ELBO objective converges to conditional log likelihood of observations. Due to the independent structure of latent state space, we can show that the overall objective (12) converges to the log likelihood of observations with the increase of step number S. The empirical study tells us that the performance is good enough when S is just around 10.

4 Improving Gaussian Process State Space Model

Gaussian Process State Space Models (GP-SSM) are a popular class of stochastic SSMs (Frigola et al., 2013, 2014; Eleftheriadis et al., 2017; Doerr et al., 2018; Ialongo et al., 2018). In SSM, at each time step the system is taken to evolve as a Markov chain by the transition function, mapping a latent state to the next. By placing a Gaussian Process (GP) prior on the transition function, we can obtain the Gaussian process state-space model (GP-SSM), which is a fully Bayesian non-parametric treatment on modeling problem. It has many advantages: 1) better uncertainty estimates based on the posterior of the transition function; 2) avoiding overfitting with little amount of data; 3) handling large amount of data without model saturation.

4.1 Preliminary

Same as standard SSM, we model the sequence of observations by a corresponding sequence of latent states , and . Here the state transition is assumed to be governed by a nonparametric stochastic function following a GP prior. Specifically, we have

where Q is the variance matrix and every d-th latent dimension has its own GP function . The emission function is still parametric same as (1). In order to reduce the computation complexity in learning GP, we adopt the induced-inputs method (Snelson and Ghahramani, 2006) and variational sparse GP (Titsias, 2009) in model learning, which achieves success in many practical problems. For every GP function , we first introduce P inducing GP targets at inducing GP inputs , which are jointly Gaussian with the transition function . Then for every latent dimension, the true GP predictive distribution can be approximated by using the set inducing inputs and outputs as below, with notation d omitted,

where the covariance matrix with entries . Following (1) we show the joint distribution of GP-SSM for completeness,

where and . Here diagand are assumed to be Gaussian.

4.2 Motivation

As far as we know, all previous work on GP-SSM (Frigola et al., 2013, 2014; Frigola-Alcade, 2015; Eleftheriadis et al., 2017; Doerr et al., 2018; Ialongo et al., 2018) assume the emission function to be a linear mapping between latent state and the mean of observation . However, in many practical applications the emission involves nonlinear function of (Gultekin and Paisley, 2017). In this case the posterior of latent states is non-Gaussian and possibly multi-modal. So, the Gaussian distribution of transition function cannot approximate the real posterior well enough.

In previous work (Frigola et al., 2013, 2014; Ialongo et al., 2018) the variational distributions need to optimize parameter matrices and , which increases model and learning complexity. Although authors in (Doerr et al., 2018) directly use transition prior as variational approximation for latent posterior without extra parameters, it cannot exploit information contained in the current observations other than by adapting q(u), which cannot handle cases with high observation noise and long sequence length (Ialongo et al., 2018).

4.3 Variational Inference for GP-SSM with HSMC

Here we use HSMC to solve problems mentioned above. We first design the variational distribution. Following the structure of real latent posterior, the latent states and transition function are not factorized, i.e., q(X, f) = q(X|f)q(f). By introducing induced outputs z, the variational distribution can be defined as

where , and and are assumed to be Gaussian, i.e., where and . We can further simplify the variational inference by integrating out induced-outputs z (Hensman et al., 2013). Then following (19) the variational transition function can be derived as

In this work, we use K-particle HSMC to approximate the intractable posterior. Denote where is the re-sampling index. At each time step t, we first sample mul- tiple particles of latent states and momenta from the transition function and Gaussian respectively. Then by using S-step RMHMC operation we transform the sampled particles to approximate the posterior as

Due to the measure preserving and reversibility property of HMC, we define the particle weights as

According to the joint and variational distribution in (19)(20), the ELBO can be computed as

where is defined in (18). In experimental study, we show that without increasing model complexity, HSMC can improve learning performance on GP-SSM with nonlinear emission.

5 Experiments

In this section, we conduct experiments to show the benefits of proposed algorithms. First, we use synthetic data to verify the advantages of HSMC over conventional variational SMC, where both transition and emission are realized by neural networks. Then based on real-world dataset, we show that HSMC can better learn GP-SSM when emission function is nonlinear.

5.1 Synthetic Data

The synthetic data is generated by a nonlinear state-space model

where and is realized by two-layer neural network with 20 hidden neurons. The non-linearities in are ReLU and Sigmoid. And the noise follow independent Gaussian with variance of 0.2. Every sequence has length of 100. Both training and testing datasets only contain observations .

In this experiment, the base models are variational recurrent neural network (VRNN) (Chung et al., 2015) and stochastic recurrent neural network (SRNN) (Fraccaro et al., 2016), where the emission is realized by deep neural network and transition is implemented by gated recurrent unit (GRU) (Chung et al., 2014). The comparison is between the base models (VRNN or SRNN) augmented by FIVO (Maddison et al., 2017) and HSMC. And generation network in every compared model is the same. However, in base model with HSMC, the proposal neural network is omitted. The performance metric is the log-likelihood per step. Here K is the number of particles and S is the number of step in RMHMC. Performance of FIVO is not related with S.

Table 1: Performance Comparison in Synthetic Data

5.2 Bike-sharing Demand Data

In this experiment, we use bike-sharing record data in New York city (?), from July 2014 to July 2017. In every transaction record, there are trip duration, bike check out/in time, names of start and end stations, and customer information such as age and gender. We first aggregate transaction records to bike demands at each station, ignoring customer information. Then we further remove stations existing for less than two years. And stations with less than one bike used per hour are also deleted. Finally, there are only 269 stations left in the dataset.

Here we compare the orginal GP-SSM (Doerr et al., 2018) with multiple particles (GP-SSM-SMC) and that augmented by HSMC (GP-SSM-HSMC). Both models have same latent dimension and number of inducing points. The performance metric is log likelihood. The results are shown below. We also the results of VRNN-FIVO for comparison.

Table 2: Performance Comparison in Bike-sharing Demand Data

We find that HSMC can help diminish the performance gap between GP model and neural network model, even though GP has much less parameters than neural networks.

References

Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A. (2017). Safe model-based reinforce- ment learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908–918.

Billings, S. A. (2013). Nonlinear system identification: NARMAX methods in the time, frequency, and spatio-temporal domains. John Wiley & Sons.

Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.

Camacho, E. F. and Alba, C. B. (2013). Model predictive control. Springer Science & Business Media.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. (2015). A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988.

Deisenroth, M. and Rasmussen, C. E. (2011). Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472.

Doerr, A., Daniel, C., Schiegg, M., Nguyen-Tuong, D., Schaal, S., Toussaint, M., and Trimpe, S. (2018). Probabilistic recurrent state-space models. arXiv preprint arXiv:1801.10395.

Eleftheriadis, S., Nicholson, T., Deisenroth, M., and Hensman, J. (2017). Identification of gaussian process state space models. In Advances in Neural Information Processing Systems, pages 5309– 5319.

Fraccaro, M., Sønderby, S. K., Paquet, U., and Winther, O. (2016). Sequential neural models with stochastic layers. In Advances in neural information processing systems, pages 2199–2207.

Frigola, R., Chen, Y., and Rasmussen, C. E. (2014). Variational gaussian process state-space models. In Advances in Neural Information Processing Systems, pages 3680–3688.

Frigola, R., Lindsten, F., Schön, T. B., and Rasmussen, C. E. (2013). Bayesian inference and learning in gaussian process state-space models with particle mcmc. In Advances in Neural Information Processing Systems, pages 3156–3164.

Frigola-Alcade, R. (2015). Bayesian time series learning with gaussian processes. Uni-versity of Cambridge.

Ghahramani, Z. and Roweis, S. T. (1999). Learning nonlinear dynamical systems using an em algorithm. In Advances in neural information processing systems, pages 431–437.

Girolami, M. and Calderhead, B. (2011). Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123– 214.

Gordon, N. J., Salmond, D. J., and Smith, A. F. (1993). Novel approach to nonlinear/non-gaussian bayesian state estimation. In IEE Proceedings F-radar and signal processing, volume 140, pages 107–113. IET.

Gultekin, S. and Paisley, J. (2017). Nonlinear kalman filtering with divergence minimization. IEEE Transactions on Signal Processing, 65(23):6319–6331.

Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. arXiv preprint arXiv:1309.6835.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347.

Ialongo, A. D., van der Wilk, M., Hensman, J., and Rasmussen, C. E. (2018). Non-factorised variational inference in dynamical systems. arXiv preprint arXiv:1812.06067.

Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

Le, T. A., Igl, M., Rainforth, T., Jin, T., and Wood, F. (2017). Auto-encoding sequential monte carlo. arXiv preprint arXiv:1705.10306.

Leimkuhler, B. and Reich, S. (2004). Simulating hamiltonian dynamics, volume 14. Cambridge university press.

Ljung, L. (1998). System identification. In Signal analysis and prediction, pages 163–173. Springer.

Ljung, L. (2010). Perspectives on system identification. Annual Reviews in Control, 34(1):1–12.

Maddison, C. J., Lawson, J., Tucker, G., Heess, N., Norouzi, M., Mnih, A., Doucet, A., and Teh, Y. (2017). Filtering variational objectives. In Advances in Neural Information Processing Systems, pages 6573–6583.

Murphy, K. P. (2012). Machine learning: A probabilistic perspective. adaptive computation and machine learning.

Naesseth, C. A., Linderman, S. W., Ranganath, R., and Blei, D. M. (2017). Variational sequential monte carlo. arXiv preprint arXiv:1705.11140.

Neal, R. M. et al. (2011). Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2.

Raiko, T. and Tornio, M. (2009). Variational bayesian learning of nonlinear hidden state-space models for model predictive control. Neurocomputing, 72(16-18):3704–3712.

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286.

Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pages 1257–1264.

Titsias, M. (2009). Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pages 567–574.

Wan, E. A. and Nelson, A. T. (1997). Dual kalman filtering methods for nonlinear prediction, smoothing and estimation. In Advances in neural information processing systems, pages 793– 799.

Designed for Accessibility and to further Open Science