28:["$","$L31",null,{"isWhiteLabelled":false,"children":["$","$Lc",null,{"pt":{"compact":0,"expanded":3},"children":[["$","$L32",null,{"noStar":true,"publisher":true,"task":true,"params":true,"size":"xl","product":{"id":"eyJwYXBlcklEIjoiMTYwNS4wNTYyMiIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","updated":"2017-04-13T03:29:26.000Z","paperID":"1605.05622","published":"2016-05-18T15:38:16.000Z","authors":"[\"Linda S. L. Tan\",\"David J. Nott\"]","title":"Gaussian variational approximation with sparse precision matrices","scoreTrending":null,"summary":"We consider the problem of learning a Gaussian variational approximation to\nthe posterior distribution for a high-dimensional parameter, where we impose\nsparsity in the precision matrix to reflect appropriate conditional\nindependence structure in the model. Incorporating sparsity in the precision\nmatrix allows the Gaussian variational distribution to be both flexible and\nparsimonious, and the sparsity is achieved through parameterization in terms of\nthe Cholesky factor. Efficient stochastic gradient methods which make\nappropriate use of gradient information for the target distribution are\ndeveloped for the optimization. We consider alternative estimators of the\nstochastic gradients which have lower variation and are more stable. Our\napproach is illustrated using generalized linear mixed models and state space\nmodels for time series.","lastCheckedForCode":"2022-09-03T05:47:52.187Z","links":[{"id":"eyJ1cmwiOiJodHRwczovL3BhcGVyc3dpdGhjb2RlLmNvbS9wYXBlci9nYXVzc2lhbi12YXJpYXRpb25hbC1hcHByb3hpbWF0aW9uLXdpdGgifQ==","type":"pwc","url":"https://paperswithcode.com/paper/gaussian-variational-approximation-with","data":null}],"reposConnection":{"edges":[]},"models":[],"tags":[{"id":"eyJuYW1lIjoidGltZSBzZXJpZXMiLCJ0eXBlIjoidGFzayJ9","name":"time series","description":"In time series forecasting, the input is a sequence of data points collected over time, and the output is a prediction of future data points. This method is commonly used in finance for stock price prediction, weather forecasting, and sales forecasting.","scoreTrending":null,"count":{"stars":15230,"papers":7906,"models":6563},"__typename":"Tag"}],"summaries":[],"emailsConnection":{"edges":[{"author":"linda s l tan","node":{"id":"eyJhZGRyZXNzIjoic3RhdHNsbEBudXMuZWR1LnNnIn0=","address":"statsll@nus.edu.sg","name":null,"avatar":null,"linkedin":null,"bio":null,"site":null,"override":null,"membership":[],"paper":[{"modelsAggregate":{"count":0}}],"github":[],"scholar":[{"thirdPartyID":"OcYrvh4AAAAJ"}],"twitter":[],"location":[],"owner":[{"id":"eyJ1aWQiOiI0NTMwZDFlMS05NTQxLTQ4ZDItYTA1Ni1kNjViODViNDhmNDYifQ==","name":"linda s l tan","github":[],"email":[],"authored":[{"id":"eyJwYXBlcklEIjoiMTYwNS4wNTYyMiIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","paperID":"1605.05622"}]}]}},{"author":null,"node":{"id":"eyJhZGRyZXNzIjoic3RhbmRqQG51cy5lZHUuc2cifQ==","address":"standj@nus.edu.sg","name":null,"avatar":null,"linkedin":null,"bio":null,"site":null,"override":null,"membership":[],"paper":[{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}}],"github":[],"scholar":[],"twitter":[],"location":[],"owner":[]}}]},"__typename":"paper","authorArray":["Linda S. L. Tan","David J. Nott"]}}],["$","$L25",null,{"container":true,"columns":100,"spacing":{"compact":0,"expanded":2,"large":3},"children":[["$","$L25",null,{"size":{"compact":100,"expanded":100,"large":68},"children":[["$","$8",null,{"children":["$","$L33",null,{"publisher":"arxiv","paperID":"1605.05622","product":{"paper":"$28:props:children:props:children:0:props:product","models":"$28:props:children:props:children:0:props:product:models"},"isWhiteLabelled":false}]}],["$","$8",null,{"children":["$","$L34",null,{"article":"$L35","model":"$undefined"}]}]]}],["$","$L25",null,{"size":"grow","children":["$","$L36",null,{}]}]]}],["$","$8",null,{"children":null}],[["$","audio",null,{"id":"tts"}],["$","$L37",null,{"paperID":"1605.05622","publisher":"arxiv","paperJSON":{"title":"Gaussian variational approximation with sparse precision matrices","paperID":"1605.05622","avgLineHeight":12.45,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"We consider the problem of learning a Gaussian variational approximation to the posterior distribution for a high-dimensional parameter, where we impose sparsity in the precision matrix to reflect appropriate conditional independence structure in the model. Incorporating sparsity in the precision matrix allows the Gaussian variational distribution to be both flexible and parsimonious, and the sparsity is achieved through parameterization in terms of the Cholesky factor. Ef-ficient stochastic gradient methods which make appropriate use of gradient information for the target distribution are developed for the optimization. We consider alternative estimators of the stochastic gradients which have lower variation and are more stable. Our approach is illustrated using generalized linear mixed models and state space models for time series.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Keywords ","element":"span"},{"text":"Gaussian variational approximation ","element":"span"},{"style":{"height":4.8},"width":11,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/0-0.png","element":"img","alt":" ·","inline":true,"padRight":true},{"text":"stochastic gradient algorithms ","element":"span"},{"style":{"height":4.8},"width":11,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/0-1.png","element":"img","alt":" ·","inline":true,"padRight":true},{"text":"sparse precision matrix ","element":"span"},{"style":{"height":4.8},"width":11,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/0-2.png","element":"img","alt":" ·","inline":true,"padRight":true},{"text":"variational Bayes","element":"span"}],[{"text":"Linda S. L. Tan Department of Statistics and Applied Probability National University of Singapore 6 Science Drive 2, Singapore 117546 Tel.: +65-6516-4416 Fax: +65-6872-3919 E-mail: statsll@nus.edu.sg","element":"span"}],[{"text":"David J. Nott Department of Statistics and Applied Probability National University of Singapore 6 Science Drive 2, Singapore 117546 Operations Research and Analytics Cluster National University of Singapore 21 Lower Kent Ridge Road, Singapore 119077 Tel.: +65-6516-2744 Fax: +65-6872-3919 E-mail: standj@nus.edu.sg","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Bayesian inference provides a principled way of combining data with prior beliefs through the application of Bayes’ rule. The posterior distribution is, however, often intractable and simulation-based Markov Chain Monte Carlo (MCMC) methods have become a central tool in Bayesian computation. In recent years, variational methods ","element":"span"},{"href":"#id-0","referenceIndex":17,"text":"(Jordan et al., ","element":"a"},{"href":"#id-0","referenceIndex":17,"text":"1999) ","element":"a"},{"text":"have also emerged as an important alternative to MCMC, providing fast approximate inference for complex, high-dimensional models. Unlike MCMC, which can be made arbitrarily accurate, variational methods make certain simplifying assumptions about the posterior density (e.g. a tractable form ","element":"span"},{"style":{"height":16},"width":53.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/0-3.png","element":"img","alt":" q(θ","inline":true},{"text":") where ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/0-4.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"denotes the vector of variables) and seek to optimize the Kullback-Leibler divergence ","element":"span"},{"style":{"height":16},"width":164.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/0-5.png","element":"img","alt":" DKL(q||p","inline":true},{"text":") between ","element":"span"},{"style":{"height":16},"width":53.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/0-6.png","element":"img","alt":" q(θ","inline":true},{"text":") and the true posterior ","element":"span"},{"style":{"height":16},"width":86.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/0-7.png","element":"img","alt":"p(θ|y","inline":true},{"text":") subject to these assumed restrictions. While earlier research on variational methods concentrated on conjugate models with analytically tractable expectations under which the variational Bayes approach ","element":"span"},{"href":"#id-1","referenceIndex":2,"text":"(At- ","element":"a"},{"href":"#id-1","referenceIndex":2,"text":"tias, ","element":"a"},{"href":"#id-1","referenceIndex":2,"text":"1999) ","element":"a"},{"text":"yields efficient closed-form updates ","element":"span"},{"href":"#id-2","referenceIndex":47,"text":"(Winn ","element":"a"},{"href":"#id-2","referenceIndex":47,"text":"and Bishop, ","element":"a"},{"href":"#id-2","referenceIndex":47,"text":"2005)","element":"a"},{"text":", recent focus considers stochastic gradient approximation methods ","element":"span"},{"href":"#id-3","referenceIndex":35,"text":"(Robbins and Monro, ","element":"a"},{"href":"#id-3","referenceIndex":35,"text":"1951) ","element":"a"},{"text":"for non-conjugate models (e.g. ","element":"span"},{"href":"#id-4","referenceIndex":27,"text":"Paisley et al., ","element":"a"},{"href":"#id-4","referenceIndex":27,"text":"2012; ","element":"a"},{"href":"#id-5","referenceIndex":39,"text":"Salimans and Knowles, ","element":"a"},{"href":"#id-5","referenceIndex":39,"text":"2013)","element":"a"},{"text":". Further discussion of the literature is deferred to Section ","element":"span"},{"text":"2. ","element":"span"},{"href":"#id-6","referenceIndex":36,"text":"Rohde ","element":"a"},{"href":"#id-6","referenceIndex":36,"text":"and Wand ","element":"a"},{"href":"#id-6","referenceIndex":36,"text":"(2015) ","element":"a"},{"text":"give a nice recent summary of alternatives to stochastic gradient approaches for handling non-conjugacy in the variational Bayes framework.","element":"span"}],[{"href":"#id-7","referenceIndex":44,"text":"Titsias and L´azaro-Gredilla ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"(2014) ","element":"a"},{"text":"propose a simple yet effective variational method known as “doubly stochastic variational inference”, where the approximating density is parameterized in terms of its mean ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/0-8.png","element":"img","alt":"µ","inline":true,"padRight":true},{"text":"and a lower triangular scale matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":". An efficient stochastic gradient algorithm is then developed for optimizing ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-0.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"by (1) parameterizing the vector of variables ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-1.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":14},"width":128.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-2.png","element":"img","alt":" Lz + µ","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"is a random variable that can be sampled easily from a base distribution that does not depend on the variational parameters (see also ","element":"span"},{"href":"#id-8","referenceIndex":19,"text":"Kingma and Welling, ","element":"a"},{"href":"#id-8","referenceIndex":19,"text":"2014; ","element":"a"},{"href":"#id-9","referenceIndex":34,"text":"Rezende et al., ","element":"a"},{"href":"#id-9","referenceIndex":34,"text":"2014) ","element":"a"},{"text":"and (2) sub-sampling from the data. The stochastic gradients constructed in this manner are “doubly stochastic” as they are built upon two sources of stochasticity that comes from sampling from the variational distribution and the full data set. This approach is very general in that it can be applied to any model where the joint density is differentiable. Unlike variational Bayes, it does not assume independence relationships among blocks of an appropriate partition of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-3.png","element":"img","alt":" θ","inline":true},{"text":". Such independence assumptions have been shown to result in underestimation of the posterior variance ","element":"span"},{"href":"#id-10","referenceIndex":46,"text":"(Wang and Titterington, ","element":"a"},{"href":"#id-10","referenceIndex":46,"text":"2005; ","element":"a"},{"href":"#id-11","referenceIndex":3,"text":"Bishop, ","element":"a"},{"href":"#id-11","referenceIndex":3,"text":"2006)","element":"a"},{"text":". The quality of the resulting approximation is thus limited only by how well the form of ","element":"span"},{"style":{"height":16},"width":53.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-4.png","element":"img","alt":" q(θ","inline":true},{"text":") matches the true posterior. Using this approach, ","element":"span"},{"href":"#id-12","referenceIndex":20,"text":"Kucukelbir et al. ","element":"a"},{"href":"#id-12","referenceIndex":20,"text":"(2016) ","element":"a"},{"text":"develop an automatic differen-tiation variational inference (ADVI) algorithm in Stan, where ","element":"span"},{"style":{"height":16},"width":53.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-5.png","element":"img","alt":" q(θ","inline":true},{"text":") is assumed to be either a diagonal (mean-field) or unrestricted Gaussian variational approximation. Constrained variables are transformed to the real line via Stan’s library of transformations and the gradients are computed using Monte Carlo integration. They note that while unrestricted ADVI is able to capture posterior correlations and hence produces more accurate marginal variance estimates than mean field ADVI, it can be prohibitively slow for large data since the number of variational parameters scales as the square of the length of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-6.png","element":"img","alt":" θ","inline":true},{"text":".","element":"span"}],[{"text":"In this article, we consider variational approximations which take the form of a multivariate Gaussian distribution ","element":"span"},{"style":{"height":16},"width":126.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-7.png","element":"img","alt":" N(µ, Σ","inline":true},{"text":") for models with high-dimensional parameters (","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-8.png","element":"img","alt":"µ","inline":true,"padRight":true},{"text":"denotes the mean and ","element":"span"},{"style":{"height":10.8},"width":33,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-9.png","element":"img","alt":" Σ","inline":true,"padRight":true},{"text":"the covariance matrix). However, instead of expressing ","element":"span"},{"style":{"height":10.8},"width":33,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-10.png","element":"img","alt":" Σ","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":13.38},"width":77.24,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-11.png","element":"img","alt":" LLT","inline":true,"padRight":true},{"text":"and optimizing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"(the Cholesky factor of ","element":"span"},{"style":{"height":10.8},"width":33,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-12.png","element":"img","alt":" Σ","inline":true},{"text":") as in ","element":"span"},{"href":"#id-7","referenceIndex":44,"text":"Tit- ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"sias and L´azaro-Gredilla ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"(2014) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-12","referenceIndex":20,"text":"Kucukelbir et al. ","element":"a"},{"href":"#id-12","referenceIndex":20,"text":"(2016)","element":"a"},{"text":", we parameterize the optimization problem in terms of the Cholesky factor of the precision matrix. This parameterization is important as it provides an avenue to impose a sparsity structure in the precision matrix that reflects conditional independence relationships in the posterior. This sparsity structure reduces computational complexity greatly and enables fast inference for models with a large number of variables without having to assume independence relationships in the posterior. We demonstrate how our approach can be applied to generalized linear mixed models (GLMMs) and state space models (SSMs) for time series data. Assuming that the number of global variables is small compared to the number of local variables, our approach reduces the number of variational parameters to be updated in each iteration from ","element":"span"},{"style":{"height":17.39},"width":88.25,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/1-13.png","element":"img","alt":" O(n2","inline":true},{"text":") to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"), where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"denotes the number of subjects in GLMMs and the length of time series in SSMs. In this way, the accuracy of using a unrestricted lower triangular matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"can be achieved at the computational cost (same order of magnitude) of using a diagonal matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":".","element":"span"}],[{"text":"Recently, several classes of richer variational approximations which go beyond factorized (mean-field) approximating densities and which are able to reflect the posterior dependence structure to varying degrees have been proposed (e.g. ","element":"span"},{"href":"#id-13","referenceIndex":10,"text":"Gershman et al., ","element":"a"},{"href":"#id-13","referenceIndex":10,"text":"2012; ","element":"a"},{"href":"#id-5","referenceIndex":39,"text":"Salimans ","element":"a"},{"href":"#id-5","referenceIndex":39,"text":"and Knowles, ","element":"a"},{"href":"#id-5","referenceIndex":39,"text":"2013)","element":"a"},{"text":". ","element":"span"},{"href":"#id-14","referenceIndex":33,"text":"Rezende and Mohamed ","element":"a"},{"href":"#id-14","referenceIndex":33,"text":"(2015) ","element":"a"},{"text":"propose the specification of the approximate posterior using normalizing flows. Starting with say simple factorized distributions, highly flexible and complex approximate posteriors are constructed by transforming the initial density through a sequence of invertible mappings which perform expansions or contractions of the probability mass in targeted regions. The resulting chain of transformed densities is known as a normalizing flow. The authors show that true posterior can be recovered asymptotically under the Langevin flow, thus overcoming an important limitation of variational inference. ","element":"span"},{"href":"#id-15","referenceIndex":31,"text":"Ranganath et al. ","element":"a"},{"href":"#id-15","referenceIndex":31,"text":"(2016) ","element":"a"},{"text":"propose hierarchical variational models which are built by placing a prior distribution on the parameters of a mean-field variational approximation and then proceeding to integrate out the mean-field parameters. They specify the prior using normalizing flows and demonstrate that the hierarchical variational model achieves better performance in terms of perplexity and held-out likelihood for deep exponential families ","element":"span"},{"href":"#id-16","referenceIndex":30,"text":"(Ranganath et al., ","element":"a"},{"href":"#id-16","referenceIndex":30,"text":"2015)","element":"a"},{"text":". Structured stochastic variational inference ","element":"span"},{"href":"#id-17","referenceIndex":13,"text":"(Hoffman and Blei, ","element":"a"},{"href":"#id-17","referenceIndex":13,"text":"2015) ","element":"a"},{"text":"is a generalization of stochastic variational inference to allow dependencies between global and local variables. In the approximating density, independence is assumed only among elements in the global variables and among the local variables conditional on the global variables. Dependence between a local variable and the global variables is captured via a local parameter defined implicitly as the point at which the local evidence lower bound is maximized.","element":"span"}],[{"href":"#id-18","referenceIndex":1,"text":"Archer et al. ","element":"a"},{"href":"#id-18","referenceIndex":1,"text":"(2016) ","element":"a"},{"text":"develop “black-box” variational inference ","element":"span"},{"href":"#id-19","referenceIndex":29,"text":"(Ranganath et al., ","element":"a"},{"href":"#id-19","referenceIndex":29,"text":"2014) ","element":"a"},{"text":"for SSMs, where a Gaussian variational approximation is considered for the latent states. To capture the temporal correlation structure, the precision matrix is assumed to be a block tri-diagonal matrix. While related, our approach differs from ","element":"span"},{"href":"#id-18","referenceIndex":1,"text":"Archer et al. ","element":"a"},{"href":"#id-18","referenceIndex":1,"text":"(2016) ","element":"a"},{"text":"in several aspects. For the SSMs application, we consider a joint Gaussian variational approximation for the model parameters and latent states while ","element":"span"},{"href":"#id-18","referenceIndex":1,"text":"Archer et al. ","element":"a"},{"href":"#id-18","referenceIndex":1,"text":"(2016) ","element":"a"},{"text":"assume that the model parameters are known and consider a Gaussian approximate posterior for the latent states only. Secondly, we optimize the Cholesky factor of the precision matrix directly while ","element":"span"},{"href":"#id-18","referenceIndex":1,"text":"Archer et al. ","element":"a"},{"href":"#id-18","referenceIndex":1,"text":"(2016) ","element":"a"},{"text":"consider other parameterizations such as defining the approximate posterior through a product of Gaussian factors and parameterizing the mean and blocks in the tri-diagonal inverse covariance using neural networks. Third, we consider a more general sparsity structure in the precision matrix, which reflects the conditional independence structures in the posterior distribution and is not limited to band matrices. We also consider an alternative estimator of the stochastic gradient which differs from the “black-box” approach of ","element":"span"},{"href":"#id-18","referenceIndex":1,"text":"Archer et al. ","element":"a"},{"href":"#id-18","referenceIndex":1,"text":"(2016) ","element":"a"},{"text":"as well as that used by ","element":"span"},{"href":"#id-7","referenceIndex":44,"text":"Titsias and ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"L´azaro-Gredilla ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"(2014) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-12","referenceIndex":20,"text":"Kucukelbir et al. ","element":"a"},{"href":"#id-12","referenceIndex":20,"text":"(2016)","element":"a"},{"text":". We demonstrate empirically that this estimator has lower variance at the mode and is helpful in improving the stability and precision of the proposed algorithm. This estimator is inspired by ","element":"span"},{"href":"#id-20","referenceIndex":11,"text":"Han et al. ","element":"a"},{"href":"#id-20","referenceIndex":11,"text":"(2016)","element":"a"},{"text":", who propose using Gaussian copulas to accommodate models whose posteriors are for instance, skewed, heavy-tailed or multi-modal and hence unsuited to a Gaussian variational approximation. Our idea of introducing sparsity via the Cholesky factor of the precision matrix may prove useful in this context as well. The relationship between the Laplace and the Gaussian variational approximation is discussed in ","element":"span"},{"href":"#id-21","referenceIndex":24,"text":"Opper and Archambeau ","element":"a"},{"href":"#id-21","referenceIndex":24,"text":"(2009) ","element":"a"},{"text":"while ","element":"span"},{"href":"#id-22","referenceIndex":6,"text":"Challis and Barber ","element":"a"},{"href":"#id-22","referenceIndex":6,"text":"(2013) ","element":"a"},{"text":"consider some differ-ent parameterizations in terms of the Cholesky. We do not consider Laplace approximations ","element":"span"},{"href":"#id-23","referenceIndex":38,"text":"(Rue et al., ","element":"a"},{"href":"#id-23","referenceIndex":38,"text":"2009) ","element":"a"},{"text":"in this paper since an important advantage of stochastic gradient methods is they are generally amenable to sub-sampling, although this is not always straightforward in complex latent variable models where the local parameters are dependent.","element":"span"}],[{"text":"In Section ","element":"span"},{"text":"2, ","element":"span"},{"text":"we review doubly stochastic variational inference, the approach of ","element":"span"},{"href":"#id-7","referenceIndex":44,"text":"Titsias and L´azaro-Gredilla ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"(2014)","element":"a"},{"text":". Section ","element":"span"},{"text":"3 ","element":"span"},{"text":"describes how the optimization problem can be framed in terms of the precision matrix, develops the algorithm using alternative gradient estimators and discusses the importance of imposing sparsity structure in the precision matrix. The setting of the learning rate in the stochastic gradient algorithm is discussed in Section ","element":"span"},{"text":"4. ","element":"span"},{"text":"In Section ","element":"span"},{"text":"5, ","element":"span"},{"text":"we illustrate how our approach can be applied to GLMMs and state space models. The performance of our algorithm is investigated using several real data sets. We conclude with a discussion of our major results and findings in Section ","element":"span"},{"text":"6.","element":"span"}]]},{"heading":"2 Review on doubly stochastic variational inference","paragraphs":[[{"text":"In this section, we provide some general background on variational methods and give a brief review of doubly stochastic variational inference ","element":"span"},{"href":"#id-7","referenceIndex":44,"text":"(Titsias and L´azaro- ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"Gredilla, ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"2014) ","element":"a"},{"text":"as we will be considering a modification of their approach.","element":"span"}],[{"text":"For a Bayesian inference problem, let ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-0.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"denote the vector of variables, ","element":"span"},{"style":{"height":16},"width":54.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-1.png","element":"img","alt":" p(θ","inline":true},{"text":") be the prior and ","element":"span"},{"style":{"height":16},"width":86.58,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-2.png","element":"img","alt":" p(y|θ","inline":true},{"text":") the likelihood for observed data ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":". In variational approximation (see, e.g. ","element":"span"},{"href":"#id-11","referenceIndex":3,"text":"Bishop, ","element":"a"},{"href":"#id-11","referenceIndex":3,"text":"2006; ","element":"a"},{"href":"#id-24","referenceIndex":25,"text":"Ormerod and Wand, ","element":"a"},{"href":"#id-24","referenceIndex":25,"text":"2010)","element":"a"},{"text":", an attempt is made to approximate an intractable posterior distribution ","element":"span"},{"style":{"height":16},"width":315.93,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-3.png","element":"img","alt":" p(θ|y) ∝ p(θ)p(y|θ","inline":true},{"text":") using a member of some approximating family. Here we will consider a parametric family with typical element ","element":"span"},{"style":{"height":16},"width":73.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-4.png","element":"img","alt":" qλ(θ","inline":true},{"text":") where ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-5.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"denotes variational parameters to be chosen. Minimization of the Kullback-Leibler divergence between ","element":"span"},{"style":{"height":16},"width":73.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-6.png","element":"img","alt":" qλ(θ","inline":true},{"text":") and ","element":"span"},{"style":{"height":16},"width":86.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-7.png","element":"img","alt":" p(θ|y","inline":true},{"text":") with respect to ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-8.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"can be shown to be equivalent to maximizing a lower bound on the log marginal likelihood log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":") (where ","element":"span"},{"style":{"height":18},"width":380.9,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-9.png","element":"img","alt":" p(y) =�p(θ)p(y|θ)dθ","inline":true},{"text":"), and taking the form","element":"span"}],[{"style":{"width":"58%"},"width":560,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-10.png","element":"img"}],[{"text":"In non-conjugate models, ","element":"span"},{"style":{"height":16},"width":65.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-11.png","element":"img","alt":" L(λ","inline":true},{"text":") will generally not have a closed form. There has been much recent research concerned with stochastic gradient methods ","element":"span"},{"href":"#id-3","referenceIndex":35,"text":"(Robbins and ","element":"a"},{"href":"#id-3","referenceIndex":35,"text":"Monro, ","element":"a"},{"href":"#id-3","referenceIndex":35,"text":"1951) ","element":"a"},{"text":"able to optimize ","element":"span"},{"style":{"height":16},"width":65.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-12.png","element":"img","alt":" L(λ","inline":true},{"text":") efficiently in this situation ","element":"span"},{"href":"#id-25","referenceIndex":16,"text":"(Ji et al., ","element":"a"},{"href":"#id-25","referenceIndex":16,"text":"2010; ","element":"a"},{"href":"#id-26","referenceIndex":23,"text":"Nott et al., ","element":"a"},{"href":"#id-26","referenceIndex":23,"text":"2012; ","element":"a"},{"href":"#id-4","referenceIndex":27,"text":"Paisley et al., ","element":"a"},{"href":"#id-4","referenceIndex":27,"text":"2012; ","element":"a"},{"href":"#id-8","referenceIndex":19,"text":"Kingma and Welling, ","element":"a"},{"href":"#id-8","referenceIndex":19,"text":"2014; ","element":"a"},{"href":"#id-27","referenceIndex":14,"text":"Hoffman et al., ","element":"a"},{"href":"#id-27","referenceIndex":14,"text":"2013; ","element":"a"},{"href":"#id-19","referenceIndex":29,"text":"Ranganath et al., ","element":"a"},{"href":"#id-19","referenceIndex":29,"text":"2014; ","element":"a"},{"href":"#id-9","referenceIndex":34,"text":"Rezende et al., ","element":"a"},{"href":"#id-9","referenceIndex":34,"text":"2014; ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"Titsias ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"and L´azaro-Gredilla, ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"2014, ","element":"a"},{"href":"#id-28","referenceIndex":45,"text":"2015)","element":"a"},{"text":".","element":"span"}],[{"text":"The method of ","element":"span"},{"href":"#id-7","referenceIndex":44,"text":"Titsias and L´azaro-Gredilla ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"(2014) ","element":"a"},{"text":"(hereafter TL) is one state of the art method which optimizes ","element":"span"},{"style":{"height":16},"width":65.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-13.png","element":"img","alt":" L(λ","inline":true},{"text":") using gradient information from the target distribution. Write ","element":"span"},{"style":{"height":16},"width":302.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-14.png","element":"img","alt":" h(θ) = p(θ)p(y|θ","inline":true},{"text":"). In the TL method, an approximating distribution of the form","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"θ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"µ, L","element":"span"},{"style":{"height":18.18},"width":158.8,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-15.png","element":"img","alt":") = |L|−1","inline":true},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"height":7.6},"width":40.91,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-16.png","element":"img","alt":"−1","inline":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"θ ","element":"span"},{"style":{"height":4.4},"width":31,"height":11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-17.png","element":"img","alt":" −","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"µ","element":"span"},{"text":"))","element":"span"}],[{"text":"is assumed (so that ","element":"span"},{"style":{"height":16},"width":160.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-18.png","element":"img","alt":" λ = (µ, L","inline":true},{"text":")) where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is a fixed density. Here ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-19.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"is a vector of parameters of dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"is the dimension of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-20.png","element":"img","alt":" θ","inline":true},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is a ","element":"span"},{"style":{"height":10.8},"width":81.53,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-21.png","element":"img","alt":" d×d","inline":true,"padRight":true},{"text":"lower triangular matrix with positive diagonal elements. If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is the density of a vector of independent standard normal random variables then ","element":"span"},{"style":{"height":16},"width":134.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-22.png","element":"img","alt":" q(θ|µ, L","inline":true},{"text":") is normal, ","element":"span"},{"style":{"height":17.38},"width":170.82,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-23.png","element":"img","alt":" N(µ, LLT","inline":true,"padRight":true},{"text":"), and the covariance matrix is being parameterized with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"as the Cholesky factor. We will only be considering the case of a multivariate normal approximation in this paper.","element":"span"}],[{"text":"The lower bound ","element":"span"},{"style":{"height":16},"width":250.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-24.png","element":"img","alt":" L(λ) = L(µ, L","inline":true},{"text":") is an expectation with respect to ","element":"span"},{"style":{"height":16},"width":134.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/2-25.png","element":"img","alt":" q(θ|µ, L","inline":true},{"text":"), but can be written as an expectation with respect to the density ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":". Writing the integral in this way (for the purpose of the stochastic gradient optimization) results in an approach which is able to effectively use gradient information from the target log posterior. More precisely, writing ","element":"span"},{"style":{"height":16.79},"width":73.51,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-0.png","element":"img","alt":" Eq(·","inline":true},{"text":") for the expectation with respect to ","element":"span"},{"style":{"height":16},"width":134.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-1.png","element":"img","alt":" q(θ|µ, L","inline":true},{"text":") and ","element":"span"},{"style":{"height":16.79},"width":76.58,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-2.png","element":"img","alt":" Ef(·","inline":true},{"text":") for the expectation with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":", we have","element":"span"}],[{"style":{"height":16.79},"width":224.88,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-3.png","element":"img","alt":"L(µ, L) = Eq","inline":true,"padRight":true},{"text":"(log ","element":"span"},{"style":{"height":16},"width":113.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-4.png","element":"img","alt":" h(θ) −","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":134.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-5.png","element":"img","alt":" q(θ|µ, L","inline":true},{"text":"))","element":"span"}],[{"style":{"width":"84%"},"width":802,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.38},"width":263.27,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-7.png","element":"img","alt":" s = L−1(θ − µ","inline":true},{"text":") is distributed according to the density ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"denotes a term not depending on ","element":"span"},{"style":{"height":14},"width":68.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-8.png","element":"img","alt":" µ, L","inline":true},{"text":". This approach of applying a transformation ","element":"span"},{"style":{"height":14},"width":185.87,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-9.png","element":"img","alt":" θ = µ+Ls","inline":true,"padRight":true},{"text":"so that the lower bound can be rewritten as an expectation with respect to a fixed density ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"that does not depend on the variational parameters is sometimes referred to as the “reparameterization trick” ","element":"span"},{"href":"#id-8","referenceIndex":19,"text":"(Kingma ","element":"a"},{"href":"#id-8","referenceIndex":19,"text":"and Welling, ","element":"a"},{"href":"#id-8","referenceIndex":19,"text":"2014; ","element":"a"},{"href":"#id-9","referenceIndex":34,"text":"Rezende et al., ","element":"a"},{"href":"#id-9","referenceIndex":34,"text":"2014; ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"Titsias and ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"L´azaro-Gredilla, ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"2014)","element":"a"},{"text":". The advantage of this approach is that efficient gradient estimators of ","element":"span"},{"style":{"height":16},"width":111.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-10.png","element":"img","alt":" L(µ, L","inline":true},{"text":") can now be constructed by sampling ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"instead of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-11.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"from ","element":"span"},{"style":{"height":16},"width":134.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-12.png","element":"img","alt":"q(θ|µ, L","inline":true},{"text":"), which has been found to result in estimators with very high variance (see, e.g. ","element":"span"},{"href":"#id-4","referenceIndex":27,"text":"Paisley et al., ","element":"a"},{"href":"#id-4","referenceIndex":27,"text":"2012)","element":"a"},{"text":".","element":"span"}],[{"text":"Next, we give expressions for the gradients of ","element":"span"},{"style":{"height":16},"width":111.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-13.png","element":"img","alt":"L(µ, L","inline":true},{"text":") with respect to ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-14.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":". To explain their derivation we need some notation first. For a scalar valued function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") of a vector valued argument ","element":"span"},{"style":{"height":16},"width":163.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-15.png","element":"img","alt":" x, ∇xg(x","inline":true},{"text":") denotes the gradient vector for the function written as a column vector. Also, for a scalar valued function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":") of a matrix ","element":"span"},{"style":{"height":16},"width":176.59,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-16.png","element":"img","alt":" A, ∇Ag(A","inline":true},{"text":") means vec","element":"span"},{"style":{"height":19.06},"width":252.52,"height":47.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-17.png","element":"img","alt":"−1(∇vec(A)g(A","inline":true},{"text":")) where, for a ","element":"span"},{"style":{"height":10.8},"width":102.58,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-18.png","element":"img","alt":" d × d","inline":true,"padRight":true},{"text":"square matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", vec(","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":") is the vector of length ","element":"span"},{"style":{"height":13.38},"width":36.74,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-19.png","element":"img","alt":" d2","inline":true,"padRight":true},{"text":"obtained by stacking the columns of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"underneath each other, and vec","element":"span"},{"style":{"height":7.6},"width":40.9,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-20.png","element":"img","alt":"−1","inline":true,"padRight":true},{"text":"is the inverse operation that takes a vector of length ","element":"span"},{"style":{"height":13.38},"width":36.74,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-21.png","element":"img","alt":" d2","inline":true,"padRight":true},{"text":"and creates a ","element":"span"},{"style":{"height":10.8},"width":96.16,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-22.png","element":"img","alt":" d × d","inline":true,"padRight":true},{"text":"square matrix by filling up the columns from left to right from the elements of the vector. In addition, we use the following well known result. For conformably dimensioned matrices ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C","element":"span"},{"text":", vec(","element":"span"},{"style":{"height":17.38},"width":340.95,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-23.png","element":"img","alt":"ABC) = (CT ⊗ A","inline":true},{"text":")vec(","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"text":"). This implies that we can write ","element":"span"},{"style":{"height":17.38},"width":432.84,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-24.png","element":"img","alt":" Ls = vec(ILs) = (sT ⊗ I","inline":true},{"text":")vec(","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"). Then","element":"span"}],[{"style":{"height":15.59},"width":101,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-25.png","element":"img","alt":"∇µEf","inline":true},{"text":"(log ","element":"span"},{"style":{"height":16.79},"width":354.89,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-26.png","element":"img","alt":" h(µ + Ls)) = Ef(∇θ","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":157.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-27.png","element":"img","alt":" h(µ + Ls","inline":true},{"text":"))","element":"span"}],[{"id":"id-29","text":"and","element":"span"}],[{"style":{"height":16.48},"width":172.61,"height":41.19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-28.png","element":"img","alt":"∇vec(L)Ef","inline":true},{"text":"(log ","element":"span"},{"style":{"height":16},"width":157.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-29.png","element":"img","alt":" h(µ + Ls","inline":true},{"text":"))","element":"span"}],[{"style":{"width":"84%"},"width":803,"height":176,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-30.png","element":"img"}],[{"text":"The last line of ","element":"span"},{"href":"#id-29","text":"(2) ","element":"a"},{"text":"just says that","element":"span"}],[{"style":{"width":"99%"},"width":947,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-31.png","element":"img"}],[{"text":"Note ","element":"span"},{"text":"that ","element":"span"},{"text":"entries ","element":"span"},{"text":"above ","element":"span"},{"text":"the ","element":"span"},{"text":"diagonal ","element":"span"},{"text":"should ","element":"span"},{"text":"be set to zero for the right-hand-side of ","element":"span"},{"href":"#id-30","text":"(3) ","element":"a"},{"text":"because ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is ","element":"span"},{"text":"lower ","element":"span"},{"text":"triangular. ","element":"span"},{"text":"For ","element":"span"},{"text":"the ","element":"span"},{"text":"term ","element":"span"},{"text":"log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"| ","element":"span"},{"text":"in ","element":"span"},{"style":{"height":16},"width":111.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-32.png","element":"img","alt":"L(µ, L","inline":true},{"text":"), we have ","element":"span"},{"style":{"height":15.59},"width":52.21,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-33.png","element":"img","alt":" ∇µ","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":324.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-34.png","element":"img","alt":" |L| = 0 and ∇L","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"| ","element":"span"},{"text":"= diag(1","element":"span"},{"style":{"height":16},"width":269.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-35.png","element":"img","alt":"/L11, . . . , 1/Ldd","inline":true},{"text":").","element":"span"}],[{"text":"Once we have expressions for the derivatives of the ","element":"span"},{"id":"id-47","text":"lower bound as expectations with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":", we can estimate these gradients unbiasedly using simulations from this distribution (typically based on just a single draw). When the log-likelihood is a sum of a large number of terms, such as in the case of a very large dataset, we can subsample the terms and still construct appropriate unbiased gradient estimates if we desire (hence the name “doubly stochastic variational inference”). Algorithm ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"shows the basic stochastic gradient method of ","element":"span"},{"href":"#id-7","referenceIndex":44,"text":"Titsias and L´azaro-Gredilla ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"(2014)","element":"a"},{"text":". The sequence ","element":"span"},{"style":{"height":10},"width":32.6,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-36.png","element":"img","alt":" ρt","inline":true},{"text":", ","element":"span"},{"style":{"height":12.8},"width":59.51,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-37.png","element":"img","alt":"t ≥","inline":true,"padRight":true},{"text":"1, in the algorithm is a sequence of learning rates satisfying the Robbins-Monro conditions ","element":"span"},{"style":{"height":16.74},"width":204.54,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-38.png","element":"img","alt":"�t ρt = ∞","inline":true},{"text":", ","element":"span"},{"style":{"height":18.17},"width":194.35,"height":45.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-39.png","element":"img","alt":"�t ρ2t < ∞","inline":true},{"text":".","element":"span"}],[{"id":"id-31","style":{"width":"78%"},"width":750,"height":384,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-40.png","element":"img"}],[{"text":"Algorithm 1: Doubly stochastic variational inference algorithm of ","element":"span"},{"href":"#id-7","referenceIndex":44,"text":"Titsias and L´azaro-Gredilla ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"(2014)","element":"a"},{"text":".","element":"span"}]]},{"heading":"3 Extension to parametrization of the precision matrix in terms of the Cholesky factor","paragraphs":[[{"text":"When the vector of variables ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-41.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"is high-dimensional, allowing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"to be a dense matrix is computationally impractical. An alternative is to assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is diagonal, but that loses any ability to capture dependence structure of the posterior. Here we consider an alternative approach where we follow a similar strategy to that of ","element":"span"},{"href":"#id-7","referenceIndex":44,"text":"Titsias and L´azaro-Gredilla ","element":"a"},{"href":"#id-7","referenceIndex":44,"text":"(2014)","element":"a"},{"text":", but instead parameterize the inverse covariance (precision) matrix in terms of the Cholesky factor and then impose sparsity on it that reflects conditional independence structure in the model.","element":"span"}],[{"id":"id-30","text":"3.1 Model Specification","element":"span"}],[{"text":"Consider a model with observations ","element":"span"},{"style":{"height":16},"width":268.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-42.png","element":"img","alt":" y = (y1, . . . , yn","inline":true},{"text":"), latent variables ","element":"span"},{"style":{"height":14},"width":160.64,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-43.png","element":"img","alt":" b1, . . . , bn","inline":true,"padRight":true},{"text":"and model parameters ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/3-44.png","element":"img","alt":" η","inline":true},{"text":". Let","element":"span"}],[{"id":"id-34","style":{"width":"99%"},"width":947,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-0.png","element":"img"}],[{"text":"denote the vector of all variables. We assume that the joint distribution can be written in the form","element":"span"}],[{"style":{"width":"27%"},"width":259,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"y, θ","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"η","element":"span"},{"text":")","element":"span"}],[{"style":{"width":"85%"},"width":818,"height":170,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-2.png","element":"img"}],[{"text":"for some 1 ","element":"span"},{"style":{"height":13.2},"width":151.26,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-3.png","element":"img","alt":" ≤ k ≤ n","inline":true},{"text":". In this model, ","element":"span"},{"style":{"height":13.19},"width":28.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-4.png","element":"img","alt":" bi","inline":true,"padRight":true},{"text":"is conditionally independent of the other latent variables in the posterior distribution ","element":"span"},{"style":{"height":16},"width":86.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-5.png","element":"img","alt":" p(θ|y","inline":true},{"text":") given ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-6.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"and the neighboring ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"latent variables.","element":"span"}],[{"text":"3.2 Gaussian variational approximation with sparse precision matrix","element":"span"}],[{"text":"We consider the variational approximation (","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":") of the posterior to be a multivariate Gaussian distribution ","element":"span"},{"style":{"height":17.38},"width":242.15,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-7.png","element":"img","alt":"N(µ, T −T T −1","inline":true},{"text":"), where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is a lower triangular matrix with positive diagonal entries. With ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"being the joint density of a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"-vector of independent standard Gaussian variables as before, we can write ","element":"span"},{"style":{"height":16},"width":212.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-8.png","element":"img","alt":" q(θ|µ, T) =","inline":true},{"style":{"height":17.38},"width":142.08,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-9.png","element":"img","alt":"|T|f(T T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":14},"width":92.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-10.png","element":"img","alt":"θ − µ","inline":true},{"text":")).","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":10.8},"width":32,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-11.png","element":"img","alt":" Ω","inline":true,"padRight":true},{"text":"denote the precision matrix of the Gaussian distribution. Then ","element":"span"},{"style":{"height":13.39},"width":176.38,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-12.png","element":"img","alt":" Ω = TT T","inline":true,"padRight":true},{"text":"and hence ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is just the Cholesky factor of ","element":"span"},{"style":{"height":10.8},"width":32,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-13.png","element":"img","alt":" Ω","inline":true},{"text":". The statistical motivation for imposing sparsity on the Cholesky factor of the precision matrix is as follows. It is well known that for a Gaussian distribution, ","element":"span"},{"style":{"height":15.59},"width":608.1,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-14.png","element":"img","alt":" Ωij = 0 corresponds to variables i","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"being conditionally independent given the rest. Also, if ","element":"span"},{"style":{"height":13.39},"width":166.56,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-15.png","element":"img","alt":"Ω = TT T","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is lower triangular, proposition 1 of ","element":"span"},{"href":"#id-32","referenceIndex":37,"text":"Rothman et al. ","element":"a"},{"href":"#id-32","referenceIndex":37,"text":"(2010) ","element":"a"},{"text":"states that if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is row banded then ","element":"span"},{"style":{"height":10.8},"width":32,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-16.png","element":"img","alt":" Ω","inline":true,"padRight":true},{"text":"possesses the same row banded structure. This means that imposing sparsity in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"can be useful for reflecting conditional independence relationships in ","element":"span"},{"style":{"height":10.8},"width":32,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-17.png","element":"img","alt":" Ω","inline":true},{"text":".","element":"span"}],[{"text":"For our model in ","element":"span"},{"href":"#id-33","text":"(5)","element":"a"},{"text":", let us partition ","element":"span"},{"style":{"height":10.8},"width":32,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-18.png","element":"img","alt":" Ω","inline":true,"padRight":true},{"text":"into blocks ","element":"span"},{"style":{"height":15.59},"width":55.05,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-19.png","element":"img","alt":"Ωij","inline":true},{"text":", 1 ","element":"span"},{"style":{"height":13.6},"width":169.33,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-20.png","element":"img","alt":" ≤ i, j ≤ n","inline":true,"padRight":true},{"text":"+ 1 according to ","element":"span"},{"href":"#id-34","text":"(4)","element":"a"},{"text":". For the Gaussian variational approximation to reflect the conditional independence structure in the posterior, we would like to have ","element":"span"},{"style":{"height":16.79},"width":605.64,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-21.png","element":"img","alt":" Ωij = 0 for {1 ≤ i, j ≤ n|j < i − k","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j > i ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":", with no constraints on the remaining blocks. Write ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"for the Cholesky factor partitioned in the same way as ","element":"span"},{"style":{"height":10.8},"width":32,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-22.png","element":"img","alt":"Ω","inline":true,"padRight":true},{"text":"with blocks ","element":"span"},{"style":{"height":15.59},"width":47.56,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-23.png","element":"img","alt":" Tij","inline":true},{"text":", 1 ","element":"span"},{"style":{"height":13.6},"width":184.49,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-24.png","element":"img","alt":" ≤ i, j ≤ n","inline":true,"padRight":true},{"text":"+ 1. Since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is lower triangular, ","element":"span"},{"style":{"height":16.39},"width":295.34,"height":40.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-25.png","element":"img","alt":" Tij = 0 if i < j","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":45.56,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-26.png","element":"img","alt":" Tii","inline":true},{"text":", 1 ","element":"span"},{"style":{"height":12.8},"width":154.4,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-27.png","element":"img","alt":" ≤ i ≤ n","inline":true,"padRight":true},{"text":"+ 1, are lower triangular matrices. If ","element":"span"},{"style":{"height":16.79},"width":374.22,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-28.png","element":"img","alt":" Tij = 0 for {1 ≤ j <","inline":true},{"style":{"height":16},"width":286.58,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-29.png","element":"img","alt":"i ≤ n|j < i − k}","inline":true},{"text":", then ","element":"span"},{"style":{"height":13.38},"width":80.64,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-30.png","element":"img","alt":" TT T","inline":true,"padRight":true},{"text":"has the sparsity we desire for ","element":"span"},{"style":{"height":10.8},"width":32,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-31.png","element":"img","alt":" Ω","inline":true},{"text":". The sparsity level of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"increases as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"decreases. We elaborate later on how the sparse lower triangular structure can be exploited in the generation of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-32.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"from the variational posterior and in gradient computations. This approach is illustrated using generalized linear mixed models and state space models in Section ","element":"span"},{"text":"5.","element":"span"}],[{"id":"id-33","text":"3.3 Stochastic gradients","element":"span"}],[{"text":"Similar to the previous case we obtain for the lower ","element":"span"},{"id":"id-37","text":"bound","element":"span"}],[{"style":{"height":16.79},"width":228.58,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-33.png","element":"img","alt":"L(µ, T) = Ef","inline":true},{"text":"(log ","element":"span"},{"style":{"height":18.18},"width":264.04,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-34.png","element":"img","alt":" h(µ + T −T s) −","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":18.18},"width":286.78,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-35.png","element":"img","alt":" q(µ + T −T s|µ, T","inline":true},{"text":")) ","element":"span"},{"style":{"height":15.59},"width":88.48,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-36.png","element":"img","alt":"= Ef","inline":true},{"text":"(log ","element":"span"},{"style":{"height":18.18},"width":279.54,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-37.png","element":"img","alt":" h(µ + T −T s)) −","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"style":{"fontStyle":"italic"},"text":"| ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K,","element":"span"}],[{"style":{"width":"4%"},"width":45,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-38.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.38},"width":123.92,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-39.png","element":"img","alt":" s = T T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":14},"width":92.63,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-40.png","element":"img","alt":"θ − µ","inline":true},{"text":") is distributed according to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"denotes a constant not depending on ","element":"span"},{"style":{"height":14},"width":70.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-41.png","element":"img","alt":" µ, T","inline":true},{"text":". To obtain the gradient of ","element":"span"},{"style":{"height":16},"width":113.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-42.png","element":"img","alt":" L(µ, T","inline":true},{"text":") with respect to ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-43.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":", in addition to the results mentioned in Section ","element":"span"},{"text":"2, ","element":"span"},{"text":"we need the following result. Denote by ","element":"span"},{"style":{"height":26.44},"width":146.37,"height":66.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-44.png","element":"img","alt":"dvec(A−1)dvec(A)","inline":true,"padRight":true},{"text":"the matrix where the (","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, j","element":"span"},{"text":")th entry is the partial derivative of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"th element of vec(","element":"span"},{"style":{"height":13.38},"width":70.8,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-45.png","element":"img","alt":"A−1","inline":true},{"text":") with respect to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"th element of vec(","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":"). Then","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"vec(","element":"span"},{"style":{"height":13.38},"width":70.79,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-46.png","element":"img","alt":"A−1","inline":true},{"text":")","element":"span"}],[{"style":{"width":"53%"},"width":510,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-47.png","element":"img"}],[{"text":"Similar to before, we have","element":"span"}],[{"style":{"height":15.59},"width":236.19,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-48.png","element":"img","alt":"∇µL = ∇µEf","inline":true},{"text":"(log ","element":"span"},{"style":{"height":18.18},"width":209.01,"height":45.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-49.png","element":"img","alt":" h(µ + T −T s","inline":true},{"text":"))","element":"span"}],[{"id":"id-36","style":{"width":"88%"},"width":847,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-50.png","element":"img"}],[{"text":"Looking at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":",","element":"span"}],[{"style":{"height":16.48},"width":173.86,"height":41.19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-51.png","element":"img","alt":"∇vec(T )Ef","inline":true},{"text":"(log ","element":"span"},{"style":{"height":18.18},"width":209,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-52.png","element":"img","alt":" h(µ + T −T s","inline":true},{"text":"))","element":"span"}],[{"style":{"width":"95%"},"width":906,"height":317,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-53.png","element":"img"}],[{"id":"id-35","text":"so that","element":"span"}],[{"style":{"width":"98%"},"width":941,"height":174,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/4-54.png","element":"img"}],[{"text":"Note that for the first term on the right hand side of ","element":"span"},{"href":"#id-35","text":"(8)","element":"a"},{"text":", entries above the diagonal should be set to zero because ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is a lower triangular matrix.","element":"span"}],[{"id":"id-48","text":"3.4 Alternative estimators of the stochastic gradients","element":"span"}],[{"text":"From ","element":"span"},{"href":"#id-36","text":"(7) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-35","text":"(8)","element":"a"},{"text":", unbiased estimators of ","element":"span"},{"style":{"height":15.59},"width":82.58,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-0.png","element":"img","alt":" ∇µL","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":86.31,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-1.png","element":"img","alt":" ∇T L","inline":true,"padRight":true},{"text":"are given by","element":"span"}],[{"style":{"width":"77%"},"width":737,"height":174,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-2.png","element":"img"}],[{"text":"respectively, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"is generated from ","element":"span"},{"style":{"height":16},"width":124.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-3.png","element":"img","alt":" N(0, Id","inline":true},{"text":"). In deriving these estimators, we have evaluated the term ","element":"span"},{"style":{"height":15.59},"width":46.42,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-4.png","element":"img","alt":"Ef","inline":true},{"text":"(log ","element":"span"},{"style":{"height":17.39},"width":293.67,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-5.png","element":"img","alt":" q(µ + T −T s|µ, T","inline":true},{"text":")) in the lower bound analytically. However, alternative estimators can be derived by approximating this term instead of using its analytical form. As","element":"span"}],[{"style":{"height":13.19},"width":48.21,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-6.png","element":"img","alt":"∇θ","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":18.18},"width":234.8,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-7.png","element":"img","alt":" q(θ) = −TT T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16},"width":250.66,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-8.png","element":"img","alt":"θ − µ) = −Ts,","inline":true}],[{"text":"we have from ","element":"span"},{"href":"#id-37","text":"(6)","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"98%"},"width":937,"height":181,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-9.png","element":"img"}],[{"text":"Similarly,","element":"span"}],[{"style":{"width":"97%"},"width":928,"height":248,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-10.png","element":"img"}],[{"text":"Thus alternative unbiased estimators of ","element":"span"},{"style":{"height":15.59},"width":82.58,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-11.png","element":"img","alt":" ∇µL","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":86.32,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-12.png","element":"img","alt":" ∇T L","inline":true,"padRight":true},{"text":"are given by","element":"span"}],[{"style":{"width":"87%"},"width":829,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-13.png","element":"img"}],[{"text":"In our experiments, we observe that the estimators ˆ","element":"span"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-14.png","element":"img","alt":"gµ,2","inline":true,"padRight":true},{"text":"and ˆ","element":"span"},{"style":{"height":11.59},"width":65.87,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-15.png","element":"img","alt":"gT,2","inline":true,"padRight":true},{"text":"seem to provide approximations with lower variance ","element":"span"},{"href":"#id-5","referenceIndex":39,"text":"(Salimans and Knowles ","element":"a"},{"href":"#id-5","referenceIndex":39,"text":"(2013) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-20","referenceIndex":11,"text":"Han ","element":"a"},{"href":"#id-20","referenceIndex":11,"text":"et al. ","element":"a"},{"href":"#id-20","referenceIndex":11,"text":"(2016) ","element":"a"},{"text":"note related phenomena). As an example, for the toenail dataset in Section ","element":"span"},{"href":"#id-38","text":"5.1.2, ","element":"a"},{"text":"we compare in Figure ","element":"span"},{"href":"#id-39","text":"1 ","element":"a"},{"text":"estimates of ","element":"span"},{"style":{"height":15.59},"width":82.58,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-16.png","element":"img","alt":" ∇µL","inline":true,"padRight":true},{"text":"given by ˆ","element":"span"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-17.png","element":"img","alt":"gµ,1","inline":true,"padRight":true},{"text":"(black) and ˆ","element":"span"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-18.png","element":"img","alt":"gµ,2","inline":true,"padRight":true},{"text":"(red) for a subset of the variables. This is done by fix-ing ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-19.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"at the mode and computing the gradient estimates of ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-20.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"at 1000 random variates ","element":"span"},{"style":{"height":16},"width":207.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-21.png","element":"img","alt":" s ∼ N(0, Id","inline":true},{"text":"). Figure ","element":"span"},{"href":"#id-39","text":"1 ","element":"a"},{"text":"shows clearly that there is much greater variation in the estimates computed using ˆ","element":"span"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-22.png","element":"img","alt":"gµ,1","inline":true,"padRight":true},{"text":"as compared to ˆ","element":"span"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-23.png","element":"img","alt":"gµ,2","inline":true},{"text":". This suggests that using the alternative estimators ˆ","element":"span"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-24.png","element":"img","alt":"gµ,2","inline":true,"padRight":true},{"text":"and ˆ","element":"span"},{"style":{"height":11.59},"width":65.87,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-25.png","element":"img","alt":"gT,2","inline":true,"padRight":true},{"text":"will result in a more stable algorithm with better convergence and greater precision.","element":"span"}],[{"text":"Below, we provide some intuition for this observation. Suppose the density that we are approximating is close to a Gaussian distribution with mean ","element":"span"},{"style":{"height":14.18},"width":40.01,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-26.png","element":"img","alt":" µ∗","inline":true,"padRight":true},{"text":"and precision ","element":"span"},{"style":{"height":14.64},"width":117.28,"height":36.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-27.png","element":"img","alt":" T ∗T ∗T","inline":true,"padRight":true},{"text":", that is, ","element":"span"},{"style":{"height":18.64},"width":425.46,"height":46.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-28.png","element":"img","alt":" h(θ) ≈ N(µ∗, T ∗−T T ∗−1","inline":true},{"text":"). Then ","element":"span"},{"style":{"height":13.19},"width":48.21,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-29.png","element":"img","alt":"∇θ","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":18.64},"width":280.46,"height":46.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-30.png","element":"img","alt":" h(θ) ≈ −T ∗T ∗T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.58},"width":263.17,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-31.png","element":"img","alt":"T −T s + µ − µ∗","inline":true},{"text":"). When we are close to the mode, ","element":"span"},{"style":{"height":14.19},"width":270.61,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-32.png","element":"img","alt":" T ≈ T ∗, µ ≈ µ∗","inline":true,"padRight":true},{"text":"and","element":"span"}],[{"style":{"width":"76%"},"width":725,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-33.png","element":"img"}],[{"text":"while","element":"span"}],[{"text":"ˆ","element":"span"},{"style":{"height":15.59},"width":348.15,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-34.png","element":"img","alt":"gµ,2 ≈ 0, ˆgT,2 ≈ 0.","inline":true}],[{"text":"Thus, for ˆ","element":"span"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-35.png","element":"img","alt":"gµ,2","inline":true},{"text":", the contributions to the gradients from ","element":"span"},{"style":{"height":13.19},"width":48.21,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-36.png","element":"img","alt":"∇θ","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":57.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-37.png","element":"img","alt":" h(θ","inline":true},{"text":") and ","element":"span"},{"style":{"height":13.19},"width":48.21,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-38.png","element":"img","alt":" ∇θ","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":53.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-39.png","element":"img","alt":" q(θ","inline":true},{"text":") cancel out. As ˆ","element":"span"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-40.png","element":"img","alt":"gµ,2","inline":true,"padRight":true},{"text":"is a factor of ˆ","element":"span"},{"style":{"height":11.59},"width":65.87,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-41.png","element":"img","alt":"gT,2","inline":true},{"text":", ˆ","element":"span"},{"style":{"height":12.79},"width":111.44,"height":31.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-42.png","element":"img","alt":"gT,2 ≈","inline":true,"padRight":true},{"text":"0 when ˆ","element":"span"},{"style":{"height":12.79},"width":109.42,"height":31.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-43.png","element":"img","alt":"gµ,2 ≈","inline":true,"padRight":true},{"text":"0. However, ˆ","element":"span"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-44.png","element":"img","alt":"gµ,1","inline":true,"padRight":true},{"text":"and ˆ","element":"span"},{"style":{"height":11.59},"width":65.87,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-45.png","element":"img","alt":"gT,1","inline":true,"padRight":true},{"text":"are still subjected to the randomness in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"around the mode. Thus we prefer to use the estimators ˆ","element":"span"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-46.png","element":"img","alt":"gµ,2","inline":true,"padRight":true},{"text":"and ˆ","element":"span"},{"style":{"height":11.59},"width":65.87,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-47.png","element":"img","alt":"gT,2","inline":true},{"text":", which do not incur any additional computation except for the term ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Ts","element":"span"},{"text":".","element":"span"}],[{"text":"3.5 Uniqueness of the Cholesky factor","element":"span"}],[{"text":"We note that in Algorithm 1, the updates of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"do not ensure that the diagonal entries are positive. While this does not result in any computational issues, we prefer to add in the following step to ensure that the diagonal entries of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"are positive. This helps to ensure uniqueness of the Cholesky factor and reduces the possibility of multiple local modes which is an important issue especially in the high-dimensional problems considered here. To achieve this aim. We introduce the lower triangular matrix ","element":"span"},{"style":{"height":10.8},"width":42.82,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-48.png","element":"img","alt":" T ′","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":17.35},"width":147.83,"height":43.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-49.png","element":"img","alt":" T ′ij = Tij","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":15.2},"width":83.86,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-50.png","element":"img","alt":" i ̸= j","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.15},"width":213.5,"height":40.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-51.png","element":"img","alt":" T ′ii = log(Tii","inline":true},{"text":"). We ","element":"span"},{"text":"compute the stochastic gradient updates for ","element":"span"},{"style":{"height":10.8},"width":42.82,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-52.png","element":"img","alt":" T ′","inline":true,"padRight":true},{"text":"whose entries are unconstrained. The gradient ˆ","element":"span"},{"style":{"height":15.59},"width":199.24,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-53.png","element":"img","alt":"gT ′,2 = ˆgT,2","inline":true,"padRight":true},{"text":"for all non-diagonal entries. Diagonal entries of ˆ","element":"span"},{"style":{"height":11.59},"width":78.36,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-54.png","element":"img","alt":"gT ′,2","inline":true,"padRight":true},{"text":"can be computed by multiplying the diagonal entries of ˆ","element":"span"},{"style":{"height":11.59},"width":65.87,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-55.png","element":"img","alt":"gT,2","inline":true,"padRight":true},{"text":"by the diagonal entries of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":".","element":"span"}],[{"text":"The modification of the doubly stochastic variational inference algorithm, in terms of the Cholesky factor of the precision matrix, is summarized in Algorithm ","element":"span"},{"href":"#id-40","text":"2.","element":"a"}],[{"text":"Now let us consider sparsity in the Cholesky factor ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". Suppose some elements of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"are fixed at zero. Then Algorithm 2 remains the same, except that only the subset of elements of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"which are not fixed at zero are stored and updated. Note that in step 2, if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is a sparse matrix, we can compute ","element":"span"},{"style":{"height":13.39},"width":97.83,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-56.png","element":"img","alt":" T −T s","inline":true,"padRight":true},{"text":"by solving the linear system ","element":"span"},{"style":{"height":13.38},"width":148.84,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-57.png","element":"img","alt":" T T x = s","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". This can be done very efficiently because ","element":"span"},{"style":{"height":13.38},"width":51.82,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-58.png","element":"img","alt":" T T","inline":true,"padRight":true},{"text":"is upper triangular and sparse. Similarly, in computing the update at step 5, we need to compute the vector ","element":"span"},{"style":{"height":18.17},"width":109.61,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-59.png","element":"img","alt":" T −1gµ","inline":true},{"text":". This can also be computed by solving the linear system ","element":"span"},{"style":{"height":15.59},"width":142.74,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/5-60.png","element":"img","alt":" Tx = gµ","inline":true},{"text":", which is again easy because ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is a sparse lower triangular matrix. So even in very high dimensions, if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is appropriately sparse, Algorithm ","element":"span"},{"href":"#id-40","text":"2 ","element":"a"},{"text":"can be implemented in a way that is efficient in terms of both memory storage requirements and CPU time.","element":"span"}],[{"id":"id-39","style":{"width":"98%"},"width":1942,"height":441,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-0.png","element":"img"}],[{"text":"Fig. 1: Toenail data: estimates of ","element":"figcaption","subtype":"caption"},{"style":{"height":15.59},"width":82.58,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-1.png","element":"img","alt":" ∇µL","inline":true,"padRight":true},{"text":"given by ˆ","element":"figcaption","subtype":"caption"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-2.png","element":"img","alt":"gµ,1","inline":true,"padRight":true},{"text":"(black) and ˆ","element":"figcaption","subtype":"caption"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-3.png","element":"img","alt":"gµ,2","inline":true,"padRight":true},{"text":"(red) at the mode for a subset of the variables.","element":"figcaption","subtype":"caption"}],[{"id":"id-40","style":{"width":"78%"},"width":751,"height":538,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-4.png","element":"img"}],[{"text":"Algorithm 2: Modified doubly stochastic variational inference algorithm parameterized in terms of the Cholesky factor of the precision matrix.","element":"span"}]]},{"heading":"4 Setting the learning rate and stopping criterion in the stochastic optimization","paragraphs":[[{"text":"4.1 Learning rate","element":"span"}],[{"text":"The setting of appropriate learning rates in stochastic gradient algorithms is a highly challenging problem. The choice of learning rate determines not only the rate of convergence but also the quality of the optimum attained. Learning rates that are too high causes the algorithm to diverge while rates that are too low results in slow learning and can lead to “apparent convergence”, a situation where parameters appear to have converged due to diminishing step-size (see, e.g. ","element":"span"},{"href":"#id-41","referenceIndex":28,"text":"Powell, ","element":"a"},{"href":"#id-41","referenceIndex":28,"text":"2011)","element":"a"},{"text":". ","element":"span"},{"href":"#id-42","referenceIndex":40,"text":"Spall ","element":"a"},{"href":"#id-42","referenceIndex":40,"text":"(2003) ","element":"a"},{"text":"suggests a step-size sequence which satis-fies the theoretical conditions for convergence ","element":"span"},{"href":"#id-3","referenceIndex":35,"text":"(Robbins ","element":"a"},{"href":"#id-3","referenceIndex":35,"text":"and Monro, ","element":"a"},{"href":"#id-3","referenceIndex":35,"text":"1951)","element":"a"},{"text":". This takes the form ","element":"span"},{"style":{"height":17.78},"width":249.08,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-5.png","element":"img","alt":" A1/(t + A2)A3","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"denotes the iteration, and ","element":"span"},{"style":{"height":14},"width":90.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-6.png","element":"img","alt":" A1 ≥","inline":true,"padRight":true},{"text":"1, ","element":"span"},{"style":{"height":14},"width":90.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-7.png","element":"img","alt":" A2 ≥","inline":true,"padRight":true},{"text":"0 and 0","element":"span"},{"style":{"height":14},"width":183.9,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-8.png","element":"img","alt":".5 < A3 ≤","inline":true,"padRight":true},{"text":"1 are constants to be tuned. However, we find that it is difficult to hand-tune this learning rate for use in Algorithm 2, as the problems considered are high-dimensional in nature and the parameters ","element":"span"},{"style":{"height":16},"width":110.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-9.png","element":"img","alt":" {µ, T}","inline":true,"padRight":true},{"text":"converge at different rates. For instance, ","element":"span"},{"text":"Titsias and L´azaro-Gredilla ","element":"span"},{"href":"#id-7","referenceIndex":44,"text":"(2014) ","element":"a"},{"text":"scaled down the learning rate of ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-10.png","element":"img","alt":"µ","inline":true,"padRight":true},{"text":"by 0.1 when using for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"in Algorithm 1. It is also likely that the parameters have different “scale”, especially when some of the constrained parameters have to be transformed to the real line. These considerations increase the need for learning rates that are adaptive and parameter-specific. Several adaptive step-size sequences (e.g. ","element":"span"},{"href":"#id-43","referenceIndex":8,"text":"Duchi et al. ","element":"a"},{"href":"#id-43","referenceIndex":8,"text":"(2011)","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":32,"text":"Ranganath et al. ","element":"a"},{"href":"#id-44","referenceIndex":32,"text":"(2013)","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":20,"text":"Kucukelbir et al., ","element":"a"},{"href":"#id-12","referenceIndex":20,"text":"2016) ","element":"a"},{"text":"have been proposed. We find that the ADADELTA method ","element":"span"},{"href":"#id-45","referenceIndex":48,"text":"(Zeiler, ","element":"a"},{"href":"#id-45","referenceIndex":48,"text":"2012)","element":"a"},{"text":", in particular, worked very well with Algorithm 2 and we use it for all the examples. For consistency, we also used ADADELTA to compute the step-size for Algorithm 1. While ADADELTA has worked well in our experiments, we have only worked on a limited number of datasets and it is likely that other learning rates may yield better performance. From our observations, the performance of learning rates tend to be problem-dependent.","element":"span"}],[{"text":"The ADADELTA method takes into consideration the scale of the parameters by incorporating second order information through a Hessian approximation. Suppose at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", a parameter ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"is updated as ","element":"span"},{"style":{"height":14.19},"width":222.06,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-11.png","element":"img","alt":"x(t) = x(t−1)","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":14.19},"width":92.48,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-12.png","element":"img","alt":" ∆x(t)","inline":true},{"text":", where ","element":"span"},{"style":{"height":19.88},"width":238.36,"height":49.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-13.png","element":"img","alt":" ∆x(t) = ωg(t)x","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.88},"width":56.93,"height":49.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-14.png","element":"img","alt":" g(t)x","inline":true,"padRight":true},{"text":"is the gradient. The step-size ","element":"span"},{"style":{"height":6.8},"width":25,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-15.png","element":"img","alt":" ω","inline":true,"padRight":true},{"text":"is computed as","element":"span"}],[{"style":{"width":"29%"},"width":285,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-16.png","element":"img"}],[{"style":{"height":6.8},"width":68.3,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-17.png","element":"img","alt":"ω =","inline":true}],[{"id":"id-46","style":{"width":"24%"},"width":232,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-18.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":18.18},"width":184.42,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-19.png","element":"img","alt":" E[∆2x](t−1)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.18},"width":129.42,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-20.png","element":"img","alt":" E[g2x](t)","inline":true,"padRight":true},{"text":"are exponentially decay- ","element":"span"},{"text":"ing averages of ","element":"span"},{"style":{"height":17.81},"width":110.92,"height":44.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-21.png","element":"img","alt":" ∆x(t)2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":23.51},"width":75.38,"height":58.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-22.png","element":"img","alt":" g(t)x 2","inline":true},{"text":", and ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-23.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"is a small positive constant added to ensure the denominator is positive and the initial step-size is nonzero. The terms ","element":"span"},{"style":{"height":18.19},"width":143.62,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-24.png","element":"img","alt":" E[∆2x](t)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.19},"width":129.42,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-25.png","element":"img","alt":" E[g2x](t)","inline":true,"padRight":true},{"text":"are updated as","element":"span"}],[{"style":{"width":"78%"},"width":749,"height":134,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-26.png","element":"img"}],[{"text":"at each iteration where ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/6-27.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"is a decaying constant. The motivation of this approach comes from NewtonRaphson algorithms where it is well known that the inverse of the Hessian matrix provides an optimal or near-optimal step-size sequence (see, e.g. ","element":"span"},{"href":"#id-42","referenceIndex":40,"text":"Spall, ","element":"a"},{"href":"#id-42","referenceIndex":40,"text":"2003)","element":"a"},{"text":". ADADELTA approximates the Hessian by taking ","element":"span"},{"style":{"height":22.57},"width":215.99,"height":56.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-0.png","element":"img","alt":"1f ′′(x) ≈ ∆xf ′(x)","inline":true},{"text":", hence the form of ","element":"span"},{"style":{"height":6.8},"width":25,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-1.png","element":"img","alt":" ω","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-46","text":"(9)","element":"a"},{"text":". To apply ","element":"span"},{"text":"ADADELTA, we modify Algorithm 2 as outlined below. Note that the step-size for ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-2.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"are different. As ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"}],[{"style":{"width":"100%"},"width":953,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-3.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , N","element":"span"},{"text":",","element":"span"}],[{"text":"1. Generate ","element":"span"},{"style":{"height":14},"width":203.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-4.png","element":"img","alt":" s ∼ N(0, Id).","inline":true,"padRight":true},{"text":"2. ","element":"span"},{"style":{"height":16.81},"width":442.32,"height":42.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-5.png","element":"img","alt":" θ(t) = µ(t−1) + T (t−1)−T s.","inline":true}],[{"text":"3. ","element":"span"},{"style":{"height":17.95},"width":513.74,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-6.png","element":"img","alt":" g(t)µ = ∇θ log h(θ(t)) + T (t−1)s .","inline":true}],[{"text":"4. Accumulate gradient ","element":"span"},{"style":{"height":21.25},"width":575.39,"height":53.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-7.png","element":"img","alt":" E[g2µ](t) = ρE[g2µ](t−1)+(1−ρ)g(t)µ 2.","inline":true}],[{"text":"5. Compute change ","element":"span"},{"style":{"height":17.94},"width":109.03,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-8.png","element":"img","alt":" ∆(t)µ =","inline":true}],[{"style":{"width":"100%"},"width":953,"height":866,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-9.png","element":"img"}],[{"text":"is a sparse matrix, we find that it is more efficient to perform steps 8–12 in vector-form and to store only the non-zero elements of ","element":"span"},{"style":{"height":21.36},"width":56.93,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-10.png","element":"img","alt":" g(t)T","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":21.36},"width":69.7,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-11.png","element":"img","alt":" ∆(t)T","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":18.57},"width":329.82,"height":46.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-12.png","element":"img","alt":" E[g2T ′](t), E[∆2T ′](t)","inline":true},{"text":". We ","element":"span"},{"text":"let ","element":"span"},{"style":{"height":13.38},"width":150.06,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-13.png","element":"img","alt":" ϵ = 10−6","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":104.66,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-14.png","element":"img","alt":" ρ = 0.","inline":true},{"text":"95, the setting recommended by ","element":"span"},{"href":"#id-45","referenceIndex":48,"text":"Zeiler ","element":"a"},{"href":"#id-45","referenceIndex":48,"text":"(2012)","element":"a"},{"text":". We note that Algorithm 2 is more sensitive to ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-15.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"when the estimators ˆ","element":"span"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-16.png","element":"img","alt":"gµ,1","inline":true,"padRight":true},{"text":"and ˆ","element":"span"},{"style":{"height":11.59},"width":65.87,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-17.png","element":"img","alt":"gT,1","inline":true,"padRight":true},{"text":"are used as compared to the alternative estimators ˆ","element":"span"},{"style":{"height":11.59},"width":63.85,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-18.png","element":"img","alt":"gµ,2","inline":true,"padRight":true},{"text":"and ˆ","element":"span"},{"style":{"height":11.59},"width":65.87,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-19.png","element":"img","alt":"gT,2","inline":true},{"text":".","element":"span"}],[{"text":"4.2 Stopping Criterion","element":"span"}],[{"text":"In variational algorithms, the lower bound is commonly used as an objective function to check for convergence. When the updates are deterministic, the lower bound is guaranteed to increase after each cycle and the algorithm can be terminated when the increase in the lower bound is negligible. In Algorithms 1 and 2, the updates are stochastic and so the lower bound is not guaranteed to increase at each iteration. Computing the lower bounds in ","element":"span"},{"href":"#id-47","text":"(1) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-37","text":"(6) ","element":"a"},{"text":"also requires evaluating the expectations with respect to the variational approximation ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":". It is straightforward, however, to obtain an unbiased estimate of the lower bound at each iteration. Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"be a random variate generated from ","element":"span"},{"style":{"height":16},"width":124.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-20.png","element":"img","alt":" N(0, Id","inline":true},{"text":"). From ","element":"span"},{"href":"#id-47","text":"(1)","element":"a"},{"text":", an unbiased estimate of the lower bound for Algorithm 1 is given by","element":"span"}],[{"text":"ˆ","element":"span"},{"style":{"height":16},"width":351.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-21.png","element":"img","alt":"L = log h(µ + Ls) −","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":233.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-22.png","element":"img","alt":" q(µ + Ls|µ, L","inline":true},{"text":")","element":"span"}],[{"style":{"width":"81%"},"width":778,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-23.png","element":"img"}],[{"text":"Similarly, an unbiased estimate of the lower bound for Algorithm 2 is given by","element":"span"}],[{"text":"ˆ","element":"span"},{"style":{"height":18.18},"width":347.74,"height":45.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-24.png","element":"img","alt":"L = log h(µ + T −T s","inline":true},{"text":") + ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"2 ","element":"span"},{"text":"log(2","element":"span"},{"style":{"height":16},"width":79.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-25.png","element":"img","alt":"π) −","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"style":{"fontStyle":"italic"},"text":"| ","element":"span"},{"text":"+ 1","element":"span"},{"text":"2","element":"span"},{"style":{"height":14.18},"width":73.47,"height":35.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-26.png","element":"img","alt":"sT s.","inline":true}],[{"text":"Here we do not evaluate the expectation of the last term ","element":"span"},{"style":{"height":19.37},"width":83.46,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-27.png","element":"img","alt":"12sT s","inline":true,"padRight":true},{"text":"analytically so that the randomness in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"will ","element":"span"},{"text":"cancel out between the first and the last term when we are close to the mode (see similar argument is given in Section ","element":"span"},{"href":"#id-48","text":"3.4)","element":"a"},{"text":".","element":"span"}],[{"text":"As the estimate ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is stochastic, we consider instead the average of ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"over the past ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"iterations, say ","element":"span"},{"text":"¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":", to minimize variability. We compute ","element":"span"},{"text":"¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"after every ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"iterations and keep a record of the maximum value of ¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"attained thus far, say ¯","element":"span"},{"style":{"height":13.19},"width":86.58,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-28.png","element":"img","alt":"Lmax","inline":true},{"text":". The algorithm is terminated when ","element":"span"},{"text":"¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"falls below ","element":"span"},{"text":"¯","element":"span"},{"style":{"height":13.19},"width":86.58,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-29.png","element":"img","alt":"Lmax","inline":true,"padRight":true},{"text":"more than ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"consecutive times. This may imply either that the algorithm has converged and hence the lower bound estimates are just bouncing around or the algorithm is diverging and the estimates of the lower bound are deteriorating. We say that the algorithm is “diverging” if ","element":"span"},{"text":"¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is tending towards ","element":"span"},{"style":{"height":7.2},"width":71,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/7-30.png","element":"img","alt":"−∞","inline":true},{"text":". In Section ","element":"span"},{"text":"5, ","element":"span"},{"text":"we adopt rather conservative values of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"= 2500 and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= 3 to avoid the dangers of stopping prematurely (see, e.g. ","element":"span"},{"href":"#id-49","referenceIndex":4,"text":"Booth and Hobert, ","element":"a"},{"href":"#id-49","referenceIndex":4,"text":"1999)","element":"a"},{"text":". Alternative stopping criteria can also be constructed by examining the relative change in the parameter updates from successive iterations or the magnitude of the gradients of the parameters (see, e.g. ","element":"span"},{"href":"#id-42","referenceIndex":40,"text":"Spall, ","element":"a"},{"href":"#id-42","referenceIndex":40,"text":"2003)","element":"a"},{"text":".","element":"span"}]]},{"heading":"5 Applications","paragraphs":[[{"text":"In this section, we illustrate how we can impose sparsity in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"via Algorithm 2 using appropriate posterior conditional independence relationships for generalized linear mixed models (GLMMs) and state space models (SSMs). We code Algorithms 1 and 2 in Julia Version 0.5.0 (","element":"span"},{"href":"http://julialang.org/","style":{"fontFamily":"monospace"},"text":"http://julialang.org/","element":"a"},{"text":") and make use of its functions for sparse matrix representations to store and perform operations on high-dimensional sparse matrices efficiently.","element":"span"}],[{"id":"id-50","style":{"width":"83%"},"width":1656,"height":357,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-0.png","element":"img"}],[{"text":"Table 1: Runtime (in seconds) of ADVI (Stan) and Algorithms 1 and 2 (Julia) for datasets of different sizes. The number of iterations (in hundreds) used in Algorithms 1 and 2 are given in brackets.","element":"figcaption","subtype":"caption"}],[{"text":"We compare the variational approximations with posteriors obtained through long runs of MCMC (regarded as ground-truth). In all examples, fitting via MCMC was performed in RStan (","element":"span"},{"href":"http://mc-stan.org/interfaces/rstan","style":{"fontFamily":"monospace"},"text":"http://mc-stan. ","element":"a"},{"href":"http://mc-stan.org/interfaces/rstan","style":{"fontFamily":"monospace"},"text":"org/interfaces/rstan","element":"a"},{"text":") and the same priors are used in MCMC and variational approximations. For MCMC, we use 50,000 iterations in each example and the first half is discarded as burn-in. A thinning factor of 5 was applied and the remaining 5000 samples are used to estimate the posterior density.","element":"span"}],[{"text":"We note that Algorithm 1 can also be readily implemented in Stan using automatic differentiation variational inference (ADVI, ","element":"span"},{"href":"#id-12","referenceIndex":20,"text":"Kucukelbir et al., ","element":"a"},{"href":"#id-12","referenceIndex":20,"text":"2016)","element":"a"},{"text":". Hence we have also included the results from ADVI for comparison. However, there are some differences between our implementation of Algorithm 1 in Julia and that in Stan, namely, the learning rate and stopping criterion are different and we impose the additional restriction that the diagonal elements in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"must be positive.","element":"span"}],[{"text":"Table ","element":"span"},{"href":"#id-50","text":"1 ","element":"a"},{"text":"shows the runtimes for ADVI and Algorithms 1 and 2 for the datasets considered in this section. We use the terms “mean-field” to refer to the case where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is a diagonal matrix and “unrestricted” when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is a full lower triangular matrix. All experiments are run on a Intel Core i5 CPU@ 3.20GHz 8.0GB Ram.","element":"span"}],[{"id":"id-74","text":"5.1 Generalized linear mixed models","element":"span"}],[{"text":"Here we consider GLMMs where ","element":"span"},{"style":{"height":17.39},"width":352.8,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-1.png","element":"img","alt":" yi = (yi1, . . . , yini)T","inline":true}],[{"text":"is the set of responses for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"th subject, ","element":"span"},{"style":{"height":15.59},"width":57.28,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-2.png","element":"img","alt":" Xij","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.59},"width":51.48,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-3.png","element":"img","alt":" Zij","inline":true,"padRight":true},{"text":"are vectors of predictors for ","element":"span"},{"style":{"height":16.79},"width":270.39,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-4.png","element":"img","alt":" yij, µij = E(yij","inline":true},{"text":"), and ","element":"span"},{"style":{"height":16},"width":46.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-5.png","element":"img","alt":" g(·","inline":true},{"text":") is a smooth invertible link function. Let","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"µ","element":"span"},{"style":{"height":20.52},"width":155.84,"height":51.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-6.png","element":"img","alt":"ij) = XTij","inline":true},{"style":{"fontStyle":"italic"},"text":"β ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z","element":"span"},{"style":{"height":20.52},"width":56.17,"height":51.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-7.png","element":"img","alt":"Tijbi","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , n, j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , n","element":"span"},{"style":{"height":7.2},"width":11,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-8.png","element":"img","alt":"i","inline":true},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"width":"54%"},"width":519,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-9.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-10.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"is a vector of fixed effects parameters and ","element":"span"},{"style":{"height":13.19},"width":28.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-11.png","element":"img","alt":" bi","inline":true,"padRight":true},{"text":"is a random effect for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"th subject. Here we consider binary responses, where ","element":"span"},{"style":{"height":11.59},"width":89.67,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-12.png","element":"img","alt":" yij ∼","inline":true,"padRight":true},{"text":"Bernoulli(","element":"span"},{"style":{"height":11.59},"width":48.28,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-13.png","element":"img","alt":"µij","inline":true},{"text":") with the logit link function ","element":"span"},{"style":{"height":21.58},"width":320.23,"height":53.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-14.png","element":"img","alt":" g(µij) = log µij1−µij","inline":true,"padRight":true},{"text":", and count ","element":"span"},{"text":"responses, where ","element":"span"},{"style":{"height":11.59},"width":95.51,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-15.png","element":"img","alt":" yij ∼","inline":true,"padRight":true},{"text":"Poisson(","element":"span"},{"style":{"height":11.59},"width":48.28,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-16.png","element":"img","alt":"µij","inline":true},{"text":") with the log link function ","element":"span"},{"style":{"height":16.79},"width":287.96,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-17.png","element":"img","alt":" g(µij) = log(µij","inline":true},{"text":"). Variational methods have been considered for efficient computation in GLMMs by ","element":"span"},{"href":"#id-51","referenceIndex":26,"text":"Ormerod and Wand ","element":"a"},{"href":"#id-51","referenceIndex":26,"text":"(2012)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-52","referenceIndex":41,"text":"Tan and Nott ","element":"a"},{"href":"#id-52","referenceIndex":41,"text":"(2013, ","element":"a"},{"href":"#id-53","referenceIndex":42,"text":"2014)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-54","referenceIndex":21,"text":"Lee and Wand ","element":"a"},{"href":"#id-54","referenceIndex":21,"text":"(2016a,","element":"a"},{"href":"#id-55","referenceIndex":22,"text":"b)","element":"a"},{"text":", among others.","element":"span"}],[{"text":"We parameterize the elements of the random effects covariance matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"so that they are unconstrained and so that a normal variational posterior approximation is reasonable. Let ","element":"span"},{"style":{"height":13.39},"width":210.09,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-18.png","element":"img","alt":" G = WW T","inline":true,"padRight":true},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":", the Cholesky factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":", is a ","element":"span"},{"style":{"height":11.2},"width":99.05,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-19.png","element":"img","alt":" p × p","inline":true,"padRight":true},{"text":"lower triangular matrix with positive diagonal entries. Let ","element":"span"},{"style":{"height":10.98},"width":59.17,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-20.png","element":"img","alt":" W ∗","inline":true,"padRight":true},{"text":"denote the matrix for which ","element":"span"},{"style":{"height":16.15},"width":262.1,"height":40.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-21.png","element":"img","alt":" W ∗ii = log(Wii","inline":true},{"text":") and ","element":"span"},{"style":{"height":17.53},"width":200.66,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-22.png","element":"img","alt":" W ∗ij = Wij","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":15.2},"width":103.78,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-23.png","element":"img","alt":" i ̸= j","inline":true},{"text":". ","element":"span"},{"text":"Write ","element":"span"},{"style":{"height":16},"width":227.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-24.png","element":"img","alt":" ζ = vech(W ∗","inline":true},{"text":"), where vech is the operation that transforms the lower triangular part of a square matrix into a vector by stacking elements below the diagonal column by column. We assume a normal prior, ","element":"span"},{"style":{"height":20.3},"width":353.04,"height":50.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-25.png","element":"img","alt":"ζ ∼ N(0, σ2ζIp(p+1)/2","inline":true},{"text":").","element":"span"}],[{"text":"The vector of variables in the model is given by ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-26.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"as defined in ","element":"span"},{"href":"#id-34","text":"(4)","element":"a"},{"text":", where ","element":"span"},{"style":{"height":17.39},"width":205.92,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-27.png","element":"img","alt":" η = (βT , ζT","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":7.6},"width":23,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-28.png","element":"img","alt":"T","inline":true,"padRight":true},{"text":"and the length of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-29.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"pn ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"+ 1)","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"2. The joint distribution can be written as","element":"span"}],[{"style":{"width":"48%"},"width":461,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-30.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"y, θ","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"θ","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"β","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"ζ","element":"span"},{"text":")","element":"span"}],[{"id":"id-56","style":{"width":"50%"},"width":477,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-31.png","element":"img"}],[{"text":"For this model, note that ","element":"span"},{"style":{"height":13.19},"width":28.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-32.png","element":"img","alt":" bi","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.59},"width":30.1,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-33.png","element":"img","alt":" bj","inline":true,"padRight":true},{"text":"are conditionally independent in the posterior distribution for ","element":"span"},{"style":{"height":15.2},"width":83.86,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-34.png","element":"img","alt":" i ̸= j","inline":true,"padRight":true},{"text":"given ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-35.png","element":"img","alt":"η","inline":true},{"text":". For the GLMM, the sparsity structure imposed on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"and hence ","element":"span"},{"style":{"height":10.8},"width":32,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-36.png","element":"img","alt":" Ω","inline":true,"padRight":true},{"text":"is illustrated in ","element":"span"},{"href":"#id-56","text":"(10)","element":"a"},{"text":". Our Algorithm 2 can efficiently learn a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"with such a structure.","element":"span"}],[{"style":{"width":"64%"},"width":613,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-37.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"=","element":"span"}],[{"style":{"width":"88%"},"width":843,"height":235,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-38.png","element":"img"}],[{"style":{"height":10.8},"width":74.85,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-39.png","element":"img","alt":"Ω =","inline":true}],[{"style":{"width":"67%"},"width":643,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/8-40.png","element":"img"}],[{"text":"For the GLMM, using a full rank lower triangular matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"in Algorithm 1 requires updates of ","element":"span"},{"style":{"height":17.38},"width":126.18,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-0.png","element":"img","alt":" O(n2p2","inline":true},{"text":") elements at each iteration while Algorithm 2 only requires ","element":"span"},{"style":{"height":17.38},"width":108.3,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-1.png","element":"img","alt":"O(np2","inline":true},{"text":") updates (assuming ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"are small). Hence the efficiency of Algorithm 2 as compared to Algorithm 1 (unrestricted) increases rapidly with the number of subjects in the dataset as can be seen from Table ","element":"span"},{"href":"#id-50","text":"1. ","element":"a"},{"text":"There is only a slight computational overhead in using Algorithm 2 as compared to a diagonal matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"in Algorithm 1, which requires ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"np","element":"span"},{"text":") updates. However, Algorithm 2 reflects the posterior dependency structure and hence has the potential to provide better estimates. Next, we investigate the performance of Algorithm 2 on several data sets. We set ","element":"span"},{"style":{"height":20.3},"width":495.32,"height":50.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-2.png","element":"img","alt":" σ2β = σ2ζ = 100 throughout.","inline":true,"padRight":true},{"text":"The gradient of log ","element":"span"},{"style":{"height":16},"width":57.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-3.png","element":"img","alt":" h(θ","inline":true},{"text":") is derived in ","element":"span"},{"text":"Appendix A.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"5.1.1 Epilepsy data","element":"span"}],[{"text":"The epilepsy data of ","element":"span"},{"href":"#id-57","referenceIndex":43,"text":"Thall and Vail ","element":"a"},{"href":"#id-57","referenceIndex":43,"text":"(1990) ","element":"a"},{"text":"includes ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 59 epileptics who were randomized to a new drug, progabide (Trt=1) or a placebo (Trt=0) in a clinical trial. The response is given by the number of seizures patients have during four follow-up periods. Other covariates include the logarithm of age (Age), the logarithm of ","element":"span"},{"style":{"height":19.37},"width":16,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-4.png","element":"img","alt":"14","inline":true,"padRight":true},{"text":"the number of baseline seizures (Base), Visit ","element":"span"},{"text":"(coded as Visit","element":"span"},{"style":{"height":13.19},"width":144.27,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-5.png","element":"img","alt":"1 = −0.","inline":true},{"text":"3, Visit","element":"span"},{"style":{"height":13.19},"width":144.27,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-6.png","element":"img","alt":"2 = −0.","inline":true},{"text":"1, Visit","element":"span"},{"style":{"height":13.19},"width":113.28,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-7.png","element":"img","alt":"3 = 0.","inline":true},{"text":"1 and Visit","element":"span"},{"style":{"height":13.19},"width":113.82,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-8.png","element":"img","alt":"4 = 0.","inline":true},{"text":"3), and a binary variable V4 which is 1 for the fourth visit and 0 otherwise. We center the covariate Age and replace Age","element":"span"},{"style":{"height":7.2},"width":11,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-9.png","element":"img","alt":"i","inline":true,"padRight":true},{"text":"by Age","element":"span"},{"style":{"height":8.3},"width":53.94,"height":20.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-10.png","element":"img","alt":"i −","inline":true,"padRight":true},{"text":"mean(Age). Consider the following two models from ","element":"span"},{"href":"#id-58","referenceIndex":5,"text":"Breslow and ","element":"a"},{"href":"#id-58","referenceIndex":5,"text":"Clayton ","element":"a"},{"href":"#id-58","referenceIndex":5,"text":"(1993)","element":"a"},{"text":". Model I is a Poisson random intercept model where","element":"span"}],[{"text":"log ","element":"span"},{"style":{"height":15.99},"width":281.7,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-11.png","element":"img","alt":" µij = β0 + βBase","inline":true},{"text":"Base","element":"span"},{"style":{"height":14.4},"width":129.16,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-12.png","element":"img","alt":"i + βTrt","inline":true},{"text":"Trt","element":"span"},{"style":{"height":15.99},"width":137.93,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-13.png","element":"img","alt":"i + βAge","inline":true},{"text":"Age","element":"span"},{"style":{"height":7.2},"width":11,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-14.png","element":"img","alt":"i","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":14.4},"width":157.04,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-15.png","element":"img","alt":" βBase×Trt","inline":true},{"text":"Base","element":"span"},{"style":{"height":11.2},"width":55.26,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-16.png","element":"img","alt":"i×,","inline":true,"padRight":true},{"text":"Trt","element":"span"},{"style":{"height":15.99},"width":280.6,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-17.png","element":"img","alt":"i + βV4V4ij + bi","inline":true}],[{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ..., n","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ..., ","element":"span"},{"text":"4. Model II is a Poisson random intercept and slope model where","element":"span"}],[{"text":"log ","element":"span"},{"style":{"height":15.99},"width":281.7,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-18.png","element":"img","alt":" µij = β0 + βBase","inline":true},{"text":"Base","element":"span"},{"style":{"height":14.4},"width":129.16,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-19.png","element":"img","alt":"i + βTrt","inline":true},{"text":"Trt","element":"span"},{"style":{"height":15.99},"width":137.93,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-20.png","element":"img","alt":"i + βAge","inline":true},{"text":"Age","element":"span"},{"style":{"height":7.2},"width":11,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-21.png","element":"img","alt":"i","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":14.4},"width":157.04,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-22.png","element":"img","alt":" βBase×Trt","inline":true},{"text":"Base","element":"span"},{"style":{"height":10.39},"width":53.12,"height":25.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-23.png","element":"img","alt":"i ×","inline":true,"padRight":true},{"text":"Trt","element":"span"},{"style":{"height":14.4},"width":150.69,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-24.png","element":"img","alt":"i + βVisit","inline":true},{"text":"Visit","element":"span"},{"style":{"height":9.6},"width":24.27,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-25.png","element":"img","alt":"ij","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":13.19},"width":139.34,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-26.png","element":"img","alt":" bi1 + bi2","inline":true},{"text":"Visit","element":"span"},{"style":{"height":9.6},"width":24.27,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-27.png","element":"img","alt":"ij","inline":true}],[{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ..., n","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ..., ","element":"span"},{"text":"4.","element":"span"}],[{"text":"We apply ADVI and Algorithms 1 and 2 on these two models. Runtimes are given in Table ","element":"span"},{"href":"#id-50","text":"1 ","element":"a"},{"text":"and the estimated marginal posteriors of ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-28.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":19,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-29.png","element":"img","alt":" ζ","inline":true,"padRight":true},{"text":"are shown in Figure ","element":"span"},{"href":"#id-59","text":"2. ","element":"a"},{"text":"Algorithm 1 (mean-field) converged quickly for both models while the runtime of Algorithm 1 (unrestricted) doubled with the inclusion of a second random effect. For this dataset, Algorithm 2 performed better than the mean-field and unrestricted approximations. It produces very good approximations of the marginal posteriors of ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-30.png","element":"img","alt":" β","inline":true},{"text":", but is overconfident in estimating the marginal posteriors of ","element":"span"},{"style":{"height":14},"width":19,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-31.png","element":"img","alt":" ζ","inline":true,"padRight":true},{"text":"in Model II. The variational posteriors from Algorithm 1 (mean-field) are accurate in the mean but the variance is underestimated, quite severely in some cases.","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-60","text":"3 ","element":"a"},{"text":"shows the iterates of the mean parameter ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-32.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"corresponding to ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-33.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":19,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-34.png","element":"img","alt":" ζ","inline":true,"padRight":true},{"text":"and the averaged lower bound ( ","element":"span"},{"text":"¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":") for Model II. For Algorithm 1 (unrestricted), it appears that some of the parameters have yet to stabilize even though the lower bound has reached stationarity.","element":"span"}],[{"id":"id-38","style":{"fontStyle":"italic"},"text":"5.1.2 Toenail data","element":"span"}],[{"text":"This dataset compares two oral anti-fungal treatments for toenail infection ","element":"span"},{"href":"#id-61","referenceIndex":7,"text":"(De Backer et al., ","element":"a"},{"href":"#id-61","referenceIndex":7,"text":"1998) ","element":"a"},{"text":"and contains information for 294 patients who are evaluated at seven visits. Some patients did not attend all planned visits and there were a total of 1908 measurements. The patients were randomized into two treatment groups, one receiving 250 mg of terbinafine per day (Trt=1) and the other 200 mg of itraconazole per day (Trt=0). The time in months (","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") that they arrived for visits was recorded and the binary response variable (onycholysis) indicates the degree of separation of the nail plate from the nail-bed (0 if none or mild, 1 if moderate or severe). We consider the logistic random intercept model,","element":"span"}],[{"text":"logit(","element":"span"},{"style":{"height":16.79},"width":271.05,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-35.png","element":"img","alt":"µij) = β0 +βTrt","inline":true},{"text":"Trt","element":"span"},{"style":{"height":15.99},"width":282.3,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-36.png","element":"img","alt":"i +βttij +βTrt×t","inline":true},{"text":"Trt","element":"span"},{"style":{"height":14.79},"width":188.28,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-37.png","element":"img","alt":"i ×tij +ui,","inline":true}],[{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ..., ","element":"span"},{"text":"294, 1 ","element":"span"},{"style":{"height":13.6},"width":102.83,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-38.png","element":"img","alt":" ≤ j ≤","inline":true,"padRight":true},{"text":"7.","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-62","text":"4 ","element":"a"},{"text":"shows the variational posteriors estimated by ADVI and Algorithms 1 and 2. The estimates from Algorithm 2 are closer to that of MCMC than the unrestricted and mean-field approximations of ADVI and Algorithm 1. Table ","element":"span"},{"href":"#id-50","text":"1 ","element":"a"},{"text":"indicates that the runtime of Algorithm 1 (unrestricted) is about 1.5 times that of Algorithm 2 even though the number of iterations required is halved.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"5.1.3 Polypharmacy data","element":"span"}],[{"text":"The polypharmacy data set ","element":"span"},{"href":"#id-63","referenceIndex":15,"text":"(Hosmer et al., ","element":"a"},{"href":"#id-63","referenceIndex":15,"text":"2013) ","element":"a"},{"text":"is ","element":"span"},{"text":"available ","element":"span"},{"text":"at ","element":"span"},{"href":"http://www.umass.edu/statdata/statdata/stat-logistic.html","style":{"fontFamily":"monospace"},"text":"http://www.umass.edu/statdata/ ","element":"a"},{"href":"http://www.umass.edu/statdata/statdata/stat-logistic.html","style":{"fontFamily":"monospace"},"text":"statdata/stat-logistic.html ","element":"a"},{"text":"and it contains data on 500 subjects studied over seven years. The response is whether the subject is taking drugs from 3 or more different groups. We consider the covariates, Gender = 1 if male and 0 if female, Race = 0 if subject is white and 1 otherwise, Age, and the following binary indicators for the number of outpatient mental ","element":"span"},{"style":{"height":14.8},"width":547.25,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-39.png","element":"img","alt":"health visits, MHV 1=1 if 1 ≤","inline":true,"padRight":true},{"text":"MHV ","element":"span"},{"style":{"height":14.4},"width":276.26,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-40.png","element":"img","alt":" ≤ 5, MHV 2=1","inline":true,"padRight":true},{"text":"if if 6 ","element":"span"},{"style":{"height":12.8},"width":31,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-41.png","element":"img","alt":" ≤","inline":true,"padRight":true},{"text":"MHV ","element":"span"},{"style":{"height":14},"width":552.31,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/9-42.png","element":"img","alt":" ≤ 14 and MHV 3=1 if MHV ≥","inline":true,"padRight":true},{"text":"15. Let INPTMHV = 0 if there were no inpatient mental health","element":"span"}],[{"id":"id-59","style":{"width":"97%"},"width":1925,"height":1935,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/10-0.png","element":"img"}],[{"text":"Fig. 2: Epilepsy data: posterior distributions of ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/10-1.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":14},"width":19,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/10-2.png","element":"img","alt":" ζ","inline":true,"padRight":true},{"text":"estimated using ADVI (dotted, blue for mean-field and green for unrestricted), Algorithm 1 (blue for mean-field and green for unrestricted), Algorithm 2 (red) and MCMC (black) for Model I (first two rows) and Model II (last two rows).","element":"figcaption","subtype":"caption"}],[{"text":"visits and 1 otherwise. We consider a logistic random intercept model (see ","element":"span"},{"href":"#id-63","referenceIndex":15,"text":"Hosmer et al., ","element":"a"},{"href":"#id-63","referenceIndex":15,"text":"2013) ","element":"a"},{"text":"of the form","element":"span"}],[{"style":{"width":"94%"},"width":899,"height":170,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/10-3.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , ","element":"span"},{"text":"500, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , ","element":"span"},{"text":"7.","element":"span"}],[{"text":"We apply ADVI and Algorithms 1 and 2 to this model. From Table ","element":"span"},{"href":"#id-50","text":"1, ","element":"a"},{"text":"the increase in runtime of Algorithm 1 (unrestricted) due to the larger number of subjects as compared to the toenail dataset is evident. The runtime of Algorithm 1 (unrestricted) is about 4.7 times that of Algorithm 2 while the runtime of Algorithm 2 is only slightly longer than that of Algorithm 1 (mean-field). From Figure ","element":"span"},{"href":"#id-64","text":"5, ","element":"a"},{"text":"Algorithm 2 provided a","element":"span"}],[{"id":"id-60","style":{"width":"98%"},"width":1937,"height":814,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-0.png","element":"img"}],[{"text":"Fig. 3: Epilepsy data: Mean (","element":"figcaption","subtype":"caption"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-1.png","element":"img","alt":"µ","inline":true},{"text":") iterates corresponding to ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-2.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":14},"width":19,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-3.png","element":"img","alt":" ζ","inline":true,"padRight":true},{"text":"and the averaged lower bound ( ","element":"figcaption","subtype":"caption"},{"text":"¯","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"L","element":"figcaption","subtype":"caption"},{"text":") from Algorithm 1 with unrestricted lower triangular matrix ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"L ","element":"figcaption","subtype":"caption"},{"text":"(green) and Algorithm 2 (red) for Model II.","element":"figcaption","subtype":"caption"}],[{"id":"id-62","style":{"width":"98%"},"width":1941,"height":401,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-4.png","element":"img"}],[{"text":"Fig. 4: Toenail data: posterior distributions of ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-5.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":14},"width":19,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-6.png","element":"img","alt":" ζ","inline":true,"padRight":true},{"text":"estimated using ADVI (dotted, blue for mean-field and green for unrestricted), Algorithm 1 (blue for mean-field and green for unrestricted), Algorithm 2 (red) and MCMC (black).","element":"figcaption","subtype":"caption"}],[{"text":"very good approximation of the marginal posteriors of ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-7.png","element":"img","alt":"β","inline":true,"padRight":true},{"text":"but there is some underestimation of the mean and standard deviation of ","element":"span"},{"style":{"height":14},"width":33.44,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-8.png","element":"img","alt":" ζ1","inline":true},{"text":".","element":"span"}],[{"text":"5.2 State space models","element":"span"}],[{"text":"Here we consider the stochastic volatility model widely used in modeling financial time series (see, e.g. ","element":"span"},{"href":"#id-65","referenceIndex":12,"text":"Harvey ","element":"a"},{"href":"#id-65","referenceIndex":12,"text":"et al., ","element":"a"},{"href":"#id-65","referenceIndex":12,"text":"1994; ","element":"a"},{"href":"#id-66","referenceIndex":18,"text":"Kim et al., ","element":"a"},{"href":"#id-66","referenceIndex":18,"text":"1998)","element":"a"},{"text":", which is an example of a non-linear state space model. The observations ","element":"span"},{"style":{"height":10},"width":31.54,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-9.png","element":"img","alt":" yt","inline":true,"padRight":true},{"text":"are generated from a zero-mean Gaussian distribution with a variance stochastically evolving over time. The unobserved log-volatility ","element":"span"},{"style":{"height":13.19},"width":29.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-10.png","element":"img","alt":" bt","inline":true,"padRight":true},{"text":"is modeled as an AR(1) process with Gaussian disturbances. Let","element":"span"}],[{"style":{"width":"94%"},"width":904,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-11.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"style":{"height":9.99},"width":96.45,"height":24.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-12.png","element":"img","alt":"t+1 ∼","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"φb","element":"span"},{"style":{"height":7.2},"width":12,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-13.png","element":"img","alt":"t","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , n,","element":"span"}],[{"text":"where ","element":"span"},{"style":{"height":14},"width":222.56,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-14.png","element":"img","alt":" λ ∈ R, σ >","inline":true,"padRight":true},{"text":"0 and 0 ","element":"span"},{"style":{"height":14},"width":125.24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-15.png","element":"img","alt":" < φ <","inline":true,"padRight":true},{"text":"1. In ","element":"span"},{"href":"#id-67","text":"(11)","element":"a"},{"text":", ","element":"span"},{"style":{"height":10},"width":31.53,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-16.png","element":"img","alt":" yt","inline":true,"padRight":true},{"text":"is the mean-corrected return at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":29.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-17.png","element":"img","alt":" bt","inline":true,"padRight":true},{"text":"is assumed to follow a stationary process with ","element":"span"},{"style":{"height":13.19},"width":33.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-18.png","element":"img","alt":" b1","inline":true,"padRight":true},{"text":"drawn from the stationary distribution. We transform the constrained parameters to the real space by letting ","element":"span"},{"style":{"height":16},"width":179.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-19.png","element":"img","alt":" σ = exp(α","inline":true},{"text":") and ","element":"span"},{"style":{"height":24.43},"width":224.44,"height":61.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-20.png","element":"img","alt":"φ = exp(ψ)exp(ψ)+1","inline":true},{"text":", where ","element":"span"},{"style":{"height":14},"width":155.26,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-21.png","element":"img","alt":" α, ψ ∈ R","inline":true},{"text":". Assume normal priors, ","element":"span"},{"style":{"height":17.9},"width":492.75,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-22.png","element":"img","alt":"α ∼ N(0, σ2α), λ ∼ N(0, σ2λ","inline":true},{"text":") and ","element":"span"},{"style":{"height":20.3},"width":227.49,"height":50.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-23.png","element":"img","alt":" ψ ∼ N(0, σ2ψ","inline":true},{"text":"). The ","element":"span"},{"text":"set of variables is given by ","element":"span"},{"style":{"height":17.39},"width":312.7,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-24.png","element":"img","alt":" θ = (b1, . . . , bn, ηT","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":7.6},"width":23,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-25.png","element":"img","alt":"T","inline":true,"padRight":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":200.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-26.png","element":"img","alt":"η = (α, λ, ψ","inline":true},{"text":"), and the joint distribution is given by","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"y, θ","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"θ","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"α","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"λ","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"ψ","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"style":{"height":16},"width":28.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-27.png","element":"img","alt":"1|","inline":true},{"style":{"fontStyle":"italic"},"text":"ψ","element":"span"},{"text":")","element":"span"}],[{"id":"id-68","style":{"width":"81%"},"width":778,"height":177,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-28.png","element":"img"}],[{"id":"id-67","text":"From ","element":"span"},{"href":"#id-68","text":"(12)","element":"a"},{"text":", ","element":"span"},{"style":{"height":13.19},"width":29.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-29.png","element":"img","alt":" bt","inline":true,"padRight":true},{"text":"is conditionally independent of all other states in the posterior distribution given ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/11-30.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"and the neighboring states. Hence, we can take advantage of","element":"span"}],[{"id":"id-64","style":{"width":"98%"},"width":1935,"height":778,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-0.png","element":"img"}],[{"text":"Fig. 5: Polypharmacy data: posterior distributions of ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-1.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":14},"width":19,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-2.png","element":"img","alt":" ζ","inline":true,"padRight":true},{"text":"estimated using ADVI (dotted, blue for mean-field and green for unrestricted), Algorithm 1 (blue for mean-field and green for unrestricted), Algorithm 2 (red) and MCMC (black) .","element":"figcaption","subtype":"caption"}],[{"text":"this conditional independence in the variational approximation ","element":"span"},{"style":{"height":16},"width":283.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-3.png","element":"img","alt":" q(θ) = N(µ, Ω","inline":true},{"text":"). By setting ","element":"span"},{"style":{"height":15.59},"width":169.88,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-4.png","element":"img","alt":" Tij = 0,","inline":true,"padRight":true},{"text":"1 ","element":"span"},{"style":{"height":16.58},"width":427.77,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-5.png","element":"img","alt":" ≤ j < i − 1 < n, TT T","inline":true,"padRight":true},{"text":"has the sparsity we desire for ","element":"span"},{"style":{"height":10.8},"width":32,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-6.png","element":"img","alt":" Ω","inline":true},{"text":". This sparse structure is illustrated in ","element":"span"},{"href":"#id-69","text":"(13)","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"86%"},"width":824,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"=","element":"span"}],[{"style":{"width":"90%"},"width":858,"height":184,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-8.png","element":"img"}],[{"text":"For the SSM, the number of parameters to update in each iteration of Algorithm 1 (unrestricted) is ","element":"span"},{"style":{"height":17.39},"width":88.25,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-9.png","element":"img","alt":" O(n2","inline":true},{"text":") while Algorithm 2 only requires ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":") updates (similar to Algorithm 1 mean-field). This is an important factor to consider in SSMs as the number of observations in a time series over a long period may be large.","element":"span"}],[{"text":"Next, we illustrate Algorithm 2 using two sets of exchange rate data which is available from the dataset “Garch” in the R package ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"Ecdat","element":"span"},{"text":". We compute the mean-corrected response ","element":"span"},{"style":{"height":16},"width":73.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-10.png","element":"img","alt":" {yt}","inline":true,"padRight":true},{"text":"from the exchange rates ","element":"span"},{"style":{"height":16},"width":71.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-11.png","element":"img","alt":"{rt}","inline":true,"padRight":true},{"text":"as","element":"span"}],[{"style":{"width":"63%"},"width":603,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-12.png","element":"img"}],[{"style":{"height":14},"width":186.33,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-13.png","element":"img","alt":"yt = 100 ×","inline":true}],[{"style":{"width":"63%"},"width":603,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-14.png","element":"img"}],[{"text":"The gradient of log ","element":"span"},{"style":{"height":16},"width":57.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-15.png","element":"img","alt":" h(θ","inline":true},{"text":") is derived in ","element":"span"},{"text":"Appendix B.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"5.2.1 GBP/USD exchange rate data","element":"span"}],[{"text":"Here we consider daily observations of the weekday exchange rates of the U.S. Dollar against the British Pound from 1st October 1981 to 28th June 1985. This dataset has been considered by ","element":"span"},{"href":"#id-65","referenceIndex":12,"text":"Harvey et al. ","element":"a"},{"href":"#id-65","referenceIndex":12,"text":"(1994)","element":"a"},{"text":", ","element":"span"},{"href":"#id-66","referenceIndex":18,"text":"Kim et al. ","element":"a"},{"href":"#id-66","referenceIndex":18,"text":"(1998) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-70","referenceIndex":9,"text":"Durbin and Koopman ","element":"a"},{"href":"#id-70","referenceIndex":9,"text":"(2012)","element":"a"},{"text":". ","element":"span"},{"id":"id-69","text":"The number of responses is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 945. Applying ADVI and Algorithms 1 and 2 to this dataset, the resulting variational posteriors are shown in Figure ","element":"span"},{"href":"#id-71","text":"6. ","element":"a"},{"text":"We note that Algorithm 1 (unrestricted) diverges as the averaged lower bound ","element":"span"},{"text":"¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is deteriorating and tending towards ","element":"span"},{"style":{"height":7.2},"width":71,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-16.png","element":"img","alt":" −∞","inline":true},{"text":". ADVI (unrestricted) also fails to converge. The mean-field approximations of ADVI and Algorithm 1 have difficulty in capturing the means of ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-17.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":26,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-18.png","element":"img","alt":" ψ","inline":true,"padRight":true},{"text":"and only manage to capture the mean of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-19.png","element":"img","alt":" λ","inline":true},{"text":". Algorithm 2 was able to capture the means with reasonable accuracy but there is underestimation in the variance of ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-20.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":26,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-21.png","element":"img","alt":" ψ","inline":true},{"text":". Figure ","element":"span"},{"href":"#id-72","text":"7 ","element":"a"},{"text":"shows the mean (solid lines) and 1 standard deviation intervals (dotted lines) of the log volatility ","element":"span"},{"style":{"height":13.19},"width":29.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/12-22.png","element":"img","alt":"bt","inline":true,"padRight":true},{"text":"at each time point estimated using Algorithm 2 and MCMC. Algorithm 2 was able to capture the means very accurately but there is some underestimation of the standard deviation.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"5.2.2 DEM /USD exchange rate data","element":"span"}],[{"text":"Next, we consider the entire series of weekday exchange rates of the U.S. Dollar against the German Deutschemark from 2nd January 1980 to 21st June 1987 available in “Garch”. This is a much larger dataset with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 1866 responses. We apply ADVI and Algorithms 1 and 2 to this dataset. The unrestricted approximations of ADVI and Algorithm 1 fail to converge again. The approximations of Algorithm 2 improved from the previous dataset and the underestimation of the standard","element":"span"}],[{"id":"id-71","style":{"width":"67%"},"width":1338,"height":410,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/13-0.png","element":"img"}],[{"text":"Fig. 6: GBP/USD exchange rate data: posterior distributions of ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":151.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/13-1.png","element":"img","alt":" {α, λ, ψ}","inline":true,"padRight":true},{"text":"estimated using ADVI (dotted, blue for ","element":"figcaption","subtype":"caption"},{"id":"id-72","text":"mean-field), Algorithm 1 (blue for mean-field), Algorithm 2 (red) and MCMC (black).","element":"figcaption","subtype":"caption"}],[{"style":{"width":"72%"},"width":1432,"height":605,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/13-2.png","element":"img"}],[{"text":"Fig. 7: GBP/USD exchange rate data: Mean (solid line) and 1 standard deviation intervals (dotted lines) of log volatility ","element":"figcaption","subtype":"caption"},{"style":{"height":13.19},"width":29.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/13-3.png","element":"img","alt":" bt","inline":true,"padRight":true},{"text":"estimated using Algorithm 2 (red) and MCMC (black).","element":"figcaption","subtype":"caption"}],[{"text":"deviation was less severe. As in the previous case, the mean field approximations of ADVI and Algorithm 1 had difficulty in capturing the means of ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/13-4.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":26,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/13-5.png","element":"img","alt":" ψ","inline":true},{"text":". Figure ","element":"span"},{"href":"#id-73","text":"9 ","element":"a"},{"text":"shows the mean and 1 standard deviation intervals of the log volatility ","element":"span"},{"style":{"height":13.19},"width":29.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/13-6.png","element":"img","alt":" bt","inline":true},{"text":". For this dataset, Algorithm 2 captured both the mean and standard deviation of the log volatility accurately.","element":"span"}]]},{"heading":"6 Conclusion","paragraphs":[[{"text":"$38","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Acknowledgements ","element":"span"},{"text":"Linda Tan was supported by the National University of Singapore Overseas Postdoctoral Fellowship. David Nott’s research was supported by a Singapore Ministry of Education Academic Research Fund Tier 2 grant (R-155-000-143-112). We thank the reviewers and the editors for their time and helpful comments which have improved the manuscript.","element":"span"}],[{"style":{"width":"67%"},"width":1338,"height":410,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/14-0.png","element":"img"}],[{"text":"Fig. 8: DEM/USD exchange rate data: posterior distributions of ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":151.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/14-1.png","element":"img","alt":" {α, λ, ψ}","inline":true,"padRight":true},{"text":"estimated using ADVI (dotted, blue for mean-field), Algorithm 1 (blue for mean-field), Algorithm 2 (red) and MCMC (black).","element":"figcaption","subtype":"caption"}],[{"id":"id-73","style":{"width":"92%"},"width":1831,"height":622,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/14-2.png","element":"img"}],[{"text":"Fig. 9: DEM/USD exchange rate data: Mean (solid line) and 1 standard deviation intervals (dotted lines) of log volatility ","element":"figcaption","subtype":"caption"},{"style":{"height":13.19},"width":29.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/14-3.png","element":"img","alt":" bt","inline":true,"padRight":true},{"text":"estimated using Algorithm 2 (red) and MCMC (black).","element":"figcaption","subtype":"caption"}]]},{"heading":"References","paragraphs":[[{"id":"id-18","text":"Archer, E., I. M. Park, L. Buesing, J. Cunningham, and ","element":"span"},{"text":"L. Paninski (2016). Black box variational inference for state space models. arXiv:1511.07367.","element":"span"}],[{"id":"id-1","text":"Attias, H. (1999). Inferring parameters and structure ","element":"span"},{"text":"of latent variable models by variational Bayes. ","element":"span"},{"text":"In K. Laskey and H. Prade (Eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence","element":"span"},{"text":", San Francisco, CA, pp. 21–30. Morgan Kaufmann.","element":"span"}],[{"id":"id-11","text":"Bishop, C. M. (2006). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Pattern recognition and machine learning","element":"span"},{"text":". New York: Springer.","element":"span"}],[{"id":"id-49","text":"Booth, J. G. and J. P. Hobert (1999). ","element":"span"},{"text":"Maximizing generalized linear mixed model likelihoods with an automated monte carlo em algorithm. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61","element":"span"},{"text":", 265–285.","element":"span"}],[{"id":"id-58","text":"Breslow, N. E. and D. G. Clayton (1993). Approximate ","element":"span"},{"text":"inference in generalized linear mixed models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of the American Statistical Association 88","element":"span"},{"text":", 9–25.","element":"span"}],[{"id":"id-22","text":"Challis, E. and D. Barber (2013). Gaussian Kullback- ","element":"span"},{"text":"Leibler approximate inference. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Learning Research 14","element":"span"},{"text":", 2239–2286.","element":"span"}],[{"id":"id-61","text":"De Backer, M., C. De Vroey, E. Lesaffre, I. Scheys, ","element":"span"},{"text":"and P. D. Keyser (1998). Twelve weeks of continuous oral therapy for toenail onychomycosis caused by dermatophytes: A double-blind comparative trial of terbinafine 250 mg/day versus itraconazole 200 mg/day. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of the American Academy of Dermatology 38","element":"span"},{"text":", 57–63.","element":"span"}],[{"id":"id-43","text":"Duchi, J., E. Hazan, and Y. Singer (2011). Adaptive ","element":"span"},{"text":"subgradient methods for online learning and stochastic optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research 12","element":"span"},{"text":", 2121–2159.","element":"span"}],[{"id":"id-70","text":"Durbin, J. and S. J. Koopman (2012). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Time series analysis by state space methods ","element":"span"},{"text":"(2 ed.). United Kingdom: Oxford University Press.","element":"span"}],[{"id":"id-13","text":"Gershman, S., M. Hoffman, and D. Blei (2012). Non- ","element":"span"},{"text":"parametric variational inference. In J. Langford and J. Pineau (Eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 29th International Conference on Machine Learning","element":"span"},{"text":", pp. 663–670.","element":"span"}],[{"id":"id-20","text":"Han, S., X. Liao, D. B. Dunson, and L. C. Carin (2016). ","element":"span"},{"text":"Variational Gaussian copula inference. In A. Gretton and C. C. Robert (Eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 19th International Conference on Artificial Intelligence and","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Statistics","element":"span"},{"text":", Volume 51, pp. 829–838. JMLR Workshop and Conference Proceedings.","element":"span"}],[{"id":"id-65","text":"Harvey, A., R. Esther, and N. Shephard (1994). Mul- ","element":"span"},{"text":"tivariate stochastic variance models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Review of Economic Studies 61","element":"span"},{"text":", 247–264.","element":"span"}],[{"id":"id-17","text":"Hoffman, M. and D. Blei (2015). Stochastic structured ","element":"span"},{"text":"variational inference. ","element":"span"},{"text":"In G. Lebanon and S. Vishwanathan (Eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", Volume 38, pp. 361–369. JMLR Workshop and Conference Proceedings.","element":"span"}],[{"id":"id-27","text":"Hoffman, M. D., D. M. Blei, C. Wang, and J. Paisley ","element":"span"},{"text":"(2013). Stochastic variational inference. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research 14","element":"span"},{"text":", 1303–1347.","element":"span"}],[{"id":"id-63","text":"Hosmer, D. W., S. Lemeshow, and R. X. Sturdivant ","element":"span"},{"text":"(2013). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Applied Logistic Regression ","element":"span"},{"text":"(3 ed.). Hoboken, NJ: John Wiley & Sons, Inc.","element":"span"}],[{"id":"id-25","text":"Ji, C., H. Shen, and M. West (2010). Bounded approx- ","element":"span"},{"text":"imations for marginal likelihoods. Technical Report 10-05, Institute of Decision Sciences, Duke University.","element":"span"}],[{"id":"id-0","text":"Jordan, M. I., Z. Ghahramani, T. S. Jaakkola, and L. K. ","element":"span"},{"text":"Saul (1999). An introduction to variational methods for graphical models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning 37","element":"span"},{"text":", 183–233.","element":"span"}],[{"id":"id-66","text":"Kim, S., N. Shephard, and S. Chib (1998). Stochastic ","element":"span"},{"text":"volatility: likelihood inference and comparison with arch models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Review of Economic studies 65","element":"span"},{"text":", 361– 393.","element":"span"}],[{"id":"id-8","text":"Kingma, D. P. and M. Welling (2014). Auto-encoding ","element":"span"},{"text":"variational Bayes. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2nd International Conference on Learning Representations (ICLR)","element":"span"},{"text":".","element":"span"}],[{"id":"id-12","text":"Kucukelbir, A., D. Tran, R. Ranganath, A. Gelman, ","element":"span"},{"text":"and D. M. Blei (2016). ","element":"span"},{"text":"Automatic differentiation variational inference. arXiv: 1603.00788.","element":"span"}],[{"id":"id-54","text":"Lee, C. Y. Y. and M. P. Wand (2016a). Streamlined ","element":"span"},{"text":"mean field variational Bayes for longitudinal and multilevel data analysis. ","element":"span"},{"href":"https://works.bepress.com/matt_wand/13/","style":{"fontFamily":"monospace"},"text":"https://works.bepress.com/ ","element":"a"},{"href":"https://works.bepress.com/matt_wand/13/","style":{"fontFamily":"monospace"},"text":"matt_wand/13/","element":"a"},{"text":".","element":"span"}],[{"id":"id-55","text":"Lee, C. Y. Y. and M. P. Wand (2016b). Variational ","element":"span"},{"text":"methods for fitting complex Bayesian mixed effects models to health data statistics in medicine. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Statistics in Medicine 35","element":"span"},{"text":", 165–188.","element":"span"}],[{"id":"id-26","text":"Nott, D. J., S. L. Tan, M. Villani, and R. Kohn (2012). ","element":"span"},{"text":"Regression density estimation with variational methods and stochastic approximation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Computational and Graphical Statistics 21","element":"span"},{"text":", 797–820.","element":"span"}],[{"id":"id-21","text":"Opper, M. and C. Archambeau (2009). The variational ","element":"span"},{"text":"Gaussian approximation revisited. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural Computation 21","element":"span"},{"text":", 786–792.","element":"span"}],[{"id":"id-24","text":"Ormerod, J. T. and M. P. Wand (2010). Explaining ","element":"span"},{"text":"variational approximations. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The American Statistician 64","element":"span"},{"text":", 140–153.","element":"span"}],[{"id":"id-51","text":"Ormerod, J. T. and M. P. Wand (2012). Gaussian vari- ","element":"span"},{"text":"ational approximate inference for generalized linear mixed models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Computational and Graphical Statistics 21","element":"span"},{"text":", 2–17.","element":"span"}],[{"id":"id-4","text":"Paisley, J. W., D. M. Blei, and M. I. Jordan (2012). ","element":"span"},{"text":"Variational Bayesian inference with stochastic search. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 29th International Conference on Machine Learning (ICML-12)","element":"span"},{"text":".","element":"span"}],[{"id":"id-41","text":"Powell, W. B. (2011). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Approximate dynamic programming: solving the curses of dimensionality","element":"span"},{"text":". Hoboken, NJ: John Wiley & Sons, Inc.","element":"span"}],[{"id":"id-19","text":"Ranganath, R., S. Gerrish, and D. M. Blei (2014). Black ","element":"span"},{"text":"box variational inference. In S. Kaski and J. Corander (Eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 17th International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", Volume 33, pp. 814–822. JMLR Workshop and Conference Proceedings.","element":"span"}],[{"id":"id-16","text":"Ranganath, R., L. Tang, L. Charlin, and D. Blei (2015). ","element":"span"},{"text":"Deep exponential families. ","element":"span"},{"text":"In G. Lebanon and S. Vishwanathan (Eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 18th International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", Volume 38, pp. 762–771. JMLR Workshop and Conference Proceedings.","element":"span"}],[{"id":"id-15","text":"Ranganath, R., D. Tran, and D. M. Blei (2016). Hierar- ","element":"span"},{"text":"chical variational models. In M. F. Balcan and K. Q. Weinberger (Eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of The 33rd International Conference on Machine Learning","element":"span"},{"text":", Volume 37, pp. 324–333. JMLR Workshop and Conference Proceedings.","element":"span"}],[{"id":"id-44","text":"Ranganath, R., C. Wang, D. M. Blei, and E. P. Xing ","element":"span"},{"text":"(2013). An adaptive learning rate for stochastic variational inference. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 30th International Conference on Machine Learning","element":"span"},{"text":", pp. 298–306.","element":"span"}],[{"id":"id-14","text":"Rezende, D. J. and S. Mohamed (2015). ","element":"span"},{"text":"Variational inference with normalizing flows. ","element":"span"},{"text":"In F. Bach and D. Blei (Eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of The 32nd International Conference on Machine Learning","element":"span"},{"text":", pp. 1530– 1538. JMLR Workshop and Conference Proceedings.","element":"span"}],[{"id":"id-9","text":"Rezende, D. J., S. Mohamed, and D. Wierstra (2014). ","element":"span"},{"text":"Stochastic backpropagation and approximate inference in deep generative models. In E. P. Xing and T. Jebara (Eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of The 31st International Conference on Machine Learning","element":"span"},{"text":", pp. 1278– 1286. JMLR Workshop and Conference Proceedings.","element":"span"}],[{"id":"id-3","text":"Robbins, H. and S. Monro (1951). ","element":"span"},{"text":"A stochastic approximation method. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Mathematical Statistics 22","element":"span"},{"text":", 400–407.","element":"span"}],[{"id":"id-6","text":"Rohde, D. and M. P. Wand (2015). Mean field varia- ","element":"span"},{"text":"tional Bayes: general principles and numerical issues. ","element":"span"},{"href":"https://works.bepress.com/matt_wand/15/","style":{"fontFamily":"monospace"},"text":"https://works.bepress.com/matt_wand/15/","element":"a"},{"text":".","element":"span"}],[{"id":"id-32","text":"Rothman, A. J., E. Levina, and J. Zhu (2010). A new ","element":"span"},{"text":"approach to Cholesky-based covariance regularization in high dimensions. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Biometrika 97","element":"span"},{"text":"(3), 539–550.","element":"span"}],[{"id":"id-23","text":"Rue, H., S. Martino, and N. Chopin (2009). Approx- ","element":"span"},{"text":"imate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71","element":"span"},{"text":", 319–392.","element":"span"}],[{"id":"id-5","text":"Salimans, T. and D. A. Knowles (2013). ","element":"span"},{"text":"Fixed-form variational posterior approximation through stochastic linear regression. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bayesian Analysis 8","element":"span"},{"text":", 837–882.","element":"span"}],[{"id":"id-42","text":"Spall, J. C. (2003). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Introduction to stochastic search and optimization: estimation, simulation and control","element":"span"},{"text":". New Jersey: Wiley.","element":"span"}],[{"id":"id-52","text":"Tan, L. S. L. and D. J. Nott (2013). Variational infer- ","element":"span"},{"text":"ence for generalized linear mixed models using partially non-centered parametrizations. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Statistical Science 28","element":"span"},{"text":", 168–188.","element":"span"}],[{"id":"id-53","text":"Tan, L. S. L. and D. J. Nott (2014). A stochastic vari- ","element":"span"},{"text":"ational framework for fitting and diagnosing generalized linear mixed models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bayesian Analysis 9","element":"span"},{"text":", 963– 1004.","element":"span"}],[{"id":"id-57","text":"Thall, P. F. and S. C. Vail (1990). ","element":"span"},{"text":"Some covariance models for longitudinal count data with overdispersion. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Biometrics 46","element":"span"},{"text":", 657–671.","element":"span"}],[{"id":"id-7","text":"Titsias, M. and M. L´azaro-Gredilla (2014). ","element":"span"},{"text":"Doubly stochastic variational Bayes for non-conjugate inference. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 31st International Conference on Machine Learning (ICML-14)","element":"span"},{"text":", pp. 1971– 1979.","element":"span"}],[{"id":"id-28","text":"Titsias, M. and M. L´azaro-Gredilla (2015). Local expec- ","element":"span"},{"text":"tation gradients for black box variational inference. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 28 (NIPS 2015)","element":"span"},{"text":".","element":"span"}],[{"id":"id-10","text":"Wang, B. and D. M. Titterington (2005). Inadequacy ","element":"span"},{"text":"of interval estimates corresponding to variational Bayesian approximations. In R. G. Cowell and G. Z (Eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics","element":"span"},{"text":", pp. 373– 380. Society for Artificial Intelligence and Statistics.","element":"span"}],[{"id":"id-2","text":"Winn, J. and C. M. Bishop (2005). Variational message ","element":"span"},{"text":"passing. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research 6","element":"span"},{"text":", 661–694.","element":"span"}],[{"id":"id-45","text":"Zeiler, M. D. (2012). Adadelta: An adaptive learning ","element":"span"},{"text":"rate method. arXiv: 1212.5701.","element":"span"}]]},{"heading":"Appendix A Gradients for generalized linear mixed models","paragraphs":[[{"text":"For the GLMM described in Section ","element":"span"},{"href":"#id-74","text":"5.1, ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"45%"},"width":436,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-0.png","element":"img"}],[{"text":"log ","element":"span"},{"style":{"height":16},"width":115.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-1.png","element":"img","alt":" h(θ) =","inline":true}],[{"style":{"width":"81%"},"width":775,"height":466,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"is a constant not dependent on ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-3.png","element":"img","alt":" θ","inline":true},{"text":". For the ","element":"span"},{"text":"logistic ","element":"span"},{"text":"GLMM, ","element":"span"},{"style":{"height":16},"width":174.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-4.png","element":"img","alt":"h1(x) =","inline":true,"padRight":true},{"text":"log","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1 + exp(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"while ","element":"span"},{"text":"for ","element":"span"},{"text":"the ","element":"span"},{"text":"Poisson ","element":"span"},{"text":"GLMM, ","element":"span"},{"style":{"height":16},"width":207.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-5.png","element":"img","alt":"h1(x) =","inline":true,"padRight":true},{"text":"exp(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"). ","element":"span"},{"text":"The ","element":"span"},{"text":"gradient ","element":"span"},{"text":"of ","element":"span"},{"text":"log ","element":"span"},{"style":{"height":16},"width":57.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-6.png","element":"img","alt":" h(θ","inline":true},{"text":") ","element":"span"},{"text":"is ","element":"span"},{"text":"given ","element":"span"},{"text":"by [","element":"span"},{"style":{"height":14.79},"width":61.22,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-7.png","element":"img","alt":"∇b1","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":227.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-8.png","element":"img","alt":" h(θ), . . . , ∇bn","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16.79},"width":142.69,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-9.png","element":"img","alt":" h(θ), ∇β","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16.79},"width":139.69,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-10.png","element":"img","alt":" h(θ), ∇ζ","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":57.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-11.png","element":"img","alt":" h(θ","inline":true},{"text":")], where","element":"span"}],[{"id":"id-75","style":{"width":"5%"},"width":49,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-12.png","element":"img"}],[{"style":{"height":14.78},"width":58.22,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-13.png","element":"img","alt":"∇bi","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":115.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-14.png","element":"img","alt":" h(θ) =","inline":true}],[{"style":{"width":"98%"},"width":934,"height":515,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-15.png","element":"img"}],[{"text":"In ","element":"span"},{"href":"#id-75","text":"(14)","element":"a"},{"text":", ","element":"span"},{"style":{"height":18.17},"width":529.45,"height":45.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-16.png","element":"img","alt":" A = �ni=1 W −T W −1bibTi W −T","inline":true,"padRight":true},{"text":"with all entries ","element":"span"},{"text":"above the diagonal set to zero and 1","element":"span"},{"style":{"height":11.2},"width":117.11,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-17.png","element":"img","alt":"diag(W )","inline":true,"padRight":true},{"text":"and 1","element":"span"},{"style":{"height":10},"width":15,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-18.png","element":"img","alt":"ζ","inline":true,"padRight":true},{"text":"are vectors of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"+ 1)","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"2. We define the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"th element of 1","element":"span"},{"style":{"height":11.2},"width":117.11,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-19.png","element":"img","alt":"diag(W )","inline":true,"padRight":true},{"text":"as 1 if the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"th element of ","element":"span"},{"style":{"height":14},"width":19,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-20.png","element":"img","alt":" ζ","inline":true,"padRight":true},{"text":"correspond to a diagonal element of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"and 0 otherwise. On the other hand, the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"th element of 1","element":"span"},{"style":{"height":10},"width":15,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-21.png","element":"img","alt":"ζ","inline":true,"padRight":true},{"text":"is exp(","element":"span"},{"style":{"height":14},"width":28.44,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-22.png","element":"img","alt":"ζi","inline":true},{"text":") if ","element":"span"},{"style":{"height":14},"width":28.44,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-23.png","element":"img","alt":" ζi","inline":true,"padRight":true},{"text":"corresponds to a diagonal element of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"and 1 otherwise. For the logistic GLMM, ","element":"span"},{"style":{"height":16},"width":171.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-24.png","element":"img","alt":" h′1(x) = {","inline":true},{"text":"1 + exp(","element":"span"},{"style":{"height":17.38},"width":130.1,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-25.png","element":"img","alt":"−x)}−1","inline":true,"padRight":true},{"text":"and for the ","element":"span"},{"text":"Poisson GLMM, ","element":"span"},{"style":{"height":16},"width":257.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/16-26.png","element":"img","alt":" h′1(x) = exp(x","inline":true},{"text":"). More details on the ","element":"span"},{"text":"derivation of the gradients are given below.","element":"span"}],[{"text":"As log ","element":"span"},{"style":{"height":17.6},"width":360.64,"height":44.01,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-0.png","element":"img","alt":" |W| = �pi=1 W ∗ii, ∇ζ","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":17.68},"width":259.42,"height":44.19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-1.png","element":"img","alt":" |W| = 1diag(W )","inline":true},{"text":". For ","element":"span"},{"text":"the term ","element":"span"},{"style":{"height":19.37},"width":414.86,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-2.png","element":"img","alt":" − 12�ni=1 bTi W −T W −1bi","inline":true},{"text":",","element":"span"}],[{"style":{"width":"83%"},"width":796,"height":777,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-3.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.59},"width":50.84,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-4.png","element":"img","alt":" Kp","inline":true,"padRight":true},{"text":"denotes the ","element":"span"},{"style":{"height":16.59},"width":115.29,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-5.png","element":"img","alt":" p2 ×p2","inline":true,"padRight":true},{"text":"commutation matrix. Let ","element":"span"},{"style":{"height":18.17},"width":544.22,"height":45.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-6.png","element":"img","alt":"A = �ni=1 W −T W −1bibTi W −T","inline":true,"padRight":true},{"text":"with all entries above ","element":"span"},{"text":"the diagonal set to zero. As ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"is a lower triangular matrix,","element":"span"}],[{"style":{"width":"76%"},"width":729,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-7.png","element":"img"}],[{"text":"Moreover,","element":"span"}],[{"text":"dvec(","element":"span"},{"style":{"height":16.79},"width":161.8,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-8.png","element":"img","alt":"W) = Dp","inline":true},{"text":"dvech(","element":"span"},{"style":{"height":16.79},"width":161.8,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-9.png","element":"img","alt":"W) = Dp","inline":true},{"text":"diag(1","element":"span"},{"style":{"height":16.79},"width":87.34,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-10.png","element":"img","alt":"ζ)dζ,","inline":true}],[{"text":"where ","element":"span"},{"style":{"height":15.59},"width":50,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-11.png","element":"img","alt":" Dp","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"+1)","element":"span"},{"style":{"height":16},"width":71.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-12.png","element":"img","alt":"/2×","inline":true},{"text":"1 duplication matrix. Therefore, using chain rule","element":"span"}],[{"style":{"height":16.79},"width":98.01,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-13.png","element":"img","alt":"∇ζ(−","inline":true},{"text":"1","element":"span"},{"text":"2","element":"span"}],[{"style":{"width":"84%"},"width":806,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-14.png","element":"img"}],[{"text":"where dg","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is a diagonal matrix with diagonal equal to the diagonal of A. The last line follows because ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is a lower triangular matrix.","element":"span"}]]},{"heading":"Appendix B Gradients for state space model","paragraphs":[[{"text":"For the stochastic volatility model in ","element":"span"},{"href":"#id-68","text":"(12)","element":"a"},{"text":",","element":"span"}],[{"text":"log ","element":"span"},{"style":{"height":25.58},"width":209.59,"height":63.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-15.png","element":"img","alt":" h(θ) = −nλ","inline":true},{"text":"2 ","element":"span"},{"style":{"height":21.37},"width":83.34,"height":53.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-16.png","element":"img","alt":"− eα","inline":true},{"text":"2","element":"span"}],[{"style":{"width":"73%"},"width":700,"height":240,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-17.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"is a constant independent of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-18.png","element":"img","alt":" θ","inline":true},{"text":". The gradient ","element":"span"},{"style":{"height":13.19},"width":48.21,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-19.png","element":"img","alt":"∇θ","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":57.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-20.png","element":"img","alt":" h(θ","inline":true},{"text":") can be computed from the following components.","element":"span"}],[{"style":{"height":14.78},"width":61.22,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-21.png","element":"img","alt":"∇b1","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":18.18},"width":625.7,"height":45.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-22.png","element":"img","alt":" h(θ) = −(1 − φ2)b1 + φ(h2 − φb1) −","inline":true,"padRight":true},{"text":"e","element":"span"},{"style":{"height":4.8},"width":21,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-23.png","element":"img","alt":"α","inline":true},{"text":"2","element":"span"}],[{"style":{"width":"87%"},"width":832,"height":176,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-24.png","element":"img"}],[{"text":"+ ","element":"span"},{"text":"e","element":"span"},{"style":{"height":4.8},"width":21,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-25.png","element":"img","alt":"α","inline":true},{"text":"2 ","element":"span"},{"style":{"height":18.12},"width":36.97,"height":45.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-26.png","element":"img","alt":"y2t","inline":true,"padRight":true},{"text":"exp(","element":"span"},{"style":{"height":13.77},"width":172.47,"height":34.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-27.png","element":"img","alt":"−λ − eαbt","inline":true},{"text":") for 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"< t < n, ","element":"span"},{"style":{"height":14.79},"width":65.22,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-28.png","element":"img","alt":"∇bn","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":425.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-29.png","element":"img","alt":" h(θ) = −(bn − φhn−1) −","inline":true,"padRight":true},{"text":"e","element":"span"},{"style":{"height":4.8},"width":21,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-30.png","element":"img","alt":"α","inline":true},{"text":"2","element":"span"}],[{"style":{"width":"95%"},"width":907,"height":737,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1605.05622/images/17-31.png","element":"img"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]]}]}]