Gaussian variational approximation with sparse precision matrices

2016·Arxiv

Abstract

Abstract

We consider the problem of learning a Gaussian variational approximation to the posterior distribution for a high-dimensional parameter, where we impose sparsity in the precision matrix to reflect appropriate conditional independence structure in the model. Incorporating sparsity in the precision matrix allows the Gaussian variational distribution to be both flexible and parsimonious, and the sparsity is achieved through parameterization in terms of the Cholesky factor. Ef-ficient stochastic gradient methods which make appropriate use of gradient information for the target distribution are developed for the optimization. We consider alternative estimators of the stochastic gradients which have lower variation and are more stable. Our approach is illustrated using generalized linear mixed models and state space models for time series.

Keywords Gaussian variational approximation stochastic gradient algorithms sparse precision matrix variational Bayes

Linda S. L. Tan Department of Statistics and Applied Probability National University of Singapore 6 Science Drive 2, Singapore 117546 Tel.: +65-6516-4416 Fax: +65-6872-3919 E-mail: statsll@nus.edu.sg

David J. Nott Department of Statistics and Applied Probability National University of Singapore 6 Science Drive 2, Singapore 117546 Operations Research and Analytics Cluster National University of Singapore 21 Lower Kent Ridge Road, Singapore 119077 Tel.: +65-6516-2744 Fax: +65-6872-3919 E-mail: standj@nus.edu.sg

1 Introduction

Bayesian inference provides a principled way of combining data with prior beliefs through the application of Bayes’ rule. The posterior distribution is, however, often intractable and simulation-based Markov Chain Monte Carlo (MCMC) methods have become a central tool in Bayesian computation. In recent years, variational methods (Jordan et al., 1999) have also emerged as an important alternative to MCMC, providing fast approximate inference for complex, high-dimensional models. Unlike MCMC, which can be made arbitrarily accurate, variational methods make certain simplifying assumptions about the posterior density (e.g. a tractable form ) where denotes the vector of variables) and seek to optimize the Kullback-Leibler divergence ) between ) and the true posterior ) subject to these assumed restrictions. While earlier research on variational methods concentrated on conjugate models with analytically tractable expectations under which the variational Bayes approach (At- tias, 1999) yields efficient closed-form updates (Winn and Bishop, 2005), recent focus considers stochastic gradient approximation methods (Robbins and Monro, 1951) for non-conjugate models (e.g. Paisley et al., 2012; Salimans and Knowles, 2013). Further discussion of the literature is deferred to Section 2. Rohde and Wand (2015) give a nice recent summary of alternatives to stochastic gradient approaches for handling non-conjugacy in the variational Bayes framework.

Titsias and L´azaro-Gredilla (2014) propose a simple yet effective variational method known as “doubly stochastic variational inference”, where the approximating density is parameterized in terms of its mean and a lower triangular scale matrix L. An efficient stochastic gradient algorithm is then developed for optimizing and L by (1) parameterizing the vector of variables as where z is a random variable that can be sampled easily from a base distribution that does not depend on the variational parameters (see also Kingma and Welling, 2014; Rezende et al., 2014) and (2) sub-sampling from the data. The stochastic gradients constructed in this manner are “doubly stochastic” as they are built upon two sources of stochasticity that comes from sampling from the variational distribution and the full data set. This approach is very general in that it can be applied to any model where the joint density is differentiable. Unlike variational Bayes, it does not assume independence relationships among blocks of an appropriate partition of . Such independence assumptions have been shown to result in underestimation of the posterior variance (Wang and Titterington, 2005; Bishop, 2006). The quality of the resulting approximation is thus limited only by how well the form of ) matches the true posterior. Using this approach, Kucukelbir et al. (2016) develop an automatic differen-tiation variational inference (ADVI) algorithm in Stan, where ) is assumed to be either a diagonal (mean-field) or unrestricted Gaussian variational approximation. Constrained variables are transformed to the real line via Stan’s library of transformations and the gradients are computed using Monte Carlo integration. They note that while unrestricted ADVI is able to capture posterior correlations and hence produces more accurate marginal variance estimates than mean field ADVI, it can be prohibitively slow for large data since the number of variational parameters scales as the square of the length of .

In this article, we consider variational approximations which take the form of a multivariate Gaussian distribution ) for models with high-dimensional parameters (denotes the mean and the covariance matrix). However, instead of expressing as and optimizing L (the Cholesky factor of ) as in Tit- sias and L´azaro-Gredilla (2014) and Kucukelbir et al. (2016), we parameterize the optimization problem in terms of the Cholesky factor of the precision matrix. This parameterization is important as it provides an avenue to impose a sparsity structure in the precision matrix that reflects conditional independence relationships in the posterior. This sparsity structure reduces computational complexity greatly and enables fast inference for models with a large number of variables without having to assume independence relationships in the posterior. We demonstrate how our approach can be applied to generalized linear mixed models (GLMMs) and state space models (SSMs) for time series data. Assuming that the number of global variables is small compared to the number of local variables, our approach reduces the number of variational parameters to be updated in each iteration from ) to O(n), where n denotes the number of subjects in GLMMs and the length of time series in SSMs. In this way, the accuracy of using a unrestricted lower triangular matrix L can be achieved at the computational cost (same order of magnitude) of using a diagonal matrix L.

Recently, several classes of richer variational approximations which go beyond factorized (mean-field) approximating densities and which are able to reflect the posterior dependence structure to varying degrees have been proposed (e.g. Gershman et al., 2012; Salimans and Knowles, 2013). Rezende and Mohamed (2015) propose the specification of the approximate posterior using normalizing flows. Starting with say simple factorized distributions, highly flexible and complex approximate posteriors are constructed by transforming the initial density through a sequence of invertible mappings which perform expansions or contractions of the probability mass in targeted regions. The resulting chain of transformed densities is known as a normalizing flow. The authors show that true posterior can be recovered asymptotically under the Langevin flow, thus overcoming an important limitation of variational inference. Ranganath et al. (2016) propose hierarchical variational models which are built by placing a prior distribution on the parameters of a mean-field variational approximation and then proceeding to integrate out the mean-field parameters. They specify the prior using normalizing flows and demonstrate that the hierarchical variational model achieves better performance in terms of perplexity and held-out likelihood for deep exponential families (Ranganath et al., 2015). Structured stochastic variational inference (Hoffman and Blei, 2015) is a generalization of stochastic variational inference to allow dependencies between global and local variables. In the approximating density, independence is assumed only among elements in the global variables and among the local variables conditional on the global variables. Dependence between a local variable and the global variables is captured via a local parameter defined implicitly as the point at which the local evidence lower bound is maximized.

Archer et al. (2016) develop “black-box” variational inference (Ranganath et al., 2014) for SSMs, where a Gaussian variational approximation is considered for the latent states. To capture the temporal correlation structure, the precision matrix is assumed to be a block tri-diagonal matrix. While related, our approach differs from Archer et al. (2016) in several aspects. For the SSMs application, we consider a joint Gaussian variational approximation for the model parameters and latent states while Archer et al. (2016) assume that the model parameters are known and consider a Gaussian approximate posterior for the latent states only. Secondly, we optimize the Cholesky factor of the precision matrix directly while Archer et al. (2016) consider other parameterizations such as defining the approximate posterior through a product of Gaussian factors and parameterizing the mean and blocks in the tri-diagonal inverse covariance using neural networks. Third, we consider a more general sparsity structure in the precision matrix, which reflects the conditional independence structures in the posterior distribution and is not limited to band matrices. We also consider an alternative estimator of the stochastic gradient which differs from the “black-box” approach of Archer et al. (2016) as well as that used by Titsias and L´azaro-Gredilla (2014) and Kucukelbir et al. (2016). We demonstrate empirically that this estimator has lower variance at the mode and is helpful in improving the stability and precision of the proposed algorithm. This estimator is inspired by Han et al. (2016), who propose using Gaussian copulas to accommodate models whose posteriors are for instance, skewed, heavy-tailed or multi-modal and hence unsuited to a Gaussian variational approximation. Our idea of introducing sparsity via the Cholesky factor of the precision matrix may prove useful in this context as well. The relationship between the Laplace and the Gaussian variational approximation is discussed in Opper and Archambeau (2009) while Challis and Barber (2013) consider some differ-ent parameterizations in terms of the Cholesky. We do not consider Laplace approximations (Rue et al., 2009) in this paper since an important advantage of stochastic gradient methods is they are generally amenable to sub-sampling, although this is not always straightforward in complex latent variable models where the local parameters are dependent.

In Section 2, we review doubly stochastic variational inference, the approach of Titsias and L´azaro-Gredilla (2014). Section 3 describes how the optimization problem can be framed in terms of the precision matrix, develops the algorithm using alternative gradient estimators and discusses the importance of imposing sparsity structure in the precision matrix. The setting of the learning rate in the stochastic gradient algorithm is discussed in Section 4. In Section 5, we illustrate how our approach can be applied to GLMMs and state space models. The performance of our algorithm is investigated using several real data sets. We conclude with a discussion of our major results and findings in Section 6.

2 Review on doubly stochastic variational inference

In this section, we provide some general background on variational methods and give a brief review of doubly stochastic variational inference (Titsias and L´azaro- Gredilla, 2014) as we will be considering a modification of their approach.

For a Bayesian inference problem, let denote the vector of variables, ) be the prior and ) the likelihood for observed data y. In variational approximation (see, e.g. Bishop, 2006; Ormerod and Wand, 2010), an attempt is made to approximate an intractable posterior distribution ) using a member of some approximating family. Here we will consider a parametric family with typical element ) where denotes variational parameters to be chosen. Minimization of the Kullback-Leibler divergence between ) and ) with respect to can be shown to be equivalent to maximizing a lower bound on the log marginal likelihood log p(y) (where ), and taking the form

In non-conjugate models, ) will generally not have a closed form. There has been much recent research concerned with stochastic gradient methods (Robbins and Monro, 1951) able to optimize ) efficiently in this situation (Ji et al., 2010; Nott et al., 2012; Paisley et al., 2012; Kingma and Welling, 2014; Hoffman et al., 2013; Ranganath et al., 2014; Rezende et al., 2014; Titsias and L´azaro-Gredilla, 2014, 2015).

The method of Titsias and L´azaro-Gredilla (2014) (hereafter TL) is one state of the art method which optimizes ) using gradient information from the target distribution. Write ). In the TL method, an approximating distribution of the form

q(θ|µ, Lf(L(θ µ))

is assumed (so that )) where f is a fixed density. Here is a vector of parameters of dimension d, where d is the dimension of , and L is a lower triangular matrix with positive diagonal elements. If f is the density of a vector of independent standard normal random variables then ) is normal, ), and the covariance matrix is being parameterized with L as the Cholesky factor. We will only be considering the case of a multivariate normal approximation in this paper.

The lower bound ) is an expectation with respect to ), but can be written as an expectation with respect to the density f. Writing the integral in this way (for the purpose of the stochastic gradient optimization) results in an approach which is able to effectively use gradient information from the target log posterior. More precisely, writing ) for the expectation with respect to ) and ) for the expectation with respect to f, we have

(log log ))

where ) is distributed according to the density f and K denotes a term not depending on . This approach of applying a transformation so that the lower bound can be rewritten as an expectation with respect to a fixed density f that does not depend on the variational parameters is sometimes referred to as the “reparameterization trick” (Kingma and Welling, 2014; Rezende et al., 2014; Titsias and L´azaro-Gredilla, 2014). The advantage of this approach is that efficient gradient estimators of ) can now be constructed by sampling s from f instead of from ), which has been found to result in estimators with very high variance (see, e.g. Paisley et al., 2012).

Next, we give expressions for the gradients of ) with respect to and L. To explain their derivation we need some notation first. For a scalar valued function g(x) of a vector valued argument ) denotes the gradient vector for the function written as a column vector. Also, for a scalar valued function g(A) of a matrix ) means vec)) where, for a square matrix A, vec(A) is the vector of length obtained by stacking the columns of A underneath each other, and vecis the inverse operation that takes a vector of length and creates a square matrix by filling up the columns from left to right from the elements of the vector. In addition, we use the following well known result. For conformably dimensioned matrices A, B and C, vec()vec(B). This implies that we can write )vec(L). Then

(log log ))

and

(log ))

The last line of (2) just says that

Note that entries above the diagonal should be set to zero for the right-hand-side of (3) because L is lower triangular. For the term log |L| in ), we have log log |L| = diag(1).

Once we have expressions for the derivatives of the lower bound as expectations with respect to f, we can estimate these gradients unbiasedly using simulations from this distribution (typically based on just a single draw). When the log-likelihood is a sum of a large number of terms, such as in the case of a very large dataset, we can subsample the terms and still construct appropriate unbiased gradient estimates if we desire (hence the name “doubly stochastic variational inference”). Algorithm 1 shows the basic stochastic gradient method of Titsias and L´azaro-Gredilla (2014). The sequence , 1, in the algorithm is a sequence of learning rates satisfying the Robbins-Monro conditions , .

Algorithm 1: Doubly stochastic variational inference algorithm of Titsias and L´azaro-Gredilla (2014).

3 Extension to parametrization of the precision matrix in terms of the Cholesky factor

When the vector of variables is high-dimensional, allowing L to be a dense matrix is computationally impractical. An alternative is to assume that L is diagonal, but that loses any ability to capture dependence structure of the posterior. Here we consider an alternative approach where we follow a similar strategy to that of Titsias and L´azaro-Gredilla (2014), but instead parameterize the inverse covariance (precision) matrix in terms of the Cholesky factor and then impose sparsity on it that reflects conditional independence structure in the model.

3.1 Model Specification

Consider a model with observations ), latent variables and model parameters . Let

denote the vector of all variables. We assume that the joint distribution can be written in the form

p(y, θ) = p(η)

for some 1 . In this model, is conditionally independent of the other latent variables in the posterior distribution ) given and the neighboring k latent variables.

3.2 Gaussian variational approximation with sparse precision matrix

We consider the variational approximation (q) of the posterior to be a multivariate Gaussian distribution ), where T is a lower triangular matrix with positive diagonal entries. With f being the joint density of a d-vector of independent standard Gaussian variables as before, we can write ()).

Let denote the precision matrix of the Gaussian distribution. Then and hence T is just the Cholesky factor of . The statistical motivation for imposing sparsity on the Cholesky factor of the precision matrix is as follows. It is well known that for a Gaussian distribution, and j being conditionally independent given the rest. Also, if where T is lower triangular, proposition 1 of Rothman et al. (2010) states that if T is row banded then possesses the same row banded structure. This means that imposing sparsity in T can be useful for reflecting conditional independence relationships in .

For our model in (5), let us partition into blocks , 1 + 1 according to (4). For the Gaussian variational approximation to reflect the conditional independence structure in the posterior, we would like to have or j > i + k}, with no constraints on the remaining blocks. Write T for the Cholesky factor partitioned in the same way as with blocks , 1 + 1. Since T is lower triangular, and , 1 + 1, are lower triangular matrices. If , then has the sparsity we desire for . The sparsity level of T increases as k decreases. We elaborate later on how the sparse lower triangular structure can be exploited in the generation of from the variational posterior and in gradient computations. This approach is illustrated using generalized linear mixed models and state space models in Section 5.

3.3 Stochastic gradients

Similar to the previous case we obtain for the lower bound

(log log )) (log log |T| + K,

where () is distributed according to f and K denotes a constant not depending on . To obtain the gradient of ) with respect to and T, in addition to the results mentioned in Section 2, we need the following result. Denote by the matrix where the (i, j)th entry is the partial derivative of the ith element of vec() with respect to the jth element of vec(A). Then

dvec()

Similar to before, we have

(log ))

Looking at T,

(log ))

so that

Note that for the first term on the right hand side of (8), entries above the diagonal should be set to zero because T is a lower triangular matrix.

3.4 Alternative estimators of the stochastic gradients

From (7) and (8), unbiased estimators of and are given by

respectively, where s is generated from ). In deriving these estimators, we have evaluated the term (log )) in the lower bound analytically. However, alternative estimators can be derived by approximating this term instead of using its analytical form. As

log (

we have from (6),

Similarly,

Thus alternative unbiased estimators of and are given by

In our experiments, we observe that the estimators ˆand ˆseem to provide approximations with lower variance (Salimans and Knowles (2013) and Han et al. (2016) note related phenomena). As an example, for the toenail dataset in Section 5.1.2, we compare in Figure 1 estimates of given by ˆ(black) and ˆ(red) for a subset of the variables. This is done by fix-ing and T at the mode and computing the gradient estimates of at 1000 random variates ). Figure 1 shows clearly that there is much greater variation in the estimates computed using ˆas compared to ˆ. This suggests that using the alternative estimators ˆand ˆwill result in a more stable algorithm with better convergence and greater precision.

Below, we provide some intuition for this observation. Suppose the density that we are approximating is close to a Gaussian distribution with mean and precision , that is, ). Then log (). When we are close to the mode, and

while

ˆ

Thus, for ˆ, the contributions to the gradients from log ) and log ) cancel out. As ˆis a factor of ˆ, ˆ0 when ˆ0. However, ˆand ˆare still subjected to the randomness in s around the mode. Thus we prefer to use the estimators ˆand ˆ, which do not incur any additional computation except for the term Ts.

3.5 Uniqueness of the Cholesky factor

We note that in Algorithm 1, the updates of L do not ensure that the diagonal entries are positive. While this does not result in any computational issues, we prefer to add in the following step to ensure that the diagonal entries of T are positive. This helps to ensure uniqueness of the Cholesky factor and reduces the possibility of multiple local modes which is an important issue especially in the high-dimensional problems considered here. To achieve this aim. We introduce the lower triangular matrix where if and ). We compute the stochastic gradient updates for whose entries are unconstrained. The gradient ˆfor all non-diagonal entries. Diagonal entries of ˆcan be computed by multiplying the diagonal entries of ˆby the diagonal entries of T.

The modification of the doubly stochastic variational inference algorithm, in terms of the Cholesky factor of the precision matrix, is summarized in Algorithm 2.

Now let us consider sparsity in the Cholesky factor T. Suppose some elements of T are fixed at zero. Then Algorithm 2 remains the same, except that only the subset of elements of T which are not fixed at zero are stored and updated. Note that in step 2, if T is a sparse matrix, we can compute by solving the linear system for x. This can be done very efficiently because is upper triangular and sparse. Similarly, in computing the update at step 5, we need to compute the vector . This can also be computed by solving the linear system , which is again easy because T is a sparse lower triangular matrix. So even in very high dimensions, if T is appropriately sparse, Algorithm 2 can be implemented in a way that is efficient in terms of both memory storage requirements and CPU time.

Fig. 1: Toenail data: estimates of given by ˆ(black) and ˆ(red) at the mode for a subset of the variables.

Algorithm 2: Modified doubly stochastic variational inference algorithm parameterized in terms of the Cholesky factor of the precision matrix.

4 Setting the learning rate and stopping criterion in the stochastic optimization

4.1 Learning rate

The setting of appropriate learning rates in stochastic gradient algorithms is a highly challenging problem. The choice of learning rate determines not only the rate of convergence but also the quality of the optimum attained. Learning rates that are too high causes the algorithm to diverge while rates that are too low results in slow learning and can lead to “apparent convergence”, a situation where parameters appear to have converged due to diminishing step-size (see, e.g. Powell, 2011). Spall (2003) suggests a step-size sequence which satis-fies the theoretical conditions for convergence (Robbins and Monro, 1951). This takes the form where t denotes the iteration, and 1, 0 and 01 are constants to be tuned. However, we find that it is difficult to hand-tune this learning rate for use in Algorithm 2, as the problems considered are high-dimensional in nature and the parameters converge at different rates. For instance, Titsias and L´azaro-Gredilla (2014) scaled down the learning rate of by 0.1 when using for L in Algorithm 1. It is also likely that the parameters have different “scale”, especially when some of the constrained parameters have to be transformed to the real line. These considerations increase the need for learning rates that are adaptive and parameter-specific. Several adaptive step-size sequences (e.g. Duchi et al. (2011), Ranganath et al. (2013), Kucukelbir et al., 2016) have been proposed. We find that the ADADELTA method (Zeiler, 2012), in particular, worked very well with Algorithm 2 and we use it for all the examples. For consistency, we also used ADADELTA to compute the step-size for Algorithm 1. While ADADELTA has worked well in our experiments, we have only worked on a limited number of datasets and it is likely that other learning rates may yield better performance. From our observations, the performance of learning rates tend to be problem-dependent.

The ADADELTA method takes into consideration the scale of the parameters by incorporating second order information through a Hessian approximation. Suppose at iteration t, a parameter x is updated as + , where and is the gradient. The step-size is computed as

where and are exponentially decay- ing averages of and , and is a small positive constant added to ensure the denominator is positive and the initial step-size is nonzero. The terms and are updated as

at each iteration where is a decaying constant. The motivation of this approach comes from NewtonRaphson algorithms where it is well known that the inverse of the Hessian matrix provides an optimal or near-optimal step-size sequence (see, e.g. Spall, 2003). ADADELTA approximates the Hessian by taking , hence the form of in (9). To apply ADADELTA, we modify Algorithm 2 as outlined below. Note that the step-size for and T are different. As T

For t = 1, . . . , N,

1. Generate 2.

3.

4. Accumulate gradient

5. Compute change

is a sparse matrix, we find that it is more efficient to perform steps 8–12 in vector-form and to store only the non-zero elements of , , . We let and 95, the setting recommended by Zeiler (2012). We note that Algorithm 2 is more sensitive to when the estimators ˆand ˆare used as compared to the alternative estimators ˆand ˆ.

4.2 Stopping Criterion

In variational algorithms, the lower bound is commonly used as an objective function to check for convergence. When the updates are deterministic, the lower bound is guaranteed to increase after each cycle and the algorithm can be terminated when the increase in the lower bound is negligible. In Algorithms 1 and 2, the updates are stochastic and so the lower bound is not guaranteed to increase at each iteration. Computing the lower bounds in (1) and (6) also requires evaluating the expectations with respect to the variational approximation q. It is straightforward, however, to obtain an unbiased estimate of the lower bound at each iteration. Let s be a random variate generated from ). From (1), an unbiased estimate of the lower bound for Algorithm 1 is given by

ˆlog )

Similarly, an unbiased estimate of the lower bound for Algorithm 2 is given by

ˆ) + d2 log(2log |T| + 12

Here we do not evaluate the expectation of the last term analytically so that the randomness in s will cancel out between the first and the last term when we are close to the mode (see similar argument is given in Section 3.4).

As the estimate ˆL is stochastic, we consider instead the average of ˆL over the past F iterations, say ¯L, to minimize variability. We compute ¯L after every F iterations and keep a record of the maximum value of ¯L attained thus far, say ¯. The algorithm is terminated when ¯L falls below ¯more than M consecutive times. This may imply either that the algorithm has converged and hence the lower bound estimates are just bouncing around or the algorithm is diverging and the estimates of the lower bound are deteriorating. We say that the algorithm is “diverging” if ¯L is tending towards . In Section 5, we adopt rather conservative values of F = 2500 and M = 3 to avoid the dangers of stopping prematurely (see, e.g. Booth and Hobert, 1999). Alternative stopping criteria can also be constructed by examining the relative change in the parameter updates from successive iterations or the magnitude of the gradients of the parameters (see, e.g. Spall, 2003).

5 Applications

In this section, we illustrate how we can impose sparsity in T via Algorithm 2 using appropriate posterior conditional independence relationships for generalized linear mixed models (GLMMs) and state space models (SSMs). We code Algorithms 1 and 2 in Julia Version 0.5.0 (http://julialang.org/) and make use of its functions for sparse matrix representations to store and perform operations on high-dimensional sparse matrices efficiently.

Table 1: Runtime (in seconds) of ADVI (Stan) and Algorithms 1 and 2 (Julia) for datasets of different sizes. The number of iterations (in hundreds) used in Algorithms 1 and 2 are given in brackets.

We compare the variational approximations with posteriors obtained through long runs of MCMC (regarded as ground-truth). In all examples, fitting via MCMC was performed in RStan (http://mc-stan. org/interfaces/rstan) and the same priors are used in MCMC and variational approximations. For MCMC, we use 50,000 iterations in each example and the first half is discarded as burn-in. A thinning factor of 5 was applied and the remaining 5000 samples are used to estimate the posterior density.

We note that Algorithm 1 can also be readily implemented in Stan using automatic differentiation variational inference (ADVI, Kucukelbir et al., 2016). Hence we have also included the results from ADVI for comparison. However, there are some differences between our implementation of Algorithm 1 in Julia and that in Stan, namely, the learning rate and stopping criterion are different and we impose the additional restriction that the diagonal elements in L must be positive.

Table 1 shows the runtimes for ADVI and Algorithms 1 and 2 for the datasets considered in this section. We use the terms “mean-field” to refer to the case where L is a diagonal matrix and “unrestricted” when L is a full lower triangular matrix. All experiments are run on a Intel Core i5 CPU@ 3.20GHz 8.0GB Ram.

5.1 Generalized linear mixed models

Here we consider GLMMs where

is the set of responses for the ith subject, and are vectors of predictors for ), and ) is a smooth invertible link function. Let

g(µβ + Zfor i = 1, . . . , n, j = 1, . . . , n,

where is a vector of fixed effects parameters and is a random effect for the ith subject. Here we consider binary responses, where Bernoulli() with the logit link function , and count responses, where Poisson() with the log link function ). Variational methods have been considered for efficient computation in GLMMs by Ormerod and Wand (2012); Tan and Nott (2013, 2014); Lee and Wand (2016a,b), among others.

We parameterize the elements of the random effects covariance matrix G so that they are unconstrained and so that a normal variational posterior approximation is reasonable. Let , where W, the Cholesky factor of G, is a lower triangular matrix with positive diagonal entries. Let denote the matrix for which ) and if . Write ), where vech is the operation that transforms the lower triangular part of a square matrix into a vector by stacking elements below the diagonal column by column. We assume a normal prior, ).

The vector of variables in the model is given by as defined in (4), where )and the length of is d = pn + k + p(p + 1)/2. The joint distribution can be written as

p(y, θ) = h(θ) = p(β)p(ζ)

For this model, note that and are conditionally independent in the posterior distribution for given . For the GLMM, the sparsity structure imposed on T and hence is illustrated in (10). Our Algorithm 2 can efficiently learn a T with such a structure.

T =

For the GLMM, using a full rank lower triangular matrix L in Algorithm 1 requires updates of ) elements at each iteration while Algorithm 2 only requires ) updates (assuming k and p are small). Hence the efficiency of Algorithm 2 as compared to Algorithm 1 (unrestricted) increases rapidly with the number of subjects in the dataset as can be seen from Table 1. There is only a slight computational overhead in using Algorithm 2 as compared to a diagonal matrix L in Algorithm 1, which requires O(np) updates. However, Algorithm 2 reflects the posterior dependency structure and hence has the potential to provide better estimates. Next, we investigate the performance of Algorithm 2 on several data sets. We set The gradient of log ) is derived in Appendix A.

5.1.1 Epilepsy data

The epilepsy data of Thall and Vail (1990) includes n = 59 epileptics who were randomized to a new drug, progabide (Trt=1) or a placebo (Trt=0) in a clinical trial. The response is given by the number of seizures patients have during four follow-up periods. Other covariates include the logarithm of age (Age), the logarithm of the number of baseline seizures (Base), Visit (coded as Visit3, Visit1, Visit1 and Visit3), and a binary variable V4 which is 1 for the fourth visit and 0 otherwise. We center the covariate Age and replace Ageby Agemean(Age). Consider the following two models from Breslow and Clayton (1993). Model I is a Poisson random intercept model where

log BaseTrtAge+ BaseTrt

for i = 1, ..., n, j = 1, ..., 4. Model II is a Poisson random intercept and slope model where

log BaseTrtAge+ BaseTrtVisit+ Visit

for i = 1, ..., n, j = 1, ..., 4.

We apply ADVI and Algorithms 1 and 2 on these two models. Runtimes are given in Table 1 and the estimated marginal posteriors of and are shown in Figure 2. Algorithm 1 (mean-field) converged quickly for both models while the runtime of Algorithm 1 (unrestricted) doubled with the inclusion of a second random effect. For this dataset, Algorithm 2 performed better than the mean-field and unrestricted approximations. It produces very good approximations of the marginal posteriors of , but is overconfident in estimating the marginal posteriors of in Model II. The variational posteriors from Algorithm 1 (mean-field) are accurate in the mean but the variance is underestimated, quite severely in some cases.

Figure 3 shows the iterates of the mean parameter corresponding to and and the averaged lower bound ( ¯L) for Model II. For Algorithm 1 (unrestricted), it appears that some of the parameters have yet to stabilize even though the lower bound has reached stationarity.

5.1.2 Toenail data

This dataset compares two oral anti-fungal treatments for toenail infection (De Backer et al., 1998) and contains information for 294 patients who are evaluated at seven visits. Some patients did not attend all planned visits and there were a total of 1908 measurements. The patients were randomized into two treatment groups, one receiving 250 mg of terbinafine per day (Trt=1) and the other 200 mg of itraconazole per day (Trt=0). The time in months (t) that they arrived for visits was recorded and the binary response variable (onycholysis) indicates the degree of separation of the nail plate from the nail-bed (0 if none or mild, 1 if moderate or severe). We consider the logistic random intercept model,

logit(TrtTrt

for i = 1, ..., 294, 1 7.

Figure 4 shows the variational posteriors estimated by ADVI and Algorithms 1 and 2. The estimates from Algorithm 2 are closer to that of MCMC than the unrestricted and mean-field approximations of ADVI and Algorithm 1. Table 1 indicates that the runtime of Algorithm 1 (unrestricted) is about 1.5 times that of Algorithm 2 even though the number of iterations required is halved.

5.1.3 Polypharmacy data

The polypharmacy data set (Hosmer et al., 2013) is available at http://www.umass.edu/statdata/ statdata/stat-logistic.html and it contains data on 500 subjects studied over seven years. The response is whether the subject is taking drugs from 3 or more different groups. We consider the covariates, Gender = 1 if male and 0 if female, Race = 0 if subject is white and 1 otherwise, Age, and the following binary indicators for the number of outpatient mental MHV if if 6 MHV 15. Let INPTMHV = 0 if there were no inpatient mental health

Fig. 2: Epilepsy data: posterior distributions of and estimated using ADVI (dotted, blue for mean-field and green for unrestricted), Algorithm 1 (blue for mean-field and green for unrestricted), Algorithm 2 (red) and MCMC (black) for Model I (first two rows) and Model II (last two rows).

visits and 1 otherwise. We consider a logistic random intercept model (see Hosmer et al., 2013) of the form

for i = 1, . . . , 500, j = 1, . . . , 7.

We apply ADVI and Algorithms 1 and 2 to this model. From Table 1, the increase in runtime of Algorithm 1 (unrestricted) due to the larger number of subjects as compared to the toenail dataset is evident. The runtime of Algorithm 1 (unrestricted) is about 4.7 times that of Algorithm 2 while the runtime of Algorithm 2 is only slightly longer than that of Algorithm 1 (mean-field). From Figure 5, Algorithm 2 provided a

Fig. 3: Epilepsy data: Mean () iterates corresponding to and and the averaged lower bound ( ¯L) from Algorithm 1 with unrestricted lower triangular matrix L (green) and Algorithm 2 (red) for Model II.

Fig. 4: Toenail data: posterior distributions of and estimated using ADVI (dotted, blue for mean-field and green for unrestricted), Algorithm 1 (blue for mean-field and green for unrestricted), Algorithm 2 (red) and MCMC (black).

very good approximation of the marginal posteriors of but there is some underestimation of the mean and standard deviation of .

5.2 State space models

Here we consider the stochastic volatility model widely used in modeling financial time series (see, e.g. Harvey et al., 1994; Kim et al., 1998), which is an example of a non-linear state space model. The observations are generated from a zero-mean Gaussian distribution with a variance stochastically evolving over time. The unobserved log-volatility is modeled as an AR(1) process with Gaussian disturbances. Let

bN(φb, 1) for t = 2, . . . , n,

where 0 and 0 1. In (11), is the mean-corrected return at time t and is assumed to follow a stationary process with drawn from the stationary distribution. We transform the constrained parameters to the real space by letting ) and , where . Assume normal priors, ) and ). The set of variables is given by ), where ), and the joint distribution is given by

p(y, θ) = h(θ) = p(α)p(λ)p(ψ)p(bψ)

From (12), is conditionally independent of all other states in the posterior distribution given and the neighboring states. Hence, we can take advantage of

Fig. 5: Polypharmacy data: posterior distributions of and estimated using ADVI (dotted, blue for mean-field and green for unrestricted), Algorithm 1 (blue for mean-field and green for unrestricted), Algorithm 2 (red) and MCMC (black) .

this conditional independence in the variational approximation ). By setting 1 has the sparsity we desire for . This sparse structure is illustrated in (13).

T =

For the SSM, the number of parameters to update in each iteration of Algorithm 1 (unrestricted) is ) while Algorithm 2 only requires O(n) updates (similar to Algorithm 1 mean-field). This is an important factor to consider in SSMs as the number of observations in a time series over a long period may be large.

Next, we illustrate Algorithm 2 using two sets of exchange rate data which is available from the dataset “Garch” in the R package Ecdat. We compute the mean-corrected response from the exchange rates as

The gradient of log ) is derived in Appendix B.

5.2.1 GBP/USD exchange rate data

Here we consider daily observations of the weekday exchange rates of the U.S. Dollar against the British Pound from 1st October 1981 to 28th June 1985. This dataset has been considered by Harvey et al. (1994), Kim et al. (1998) and Durbin and Koopman (2012). The number of responses is n = 945. Applying ADVI and Algorithms 1 and 2 to this dataset, the resulting variational posteriors are shown in Figure 6. We note that Algorithm 1 (unrestricted) diverges as the averaged lower bound ¯L is deteriorating and tending towards . ADVI (unrestricted) also fails to converge. The mean-field approximations of ADVI and Algorithm 1 have difficulty in capturing the means of and and only manage to capture the mean of . Algorithm 2 was able to capture the means with reasonable accuracy but there is underestimation in the variance of and . Figure 7 shows the mean (solid lines) and 1 standard deviation intervals (dotted lines) of the log volatility at each time point estimated using Algorithm 2 and MCMC. Algorithm 2 was able to capture the means very accurately but there is some underestimation of the standard deviation.

5.2.2 DEM /USD exchange rate data

Next, we consider the entire series of weekday exchange rates of the U.S. Dollar against the German Deutschemark from 2nd January 1980 to 21st June 1987 available in “Garch”. This is a much larger dataset with n = 1866 responses. We apply ADVI and Algorithms 1 and 2 to this dataset. The unrestricted approximations of ADVI and Algorithm 1 fail to converge again. The approximations of Algorithm 2 improved from the previous dataset and the underestimation of the standard

Fig. 6: GBP/USD exchange rate data: posterior distributions of estimated using ADVI (dotted, blue for mean-field), Algorithm 1 (blue for mean-field), Algorithm 2 (red) and MCMC (black).

Fig. 7: GBP/USD exchange rate data: Mean (solid line) and 1 standard deviation intervals (dotted lines) of log volatility estimated using Algorithm 2 (red) and MCMC (black).

deviation was less severe. As in the previous case, the mean field approximations of ADVI and Algorithm 1 had difficulty in capturing the means of and . Figure 9 shows the mean and 1 standard deviation intervals of the log volatility . For this dataset, Algorithm 2 captured both the mean and standard deviation of the log volatility accurately.

6 Conclusion

In this article, we propose parameterizing the precision matrix in a Gaussian variational approximation in terms of the Cholesky factor and performing optimization using stochastic gradient methods. This approach enables us to impose sparsity in the precision matrix so as to reflect conditional independence structure in the posterior distribution appropriately. We also propose improved estimators of the stochastic gradient, which have lower variance and which are helpful in increasing the stability and precision of the algorithm. We illustrate how our approach can be applied to generalized linear mixed models and state space models. Our experimental results indicate that imposing sparsity in the precision matrix reduces the computational complexity of the problem greatly without having to assume independence among the model parameters. This also helps to improve the accuracy of the posterior approximations. We note that Algorithm 2 performs better than the unrestricted approximations of ADVI and Algorithm 1 on several occasions even though there are lesser constraints in the unrestricted approximations. This may be due to the difficulties associated with such high-dimensional optimization. In using a Gaussian variational approximation, all constrained parameters have to be transformed to the real line and we observed sensitivity towards the transformations being used as well as the way the model is being parameterized. These are important areas for future work.

Acknowledgements Linda Tan was supported by the National University of Singapore Overseas Postdoctoral Fellowship. David Nott’s research was supported by a Singapore Ministry of Education Academic Research Fund Tier 2 grant (R-155-000-143-112). We thank the reviewers and the editors for their time and helpful comments which have improved the manuscript.

Fig. 8: DEM/USD exchange rate data: posterior distributions of estimated using ADVI (dotted, blue for mean-field), Algorithm 1 (blue for mean-field), Algorithm 2 (red) and MCMC (black).

Fig. 9: DEM/USD exchange rate data: Mean (solid line) and 1 standard deviation intervals (dotted lines) of log volatility estimated using Algorithm 2 (red) and MCMC (black).

References

Archer, E., I. M. Park, L. Buesing, J. Cunningham, and L. Paninski (2016). Black box variational inference for state space models. arXiv:1511.07367.

Attias, H. (1999). Inferring parameters and structure of latent variable models by variational Bayes. In K. Laskey and H. Prade (Eds.), Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, San Francisco, CA, pp. 21–30. Morgan Kaufmann.

Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.

Booth, J. G. and J. P. Hobert (1999). Maximizing generalized linear mixed model likelihoods with an automated monte carlo em algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61, 265–285.

Breslow, N. E. and D. G. Clayton (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88, 9–25.

Challis, E. and D. Barber (2013). Gaussian Kullback- Leibler approximate inference. Journal of Machine

Learning Research 14, 2239–2286.

De Backer, M., C. De Vroey, E. Lesaffre, I. Scheys, and P. D. Keyser (1998). Twelve weeks of continuous oral therapy for toenail onychomycosis caused by dermatophytes: A double-blind comparative trial of terbinafine 250 mg/day versus itraconazole 200 mg/day. Journal of the American Academy of Dermatology 38, 57–63.

Duchi, J., E. Hazan, and Y. Singer (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 2121–2159.

Durbin, J. and S. J. Koopman (2012). Time series analysis by state space methods (2 ed.). United Kingdom: Oxford University Press.

Gershman, S., M. Hoffman, and D. Blei (2012). Non- parametric variational inference. In J. Langford and J. Pineau (Eds.), Proceedings of the 29th International Conference on Machine Learning, pp. 663–670.

Han, S., X. Liao, D. B. Dunson, and L. C. Carin (2016). Variational Gaussian copula inference. In A. Gretton and C. C. Robert (Eds.), Proceedings of the 19th International Conference on Artificial Intelligence and

Statistics, Volume 51, pp. 829–838. JMLR Workshop and Conference Proceedings.

Harvey, A., R. Esther, and N. Shephard (1994). Mul- tivariate stochastic variance models. The Review of Economic Studies 61, 247–264.

Hoffman, M. and D. Blei (2015). Stochastic structured variational inference. In G. Lebanon and S. Vishwanathan (Eds.), Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, Volume 38, pp. 361–369. JMLR Workshop and Conference Proceedings.

Hoffman, M. D., D. M. Blei, C. Wang, and J. Paisley (2013). Stochastic variational inference. Journal of Machine Learning Research 14, 1303–1347.

Hosmer, D. W., S. Lemeshow, and R. X. Sturdivant (2013). Applied Logistic Regression (3 ed.). Hoboken, NJ: John Wiley & Sons, Inc.

Ji, C., H. Shen, and M. West (2010). Bounded approx- imations for marginal likelihoods. Technical Report 10-05, Institute of Decision Sciences, Duke University.

Jordan, M. I., Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999). An introduction to variational methods for graphical models. Machine Learning 37, 183–233.

Kim, S., N. Shephard, and S. Chib (1998). Stochastic volatility: likelihood inference and comparison with arch models. Review of Economic studies 65, 361– 393.

Kingma, D. P. and M. Welling (2014). Auto-encoding variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR).

Kucukelbir, A., D. Tran, R. Ranganath, A. Gelman, and D. M. Blei (2016). Automatic differentiation variational inference. arXiv: 1603.00788.

Lee, C. Y. Y. and M. P. Wand (2016a). Streamlined mean field variational Bayes for longitudinal and multilevel data analysis. https://works.bepress.com/ matt_wand/13/.

Lee, C. Y. Y. and M. P. Wand (2016b). Variational methods for fitting complex Bayesian mixed effects models to health data statistics in medicine. Statistics in Medicine 35, 165–188.

Nott, D. J., S. L. Tan, M. Villani, and R. Kohn (2012). Regression density estimation with variational methods and stochastic approximation. Journal of Computational and Graphical Statistics 21, 797–820.

Opper, M. and C. Archambeau (2009). The variational Gaussian approximation revisited. Neural Computation 21, 786–792.

Ormerod, J. T. and M. P. Wand (2010). Explaining variational approximations. The American Statistician 64, 140–153.

Ormerod, J. T. and M. P. Wand (2012). Gaussian vari- ational approximate inference for generalized linear mixed models. Journal of Computational and Graphical Statistics 21, 2–17.

Paisley, J. W., D. M. Blei, and M. I. Jordan (2012). Variational Bayesian inference with stochastic search. In Proceedings of the 29th International Conference on Machine Learning (ICML-12).

Powell, W. B. (2011). Approximate dynamic programming: solving the curses of dimensionality. Hoboken, NJ: John Wiley & Sons, Inc.

Ranganath, R., S. Gerrish, and D. M. Blei (2014). Black box variational inference. In S. Kaski and J. Corander (Eds.), Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, Volume 33, pp. 814–822. JMLR Workshop and Conference Proceedings.

Ranganath, R., L. Tang, L. Charlin, and D. Blei (2015). Deep exponential families. In G. Lebanon and S. Vishwanathan (Eds.), Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, Volume 38, pp. 762–771. JMLR Workshop and Conference Proceedings.

Ranganath, R., D. Tran, and D. M. Blei (2016). Hierar- chical variational models. In M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of The 33rd International Conference on Machine Learning, Volume 37, pp. 324–333. JMLR Workshop and Conference Proceedings.

Ranganath, R., C. Wang, D. M. Blei, and E. P. Xing (2013). An adaptive learning rate for stochastic variational inference. In Proceedings of the 30th International Conference on Machine Learning, pp. 298–306.

Rezende, D. J. and S. Mohamed (2015). Variational inference with normalizing flows. In F. Bach and D. Blei (Eds.), Proceedings of The 32nd International Conference on Machine Learning, pp. 1530– 1538. JMLR Workshop and Conference Proceedings.

Rezende, D. J., S. Mohamed, and D. Wierstra (2014). Stochastic backpropagation and approximate inference in deep generative models. In E. P. Xing and T. Jebara (Eds.), Proceedings of The 31st International Conference on Machine Learning, pp. 1278– 1286. JMLR Workshop and Conference Proceedings.

Robbins, H. and S. Monro (1951). A stochastic approximation method. The Annals of Mathematical Statistics 22, 400–407.

Rohde, D. and M. P. Wand (2015). Mean field varia- tional Bayes: general principles and numerical issues. https://works.bepress.com/matt_wand/15/.

Rothman, A. J., E. Levina, and J. Zhu (2010). A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika 97(3), 539–550.

Rue, H., S. Martino, and N. Chopin (2009). Approx- imate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71, 319–392.

Salimans, T. and D. A. Knowles (2013). Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis 8, 837–882.

Spall, J. C. (2003). Introduction to stochastic search and optimization: estimation, simulation and control. New Jersey: Wiley.

Tan, L. S. L. and D. J. Nott (2013). Variational infer- ence for generalized linear mixed models using partially non-centered parametrizations. Statistical Science 28, 168–188.

Tan, L. S. L. and D. J. Nott (2014). A stochastic vari- ational framework for fitting and diagnosing generalized linear mixed models. Bayesian Analysis 9, 963– 1004.

Thall, P. F. and S. C. Vail (1990). Some covariance models for longitudinal count data with overdispersion. Biometrics 46, 657–671.

Titsias, M. and M. L´azaro-Gredilla (2014). Doubly stochastic variational Bayes for non-conjugate inference. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1971– 1979.

Titsias, M. and M. L´azaro-Gredilla (2015). Local expec- tation gradients for black box variational inference. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 28 (NIPS 2015).

Wang, B. and D. M. Titterington (2005). Inadequacy of interval estimates corresponding to variational Bayesian approximations. In R. G. Cowell and G. Z (Eds.), Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 373– 380. Society for Artificial Intelligence and Statistics.

Winn, J. and C. M. Bishop (2005). Variational message passing. Journal of Machine Learning Research 6, 661–694.

Zeiler, M. D. (2012). Adadelta: An adaptive learning rate method. arXiv: 1212.5701.

Appendix A Gradients for generalized linear mixed models

For the GLMM described in Section 5.1, we have

log

where C is a constant not dependent on . For the logistic GLMM, log{1 + exp(x)} while for the Poisson GLMM, exp(x). The gradient of log ) is given by [log log log log )], where

log

In (14), with all entries above the diagonal set to zero and 1and 1are vectors of length p(p + 1)/2. We define the ith element of 1as 1 if the ith element of correspond to a diagonal element of W and 0 otherwise. On the other hand, the ith element of 1is exp() if corresponds to a diagonal element of W and 1 otherwise. For the logistic GLMM, 1 + exp(and for the Poisson GLMM, ). More details on the derivation of the gradients are given below.

As log log . For the term ,

where denotes the commutation matrix. Let with all entries above the diagonal set to zero. As W is a lower triangular matrix,

Moreover,

dvec(dvech(diag(1

where is the p(p+1)1 duplication matrix. Therefore, using chain rule

12

where dgA is a diagonal matrix with diagonal equal to the diagonal of A. The last line follows because A is a lower triangular matrix.

Appendix B Gradients for state space model

For the stochastic volatility model in (12),

log 2 2

where C is a constant independent of . The gradient log ) can be computed from the following components.

log e2

+ e2 exp() for 1 < t < n, log e2

designed for accessibility and to further open science