Multilevel Monte Carlo estimation of log marginal likelihood

2019·Arxiv

Abstract

Abstract

In this short note we provide an unbiased multilevel Monte Carlo estimator of the log marginal likelihood and discuss its application to variational Bayes.

For some dataset X with N i.i.d. samples generated from a conditional distribution ) with some random variable ), the log marginal likelihood of X is given by

where, for some importance distribution ) and any positive integer , the law of large numbers leads to

In what follows, we write ) := Let us introduce a sequence of random variables indexed by :

for ). For the same samples , denote

with

and

for 0 and let ). Following the idea from multilevel Monte Carlo methods [3, 4, 9], we represent the log marginal likelihood for each data by a telescoping sum

log ) = lim)] =

for any = () such that 0 for all and = 1, resulting in

This representation of the log marginal likelihood naturally leads to an unbiased Monte Carlo estimator

for any batch size M > 0, where are independently and randomly chosen from X, whereas 0 are independently generated from the discrete probability distribution = (). It follows from [5, Theorem 2 and Remark 3] that the variance of , denoted by , is of ) if there exist s > 4 and t > 2 with (4)(2) 8 such that

where we note that the antithetic property ()+

with respect to both and as in [7], our proposal here is to maximize the log marginal likelihood (with respect to , of course), and at the same time, to maximize the evidence lower bound (or equivalently, to minimize the KullbackLeibler divergence) with respect to . Although related works such as [2, 8] have used intermediate quantities between the evidence lower bound and the log marginal likelihood, as far as the authors know, none of them have succeeded in directly looking at the log marginal likelihood when it cannot be evaluated analytically. Using the telescoping sum representation in (1), the gradient of the log marginal likelihood with respect to for fixed is given by

where we have

) =

) =

for 0, whereas the gradient of the evidence lower bound with respect to is given by

This way we can construct unbiased Monte Carlo estimators for both of the gradients log ) and ), which are

respectively, for any batch size M > 0, in which the common stochastic samples and those on z for each can be used. Finally it must be pointed out that, as inferred from [6, Lemma 2] studied in a quite different con-

text, by exploiting the properties () +

and () +

2 =), the variance of every component of ) is shown to be of ) if

so that an adequate choice for = () will remain the same, i.e., , even for gradient estimations.

The authors plan to report, in the near future, more applications of multilevel Monte Carlo approaches to various Bayesian computations, such as variational inference for global latent variables using locally marginalized evidence lower bound and computations of various metrics and their gradients including mutual information, reversed KL divergence, variational Renyi’s bound and -upper bound.

References

[1] Blei, D. M., Kucukelbir, A., McAuliffe, J. D. (2017) Variational inference: a review for statisticians, Journal of the American Statistical Association, 112 (518), 859–877.

[2] Burda, Y., Grosse, R., Salakhutdinov, R. (2016) Importance weighted au- toencoders, arXiv:1509.00519.

[3] Giles, M. B. (2008) Multilevel Monte Carlo path simulation, Operations Research, 56, 607–617.

[4] Giles, M. B. (2015) Multilevel Monte Carlo methods, Acta Numerica, 24, 259–328.

[5] Goda, T., Hironaka, T., Iwamoto, T. (2019) Multilevel Monte Carlo estimation of expected information gains, arXiv:1811.07546 (accepted for publication in Stochastic Analysis and Applications).

[6] Hironaka, T., Giles, M. B., Goda, T., Thom, H. (2019) Multilevel Monte Carlo estimation of the expected value of sample information, arXiv:1909.00549.

[7] Kingma, D. P., Welling, M. (2014) Auto-encoding variational Bayes, arXiv:1312.6114.

[8] Nowozin, S. (2018) Debiasing evidence approximations: on importanceweighted autoencoders and Jackknife variational inference, ICLR 2018 conference paper.

[9] Rhee, C. H., Glynn, P. (2015) Unbiased estimation with square root con- vergence for SDE models, Operations Research, 63, 1026–1043.

designed for accessibility and to further open science