CATVI: Conditional and Adaptively Truncated Variational Inference for Hierarchical Bayesian Nonparametric Models

2020·arXiv

Abstract

Abstract

Current variational inference methods for hierarchical Bayesian nonparametric models can neither characterize the correlation structure among latent variables due to the mean-field setting, nor infer the true posterior dimension because of the universal truncation. To overcome these limitations, we propose the conditional and adaptively truncated variational inference method (CATVI) by maximizing the nonparametric evidence lower bound and integrating Monte Carlo into the variational inference framework. CATVI enjoys several advantages over traditional methods, including a smaller divergence between variational and true posteriors, reduced risk of underfitting or overfit-ting, and improved prediction accuracy. Empirical studies on three large datasets reveal that CATVI applied in Bayesian nonparametric topic models substantially outperforms competing models, providing lower perplexity and clearer topic-words clustering.

1 INTRODUCTION

Hierarchical Bayesian nonparametric (HBNP) models are widely used in bioinfomatics, language processing, computer vision and network analysis (Sudderth and Jordan, 2009; Caron and Fox, 2017; Williamson, 2016; Yurochkin et al., 2019). A major benefit of HBNP models is their ability to relax the fixed dimension assumption in parametric models. For example, in natural language processing, hierarchical Dirichlet process (HDP) model (Teh et al., 2006) replaces the finite-dimensional Dirichlet distribution in

Proceedings of the 25International Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2022, Valencia, Spain. PMLR: Volume 151. Copyright 2022 by the author(s).

latent Dirichlet allocation (LDA) with a countable-dimensional Dirichlet process (DP). This is done by regarding the number of topics as a random variable that can be inferred from the data, rather than as a parametric value (Blei et al., 2003).

However, it is much harder to implement HBNP models than their parametric counterparts. In particular, due to a HBNP model’s infinite-dimensional nature, a finite-dimensional truncation is needed to approximate the posterior. Yet, the selection of the optimal truncation level poses several challenges. On one hand, the traditional Markov chain Monte Carlo (MCMC) methods (Teh et al., 2006) can produce an adaptive selection of the truncated dimension, but they are not computationally scalable especially for big data. On the other hand, standard variational inference methods (Teh et al., 2008; Wang et al., 2011; Hoffman et al., 2013; Roychowdhury and Kulis, 2015; Xu et al., 2019) can accelerate the computation, but they suffer from a universal selection of the truncation level by truncating the dimension of all latent variables to a prespec-ified value. Using a prespecified value is problematic, because a subjective selection of the fixed truncation level can make the model prone to overfitting or un-derfitting, leading to low predictive accuracy. These existing challenges in universal truncation contradict the motivation and advantages of using HBNP models.

In this paper, we propose a general framework, called conditional and adaptively truncated variational inference (CATVI), to infer HBNP models in the following steps. First, we convert the inference problem to an optimization task of maximizing our proposed nonparametric evidence lower bound based on finite partitions. Second, we introduce a conditional setting when factorizing variational distributions by conditioning variables in the middle layers on two adjacent layers. Third, to handle big data, we develop a stochastic variational inference framework (Blei et al., 2017) under our conditional setting. Finally, we obtain empirical distributions from Monte Carlo sampling of local latent variables, which are then used to update the variational parameters for the global latent vari-

Figure 1: The HBNP models. The blue and red boxes correspond to J and replicates, respectively.

ables. This enables us to truncate the dimension of the latent variational distributions to that of the empirical distribution.

Our proposed method benefits from both the inferential accuracy of Monte Carlo sampling and the computational efficiency of variational inference. First, our method rebuilds the correlation structure and hence attains a smaller Kullback–Leibler (KL) divergence between the variational distribution and the true posterior. Such procedure removes the unrealistic mean-field assumption, and searches for an optimal variational distribution over a wider family. Second, it adjusts the dimension of variation distributions, which converges to a stable level balancing the goodness-of-fit and model complexity. With these advantages, CATVI provides an adaptive selection of the truncated dimension, reducing the risk of over-fitting or underfitting, while also enabling more accurate predictions without sacrificing the computational efficiency. Specific to the inference for the HDP model, CATVI enjoys several advantages over existing methods (Teh et al., 2006; Hoffman et al., 2013; Wang and Blei, 2012; Bryant and Sudderth, 2012), see Section 6 for our detailed discussion.

2 BACKGROUND: HBNP MODELS

As a subclass of Bayesian nonparametric models, HBNP models extend the simplicity of using random measures (see Appendix A) as priors to the following hierarchical structure,

for j = 1, . . . , J, i = 1as illustrated in Figure 1. In the top layer, are generated from a random measure R with common base measure , while in the bottom layer, itself is a realization of random measure P with base measure H. To ensure exchangeability, are assumed to be identical and independent given . Each local latent variable is sampled from independently. Finally, the global parameter is assigned a prior ), and the observation is generated from a likelihood function f, parameterized by both global latent variable and local latent variables .

In topic modelling, the HDP model (Teh et al., 2006) uses a DP for both P and R in (1) as,

where are concentration parameters, and are normalized based measures (see Appendix A). Suppose a corpus has J documents, each document j has words, and each word is chosen from a vocabulary with W terms. Specifically, = is generated from the distribution DP(), and for each document j, a topic proportion, defined as = , is independently sampled from the dis- tribution DP(). For each topic k, the distribution of words over vocabulary is sampled from a W-dimensional Dirichlet distribution parameterized by , = (Dir(). For each word i in document j, a topic assignment is chosen from Multinomial(), where represents topic k. Finally, the observation is generated from the assigned topic and the corresponding within-topic word distribution, Multinomial().

The necessity to let be atomic can be shown in the HDP model. If are sampled from a Dirichlet process with a diffuse base measure instead of an atomic will not share any support almost surely, and thus none of the topics being shared across the documents. However, for a general HBNP model, as long as is atomic, it is not necessary to restrict the prior for to be a Dirichlet process or a probability random measure. For example, the ΓDP model, which has the following structure,

allows for more flexibility by removing the constraint on the concentration parameter in the top layer. Other choices of the prior for include beta process, stable process and inverse Gaussian process (Ghosal and Van der Vaart, 2017).

To infer the HBNP models, we set up the theoretical foundations for nonparametric KL divergence and evidence lower bound, and then propose a novel methodology in the following Sections 3 and 4.

3 NONPARAMETRIC EVIDENCE LOWER BOUND

3.1 KL Divergence between Random Measures

The object of variational inference is to minimize the KL divergence between the variational distribution and the true posterior. For two infinite-dimensional random measures, their KL divergence is well defined even though an infinite-dimensional density function does not exist in a conventional sense. Given two random measures P and Q from (Θ, M) into (Ω, F), the Radon–Nikodym derivative dQ/dP exits if Q is absolutely continuous with respect to P. Their KL divergence is defined as

which is intractable due to the infinite-dimensional integral (Matthews et al., 2016). We have developed a new approach to calculate it using the limit superior of the divergence between corresponding finite-dimensional induced measures, that is,

where and are respectively induced measures from P and Q on a finite partition = (such that ) = ) and ) = ) for each . With an induced random variable : Θ , we can also denote the induced measures by ) and ). The result in (4) is justified in Appendix C.1. We use the following two examples to illustrate (4).

Example 1 For Poisson processes P = PP(Λ + ) and Q = PP(Λ+), where Λ is the intensity function defined on Ω, , and is a Dirac function at point Ω. Under partition = (the limit superior in (4) is achieved, that is,

where Pois(a) is the Poisson distribution with intensity a.

Example 2 For Dirichlet processes P = DP(+ ) and Q = DP(), where H is the base measure, is the concentration parameter, and Ω for i = 1, . . . n. Similarly, under the partition

where Dir() is the Dirichlet distribution with parameters .

With the KL divergence between random measures represented under a finite partition, we can then de-fine the nonparametric counterpart of evidence lower bound below.

3.2 Nonparametric Evidence Lower Bound

The parametric variational inference algorithm uses a finite-dimensional variational distribution to approximate the posterior by maximizing the evidence lower bound (Blei et al., 2017). In contrast, HBNP models uses a random measure for the variational distribution, due to the infinite dimensionality of latent variables. We propose a general inference framework for HBNP models by maximizing the nonparametric evidence lower bound (NPELBO), defined as the limit inferior of the parametric evidence lower bound under a finite partition, lim inf (ELBOthat is

where ) and ) correspond to the induced measures from the joint distribution and the variational distribution on , and where Z and X are the observations and latent variables, respectively. Moreover, given the KL divergence between random measures in (4), in Appendix C.2 we show that

This demonstrates the equivalence between maximizing the NPELBO in (5) and minimizing the KL divergence between the variational distribution Q(Z) and the true posterior P(Z|X). The task of maximizing the NPELBO is general and can be applied broadly within Bayesian nonparametrics. To simplify notation, we will use ) and ) to denote the true and variational distributions, respectively, where the context is clear. To infer HBNP models, we aim to maximize the defined NPELBO, while truncating the dimension of variational distribution adaptively as follows.

4 METHODOLOGY

CATVI adopts the stochastic variational inference framework (Hoffman et al., 2013), where the computation is accelerated by selecting a small batch of data and updating variational parameters with an unbiased random gradient. We first build the foundation of conditional variational inference as follows.

4.1 Conditional Variational Inference

Conditional setting HBNP models in (1) contain global latent variable , local latent variables z, global prior , local priors and observations x, where , and = . We aim to find the variational distribution to maximize the NPELBO. In contrast to traditional approaches under the mean-field setting, we factorize the variational distribution

as

(7) in the sense of the probability law. Such conditional design facilitates the recovery of the dependence structure among and z.

Combing (5) and (7), we seek to maximize the following NPELBO:

where the entropy log ) and is a partition of the sample space Ω for and

Conditional variational distribution To maximize the NPELBO in (8), we first compute the optimal variational distribution of given and for each j. As ) =

), the non-constant term in (8) with respect to ) is

lim inf

Note that the above expression can be viewed as the negative of a KL divergence whose maximum is zero. Therefore, to enable the NPELBO to reach the maximum, the optimal conditional variational distribution for should be Consequently, NPELBO in (8) does not contain any term related to ). In Appendix C.3, we derive NPELBO with respect to ) as

up to a constant. It is important to note that Eis with respect to the prior ) instead of the variational distribution ), and hence this expectation can often be calculated analytically in HBNP models due to the conjugacy.

4.2 Empirical Distribution and Evidence Lower Bound

Within the conditional variational freamework, for the task of adaptive truncation, CATVI integrates Monte Carlo sampling to variational inference by iterating the following steps till convergence, (i) using Monte Carlo sampling to get an empirical optimal variational distribution for local variables z and (ii) updating the variational distributions for global variables and .

Empirical distribution From the entire data x, we randomly sample a subset , where S is the batch size with . Given a partition in the current training iteration, we aim to update the parameters for ) conditional on ) and . While standard stochastic variational in- ference updates parameters analytically, we use Monte Carlo sampling to draw samples for each from thus constructing an empirical distribution,

Empirical evidence lower bound Using the empirical distribution ˆ), we obtain an empirical evidence lower bound with respect to ELBOby replacing ) in (9) with ˆ), that is,

up to a constant. It is obvious that E( ELBO) = ELBOthus satisfying the key condition for stochastic variational inference (Hoffman et al., 2013), that is, the random gradient is unbiased. Therefore, according to (10), we can use the random gradient generated from ˆto update the parameters for ).

Resampling We next present the procedure to get the empirical distribution ˆ). As is integrated out, the local latent variables can not be sam- pled independently when we use Monte Carlo sampling to draw ˆgiven ) and Therefore, we propose the following Gibbs sampling approach to get samples under optimal variational distributions. Conditional on ) and samples ˆ= 1, it follows from (8) that the optimal variational distribution of log ) is proportional to

Then we sample ˆ) for each i iteratively, which constructs a Markov chain. Noting that ) is often multinomial, sampling from its logarithm is commonly used. After the convergence, we can resample ˆfrom the stable Markov chain to update the parameters of ) according to (10). Similarly, we derive the empirical evidence lower bound with respect to ) in Appendix B.1 and can update the parameters for ) using ˆcorrespondingly.

4.3 Adaptive Truncation

Finally, we seek to obtain the finite partition that could reach the limit inferior in NPELBO. Rather than having fixed on a universal truncation level, we enable the dimension of to gradually adjust to a stable level. This partition or truncation is dependent on data-fitting and embedded within the optimization process, providing another key advantage of using a Monte Carlo sampling scheme in the stochastic variational inference framework.

Partition refinement According to the structure of HBNP models, ˆare sampled from the atomic support of . Without loss of generality, we assume that the current partition consists of atomic elements and their complement = Ω. Under this partition, ) is positive given (11), and hence ˆcan be sampled within , that is, ˆis a new sample, distinct from . If this happens, we draw a new and refine the partition as, where is updated as Ω.

Remark: The partition refinement procedure reaches the limit inferior of empirical evidence lower bound as follows. Since there is no sampling within after each update, to minimize the KL divergence, the posterior should be proportional to the prior on Moreover, if we further partition into , the KL divergence stays the same. Thus, E( ELBO) = ELBO= NPELBO. See Appendix C.5 for a justification.

We summarize the CATVI algorithm in Algorithm 1.

5 APPLICATIONS IN TOPIC MODELS

5.1 CATVI for the HDP Model

We apply the proposed CATVI method to the HDP model. Specifically, we factorize the variational distributions in the conditional setting and specify the

variational family as follows. First, the variational distribution of for each s is given by ) = DP+ , where = = ) and ) is the indicator function. Second, ) for each topic k is set as a W-dimensional Dirichlet distribution, ) = Dirichlet(), where = (is the parameter of vocabulary distribution for topic k. The variational distribution for topics without any observation remains the same as the prior, hence ) = Dirichlet(). Third, we specify the variational family for using spike and slab distributions (Andersen et al., 2017) as ) = + DP(such that = 1. Finally, following (11) we use Monte Carlo sampling to obtain samples , avoiding the need to parametrize their variational distributions.

As different samples in are used to represent different topic clusters in topic modelling, their exact values in the sample space do not contain any statistical information. We can then simply index the topics from 1 to K and denote the different clusters by distinct points in Ω, and cluster 0 is the topic without any observation. Given samples , we define the number of topics with obser- vations by K = ˆ0), where ˆ= In Appendix C.4, we rely on (10) to derive the empirical evidence lower bound

with respect to

up to a constant. According to Algorithm 1, we repeatedly select documents of a batch size S, sample , and update parameters for and iter- atively until the empirical evidence lower bound converges to its maximum. During Gibbs sampling, once a document is sampled in cluster 0, we add a new cluster K + 1, thus partitioning to be (K + 1)-dimensional, with K single points and one complement set = ΩDuring the training, this procedure is repeated until is optimized. See Appendix B.2 for detailed steps on updating the variational parameters and refining the partition.

5.2 CATVI for Generic HBNP Models

The CATVI algorithm can also be applied to a general class of HBNP models, where the global prior is generated from a completely random measure. In these models, the concentration parameter for any is not fixed, and is not restricted to be a probability measure. The corresponding inference algorithm is similar to that of the HDP model, but requires a new parameter to approximate (Ω). We choose the variational family for the global prior as ) = + where is the normalization of the corresponding completely random measure and = 1. We provide the corresponding empirical evidence lower bounds and algorithms to infer more general HBNP models, including ΓDP model, in Appendix B.3.

6 RELATIONSHIP TO RELATED WORKS

In this section, we discuss several advantages of CATVI compared with traditional methods (Hoffman et al., 2013; Wang et al., 2011; Wang and Blei, 2012), although these are spe-cific to the inference for the HDP model. First, CATVI replaces the unrealistic mean-field assumption with the conditional setting to capture the correlation structure among latent variables. Second, CATVI approximates the posterior groupwisely instead of updating the stick-breaking parameters sequentially, and hence avoids the gradient vanishing problem. By contrast, Hoffman et al. (2013) and Wang et al. (2011) perform inference separately over each atomic location and weight of using the stick-breaking representation (1 ), where s are the representation parameters. However, this may cause the gradient vanishing problem of if k is large, because (1 ) is close to zero. Third, these traditional methods universally truncate the dimension of to a fixed level, contradicting the motivation and advantages of using HBNP models. Finally, CATVI is guaranteed to maximize the NPELBO. By comparison, Wang and Blei (2012) update parameters using the locally collapsed Gibbs sampling, but their work leads to an approximation that fails to maximize the ELBO, especially when the variance of distributions is large.

From a computational perspective, CATVI inherits the fast speed of stochastic variational inference, while other methods that truncate the dimension in a truly nonparametric way are very slow, such as the split-merge variational inference (Bryant and Sudderth, 2012) and the pure Gibbs sampling (Teh et al., 2006). To check the split-merge criterion, the split-merge variational inference requires calculating the likelihood before and after a split or merge, which is computationally infeasible in practice. Moreover, the pure Gibbs sampling is not scalable as well. As pure Gibbs sampling does not have batch selection, the Markov chains would converge very slowly when the sample size is large. As a result, these methods cannot be used to handle big data.

7 EXPERIMENTS

7.1 Datasets and Architectures

We apply the CATVI algorithm to three large datasets, arXiv, NYT and Wiki, and compare the performance of CATVI with the online variational inference (OVI) (Wang et al., 2011), the memorized online variational inference (MOVI) (Hughes and Sudderth, 2013), the split-merge variational inference (SMVI) (Bryant and Sudderth, 2012) and Gibbs sampling (GS) (Teh et al., 2004).

arXiv The corpus contains descriptive metadata of articles on arXiv up to September 1, 2019, resulting in 1.03M documents and 44M words from a vocabulary of 7,500 terms.

NYT The corpus contains all articles published by New York Times from January 1987 to June 2007 (Sandhaus, 2008), resulting in 1.56M documents and 176M words from a vocabulary of 7,600 terms.

Wiki The corpus contains entries from all English Wikipedia websites on January 1, 2019, resulting in 4.03M documents and 423M words from a vocabulary of 8,000 terms.

For the preprocessing, stemming and lemmatization are used to clean the raw text, and then words with too high or too low frequency, as well as common stop words, are filtered out.

Figure 2: Left column: plots for the perplexity vs the running time up to 5 hours, Right column: plots for the number of topics vs the running time.

To evaluate the performance of CATVI, we set aside a test set of 10,000 documents for each dataset and calculate the predictive perplexity as

perplexity = exp

where Dand Drepresent the training and test data, respectively, and are the training and test words in test document j, respectively, and is the number of words in (Ranganath and Blei, 2018). The perplexity measures the uncertainty of fit-ted models, where a lower perplexity will result in a better language model with higher predictive likelihood. Since the perplexity can not be computed exactly, the standard routine uses Dto compute the variational distribution for and , then obtains the variational distribution for based on and , and then approximates the likelihood by ) =

, whereandare the variational expectations of and , respectively (Blei et al., 2003). Experiments are run with the three datasets above using both the HDP and ΓDP models. For the HDP model, we set the hyperparameters as = 5, where and are the concentration parameters for and respectively, and is the hyperparameter for the prior on the distribution of words. The initial number of topics is set to be 100. The parameters are then optimized using stochastic gradient descent, with a batch size of 256 and a linear decaying learning rate adopted in Hoffman et al. (2010). For the ΓDP model, we use the same settings but discard . In the experiments, we remove clusters with fewer than 1 document during the training.

7.2 Empirical Results

Predictive perplexity The top row of Figure 2 plots the predictive perplexity as a function of running time for the three comparison methods using the three datasets. As MOVI, SMVI and GS provide much higher perplexities, we do not plot their results in Figure 2. Table 1 reports numerical summaries for all comparison methods. In particular, as GS can not scale to large datasets, we use a subset with 500 documents to run the experiments. Several conclusions can be drawn here. First, on all three datasets, CATVI uniformly outperforms competing methods. The improvement is highly consequential, especially for arXiv and Wiki. For NYT, there is moderate improvement, likely due to the long length of documents in this corpus. Second, for each dataset, the ΓDP model attains a lower perplexity than the HDP model, consistent with the fact that the ΓDP model removes a restriction of the HDP model and hence is more flexible. Third, CATVI is empirically shown to be computationally ef-ficient, reaching the lowest perplexity within the same training time. Although it involves Monte Carlo sampling, the perplexity converges fast. This is because the convergence of local Markov chains to assign words to topics is accelerated by a clear topic-words clustering as the global variational distributions approach to the optimal.

Number of topics The bottom row of Figure 2 plots the number of topics during the training process. For OVI, the number of topics remains constant at the prespecified value, while for CATVI, this value first increases steeply and then converges to a stable level. For example, the number of topics in Wiki sharply

Table 1: A summary of predictive perplexity results.

increase from 100 to around 190 for the HDP model and around 200 for ΓDP model. The sharp increase is driven by the data complexity, while the stable level is achieved due to the dimension penalty effect from the priors in HBNP models. Although the estimation of the number of topics is not consistent, CATVI can provide some useful information about topics in data. For instance, the data from the arXiv corpus in these experiments are limited to abstracts of scientific articles, and thus it has the smallest number of topics. By contrast, NYT is a compilation of all new articles covering a wider range of areas, and hence consists of more topics. Similarly, Wiki has the largest number of topics as it contains almost every aspect of an encyclopedia. It is important to note that we do not need to set a fixed number of topics before the inference. Instead, CATVI starts from an initial value, for example 100 in our experiments, then automatically converges to a stable optimal number of topics.

Topic-words clustering CATVI is shown to reveal much better linguistic results. To compare CATVI with OVI for the HDP model, we report the top 12 words in the top 10 topics with biggest weights for both methods on arXiv and Wiki in Tables 2a and 2b, respectively. We observe a few apparent patterns. First, the topic-word clusters from CATVI hardly contains replicated topics, whereas those from OVI results have similar word components, such as those shown in blue in columns 1-6 in the bottom part of Table 2a. An ideal topic-word clustering should allocate these words into just one topic. However, the prespecified number of topics is fixed at 150 in OVI, which is larger than the ground truth, resulting in generating replicated topics. By contrast, the topic-word clustering by CATVI does not have such redundancy. It is apparent that our top 10 topics are mostly distinct. Second, CATVI leads to much clearer topic-word clustering. For both datasets, our results indicate that all of our detected words within any column are highly relevant and should intuitively be grouped into one cluster with clear linguistic meaning. For example, column 7 of Table 2b for CATVI presents several words all related to military, but words in the same column for OVI seem to be a mixture of several loosely connected topics including ‘human, character, reveal, episode, comic, voice’, ‘human, earth’ and ‘human, kill, attack, fight, battle, doctor’. This mixture of topics makes the topic-word clustering in this column ambiguous. Furthermore, CATVI identifies a topic about popular English given names in column 5 of Table 2b. Although these given names are not shown in a single document, CATVI can successfully discover that they belong to one topic, while OVI fails. This is because CATVI does not force the topics to merge together if the prespecified number of topics is not large enough, thus reducing the noise in the clusters.

We also perform sensitivity analysis of CATVI using arXiv under the HDP model as an example. The left and right panels of Figure 3 in Appendix E respectively plot the results as the batch size varies from 128 to 1024 and the initial number of topics varies from 60 to 140. We observe that the performance is not sensitive to the change of these hyperparameters. Moreover, the best results are obtained for the case with a smaller batch size and a larger initial number of topics.

8 DISCUSSION

CATVI can also be applied to other HBNP models including, for example, hierarchical Pitman–Yor process model (Teh and Jordan, 2010) and hierarchical beta process model (Thibaux and Jordan, 2007). CATVI will provide more advantages in these applications, because the hierarchical Pitman–Yor process, with heavy tail behavior, and the hierarchical beta process, with sparse structure, may suffer more from the universal truncation.

Acknowledgments

We thank the anonymous reviewers for useful comments during the review process.

Opinions expressed in this paper are those of the authors, and do not necessarily reflect the view of J.P. Morgan. Opinions and estimates constitute our judgement as of the date of this Material, are for informational purposes only and are subject to change without notice. This Material is not the product of J.P. Morgan’s Research Department and therefore, has not been prepared in accordance with legal requirements to promote the independence of research, including but not limited to, the prohibition on the dealing ahead of the dissemination of investment research. This Material is not intended as research, a recommendation, advice, offer or solicitation for the purchase or sale of any financial product or service, or to be used in any way for evaluating the merits of participating in any transaction. It is not a research report and is not in-

Table 2: Top 12 words in top 10 topics.

tended as such. Past performance is not indicative of future results. Please consult your own advisors regarding legal, tax, accounting or any other aspects including suitability implications for your particular circumstances. J.P. Morgan disclaims any responsibility or liability whatsoever for the quality, accuracy or completeness of the information herein, and for any reliance on, or use of this material in any way. Important disclosures at: www.jpmorgan.com/disclosures.

References

Andersen, M. R., Vehtari, A., Winther, O., and Hansen, L. K. (2017). Bayesian inference for spatiotemporal spike-and-slab priors. Journal of Machine Learning Research, 18(1):5076–5133.

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: a review for statisticians. Journal of the American Statistical Association, 112(518):859–877.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022.

Bryant, M. and Sudderth, E. B. (2012). Truly nonparametric online variational inference for hierarchical Dirichlet processes. In Advances in Neural Information Processing Systems 25, pages 2699–2707.

Caron, F. and Fox, E. B. (2017). Sparse graphs using exchangeable random measures. Journal of the Royal Statistical Society: Series B, 79(5):1295–1366.

Ghosal, S. and Van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press, Cambridge.

Hoffman, M., Bach, F. R., and Blei, D. M. (2010). Online learning for latent Dirichlet allocation. In Advances in Neural Information Processing Systems 23, pages 856–864.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14:1303–1347.

Hughes, M. C. and Sudderth, E. (2013). Memoized on- line variational inference for Dirichlet process mixture models. In Advances in Neural Information Processing Systems 26, pages 1133–1141.

Kingman, J. F. C. (1993). Poisson Processes. Clarendon Press, Oxford.

Matthews, A. G. d. G., Hensman, J., Turner, R., and Ghahramani, Z. (2016). On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics.

Ranganath, R. and Blei, D. M. (2018). Correlated ran- dom measures. Journal of the American statistical Association, 113(521):417–430.

Roychowdhury, A. and Kulis, B. (2015). Gamma pro- cesses, stick-breaking, and variational inference. In Proceedings of the 8th International Conference on Artificial Intelligence and Statistics.

Sandhaus, E. (2008). The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia.

Sudderth, E. B. and Jordan, M. I. (2009). Shared segmentation of natural scenes using dependent Pitman–Yor processes. In Advances in Neural Information Processing Systems 21, pages 1585–1592.

Teh, Y., Jordan, M., Beal, M., and Blei, D. (2004). Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes. In Advances in Neural Information Processing Systems 17.

Teh, Y. W. and Jordan, M. I. (2010). Hierarchical Bayesian nonparametric models with applications. In Bayesian Nonparametrics, pages 158–207. Cambridge University Press.

Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581.

Teh, Y. W., Kurihara, K., and Welling, M. (2008). Collapsed variational inference for HDP. In Advances in Neural Information Processing Systems 20.

Thibaux, R. and Jordan, M. I. (2007). Hierarchical beta processes and the Indian buffet process. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, volume 2, pages 564–571.

Wang, C. and Blei, D. M. (2012). Truncation-free on- line variational inference for Bayesian nonparametric models. In Advances in Neural Information Processing Systems 25.

Wang, C., Paisley, J., and Blei, D. M. (2011). On- line variational inference for the hierarchical Dirichlet process. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics.

Williamson, S. A. (2016). Nonparametric network models for link prediction. Journal of Machine Learning Research, 17(202):1–21.

Xu, K., Srivastava, A., and Sutton, C. (2019). Varia- tional Russian Roulette for deep Bayesian nonparametrics. In Proceedings of the 36th International Conference on Machine Learning, pages 6963–6972.

Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., and Khazaeni, Y. (2019). Bayesian

nonparametric federated learning of neural networks. In Proceedings of the 36th International Conference on Machine Learning, pages 7252–7261.

Supplementary Material: CATVI: Conditional and Adaptively Truncated Variational Inference for Hierarchical Bayesian Nonparametric Models

This supplementary material contains a short review of completely random measures in Appendix A, CATVI algorithm and its applications to the HDP model and the ΓDP model in Appendix B, technical proofs and derivations in Appendix C, computational complexity analysis and code in Appendix D and sensitivity analysis results in Appendix E.

A A Short Review of Completely Random Measures

Suppose that (Ω, F) is a Polish sample space, Θ is the set of all bounded measures on (Ω, F) and M is a -algebra on Θ. A random measure G on (Ω, F) is a transition kernel from (Θ, M) into (Ω, F) such that (i) ) is M-measurable for any and (ii) ) is a measure for any realization of G (Ghosal and Van der Vaart, 2017). For example, a Dirichlet process P with base measure satisfies

P(A), . . . , P(ADirichlet(A), . . . , P(A

for any partition = () of Ω, that is, a finite number of measurable, nonempty and disjoint sets such that = Ω. The Dirichlet process is denoted by DP() or DP() with concentration parameter (Ω) and center measure . Moreover, a random measure is called a completely random measure (Kingman, 1993) if it also satisfies the condition that (iii) ) is independent of ) for any disjoint subsets and in Ω. Completely random measures and their normalizations (Ghosal and Van der Vaart, 2017), for example, the Gamma process and Dirichlet process, respectively, are commonly used as priors for infinite-dimensional latent variables in HBNP models, because their realizations are atomic measures with countable-dimensional supports.

A completely random measure (Kingman, 1993) is characterized by its Laplace transform,

(1 )v(dx, ds),

where A is any measurable subset of Ω and ) is called the L´evy measure. If ) = ), where ) and ) are measures on Ω and (0], respectively, the completely random measure is homogeneous (Ghosal and Van der Vaart, 2017). In such a case, we call ) the weight intensity measure. We can view completely random measure as a Poisson process on the product space Ω (0] using its L´evy measure as the mean measure.

B CATVI Algorithm

To maximize the NPELBO, we iterate the following three steps: (i) randomly select a small batch from the entire data, (ii) sample by Monte Carlo method, and (iii) update ) and ) in the stochastic variational inference framework.

In an analogy to (9), the NPELBO with respect to ) is

Elog(A.1)

and the empirical evidence lower bound with respect to ), ELBO, is,

J log (A.2)

up to a constant and then we can update its parameter with the corresponding random gradient in a similar way. Moreover, the NPELBO with respect to ) is

log E) + Elog log (A.3)

Factorizing Eas Eleads to (11). We summarize the details of CATVI algorithm in Algorithm 1.

We repeatedly select documents of a batch size and update parameters iteratively according to the following three steps, until the NPELBO attains its maximum.

There is no closed-form expression for the parameters to attain the maximum in (12). Moreover, the standard gradient descent algorithm fails in this case, because may easily exceed the simplex during the updating procedure. Instead, given the parameters in the -th iteration, we first define

Φ(+ ˆΦ()= 1, . . . , K, = 0, (B.1)

where Φ() denotes the log-gamma function, such that = 1, and then we update the parameters by = (1 + where is the step size defined in Algorithm 1. This updating algorithm is consistent to the gradient descent after the inverse logit transformation. See Appendix C.6 for a justification. In the process of updating, the condition = 1 always holds, and hence we eliminate the risk of exceeding the simplex.

By (A.2), we update the parameters for ) using samples . We define for topic k and word w as,

J STI(ˆz= φ, x= w), (B.2)

and update the parameter by = (1 for each k, where = ().

Sampling for z. According to (11) we sample ˆconditional on ) and ˆby

+ ˆ) expΦ(= 1, . . . , K, expΦ(= 0, (B.3)

to construct the Markov chain, where ˆ). Whenever the sampled ˆis in , meaning ˆforms a new point not belonging to , we need to update the partition and add a new topic indicated by . Otherwise the partition dimension remains the same. Iterating the sampling scheme till convergence, we obtain the samples and corresponding for the selected chunk.

ΓDP releases the constraint of fixed concentration parameter in HDP. Therefore, the CATVI algorithm for ΓDP inherits the steps in (B.1) and (B.3), except that a parameter replaces the concentration parameter in both formulas.

We derive the empirical evidence lower bound in Appendix C.7 with respect to ) as,

up to a constant, where ) is the weight intensity measure (see Appendix A) for the completely random measure, and ) is the density function for (Ω) that can be derived using its Laplace transform. Therefore, we can update in the same way as the HDP model.

Similar to (B.1), we update according to

Φ(+ ˆΦ()= 1, . . . , K, = 0, (B.5)

and = (1 + Moreover, in an analogy to (B.3), the probability to sample ˆis defined as

+ ˆ) expΦ(= 1, . . . , K, expΦ(= 0. (B.6)

Finally, we apply the gradient ascent to update . In Appendix C.7, we derive the gradient of empirical evidence lower bound with respect to as

and then update by = +

C Technical Proofs and Derivations

By definition of induced measure, Θ) = Q(dΘ) for any M-measurable dΘ , we have log log

It follows from lim supand the monotone convergence theorem that

log dqΩdpΩ dQ =log dQdP dQ.

Combining the above equations yields (4). Furthermore, suppose there exists a sequence of partition such that lim sup , we have

log dqΩdpΩdQ =log dqΩdpΩ dQ =log dqΩdpΩ dqΩ.

Hence lim supKL() = KL(which will be used in Appendix C.4.

By p(X, Z) = p(Z|X)p(X), we have log ) ) = log p(X) +log )

Taking the limit inferior on both sides, we have

log ) ) = log lim sup

Combing the above equation with the definition of NPELBO in (5) and the KL divergence in (4) yields (6).

By ) =and the hierarchical generative structure, the evidence lower bound under partition with respect to ) equals,

= Elog log ) + constant

log ) + constant

Elog E) + Elog log ) + constant.

Furthermore, based on the equation above, (8) can be expressed as NPELBO = lim inf ELBO.

By the formula of moments for Dirichlet-distributed random variables, we obtain

E) = Γ()Γ()

Based on the points defined in Section 5.1, we propose a sequence of partition to approach , where = (] for k = 1, . . . , K and is the corresponding complement. Under ) = d(and ) = d), where d) denotes the density function for (K +1)-dimensional Dirichlet distribution, M = and is the corresponding induced random variable. By (10), the empirical evidence lower bound under is

(1) log ) + (1) log

J log Γ(+ ˆ) Γ()

where = ). Since (Beta() under ), the term 1) log is constant with respect to parameters . Taking lim sup on both sides of the above equation with lim sup(log ) = log lim sup= 0 for k > 0 and lim sup= 1, we obtain equation (12).

In this section, we show that the empirical evidence lower bound achieves the limit inferior in NPELBO. With the partition = () defined in Section 4.3, there is no sampling within , and hence we have

q(φ), G(φ), , G(φ(φ), G(φ), , G(φ(φ), , G(φ.

As the likelihood part does not contain ), by integrating both sides with respect to ), we can get Moreover, the KL divergence between the variational distribution and true posterior is

= log q(φ), G(φ), , G(φp(φ), G(φ), , G(φ(φ), , G(φ(φ), G(φ), , G(φ

because

q(φ), G(φ), , G(φ= p(φ), G(φ), , G(φ(φ), , G(φ/N,

where N is the normalization constant,

N = p(φ), G(φ), , G(φ(φ), , G(φ(φ)dG(φdG(φ)

= p(φ), , G(φ(φ), G(φ), , G(φ

= p(φ), , G(φ(φ), , G(φ.

It is obvious that N is independent of . Therefore, if we partition into , the normalization constant N will not change, that is, the KL divergence under and = () are the same. Consequently, the partition enables the limit superior of KL divergence to be reached. By (6), the limit inferior of NPELBO is also attained.