b

DiscoverSearch
About
My stuff
Bayesian Allocation Model: Inference by Sequential Monte Carlo for Nonnegative Tensor Factorizations and Topic Models using Polya Urns
2019·arXiv
Abstract
Abstract

: We introduce a dynamic generative model, Bayesian allocation model (BAM), which establishes explicit connections between nonnegative tensor factorization (NTF), graphical models of discrete probability distributions and their Bayesian extensions, and the topic models such as the latent Dirichlet allocation. BAM is based on a Poisson process, whose events are marked by using a Bayesian network, where the conditional probability tables of this network are then integrated out analytically. We show that the resulting marginal process turns out to be a P´olya urn, an integer valued self-reinforcing process. This urn processes, which we name a P´olya-Bayes process, obey certain conditional independence properties that provide further insight about the nature of NTF. These insights also let us develop space efficient simulation algorithms that respect the potential sparsity of data: we propose a class of sequential importance sampling algorithms for computing NTF and approximating their marginal likelihood, which would be useful for model selection. The resulting methods can also be viewed as a model scoring method for topic models and discrete Bayesian networks with hidden variables. The new algorithms have favourable properties in the sparse data regime when contrasted with variational algorithms that become more accurate when the total sum of the elements of the observed tensor goes to infinity. We illustrate the performance on several examples and numerically study the behaviour of the algorithms for various data regimes.

Keywords and phrases: Probabilistic Graphical Models, Bayesian Network, Latent Dirichlet Allocation, Nonnegative Matrix Factorization, Nonnegative Tensor Factorization, Information divergence, Kullback-Leibler divergence, P´olya urn, Sequential Monte Carlo, Variational Bayes, Bayesian Model Comparison, Marginal Likelihood.

Matrices and tensors are natural candidates for the representation of relations between two or more entities (Cichocki et al., 2009; Kolda and Bader, 2009). In matrix factorization, the goal is computing an exact or approximate decomposition of a matrix X of form  X ≈ WH, possibly subject to various constraints on factor matrices W, H or both. The decomposition, also called a factorization due to its algebraic structure, provides a latent representation of data that can be used for prediction, completion, anomaly detection, denoising or separation tasks to name a few.

The difficulty of the matrix decomposition problem may depend on the nature of the constraints on factor matrices W, H or the nature of the error function used for measuring the approximation quality. For many matrix decomposition problems, exact algorithms are not known. An important exception is the ubiquitous singular value decomposition where efficient and numerically stable algorithms are known for the computation of a decomposition of form  X = U ⊤ΣVwith orthonormal U and V and nonnegative diagonal Σ for any matrix (Golub and Van Loan, 2013). Instead, when both factors W and H are required to be nonnegative, we obtain the nonnegative matrix factorization (NMF) (Cohen and Rothblum, 1993; Paatero and Tapper, 1994; Lee and Seung, 2001) and the problem becomes intractable in general (Vavasis, 1999; Gillis, 2017). However, under further conditions on the matrices W and H, such as separability (Donoho and Stodden, 2004) or orthogonality (Asteris et al., 2015a), it becomes possible to obtain provable algorithms with certain optimality guarantees (Gillis, 2012; Arora et al., 2016; Gillis, 2014; Asteris et al., 2015b). Despite the lack of practical general exact algorithms, NMF and its various variants have been extensively investigated in the last decades (Gillis, 2017).

In domains where it is natural to represent data as a multi-way array (e.g., a three-way array is denoted as  X ∈ RI×J×K, where  I, J, K ∈ N+) involving more then two entities, tensor decomposition methods are more natural (Acar and Yener, 2009; Kolda and Bader, 2009; Cichocki et al., 2009), along with several structured extensions such as coupled/collective factorizations (Paatero, 1999; Yılmaz et al., 2011), tensor trains and tensor networks (Cichocki et al., 2016, 2017). Unfortunately, in contrast to matrix decompositions, computing exact or approximate tensor factorizations is a much harder task that is known to be an NP-hard problem in general (H˚astad, 1990; Hillar and Lim, 2013). Algorithms used in practice mostly rely on non-convex optimization, and depend on iterative minimization of a discrepancy measure between the target data and the desired decomposition. Despite their lack of optimality guarantees, tensor decomposition algorithms have found many modern applications as surveyed in Sidiropoulos et al. (2017); Papalexakis et al. (2016), psychometrics (Tucker, 1966; Harshman, 1970), knowledge bases (Nickel et al., 2016), graph analysis, social networks, (Papalexakis et al., 2016), music, audio, source separation (Virtanen et al., 2015; Simsekli et al., 2015), communications, channel estimation (Kofidis and Regalia, 2001; Sidiropoulos et al., 2017), bioinformatics (Mørup et al., 2008; Acar et al., 2011), link prediction (Ermi¸s et al., 2015), and computational social science (Schein et al., 2016).

Another closely related set of methods for relational data are the probabilistic topic models (Blei, 2012). Topic models have emerged even back in the early 1990’s primarily from the need of understanding vector space models for the analysis and organizing large text document corpora (Deerwester et al., 1990; Papadimitriou et al., 2000; Arora et al., 2015), but quickly applied to other data modalities such as images, video, audio (Signoretto et al., 2011; Liu et al., 2013; Abdallah et al., 2007; Virtanen et al., 2008). Research in this direction is mostly influenced by the seminal works Blei et al. (2003); Minka and Lafferty (2002) and shortly thereafter many related models have been proposed (Canny, 2004; Li and McCallum, 2006; Airoldi et al., 2008; Schein et al., 2015). The empirical success of topic models has also triggered further research in several alternative inference strategies (Griffiths and Steyvers, 2004; Teh et al., 2007).

It has been a folkloric knowledge that factorizing nonnegative matrices and fitting topic models to document corpora are closely related problems where both problems can be viewed as inference and learning in graphical models (Gopalan et al., 2013). In this paper, we formalize this observation and illustrate that nonnegative tensor factorization (NTF) models and probabilistic topic models have the same algebraic form as an undirected graphical model, which ‘factorizes’ a joint distribution as a product of local compatibility functions. Based on this observation, we develop a general modeling framework that allows us computing tensor factorizations as well as approximating the marginal likelihood of a tensor factorization model for Bayesian model selection.

In our construction, the central object is a dynamic model that we coin as the Bayesian allocation model (BAM). The model first defines a homogeneous Poisson process on the real line (Kingman, 1993; Daley and Vere-Jones, 2007) and we imagine the increments of this Poisson process as ‘tokens’, drawing from the topic modeling literature. Each token is then marked in an independently random fashion using mark probabilities Θ. The mark probabilities Θ are assumed to obey a certain factorization that respects a Bayesian network of discrete random variables, with the directed graph1 G(Lauritzen, 1996; Barber, 2012) where the probability tables corresponding to G are also random with a Dirichlet prior. Thanks to the conjugacy properties of the Poisson process and the Dirichlet-Gamma distributions, we can integrate out the random probability tables to arrive at a marginal model that turns out to be a P´olya urn process. The Markov properties of the directed graph G allows us to understand the nature of this P´olya urn model.

For general graphs, exact inference, i.e., the computation of certain marginal probabilities conditioned on the observations, can be accomplished using the junction tree algorithm (Lauritzen and Spiegelhalter, 1988; Lauritzen, 1992; Spiegelhalter et al., 1993). For graphical models with discrete probability tables, under suitable assumptions, it is also possible to integrate out the probability tables to calculate the marginal likelihood. In this paper, we will show that computing the marginal likelihood of BAM is equivalent to computing the probability that the aforementioned P´olya urn process hits a certain set in its state space. This dynamic view leads to a class of novel sequential algorithms

for the calculation of the marginal likelihood.

Note that hierarchical probability models, including topic models and tensor factorizations can be always expressed as Bayesian networks that include both the involved random variables as well as the probability tables. Indeed in the topic modeling literature, it is a common practice to express a joint distribution induced by a topic model using a plate notation. We argue that this detailed description may sometime blur the direct connection of topic models to Bayesian networks. We will illustrate that many known topic models, including Latent Dirichlet allocation, mixed membership stochastic blockmodels, or various NTF models such as nonnegative Tucker or canonical Polyadic decompositions can be viewed in a much simpler way, by just focusing on the discrete random variables and analytically integrating out the probability tables.

The key contribution of the present paper is a surprisingly simple generic sequential Monte Carlo (SMC) algorithm that exploits and respects the sparsity of observed tensors. The computational complexity of the algorithm scales with the total sum of the elements in the observed tensor to be decomposed and the treewidth of the graph G that defines the mark distribution. Unlike other approaches that are based on standard techniques such as non-convex optimization, variational Bayes, or Markov chain Monte Carlo (Acar and Yener, 2009; S¸im¸sekli and Cemgil, 2012; Ermis et al., 2014; Nguyen et al., 2018), the proposed algorithm does not depend on the size of the observed tensor. This is a direct consequence of the junction tree factorization that implies also a factorized representation of the P´olya urn. Moreover, since the tokens are allocated one by one to the urn (due to the Poisson process formulation), the representation is typically sparse and this allows the use of efficient data structures for implementation.

The paper is organized as follows. In Section 2, we give a review of NTF models and setup our notation to highlight the connections to topic models and Bayesian networks. Then, Section 3 introduces BAM as a generative model for tensors and discusses the properties of the resulting model. In the subsequent Section 4, we show how the generic model can be viewed as a P´olya urn model and derive an SMC algorithm for unbiased estimation of the marginal likelihood. Finally, Section 5 contains a simulation study to illustrate the performance and practical utility of the derived algorithms when compared with variational method. We illustrate with examples where our algorithm is practical in the sparse data regime. To make the paper self contained, we also provide a generic derivation of VB (Beal et al., 2006) and related optimization techniques in the same section. We conclude with the Section 6.

In this section, we will give a short review of matrix and tensor factorizations, topic models and probabilistic graphical models. All of these subjects are quite mature where excellent and extensive reviews are available, so our purpose will be setting up a common notation and building motivation for the rest of the paper.

2.1. Nonnegative Matrix and Nonnegative Tensor Factorizations

First, we provide a review of nonnegative matrix and tensor factorizations, the latter being a generalization of the former.

2.1.1. Nonnegative matrix factorization

Nonnegative matrix factorization (NMF) is a statistical model that has been a widely adopted method for data analysis (Paatero and Tapper, 1994; Cohen and Rothblum, 1993; Lee and Seung, 2001; Gillis, 2017). In its original formulation, the NMF problem is typically cast as a minimization problem:

image

where X is an  I × Jobserved matrix and W and H are factor matrices of sizes I×Kand  K×Jrespectively with nonnegative entries such that a divergence D is minimized. One geometric interpretation is that the model seeks K nonnegative basis vectors  W:kfor  k ∈ [K] ≡ {1, 2, . . . , K}represent each data vector  X:jby a conic (nonnegative) combination of �k W:kHkj. Here, replacing an index with a : denotes a vector as in  W:k= [W1k, W2k, . . . , WIk]⊤. As nonnegativity prevents cancellation, typical solutions of the matrices W and H are empirically known to have qualitatively different behaviour than other matrix decomposition models such as principal component analysis, that can be obtained by singular value decomposition.

One possible divergence choice for D is the information divergence,

image

where

image

We will refer to the corresponding model as KL-NMF, following the established terminology in the literature. We note that the abbreviation KL refers to the Kullback-Leibler divergence that is actually the information divergence between suitably normalized measures p and q where �i pi = �i qi. In the sequel we will assume that the matrix X has nonnegative integer entries, i.e., it is a count matrix. For practical considerations, this is not a major restriction as one can always scale finite precision rational numbers to integers by appropriate scaling.

When X is a count matrix, the model is sometimes referred as the Poisson factorization model: the negative log-likelihood of the Poisson intensity is equal, up to constant terms, to the information divergence (Cemgil, 2009; Gopalan et al., 2013). More precisely, if

image

the log-likelihood is

image

Here, the statistical interpretation of NMF is estimating a low rank intensity matrix that maximizes the likelihood of observed data.

2.1.2. Nonnegative tensor factorization

Nonnegative tensor factorization (NTF) models can be viewed as structured extensions of NMF, in a sense we will now describe. Note that the matrix product WH is a particular matrix valued bilinear function T = T [W, H], the element-wise products W and H followed by a contraction on index k as T (i, j) = �k WikHkj. To express general models, we will introduce a multi-index notation where for a set of integers U = {m, n, r, . . . } we let  aU= (am, an, ar, . . .) and for any m < n we let m : n denote {m, m + 1, . . . , n}. In the sequel, we will use NMF as a running example.

In the NMF model, we have L = 2 factor matrices and N = 3 indices. We redefine the factor matrices  W1 = W, W2 = Hand rename the indices as i1 = i, i2 = j, i3 = kand define the sets  U1 = {1, 3}, U2 = {2, 3}, V = {1, 2}and ¯V = {3}. We let  W1(iU1) denote  W1(i1, i3) (that is  Wik), and  W2(iU2) denote  W2(i2, i3) (which is  Hkj). The contraction index set is ¯V so the bilinear function can be expressed as  T [W1, W2](iV) = �i ¯V W1(iU1)W2(iU2) (which is just T [W, H](i, j) = �k WikHkj).

In tensor factorization, the goal is finding an approximate decomposition of a target tensor X as a product of L factor tensors  W1, W2, . . . , WL ≡ W1:L, on a total of N indices. The target tensor has indices  iVwhere  V ⊂ [N], where each element is denoted as  X(iV) and the order of this target tensor is the cardinality of V , denoted by |V |. Each of the factor tensors  Wlhave indices iUlwhere  Ul ⊂ [N] and each element is denoted as  Wl(iUl). As such, a tensor factorization model is

image

where

image

and we use the bar notation to denote the complement of an index set as ¯V = [N] \ V . Note that this is simply a particular marginal sum over the indices i ¯Vthat are not members of the target tensor. For any set  U ⊂ [N] we will refer to the corresponding marginal using the same notation  T [W1:L](iU) = �i ¯U�Ll=1 Wl(iUl).

Using the notation above, we can define some known models easily. For example, a canonical Polyadic decomposition (also known as a PARAFAC) model is defined by

T [W1, W2, W3](i1, i2, i3) =�i4W1(i1, i4)W2(i2, i4)W3(i3, i4) (2.5)

and can be encoded as  U1 = {1, 4}, U2 = {2, 4}, U3 = {3, 4}, V = {1, 2, 3}, ¯V = {4}, where we decompose an order |V | = 3 tensor as a product of L = 3 factor matrices. Another common model is the Tucker model

T [W1, W2, W3, W4](i1, i2, i3) = �

image

that corresponds to  U1 = {1, 4}, U2 = {2, 5}, U3 = {3, 6}, U4 = {4, 5, 6}, V = {1, 2, 3}, ¯V = {4, 5, 6}, where we decompose an order |V | = 3 tensor as a product of L = 4 tensors, three factor matrices and a so-called core-tensor. Many other tensor models, such as tensor trains (Oseledets, 2011), tensor networks (Orus, 2013; Cichocki et al., 2016, 2017) can also be expressed in this framework.

There exist a plethora of divergences that can be used for measuring the quality of an approximate tensor factorization (Fevotte and Idier, 2011; Finesso and Spreij, 2006; Cichocki et al., 2006). In this paper, we focus on Poisson tensor factorization, that has a particularly convenient form due to the properties of the information divergence and is a natural choice in fitting probability models. The derivative of the scalar information divergence with respect to a scalar parameter w is  ∂dKL(p||q(w))/∂w= (1  − p/q(w)) ∂q(w)/∂w. Thanks to the multilinear structure of a tensor model, the partial derivatives of the divergence D can be written explicitly as the difference of two positive terms

image

After writing the Lagrangian of the objective function (2.3) for all factors  Wlfor l = 1 . . . L, necessary optimality criteria can be derived for each  Wlfrom the Karusch-Kuhn-Tucker (KKT) conditions (see, e.g., Kazemipour et al. (2017)) as

image

for all  iUl, l= 1, . . . , L. For solving the KKT conditions, one can derive a multiplicative algorithm for each factor with the ratio of the negative and positive parts of the gradient

image

It can be verified that this algorithm is a local search algorithm that chooses a descent direction as  Wl(iUl) increases when (∇lD)(iUl) <0 and decreases when (∇lD)(iUl) >0. In the KL case, the same algorithm can also be derived as an expectation-maximization (EM) algorithm by data augmentation (Cemgil, 2009; Yılmaz et al., 2011). For the matrix (NMF) case, the convergence properties of the multiplicative updates are established in Finesso and Spreij (2006); Lin (2007a). An alternative interpretation of the KKT optimality conditions is a balance condition where the marginals on  iUlmust match on both sides for all  l ∈ [L]

image

where we let

image

and use the letter P to highlight the fact that this quantity can be interpreted as a conditional probability measure.

For a general tensor model, calculating these marginals, hence the gradient required for learning can be easily intractable. The theory of probabilistic graphical models (Lauritzen and Spiegelhalter, 1988; Maathuis et al., 2018) provides a precise characterization of the difficulty of this computation. In fact, the NTF model in (2.4) is equivalent to an undirected graphical model with hidden random variables. Here, the undirected graph is defined with a node for each index, and for each pair of nodes there is an undirected edge in the edge set if two indices appear together in an index set  Ulfor l = 1 . . . L. Exact inference turns out to be exponential in the treewidth of this undirected graph. The relations between tensor networks and graphical models have also been noted (Robeva and Seigal, 2018). Due to this close connection, we will review graphical models and closely related topic models, that have a well understood probabilistic interpretation and highlight their connections to NTF models. It will turn out, that the concept of conditional independence translates directly to a ‘low rank’ structure often employed in construction of tensor models.

2.2. Probabilistic Graphical Models

Probabilistic graphical models are a graph based formalism for specifying joint distributions of several random variables  ξ1, . . . , ξNand for developing computation algorithms (Maathuis et al., 2018). For discrete random variables, when each  ξnfor  n ∈ [N] takes values in a finite set with  Inelements, one natural parametrization of the distribution is via an N-way table  θ, where each element  θ(i1, . . . , iN) for  in ∈ [In] specifies the probability of the event ξ1 = i1 ∧ ξ2 = i2 ∧ · · · ∧ ξN = iN, expressed as Pr{ξ1:N = i1:N}. More formally, we have

image

where  E {·}denotes the expectation with respect to the joint distribution Pr and I {cond} is the indicator of the condition, that is 1 if the condition inside the bracket is true and 0 otherwise.

For large N, explicitly storing the table  θis not feasible due to the fact that its size is growing exponentially with N, so alternative representations are of great practical interest. One such representation is the undirected graphical models briefly mentioned in the previous section. Another alternative representation is a Bayesian network (Pearl, 1988). In this representation, the probability model is specified with a reference to a graph G = (VG, EG) with the set of vertices VG= [N] and directed edges  EG. This graph is a directed acyclic graph (DAG) meaning that the set of directed edges  EG ⊂ VG ×VGare chosen such that there are no directed cycles in G. Each vertex n of the graph corresponds to a random variable and missing edges represent conditional independence relations.

Given a directed acyclic graph G, for each vertex n, we define the set of parents pa(n) as the set of vertices v from which there is a directed edge incident to node n, defined by pa(n) ≡ {v: (v, n) ∈ EG}. We will also define the family of the node n as fa(n) ≡ {n}∪pa(n). With these definitions, the Bayesian network encodes nothing but a factored representation of the original probability table as:

image

The factors are typically specified during model construction or need to be estimated from data once a structure is fixed. In theory, to obtain such a factorized representation, we can associate each factor  θn|pa(n)with a conditional distribution of form Pr(ξn|ξpa(n)) defined by the ratio of two marginals

image

where the notation  ξUrefers to a collection of random variables, indexed by U as  ξU = {ξu : u ∈ U}for any  U ⊂ [N]. Formally, we define the (moment) parameters of a marginal distribution Pr(ξU) as

image

θU(iU) = E

image

as such a marginal is, in the tensor terminology introduced in the previous section, a contraction on  i ¯Uwhere we have  θU(iU) = �i ¯U θ(i1:N). We can

define a conditional probability table as:

image

We use the notation n|pa(n) just to highlight the fact that �in θn|pa(n)(in, ipa(n)) = 1, for all  ipa(n). One key problem in probabilistic graphical models is the inference problem, that is the computation of marginal probabilities conditioned on a subset of random variables  ξVobserved to be in a specific state, where  V ⊂ [N] is the set of visible indices. Exact inference is feasible, but this depends critically on the structure of the graph G as well as the particular set of visible indices V . To highlight the relation to tensor factorization, we will describe the conditioning using tensor operations and define an observation as a tensor with a single nonzero entry

image

The inferential goal is, given  ξV = i∗V(so that  sV (i∗V) = 1) and  U ⊂ [N], computing posterior marginals of form

image

A general method for exact inference in graphical models is the junction tree algorithm (see, e.g., Lauritzen and Spiegelhalter (1988)) that proceeds by combining groups of nodes into cliques by moralization and triangulation steps to arrive at a representation as

image

where  CGand  SGare collection of sets named cliques and separators satisfying the running intersection property. This factorization allows a propagation algorithm on a tree for efficiently computing desired marginals. In a sense, the junction tree can be viewed as a compact representation of the joint distribution from which desired posterior marginals can still be computed efficiently. In general, each clique  C ∈ CGwill contain at least one of the families of fa(n) so the junction tree algorithm provides a practical method that facilitates the computation of Pr{ξfa(n) = ifa(n)|ξV = iV }for all n, including cases where some variables are observed, i.e., when fa(n) ∩ Vis not empty.

We note that these posterior marginals are in fact closely related to the required gradients when solving the KKT conditions in (2.7) for the NTF model. In other words, the gradients required when fitting a tensor model to data can be computed in principle by the junction tree algorithm. To see this, we assume that the observed tensor X is in fact a contingency table of T observations, ξ1V , . . . , ξTV(with corresponding tensors  s1V , . . . , sTV), where each cell of X gives the total count of observing the index  iV

image

In this case, given  ξτV=  iτV(so that  sτ(iτV) = 1) for  τ ∈ [T], the total gradient, also known as the expected sufficient statistics, is given by

image

We will not further delve into the technical details of the junction tree algorithm here but refer the reader to the literature (Lauritzen, 1996; Cowell et al., 2003; Barber, 2012).

2.3. Topic Models

Topic models are a class of hierarchical probabilistic generative models for relational data with the latent Dirichlet allocation (LDA) as the prototypical example (Blei et al., 2003). The LDA is introduced as a generative model for document collections and can be viewed as a full Bayesian treatment of a closely related model, probabilistic latent semantic indexing (PLSI) (Hofmann, 1999). Suppose we are given a corpus of J documents with a total of  S+words from a dictionary of size I. To avoid confusion, we will refer to the individual instances of words as tokens. Formally, for each token  τwhere  τ ∈ [S+], we are given a data set of document labels  j ∈ [J] and word labels  i ∈ [I] from a fixed dictionary of size I. This information is encoded as a product of two indicators  dτand wτ: dτj wτi= 1 if and only if token  τcomes from document j and it is assigned to word i.

LDA is typically presented as a mixture model where one assumes that, conditioned on the document indicator  dτ, the token  τfirst chooses a topic k ∈ [K] among K different topics, with document specific topic probability  ϑkj, then chooses the word  wτconditioned on the topic with probability  βik. This model can be summarized by the following hierarchical model for all  τ

image

The inferential goal of LDA is, given d and w estimating the posterior of z as well as the tables  β, ϑ. The corresponding simplified graphical model is illustrated in Figure 1 Following the seminal work of Blei et al. (2003), many more variations have been proposed (Li and McCallum, 2006; Airoldi et al., 2008). One can view LDA as a hierarchical Bayesian model for the conditional distribution p(w|d) = �z p(w|z)p(z|d). As such, the approach can be viewed naturally as a Bayesian treatment of the graphical model  d → z → wwith observed d and w and latent z; see Figure 1. In this paper, we will view topic models and structured tensor decompositions from the lens of discrete graphical models.

In topic modeling, it is common to select independent priors on factor parameters, for example in (2.12) the hyper-parameters could be specified freely. While this choice seems to be intuitive, it turns out having a quite dramatic effect on posterior inference and may even lead to possibly misleading conclusions. The choice of a Dirichlet may also seem to be arbitrary and merely due to convenience, but Geiger and Heckerman (1997) prove that under certain plausible assumptions the Dirichlet choice is inevitable. If the parameters of a Bayesian network are assumed to be globally and locally independent, and there are no extra assumptions about the conditional independence structure of the model other than the ones directly encoded by the graph, i.e., any Markov equivalent graph structures could not be discriminated, then the only possible choice of priors happens to be a collection of Dirichlet distributions satisfying equivalent sample size principle. In plain terms, this requires that all Dirichlet hyper parameters should be chosen consistently as marginal pseudo-counts from a fixed, common imaginary data set (Heckerman et al., 1995). These results, initially stated only for Bayesian structure learning problems in discrete graphical models with observable nodes directly apply to the much more general setting of topic models and tensor factorizations. Hence, in the following section, we will establish the link of topic models with tensor decomposition models and Bayesian networks. Our strategy will be introducing a dynamic model, that we coin as BAM and showing that this model can be used for constructing all the related models.

We first define an allocation process as a collection of homogeneous Poisson processes  St(i1:N), t ∈[0, 1) where each index  infor  n ∈ [N] has the cardinality of  In, i.e.,  in ∈ [In], so we have �n Inprocesses. All processes are obtained by marking an homogeneous Poisson process  {St, t ∈[0, 1)} with constant intensity λthat we name as the base process. The marking is done by randomly assigning each event (token) generated by the base process independently to the processes St(i1:N) with probability  θ(i1:N). Let the event times of the base process be 0  < t1 < · · · < tT <1 and define  Sτ(i1:N) =  Stτ(i1:N) for  τ ∈ [T]. Note that T is random also. We will also define the increments

image

image

Fig 1: (Left) A Bayesian network representation of the probability model p(d)p(z|d)p(w|z), also known as the PLSI model. d, z, w correspond to documents, topics and words. (Middle) Bayesian network representation of LDA for a single token (see text), LDA can be viewed as a full Bayesian treatment for the discrete graphical model on the left; (Right) The equivalent model for KL-NMF and LDA derived from BAM. We extend this connection via graphical models to other NTF models. Consistency requires that all factors share a common base measure  α.

with the convention  S0(i1:N) = 0. When  τ = T, we let  S = STand refer to this object as the allocation tensor. It is useful to view the allocation tensor as a collection of cells where cell  i1:Ncontains  S(i1:N) tokens at time t = 1 and the increments  sτ(i1:N) as indicators of the allocation of the  τ’th token to cell i1:N. Later, it will be also useful to think of  i1:Nas a color, where each index  inencodes a color component.

As the assignments of tokens to cells are random, we can define the sequence of random variables  cτ, τ= 1, . . . , T where  cτis the index of the Poisson process that is incremented at time  tτ, i.e.,

image

In the general case, we can imagine the joint distribution of  cτas an N-way array (an order N tensor)  θwith �n Inentries, and  θ(i1:N) specifies the probability that a token is placed into the cell with index  i1:N,

image

For modeling purposes, we will further assume that  θrespects a factorization implied by a Bayesian network G = (VG, EG) (see Section 2.2). As such, we have a factorization of  θthat has the form

image

Remark 3.1. We will generally use  S, Sτ, sτ, and  cτfor realizations of the random variables  S, Sτ, sτ, cτ, and preserve the relation between the random variables for their realizations as well. (For example, for some  Sτ−1and  Sτwe will use  sτdirectly, with reference to the relation between the random variables they correspond to). Finally, we will resort to the bold-faced notation when the distinction between random variables and realizations is essential in the notation; otherwise we will stick to the lighter notation  S, Sτ, sτ, and  cτ, etc. in the flow of our discussion.

3.1. Bayesian Allocation Model

To complete the probabilistic description, we let  λ ∼ GA(a, b). For  θ, we start with an order-N tensor  αwith nonnegative entries  α(i1:N) ≥0 for all  i1:N. In practice, we do not need to explicitly store or construct  α; we will use it to consistently define the contractions for all  n ∈ [N]

image

We model each factor  θn|pa(n)in (3.2) as an independent2random conditional probability table, and  αfa(n)is used to assign the prior distribution on those tables.

Specifically, given the hyperparameters (α, a, b), the hierarchical generative model for variables  λ, Θ  ≡ {θn|pa(n) : n ∈ [N]}and S is constructed as:

image

Here,  αfa(n)(:, ipa(n)) is an  In ×1 vector obtained by fixing  ipa(n)and varying inonly, and D(v) is the Dirichlet distirbution with parameter vector v. Given the hierarchical model, the joint distribution for the Bayesian allocation model (BAM) is expressed as

image

where the distributions (densities)  pa,b(λ), pα(Θ) and  p(S|Θ, λ) are induced by (3.4). To make our reference to the assumed probability measure clear, we will use the notation  πdenote distributions regarding the model in (3.4).

Remark 3.2. Our construction of defining the prior parameters from a common αseems restrictive, however this is required for consistency: the factorization in (3.2) is not unique; there are alternative factorizations of  θthat are all Markov equivalent (Murphy, 2012). If we require that the distribution of  θ(i1:N) should be identical for all equivalent models then defining each measure as in (3.3) is necessary and sufficient to ensure consistency by the equivalent sample size principle (Geiger and Heckerman, 1997). We discuss this further in Section 3.6, where we provide a numerical demonstration in Example 3.2.

3.1.1. Observed data as contractions

In most applications, the elements of the allocation tensor S are not directly observed. In this paper, we will focus on specific types of constraints where we assume that we observe particular contractions of S, of form

image

Here  V ⊂ [N] is the set of ‘visible’ indices, and ¯V = [N] \ V are the latent indices; see Figure 2 for an illustration. Many hidden variable models such as topic models and tensor factorization models have observations of this form.

image

Fig 2: Visualization of the allocation tensor: each cube is a cell that is indexed by a tuple  i1:N (here ikj) and S(i1:N) denotes the numbers of tokens placed (here  Sikj).In a topic model, precise allocations are unknown, but counts over fibers are observed SV (iV ) (here, Si+j = Xij).

Given the observed tensor X = X, one main inferential goal in this paper is to estimate the marginal likelihood of X, defined by

image

When we view the marginal likelihood as a function of a certain parameter, we will use the notation  LX(·) where the value of the parameter appears in the brackets. Furthermore, we are interested in Bayesian decomposition of X by targeting the posterior distribution

image

image

Fig 3: A schematic description of BAM for KL-NMF and LDA. Increments  sτ ofthe homogeneous Poisson process are marked using a Bayesian network  G j → k → ito obtain the ’marked’ token  sτikj and placed into the allocation tensor. The number of tokens  Sikjat the box corresponding to the color (i, k, j) are Poisson distributed with  Sikj ∼ PO�λθ1θ2|1θ3|2�. For KL-NMF, the counts at each box are not directly observed, instead we observe a contraction of the allocation tensor S over the latent index  k as Xij = �k Sikj. For LDA, we have the same model, however, the total number of tokens is assumed to be known.

Many other inferential goals, such as model order estimation and tensor factorization, rely on the quantities stated above. For example, for model order estimation, one needs to calculate the marginal likelihood given the model orders to be compared. For tensor factorization, S provides the information for the underlying graphical model and therefore its posterior distribution given X needs to be found.

Example 3.1 (KL-NMF as a BAM). We first consider the following specific undirected graphical model i k j that corresponds to the KL-NMF model. This simple graph has three equivalent Bayesian network parametrizations as  j →k → i, j ← k → iand  j ← k ← i, e.g., see (Murphy, 2012). To have a concrete example, we will focus on an allocation process that assigns tokens generated from the base process to a document j, then conditioned on the document on a topic k and finally conditioned on the topic to a word i, as in doc  →topic  →word, the other graph structures would be equivalent. To complete the model specification, we let (i1, i2, i3) ≡ (j, k, i). This implies that we have pa(1) = ∅,fa(1) = {1}, pa(2) = {1}, fa(2) = {1, 2} and pa(3) = {2}, fa(3) = {2, 3}. In a topic model, the topic assignments are hidden, and we observe only the word-document pairs i, j so the visible set is V = {1, 3}. The specialized generative

process for this model is

image

θ1(:)  ∼ D(α1(:)), θ2|1(:, j) ∼ D(α2,1(:, j)), θ3|2(:, k) ∼ D(α3,2(:, k))

image

This model is illustrated in Figure 3. Now, if we define the matrices  Wik ≡θ3|2(i, k) and  Hkj ≡ λθ2|1(k, j)θ1(j) and sum over the allocation tensor S, we can arrive at the following model

image

Apart from the specific restricted choice of prior parameters for factors, the generative model is closely related to the Bayesian KL-NMF or Poisson factorization (Cemgil, 2009; Paisley et al., 2014). The formulation of KL-NMF in Cemgil (2009) proposes Gamma priors on both W and H, thereby introducing an extra scaling redundancy which prohibits integrating out W and H exactly, hence leads to a subtle problem in Bayesian model selection.

3.2. The Marginal Allocation Probability of BAM

In the sequel, our aim will be deriving the marginal distribution of the allocation tensor  S = ST, the number of tokens accumulated at step  τ = T, or equivalently t = 1. We will refer to  π(S) as the marginal allocation probability.

We can obtain an explicit analytical expression for  π(S) with the help of conjugacy. We can integrate out analytically the thinning probabilities Θ and intensity  λ. The closed form conditional distribution is available as

π(λ, Θ|S) =  GA(λ; a + S+, b+ 1)

image

In (3.10), we have used the definition

image

in analogy with (3.3), and  S+is defined in (3.20). (Note that, surprisingly the posterior remains factorized as  π(λ, Θ|S) =  π(λ|S)π(Θ|S).) The expression for

the marginal allocation probability is obtained using (3.5) and (3.10) as

image

(3.11) where, to simplify our notation, we define a multivariate beta function of a tensor argument as follows

image

Here  Zfa(n)is an object of the same shape signature as  Sfa(n), hence  Bnis a function that computes first a nonlinear contraction of a tensor with indices ifa(n)over index  into result in a reduced tensor over indices  ipa(n)and multiplies all the entries of this tensor. We will denote the usual multivariate beta function as Beta(z) = �iΓ(zi)/Γ(�i zi).

3.2.1. The marginal allocation probability conditioned on its sum

In many instances, the sum  S+is observable, such as via a partially observed X as in (3.6), while S itself remains unknown. For such cases, it is informative to look at the probability of the marginal allocation conditioned on the sum  S+. For that, consider the allocation model described at the beginning of Section 3 and let  πτdenote the marginal distribution of  Sτ,

image

Recalling  S = STand  T = S+, one can decompose (3.11) as

image

The first factor in (3.14) is the marginal probability of T, obtained after integrating over the latent intensity parameter  λas

image

This is a negative Binomial distribution and can be interpreted as observing  S+‘successes’, each with probability 1/(b + 1) before we have observed a ‘failures’, each with probability b/(b + 1). The second factor in (3.14) follows from (3.11) and (3.15) as

image

In (3.16), the first factor is the score of a Bayesian network when complete allocations S are observed (Cooper and Herskovits, 1992; Murphy, 2012) and the second factor is the multinomial coefficient

image

that counts the multiplicity of how many different ways  S+tokens could have been distributed to  |I1| × |I2| . . . |IN|cells such that the cell  i1:Nhas  S(i1:N) tokens allocated to it.

Note that for large  z+= �i zi, we have  −H(z/z+) ≈(log Beta(z))/z+where H(p) is the entropy of a discrete distribution p, see, e.g., Csisz´ar and Shields (2004). We have similarly  Hn(Zfa(n)/Z+) ≈(log  Bn(Zfa(n)))/Z+. Hence,

image

which suggests that the marginal allocation probability is high when the tokens are distributed as uniformly as possible over S corresponding to high entropy H(S/S+) while the entropy of the marking distribution encoded by the Bayesian network G has a low entropy.

3.3. The Marginal Allocation Probability for the KL-NMF

The specific form of (3.11) for the KL-NMF provides further insight about the marginal allocation probability. Assuming that all entries of X are observed, some of the terms in (3.11) become merely constants. We group the terms that are constant when X is fixed as  Cf(S). These terms only depend on fixed hyperparameters  α, a, bor can be directly calculated from observed data X. The remaining S dependent terms  Cd(S) are given as

image

To understand  Cd(S), we consider Stirling’s formula log Γ(s + 1)  ∼ slog  s −s + O(log s) so terms such as  − �Ni=1log Γ(si) ≈ − �Ni=1 silog  si + xwhere �i si = xcan be interpreted as an entropy of a mass function defined on N cells where  siis the mass allocated to cell i. The entropy increases if the total mass is distributed evenly across cells and decreases if it concentrates only to a few cells. We thus imagine a physical system where a total of  S+balls are placed into  I × K × Jbins, where there must be exactly  Xijballs in the cells (i1j), (i2j), . . . , (iKj), that we will refer as a fiber (i : j) (as  Si+j = Xij). We can think of each tensor S as a mass function, where  Sikjcounts the number of balls being placed into bin (i k j) and  π(S) is a distribution over these mass functions. The individual bins (i k j) and the fibers (i : k) with fixed marginal sum are depicted as in the ‘cubic’ plots in Figure.2.

We can see that the expression has four terms that are competing with each other: terms on line (3.17) that try to make the marginal sums  Sik+and  S+kjas concentrated as possible to a few cells while the terms in (3.18) force  S+k+and  Sikjto be as even as possible. Following the physical systems analogy, π(S) assigns an higher probability to configurations, where balls are aligned as much as possible across (ik :) and (: k j), but are still evenly distributed among slices of form (: k :) and to individual bins (i k j); see Figure 4. This observation also explains the clustering behavior of KL-NMF from a different perspective. It is often reported in the literature that NMF has, empirically and theoretically, a representation by parts property, that is, the matrices  W ∗or  H∗obtained as the solution to the minimization problem in (2.1) tend to be usually sparse with occasional large entries. Remembering that expectations under  p(W, H|S∗) are  E {W:k} ∝ S:+kand  E {H:j} ∝ S+j:, we see that this is what the terms in (3.17) enforce. Yet, this property is not directly visible from the original generative model (2.2) but is more transparent from the alternative factorization. In Section 4, we will also give a dynamic interpretation of this clustering behaviour.

3.4. Decomposing the Marginal Allocation Probability

An important observation in the previous section is that the marginal allocation probability  π(S) inherits the same factorization properties of the underlying graphical model G, as it can also be written as a product of a function of the clique marginals (Sik+and  S+kj) divided by a function of the separator marginal

image

This property, that the parameter posterior has the analogous factorization is named as a hyper Markov law (Dawid and Lauritzen, 1993). A direct consequence of this factorization is that log  π(S) can be written as a sum of convex and concave terms, corresponding to clique and separator potentials respectively. The last term corresponding to the base measure also appears as a concave term but it would be cancelled out when comparing different models. The presence of convex terms renders the maximization problem difficult in general. The terms are pictorially illustrated in Figure 4.

For a general graph G, the form of  π(S) may not be easy to interpret. However, the junction tree factorization for decomposable graphs provides a nice

image

Fig 4: Visualization of the clique and separator domains of the marginal allocation probability  π(S) for KL-NMF/LDA.

interpretation

image

where, similarly, the proportionality constant can be solely determined from X = SV. Consequently, as in the special KL-NMF case, the general log  π(S) can also be expressed as a sum of convex terms corresponding to the clique potentials and concave terms corresponding to the separator potentials. Not surprisingly, the presence of convex terms renders this problem also a hard optimization problem in general. Here is an interesting trade-off: the factorization structure renders maximization difficult but the model can be expressed with far less parameters. It is possible getting rid of separators by clumping more cliques together but this implies that that the computational requirement increases, in the extreme case of a complete graph, all convex terms disappear but the model becomes intractable to compute (Wainwright and Jordan, 2007).

In the following section, we will give a specific example of our rather general and abstract construction to illustrate that KL-NMF and LDA, upto the choice of priors, are equivalent to a particular allocation model.

3.5. KL-NMF and LDA Equivalence

In this section, we will establish the equivalence of KL-NMF and LDA, we will first state the LDA model and construct an equivalent allocation model. In a sense, this equivalence is analogous to the close connection of the joint distribution of Bernoulli random variables and their sum, a Binomial random variable. LDA is a generative model for the token allocations, conditioned on the total number  S+tokens while KL-NMF is a joint distribution of both.

To follow the derivation of LDA more familiar in the literature, we first define the following indicators:  djτ, zkτand  wiτthat encode the events that token τ ∈ [S+] selects j’th document, k’th topic and i’th word respectively. Then we

define a hierarchical generative model

image

The inference goal is, given d and w estimating  z, βand  ϑ. The variables  χand d are omitted in the original formulation of LDA as these can be directly estimated from data, when all the tokens are observed, i.e., when there are no missing values. We consider the joint indicator  sτikj=  djτzkτwiτand define Sikj= �τ sτikj. As  sτis multinomial with  M(θ,1), each cell having probability θikj ≡ βikϑkjχj, the allocation tensor S, being the sum of multinomial random variables with the same probability is also multinomial with  M(θ, S+). To see the connection with the allocation model, remember that the multinomial distribution can be characterized as the posterior distribution of independent Poisson random variables, conditioned on their sum (Kingman, 1993). This property is succinctly summarized below using a Kronecker  δas

image

The original LDA formulation also omits  S+, the total number of tokens; as there are no missing values. This suggests that if we would condition the following allocation model also on  S+, we will exactly get the LDA

image

We have thus shown that both LDA and Bayesian KL-NMF are equivalent to Bayesian learning of a graphical model i k j where k is latent. This observation suggests that there are one-to-one correspondences between other structured topic models, NTF and graphical models. Note that the link between KL-NMF and LDA has long been acknowledged and the literature is summarized in Buntine and Jakulin (2006). We conclude this section with an illustrative example of parameter consistency problem when estimating the rank of a decomposition, i.e., the cardinality of a latent index.

3.6. On Hyperparameters of BAM

The Poisson process interpretation guides us here in defining natural choices for model parameters: the (prior) expectation of  λ, denoted as  E {λ} = a/b, is also the expected number of tokens observed until time t = 1. Let

image

(with realizations  S+). Given an observed tensor  X, S+is fully observable and we could choose the scale parameter as  b ≈ a/S+, following an empirical Bayesian approach.

image

unknown. In this case, b should be integrated out, but we do not investigate this possibility here. In practice, it is also possible to get a rough estimate from data as an average.

image

our a-priori belief, how the tokens will be distributed on S. In practice, we expect S to be sparse, i.e., tokens will accumulate in relatively few cells, however without a-priori preference to a particular group. Here, a natural choice is a flat prior

image

with a > 0 being a sparsity parameter having typical values in the range of 0.05 or 0.5, also known as a BDeu prior. Here, BDeu is an abbreviation for Bayesian Dirichlet (likelihood) Equivalent Uniform prior choice (Buntine, 2013; Hecker- man et al., 1995). Fixing  α, we can also see that a is also a shape parameter with a = �i1:N α(i1:N). The choice of a is critical (Steck and Jaakkola, 2002).

Example 3.2 (BDeu Priors). In this example, our goal is to illustrate with a simple example how choosing the prior parameter  αeffects model scoring and how an inconsistent choice may lead to misleading conclusions even in very simple cases, a point that seems to have been neglected in the topic modeling literature.

Suppose we observe the following contingency table where rows and columns are indexed by i and j respectively

image

and our goal is to decide if S is a draw from an independent or a dependent model, that is to decide if the tokens are allocated according to the allocation schema i j versus  i ← j(or  i → j).

As only  S+= 4 tokens are observed, intuition suggests that the independent model maybe preferable over the one with dependence, but this behaviour is closely related to the choice of the hyperparameters. In a Bayesian model selection framework, one may choose taking a flat prior for all i, j as  α(i, j) = a/4. For the case a = 1, b = 1, we obtain for the independent model

image

and for the dependent model

image

Here, the independent model is slightly preferred. Also, the numeric results are identical, as they should be, for all Markov equivalent dependent models. However, when consistency is not respected, and we choose all Dirichlet priors to be flat, say all cells as a/4, we would obtain for the independent model

image

and for the dependent model

image

In the inconsistent case, not only the two alternative factorizations give different results; the independent model achieves a lower marginal allocation probability than the dependent model.

image

a). In this example, we will illustrate the behaviour of the marginal allocation probability  π(S) for different equivalent sample size parameters a. We calculate π(S) exactly for the NMF model in the range of a from 101to 10−10for a toy matrix  X(1), given below. This range has been chosen to demonstrate the gradual change in the probability values of  π(S) due to the transition from large a to small a.

image

Fig 5: Histogram of the marginal allocation probability for different tensors with marginal count matrix  X(1). The marginal likelihood of  X(1) is shown with a red vertical line. Histogram of  dEPis supplied for comparison.

In this example, we enumerate all tensors with non-negative integer entries, and with the hidden index dimension K = 3, that have the marginal count matrix  X(1), i.e., we consider all S such that  Si+j = X(1)ij. For each such tensor S, we calculate  π(S) and show the histogram of those values in Figure 5. The effect of a is quite dramatic, especially when  a →0 the values that the marginal allocation probability can take is confined on a grid (see Figure 5). This gives a clue about the nature of the distribution of the marginal allocation S, where we have discrete levels of probability values and each admissible tensor S is assigned to one of these values.

The behaviour of the marginal in the limit  a →0 is perhaps surprising. In fact, Steck and Jaakkola (2002) have shown that the log-probability of the tensor S becomes independent of the counts in particular entries, and dependent only on the number of different configurations in the data, defined as the effective number of parameters:

image

dEP (S) =

image

In the regime where a is sufficiently small, the  dEP (S) becomes the sole determiner of the probability of a data tensor. As seen in the equation above, this value is only dependent on the structure of the graphical model, and as Steck and Jaakkola (2002) shows, it determines the probability as specified below:

image

This is corroborated by the our findings in Figure 5 where, for example, the clear modes observed in the parameter setting a = 10−10are at distance from their neighbors by log a.

image

Fig 6: MAP estimations of the missing entries  X(3)12 and X(4)12 are 3 and 1, since the posterior modes of  S+are at 12 and 10 respectively.

Example 3.4 (Posterior Distribution of the number of tokens S+).In this example, we give an illustration of matrix completion problem with BAM. For simplicity, assume we have the following 2  ×2 matrices  X(3)and  X(4)with

image

Fig 7: Some Nonnegative matrix/tensor Factorization Models and Topic Models expressed as instances of BAM.

a missing entry:

image

As one entry is missing, the total number of tokens is not known here.

For each of the matrices, we have calculated the unnormalized posterior distribution Pr(S+ = T, X | K) of  S+under KL-NMF model (Figure 7a) which is also the posterior distribution of the missing entry in this case. Our calculations with exact enumeration for model shows that the distribution of the missing entry  X(3)12reaches its peak at 3 whereas  X(4)12reaches its peak at 1 for the all the cardinalities K = 1, . . . , 4 of the latent node. These results clearly indicate that the predictive distribution of the missing data is determined by the structure of the observed tensor X rather than the observable moments of X such as the mean.

3.7. NTF and Topic Models as Instances of BAM

In this subsection we illustrate examples where structured topic models or NTF models can be expressed as instances of BAM by defining an appropriate directed G.

3.7.1. Pachinko allocation

Pachinko allocation (Li and McCallum, 2006) is proposed as an extension to LDA to model the fact that topics often have a hierarchical structure rather than the flat structure assumed by LDA. When expressed as a BAM, the generative model is essentially a chain as  j → k1 → k2 · · · → kL → iwhere L is the depth of the hierarchy. The observed indices are V = {i, j}, so we observe the counts for word-document pairs. As a concrete example with depth L = 3, a token chooses first a document j, then conditioned on the document, chooses a super-topic  k1(such as sports, politics, economics) , than a sub category  k2(such as football, baseball, volleyball) and than conditioned on the sub category a topic  k3(world-cup, premier league, fifa) and finally the word i. Like LDA, the observations are a matrix  Si+++jand the original formulation of Pachinko allocation also conditions on the total number of tokens  S+.

3.7.2. Mixed membership stochastic blockmodel

A mixed membership stochastic blockmodel (MMB) (Airoldi et al., 2008) is a generative model for relational data, essentially for the adjacency matrix of a directed graph. In this model, we have I entities, indexed with  i1and  i2, and for each of the possible  I2pairs, where we observe a binary value that indicates if there is a directed edge or not. MMB can be viewed as an instance of BAM with the set of observed indices  V = {i1, i2, s}, if we assume a one hot encoding where the observations are encoded as a tensor  Si1i2++swhere  i1, i2 ∈ [N] and s ∈[2].

The generative model corresponding to MMB is,  i1 → k1 → s ← k2 ← i2. The token chooses a source  i1and independently a destination  i2, followed by the ‘blocks’ for the source and destination. Finally, the category s for the edge, here only a binary one, is chosen. As in LDA, we condition on the number of tokens  S+ = I2.

3.7.3. Sum conditioned Poisson factorization

Sum conditioned Poisson factorization is a recently proposed model for binary, categorical or ordinal data (C¸apan et al., 2018) of form  X(i, j) ∈ [R] with possibly missing entries, where the aim is introducing a factorization structure on the moment parameters rather than the canonical parameters as in logistic or ordinal matrix and tensor factorizations (Paquet et al., 2012; Nickel et al., 2016). BAM is of form p(j)p(k|j)p(r|k, j)p(i|k, r), where a token chooses first a source j, then jointly a topic k and a category r and finally a destination i. For observed X(i, j), the observations are of form  Si+rjand for missing X(i, j), they are of form  Si++j = M(hence the name sum conditioned). The model makes only sense if there are both missing and observed entries in X. If all entries are observed, the model degenerates to Poisson factorization.

3.7.4. NTF with KL cost, Poisson tensor factorization

The canonical Polyadic decomposition (Harshman, 1970), defined in (2.5) is one of the widely used tensor models (Kolda and Bader, 2009). A closely related model is used for modelling multiway relations (Schein et al., 2015, 2016). One parametrization of this model is

image

with visible indices  i1, i2and  i3. Another Markov equivalent model is

image

which is also a canonical Polyadic model. A nonnegative 3-way Tucker model (2.6) is p(i6)p(i5|i6)p(i4|i5, i6)p(i3|i6)p(i2|i5)p(i1|i4)

with visible indices V = {1, 2, 3}. For both models, the corresponding graphical models are shown in Figure 7. Many classical graphical model structures can also be viewed as tensor models. For example, a hidden Markov model can be viewed as a T-way Tucker model with 2T indices where the core tensor is factorized as a Markov chain. In the tensor literature, related models are also known as tensor trains or tensor networks (Oseledets, 2011; Cichocki et al., 2016).

3.8. Tensor Factorizations as Directed Models

In its full generality, a NTF model with KL cost (Yılmaz et al., 2011) (see in (2.3)) can be expressed as an undirected graphical model, as a product of individual component tensors, followed by a contraction over hidden indices. However, unlike the examples in the previous subsection, it is not always possible to find an equivalent directed graphical model for a tensor model, as not all undirected models have exact directed representations. A well known example is

image

that corresponds to a four-cycle. This undirected graphical model can not be expressed exactly, in the sense that it implies certain Markov properties that can not exactly be expressed by a directed graphical model. Hence, this tensor model can not be expressed exactly as a BAM. The technical condition for exact representation is the decomposability (Lauritzen, 1996) of the underlying undirected graph: every decomposable tensor model can always exactly formulated as a BAM.

Given a tensor model, it is always possible to construct a decomposable model by adding redundant edges, equivalently extending the factors. For the above example one such construction is

image

but this introduces redundant parameters hence ignores some structural constraints. Nevertheless, for many popular tensor models such as a nonnegative canonical Polyadic (PARAFAC) or nonnegative Tucker models, the models are decomposable hence it is easy to express a corresponding directed graphical model. Given a general tensor model, we can systematically construct a directed representation, hence a BAM that can represent data equivalently well as follows: extend each factor appropriately such that the corresponding undirected graph is decomposable, (corresponding to the called moralization and triangulation steps), and form an order  σof nodes respecting a perfect numbering. We construct a directed graph G sequentially by adding each node v in the order given by  σand add directed edges from a node w previously added to G if v and w are in the same clique, i.e., for all nodes w such that  σ(w) < σ(v) and  v, w ∈ Cfor a clique  C ∈ CG.

In this section, we will first provide an analysis of the allocation model as a dynamic P´olya urn process where tokens are placed according to an underlying Bayesian network, in a sense that we will later describe. It will turn out that the conditional independence relations implied by the Bayesian network also carry forward to the urn process that significantly simplifies inference and makes it possible to run sequential algorithms with minor space requirement. Based on this interpretation, we will describe a general sequential Monte Carlo (SMC) algorithm for BAM, with the primary motivation of estimating the marginal likelihood  LXin (3.7). The BAM perspective is not only useful for devising novel inference algorithms that are favorable when data consists of small counts, it also provides further insight about the clustering nature of nonnegative decompositions in general.

4.1. P´olya Urn Interpretation of the Allocation Model

A P´olya urn (Mahmoud, 2008; Pemantle, 2007) is a self reinforcing counting process, that is typically described using the metaphor of colored balls drawn with replacement from an urn. In the basic urn model, each ball has one of the I distinct colors; one repeatedly draws a ball uniformly at random and places it back with an additional ball of the same color so the process exhibits self reinforcement.

To make the analogy to the allocation model, we will associate each color with a cell and we let  Sτidenote the number of tokens allocated to cell i such that i ∈ [I] at time  τ. Initially there are no tokens, with  S0i= 0 where  i ∈ [I]. The first token is placed into cell i with probability  αi/α+and in each subsequent step  τ, a token is placed into cell i with probability:

image

At step  τ, set  Sτi=  Sτ−1i + sτi, where  sτiis the indicator of the event that the token  τis placed into cell i. Clearly, the placement of tokens has the same law as a P´olya urn. This is a Markov process with the special property that the future of the process heavily depends on the first few outcomes and its limit probability vector limτ→∞ ζτ:has a Dirichlet distribution  D(α).

4.1.1. P´olya-Bayes processes

In the following, we show the close connection between P´olya urns and BAM developed in Section 3. In particular, we show that BAM in Section 3 admits a P´olya urn interpretation if  {θn|pa(n) : n ∈ [N]}have the prior distribution in (3.4).

Consider the allocation model in Section 3 and assume Θ =  {θn|pa(n) : n ∈[N]} have the prior distribution in (3.4), so that we have BAM in Section 3.1. For the sequence  {Sτ}τ≥1of tensors generated by the allocation model, define the transition probabilities

image

where S is a viable tensor at time  τ1. We first look at the forward transition probability  fτ|τ−1(Sτ|Sτ−1) for a viable pair (Sτ−1, Sτ) with  sτ(i1:N) = 1 for some index  i1:N. Clearly, we can write

image

Since  cτis independent from  Sτ−1given Θ, we can decompose the probability of the increment as

Pr(cτ=  i1:N|Sτ−1=  Sτ−1) =�Pr(cτ=  i1:N|Θ)π(Θ|Sτ−1) dΘ

image

where we have used (3.1), (3.2), and (3.10) in the second, third, and the last lines, respectively.

The expression for  fτ|τ−1(Sτ|Sτ−1) in (4.1) exploits the structure in graph G and suggests a way to generate or evaluate a one-step transition for the sequence {Sτ}τ≥0. Specifically, (4.1) implies a collection of dependent P´olya urn processes with a sequential dependence structure. Specifically, the process  {Sτ}τ≥1can be viewed as a generalized form of a “P´olya tree” of depth N as defined in Mauldin et al. (1992, Section 3). A P´olya tree and our process is equivalent if the graph G is complete. Our process is more general due to capturing the conditional independence structures in the Bayesian network with respect to the graph G. To highlight this, and avoid confusion with the term “tree” of a graph, we will call the process  {Sτ}τ≥1a P´olya urn process on a Bayesian network, or shortly a  P´olya-Bayes process.

In compact form, we can write the transition probability in (4.1) shortly as

image

which reveals a several interesting facts about the process  {Sτ}τ≥0. For example, notice from (4.2) that

image

which is the non-combinatorial factor in (3.16) and depends only on  Sτ. We deduce that the incremental process  {sτ}τ≥1is exchangeable. Moreover, noting that there are

second factor in (3.16), we verify that

image

Note that, as the factorization of Θ is not unique, all the Markov equivalent graphs of G can be used to generate the same P´olya-Bayes process, albeit in a different topological order. In fact, for a general thinning mechanism according to a directed graph G, if the corresponding moral graph is decomposable (Cowell et al., 2003), a P´olya urn representation can be constructed from the Junction tree factorization (clique potentials divided by separator potentials and a normalization term) of (C.3) as:

Pr(sτ(i1:N) = 1  | Sτ−1=  Sτ−1) = 1(α+ + Sτ−1+)

�|SG|q=1(αDq(iDq) +  Sτ−1Dq(iDq)) (4.3) The individual terms here contain the sufficient statistics that need to be stored for exactly drawing samples from the urn process. This factorization is useful for sampling any desired conditional with a reduced space complexity: first propagate messages to the root clique and than sample starting from the root. In hidden Markov models, the special case of these algorithms are known as forward filtering backward sampling (Capp´e et al., 2005).

Example 4.1. As a concrete example, the P´olya-Bayes process corresponding to the NMF model has the transition probability

image

This P´olya-Bayes process has  I×K×Jcolors where each color is specified by the tuple (i k j) and balls are drawn according to the underlying graph  j → k → i, with probabilities given by the three ratios. In a sense, each node of the Bayesian network encodes partial information about the color, reminiscent to individual R, G, and B components in the RGB color representation. In the corresponding urn, at step  τ, the token is drawn proportional to  α + Sτ−1and after the draw the counts are updated as  Sτikj ← Sτ−1ikj+sτikjand then the next token at  τ+1 is drawn proportional to  α+Sτ. In the typical parameter regime where the pseudo-counts  αare chosen small, the urn parameter enforces the tokens to get clustered and leads to the observed clustering behaviour of NMF.

The other Markov equivalent graphs  j ← k → iand  j ← k ← icorrespond to following factorizations:

Pr(sτikj= 1  | Sτ−1 = Sτ−1) = (α+kj + Sτ−1+kj)(α+k+ + Sτ−1+k+)

image

The last line corresponds to the basic urn schema that corresponds to the junction tree factorization specifically for the NMF model.

Reverse process: We can also calculate the transition probabilities for the reverse P´olya-Bayes process, that is how to remove tokens from their allocations. The procedure is simple to describe, select any token uniformly at random and remove it. To see this, consider a viable pair (Sτ−1, Sτ) with  sτ(i1:N) = 1 for some index  i1:N. The Bayes rule for the reverse transition probability can be written as

image

since the only difference between  Sτ−1and  Sτis at  i1:N. Note that (4.4) is Pr(sτ+1(i1:N) = 1|Sτ=  Sτ). This leads to a multinomial sampling without replacement procedure, where each cell  i′1:Nis sampled proportional to its oc- cupancy and the selected token is removed.

4.2. Sequential Monte Carlo for Estimating the Marginal Likelihood

In this section, we propose a Monte Carlo based estimator of the marginal likelihood  LXin (3.7) for a given X = X generated as in (3.6).

image

Fig 8: Computing the probability of a P´olya-Bayes urn process (a realization is shown by  •’s) hitting the target set  ■constitutes a rare event problem, prohibiting the use of a standard importance sampling. In a matrix factorization setting, the target set would consist of tensors whose marginal are equal to a given marginal tensor X.

We will assume that all entries of X are observed and therefore the total number of tokens T is available. As a result, we will treat T as a fixed number in this section. Moreover, note that  LXis factorized as

image

where Pr(S+ = T) can easily be calculated from (3.11) and does not depend on any structural properties of the graph G. Therefore, we focus on estimation of the probability Pr(STV= X). This probability can be written as

image

This formulation suggests that calculating the marginal likelihood is equivalent to computing the probability that the P´olya-Bayes process  {Sτ}τ≥1hits the target set

image

at step T, so Pr(STV= X) = Pr{ST ∈ Ω}.

4.2.1. Sequential importance sampling

A naive way of performing importance sampling to estimate the sum in (4.5) is simply to generate M independent configurations,  ST (1), . . . , ST (M), by simulating the P´olya-Bayes process until time T and calculate the estimate

image

However, this estimator would be extremely poor in the sense of having a high variance unless  S+ = Tis small. Instead, we can use the more efficient sequential importance sampling (SIS) technique to design a sequentially built proposal that takes X into account to avoid zero weights.

As the urn process is a non decreasing counting process with increments of one at each step, the indicator of the set Ω can be written for all viable  S1:Tequivalently as:

image

where the factorization is enabled with the indicator function

image

with the relation “≥” meant to apply element-wise. The indicator encodes the condition that the total number of balls allocated to the fiber  S(iV , i ¯V) for all values of  i ¯Vshould never exceed the observations  X(iV) as otherwise the urn process will never hit the target set Ω. Conversely, the process will inevitably hit the target set if this condition is satisfied through the entire trajectory.

The factorization of the indicator over  τin (4.7) enables us to formulate the calculation of the marginal likelihood as a sequential rare event problem. We will formulate our proposed importance sampling within this context.

We remind the reader the one-to-one relation among  {Sτ}τ≥0, {sτ}τ≥1and {cτ}τ≥1, and that we will be using those variables together with reference to their relationship. That in mind, let

image

be the conditional probability of  cτgiven its past, and note that the probability of the entire trajectory can be written as

image

(We have chosen to use the increment indices  cτfor notational convenience in the derivations to follow.) Then, one can write

image

Equation (4.8) suggests for designing a sequential proposal mechanism that can respect the observed data X at each time step, thanks to the factor  g(Sτ).

Our proposal mechanism for SIS is based on the observation that we can obtain exact samples from the conditional distribution of  S1:TV, conditioned on X = STV. More formally, we partition the index of increment as

image

and construct our proposal in two steps

image

where the densities on the right hand side are further decomposed into one-step conditional densities

image

which suggests the sequential nature of the proposal mechanism. We describe those two steps below:

1. Our proposal for  c1:TVcorresponds to simulating backward the sequence of marginal tensors  S1:T −1Vconditioned on  STV= X. More formally, the one-step conditional proposal distribution is

image

This is multinomial sampling of cells with probabilities proportional to the (remaining) number of tokens in those cells. In practice, this mechanism can be implemented by uniform sampling of the tokens in X without replacement.

2. The remaining part  c1:T¯Vis proposed sequentially by sampling  cτ¯Vfrom the conditional distribution of  cτ¯Vgiven  c1:τ−1and  cτV,

image

The resulting importance weight function is

image

Notice that the first part of (4.10) corresponds to Step 1 of the proposal mechanism and it does not depend on the choice of  cτ’s.

The SIS proposal mechanism allows us to calculate the weight function sequentially. Observing (4.9), we can write the weight function as

image

where the incremental weight is

image

Note that since  Sτ−1can recursively be computed from  c1:τ−1, computational load of (4.11) does not increase with  τ.

An important observation is that the incremental weight in (4.11) does not depend on the sampled value  cτ¯V. In fact, this is a good sign when the variabil- ity of the importance weights is concerned: If confined to sequential proposal mechanisms, our choice for q overall is optimal in the sense that it minimizes the variance of incremental weights (Liu and Chen, 1998; Doucet and Johansen, 2009).

image

Algorithm 1 presents the BAM-SIS algorithm, which can be used to generate a weighted sample from the posterior of BAM  π(S|X) defined in (3.8). In order to demonstrate the recursive nature of BAM-SIS and emphasize on the fixed memory requirement, we have dropped the time index  τ. The implicit requirement is that it should be feasible to do exact sampling from the conditional distribution  pτ(sτ¯V |sτV , Sτ−1) of the directed graphical model G. The complexity of inference is closely related to the junction tree factorization and for many models of interest in practice, we are not required to explicitly store S but only need to have access to counts of form  Sfa(n), hence it is sufficient to maintain marginal counts corresponding to the clique marginals of form (4.3). We illustrate this for the KL-NMF in the following example.

Example 4.2. We provide in Algorithm 2 a particular instance of BAM-SIS for the KL-NMF model that corresponds to the graph  i ← k → j. For this model, it is sufficient to maintain two matrices  Sτik+and  Sτ+kjand the reconstruction Sτi+j.

image

Sequential importance sampling - resampling Although better than (4.6), the SIS estimator in Algorithm 1 is still impoverished by increasing T, as the variance of the importance weights increases exponentially in T, resulting in what is called weight degeneracy. To overcome weight degeneracy, SIS is accompanied with a resampling procedure, where (in its default implementation) the particles are resampled according to their weights, which then become 1/M after resampling. The resampling idea is first introduced in Gordon et al. (1993), leading to the famous SIS - resampling (SIS-R) method, aka the particle filer. Standard resampling schemes include multinomial resampling (Gordon et al., 1993), residual resampling (Whitley, 1994; Liu and Chen, 1998), stratified resampling (Kitagawa, 1996), and systematic resampling (Whitley, 1994; Carpenter et al., 1999).

Algorithm 3 presents the SIS-R algorithm for BAM, or shortly BAM-SIS-R (with steps 15 and 16 always implemented). BAM-SIS-R produces weighted samples for the posterior  π(ST |X) and, more importantly for our work, an unbiased estimator of the marginal likelihood  LX. The unbiasedness is established in Del Moral (2004).

One way to improve the performance of SIS-R is adaptive resampling, i.e., resampling only at iterations where the effective sample size drops below a certain proportion of M. For a practical implementation, the effective sample size should be estimated from particle weights, such as in Liu (2001, pp. 35-36). Algorithm 3 covers adaptive resampling as well, where the decision to resample or not is taken at step 14. Although adaptive resampling does reduce the variance of the marginal likelihood estimator, unbiasedness is no more guaranteed.

image

4.3. Variational Algorithms

In this subsection, for the sake of completeness, we will develop variational inference methods for the allocation model. The variational techniques are well known in the context of NMF and related topic models (Cemgil, 2009; Paisley et al., 2014; Blei et al., 2003) and the derivations are technical but straightforward, so we omit most of the technical details and mainly state the results.

In the sequel, we will focus on the special case when the observations have the form of a contraction as in (3.6), i.e.  X = SV. This case corresponds to a hidden variable model, where entities corresponding to the hidden indices ¯V are never observed. We will denote the target distribution of BAM, i.e., the full posterior distribution  p(S, λ, Θ | X), as P.

Variational Bayes (VB) (Beal et al., 2006) is a technique where a target posterior distribution P is approximated by a variational distribution Q via minimizing Kullback-Leibler divergence KL(Q||P). In the context of Bayesian model selection, minimization of the KL(Q||P) corresponds to establishing a tight lower bound for the marginal log-likelihood log  LX, which we refer as evidence lower bound (ELBO) and denote by  BP[Q]. This correspondence is

due to the following decomposition of marginal log-likelihood

image

where the ELBO is explicitly defined as

image

In a typical scenario of VB, variational distribution Q is assumed to be a member of a restricted family of distributions. In its most common form, also known as mean-field approximation, Q is assumed to factorize over some partition of the latent variables, in a way that is reminiscent to a rank-one approximation in the space of distributions. For BAM, a natural choice of mean-field approximation is the family of factorized distributions of the form

image

To minimize the KL(Q||P), i.e., to maximize ELBO, one can utilize a fixed point iteration algorithm (with local optimum guarantees) on q(S) and  q(λ,Θ) where the updates are:

image

and explicit evaluation of the equations above implies the following set of marginal variational distributions

image

where

image

Hence, ˆa, ˆαand ΦVare the variational parameters. Thereby the updates in (4.13) and (4.14) on the variational parameters are found as

image

Here the expectations are to be taken with respect to Q with the most recently updated parameters. When there are no missing values in  X, S+is known as S+= �iV X(iV). For  n ∈ [N], the expected sufficient statistics of the model

image

need to be estimated. These sufficient statistics are various marginal statistics of the possibly intractable object

image

Yet, as ΦVrespects a factorization implied by the DAG G, and depending on the structure of the graph G and the set of visible indices V , it is possible to calculate the required sufficient statistics  EQ�Sfa(n)�exactly by the junction tree algorithm. This is very attractive because we do not need to explicitly store or construct the tensor  EQ {S}but only typically much lower dimensional clique potentials.

Once the expected sufficient statistics are estimated, the evidence lower bound (ELBO) in equation (4.12) yields

image

When we contrast the ELBO expression to the marginal allocation probability in (3.11), we see that the expressions are almost identical, with the S replaced by expectations of form  EQ{S}, i.e., the ‘mean field’ approximation. The last term can be written as

image

where  HΦ(iV) =  − �i ¯VΦV (i1:N) log ΦV (i1:N) is the posterior entropy. The lower bound is large when the entropy of the posterior is large, discounted by the number of different ways the  X(iV) tokens may have arrived.

4.3.1. Alternative optimization methods for BAM

As BAM is a conjugate hierarchical Bayesian model, some other standard optimization based methods are suitable for it. These method include EM, dual-EM, and iterative conditional modes (ICM), which is essentially a coordinate method. These algorithms solve different but closely related problems, where the key algorithmic difference becomes if posterior modes (such as  S∗) or expectations (such as E {S}) are computed during iterations. Table 1 below succinctly lists the available algorithms for BAM, where we partition the parameters into two groups as S, and (λ, θ).

Table 1 Other optimization methods for BAM

image

In this section, our goal is to evaluate the algorithms developed in Section 4 on two tasks: i) computing the marginal likelihood, and ii) computing approximate decompositions of count tensors. We will compare variational approximations with sequential Monte Carlo to show the relative merits and weaknesses of each approach. Besides key indicators such as memory requirement and computational efficiency in terms of run time, we will investigate the SMC approach on model scoring and parameter estimation. Throughout the section, the SMC method refers to BAM-SIS-R in Algorithm 3, and VB refers to the mean-field approximation as described in Section 4.3 and derived in detail in the Appendix.

Rank estimation or more generally model order selection is a prevalent problem in matrix and tensor factorization models. From a Bayesian perspective, Bayes factors (Kass and Raftery, 1995), that is the ratio of the marginal likelihoods of two alternative models, provide a systematic and principled approach for model comparison. For the comparison of multiple models, one can use the closely related quantity “posterior log odds”, which corresponds to the normalized values of the marginal log-likelihoods for each model, assuming a uniform prior on the models. A central question for this section is: Given several alternative models and parameter regimes, which algorithms should be preferred for calculating the marginal likelihood for Bayesian model comparison? Moreover, how do complexities of those algorithms change with respect to variable dimensions and number of tokens? In Section 5.1, we use synthetic data experiments to provide answers to these questions. In Section 5.2, we present an application of model selection using BAM for causality inference, as well as providing an examination of the nature of the latent variables inferred by the SMC algorithm. The experiments in Section 5.3 demonstrate that BAM framework can be extended to conduct model selection between models with vs. without parameter tying. Lastly, in Section 5.4, we investigate the nature of decompositions computed by SMC and VB of BAM and contrast them to one obtained via an optimization approach.

5.1. Synthetic Data Experiments

In this section, we carry out a comparison for the calculation of the marginal likelihood and posterior log odds. It is known that SMC (without adaptive resampling) is unbiased, in contrast to variational methods, in estimating the marginal likelihood. Hence, our aim is to numerically explore the parameter regimes where an SMC-based marginal likelihood estimation may be preferable to a variational approximation to the marginal likelihood. We start our exploration with experiments on small to data to feasibly compute ground truth solutions for comparison.

5.1.1. Comparison of SMC, VB and exact enumeration on toy data

To compare the accuracy of marginal likelihood estimations by VB and SMC , we consider KL-NMF/LDA model (Figure 7a) and select two small toy matrices X(1)and  X(2)for which it is feasible to calculate the exact marginal likelihood by exhaustive enumeration. Moreover, we wish to see how the approximation affects the relative ordering of models of different ranks according to the posterior log odds. As the model selection is highly dependent on the prior parameters, we will investigate the behaviour of the marginal likelihood and log odds according to the equivalent sample size parameter a. The toy matrices are

image

are chosen to have certain intuitive properties: The first matrix  X(1)could be a draw from a uniform Θ, i.e. each  θn|pa(n)(in, ipa(n)) ≈ 1/In, where the model will be of rank K = 1, corresponding to independence of the indices i and j. Alternatively, we can have a model of order K = 2 with two topics, one for the first two columns where only word 1 is active and one for the last column where words 2, 3 are active, with the third column being a superposition of the two. The second toy matrix  X(2)does not appear to be a draw from an uniform Θ and is more clustered and could be conveniently described by a model of rank K = 2.

In Figure 9, we illustrate the results of the exact algorithm and SMC and algorithms, for different model orders K = 1, . . . , 4 and over the range  a ∈[10−5,105]. This range for a captures both the sparse and the dense prior regimes, corresponding to small and large a values respectively.

The P´olya urn interpretation of the allocation process indicates that the allocations drawn from relatively bigger initial counts are affected negligibly by the previous draws and mostly determined by the initial configuration. Hence the flat priors on Θ with large a parameter result in allocation tensors where the tokens are allocated uniformly. This effect of the parameter a is depicted on the Figures 9a and 9b.

image

(a) Marginal likelihood estimations (or lower bounds) of both SMC and VB algorithms seem accurate for the matrix  X(1). However, VB fails to identify the model order K as opposed to SMC .

image

(b) Marginal likelihood estimations of SMC is highly accurate for the matrix  X(2). BothSMC and VB algorithms are able to identify the model order K as in exact calculation.

Fig 9: Marginal log-likelihood estimations of SMC and VB compared to exact marginal log-likelihood.

Since the entries of  X(1)could be interpreted as being drawn from a uniform categorical distribution, its marginal likelihood  LX(1)(K) increases monotonically with the increasing a; see Figure 9a. However, in Figure 9b we observe a different behaviour for  X(2): Increasing a decreases the marginal likelihood after some point, hence indicates  X(2)is less likely to be drawn from a balanced P´olya-Bayes process.

Another observation about the Figures 9a and 9b is that the model selection may highly be affected by the chosen value of a. For instance, our simulation results show that bigger values of a favor a larger model order K. This behaviour of a is also not surprising in the context of matrix factorization. Model parameters  θn|pa(n)’s are normally defined on unit simplices, but as we increase a, we effectively assign zero prior probability to the vectors that are distant from the center. Therefore, increasing the value of parameter a is practically equivalent to restricting the domain of the basis  θn|pa(n)’s. If we view the model order K as the rank of the matrix decomposition, then it is natural to expect bigger rank when the domain of the basis is restricted.

In the second part of the experiment, we compared the marginal likelihood estimations of SMC and VB algorithms to exact marginal likelihood values. For each rank K and prior parameter a, we used 1000 particles for SMC and ran both of the SMC and VB algorithms 100 times to obtain the final estimation. In the case of SMC, we report the average of the estimations whereas in VB we report the maximum of the evidence lower bounds as the final estimation of marginal likelihood.

It is clear from the Figure 9, the marginal likelihood estimations of SMC are highly accurate for both of the matrices  X(1)and  X(2)(Figures 9a and 9b). Regardless of the magnitude of a, SMC is able to correctly identify the model order K as in exact calculation. However, estimations of VB in the sparse data regime are less accurate and it is not able to identify the correct model order for the matrix  X(1)(Figure 9a). Likewise, posterior log odds estimations in the sparse regime highly differ from the ground truth for both matrices (Figures 9a and 9b).

5.1.2. Model order scoring for PARAFAC

As we discussed in previous sections, model selection may highly depend on the hyperparameter a. However, in this section we will analyze the model selection capabilities of our algorithms, and we will not struggle with tuning the parameter a, rather its value will be assumed fixed and known. Hence the only unknown parameter we will be dealing with is the generative model itself.

In the first part of the experiment, we generated a 20  ×25  ×30 tensor X from the PARAFAC model with rank R = 5, number of tokens T = 500 and prior parameter a = 30. Here rank refers to the cardinality of the latent index r in Figure 7e. Our goal is to identify the rank R of the decomposition, i.e., our hypothesized models are the Naive Bayes Models as in Figure 7e each having different cardinality for the latent variable r. Figure 10a shows the marginal likelihood and ELBO estimations of SMC and VB for various ranks R. Although we are not able to calculate the true marginal likelihood for even small R and compare it with our estimations due to intractability, both SMC and VB estimate maximum  LX(R) at true R as one might hope, however it should be noted that VB seems to deviate below at higher R due to the gap between true marginal likelihood and ELBO, i.e., KL divergence between true posterior distribution and variational distributions.

image

Fig 10: Comparison of VB and SMC in terms of average runtime and model selection on simulated data.

To support our claim about the time complexity of the SMC algorithm being independent of the size of X, we also compared the runtimes of the VB and SMC algorithms for the different sizes of the observed tensor X ranging from 4  × 4 ×4 to 64  ×64  ×64. For each of sizes, we first generated X by sampling T = 1000 index configurations from the Naive Bayes Model (PARAFAC) in Figure 7e and then ran both SMC and VB algorithms on the same tensors X until the convergence. We repeated this process 10 times and reported average runtime results in Figure 10b. Our findings support our algorithm analyses that the time complexity of VB scales with the size of X whereas the time complexity of SMC (without parallelization) is independent from it.

Our conclusion in this section is it is possible to do reliable model selection using both popular variational algorithms and our proposed SMC algorithm. However, each algorithm has favorable properties in different data regimes, which make them complementary alternatives. For instance, the time complexity of SMC is independent of the size of X which makes it a promising choice in the sparse data regime, while the time complexity of the variational algorithms are independent of the total number of tokens making them preferable in the dense data regime.

5.2. Model Selection for Causal Inference

Deciding whether the relationship between two or more variables is truly causal or spurious, i.e. explainable through the existence of a common hidden cause variable, can be cast as a model selection problem (Heckerman et al., 1995; Cai et al., 2018; Kocaoglu et al., 2018); in this section we will apply our methodology to make a decision of this nature on the popular Abalone data set. The Abalone data set is a collection of observations including physical measurements, sex, and age from the marine animal abalone (Dua and Karra Taniskidou, 2017). All measurements except sex are real valued, thus we will be categorizing them while conducting our analysis with BAM.

In the Abalone data set one can surmise that all seven variables that pertain to physical magnitude (i.e. size, weight) should be better explained by a model involving a common hidden cause variable, rather than a model defined by a complete graph where the different physical variables are assumed to cause each other. From a Bayesian model selection perspective, deciding between the two hypotheses corresponds to deciding between the two graphical models, that are shown in Figure 11. The first model corresponds to a complete graph (CG) model with no latent variables and the alternative is the CP/PARAFAC, i.e. the naive Bayes model. Without going into a detailed discussion of causal inference (Pearl, 2009; Mooij et al., 2016), we will focus only on the Bayesian model selection aspects.

image

Fig 11: Graphical models that correspond to truly causal relationship, the model with the complete graph (left) vs. spurious relationship, CP/PARAFAC model (right) for abalone physical measurement variables.

We use all seven physical magnitude variables from the Abalone data set to apply our methodology: length, diameter, height, whole weight, shucked weight, viscera weight, shell weight. Given the fact that our method works with categorical data and the original data includes non-categorical data, we categorized each non-categorical variable into 5 clusters, using k-means clustering separately on each variable. We conduct and present our simulations with two hyperparameter settings: a = 1 and 0.001. In each experiment, to compare the two models we compare the marginal likelihood estimate produced by SMC (as described in Algorithm 3) and ELBO produced by VB (as described in in Section 4.3) for the CP model with the marginal likelihood for the CG model. The marginal likelihood estimates for the CP model is presented for all model orders between 1 and 30. Since there are no latent variables in the CG model, its marginal likelihood can be analytically calculated and does not vary according to the model order. Based on the graphical model presented at Figure 11 (left), the marginal

likelihood for the CG can be computed as below:

image

The results of the experiments can be seen in Figure 12. The results presented show that for both hyperparameter settings, the CP model is a better explanation for the data. This is because for all model orders larger than 2 for the case of a = 1 and a = 0.001, the marginal likelihood estimate by SMC and VB for CP model is larger than the marginal likelihood for the CG model. Our results therefore demonstrate that all physical variables are more likely to be caused by a latent variable than to function as causes for each other.

image

Fig 12: Marginal likelihood for both hypotheses for abalone physical magnitude variables for a = 1 (left), and a = 0.001 (right). CP SMC and CP ELBO correspond to the estimates by SMC and VB for CP respectively. CG log  LXcorresponds to the marginal likelihood value calculated for the CG.

An immediate examination of interest is the nature of the latent variable produced by the SMC algorithm. A reasonable candidate for this latent cause variable is the age of the abalone. If this is indeed the case, we expect the abalones allocated to different levels of the latent variable to differ from each other according to age. To see whether this is the case, we examine the age distributions of the abalones that are allocated to different levels of the latent variable - the dataset includes a variable, number of rings, that represents the age of the abalone. We present such an examination for a = 1 and model order R = 3 in Figure 13. The rows of the figure illustrate the distributions p(age|r). As can be seen, the age distributions for each level of the hidden variable is different, potentially corresponding to young, medium, and older aged abalones. Therefore we can conclude that the physical measurements are better explained by a common latent cause variable, which is likely to be the age of the abalone.

5.3. Letter Transition Data

The BAM framework also allows inference in models where parameters of different probability tables are tied. An example to parameter tying would be

image

Fig 13: The distributions p(age|r) of tokens belonging to different latent variable affiliations (r), with model order R = 3 and a = 1. The figure shows that abalones belonging to different latent variable levels demonstrate different age distributions. The examination was conducted on a single particle obtained from SMC . The particle with the most likely decomposition at the end of the procedure was used.

modeling the data using NMF with the graphical model  i ← k → j, where both conditional probability tables are forced to be the same instead of being modeled as independent. We will provisionally call such a model symmetric NMF (sNMF) for obvious reasons. The graphical models that depict the ordinary non-symmetric model, NMF, and the symmetric model, sNMF, can be seen in Figure 14. From a matrix decomposition perspective, these models would correspond to the factorizations  WH ≈ Xvs.  WW T ≈ Xrespectively. The SMC derivation for the sNMF is provided in the appendix.

image

Fig 14: Non-symmetric (left) and symmetric (right) NMF models’ graphical representation.

An example data for comparing the two models would be the letter transition data obtained from Norvig (2013). Here, non-symmetric NMF would correspond to representing a letter differently according to whether it is a preceding letter of a pair vs. the following. The symmetric model, sNMF, is oblivious to the order of the letters, but models the co-occurrence of the letters instead. The actual data set includes an approximate total of 2.8 ×1012transitions. Examining this transition matrix, it is obvious that the data cannot be explained better by a symmetric model, given that the original matrix is not approximately symmetric. This is unsurprising given that sounds or letters in a language follow each other according to certain regularities which are not normally expected to be order independent. However, making this inference would be much more challenging if only a tiny sample from this data was available. We use our framework to compare the two models in this much noisier setting. We obtain a sample from the data by normalizing the counts of the original matrix, use it as the parameter of a multinomial distribution and draw 2000 tokens from this distribution to conduct our experiments.

Here we want to answer three questions: 1- How well does SMC fare in choosing the correct model for the data; 2- How does VB compare to SMC ; 3- How the selection of hyperparameters affects these results. Instead of choosing a specific model order to make this comparison, we will be conducting the experiments for all cardinalities between 1 and 50. In this way, we will be able to observe whether model order affect the decision between the two models, and if it does, how it does so. We will be conducting the experiments at the hyperparameter settings of a = 1, and 0.001.

image

Fig 15: Comparing symmetric vs. non-symmetric NMF marginal likelihood using SMC and VB for a = 1 (left) and a = 0.001 (right).

The results for a = 1 are depicted in Figure 15. The first observation is that at this level, the scores obtained from SMC for NMF dominate those obtained for sNMF at every model order. Therefore, for this parameter setting, SMC favors the non-symmetrical model unequivocally. This is paralleled by the ELBO scores obtained from VB which also favor NMF over sNMF at every model order. This shows that at least for the given problem, VB would also be a sufficient inferential tool for model selection. However, model orders with the highest marginal likelihood for NMF and sNMF differ when SMC and VB are compared, and VB is seen to favor models of lower orders.

The results for a = 0.001 demonstrate a different pattern. Here, the symmetric model is preferred by SMC above the non-symmetric model for almost all model orders. Analysis by Steck and Jaakkola (2002) show that this behavior is expected regardless of the specific inference procedure used: When the priors are very weak, model selection procedures favor models with fewer parameters. Thus, at very weak prior regime the symmetric model is favored over the non-symmetric model since it has fewer number of parameters (due to parameter tying). The distinction is not as clear with VB, however, where the ELBOs for the sNMF is not consistently above the ones for NMF. Similar behavior by VB at weak prior settings was also observed in the experiments with synthetic data at Subsection 5.1.

image

Fig 16: Sparsity (left), squared loss (center), and KL-divergence (right) values for hyperparameters  a = 10−3 to a = 103 with R = 3.

5.4. Examining Decompositions Obtained by SMC

In our final experiment, we compare the nature of decompositions obtained by SMC for BAM with those obtained by an optimization algorithm, that are popular among practitioners. Such an algorithm is alternating least squares using projected gradient descent (ALS-PGD) (Lin, 2007b). We use the bigram letter transition data presented above for this task, with the number of tokens being set to 2000 as above. We evaluate the decompositions produced through this experiment according to two criteria: sparsity and accuracy. For sparsity we use Hoyer’s (2004) definition:

image

where n is the number of elements in the matrix. Given that NMF decomposition results in two different matrices, the mean of the two sparsity results were used as the ultimate score for sparsity for a given decomposition. For accuracy we use two divergence measures between the original matrix X and its approximation ˆX: the squared loss, that is the Frobenius norm of the difference between the two matrices, and KL-divergence, defined as below:

image

We let i and j be the indices of the observed matrix, and k be the index of the latent variable. To obtain the decomposition of the data using BAM, we set  W = θi|kand  H = Tθj|kθkwith T being the total number of tokens, and WH ≈ X.

We have compared the representations produced by the two algorithms fixing a model order (R = 3) and varying a to see how composition quality of the two methods compare at different hyperparameter strength levels. The results for this can be seen in Figure 16. The results show that for a considerable range of hyperparameter values, decompositions produced by SMC for BAM have larger sparsity than those produced by ALS-PGD, while the results for VB are less consistent. Not surprisingly, while the ALS-PGD performs better for squared loss metric, methods of BAM mostly dominate for KL-divergence. This is unsurprising given the fact that these metrics are what the two different frameworks optimize for respectively.

The central message of this paper is that the problem of calculating the marginal likelihood in nonnegative tensor models and topic models are closely related to scoring Bayesian networks and this problem can be treated as a sequential rare event estimation problem. We introduced BAM to explicitly exploit this connection and derive an SMC algorithm.

The BAM also provides us to see the equivalence of general nonnegative tensor factorizations, topic models, and graphical models. Similar equivalences in specific cases such as the LDA and Bayesian KL-NMF has been pointed out several times (Buntine, 2002; Girolami and Kab´an, 2003; Gaussier and Goutte, 2005; Ding et al., 2008; Faleiros and Lopes, 2016), but the emphasis is given on algorithmic equivalence rather than the equivalence of the generative models, with the exception of Buntine and Jakulin (2006). Our approach shows that the equivalence of generative models holds for a large class of decomposable tensor models and discrete graphical models (Heckerman et al., 1995; Geiger and Heckerman, 2002, 1997).

Perhaps surprisingly, the literature on calculating the marginal likelihood for graphical models with hidden nodes is not extensive (Friedman and Koller, 2003; Beal et al., 2006; Riggelsen, 2006; Adel and de Campos, 2017). The graphical model literature focuses typically on structure learning on fully observed models (Tarantola, 2004) rather than dealing with latent variables. In this paper, we have investigated the problem using a P´olya urn interpretation of the marginal allocation probability of BAM, a model that can be viewed as a P´olya tree (Mauldin et al., 1992) albeit with tied parameters. This observation provided us a novel yet intuitive sequential inference framework, particularly practical in the data sparse regime where the calculation of the marginal likelihood is simply the probability that the P´olya urn hits a small set, that is the set of all tensors with the observed marginal.

It is important to note that for parameter estimation in topic models and discrete graphical models with latent variables (such as Hidden Markov models or mixture models), alternative tensor methods are also popular, known as the method of moments or spectral methods. Here, the key idea is representing certain higher order moments of the observed data as a tensor and calculating a decomposition to identify directly the hidden parameters. Under certain conditions the exact moment tensors can be shown to be orthogonally decomposable, and iterative algorithms with recovery guarantees are known (Anand- kumar et al., 2014; Robeva, 2016). However, such spectral algorithms may not be very suitable for the sparse data regime where the estimates of higher order moments can have a high variance. In this regime, one can obtain invalid (such as negative) estimates as spectral methods minimize implicitly the Frobenius norm, which is not the natural divergence measure for a Poisson likelihood. To our knowledge, a spectral estimate that can be used as a proxy for the marginal likelihood is also not known.

Our approach can be used for developing algorithms for model selection in other structured topic models and NTF, tensor trains or tensor networks as all these related problems can be viewed as instances of scoring Bayesian networks with latent variables.

The proposed dynamical process view of tensor factorization turns out to be also useful for understanding the nature of the nonnegative decomposition problem and explain the surprising success of variational approximation methods or Monte Carlo sampling schemata on some data sets and failure on others. From the algorithmic perspective, many asymptotic consistency results for variational inference require that the data size goes to infinity but for tensors models this statement becomes ambiguous for a tensor with a fixed dimension. Our model gives an alternative perspective as the number of tokens: we can view any non-negative tensor as a limit of a count process where each entry is normalized by the total count. This perspective gives additional justification for variational algorithms in the large data regime when combined with the results presented in Wang and Blei (2018). From the modeling perspective, it is long known as an empirical fact, that the factor tensors obtained by decomposition of nonnegative matrices and tensors is sparse, informally known as a clustering behaviour or by parts representation. Indeed, this was one of the initial motivations of the seminal paper Lee and Seung (2001). While some theoretical justifications have been provided (Donoho and Stodden, 2004) we believe that the observed clustering properties can also be understood as a rich-get-richer phenomenon as a consequence of the self reinforcement property of the underlying P´olya urn process.

In our development, we also point out a subtle issue that seems to have been neglected in the topic modeling and probabilistic NTF literature when choosing priors. We illustrate that the seemingly intuitive and convenient choice of independent Dirichlet priors on factor parameters turns out having a quite dramatic effect on posterior inference and may even lead to possibly misleading conclusions about the model structure, such as the cardinality of a hidden node. This concept of model structure is also closely related to a central problem in tensor computations such as when inferring the rank of a decomposition. The equivalence of NTF and graphical models suggests us that the Dirichlet choice is inevitable under certain assumptions (Geiger and Heckerman, 1997) while for structure learning the Dirichlet hyperparameters should be chosen consistently as marginal pseudo-counts from a fixed, common imaginary data set in contrast of being merely independent a priori (Heckerman et al., 1995).

A possible future work is developing efficient algorithms where the observations topic modeling algorithms are designed for handling large document corpora and missing data is typically not a concern – absence of certain words from a document collection is an indicator about a topic, and a zero count is an informative observation. Missing data here corresponds to censoring, where certain words are deliberately erased from certain documents. This is not a typical scenario in document processing and some inference algorithms, such as collapsed Gibbs sampling (Teh et al., 2007), do not handle this type of missing data naturally. This is in contrast to matrix and tensor factorization models, where one can naturally handle such constraints by simply removing the corresponding terms from the marginal likelihood.

In the topic modeling literature, there are several inference algorithms that have been developed for fast inference specifically for LDA (Li et al., 2014; Yu et al., 2015; Magnusson et al., 2015; Terenin et al., 2017). It is not always clear in the literature how to extend and apply these powerful inference methods to more general structured topic models and tensor factorizations. In a sense, our paper outlines a generic method for designing algorithms for tensors.

The work was partly supported as a bilateral Turkish-French research programme by the French National Research Agency under Grant ANR-16-CE23-0014 (FBIMATRIX) and by TUBITAK Turkish Science, Technology and Research Foundation under Grant 116E580 (FBIMATRIX).

E. E. Abdallah, A. B. Hamza, and P. Bhattacharya. Mpeg video watermarking using tensor singular value decomposition. In M. Kamel and A. Campilho, editors, Image Analysis and Recognition, pages 772–783, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. ISBN 978-3-540-74260-9.

E. Acar and B. Yener. Unsupervised Multiway Data Analysis: A Literature Survey. IEEE Transactions on Knowledge and Data Engineering, 21(1):6–20, jan 2009. ISSN 1041-4347. .

E. Acar, D. M. Dunlavy, T. G. Kolda, and M. Mørup. Scalable tensor factoriza- tions for incomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1):41–56, mar 2011. ISSN 01697439. .

T. Adel and C. P. de Campos. Learning Bayesian Networks with Incomplete Data by Augmentation. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 1684–1690, aug 2017.

E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9:1981–2014, may 2008.

A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. J. Mach. Learn. Res., 15 (1):2773–2832, Jan. 2014. ISSN 1532-4435.

S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski. RAND-WALK: A Latent Variable Model Approach to Word Embeddings. CoRR, feb 2015.

S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a Nonnegative Matrix Factorization—Provably. SIAM Journal on Computing, 45(4):1582–1611, jan 2016. ISSN 0097-5397. .

M. Asteris, D. Papailiopoulos, and A. G. Dimakis. Orthogonal nmf through subspace exploration. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 343–351. Curran Associates, Inc., 2015a.

M. Asteris, D. Papailiopoulos, and A. G. Dimakis. Orthogonal NMF through Subspace Exploration. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 343–351. Curran Associates, Inc., 2015b.

D. Barber. Bayesian reasoning and machine learning. Cambridge University Press, 2012.

M. J. Beal, Z. Ghahramani, et al. Variational bayesian learning of directed graphical models with hidden variables. Bayesian Analysis, 1(4):793–831, 2006.

D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77, apr 2012. ISSN 00010782. .

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. J. Mach. Learn. Res., 3:993–1022, mar 2003. ISSN 1532-4435.

W. Buntine. Variational extensions to em and multinomial pca. In T. Elomaa, H. Mannila, and H. Toivonen, editors, Machine Learning: ECML 2002, pages 23–34, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg. ISBN 978-3-540-36755-0. .

W. Buntine and A. Jakulin. Discrete component analysis. In C. Saunders, M. Grobelnik, S. Gunn, and J. Shawe-Taylor, editors, Subspace, Latent Structure and Feature Selection, pages 1–33. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. ISBN 978-3-540-34138-3.

W. L. Buntine. Theory refinement on bayesian networks. CoRR, abs/1303.5709, 2013.

R. Cai, J. Qiao, K. Zhang, Z. Zhang, and Z. Hao. Causal discovery from discrete data using hidden compact representation. In Advances in Neural Information Processing Systems, pages 2671–2679, 2018.

J. Canny. GaP: A Factor Model for Discrete Data. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’04, pages 122–129, New York, NY, USA, 2004. ACM. ISBN 1-58113-881-4. .

G. C¸apan, S. Akbayrak, T. Y. Ceritli, and A. T. Cemgil. Sum conditioned poisson factorization. In Y. Deville, S. Gannot, R. Mason, M. D. Plumbley, and D. Ward, editors, Latent Variable Analysis and Signal Separation, pages 24–35, Cham, 2018. Springer International Publishing. ISBN 978-3-319-93764-9.

O. Capp´e, E. Moulines, and T. Ryden. Inference in Hidden Markov Models. Springer Series in Statistics. Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387402640.

J. Carpenter, P. Clifford, and P. Fearnhead. An improved particle filter for non-linear problems. Radar Sonar & Navigation, IEE Proceedings, 146:2–7, 1999.

A. T. Cemgil. Bayesian inference for nonnegative matrix factorisation models.

Computational Intelligence and Neuroscience, 2009, 2009.

A. Cichocki, R. Zdunek, and S.-i. Amari. Csisz´ar’s divergences for non-negative matrix factorization: Family of new algorithms. In J. Rosca, D. Erdogmus, J. C. Pr´ıncipe, and S. Haykin, editors, Independent Component Analysis and Blind Signal Separation, pages 32–39, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-32631-1.

A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari. Nonnegative Matrix and Tensor Factorization. Wiley, 2009.

A. Cichocki, N. Lee, I. Oseledets, A.-H. Phan, Q. Zhao, and D. P. Mandic. Tensor Networks for Dimensionality Reduction and Large-scale Optimization: Part 1 Low-Rank Tensor Decompositions.  Foundations and Trends R⃝ in MachineLearning, 9(4-5):249–429, 2016. ISSN 1935-8237. .

A. Cichocki, N. Lee, I. Oseledets, A.-H. Phan, Q. Zhao, M. Sugiyama, and D. P. Mandic. Tensor Networks for Dimensionality Reduction and Large-scale Optimization: Part 2 Applications and Future Perspectives. Foundations and Trends R⃝ in Machine Learning, 9(6):249–429, 2017. ISSN 1935-8237. .

J. E. Cohen and U. G. Rothblum. Nonnegative ranks, decompositions, and factorizations of nonnegative matrices. Linear Algebra and its Applications, 190(C):149–168, sep 1993. ISSN 00243795. .

G. F. Cooper and E. Herskovits. A Bayesian method for the induction of proba- bilistic networks from data. Machine Learning, 9(4):309–347, oct 1992. ISSN 0885-6125. .

R. G. Cowell, P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic Networks and Expert Systems: Exact Computational Methods for Bayesian Networks. Springer, 2003. ISBN 0387987673. .

I. Csisz´ar and P. C. Shields. Information Theory and Statistics: A Tutorial. Foundations and Trends in Communications and Information Theory, 1(4): 417–528, 2004. ISSN 1567-2190. .

D. J. Daley and D. Vere-Jones. An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media, 2007.

A. P. Dawid and S. L. Lauritzen. Hyper Markov Laws in the Statistical Analysis of Decomposable Graphical Models. The Annals of Statistics, 21(3):1272– 1317, sep 1993. ISSN 0090-5364. .

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, sep 1990. ISSN 0002-8231. .

P. Del Moral. Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications. Springer-Verlag, New York, 2004.

C. Ding, T. Li, and W. Peng. On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing. Computational Statistics & Data Analysis, 52(8):3913–3927, apr 2008. ISSN 01679473. .

D. Donoho and V. Stodden. When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? In S. Thrun, L. K. Saul, and B. Sch¨olkopf, editors, Advances in Neural Information Processing Systems 16, pages 1141–1148. MIT Press, 2004.

A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of nonlinear filtering, 12(656-704):3, 2009.

D. Dua and E. Karra Taniskidou. UCI machine learning repository, 2017.

B. Ermis, Y. K. Yılmaz, A. T. Cemgil, and E. Acar. Variational inference for probabilistic latent tensor factorization with kl divergence. arXiv preprint arXiv:1409.8083, 2014.

B. Ermi¸s, E. Acar, and A. T. Cemgil. Link prediction in heterogeneous data via generalized coupled tensor factorization. Data Mining and Knowledge Discovery, 29(1):203–236, Jan 2015. ISSN 1573-756X. .

T. P. Faleiros and A. A. Lopes. On the equivalence between algorithms for Non- negative Matrix Factorization and Latent Dirichlet Allocation. In ESANN 2016 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pages 171–176, Bruges (Belgium), April 2016.

C. Fevotte and J. Idier. Algorithms for nonnegative matrix factorization with the  β-divergence. Neural Computation, 23(9):2421–2456, Sep. 2011. ISSN 0899-7667. .

L. Finesso and P. Spreij. Nonnegative matrix factorization and I-divergence alternating minimization. Linear Algebra and its Applications, 416:270–287, 2006. .

N. Friedman and D. Koller. Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 2003. ISSN 08856125. .

E. Gaussier and C. Goutte. Relation between PLSA and NMF and implications. In  SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR con-ference on Research and development in information retrieval, pages 601–602, New York, NY, USA, 2005. ACM. ISBN 1595930345. .

D. Geiger and D. Heckerman. A characterization of the dirichlet distribution through global and local parameter independence. The Annals of Statistics, 25(3):1344–1369, 1997. ISSN 00905364.

D. Geiger and D. Heckerman. Parameter priors for directed acyclic graphical models and the characterization of several probability distributions. The Annals of Statistics, 30(5):1412–1440, oct 2002. .

N. Gillis. Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing. Journal of Machine Learning Research, 13:3349–3386, apr 2012. ISSN 15324435. .

N. Gillis. The Why and How of Nonnegative Matrix Factorization. In J. A. Suykens, S. Signoretto, and A. Argyriou, editors, Regularization, Optimization, Kernels, and Support Vector Machines, Machine Learning and Pattern Recognition, chapter 12, pages 257–291. Chapman & Hall/CRC, Boca Raton, Florida, USA, 2014.

N. Gillis. Introduction to Nonnegative Matrix Factorization. CoRR, 2017.

M. Girolami and A. Kab´an. On an equivalence between PLSI and LDA. In Proceedings of the 26th annual international ACM SIGIR conference on Re- search and development in informaion retrieval - SIGIR ’03, page 433, New York, New York, USA, 2003. ACM Press. ISBN 1581136463. .

G. H. Golub and C. F. Van Loan. Matrix Computations (4th Ed.). Johns Hopkins University Press, Baltimore, MD, USA, 2013. ISBN 9781421407944.

P. Gopalan, J. M. Hofman, and D. M. Blei. Scalable recommendation with poisson factorization. arXiv preprint arXiv:1311.1704, 2013.

N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F, 140 (6):107–113, 1993.

T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228–5235, 2004.

R. A. Harshman. Foundations of the PARAFAC procedure: Models and con- ditions for an ”explanatory” multi-modal factor analysis. UCLA Working Papers in Phonetics, 16(1):1–84, 1970.

J. H˚astad. Tensor rank is NP-complete. Journal of Algorithms, 11(4):644–654, dec 1990. ISSN 01966774. .

D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning, 20 (3):197–243, 1995. ISSN 15730565. .

C. J. Hillar and L.-H. Lim. Most Tensor Problems Are NP-Hard. Journal of the ACM, 60(6):1–39, nov 2013. ISSN 00045411. .

T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, pages 50–57, New York, NY, USA, 1999. ACM. ISBN 1-58113-096-1. .

P. O. Hoyer. Non-negative Matrix Factorization with Sparseness Constraints. Journal of Machine Learning Research, 5:1457–1469, 2004.

R. E. Kass and A. E. Raftery. Bayes Factors. Journal of the American Statistical Association, 90(430):773–795, 1995.

A. Kazemipour, B. Babadi, M. Wu, K. Podgorski, and S. Druckmann. Multi- plicative updates for optimization problems with dynamics. bioRxiv, 2017. .

J. F. C. Kingman. Poisson processes. Wiley Online Library, 1993.

G. Kitagawa. Monte carlo filter and smoother for non-gaussian nonlinear state space models. Journal of computational and graphical statistics, 5(1):1–25, 1996.

M. Kocaoglu, S. Shakkottai, A. G. Dimakis, C. Caramanis, and S. Vishwanath. Entropic Latent Variable Discovery. Technical report, ArXiV, 2018.

E. Kofidis and P. A. Regalia. Tensor approximation and signal processing ap- plications. Contemporary Mathematics, 280:103–134, 2001.

T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, 2009. ISSN 0036-1445. .

S. L. Lauritzen. Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420):1098–1108, 1992. ISSN 01621459.

S. L. Lauritzen. Graphical Models. OUP Oxford, 1996. ISBN 978-0198522195.

S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of

the Royal Statistical Society. Series B (Methodological), 50(2):157–224, 1988. ISSN 00359246.

D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pages 556–562, 2001.

A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. Reducing the sampling complexity of topic models. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14, pages 891– 900, New York, New York, USA, 2014. ACM Press. ISBN 9781450329569. .

W. Li and A. McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning, pages 577–584, 2006.

C.-J. Lin. On the Convergence of Multiplicative Update Algorithms for Non- negative Matrix Factorization. IEEE Transactions on Neural Networks, 18 (6):1589–1596, 2007a. ISSN 1045-9227. .

C.-J. Lin. Projected gradient methods for nonnegative matrix factorization. Neural Computation, 19(10):2756–2779, Oct. 2007b. ISSN 0899-7667. .

J. Liu. Monte Carlo Strategies in Scientific Computing. Springer Series in Statistics. Springer Verlag, New York, NY, USA, 2001.

J. Liu and R. Chen. Sequential Monte-Carlo methods for dynamic systems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 93:1032–1044, 1998.

J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208–220, Jan 2013. ISSN 0162-8828. .

M. Maathuis, M. Drton, S. Lauritzen, and M. Wainwright, editors. Handbook of Graphical Models. CRC Press, Boca Raton, Florida : CRC Press, c2019., nov 2018. ISBN 9780429463976. .

M. Magnusson, L. Jonsson, M. Villani, and D. Broman. Sparse Partially Col- lapsed MCMC for Parallel Inference in Topic Models. Journal of Computational and Graphical Statistics, jun 2015. ISSN 15372715. .

H. Mahmoud. Polya Urn Models. Chapman & Hall/CRC, 1 edition, 2008. ISBN 1420059831, 9781420059830.

R. D. Mauldin, W. D. Sudderth, and S. C. Williams. Polya trees and random distributions. The Annals of Statistics, 20(3):1203–1221, 1992. ISSN 00905364.

T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pages 352–359, 2002. ISSN 0022-0302. .

J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Sch¨olkopf. Distin- guishing cause from effect using observational data: methods and benchmarks. The Journal of Machine Learning Research, 17(1):1103–1204, 2016.

M. Mørup, L. K. Hansen, S. M. Arnfred, L.-H. Lim, and K. H. Madsen. Shift- invariant multilinear decomposition of neuroimaging data. NeuroImage, 42 (4):1439–1450, oct 2008. ISSN 10538119. .

K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.

T. H. Nguyen, U. S¸im¸sekli, G. Richard, and A. T. Cemgil. Efficient bayesian model selection in parafac via stochastic thermodynamic integration. IEEE Signal Processing Letters, 25(5):725–729, 2018.

M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A Review of Relational Machine Learning for Knowledge Graphs. Proceedings of the IEEE, 104(1): 11–33, 2016. .

P. Norvig. English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU, 2013.

R. Orus. A Practical Introduction to Tensor Networks: Matrix Product States and Projected Entangled Pair States. Annals of Physics, jun 2013. ISSN 1096035X. .

I. V. Oseledets. Tensor-train decomposition. SIAM J. Sci. Comput., 33(5): 2295–2317, Sept. 2011. ISSN 1064-8275. .

P. Paatero. The Multilinear EngineA Table-Driven, Least Squares Program for Solving Multilinear Problems, Including the n-Way Parallel Factor Analysis Model. Journal of Computational and Graphical Statistics, 1999. ISSN 15372715. .

P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2):111–126, jun 1994. ISSN 11804009. .

J. Paisley, D. M. Blei, and M. I. Jordan. Bayesian Nonnegative Matrix Fac- torization with Stochastic Variational Inference. In E. M. Airoldi, D. Blei, E. A. Erosheva, and S. E. Fienberg, editors, Handbook of Mixed Membership Models and Their Applications. Chapman and Hall/CRC, 2014.

C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent Semantic Indexing: A Probabilistic Analysis. Journal of Computer and System Sciences, 61(2):217–235, oct 2000. ISSN 00220000. .

E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos. Tensors for Data Mining and Data Fusion. ACM Transactions on Intelligent Systems and Technology, 8(2):1–44, oct 2016. ISSN 21576904. .

U. Paquet, B. Thomson, and O. Winther. A hierarchical model for ordinal matrix factorization. Statistics and Computing, 22(4):945–957, 2012.

J. Pearl. Probabilistic reasoning in intelligent systems : networks of plausible inference. Morgan Kaufmann Publishers, 1988. ISBN 9781558604797.

J. Pearl. Causality. Cambridge university press, 2009.

R. Pemantle. A survey of random processes with reinforcement. Probability Surveys, 2007. ISSN 1549-5787. .

C. Riggelsen. Learning parameters of Bayesian networks from incomplete data via importance sampling. International Journal of Approximate Reasoning, 42(1-2):69–83, may 2006. ISSN 0888613X. .

E. Robeva and A. Seigal. Duality of graphical models and tensor networks. Information and Inference: A Journal of the IMA, 06 2018. .

E. M. Robeva. Decomposing Matrices, Tensors, and Images. PhD thesis, University of California, Berkeley, 2016.

A. Schein, J. W. Paisley, D. M. Blei, and H. M. Wallach. Bayesian poisson ten- sor factorization for inferring multilateral relations from sparse dyadic event

counts. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pages 1045–1054, 2015.

A. Schein, M. Zhou, D. Blei, and H. Wallach. Bayesian poisson tucker decom- position for learning the structure of international relations. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2810–2819, New York, New York, USA, 20–22 Jun 2016. PMLR.

N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. Tensor Decomposition for Signal Processing and Machine Learning. IEEE Transactions on Signal Processing, 65(13):3551–3582, jul 2017. ISSN 1053-587X. .

M. Signoretto, R. Van de Plas, B. De Moor, and J. A. K. Suykens. Tensor versus matrix completion: A comparison with application to spectral data. IEEE Signal Processing Letters, 18(7):403–406, July 2011. ISSN 1070-9908. .

U. S¸im¸sekli and A. T. Cemgil. Markov chain monte carlo inference for proba- bilistic latent tensor factorization. In 2012 IEEE International Workshop on Machine Learning for Signal Processing, pages 1–6. IEEE, 2012.

U. Simsekli, T. Virtanen, and A. Cemgil. Non-negative tensor factorization models for Bayesian audio processing. Digital Signal Processing: A Review Journal, 47, 2015. ISSN 10512004. .

D. J. Spiegelhalter, A. P. Dawid, S. L. Lauritzen, and R. G. Cowell. Bayesian Analysis in Expert Systems. Statistical Science, 8(3):219–247, aug 1993. ISSN 0883-4237. .

H. Steck and T. S. Jaakkola. On the Dirichlet Prior and Bayesian Regularization. In In Advances in Neural Information Processing Systems 15, pages 697–704. MIT Press, 2002.

C. Tarantola. MCMC model determination for discrete graphical models. Statistical Modelling: An International Journal, 4(1):39–61, apr 2004. ISSN 1471-082X. .

Y. W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Advances in neural information processing systems, pages 1353–1360, 2007.

A. Terenin, M. Magnusson, L. Jonsson, and D. Draper. Polya Urn Latent Dirich- let Allocation: a doubly sparse massively parallel sampler. IEEE Transactions on Pattern Analysis and Machine Intelligence, apr 2017. ISSN 01628828. .

L. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966.

S. Vavasis. On the complexity of nonnegative matrix factorization. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 61(3):611– 622, 1999.

T. Virtanen, A. Taylan Cemgil, and S. Godsill. Bayesian extensions to non- negative matrix factorisation for audio signal modelling. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1825–

1828, March 2008. .

T. Virtanen, J. F. Gemmeke, B. Raj, and P. Smaragdis. Compositional Models for Audio Processing: Uncovering the structure of sound mixtures. IEEE Signal Processing Magazine, 32(2):125–144, mar 2015. ISSN 1053-5888. .

M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference.  Foundations and Trends R⃝ in Machine Learning, 1(12):1–305, 2007. ISSN 1935-8237. .

Y. Wang and D. M. Blei. Frequentist Consistency of Variational Bayes. Journal of the American Statistical Association, pages 1–15, jun 2018. ISSN 0162-1459. .

D. Whitley. A genetic algorithm tutorial. Statistics and Computing, 4:65–85, 1994.

Y. K. Yılmaz, A. T. Cemgil, and U. S¸im¸sekli. Generalised Coupled Tensor Factorisation. In Advances in Neural Information Processing Systems,, 2011.

H.-F. Yu, C.-J. Hsieh, H. Yun, S. Vishwanathan, and I. S. Dhillon. A Scal- able Asynchronous Distributed Algorithm for Topic Modeling. In Proceedings of the 24th International Conference on World Wide Web - WWW ’15, pages 1340–1350, New York, New York, USA, 2015. ACM Press. ISBN 9781450334693. .

Appendix A Source Code

image

Appendix B Basic Distributions

B.1 Poisson Distribution

Gamma function: Γ(z) = (z −1)! for nonnegative integer z.

image

The entropy is not exactly known but can be approximated.

image

B.2 Gamma Distribution

image

a is the shape and b is the rate parameter. Sufficient statistics

image

The entropy:

image

Here,  ψ(x) is the psi function defined as  ψ(x) = log Γ(x)/dx.

B.3 Dirichlet Distribution

Multivariate Beta function:

image

Dirichlet density:

image

B.4 Categorical and Multinomial Distributions

Multinomial distribution:

image

subject to  s1 + · · · + sK = N.

image

Categorical distribution:

image

subject to  s1 + · · · + sK= 1.

Appendix C Allocation Model

Let Θ  ≡ {θn|pa(n)(:, ipa(n)) :  ∀(n, ipa(n))}be the set of conditional probability tables implied by a Bayesian Network G where each  θn|pa(n)is Dirichlet distributed:

image

An allocation tensor S is constructed via independently allocating the events, that are generated from a Poisson process with the intensity  λ, by using the thinning probabilities Θ:

image

Full joint distribution of S, Θ and  λis

π(λ, Θ, S) = exp ((a −1) log  λ − bλ −log Γ(a) + a log b)

image

By completing the terms, we can now integrate out  λand  θ

image

here,  Sfa(n)(ifa(n)) ≡ �i′¬fa(n) Si′1:Nand  S+ ≡ �i1:N S(i1:N) as above. Note that, surprisingly the posterior remains still factorized as

image

which leads to a closed form expression for the marginal likelihood of S:

image

where, for a tensor  Z, Bn(Zfa(n)) is defined in (3.12).

Appendix D Variational Bayes

A mean-field approximation to the allocation model is in the form of

image

where the factors are

image

and explicit evaluation of the equations above implies the following set of marginal variational distributions

image

where

image

Therefore, Q is equal to

image

Q = exp

image

The evidence lower bound is a functional of Q which can be derived from the KL(Q∥P):

image

Minimization of KL(Q∥P) w.r.t. Q ends up with the following variational marginal distributions:

 q(λ) =  GA(λ; ˆa, b + 1) with the variational parameter

image

Hence,  EQ {log  λ} = ψ(ˆa) −log(b + 1) =  ψ(a + EQ {S+})) −log(b + 1) and EQ {λ}= ˆa/(b + 1).

 q(θn|pa(n)) =  D(θfa(n); ˆαn|pa(n)) with the variational parameter

image

subject to  X = SV. Conditioned on X, the required expectations  EQ {S}can be computed in closed form, as q(S) is a product of multinomial probabilities:

image

D.1 Derivation of ELBO

The individual terms F[Q] amd H[Q] in (D.1) are

F[Q] ≡ EQ {log  π(S, λ,Θ)}

= (a −1 +  EQ {S+})E {log  λ} − (b+ 1)EQ {λ} −log Γ(a) + a log b +�

image

H[Q] ≡ EQ {log  Q(S, λ,Θ)}

image

Since evidence lower bound is the difference of the terms F[Q] and H[Q], sim-plification of the common terms yields

BP[Q] = a log  b − (a + EQ {S+}) log(b + 1) + log Γ(a + EQ {S+}) −log Γ(a) +�

image

D.2 Calculation of sufficient statistics

The variational parameter ΦVof multinomial distribution q(S) also factorizes according to the directed graph G and have the form

image

This structure can be exploited to calculate the expectations by an algorithm closely related to belief propagation.

image

Belief propagation is typically framed as carrying out inference in a Bayesian network by conditioning some of the random variables to their observed values. In our allocation model formalism, this is equivalent to observing only a single token  τconditioned on the event that its visible and invisible indices are  iτVand iτ¯Vrespectively, so the observations can be naturally factorized as

image

A general contraction of the allocation tensor can also be written as

image

As such,

image

We could run belief propogation for each token  τseparately to estimate the expected sufficient statistics as these are additive

image

However, as the sufficient statistics are only calculated for updating the parameters in bound optimization, this maybe wasteful and it is desirable calculating the statistics in an online or recursive manner.

Appendix E Parameter tying

In some applications, it is desirable to constrain the thinning probabilities Θ further. One common choice is having shared parameters, in order to encode additional structure such as symmetric decompositions  X ≈ WW ⊤. Below is the derivation of the SMC algorithm for symmetric CP/PARAFAC model is described. Notice that symmmetric NMF (sNMF) corresponds to the special case of symmetric CP/PARAFAC where N = 2.

Let  sτ(r, i1:N) be the indicator of the event that the token  τis allocated to cell (r, i1:N) and let’s define the following indicators in terms of  sτ:

image

where the sum �i1:N:in=iis over all the valuations of indices  i1:Nsatisfying the condition  in = i. Then the conditional probability of the event  sτis

image

Since the individual events  s1, s2, . . . , sTare conditionally independent given the parameters Θ, the joint distribution of the events  s1:Tfactorizes as follows

image

If we assume the following Dirichlet priors on Θ

image

posterior distribution of Θ turns out to be also Dirichlet due to conjugacy:

image

Hence, the marginal likelihood of the events  s1:Tcan be found by Bayes theorem:

image

and similarly the P´olya-Bayes process probabilities are

image

Then it is straightforward to adapt Algorithm 3 to symmetric CP/PARAFAC case by changing only the distributions  pτ(c ¯V | cV , Sτ−1) and  pτ,V (cV | Sτ−1) as follows

image


Designed for Accessibility and to further Open Science