Gaussian Mixture Reduction with Composite Transportation Divergence

2020·Arxiv

Abstract

Abstract

Gaussian mixtures are widely used for approximating density functions in various applications such as density estimation, belief propagation, and Bayesian filtering. These applications often utilize Gaussian mixtures as initial approximations that are updated recursively. A key challenge in these recursive processes stems from the exponential increase in the mixture’s order, resulting in intractable inference. To overcome the difficulty, the Gaussian mixture reduction (GMR), which approximates a high order Gaussian mixture by one with a lower order, can be used. Although existing clustering-based methods are known for their satisfactory performance and computational efficiency, their convergence properties and optimal targets remain unknown. In this paper, we propose a novel optimization-based GMR method based on composite transportation divergence (CTD). We develop a majorization-minimization algorithm for computing the reduced mixture and establish its theoretical convergence under general conditions. Furthermore, we demonstrate that many existing clustering-based methods are special cases of ours, effectively bridging the gap between optimization-based and clustering-based techniques. Our unified framework empowers users to select the most appropriate cost function in CTD to achieve superior performance in their specific applications. Through extensive empirical experiments, we demonstrate the efficiency and effectiveness of our proposed method, showcasing its potential in various domains.

I. INTRODUCTION

FINITE mixture models are widely employed to approximate nearly all smooth density functions, a concept referredto as the universal approximation property [1, 2]. Mathematically, a finite mixture model consist of a collection of probability distributions, where the distribution function is a convex combination of a finite number of distinct distributions from a parametric distribution family. Among various types of finite mixture models, the finite Gaussian mixture model (GMM) stands as the most widely utilized mixture in numerous applications, primarily due to the advantageous properties offered by the Gaussian distribution. The probability density function (PDF) of a finite mixture is defined as follows. Let detbe the PDF of a d-dimensional Gaussian. We exchangeably write with where denotes the space of symmetric positive definite matrices. Let be a Dirac measure at and be a probability measure that assigns probability to for where for . We denote the PDF of an N-component Gaussian mixture by

We call G the mixing distribution, the mixing weight, and the component parameter. The number of components N is also called the order of a mixture. We prefer parameterizing GMM with a mixing distribution G rather than a vector such as because the mixing distribution G does not suffer from the well-known label-switching dilemma [3, Section 1.14].

Due to the parametric nature of mixture models, they offer a convenient and efficient means of approximating distributions with unknown shapes, thanks to their universal approximation property. This property simplifies downstream inference tasks and enhances computational efficiency. Consequently, mixture models have found numerous applications in approximate inference methods, including belief propagation [4, 5] and Bayesian filtering [6]. In these applications, a finite Gaussian mixture is used to provide an initial approximation to density functions that are updated recursively. A challenge in these recursions is that the order of the Gaussian mixture increases exponentially and the inference quickly becomes computationally intractable. To overcome the difficulty, the technique called Gaussian mixture reduction (GMR), which approximates a high-order Gaussian mixture by one with lower-order, can be used. As seen in Fig. 1, the density function of an 8-component mixture in (a) is well approximated by a 3-component mixture in (b). One could hence use GMR to replace a high-order GMM with a

Q. Zhang is with the Institute of Statistics and Big Data, Renmin University of China, Beijing, China. A. Zhang is with the Department of Statistical Sciences, University of Toronto, Toronto, Canada. J. Chen is with the Department of Statistics, University of British Columbia, Vancouver, Canada.

© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Fig. 1: Two Gaussian mixtures of different orders have similar shaped density functions (solid line). The dashed lines are component density functions.

lower-order GMM after each update, thereby reducing the computational cost in downstream operations. More specifically, GMR approximates a given mixture by a lower order mixture ()

We refer to two mixtures and as the original mixture and the reduced mixture respectively. We assume the target order M is pre-specified.

There exist at least three general types of GMR approaches in the literature: greedy algorithm-based [7, 8], optimization-based [9, 10], and clustering-based approaches [6, 11–16]. While greedy algorithms are straightforward to implement, they often employ ad-hoc similarity metrics [17] and tend to yield suboptimal outcomes. Also, when there is a large discrepancy between the orders of the original mixture and the reduced mixture, the greedy algorithms are computationally expensive. On the other hand, optimization-based approaches formulate GMR as an optimization problem, aiming to minimize the divergence between the reduced mixture and the original mixture. The advantage of this approach lies in having a well-defined optimality target. However, selecting an appropriate divergence measure can be challenging. Ideally, the divergence should be easy to evaluate, and the corresponding optimization should be computationally efficient. To this end, previous works such as [18] and [10] have proposed using the squared distance as the objective function in optimization-based GMR methods. Although the distance has a closed-form, minimizing this objective can still be computationally expensive and notably slow, especially when the GMM has a large dimension d or reduction order M, or both. To develop more numerically efficient algorithms for GMR, clustering-based approaches have been introduced. Inspired by the popular k-means algorithm [19], these approaches offer computational advantages over optimization-based methods. They treat GMR as a clustering problem in the space of Gaussian distributions. The process involves iteratively assigning components of the original mixture to different clusters using similarity metrics and updating cluster centers through moment matching. However, despite their computational efficiency, the theoretical properties of clustering-based approaches remain largely unknown. Important questions regarding convergence guarantees and rates, as well as the reliability of moment matching for updating cluster centers, are yet to be answered. In this paper, we aim to address these questions and shed light on the theoretical aspects of clustering-based GMR methods.

In this paper, we introduce a novel optimization-based GMR approach along with its corresponding numerical algorithm. Despite being an optimization-based approach, we also demonstrate that the existing clustering-based approaches are special cases of our proposed method. Therefore, we effectively bridge the gap between optimization-based and clustering-based approaches for GMR. Our approach enables the interpretation of clustering-based methods from a new perspective, while also offering a fresh method that potentially enhances performance. We emphasize the following key contributions of our work:

• Novel optimization-based GMR approach. First, we propose a novel optimization-based approach for GMR that addresses the limitations of existing methods. We target at minimizing a composite transportation divergence (CTD) [20, 21] between the original and the reduced mixtures. The CTD leverages the computational efficiency of the divergence between two Gaussian components, making it easier to minimize than many other divergences defined between two mixtures. Additionally, by formulating GMR as an optimization problem, we provide a more principled framework for GMR.

• Unified perspective. Second, we develop a novel and computationally efficient majorization-minimization (MM) algorithm for our optimization problem. Remarkably, we demonstrate that many existing clustering-based approaches can be viewed as special cases of our MM algorithm. We reveal that when the clustering-based approaches are correctly performed, then

their hidden optimality targets are the CTD to the original mixture. Moreover, our convergence results for the general MM algorithm can be applied to the existing clustering-based algorithms, thereby providing theoretical guarantees for their convergence. This analysis offers valuable insights into the convergence properties of clustering-based approaches, strengthening their validity and applicability. Additionally, our investigation highlights that moment matching may not always be the optimal choice for updating the cluster centers in the clustering-based algorithm. Instead, it is crucial to align the cluster center updates with the chosen divergence in the assignment step. Relying solely on moment matching without considering the divergence in the assignment step can lead to non-convergence of the clustering-based approach.

• Improved performance. Third, though many existing clustering-based methods are our special cases, we show that we can improve their performane by choosing cost functions. The CTD is associated with a cost function on the space of Gaussian distributions. The selection of different cost functions gives rise to a family of CTDs, offering users the flexibility to choose the most suitable cost function for their specific applications. This versatility empowers users to optimize the performance of their Gaussian mixture reduction process by selecting a cost function that aligns closely with their modeling objectives and desired outcomes.

In conclusion, our proposed GMR approach combines the advantages of both optimization-based and clustering-based methods. It provides a well-motivated optimization target while also being numerically efficient, and bridges the gap between these two approaches. The remainder of the paper is organized as follows. In Section II, we overview the existing approaches for GMR. In Section III, we formally define the composite transportation divergence, the proposed GMR method, and the corresponding MM algorithm. We point out that many existing clustering-based methods are special cases of our MM algorithm. The unified framework permits the users to choose the most suitable cost functions to achieve superior performance in their specific applications. We compare different approaches by numerical experiments in Section IV. Section V concludes the paper.

II. EXISTING METHODS FOR MIXTURE REDUCTION

There are three general categories of existing methods for GMR: greedy algorithm-based, optimization-based, and clustering- based approaches. In this section, we provide a brief review of these methods and suggest interested readers refer to [17] for a more comprehensive review.

A. Greedy algorithm-based approaches

The greedy algorithm-based methods can be categorized into top-down and bottom-up methods. Top-down greedy algorithm-based methods [7, 8, 22] typically involve merging two components or pruning one component at a time from the original mixture until the desired order is achieved. On the other hand, bottom-up greedy algorithm-based methods [10] start with a single Gaussian component and successively add one component at a time until the desired order is reached. When merging two components, a common approach is to select the most similar pair of components and merge them using moment matching. In the case of pruning, the method typically discards either the component with the smallest weight or the component with the lowest “cost” to remove, followed by the adjustment of the weights of the remaining components. Greedy methods often rely on ad-hoc similarity measures [17]. While these methods strive for optimality at each step, the final result may fall short of being truly globally optimal [15].

In contrast, our method has a well-defined overall optimality goal, distinguishing it from the greedy algorithm-based approaches. We aim to find a global optimum by formulating the GMR as an optimization problem with a clear objective.

B. Optimization-based approaches

The GMR is naturally formulated as an optimization problem in [10, 18]. Let f(x) and be two PDFs. The integrated squared error (ISE) or the squared distance between f and measures the integrated squared difference between their PDFs and is given by

Under the special case of Gaussian mixture, the ISE between and has the following convenient expression:

where , and , and are matrices with their (i, j)-th elements respectively being , and .The optimization-based GMR approach in [18] is to search for

where

is the space of mixing distributions with M components.

Evaluating the ISE, , is straightforward numerically, but optimizing it can be computationally expensive. The method employed by [18] utilizes a Quasi-Newton algorithm that optimizes over variables. However, the per-iteration computational cost scales at . Given that this cost is quartic in dimension d and quadratic in M, it becomes prohibitively expensive as d and M increase. Moreover, due to the non-convex nature of the objective function , the Quasi-Newton algorithm can become trapped at local minima. Consequently, finding an algorithm that is less susceptible to local minima becomes crucial, as highlighted by [10].

In contrast, our method is also optimization-based and possesses a well-defined optimality target. However, our approach offers improved computational efficiency compared to the minimum ISE approach in [18]. Moreover, we show that when the cost function is chosen to be the ISE between two Gaussians in the CTD, our objective function is a surrogate of ISE between two mixtures.

C. Clustering-based approaches

Both the greedy algorithm-based and optimization-based GMR approaches face scalability challenges as the dimensions d and the desired reduction order M increase. To address this issue, clustering-based approaches have emerged as competitive alternatives, offering computational efficiency advantages. Clustering-based approaches, exemplified by methods like the one described in [15], draw inspiration from the popular k-means algorithm [19] in Euclidean space and adapt it to the space of Gaussian distributions. They start with a user-proposed M Gaussian distributions as initial cluster centers and iterate between the following two steps:

• Assignment step: Partition N components of the original mixture into M groups based on their closeness to the current cluster centers according to some divergence .

• Update step: Relocate the cluster centers based on the components assigned to each cluster.

The approach iterates until there are no meaningful changes in these centers. These cluster centers are then exported as M components of the reduced mixture. The mixing weight of each reduced component is the sum of the weights of the original components assigned to the corresponding cluster.

Similar to the k-means algorithm in the vector space, there exist various assignment schemes in clustering-based GMR approaches. These schemes can be broadly categorized into “hard” clustering-based and “soft” clustering-based methods, depending on how the components of the original mixture are assigned to clusters. In the case of hard clustering-based approaches, each component in the original mixture is assigned exclusively to a single cluster. On the other hand, soft clustering-based approaches assign components of the original mixture to multiple clusters, usually in proportions that reflect their membership strength. Some existing hard and soft clustering-based approaches for GMR are as follows. 1) Hard clustering: Let and respectively be the nth and mth component of the original and reduced mixture. In theassignment step, [12], [13], and [15] choose the Kullback–Leibler (KL) divergence as their closeness metric:

Under the special case of Gaussian distributions, the KL divergence becomes

The derivation of the KL divergence is given in Appendix A-A. The squared Wasserstein distance [23]

is chosen as the closeness metric in [14] where is the Euclidean norm.

They assign the nth component in the original mixture to the mth cluster when . One may randomly pick one cluster if a component is equally close to more than one cluster. Denote C(m) the index set of components assigned to the mth cluster in one iteration. Let . The mth cluster center is updated by moment matching [15]

Alternatively, [14] propose to update the cluster centers by the Wasserstein barycenter of but uses an approximate solution instead for computational efficiency. The iteration continues until the change in the is below some user-specified threshold.

2) Soft clustering: Instead of assigning the whole component of the original mixture to a single cluster, the soft clustering-based approach assigns fraction of the nth component to the mth cluster for some and . Various forms of have been considered in the literature. For instance, [11] let and [6] uses with some hyper-parameter I > 0 and . They also use moment matching to update the cluster centers.

The soft clustering-based approach reduces to the hard clustering-based approach as the hyper-parameter . This is seen by noticing

The computational cost of both hard and soft clustering-based algorithms is at each iteration, which is lower than the per iteration cost of the optimization-based approach in [9].

Although clustering-based approaches offer computational advantages in GMR, their theoretical properties have remained largely unknown. In this paper, we make significant progress in understanding these methods by demonstrating that many existing clustering-based approaches are special cases of our proposed method. This realization allows us to unveil the hidden optimality targets of these existing methods, which turn out to be the CTD to the original mixture. We hence bridge the gap between clustering-based and optimization-based approaches for GMR. Furthermore, our formulation introduces a more general class of clustering-based algorithms that offer flexibility to users. Within this framework, users can select the appropriate divergence measure in the assignment step and subsequently update the cluster centers based on their specific application requirements. This enhanced flexibility allows for improved customization and performance optimization in a wide range of practical scenarios.

III. PROPOSED GAUSSIAN MIXTURE REDUCTION METHOD

The optimization-based GMR formulation in [18] suffers from a lack of effective and efficient numerical solutions, despite its conceptual simplicity. Instead of continuously searching for elusive algorithms to minimize the ISE in (2), it is more practical to replace ISE with other divergences that offer favorable theoretical properties and enable efficient optimization algorithms. It is worth noting that most divergences are computationally expensive to evaluate between two mixtures. However, the evaluation cost is considerably lower when comparing two Gaussian distributions. Exploiting this property, we consider a Gaussian mixture as a discrete distribution defined on the space of Gaussian distributions. By introducing a divergence on the space of Gaussian distributions, we can induce a transportation divergence between two finite Gaussian mixtures. This leads to the development of a composite transportation divergence (CTD), which not only possesses strong theoretical motivations but also facilitates the design of effective and efficient algorithms.

In this section, we begin by illustrating several optimization-based GMR approaches. Through these illustrations, we highlight the numerical challenges encountered, which serve as the driving force behind our novel GMR framework based on CTD. Subsequently, we introduce our proposed GMR approach, accompanied by the innovative MM algorithm. After that, we present the theoretical results on the convergence properties of the MM algorithm. Finally, we conclude the section by showing the connection of our proposed method to existing optimization-based and clustering-based approaches.

A. Optimization-based GMR with KL divergence

The Kullback–Leibler (KL) divergence is widely recognized as a popular measure of similarity between two distributions. At first glance, it may appear to be a natural choice of divergence for the optimization-based GMR approach. However, employing the KL divergence does not yield an effective optimization-based GMR solution as we show below.

If we adopt the KL divergence between two mixtures, the resulting optimization problem can be expressed as follows:

where

Since the minimization is over and G is known, the above optimization problem is equivalent to

The solution is therefore linked to the widely known population maximum likelihood estimate (MLE) of a finite mixture [24]. However, the task of solving for presents even greater challenges compared to the ISE optimization problem (3). This is primarily due to the additional computational cost involved in evaluating the KL divergence between two mixtures, compounded by the difficulty of minimizing a non-convex function.

To tackle the optimization problem (6), one possible approach is to employ the population EM algorithm, as discussed in [24]. The algorithm begins by proposing a hypothetical mixture distribution, denoted as , for a random variable X, where X is part of the complete data (X, Z) and Z represents a latent variable. The conditional distribution of X given Z = m is represented as , with . On the other hand, the true distribution of X is denoted as , with . The joint hypothetical density of X, Z is given by

and the posterior of Z given X = x,

With the complete data (X, z), the complete population log-likelihood of is given by

• Once is defined and an initial is given, the population M-step proceeds as

Although the M-step is conceptually straightforward, computational challenges persist, as explained below. Given that the Qfunction in (7) is additive in and , it enables an updating scheme that allows for separate updates of these parameters.

and

Although we have introduced a straightforward updating scheme, it is important to note that both (8) and (9) involve integrals of the ratio of two Gaussian mixture densities . These integrals are computationally intractable, further complicating the optimization task associated with the KL divergence-based GMR.

Another possible approach is via the finite sample EM algorithm. In the finite sample EM algorithm, the expectations in (8) and (9) are replaced by their corresponding sample means. Consequently, one way to alleviate the computational burden is to obtain approximate solutions [12]. The idea is to generate Monte Carlo samples Xfrom the original mixture . The log-likelihood of the reduced mixture is then given by

which converges to as . Consequently, the MLE based on converges to , the desired reduction result. However, Monte Carlo methods often suffer from the curse of dimensionality, where the number of samples needed to achieve the same level of precision tends to increase exponentially with the dimensionality, denoted as d. Therefore, this approach becomes impractical for large d when given a fixed computational cost budget. In summary, the GMR approach based on KL divergence presents an even more demanding optimization challenge compared to the GMR based on the ISE divergence.

Furthermore, it is widely recognized that both the finite sample EM algorithm and the population EM algorithm are prone to getting trapped in local maxima. This issue persists when applied to GMR, posing a significant challenge. However, there is potential to address this concern by gaining a deeper understanding of the structural characteristics of local maxima, as discussed in [25]. Acquiring detailed knowledge of the original mixture can aid in effectively identifying local maxima and locating the global maximum.

In order to advance the adoption of optimization-based GMR methods, it is crucial to develop improved computational techniques. The proposed optimization-based GMR approach utilizing the CTD represents a significant breakthrough in this regard, offering promising avenues to overcome the aforementioned challenges.

B. Optimization-based GMR with CTD

Our objective is to propose an optimization-based method that combines the advantages of having a well-defined optimal target and the computational efficiency found in clustering-based methods. To achieve this, we begin by examining the k-means algorithm in the vector space from an optimization perspective, as discussed in [26].

Consider a set of d-dimensional vectors associated with clusters, where M is a predetermined number. The k-means problem aims to find M elements that minimize the following objective function:

where is some divergence between two sets O and R. It can be seen that from an optimization perspective, the goal of the k-means algorithm is to find the optimal set that minimizes the divergence to Os.

In the context of the clustering-based approach for GMR, we are not dealing with vectors anymore; instead, we seek to cluster N Gaussian distributions into M groups. Therefore, by extending the concept in (10) to the space of distributions, we can replace the divergence D(O, R) between two sets in vector space with a measure that quantifies the similarity between two distributions. Additionally, since each Gaussian component is associated with a weight, we must also consider the corresponding mixing weights. This analogy motivates us to explore the composite transportation divergence (CTD) between two finite Gaussian mixtures.

We begin by formally defining the CTD and explaining its connection with the objective function of the k-means algorithm. Let and represent the PDF of the original and reduced Gaussian mixtures, respectively, with appropriate orders. The mixing weights are denoted as w and , and the components’ PDFs are denoted as and . The CTD [21, 27] measures the transportation cost from the original mixture to the reduced mixture and is defined as follows.

Definition 1 (Composite transportation divergence). Let be a divergence on the space of Gaussian distributions F = . The composite transportation divergence (CTD) between two finite Gaussian mixtures with cost function is

where

is the set of coupling matrices with marginals w and . Let be a regularization parameter. An entropic regularized CTD is

with entropy .

The CTD between two mixtures is a type of optimal transport divergence, as described in [28], which measures the divergence between the mixing distributions G and . The optimal transport divergence arises from the concept of the optimal transportation problem, and an illustrative example can help convey its intuition.

Consider a scenario where there are N warehouses and M factories operating in the space of Gaussian distributions F. The nth warehouse, located at , contains units of raw material, while the mth factory, located at , requires units of raw material. Let denote the per unit cost to transport materials from warehouse n to factory m, and let represent the amount of material being transported. Assuming that the transportation cost is proportional to the amount of material transported, the total transportation cost under a given transportation plan is given by . The coupling set represents all possible transportation plans, subject to two marginal constraints: (a) the correct amount of material is taken from the warehouses, and (b) the correct amount of material is received by the factories. The optimal transportation problem aims to find the transportation plan that minimizes the total cost among all feasible plans in . The resulting minimum total cost, obtained under the optimal transportation plan, corresponds to the CTD between the two mixtures. In other words, the CTD quantifies the transportation cost associated with moving the materials optimally from one mixture to another.

The computation of the CTD involves two numerical tasks. Firstly, we need to evaluate the cost function between two components, which is computationally cheap compared to evaluating the direct divergence between two mixtures. The second numerical task is to find the optimal transportation plan, typically achieved through numerical algorithms such as linear programming. The computational cost of this task is typically on the order of when the number of components N is equal to the number of clusters M. To expedite the computation of the optimal transport, an entropic regularization technique was proposed by [29]. This approach provides an approximate solution to the optimal transport problem, significantly reducing the computational time. It is worth noting that in our paper, the entropic regularization serves a different purpose, as our proposed GMR does not require solving the optimal transportation problem as we show below. We will elaborate on the purpose of introducing the entropic regularization in Section III-E1. For simplicity, we also refer to the entropic regularized CTD as CTD, highlighting the differences only when necessary.

As the original mixture is known, for the simplicity of notation, we write Given a cost function on the space of Gaussian distributions and a regularization parameter , we propose to reduce of order N to of order M with

We illustrate how our proposed objective (12) generalizes the objective (10) of the k-means algorithm in the vector space. In our context, we consider the original mixture and the reduced mixture as sets of weighted observations in the space of Gaussian distributions F. For instance, the original mixture consists of observations with associated weight . In this analogy, the composite transportation divergence plays a similar role as the divergence D(O, R) in k-means, quantifying the dissimilarity between the two sets. Consequently, our objective is to identify the optimal set that minimizes this divergence.

With this objective, our approach clearly falls within the framework of optimization-based GMR methods. We will now outline a straightforward numerical algorithm to solve this optimization problem, which not only enables our proposed method to have a clear optimal target but also inherits the computational advantages of clustering-based approaches.

C. The tailor-made MM algorithm

Upon initial examination, it may appear that solving for in (12) requires addressing two sub-problems, each involving a constrained optimization issue: (a) evaluating for each ; (b) minimizing with respect to . The first sub-problem typically involves a numerical search for transportation plans within the coupling set . However, we demonstrate that such sequential optimization is not necessary. Instead, the overall optimization problem can be seamlessly resolved using a single Majorization-Minimization (MM) algorithm.

The intuition behind this approach is as follows: recall that the CTD represents the lowest cost of transporting materials from warehouses to factories. The optimal transportation plan, , transports components from warehouses to components in factories at the lowest cost. When transporting all materials from warehouses to factories, must have its first marginal matching the warehouse mixing distribution and its second marginal matching the factory mixing distribution. However, in the context of our GMR approach, the factory mixing distribution is being optimized. Therefore, we can allow the second marginal of to be flexible, rather than imposing a specific form on it. In other words, instead of constraining to have as its second marginal, we let be the second marginal of . This renders the marginal distribution constraint on redundant. By removing this redundant constraint, the optimal transportation plan with only one marginal constraint can be expressed in closed form, facilitating the use of an easy-to-implement MM algorithm. Now, let us formally describe the algorithm based on this insight. Let and be a vector of the component parameters of thetarget mixture . Define a function with its dependency on the original G hidden:

The optimization of the transportation plans over involve only one linear marginal constraint in terms of w. Once the target component parameters are given, the optimal transportation plan has its (n, m)th entry:

When , the solution is given by .

Theorem 1 (Equivalent optimization problem with single marginal constraint). Let , and the other notation be the same as given earlier. Let

The mixing distribution of the reduced mixture is then

The proof is deferred to Appendix B-A. The advantage of the new objective is that it has a closed-form form given by:

This objective depends solely on the component parameters and does not involve the mixing weights. The mixing weights are uniquely determined by the component parameters through the definition of . This separation allows us to optimize over the component parameters and mixing weights separately, reducing the overall optimization problem to minimizing with respect to the component parameters . Although the closed form of allows for easy computation, it is important to note that both the optimal transportation

plan and the cost function are dependent on the component parameters. To address this dependency, we propose

an iterative procedure inspired by the well-known Majorization-Minimization (MM) algorithm [30], which enables us to separate these dependencies and optimize them iteratively. For completeness, we provide a brief overview of the MM algorithm following [30]. Suppose we wish to minimize a function g(x) over some space X. The MM algorithm iteratively updates a solution from an initial point . After t iterations, with the current solution , MM algorithm first constructs a function that majorizes g(x) at , that is with equality holds at . It then updates with . The success of MM algorithm relies on finding a majorization function , preferably convex in x, that is easy to minimize. Such a procedure ensures a decreasing sequence of . We now present the majorization and minimization steps in minimizing (14). • Majorization Let be the mixing distribution updated after t iterations. We first propose a majorization function forthe optimization goal (15):

with dependent on the entropy regularization strength as in (13). Note this majorization function is the regularized total transportation cost under transportation plan . • Minimization We then minimize the majorization function (16) and update for each m:

The minimization step in (17) during each iteration of the algorithm is simplified compared to directly solving (14). This simplification arises because the design of the majorization function in equation (16) separates the components , allowing for separate and parallel optimization with respect to each . Additionally, the proposed algorithm benefits from the sequential update of the mixing proportions and the component parameters . Specifically, we first update using the expression

and then obtain . This sequential updating scheme allows for a more efficient optimization process. The algorithm iterates between the majorization step in equation (16) and the minimization step in equation (17) until the change in falls below a certain threshold. The complete algorithm is summarized in Algorithm 1.

One primary advantage of our proposed approach for GMR, compared to other optimization-based approaches, is its easy-to-implement algorithm and computational efficiency. In the assignment step, the optimal transportation can be computed using a closed-form solution, resulting in a cost of O(NM), given the values. The per-evaluation cost of the commonly used is . Therefore, the total cost for the assignment step is . The updating step (17) involves solving for the barycenter [31] for M clusters, with costs depending on the cost function employed in the CTD in (11). When the cost function is KL divergence, the corresponding Gaussian barycenter can be computed at a cost of . The per-iteration cost in this case is , which is lower than the per-iteration cost of directly minimizing the integrated squared error in Section II, which is quartic in d.

Theorem 2 (Convergence of the algorithm). Suppose the cost function is continuous in both arguments. Assume that for any constant and , the following set is compact according to some distance on F:

Let be the sequence of mixing distributions generated by with some initial mixing

The proof of this theorem can be found in Appendix B-C. It is worth noting that, similar to the famous EM algorithm [32], these properties alone do not guarantee the convergence of . However, these two properties do imply that converges monotonically to a constant . Furthermore, all limiting points are stationary points of , meaning that iterations from do not further reduce the value of . In certain special cases or under specific conditions, we can demonstrate the convergence of .

Theorem 3. Assume the same conditions of Theorem 2. Suppose we carry out the MM iteration forever. When , we have 1) There exists a large enough T and such that for all , and

(18) 2) Limiting point of starting from any is a local minimum of in the space of . 3) There exist an MM related exhaustive algorithm with exponential time to solve (12) for fixed d.

The proof can be found in Appendix B-D. The first two properties ensure the convergence of both and forany initial value when . Based on our experience, we have observed that the value of T is typically small, as shown in Table II. In many real-world applications, it is anticipated that N is approximately 30 or less, although previous studies such as [15] have experimented with N on the order of 1000 to demonstrate the scalability of clustering-based methods. Consequently, an exhaustive search can be a viable tool in certain scenarios. Additionally, it provides a way to examine how frequently a globally optimal solution is missed by an iterative procedure with its specific initial value scheme.

When , we provide a novel convergence analysis of our MM algorithm from a mirror descent perspective. Our analysis draws inspiration from the convergence analysis of the EM algorithm in KL divergence [33]. In particular, we establish that the MM updates can be interpreted as mirror descent updates, where each iteration minimizes the linearization of the objective function while incorporating a weighted Bregman divergence penalization term. By establishing this connection, we can analyze the convergence rate of our proposed MM algorithm from a mirror descent perspective.

The formal description of our main result requires some preliminary results.

Definition 2 (Bregman divergence). Let be a function that is: a) strictly convex, b) continuously differentiable, c) defined on a closed convex set . Then the Bregman divergence induced by A is defined as

That is, the difference between the value of A at and the first order Taylor expansion of A around evaluated at point . The function is also called a reference function.

The Bregman divergence encompasses various well-known divergences as special cases. For example, under Gaussian distributions, we have the following result:

where is the log-partition function of the Gaussian distribution when written in the standard form of the natural exponential family. The Bregman divergence is connected to the mirror descent algorithm as follows.

Mirror descent is a generalization of the gradient descent algorithm [34]. Suppose we aim to minimize a function f over a set , starting from some initial value . The mirror descent iteratively updates the parameters as follows:

where L > 0 is a constant, and represents a Bregman divergence. The objective in the mirror descent update at each iteration is a linear function penalized by the Bregman divergence. It has been demonstrated in [34] that when f is L-smooth relative to A (defined in Definition 3), the convergence of is linear, and the rate is quantified as:

where is the lower bound of f.

We discovered that our MM updates can be expressed as mirror descent-like updates. Therefore, the convergence analysis of the mirror descent algorithm can be extended to the convergence analysis in our case. Specifically, let be a cost function. If in addition, the cost function is a Bregman divergence induced by , namely for any and . Let be defined in (13) and . We show in Lemma 1 in Appendix B-E that our MM updates can be written as

where

, and .We say this update is mirror descent alike since the RHS of (19) is not penalized by a Bregman divergence but its weighted summation. With this mirror descent-like updates, we are able to show the following linear convergence result.

Theorem 4 (Rate of convergence of MM algorithm). Let be the sequence of the component parameters produced by the mirror descent update in (19) and . Then

E. Connection with existing methods

1) Connection with existing clustering-based approaches: Our proposed GMR approach is an optimization-based method, as evident from its formulation. In this section, we demonstrate that our MM algorithm can be used to derive both hard and soft clustering-based Gaussian mixture reduction methods, depending on the value of . Specifically, when , our algorithm corresponds to a class of hard clustering-based methods. On the other hand, for , it represents a class of soft clustering-based approaches. It is important to note that the inclusion of entropy regularization in our method is aimed at achieving soft clustering-based results, rather than solely for computational efficiency.

By selecting specific cost functions and values of , our algorithm encompasses various clustering-based algorithms found in the existing literature. We summarize these algorithms, along with the corresponding choices of cost functions and , in Table I.

Previous work [15] has demonstrated that hard clustering-based approaches generally offer computational advantages over the minimum ISE approach proposed by [18]. Additionally, these hard clustering-based methods outperform certain greedy algorithms, such as those presented in [22] and [8]. However, the convergence and optimality of hard clustering-based approaches have not been extensively studied before. By establishing the connection between our method and clustering-based approaches, our results provide crucial support to these methods by addressing these aspects that were previously missing in the literature.

2) Convergence: Due to the connection between our method and existing clustering-based algorithms, the convergence of most clustering-based methods can be inferred when their corresponding entropic regularized CTD satisfies the conditions outlined in Theorem 4.

3) Consistency of assignment and update steps: Our proposed MM algorithm employs the same cost function in both the assignment and update steps. In the assignment step, this cost function measures the similarity between components in the original mixture and the components in the proposed mixture . In the update step, we seek the barycenter of the components in the original mixture that are assigned to the same cluster, using the same cost function. Our theory demonstrates that the MM algorithm generates a sequence with non-increasing entropic regularized CTD, ensuring convergence in this scenario. However, if different cost functions are used in the assignment and update steps, this guarantee may not hold. An example of this occurs when components are assigned to clusters based on a divergence measure such as the Wasserstein distance, but the cluster centers are updated through moment matching. As moment matching leads to the barycenter under the KL divergence, the convergence of the algorithm is not implied by our theory in such cases.

TABLE I: The relationship between the proposed GMR approach and existing clustering based GMR approaches according to the cost function and regularization strength . Empty entries indicate new approaches not previously explored.

We will begin by examining the case when and the cost function is the KL divergence in (5). Let be the clustercenter after t iterations. For simplicity, let us assume that for every n, there exists a unique . In this case, the transportation plan in the MM algorithm is given by:

which corresponds to the KL barycenter. As demonstrated in Appendix C, the barycenter solution coincides with the moment matching solution for Gaussian mixtures. Hence, the proposed GMR approach with and the KL divergence as the cost function corresponds to the hard-clustering GMR algorithm presented in [15]. Similarly, when and represents the 2-Wasserstein distance between two Gaussians, the proposed GMR approach leads to the hard-clustering algorithm presented in [14]. Consider the case when and . In this scenario, our proposed algorithm reducesto the soft clustering-based algorithm presented in [6]. However, it is important to note that this cost function depends on the mixing weights of the reduced mixture, and thus Theorem 2 does not directly apply, leaving the convergence of the algorithm uncertain. Nevertheless, we provide a valid motivation for this algorithm.

The soft-clustering algorithm in [6] draws inspiration from variational inference [35]. Consider a scenario where we have I random observations from . Let the data be denoted as , and the log-likelihood function is given by . Taking the expectation of this log-likelihood yields:

Let be a random sample from for each n. It appears that [6] suggests

Conceptually, the claim in their suggested variational lower bound implies that X has a probability of being a random sample from the component . However, this claim is not true and invalidates the proposed variational lower bound.

In Section IV, we conducted experiments to demonstrate the efficacy of our proposed GMR approach by introducing a novel cost function, the ISE between two Gaussians. It is noteworthy that previous clustering-based GMR methods have not explored the use of this cost function. Our findings clearly illustrate that leveraging ISE can lead to significant performance improvements in existing clustering-based algorithms. This novel contribution highlights the effectiveness of a unified minimum-CTD-based GMR approach.

2) Connection with existing optimization-based approaches: We also establish a connection of our proposed method with existing optimization-based approaches. To begin with, we consider the special case where we aim to reduce a Gaussian mixture to a single Gaussian and use the ISE in (1) or KL divergence in (4) as the cost function in CTD. In this case, we have the equivalence

The left-hand side of equation (20) represents the CTD between the original mixture and the reduced mixture, while the right-hand side represents the divergence between two mixtures. This equality demonstrates that when reducing a mixture to a single Gaussian using certain cost functions, these two approaches are equivalent. A proof of this equivalence is provided in Appendix C.

In a more general setting, when the cost function satisfies the “convexity” property, we can establish that our proposed objective serves as an upper bound on the divergence between two mixtures.

Theorem 5. Let be a non-negative bi-variate function on space of Gaussian distributions with “convexity” property: for any , and component PDFs , we have

The proof of this theorem can be found in Appendix C. Importantly, the KL divergence and the ISE possess the convexity property, as demonstrated in Appendix C. Consequently, our proposed method minimizes an upper bound of the existing ISE approach and the computationally challenging minimum KL divergence approach.

IV. EXPERIMENTS

A. General experimental setting

We demonstrate the effectiveness of the proposed GMR approach through experiments. We consider the following four GMR methods:

1) Greedy: We include the most recently developed greedy algorithm-based approach from [14] as a baseline for comparison.

2) ISE: We include the optimization-based reduction method from [18] as another baseline for comparison.

3) CTD–KL: This is our CTD-based method with the cost function being the KL divergence in (5). When the cost function uses KL divergence with , the proposed GMR approach reduces to the existing clustering-based approach, as shown in Table I. The case of has not been considered in the existing literature.

4) CTD–ISE: This is our CTD-based method with the cost function being the ISE defined as follows:

The use of ISE as the cost function is a novel contribution of our proposed approach, as it has not been studied in existing literature. This approach is used to illustrate the generality of our proposed CTD framework, which allows the choice of other valid divergences.

The regularization parameter in the proposed method plays a role in the quality of reduction. To see the difference, we conduct experiments with different levels of regularization to cover both hard () and soft clustering-based () algorithms. To determine the optimal value for a given cost function of the CTD in the proposed approach, we select the value that achieves the lowest integrated square error between the original and reduced mixtures from a grid of values and this output is called the soft clustering-based method. The grid of values is chosen as follows:

The regularization parameter in our proposed method influences the quality of reduction. To explore this, we conduct experiments with different levels of regularization, covering both hard clustering-based () and soft clustering-based () algorithms. To determine the optimal value for a given cost function in the soft clustering-based methods, we select the value that achieves the lowest integrated square error between the original and reduced mixtures from a grid of values. The grid of values is chosen as follows:

The reason for selecting the value of over the specified grid is as follows. We observe that as increases, more and more components of the reduced mixtures become identical. In the most extreme case when , all components of the reduced mixture are identical, resulting in the original mixture being reduced to a single Gaussian. This can be observed from the expression in (13), where all entries in converge to as , leading to identical cluster centers. Therefore, we conclude that an appropriate range for should be determined based on the component-wise divergence of the original mixture.

For non-greedy algorithm-based approaches, we employ multiple initial values to avoid local minima, and we select the output with the smallest objective function value is considered as the reduced mixture. We use the same initial values for all methods except for the soft clustering-based method. Specifically, we initialize all algorithms with five multiple initial values, where four of them correspond to the outputs of four greedy algorithm-based reduction methods: Salmond [7], Runnalls [8], Williams [18], and Wasserstein [14]. The last initialization approach involves generating 1000 samples randomly from the original mixture and using the output of the EM algorithm with order M as the corresponding initial value. In the hand gesture recognition experiment, we randomly select five images from each gesture class as initial values. For the soft clustering-based methods, we do not employ multiple initial values but utilize the output of the corresponding hard clustering-based reduction result as the initial value.

The stopping criterion for the algorithm is when the relative change in the objective function falls below . In other words, let represent the value of the objective function at the t-th iteration. The algorithm terminates when the following condition is met:

We compare the total runtime of the algorithms across multiple initializations. The experiments are implemented in Python 3.7.7 on the Cedar cluster at Compute Canada. The source code is publicly available at the following GitHub repository: https://github.com/SarahQiong/CTDGMR.

B. Simulated mixtures

We begin by focusing on reducing a bivariate Gaussian mixture of order N = 25. Instead of arbitrarily selecting original mixtures, we generate a total of R = 100 mixtures with a structured random pattern. In each repetition, we generate the parameter values for the original mixture using the following procedure:

1. We assign equal mixing weights to all components, with for j = 1, . . . , 25. 2. We generate a multinomial random vector with equal event probabilities. From there, we uniformly and randomly choose from the interval , where L = 10. 3. Next, we generate uniformly within a circle with a radius of 2.5. The component means are then defined as . This process results in components that are roughly clustered around five random centers. Refer to Fig. 2 (a) for a typical representation of the outcome.

Fig. 2: A typical order N = 25 original Gaussian mixture. (a) Centers marked as diamonds and locations of component means. (b) Heat map of the density function of the original mixture.

We next generate N = 25 component covariance matrices. We first generate independently from Gamma distribution with shape parameter 8 and scale parameter 4 followed by a rotation angle uniformly in . We then let

be the n-th component covariance matrix. Fig. 2 (b) shows the heat map of a typical density function. This design ensures the reduction is a meaningful exercise and there is enough uncertainty to set apart various reduction methods.

Given the knowledge of how the original mixtures are generated, it is natural to reduce the original mixture to order M = 5. In real-world applications, we most likely do not have such knowledge. Therefore, we experiment with M ranging from 3 to 22. We use the integrated squared error in (2) between the original and reduced mixtures as a performance measure. The lower

Fig. 3: Heat maps of the density functions of the reduced mixture.

the value, the better the performance. Fig. 4 shows their performances in terms of approximation precision and computational time. We also experimented on higher dimension Gaussian mixtures. Due to space limits, we do not present these results and these methods have similar relatively performances.

All reduction methods exhibit improved performance as the order M increases, but they also require more computation time. As expected, the minimum ISE approach of [18] achieves the smallest integrated squared error, which is evident from our experimental results. On the other hand, the greedy algorithm-based method performs the poorest across all methods in terms of ISE. Due to its inferior performance, we only include this method in the simulated mixture example section and do not consider it for the remaining experiments.

Our proposed CTD-based methods, CTD–ISE and CTD–KL, achieve comparable or slightly worse precision compared to the minimum ISE approach, while still producing meaningful reduction results. Notably, CTD–ISE significantly outperforms CTD– KL, highlighting the effectiveness of our proposed framework. When M = 22, we observe that CTD–ISE can achieve similar performance to ISE with only 1/10th of the computational time. In contrast, the minimum ISE approach in [18] requires 1000 times more computation time than the proposed KL-based approach. Soft clustering methods exhibit slightly better precision than their hard clustering counterparts but require more iterations to converge, leading to increased computational time, as shown in Table II.

In summary, the proposed approach with CTD–ISE yields mixture reduction methods that achieve comparable performance to the minimum ISE approach while significantly reducing computational costs. Table II shows that our proposed hard CTDbased method converges within only 3 steps, even though the worst case is . Additionally, considering the negligible improvement of the soft clustering-based method compared to the hard clustering-based methods, we do not recommend using the soft clustering approach. Instead, it is advisable to utilize a different cost function, such as ISE, instead of KL divergence.

Fig. 4: (a) between the reduced and original mixtures. (b) The computational time. The plot includes the reduction approaches MISE (solid line with dot), hard CTD–KL (solid line with triangle), soft CTD–KL (dashed line with triangle), hard CTD–ISE (solid line with cross), and soft CTD–ISE (dashed line with cross).

TABLE II: The mean (std) of number of iterations of CTD based reduction approaches over 100 repetitions.

C. Approximate inference in belief propagation

This experiment illustrates the effectiveness of Gaussian mixture reduction in belief propagation. The belief propagation is an iterative algorithm used to compute the marginal distributions in the graphical model. Specifically, let there be a graph with a node set V and an undirected edge set E. A probabilistic graphical model associates each node with a random variable, say , and postulates that the density function of the random vector can be factorized into

for some non-negative valued functions and . We call local potential and local evidence potential. Denote the neighborhood of a node i as . A message is a function associated with edge (i, j) and it is updated in the tth step according to

The belief function associated with the density function of is updated in the tth step according to

The messages and beliefs are iteratively updated until convergence. We refer to the above procedure as belief propagation.

In belief propagation, the closed-form outcome of the messages generally does not exist. To ensure efficient inference, density functions of Gaussian mixtures are often used to approximate the messages for two reasons. First, they are flexible to approximate any density function to arbitrary precision. Second, they lead to closed-form outputs in the message and belief updates, which are also mixtures whose orders increase exponentially as iterations. However, this naive iterative procedure quickly becomes intractable. One remedy is to perform a Gaussian mixture reduction step after each iteration to stop the order from increasing exponentially.

We apply the proposed GMR methods to the belief propagation for the model represented by Fig. 5 (a) following [6]. We let , with marked alongside the graph edges in the figure and

We create R = 100 graphs with parameter values generated independently and identically distributed according to: