FINITE mixture models are widely employed to approximate nearly all smooth density functions, a concept referredto as the universal approximation property [1, 2]. Mathematically, a finite mixture model consist of a collection of probability distributions, where the distribution function is a convex combination of a finite number of distinct distributions from a parametric distribution family. Among various types of finite mixture models, the finite Gaussian mixture model (GMM) stands as the most widely utilized mixture in numerous applications, primarily due to the advantageous properties offered by the Gaussian distribution. The probability density function (PDF) of a finite mixture is defined as follows. Let det
be the PDF of a d-dimensional Gaussian. We exchangeably write
with
where
denotes the space of
symmetric positive definite matrices. Let
be a Dirac measure at
and
be a probability measure that assigns probability
to
for
where
for
. We denote the PDF of an N-component Gaussian mixture by
We call G the mixing distribution, the mixing weight, and
the component parameter. The number of components N is also called the order of a mixture. We prefer parameterizing GMM with a mixing distribution G rather than a vector such as
because the mixing distribution G does not suffer from the well-known label-switching dilemma [3, Section 1.14].
Due to the parametric nature of mixture models, they offer a convenient and efficient means of approximating distributions with unknown shapes, thanks to their universal approximation property. This property simplifies downstream inference tasks and enhances computational efficiency. Consequently, mixture models have found numerous applications in approximate inference methods, including belief propagation [4, 5] and Bayesian filtering [6]. In these applications, a finite Gaussian mixture is used to provide an initial approximation to density functions that are updated recursively. A challenge in these recursions is that the order of the Gaussian mixture increases exponentially and the inference quickly becomes computationally intractable. To overcome the difficulty, the technique called Gaussian mixture reduction (GMR), which approximates a high-order Gaussian mixture by one with lower-order, can be used. As seen in Fig. 1, the density function of an 8-component mixture in (a) is well approximated by a 3-component mixture in (b). One could hence use GMR to replace a high-order GMM with a
Q. Zhang is with the Institute of Statistics and Big Data, Renmin University of China, Beijing, China. A. Zhang is with the Department of Statistical Sciences, University of Toronto, Toronto, Canada. J. Chen is with the Department of Statistics, University of British Columbia, Vancouver, Canada.
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Fig. 1: Two Gaussian mixtures of different orders have similar shaped density functions (solid line). The dashed lines are component density functions.
lower-order GMM after each update, thereby reducing the computational cost in downstream operations. More specifically, GMR approximates a given mixture by a lower order mixture (
)
We refer to two mixtures and
as the original mixture and the reduced mixture respectively. We assume the target order M is pre-specified.
There exist at least three general types of GMR approaches in the literature: greedy algorithm-based [7, 8], optimization-based [9, 10], and clustering-based approaches [6, 11–16]. While greedy algorithms are straightforward to implement, they often employ ad-hoc similarity metrics [17] and tend to yield suboptimal outcomes. Also, when there is a large discrepancy between the orders of the original mixture and the reduced mixture, the greedy algorithms are computationally expensive. On the other hand, optimization-based approaches formulate GMR as an optimization problem, aiming to minimize the divergence between the reduced mixture and the original mixture. The advantage of this approach lies in having a well-defined optimality target. However, selecting an appropriate divergence measure can be challenging. Ideally, the divergence should be easy to evaluate, and the corresponding optimization should be computationally efficient. To this end, previous works such as [18] and [10] have proposed using the squared distance as the objective function in optimization-based GMR methods. Although the
distance has a closed-form, minimizing this objective can still be computationally expensive and notably slow, especially when the GMM has a large dimension d or reduction order M, or both. To develop more numerically efficient algorithms for GMR, clustering-based approaches have been introduced. Inspired by the popular k-means algorithm [19], these approaches offer computational advantages over optimization-based methods. They treat GMR as a clustering problem in the space of Gaussian distributions. The process involves iteratively assigning components of the original mixture to different clusters using similarity metrics and updating cluster centers through moment matching. However, despite their computational efficiency, the theoretical properties of clustering-based approaches remain largely unknown. Important questions regarding convergence guarantees and rates, as well as the reliability of moment matching for updating cluster centers, are yet to be answered. In this paper, we aim to address these questions and shed light on the theoretical aspects of clustering-based GMR methods.
In this paper, we introduce a novel optimization-based GMR approach along with its corresponding numerical algorithm. Despite being an optimization-based approach, we also demonstrate that the existing clustering-based approaches are special cases of our proposed method. Therefore, we effectively bridge the gap between optimization-based and clustering-based approaches for GMR. Our approach enables the interpretation of clustering-based methods from a new perspective, while also offering a fresh method that potentially enhances performance. We emphasize the following key contributions of our work:
• Novel optimization-based GMR approach. First, we propose a novel optimization-based approach for GMR that addresses the limitations of existing methods. We target at minimizing a composite transportation divergence (CTD) [20, 21] between the original and the reduced mixtures. The CTD leverages the computational efficiency of the divergence between two Gaussian components, making it easier to minimize than many other divergences defined between two mixtures. Additionally, by formulating GMR as an optimization problem, we provide a more principled framework for GMR.
• Unified perspective. Second, we develop a novel and computationally efficient majorization-minimization (MM) algorithm for our optimization problem. Remarkably, we demonstrate that many existing clustering-based approaches can be viewed as special cases of our MM algorithm. We reveal that when the clustering-based approaches are correctly performed, then
their hidden optimality targets are the CTD to the original mixture. Moreover, our convergence results for the general MM algorithm can be applied to the existing clustering-based algorithms, thereby providing theoretical guarantees for their convergence. This analysis offers valuable insights into the convergence properties of clustering-based approaches, strengthening their validity and applicability. Additionally, our investigation highlights that moment matching may not always be the optimal choice for updating the cluster centers in the clustering-based algorithm. Instead, it is crucial to align the cluster center updates with the chosen divergence in the assignment step. Relying solely on moment matching without considering the divergence in the assignment step can lead to non-convergence of the clustering-based approach.
• Improved performance. Third, though many existing clustering-based methods are our special cases, we show that we can improve their performane by choosing cost functions. The CTD is associated with a cost function on the space of Gaussian distributions. The selection of different cost functions gives rise to a family of CTDs, offering users the flexibility to choose the most suitable cost function for their specific applications. This versatility empowers users to optimize the performance of their Gaussian mixture reduction process by selecting a cost function that aligns closely with their modeling objectives and desired outcomes.
In conclusion, our proposed GMR approach combines the advantages of both optimization-based and clustering-based methods. It provides a well-motivated optimization target while also being numerically efficient, and bridges the gap between these two approaches. The remainder of the paper is organized as follows. In Section II, we overview the existing approaches for GMR. In Section III, we formally define the composite transportation divergence, the proposed GMR method, and the corresponding MM algorithm. We point out that many existing clustering-based methods are special cases of our MM algorithm. The unified framework permits the users to choose the most suitable cost functions to achieve superior performance in their specific applications. We compare different approaches by numerical experiments in Section IV. Section V concludes the paper.
There are three general categories of existing methods for GMR: greedy algorithm-based, optimization-based, and clustering- based approaches. In this section, we provide a brief review of these methods and suggest interested readers refer to [17] for a more comprehensive review.
A. Greedy algorithm-based approaches
The greedy algorithm-based methods can be categorized into top-down and bottom-up methods. Top-down greedy algorithm-based methods [7, 8, 22] typically involve merging two components or pruning one component at a time from the original mixture until the desired order is achieved. On the other hand, bottom-up greedy algorithm-based methods [10] start with a single Gaussian component and successively add one component at a time until the desired order is reached. When merging two components, a common approach is to select the most similar pair of components and merge them using moment matching. In the case of pruning, the method typically discards either the component with the smallest weight or the component with the lowest “cost” to remove, followed by the adjustment of the weights of the remaining components. Greedy methods often rely on ad-hoc similarity measures [17]. While these methods strive for optimality at each step, the final result may fall short of being truly globally optimal [15].
In contrast, our method has a well-defined overall optimality goal, distinguishing it from the greedy algorithm-based approaches. We aim to find a global optimum by formulating the GMR as an optimization problem with a clear objective.
B. Optimization-based approaches
The GMR is naturally formulated as an optimization problem in [10, 18]. Let f(x) and be two PDFs. The integrated squared error (ISE) or the squared
distance between f and
measures the integrated squared difference between their PDFs and is given by
Under the special case of Gaussian mixture, the ISE between and
has the following convenient expression:
where , and
, and
are matrices with their (i, j)-th elements respectively being
, and
.The optimization-based GMR approach in [18] is to search for
where
is the space of mixing distributions with M components.
Evaluating the ISE, , is straightforward numerically, but optimizing it can be computationally expensive. The method employed by [18] utilizes a Quasi-Newton algorithm that optimizes over
variables. However, the per-iteration computational cost scales at
. Given that this cost is quartic in dimension d and quadratic in M, it becomes prohibitively expensive as d and M increase. Moreover, due to the non-convex nature of the objective function
, the Quasi-Newton algorithm can become trapped at local minima. Consequently, finding an algorithm that is less susceptible to local minima becomes crucial, as highlighted by [10].
In contrast, our method is also optimization-based and possesses a well-defined optimality target. However, our approach offers improved computational efficiency compared to the minimum ISE approach in [18]. Moreover, we show that when the cost function is chosen to be the ISE between two Gaussians in the CTD, our objective function is a surrogate of ISE between two mixtures.
C. Clustering-based approaches
Both the greedy algorithm-based and optimization-based GMR approaches face scalability challenges as the dimensions d and the desired reduction order M increase. To address this issue, clustering-based approaches have emerged as competitive alternatives, offering computational efficiency advantages. Clustering-based approaches, exemplified by methods like the one described in [15], draw inspiration from the popular k-means algorithm [19] in Euclidean space and adapt it to the space of Gaussian distributions. They start with a user-proposed M Gaussian distributions as initial cluster centers and iterate between the following two steps:
• Assignment step: Partition N components of the original mixture into M groups based on their closeness to the current cluster centers according to some divergence .
• Update step: Relocate the cluster centers based on the components assigned to each cluster.
The approach iterates until there are no meaningful changes in these centers. These cluster centers are then exported as M components of the reduced mixture. The mixing weight of each reduced component is the sum of the weights of the original components assigned to the corresponding cluster.
Similar to the k-means algorithm in the vector space, there exist various assignment schemes in clustering-based GMR approaches. These schemes can be broadly categorized into “hard” clustering-based and “soft” clustering-based methods, depending on how the components of the original mixture are assigned to clusters. In the case of hard clustering-based approaches, each component in the original mixture is assigned exclusively to a single cluster. On the other hand, soft clustering-based approaches assign components of the original mixture to multiple clusters, usually in proportions that reflect their membership strength. Some existing hard and soft clustering-based approaches for GMR are as follows. 1) Hard clustering: Let and
respectively be the nth and mth component of the original and reduced mixture. In theassignment step, [12], [13], and [15] choose the Kullback–Leibler (KL) divergence as their closeness metric:
Under the special case of Gaussian distributions, the KL divergence becomes
The derivation of the KL divergence is given in Appendix A-A. The squared Wasserstein distance [23]
is chosen as the closeness metric in [14] where is the Euclidean norm.
They assign the nth component in the original mixture to the mth cluster when . One may randomly pick one cluster if a component is equally close to more than one cluster. Denote C(m) the index set of components assigned to the mth cluster in one iteration. Let
. The mth cluster center is updated by moment matching [15]
Alternatively, [14] propose to update the cluster centers by the Wasserstein barycenter of but uses an approximate solution instead for computational efficiency. The iteration continues until the change in the
is below some user-specified threshold.
2) Soft clustering: Instead of assigning the whole component of the original mixture to a single cluster, the soft clustering-based approach assigns fraction of the nth component to the mth cluster for some
and
. Various forms of
have been considered in the literature. For instance, [11] let
and [6] uses
with some hyper-parameter I > 0 and
. They also use moment matching to update the cluster centers.
The soft clustering-based approach reduces to the hard clustering-based approach as the hyper-parameter . This is seen by noticing
The computational cost of both hard and soft clustering-based algorithms is at each iteration, which is lower than the per iteration cost of the optimization-based approach in [9].
Although clustering-based approaches offer computational advantages in GMR, their theoretical properties have remained largely unknown. In this paper, we make significant progress in understanding these methods by demonstrating that many existing clustering-based approaches are special cases of our proposed method. This realization allows us to unveil the hidden optimality targets of these existing methods, which turn out to be the CTD to the original mixture. We hence bridge the gap between clustering-based and optimization-based approaches for GMR. Furthermore, our formulation introduces a more general class of clustering-based algorithms that offer flexibility to users. Within this framework, users can select the appropriate divergence measure in the assignment step and subsequently update the cluster centers based on their specific application requirements. This enhanced flexibility allows for improved customization and performance optimization in a wide range of practical scenarios.
The optimization-based GMR formulation in [18] suffers from a lack of effective and efficient numerical solutions, despite its conceptual simplicity. Instead of continuously searching for elusive algorithms to minimize the ISE in (2), it is more practical to replace ISE with other divergences that offer favorable theoretical properties and enable efficient optimization algorithms. It is worth noting that most divergences are computationally expensive to evaluate between two mixtures. However, the evaluation cost is considerably lower when comparing two Gaussian distributions. Exploiting this property, we consider a Gaussian mixture as a discrete distribution defined on the space of Gaussian distributions. By introducing a divergence on the space of Gaussian distributions, we can induce a transportation divergence between two finite Gaussian mixtures. This leads to the development of a composite transportation divergence (CTD), which not only possesses strong theoretical motivations but also facilitates the design of effective and efficient algorithms.
In this section, we begin by illustrating several optimization-based GMR approaches. Through these illustrations, we highlight the numerical challenges encountered, which serve as the driving force behind our novel GMR framework based on CTD. Subsequently, we introduce our proposed GMR approach, accompanied by the innovative MM algorithm. After that, we present the theoretical results on the convergence properties of the MM algorithm. Finally, we conclude the section by showing the connection of our proposed method to existing optimization-based and clustering-based approaches.
A. Optimization-based GMR with KL divergence
The Kullback–Leibler (KL) divergence is widely recognized as a popular measure of similarity between two distributions. At first glance, it may appear to be a natural choice of divergence for the optimization-based GMR approach. However, employing the KL divergence does not yield an effective optimization-based GMR solution as we show below.
If we adopt the KL divergence between two mixtures, the resulting optimization problem can be expressed as follows:
where
Since the minimization is over and G is known, the above optimization problem is equivalent to
The solution is therefore linked to the widely known population maximum likelihood estimate (MLE) of a finite mixture [24]. However, the task of solving for presents even greater challenges compared to the ISE optimization problem (3). This is primarily due to the additional computational cost involved in evaluating the KL divergence between two mixtures, compounded by the difficulty of minimizing a non-convex function.
To tackle the optimization problem (6), one possible approach is to employ the population EM algorithm, as discussed in [24]. The algorithm begins by proposing a hypothetical mixture distribution, denoted as , for a random variable X, where X is part of the complete data (X, Z) and Z represents a latent variable. The conditional distribution of X given Z = m is represented as
, with
. On the other hand, the true distribution of X is denoted as
, with
. The joint hypothetical density of X, Z is given by
and the posterior of Z given X = x,
With the complete data (X, z), the complete population log-likelihood of is given by
• Once is defined and an initial
is given, the population M-step proceeds as
Although the M-step is conceptually straightforward, computational challenges persist, as explained below. Given that the Qfunction in (7) is additive in and
, it enables an updating scheme that allows for separate updates of these parameters.
and
Although we have introduced a straightforward updating scheme, it is important to note that both (8) and (9) involve integrals of the ratio of two Gaussian mixture densities . These integrals are computationally intractable, further complicating the optimization task associated with the KL divergence-based GMR.
Another possible approach is via the finite sample EM algorithm. In the finite sample EM algorithm, the expectations in (8) and (9) are replaced by their corresponding sample means. Consequently, one way to alleviate the computational burden is to obtain approximate solutions [12]. The idea is to generate Monte Carlo samples Xfrom the original mixture
. The log-likelihood of the reduced mixture is then given by
which converges to as
. Consequently, the MLE based on
converges to
, the desired reduction result. However, Monte Carlo methods often suffer from the curse of dimensionality, where the number of samples needed to achieve the same level of precision tends to increase exponentially with the dimensionality, denoted as d. Therefore, this approach becomes impractical for large d when given a fixed computational cost budget. In summary, the GMR approach based on KL divergence presents an even more demanding optimization challenge compared to the GMR based on the ISE divergence.
Furthermore, it is widely recognized that both the finite sample EM algorithm and the population EM algorithm are prone to getting trapped in local maxima. This issue persists when applied to GMR, posing a significant challenge. However, there is potential to address this concern by gaining a deeper understanding of the structural characteristics of local maxima, as discussed in [25]. Acquiring detailed knowledge of the original mixture can aid in effectively identifying local maxima and locating the global maximum.
In order to advance the adoption of optimization-based GMR methods, it is crucial to develop improved computational techniques. The proposed optimization-based GMR approach utilizing the CTD represents a significant breakthrough in this regard, offering promising avenues to overcome the aforementioned challenges.
B. Optimization-based GMR with CTD
Our objective is to propose an optimization-based method that combines the advantages of having a well-defined optimal target and the computational efficiency found in clustering-based methods. To achieve this, we begin by examining the k-means algorithm in the vector space from an optimization perspective, as discussed in [26].
Consider a set of d-dimensional vectors associated with
clusters, where M is a predetermined number. The k-means problem aims to find M elements
that minimize the following objective function:
where is some divergence between two sets O and R. It can be seen that from an optimization perspective, the goal of the k-means algorithm is to find the optimal set that minimizes the divergence to Os.
In the context of the clustering-based approach for GMR, we are not dealing with vectors anymore; instead, we seek to cluster N Gaussian distributions into M groups. Therefore, by extending the concept in (10) to the space of distributions, we can replace the divergence D(O, R) between two sets in vector space with a measure that quantifies the similarity between two distributions. Additionally, since each Gaussian component is associated with a weight, we must also consider the corresponding mixing weights. This analogy motivates us to explore the composite transportation divergence (CTD) between two finite Gaussian mixtures.
We begin by formally defining the CTD and explaining its connection with the objective function of the k-means algorithm. Let and
represent the PDF of the original and reduced Gaussian mixtures, respectively, with appropriate orders. The mixing weights are denoted as w and
, and the components’ PDFs are denoted as
and
. The CTD [21, 27] measures the transportation cost from the original mixture to the reduced mixture and is defined as follows.
Definition 1 (Composite transportation divergence). Let be a divergence on the space of Gaussian distributions F =
. The composite transportation divergence (CTD) between two finite Gaussian mixtures with cost function
is
where
is the set of coupling matrices with marginals w and . Let
be a regularization parameter. An entropic regularized CTD is
with entropy .
The CTD between two mixtures is a type of optimal transport divergence, as described in [28], which measures the divergence between the mixing distributions G and . The optimal transport divergence arises from the concept of the optimal transportation problem, and an illustrative example can help convey its intuition.
Consider a scenario where there are N warehouses and M factories operating in the space of Gaussian distributions F. The nth warehouse, located at , contains
units of raw material, while the mth factory, located at
, requires
units of raw material. Let
denote the per unit cost to transport materials from warehouse n to factory m, and let
represent the amount of material being transported. Assuming that the transportation cost is proportional to the amount of material transported, the total transportation cost under a given transportation plan
is given by
. The coupling set
represents all possible transportation plans, subject to two marginal constraints: (a) the correct amount of material is taken from the warehouses, and (b) the correct amount of material is received by the factories. The optimal transportation problem aims to find the transportation plan
that minimizes the total cost among all feasible plans in
. The resulting minimum total cost, obtained under the optimal transportation plan, corresponds to the CTD between the two mixtures. In other words, the CTD quantifies the transportation cost associated with moving the materials optimally from one mixture to another.
The computation of the CTD involves two numerical tasks. Firstly, we need to evaluate the cost function between two components, which is computationally cheap compared to evaluating the direct divergence between two mixtures. The second numerical task is to find the optimal transportation plan, typically achieved through numerical algorithms such as linear programming. The computational cost of this task is typically on the order of when the number of components N is equal to the number of clusters M. To expedite the computation of the optimal transport, an entropic regularization technique was proposed by [29]. This approach provides an approximate solution to the optimal transport problem, significantly reducing the computational time. It is worth noting that in our paper, the entropic regularization serves a different purpose, as our proposed GMR does not require solving the optimal transportation problem as we show below. We will elaborate on the purpose of introducing the entropic regularization in Section III-E1. For simplicity, we also refer to the entropic regularized CTD as CTD, highlighting the differences only when necessary.
As the original mixture is known, for the simplicity of notation, we write
Given a cost function
on the space of Gaussian distributions and a regularization parameter
, we propose to reduce
of order N to
of order M with
We illustrate how our proposed objective (12) generalizes the objective (10) of the k-means algorithm in the vector space. In our context, we consider the original mixture and the reduced mixture as sets of weighted observations in the space of Gaussian distributions F. For instance, the original mixture consists of observations with associated weight
. In this analogy, the composite transportation divergence
plays a similar role as the divergence D(O, R) in k-means, quantifying the dissimilarity between the two sets. Consequently, our objective is to identify the optimal set that minimizes this divergence.
With this objective, our approach clearly falls within the framework of optimization-based GMR methods. We will now outline a straightforward numerical algorithm to solve this optimization problem, which not only enables our proposed method to have a clear optimal target but also inherits the computational advantages of clustering-based approaches.
C. The tailor-made MM algorithm
Upon initial examination, it may appear that solving for in (12) requires addressing two sub-problems, each involving a constrained optimization issue: (a) evaluating
for each
; (b) minimizing
with respect to
. The first sub-problem typically involves a numerical search for transportation plans
within the coupling set
. However, we demonstrate that such sequential optimization is not necessary. Instead, the overall optimization problem can be seamlessly resolved using a single Majorization-Minimization (MM) algorithm.
The intuition behind this approach is as follows: recall that the CTD represents the lowest cost of transporting materials from warehouses to factories. The optimal transportation plan, , transports components from warehouses to components in factories at the lowest cost. When transporting all materials from warehouses to factories,
must have its first marginal matching the warehouse mixing distribution and its second marginal matching the factory mixing distribution. However, in the context of our GMR approach, the factory mixing distribution is being optimized. Therefore, we can allow the second marginal of
to be flexible, rather than imposing a specific form on it. In other words, instead of constraining
to have
as its second marginal, we let
be the second marginal of
. This renders the marginal distribution constraint
on
redundant. By removing this redundant constraint, the optimal transportation plan with only one marginal constraint can be expressed in closed form, facilitating the use of an easy-to-implement MM algorithm. Now, let us formally describe the algorithm based on this insight. Let
and
be a vector of the component parameters of thetarget mixture
. Define a function with its dependency on the original G hidden:
The optimization of the transportation plans over involve only one linear marginal constraint in terms of w. Once the target component parameters
are given, the optimal transportation plan
has its (n, m)th entry:
When , the solution is given by
.
Theorem 1 (Equivalent optimization problem with single marginal constraint). Let , and the other notation be the same as given earlier. Let
The mixing distribution of the reduced mixture is then
The proof is deferred to Appendix B-A. The advantage of the new objective is that it has a closed-form form given by:
This objective depends solely on the component parameters and does not involve the mixing weights. The mixing weights are uniquely determined by the component parameters through the definition of
. This separation allows us to optimize over the component parameters and mixing weights separately, reducing the overall optimization problem to minimizing
with respect to the component parameters
. Although the closed form of
allows for easy computation, it is important to note that both the optimal transportation
plan and the cost function are dependent on the component parameters. To address this dependency, we propose
an iterative procedure inspired by the well-known Majorization-Minimization (MM) algorithm [30], which enables us to separate these dependencies and optimize them iteratively. For completeness, we provide a brief overview of the MM algorithm following [30]. Suppose we wish to minimize a function g(x) over some space X. The MM algorithm iteratively updates a solution from an initial point . After t iterations, with the current solution
, MM algorithm first constructs a function
that majorizes g(x) at
, that is
with equality holds at
. It then updates
with
. The success of MM algorithm relies on finding a majorization function
, preferably convex in x, that is easy to minimize. Such a procedure ensures a decreasing sequence of
. We now present the majorization and minimization steps in minimizing (14). • Majorization Let
be the mixing distribution updated after t iterations. We first propose a majorization function forthe optimization goal (15):
with dependent on the entropy regularization strength
as in (13). Note this majorization function is the regularized total transportation cost under transportation plan
. • Minimization We then minimize the majorization function (16) and update
for each m:
The minimization step in (17) during each iteration of the algorithm is simplified compared to directly solving (14). This simplification arises because the design of the majorization function in equation (16) separates the components , allowing for separate and parallel optimization with respect to each
. Additionally, the proposed algorithm benefits from the sequential update of the mixing proportions
and the component parameters
. Specifically, we first update
using the expression
and then obtain . This sequential updating scheme allows for a more efficient optimization process. The algorithm iterates between the majorization step in equation (16) and the minimization step in equation (17) until the change in
falls below a certain threshold. The complete algorithm is summarized in Algorithm 1.
One primary advantage of our proposed approach for GMR, compared to other optimization-based approaches, is its easy-to-implement algorithm and computational efficiency. In the assignment step, the optimal transportation can be computed using a closed-form solution, resulting in a cost of O(NM), given the values. The per-evaluation cost of the commonly used
is
. Therefore, the total cost for the assignment step is
. The updating step (17) involves solving for the barycenter [31] for M clusters, with costs depending on the cost function employed in the CTD in (11). When the cost function is KL divergence, the corresponding Gaussian barycenter can be computed at a cost of
. The per-iteration cost in this case is
, which is lower than the per-iteration cost of directly minimizing the integrated squared error in Section II, which is quartic in d.
Theorem 2 (Convergence of the algorithm). Suppose the cost function is continuous in both arguments. Assume that for any constant
and
, the following set is compact according to some distance on F:
Let be the sequence of mixing distributions generated by
with some initial mixing
The proof of this theorem can be found in Appendix B-C. It is worth noting that, similar to the famous EM algorithm [32], these properties alone do not guarantee the convergence of . However, these two properties do imply that
converges monotonically to a constant
. Furthermore, all limiting points
are stationary points of
, meaning that iterations from
do not further reduce the value of
. In certain special cases or under specific conditions, we can demonstrate the convergence of
.
Theorem 3. Assume the same conditions of Theorem 2. Suppose we carry out the MM iteration forever. When , we have 1) There exists a large enough T and
such that for all
,
and
(18) 2) Limiting point
of
starting from any
is a local minimum of
in the space of
. 3) There exist an MM related exhaustive algorithm with exponential time
to solve (12) for fixed d.
The proof can be found in Appendix B-D. The first two properties ensure the convergence of both and
forany initial value when
. Based on our experience, we have observed that the value of T is typically small, as shown in Table II. In many real-world applications, it is anticipated that N is approximately 30 or less, although previous studies such as [15] have experimented with N on the order of 1000 to demonstrate the scalability of clustering-based methods. Consequently, an exhaustive search can be a viable tool in certain scenarios. Additionally, it provides a way to examine how frequently a globally optimal solution is missed by an iterative procedure with its specific initial value scheme.
When , we provide a novel convergence analysis of our MM algorithm from a mirror descent perspective. Our analysis draws inspiration from the convergence analysis of the EM algorithm in KL divergence [33]. In particular, we establish that the MM updates can be interpreted as mirror descent updates, where each iteration minimizes the linearization of the objective function while incorporating a weighted Bregman divergence penalization term. By establishing this connection, we can analyze the convergence rate of our proposed MM algorithm from a mirror descent perspective.
The formal description of our main result requires some preliminary results.
Definition 2 (Bregman divergence). Let be a function that is: a) strictly convex, b) continuously differentiable, c) defined on a closed convex set
. Then the Bregman divergence induced by A is defined as
That is, the difference between the value of A at and the first order Taylor expansion of A around
evaluated at point
. The function
is also called a reference function.
The Bregman divergence encompasses various well-known divergences as special cases. For example, under Gaussian distributions, we have the following result:
where is the log-partition function of the Gaussian distribution when written in the standard form of the natural exponential family. The Bregman divergence is connected to the mirror descent algorithm as follows.
Mirror descent is a generalization of the gradient descent algorithm [34]. Suppose we aim to minimize a function f over a set , starting from some initial value
. The mirror descent iteratively updates the parameters as follows:
where L > 0 is a constant, and represents a Bregman divergence. The objective in the mirror descent update at each iteration is a linear function penalized by the Bregman divergence. It has been demonstrated in [34] that when f is L-smooth relative to A (defined in Definition 3), the convergence of
is linear, and the rate is quantified as:
where is the lower bound of f.
We discovered that our MM updates can be expressed as mirror descent-like updates. Therefore, the convergence analysis of the mirror descent algorithm can be extended to the convergence analysis in our case. Specifically, let be a cost function. If in addition, the cost function
is a Bregman divergence induced by
, namely
for any
and
. Let
be defined in (13) and
. We show in Lemma 1 in Appendix B-E that our MM updates can be written as
where
, and
.We say this update is mirror descent alike since the RHS of (19) is not penalized by a Bregman divergence but its weighted summation. With this mirror descent-like updates, we are able to show the following linear convergence result.
Theorem 4 (Rate of convergence of MM algorithm). Let be the sequence of the component parameters produced by the mirror descent update in (19) and
. Then
E. Connection with existing methods
1) Connection with existing clustering-based approaches: Our proposed GMR approach is an optimization-based method, as evident from its formulation. In this section, we demonstrate that our MM algorithm can be used to derive both hard and soft clustering-based Gaussian mixture reduction methods, depending on the value of . Specifically, when
, our algorithm corresponds to a class of hard clustering-based methods. On the other hand, for
, it represents a class of soft clustering-based approaches. It is important to note that the inclusion of entropy regularization in our method is aimed at achieving soft clustering-based results, rather than solely for computational efficiency.
By selecting specific cost functions and values of
, our algorithm encompasses various clustering-based algorithms found in the existing literature. We summarize these algorithms, along with the corresponding choices of cost functions and
, in Table I.
Previous work [15] has demonstrated that hard clustering-based approaches generally offer computational advantages over the minimum ISE approach proposed by [18]. Additionally, these hard clustering-based methods outperform certain greedy algorithms, such as those presented in [22] and [8]. However, the convergence and optimality of hard clustering-based approaches have not been extensively studied before. By establishing the connection between our method and clustering-based approaches, our results provide crucial support to these methods by addressing these aspects that were previously missing in the literature.
2) Convergence: Due to the connection between our method and existing clustering-based algorithms, the convergence of most clustering-based methods can be inferred when their corresponding entropic regularized CTD satisfies the conditions outlined in Theorem 4.
3) Consistency of assignment and update steps: Our proposed MM algorithm employs the same cost function in both the assignment and update steps. In the assignment step, this cost function measures the similarity between components in the original mixture and the components in the proposed mixture
. In the update step, we seek the barycenter of the components in the original mixture that are assigned to the same cluster, using the same cost function. Our theory demonstrates that the MM algorithm generates a sequence with non-increasing entropic regularized CTD, ensuring convergence in this scenario. However, if different cost functions are used in the assignment and update steps, this guarantee may not hold. An example of this occurs when components are assigned to clusters based on a divergence measure such as the Wasserstein distance, but the cluster centers are updated through moment matching. As moment matching leads to the barycenter under the KL divergence, the convergence of the algorithm is not implied by our theory in such cases.
TABLE I: The relationship between the proposed GMR approach and existing clustering based GMR approaches according to the cost function and regularization strength
. Empty entries indicate new approaches not previously explored.
We will begin by examining the case when and the cost function is the KL divergence in (5). Let
be the clustercenter after t iterations. For simplicity, let us assume that for every n, there exists a unique
. In this case, the transportation plan in the MM algorithm is given by:
which corresponds to the KL barycenter. As demonstrated in Appendix C, the barycenter solution coincides with the moment matching solution for Gaussian mixtures. Hence, the proposed GMR approach with and the KL divergence as the cost function corresponds to the hard-clustering GMR algorithm presented in [15]. Similarly, when
and
represents the 2-Wasserstein distance
between two Gaussians, the proposed GMR approach leads to the hard-clustering algorithm presented in [14]. Consider the case when
and
. In this scenario, our proposed algorithm reducesto the soft clustering-based algorithm presented in [6]. However, it is important to note that this cost function depends on the mixing weights of the reduced mixture, and thus Theorem 2 does not directly apply, leaving the convergence of the algorithm uncertain. Nevertheless, we provide a valid motivation for this algorithm.
The soft-clustering algorithm in [6] draws inspiration from variational inference [35]. Consider a scenario where we have I random observations from . Let the data be denoted as
, and the log-likelihood function is given by
. Taking the expectation of this log-likelihood yields:
Let be a random sample from
for each n. It appears that [6] suggests
Conceptually, the claim in their suggested variational lower bound implies that X has a probability of being a random sample from the component
. However, this claim is not true and invalidates the proposed variational lower bound.
In Section IV, we conducted experiments to demonstrate the efficacy of our proposed GMR approach by introducing a novel cost function, the ISE between two Gaussians. It is noteworthy that previous clustering-based GMR methods have not explored the use of this cost function. Our findings clearly illustrate that leveraging ISE can lead to significant performance improvements in existing clustering-based algorithms. This novel contribution highlights the effectiveness of a unified minimum-CTD-based GMR approach.
2) Connection with existing optimization-based approaches: We also establish a connection of our proposed method with existing optimization-based approaches. To begin with, we consider the special case where we aim to reduce a Gaussian mixture to a single Gaussian and use the ISE in (1) or KL divergence in (4) as the cost function in CTD. In this case, we have the equivalence
The left-hand side of equation (20) represents the CTD between the original mixture and the reduced mixture, while the right-hand side represents the divergence between two mixtures. This equality demonstrates that when reducing a mixture to a single Gaussian using certain cost functions, these two approaches are equivalent. A proof of this equivalence is provided in Appendix C.
In a more general setting, when the cost function satisfies the “convexity” property, we can establish that our proposed objective serves as an upper bound on the divergence between two mixtures.
Theorem 5. Let be a non-negative bi-variate function on space of Gaussian distributions with “convexity” property: for any
, and component PDFs
, we have
The proof of this theorem can be found in Appendix C. Importantly, the KL divergence and the ISE possess the convexity property, as demonstrated in Appendix C. Consequently, our proposed method minimizes an upper bound of the existing ISE approach and the computationally challenging minimum KL divergence approach.
A. General experimental setting
We demonstrate the effectiveness of the proposed GMR approach through experiments. We consider the following four GMR methods:
1) Greedy: We include the most recently developed greedy algorithm-based approach from [14] as a baseline for comparison.
2) ISE: We include the optimization-based reduction method from [18] as another baseline for comparison.
3) CTD–KL: This is our CTD-based method with the cost function being the KL divergence in (5). When the cost function uses KL divergence with , the proposed GMR approach reduces to the existing clustering-based approach, as shown in Table I. The case of
has not been considered in the existing literature.
4) CTD–ISE: This is our CTD-based method with the cost function being the ISE defined as follows:
The use of ISE as the cost function is a novel contribution of our proposed approach, as it has not been studied in existing literature. This approach is used to illustrate the generality of our proposed CTD framework, which allows the choice of other valid divergences.
The regularization parameter in the proposed method plays a role in the quality of reduction. To see the difference, we conduct experiments with different levels of regularization to cover both hard (
) and soft clustering-based (
) algorithms. To determine the optimal
value for a given cost function of the CTD in the proposed approach, we select the
value that achieves the lowest integrated square error between the original and reduced mixtures from a grid of
values and this output is called the soft clustering-based method. The grid of
values is chosen as follows:
The regularization parameter in our proposed method influences the quality of reduction. To explore this, we conduct experiments with different levels of regularization, covering both hard clustering-based (
) and soft clustering-based (
) algorithms. To determine the optimal
value for a given cost function in the soft clustering-based methods, we select the value that achieves the lowest integrated square error between the original and reduced mixtures from a grid of
values. The grid of
values is chosen as follows:
The reason for selecting the value of over the specified grid is as follows. We observe that as
increases, more and more components of the reduced mixtures become identical. In the most extreme case when
, all components of the reduced mixture are identical, resulting in the original mixture being reduced to a single Gaussian. This can be observed from the expression in (13), where all entries in
converge to
as
, leading to identical cluster centers. Therefore, we conclude that an appropriate range for
should be determined based on the component-wise divergence of the original mixture.
For non-greedy algorithm-based approaches, we employ multiple initial values to avoid local minima, and we select the output with the smallest objective function value is considered as the reduced mixture. We use the same initial values for all methods except for the soft clustering-based method. Specifically, we initialize all algorithms with five multiple initial values, where four of them correspond to the outputs of four greedy algorithm-based reduction methods: Salmond [7], Runnalls [8], Williams [18], and Wasserstein [14]. The last initialization approach involves generating 1000 samples randomly from the original mixture and using the output of the EM algorithm with order M as the corresponding initial value. In the hand gesture recognition experiment, we randomly select five images from each gesture class as initial values. For the soft clustering-based methods, we do not employ multiple initial values but utilize the output of the corresponding hard clustering-based reduction result as the initial value.
The stopping criterion for the algorithm is when the relative change in the objective function falls below . In other words, let
represent the value of the objective function at the t-th iteration. The algorithm terminates when the following condition is met:
We compare the total runtime of the algorithms across multiple initializations. The experiments are implemented in Python 3.7.7 on the Cedar cluster at Compute Canada. The source code is publicly available at the following GitHub repository: https://github.com/SarahQiong/CTDGMR.
B. Simulated mixtures
We begin by focusing on reducing a bivariate Gaussian mixture of order N = 25. Instead of arbitrarily selecting original mixtures, we generate a total of R = 100 mixtures with a structured random pattern. In each repetition, we generate the parameter values for the original mixture using the following procedure:
1. We assign equal mixing weights to all components, with for j = 1, . . . , 25. 2. We generate a multinomial random vector
with equal event probabilities. From there, we uniformly and randomly choose
from the interval
, where L = 10. 3. Next, we generate
uniformly within a circle with a radius of 2.5. The component means are then defined as
. This process results in components that are roughly clustered around five random centers. Refer to Fig. 2 (a) for a typical representation of the outcome.
Fig. 2: A typical order N = 25 original Gaussian mixture. (a) Centers marked as diamonds and locations of component means. (b) Heat map of the density function of the original mixture.
We next generate N = 25 component covariance matrices. We first generate independently from Gamma distribution with shape parameter 8 and scale parameter 4 followed by a rotation angle
uniformly in
. We then let
be the n-th component covariance matrix. Fig. 2 (b) shows the heat map of a typical density function. This design ensures the reduction is a meaningful exercise and there is enough uncertainty to set apart various reduction methods.
Given the knowledge of how the original mixtures are generated, it is natural to reduce the original mixture to order M = 5. In real-world applications, we most likely do not have such knowledge. Therefore, we experiment with M ranging from 3 to 22. We use the integrated squared error in (2) between the original and reduced mixtures as a performance measure. The lower
Fig. 3: Heat maps of the density functions of the reduced mixture.
the value, the better the performance. Fig. 4 shows their performances in terms of approximation precision and computational time. We also experimented on higher dimension Gaussian mixtures. Due to space limits, we do not present these results and these methods have similar relatively performances.
All reduction methods exhibit improved performance as the order M increases, but they also require more computation time. As expected, the minimum ISE approach of [18] achieves the smallest integrated squared error, which is evident from our experimental results. On the other hand, the greedy algorithm-based method performs the poorest across all methods in terms of ISE. Due to its inferior performance, we only include this method in the simulated mixture example section and do not consider it for the remaining experiments.
Our proposed CTD-based methods, CTD–ISE and CTD–KL, achieve comparable or slightly worse precision compared to the minimum ISE approach, while still producing meaningful reduction results. Notably, CTD–ISE significantly outperforms CTD– KL, highlighting the effectiveness of our proposed framework. When M = 22, we observe that CTD–ISE can achieve similar performance to ISE with only 1/10th of the computational time. In contrast, the minimum ISE approach in [18] requires 1000 times more computation time than the proposed KL-based approach. Soft clustering methods exhibit slightly better precision than their hard clustering counterparts but require more iterations to converge, leading to increased computational time, as shown in Table II.
In summary, the proposed approach with CTD–ISE yields mixture reduction methods that achieve comparable performance to the minimum ISE approach while significantly reducing computational costs. Table II shows that our proposed hard CTDbased method converges within only 3 steps, even though the worst case is . Additionally, considering the negligible improvement of the soft clustering-based method compared to the hard clustering-based methods, we do not recommend using the soft clustering approach. Instead, it is advisable to utilize a different cost function, such as ISE, instead of KL divergence.
Fig. 4: (a) between the reduced and original mixtures. (b) The computational time. The plot includes the reduction approaches MISE (solid line with dot), hard CTD–KL (solid line with triangle), soft CTD–KL (dashed line with triangle), hard CTD–ISE (solid line with cross), and soft CTD–ISE (dashed line with cross).
TABLE II: The mean (std) of number of iterations of CTD based reduction approaches over 100 repetitions.
C. Approximate inference in belief propagation
This experiment illustrates the effectiveness of Gaussian mixture reduction in belief propagation. The belief propagation is an iterative algorithm used to compute the marginal distributions in the graphical model. Specifically, let there be a graph with a node set V and an undirected edge set E. A probabilistic graphical model associates each node with a random variable, say , and postulates that the density function of the random vector
can be factorized into
for some non-negative valued functions and
. We call
local potential and
local evidence potential. Denote the neighborhood of a node i as
. A message
is a function associated with edge (i, j) and it is updated in the tth step according to
The belief function associated with the density function of
is updated in the tth step according to
The messages and beliefs are iteratively updated until convergence. We refer to the above procedure as belief propagation.
In belief propagation, the closed-form outcome of the messages generally does not exist. To ensure efficient inference, density functions of Gaussian mixtures are often used to approximate the messages for two reasons. First, they are flexible to approximate any density function to arbitrary precision. Second, they lead to closed-form outputs in the message and belief updates, which are also mixtures whose orders increase exponentially as iterations. However, this naive iterative procedure quickly becomes intractable. One remedy is to perform a Gaussian mixture reduction step after each iteration to stop the order from increasing exponentially.
We apply the proposed GMR methods to the belief propagation for the model represented by Fig. 5 (a) following [6]. We let , with
marked alongside the graph edges in the figure and
We create R = 100 graphs with parameter values generated independently and identically distributed according to: , and
. We obtain exact inference following (23) and (24) for the first 4 iterations and it becomes infeasible for more iterations. We reduce the order of the message mixture to M = 4 using three reduction methods after any iteration when its order exceeds 4 following [6]. We then use the reduced message mixture to update the beliefs according to (24) leading to approximated beliefs.
Fig. 5: (a) Graphical model. (b) Integrated squared error between the exact and approximate beliefs. (c) Computational times for belief update. The proposed methods include soft clustering-based (boxes without hatching pattern) and hard clustering-based (diagonal hatching).
We use the ISE to evaluate the performance of the reduction methods averaged across 4 nodes and the first 3 iterations. For the soft clustering-based methods, we use values over the same grids as Section IV-A. The integrated squared error is computed for each fixed
value applied to all nodes and in all iterations. That is, the performance is tallied for each
value, not cherry-picked over repetitions. The reported integrated squared error is the lowest one achieved by a single
value.
Fig. 5 (b) contains box-plots of 100 outcomes. By definition, the ISE approach has the lowest integrated squared error. As anticipated, both CTD–ISE and CTD–KL do not attain as low integrated squared error as the ISE, but are much more computationally efficient. Similar to the experiments on the simulated mixtures, the proposed methods corresponding to soft clustering have lower integrated squared error values and use slightly more computational time. They have lower variation with cost function (22) and higher precision with the KL divergence cost function than their hard clustering counterpart.
D. Real data application for hand gesture recognition
Static hand gesture recognition involves training a classifier to identify hand gestures in future images based on a set of labeled images. In this study, we utilized the Jochen Triesch static hand posture database, which is publicly available online [36]. This database consists of grayscale images of size depicting 10 hand postures representing the alphabetic letters: A, B, C, D, G, H, I, L, V, and Y. The images were captured from 24 individuals against three different backgrounds.
To remove the backgrounds and standardize the dataset, we followed the same procedure as described in [37]. The hands in these images are centered through cropping and subsequent resizing. After preprocessing, we obtained a total of 168 images, with each hand posture having 16 to 20 images. [37] approached the problem by considering the intensity of each pixel as a function of its location. They approximated this function using a Gaussian mixture density function, up to a normalization constant. For each image, they fitted an order of 10 Gaussian mixtures. Fig. 6 illustrates an original image from the dataset alongside the heat map representing the density function of the corresponding fitted mixture.
[37] proposed a classification method where a new image is classified based on its Cauchy-Schwarz divergence [38] to all training images. For instance, a test image of a hand posture is classified as posture “A” if there exists a training image with posture “A” that is closest to the test image. This appoach achieved a cross-validation classification accuracy of 95.2%. However, this test procedure can be time-consuming when dealing with a large number of training images. To address this issue, we investigate an alternative approach by combining the proposed GMR method with a slightly different strategy from [37].
Fig. 6: (a) A typical pre-processed image of hand posture “C”. (b) Heat map of a fitted order 10 Gaussian mixture for the image in (a).
In our approach, we create a representative image called a class prototype for each hand posture. Instead of comparing the test image with all training images, we compare it with these class prototypes. This significantly reduces the number of comparisons when dealing with a large number of training images, albeit potentially sacrificing some classification accuracy. To create the class prototypes, we first replace each original training image with a Gaussian mixture. Next, we combine the training images of the same hand posture into a single mixture by directly summing their components. This combined mixture has a higher order. Subsequently, we reduce the order of the summed-up mixture for each hand posture to a order 10 Gaussian mixture, which serves as the class prototype. Finally, we employ a nearest neighbor classifier for the classification task. Fig. 7 illustrates the prototypes of the hand postures obtained using this approach.
Fig. 7: The class prototypes of hand gestures obtained by different reduction approaches.
We conducted a comparative analysis of classification accuracy and computational time for various reduction methods. Given the relatively small training set, we employed a 5-fold cross-validation approach repeated 100 times to gauge the classification accuracy. Our evaluation considered two schemes:
1) In the first scheme, we utilize the same divergence for both classification and reduction. For instance, we minimize the ISE to obtain the class prototype and measured the similarity between the test images and the class prototypes using the same divergence. Fig. 8 (a) depicts the classification accuracy of this strategy using different reduction methods. For the proposed methods based on the soft clustering algorithm, we conduct a search over the same grid as before and employ the same value of
for each posture and each cross-validation fold.
2) In the second scheme, we explore the use of different divergences for the test and reduction phases. For instance, we minimize the ISE to obtain the class prototype, but employ the CTD with KL divergence to measure the similarity between the test images and the class prototypes. Fig. 8 shows the classification accuracy obtained using this strategy, showcasing the impact of various combinations of divergences. To avoid an excessive number of combinations, we only included the proposed approach with . By conducting these experiments, we aimed to identify the most effective reduction methods and optimal combinations of divergences in order to maximize classification accuracy while considering computational efficiency.
Fig. 8: (a) Classification accuracy when the reduction and test use the same divergence. (b) Computational time of different reduction approaches. (c) Classification accuracy when the reduction and test use the different divergences. The proposed methods include soft clustering-based (boxes without hatching pattern) and hard clustering-based (diagonal hatching).
When the same divergence is utilized for both reduction and test, the ISE achieves the highest classification accuracy. However, it requires a greater amount of computational time. In contrast, our proposed methods exhibit slightly lower classification accuracy but significantly reduced computational time. Notably, our proposed minimum CTD-based method with KL divergene yields higher accuracy compared to other reduction methods.
In the case where different divergences are employed for reduction and test, the classification accuracy is maximized when using the ISE for the test phase. Furthermore, all reduction methods demonstrate satisfactory classification accuracy.
Overall, the combination of CTD–KL for reduction and ISE for testing emerges as the most favorable choice in terms of both computational time and classification accuracy. Since CTD-KL is the most computationally efficient approach for reduction, and ISE is the most efficient for the test phase, our experimental results demonstrate that our best approach effectively combines the advantages of both divergences.
In this paper, we introduce a novel optimization-based Gaussian mixture reduction (GMR) approach, aiming to minimize the composite transportation divergence between the original mixture and a mixture of the desired order. To facilitate practical implementation, we also present an efficient iterative majorization-minimization algorithm with proven convergence properties.
Our proposed method encompasses various existing clustering-based methods as special cases. Notably, we empirically demonstrate that by selecting the integrated squared error between two Gaussians as the cost function, we can enhance the performance of these existing clustering-based methods. Nevertheless, users have the flexibility to choose the most suitable cost functions based on their specific applications.
In summary, our proposed GMR approach effectively combines the merits of both optimization-based and clustering-based methods. It offers a well-motivated optimization target while maintaining numerical efficiency, thereby bridging the gap between these two approaches. By introducing this innovative approach, we provide a valuable tool for researchers and practitioners in the field.
[1] T. T. Nguyen, H. D. Nguyen, F. Chamroukhi, and G. J. McLachlan, “Approximation by finite mixtures of continuous density functions that vanish at infinity,” Cogent Mathematics & Statistics, vol. 7, no. 1, p. 1750861, 2020.
[2] D. M. Titterington, S. Afm, A. F. Smith, and U. Makov, Statistical Analysis of Finite Mixture distributions. John Wiley & Sons Incorporated, 1985, vol. 198.
[3] G. McLachlan and D. Peel, Finite Mixture Models. John Wiley & Sons, 2004.
[4] E. B. Sudderth, A. T. Ihler, M. Isard, W. T. Freeman, and A. S. Willsky, “Nonparametric belief propagation,” Communications of the ACM, vol. 53, no. 10, pp. 95–103, 2010.
[5] M. A. Brubaker, A. Geiger, and R. Urtasun, “Map-based probabilistic visual self-localization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 4, pp. 652–665, 2015.
[6] L. Yu, T. Yang, and A. B. Chan, “Density-preserving hierarchical EM algorithm: simplifying Gaussian mixture models for approximate inference,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 6, pp. 1323–1337, 2018.
[7] D. J. Salmond, “Mixture reduction algorithms for target tracking in clutter,” in Signal and Data Processing of Small Targets 1990, vol. 1305. International Society for Optics and Photonics, 1990, p. 434.
[8] A. R. Runnalls, “Kullback-Leibler approach to Gaussian mixture reduction,” IEEE Transactions on Aerospace and Electronic Systems, vol. 43, no. 3, pp. 989–999, 2007.
[9] J. L. Williams and P. S. Maybeck, “Cost-function-based hypothesis control techniques for multiple hypothesis tracking,” Mathematical and Computer Modelling, vol. 43, no. 9–10, pp. 976–989, 2006.
[10] M. F. Huber and U. D. Hanebeck, “Progressive Gaussian mixture reduction,” in 2008 11th International Conference on Information Fusion. IEEE, 2008, pp. 1–8.
[11] N. Vasconcelos and A. Lippman, “Learning mixture hierarchies,” in Advances in Neural Information Processing Systems 11, 1999, pp. 606–612.
[12] J. Goldberger and S. T. Roweis, “Hierarchical clustering of a mixture model,” in Advances in Neural Information Processing Systems 17, 2005, pp. 505–512.
[13] J. V. Davis and I. S. Dhillon, “Differential entropic clustering of multivariate Gaussians,” in Advances in Neural Information Processing Systems 19, 2007, pp. 337–344.
[14] A. Assa and K. N. Plataniotis, “Wasserstein-distance-based Gaussian mixture reduction,” IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1465–1469, 2018.
[15] D. Schieferdecker and M. F. Huber, “Gaussian mixture reduction via clustering,” in 2009 12th International Conference on Information Fusion. IEEE, 2009, pp. 1536–1543.
[16] K. Zhang and J. T. Kwok, “Simplifying mixture models through function approximation,” IEEE Transactions on Neural Networks, vol. 21, no. 4, pp. 644–658, 2010.
[17] D. F. Crouse, P. Willett, K. Pattipati, and L. Svensson, “A look at Gaussian mixture reduction algorithms,” in 14th International Conference on Information Fusion. IEEE, 2011, pp. 1–8.
[18] J. L. Williams, “Gaussian mixture reduction of tracking multiple maneuvering targets in clutter,” Master’s thesis, Air Force Institute of Technology, 2003.
[19] S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on information theory, vol. 28, no. 2, pp. 129–137, 1982.
[20] X. Nguyen, “Convergence of latent mixing measures in finite and infinite mixture models,” The Annals of Statistics, vol. 41, no. 1, pp. 370–400, 2013.
[21] Y. Chen, T. T. Georgiou, and A. Tannenbaum, “Optimal transport for Gaussian mixture models,” IEEE Access, vol. 7, pp. 6269–6278, 2018.
[22] M. West, “Approximating posterior distributions by mixtures,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 55, no. 2, pp. 409–422, 1993.
[23] C. Villani, Topics in Optimal Transportation. American Mathematical Society, 2003, vol. 58.
[24] S. Balakrishnan, M. J. Wainwright, and B. Yu, “Statistical guarantees for the EM algorithm: From population to sample- based analysis,” The Annals of Statistics, vol. 45, no. 1, pp. 77–120, 2017.
[25] W. Qian, Y. Zhang, and Y. Chen, “Structures of spurious local minima in k-means,” IEEE Transactions on Information Theory, vol. 68, no. 1, pp. 395–422, 2021.
[26] N. Ho, X. Nguyen, M. Yurochkin, H. H. Bui, V. Huynh, and D. Phung, “Multilevel clustering via wasserstein means,” in International conference on machine learning. PMLR, 2017, pp. 1501–1509.
[27] J. Delon and A. Desolneux, “A Wasserstein-type distance in the space of Gaussian mixture models,” SIAM Journal on Imaging Sciences, vol. 13, no. 2, pp. 936–970, 2020.
[28] G. Peyr´e and M. Cuturi, “Computational optimal transport: with applications to data science,” Foundations and Trends® in Machine Learning, vol. 11, no. 5–6, pp. 355–607, 2019.
[29] M. Cuturi, “Sinkhorn distances: lightspeed computation of optimal transport,” in Advances in Neural Information Processing Systems 26, 2013, pp. 2292–2300.
[30] D. R. Hunter and K. Lange, “A tutorial on MM algorithms,” The American Statistician, vol. 58, no. 1, pp. 30–37, 2004.
[31] M. Agueh and G. Carlier, “Barycenters in the Wasserstein space,” SIAM Journal on Mathematical Analysis, vol. 43, no. 2, pp. 904–924, 2011.
[32] C. J. Wu, “On the convergence properties of the EM algorithm,” The Annals of Statistics, vol. 11, no. 1, pp. 95–103, 1983.
[33] F. Kunstner, R. Kumar, and M. Schmidt, “Homeomorphic-invariance of em: Non-asymptotic convergence in kl divergence for exponential families via mirror descent,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 3295–3303.
[34] H. Lu, R. M. Freund, and Y. Nesterov, “Relatively smooth convex optimization by first-order methods, and applications,” SIAM Journal on Optimization, vol. 28, no. 1, pp. 333–354, 2018.
[35] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American statistical Association, vol. 112, no. 518, pp. 859–877, 2017.
[36] J. Triesch and C. Von Der Malsburg, “Robust classification of hand postures against complex backgrounds,” in Proceedings
of the Second International Conference on Automatic Face and Gesture Recognition. IEEE, 1996, pp. 170–175.
[37] K. Kampa, E. Hasanbelliu, and J. C. Principe, “Closed-form Cauchy-Schwarz pdf divergence for mixture of Gaussians,” in The 2011 International Joint Conference on Neural Networks. IEEE, 2011, pp. 2578–2585.
[38] R. Jenssen, J. C. Principe, D. Erdogmus, and T. Eltoft, “The Cauchy–Schwarz divergence and Parzen windowing: connections to graph theory and Mercer kernels,” Journal of the Franklin Institute, vol. 343, no. 6, pp. 614–629, 2006.
[39] A. W. Van der Vaart, Asymptotic Statistics. Cambridge University Press, 2000, vol. 3.
Qiong Zhang received her Ph.D. degree in statistics from the University of British Columbia in 2022. Prior to that, she got her B.Sc. degree from the University of Science and Technology of China in 2015 and the M.Sc. degree in Statistics from the University of British Columbia in 2017. She is currently an assistant professor in the institute of statistics and big data at Renmin University of China. Her research interests include inference under Gaussian mixture models and distributed learning.
Archer Gong Zhang received his Ph.D. degree in statistics from the University of British Columbia in 2022. Prior to that, he got his honours B.Sc. degree in statistics from the University of Toronto in 2016. He is currently a postdoctoral fellow in statistical sciences at the University of Toronto. His research focuses on developing efficient statistical methods for data from multiple populations that share some latent structures, and he is interested in semiparametric and nonparametric inferences.
Jiahua Chen earned his Ph.D. in statistics from the University of Wisconsin-Madison in 1990. After a year of post-doctoral research, he joined the University of Waterloo, Canada, as a Faculty Member. In 2007, he became part of The University of British Columbia, where he held the Canada Research Chair-Tier I until 2020. Chen is a fellow of IMS and ASA, and in 2022, he was honored as a fellow of the Royal Society of Canada. He received the CRM-SSC Prize in 2005 and the Gold Medal of the Statistical Society of Canada in 2014.
A. Derivation of KL Divergence between Gaussians
In this section, we provide the details for deriving the explicit expression of the KL divergence between two Gaussian distributions. Let det
be the probability density function of a Gaussian distribution with mean
and covariance matrix
, then
,
and
tr�
�tr��
tr��
tr��
��
By applying (25), we then have
B. Gaussian Barycenter under KL Divergence
The key to the update step of our composite transportation divergence based approach for Gaussian mixture reduction is to find out the barycenter of Gaussian distributions under various divergences. The barycenter of the Gaussian distributions with respect to some divergence has either an explicit solution or permits simple numerical solution. We show the Gaussian barycenter under the KL divergence.
Let and
be a vector of positive values of length N. The (weighted) KL barycenter of Gaussian distributions
is defined to be
where F is the space of all Gaussian distributions. The KL barycenter has mean
and covariance matrix
We prove the conclusion below.
Proof. With KL divergence, the barycenter confined in the space of Gaussians has its mean and covariance minimizing the function
We now use the following linear algebra formulas
to work out partial derivatives of L with respect to and
. They are given by
Setting both partial derivatives to 0, we obtain
and the covariance
This completes the proof.
A. Proof of Theorem 1
Proof. The key conclusion of this theorem is
We prove this result by showing both
and
We first prove (26). Let . Denote the mth component and the corresponding mixing weight by
and
respectively. Let
Then
where (a) holds since and
for any function f when
.
Next, we prove (27). Let . Denote the mth component and the corresponding mixing weight by
and
respectively. Let
Then by definition, we have . Clearly,
, hence
where (b) holds since and by the definition of
, which completes the proof.
B. Optimal Transportation Plan with One Marginal Constraint
Let be the entropy of the transportation plan
and
. Denote
for some that is known and does not depend on
. We show in this section that
Proof. Let
We prove the results in the following two cases.
• Case I () The Lagrangian associated with (28) is
• Case II () The objective function in this case becomes
which is linear in under the constraints that
for n = 1, 2, . . . , N. Therefore, by the linearity and the fact that
, it is clear that the objective function is smallest when
C. Proof of Theorem 2
Proof. We first prove property 1). We have for all
with equality holds at
. Hence,
This is the property that a majorization-minimization algorithm must have. We now prove property 2). Suppose has a convergent subsequence leading to a limit
. Let this subsequence be
. By Helly’s selection theorem [39], there is a subsequence
of
such that
has a limit, say
. These limits, however, could be subprobability distributions. That is, we cannot rule out the possibility that the total probability in the limit is below 1 by Helly’s theorem.
This is not the case under the theorem conditions. Let be large enough such that
is not empty. With this , we define
Suppose has a component
such that
for all n. Replacing this component in
by any
to form
, we can see that for any t,
This result shows that none of the components of are members of
. Otherwise,
does not minimize
at the tth iteration.
Note that the complement of is compact by condition
Consequently, the components of are confined to a compact subset. Hence, all limit points of
, including both
and
, are proper distributions. By the monotonicity of the iteration:
Let , we get
By the definition of the majorization-minimization iteration, we have
Let and by the continuity of
, we get
Hence, is a solution to
when
. Namely, we have
. With the help of (30), it further implies
when . This shows that iteration from
does not make
smaller than
. Hence,
is a stationary point of the majorization-minimization iteration. This is the conclusion (ii) and we have completed the proof.
Finally, when , the proposed algorithm is a member of hard clustering-based. The eventual solution is to divide the N components of the original mixture into M clusters, followed by finding the total weight of each cluster and its barycenter. Since there are
possible different clustering outcomes, we have at most
different
values. By the monotonicity of
, the MM-algorithm must stall after a finite number of iterations. In other words, it converges after at most
iterations.
D. Proof of Theorem 3
We first prove Property 1). Note that the minimization step (17) of the MM algorithm updates the cluster center by the barycenter of the components in this cluster. Since there are at most distinct partitions of N components into M clusters, there are hence at most
different
outputs. Since
monotonically decrease with t, we must have
when t is sufficiently large. This implies (18). Suppose all distinct partitions produce distinct
values, (18) then further implies
because distinct
and
cannot have the same
value. Suppose some distinct partitions produce the same
values. If so, we can place a tiny disturbance on the original ixture
so that none of the transportation plan will be altered. At the same time, it leads to distinct
values as there are only finite number of partitions. Hence,
when t is sufficiently large with the disturbed G. By continuity, the conclusion remains true for the original mixture. This proves property 1). We next prove Property 2). If
is not a local minimum, then there must be infinite many
that are arbitrarily near
such that
This is not possible as there are at most distinct
values. This completes the proof. We finally prove Property 3). One may work out a mixing distribution
for each partition. Since there are no more than
distinct partitions of N components into M clusters, and the global optimal reduction is one of the them, an exhaustive search over all of them is guaranteed to solve the optimality problem, which completes the proof.
to locate the minimum point of . A mirror descent algorithm follows the same routine after replacing
by some specific divergence function on
.
Definition 3 (Relative smoothness). Let be a differentiable convex function on convex set
and
be a Bregman divergence induced by A. We say
is L-smooth relative to
on
if for any
,
When f is L-smooth relative to and convex with minimum point
, [33] prove that the iterative algorithm
satisfies
In our proposed MM-algorithm, the mixing proportions of is completely determined by the support points
, we ignore mixing proportion and explore the relative smoothness in terms of only support points. We therefore interpret mixing distributions as follows subsequently:
We use G for the mixing distribution of the original mixture, and
for two general mixing distributions. We will also use
and
the mixing distribution sequence as output of the MM-algorithm. We use the following interpretation when taking gradient or derivatives with respect to G:
Lemma 1 (Relative smoothness of ). Suppose the cost function
is a Bregman divergence induced by some
such that
for two Gaussian densities and
(with component parameters
). Then
Proof. Under lemma condition on , we have
Hence,
Hence, taking gradient with respect to on two sides of (32) followed by letting
, we get
Applying this identity back to (32), we get
Because majorizes
, we have
or
This leads to conclusion
Note that the upper bound is a sum of functions of , allowing separate separate programming:
where we have used notation
This result shows that the MM updates at each iteration is very similar to a mirror descent algorithm. In spite of the similarity, the updating scheme of in (33) is not mirror descent because unlike L in (31), our
depends on t. However, it still leads to the convergence property in Theorem 4.
Proof of Theorem 4. Let be the sequence of the component parameters produced by the mirror descent update in (19) and
. We show that
The MM-algorithm iteration at the component parameter level is
which further implies
Let and
in (32), use (34) in the second step in the following, we get
This implies
Hence,
This completes the proof.
In this section, we provide details regarding connection bewteen existing and proposed optimization based GMR approachs
presented in Section III-E.
When M = 1, two GMR approaches based on ISE and KL divergences are the same. Our task is to show (20). In the
which proves (20). When M > 1, Theorem 5 states that the composite transportation divergence upper bounds the plain divergence between two mixtures when the cost function has the convexity property in (21). We prove this theorem assuming the convexity property. Let be a transportation plan between
and
. It is seen that
with inequality implied by the convexit property (21). Taking the infimum over , we get
for non-negative numbers . Hence, KL divergence has the convexity property.