Statistical issues with over-specification. While density estimation in fi-
Computational concerns with mixture models. While the papers discussed
Fig 1. Plots of the error in the EM solution versus the sample size n, focusing on the effect of signal strength on EM solution accuracy. The true data distribution is given by
1) and we use EM to fit the model
1), generating the EM estimate
samples. (a) When the signal is strong, the estimation rate decays at the parametric rate
, as revealed by the
2 slope in a leastsquare fit of the log error based on the log sample size log n. (b) When there is no signal (
= 0), then depending on the choice of weight
in the fitted model, we observe two distinct scalings for the error:
and,
5, again as revealed by least-squares fits of the log error using the log sample size log n.
Fig 2. Behavior of the (numerically computed) population EM updates (8) when the underlying data distribution is N(0, 1). (a) Unbalanced mixture fits (2) with weights (): We observe geometric convergence towards
= 0 for all
5 although the rate of convergence gets slower as
5. (b) Balanced mixture fits (2) with weights (0.5, 0.5): We observe two phases of convergence. First, EM quickly converges to ball of constant radius and then it exhibits slow convergence towards
= 0. Indeed, we see that during the slow convergence, the population EM updates track the curve given by
) very closely, as predicted by our theory.
Theorem 1. Suppose that we fit an unbalanced instance (i.e., of the mixture model (3) to
data. Then:
(a) The population EM operator (8) is globally strictly contractive, meaning that
(b) There are universal constants such that given any
and a sample size
, the sample EM sequence
generated by the update (9) satisfies the upper bound
Fig 3. Scaling of the Euclidean error for EM estimates
bound of the order (d/n) on the EM solution is sharp in terms of n and d.
Theorem 2. Suppose that we fit a balanced instance (mixture model (3) to
data. Then the population EM operator (8)
has the following properties:
scalars , any sample size
+ log(log(1
and any iterate number
with probability at least 1
that ) can be chosen arbitrarily close to zero, so at the expense of
Theorem 4. There are universal positive constants such that for any non-zero solution
to the sample EM fixed-point equation
for the balanced mixture fit, we have
for any positive radius r, any threshold , and any sample size
where denotes the imbalance in the mixture fit (3).
Fig 4. Scaling of the Euclidean error for EM estimates
computed using the balanced (
) mixture-fit (2). Here the true data distribution is
denotes the EM iterate upon convergence when we fit a balanced mixture with n samples in d dimensions. (a) Scaling with respect to
. (b) Scaling with respect to
. We ran experiments for several other pairs of (n, d) and the conclusions were the same. Clearly, the empirical results suggest a scaling of order (
for the final iterate of sample-based EM.
Fig 5. Illustration of the annulus-based-localization argument part (I): Defining the epochs or equivalently the annuli. (a) Outer radius for the -th epoch is given by
(tracking dependency only on n). (b) For any given epoch
, we analyze the behavior of the EM sequence
when
lies in the annulus around
with inner and outer radii given by
, respectively. We prove that EM iterates move from one epoch to the next epoch (e.g. epoch
+ 1) after at most
erations. Given the definition of
, we see that the inner and outer radii of the aforementioned annulus converges linearly to
. Consequently, after at most log(1
) epochs (or
) iterations), the EM iterate lies in a ball of radius
. We illustrate the one-step dynamics in any given annulus in Figure 6.
Fig 6. Illustration of the annulus-based-localization argument part (II): Dynamics of EM in the -th epoch or equivalently the annulus
For a given epoch
, we analyze the behavior of the EM sequence
lies in the annulus with inner and outer radii given by
, respectively. In this epoch, the population EM operator
) contracts with a contraction coefficient that depends on
, which is the inner radius of the disc, while the perturbation error
between the sample and population EM operators depends on
, which is the outer radius of the disc. Overall, we prove that
is non-expansive and after at most
steps, the sample EM updates move from epoch
). We demonstrate that the slow convergence of EM is very likely
General over-specified mixture fits on Gaussian data. First, we remark that
Over-specified fits for mixtures of Gaussian data. Using similar reasoning
Fig 7. Plots of the Wasserstein error associated with EM fixed points versus the sample size for fitting various kinds of location mixture models to standard normal N(0, 1) data. We fit mixture models with either two or three components, with all location parameters estimated in an unconstrained manner. The lines are obtained by a linear regression of the log error on the sample size n. (a) Fitting a two-mixture model
1) with three different fixed values of weights
and two (unconstrained) location parameters, along with least-squares fits to the log errors. (b) Data plotted as red triangles is obtained by fitting a two-component model with unknown mixture weights and two location parameters
1), whereas green circles correspond to results fitting a three-component mixture model
1). In all cases, the EM solutions exhibit the slow
tical rate for the error in parameter estimation. Also see Figure 9.
Fig 8. Behavior of EM for an over-specified Gaussian mixture. True model:
= 10. We fit a model
1), where we initialize
close to
. (a) Population EM updates: We observe that while
slowly to
= 0, the iterates
converge exponentially fast to
(b) We plot the statistical error for the two parameters. While the strong signal component has a parametric
rate, for the no signal component EM has the slower
rate, which is in good agreement with the theoretical results derived in the paper. (We remark that the error floor for
panel (a) arises from the finite precision inherent to numerical integration.)
Fig 9. Plots of Wasserstein error when both weights and location parameters are unknown and estimated using EM and the fitted multivariate mixture model is over-specified. (a) True model: ), and fitted model
) and (b) True model:
fitted model:
). In both cases, once again we see the scaling of order
for the final error (similar to results in Figure 7 and 8).
-operator norm of the matrix
defined in equation (36), is upper-bounded as
Lemma 3. For all vectors , the matrix
defined in equation (39), satisfies the bounds
Lemma 4. Consider the sample-based EM iteration a sample size
. Suppose that there exists an index
and an iteration number t such that
. Then, conditional on the event
from equation (44), we have
where step (i) follows since ] = 1, and from the fact that the random
that the map is increasing in
, and for any fixed value of
the function
is a decreasing function
Lemma 5. For both the unbalanced or balanced model fits (3), when the true model is standard Gaussian, we have
where denotes a universal constant.
Fig 10. Behavior of EM for the two-mixture over-specified fit (2) with unknown weights where the true model is ). We consider two different initializations. Case 1 (unbalanced): When the initialization condition (81) is met (in particular we set
much smaller than
). In Case 2 (balanced), we initialize the weight parameter very close to
. Panel (a) characterizes the population EM updates and panel (b) depicts the statistical error with sample size n for the two cases. We see that when the condition (81) is met, EM converges in few steps within error
error and on the other hand when the initial weight is near
we observe a slow convergence of EM with a larger statistical error of order
Corollary 1. Consider the sample EM sequences with an initialization that satisfies the condition (81) for some
. Then for any fixed
with probability at least 1are universal constants.
Fig 11. Plots of the log-likelihood for the unbalanced and balanced fit for data generated from N(0, 1). (a) Behavior of population log-likelihood ) (computed using numerical integration) as a function of
for different weights
. (b) Behavior of sample log-likelihood
with n = 1000 samples for
. The plots in these panels portray a stark contrast in the shapes of the log-likelihood functions in the balanced and unbalanced case, it gets flatter around
concretely, in unbalanced case we see a quadratic type behavior (strongly concave); whereas in balanced case, the log-likelihood function is flatter and depicts a fourth degree polynomial type (weakly concave) behavior.
Fig 12. Plots of the population log-likelihood for the mixture of regression model for . We see that while the log-likelihood is clearly locally strongly concave around
7, and it is rather flat (and weakly concave) for the case of no signal
= 0. This flatness in log-likelihood results in a slower rate of algorithmic and statistical convergence of EM in this setting thereby providing further evidence of the usefulness of our analysis of EM.
for any iterate , with probability at least 1
Here,
denote universal constants.
[1] Balakrishnan, S., Wainwright, M. J. and Yu, B. (2017). Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics 45 77-120. (Cited on pages 3, 5, 8, 13, 16, 18, 19, and 25.)
[2] Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. The Annals of Statistics 33 1497–1537. (Cited on pages 5 and 21.)
[3] Cai, T. T., Ma, J., Zhang, L. et al. (2019). CHIME: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality. The Annals of Statistics 47 1234–1267. (Cited on page 4.)
[4] Chen, J. (1995). Optimal rate of convergence for finite mixture models. The Annals of Statistics 23 221-233. (Cited on pages 3, 5, 15, 17, and 57.)
[5] Chen, Y. C. (2018). Statistical inference with local optima. arXiv preprint arXiv:1807.04431. (Cited on page 4.)
[6] Chen, J. and Li, P. (2009). Hypothesis test for normal mixture models: the EM approach. The Annals of Statistics 37 2523-2542. (Cited on page 29.)
[7] Daskalakis, C., Tzamos, C. and Zampetakis, M. (2017). Ten steps of EM suffice for mixtures of two Gaussians. In Proceedings of the 2017 Conference on Learning Theory. (Cited on pages 4, 5, 8, 16, and 25.)
[8] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1997). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 39 1-38. (Cited on page 3.)
[9] Dwivedi, R., Ho, N., Khamaru, K., Wainwright, M. J. and Jordan, M. I. (2018). Theoretical guarantees for EM under misspecified Gaussian mixture models. In NeurIPS 31. (Cited on page 4.)
[10] Dwivedi, R., Ho, N., Khamaru, K., Wainwright, M. J., Jordan, M. I. and Yu, B. (2019). Challenges with EM in application to weakly identifiable mixture models. arXiv preprint arXiv:1902.00194. (Cited on page 28.)
[11] Ghosal, S. and van der Vaart, A. (2001). Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. The Annals of Statistics 29 1233-1263. (Cited on page 2.)
[12] Hao, B., Sun, W., Liu, Y. and Cheng, G. (2018). Simultaneous clustering and estimation of heterogeneous graphical models. Journal of Machine Learning Research 18 1 - 58. (Cited on page 4.)
[13] Heinrich, P. and Kahn, J. (2018). Strong identifiability and optimal minimax rates for finite mixture estimation. The Annals of Statistics 46 2844-2870. (Cited on page 3.)
[14] Inglot, T. (2010). Inequalities for quantiles of the chi-square distribution. Probability and Mathematical Statistics 30 339–351. (Cited on page 42.)
[15] Ishwaran, H., James, L. F. and Sun, J. (2001). Bayesian model selection in finite mixtures by marginal density decompositions. Journal of the American Statistical Association 96 1316-1332. (Cited on page 17.)
[16] Janson, S. (1997). Gaussian Hilbert Spaces 129. Cambridge University Press. (Cited on page 42.)
[17] Klusowski, J. M., Yang, D. and Brinda, W. (2019). Estimating the coefficients of a mixture of two linear regressions by expectation maximization. IEEE Transactions on Information Theory 65 3515–3524. (Cited on page 4.)
[18] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics 34 2593–2656. (Cited on pages 5 and 21.)
[19] Kumar, R. and Schmidt, M. (2017). Convergence rate of expectation-maximization.
10th NIPS Workshop on Optimization for Machine Learning. (Cited on pages 4 and 25.)
[20] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Springer-Verlag, New York, NY. (Cited on page 45.)
[21] Li, P., Chen, J. and Marriott, P. (2009). Non-finite Fisher information and homogeneity: an EM approach. Biometrika 96 411-426. (Cited on page 29.)
[22] Ma, J., Xu, L. and Jordan, M. I. (2000). Asymptotic convergence rate of the EM algorithm for Gaussian mixtures. Neural Computation 12 2881-2907. (Cited on page 3.)
[23] Nguyen, X. (2013). Convergence of latent mixing measures in finite and infinite mixture models. The Annals of Statistics 4 370-400. (Cited on pages 3, 17, and 57.)
[24] Redner, R. A. and Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM review 26 195–239. (Cited on page 3.)
[25] Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59 731-792. (Cited on page 2.)
[26] Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 689-710. (Cited on page 3.)
[27] Stephens, M. (2002). Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62 795-809. (Cited on page 2.)
[28] van de Geer, S. (2000). Empirical Processes in M-estimation. Cambridge University Press. (Cited on pages 2, 5, and 21.)
[29] van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. (Cited on page 57.)
[30] van der Vaart, A. W. and Wellner, J. A. (2000). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer-Verlag, New York, NY. (Cited on pages 44 and 45.)
[31] Vershynin, R. (2011). Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027v7. (Cited on page 47.)
[32] Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, Cambridge, UK. (Cited on pages 21, 44, 45, and 46.)
[33] Wang, Z., Gu, Q., Ning, Y. and Liu, H. (2015). High-dimensional expectation-maximization algorithm: Statistical optimization and asymptotic normality. In Advances in Neural Information Processing Systems 28. (Cited on page 4.)
[34] Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics 11 95-103. (Cited on page 3.)
[35] Xu, J., Hsu, D. and Maleki, A. (2016). Global analysis of expectation maximization for mixtures of two Gaussians. In Advances in Neural Information Processing Systems 29. (Cited on page 4.)
[36] Xu, L. and Jordan, M. I. (1996). On convergence properties of the EM Algorithm for Gaussian mixtures. Neural Computation 8 129-151. (Cited on page 3.)
[37] Yan, B., Yin, M. and Sarkar, P. (2017). Convergence of gradient EM on multicomponent mixture of Gaussians. In Advances in Neural Information Processing Systems 30. (Cited on pages 4 and 5.)
[38] Yi, X. and Caramanis, C. (2015). Regularized EM algorithms: a unified framework and statistical guarantees. In Advances in Neural Information Processing Systems 28. (Cited on page 4.)