Singularity, Misspecification, and the Convergence Rate of EM

2018·Arxiv

Abstract

Abstract

SINGULARITY, MISSPECIFICATION, AND THE CONVERGENCE RATE OF EM

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Motivating simulations and problem set-up . . . . . . . . . . . . . 6 3 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 New techniques for sharp analysis of sample EM . . . . . . . . . . 18 5 Generality of results and future work . . . . . . . . . . . . . . . . . 24 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 A Proofs of main results . . . . . . . . . . . . . . . . . . . . . . . . . 30 B Proofs of auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . 43 C Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 D Unbalanced vs balanced ﬁts: Closer look at log-likelihood . . . . . 55 E Mixture of regression . . . . . . . . . . . . . . . . . . . . . . . . . . 57 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Author’s addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 1. Introduction. The growth in the size and scope of modern data sets has presented the ﬁeld of statistics with a number of challenges, one of them being how to deal with various forms of heterogeneity. Mixture models provide a principled approach to modeling heterogeneous collections of data (that are usually assumed i.i.d.). In practice, it is frequently the case that the number of mixture components in the ﬁtted model does not match the number of mixture components in the data-generating mechanism. It is known that such mismatch can lead to substantially slower convergence rates for the maximum likelihood estimate (MLE) for the underlying parameters. In contrast, relatively less attention has been paid to the computational implications of this mismatch. In particular, the algorithm of choice for ﬁtting ﬁnite mixture models is the Expectation-Maximization (EM) algorithm, a general framework that encompasses various types of divide-and-conquer computational strategies. The goal of this paper is to gain a fundamental understanding of the behavior of EM when used to ﬁt over-speciﬁed mixture models.

Statistical issues with over-specification. While density estimation in fi-

nite mixture models is relatively well understood [11, 28], characterizing the behavior of maximum likelhood for parameter estimation has remained challenging. The main diﬃculty for analyzing the MLE in such settings arises from label switching between the mixtures [25, 27], and lack of strong concavity in the likelihood. Such issues do not interfere with density estimation, since the standard divergence measures like the Kullback-Leibler and Hellinger distances remain invariant under permutations of labels, and

strong concavity is not required. An important contribution to the understanding of parameter estimation in ﬁnite mixture models was made by Chen [4]. He considered a class of over-speciﬁed ﬁnite mixture models; here the term “over-speciﬁed” means that the model to be ﬁt has more mixture components than the distribution generating the data. In an interesting contrast to the usual n− 12 convergence rate for the MLE based on n samples, Chen showed that for estimating scalar location parameters in a certain class of over-speciﬁed ﬁnite mixture models, the corresponding rate slows down to n− 14 . This theoretical result has practical signiﬁcance, since methods that over-specify the number of mixtures are often more feasible than methods that ﬁrst attempt to estimate the number of components, and then estimate the parameters using the estimated number of components [26]. In subsequent work, Nguyen [23] and Heinrich et al. [13] have characterized the (minimax) convergence rates of parameter estimation rates for mixture models in both exactly-ﬁtted or over-speciﬁed settings in terms of the Wasserstein distance.

Computational concerns with mixture models. While the papers discussed

above address the statistical behavior of a global maximum of the log-likelihood, they do not consider the associated computational issues of obtaining such a maximum. In general settings, non-convexity of the log-likelihood makes it impossible to guarantee that the iterative algorithms used in practice converge to the global optimum, or equivalently the MLE. Perhaps the most widely used algorithm for computing the MLE is the expectation-maximization (EM) algorithm [8]. Early work on the EM algorithm [34] showed that its iterates converge asymptotically to a local maximum of the log-likelihood function for a broad class of incomplete data models; this general class includes the ﬁtting of mixture models as a special case. The EM algorithm has also been studied in the speciﬁc setting of Gaussian mixture models; here we ﬁnd results both for the population EM algorithm, which is the idealized version of EM based on an inﬁnite sample size, as well as the usual sample-based EM algorithm that is used in practice. For Gaussian mixture models, the population EM algorithm is known to exhibit various convergence rates, ranging from linear to superlinear (quasi-Newton like) convergence if the overlap between the mixture components tends to zero [22, 36]. It has also been noted in several papers [22, 24] that the convergence of EM can be prohibitively slow when the mixtures are not well separated. Prior work on EM. Balakrishnan et al. [1] laid out a general theoretical framework for analysis of the EM algorithm, and in particular how to prove

non-asymptotic bounds on the Euclidean distance between sample-based EM iterates and the true parameter. When applied to the special case of two-component Gaussian location mixtures, assumed to be well-speciﬁed and suitably separated, their theory guarantees that (1) population EM updates enjoy a geometric rate of convergence to the true parameter when initialized in a suﬃciently small neighborhood around the truth, and (2) sample-based EM updates converge to an estimate at Euclidean distance of order (d/n)12 , based on n i.i.d. draws from a ﬁnite mixture model in Rd. Further work in this vein has characterized the behavior of EM in a variety of settings for two Gaussian mixtures, including convergence analysis with additional sparsity constraints [12, 33, 38], global convergence of population EM [35], guarantees of geometric convergence under less restrictive conditions on the two mixture components [7, 17], analysis of EM with unknown mixture weights, means and covariances for two mixtures [3], and the analysis of EM to more than two Gaussian components [12, 37]. Other related work has provided optimization-theoretic guarantees for EM by viewing it in a generalized surrogate function framework [19], and analyzed the statistical properties of conﬁdence intervals based on an EM estimator [5]. An assumption common to all of this previous work is that there is no misspeciﬁcation in the ﬁtting of the Gaussian mixtures; in particular, it is assumed that the data is generated from a mixture model with the same number of components as the ﬁtted model. A portion of our recent work [9] has shown that EM retains its fast convergence behavior—albeit to a biased estimate—in under-speciﬁed settings where the number of components in the ﬁtted model are less than that in the true model. However, as noted above, in practice, it is most common to use over-speciﬁed mixture models. For these reasons, it is desirable to understand how the EM algorithm behaves in the over-speciﬁed settings. Our contributions. The goal of this paper is to shed some light on the non-asymptotic performance of the EM algorithm for over-speciﬁed mixtures. We provide a comprehensive study of over-speciﬁed mixture models when ﬁt to a particularly simple (non-mixture) data-generating mechanism; a multivariate normal distribution N(0, σ2Id) in d dimensions with known scale parameter σ > 0. This setting, despite its simplicity, suﬃces to reveal some rather interesting properties of EM in the over-speciﬁed context. In particular, we obtain the following results. • Two-mixture unbalanced ﬁt: For our ﬁrst model class, we study a mixture of two location-Gaussian distributions with unknown location, known variance and known unequal weights for the two components.

For this case, we establish that the population EM updates converge at a geometric rate to the true parameter; as an immediate consequence, the sample-based EM algorithm converges in O (log(n/d)) steps to a ball of radius (d/n)12 . The fast convergence rate of EM under the unbalanced setting provides an antidote to the pessimistic belief that statistical estimators generically exhibit slow convergence for over-speciﬁed mixtures. • Two-mixture balanced ﬁt: In the balanced version of the problem in which the mixture weights are equal to 12 for both components, we ﬁnd that the EM algorithm behaves very diﬀerently. Beginning with the population version of the EM algorithm, we show that it converges to the true parameter from an arbitrary initialization. However, the rate of convergence varies as a function of the distance of the current iterate from the true parameter value, becoming exponentially slower as the iterates approach the true parameter. This behavior is in sharp contrast to well-speciﬁed settings [1, 7, 37], where the population updates converge at a geometric rate. We also show that our rates for population EM are tight. By combining the slow convergence of population EM with a novel localization argument, one involving the empirical process restricted to an annulus, we show that the sample-based EM iterates converge to a ball of radius (d/n)14 around the true parameter after O((n/d)12 ) steps. The n− 14 component of the Euclidean error matches known guarantees for the global maximum of the MLE [4]. The localization argument in our analysis is of independent interest, because such techniques are not required in analyzing the EM algorithm in well-speciﬁed settings when the population updates are globally contractive. We note that ball-based localization methods are known to be essential in deriving sharp statistical rates for M-estimators (e.g., [2, 18, 28]); to the best of our knowledge, the use of an annulus-based localization argument in analyzing an algorithm is novel. Moreover, we show via extensive numerical experiments that the fast convergence of EM for the unbalanced ﬁt is a special case; and that the slow behavior of EM proven for the balanced ﬁt (in particular the rate of order n− 14 ) arises in several general (including more than two components) over-speciﬁed Gaussian mixtures with known variance, known or unknown weights, and unknown location parameters. Organization. The remainder of the paper is organized as follows. In Section 2 we provide illustrative simulations of EM in diﬀerent settings in or-

der to motivate the settings analyzed later in the paper. We then provide a thorough analysis of the convergence rates of EM when over-ﬁtting Gaussian data with two components in Section 3 and the key ideas of the novel proof techniques in Section 4. We provide a thorough discussion of our results in Section 5, exploring their general applicability, and presenting further simulations that substantiate the value of our theoretical framework. Detailed proofs of our results and discussion of certain additional technical aspects of our results are provided in the appendix. Notation. For any two sequences an and bn, the notation an ≾ bn or an = O (bn) means that an ≤ Cbn for some universal constant C. Similarly, the notation an ≍ bn or an = Θ(bn) denotes that both the conditions, an ≾ bn and bn ≾ an, hold. Throughout this paper, π denotes a variable and π denotes the mathematical constant “pi”. Experimental settings. We summarize a few common aspects of the numerical experiments presented in the paper. Population-level computations were done using numerical integration on a suﬃciently ﬁne grid. With ﬁnite samples, the stopping criteria for the convergence of EM were: (1) the change in the iterates was small enough, or (2) the number of iterations was too large (greater than 100, 000). Experiments were averaged over several repetitions (ranging from 25 to 400). In majority of the runs, for each case, criteria (1) led to convergence. In our plots for sample EM, we report �me + 2�se on the y-axis, where �me, �se respectively denote the mean and standard deviation across the experiments for the metric under consideration, e.g., the parameter estimation error. Furthermore, whenever a slope is provided, it is the slope for the least-squares ﬁt on the log-log scale for the quantity on y-axis when ﬁtted with the quantity reported on the x-axis. For instance, in Figure 1(b), we plot |�θn − θ∗| on the y-axis value versus the sample size n on the x-axis, averaged over 400 experiments, accounting for the deviation across these experiments. Furthermore, the green dotted line with legend π = 0.3 and the corresponding slope −0.48 denote the least-squares ﬁt and the respective slope for log |�θn − θ∗| (green solid dots) with log n for the experiments corresponding to the setting π = 0.3.

explore a wide range of behavior demonstrated by EM for certain settings of over-speciﬁed location Gaussian mixtures. We begin with several simulations that illustrate fast and slow convergence of EM for various settings, and serve as a motivation for the theoretical results derived later in the paper. We provide basic background on EM in Section 2.3, and describe the problems to be tackled.

2.1. Problem set-up. Let φ(·; µ, Σ) denote the density of a Gaussian random vector with mean µ and covariance Σ. Consider the two component Gaussian mixture model with density

Given n samples from the distribution (1), suppose that we use the EM algorithm to ﬁt a two-component location Gaussian mixture with ﬁxed weights and variance1 and special structure on the location parameters—more precisely, we ﬁt the model with density

using the EM algorithm, and take the solution2 as an estimate of θ∗. An important aspect of the problem at hand is the signal strength, which is measured as the separation between the means of mixture components relative to the spread in the components. For the model (1), the signal strength is given by the ratio ∥θ∗∥2 /σ. When this ratio is large, we refer to it as the strong signal case; otherwise, it corresponds to the weak signal case. Of particular interest to us is the behavior of EM in the limit of weak signal when there is no separation; i.e., ∥θ∗∥2 = 0. For such cases, we call the ﬁt (2)

the setting of θ∗ = 0 corresponds to the simplest case of over-speciﬁed ﬁt, since the true model has just one component (standard normal distribution irrespective of the parameter π) but the ﬁtted model has two (one extra) component (unless the EM estimate is also 0). We now present the empirical behavior of EM for these models and defer the derivation of EM updates to Section 2.3.

with a numerical study of the eﬀect of separation among the mixtures on the statistical behavior of the estimates returned by EM. Our main observation is that weak or no separation leads to relatively low accuracy estimates. Additional simulations for more general mixtures, including more than two components, are provided in Section 5.3. Next, via numerical integration on a grid with suﬃciently small discretization width, we simulate the behavior of the population EM algorithm width—an idealized version of EM in the

limit of inﬁnite samples—in order to understand the eﬀect of signal strength on EM’s algorithmic rate of convergence, i.e., the number of steps needed for population EM to converge to a desired accuracy. We observe a slow down of EM on the algorithmic front when the signal strength approaches zero.

simulation results for data generated from the model (1) in dimension d = 1 and noise variance σ2 = 1, and for three diﬀerent values of the weight π ∈ {0.1, 0.3, 0.5}. In all cases, we ﬁt a two-location Gaussian mixture with ﬁxed weights and variance as speciﬁed by equation (2). The two panels show the estimation error of the EM solution as a function of n for two distinct cases of the data-generating mechanism: (a) in the strong signal case, we set θ∗ = 5 so that the data has two well-separated mixture components, and (b) to obtain the limiting case of no signal, we set θ∗ = 0, so that the two mixture components in the data-generating distribution collapse to one, and we are simply ﬁtting the data from a standard normal distribution. In the strong signal case, it is well known [1, 7] that EM solutions have an estimation error (measured by the Euclidean distance between the EM estimate and the true parameter θ∗) that achieves the classical (parametric) rate n− 12 ; the empirical results in Figure 1(a) are simply a conﬁrmation of these theoretical predictions. More interesting is the case of no signal (which is the limiting case with weak signal), where the simulation results shown in panel (b) of Figure 1 reveal a diﬀerent story. In this case, whereas the EM solution (with random standard normal initialization) has an error that decays as n− 12 when π ̸= 1/2, its error decays at the considerably slower rate n− 14 when π = 1/2. We return to these cases in further detail in Section 3.

the sample EM algorithm in the “no signal” case motivated us to examine the behavior of population EM for this case. To be clear, while sample EM is the practical algorithm that can actually be applied, it can be insightful for theoretical purposes to ﬁrst analyze the convergence of the population EM updates, and then leverage these ﬁndings to understand the behavior of sample EM [1]. Our analysis follows a similar road-map. Interestingly, for the case with θ∗ = 0, the population EM algorithm behaves signiﬁcantly diﬀerently for the unbalanced ﬁt (π ̸= 12) as compared to the balanced ﬁt

EM iterate θt to the true parameter value, θ∗ = 0, on the vertical axis, versus the iteration number t on the horizontal axis. With the vertical axis on a log scale, a geometric convergence rate of the algorithm shows up as a negatively

Fig 1. Plots of the error in the EM solution versus the sample size n, focusing on the effect of signal strength on EM solution accuracy. The true data distribution is given by 1) and we use EM to fit the model 1), generating the EM estimate samples. (a) When the signal is strong, the estimation rate decays at the parametric rate , as revealed by the 2 slope in a leastsquare fit of the log error based on the log sample size log n. (b) When there is no signal (= 0), then depending on the choice of weight in the fitted model, we observe two distinct scalings for the error: and, 5, again as revealed by least-squares fits of the log error using the log sample size log n.

sloped line (disregarding transient eﬀects in the ﬁrst few iterations). For the unbalanced mixtures in panel (a), we see that EM converges geometrically quickly, although the rate of convergence (corresponding to the slope of the line) tends towards zero as the mixture weight π tends towards 1/2 from below. For π = 1/2, we obtain a balanced mixture, and, as shown in the plot in panel (b), the convergence rate is now sub-geometric. In fact, the behavior of the iterates is extremely well characterized by the recursion θ �→ θ1+θ2 . The theory to follow provides a precise characterization of the behavior seen in Figures 1(b) and 2. Furthermore, in Section 5, we provide further support for relevannce of our theoretical results in explaining the behavior of EM for other classes of over-speciﬁed models, including Gaussian mixture models with unknown weights as well as mixtures of linear regressions. 2.3. EM updates for the model ﬁt (2). In this section, we provide a quick introduction to the EM updates. Readers familiar with the literature can skip directly to the main results in Section 3. Recall that the two-component

Fig 2. Behavior of the (numerically computed) population EM updates (8) when the underlying data distribution is N(0, 1). (a) Unbalanced mixture fits (2) with weights (): We observe geometric convergence towards = 0 for all 5 although the rate of convergence gets slower as 5. (b) Balanced mixture fits (2) with weights (0.5, 0.5): We observe two phases of convergence. First, EM quickly converges to ball of constant radius and then it exhibits slow convergence towards = 0. Indeed, we see that during the slow convergence, the population EM updates track the curve given by ) very closely, as predicted by our theory.

model ﬁt is based on the density

From now on we assume that the data is drawn from the zero-mean Gaussian distribution N(0, σ2Id). Note that the model ﬁt described above contains the true model with θ∗ = 0 and it is referred to as an over-speciﬁed ﬁt since for any non-zero θ, the ﬁtted model has two components. The maximum likelihood estimate is obtained by solving the following optimization problem �θMLEn ∈ arg maxθ∈Θ 1n

�log(πφ(xi; θ, σ2Id) + (1 − π)φ(xi; −θ, σ2Id))�.(4) In general, there is no closed-form expression for �θMLEn . The EM algorithm circumvents this problem via a minorization-maximization scheme. Indeed, population EM is a surrogate method to compute the maximizer of the population log-likelihood L(θ) := EX�log(πφ(X; θ, σ2Id) + (1 − π)φ(X; −θ, σ2Id)�,(5)

where the expectation is taken over the true distribution. On the other hand, sample EM attempts to estimate �θMLEn . We now describe the expressions for both the sample and population EM updates for the model-ﬁt (3). Given any point θ, the EM algorithm proceeds in two steps: (1) compute a surrogate function Q(·; θ) such that Q(θ′; θ) ≤ L(θ′) and Q(θ; θ) = L(θ); and (2) compute the maximizer of Q(θ′; θ) with respect to θ′. These steps are referred to as the E-step and the M-step, respectively. In the case of two-component location Gaussian mixtures, it is useful to describe a hidden variable representation of the mixture model. Consider a binary indicator variable Z ∈ {0, 1} with the marginal distribution P(Z = 1) = π and P(Z = 0) = 1 − π, and deﬁne the conditional distributions (X | Z = 0) ∼ N(−θ, σ2Id), and (X | Z = 1) ∼ N(θ, σ2Id). These marginal and conditional distributions deﬁne a joint distribution over the pair (X, Z), and by construction, the induced marginal distribution over X is a Gaussian mixture of the form (3). For EM, we ﬁrst compute the conditional probability of Z = 1 given X = x: wθ(x) = wπθ (x) := π exp�− ∥θ−x∥222σ2 � π exp�− ∥θ−x∥222σ2 �+ (1 − π) exp�− ∥θ+x∥222σ2 �.(6) Then, given a vector θ, the E-step in the population EM algorithm involves computing the minorization function θ′ �→ Q(θ′, θ). Doing so is equivalent to computing the expectation Q(θ′; θ) = −12E�wθ(X)��X − θ′��22 + (1 − wθ(X))��X + θ′��22�,(7) where the expectation is taken over the true distribution (here N(0, σ2Id). In the M-step, we maximize the function θ′ �→ Q(θ′; θ). Doing so deﬁnes a mapping M : Rd → Rd, known as the population EM operator, given by M(θ) = arg maxθ′∈Rd Q(θ′, θ) = E�(2wθ(X) − 1)X�.(8) In this deﬁnition, the second equality follows by computing the gradient ∇θ′Q, and setting it to zero. In summary, for the two-component location mixtures considered in this paper, the population EM algorithm is deﬁned by the sequence θt+1 = M(θt), where the operator M is deﬁned in equation (8). We obtain the sample EM update by simply replacing the expectation E in equations (7) and (8) by the empirical average based on an observed set

of samples. In particular, given a set of i.i.d. samples {Xi}ni=1, the sample EM operator Mn : Rd �→ Rd takes the form Mn(θ) := 1n

(2wθ(Xi) − 1)Xi.(9) Overall, the sample EM algorithm generates the sequence of iterates given by θt+1 = Mn(θt). In the sequel, we study the convergence of EM both for the population EM algorithm in which the updates are given by θt+1 = M(θt), and the sample-based EM sequence given by θt+1 = Mn(θt). With this notation in place, we now turn to the main results of this paper. 3. Main results. In this section, we state our main results for the convergence rates of the EM updates under the unbalanced and balanced mixture ﬁt. We start with the easier case of unbalanced mixture ﬁt in Section 3.1 followed by the more delicate (and interesting) case of the balanced ﬁt in Section 3.2.

acterization of both the population and sample-based EM updates in the setting of unbalanced mixtures. In particular, we assume that the ﬁtted two-components mixture model (3) has known weights π and 1 − π, where π ∈ (0, 1/2). The following result characterizes the behavior of the EM updates for this set-up.

Theorem 1. Suppose that we fit an unbalanced instance (i.e., of the mixture model (3) to data. Then:

(a) The population EM operator (8) is globally strictly contractive, meaning that

where ρ := |1 − 2π| ∈ (0, 1).

(b) There are universal constants such that given any and a sample size , the sample EM sequence generated by the update (9) satisfies the upper bound

�1 − ρ22

d + log(1/δ)

,(10b)

See Appendix A.1 for the proof of this theorem. Fast convergence of population EM. The bulk of the eﬀort in proving Theorem 1 lies in establishing the guarantee (10a) for the population EM iterates. Such a contraction bound immediately yields the exponential fast convergence of the population EM updates θt+1 = M(θt) to θ∗ = 0: ��θT ��2 ≤ ϵ for T ≥ 1log 1(1−ρ2/2)· log

.(11) Since the mixture weights (π, 1 − π) are bounded away from 1/2, we have that ρ = |1 − 2π| is bounded away from zero, and thus population EM iterates converge in O (log(1/ϵ)) steps to an ϵ-ball around θ∗ = 0. This result is equivalent to showing that in the unbalanced instance (π ̸= 1/2), the log-likelihood is strongly concave around the true parameter. Statistical rate of sample EM. Once the bounds (10a) and (11)) have been established, the proof of the statistical rate (10b) for sample EM utilizes the scheme laid out by Balakrishnan et al. [1]. In particular, we prove a non-asymptotic uniform law of large numbers (Lemma 1 stated in Section 4.1) that allows for the translation from population to sample EM iterates. Roughly speaking, Lemma 1 guarantees that for any radius r > 0, tolerance δ ∈ (0, 1), and suﬃciently large n, we have

sup

d + log(1/δ)

≥ 1 − δ.(12) This bound, when combined with the contractive behavior (10a) or equivalently the exponentially fast convergence (11) of the population EM iterates allows us to establish the stated bound (10b). (See, e.g., Theorem 2 in the paper [1].) Putting the pieces together, we conclude that the sample EM updates converge to an estimate of θ∗—that has Euclidean error of the order (d/n)12 — after a relatively small number of steps that are of the order log(n/d). Note that this theoretical prediction is veriﬁed by the simulation study in Figure 1(b) for the univariate setting (d = 1) of the unbalanced mixture-ﬁt. In Figure 3, we present the scaling of the radius of the ﬁnal EM iterate3 with respect to the sample size n and the dimension d, averaged over 400 runs of sample EM for various settings of (n, d). Linear ﬁts on the log-log scale in these simulations suggest a rate close to (d/n)12 as claimed in Theorem 1.

Fig 3. Scaling of the Euclidean error for EM estimates

bound of the order (d/n) on the EM solution is sharp in terms of n and d.

Remark. We make two comments in passing. First, the value of��θ0��2 in the convergence rate of sample EM updates in Theorem 1 can be assumed to be of constant order; this assumption stems from the fact the population EM operator maps any θ0 to a vector with norm smaller than�2/π (cf. Lemma 5 in Appendix C.1). Second, when the weight parameter π is assumed to be unknown in the model ﬁt (3), the EM algorithm exhibits fast convergence when π is initialized suﬃciently away from 12; see Section 5.1 for more details. From unbalanced to balanced ﬁt. The bound (11) shows that the extent of unbalancedness in the mixture weights plays a crucial role in the geometric rate of convergence for the population EM. When the mixtures become more balanced, that is, weight π approaches 1/2 or equivalently ρ approaches zero, the number of steps T required to achieve ϵ-accuracy scales as O�log(��θ0��2 /ϵ)/ρ2�and in the limit ρ → 0, this bound degenerates to ∞ for any ﬁnite ϵ. Indeed, the bound (10a) from Theorem 1 simply states that the population EM operator is non-expansive for balanced mixtures (ρ = 0), and does not provide any particular rate of convergence for this case. It turns out that the EM algorithm is worse in the balanced case, both in terms of the optimization speed and in terms of the statistical rate. This slower sta-

tistical rate is in accord with existing results for the MLE in over-speciﬁed mixture models [4]; the novel contribution here is the rigorous analysis of the analogous behavior for the EM algorithm.

vide a sharp characterization of the algorithmic rate of convergence of the population EM update for the balanced ﬁt (see Section 3.2.1). We then provide sharp bound for the statistical rate for the sample EM updates (cf. Section 3.2.2).

of the population EM operator for the balanced ﬁt. We show that it is globally convergent, albeit with a contraction parameter that depends on θ, and degrades towards 1 as ∥θ∥2 → 0. Our statement involves the constant

normal variate. (Note that p < 1.)

Theorem 2. Suppose that we fit a balanced instance (mixture model (3) to data. Then the population EM operator (8) has the following properties:

∥θ∥2 ≤ γup(θ) := 1 − p +

1 + ∥θ∥222σ2 < 1.(13a)

∥θ∥2 ≥ γlow(θ) := 1 1 + 2∥θ∥22σ2 .(13b) See Appendix A.2 for the proof of Theorem 2. The salient feature of Theorem 2 is that the contraction coeﬃcient γup(θ) is not globally bounded away from 1 and in fact satisﬁes limθ→0 γup(θ) = 1. In conjunction with the lower bound (13b), we see that

for small ∥θ∥2.(14) This precise contraction behavior of the population EM operator is in accord with that of the simulation study in Figure 2(b).

The preceding results show that the population EM updates should exhibit two phases of behavior. In the ﬁrst phase, up to a relatively coarse accuracy of the order σ, the iterates exhibit geometric convergence. Concretely, we are guaranteed to have��θT0��2 ≤√2σ after running the algorithm for T0 :=log(∥θ0∥22/(2σ2))log(2/(2−p)) steps. In the second phase, as the error decreases from √2σ to a given ϵ ∈�0,√2σ�, the convergence rate becomes sub-geometric; concretely, we have ��θT0+t��2 ≤ ϵ for t ≥ cσ2ϵ2 log(σ/ϵ).(15) Note that the conclusion (15) shows that for small enough ϵ, the population EM takes Θ(log(1/ϵ)/ϵ2) steps to ﬁnd ϵ-accurate estimate of θ∗ = 0. This rate is extremely slow compared to the geometric rate O(log(1/ϵ)) derived for the unbalanced mixtures in Theorem 1. Hence, the slow rate establishes a qualitative diﬀerence in the behavior of the EM algorithm between the balanced and unbalanced setting. Moreover, the sub-geometric rate of EM in the balanced case is also in stark contrast with the favorable behavior of EM for the exact-ﬁtted settings analyzed in past work. Balakrishnan et al. [1] showed that when the EM algorithm is used to ﬁt a two-component Gaussian mixture with suﬃ-ciently large value of ∥θ⋆∥2σ (known as the high signal-to-noise ratio, or high SNR for short), the population EM operator is contractive, and hence geometrically convergent, within a neighborhood of the true parameter θ∗. In a later work on the two-component balanced mixture ﬁt model, Daskalakis et al. [7] showed that the convergence is in fact geometric for any non-zero value of the SNR. The model considered in Theorem 2 can be seen as the limiting case of weak signal for a two mixture model—which degenerates to the Gaussian distribution when the SNR becomes exactly zero. For such a limit, we observe that the fast convergence of population EM sequence no longer holds.

ments of upper and lower bounds on the rate of the sample EM iterates for the balanced ﬁt on Gaussian data. We begin with an upper bound, which involves the previously deﬁned function γup(θ) := 1 − p + p/�1 + ∥θ∥222σ2�.

scalars , any sample size + log(log(1and any iterate number

,(16)

with probability at least 1

See Section 4 for a discussion of the techniques employed to prove this theorem. The detailed proof is provided in Appendix A.3, where we also provide some more details on the deﬁnitions of these constants. As we show in our proofs, once the iteration number t satisﬁes the lower bound stated in the theorem, the second term on the right-hand side of the bound (16) dominates the ﬁrst term; therefore, from this point onwards, the the sample EM iterates have Euclidean norm of the order (d/n)14 −α. Note

that ) can be chosen arbitrarily close to zero, so at the expense of

increasing the lower bound on the number of iterations t by a logarithmic factor log(1/α), we can obtain rates arbitrarily close to (d/n)14 . We note that earlier studies of parameter estimation for over-speciﬁed mixtures, in both the frequentist [4] and Bayesian settings [15, 23], have derived a rate of n− 14 for the global maximum of the log likelihood. To the best of our knowledge, Theorem 3 is the ﬁrst non-asymptotic algorithmic result that shows that such rates apply to the ﬁxed points and dynamics of the EM algorithm, which need not converge to the global optima. The preceding discussion was devoted to an upper bound on sample EM for the balanced ﬁt. Let us now match this upper bound, at least in the univariate case d = 1, by showing that any non-zero ﬁxed point of the sample EM updates has Euclidean norm of the order n− 14 . In particular, we prove the following lower bound.

Theorem 4. There are universal positive constants such that for any non-zero solution to the sample EM fixed-point equation for the balanced mixture fit, we have

See Appendix A.4 for the proof of this theorem. Since the iterative EM scheme converges only to one of its ﬁxed points, the theorem shows that one cannot obtain a high-probability bound for any radius smaller than n− 14 . As a consequence, with constant probability, the

radius of convergence n− 14 for sample EM convergence in Theorem 3 for the univariate setting is tight.

tion, we highlight the new proof techniques introduced in this work that are required to obtain the sharp characterization of the sample EM updates in the balanced case (Theorem 3). We begin in Section 4.1 by elaborating that a direct application of the previous frameworks leads to sub-optimal statistical rates for sample EM in the balanced ﬁt. This sub-optimality motivates the development of new methods for analyzing the behavior of the sample EM iterates, based on an annulus-based localization argument over a sequence of epochs, which we sketch out in Sections 4.2 and 4.3. We remark that our novel techniques, introduced here for analyzing EM with the balanced ﬁt, are likely to be of independent interest. We believe that they can potentially be extended to derive sharp statistical rates in other settings when the algorithm under consideration does not exhibit an geometrically fast convergence. 4.1. A sub-optimal guarantee. Let us recall the set-up for the procedure suggested by Balakrishnan et al. [1], specializing to the case where the true parameter θ∗ = 0, as in our speciﬁc set-up. Using the triangle inequality, the norm of the sample EM iterates θt+1 = Mn(θt) can be upper bounded by a sum of two terms as follows:

for all t ≥ 0. The ﬁrst term on the right-hand side corresponds to the deviations between the sample and population EM operators, and can be controlled via empirical process theory. The second term corresponds to the behavior of the (deterministic) population EM operator, as applied to the sample EM iterate θt, and needs to be controlled via a result on population EM. Theorem 2 from Balakrishnan et al. [1] is based on imposing generic conditions on each of these two terms, and then using them to derive a generic bound on the sample EM iterates. In the current context, their theorem can be summarized as follows. For given tolerances δ ∈ (0, 1), ϵ > 0 and starting radius r > 0, suppose that there exists a function ε(n, δ) > 0, decreasing in terms of the sample size n, and a contraction coeﬃcient κ ∈ (0, 1) such that sup

∥θ∥2 ≤κ and P

sup

≥1−δ.(19a)

Then for a sample size n suﬃciently large and ϵ suﬃciently small to ensure that

the sample EM iterates are guaranteed to converge to a ball of radius ε(n, δ)/(1 − κ) around the true parameter θ∗ = 0. In order to apply this theorem to the current setting, we need to specify a choice of ε(n, δ) for which the bound on the empirical process holds. The following auxiliary result provides such control for us:

for any positive radius r, any threshold , and any sample size

sup

d + log(1/δ)

≥ 1 − δ,(20)

where denotes the imbalance in the mixture fit (3).

The proof of this lemma is based on Rademacher complexity arguments; see Appendix B.1 for the details.

ity in line (19a) holds with ε(n, δ) ≲ σ2 ��θ0��2�d/n. On the other hand, Theorem 2 implies that for any θ such that ∥θ∥2 ≥ ϵ, we have that population EM is contractive with parameter bounded above by κ(ϵ) ≍ 1 − ϵ2. In order to satisfy inequality (i) in equation (19b), we solve the equation ε(n, δ)/(1 − κ(ϵ)) = ϵ. Tracking only the dependency on d and n, we ob-tain4

ϵ2 = ϵ =⇒ ϵ = O�(d/n)16�,(21) which shows that the Euclidean norm of the sample EM iterate is bounded by a term of order (d/n)16 . While this rate is much slower than the classical (d/n)12 rate that we established in the unbalanced case, it does not coincide with the n− 14 rate that we obtained in Figure 1(b) for balanced setting with d = 1. Thus, the proof technique based on the framework of Balakrishnan et al. [1] appears

to be non-optimal. The sub-optimality of this approach necessitates the development of a more reﬁned technique. Before sketching this technique, we now quantify empirically the convergence rate of sample EM in terms of both dimension d and sample size n for the balanced mixture ﬁt. In Figure 4, we summarize the results of these experiments. The two panels in the ﬁgure exhibit that the error in the sample EM estimate scales as (d/n)14 , thereby providing further numerical evidence that the preceding approach indeed led to a sub-optimal result.

Fig 4. Scaling of the Euclidean error for EM estimates computed using the balanced () mixture-fit (2). Here the true data distribution is denotes the EM iterate upon convergence when we fit a balanced mixture with n samples in d dimensions. (a) Scaling with respect to . (b) Scaling with respect to . We ran experiments for several other pairs of (n, d) and the conclusions were the same. Clearly, the empirical results suggest a scaling of order (for the final iterate of sample-based EM.

why the preceding argument led to a sub-optimal bound. In brief, its “oneshot” nature contains two major deﬁciencies. First, the tolerance parameter ϵ is used both (a) for measuring the contractivity of the updates, as in the ﬁrst inequality in equation (19a), and (b) for determining the ﬁnal accuracy that we achieve. At earlier phases of the iteration, the algorithm will converge more quickly than suggested by the worst-case analysis based on the ﬁnal accuracy. A second deﬁciency is that the argument uses the radius r only once, setting it to a constant to reﬂect the initialization θ0 at the start of the algorithm. This means that we failed to “localize” our bound on the empirical process in Lemma 1. At later iterations of the algorithm, the norm

Fig 5. Illustration of the annulus-based-localization argument part (I): Defining the epochs or equivalently the annuli. (a) Outer radius for the -th epoch is given by (tracking dependency only on n). (b) For any given epoch , we analyze the behavior of the EM sequence when lies in the annulus around with inner and outer radii given by , respectively. We prove that EM iterates move from one epoch to the next epoch (e.g. epoch + 1) after at most erations. Given the definition of , we see that the inner and outer radii of the aforementioned annulus converges linearly to . Consequently, after at most log(1) epochs (or ) iterations), the EM iterate lies in a ball of radius . We illustrate the one-step dynamics in any given annulus in Figure 6.

controlled. We note that ideas of localizing the radius r for an empirical process plays a crucial role in obtaining sharp bounds on the error of Mestimation procedures [2, 18, 28, 32]. A novel aspect of the localization argument in our setting is the use of an annulus instead of a ball. In particular, we analyze the iterates from the EM algorithm assuming that they lie within a pre-specicied annulus, deﬁned by an inner and an outer radius. On one hand, the outer radius of the annulus helps to provide a sharp control on the perturbation bounds between the population and sample operators. On the other hand, the inner radius of the annulus is used to tightly control the algorithmic rate of convergence. We now summarize our key arguments. The entire sequence of sample EM iterations is broken up into a sequence of diﬀerent epochs. During each epoch, we localize the EM iterates to an annulus. In more detail: • We index epochs by the integer ℓ = 0, 1, 2, . . ., and associate them with a sequence {αℓ}ℓ≥0 of scalars in the interval [0, 14]. The input to epoch ℓ is the scalar αℓ, and the output from epoch ℓ is the scalar αℓ+1.

Fig 6. Illustration of the annulus-based-localization argument part (II): Dynamics of EM in the -th epoch or equivalently the annulus For a given epoch , we analyze the behavior of the EM sequence lies in the annulus with inner and outer radii given by , respectively. In this epoch, the population EM operator ) contracts with a contraction coefficient that depends on , which is the inner radius of the disc, while the perturbation error between the sample and population EM operators depends on , which is the outer radius of the disc. Overall, we prove that is non-expansive and after at most steps, the sample EM updates move from epoch

• The ℓ-th epoch is deﬁned to be the set of all iterations t of the sample EM algorithm such that the sample EM iterate θt lies in the following annulus:

We establish that the sample-EM operator is non-expansive so that each epoch is well-deﬁned (and that subsequent iterations can only correspond to subsequent epochs). • Upon completion of epoch ℓ at iteration Tℓ, the EM algorithm returns an estimate θTℓ such that ∥θTℓ∥2 ≾ (d/n)αℓ+1, where αℓ+1 = 13αℓ + 16.(23) Note that the new scalar αℓ+1 serves as the input to epoch ℓ + 1. The recursion (23) is crucial in our analysis: it tracks the evolution of the exponent acting upon the ratio d/n, and the rate (d/n)αℓ+1 is the bound

on the Euclidean norm of the sample EM iterates achieved at the end of epoch ℓ. A few properties of the recursion (23) are worth noting. First, given our initialization α0 = 0, we see that α1 = 16, which agrees with the outcome of our one-step analysis from above. Second, as the recursion is iterated, it converges from below to the ﬁxed point α∗ = 14. Thus, our argument will allow us to prove a bound arbitrarily close to (d/n)14 , as stated formally in Theorem 3 to follow. Refer to Figures 5 and 6 for an illustration of the deﬁnition of these annuli, epochs and the associated conclusions.

the key recursion (23) arises. Consider epoch ℓ speciﬁed by input αℓ < 14, and consider an iterate θt in the following annulus:��θt��2 ∈ [(d/n)αℓ+1, (d/n)αℓ]. We begin by proving that this initial condition ensures that��θt��2 is less than level (d/n)αℓ for all future iterations; for details, see Lemma 4 stated in the Appendix. Given this guarantee, our second step is to make use of the inner radius of the considered annulus to apply Theorem 2 for the population EM operator, for all iterations t such that��θt��2 ≥ (d/n)αℓ+1. Consequently, for these iterations, we have ��M(θt)��2 ≤�1 − p + p1 + ∥θ∥222σ2

≾ (1 − (d/n)2αℓ+1)(d/n)αℓ ≤ �γ �dn�αℓ,(24a) where �γ := e−(d/n)2αℓ+1. On the other hand, using the outer radii of the annulus and applying Lemma 1 for this epoch, we obtain that

n =

,(24b) for all t in the epoch. Unfolding the basic triangle inequality (18) for T steps, we ﬁnd that

The second term decays exponentially in T, and our analysis shows that it is dominated by the ﬁrst term in the relevant regime of analysis. Examining

the ﬁrst term, we ﬁnd that θt+T has Euclidean norm of the order

.(25) The epoch is said to be complete once��θt+T ��2 ≾ � dn�αℓ+1. Disregarding constants, this condition is satisﬁed when r =� dn�αℓ+1, or equivalently when

=

Viewing this equation as a function of the pair (αℓ+1, αℓ) and solving for αℓ+1 in terms of αℓ yields the recursion (23). Refer to Figure 6 for a visual illustration of the localization argument summarized above for a given epoch. Of course, the preceding discussion is informal, and there remain many details to be addressed in order to obtain a formal proof. We refer the reader to Appendix A.3 for the complete argument.

acterized the behavior of the EM algorithm for diﬀerent settings of over-speciﬁed location Gaussian mixtures. We established rigorous statistical guarantees of EM under two particular but representative settings of over-speciﬁed location Gaussian mixtures: the balanced and unbalanced mixture-ﬁt. The log-likelihood for the unbalanced ﬁt remains strongly log-concave5 (due to the ﬁxed weights and location parameters being sign ﬂips) and hence the Euclidean error of the ﬁnal iterate of EM decays at the usual rate (d/n)12 with n samples in d dimensions. However, in the balanced case, the log-likelihood is no longer strongly log-concave and the error decays at the slower rate (d/n)14 . We view our results as the ﬁrst step in understanding and possibly improving the EM algorithm in non-regular settings. We now provide a detailed discussion that sheds light on the general applicability of our results. In particular, we discuss the behavior of EM under the following settings: (i) over-speciﬁed mixture models with unknown weight parameters (Section 5.1), (ii) over-speciﬁed mixture of linear regression (Section 5.2), and (iii) more general settings with over-speciﬁed mixture models (Section 5.3). We conclude the paper with a discussion of several future directions that arise from the previous settings in Section 5.4.

5.1. When the weights are unknown. Our theoretical analysis so far assumed that the weights were ﬁxed, an assumption common to a number of previous papers in the area [1, 7, 19]. In Appendix C.2, we consider the case of unknown weights for the model ﬁt (3). In this context, our main contribution is to show that if the weights are initialized far away from 12—meaning that the initial mixture is highly unbalanced—then the EM algorithm converges quickly, and the results from Theorem 1 are valid. (See Lemma 6 in Appendix C.2 for the details.) On the other hand, if the initial mixture is not heavily imbalanced, we observe the slow convergence of EM consistent with Theorems 2 and 3.

the behavior of the EM algorithm in application to parameter estimation in mixture models. Our ﬁndings turn out to hold somewhat more generally, with Theorems 2 and 3 having analogues when the EM algorithm is used to ﬁt a mixture of linear regressions in over-speciﬁed settings. Concretely, suppose that (Y1, X1), . . . , (Yn, Xn) ∈ R × Rd are i.i.d. samples generated from the model

tors Xi ∈ Rd are also i.i.d. samples from the standard multivariate Gaussian N(0, Id). Of interest is to estimate the parameter θ∗ using these samples and EM is a popular method for doing so. When θ∗ has suﬃciently large Euclidean norm, a setting referred to as the strong signal case, Balakrishnan et al. [1] showed that the estimate returned by EM is at a distance (d/n)12 from the true parameter θ∗ with high probability. On the other hand, our analysis shows that when ∥θ∗∥2 decays to zero—leading to an over-speciﬁed setting— the convergence of EM becomes slow. In particular, the EM algorithm takes signiﬁcantly more steps and returns an estimate that is statistically worse, lying at Euclidean distance of the order (d/n)14 from the true parameter. While the EM operators in this case are slightly diﬀerent when compared to the over-speciﬁed Gaussian mixture analyzed before, the proof techniques remain similar. More concretely, we ﬁrst show that the convergence of population EM is slow (similar to Theorem 2) and then use the annulus-based localization argument (similar to the proof of Theorem 3 from Section 4) to derive a sharp rate. For completeness, we present these results formally in Lemma 7 and Corollary 2 in Appendix E. 5.3. Slow rates for general mixtures. We now present several experiments that provide numerical backing to the claim that the slow rate of

order n− 14 is not merely an artifact of the special balanced ﬁt ((3) with

). We demonstrate that the slow convergence of EM is very likely

to arise while ﬁtting general over-speciﬁed location Gaussian mixtures with unknown weights (and known covariance). We consider three settings: (A) ﬁtting several general over-speciﬁed location Gaussian mixture ﬁts to Gaussian data (Figure 7), (B) ﬁtting a special three-component mixture ﬁt to a two mixture of Gaussians (Figure 8), and (C) ﬁtting mixtures with unknown weights and location parameters when the number of components in the ﬁtted model is over-speciﬁed by two (Figure 9). We now turn to the details of these settings.

General over-specified mixture fits on Gaussian data. First, we remark that

the fast convergence in the unbalanced ﬁt (Theorem 1) was a joint result of the facts that (a) the weights were ﬁxed and unequal, and (b) the parameters were constrained to be a sign ﬂip. If either of these conditions is violated, the EM algorithm exhibits slow convergence on both algorithmic and statistical fronts. Theorems 2, 3 and 4 provide rigorous details for the case of equal and ﬁxed weights (balanced ﬁt). When the weights are unknown, EM can exhibit slow rate (see Section 5.1 and Appendix C.2 for further details). When the weights are ﬁxed and unequal, but the location parameters are estimated freely—that is, with the model πφ(x; θ1, 1) + (1 − π)φ(x; θ2, 1), as illustrated in Figure 7(a)—then the EM estimates have error6 of order n− 14 . In such cases, the parameter estimates approximately satisfy the relation

a two-components mixture model, the location estimates become weighted sign ﬂips of each other. The features are the intuitive reason underlying the similarity of behavior of EM between this ﬁt and the balanced ﬁt. Finally, when we ﬁt a two mixture model with unknown weight parameter and free location parameters, the ﬁnal error also has a scaling of order n− 14 . Refer to Figure 7 for a numerical validation of these results.

Over-specified fits for mixtures of Gaussian data. Using similar reasoning

as above, let us sketch out how our theoretical results also yield usable predictions for more general over-speciﬁed models. Roughly speaking, whenever there are extra number of components to be estimated, parameters of some of them are likely to end up satisfying certain form of local constraint. More concretely, suppose that we are given data generated from a k-component

Fig 7. Plots of the Wasserstein error associated with EM fixed points versus the sample size for fitting various kinds of location mixture models to standard normal N(0, 1) data. We fit mixture models with either two or three components, with all location parameters estimated in an unconstrained manner. The lines are obtained by a linear regression of the log error on the sample size n. (a) Fitting a two-mixture model 1) with three different fixed values of weights and two (unconstrained) location parameters, along with least-squares fits to the log errors. (b) Data plotted as red triangles is obtained by fitting a two-component model with unknown mixture weights and two location parameters 1), whereas green circles correspond to results fitting a three-component mixture model 1). In all cases, the EM solutions exhibit the slow tical rate for the error in parameter estimation. Also see Figure 9.

mixture, and we use the EM algorithm to ﬁt the location parameters of a mixture model with k + 1 components. Loosely speaking, the EM estimates corresponding to a set of k − 1 components are likely to converge quickly, leaving the two remaining components to ﬁt a single component in the true model. If the other components are far away, the EM updates for the parameters of these two components are unaﬀected by them and start to behave like the balanced case. See Figure 8 for a numerical illustration of this intuition in an idealized setting where we use k + 1 = 3 components to ﬁt data generated from a k = 2 component model. In this idealized setting, the error for one of the parameter scales at the fast rate of order n− 12 , and that of the parameter that is locally over-ﬁtted exhibits a slow rate of order n− 14 . Finally, we see that the statistical error of order n− 14 also arises when we over-specify the number of components by more than one. In particular, we observe in Figure 7(b) (green dashed dotted line with solid circles) and Figure 9 (both curves) that a similar scaling of order n− 14 arises when we

over-specify the number of components by 2 and estimate the weight and location parameters. Besides formally analyzing EM in these general cases, several other future directions arise from our work which we now discuss.

Fig 8. Behavior of EM for an over-specified Gaussian mixture. True model:

= 10. We fit a model

1), where we initialize close to . (a) Population EM updates: We observe that while slowly to = 0, the iterates converge exponentially fast to (b) We plot the statistical error for the two parameters. While the strong signal component has a parametric rate, for the no signal component EM has the slower rate, which is in good agreement with the theoretical results derived in the paper. (We remark that the error floor for panel (a) arises from the finite precision inherent to numerical integration.)

5.4. Future directions. In our current work, we assumed that only the location parameters were unknown and that the scale parameters of the underlying model are known. Nevertheless in practice, this assumption is rather restrictive and it is natural to ask what happens if the scale parameters were also unknown. We note that the MLE is known to have even slower statistical rates for the estimation error with such higher-order mixtures; therefore, it would be interesting to determine if the EM algorithm also suﬀers from a similar slow down when the scale parameters are unknown. We refer the readers to a recent preprint [10], where we establish that the EM algorithm can suﬀer from a further slow-down on the statistical and computational ends when over-speciﬁed mixtures are ﬁtted with an unknown scale parameter. Another important direction is to analyze the behavior of EM under diﬀer-ent models for generating the data. While our analysis is focused on Gaussian

Fig 9. Plots of Wasserstein error when both weights and location parameters are unknown and estimated using EM and the fitted multivariate mixture model is over-specified. (a) True model: ), and fitted model ) and (b) True model: fitted model: ). In both cases, once again we see the scaling of order for the final error (similar to results in Figure 7 and 8).

mixtures, the non-standard statistical rate n− 14 also arises in other types of over-speciﬁed mixture models, such as those involving mixtures with other exponential family members, or Student-t distributions, suitable for heavytailed data. We believe that the analysis of our paper can be generalized to a broader class of ﬁnite mixture models that includes the aforementioned models. A ﬁnal direction of interest is whether the behavior of EM—slow versus fast convergence—can be used as a statistic in a classical testing problem: testing the simple null of a standard multivariate Gaussian versus the compound alternative of a two-component Gaussian mixture. This problem is known to be challenging due to the break-down of the (generalized) likelihood ratio test, due the singularity of the Fisher information matrix; see the papers [6, 21] for some past work on the problem. The results of our paper suggest an alternative approach, which is based on monitoring the convergence rate of EM. If the EM algorithm converges slowly for a balanced ﬁt, then we may accept the null, whereas the opposite behavior can be used as an evidence for rejecting the null. We leave for future work the analysis of such a testing procedure based on the convergence rates of EM. Acknowledgments. This work was partially supported by Oﬃce of Naval Research grant DOD ONR-N00014-18-1-2640 and National Science Foundation grant NSF-DMS-1612948 to MJW, and National Science Foundation grant NSF-DMS-1613002 to BY, and by Army Research Oﬃce grant

W911NF-17-1-0304 to MIJ.

We now collect several proofs and results that were deferred from the main paper. Appendices A and B contain, respectively, the proofs of our main theorems, and the proofs of all auxiliary technical lemmas. In Appendix C, we provide some additional results that discuss (a) the behavior of EM when the weights are unknown, and (b) the initial radius requirement for EM. Appendix D contains details on the nature of log-likelihood and Fisher information matrix, in order to provide an intuition about the diﬀerent behavior of EM in the balanced and unbalanced case. Finally, in Appendix E, we show that the techniques presented in this paper are not speciﬁc to mixture models, and use them to establish slow convergence of EM for over-speciﬁed mixture of linear regressions. APPENDIX A: PROOFS OF MAIN RESULTS In this appendix, we present the proofs of our main results, namely Theorems 1, 2, 3, and 4. A.1. Proof of Theorem 1. As alluded to earlier in Section 3.1—given Lemma 1—it suﬃces to prove the contraction property (10a) for the population operator M. Recall that θ∗ = 0 is a ﬁxed point of the population EM operator (i.e., M(0) = 0). This fact, combined with the deﬁnition (8) of the M-update, yields ∥M(θ)∥2 = ∥M(θ) − M(θ∗)∥2 = ∥E [2(wθ(X) − w0(X))X]∥2 , where, in the unbalanced setting (3), the weight function wθ (6) and the gradient ∇θwθ take the form For a scalar u ∈ [0, 1], deﬁne the function h(u) = wuθ(X), and note that h′(u) = ∇wuθ(X)⊤θ. Thus, using a Taylor series expansion along the line

θu = uθ, u ∈ [0, 1], we ﬁnd that

= 4π(1 − π)

σ2�π exp�− θ⊤u Xσ2 �+ (1 − π) exp�θ⊤u Xσ2 ��2

≤ 4π(1 − π)∥θ∥2 maxu∈[0,1] |||E [Γθu(X)] |||op,(27) where in the last equation we have deﬁned the matrix Γθu(X) :=

σ2�π exp�− θ⊤u Xσ2 �+ (1 − π) exp�θ⊤u Xσ2 ��2 .(28) Writing the mixture weight as π = 12 (1 − ρ), we claim that it suﬃces to show that max

Indeed, taking the last bound as given and substituting it into inequality (27), we ﬁnd that ∥M(θ)∥2 ≤ 4π (1 − π) 1 − ρ2/21 − ρ2 ∥θ∥2 = (1 − ρ2/2) ∥θ∥2 , which yields the claim (10a) of Theorem 1. Proof of claim (29). We begin by making a convenient change of coordinates. Let R ∈ Rd×d be an orthonormal matrix such that Rθu = ∥θu∥2 e1, where e1 denotes the ﬁrst canonical basis vector in dimension d. Deﬁne the random vector V := RX/σ. Since the vector X ∼ N(0, σ2Id) and the matrix R is orthonormal, the random vector V follows a N(0, Id) distribution. Substituting X = σR⊤V and Rθu = ∥θu∥2 e1 in the expression (28) for Γθu and using the fact that |||R⊤BR|||op = |||B|||op for any matrix B and any orthogonal matrix R, we ﬁnd that |||E [Γθu(X)] |||op = |||Bθu|||op, where Bθu := EV

(π exp (− ∥θu∥2 V1/σ) + (1 − π) exp (∥θu∥2 V1/σ))2

Here V1 := V e1 denotes the ﬁrst coordinate of the random vector V . Note that the matrix Bθu is a diagonal matrix, with non-negative entries. Thus, in order to prove the bound (29), it suﬃces to show that max

When θu = 0, the matrix Bθu = E[V V ⊤] = Id and the claim holds trivially. Turning to the case θu ̸= 0, we split our analysis into two cases, depending on whether j = 1 or j ̸= 1.

(πe−y + (1 − π)ey) ∈ [�(1 − ρ2), 1], if ey ∈�1, 1 + ρ1 − ρ

, and (πe−y + (1 − π)ey) > 1, otherwise.(31) Let Ec and I(E) respectively denote the complement and the indicator of any event E. Deﬁne the event Eθu :=

Using the observation (31) above and the fact that V1 ∼ N(0, 1), we obtain [Bθu]11 = E

(π exp (− ∥θu∥2 V1/σ) + (1 − π) exp (∥θu∥2 V1/σ))2

1 (1 − ρ2)E�V 21 I(Eθu)�+ E�V 21 I(Ecθu)�

(1 − ρ2) .(32) Note that whenever θu ̸= 0, we have that Eθu ⊆�V1 ≥ 0} and consequently, we obtain that

Putting the inequalities (32) and (33) together, we conclude that [Bθu]11 ≤ (1 − ρ2/2)/(1 − ρ2).

Bounding [Bθu]jj, j ̸= 1. Using arguments similar to the previous case, and the fact that the random variables Vi, i ∈ [d], are independent standard normal random variables, we ﬁnd that [Bθu]jj = E

(π exp (− ∥θu∥2 V1/σ) + (1 − π) exp (∥θu∥2 V1/σ))2

(π exp (− ∥θu∥2 V1/σ) + (1 − π) exp (∥θu∥2 V1/σ))2

Invoking the deﬁnition of the event Eθu, we have

1

Finally, noting that E [I(Eθu)] ≤ E [I(V1 ≥ 0)] = 1/2 whenever θu ̸= 0, yields the claim. A.2. Proof of Theorem 2. We split our proof into two parts, which correspond to the upper bound (13a) and the lower bound (13b) respectively. A.2.1. Proof of the upper bound (13a). For the balanced ﬁt, we have wθ(X) = 1 1 + e−2θ⊤X/σ2 and ∇θ(wθ(X)) =

Using a Taylor expansion and repeating the preliminary computations as those in the proof of Theorem 1 from the unbalanced setting, we obtain that

upper bound (13a), it suﬃces to show that

1 + ∥θ∥22 /2σ2

4 (35)

where p := (1 + PZ∼N(0,1)(|Z| ≤ 1))/2 < 1. We now establish the claim (35). Like in proof of Theorem 1, we perform a change of coordinates using an orthogonal matrix R such that Rθu = ∥θu∥2 e1, where e1 is the ﬁrst canonical basis in dimension d. Deﬁne the random vector V := RX/σ. Since the vector X ∼ N(0, σ2Id) and the matrix R is orthogonal, we have that the vector V ∼ N(0, Id). Substituting the matrix X = σR⊤V and Rθu = ∥θu∥2 e1 in the expression for Γθu and using the equality |||R⊤BR|||op = |||B|||op, valid for any matrix B and any orthogonal matrix R, we obtain that |||E [Γθu(X)] |||op = |||Bθu|||op, where Bθu := EV

(exp (− ∥θu∥2 V1/σ) + exp (∥θu∥2 V1/σ))2

.(36) Clearly, the matrix Bθu is a diagonal matrix with non-negative entries (note the abuse of notation: the deﬁnitions of the matrices Γθu and Bθu is diﬀerent from the unbalanced case). Consequently, to obtain a bound for the operator norm of the matrix Bθu, it is suﬃcient to provide an upper bound on the diagonal entries of the matrix Bθu. In order to do so, we introduce an auxiliary claim:

-operator norm of the matrix defined in equation (36), is upper-bounded as

|||Bθu|||op = maxj∈[d][Bθu]jj ≤ p24 + (1 − p2)4 1(1 + ∥θu∥22 /(2σ2))2 ,(37) where p2 = P (|V1| ≤ 1) < 1. See Appendix B.2 for the proof of Lemma 2. Using Lemma 2, we now complete the proof. Integrating both sides of the inequality (37) with respect to u ∈ [0, 1], we ﬁnd that

(1 − p2) 4 1 (1 + ∥θu∥22 /(2σ2))2 du

1

Direct computation of the second integral yields

1

1 + ∥θ∥22σ2

≤ 12

1 + ∥θ∥22σ2 + 1

where the last inequality above follows since tan−1(y) ≤ y, for all y ≥ 0. Putting together the pieces yields

≥ 4λmin (Γθ) ∥θ∥2 ,(38) where λmin(Γθ) denotes the smallest eigenvalue of the square matrix Γθ. Following the change of variable V := RX/σ used in the proof of upper bound (13a), we obtain that

(exp (−∥θu∥2V1/σ) + exp (∥θu∥2V1/σ))2 du

(39) Clearly, the matrix Fθ is a diagonal matrix with non-negative diagonal entries and consequently, we have λmin(Fθ) = minj∈[d][Fθ]jj.(40)

<