b

DiscoverSearch
About
My stuff
Algebraic and Analytic Approaches for Parameter Learning in Mixture Models
2020·arXiv
Abstract
Abstract

We present two different approaches for parameter learning in several mixture models in one dimension. Our first approach uses complex-analytic methods and applies to Gaussian mixtures with shared variance, binomial mixtures with shared success probability, and Poisson mixtures, among others. An example result is that  exp(O(N 1/3))samples suffice to exactly learn a mixture of k < N Poisson distributions, each with integral rate parameters bounded by N. Our second approach uses algebraic and combinatorial tools and applies to binomial mixtures with shared trial parameter N and differing success parameters, as well as to mixtures of geometric distributions. Again, as an example, for binomial mixtures with k components and success parameters discretized to resolution  ǫ, O(k2(N/ǫ)8/√ǫ)samples suffice to exactly recover the parameters. For some of these distributions, our results represent the first guarantees for parameter estimation.

image

Mixture modeling is a powerful method in the statistical toolkit, with widespread use across the sciences (Titterington et al., 1985). Starting with the seminal work of Dasgupta (1999), computational and statistical aspects of learning mixture models have been the subject of intense investigation in the theoretical computer science and statistics communities (Achlioptas and McSherry, 2005; Kalai et al., 2010; Belkin and Sinha, 2010; Arora and Kannan, 2001; Moitra and Valiant, 2010; Feldman et al., 2008; Chan et al., 2014; Acharya et al., 2017; Hopkins and Li, 2018; Diakonikolas et al., 2018; Kothari et al., 2018; Hardt and Price, 2015).

In this literature, there are two flavors of result: (1) parameter estimation, where the goal is to identify the mixing weights and the parameters of each component from samples, and (2) density estimation or PAC-learning, where the goal is simply to find a distribution that is close in some distance (e.g., TV distance) to the data-generating mechanism. Density estimation can be further subdivided into proper and improper learning approaches depending on whether the algorithm outputs a distribution from the given mixture family or not. These three guarantees are quite different. Apart from Gaussian mixtures, where all types of results exist, prior work for other mixture families

image

largely focuses on density estimation, and very little is known for parameter estimation outside of Gaussian mixture models. In this paper, we focus on parameter estimation and provide two new approaches, both of which apply to several mixture families.

Our first approach is analytic in nature and yields new sample complexity guarantees for univariate mixture models including Gaussian, Binomial, and Poisson. Our key technical insight is that we can relate the total variation between two candidate mixtures to a certain Littlewood polynomial, and then use complex analytic techniques to establish separation in TV-distance. With this separation result, we can use density estimation techniques (specifically proper learning techniques) to find a candidate mixture that is close in TV-distance to the data generating mechanism. The results we obtain via this approach are labeled as “analytic” in Table 1. This approach has recently led to important advances in the trace reconstruction and population recovery problems; see work by De et al. (2017a), Nazarov and Peres (2017), and De et al. (2017b).

Our second approach is based on the method of moments, a popular approach for learning Gaussian mixtures, and is more algebraic. Roughly, these algorithms are based on expressing moments of the mixture model as polynomials of the component parameters, and then solving a polynomial system using estimated moments. This approach has been studied in some generality by Belkin and Sinha (2010) who show that it can succeed for a large class of mixture models. However, as their method uses non-constructive arguments from algebraic geometry it cannot be used to bound how many moments are required, which is essential in determining the sample complexity; see a discussion in (Moitra, 2018, Section 7.6). In contrast, our approach does yield bounds on how many moments suffice and can be seen as a quantified version of the results in Belkin and Sinha (2010). The results we obtain via this approach are labeled as “algebraic” in Table 1.

The literature on mixture models is quite large, and we have just referred to a sample of most relevant papers here. A bigger overview on learning distributions can be found in the recent monographs such as Moitra (2018); Diakonikolas (2016).

1.1. Overview of results

As mentioned, an overview of our sample complexity results are displayed in Table 1, where in all cases we consider a uniform mixture of k distributions. Our guarantees are for exact parameter estimation, under the assumption that the mixture parameters are discretized to a particular resolution, given in the third column of the table. Theorem statements are given in the sequel.

At first glance the guarantees seem weak, since they all involve exponential dependence in problem parameters. However, except for the Gaussian case, these results are the first guarantees for parameter estimation for these distributions. All prior results we are aware of consider density estimation (Chan et al., 2013; Feldman et al., 2008).

For the mixtures of discrete distributions, such as binomial and negative binomial with shared trial parameter, or Poisson/geometric/chi-squared mixtures with certain discretizations, it seems like the dependence of sample complexity on the number of components k is polynomial (see Table 1). Note that for these examples  k ≤ N, the upper bounds on parameter values. Therefore the actual dependence on k can still be interpreted as exponential. The results are especially interesting when k is large and possibly growing with N.

For Gaussian mixtures, the most interesting aspect of our bound is the polynomial dependence on the number of components k (first row of Table 1). In our setting and taking  σ = 1, the result of Moitra and Valiant (2010) is applicable, and it yields  ǫ−O(k) sample complexity, which is incom-

image

Table 1: Overview of our results. Results are given for uniform mixtures of k different components but some can be extended to non-uniform mixtures. Note that for rows 2, 4, 7, and 8, k does not appear. This is because  k ≤ Nand other terms dominate.

parable to our  k3 exp(O(ǫ−2/3))bound. Note that our result avoids an exponential dependence in k, trading this off for an exponential dependence on the discretization/accuracy parameter  ǫ.1 Otherresults for Gaussian mixtures either 1) consider density estimation (Daskalakis and Kamath, 2014; Feldman et al., 2008), which is qualitatively quite different from parameter estimation, 2) treat k as constant (Hardt and Price, 2015; Kalai et al., 2010), or 3) focus on the high dimensional setting and require separation assumptions (see for example Diakonikolas et al. (2017) and Moitra (2018)).

As such, our results reflect a new sample complexity tradeoff for parameter estimation in Gaussian mixtures.

As another note, using ideas from (Nazarov and Peres, 2017; De et al., 2017a), one can show that the analytic result for Binomial mixtures is optimal. This raises the question of whether the other results are also optimal or is learning a Binomial mixture intrinsically harder than learning, e.g., a Poisson or Gaussian mixture?

As a final remark, our assumption that parameters are discretized is related to separation conditions that appear in the literature on learning Gaussian mixtures. However, our approach does not seem to yield guarantees when the parameters do not exactly fall into the discretization. We hope to resolve this shortcoming in future work.

1.2. Our techniques

To establish these results, we take two loosely related approaches. In our analytic approach, the key structural result is to lower bound the total variation distance between two mixtures  M, M′ by acertain Littlewood polynomial. For each distribution type, if the parameter is  θ, we find a function

image

(For Gaussians,  Gtis essentially the characteristic function). Such functions can be used to obtain Littlewood polynomials from the difference in expectation for two different mixtures, for example if the parameters  θare integral and the mixture weights are uniform. Applying complex analytic results on Littlewood polynomials, this characterization yields a lower bound on the total variation distance between mixtures, at which point we may use density estimation techniques for parameter learning. Specifically we use the minimum distance estimator (see, Devroye and Lugosi (2012, Sec. 6.8)), which is based on the idea of Scheffe sets. Scheffe sets are building blocks of the Scheffe estimator, commonly used in density estimation, e.g. Suresh et al. (2014).

Our algebraic approach is based on the more classical method of moments. Our key innovation here is a combinatorial argument to bound the number of moments that we need to estimate in order to exactly identify the correct mixture parameters. In more detail, when the parameters belong to a discrete set, we show that the moments reveal various statistics about the multi-set of parameters in the mixture. Then, we adapt and extend classical combinatorics results on sequence reconstruction to argue that two distinct multi-sets must disagree on a low-order moment. These combinatorial results are related to the Prouhet-Tarry-Escott problem (see, e.g., Borwein (2002)) which also has connections to Littlewood polynomials. To wrap up we use standard concentration arguments to estimate all the necessary moments, which yields the sample complexity guarantees.

We note that the complex analytic technique provides non-trivial result only for those mixtures for which an appropriate function  Gtexists. On the other hand, the algebraic approach works for all mixtures whose  ℓth moment can be described as a polynomial of degree exactly  ℓ in itsunknown parameters. In Belkin and Sinha (2010), it was shown that most distributions have this later property. In general, where both methods can be applied, the complex analytic techniques typically provide tighter sample complexity bounds than the algebraic ones.

In this section, we show how analysis of the characteristic function can yield sample complexity guarantees for learning mixtures. At a high level, the recipe we adopt is the following.

1. First, we show that, in a general sense, the total variation distance between two separated mixtures is lower bounded by the  L∞norm of their characteristic functions.

2. Next, we use complex analytic methods and specialized arguments for each particular distribution to lower bound the latter norm.

3. Finally, we use the minimum distance estimator (Devroye and Lugosi, 2012) to find a mixture that is close in total variation to the data generating distribution. Using uniform convergence arguments this yields exact parameter learning.

The two main results we prove in this section are listed below.

Theorem 1 (Learning Gaussian mixtures) Let  M = 1k�ki=1 N(µi, σ2)be a uniform mixture of k univariate Gaussians, with known shared covariance  σ2 and with distinct means  µi ∈ ǫZ. Thenthere exists an algorithm that requires  k3 exp(O((σ/ǫ)2/3))samples from M and exactly identifies the parameters  {µi}ki=1 with high probability.

Theorem 2 (Learning Poisson mixtures) Let  M = 1k�ki=1 Poi(λi) where λi ∈ {0, 1, . . . , N}for each i are distinct. Then there exists an algorithm that that requires  exp(O(N 1/3))) samplesfrom M to exactly identify the parameters  {λi}ki=1 with high probability.

There are some technical differences in deriving the results for Gaussian vs Poisson mixtures. Namely, because of finite choice of parameters we can take a union bound over the all possible incorrect mixtures for the latter case, which is not possible for Gaussian. For Gaussian mixtures we instead use an approach based on VC dimension. The results for negative binomial mixtures and chi-squared mixtures (shown in Table 1) follow the same route as the Poisson mixture. As reported in Table 1, this approach also yields results for mixtures of binomial distributions that we obtained in a different context in our prior work (Krishnamurthy et al., 2019).

2.1. Total Variation and Characteristic Functions

Let  {fθ}θ∈Θdenote a parameterized family of distributions over a sample space  Ω ⊂ R, where fθdenotes either a pdf or pmf, depending on the context. We call  M a (finite) Θ-mixture if M haspdf/pmf �θ∈A αθfθ and A ⊂ Θ, |A| = k. For a distribution with density f (we use distribution and density interchangeably in the sequel), define the characteristic function  Cf(t) ≡ EX∼f[eitX].For any two distribution  f, f ′ defined over a sample space  Ω ⊆ Rthe variational distance (or the TV-distance) is defined to be  ∥f − f ′∥TV ≡ 12�Ω

���df′df − 1��� df. For a function  G : Ω → C define theL∞norm to be  ∥G∥∞ = supω∈Ω |G(ω)| where | · |denotes the modulus.

As a first step, our aim is to show that the total variation distance between  M = �θ∈A αθfθand any other mixture  M′ given by �θ∈B βθfθ, B ⊂ Θ, |B| = kis lower bounded. The following elementary lemma completes the first step of the outlined approach.

Lemma 3 For any two distributions  f, f ′ defined over the same sample space  Ω ⊆ R, we have

image

More generally, for any  G : Ω → C and Ω′ ⊂ Ω we have

image

Proof We prove the latter statement, which implies the former since for the function  G(x) = eitx

we have  supx |G(x)| = 1. We have

image

Equipped with the lower bound in Lemma 3, for each type of distribution, we set out to find a good function G to witness separation in total variation distance. As we will see shortly, for a parametric family  fθ, it will be convenient to find a family of functions  Gt such that

image

Of course, to apply Lemma 3, it will also be important to understand  ∥Gt∥∞. While such functions are specific to the parametric model in question, the remaining analysis will be unified. We derive such functions and collect the relevant properties in the following lemma. At a high level, the calculations are based on reverse engineering from the characteristic function, e.g., finding a choice t′(t) such that Cf(t′) = exp(itθ).

Lemma 4 Let  z = exp(it) where t ∈ [−π/L, π/L].

Gaussian. If  X ∼ N(µ, σ) and Gt(x) = eitx then

image

Poisson. If  X ∼ Poi(λ) and Gt(x) = (1 + it)x then

image

Chi-Squared. If  X ∼ χ2(ℓ) and Gt(x) = exp(x/2 − xe−2it/2) then

image

Negative Binomial. If  X ∼ NB(r, p) and Gt(x) =�1/p − (1/p − 1)e−it�x then

image

Proof Here we give the argument for Poisson distributions only. The remaining calculations are deferred to the appendix. For Poisson random variables, if  Gt(x) = (1 + it)x then since |1 + it|2 =1 + t2 the second claim follows. For the first:

image

2.2. Variational Distance Between Mixtures

We crucially use the following lemma.

Lemma 5 (Borwein and Erd´elyi (1997)) Let  a0, a1, a2, · · · ∈ {0, 1, −1}be such that not all of them are zero. For any complex number  z, let A(z) ≡ �k akzk.Then, for some absolute constant

image

We will also need the following ‘tail bound’ lemma.

Lemma 6 Suppose a > 1 is any real number and  r ∈ R+. For any discrete random variable X with support Z and pmf f,

image

Proof Note that,  Pr(X ≥ x) = Pr(a2X−2x ≥ 1) ≤ E[a2X−2x]. We have,

image

Theorem 7 (TV Lower Bounds) The following bounds hold on distance between two different mixtures assuming all k parameters are distinct for each mixture.

Gaussian:  M = 1k�ki=1 N(µi, σ) and M′ = 1k�ki=1 N(µ′i, σ) where µi, µ′i ∈ ǫZ. Then��M′ − M��TV ≥ k−1 exp(−Ω((σ/ǫ)2/3)) .

Poisson:  M = 1k�ki=1 Poi(λi) and M′ = 1k�ki=1 Poi(λ′i) where λi, λ′i ∈ {0, 1, . . . , N}. Then��M′ − M��TV ≥ k−1 exp(−Ω(N 1/3)) .

Chi-Squared:  M = 1k�ki=1 χ2(ℓi) and M′ = 1k�ki=1 χ2(ℓ′i) where ℓi, ℓ′i ∈ {1, 2, . . . , N}.Then

image

Negative Binomial:  M = 1k�ki=1 NB(ri, p) and M′ = 1k�ki=1 NB(r′i, p) where ri, r′i ∈ {1, 2, . . . , N}.Then

image

Proof As above we give the argument for Poisson random variables, deferring the others to the appendix. Let  X ∼ M and X′ ∼ M′. Then, for w = 1 + it, from Lemma 4,

image

Now we use Lemma 3 with  G(x) = wx, Ω′ = {0, 1, . . . , 2N} and t ≤ 1, to have,

image

Now using Lemma 6,

image

2.3. Parameter Learning

Union Bound Approach for Discrete Distributions We begin with the following proposition which follows from Theorem 7.1 of Devroye and Lugosi (2012).

image

For the mixture of Poissons,  M = 1k�ki=1 Poi(λi) where λi ∈ {0, 1, . . . , N},the number of choices for parameters in the mixture is  (N + 1)k. Now using Lemmas 7 and 8,  exp(O(N 1/3))samples are sufficient to learn the parameters of the mixture.

Exactly the same argument applies to mixtures of Chi-Squared and Negative-Binomial distributions, yielding  exp(O(N 1/3)) and exp(O((N/p)1/3))samples suffice, respectively. However, for Gaussians we need a more intricate approach.

VC Approach for Gaussians To learn the parameters of a Gaussian mixture

image

we use the minimum distance estimator precisely defined in (Devroye and Lugosi, 2012, Section 6.8). Let  A ≡ {{x : M(x) ≥ M′(x)} :for any two mixtures  M ̸= M′}be a collection of subsets. Let Pmdenote the empirical probability measure induced by the m samples. Then, choose a mixture ˆMfor which the quantity  supA∈A | Pr∼ ˆM(A) − Pm(A)|is minimum (or within 1/m of the infi- mum). This is the minimum distance estimator, whose performance is guaranteed by the following proposition (Devroye and Lugosi, 2012, Thm. 6.4).

Proposition 9 Given m samples from  M and with ∆ = supA∈A | Pr∼M(A) − Pm(A)|, we have��� ˆM − M���TV ≤ 4∆ + 3m.

We now upper bound the right-hand side of the above inequality. Via McDiarmid’s inequality and a standard symmetrization argument,  ∆is concentrated around its mean which is a function of V C(A), the VC dimension of the class A, see (Devroye and Lugosi, 2012, Section 4.3):

image

with high probability, for an absolute constant c. This latter term is bounded by the following.

Lemma 10 For the class A defined above, the VC dimension is given by V C(A) = O(k).

Proof First of all we show that any element of the set A can be written as union of at most  4k − 1intervals in R. For this we use the fact that a linear combination of k Gaussian pdfs f(x) = �ki=1 αifi(x) where fis normal pdf  N(µi, σ2i ) and αi ∈ R, 1 ≤ i ≤ khas at most  2k − 2 zero-crossings (Kalai et al., 2012). Therefore, for any two mixtures of interest  M(x) − M′(x) has atmost  4k − 2zero-crossings. Therefore any  A ∈ Amust be a union of at most  4k − 1contiguous regions in R. It is now an easy exercise to see that the VC dimension of such a class is  Θ(k).

image

from Theorem 7, notice that for any other mixture  M′ we must have,

image

As long as��� ˆM − M���TV ≤ 12 ∥M − M′∥TV we will exactly identify the parameters. Therefore m = k3 exp(O((σ/ǫ)2/3))samples suffice to exactly learn the parameters with high probability.

2.4. Extension to Non-Uniform Mixtures

The above results extend to non-uniform mixtures, where the main change is that we require a generalization of Lemma 5. The result, also proved by Borwein and Erd´elyi (1997), states that if a0, a1, a2, . . . ∈ [−1, 1] with poly(n)precision then  max−π/L≤θ≤π/L |A(eiθ)| ≥ e−cL log n, for anabsolute constant c. This weaker bound yields an extra poly(n) factor in the sample complexity.

There are some mixtures where the problem of learning parameters is not amenable to the approach in the previous section. A simple motivating example is learning the parameters  pi ∈{0, ǫ, 2ǫ, 3ǫ, . . . , 1} values3 in the mixture  M = 1k�ki=1 Bin(n, pi). In this section, we present an alternative procedure for learning such mixtures. The basic idea is as follows:

We compute moments  EXℓ exactly for  ℓ = 0, 1, . . . , Tby taking sufficiently many samples. The number of samples will depend on T and the precision of the parameters of the mixture.

We argue that if T is sufficiently large, then these moments uniquely define the parameters of the mixture. To do this we use a combinatorial result due to Krasikov and Roditty (1997).

In this section, it will be convenient to define a function  mℓon multi-sets where

image

Our main result is as follows:

Theorem 11 (Learning Binomial mixtures) Let  M = 1k�ki=1 Bin(n, pi)be a uniform mixture of k binomials, with known shared number of trials n and unknown probabilities  p1, . . . , pk ∈{0, ǫ, 2ǫ, . . . , 1}. Then, provided  n ≥ 4/√ǫ, the first 4/√ǫmoments suffice to learn the parameters  piand there exists an algorithm that, when given  O(k2(n/ǫ)8/√ǫ)samples from M, exactly identifies the parameters  {pi}ki=1 with high probability.

Computing the Moments We compute the  ℓth moment as  Sℓ,t = � Y ℓi /t where Y1, . . . , Yt ∼ X.

Lemma 12  Pr[|Sℓ,t − EXℓ| ≥ γ] ≤ EX2ℓtγ2 ≤ (2ℓ)!γ2t infα�EeαXα2ℓ �where the last inequality assumes the all the moments of X are non-negative.

Proof By the Chebyshev bound,

image

We then use the moment generating function: for all  α > 0, EX2ℓ ≤ (2ℓ)!EeαX/α2ℓ.

The following corollary, tailors the above lemma for a mixture of binomial distributions.

image

where f is a polynomial of degree at most  ℓwith integer coefficients (Belkin and Sinha, 2010). If piis an integer multiple of  ǫthen this implies  k(EXℓ)/ǫℓ is integral and therefore any mixture with a different  ℓth moment differs by at least  ǫℓ/k. Hence, learning the  ℓth moment up to  γℓ < ǫℓ/(2k)implies learning the moment exactly.

Lemma 14 For  X ∼ Bin(n, p), EXℓ is a polynomial in p of degree exactly  ℓ if n ≥ ℓ.

The proof of the lemma is relegated to the appendix.

Theorem 15  O(k2(n/ǫ)8/√ǫ)samples are sufficient to exactly learn the first  4/√ǫmoments of a uniform mixture of k binomial distributions �ki=1 Bin(n, pi)/kwith probability at least 7/8 where each  pi ∈ {0, ǫ, 2ǫ, . . . , 1}.

Proof Let  T = 4/√ǫ. From Corollary 20 and the preceding discussion, learning the  ℓth momentexactly with failure probability  1/91+T−ℓ requires

image

samples. And hence, we can compute all  ℓth moments exactly for  1 ≤ ℓ ≤ 4/√ǫ using

image

samples with failure probability �Tℓ=1 1/91+T−ℓ < �∞i=1 1/9i = 1/8.

How many moments determine the parameters It remains to show the first  4/√ǫ momentssuffice to determine the  pivalues in the mixture  X ∼ �ki=1 Bin(n, pi)/k provided n ≥ 4ǫ . To dothis suppose there exists another mixture  Y ∼ �ki=1 Bin(n, qi)/kand we will argue that

image

implies  {pi}i∈[k] = {qi}i∈[k]. To argue this, define integers  αi, βi ∈ {0, 1, . . . , 1/ǫ}such at that

image

Hence, if the first T moments match  mℓ(A) = mℓ(B) for all ℓ = 0, 1, . . . , T. But the following theorem establishes that if  T = 4�1/ǫthen this implies A = B.

Theorem 16 (Krasikov and Roditty (1997)) For any two subsets  S, T of {0, 1, . . . , n − 1}, then

image

We note that the above theorem is essentially tight. Specifically, there exists  S ̸= T withmk(S) = mk(T) for k = 0, 1, . . . , cn/ log n for some c. As a consequence of this, we note that even the exact values of the  c√n/ log nmoments are insufficient to learn the parameters of the distribution. For an example in terms of Gaussian mixtures, even given the promise  µi ∈{0, 1, . . . , n − 1}are distinct, then the first  c√n/ log nmoments of �i N(µi, 1)are insufficient to uniquely determine  µiwhereas the first  4√nmoments are sufficient.

3.1. Extension to Non-Uniform Distributions

We now consider extending the framework to non-uniform distributions. In this case, the method of computing the moments is identical to the uniform case. However, when arguing that a small number of moments suffices we can no longer appeal to the Theorem 16.

To handle non-uniform distribution we introduce a precision variable q and assume that the weights of the component distributions  ω1, ω2, . . . , ωkare of the form:

image

where  wi ∈ {0, 1, . . . , q − 1}. Then, in the above framework if we are trying to learn parameters α1, . . . , αkthen the moments are going to define a multi-set consisting of  wi copies of αi for eachi ∈ [k]. To quantify how many moments suffice in this case, we need to prove a variant of Theorem 16. The proof is a relatively straight-forward generalization of proof by Scott (1997) and can be found in the appendix.

Theorem 17 For any two multi-sets S, T where each element is in  {0, 1, . . . , n − 1}and the multiplicity of each element is at most  q − 1, then S = Tif and only if  mk(S) = mk(T) for all k =0, 1, . . . , 2√qn log qn.

Acknowledgements The work was partially supported by NSF grants CCF-1909046, CCF-1934846, CCF-1908849, and CCF-1637536.

Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. Sample-optimal density esti- mation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1278–1289. SIAM, 2017.

Dimitris Achlioptas and Frank McSherry. On spectral learning of mixtures of distributions. In Conference on Learning Theory, 2005.

Sanjeev Arora and Ravi Kannan. Learning mixtures of arbitrary gaussians. In Symposium on Theory of Computing, 2001.

Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In Foundations of Computer Science, 2010.

P. Borwein and T. Erd´elyi. Littlewood-type problems on subarcs of the unit circle. Indiana University Mathematics Journal, 1997.

Peter Borwein. The Prouhet—Tarry—Escott Problem, pages 85–95. Springer New York, New York, NY, 2002. ISBN 978-0-387-21652-2. doi: 10.1007/978-0-387-21652-211. URL https://doi.org/10.1007/978-0-387-21652-2_11.

Siu-On Chan, Ilias Diakonikolas, Rocco A Servedio, and Xiaorui Sun. Learning mixtures of struc- tured distributions over discrete domains. In Symposium on Discrete Algorithms, 2013.

Siu-On Chan, Ilias Diakonikolas, Rocco A Servedio, and Xiaorui Sun. Efficient density estimation via piecewise polynomial approximation. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 604–613. ACM, 2014.

Sanjoy Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science, pages 634–644, 1999.

Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, 2014.

Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Optimal mean-based algorithms for trace reconstruction. In Symposium on Theory of Computing, 2017a.

Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Sharp bounds for population recovery. CoRR, abs/1703.01474, 2017b. URL http://arxiv.org/abs/1703.01474.

Luc Devroye and G´abor Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media, 2012.

Ilias Diakonikolas. Learning structured distributions. Handbook of Big Data, 267, 2016.

Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73–84. IEEE, 2017.

Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. List-decodable robust mean estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1047–1060. ACM, 2018.

Jon Feldman, Ryan O’Donnell, and Rocco A Servedio. Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 2008.

Moritz Hardt and Eric Price. Tight bounds for learning a mixture of two gaussians. In Symposium on Theory of Computing, 2015.

Godfrey Harold Hardy, Edward Maitland Wright, et al. An introduction to the theory of numbers. Oxford university press, 1979.

Samuel B Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In Symposium on Theory of Computing, 2018.

Adam Kalai, Ankur Moitra, and Gregory Valiant. Disentangling Gaussians. Communications of the ACM, 55(2):113–120, 2012.

Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Efficiently learning mixtures of two gaussians. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 553–562. ACM, 2010.

Pravesh K Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1035–1046. ACM, 2018.

I. Krasikov and Y. Roditty. On a reconstruction problem for sequences. Journal of Combinatorial Theory, Series A, 1997.

Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace recon- struction: Generalized and parameterized. In 27th Annual European Symposium on Algorithms, ESA 2019, September 9-11, 2019, Munich/Garching, Germany., pages 68:1–68:25, 2019.

Ankur Moitra. Algorithmic aspects of machine learning. Cambridge University Press, 2018.

Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In Foundations of Computer Science, 2010.

Fedor Nazarov and Yuval Peres. Trace reconstruction with  exp(O(n1/3)samples. In Symposium on Theory of Computing, 2017.

Alex D. Scott. Reconstructing sequences. Discrete Mathematics, 1997.

Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Near-optimal- sample estimators for spherical gaussian mixtures. In Advances in Neural Information Processing Systems, pages 1395–1403, 2014.

D Michael Titterington, Adrian FM Smith, and Udi E Makov. Statistical analysis of finite mixture distributions. Wiley, 1985.

Eric W. Weisstein. Geometric distribution. From MathWorld–A Wolfram Web Resource, 2019. URL http://mathworld.wolfram.com/GeometricDistribution.html.

Additional calculations for Lemma 4. We consider each distribution in turn:

Gaussian: Observe that  E[Gt(X)]is precisely the characteristic function. Clearly we have ∥Gt∥∞ = 1and further

image

Poisson: If  Gt(x) = (1 + it)x then since |1 + it|2 = 1 + t2 the second claim follows. For the first:

image

Negative Binomial: Let  wt = 1/p − (1/p − 1)e−it then |wt|2 = 1+(1−p)2−2(1−p) cos tp2 =

image

4.1. Additional calculations for Theorem 7.

Gaussian: The characteristic function of a Gaussian  X ∼ N(µ, σ2) is

image

Chi-Squared: Let  X ∼ M and X′ ∼ M′. Then, for  w = exp(1/2 − e−2it/2), fromLemma 4,

image

where we have used the pdf of chi-squared distribution and the tail bounds for chi-squared. Now using Lemma 5, and taking  |t| ≤ πL,

image

Negative-Binomial: Let  X ∼ M and X′ ∼ M′. Then, for  w = 1/p − (1/p − 1)e−it, fromLemma 4, taking  G(x) = wx,

image

where  u(x) =�x+N−1x �(1 − p)Npx. We have |w| ≤ ec(1−p)t2/p2 ≤ ec(1−p)/p2 for t < 1.Using Lemma 6, with  X ∼ NB(N, p), we have,

image

4.2. Proof of Theorem 17

Let a be the characteristic vector of a subset  S ⊂ U. Let sℓ = mℓ(S)on this set and let s = (s0, s1, . . . , sk−1). We need to prove a is uniquely determined by s. Let us define

image

Claim 1 For a prime number  p and i ̸≡p 0, we have ni,p(a) ≡p s0 − �j�p−1j �sj(−i)p−1−j

Proof

image

Recall that Fermat’s theorem (Hardy et al. (1979)) says that for any prime p and any number  α ̸≡p 0,we must have that  αp−1 ≡p 1. Hence, for a prime number p and some number  i ̸≡p 0, we have

image

Since the value of  ni,pis at most  ⌈qn/p⌉, we can obtain the value of  ni,pexactly if p is chosen to be greater than √qn. Now, let us denote the vector  vi,p ∈ Fnq where the ℓth entry is

image

Therefore, consider two different subsets  S, S′ ⊂ Uand assume that their characteristic vectors are a and b respectively. Therefore, if a and b both give rise to the same value of  s, then a.vi,p = b.vi,p.Hence, if the set of vectors

image

spans  Fnq , then it must imply that a = b and our proof will be complete. Consider a subset  T ⊂ Sdefined by

image

Now, there are two possible cases. First, let us assume that the vectors in T are not all linearly independent in  Fq. In that case, we must have a set of tuples  (i1, p1), (i2, p2), . . . , (im, pm) suchthat

image

where  0 ̸= αj ∈ Fq for all j. Now, by the Chinese Remainder Theorem, we can find an integer r such that  r ≡p1 i1 and r ≡pj 0 for all pj ̸= p1. Define an infinite dimensional vector  ˜v where theℓth entry is

image

Since,  ij ̸≡pj 0, it is evident that  ˜v[r] ̸≡q 0 Now, let sbe the smallest number such that  ˜v[s] ̸= 0and s > n because of our assumption in Eq. 1. Now consider the vector  vt where

image

Now,  vit = 0 for all i < t and vtt ̸= 0. Hence, the set  {vt}nt=1 are in the span of S and also span  Fnq .For the second case, let us assume that the vectors in T are linearly independent. We require the size of T > n so that the vectors in  T span Fnq . From the prime number theorem we know that

image

and hence we simply need that

image

Therefore,  k > (1 + o(1))√qn log qnis sufficient.

4.3. Algebraic method for Geometric distribution

We will denote the Geometric distribution with success parameter 0 < p < 1 as Geo(p) and it has the following form: for a random variable X distributed according to Geo(p), Pr(X = x) = (1 − p)xp where x ∈ {0, 1, 2, . . . }.

Theorem 18 (Learning mixtures of Geometric Distribution) Let  M = 1k�ki=1 Geo(pi) be auniform mixture of k Geometric distributions, with unknown probabilities

image

Then, the first  4√nmoments suffice to learn the parameters  piand there exists an algorithm that, when given  O�k2� √nǫ �8√n�samples from M, exactly identifies the parameters  {pi}ki=1 with highprobability.

Computing the moments. We compute the  ℓth moment in the natural way again. Let  Y1, . . . , Yt ∼

image

Lemma 19 (Restating Lemma 12)  Pr[|Sℓ − EXℓ| ≥ γ] ≤ EX2ℓtγ2 ≤ (2ℓ)!γ2t infα�EeαXα2ℓ �where thelast inequality assumes the all the moments of X are non-negative.

The following corollary, tailors the above lemma for a mixture of geometric distributions.

image

Corollary 20 If  X ∼ �ki=1 Geo(pi)/k then Pr[|Sℓ − EXℓ| ≥ γ] ≤ 2tγ2� 4ℓmini pi

Proof Given a random variable  Z ∼ Geo(p), we will show that  EZk ≤ 2�2kp�k+1for all integer valued  k ≥ 0. It is known that (Weisstein, 2019)

image

where Li−k(z)is the polylogarithmic function of order  −kand argument z, defined explicitly as

image

with�kj�being the Eulerian numbers (see below). Hence, it can be observed that  EZk is a polyno-

mial in 1p of degree k. Denoting Ck = max0≤j≤k−1�kj�and substituting it, we get that

image

Putting everything together and by appealing to Lemma 12, we get the statement of the corollary.

For the geometric distribution,

image

where f is a degree  ℓpolynomial with integer coefficients. If  1/pi − 1is an integer multiple of  ǫthen this implies  k(EXℓ)/ǫℓ is integral and therefore any mixture with a different  ℓth moment must differ by at least  ǫℓ/k. Hence, learning the  ℓth moment up to  γℓ < ǫℓ/(2k)implies learning the moment exactly.

Lemma 21  O�k2�√nǫ �8√n�samples are sufficient to exactly learn the first  4√nmoments of a

uniform mixture of k Geometric distributions �ki=1 Geo(pi)/kwith probability at least 7/8 where each 1pi ∈ {1, 1 + ǫ, 1 + 2ǫ, . . . , 1 + nǫ}.

Proof Let  T = 4√n. From Corollary 20 and the preceding discussion, learning the  ℓth momentexactly with failure probability  1/91+T−ℓ requires

image

samples. And hence, we can compute all  ℓth moments exactly for  1 ≤ ℓ ≤ 4√n using

image

samples with failure probability �Tℓ=1 1/91+T−ℓ < �∞i=1 1/9i = 1/8.

How many moments needed to determine the parameters? It remains to show the first  4√nmoments suffice to determine the  pivalues in the mixture  X ∼ �ki=1 Geo(pi)/k. To do this suppose there exists another mixture  Y ∼ �ki=1 Geo(qi)/kand we will argue that

image

implies  {pi}i∈[k] = {qi}i∈[k]. To argue this, define integers  αi, βi ∈ {0, 1, . . . , n} such that pi =1

image

Hence, if the first T moments match,  mℓ(A) = mℓ(B) for all ℓ = 0, 1, . . . , T. But, again Theorem 16 establishes that if  T = 4√nthen this implies A = B.

Alternative Technique. In the previous analysis the parameters of the geometric distribution (pi’s) had to belong to the set  {1, 11+ǫ, 11+2ǫ, . . . , 11+nǫ}. The reason we had to choose this set is because the moments were polynomials in inverse of the parameters ( 1pi ’s). However it is also pos- sible to obtain a sample complexity bound when the parameters belong to the set  {0, ǫ, 2ǫ, . . . , 1}.This can be done by estimating the probability mass function of the mixture at the discrete points {0, 1, 2, . . . }. We have the following theorem in this case.

image

Recall that for a random variable  X ∼ Mdistributed according to the mixture of geometric distributions, we have

image

and more generally,

image

which is a polynomial in degree k + 1. Now, for the mixture  X ∼ 1/k �ki=1 Geo(pi), we need to argue that estimating the probabilities  Pr(X = ℓ) for ℓ = 0, 1, . . . , 4�1/ǫis sufficient to recover the parameters  pi. Again, suppose there exists another mixture  Y ∼ 1/k �ki=1 Geo(qi) such that

image

and we will argue that this implies  {pi}i∈[k] = {qi}i∈[k]. As before, define integers  αi, βi ∈

image

and it can be shown after some algebraic manipulations that

image

Notice that  m0(A) = m0(B)trivially because both of them contain k components. Again, Theorem 16 establishes that if  mℓ(A) = mℓ(B) for ℓ ∈ {0, 1, . . . 4�1/ǫ}then this implies A = B.

Computing the probabilities. Suppose  Y1, Y2, . . . , Ytare i.i.d. with  X ∼ 1/k �ki=1 Geo(pi).Let us denote  Sℓas the empirical probability that we calculate as,

image

It is obvious that  ESℓ = Pr(X = ℓ). Now, using Chernoff bound, we have

image

Again, recall that

image

where  f(·)is a polynomial of degree  ℓ + 1with integer coefficients. If  piis an integer multiple of ǫthen this implies  kSℓ/ǫℓ+1 is integral and therefore any mixture with a different  ℓth moment has a  ℓmoment that differs by at least  ǫℓ+1/k. Hence, learning the  ℓth moment up to  γℓ < ǫℓ+1/(2k)implies learning the moment exactly. We will use  t = 12k2ǫ8/√ǫ+2 log 64√ǫ number of samples and we will show it will be sufficient to succeed with a probability of at least 78. We will estimate the probabilities as mentioned above and therefore the failure probability can be calculated by using the Chernoff Bound and a union bound over 4√ǫ probabilities to be estimated. Therefore the probability of failure is bounded above by,

image

and hence the proof is complete.

4.4. Proof of Lemma 14

We will prove that for  X ∼ Bin(n, p), the leading term of  EXℓ is �ℓ−1i=0(n − i)pℓ. Since for  n ≥ ℓ,�ℓ−1i=0(n − i) ̸= 0, this implies that  EXℓis a polynomial of degree exactly  ℓ. We will prove this by induction. Since  X ∼ Bin(n, p), we know that EX = np. This verifies the base case. Now, in the induction step, let us assume that the leading term of  EXk is �k−1i=0 (n − i)pk. It is known that (see Belkin and Sinha (2010))

image

Therefore it follows that the leading term of  EXk+1 is

image

This proves the induction step and the lemma.


Designed for Accessibility and to further Open Science