Algebraic and Analytic Approaches for Parameter Learning in Mixture Models

2020·Arxiv

Abstract

Abstract

We present two different approaches for parameter learning in several mixture models in one dimension. Our first approach uses complex-analytic methods and applies to Gaussian mixtures with shared variance, binomial mixtures with shared success probability, and Poisson mixtures, among others. An example result is that samples suffice to exactly learn a mixture of k < N Poisson distributions, each with integral rate parameters bounded by N. Our second approach uses algebraic and combinatorial tools and applies to binomial mixtures with shared trial parameter N and differing success parameters, as well as to mixtures of geometric distributions. Again, as an example, for binomial mixtures with k components and success parameters discretized to resolution samples suffice to exactly recover the parameters. For some of these distributions, our results represent the first guarantees for parameter estimation.

1. Introduction

Mixture modeling is a powerful method in the statistical toolkit, with widespread use across the sciences (Titterington et al., 1985). Starting with the seminal work of Dasgupta (1999), computational and statistical aspects of learning mixture models have been the subject of intense investigation in the theoretical computer science and statistics communities (Achlioptas and McSherry, 2005; Kalai et al., 2010; Belkin and Sinha, 2010; Arora and Kannan, 2001; Moitra and Valiant, 2010; Feldman et al., 2008; Chan et al., 2014; Acharya et al., 2017; Hopkins and Li, 2018; Diakonikolas et al., 2018; Kothari et al., 2018; Hardt and Price, 2015).

In this literature, there are two flavors of result: (1) parameter estimation, where the goal is to identify the mixing weights and the parameters of each component from samples, and (2) density estimation or PAC-learning, where the goal is simply to find a distribution that is close in some distance (e.g., TV distance) to the data-generating mechanism. Density estimation can be further subdivided into proper and improper learning approaches depending on whether the algorithm outputs a distribution from the given mixture family or not. These three guarantees are quite different. Apart from Gaussian mixtures, where all types of results exist, prior work for other mixture families

largely focuses on density estimation, and very little is known for parameter estimation outside of Gaussian mixture models. In this paper, we focus on parameter estimation and provide two new approaches, both of which apply to several mixture families.

Our first approach is analytic in nature and yields new sample complexity guarantees for univariate mixture models including Gaussian, Binomial, and Poisson. Our key technical insight is that we can relate the total variation between two candidate mixtures to a certain Littlewood polynomial, and then use complex analytic techniques to establish separation in TV-distance. With this separation result, we can use density estimation techniques (specifically proper learning techniques) to find a candidate mixture that is close in TV-distance to the data generating mechanism. The results we obtain via this approach are labeled as “analytic” in Table 1. This approach has recently led to important advances in the trace reconstruction and population recovery problems; see work by De et al. (2017a), Nazarov and Peres (2017), and De et al. (2017b).

Our second approach is based on the method of moments, a popular approach for learning Gaussian mixtures, and is more algebraic. Roughly, these algorithms are based on expressing moments of the mixture model as polynomials of the component parameters, and then solving a polynomial system using estimated moments. This approach has been studied in some generality by Belkin and Sinha (2010) who show that it can succeed for a large class of mixture models. However, as their method uses non-constructive arguments from algebraic geometry it cannot be used to bound how many moments are required, which is essential in determining the sample complexity; see a discussion in (Moitra, 2018, Section 7.6). In contrast, our approach does yield bounds on how many moments suffice and can be seen as a quantified version of the results in Belkin and Sinha (2010). The results we obtain via this approach are labeled as “algebraic” in Table 1.

The literature on mixture models is quite large, and we have just referred to a sample of most relevant papers here. A bigger overview on learning distributions can be found in the recent monographs such as Moitra (2018); Diakonikolas (2016).

1.1. Overview of results

As mentioned, an overview of our sample complexity results are displayed in Table 1, where in all cases we consider a uniform mixture of k distributions. Our guarantees are for exact parameter estimation, under the assumption that the mixture parameters are discretized to a particular resolution, given in the third column of the table. Theorem statements are given in the sequel.

At first glance the guarantees seem weak, since they all involve exponential dependence in problem parameters. However, except for the Gaussian case, these results are the first guarantees for parameter estimation for these distributions. All prior results we are aware of consider density estimation (Chan et al., 2013; Feldman et al., 2008).

For the mixtures of discrete distributions, such as binomial and negative binomial with shared trial parameter, or Poisson/geometric/chi-squared mixtures with certain discretizations, it seems like the dependence of sample complexity on the number of components k is polynomial (see Table 1). Note that for these examples , the upper bounds on parameter values. Therefore the actual dependence on k can still be interpreted as exponential. The results are especially interesting when k is large and possibly growing with N.

For Gaussian mixtures, the most interesting aspect of our bound is the polynomial dependence on the number of components k (first row of Table 1). In our setting and taking , the result of Moitra and Valiant (2010) is applicable, and it yields sample complexity, which is incom-

Table 1: Overview of our results. Results are given for uniform mixtures of k different components but some can be extended to non-uniform mixtures. Note that for rows 2, 4, 7, and 8, k does not appear. This is because and other terms dominate.

parable to our bound. Note that our result avoids an exponential dependence in k, trading this off for an exponential dependence on the discretization/accuracy parameter results for Gaussian mixtures either 1) consider density estimation (Daskalakis and Kamath, 2014; Feldman et al., 2008), which is qualitatively quite different from parameter estimation, 2) treat k as constant (Hardt and Price, 2015; Kalai et al., 2010), or 3) focus on the high dimensional setting and require separation assumptions (see for example Diakonikolas et al. (2017) and Moitra (2018)).

As such, our results reflect a new sample complexity tradeoff for parameter estimation in Gaussian mixtures.

As another note, using ideas from (Nazarov and Peres, 2017; De et al., 2017a), one can show that the analytic result for Binomial mixtures is optimal. This raises the question of whether the other results are also optimal or is learning a Binomial mixture intrinsically harder than learning, e.g., a Poisson or Gaussian mixture?

As a final remark, our assumption that parameters are discretized is related to separation conditions that appear in the literature on learning Gaussian mixtures. However, our approach does not seem to yield guarantees when the parameters do not exactly fall into the discretization. We hope to resolve this shortcoming in future work.

1.2. Our techniques

To establish these results, we take two loosely related approaches. In our analytic approach, the key structural result is to lower bound the total variation distance between two mixtures certain Littlewood polynomial. For each distribution type, if the parameter is , we find a function

(For Gaussians, is essentially the characteristic function). Such functions can be used to obtain Littlewood polynomials from the difference in expectation for two different mixtures, for example if the parameters are integral and the mixture weights are uniform. Applying complex analytic results on Littlewood polynomials, this characterization yields a lower bound on the total variation distance between mixtures, at which point we may use density estimation techniques for parameter learning. Specifically we use the minimum distance estimator (see, Devroye and Lugosi (2012, Sec. 6.8)), which is based on the idea of Scheffe sets. Scheffe sets are building blocks of the Scheffe estimator, commonly used in density estimation, e.g. Suresh et al. (2014).

Our algebraic approach is based on the more classical method of moments. Our key innovation here is a combinatorial argument to bound the number of moments that we need to estimate in order to exactly identify the correct mixture parameters. In more detail, when the parameters belong to a discrete set, we show that the moments reveal various statistics about the multi-set of parameters in the mixture. Then, we adapt and extend classical combinatorics results on sequence reconstruction to argue that two distinct multi-sets must disagree on a low-order moment. These combinatorial results are related to the Prouhet-Tarry-Escott problem (see, e.g., Borwein (2002)) which also has connections to Littlewood polynomials. To wrap up we use standard concentration arguments to estimate all the necessary moments, which yields the sample complexity guarantees.

We note that the complex analytic technique provides non-trivial result only for those mixtures for which an appropriate function exists. On the other hand, the algebraic approach works for all mixtures whose moment can be described as a polynomial of degree exactly unknown parameters. In Belkin and Sinha (2010), it was shown that most distributions have this later property. In general, where both methods can be applied, the complex analytic techniques typically provide tighter sample complexity bounds than the algebraic ones.

2. Learning Mixtures via Characteristic Functions

In this section, we show how analysis of the characteristic function can yield sample complexity guarantees for learning mixtures. At a high level, the recipe we adopt is the following.

1. First, we show that, in a general sense, the total variation distance between two separated mixtures is lower bounded by the norm of their characteristic functions.

2. Next, we use complex analytic methods and specialized arguments for each particular distribution to lower bound the latter norm.

3. Finally, we use the minimum distance estimator (Devroye and Lugosi, 2012) to find a mixture that is close in total variation to the data generating distribution. Using uniform convergence arguments this yields exact parameter learning.

The two main results we prove in this section are listed below.

Theorem 1 (Learning Gaussian mixtures) Let be a uniform mixture of k univariate Gaussians, with known shared covariance and with distinct means there exists an algorithm that requires samples from M and exactly identifies the parameters with high probability.

Theorem 2 (Learning Poisson mixtures) Let for each i are distinct. Then there exists an algorithm that that requires from M to exactly identify the parameters with high probability.

There are some technical differences in deriving the results for Gaussian vs Poisson mixtures. Namely, because of finite choice of parameters we can take a union bound over the all possible incorrect mixtures for the latter case, which is not possible for Gaussian. For Gaussian mixtures we instead use an approach based on VC dimension. The results for negative binomial mixtures and chi-squared mixtures (shown in Table 1) follow the same route as the Poisson mixture. As reported in Table 1, this approach also yields results for mixtures of binomial distributions that we obtained in a different context in our prior work (Krishnamurthy et al., 2019).

2.1. Total Variation and Characteristic Functions

Let denote a parameterized family of distributions over a sample space denotes either a pdf or pmf, depending on the context. We call pdf/pmf . For a distribution with density f (we use distribution and density interchangeably in the sequel), define the characteristic function For any two distribution defined over a sample space the variational distance (or the TV-distance) is defined to be

. For a function norm to be denotes the modulus.

As a first step, our aim is to show that the total variation distance between and any other mixture is lower bounded. The following elementary lemma completes the first step of the outlined approach.

Lemma 3 For any two distributions defined over the same sample space

More generally, for any

Proof We prove the latter statement, which implies the former since for the function

we have

Equipped with the lower bound in Lemma 3, for each type of distribution, we set out to find a good function G to witness separation in total variation distance. As we will see shortly, for a parametric family , it will be convenient to find a family of functions

Of course, to apply Lemma 3, it will also be important to understand . While such functions are specific to the parametric model in question, the remaining analysis will be unified. We derive such functions and collect the relevant properties in the following lemma. At a high level, the calculations are based on reverse engineering from the characteristic function, e.g., finding a choice

Lemma 4 Let

• Gaussian. If

• Poisson. If

• Chi-Squared. If

• Negative Binomial. If

Proof Here we give the argument for Poisson distributions only. The remaining calculations are deferred to the appendix. For Poisson random variables, if the second claim follows. For the first:

2.2. Variational Distance Between Mixtures

We crucially use the following lemma.

Lemma 5 (Borwein and Erd´elyi (1997)) Let be such that not all of them are zero. For any complex number Then, for some absolute constant

We will also need the following ‘tail bound’ lemma.

Lemma 6 Suppose a > 1 is any real number and . For any discrete random variable X with support Z and pmf f,

Proof Note that,

Theorem 7 (TV Lower Bounds) The following bounds hold on distance between two different mixtures assuming all k parameters are distinct for each mixture.

• Gaussian:

• Poisson:

• Chi-Squared:

• Negative Binomial:

Proof As above we give the argument for Poisson random variables, deferring the others to the appendix. Let . Then, for w = 1 + it, from Lemma 4,

Now we use Lemma 3 with

Now using Lemma 6,

2.3. Parameter Learning

Union Bound Approach for Discrete Distributions We begin with the following proposition which follows from Theorem 7.1 of Devroye and Lugosi (2012).

For the mixture of Poissons, the number of choices for parameters in the mixture is . Now using Lemmas 7 and 8, samples are sufficient to learn the parameters of the mixture.

Exactly the same argument applies to mixtures of Chi-Squared and Negative-Binomial distributions, yielding samples suffice, respectively. However, for Gaussians we need a more intricate approach.

VC Approach for Gaussians To learn the parameters of a Gaussian mixture

we use the minimum distance estimator precisely defined in (Devroye and Lugosi, 2012, Section 6.8). Let for any two mixtures be a collection of subsets. Let denote the empirical probability measure induced by the m samples. Then, choose a mixture for which the quantity is minimum (or within 1/m of the infi- mum). This is the minimum distance estimator, whose performance is guaranteed by the following proposition (Devroye and Lugosi, 2012, Thm. 6.4).

Proposition 9 Given m samples from

We now upper bound the right-hand side of the above inequality. Via McDiarmid’s inequality and a standard symmetrization argument, is concentrated around its mean which is a function of V C(A), the VC dimension of the class A, see (Devroye and Lugosi, 2012, Section 4.3):

with high probability, for an absolute constant c. This latter term is bounded by the following.

Lemma 10 For the class A defined above, the VC dimension is given by V C(A) = O(k).

Proof First of all we show that any element of the set A can be written as union of at most intervals in R. For this we use the fact that a linear combination of k Gaussian pdfs f(x) = s normal pdf has at most crossings (Kalai et al., 2012). Therefore, for any two mixtures of interest most zero-crossings. Therefore any must be a union of at most contiguous regions in R. It is now an easy exercise to see that the VC dimension of such a class is

from Theorem 7, notice that for any other mixture we must have,

As long aswe will exactly identify the parameters. Therefore samples suffice to exactly learn the parameters with high probability.

2.4. Extension to Non-Uniform Mixtures

The above results extend to non-uniform mixtures, where the main change is that we require a generalization of Lemma 5. The result, also proved by Borwein and Erd´elyi (1997), states that if precision then absolute constant c. This weaker bound yields an extra poly(n) factor in the sample complexity.

3. Learning Mixtures via Moments

There are some mixtures where the problem of learning parameters is not amenable to the approach in the previous section. A simple motivating example is learning the parameters in the mixture . In this section, we present an alternative procedure for learning such mixtures. The basic idea is as follows:

• We compute moments exactly for by taking sufficiently many samples. The number of samples will depend on T and the precision of the parameters of the mixture.

• We argue that if T is sufficiently large, then these moments uniquely define the parameters of the mixture. To do this we use a combinatorial result due to Krasikov and Roditty (1997).

In this section, it will be convenient to define a function on multi-sets where

Our main result is as follows:

Theorem 11 (Learning Binomial mixtures) Let be a uniform mixture of k binomials, with known shared number of trials n and unknown probabilities . Then, provided moments suffice to learn the parameters and there exists an algorithm that, when given samples from M, exactly identifies the parameters with high probability.

Computing the Moments We compute the th moment as

Lemma 12 where the last inequality assumes the all the moments of X are non-negative.

Proof By the Chebyshev bound,

We then use the moment generating function: for all

The following corollary, tailors the above lemma for a mixture of binomial distributions.

where f is a polynomial of degree at most with integer coefficients (Belkin and Sinha, 2010). If is an integer multiple of then this implies is integral and therefore any mixture with a different th moment differs by at least . Hence, learning the th moment up to implies learning the moment exactly.

Lemma 14 For is a polynomial in p of degree exactly

The proof of the lemma is relegated to the appendix.

Theorem 15 samples are sufficient to exactly learn the first moments of a uniform mixture of k binomial distributions with probability at least 7/8 where each

Proof Let . From Corollary 20 and the preceding discussion, learning the exactly with failure probability

samples. And hence, we can compute all th moments exactly for

samples with failure probability

How many moments determine the parameters It remains to show the first suffice to determine the values in the mixture this suppose there exists another mixture and we will argue that

implies . To argue this, define integers such at that

Hence, if the first T moments match . But the following theorem establishes that if then this implies A = B.

Theorem 16 (Krasikov and Roditty (1997)) For any two subsets

We note that the above theorem is essentially tight. Specifically, there exists . As a consequence of this, we note that even the exact values of the moments are insufficient to learn the parameters of the distribution. For an example in terms of Gaussian mixtures, even given the promise are distinct, then the first moments of are insufficient to uniquely determine whereas the first moments are sufficient.

3.1. Extension to Non-Uniform Distributions

We now consider extending the framework to non-uniform distributions. In this case, the method of computing the moments is identical to the uniform case. However, when arguing that a small number of moments suffices we can no longer appeal to the Theorem 16.

To handle non-uniform distribution we introduce a precision variable q and assume that the weights of the component distributions are of the form:

where . Then, in the above framework if we are trying to learn parameters then the moments are going to define a multi-set consisting of . To quantify how many moments suffice in this case, we need to prove a variant of Theorem 16. The proof is a relatively straight-forward generalization of proof by Scott (1997) and can be found in the appendix.

Theorem 17 For any two multi-sets S, T where each element is in and the multiplicity of each element is at most if and only if

Acknowledgements The work was partially supported by NSF grants CCF-1909046, CCF-1934846, CCF-1908849, and CCF-1637536.

References

Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. Sample-optimal density esti- mation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1278–1289. SIAM, 2017.

Dimitris Achlioptas and Frank McSherry. On spectral learning of mixtures of distributions. In Conference on Learning Theory, 2005.

Sanjeev Arora and Ravi Kannan. Learning mixtures of arbitrary gaussians. In Symposium on Theory of Computing, 2001.

Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In Foundations of Computer Science, 2010.

P. Borwein and T. Erd´elyi. Littlewood-type problems on subarcs of the unit circle. Indiana University Mathematics Journal, 1997.

Peter Borwein. The Prouhet—Tarry—Escott Problem, pages 85–95. Springer New York, New York, NY, 2002. ISBN 978-0-387-21652-2. doi: 10.1007/978-0-387-21652-211. URL https://doi.org/10.1007/978-0-387-21652-2_11.

Siu-On Chan, Ilias Diakonikolas, Rocco A Servedio, and Xiaorui Sun. Learning mixtures of struc- tured distributions over discrete domains. In Symposium on Discrete Algorithms, 2013.

Siu-On Chan, Ilias Diakonikolas, Rocco A Servedio, and Xiaorui Sun. Efficient density estimation via piecewise polynomial approximation. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 604–613. ACM, 2014.

Sanjoy Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science, pages 634–644, 1999.

Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, 2014.

Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Optimal mean-based algorithms for trace reconstruction. In Symposium on Theory of Computing, 2017a.

Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Sharp bounds for population recovery. CoRR, abs/1703.01474, 2017b. URL http://arxiv.org/abs/1703.01474.

Luc Devroye and G´abor Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media, 2012.

Ilias Diakonikolas. Learning structured distributions. Handbook of Big Data, 267, 2016.

Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73–84. IEEE, 2017.

Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. List-decodable robust mean estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1047–1060. ACM, 2018.

Jon Feldman, Ryan O’Donnell, and Rocco A Servedio. Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 2008.

Moritz Hardt and Eric Price. Tight bounds for learning a mixture of two gaussians. In Symposium on Theory of Computing, 2015.

Godfrey Harold Hardy, Edward Maitland Wright, et al. An introduction to the theory of numbers. Oxford university press, 1979.

Samuel B Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In Symposium on Theory of Computing, 2018.

Adam Kalai, Ankur Moitra, and Gregory Valiant. Disentangling Gaussians. Communications of the ACM, 55(2):113–120, 2012.

Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Efficiently learning mixtures of two gaussians. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 553–562. ACM, 2010.

Pravesh K Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1035–1046. ACM, 2018.

I. Krasikov and Y. Roditty. On a reconstruction problem for sequences. Journal of Combinatorial Theory, Series A, 1997.

Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace recon- struction: Generalized and parameterized. In 27th Annual European Symposium on Algorithms, ESA 2019, September 9-11, 2019, Munich/Garching, Germany., pages 68:1–68:25, 2019.

Ankur Moitra. Algorithmic aspects of machine learning. Cambridge University Press, 2018.

Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In Foundations of Computer Science, 2010.

Fedor Nazarov and Yuval Peres. Trace reconstruction with samples. In Symposium on Theory of Computing, 2017.

Alex D. Scott. Reconstructing sequences. Discrete Mathematics, 1997.

Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Near-optimal- sample estimators for spherical gaussian mixtures. In Advances in Neural Information Processing Systems, pages 1395–1403, 2014.

D Michael Titterington, Adrian FM Smith, and Udi E Makov. Statistical analysis of finite mixture distributions. Wiley, 1985.

Eric W. Weisstein. Geometric distribution. From MathWorld–A Wolfram Web Resource, 2019. URL http://mathworld.wolfram.com/GeometricDistribution.html.

4. Omitted Proofs

Additional calculations for Lemma 4. We consider each distribution in turn:

• Gaussian: Observe that is precisely the characteristic function. Clearly we have and further

• Poisson: If the second claim follows. For the first:

• Negative Binomial: Let

4.1. Additional calculations for Theorem 7.

• Gaussian: The characteristic function of a Gaussian

• Chi-Squared: Let . Then, for Lemma 4,

where we have used the pdf of chi-squared distribution and the tail bounds for chi-squared. Now using Lemma 5, and taking

• Negative-Binomial: Let . Then, for Lemma 4, taking

where Using Lemma 6, with

4.2. Proof of Theorem 17

Let a be the characteristic vector of a subset on this set and let s = . We need to prove a is uniquely determined by s. Let us define

Claim 1 For a prime number

Proof

Recall that Fermat’s theorem (Hardy et al. (1979)) says that for any prime p and any number we must have that . Hence, for a prime number p and some number

Since the value of is at most , we can obtain the value of exactly if p is chosen to be greater than . Now, let us denote the vector th entry is

Therefore, consider two different subsets and assume that their characteristic vectors are a and b respectively. Therefore, if a and b both give rise to the same value of Hence, if the set of vectors

spans , then it must imply that a = b and our proof will be complete. Consider a subset defined by

Now, there are two possible cases. First, let us assume that the vectors in T are not all linearly independent in . In that case, we must have a set of tuples that

where . Now, by the Chinese Remainder Theorem, we can find an integer r such that . Define an infinite dimensional vector th entry is

Since, , it is evident that be the smallest number such that and s > n because of our assumption in Eq. 1. Now consider the vector

Now, . Hence, the set are in the span of S and also span For the second case, let us assume that the vectors in T are linearly independent. We require the size of T > n so that the vectors in . From the prime number theorem we know that

and hence we simply need that

Therefore, is sufficient.

4.3. Algebraic method for Geometric distribution

We will denote the Geometric distribution with success parameter 0 < p < 1 as Geo(p) and it has the following form: for a random variable X distributed according to Geo(p), Pr(X = x) =

Theorem 18 (Learning mixtures of Geometric Distribution) Let uniform mixture of k Geometric distributions, with unknown probabilities

Then, the first moments suffice to learn the parameters and there exists an algorithm that, when given samples from M, exactly identifies the parameters probability.

Computing the moments. We compute the th moment in the natural way again. Let

Lemma 19 (Restating Lemma 12) last inequality assumes the all the moments of X are non-negative.

The following corollary, tailors the above lemma for a mixture of geometric distributions.

Corollary 20 If

Proof Given a random variable , we will show that for all integer valued . It is known that (Weisstein, 2019)

where Liis the polylogarithmic function of order and argument z, defined explicitly as

withbeing the Eulerian numbers (see below). Hence, it can be observed that is a polyno-

mial in and substituting it, we get that

Putting everything together and by appealing to Lemma 12, we get the statement of the corollary.

For the geometric distribution,

where f is a degree polynomial with integer coefficients. If is an integer multiple of then this implies is integral and therefore any mixture with a different th moment must differ by at least . Hence, learning the th moment up to implies learning the moment exactly.

Lemma 21 samples are sufficient to exactly learn the first moments of a

uniform mixture of k Geometric distributions with probability at least 7/8 where each

Proof Let . From Corollary 20 and the preceding discussion, learning the exactly with failure probability

samples. And hence, we can compute all th moments exactly for

samples with failure probability

How many moments needed to determine the parameters? It remains to show the first moments suffice to determine the values in the mixture . To do this suppose there exists another mixture and we will argue that

implies . To argue this, define integers

Hence, if the first T moments match, . But, again Theorem 16 establishes that if then this implies A = B.

Alternative Technique. In the previous analysis the parameters of the geometric distribution (’s) had to belong to the set . The reason we had to choose this set is because the moments were polynomials in inverse of the parameters ( ’s). However it is also pos- sible to obtain a sample complexity bound when the parameters belong to the set This can be done by estimating the probability mass function of the mixture at the discrete points {0, 1, 2, . . . }. We have the following theorem in this case.

Recall that for a random variable distributed according to the mixture of geometric distributions, we have

and more generally,

which is a polynomial in degree k + 1. Now, for the mixture , we need to argue that estimating the probabilities is sufficient to recover the parameters . Again, suppose there exists another mixture

and we will argue that this implies . As before, define integers

and it can be shown after some algebraic manipulations that

Notice that trivially because both of them contain k components. Again, Theorem 16 establishes that if then this implies A = B.

Computing the probabilities. Suppose are i.i.d. with Let us denote as the empirical probability that we calculate as,

It is obvious that . Now, using Chernoff bound, we have

Again, recall that

where is a polynomial of degree with integer coefficients. If is an integer multiple of then this implies is integral and therefore any mixture with a different th moment has a moment that differs by at least . Hence, learning the th moment up to implies learning the moment exactly. We will use number of samples and we will show it will be sufficient to succeed with a probability of at least . We will estimate the probabilities as mentioned above and therefore the failure probability can be calculated by using the Chernoff Bound and a union bound over probabilities to be estimated. Therefore the probability of failure is bounded above by,

and hence the proof is complete.

4.4. Proof of Lemma 14

We will prove that for , the leading term of . Since for , this implies that is a polynomial of degree exactly . We will prove this by induction. Since , we know that EX = np. This verifies the base case. Now, in the induction step, let us assume that the leading term of . It is known that (see Belkin and Sinha (2010))

Therefore it follows that the leading term of

This proves the induction step and the lemma.