Mixture modeling is a powerful method in the statistical toolkit, with widespread use across the sciences (Titterington et al., 1985). Starting with the seminal work of Dasgupta (1999), computational and statistical aspects of learning mixture models have been the subject of intense investigation in the theoretical computer science and statistics communities (Achlioptas and McSherry, 2005; Kalai et al., 2010; Belkin and Sinha, 2010; Arora and Kannan, 2001; Moitra and Valiant, 2010; Feldman et al., 2008; Chan et al., 2014; Acharya et al., 2017; Hopkins and Li, 2018; Diakonikolas et al., 2018; Kothari et al., 2018; Hardt and Price, 2015).
In this literature, there are two flavors of result: (1) parameter estimation, where the goal is to identify the mixing weights and the parameters of each component from samples, and (2) density estimation or PAC-learning, where the goal is simply to find a distribution that is close in some distance (e.g., TV distance) to the data-generating mechanism. Density estimation can be further subdivided into proper and improper learning approaches depending on whether the algorithm outputs a distribution from the given mixture family or not. These three guarantees are quite different. Apart from Gaussian mixtures, where all types of results exist, prior work for other mixture families
largely focuses on density estimation, and very little is known for parameter estimation outside of Gaussian mixture models. In this paper, we focus on parameter estimation and provide two new approaches, both of which apply to several mixture families.
Our first approach is analytic in nature and yields new sample complexity guarantees for univariate mixture models including Gaussian, Binomial, and Poisson. Our key technical insight is that we can relate the total variation between two candidate mixtures to a certain Littlewood polynomial, and then use complex analytic techniques to establish separation in TV-distance. With this separation result, we can use density estimation techniques (specifically proper learning techniques) to find a candidate mixture that is close in TV-distance to the data generating mechanism. The results we obtain via this approach are labeled as “analytic” in Table 1. This approach has recently led to important advances in the trace reconstruction and population recovery problems; see work by De et al. (2017a), Nazarov and Peres (2017), and De et al. (2017b).
Our second approach is based on the method of moments, a popular approach for learning Gaussian mixtures, and is more algebraic. Roughly, these algorithms are based on expressing moments of the mixture model as polynomials of the component parameters, and then solving a polynomial system using estimated moments. This approach has been studied in some generality by Belkin and Sinha (2010) who show that it can succeed for a large class of mixture models. However, as their method uses non-constructive arguments from algebraic geometry it cannot be used to bound how many moments are required, which is essential in determining the sample complexity; see a discussion in (Moitra, 2018, Section 7.6). In contrast, our approach does yield bounds on how many moments suffice and can be seen as a quantified version of the results in Belkin and Sinha (2010). The results we obtain via this approach are labeled as “algebraic” in Table 1.
The literature on mixture models is quite large, and we have just referred to a sample of most relevant papers here. A bigger overview on learning distributions can be found in the recent monographs such as Moitra (2018); Diakonikolas (2016).
1.1. Overview of results
As mentioned, an overview of our sample complexity results are displayed in Table 1, where in all cases we consider a uniform mixture of k distributions. Our guarantees are for exact parameter estimation, under the assumption that the mixture parameters are discretized to a particular resolution, given in the third column of the table. Theorem statements are given in the sequel.
At first glance the guarantees seem weak, since they all involve exponential dependence in problem parameters. However, except for the Gaussian case, these results are the first guarantees for parameter estimation for these distributions. All prior results we are aware of consider density estimation (Chan et al., 2013; Feldman et al., 2008).
For the mixtures of discrete distributions, such as binomial and negative binomial with shared trial parameter, or Poisson/geometric/chi-squared mixtures with certain discretizations, it seems like the dependence of sample complexity on the number of components k is polynomial (see Table 1). Note that for these examples , the upper bounds on parameter values. Therefore the actual dependence on k can still be interpreted as exponential. The results are especially interesting when k is large and possibly growing with N.
For Gaussian mixtures, the most interesting aspect of our bound is the polynomial dependence on the number of components k (first row of Table 1). In our setting and taking , the result of Moitra and Valiant (2010) is applicable, and it yields
sample complexity, which is incom-
Table 1: Overview of our results. Results are given for uniform mixtures of k different components but some can be extended to non-uniform mixtures. Note that for rows 2, 4, 7, and 8, k does not appear. This is because and other terms dominate.
parable to our bound. Note that our result avoids an exponential dependence in k, trading this off for an exponential dependence on the discretization/accuracy parameter
results for Gaussian mixtures either 1) consider density estimation (Daskalakis and Kamath, 2014; Feldman et al., 2008), which is qualitatively quite different from parameter estimation, 2) treat k as constant (Hardt and Price, 2015; Kalai et al., 2010), or 3) focus on the high dimensional setting and require separation assumptions (see for example Diakonikolas et al. (2017) and Moitra (2018)).
As such, our results reflect a new sample complexity tradeoff for parameter estimation in Gaussian mixtures.
As another note, using ideas from (Nazarov and Peres, 2017; De et al., 2017a), one can show that the analytic result for Binomial mixtures is optimal. This raises the question of whether the other results are also optimal or is learning a Binomial mixture intrinsically harder than learning, e.g., a Poisson or Gaussian mixture?
As a final remark, our assumption that parameters are discretized is related to separation conditions that appear in the literature on learning Gaussian mixtures. However, our approach does not seem to yield guarantees when the parameters do not exactly fall into the discretization. We hope to resolve this shortcoming in future work.
1.2. Our techniques
To establish these results, we take two loosely related approaches. In our analytic approach, the key structural result is to lower bound the total variation distance between two mixtures certain Littlewood polynomial. For each distribution type, if the parameter is
, we find a function
(For Gaussians, is essentially the characteristic function). Such functions can be used to obtain Littlewood polynomials from the difference in expectation for two different mixtures, for example if the parameters
are integral and the mixture weights are uniform. Applying complex analytic results on Littlewood polynomials, this characterization yields a lower bound on the total variation distance between mixtures, at which point we may use density estimation techniques for parameter learning. Specifically we use the minimum distance estimator (see, Devroye and Lugosi (2012, Sec. 6.8)), which is based on the idea of Scheffe sets. Scheffe sets are building blocks of the Scheffe estimator, commonly used in density estimation, e.g. Suresh et al. (2014).
Our algebraic approach is based on the more classical method of moments. Our key innovation here is a combinatorial argument to bound the number of moments that we need to estimate in order to exactly identify the correct mixture parameters. In more detail, when the parameters belong to a discrete set, we show that the moments reveal various statistics about the multi-set of parameters in the mixture. Then, we adapt and extend classical combinatorics results on sequence reconstruction to argue that two distinct multi-sets must disagree on a low-order moment. These combinatorial results are related to the Prouhet-Tarry-Escott problem (see, e.g., Borwein (2002)) which also has connections to Littlewood polynomials. To wrap up we use standard concentration arguments to estimate all the necessary moments, which yields the sample complexity guarantees.
We note that the complex analytic technique provides non-trivial result only for those mixtures for which an appropriate function exists. On the other hand, the algebraic approach works for all mixtures whose
moment can be described as a polynomial of degree exactly
unknown parameters. In Belkin and Sinha (2010), it was shown that most distributions have this later property. In general, where both methods can be applied, the complex analytic techniques typically provide tighter sample complexity bounds than the algebraic ones.
In this section, we show how analysis of the characteristic function can yield sample complexity guarantees for learning mixtures. At a high level, the recipe we adopt is the following.
1. First, we show that, in a general sense, the total variation distance between two separated mixtures is lower bounded by the norm of their characteristic functions.
2. Next, we use complex analytic methods and specialized arguments for each particular distribution to lower bound the latter norm.
3. Finally, we use the minimum distance estimator (Devroye and Lugosi, 2012) to find a mixture that is close in total variation to the data generating distribution. Using uniform convergence arguments this yields exact parameter learning.
The two main results we prove in this section are listed below.
Theorem 1 (Learning Gaussian mixtures) Let be a uniform mixture of k univariate Gaussians, with known shared covariance
and with distinct means
there exists an algorithm that requires
samples from M and exactly identifies the parameters
with high probability.
Theorem 2 (Learning Poisson mixtures) Let for each i are distinct. Then there exists an algorithm that that requires
from M to exactly identify the parameters
with high probability.
There are some technical differences in deriving the results for Gaussian vs Poisson mixtures. Namely, because of finite choice of parameters we can take a union bound over the all possible incorrect mixtures for the latter case, which is not possible for Gaussian. For Gaussian mixtures we instead use an approach based on VC dimension. The results for negative binomial mixtures and chi-squared mixtures (shown in Table 1) follow the same route as the Poisson mixture. As reported in Table 1, this approach also yields results for mixtures of binomial distributions that we obtained in a different context in our prior work (Krishnamurthy et al., 2019).
2.1. Total Variation and Characteristic Functions
Let denote a parameterized family of distributions over a sample space
denotes either a pdf or pmf, depending on the context. We call
pdf/pmf
. For a distribution with density f (we use distribution and density interchangeably in the sequel), define the characteristic function
For any two distribution
defined over a sample space
the variational distance (or the TV-distance) is defined to be
. For a function
norm to be
denotes the modulus.
As a first step, our aim is to show that the total variation distance between and any other mixture
is lower bounded. The following elementary lemma completes the first step of the outlined approach.
Lemma 3 For any two distributions defined over the same sample space
More generally, for any
Proof We prove the latter statement, which implies the former since for the function
we have
Equipped with the lower bound in Lemma 3, for each type of distribution, we set out to find a good function G to witness separation in total variation distance. As we will see shortly, for a parametric family , it will be convenient to find a family of functions
Of course, to apply Lemma 3, it will also be important to understand . While such functions are specific to the parametric model in question, the remaining analysis will be unified. We derive such functions and collect the relevant properties in the following lemma. At a high level, the calculations are based on reverse engineering from the characteristic function, e.g., finding a choice
Lemma 4 Let
• Gaussian. If
• Poisson. If
• Chi-Squared. If
• Negative Binomial. If
Proof Here we give the argument for Poisson distributions only. The remaining calculations are deferred to the appendix. For Poisson random variables, if the second claim follows. For the first:
2.2. Variational Distance Between Mixtures
We crucially use the following lemma.
Lemma 5 (Borwein and Erd´elyi (1997)) Let be such that not all of them are zero. For any complex number
Then, for some absolute constant
We will also need the following ‘tail bound’ lemma.
Lemma 6 Suppose a > 1 is any real number and . For any discrete random variable X with support Z and pmf f,
Proof Note that,
Theorem 7 (TV Lower Bounds) The following bounds hold on distance between two different mixtures assuming all k parameters are distinct for each mixture.
• Gaussian:
• Poisson:
• Chi-Squared:
• Negative Binomial:
Proof As above we give the argument for Poisson random variables, deferring the others to the appendix. Let . Then, for w = 1 + it, from Lemma 4,
Now we use Lemma 3 with
Now using Lemma 6,
2.3. Parameter Learning
Union Bound Approach for Discrete Distributions We begin with the following proposition which follows from Theorem 7.1 of Devroye and Lugosi (2012).
For the mixture of Poissons, the number of choices for parameters in the mixture is
. Now using Lemmas 7 and 8,
samples are sufficient to learn the parameters of the mixture.
Exactly the same argument applies to mixtures of Chi-Squared and Negative-Binomial distributions, yielding samples suffice, respectively. However, for Gaussians we need a more intricate approach.
VC Approach for Gaussians To learn the parameters of a Gaussian mixture
we use the minimum distance estimator precisely defined in (Devroye and Lugosi, 2012, Section 6.8). Let for any two mixtures
be a collection of subsets. Let
denote the empirical probability measure induced by the m samples. Then, choose a mixture
for which the quantity
is minimum (or within 1/m of the infi- mum). This is the minimum distance estimator, whose performance is guaranteed by the following proposition (Devroye and Lugosi, 2012, Thm. 6.4).
Proposition 9 Given m samples from
We now upper bound the right-hand side of the above inequality. Via McDiarmid’s inequality and a standard symmetrization argument, is concentrated around its mean which is a function of V C(A), the VC dimension of the class A, see (Devroye and Lugosi, 2012, Section 4.3):
with high probability, for an absolute constant c. This latter term is bounded by the following.
Lemma 10 For the class A defined above, the VC dimension is given by V C(A) = O(k).
Proof First of all we show that any element of the set A can be written as union of at most intervals in R. For this we use the fact that a linear combination of k Gaussian pdfs f(x) =
s normal pdf
has at most
crossings (Kalai et al., 2012). Therefore, for any two mixtures of interest
most
zero-crossings. Therefore any
must be a union of at most
contiguous regions in R. It is now an easy exercise to see that the VC dimension of such a class is
from Theorem 7, notice that for any other mixture we must have,
As long aswe will exactly identify the parameters. Therefore
samples suffice to exactly learn the parameters with high probability.
2.4. Extension to Non-Uniform Mixtures
The above results extend to non-uniform mixtures, where the main change is that we require a generalization of Lemma 5. The result, also proved by Borwein and Erd´elyi (1997), states that if precision then
absolute constant c. This weaker bound yields an extra poly(n) factor in the sample complexity.
There are some mixtures where the problem of learning parameters is not amenable to the approach in the previous section. A simple motivating example is learning the parameters in the mixture
. In this section, we present an alternative procedure for learning such mixtures. The basic idea is as follows:
• We compute moments exactly for
by taking sufficiently many samples. The number of samples will depend on T and the precision of the parameters of the mixture.
• We argue that if T is sufficiently large, then these moments uniquely define the parameters of the mixture. To do this we use a combinatorial result due to Krasikov and Roditty (1997).
In this section, it will be convenient to define a function on multi-sets where
Our main result is as follows:
Theorem 11 (Learning Binomial mixtures) Let be a uniform mixture of k binomials, with known shared number of trials n and unknown probabilities
. Then, provided
moments suffice to learn the parameters
and there exists an algorithm that, when given
samples from M, exactly identifies the parameters
with high probability.
Computing the Moments We compute the th moment as
Lemma 12 where the last inequality assumes the all the moments of X are non-negative.
Proof By the Chebyshev bound,
We then use the moment generating function: for all
The following corollary, tailors the above lemma for a mixture of binomial distributions.
where f is a polynomial of degree at most with integer coefficients (Belkin and Sinha, 2010). If
is an integer multiple of
then this implies
is integral and therefore any mixture with a different
th moment differs by at least
. Hence, learning the
th moment up to
implies learning the moment exactly.
Lemma 14 For is a polynomial in p of degree exactly
The proof of the lemma is relegated to the appendix.
Theorem 15 samples are sufficient to exactly learn the first
moments of a uniform mixture of k binomial distributions
with probability at least 7/8 where each
Proof Let . From Corollary 20 and the preceding discussion, learning the
exactly with failure probability
samples. And hence, we can compute all th moments exactly for
samples with failure probability
How many moments determine the parameters It remains to show the first suffice to determine the
values in the mixture
this suppose there exists another mixture
and we will argue that
implies . To argue this, define integers
such at that
Hence, if the first T moments match . But the following theorem establishes that if
then this implies A = B.
Theorem 16 (Krasikov and Roditty (1997)) For any two subsets
We note that the above theorem is essentially tight. Specifically, there exists . As a consequence of this, we note that even the exact values of the
moments are insufficient to learn the parameters of the distribution. For an example in terms of Gaussian mixtures, even given the promise
are distinct, then the first
moments of
are insufficient to uniquely determine
whereas the first
moments are sufficient.
3.1. Extension to Non-Uniform Distributions
We now consider extending the framework to non-uniform distributions. In this case, the method of computing the moments is identical to the uniform case. However, when arguing that a small number of moments suffices we can no longer appeal to the Theorem 16.
To handle non-uniform distribution we introduce a precision variable q and assume that the weights of the component distributions are of the form:
where . Then, in the above framework if we are trying to learn parameters
then the moments are going to define a multi-set consisting of
. To quantify how many moments suffice in this case, we need to prove a variant of Theorem 16. The proof is a relatively straight-forward generalization of proof by Scott (1997) and can be found in the appendix.
Theorem 17 For any two multi-sets S, T where each element is in and the multiplicity of each element is at most
if and only if
Acknowledgements The work was partially supported by NSF grants CCF-1909046, CCF-1934846, CCF-1908849, and CCF-1637536.
Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. Sample-optimal density esti- mation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1278–1289. SIAM, 2017.
Dimitris Achlioptas and Frank McSherry. On spectral learning of mixtures of distributions. In Conference on Learning Theory, 2005.
Sanjeev Arora and Ravi Kannan. Learning mixtures of arbitrary gaussians. In Symposium on Theory of Computing, 2001.
Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In Foundations of Computer Science, 2010.
P. Borwein and T. Erd´elyi. Littlewood-type problems on subarcs of the unit circle. Indiana University Mathematics Journal, 1997.
Peter Borwein. The Prouhet—Tarry—Escott Problem, pages 85–95. Springer New York, New York, NY, 2002. ISBN 978-0-387-21652-2. doi: 10.1007/978-0-387-21652-211. URL https://doi.org/10.1007/978-0-387-21652-2_11.
Siu-On Chan, Ilias Diakonikolas, Rocco A Servedio, and Xiaorui Sun. Learning mixtures of struc- tured distributions over discrete domains. In Symposium on Discrete Algorithms, 2013.
Siu-On Chan, Ilias Diakonikolas, Rocco A Servedio, and Xiaorui Sun. Efficient density estimation via piecewise polynomial approximation. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 604–613. ACM, 2014.
Sanjoy Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science, pages 634–644, 1999.
Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, 2014.
Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Optimal mean-based algorithms for trace reconstruction. In Symposium on Theory of Computing, 2017a.
Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Sharp bounds for population recovery. CoRR, abs/1703.01474, 2017b. URL http://arxiv.org/abs/1703.01474.
Luc Devroye and G´abor Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media, 2012.
Ilias Diakonikolas. Learning structured distributions. Handbook of Big Data, 267, 2016.
Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73–84. IEEE, 2017.
Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. List-decodable robust mean estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1047–1060. ACM, 2018.
Jon Feldman, Ryan O’Donnell, and Rocco A Servedio. Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 2008.
Moritz Hardt and Eric Price. Tight bounds for learning a mixture of two gaussians. In Symposium on Theory of Computing, 2015.
Godfrey Harold Hardy, Edward Maitland Wright, et al. An introduction to the theory of numbers. Oxford university press, 1979.
Samuel B Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In Symposium on Theory of Computing, 2018.
Adam Kalai, Ankur Moitra, and Gregory Valiant. Disentangling Gaussians. Communications of the ACM, 55(2):113–120, 2012.
Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Efficiently learning mixtures of two gaussians. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 553–562. ACM, 2010.
Pravesh K Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1035–1046. ACM, 2018.
I. Krasikov and Y. Roditty. On a reconstruction problem for sequences. Journal of Combinatorial Theory, Series A, 1997.
Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace recon- struction: Generalized and parameterized. In 27th Annual European Symposium on Algorithms, ESA 2019, September 9-11, 2019, Munich/Garching, Germany., pages 68:1–68:25, 2019.
Ankur Moitra. Algorithmic aspects of machine learning. Cambridge University Press, 2018.
Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In Foundations of Computer Science, 2010.
Fedor Nazarov and Yuval Peres. Trace reconstruction with samples. In Symposium on Theory of Computing, 2017.
Alex D. Scott. Reconstructing sequences. Discrete Mathematics, 1997.
Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Near-optimal- sample estimators for spherical gaussian mixtures. In Advances in Neural Information Processing Systems, pages 1395–1403, 2014.
D Michael Titterington, Adrian FM Smith, and Udi E Makov. Statistical analysis of finite mixture distributions. Wiley, 1985.
Eric W. Weisstein. Geometric distribution. From MathWorld–A Wolfram Web Resource, 2019. URL http://mathworld.wolfram.com/GeometricDistribution.html.
Additional calculations for Lemma 4. We consider each distribution in turn:
• Gaussian: Observe that is precisely the characteristic function. Clearly we have
and further
• Poisson: If the second claim follows. For the first:
• Negative Binomial: Let
4.1. Additional calculations for Theorem 7.
• Gaussian: The characteristic function of a Gaussian
• Chi-Squared: Let . Then, for
Lemma 4,
where we have used the pdf of chi-squared distribution and the tail bounds for chi-squared. Now using Lemma 5, and taking
• Negative-Binomial: Let . Then, for
Lemma 4, taking
where Using Lemma 6, with
4.2. Proof of Theorem 17
Let a be the characteristic vector of a subset on this set and let s =
. We need to prove a is uniquely determined by s. Let us define
Claim 1 For a prime number
Proof
Recall that Fermat’s theorem (Hardy et al. (1979)) says that for any prime p and any number we must have that
. Hence, for a prime number p and some number
Since the value of is at most
, we can obtain the value of
exactly if p is chosen to be greater than
. Now, let us denote the vector
th entry is
Therefore, consider two different subsets and assume that their characteristic vectors are a and b respectively. Therefore, if a and b both give rise to the same value of
Hence, if the set of vectors
spans , then it must imply that a = b and our proof will be complete. Consider a subset
defined by
Now, there are two possible cases. First, let us assume that the vectors in T are not all linearly independent in . In that case, we must have a set of tuples
that
where . Now, by the Chinese Remainder Theorem, we can find an integer r such that
. Define an infinite dimensional vector
th entry is
Since, , it is evident that
be the smallest number such that
and s > n because of our assumption in Eq. 1. Now consider the vector
Now, . Hence, the set
are in the span of S and also span
For the second case, let us assume that the vectors in T are linearly independent. We require the size of T > n so that the vectors in
. From the prime number theorem we know that
and hence we simply need that
Therefore, is sufficient.
4.3. Algebraic method for Geometric distribution
We will denote the Geometric distribution with success parameter 0 < p < 1 as Geo(p) and it has the following form: for a random variable X distributed according to Geo(p), Pr(X = x) =
Theorem 18 (Learning mixtures of Geometric Distribution) Let uniform mixture of k Geometric distributions, with unknown probabilities
Then, the first moments suffice to learn the parameters
and there exists an algorithm that, when given
samples from M, exactly identifies the parameters
probability.
Computing the moments. We compute the th moment in the natural way again. Let
Lemma 19 (Restating Lemma 12) last inequality assumes the all the moments of X are non-negative.
The following corollary, tailors the above lemma for a mixture of geometric distributions.
Corollary 20 If
Proof Given a random variable , we will show that
for all integer valued
. It is known that (Weisstein, 2019)
where Liis the polylogarithmic function of order
and argument z, defined explicitly as
withbeing the Eulerian numbers (see below). Hence, it can be observed that
is a polyno-
mial in and substituting it, we get that
Putting everything together and by appealing to Lemma 12, we get the statement of the corollary.
For the geometric distribution,
where f is a degree polynomial with integer coefficients. If
is an integer multiple of
then this implies
is integral and therefore any mixture with a different
th moment must differ by at least
. Hence, learning the
th moment up to
implies learning the moment exactly.
Lemma 21 samples are sufficient to exactly learn the first
moments of a
uniform mixture of k Geometric distributions with probability at least 7/8 where each
Proof Let . From Corollary 20 and the preceding discussion, learning the
exactly with failure probability
samples. And hence, we can compute all th moments exactly for
samples with failure probability
How many moments needed to determine the parameters? It remains to show the first moments suffice to determine the
values in the mixture
. To do this suppose there exists another mixture
and we will argue that
implies . To argue this, define integers
Hence, if the first T moments match, . But, again Theorem 16 establishes that if
then this implies A = B.
Alternative Technique. In the previous analysis the parameters of the geometric distribution (’s) had to belong to the set
. The reason we had to choose this set is because the moments were polynomials in inverse of the parameters (
’s). However it is also pos- sible to obtain a sample complexity bound when the parameters belong to the set
This can be done by estimating the probability mass function of the mixture at the discrete points {0, 1, 2, . . . }. We have the following theorem in this case.
Recall that for a random variable distributed according to the mixture of geometric distributions, we have
and more generally,
which is a polynomial in degree k + 1. Now, for the mixture , we need to argue that estimating the probabilities
is sufficient to recover the parameters
. Again, suppose there exists another mixture
and we will argue that this implies . As before, define integers
and it can be shown after some algebraic manipulations that
Notice that trivially because both of them contain k components. Again, Theorem 16 establishes that if
then this implies A = B.
Computing the probabilities. Suppose are i.i.d. with
Let us denote
as the empirical probability that we calculate as,
It is obvious that . Now, using Chernoff bound, we have
Again, recall that
where is a polynomial of degree
with integer coefficients. If
is an integer multiple of
then this implies
is integral and therefore any mixture with a different
th moment has a
moment that differs by at least
. Hence, learning the
th moment up to
implies learning the moment exactly. We will use
number of samples and we will show it will be sufficient to succeed with a probability of at least
. We will estimate the probabilities as mentioned above and therefore the failure probability can be calculated by using the Chernoff Bound and a union bound over
probabilities to be estimated. Therefore the probability of failure is bounded above by,
and hence the proof is complete.
4.4. Proof of Lemma 14
We will prove that for , the leading term of
. Since for
, this implies that
is a polynomial of degree exactly
. We will prove this by induction. Since
, we know that EX = np. This verifies the base case. Now, in the induction step, let us assume that the leading term of
. It is known that (see Belkin and Sinha (2010))
Therefore it follows that the leading term of
This proves the induction step and the lemma.