Minimax Optimal Estimation of KL Divergence for Continuous Distributions

2020·Arxiv

Abstract

Abstract

Estimating Kullback-Leibler divergence from identical and independently distributed samples is an important problem in various domains. One simple and effective estimator is based on the k nearest neighbor distances between these samples. In this paper, we analyze the convergence rates of the bias and variance of this estimator. Furthermore, we derive a lower bound of the minimax mean square error and show that kNN method is asymptotically rate optimal.

I. INTRODUCTION

Kullback-Leibler (KL) divergence has a broad range of applications in information theory, statistics and machine learning. For example, KL divergence can be used in hypothesis testing [1], text classi-fication [2], outlying sequence detection [3], multimedia classification [4], speech recognition [5], etc. In many applications, we hope to know the value of KL divergence, but the distributions are unknown. Therefore, it is important to estimate KL divergence based only on some identical and independently distributed (i.i.d) samples. Such problem has been widely studied [6–13]. The estimation method is different depending on whether the underlying distribution is discrete or continuous. For discrete distributions, an intuitive method is called plug-in estimator, which first estimates the probability mass function (PMF) by simply counting the number of occurrences at each possible value and then calculates the KL divergence based on the estimated PMF. However, since it is always possible that the number of occurrences at some locations is zero, this method has infinite bias and variance for arbitrarily large sample size. As a result, it is necessary to design some new estimators, such that both the bias and variance converge to zero. Several methods have been proposed in [11–13]. These methods perform well for distributions with fixed alphabet size. Recently, there is a growing interest in designing estimators that are suitable for distributions with growing alphabet size. [6] provided an ‘augmented plug-in estimator’, which is a modification of the simple plug-in method. The basic idea of this method is to add a term to both the numerator and the denominator when calculating the ratio of the probability mass. Although this modification will introduce some additional bias, the overall bias is reduced. Moreover, a minimax lower bound has also been derived in [6], which shows that the augmented plug-in estimator proposed in [6] is rate optimal. For continuous distributions, there are also many interesting methods. A simple one is to divide the support into many bins, so that continuous values can be quantized, and then the distribution can be

Puning Zhao and Lifeng Lai are with Department of Electrical and Computer Engineering, University of California, Davis, CA, 95616. Email: {pnzhao,lflai}@ucdavis.edu. This work was supported by the National Science Foundation under grants CCF-17-17943, ECCS-17-11468, CNS-18-24553 and CCF-19-08258.

converted to a discrete one. As a result, the KL divergence can be estimated based on these two discrete distributions. However, compared with other methods, this method is usually inefficient, especially when the distributions have heavy tails, as the probability mass of a bin at the tail of distributions is hard to estimate. An improvement was proposed in [7], which is based on data dependent partitions on the densities with an appropriate bias correction technique. Comparing with the direct partition method mentioned above, this adaptive one constructs more bins at the regions with higher density, and vice versa, to ensure that the probability mass in each bins are approximately equal. It is shown in [7] that this method is strongly consistent. Another estimator was designed in [14], which uses a kernel based approach to estimate the density ratio. There are also some previous works that focus on a more general problem of estimating f-divergence, with KL divergence being a special case. For example, [15] constructed an estimator based on a weighted ensemble of plug in estimators, and the parameters need to be tuned properly to get a good bias and variance tradeoff. Another method of estimating f-divergence in general was proposed in [10], under certain structural assumptions.

Among all the methods for the estimation of KL divergence between two continuous distributions, a simple and effective one is k nearest neighbor (kNN) method based estimator. kNN method, which was first proposed in [16], is a powerful tool for nonparametric statistics. Kozachenko and Leonenko [17] designed a kNN based method for the estimation of differential entropy, which is convenient to use and does not require too much parameter tuning. Both theoretical analysis and numerical experiments show that this method has desirable accuracy [18–24]. In particular, [23] shows that this estimator is nearly minimax rate optimal under some assumptions. The estimation of KL divergence shares some similarity with that of entropy estimation, since KL divergence between f and g, which denotes the probability density functions (pdf) of two distributions, is actually the difference of the entropy of f and the cross entropy between f and g. As a result, the idea of Kozachenko-Leonenko entropy estimator can be used to construct a kNN based estimator for KL divergence, which was first proposed in [8]. The basic idea of this estimator [8] is to obtain an approximate value of the ratio between f and g based on the ratio of kNN distances. It has been discussed in [8] that, compared with other KL divergence estimators, the kNN based estimator has a much lower sample complexity, and is easier to generalize and implement for high dimensional data. Moreover, it was proved in [8] that the kNN based estimator is consistent, which means that both the bias and the variance converge to zero as sample sizes increase. However, the convergence rate remains unknown.

In this paper, we make the following two contributions. Our first main contribution is the analysis of the convergence rates of bias and variance of the kNN based KL divergence estimator proposed in [8]. For the bias, we discuss two significantly different types of distributions separately. In the first type of distributions analyzed, both f and g have bounded support, and are bounded away from zero. One such example is when both distributions are uniform distributions. This implies that the distribution has boundaries, where the pdf suddenly changes. There are two main sources of estimation bias of kNN method for this case. The first source is the boundary effect, as the kNN method tends to underestimate the pdf values at the region near the boundary. The second source is the local non-uniformity of the pdf. It can be shown that the bias caused by the second source converges fast enough and thus can be negligible. As a result, the boundary bias is the main cause of bias of the kNN based KL divergence estimator for the first type of distributions considered. In the second type of distributions analyzed, we assume that both f and g are continuous everywhere. For example, a pair of two Gaussian distributions with different mean or variance belong to this case. For this type of distributions, the boundary effect does not exist. However, as the density values can be arbitrarily close to zero, we need to consider the bias caused by the tail region, in which f or g is too low and thus kNN distances are too large for us to obtain an accurate estimation of the density ratio f/g. For the variance of this estimator, we bound the convergence rate under a unified assumption, which holds for both two cases discussed above. The convergence rate of the mean square error can then be obtained based on that of the bias and variance. In this paper, we assume that k is fixed. We will show that with fixed k, the convergence rate of the mean square error over the sample sizes is already minimax optimal.

Our second main contribution is to derive a minimax lower bound of the mean square error of KL divergence estimation, which characterizes the theoretical limit of the convergence rates of any methods. For discrete distributions, the minimax lower bound has already been derived in [25] and [6]. However, for continuous distributions, the minimax lower bound has not been established. In fact, there exists no estimators that are uniformly consistent for all continuous distributions. For example, let is the indicator function, and g is uniform in [0, 1]. Then the estimation error of KL divergence between f and g equals the estimation error of the entropy of can be arbitrarily large, according to the lower bound derived in [26], there exists no uniformly consistent estimator. As a result, to find a minimax lower bound, it is necessary to impose some restrictions on the distributions. In this paper, we analyze the minimax lower bound for two cases that match our assumptions for deriving the upper bound, i.e. distributions with bounded support and densities bounded away from zero, and distributions that are smooth everywhere and densities can be arbitrarily close to zero. For each case, we show that the minimax lower bound nearly matches our upper bound using kNN method. This result indicates that the kNN based KL divergence estimator is nearly minimax optimal. To the best of our knowledge, our work is the first attempt to analyze the convergence rate of KL divergence estimator based on kNN method, and prove its minimax optimality.

The remainder of this paper is organized as follows. In Section II, we provide the problem statements. In Sections III and IV, we characterize the convergence rates of the bias and variance of the kNN based KL divergence estimator respectively. In Section V, we show the minimax lower bound. We then provide numerical examples in Section VI, and concluding remarks in Section VII.

II. PROBLEM STATEMENT

Consider two pdfs . The KL divergence between f and g is defined as

f and g are unknown. However, we are given a set of samples drawn i.i.d from pdf f, and another set of samples drawn i.i.d from pdf g. The goal is to estimate D(f||g) based on these samples.

[8] proposed a kNN based estimator:

in which is the distance between -th nearest neighbor in while is the distance between -th nearest neighbor in is the dimension. The distance between any two points u, v is defined as can be an arbitrary norm. The basic idea of this estimator is using kNN method to estimate the density ratio. An estimation of

in which V (S) is the volume of set S. (3) can be understood as follows. Apart from , there are another samples from , among which k points fall in . Therefore, is an estimate of is the probability mass with respect to the distribution with pdf f. As the distribution is continuous, we have We can then use (3) to estimate . Similarly, as there are generated from g, we can obtain an estimate

As

by replacing with (3) and (4) respectively, we can get the expression of the KL divergence estimator in (2).

[8] has proved that this estimator is consistent, but the convergence rate remains unknown. In this paper, we analyze the convergence rates of the bias and variance of this estimator, and derive the minimax lower bound.

III. BIAS ANALYSIS

In this section, we derive convergence rate of the bias of the estimator (2). We will consider two different cases depending on whether the support is bounded or not, as they have different sources of biases.

A. The Case with Bounded Support

We first discuss the case in which the distributions have bounded support and the densities are bounded away from zero. The main source of bias of this case is boundary effects. The analysis is based on the following assumptions:

Assumption 1. Assume the following conditions:

aV (B(x, r)), and for all denotes the volume of a set;

Assumption (a) is necessary to ensure that the definition of KL divergence in (1) is valid. (b) bounds both the lower and upper bound of the pdf value. (c) restricts the surface area of the supports of f and g. Since the kNN divergence estimator tends to cause significant bias at the region near to the boundary, the estimation bias for distributions with irregular supports with large surface area are usually large. (d) requires the boundedness of the support. The case with unbounded support will be considered in Section III-B. (e) ensures that the angles at the corners of the support sets have a lower bound, so that there will not be significant bias at the corner region. (f) ensures the smoothness of distribution in the support set. Note that (3) and (4) actually estimate the average density f and g over the ball and are smooth, then the average values will not deviate too much from the pdf value at the center of the balls, i.e.

Based on the above assumptions, we have the following theorem regarding the bias of estimator (2).

Theorem 1. Under Assumption 1, the convergence rate of the bias of kNN based KL divergence estimator is bounded by:

Proof. (Outline) Considering that

in which h denotes the differential entropy, we decompose the KL divergence estimator to an estimator of the differential entropy of f, as well as an estimator of the cross entropy between f and g. We then bound the bias of these two estimators. In particular, we can write

with

in which is the digamma function, being the Gamma function. Due to the property of Gamma distribution, we know that decays sufficiently fast and can be negligible for large sample sizes N and M.

has the same form as the bias of Kozachenko-Leonenko entropy estimator [17], which has been analyzed in many previous literatures [19, 21–23, 27]. With some modifications, the proofs related to the entropy estimator can also be used to bound , which is actually the bias of a cross entropy estimator. However, as the assumptions are different from the assumptions made in previous literatures, we need to derive (6) in a different way.

In our proof, for both the entropy estimator and the cross entropy estimator, we divide the support into two parts, the central region and the boundary region. In the central region, will be within will be within with high probability. Since f and g are smooth, the expected estimate are very close to the truth, and thus will not cause significant bias. The main bias comes from the boundary region, in which the density estimator are no longer accurate, as

exceeds the supports . We bound the boundary bias by letting the boundary region to shrink with a proper speed. The detailed proof is shown in Appendix A.

B. The Case with Smooth Distributions

We now consider the second case where the density is smooth everywhere and the density can be arbitrarily close to zero. For this case, the main source of bias is tail effects. We make the following assumptions:

Assumption (a) ensures that the definition of KL divergence in (1) is valid. (b) is the tail assumption. A lower indicates a stronger tail, and thus the convergence of bias of the KL divergence estimator will be slower. For example, for Gaussian distribution and for Cauchy distribution. (c) is the smoothness assumption. (d) is an additional tail assumption, which is actually very weak and holds for almost all of the common distributions, since s can be arbitrarily small. However, this assumption is important since it prevents very large . Based on the above assumptions, we have the following theorem regarding the bias of estimator (2).

Theorem 2. Under Assumption 2, the convergence rate of the bias of kNN based KL divergence estimator is bounded by:

Proof. (Outline) Similar to the proof of Theorem 1, we still decompose the KL divergence estimator to two estimators that estimate the entropy of f and the cross entropy between f and g, separately. In particular, we can still decompose the bias using (8). For simplicity, we only provide the convergence bound of , which is the error of the cross entropy estimator. The bound of the entropy estimator holds similarly.

For the cross entropy estimator, we divide the support into two parts, including a central region in which f or g is relatively high, and a tail region is relatively low. According to the results of order statistics [27, 28], probability mass of S with respect to the distribution with pdf g. Therefore, can be bounded by

We bound two terms in (11) separately. To derive the bound of bias in , we find a high probability upper bound of , denoted as . The bound of bias can be obtained by bounding the local non-uniformity of . On the contrary, if , we use assumption (d) to ensure that will not be too large, and thus will not cause significant estimation error. We let to decay with M at a proper speed, to maximize the overall convergence rate of the bias.

To bound the bias in , we let the threshold between to decay with sample size M, so that the probability mass of also decreases with M. We then combine the bound of adjust the rate of the decay of the threshold between

The detailed proof can be found in Appendix B.

IV. VARIANCE ANALYSIS

We now discuss the variance of this divergence estimator, based on the following unifying assump- tions.

Assumption 3. Assume that the following conditions hold:

in which

Assumption 3 (a)-(c) are satisfied if either Assumption 1 or Assumption 2 is satisfied. (a) only requires that the pdf is continuous almost everywhere, and thus holds not only for distributions that are smooth everywhere, but also for distributions that have boundaries. (b) is obviously satisfied under Assumption 1, since it requires that the densities are both upper and lower bounded. From Assumption 2, it is also straightforward to show that property combining with the smoothness condition (Assumption 2 (c)) imply that (15) holds for sufficiently small . (c) is the same as Assumption 2 (d) and weaker than Assumption 1 (d). Therefore, (a)-(c) are weaker than both previous assumptions on the analysis of bias. (d) is a new assumption which restricts the density ratio. This is important since if the density ratio can be too large, which means that there exists a region on which there are too many samples from fewer samples from will be large and unstable for too many Therefore we use assumption (d) to bound the density ratio.

Under these assumptions, the variance of the divergence estimator can be bounded using the following theorem.

Theorem 3. Under Assumption 3, if , then the convergence rate of the variance of estimator (2) can be bounded by:

Proof. (Outline) From (2), we have

Our proof uses some techniques from [27], which proved the O(1/N) convergence of variance of Kozachenko-Leonenko entropy estimator with k = 1 for one dimensional distributions, and [23], which generalizes the result to arbitrary fixed dimension and k, without restrictions on the boundedness of the support. The basic idea is that if one sample is replaced by another i.i.d sample, then it can be shown that the k-NN distance will change only for a tiny fraction of the samples.

The first term in (18) is just the variance of Kozachenko-Leonenko entropy estimator. Therefore we can use similar proof procedure as was already used in the proof of Theorem 2 in [23]. [23] analyzed a truncated Kozachenko-Leonenko entropy estimator, which means that is truncated by an upper bound . We prove the same convergence bound for the estimator without truncation.

For the second term in (18), the analysis becomes much harder, since the k-NN distance may change for much more samples from , instead of only a tiny fraction of samples. For this term, we design a new method to obtain the high probability bound of the deviation of from its mean. The basic idea of our new methods can be briefly stated as following: Define two sets is a subset of such that for any is among the k nearest neighbors of . Similarly, define to be a set such that for all among the k nearest neighbors of x. If we replace , the kNN distance of will only change if . With this observation, we give a high probability bound of the number of samples from that are in respectively, and then bound the maximum difference of the estimated result caused by replacing . Based on this bound, we can then bound the second term in (18) using Efron-Stein inequality.

The detailed proof can be found in Appendix C.

In the analysis above, we have derived the convergence rate of bias and variance. With these results, we can then bound the mean square error of kNN based KL divergence estimator. For distributions that satisfy Assumptions 1 and 3, the mean square error can be bounded by

For distributions that satisfy Assumptions 2 and 3, the corresponding bound is

V. MINIMAX ANALYSIS

In this section, we derive the minimax lower bound of the mean square error of KL divergence estimation, which holds for all methods (not necessarily kNN based) that do not have the knowledge of the distributions f and g. The minimax analysis also considers two cases, i.e. the distributions whose densities are bounded away from zero, and those who has approaching zero densities. For the first case, the following theorem holds.

Theorem 4. Define as set of pairs (f, g) that satisfies Assumptions 1 and 3, and

in which is the estimation of KL divergence using N samples drawn from distribution with pdf f and M samples from g. Then for sufficiently large and sufficiently small

Proof. (Outline) The minimax lower bound of functional estimation can be bounded using Le Cam’s method [29]. For the proof of Theorem 4, we use some techniques from [26], which derived the minimax bound of entropy estimation for discrete distributions. The main idea is to construct a subset of distributions that satisfy Assumptions 1 and 3, and then conduct Poisson sampling. These operations can help us calculate the distance between two distributions in a more convenient way, which is important for using Le Cam’s method. Details of the proof can be found in Appendix D.

for arbitrarily small

We remark that in Theorem 4, the support set are unknown. If we assume that are known, then with some boundary correction methods, such as the mirror reflection method proposed in [30], the convergence rate can be faster than that in (22). However, in Theorem 4, instead of using fixed support sets, contains distributions with a broad range of different support sets. These support sets are only restricted by Assumption 1 (c) and (d), which require that the surface area of all the elements in are bounded by , and the diameters are bounded by R. As a result, the minimax convergence rate becomes slower. This result indicates the inherent difficulty caused by the boundary effect for distributions with densities bounded away from zero.

For the second case, the corresponding result is shown in Theorem 5.

Theorem 5. Define as set of pairs (f, g) that satisfies Assumptions 2 and 3, and

then for sufficiently large

Proof. (Outline) The minimax convergence rate of differential entropy estimation under similar assumptions was derived in [23]. We can extend the analysis to the minimax convergence rate of cross entropy estimation between f and g. Combine the bound for entropy and cross entropy, we can then obtain the minimax lower bound of the mean square error of KL divergence estimation. The detailed proof is shown in Appendix E.

Comparing (23) with (19), as well as (25) with (20), we observe that the convergence rate of the upper bound of mean square error of kNN based KL divergence estimator nearly matches the minimax lower bound for both cases. These results indicate that the kNN method with fixed k is nearly minimax rate optimal.

VI. NUMERICAL EXAMPLES

In this section, we provide numerical experiments to illustrate the theoretical results in this paper. In the simulation, we plot the curve of the estimated bias and variance over sample sizes. For illustration simplicity, we assume that the sample sizes for two distributions are equal, i.e. M = N. For each sample size, the bias and variance are estimated by repeating the simulation T times, and then calculate the sample mean and the sample variance of all these trials. For low dimensional distributions, the bias is relatively small, therefore it is necessary to conduct more trials comparing with high dimensional distributions. In the following experiments, we repeat T = 100, 000 times if d = 1, and 10, 000 times if d > 1. In all of the figures, we use log-log plots with base 10. In all of the trials, we fix k = 3. Figure 1 shows the convergence rate of kNN based KL divergence estimator for two uniform distributions with different support. This case is an example that satisfies Assumption 1. In Figure 2, f and g are two Gaussian distributions with different mean but equal variance. In Figure 3, f and g are two Gaussian distributions with the same mean but different variance. These two cases are examples that satisfy Assumption 2. For all of these distributions above, we compare the empirical convergence rates of the bias and variance with the theoretical prediction. The empirical convergence rates are calculated by finding the negative slope of the curves in these figures by linear regression, while the theoretical ones come from Theorems 1, 2 and 3 respectively. The results are shown in Table I. For the convenience of expression, we say that the theoretical convergence rate of bias or variance is , if it decays with either or for arbitrarily small , given the condition M = N.

TABLE I: Theoretical and empirical convergence rate comparison

In Table I, we observe that for the distribution used in Figure 1, the empirical convergence rates of both bias and variance agree well with the theoretical prediction, in which the theoretical bound of bias comes from Theorem 1, while the variance comes from Theorem 3.

Fig. 1: Convergence of bias and variance of kNN based KL divergence estimator for two uniform distributions with different support sets.

Fig. 2: Convergence of bias and variance of kNN based KL divergence estimator for two Gaussian distributions with different means. f is the pdf of is the pdf of denotes d dimensional identity matrix, and 1 = (1, . . . , 1).

For the distribution in Figure 2, the empirical convergence of bias matches the theoretical prediction from Theorem 2. For Gaussian distributions with different mean, it can be shown that for any there exists a constant such that Assumption 2 (b) holds. Therefore, according to Theorem 2, the convergence rate of bias is for arbitrarily small . Therefore, in the second line of Table I, the theoretical rate of bias is 0.67, 0.50 and 0.40, respectively. Now we discuss the convergence rate of variance. Note that the theoretical result about the variance is unknown, since f/g can reach infinity, thus Assumption 3 (d) is not satisfied, and Theorem 3 does not hold here. We observe that the empirical convergence rate is slower than that in other cases. Such a result may indicate that it is

Fig. 3: Convergence of bias and variance of kNN based KL divergence estimator for two Gaussian distributions with different variances. f is the pdf of is the pdf of

harder to estimate the KL divergence if the density ratio is unbounded.

For the distribution in Figure 3, the empirical and theoretical convergence rate of the variance matches well, while the empirical rate of bias is faster than the theoretical prediction. Note that the bound we have derived holds universally for all distributions that satisfy the assumptions. For certain specific distribution, the convergence rate can probably be faster. In particular, there is an uniform bound on the Hessian of f and g in Assumption 2 (c). However, for Gaussian distributions, the Hessian is lower where the pdf value is small. Therefore, the local non-uniformity is not as serious as the worst case that satisfies the assumptions.

VII. CONCLUSION

In this paper, we have analyzed the convergence rates of the bias and variance of the kNN based KL divergence estimator proposed in [8]. For the bias, we have discussed two types of distributions depending on the main causes of the bias. In the first case, the distribution has bounded support, and the pdf is bounded away from zero. In the second case, the distribution is smooth everywhere and the pdf can approach zero arbitrarily close. For the variance, we have derived the convergence rate under a more general assumption. Furthermore, we have derived the minimax lower bound of KL divergence estimation. The bound holds for all possible estimators. We have shown that for both types of distributions, the kNN based KL divergence estimator is nearly minimax rate optimal. We have also used numerical experiments to illustrate that the practical performances of kNN based KL divergence estimator are consistent with our theoretical analysis.

in which

and is the volume of unit ball. Here, we omit are the same for all i.

In the following, we provide details on how to bound can then be bounded using similar method.

To begin with, we denote as the probability mass of We have the following lemma.

Lemma 1. There exists a constant , such that, if

Proof.

in which the first inequality uses Assumption 1 (f).

From order statistics [28], , therefore

Define

in which . From (31), we observe that the bias is determined by the difference between the average pdf in and the pdf at its center the region that is relatively far from the boundary. For all , with high probability, In this case, the bias is caused by the non-uniformity of density. With the increase of sample size, the effect of such non-uniformity will converge to zero. is the region near to the boundary, in which the probability that is not negligible, hence can deviate significantly comparing with . Therefore, the bias in this region will not converge to zero. However, we let the size of converge to zero, so that the overall bound of the bias converges.

In step (a), we use Lemma 1, Assumption 1 (b) and Assumption 1 (e). In step (b), the first term uses the fact that for sufficiently large will be sufficiently small, hence . The second term of step (b) comes from the Chernoff bound, which indicates that for all and sufficiently large M,

In this equation, is the volume of , and we use the fact that according to the definition of and Assumption 1 (c). Based on (34) and (36),

Similarly, we have , and according to the definition of digamma function . Therefore

In this section, we derive the bound of the bias for distributions that satisfy Assumption 2. These

distributions are smooth everywhere and the densities can approach zero. We begin with the following lemmas, whose proofs can be found in Appendix B-A, B-B, and B-C, respectively.

Lemma 2. There exist constants

Lemma 3. There exists a constant , such that

for sufficiently small t, in which X follows a distribution with pdf f.

Similar to the proof of Theorem 1, we decompose the bias as Then

Divide into two parts.

in which will be determined later. is the constant in Lemma 1.

in which (a) comes from Lemma 1. For (b), note that according to (41), . (c) uses Lemma 4. For , note that according to Lemma 1,

Based on this fact, if , we show the following two lemmas:

Lemma 5. There exists a constant , such that

Proof. Please see Appendix B-D for detailed proof.

Lemma 6. There exists a constant , such that

Proof. Please see Appendix B-E for detailed proof.

Note that

Therefore

From order statistics [28], . According to Assumption 2 (b), the first three terms in (51) can be bounded by:

and

The last term in (51) can be bounded using the following lemma, whose proof can be found in Appendix B-F.

Lemma 7. There exist two constants , such that for sufficiently large M,

Using this lemma, we have

Combining (43), (50) and (57), we get

Since the above bound holds for arbitrary , we just let

Similarly, we have , and according to the definition of digamma function,

A. Proof of Lemma 2 We only show that there exists a constant holds for all x. The proof of the upper bound of density f will be exactly the same. From Lemma 1,

Since

for all as the right hand side of (62) given

then

B. Proof of Lemma 3 From H¨older inequality, For any p, q such that p > 1, q > 1, and 1/p + 1/q = 1,

From Assumption 2 (b),

Using Stirling’s formula

which holds for all p > 1. For sufficiently small t, let p = ln(1/t), then the right hand side of (67) becomes

C. Proof of Lemma 4

in which we used (41) and Lemma 1. Hence, according to (44) and Chernoff inequality,

Moreover, define

The proof is complete.

E. Proof of Lemma 6

as the kNN distance of . Then for sufficiently large

Denote as the number of samples from that are in S. Then for any given x, and

Let

It can be checked that , therefore

In (a), we use (70) and the definition of , which implies that . (b) uses the fact that

It remains to bound

In Lemma 5, we have shown that . For the second term,

Let T = (1/s) ln M, then

Hence

F. Proof of Lemma 7

In (a), we use Lemma 2. (b) uses Chernoff bound. Moreover, let then

Combining (81) and (82), the proof is complete.

We bound separately.

Bound of is the variance of Kozachenko-Leonenko entropy estimator [17], which estimates . Here we use similar proof procedure as was already used in the proof of Theorem 2 in our recent work [23]. [23] has analyzed a truncated KL entropy estimator, which means that is truncated by an upper bound . The variance of this estimator is actually equal to . It was shown in [23] that if . In this section, we prove the same convergence bound for the estimator without truncation, i.e. be a sample that is i.i.d with . Recall that -th nearest neighbor distance of . If we replace , then the kNN distances will change. Denote -th nearest neighbor distance based on . Then use Efron- Stein inequality [29],

Define . Moreover, define neighbor distances based on . Follow the steps in Appendix C of [23], we have

in which is a constant that depends on dimension d and the norm we use. For example, if we use norm, then is the minimum number of cones with angle that cover

Now we bound Note that we truncate the estimator for the convenience of analysis, although we are now analyzing an estimator without truncation. The deviation caused by such truncation will be bounded later. In the following proof, we omit the index for convenience. can be bounded by

in which is the probability mass of S under a distribution with pdf According to Assumption 3 (b), . Moreover, Lemma 6 and Lemma 7 in [23] have shown that

and

It remains to show that

For sufficiently large . From Assumption 3 (b), for sufficiently small t,

in which we use small o notation, since for any variable

In (a), we use the definition of in (16) for the first term, and use Chernoff inequality for the second term. (b) holds because . Then we can get (91) using (90).

Moreover, we can show the following Lemma:

Lemma 8.

Proof. Please see Appendix C-A.

Based on (89), (91) and Lemma 8, . Therefore (86) becomes

Similar results hold for . Hence (85) becomes

Bound of be a sample that is i.i.d with -th nearest neighbor distance of be a sample that is i.i.d with , and define -th nearest neighbor distance of Then from Efron-Stein inequality,

To bound the right hand side of (95), we first make the following definitions:

Definition 1. Define two sets

Definition 2. Define three events:

in which C is the constant in Assumption 3 (d). Here, for any set number of points from that are in is the same constant used in (85).

We also denote The following lemma shows that all of these three events happen with low probability.

Lemma 9. The probabilities of are bounded by:

Proof. Please see Appendix C-B.

These three bounds show that P. Moreover, we show the following lemma:

Lemma 10. There exists a constant such that for sufficiently large M we have

Proof. Please see Appendix C-C.

Based on Lemma 9 and Lemma 10,

If E does not happen, then are all upper bounded by and are all upper bounded by . Besides, from (99), they are both lower bounded by . There are at most points such that

Combining (105) and (106), we have

Then can be bounded by:

Similar to the analysis from (86) to (93), we can show that the limit of can also be bounded by the right hand side of (93). Therefore

and

Then

Define

It can be shown that P

dominated convergence theorem,

Similar bound holds for , and using the union bound, we get (101).

Proof of (102). Since g is bounded by , then for sufficiently large M,

Hence from Chernoff inequality,

Then (102) can be obtained by calculating the union bound. Proof of (103). We first prove (103) under the condition that we are using norm first. We will then generalize the result to the case with arbitrary norm. Define

then S

Recall that is defined as the minimum number of cones with angle that can cover we pick any , and divide cones with angle , such that y is the vertex of all the cones. These cones are named as

Define as the number of points from that are in S. Moreover, define

Then from Chernoff inequality,

and

This result indicates that with probability at least , there are at least k points in

, there are already at least among as the angle between vector

which indicates that are all closer to x than y, therefore y can not be one of the k nearest neighbors of . Recall that x is arbitrarily picked outside . Therefore with probability at least

Using Assumption 3 (d), hold, then

Using Chernoff inequality again,

Therefore

The proof is complete.

and

then

In (a), we use

according to (130),

Thus

in which the last step uses (66).

In this section, we show the minimax convergence rate of KL divergence estimator for distributions

with bounded support and densities bounded away from zero. The proof can be divided into proving the following three bounds separately:

and g(x) = 1. Then

Therefore, for sufficiently small . Moreover,

By Taylor expansion, it can be shown that thus

Therefore, from Le Cam’s lemma [29],

Let

Similarly, let

for . Then it can be shown that

The proof has similar idea with [26] and [23]. To begin with, define

in which is the unit ball volume, thusare two constants. decrease with N, while m increases with selected such that . It can be checked that both f and g integrate to 1. The condition is designed such that the density in the support is bounded away from zero, i.e. if . Moreover, the surface area of the support is is the surface area of unit ball, and the condition , the surface area of the supports of f and g are both upper bounded by . Therefore, for sufficiently large and sufficiently small Define

Recall that is defined as the minimax mean square error over

Define

in which , Poi is the Poisson distribution. Then we have the following lemma:

Lemma 11.

Proof. Please refer to Appendix D-A for details.

Furthermore, define

Comparing with the definition of in (148), the only difference is that we now allow to deviate slightly from . As a result, f is not necessarily a pdf, since it is not normalized. However, we extend the definition of KL divergence here. Define

in which . Then the number of samples falling on any two disjoint intervals are mutually independent. can be lower bounded by with the following lemma:

Lemma 12. If

Proof. Please refer to Appendix D-B for details.

With Lemma 11 and Lemma 12, the problem of bounding can be converted to bounding . We then show the following lemma, which is slightly modified from Lemma 11 in [23].

Let

then

in which is the differential entropy of

Proof. The proof is exactly the same as the proof of Lemma 11 in [23]. Condition (1) is different from the corresponding condition in [23], but such difference does not affect the proof.

let

in which denotes the distribution that puts all the mass on u = 0. Now we assume Let are supported in , and condition (1) in Lemma 13 is satisfied. Then from Lemma 4 in [26],

and . In particular, are properly selected, according to eq.(34) in [26],

Let , then the above equation can be transformed to the following one:

i.e. there exist two constants that depend on , such that

Hence

To bound the total variation term in (159), we use the following lemma.

Lemma 14. ([26], Lemma 3) Let be random variables on

Substitute

in which the last step holds because

and from (148),

and

Then

Note that the second, third and fourth term in the bracket at the right hand side of (159) converge to zero. In particular, for the second term,

For the third term,

and it is straightforward to show that the fourth term also converges to zero. Therefore, from Lemma 13,

Pick . According to condition 1) in the statement of Lemma 13, this is possible if c is sufficiently small. Then

From Lemma 12, and note that from (158),

which converges sufficiently fast, thus can also be lower bounded with the right hand side of (179). From (150) and (152),

Define

Then for any

Define

then for sufficiently large and sufficiently low We use Poisson sampling again. Define

in which . Then we have the following lemma.

Lemma 15.

Proof. Please refer to Appendix D-C.

Define

and

in which . Then the following lemma lower bounds

Lemma 16. If

Proof. Please refer to Appendix D-D.

Now we bound with the following lemma.

Lemma 17. Let be two random variables that satisfy the following conditions: (1) (2)

Proof. Please refer to Appendix D-E.

Now we use eq.(34) in [26] again, which shows that there exist that have matching moments up to L-th order, such that

The remaining proof follows the proof of (137). take the same value as the equations from (170) to (174), and then we can get similar bound as (137), replacing N with M.

in which the inequality in the second step comes from Jensen’s inequality. Note that nonincreasing function of N, because if , one can always pick samples for the estimation, thus always holds. Therefore

Moreover, since , use Chernoff inequality, we get

Now it remains to bound . Note that we can always let the estimator be

hence

From the definition of in (148), for all

which is the same for all . In addition,

Hence,

From (193), (194), (195) and (201),

and

Then from (154),

Now we bound the second term.

According to (200),

in which (a) is obtained by maximizing