Kullback-Leibler (KL) divergence has a broad range of applications in information theory, statistics and machine learning. For example, KL divergence can be used in hypothesis testing [1], text classi-fication [2], outlying sequence detection [3], multimedia classification [4], speech recognition [5], etc. In many applications, we hope to know the value of KL divergence, but the distributions are unknown. Therefore, it is important to estimate KL divergence based only on some identical and independently distributed (i.i.d) samples. Such problem has been widely studied [6–13]. The estimation method is different depending on whether the underlying distribution is discrete or continuous. For discrete distributions, an intuitive method is called plug-in estimator, which first estimates the probability mass function (PMF) by simply counting the number of occurrences at each possible value and then calculates the KL divergence based on the estimated PMF. However, since it is always possible that the number of occurrences at some locations is zero, this method has infinite bias and variance for arbitrarily large sample size. As a result, it is necessary to design some new estimators, such that both the bias and variance converge to zero. Several methods have been proposed in [11–13]. These methods perform well for distributions with fixed alphabet size. Recently, there is a growing interest in designing estimators that are suitable for distributions with growing alphabet size. [6] provided an ‘augmented plug-in estimator’, which is a modification of the simple plug-in method. The basic idea of this method is to add a term to both the numerator and the denominator when calculating the ratio of the probability mass. Although this modification will introduce some additional bias, the overall bias is reduced. Moreover, a minimax lower bound has also been derived in [6], which shows that the augmented plug-in estimator proposed in [6] is rate optimal. For continuous distributions, there are also many interesting methods. A simple one is to divide the support into many bins, so that continuous values can be quantized, and then the distribution can be
Puning Zhao and Lifeng Lai are with Department of Electrical and Computer Engineering, University of California, Davis, CA, 95616. Email: {pnzhao,lflai}@ucdavis.edu. This work was supported by the National Science Foundation under grants CCF-17-17943, ECCS-17-11468, CNS-18-24553 and CCF-19-08258.
converted to a discrete one. As a result, the KL divergence can be estimated based on these two discrete distributions. However, compared with other methods, this method is usually inefficient, especially when the distributions have heavy tails, as the probability mass of a bin at the tail of distributions is hard to estimate. An improvement was proposed in [7], which is based on data dependent partitions on the densities with an appropriate bias correction technique. Comparing with the direct partition method mentioned above, this adaptive one constructs more bins at the regions with higher density, and vice versa, to ensure that the probability mass in each bins are approximately equal. It is shown in [7] that this method is strongly consistent. Another estimator was designed in [14], which uses a kernel based approach to estimate the density ratio. There are also some previous works that focus on a more general problem of estimating f-divergence, with KL divergence being a special case. For example, [15] constructed an estimator based on a weighted ensemble of plug in estimators, and the parameters need to be tuned properly to get a good bias and variance tradeoff. Another method of estimating f-divergence in general was proposed in [10], under certain structural assumptions.
Among all the methods for the estimation of KL divergence between two continuous distributions, a simple and effective one is k nearest neighbor (kNN) method based estimator. kNN method, which was first proposed in [16], is a powerful tool for nonparametric statistics. Kozachenko and Leonenko [17] designed a kNN based method for the estimation of differential entropy, which is convenient to use and does not require too much parameter tuning. Both theoretical analysis and numerical experiments show that this method has desirable accuracy [18–24]. In particular, [23] shows that this estimator is nearly minimax rate optimal under some assumptions. The estimation of KL divergence shares some similarity with that of entropy estimation, since KL divergence between f and g, which denotes the probability density functions (pdf) of two distributions, is actually the difference of the entropy of f and the cross entropy between f and g. As a result, the idea of Kozachenko-Leonenko entropy estimator can be used to construct a kNN based estimator for KL divergence, which was first proposed in [8]. The basic idea of this estimator [8] is to obtain an approximate value of the ratio between f and g based on the ratio of kNN distances. It has been discussed in [8] that, compared with other KL divergence estimators, the kNN based estimator has a much lower sample complexity, and is easier to generalize and implement for high dimensional data. Moreover, it was proved in [8] that the kNN based estimator is consistent, which means that both the bias and the variance converge to zero as sample sizes increase. However, the convergence rate remains unknown.
In this paper, we make the following two contributions. Our first main contribution is the analysis of the convergence rates of bias and variance of the kNN based KL divergence estimator proposed in [8]. For the bias, we discuss two significantly different types of distributions separately. In the first type of distributions analyzed, both f and g have bounded support, and are bounded away from zero. One such example is when both distributions are uniform distributions. This implies that the distribution has boundaries, where the pdf suddenly changes. There are two main sources of estimation bias of kNN method for this case. The first source is the boundary effect, as the kNN method tends to underestimate the pdf values at the region near the boundary. The second source is the local non-uniformity of the pdf. It can be shown that the bias caused by the second source converges fast enough and thus can be negligible. As a result, the boundary bias is the main cause of bias of the kNN based KL divergence estimator for the first type of distributions considered. In the second type of distributions analyzed, we assume that both f and g are continuous everywhere. For example, a pair of two Gaussian distributions with different mean or variance belong to this case. For this type of distributions, the boundary effect does not exist. However, as the density values can be arbitrarily close to zero, we need to consider the bias caused by the tail region, in which f or g is too low and thus kNN distances are too large for us to obtain an accurate estimation of the density ratio f/g. For the variance of this estimator, we bound the convergence rate under a unified assumption, which holds for both two cases discussed above. The convergence rate of the mean square error can then be obtained based on that of the bias and variance. In this paper, we assume that k is fixed. We will show that with fixed k, the convergence rate of the mean square error over the sample sizes is already minimax optimal.
Our second main contribution is to derive a minimax lower bound of the mean square error of KL divergence estimation, which characterizes the theoretical limit of the convergence rates of any methods. For discrete distributions, the minimax lower bound has already been derived in [25] and [6]. However, for continuous distributions, the minimax lower bound has not been established. In fact, there exists no estimators that are uniformly consistent for all continuous distributions. For example, let is the indicator function, and g is uniform in [0, 1]. Then the estimation error of KL divergence between f and g equals the estimation error of the entropy of
can be arbitrarily large, according to the lower bound derived in [26], there exists no uniformly consistent estimator. As a result, to find a minimax lower bound, it is necessary to impose some restrictions on the distributions. In this paper, we analyze the minimax lower bound for two cases that match our assumptions for deriving the upper bound, i.e. distributions with bounded support and densities bounded away from zero, and distributions that are smooth everywhere and densities can be arbitrarily close to zero. For each case, we show that the minimax lower bound nearly matches our upper bound using kNN method. This result indicates that the kNN based KL divergence estimator is nearly minimax optimal. To the best of our knowledge, our work is the first attempt to analyze the convergence rate of KL divergence estimator based on kNN method, and prove its minimax optimality.
The remainder of this paper is organized as follows. In Section II, we provide the problem statements. In Sections III and IV, we characterize the convergence rates of the bias and variance of the kNN based KL divergence estimator respectively. In Section V, we show the minimax lower bound. We then provide numerical examples in Section VI, and concluding remarks in Section VII.
Consider two pdfs . The KL divergence between f and g is defined as
f and g are unknown. However, we are given a set of samples drawn i.i.d from pdf f, and another set of samples
drawn i.i.d from pdf g. The goal is to estimate D(f||g) based on these samples.
[8] proposed a kNN based estimator:
in which is the distance between
-th nearest neighbor in
while
is the distance between
-th nearest neighbor in
is the dimension. The distance between any two points u, v is defined as
can be an arbitrary norm. The basic idea of this estimator is using kNN method to estimate the density ratio. An estimation of
in which V (S) is the volume of set S. (3) can be understood as follows. Apart from , there are another
samples from
, among which k points fall in
. Therefore,
is an estimate of
is the probability mass with respect to the distribution with pdf f. As the distribution is continuous, we have
We can then use (3) to estimate
. Similarly, as there are
generated from g, we can obtain an estimate
As
by replacing with (3) and (4) respectively, we can get the expression of the KL divergence estimator in (2).
[8] has proved that this estimator is consistent, but the convergence rate remains unknown. In this paper, we analyze the convergence rates of the bias and variance of this estimator, and derive the minimax lower bound.
In this section, we derive convergence rate of the bias of the estimator (2). We will consider two different cases depending on whether the support is bounded or not, as they have different sources of biases.
A. The Case with Bounded Support
We first discuss the case in which the distributions have bounded support and the densities are bounded away from zero. The main source of bias of this case is boundary effects. The analysis is based on the following assumptions:
Assumption 1. Assume the following conditions:
aV (B(x, r)), and for all denotes the volume of a set;
Assumption (a) is necessary to ensure that the definition of KL divergence in (1) is valid. (b) bounds both the lower and upper bound of the pdf value. (c) restricts the surface area of the supports of f and g. Since the kNN divergence estimator tends to cause significant bias at the region near to the boundary, the estimation bias for distributions with irregular supports with large surface area are usually large. (d) requires the boundedness of the support. The case with unbounded support will be considered in Section III-B. (e) ensures that the angles at the corners of the support sets have a lower bound, so that there will not be significant bias at the corner region. (f) ensures the smoothness of distribution in the support set. Note that (3) and (4) actually estimate the average density f and g over the ball and
are smooth, then the average values will not deviate too much from the pdf value at the center of the balls, i.e.
Based on the above assumptions, we have the following theorem regarding the bias of estimator (2).
Theorem 1. Under Assumption 1, the convergence rate of the bias of kNN based KL divergence estimator is bounded by:
Proof. (Outline) Considering that
in which h denotes the differential entropy, we decompose the KL divergence estimator to an estimator of the differential entropy of f, as well as an estimator of the cross entropy between f and g. We then bound the bias of these two estimators. In particular, we can write
with
in which is the digamma function,
being the Gamma function. Due to the property of Gamma distribution, we know that
decays sufficiently fast and can be negligible for large sample sizes N and M.
has the same form as the bias of Kozachenko-Leonenko entropy estimator [17], which has been analyzed in many previous literatures [19, 21–23, 27]. With some modifications, the proofs related to the entropy estimator can also be used to bound
, which is actually the bias of a cross entropy estimator. However, as the assumptions are different from the assumptions made in previous literatures, we need to derive (6) in a different way.
In our proof, for both the entropy estimator and the cross entropy estimator, we divide the support into two parts, the central region and the boundary region. In the central region, will be within
will be within
with high probability. Since f and g are smooth, the expected estimate
are very close to the truth, and thus will not cause significant bias. The main bias comes from the boundary region, in which the density estimator
are no longer accurate, as
exceeds the supports
. We bound the boundary bias by letting the boundary region to shrink with a proper speed. The detailed proof is shown in Appendix A.
B. The Case with Smooth Distributions
We now consider the second case where the density is smooth everywhere and the density can be arbitrarily close to zero. For this case, the main source of bias is tail effects. We make the following assumptions:
Assumption (a) ensures that the definition of KL divergence in (1) is valid. (b) is the tail assumption. A lower indicates a stronger tail, and thus the convergence of bias of the KL divergence estimator will be slower. For example,
for Gaussian distribution and
for Cauchy distribution. (c) is the smoothness assumption. (d) is an additional tail assumption, which is actually very weak and holds for almost all of the common distributions, since s can be arbitrarily small. However, this assumption is important since it prevents very large
. Based on the above assumptions, we have the following theorem regarding the bias of estimator (2).
Theorem 2. Under Assumption 2, the convergence rate of the bias of kNN based KL divergence estimator is bounded by:
Proof. (Outline) Similar to the proof of Theorem 1, we still decompose the KL divergence estimator to two estimators that estimate the entropy of f and the cross entropy between f and g, separately. In particular, we can still decompose the bias using (8). For simplicity, we only provide the convergence bound of , which is the error of the cross entropy estimator. The bound of the entropy estimator holds similarly.
For the cross entropy estimator, we divide the support into two parts, including a central region in which f or g is relatively high, and a tail region
is relatively low. According to the results of order statistics [27, 28],
probability mass of S with respect to the distribution with pdf g. Therefore,
can be bounded by
We bound two terms in (11) separately. To derive the bound of bias in , we find a high probability upper bound of
, denoted as
. The bound of bias can be obtained by bounding the local non-uniformity of
. On the contrary, if
, we use assumption (d) to ensure that
will not be too large, and thus will not cause significant estimation error. We let
to decay with M at a proper speed, to maximize the overall convergence rate of the bias.
To bound the bias in , we let the threshold between
to decay with sample size M, so that the probability mass of
also decreases with M. We then combine the bound of
adjust the rate of the decay of the threshold between
The detailed proof can be found in Appendix B.
We now discuss the variance of this divergence estimator, based on the following unifying assump- tions.
Assumption 3. Assume that the following conditions hold:
in which
Assumption 3 (a)-(c) are satisfied if either Assumption 1 or Assumption 2 is satisfied. (a) only requires that the pdf is continuous almost everywhere, and thus holds not only for distributions that are smooth everywhere, but also for distributions that have boundaries. (b) is obviously satisfied under Assumption 1, since it requires that the densities are both upper and lower bounded. From Assumption 2, it is also straightforward to show that property combining with the smoothness condition (Assumption 2 (c)) imply that (15) holds for sufficiently small
. (c) is the same as Assumption 2 (d) and weaker than Assumption 1 (d). Therefore, (a)-(c) are weaker than both previous assumptions on the analysis of bias. (d) is a new assumption which restricts the density ratio. This is important since if the density ratio can be too large, which means that there exists a region on which there are too many samples from
fewer samples from
will be large and unstable for too many
Therefore we use assumption (d) to bound the density ratio.
Under these assumptions, the variance of the divergence estimator can be bounded using the following theorem.
Theorem 3. Under Assumption 3, if , then the convergence rate of the variance of estimator (2) can be bounded by:
Proof. (Outline) From (2), we have
Our proof uses some techniques from [27], which proved the O(1/N) convergence of variance of Kozachenko-Leonenko entropy estimator with k = 1 for one dimensional distributions, and [23], which generalizes the result to arbitrary fixed dimension and k, without restrictions on the boundedness of the support. The basic idea is that if one sample is replaced by another i.i.d sample, then it can be shown that the k-NN distance will change only for a tiny fraction of the samples.
The first term in (18) is just the variance of Kozachenko-Leonenko entropy estimator. Therefore we can use similar proof procedure as was already used in the proof of Theorem 2 in [23]. [23] analyzed a truncated Kozachenko-Leonenko entropy estimator, which means that is truncated by an upper bound
. We prove the same convergence bound for the estimator without truncation.
For the second term in (18), the analysis becomes much harder, since the k-NN distance may change for much more samples from , instead of only a tiny fraction of samples. For this term, we design a new method to obtain the high probability bound of the deviation of
from its mean. The basic idea of our new methods can be briefly stated as following: Define two sets
is a subset of
such that for any
is among the k nearest neighbors of
. Similarly, define
to be a set such that for all
among the k nearest neighbors of x. If we replace
, the kNN distance of
will only change if
. With this observation, we give a high probability bound of the number of samples from
that are in
respectively, and then bound the maximum difference of the estimated result caused by replacing
. Based on this bound, we can then bound the second term in (18) using Efron-Stein inequality.
The detailed proof can be found in Appendix C.
In the analysis above, we have derived the convergence rate of bias and variance. With these results, we can then bound the mean square error of kNN based KL divergence estimator. For distributions that satisfy Assumptions 1 and 3, the mean square error can be bounded by
For distributions that satisfy Assumptions 2 and 3, the corresponding bound is
In this section, we derive the minimax lower bound of the mean square error of KL divergence estimation, which holds for all methods (not necessarily kNN based) that do not have the knowledge of the distributions f and g. The minimax analysis also considers two cases, i.e. the distributions whose densities are bounded away from zero, and those who has approaching zero densities. For the first case, the following theorem holds.
Theorem 4. Define as set of pairs (f, g) that satisfies Assumptions 1 and 3, and
in which is the estimation of KL divergence using N samples drawn from distribution with pdf f and M samples from g. Then for sufficiently large
and sufficiently small
Proof. (Outline) The minimax lower bound of functional estimation can be bounded using Le Cam’s method [29]. For the proof of Theorem 4, we use some techniques from [26], which derived the minimax bound of entropy estimation for discrete distributions. The main idea is to construct a subset of distributions that satisfy Assumptions 1 and 3, and then conduct Poisson sampling. These operations can help us calculate the distance between two distributions in a more convenient way, which is important for using Le Cam’s method. Details of the proof can be found in Appendix D.
for arbitrarily small
We remark that in Theorem 4, the support set are unknown. If we assume that
are known, then with some boundary correction methods, such as the mirror reflection method proposed in [30], the convergence rate can be faster than that in (22). However, in Theorem 4, instead of using fixed support sets,
contains distributions with a broad range of different support sets. These support sets are only restricted by Assumption 1 (c) and (d), which require that the surface area of all the elements in
are bounded by
, and the diameters are bounded by R. As a result, the minimax convergence rate becomes slower. This result indicates the inherent difficulty caused by the boundary effect for distributions with densities bounded away from zero.
For the second case, the corresponding result is shown in Theorem 5.
Theorem 5. Define as set of pairs (f, g) that satisfies Assumptions 2 and 3, and
then for sufficiently large
Proof. (Outline) The minimax convergence rate of differential entropy estimation under similar assumptions was derived in [23]. We can extend the analysis to the minimax convergence rate of cross entropy estimation between f and g. Combine the bound for entropy and cross entropy, we can then obtain the minimax lower bound of the mean square error of KL divergence estimation. The detailed proof is shown in Appendix E.
Comparing (23) with (19), as well as (25) with (20), we observe that the convergence rate of the upper bound of mean square error of kNN based KL divergence estimator nearly matches the minimax lower bound for both cases. These results indicate that the kNN method with fixed k is nearly minimax rate optimal.
In this section, we provide numerical experiments to illustrate the theoretical results in this paper. In the simulation, we plot the curve of the estimated bias and variance over sample sizes. For illustration simplicity, we assume that the sample sizes for two distributions are equal, i.e. M = N. For each sample size, the bias and variance are estimated by repeating the simulation T times, and then calculate the sample mean and the sample variance of all these trials. For low dimensional distributions, the bias is relatively small, therefore it is necessary to conduct more trials comparing with high dimensional distributions. In the following experiments, we repeat T = 100, 000 times if d = 1, and 10, 000 times if d > 1. In all of the figures, we use log-log plots with base 10. In all of the trials, we fix k = 3. Figure 1 shows the convergence rate of kNN based KL divergence estimator for two uniform distributions with different support. This case is an example that satisfies Assumption 1. In Figure 2, f and g are two Gaussian distributions with different mean but equal variance. In Figure 3, f and g are two Gaussian distributions with the same mean but different variance. These two cases are examples that satisfy Assumption 2. For all of these distributions above, we compare the empirical convergence rates of the bias and variance with the theoretical prediction. The empirical convergence rates are calculated by finding the negative slope of the curves in these figures by linear regression, while the theoretical ones come from Theorems 1, 2 and 3 respectively. The results are shown in Table I. For the convenience of expression, we say that the theoretical convergence rate of bias or variance is , if it decays with either
or
for arbitrarily small
, given the condition M = N.
TABLE I: Theoretical and empirical convergence rate comparison
In Table I, we observe that for the distribution used in Figure 1, the empirical convergence rates of both bias and variance agree well with the theoretical prediction, in which the theoretical bound of bias comes from Theorem 1, while the variance comes from Theorem 3.
Fig. 1: Convergence of bias and variance of kNN based KL divergence estimator for two uniform distributions with different support sets.
Fig. 2: Convergence of bias and variance of kNN based KL divergence estimator for two Gaussian distributions with different means. f is the pdf of is the pdf of
denotes d dimensional identity matrix, and 1 = (1, . . . , 1).
For the distribution in Figure 2, the empirical convergence of bias matches the theoretical prediction from Theorem 2. For Gaussian distributions with different mean, it can be shown that for any there exists a constant
such that Assumption 2 (b) holds. Therefore, according to Theorem 2, the convergence rate of bias is
for arbitrarily small
. Therefore, in the second line of Table I, the theoretical rate of bias is 0.67, 0.50 and 0.40, respectively. Now we discuss the convergence rate of variance. Note that the theoretical result about the variance is unknown, since f/g can reach infinity, thus Assumption 3 (d) is not satisfied, and Theorem 3 does not hold here. We observe that the empirical convergence rate is slower than that in other cases. Such a result may indicate that it is
Fig. 3: Convergence of bias and variance of kNN based KL divergence estimator for two Gaussian distributions with different variances. f is the pdf of is the pdf of
harder to estimate the KL divergence if the density ratio is unbounded.
For the distribution in Figure 3, the empirical and theoretical convergence rate of the variance matches well, while the empirical rate of bias is faster than the theoretical prediction. Note that the bound we have derived holds universally for all distributions that satisfy the assumptions. For certain specific distribution, the convergence rate can probably be faster. In particular, there is an uniform bound on the Hessian of f and g in Assumption 2 (c). However, for Gaussian distributions, the Hessian is lower where the pdf value is small. Therefore, the local non-uniformity is not as serious as the worst case that satisfies the assumptions.
In this paper, we have analyzed the convergence rates of the bias and variance of the kNN based KL divergence estimator proposed in [8]. For the bias, we have discussed two types of distributions depending on the main causes of the bias. In the first case, the distribution has bounded support, and the pdf is bounded away from zero. In the second case, the distribution is smooth everywhere and the pdf can approach zero arbitrarily close. For the variance, we have derived the convergence rate under a more general assumption. Furthermore, we have derived the minimax lower bound of KL divergence estimation. The bound holds for all possible estimators. We have shown that for both types of distributions, the kNN based KL divergence estimator is nearly minimax rate optimal. We have also used numerical experiments to illustrate that the practical performances of kNN based KL divergence estimator are consistent with our theoretical analysis.
in which
and is the volume of unit ball. Here, we omit
are the same for all i.
In the following, we provide details on how to bound can then be bounded using similar method.
To begin with, we denote as the probability mass of
We have the following lemma.
Lemma 1. There exists a constant , such that, if
Proof.
in which the first inequality uses Assumption 1 (f).
From order statistics [28], , therefore
Define
in which . From (31), we observe that the bias is determined by the difference between the average pdf in
and the pdf at its center
the region that is relatively far from the boundary. For all
, with high probability,
In this case, the bias is caused by the non-uniformity of density. With the increase of sample size, the effect of such non-uniformity will converge to zero.
is the region near to the boundary, in which the probability that
is not negligible, hence
can deviate significantly comparing with
. Therefore, the bias in this region will not converge to zero. However, we let the size of
converge to zero, so that the overall bound of the bias converges.
In step (a), we use Lemma 1, Assumption 1 (b) and Assumption 1 (e). In step (b), the first term uses the fact that for sufficiently large will be sufficiently small, hence
. The second term of step (b) comes from the Chernoff bound, which indicates that for all
and sufficiently large M,
In this equation, is the volume of
, and we use the fact that
according to the definition of
and Assumption 1 (c). Based on (34) and (36),
Similarly, we have , and according to the definition of digamma function
. Therefore
In this section, we derive the bound of the bias for distributions that satisfy Assumption 2. These
distributions are smooth everywhere and the densities can approach zero. We begin with the following lemmas, whose proofs can be found in Appendix B-A, B-B, and B-C, respectively.
Lemma 2. There exist constants
Lemma 3. There exists a constant , such that
for sufficiently small t, in which X follows a distribution with pdf f.
Similar to the proof of Theorem 1, we decompose the bias as Then
Divide into two parts.
in which will be determined later.
is the constant in Lemma 1.
in which (a) comes from Lemma 1. For (b), note that according to (41), . (c) uses Lemma 4. For
, note that according to Lemma 1,
Based on this fact, if , we show the following two lemmas:
Lemma 5. There exists a constant , such that
Proof. Please see Appendix B-D for detailed proof.
Lemma 6. There exists a constant , such that
Proof. Please see Appendix B-E for detailed proof.
Note that
Therefore
From order statistics [28], . According to Assumption 2 (b), the first three terms in (51) can be bounded by:
and
The last term in (51) can be bounded using the following lemma, whose proof can be found in Appendix B-F.
Lemma 7. There exist two constants , such that for sufficiently large M,
Using this lemma, we have
Combining (43), (50) and (57), we get
Since the above bound holds for arbitrary , we just let
Similarly, we have , and according to the definition of digamma function,
A. Proof of Lemma 2 We only show that there exists a constant holds for all x. The proof of the upper bound
of density f will be exactly the same. From Lemma 1,
Since
for all as the right hand side of (62) given
then
B. Proof of Lemma 3 From H¨older inequality, For any p, q such that p > 1, q > 1, and 1/p + 1/q = 1,
From Assumption 2 (b),
Using Stirling’s formula
which holds for all p > 1. For sufficiently small t, let p = ln(1/t), then the right hand side of (67) becomes
C. Proof of Lemma 4
in which we used (41) and Lemma 1. Hence, according to (44) and Chernoff inequality,
Moreover, define
The proof is complete.
E. Proof of Lemma 6
as the kNN distance of
. Then for sufficiently large
Denote as the number of samples from
that are in S. Then for any given x, and
Let
It can be checked that , therefore
In (a), we use (70) and the definition of , which implies that
. (b) uses the fact that
It remains to bound
In Lemma 5, we have shown that . For the second term,
Let T = (1/s) ln M, then
Hence
F. Proof of Lemma 7
In (a), we use Lemma 2. (b) uses Chernoff bound. Moreover, let then
Combining (81) and (82), the proof is complete.
We bound separately.
Bound of is the variance of Kozachenko-Leonenko entropy estimator [17], which estimates
. Here we use similar proof procedure as was already used in the proof of Theorem 2 in our recent work [23]. [23] has analyzed a truncated KL entropy estimator, which means that
is truncated by an upper bound
. The variance of this estimator is actually equal to
. It was shown in [23] that if
. In this section, we prove the same convergence bound for the estimator without truncation, i.e.
be a sample that is i.i.d with
. Recall that
-th nearest neighbor distance of
. If we replace
, then the kNN distances will change. Denote
-th nearest neighbor distance based on
. Then use Efron- Stein inequality [29],
Define . Moreover, define
neighbor distances based on
. Follow the steps in Appendix C of [23], we have
in which is a constant that depends on dimension d and the norm we use. For example, if we use
norm, then
is the minimum number of cones with angle
that cover
Now we bound Note that we truncate the estimator for the convenience of analysis, although we are now analyzing an estimator without truncation. The deviation caused by such truncation will be bounded later. In the following proof, we omit the index for convenience.
can be bounded by
in which is the probability mass of S under a distribution with pdf
According to Assumption 3 (b),
. Moreover, Lemma 6 and Lemma 7 in [23] have shown that
and
It remains to show that
For sufficiently large . From Assumption 3 (b), for sufficiently small t,
in which we use small o notation, since for any variable
In (a), we use the definition of in (16) for the first term, and use Chernoff inequality for the second term. (b) holds because
. Then we can get (91) using (90).
Moreover, we can show the following Lemma:
Lemma 8.
Proof. Please see Appendix C-A.
Based on (89), (91) and Lemma 8, . Therefore (86) becomes
Similar results hold for . Hence (85) becomes
Bound of be a sample that is i.i.d with
-th nearest neighbor distance of
be a sample that is i.i.d with
, and define
-th nearest neighbor distance of
Then from Efron-Stein inequality,
To bound the right hand side of (95), we first make the following definitions:
Definition 1. Define two sets
Definition 2. Define three events:
in which C is the constant in Assumption 3 (d). Here, for any set number of points from
that are in
is the same constant used in (85).
We also denote The following lemma shows that all of these three events happen with low probability.
Lemma 9. The probabilities of are bounded by:
Proof. Please see Appendix C-B.
These three bounds show that P. Moreover, we show the following lemma:
Lemma 10. There exists a constant such that for sufficiently large M we have
Proof. Please see Appendix C-C.
Based on Lemma 9 and Lemma 10,
If E does not happen, then are all upper bounded by
and
are all upper bounded by
. Besides, from (99), they are both lower bounded by
. There are at most
points such that
Combining (105) and (106), we have
Then can be bounded by:
Similar to the analysis from (86) to (93), we can show that the limit of can also be bounded by the right hand side of (93). Therefore
and
Then
Define
It can be shown that P
dominated convergence theorem,
Similar bound holds for , and using the union bound, we get (101).
Proof of (102). Since g is bounded by , then for sufficiently large M,
Hence from Chernoff inequality,
Then (102) can be obtained by calculating the union bound. Proof of (103). We first prove (103) under the condition that we are using norm first. We will then generalize the result to the case with arbitrary norm. Define
then S
Recall that is defined as the minimum number of cones with angle
that can cover
we pick any
, and divide
cones with angle
, such that y is the vertex of all the cones. These cones are named as
Define as the number of points from
that are in S. Moreover, define
Then from Chernoff inequality,
and
This result indicates that with probability at least , there are at least k points in
, there are already at least
among
as the angle between vector
which indicates that are all closer to x than y, therefore y can not be one of the k nearest neighbors of
. Recall that x is arbitrarily picked outside
. Therefore with probability at least
Using Assumption 3 (d), hold, then
Using Chernoff inequality again,
Therefore
The proof is complete.
and
then
In (a), we use
according to (130),
Thus
in which the last step uses (66).
In this section, we show the minimax convergence rate of KL divergence estimator for distributions
with bounded support and densities bounded away from zero. The proof can be divided into proving the following three bounds separately:
and g(x) = 1. Then
Therefore, for sufficiently small . Moreover,
By Taylor expansion, it can be shown that thus
Therefore, from Le Cam’s lemma [29],
Let
Similarly, let
for . Then it can be shown that
The proof has similar idea with [26] and [23]. To begin with, define
in which is the unit ball volume, thus
are two constants.
decrease with N, while m increases with
selected such that
. It can be checked that both f and g integrate to 1. The condition
is designed such that the density in the support is bounded away from zero, i.e. if
. Moreover, the surface area of the support is
is the surface area of unit ball, and
the condition
, the surface area of the supports of f and g are both upper bounded by
. Therefore, for sufficiently large
and sufficiently small
Define
Recall that is defined as the minimax mean square error over
Define
in which , Poi is the Poisson distribution. Then we have the following lemma:
Lemma 11.
Proof. Please refer to Appendix D-A for details.
Furthermore, define
Comparing with the definition of in (148), the only difference is that we now allow
to deviate slightly from
. As a result, f is not necessarily a pdf, since it is not normalized. However, we extend the definition of KL divergence
here. Define
in which . Then the number of samples falling on any two disjoint intervals are mutually independent.
can be lower bounded by
with the following lemma:
Lemma 12. If
Proof. Please refer to Appendix D-B for details.
With Lemma 11 and Lemma 12, the problem of bounding can be converted to bounding
. We then show the following lemma, which is slightly modified from Lemma 11 in [23].
Let
then
in which is the differential entropy of
Proof. The proof is exactly the same as the proof of Lemma 11 in [23]. Condition (1) is different from the corresponding condition in [23], but such difference does not affect the proof.
let
in which denotes the distribution that puts all the mass on u = 0. Now we assume
Let
are supported in
, and condition (1) in Lemma 13 is satisfied. Then from Lemma 4 in [26],
and . In particular,
are properly selected, according to eq.(34) in [26],
Let , then the above equation can be transformed to the following one:
i.e. there exist two constants that depend on
, such that
Hence
To bound the total variation term in (159), we use the following lemma.
Lemma 14. ([26], Lemma 3) Let be random variables on
Substitute
in which the last step holds because
and from (148),
and
Then
Note that the second, third and fourth term in the bracket at the right hand side of (159) converge to zero. In particular, for the second term,
For the third term,
and it is straightforward to show that the fourth term also converges to zero. Therefore, from Lemma 13,
Pick . According to condition 1) in the statement of Lemma 13, this is possible if c is sufficiently small. Then
From Lemma 12, and note that from (158),
which converges sufficiently fast, thus can also be lower bounded with the right hand side of (179). From (150) and (152),
Define
Then for any
Define
then for sufficiently large and sufficiently low
We use Poisson sampling again. Define
in which . Then we have the following lemma.
Lemma 15.
Proof. Please refer to Appendix D-C.
Define
and
in which . Then the following lemma lower bounds
Lemma 16. If
Proof. Please refer to Appendix D-D.
Now we bound with the following lemma.
Lemma 17. Let be two random variables that satisfy the following conditions: (1)
(2)
Proof. Please refer to Appendix D-E.
Now we use eq.(34) in [26] again, which shows that there exist that have matching moments up to L-th order, such that
The remaining proof follows the proof of (137). take the same value as the equations from (170) to (174), and then we can get similar bound as (137), replacing N with M.
in which the inequality in the second step comes from Jensen’s inequality. Note that nonincreasing function of N, because if
, one can always pick
samples for the estimation, thus
always holds. Therefore
Moreover, since , use Chernoff inequality, we get
Now it remains to bound . Note that we can always let the estimator be
hence
From the definition of in (148), for all
which is the same for all . In addition,
Hence,
From (193), (194), (195) and (201),
and
Then from (154),
Now we bound the second term.
According to (200),
in which (a) is obtained by maximizing under the restriction
, (b) comes from
, and (c) uses
. Moreover,
Hence
Therefore
and
The proof is complete.
define
Similar to (206),
and
in which the last step holds since . The proof is complete.
Define two events:
then
Consider that
hence for i = 1, 2,
Therefore
According to (183),
From the definition of in (220) and (221), if
happen, then
Denote as the distribution of samples according to
conditional on
as the distribution according to
conditional on
. Then under
and
Then according to Le Cam’s lemma,
The proof is complete.
APPENDIX E PROOF OF THEOREM 5 Similar to Theorem 4, the proof can be divided into proving the following three bounds:
in which is the value of the first coordinate of x, and
and
From Le Cam’s lemma,
Let , for sufficiently large
. Similarly, let
and
in which , then we can get
Define
In (245), there are two conditions that are different from the definition of . The first one is designed so that the distribution satisfies the tail assumption (Assumption 2 (b)). For
Let
then
Hence
Proof of (233). Define
in which are two constants. Comparing with the definition of
in (245), we add a new condition
, to ensure that f/g is always bounded by
. Similar to Theorem 4, Let
. Moreover, we still define
as was already defined in (190). Then from Lemma 17,
and from (169),
From Lemma 5 in [26], there exists two constants
Let
then
With the restriction
hence
[1] N. H. Anderson, P. Hall, and D. M. Titterington, “Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates,” Journal of Multivariate Analysis, July 1994.
[2] I. S. Dhillon, S. Mallela, and R. Kumar, “A divisive information-theoretic feature clustering algorithm for text classification,” Journal of Machine Learning Research, vol. 3, pp. 1265–1287, Mar. 2003.
[3] Y. Bu, S. Zou, Y. Liang, and V. V. Veeravalli, “Universal outlying sequence detection for continuous observations,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, Mar. 2016.
[4] P. J. Moreno, P. P. Ho, and N. Vasconcelos, “A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications,” in Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, Dec. 2004.
[5] J. Ram´ırez, J. C. Segura, C. Ben´ıtez, A. De La Torre, and A. J. Rubio, “A new Kullback-Leibler VAD for speech recognition in noise,” IEEE Signal Processing Letters, Jan. 2004.
[6] Y. Bu, S. Zou, Y. Liang, and V. V. Veeravalli, “Estimation of KL divergence: Optimal minimax rate,” IEEE Trans. Inform. Theory, Apr. 2018.
[7] Q. Wang, S. R. Kulkarni, and S. Verd´u, “Divergence estimation of continuous distributions based on data-dependent partitions,” IEEE Trans. Inform. Theory, Sept. 2005.
[8] ——, “Divergence estimation for multidimensional densities via k-nearest-neighbor distances,” IEEE Trans. Inform. Theory, vol. 55, no. 5, pp. 2392–2405, May 2009.
[9] A. Antos and I. Kontoyiannis, “Convergence properties of functional estimates for discrete distributions,” Random Structures & Algorithms, vol. 19, no. 3-4, pp. 163–193, Nov. 2001.
[10] P. K. Rubenstein, O. Bousquet, J. Djolonga, C. Riquelme, and I. Tolstikhin, “Practical and consistent estimation of f-divergences,” in Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, Dec. 2019.
[11] H. Cai, S. R. Kulkarni, and S. Verdu, “Universal estimation of entropy and divergence via block sorting,” in Proc. IEEE Intl. Symposium on Inform. Theory, Lausanne, Switzerland, July 2002.
[12] H. Cai, S. R. Kulkarni, and S. Verd´u, “Universal divergence estimation for finite-alphabet sources,” IEEE Trans. Inform. Theory, vol. 52, no. 8, pp. 3456–3475, July 2006.
[13] Z. Zhang and M. Grabchak, “Nonparametric estimation of K¨ullback-Leibler divergence,” Neural Computation, Oct. 2014.
[14] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,” IEEE Trans. Inform. Theory, vol. 56, no. 11, pp. 5847–5861, Nov. 2010.
[15] K. R. Moon and A. O. Hero, “Ensemble estimation of multivariate f-divergence,” in Proc. IEEE Intl. Symposium on Inform. Theory, Honolulu, HI, July 2014.
[16] E. Fix, Discriminatory analysis: nonparametric discrimination, consistency properties. USAF school of Aviation Medicine, Feb. 1951.
[17] L. Kozachenko and N. N. Leonenko, “Sample estimate of the entropy of a random vector,” Problemy Peredachi Informatsii, vol. 23, no. 2, pp. 9–16, Oct. 1987.
[18] A. B. Tsybakov and E. Van der Meulen, “Root-n consistent estimators of entropy for densities with unbounded support,” Scandinavian Journal of Statistics, pp. 75–83, Mar. 1996.
[19] S. Singh and B. P´oczos, “Analysis of k-nearest neighbor distances with application to entropy estimation,” arXiv preprint arXiv:1603.08578, Mar. 2016.
[20] ——, “Finite-sample analysis of fixed-k nearest neighbor density functional estimators,” in Proc. Advances in Neural Information Processing Systems, Barcelona, Spain, Dec. 2016, pp. 1217–1225.
[21] W. Gao, S. Oh, and P. Viswanath, “Demystifying fixed k-nearest neighbor information estimators,” IEEE Trans. Inform. Theory, Feb. 2018.
[22] T. B. Berrett, R. J. Samworth, M. Yuan et al., “Efficient multivariate entropy estimation via k-nearest neighbour distances,” The Annals of Statistics, vol. 47, no. 1, pp. 288–318, Jan. 2019.
[23] P. Zhao and L. Lai, “Analysis of kNN information estimators for smooth distributions,” IEEE Trans. Inform. Theory, 2020, to appear.
[24] S. Khan, S. Bandyopadhyay, A. R. Ganguly, S. Saigal, D. J. Erickson III, V. Protopopescu, and G. Ostrouchov, “Relative performance of mutual information estimation methods for quantifying the dependence among short and
noisy data,” Physical Review E, vol. 76, no. 2, p. 026209, Aug 2007.
[25] Y. Han, J. Jiao, and T. Weissman, “Minimax rate-optimal estimation of divergences between discrete distributions,” arXiv:1605.09124, May 2016.
[26] Y. Wu and P. Yang, “Minimax rates of entropy estimation on large alphabets via best polynomial approximation,” IEEE Trans. Inform. Theory, vol. 62, no. 6, pp. 3702–3720, June 2016.
[27] G. Biau and L. Devroye, Lectures on the nearest neighbor method. Springer, 2015.
[28] H. A. David and H. N. Nagaraja, Order statistics. Wiley Online Library, 1970.
[29] A. B. Tsybakov, Introduction to nonparametric estimation. Springer, 2009.
[30] H. Liu, L. Wasserman, and J. D. Lafferty, “Exponential concentration for mutual information estimation with application to forests,” in Proc. Advances in Neural Information Processing Systems, Lake Tahoe, Nevada, Dec. 2012.
[31] A. Timan, M. Stark, I. Sneddon, and S. Ulam, Theory of approximation of functions of a real variable. Pergamon Press, 1963.