Nearest Neighbor Dirichlet Mixtures

2020·Arxiv

Abstract

Abstract

There is a rich literature on Bayesian methods for density estimation, which characterize the unknown density as a mixture of kernels. Such methods have advantages in terms of providing uncertainty quantification in estimation, while being adaptive to a rich variety of densities. However, relative to frequentist locally adaptive kernel methods, Bayesian approaches can be slow and unstable to implement in relying on Markov chain Monte Carlo algorithms. To maintain most of the strengths of Bayesian approaches without the computational disadvantages, we propose a class of nearest neighbor-Dirichlet mixtures. The approach starts by grouping the data into neighborhoods based on standard algorithms. Within each neighborhood, the density is characterized via a Bayesian parametric model, such as a Gaussian with unknown parameters. Assigning a Dirichlet prior to the weights on these local kernels, we obtain a pseudo-posterior for the weights and kernel parameters. A simple and embarrassingly parallel Monte Carlo algorithm is proposed to sample from the resulting pseudo-posterior for the unknown density. Desirable asymptotic properties are shown, and the methods are evaluated in simulation studies and applied to a motivating data set in the context of classification.

1 Introduction

Bayesian nonparametric methods provide a useful alternative to black box machine learning algorithms, having potential advantages in terms of characterizing uncertainty in inferences and predictions. However, computation can be slow and unwieldy to implement. Hence, it is important to develop simpler and faster Bayesian nonparametric approaches, and hybrid methods that borrow the best of both worlds. For example, if one could use the Bayesian machinery for uncertainty quantification and reduction of mean square errors through shrinkage, while incorporating algorithmic aspects of machine learning approaches, one may be able to engineer a highly effective hybrid. The focus of this article is on proposing such an approach for density estimation, motivated by the successes and limitations of nearest neighbor algorithms and Bayesian mixture models.

Nearest neighbor algorithms are popular due to a combination of simplicity and performance. Given a set of n observations , the density at x is estimated as ˆis the number of neighbors of the distance of x from its kth nearest neighbor in is the volume of the p-dimensional unit ball (Loftsgaarden and Quesenberry, 1965; Mack and Rosenblatt, 1979). Refer to Biau and Devroye (2015) for an overview of related estimators and corresponding theory.

Nearest neighbor density estimators are a type of locally adaptive kernel density estimators. The literature on such methods identifies two broad classes: balloon estimators and sample smoothing estimators; see Scott (2015); Terrell and Scott (1992) for an overview. Balloon estimators characterize the density at a query point x using a bandwidth function h(x); classical examples include the naive k-nearest neighbor density estimator (Loftsgaarden and Quesenberry, 1965) and its modification in Mack and Rosenblatt (1979). More elaborate balloon estimators face challenges in terms of choice of h(x) and obtaining estimators that do not integrate to 1. Sample smoothing estimators make use of n different bandwidths ), one for each sample point , to estimate the density at a query point x globally. By construction, sample smoothing estimators are bona fide density functions integrating to 1. To fit either the balloon or the sample smoothing estimator, one may compute an initial pilot density estimator employing a constant bandwidth and then use this pilot to estimate the bandwidth function (Breiman et al., 1977; Abramson, 1982). Another example of a locally adaptive density estimator is the local likelihood density estimator (Loader, 1996, 2006; Hjort and Jones, 1996), which fits a polynomial model in the neighborhood of a query point x to estimate the density at x, estimating the parameters of the local polynomial by maximizing a penalized local log-likelihood function. The above methods produce a point estimate of the density without uncertainty quantification (UQ).

Alternatively, there is a Bayesian literature on locally adaptive kernel methods, which

express the unknown density as:

which is a mixture of m components, with the hth having probability weight and kernel parameters ; by allowing the location and bandwidth to vary across components, local adaptivity is obtained. A Bayesian specification is completed with prior for the kernel parameters and for the weights. In practice, it is common to rely on an over-fitted mixture model (Rousseau and Mengersen, 2011), which chooses m as a pre-specified finite upper bound on the number of components, and lets

By choosing close to zero, this prior favors effective deletion of redundant components. Also, augmenting with component index , a simple Gibbs sampler can be used for posterior computation, alternating between sampling (i) multinomial conditional posterior, for and (iii) Dirichlet(

Relative to frequentist locally adaptive methods, Bayesian approaches are appealing in automatically providing a characterization of uncertainty in estimation, while having excellent practical performance for a broad variety of density shapes and dimensions. However, implementation typically relies on Markov chain Monte Carlo (MCMC), with the Gibbs sampler sketched above providing an example of a common algorithm used in practice. Unfortunately, current MCMC algorithms for posterior sampling in mixture models tend to face issues with slow mixing, meaning the sampler can take a very large number of iterations to adequately explore different posterior modes and obtain sufficiently accurate posterior summaries.

MCMC inefficiency has motivated a literature on faster approaches, including sequential approximations (Wang and Dunson, 2011; Zhang et al., 2014) and variational Bayes (Blei and Jordan, 2006). These methods are order dependent, tend to converge to local modes, and/or lack theory support. Newton and Zhang (1999); Newton (2002) instead rely on predictive recursion. Such estimators are fast to compute and have theory support, but are also order dependent and do not provide a characterization of uncertainty. Alternatively, one can use a Polya tree as a conjugate prior (Lavine, 1992, 1994), and there is a rich literature on related multiscale and recursive partitioning approaches, such as the optional Polya tree (Wong and Ma, 2010). However, Polya trees have disadvantages in terms of sensitivity to a base partition and a tendency to favor spiky/erratic densities. These disadvantages are

inherited by most of the computationally fast modifications.

This article develops an alternative to current locally adaptive density estimators, obtaining the practical advantages of Bayesian approaches in terms of uncertainty quantification and a tendency to have relatively good performance for a wide variety of true densities, but without the computational disadvantage due to the use of MCMC. This is accomplished with a Nearest Neighbor-Dirichlet Mixture (NN-DM) model. The basic idea is to rely on fast nearest neighbor search algorithms to group the data into local neighborhoods, and then condition on these neighborhoods in defining a Bayesian mixture model-based approach. Section 2 outlines the NN-DM approach and describes implementation details for Gaussian kernels. Section 3 provides some theory support for NN-DM. Section 4 contains simulation experiments comparing NN-DM with a rich variety of competitors in univariate and multivariate examples, including an assessment of UQ performance. Section 5 contains a real data application, and Section 6 a discussion.

2 Methodology

2.1 Nearest Neighbor Dirichlet Mixture Framework

Let ) denote a distance metric between data points Euclidean distance is typically chosen. For each denote the jth nearest neighbor to in the data ), such that ), with ties broken by increasing order of indices. The indices on the k nearest neighbors to are denoted as , where by convention we define . Denote the set of data points in the i-th neighborhood by . In implementing the proposed method, we typically let the number of neighbors k vary as a function of n. When necessary we use the notation to express this dependence. However, we routinely drop the n subscript for notational simplicity.

Fixing , we model the density of the data within the i-th neighborhood using

where are parameters specific to neighborhood i that are given a global prior distribution . To combine the )s into a single global f(x), similarly to equations (1)-(2), we let

The key difference relative to standard Bayesian mixture model (1) is that in (4) we include one component for each data sample and assume that only the data in the k-nearest neighborhood of sample i will inform about . In contrast, (1) lacks any sample dependence, and we infer allocation of samples to mixture components in a posterior inference phase.

Given the restriction that only data in the i-th neighborhood inform about pseudo-posterior distribution of given data

This pseudo-posterior is in a simple analytic form if is conjugate to ). The prior can involve unknown parameters and borrows information across neighborhoods; this reduces the large variance problem common to nearest neighbor estimators.

Since the neighborhoods are overlapping, the conditional posterior for not exactly Dirichlet. However, one can define the number of unique points in the i-th neighborhood similar in spirit to the number of points in the h-th cluster in mixture models of the form (1). By convention, we define the point that generated its neighborhood to be a member of that neighborhood. For any other data point to be a unique member of the neighborhood generated by , we require lies in the neighborhood generated by does not lie in the neighborhood of any other . In Section 3.2, we show that the number of unique points defined as above approaches 1 as . This motivates the following Dirichlet pseudo-posterior update for the neighborhood weights

This distribution is equivalent to the conditional posterior on the kernel weights in the Dirichlet mixture of equations (1)-(2), but we use n components and fix the effective number of samples allocated to each component at one. The Dirichlet distribution in (6) is centered on (1/n, . . . , 1/n) with concentration parameter + 1. In practice, is typically chosen to be close to zero, as motivated in Section 1, and we let increase slowly with the sample size.

Based on equations (3)-(6), our nearest neighbor-Dirichlet mixture produces a pseudo-posterior distribution for the unknown density f(x) through simple distributions for the parameters characterizing the density within each neighborhood and for the weights. To generate independent Monte Carlo samples from the pseudo-posterior for f, one can simply draw samples from (5) and (6) independently and plug these samples into the expression for f(x) in (4). Although this is not exactly a coherent fully Bayesian posterior distribution, we claim that it can be used as a practical alternative to such a posterior in practice. This claim is backed up by theoretical arguments, simulation studies and a real data application in the sequel.

2.2 Illustration with Gaussian Kernels

Suppose we have independent and identically distributed (iid) observations density is an unknown density function with respect to the Lebesgue measure on denote the set of all real-valued positive definite matrices. Fix . We will illustrate the method for a general note key changes for the special case p = 1. We proceed by setting ) to be the multivariate Gaussian density . We first compute the neighborhoods corresponding to as in Section 2.1 and place a normal-inverse Wishart prior on ), given by () independently for and Σ; for details about parametrization see Section J of the Appendix.

For p = 1, we have a univariate Gaussian density ) in neighborhood i with normal-inverse gamma priors (2) independently for i = 1IG(= 1, the IW) density simplifies to an IG(density with

Monte Carlo samples from the pseudo-posterior of f(x) given the data obtained using Algorithm 1. The corresponding steps for the univariate case are provided in Section I of the Appendix. Although the pseudo-posterior distribution of f(x) lacks an analytic form, we can obtain a simple form for its pseudo-posterior mean by integrating over the pseudo-posterior distribution of (Recall the definitions of Ψfrom Step 2 of Algorithm 1 and define Λpseudo-posterior mean of f(x) is given by

Here for -dimensional Student’s t-density with degrees of freedom 0, location and scale matrix Λ For the univariate case, we have

Algorithm 1. The pseudo-posterior mean of f(x) is given by

where ) represents the univariate Student’s t-density with degrees of freedom.

2.3 Hyperparameter Choice

The hyperparameters in the prior for the neighborhood-specific parameters need to be chosen carefully – we found results to be sensitive to . If non-informative values are chosen for these key hyperparameters, we tend to inherit typical problems of nearest neighbor estimators including lack of smoothness and high variance. Suppose Σ for denote the i, jth entry of Σ and Ψ, respectively. Then Σ+ 1. Thus borrowing from the univariate case, we set Ψ= 0 for all , which implies that Ψuse leave-one-out cross-validation to select the optimum dimensional data, we recommend fixing which implies a multivariate Cauchy prior predictive density. We choose the leave-one-out log-likelihood as the criterion function for cross-validation, which is closely related to minimizing the Kullback-Leibler divergence between the true and estimated density (Hall, 1987; Bowman, 1984). The explicit expression for the pseudo-posterior mean in equations (7) and (8) makes cross-validation computationally efficient. The description of a fast implementation is provided in Section H of the Appendix.

The proposed method has substantially faster runtime if one bypasses cross-validation and uses a default choice of hyperparameters. In particular, we found the default values to work well across a number of simulation cases, especially when the true density is smooth. Although using cross-validation to estimate can lead to improved performance when the underlying density is spiky, cross-validation provides little to no gains for smooth true densities. Furthermore, with low sample size and increasing number of dimensions, we found this gap to diminish rapidly.

The other key tuning parameter for NN-DM is the number of nearest neighbors The pseudo-posterior mean in (7) reduces to a single . In contrast, = 1 provides a sample smoothing kernel density estimate with a specific bandwidth function (Terrell and Scott, 1992). Therefore, the choice of k can impact the smoothness of the density estimate. To assess the sensitivity of the NN-DM estimate to the choice of k, we investigate how the out-of-sample log-likelihood of a test set changes with respect to k in Section 4.7. These simulations suggest that the proposed method is quite robust to the exact choice of k. In practice with finite samples and small dimensions, we recommend a default choice of = 10 for univariate and multivariate cases, respectively. These values led to good performance across a wide variety of simulation cases as described in Section 4.

3 Theory

3.1 Asymptotic Properties

There is a rich literature on asymptotic properties of the posterior measure for an unknown density under Bayesian models, providing a frequentist justification for Bayesian density estimation; refer, for example to Ghosal et al. (1999), Ghosal et al. (2007). Unfortunately, the tools developed in this literature rely critically on the mathematical properties of fully Bayes posteriors, providing theoretical guarantees for a computationally intractable exact posterior distribution under a Bayesian model. Our focus is instead on providing frequentist asymptotic guarantees for our computationally efficient NN-DM approach, with this task made much more complex by the dependence across neighborhoods induced by the use of a nearest neighbor procedure.

We first focus on proving pointwise consistency of the pseudo-posterior of f(x) induced by (3)-(6) for each , using Gaussian kernels as in Section 2.2. We separately study the mean and variance of the NN-DM pseudo-posterior distribution, first showing that the pseudo-posterior mean in (7) is pointwise consistent and then that the pseudo-posterior variance vanishes asymptotically. The key idea behind our proof is to show that the pseudo-posterior mean is asymptotically close to a kernel density estimator with suitably chosen bandwidth for fixed at a desired rate. The proof then follows from standard arguments leading to consistency of kernel density estimators. The NN-DM pseudo-posterior mean mimics a kernel density estimator only in the asymptotic regime; in finite sample simulation studies (refer to Section 4), NN-DM has much better performance. The detailed proofs of all results in this section are in the Appendix.

Consider independent and identically distributed data from a fixed unknown density with respect to the Lebesgue measure on equipped with the Euclidean metric, inducing the measure to denote the mean of f(x), variance of f(x) and probability of the event ), respectively, under the pseudo-posterior distribution of f(x) implied by equations (3)-(6). We make the following regularity assumptions on

Assumption 3.2 (Bounded gradient)is continuous on [0all and some finite L > 0.

Assumption 3.3 (Bounded sup-norm)

Our asymptotic analysis relies on analyzing the behaviour of the pseudo-posterior updates within each nearest neighborhood. We leverage on key results from Biau and Devroye (2015); Evans et al. (2002) which are based on the assumption that the true density has compact support as in Assumption 3.1. Assumption 3.2 ensures that the kernel density estimator has finite expectation. Versions of this assumption are common in the kernel density literature; for example, refer to Tsybakov (2004). Assumptions 3.1 and 3.3 imply the existence of 0 such that 0 referred to as a positive density condition by Evans (2008); Evans et al. (2002). This is used to establish consistency of the proposed method, justify the choice of the pseudo-posterior distribution of the weights, and obtain our tuning algorithm for the Dirichlet prior concentration parameter . These assumptions are standard in the literature studying frequentist asymptotic properties of nearest neighbor and Bayesian density estimators.

For i = 1, . . . , n, recall the definitions of

where are as in Algorithm 1. Define the bandwidth

matrix

We have suppressed the dependence of for notational convenience. It is immediate that . To prove consistency of the pseudo-posterior mean, we first show that ˆare asymptotically close, that is we show that obtain this result, we approximate using successive applications of the mean value theorem. Finally, we exploit the convergence of ) to the true value to obtain the consistency of ˆ). The proof of convergence of ) is provided in Section G of the Appendix. The precise statement regarding the consistency of the pseudo-posterior mean is given in the following theorem. Let denote the minimum of a and b.

that -probability as

We now look at the pseudo-posterior variance of f(x). We let

For

As 2. Analogous steps to the ones used in the proof of Theorem 3.4 can be used to imply that -probability. Also, as (4using Stirling’s approximation. We now provide an upper bound on the pseudo-posterior variance of f(x) which shows convergence of the pseudo-posterior variance to 0.

be the bandwidth matrix defined in (9). Let . Under Assumptions 3.1-3.3 with as in Theorem 3.4, we have

Refer to Sections B and C in the Appendix for proofs of Theorems 4 and 5, respectively. Pointwise pseudo-posterior consistency follows directly from Theorems 3.4 and 3.5 as shown below.

satisfy Assumptions 3.1-3.3 with as in Theorem 3.4. Fix and define the -ball around denote the probability of the set under the pseudo-posterior distribution of f(x) as induced by equations (3)-(6). Then pr-probability as

0 and consider the . Then by Chebychev’s inequality, we have pr-probability as , using Theorems 3.4 and 3.5.

Our next result focuses on the limiting distribution of g(x), corresponding to a simplified form of the NN-DM which sets the weights to their pseudo-posterior mean,

where we focus on the univariate case with Gaussian kernels. From Section I of the Appendix, the pseudo-posterior distribution of (is given by NIG(are as before and

We show that the limiting distribution of g(x) is a Gaussian distribution with appropriate centering and scaling. This allows interpretation of 100(1 )% pseudo-credible intervals as 100(1)% frequentist confidence intervals on average for large n. Use of simplified versions of the original estimator to derive asymptotic distribution results is standard in the Bayesian nonparametric literature; refer to Chapter 12 of Ghosal and Van der Vaart (2017). Let denote the l-th derivative of

Theorem 3.7. For a fixed denote the simplified NN-DM estimator in (13). Suppose satisfies Assumptions 3.1-3.3 and also satisfies for some finite (9) satisfying

For a proof of Theorem 3.7, we refer the reader to Section D of the Appendix.

3.2 Pseudo-Posterior Distribution of Weights

We investigate the rationale behind the pseudo-posterior update of the weight vector which has a symmetric prior distribution Dirichlet() as motivated in Section 1. As discussed in Section 1, the conditional update for the weights in a finite Bayesian mixture model with m components given the cluster allocation indices is obtained by Dirichlet(is the prior concentration parameter and ) is the number of data points allocated to the h-th cluster. This is not true in our case as the -nearest neighborhoods have considerable overlap between them. Instead, we consider the number of unique data points in each of these neighborhoods.

Define the -nearest neighborhood of to be the set where -th nearest neighbor of in the data , following the notation in Section 2.1. We assume ) is the Euclidean metric from here on, and let denote the distance of -th nearest neighbor in

Let denote the number of unique members in as defined in Section 2.1. Then, we can express