b

DiscoverSearch
About
My stuff
Geodesic Distance Estimation with Spherelets
2019·arXiv
Abstract
Abstract

Many statistical and machine learning approaches rely on pairwise distances between data points. The choice of distance metric has a fundamental impact on performance of these procedures, raising questions about how to appropriately calculate distances. When data points are real-valued vectors, by far the most common choice is the Euclidean distance. This article is focused on the problem of how to better calculate distances taking into account the intrinsic geometry of the data, assuming data are concentrated near an unknown subspace or manifold. The appropriate geometric distance corresponds to the length of the shortest path along the manifold, which is the geodesic distance. When the manifold is unknown, it is challenging to accurately approximate the geodesic distance. Current algorithms are either highly complex, and hence often impractical to implement, or based on simple local linear approximations and shortest path algorithms that may have inadequate accuracy. We propose a simple and general alternative, which uses pieces of spheres, or spherelets, to locally approximate the unknown subspace and thereby estimate the geodesic distance through paths over spheres. Theory is developed showing lower error for many manifolds, with applications in clustering, conditional density estimation and mean regression. The conclusion is supported through multiple simulation examples and real data sets.

Key Words: Clustering; Conditional density estimation; Curvature; Geodesic distance; Kernel regression; Manifold learning; Pairwise distances; Spherelets.

Distance metrics provide a key building block of a vast array of statistical procedures, ranging from clustering to dimensionality reduction and data visualization. Indeed, one of the most common representations of a data set  {xi}ni=1, for xi ∈ X ⊂ RD, is via a matrix of pairwise distances between each of the data points. The key question that this article focuses on is how to represent distances between data points x and y in a manner that takes into account the intrinsic geometric structure of the data. Although the standard choice in practice is the Euclidean distance, this choice implicitly assumes that the data do not have any interesting nonlinear geometric structure in their support. In the presence of such structure, Euclidean distances can provide a highly misleading representation of how far away different data points are.

This issue is represented in Figure 1, which shows toy data sampled from a density concentrated close to an Euler spiral. It is clear that many pairs of points that are close in Euclidean distance are actually far away from each other if one needs to travel between the points along a path that does not cross empty regions across which there is no data but instead follows the ‘flow’ of the data. As a convenient, if sometimes overly-simplistic, mathematical representation to provide a framework to address this problem, it is common to suppose that the support X = M, with M corresponding to a d-dimensional Riemannian manifold. For the data in Figure 1, the manifold M corresponds to the d = 1 dimensional curve shown with a solid line; although the data do not fall exactly on M, we will treat such deviations as measurement errors that can be adjusted for statistically in calculating distances.

image

Figure 1: Noisy Euler Spiral.

The shortest path between two points x and y that both lie on a manifold M is known as the geodesic, with the length of this path corresponding to the geodesic distance. If x and y are very close to each other, then the Euclidean distance provides an accurate approximation to the geodesic distance but otherwise, unless the manifold has very low curvature and is close to flat globally, Euclidean and geodesic distances can be dramatically different. The accuracy of Euclidean distance in small regions has been exploited to develop algorithms for approximating geodesic distances via graph distances. Such approaches define a weighted graph in which edges connect neighbors and weights correspond to the Euclidean distance. The estimated geodesic distance is the length of the shortest path on this graph; for details, see Tenenbaum et al. (2000) and Silva and Tenenbaum (2003). There is a rich literature considering different constructions and algorithms for calculating the graph distance including Meng et al. (2008), Meng et al. (2007) and Yang (2004). In using the Euclidean distance within local neighborhoods, one needs to keep neighborhoods small to control the global approximation error. This creates problems when the sample size n is not sufficiently large and when the density  ρof the data points is not uniform over M but instead is larger in certain regions than others.

A good strategy for more accurate geodesic distance estimation is to improve the local

Euclidean approximation while continuing to rely on graph distance algorithms. A better local approximation leads to better global approximation error. This was the focus of a recent local geodesic distance estimator proposed in Wu et al. (2018) and Malik et al. (2019). Their covariance-corrected estimator adds an adjustment term to the Euclidean distance, which depends on the projection to the normal space. This provides a type of local adjustment for curvature, and they provide theory on approximation accuracy. However, their approach often has poor empirical performance in our experience, potentially due to statistical inaccuracy in calculating the adjustment and to lack of robustness to measurement errors.

We propose an alternative local distance estimator, which has the advantage of providing a simple and transparent modification of Euclidean distance to incorporate curvature. This is accomplished by approximating the manifold in a local neighborhood using a sphere, an idea proposed in Li et al. (2018) but for manifold learning and not geodesic distance estimation. Geodesic distance estimation involves a substantially different goal, and distinct algorithms and theory need to be developed. The sphere has the almost unique features of both accounting for non-zero curvature and having the geodesic distance between any two points in a simple closed form; even for simple manifolds the geodesic is typically intractable. We provide a transparent and computationally efficient algorithm, provide theory justifying accuracy and show excellent performance in a variety of applications including clustering, conditional density estimation and mean regression on multiple real data sets.

Throughout this paper, we assume M is a smooth compact Riemannian manifold with Riemannian metric  g. Letting γ(s) be a geodesic in arc length parameter s, the geodesic distance  dMis defined by

image

where  L(γ) :=� S0 g{γ′(s), γ′(s)}1/2dt is the length of curve  γ. Given points  X = {x1, · · · , xn}on the manifold, the goal is to estimate the pairwise distance matrix  GD ∈ Rn×n whereGDij = dM(xi, xj). First we propose a local estimator, that is, to estimate  GDij where xiand  xjare close to each other, and then follow the local-to-global philosophy to obtain a global estimator, for arbitrary  xi and xj.

2.1 Local Estimation

In this subsection, we focus on geodesic distance estimation between neighbors. The simplest estimator of  dM(xi, yi) is ∥xi − xj∥, denoted by  dE(xi, xj). However, the estimation error of the Euclidean distance depends on the curvature linearly. As a result, a nonlinear estimator incorporating curvature needs to be developed to achieve a smaller estimation error for curved manifolds. We propose a nonlinear estimator using spherical distance, which is motivated by the fact that osculating circles/spheres approximate the manifold better than tangent lines/spaces. On the osculating sphere, the geodesic distance admits an analytic form, which we use to calculate local geodesic distances.

Let  Sxi(V, c, r) be a ddimensional sphere centered at c with radius r in d+1 dimensional affine space  xi + V, approximating M in a local neighborhood of  xi. Letting π be theorthogonal projection from the manifold to the sphere, the spherical distance is defined as

image

the geodesic distance between  π(xi) and π(xj) on the sphere  Sxi(V, c, r).The spherical distance depends on the choice of sphere  Sxi(V, c, r), which will be discussed in section 2.3.

2.2 Global Estimation

We now consider global estimation of the geodesic distance  dM(xi, xj) for any xi, xj. Thepopular Isomap algorithm was proposed in Tenenbaum et al. (2000) for dimension reduction for manifolds isometrically embedded in higher dimensional Euclidean space. Isomap relies on estimating the geodesic distance using the graph distance based on a local Euclidean estimator. Let G be the graph with vertices  xi. For any two points  xi and xj that areclose to each other, Isomap estimates  dM(xi, xj) using ∥xi −xj∥. This leads to the following global estimator of  dM(xi, xj), for any two points  xi, xj ∈ X,

image

where P varies over all paths along  G having xi0 = xi and xip = xj. In particular, the global distance is defined by the length of the shortest path on the graph, where the length of each edge is given by the Euclidean distance. In practice, local neighbors are determined by a k-nearest neighbors algorithm, with the implementation algorithm given in Section 2.4.

The estimator in expression (2) has been successfully implemented in many different contexts. However, the use of a local Euclidean estimator  ∥xil − xil+1∥is a limitation, and one can potentially improve the accuracy of the estimator by using a local approximation that can capture curvature, such as  dS(xi, xj) in (1). This leads to the following alternative estimator:

image

where P is as defined for (2) and an identical graph paths algorithm can be implemented as for Isomap, but with spherical distance used in place of Euclidean distance in the local component.

2.3 Osculating Sphere

In order to calculate the local spherical distances necessary for computing (3), we first need to estimate ‘optimal’ approximating spheres within each local neighborhood, characterized by the k nearest neighbors of  xi, denoted by  X[k]i . The local sample covariance matrix is defined as Σk(xi) = k−1 �xj∈X[k]i (xj − xi)(xj − xi)T. The eigen-space spanned by the first d + 1 eigenvectors of Σk(xi), denoted by  V ∗ = span [evec1{Σk(xi)}, · · · , evecd+1{Σk(xi)}] isthe best estimator of the d+1 dimensional subspace V . Here we are ordering the eigenvectors by the corresponding eigenvalues in decreasing order.

Observe that the target sphere  Sxi(V ∗, c∗, r∗) passes through  xiso we have  r∗ = ∥c∗−xi∥.Hence, the only parameter to be determined is  c∗ and then r∗ = ∥c∗ − xi∥. To estimate  c∗,we propose a centered k-osculating sphere algorithm. Suppose  xj ∈ Sxi(V ∗, c∗, r∗), then the projection of  xj to xi + V ∗, denoted by  yj = xi + V ∗V ∗⊤(xj − xi), is among the zeros of the function  ∥y − c∗∥2 − r∗2 where r = ∥c∗ − xi∥ = ∥c∗ − yi∥. We use this to define a loss function for estimating c in Definition 1; related ‘algebraic’ loss functions were considered in Coope (1993) and Li et al. (2018).

Definition 1. Under the above assumptions and notations, let  c∗ be the minimizer of the following optimization problem:

image

Letting  r∗ = ∥xi − c∗∥, the sphere  Sxi(V ∗, c∗, r∗)is called the centered k-osculating sphere of X at xi.

We can tell from the definition that the centered sphere is a nonlinear analogue of centered principal component analysis to estimate the tangent space. There is one additional constraint for the centered k-osculating sphere: the sphere passes through  xi. This constraint is motivated by the proof of Theorem 4, see the supplementary materials.

Observe that the optimization problem is convex with respect to c and we can derive a simple analytic solution, presented in the following theorem.

Theorem 1. The minimizer of the optimization problem (4) is given by:

image

where H = �xj∈X[k]i (yj − yi)(yj − yi)⊤ and f = �xj∈X[k]i (∥yj∥2 − ∥yi∥2)(yj − yi).

2.4 Algorithms

In this subsection, we present algorithms to calculate the spherical distance. Before considering algorithms for distance estimation, we present the algorithm for the centered k-osculating sphere, shown in Algorithm 1.

In real applications where the data are noisy, we recommend replacing the centered k-osculating sphere by an uncentered version because in this case the base point x may not be on the manifold so shifting toward x can negatively impact the performance. In addition, the constraint  r = ∥xi − c∥restricts the degrees of freedom when choosing the optimal r. The only difference is that instead of centering at the base point x and forcing  r = ∥xi − c∥,we instead shift  xito the mean ¯x = 1n�ni=1 xiand average  ∥xj − c∥, as shown in Algorithm 2.

image

From Algorithm 3 we obtain the local pairwise distance matrix  SD, where SDij denotes

the distance between  xi and xj. However, if  xi and xjare not neighbors of each other, the distance will be infinity, or equivalently speaking there is no edge connecting  xi and xjin graph G. Then we need to convert the local distance to global distances by the graph distance proposed in Section 2.2. There are multiple algorithms for shortest path search on graphs including the Floyd-Warshall algorithm (Floyd (1962) and Warshall (1962)) and Dijkstra’s algorithm (Dijkstra (1959)); here we adopt the Dijkstra’s algorithm, which is easier to implement. Algorithm 4 shows how to obtain the graph spherical distance from local spherical distance.

image

We note that in the local estimation, the computational complexity for  dM(xi, xj) isO(min{k, D}3), where kis assumed to be much smaller than n. To compare with, the computational complexity of  dE(xi, xj) is O(D). Hence, in general, we are not introducing more computation cost by replacing the local Euclidean distance by the local spherical distance unless d is not very small relative to D. Once the graph is determined, the computational complexity of Dijkstra’s algorithm is  O(n2), where nis the sample size, and this complexity does not depend on which local distance is applied to obtain the weights on the graph G. Hence, the total computational complexity for the graph Euclidean distance estimator is  O�nkD + n2�while the complexity for the graph spherical distance estimator is  O�n min{k, D}3 + n2�.

In this section, we analyze why the spherical distance is a better estimator than the Euclidean distance from a theoretical perspective following a local-to-global philosophy.

3.1 Local Error

First we study the local error, that is,  |dS(x, y) − dM(x, y)| for y ∈ B¯r(x), where B¯r(x) isthe geodesic ball on M centered at x with radius ¯r. It is well known that the error of the Euclidean estimator is third order, as formalized in the following proposition.

Proposition 1.  Assume dM(x, y) = s, then

image

with 1r0 = sup {∥γ′′(s)∥}, where γvaries among all geodesics on M in arc length parameter. In terms of the error rate,  dE(x, y) = s + O(s3).

These bounds are tight and the proof can be found in Smolyanov et al. (2007). The Euclidean distance is a simple estimator of the geodesic distance, and the error is s324r20 .While this may seem to be a good result, if the manifold has high curvature, so that  r0 isvery small, performance is not satisfactory. This is implied by the  r−20multiple on the error rate, and is also clearly apparent in experiments shown later in the paper.

Now we consider the error of the spherical distance proposed in section 2.1. For simplicity, we first consider the case in which  M = γis a curve in  R2 with domain [0, S]. Without loss of generality, fix  x = γ(0) but vary  y = γ(s). Let nbe the unit normal vector of  γ at x, thatis  γ′′(0) = κn. Let r = 1|κ| and c = x − 1κn, which determine a circle  Cx(c, r) centered at c with radius r. This circle  Cx(c, r) is called the osculating circle of the curve  γ, which is the “best” circle approximation to the curve. Letting  π : γ → Cx(c, r) be the projection to the osculating circle, the error in  dS(x, y) as an estimator of  dM(x, y) is shown in the following theorem.

Theorem 2. Let x = γ(0) and y = γ(s), so dM(x, y) = s, then

image

Comparing to the error of Euclidean estimation in Proposition 1, the spherical estimate improves the error rate from  O(s3) to O(s4).

The above result is for curves, and as a second special case we suppose that  Md ⊂ Rd+1is a d dimensional hyper-surface. Similar to the curve case, the spherical distance can be defined on any sphere  Sx(c, r) passing through x with center c and radius  r where c = x− 1κnand n is the normal vector of the tangent space  TxM, r = 1|κ|.However, for geodesics along different directions, denoted by  γv := expx(sv) where v ∈ UTxM, the curvature κv(x) defined by  γ′′v(0) = κv(x)nmight be different. Let  κ2(x) = supv∈UTxM κv(x) andκ1(x) = infv∈UTxM κv(x), where the maximum and minimum can be achieved due to the compactness of  UTxM. Fix any κ0(x) ∈ [κ1(x), κ2(x)], let Sx(c, r) be the corresponding sphere, and  π : M → Sx(c, r) be the projection. The estimation error is given by the following theorem.

image

estimation error of spherical distance is given by

image

In the worst case, the error has the same order as that for the Euclidean distance. However, there are multiple cases where the error is much smaller than the Euclidean one, shown in the following corollary.

Corollary 1. Under the same conditions in Theorem 3,

(1) If κy = κ0 or κy = 2κ0, then dS(x, y) = s + O(s4).

(2) If |κ2(x) − κ1(x)| < ¯r, then dS(x, y) = s + O(¯rs3).

Assume  κ0 = κv0, then for all  y ∈ {expx(sv) | |κv − κ0| ≤ ¯r}, which is a neighborhood of the geodesic expx(tv0), spherical estimation outperforms the Euclidean estimation. The closer to the central geodesic, the better the estimation performance. For a point x where κv(x) is not changing rapidly along different directions, the spherical estimation works well in the geodesic ball  B¯r(x).

Finally we consider the most general case: M is a d dimensional manifold embedded in RD for any D > d. Let Sx(c, r) be a ddimensional sphere whose tangent space is also  TxM.Letting  πbe the projection to the sphere, the estimation error is given by the following theorem.

Theorem 4. Fix x ∈ M, for y = expx(sv) such that dM(x, y) = s, then the estimation error of spherical distance is given by

image

Combining Theorem 2-4, we conclude that spherical estimation is at least the same as Euclidean estimation in terms of the error rate, and in many cases, the spherical estimation outperforms the Euclidean one.

3.2 Global Error

In this section we analyze the estimation error:  |dSG(x, y)−dM(x, y)| for any x, y ∈ M. Theidea is to pass the local error bound to the global error bound. We use the same notation introduced in Section 2.2.

Theorem 5. Assume M is a compact, geodesically convex submanifold embedded in  RD and{xi}ni=1 ⊂ Mis a set of points, which are vertices of graph G. Introduce constants  ϵmin > 0,ϵmax > 0, 0 < δ < ϵmin /4 and let Cbe the constant such that  |dS(x, y) − dM(x, y)| ≤dM(x, y){1 + Cd2M(x, y)}according to Theorem 4. Suppose

1. G contains all edges  xy with ∥x − y∥ ≤ ϵmin.

2. All edges xy in G have length  ∥x − y∥ ≤ ϵmax.

3.  {xi}ni=1 is a δ-net of M, that is, for any  x ∈ M, there exists  xi such that dM(x, xi) ≤ δ.

Then for any  x, y ∈ M

image

As the sample size grows to infinity,  δ, ϵmin, ϵmax →0 and we can carefully choose the size of the neighborhood so that  δ/ϵmin →0. As a result,  λ1, λ2 → 0 so dS(x, y) → dM(x, y)uniformly.

4.1 Euler Spiral

We test the theoretical results on generated data from manifolds in which the geodesic distance is known so that we can calculate the error. The first example we consider is the Euler spiral, a curve in  R2. The Cartesian coordinates are given by Fresnel integrals:

image

The main feature of the Euler spiral is that the curvature grows linearly, that is,  κ(s) = s.We generate 500 points uniformly on [0, 2]. Then we fix  x = γ(1.6) and choose ¯r = 0.04, so there are 20 points falling inside the geodesic ball  B¯r(x), denoted by  y1, · · · , y20. Then wecan calculate the Euclidean  ∥yi − x∥and the spherical distance  dS(x, yi).

image

Figure 2: Local error for Euler spiral

The covariance-corrected geodesic distance estimator (Malik et al. (2019)) can be viewed as the state-of-the-art. We compare the spherical distance with both Euclidean distance and the covariance-corrected distance. Figure 2a is the spiral and Figure 2b contains the error plot for the three algorithms. To visualize the rate, we also present the log  −log plot in Figure 2c. The results match our theory and the spherical estimator has the smallest error among these three algorithms.

Then we consider the global error. By the definition of arc length parameter, the pairwise geodesic distance matrix is given by  GDij = |si−sj|. Denote the Euclidean pairwise distance matrix by D, the graph distance based on Euclidean distance, covariance-corrected distance and spherical distance by EG, CG and SG, respectively. As the most natural measurement of the global error, we calculate and compare the following norms:

image

Table 1 shows the global error when the total sample size is 500 and k is chosen to be 3. Furthermore, we vary the curvature from [0, 1] to [3, 4] to assess the influence of curvature on these estimators.

Table 1: Global error for Euler spiral

image

The global Euclidean distance is by far the worst and the graph spherical distance is the best in all cases. Furthermore, as the curvature increases, the spherical error increases the most slowly. This matches the theoretical analysis, since the spherical estimator takes the curvature into consideration.

In real applications, almost all data contain measurement error, so the data may not exactly lie on some manifold, but instead may just concentrate around the manifold with certain noise. The robustness of the algorithm with respect to the noise is a crucial feature. To assess this, we generate samples from the Euler spiral and add Gaussian noise  ϵi ∼N(0, σ2IdD) where σis the noise level. In this setting the local error is not very meaningful since  xiis no longer on the manifold. However, the global error is still informative since the pairwise distance matrix contains much information about the intrinsic geometry of the manifold. Since the ground truth  dM(xi, xj) is not well defined, we firstly apply the Graph Euclidean distance to a large data set, and treat these results as ground truth GD. The reason is that when the sample size is large enough, all the above global estimators converge to the true distance except for D. Then we subsample a smaller dataset and apply these global estimators to obtain EG, CG and SG and compute the error. We test on different subsample sizes to assess the stability of the algorithms and the performance on small data sets.Figure 3 shows that spherical estimation works well on very small data sets, because it efficiently captures the geometry hidden in the data.

image

Figure 3: Global error for noisy Euler spiral

4.2 Torus

We also consider the torus, a two dimensional surface with curvature ranging from negative to positive depending on the location. We set the major radius to be R = 5 and the minor radius to be r = 1 so the equation for the torus is

image

Since the geodesic distance on the torus does not admit an analytic form, we apply the same strategy as in the noisy Euler Spiral case. First we generate a large dataset and apply the Graph Euclidean method to obtain the “truth”, then estimate the distance through a subset and finally compute the error. Similarly, we also consider the noisy case by adding Gaussian noise to the torus data. The results are shown in Figure 4, which demonstrates that the performance of the spherical estimation is the best for both clean and noisy data.

image

Figure 4: Global error for (noisy) torus

In this section we consider three applications of geodesic distance estimation: clustering, conditional density estimation and regression.

5.1 k-Smedoids clustering

Among the most popular algorithms for clustering, k-medoids (introduced in Kaufman et al. (1987)) takes the pairwise distance matrix as the input; refer to Algorithm 5 (Park and Jun (2009)). Similar to k-means, k-medoids also aims to minimize the distance between the points in each group and the group centers. Differently from k-means, the centers are chosen from the data points instead of arbitrary points in the ambient space.

image

In most packages, the default pairwise distance matrix is the global Euclidean distance D, which is inaccurate if the support of the data has essential curvature. As a result, we replace D by SG and call the new algorithm k-Smedoids. By estimating GD better, it is reasonable to expect that k-Smedoids has better performance.

We present two types of examples: unlabeled (example 1) and labeled data (example 2 and 3). For the unlabeled data, we visualize the clusters to show the performance of different algorithms, for the labeled datasets, we can make use of the labels and do quantitative comparisons. Among clustering performance evaluation metrics, we choose the following: Adjusted Rand Index (ARI, Hubert and Arabie (1985)), Mutual Information Based Scores (MIBS, Strehl and Ghosh (2002), Vinh et al. (2009)), HOMogeneity (HOM), COMpleteness (COM), V-Measure (VM, Rosenberg and Hirschberg (2007)) and Fowlkes-Mallows Scores (FMS, Fowlkes and Mallows (1983)). We compare these scores for standard k-medoids, k-Emedoids and our k-Smedoids. These algorithms are based on different pairwise distance matrices while other steps are exactly the same, so the performance will illustrate the gain from the estimation of the geodesic distance. We note that for all above metrics, larger values reflect better clustering performance.

image

tuned accordingly. In example 1, the data are visualizable so d = 1 is known and k can be tuned by the clustering performance: whether the two circles are separated. For example 2-3, cross validation can be applied to tune the parameters based on the six scores. In any case with quantitative scores, cross validation can be used to tune the parameters mentioned above. Our recommended default choices of k are uniformly distributed integers between d + 2 and n2 , proportion to  √n. Estimating the manifold dimension d has been proven to be a very hard problem, both practically and theoretically. There are some existing methods to estimate d, see Granata and Carnevale (2016), Levina and Bickel (2005), K´egl (2003), Camastra and Vinciarelli (2002), Carter et al. (2009), Hein and Audibert (2005), Camastra and Vinciarelli (2001) and Fan et al. (2009), and we can apply these algorithms directly. Example 1: Two ellipses.

image

Figure 5: Clustering performance for a two ellipses example

We randomly generate 100 samples from two concentric ellipses with eccentricity√3/2added by zero mean Gaussian noise. We compare with k-means, standard k-medoids and k-Emedoids. Figure 5 shows the clustering results for the two ellipses data. In this example, we set K = 2, k = 3 and d = 1 since the support is a curve with dimension 1.

Since the two groups are disconnected and curved, the Euclidean-based algorithms fail

while the spherical algorithm works better than using other geodesic distance estimators. We also consider two real datasets with labels. Example 2: Banknote.

The Banknote data set is introduced in Lohweg and Doerksen (2012). There are D = 4 features, characterizing the images from genuine and forged banknote-like specimens and the sample size is 1372. The binary label indicates whether the banknote specimen is genuine or forged.

Table 2 shows the clustering performance of three algorithms for the Banknote data. We can see that k-Smedoids has the highest score for all 6 metrics. In this example, K = 2 is known, and we set k = 4 and d = 1. The choice of d and k are determined by cross validation.

Table 2: Clustering performance for Banknote

Table 3: Clustering performance for Galaxy Zoo

image

Example 3: Galaxy zoo data.

The last example is from the Galaxy Zoo project available at http://zoo1.galaxyzoo.org. The features are the fraction of the vote from experts in each of the six categories, and the labels represent whether the galaxy is spiral or elliptical. We randomly choose 1000 samples from the huge data set.

Table 3 shows the clustering performance of three algorithms for the Galaxy zoo data. We can see that k-Smedoids has the highest score for all 6 metrics. In this example K = 2, and the parameters k = 6, d = 1 are determined by cross validation.

5.2 Geodesic conditional density estimation

Conditional density estimation aims to estimate f(y|x) based on observations  {(xi, yi)}ni=1where  xi ∈ RDare predictors and  yi ∈ Rare responses. The most popular conditional density estimator involving pairwise distance is the kernel density estimator (KDE) with Gaussian kernel (Davis et al., 2011):

image

where  h1 and h2are bandwidths. This method is motivated by the formula  f(y|x) = f(x,y)f(x) ,using kernel density estimation in both the numerator and the denominator.

As discussed before, if the data have essential curvature, Euclidean distance can’t capture the intrinsic structure in the data. Instead, we can improve the performance by replacing the Euclidean distance by geodesic distance. That is, the natural estimator is

image

where d is the (estimated) geodesic distance. The kernel  e−d(xi,x)2/h corresponds to the Riemannian Gaussian distribution (Said et al., 2017).

In terms of the distance, we have four pairwise distances between training data: global Euclidean distance D, graph Euclidean distance ID, graph covariance corrected distance CD and our proposed graph spherical distance SD. For any given  x, d(x, xi) is obtained by interpolation. First we add x to the graph consists of all training data and connect x with its neighbors. Then we calculate the graph distance between  x and xi. This is more efficient than calculating pairwise distances between all samples  Xtrain ∪ Xtest. The algorithm is formulated in Algorithm 6:

image

We can replace the distance estimator in the first step by any other algorithm to obtain the corresponding version of conditional density estimation. To compare the performance, we estimate the conditional density through training data  Xtrain, Ytrainand calculate the sum of log likelihood �ntesti=1 log( ˆf(yi|xi)). We randomly permute the data to obtain different training and test sets and provide boxplots for the sum of log likelihoods.

Regarding the tuning parameters, k and d have been discussed in Section 5.1. Regarding bandwidths  h1 and h2, there is a very rich literature on choosing optimal bandwidths in other contexts. For simplicity we use cross validation to estimate  h1 and h2. We consider the following two real data sets. Example 4: Combined Cycle Power Plant. This data set is introduced in T¨ufekci (2014); Kaya et al. (2012), containing 9568 samples collected from a Combined Cycle Power Plant from 2006 to 2011. There are 4 predictors: hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V), to

predict the net hourly electrical energy output (EP) of the plant (response). We randomly sample 1000 data points and repeat 100 times to obtain the following boxplots for the sum of log likelihood scores for different distance estimation methods, as shown in Figure 6. There is a clear improvement for our graph spherical approach.

image

Figure 6: Sum of log likelihood for Combined Cycle Power Plant data set

Example 5: Concrete Compressive Strength. This data set is introduced in Yeh (1998), containing 1030 samples with 8 predictors: cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, fine aggregate and age, to predict the concrete compressive strength. We randomly split the data 50  −50 as training and test data and repeat for 100 times to obtain the boxplots for the sum of log likelihood scores for different distance estimation methods, as shown in Figure 7.

image

Figure 7: Sum of log likelihood for Concrete Compressive Strength data set

From the above two examples we can tell that Euclidean distance is the worst choice because the predictors have non-linear support. Graph Euclidean and covariance corrected distances improve the performance a lot, while graph spherical distance outperforms all competitors.

5.3 Geodesic kernel mean regression

As a related case, we also consider kernel mean regression using a simple modification of the Nadaraya-Watson estimator (Nadaraya, 1964; Watson, 1964):

image

Algorithm 7 provides details:

image

Again we have four options for the distance in the first step, D, ID, CD and our proposed SD. We measure the performance by calculating the Root Mean Square Error (RMSE)��ntesti=1 ( �m(xi) − yi)2/ntest, and again use cross validation for bandwidth choice. We consider the same two datasets as in Section 5.2 and show the results in Figure 8 and 9: Again it is clear that spherical distance outperforms the competitors.

The choice of distance between data points plays a critical role in many statistical and machine learning procedures. For continuous data, the Euclidean distance provides by far the most common choice, but we have shown that it can produce badly sub-optimal results in a number of real data examples, as well as for simulated data known to follow a manifold structure up to measurement error. Our proposed approach seems to provide a clear improvement upon the state-of-the-art for geodesic distance estimation between data points lying close to an unknown manifold, and hence may be useful in many different contexts.

There are multiple future directions of immediate interest. The first is the question of how the proposed approach performs if the data do not actually have an approximate

image

Figure 8: Regression RMSE for Combined Cycle Power Plant data set

manifold structure, and whether extensions can be defined that are adaptive to a variety of true intrinsic structures in the data. We have found that our approach is much more robust to measurement errors than competing approaches; in practice, it is almost never reasonable to suppose that data points fall exactly on an unknown Riemannian manifold. With this in mind, we allow measurement errors in our approach, so that the data points can deviate from the manifold. This leads to a great deal of flexibility in practice, and is likely a reason for the good performance we have seen in a variety of real data examples. However, there is a need for careful work on how to deal with measurement errors and define a single class of distance metrics that can default to Euclidean distance as appropriate or include other structure as appropriate, in an entirely data-dependent manner.

Acknowledgement

The authors acknowledge support for this research from an Office of Naval Research grant N00014-14-1-0245/N00014-16-1-2147 and a National Institute of Health grant 5R01ES027498-02.

Bernstein, M., De Silva, V., Langford, J. C., and Tenenbaum, J. B. (2000). Graph ap- proximations to geodesics on embedded manifolds. Technical report, Technical report, Department of Psychology, Stanford University.

Camastra, F. and Vinciarelli, A. (2001). Intrinsic dimension estimation of data: An approach based on Grassberger–Procaccia’s algorithm. Neural Processing Letters, 14(1):27–34.

image

Figure 9: Regression RMSE for Concrete Compressive Strength data set

Camastra, F. and Vinciarelli, A. (2002). Estimating the intrinsic dimension of data with a fractal-based method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(10):1404–1407.

Carter, K. M., Raich, R., and Hero III, A. O. (2009). On local intrinsic dimension estimation and its applications. IEEE Transactions on Signal Processing, 58(2):650–663.

Coope, I. D. (1993). Circle fitting by linear and nonlinear least squares. Journal of Optimization Theory and Applications, 76(2):381–388.

Davis, R. A., Lii, K.-S., and Politis, D. N. (2011). Remarks on some nonparametric estimates of a density function. In Selected Works of Murray Rosenblatt, pages 95–100. Springer.

Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. Numerische Mathematik, 1(1):269–271.

Fan, M., Qiao, H., and Zhang, B. (2009). Intrinsic dimension estimation of manifolds by incising balls. Pattern Recognition, 42(5):780–787.

Floyd, R. W. (1962). Algorithm 97: shortest path. Communications of the ACM, 5(6):345.

Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clus- terings. Journal of the American Statistical Association, 78(383):553–569.

Granata, D. and Carnevale, V. (2016). Accurate estimation of the intrinsic dimension using graph distances: Unraveling the geometric complexity of datasets. Scientific Reports, 6:31377.

Hein, M. and Audibert, J.-Y. (2005). Intrinsic dimensionality estimation of submanifolds in r d. In Proceedings of the 22nd International Conference on Machine learning, pages 289–296. ACM.

Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1):193– 218.

Kaufman, L., Rousseeuw, P., and Dodge, Y. (1987). Clustering by means of medoids in statistical data analysis based on the l1 norm and related method. North-Holland, pages 405–416.

Kaya, H., T¨ufekci, P., and G¨urgen, F. S. (2012). Local and global learning methods for predicting power of a combined gas & steam turbine. In Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering (ICETCEE), pages 13–18.

K´egl, B. (2003). Intrinsic dimension estimation using packing numbers. In Advances in Neural Information Processing Systems, pages 697–704.

Levina, E. and Bickel, P. J. (2005). Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems, pages 777–784.

Li, D., Mukhopadhyay, M., and Dunson, D. B. (2018). Efficient manifold and subspace approximations with spherelets. arXiv preprint arXiv:1706.08263.

Lohweg, V. and Doerksen, H. (2012). UCI machine learning repository.

Malik, J., Shen, C., Wu, H.-T., and Wu, N. (2019). Connecting dots: from local covariance to empirical intrinsic geometry and locally linear embedding. Pure and Applied Analysis, 1(4):515–542.

Meng, D., Leung, Y., Xu, Z., Fung, T., and Zhang, Q. (2008). Improving geodesic distance estimation based on locally linear assumption. Pattern Recognition Letters, 29(7):862–870.

Meng, D., Xu, Z., Gu, N., and Dai, M. (2007). Estimating geodesic distances on locally linear patches. In Signal Processing and Information Technology, 2007 IEEE International Symposium on, pages 851–854. IEEE.

Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1):141–142.

Park, H.-S. and Jun, C.-H. (2009). A simple and fast algorithm for k-medoids clustering. Expert Systems with Applications, 36(2):3336–3341.

Rosenberg, A. and Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL).

Said, S., Bombrun, L., Berthoumieu, Y., and Manton, J. H. (2017). Riemannian gaussian distributions on the space of symmetric positive definite matrices. IEEE Transactions on Information Theory, 63(4):2153–2170.

Silva, V. D. and Tenenbaum, J. B. (2003). Global versus local methods in nonlinear di- mensionality reduction. In Advances in Neural Information Processing Systems, pages 721–728.

Smolyanov, O. G., Weizs¨acker, H. v., and Wittich, O. (2007). Chernoff’s theorem and dis- crete time approximations of brownian motion on manifolds. Potential Analysis, 26(1):1– 29.

Strehl, A. and Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3(Dec):583–617.

Tenenbaum, J. B., De Silva, V., and Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323.

T¨ufekci, P. (2014). Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power & Energy Systems, 60:126–140.

Vinh, N. X., Epps, J., and Bailey, J. (2009). Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1073–1080. ACM.

Warshall, S. (1962). A theorem on boolean matrices. In Journal of the ACM. Citeseer.

Watson, G. S. (1964). Smooth regression analysis.  Sankhy¯a: The Indian Journal of Statistics,Series A, pages 359–372.

Wu, H.-T., Wu, N., et al. (2018). Think globally, fit locally under the manifold setup: Asymptotic analysis of locally linear embedding. The Annals of Statistics, 46(6B):3805– 3837.

Yang, L. (2004). K-edge connected neighborhood graph for geodesic distance estimation and nonlinear data projection. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 1, pages 196–199. IEEE.

Yeh, I.-C. (1998). Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28(12):1797–1808.

7.1 Centered k-osculating sphere solution

We solve the optimization problem described in Definition 1.

Proof of Theorem 1. Our goal is to minimize the function

image

First we simply f(c) and omit the terms that do not depend on c in the third equation:

image

where  H = �xj∈X[k]i

�(yj − yi)(yj − yi)⊤�and f = �xj∈X[k]i (∥yj∥2 − ∥yi∥2)(yj − yi).This is a quadratic function with respect to c and the minimizer is given by

image

Then  r∗ = ∥xi − c∗∥is the radius of the centered k-osculating sphere.

7.2 Local estimation error

In this section we prove Theorem 2-4.

7.2.1 Proof for curves

First we prove Theorem 2, the error bound for curves.

Proof of Theorem 2. At the fixed points  x = γ(0), let t = γ′(0) and n = γ′′(0)∥γ′′(0)∥, then (−n, t)is an orthonormal basis at x. Then the Taylor expansion  γ(s) = γ(0) + γ′(0)s + s22 γ′′(0) +

image

where  v1 and v2are unknown constants subject to the constraint  ∥γ′(s)∥= 1. Observe that

image

κ >0; the proof is the same when  κ ≤ 0.

Let  θbe the angle between  x − c and y − c. Then the spherical distance between  π(y)and  x is rθ.In section 2.1, we characterize  θ by cos(θ) = r arccos�x−cr · π(y)−cr �, herewe characterize  θ by tan(θ) for computational simplicity. Let  ytbe the intersection of the tangent space spanned by  γ′(0) and the straight line connecting  y and c, and let yl be theprojection of y onto the tangent space. Then we can focus on the triangle  △cxyl withx − c//yl − y. Observe that

image

so yt − x∥∥y − yl∥ = r∥yt − x∥ − r∥yl − x∥, hence (r − ∥y − yl∥)∥yt − x∥ = r∥yl − x∥. As aresult,

image

By the definition of  yl, we can write  yl =

image

image

Finally, by the Taylor expansion of arctan, the spherical distance is

image

The last step results from the fact that  κ = 1r.

7.2.2 Proof for hyper-surfaces

Based on the proof of Theorem 2, we next prove Theorem 3, the error bound for hyper-surfaces.

Proof of Theorem 3. Define γ(s) := expx(sv) and denote the normal vector of  TxM by n.Let  v = γ′(0) and expand v to an orthonormal basis of  TxM, denoted by  {v, v2, · · · , vd}.Since  n ⊥ TxM, {−n, v, v2, · · · , vd}forms a basis for  Rd+1. Before doing Taylor expansion, we first prove that  γ′′(0)//n, where //representing parallel relation. Denote the projection to  n by ⊥ and the covariant derivative by  ∇. Since γis a geodesic, we have

image

which implies  γ′′(0)//n. By Taylor expansion, we rewrite  γin terms of the new coordinates:

image

Again, by the constraint  ∥γ′∥= 1, we have  α2 = −κ2y6 . As before, denote the intersection of y − c and TxM by ytand the angle between  x − c and y − c by θ, so

image

By direct calculation, we derive that the coordinates for  yt as

image

This is exactly the same as Equation 8. Then similar to the proof of Equation 9, we conclude that

image

Now we can prove Corollary 1.

Proof of Corollary 1. (1) From Equation 10, the third order term vanishes if and only if

image

(2) Again by Equation 10, we have

image

7.2.3 Proof for general cases

Finally we prove Theorem 4.

Proof of Theorem 4. For y = expx(sv) = γ(s), let v1 = v, and expand  v1to an orthonormal basis of  TxMdenoted by  {v1, · · · , vd}. Similarly, fix any orthonormal basis of the normal space  TxM⊥, denote by  {n1, · · · , nD−d}. The same as the proof of Theorem 3,  γ′′(0) ∈TxM⊥, so the Taylor expansion of  γcan be expressed in the new coordinates as follows:

image

where  γ′′(0) = [α1, · · · , αD−d, 0d]⊤.

Assume  c = x + rn where n = �D−di=1 βiniis a unit vector in the normal space  TxM⊥so �D−di=1 β2i = 1. The idea of the proof is to connect  s and dS(x, y) by the tangent space. To be more specific, denote the projection onto  TxM by Px, and define  yl = Px(y) andys = Px {π(y)}. Then it suffices to show the following three statements:

image

iii  ∥ys − x∥ = s + O(s3).

Observe that the first statement implies that the Euclidean distance between base point and the projection to the tangent space is an estimator of the geodesic distance with error  O(s3).Since  TxMis the common tangent space of  M and Sx(c, r), Statement i implies Statement ii. So it suffices to show Statement i and Statement iii. Before we prove the statements, we need to calculate the coordinates of  yl and ys. By the definition of  Px, we have

image

Similarly, by the definition of  πand linearity of  Px,

image

Now we can calculate the distances involved in the statements. Firstly,  ∥yl−x∥ = s+O(s3) =dM(x, y) + O(s3) is a direct consequence of Equation 11.

Accordingly to Equation 12, the only missing part to calculate  ∥ys−x∥, is ∥y−c∥. Recallthat  c = x + r �D−di=1 βini, so

image

As a result,

image

7.3 Global estimation error

In this section we prove the global error bound stated in Theorem 5.

Proof of Theorem 5. Before proving the inequalities, we define the graph geodesic distance

image

where P varies over all paths along  G with x0 = x and xp = y. Clearly we have  dM(x, y) ≤dG(x, y) for any x, y ∈ Mand any graph G. The idea is to show  dSG(x, y) ≈ dG(x, y) ≈dM(x, y).

First we prove the first inequality: (1  − λ1)dM(x, y) ≤ dSG(x, y). Assume P = {xi}pi=1minimizes  dSG. Then we have

image

By theorem 2 in Bernstein et al. (2000),  dG(x, y) ≤ (1+ 4δϵmin )dM(x, y). Combining the above two equalities we conclude that  dSG(x, y) ≤ (1 + λ2)dM(x, y), where λ2 = 4δϵmin + Cϵ2max +

image


Designed for Accessibility and to further Open Science