Learning functions varying along a central subspace

2020·Arxiv

Abstract

Abstract

Many functions of interest are in a high-dimensional space but exhibit low-dimensional structures. This paper studies regression of a s-H¨older function which varies along a central subspace of dimension . A direct approximation of an accuracy requires the number of samples n in the order of . In this paper, we analyze the Generalized Contour Regression (GCR) algorithm for the estimation of the central subspace and use piecewise polynomials for function approximation. GCR is among the best estimators for the central subspace, but its sample complexity is an open question. We prove that GCR leads to a mean squared estimation error of ) for the central subspace, if a variance quantity is exactly known. The estimation error of this variance quantity is also given in this paper. The mean squared regression error of f is proved to be in the order of (where the exponent depends on the dimension of the central subspace d instead of the ambient space D. This result demonstrates that GCR is effective in learning the low-dimensional central subspace. We also propose a modified GCR with improved efficiency. The convergence rate is validated through several numerical experiments.

1 Introduction

A vast majority of statistical inference and machine learning problems can be modeled as regression, where the goal is to estimate an unknown function from a finite number of training samples. Nowadays, new challenges are introduced to regression and prediction due to the rise of high-dimensional data in many fields of contemporary science. The well-known curse of dimensionality implies that, in order to achieve a fixed accuracy in prediction, the number of training data must grow exponentially with respect to the data dimension, which is beyond practical applications.

Fortunately, functions of interest in applications often exhibit low-dimensional structures. In many situations, the response may depend on few variables, or a low-dimensional subspace. For example, in Bioinformatics, the cDNA microarray data have thousands of dimension but an effective classification of tumor types only depends on a subspace of dimension one or two [4]. In engineering, the coefficients of certain elliptic partial differential equations are parameterized by many variables but has a small effective dimension [8]. In photovoltaic industry, there are five input parameters in the single-diode solar cell model while the maximum power output only depends on a linear combination of these five parameters [9]. Similar low-dimensional models appear in optimization [22], optimal control [55], uncertainty quantification [2], text classification [30] and biomedicine [43, 44].

These applications motivate us to consider the regression model

where . The columns of Φ consist of an orthonormal basis of a d-dimensional subspace, i.e., Φ. The random variable models noise, which is independent of the ’s. In this model, the function f is defined on but only depends on the projection of x to the subspace spanned by Φ, denoted by be such a space of minimum dimension: ) is not a constant function for any the central subspace [10], or the Effective Dimension-Reduction subspace [37, 14, 54]. In other words, the function f only varies along the central subspace, and remains the same along any orthogonal direction of the central subspace. This model is also called the single-index model for d = 1 and the multi-index model for

where the ’s are independently drawn from a probability measure and the a prior knowledge that f varies along a d-dimensional central subspace, we aim at constructing an empirical estimator and studying the Mean Squared Error (MSE)

The central interest of this problem is to estimate the central subspace Φ. A class of methods is based on the gradient is proportional to the Φ direction, which allows one to estimate Φ from average of the empirical gradients [46, 24, 25]. When 2, the covariance matrix of is given by Φgives an estimate of Φ as the eigenspace associated with the top d eigenvalues of this covariance matrix [28, 8]. If the gradient can be accurately estimated, these gradient-based methods lead to a -consistent estimation of Φ [25, 8]. In general, the gradient estimation in an exponentially large number of samples in dimension D.

The single-index model with d = 1 has been extensively studied in literature. In this case, is called the link function. The minimax mean squared regression error while the link function belongs to the s-H¨older class was proved to be ) [20, 33, 6]. These results demonstrate that the optimal algorithm can automatically adapt to the central subspace of dimension 1. For estimation, a lot of methods based on non-convex optimization have been proposed [33, 23, 29, 53] while achieving the global minimum is not guaranteed. It was proved in [23, Chapter 22] that minimizing the empirical risk over the central subspace of dimension 1 and approximating g with piecewise polynomials simultaneously give rise to the MSE of ). The main challenge of this approach is to obtain the global minimum due to non-convexity in the optimization. In the context of regression with point queries, an adaptive query algorithm was proposed for the single-index model in [7] with performance guarantees. This algorithm was generalized to the multi-index model in [18]. In the standard regression setting, adaptive queries are not allowed, and the samples are usually given before learning starts.

Estimating the central subspace is related with sufficient dimension reduction in statistics. A class of methods related with inverse regression has been developed to estimate the central subspace Φ (see [41, 34] for a comprehensive review). Sliced Inverse Regression (SIR) [38] is the first and most well known method. The term “inverse regression” refers to the conditional expectation E(x|y). In SIR, the central subspace is estimated as the eigenspace associated with the top d eigenvalues of Cov()). Similar techniques include kernel inverse regression [56], parametric inverse regression [3] and canonical correlation estimator [19]. These methods are referred as first-order methods since the first-order statistical information is utilized, such as the conditional mean. Their performance crucially depends on the function g and the distribution of data. If g is symmetric about 0 along some directions, then those directions can not be recovered by the first-order methods. This issue about symmetry is better addressed by second-order methods which utilize the variance and covariance information of data. Popular second-order methods include sliced inverse variance estimation [11], sliced average variance estimation (SAVE) [12, 16], average variance estimation (MAVE) [53], contour regression [36], directional regression [35], principal Hessian directions [39], a hybrid SIR and SAVE [57], etc. New methods are also developed in a multiscale framework [32, 31]. For the single-index model, Smallest Vector Regression (SVR) is proposed in [32] in a multiscale framework, which combines the idea of SIR and SAVE. A performance analysis is provided for the index estimation and regression, while the assumptions are favorable to monotonic g.

This paper focuses on a second-order method called Generalized Contour Regression (GCR) introduced by Li, Zha and Chiaromonte [36]. In comparison with SIR and SAVE mentioned above, GCR has advantages in the case where g is not monotonic. Empirical experiments have demonstrated the success of GCR for the estimation of the central subspace, but its sample complexity is still not well understood yet. In this paper, we analyze the error behavior of the regression scheme which consists of GCR for the central subspace estimation and piecewise polynomial approximation of g. We also propose a modified GCR with improved efficiency. Our contributions are:

(i) We prove that GCR estimates the central subspace with a mean squared estimation error of ), if a variance quantity is exactly known. The estimation error of this variance quantity is also given in this paper.

(ii) Our regression scheme gives rise to the MSE of f in the order of (demonstrates that the MSE decays exponentially with an exponent depending on the dimension of the central subspace d, instead of the ambient dimension D.

(iii) Our modified GCR improves over the original GCR in efficiency. Numerical experiments demonstrate that the modified GCR has the same convergence rate as that of the original GCR in our statistical theory.

This paper is organized as follows: We first introduce SCR and GCR in Section 2. Our regression scheme and main results are stated in Section 3. Numerical experiments are provided in Section 4 to validate our theory. We present proofs in Section 5 and conclude in Section 6.

We use lowercase bold letters and capital letters to denote vectors and matrices respectively. is the Euclidean norm of the vector is the spectral norm of the matrix A. For two square matrices A and B of the same size, is positive semi-definite. We use to denote nonnegative integers. For a function g, supp(g) denotes the support of g. For denotes the largest integer that is less than or equal to denotes the smallest integer that is greater than or equal to x. Throughout the paper, we use to denote the central subspace and its orthogonal complement, respectively.

2 Central subspace and contour regression

The model in (1) has an ambiguity since the solutions are not unique. The columns of Φ form an orthonormal basis of the central subspace, while the choice of orthonormal basis is not unique. This gives rise to an ambiguity between the basis Φ and the function g, which can be characterized by the following lemma:

Lemma 1. Consider the model in (1) where the columns of Φ form an orthonormal basis of the central subspace. For any orthogonal matrix any . Then the columns of form another orthonormal basis of the central subspace and

Lemma 1 shows that the representation of f is not unique. If another set of orthonormal basis is picked, the function g changes accordingly. In this paper, we aim to recover one set of orthonormal basis Φ and the corresponding function . The subspace estimation error and regression error are represented by

respectively, which are invariant to the choice of orthonormal basis. After a change of basis, we can assume that the columns of Φ are carefully chosen such that

according to [5, Theorem 2.2.1].

2.1 Simple contour regression

We start with Simple Contour Regression (SCR) [36] which utilizes the fact that the contour directions of f are orthogonal to the central subspace. Let 0 and define the following conditional covariance matrix:

where (˜) are two independent samples.

When d = 1 and g is a monotonic function, we expect many of the (˜x, x) pairs satisfying small while can be arbitrarily large for any . In this case, most of the ˜directions satisfying are aligned with

For example, in Figure 1(a), 2]. This function can be expressed as Model (1) with Φ = [1. Consider two inputs ) satisfying , the condition the other hand, 4], which is independent of . In Figure 1(b), 200 samples are uniformly drawn from [) pair is connected if 01. We observe that, almost all connections by SCR in Figure 1(b) are aligned with the w direction.

SCR estimates the central subspace from the smallest d eigenvectors of ). The success of SCR is guaranteed with the following condition:

Condition 1. There exists such that for any and unit vectors , the following holds

We define an elliptical distribution [34] as

Definition 1 (Elliptical distribution)be a probability measure in with density function has an elliptical distribution if there exists a positive-definite matrix such that

for some function has a spherical distribution.

It is known that if x has a spherical distribution, then E(x) = 0 and Cov(x) = aI for some a > 0, see [17, Theorem 2.5 and 2.7]. Condition 1 together with an spherical distribution of x guarantee that the eigenvectors associated with the smallest d eigenvalues of Theorem 2.1].

2.2 Generalized contour regression

Condition 1 does not necessarily hold when d = 1 and the function g is not monotonic, as well as when d > 1. For example in Figure 2 (a), This function can be expressed in Model (1) with Φ = [1) be two inputs satisfying

Figure 1: (a) Function . When 200 samples are uniformly drawn in [[5], (b) and (c) display the connected () pairs by SCR (b) and GCR (c), respectively. An () pair is connected by SCR in (b) if 01 and by GCR in (c) if ) is an empirical estimator of defined in (24).

Figure 2: (a) Function . When 200 samples are uniformly drawn in [[5], (b) and (c) display the connected () pairs by SCR (b) and GCR (c), respectively. An () pair is connected by SCR in (b) if 01 and by GCR in (c) if

for some small . Due to symmetry, can be very large, which violates Condition 1. When 200 samples are uniformly drawn in [connected by SCR in Figure 2 (b) if 01. The ideal connections are along the direction, but SCR gives many misleading connections since g is not monotonic. When d > 1, Condition 1 can be easily violated as well. For any fixed , the contour is a curve or a surface, so is not necessarily small.

Fortunately, most misleading connections can be identified from a large variance of f along the segment between the two inputs. In Figure 2, is small but the variance of f along the segment between is large. This criterion helps to rule out the misleading connection between

Replacing the condition of in SCR by a variance condition gives rise to the Generalized Contour Regression (GCR) [36]. Let the segment between . The variance of f along this line is: