A vast majority of statistical inference and machine learning problems can be modeled as regression, where the goal is to estimate an unknown function from a finite number of training samples. Nowadays, new challenges are introduced to regression and prediction due to the rise of high-dimensional data in many fields of contemporary science. The well-known curse of dimensionality implies that, in order to achieve a fixed accuracy in prediction, the number of training data must grow exponentially with respect to the data dimension, which is beyond practical applications.
Fortunately, functions of interest in applications often exhibit low-dimensional structures. In many situations, the response may depend on few variables, or a low-dimensional subspace. For example, in Bioinformatics, the cDNA microarray data have thousands of dimension but an effective classification of tumor types only depends on a subspace of dimension one or two [4]. In engineering, the coefficients of certain elliptic partial differential equations are parameterized by many variables but has a small effective dimension [8]. In photovoltaic industry, there are five input parameters in the single-diode solar cell model while the maximum power output only depends on a linear combination of these five parameters [9]. Similar low-dimensional models appear in optimization [22], optimal control [55], uncertainty quantification [2], text classification [30] and biomedicine [43, 44].
These applications motivate us to consider the regression model
where . The columns of Φ consist of an orthonormal basis of a d-dimensional subspace, i.e., Φ
. The random variable
models noise, which is independent of the
’s. In this model, the function f is defined on
but only depends on the projection of x to the subspace spanned by Φ, denoted by
be such a space of minimum dimension:
) is not a constant function for any
the central subspace [10], or the Effective Dimension-Reduction subspace [37, 14, 54]. In other words, the function f only varies along the central subspace, and remains the same along any orthogonal direction of the central subspace. This model is also called the single-index model for d = 1 and the multi-index model for
where the ’s are independently drawn from a probability measure
and the a prior knowledge that f varies along a d-dimensional central subspace, we aim at constructing an empirical estimator
and studying the Mean Squared Error (MSE)
The central interest of this problem is to estimate the central subspace Φ. A class of methods is based on the gradient is proportional to the Φ direction, which allows one to estimate Φ from average of the empirical gradients [46, 24, 25]. When
2, the covariance matrix of
is given by Φ
gives an estimate of Φ as the eigenspace associated with the top d eigenvalues of this covariance matrix [28, 8]. If the gradient can be accurately estimated, these gradient-based methods lead to a
-consistent estimation of Φ [25, 8]. In general, the gradient estimation in
an exponentially large number of samples in dimension D.
The single-index model with d = 1 has been extensively studied in literature. In this case, is called the link function. The minimax mean squared regression error while the link function belongs to the s-H¨older class was proved to be
) [20, 33, 6]. These results demonstrate that the optimal algorithm can automatically adapt to the central subspace of dimension 1. For estimation, a lot of methods based on non-convex optimization have been proposed [33, 23, 29, 53] while achieving the global minimum is not guaranteed. It was proved in [23, Chapter 22] that minimizing the empirical risk over the central subspace of dimension 1 and approximating g with piecewise polynomials simultaneously give rise to the MSE of
). The main challenge of this approach is to obtain the global minimum due to non-convexity in the optimization. In the context of regression with point queries, an adaptive query algorithm was proposed for the single-index model in [7] with performance guarantees. This algorithm was generalized to the multi-index model in [18]. In the standard regression setting, adaptive queries are not allowed, and the samples are usually given before learning starts.
Estimating the central subspace is related with sufficient dimension reduction in statistics. A class of methods related with inverse regression has been developed to estimate the central subspace Φ (see [41, 34] for a comprehensive review). Sliced Inverse Regression (SIR) [38] is the first and most well known method. The term “inverse regression” refers to the conditional expectation E(x|y). In SIR, the central subspace is estimated as the eigenspace associated with the top d eigenvalues of Cov()). Similar techniques include kernel inverse regression [56], parametric inverse regression [3] and canonical correlation estimator [19]. These methods are referred as first-order methods since the first-order statistical information is utilized, such as the conditional mean. Their performance crucially depends on the function g and the distribution of data. If g is symmetric about 0 along some directions, then those directions can not be recovered by the first-order methods. This issue about symmetry is better addressed by second-order methods which utilize the variance and covariance information of data. Popular second-order methods include sliced inverse variance estimation [11], sliced average variance estimation (SAVE) [12, 16], average variance estimation (MAVE) [53], contour regression [36], directional regression [35], principal Hessian directions [39], a hybrid SIR and SAVE [57], etc. New methods are also developed in a multiscale framework [32, 31]. For the single-index model, Smallest Vector Regression (SVR) is proposed in [32] in a multiscale framework, which combines the idea of SIR and SAVE. A performance analysis is provided for the index estimation and regression, while the assumptions are favorable to monotonic g.
This paper focuses on a second-order method called Generalized Contour Regression (GCR) introduced by Li, Zha and Chiaromonte [36]. In comparison with SIR and SAVE mentioned above, GCR has advantages in the case where g is not monotonic. Empirical experiments have demonstrated the success of GCR for the estimation of the central subspace, but its sample complexity is still not well understood yet. In this paper, we analyze the error behavior of the regression scheme which consists of GCR for the central subspace estimation and piecewise polynomial approximation of g. We also propose a modified GCR with improved efficiency. Our contributions are:
(i) We prove that GCR estimates the central subspace with a mean squared estimation error of ), if a variance quantity is exactly known. The estimation error of this variance quantity is also given in this paper.
(ii) Our regression scheme gives rise to the MSE of f in the order of (demonstrates that the MSE decays exponentially with an exponent depending on the dimension of the central subspace d, instead of the ambient dimension D.
(iii) Our modified GCR improves over the original GCR in efficiency. Numerical experiments demonstrate that the modified GCR has the same convergence rate as that of the original GCR in our statistical theory.
This paper is organized as follows: We first introduce SCR and GCR in Section 2. Our regression scheme and main results are stated in Section 3. Numerical experiments are provided in Section 4 to validate our theory. We present proofs in Section 5 and conclude in Section 6.
We use lowercase bold letters and capital letters to denote vectors and matrices respectively. is the Euclidean norm of the vector
is the spectral norm of the matrix A. For two square matrices A and B of the same size,
is positive semi-definite. We use
to denote nonnegative integers. For a function g, supp(g) denotes the support of g. For
denotes the largest integer that is less than or equal to
denotes the smallest integer that is greater than or equal to x. Throughout the paper, we use
to denote the central subspace and its orthogonal complement, respectively.
The model in (1) has an ambiguity since the solutions are not unique. The columns of Φ form an orthonormal basis of the central subspace, while the choice of orthonormal basis is not unique. This gives rise to an ambiguity between the basis Φ and the function g, which can be characterized by the following lemma:
Lemma 1. Consider the model in (1) where the columns of Φ form an orthonormal basis of the central subspace. For any orthogonal matrix any
. Then the columns of
form another orthonormal basis of the central subspace and
Lemma 1 shows that the representation of f is not unique. If another set of orthonormal basis is picked, the function g changes accordingly. In this paper, we aim to recover one set of orthonormal basis Φ and the corresponding function
. The subspace estimation error and regression error are represented by
respectively, which are invariant to the choice of orthonormal basis. After a change of basis, we can assume that the columns of Φ are carefully chosen such that
according to [5, Theorem 2.2.1].
2.1 Simple contour regression
We start with Simple Contour Regression (SCR) [36] which utilizes the fact that the contour directions of f are orthogonal to the central subspace. Let 0 and define the following conditional covariance matrix:
where (˜) are two independent samples.
When d = 1 and g is a monotonic function, we expect many of the (˜x, x) pairs satisfying small while
can be arbitrarily large for any
. In this case, most of the ˜
directions satisfying
are aligned with
For example, in Figure 1(a), 2]. This function can be expressed as Model (1) with Φ = [1
. Consider two inputs
) satisfying
, the condition
the other hand,
4], which is independent of
. In Figure 1(b), 200 samples are uniformly drawn from [
) pair is connected if
01. We observe that, almost all connections by SCR in Figure 1(b) are aligned with the w direction.
SCR estimates the central subspace from the smallest d eigenvectors of ). The success of SCR is guaranteed with the following condition:
Condition 1. There exists such that for any
and unit vectors
, the following holds
We define an elliptical distribution [34] as
Definition 1 (Elliptical distribution)be a probability measure in
with density function
has an elliptical distribution if there exists a positive-definite matrix
such that
for some function has a spherical distribution.
It is known that if x has a spherical distribution, then E(x) = 0 and Cov(x) = aI for some a > 0, see [17, Theorem 2.5 and 2.7]. Condition 1 together with an spherical distribution of x guarantee that the eigenvectors associated with the smallest d eigenvalues of Theorem 2.1].
2.2 Generalized contour regression
Condition 1 does not necessarily hold when d = 1 and the function g is not monotonic, as well as when d > 1. For example in Figure 2 (a), This function can be expressed in Model (1) with Φ = [1
) be two inputs satisfying
Figure 1: (a) Function . When 200 samples are uniformly drawn in [
[
5], (b) and (c) display the connected (
) pairs by SCR (b) and GCR (c), respectively. An (
) pair is connected by SCR in (b) if
01 and by GCR in (c) if
) is an empirical estimator of
defined in (24).
Figure 2: (a) Function . When 200 samples are uniformly drawn in [
[
5], (b) and (c) display the connected (
) pairs by SCR (b) and GCR (c), respectively. An (
) pair is connected by SCR in (b) if
01 and by GCR in (c) if
for some small . Due to symmetry,
can be very large, which violates Condition 1. When 200 samples are uniformly drawn in [
connected by SCR in Figure 2 (b) if
01. The ideal connections are along the
direction, but SCR gives many misleading connections since g is not monotonic. When d > 1, Condition 1 can be easily violated as well. For any fixed
, the contour
is a curve or a surface, so
is not necessarily small.
Fortunately, most misleading connections can be identified from a large variance of f along the segment between the two inputs. In Figure 2, is small but the variance of f along the segment between
is large. This criterion helps to rule out the misleading connection between
Replacing the condition of in SCR by a variance condition gives rise to the Generalized Contour Regression (GCR) [36]. Let
the segment between
. The variance of f along this line is:
In GCR, the covariance matrix is taken to be
where 0 is a parameter. In [36], the authors assume that for any unit vectors
, there is some constant
It was proved that, when has an elliptical distribution, (5) always holds for sufficiently small
. In this work, we refine the assumption (5) to the following one which requires a gap between the two sides of (5):
Assumption 1. There exist such that for any 0
vectors
, the following hold
This assumption with a gap is used to prove an eigengap of the matrix ), which is necessary to control the central subspace estimation error.
An () pair is said to be connected if
. In Figure 2, even though
is small, the variance
) is large. When
is small, the condition of
is violated so the (
) pair is not connected. We expect all connected pairs by GCR to be aligned with the
direction, such as (
). For such connected pairs, it is very likely that
implies
independent of
. When 200 samples are uniformly drawn in [
pair is connected by GCR in Figure 2(c) if an empirical estimator of
) (see (24)) is no more than 0.001. We observe that, most connections by GCR are along the
direction, and many misleading connections by SCR are ruled out.
Assumption 1 together with a spherical distribution of x imply that, when the eigenspace of
) associated with the smallest d eigenvalues is
and the rest of the eigenvectors span
has a spherical distribution. Let
eigenvalues of
. Under Assumption 1, the followings hold for any
• For any integer 1
• The eigenvectors of associated with
and the eigenvectors associated with
Proposition 1 is proved in Supplementary materials A. The second part of Proposition 1 was first proved in [36] based on the assumption in (5). We further show that, under Assumption 1, there is a eigengap between the first d eigenvalues and the rest of the eigenvalues for
3.1 Assumptions
In order to guarantee the success of GCR, we introduce the following assumptions on , requiring that i)
is supported on a bounded domain; ii)
has a spherical distribution.
Assumption 2. Suppose the probability measure satisfies the following assumptions:
i) is bounded by B > 0: for every x sampled from
almost surely.
Assumption 2(ii) is common in inverse regression, but GCR is still more robust against nonsphericity (or nonellipticity) than SIR and SCR (see Experiment 1 in Section 4). We next define Lipschitz and H¨older functions and make some regularity assumption on g.
Definition 2 (Lipschitz functions). A function g is Lipschitz with Lipschitz constant
Definition 3 (H¨older functions)A function
-smooth if for every
the partial derivative
exists and satisfies
where
In this paper, the function g is assumed to be (
Assumption 3. The function -smooth with
Assumption 2 and 3 imply the followings:
i) g is Lipschitz as is an upper bound of the Lipschitz constant of g.
ii) g is bounded by Assumption 2 (i) and the Lipschitz property above: denotes an upper bound of g which satisfies
has zero mean and is bounded:
surely.
3.2 Central subspace error
Our central interest is to estimate the central subspace. In this paper, we prove the central subspace estimation error by GCR based on the exact variance quantity ). The empirical estimation of
) is discussed in Section 3.4. If
) is exactly known, we define
such that
where is the indicator function which is equal to 1 if
and 0 otherwise. The matrices
) have the same set of eigenvectors. According to Proposition 1, for any 0
, the eigenvectors associated with the smallest d eigenvalues of
form an orthonormal basis of the central subspace.
The empirical counterpart of ) based on the i.i.d. samples
and The central subspace estimator based on
is denoted by
columns are the eigenvectors associated with the smallest d eigenvalues of
. This estimation error is quantified by
. To show that Proj
is close to Proj
, we first derive a concentration inequality for
be i.i.d. samples from a probability measure
satisfying Assumption 2. Assume
is known for any (
pair. For any
where B are given in Assumption 2.
Theorem 1 is proved in Section 5.1 with a new concentration inequality for matrix-valued U-statistics. To further bound the distance between Proj, we will use the relation between
) in (7), and the eigengap implied from Assumption 1. Denote
Since f is Lipschitz and supp() is compact, we expect that, for a fixed
is bounded away from 0. The following example shows that when
has a spherical distribution and under mild conditions,
is bounded away from 0.
Example 1. Suppose Assumption 2 and 3 hold, and the density function of bounded below by
where
Example 1 is proved in Supplementary materials B. The following theorem shows that, under the same conditions as in Theorem 1, Projis close to Proj
with high probability (see a proof in Section 5.2).
be i.i.d. samples from a probability measure
. Suppose Assumption 1 and 2 hold and
is exactly known for any (
pair. For any 0
where are given in Assumption 1 and 2.
Integrating the probability in (11) gives rise to the following mean squared estimation error for the central subspace:
Corollary 1. Under the conditions in Theorem 2, we have
where
Corollary 1 is proved in Section 5.3. It implies that, if ) is exactly known, the meansquared estimation error of GCR for the central subspace converges in the rate of
3.3 Regression error
Given 2n samples denoted by , we evenly split the data into two subsets
is used to compute
) as described in Section 3.2, and
is used to estimate the function g. After the central subspace is estimated as
, the next step is to estimate the function g from the data
Our regression scheme is summarized in Algorithm 1. A rich class of nonparametric regression techniques [23, 50, 51] can be used to estimate g, such as kNN, kernel regression, polynomial partitioning estimates, etc. In this paper, we will present the results of polynomial partitioning estimates.
with noise
Due to the mismatch between Φ and Φ, the noise
has a bias and its conditional mean at
where the expectation is taken over , conditioning on
. This function is bounded such that
to denote an upper bound of the squared bias in noise.
The support of the measure is bounded by Assumption 2(i), which implies
For a fixed positive integer
be the space of piecewise polynomials of degree no more than k on the partition of [
cubes with side length 2
the polynomial degree k should be chosen as
1. Consider the piecewise polynomial estimator of order k:
Since g is bounded by M, we truncate the final estimator to
The parameter K determines the size of the partition, which we set as
with defined in (13). The goal of this paper is to give an error analysis of
, which has the Mean Squared Error (MSE)
where the expectation is taken over the joint distribution of
The first term captures the estimation error of the central subspace by GCR, and the second terms captures the regression error of g. The following theorem provides an upper bound for the MSE of (see its proof in Section 5.4).
be i.i.d. samples of a probability measure
according to the model in (1). Suppose Assumption 1-4 hold and
is exactly known for any (
according to (20). The estimator
in (19) satisfies
where is defined in (13) and C > 0 is a constant depending on
Theorem 3 demonstrates that if GCR is used to estimate the central subspace, the mean squared regression error decays exponentially in n with an exponent depending on d, instead of D. GCR effectively exploits the low-dimensional structure of the function and gives a fast rate of convergence in comparison with a direct regression in
3.4 A practical algorithm to estimate the central subspace
Our estimation theory in Theorem 2 and 3 utilizes U-statistics with the exact knowledge of ) for any (
) pair. In practice,
) is not given and we need to estimate it from the samples. In this paper, we use the empirical estimation of
proposed in [36] and prove an estimation error for
. We also propose an efficient algorithm to compute the empirical covariance matrix for the estimation of the central subspace.
3.4.1 Empirical estimation of Vf
The variance quantity ) is the variance of f along the segment
has 0 measure, it is unlikely to have data exactly lying on
Following the idea in [36], we approximate
) by the variance of y in a narrow tube enclosing
radius r (see Figure 3). Let
)) be the distance from
inf
) as the tube enclosing
) with radius r as follows
The variance of ) is denoted as
The empirical counterpart of ) based on the data
Figure 3: ) : the tube enclosing
) with radius r.
We further assume that the difference between ) is linear in r:
Assumption 6. There exists such that, for any
Assumption 6 implies that when ) is close to
supp(
). The following example shows that when
is the volume measure on its support and
), then Assumption 6 holds.
Example 2. Let f be defined as (1). Suppose Assumption 2(i) and 3 hold, and is the volume measure on a compact set in
. For every
any
Example 2 is proved in Supplementary materials C.
3.5 An efficient algorithm to form the empirical covariance matrix
Forming the empirical covariance matrix as in (8) requires ) for any (
With n samples, the computational complexity is
, which is expensive. When the parameter
is small, we expect to have a small number of (
) pairs connected such that
We next propose a new algorithm to efficiently identify the connected pairs.
In our new algorithm, we form the empirical version of i.i.d. samples
be a matrix with entries in {0, 1} such that all (i, j) indices with
) = 1 satisfy
. Our estimation of
where . In other words, the connected pairs are indicated by the nonzero entries in the matrix
. In this paper, we form the matrix
via Algorithm 2. In Algorithm 2, each data point is used at most once: when a point is connected in one pair, we remove it from the rest of the computation. Algorithm 2 is more efficient than the original GCR in the sense that it outputs at most n/2 connected pairs while the original GCR uses
pairs (most of which are not connected) to estimate
). Therefore the computational cost is greatly reduced. Moreover, our numerical experiments show that the mean squared error of the central subspace estimation by Algorithm 2 converges in the order of
, which is the same as that in Corollary 1.
In are two important parameters, which need to be properly chosen. In this paper, we choose these parameters as follows: Let
0 be a fixed constant. We set
with
eters are defined in Assumption 2-6.
In the following theorem, we prove that, if the parameters are chosen according to (30) and (31), Algorithm 2 guarantees at least n/4 connected pairs of (
) such that
, and all the connected pairs satisfy
with high probability.
samples from the probability measure
sampled according to the model in (1), under Assumption 2(i), 3-6. Let
according to (30) and (31). Index the output pairs
by Algorithm 2 as
is sufficiently large such that 2(
Algorithm 2 gives rise to
Theorem 4 is proved in Supplementary materials D. Theorem 4 shows that if properly chosen, with high probability, Algorithm 2 gives rise to
) with at least n/4 pairs of (
). Moreover, all such pairs satisfy
with high probability. If n is large enough and
is small enough such that
, we expect the estimated subspace
Φ by Algorithm 2 to be a good approximation of the central subspace Φ.
described in Section 3.3 using the samples is estimated as
3.6 Data normalization
Our theoretical analysis assumes that follows a spherical distribution. In practice these conditions may not be satisfied for the given data, and we always preprocess the data by normalization [35]. Given the data set
, we first compute the empirical mean ¯
and the empirical covariance matrix
then we normalize the data as
This normalization does not alter the low-dimensional property of the function since
where
In this section, we provide numerical experiments to demonstrate the performance of the modi-fied GCR in Algorithm 2 and the regression scheme. The data are sampled accord- ing to the model in (1) with
). The noise is
experiments, 90% of the given data is used for training and 10% is used for testing. Training data are used for central subspace estimation by GCR, SCR or SIR respectively and regression of g is estimated by Gaussian kernel regression through the MATLAB built-in function fitrker-nel. The test data are used to compute the central subspace estimation error
and the regression error
where is the number of the test data. In GCR, the parameter r is chosen in the order of
according to (30). We set
without specification.
We expect scales linearly with respect to log
, with a slope of
5, independently of d and D.
4.1 Experiment 1: Robustness to non-elliptical distributions
In the first experiment, we investigate the sensitivity of GCR, SCR and SIR to the condition of elliptical distributions in Assumption 2(ii),. Let function can be expressed as the model in (1) with Φ = [1
We sample x uniformly in the domain [
5] excluding the forth quarter, which violates the condition of elliptical distributions. We set
001 in GCR,
in SCR and 10 slices in SIR. Figure 4 shows 1500 samples of x (black dots), the direction of Φ (red arrow) and the direction of
Φ (blue arrow). We observe that GCR is more robust than SCR and SIR when x is not elliptically distributed.
Figure 4: (Experiment 1) central subspace estimation by GCR, SCR and SIR when x is not elliptically distributed. Samples are displayed in black dots and the direction of Φ is represented by a red arrow. The direction of Φ is shown in a blue arrow in (a) by GCR, (b) by SCR and (c) by SIR. We observe that GCR is more robust than SCR and SIR when x is not elliptically distributed.
Figure 5: (Experiment 2 – comparison of GCR, SCR and SIR with 5% noise.) Log-log plot of the central subspace estimation error versus (b). SIR has the best performance when the function g is monotonic.
4.2 Experiment 2: Monotonic functions
In the second experiment, we test and compare the performance of GCR, SCR and SIR on two monotonic single index models:
Both functions can be expressed in model (1) with D = 10 and d = 1. They are
has 1 in the ith entry and 0 everywhere else. In this and the following experiments, the
’s are uniformly sampled from [
, and the sample size varies such that
We compare GCR,SCR and SIR with 5% noise. Figure 5 shows the log-log plot of the central subspace estimation error versus n for each function. In GCR, we use . For both functions,
2 is used in SCR and each slice in SIR is set to contain about 200 samples. The subspace error in SCR and GCR converges almost in the order of
as expected. When g is monotonic, SIR yields the best performance. We will show in the next following experiments that SIR can easily fail when g is not monotonic, while GCR can handle many more cases.
4.3 Experiment 3: A non-monotonic function
The third experiment is
This function can be expressed as the model in (1) with
Figure 6: (Experiment 3 – performance of GCR without noise) Log-log plot of the central subspace estimation error (a) and the regression error (b) versus with C = 1, 1/10, 1/120, 1/1200, 1/12000 respectively.
We first present the performance of GCR on noiseless data, i.e. = 0, with different choices of the parameter
. In GCR, the parameter
should be chosen as
to (31). We set C = 1, 1/10, 1/120, 1/1200, 1/12000 and show the central subspace estimation error versus n in Figure 6(a) and the regression error versus n in Figure 6(b). Each error is averaged over 10 experiments. In log-log scale, the errors decay linearly as n increases with the same rate as long as C is not too large, as we expect. Our theory predicts the slopes in Figure 6(a) as
5, which are almost matched by the slopes in Figure 6(a). The success of GCR requires the condition
, so GCR fails when C is too large, i.e. C = 1. The regression error is observed to converge in the order of
is not too large.
We next compare the performance of GCR, SCR and SIR with noiseless data. We set 120 in GCR, and
in SCR since it provides the best results among many choices. In SIR, each slice contains about 200 samples. We display the central subspace estimation error versus n in Figure 7(a) and the regression error versus n in Figure 7(b). Each error is the averaged error of 10 experiments. GCR and SCR perform better than SIR, and GCR is the best. Estimation of the central subspace greatly improves the rate of convergence in comparison with a direct regression in
Figure 7: (Experiment 3 – Comparison of GCR, SCR and SIR without noise) Log-log plot of the central subspace estimation error (a) and the regression error (b) versus n by GCR, SCR and SIR. The black curve in (b) represents the regression error of without an estimation of the central subspace. GCR and SCR perform better than SIR, and GCR is the best. Estimating the central subspace greatly improves the rate of convergence in comparison with a direction regression in
Figure 8: (Experiment 3 – Comparison of GCR, SCR and SIR with 5% noise) Log-log plot of the central subspace estimation error (a) and the regression error (b) versus n by GCR, SCR and SIR. The black curve in (b) represents the regression error of estimation of the central subspace. GCR and SCR perform better than SIR, and GCR is the best. Estimating the central subspace greatly improves the rate of convergence in comparison with a direction regression in
Results with 5% noise are shown in Figure 8. We set and SCR. In SIR, each slice contains about 200 samples. We observe that GCR perform better than SCR and SIR.
We then compare GCR, SCR and SIR with heavy noise – 50%, 100% and 120% noise. Visualization of data along the directions is shown in the first and second row of Figure 9. The third row shows the log-log plot of the central subspace estimation error by GCR,
Figure 9: (Experiment 3 – Comparison of GCR, SCR and SIR with heavy noise: 50% noise in the left column, 100% in the middle column and 120% in the right column) The first row shows the visualization of the noiseless f(x) in black and the noisy y in gray along the while
1. The second row shows the visualization of the noiseless f(x) in black and the noisy y in gray along the
direction while
9. The third row shows the log-log plot of the central subspace estimation error by GCR, SCR and SIR versus n.
SCR and SIR versus n. Each error is averaged over 10 experiments. We observe that GCR is very robust against heavy noise – the central subspace estimation error converges in the order of with 50% noise, and the rate slightly degrades in the presence of 100% and 120% noise. In comparison, SCR and SIR tend to fail when noise is heavy.
4.4 Experiment 4: A non-monotonic function
The fourth experiment is
This function can be expressed as the model in (1) with This experiment is more challenging since f is not monotonic along both
directions.
Figure 10: (Experiment 4 – performance of GCR without noise) Log-log plot of the central subspace estimation error (a) and the regression error (b) versus with C = 1/0.05, 1/0.5, 1/5, 1/50, 1/500 respectively.
Performance of GCR with 500 on noiseless data is presented in Figure 10. The central subspace estimation error and the regression error are shown in Figure 10 (a) and (b), respectively. Each error is averaged over 10 experiments. Our observation is similar to Experiment 2 that, in log-log scale, both errors decay linearly as n increases with the same slope, as long as C is not too large.
In Figure 11, we compare GCR, SCR and SIR with 5% noise. We set GCR and
in SCR. In SIR, each slice contains about 200 samples. We show the central subspace estimation error versus n in Figure 11 (a) and the regression error versus n in Figure 11 (b). Each error is averaged over 10 experiments. Among the three methods, GCR yields the smallest error and the fastest rate of convergence.
Results with 50% noise are shown in Figure 12. We set in SCR. In SIR, each slice contains about 200 samples. The noisy y (gray) and the noiseless f(x) (black) are shown in Figure 12 (a) along the
direction with
and in Figure 12 (b) along the
direction with
The central subspace estimation error by the three methods are displayed in Figure 12 (c). Each error is averaged over 10 experiments. SCR and SIR perform poorly in this test, while GCR is very robust to heavy noise. The central subspace estimation error decays as
), which is close to our prediction of
4.5 Experiment 5: A non-monotonic function
In Experiment 3 and 4, the function f can be written as a sum of two single index models. We next compare GCR, SCR and SIR on functions without such structures. Consider
which can be expressed in the model (1) with We test GCR, SCR and SIR with 5%
Figure 11: (Experiment 4 – Comparison of GCR, SCR and SIR with 5% noise) Log-log plot of the central subspace estimation error (a) and the regression error (b) versus n by GCR and SCR. The black curve in (b) represents the regression error of without an estimation of the central subspace. GCR performs better than SCR and SIR. Estimation of the central subspace greatly improves the rate of convergence in comparison with direction regression in
Figure 12: (Experiment 4 – Comparison of GCR, SCR and SIR with 50% noise) Visualization of noiseless f(x) in black and the noisy y in gray along the direction while
(a) and along the
direction while
1 (b). (c) is the log-log plot of the central subspace estimation error versus n by GCR, SCR and SIR. GCR performs better than SCR and SIR. GCR is very robust against heavy noise.
and 50% noise. The log-log plot of the central subspace estimation error versus n is displayed in Figure 13. Each error is averaged over 10 experiments. GCR has a convergence rate of ) with 5% noise and
) with 50% noise, which is robust to heavy noise and performs better than SCR and SIR.
4.6 Experiment 6: A non-monotonic function
Our last experiment is
Figure 13: (Experiment 5 – Comparison of GCR, SCR and SIR with 5% and 50% noise) Log-log plot of the central subspace estimation error versus n by GCR, SCR and SIR with 5% noise (a) and 50% noise (b).
Figure 14: (Experiment 6 – Comparison of GCR, SCR and SIR with 5% and 50% noise.) Loglog plot of the central subspace estimation error versus n by GCR, SCR and SIR with 5% noise (a) and 50% noise (b). GCR performs better than SCR and SIR.
which can be expressed in the model (1) with [
We test GCR, SCR and SIR with 5% and 50% noise. The log-log plot of the central subspace estimation error versus n is displayed in Figure 14. Each error is averaged over 10 experiments. GCR has a convergence rate of ) with 5% noise and
) with 50% noise, which is robust to heavy noise and performs significantly better than SCR and SIR.
This section contains the proof of our main results in Section 3. Lemmas are proved in the Supplementary materials.
5.1 Concentration inequality for matrix-valued U-statistics and proof of Theorem 1
U-statistics was introduced by Hoeffding [26] and the concentration inequalities for scalar-valued U-statistics have been well studied [26],[47, Chapter 5]. To our knowledge, there are limited results on matrix-valued U-statistics. The concentration inequality for a modified matrix-valued U-statistics can be found in [42]. In this paper, we prove a Bernstein inequality for matrix-valued U-statistics using the tools in [49, 1]. We denote as the set of
real-valued Hermitian matrices.
Definition 4 (Matrix U-statistics). Let x be a random variable with probability distribution in
be a matrix-valued kernel with m inputs. Suppose we can access
i.i.d. copies of x, denoted by
. The U-statistic of order m and kernel
is defined as
is the set of all m-tuple of integers between 1 and n.
We say W is permutation symmetric if for any () and any permutation
is permutation symmetric,
can be written as
U-statistics are unbiased estimators of ) where the expectation is taken over the joint distribution of
. We use tools from [49, 1] to prove the following lemma about the concentration of the matrix-valued U-statistics:
Lemma 2 (Bernstein inequality for matrix-valued U-statistics). Let x be a random variable with probability distribution be a permutation symmetric matrix-valued kernel with m inputs. Assume
Suppose we can access i.i.d. copies of x, denoted by
be the U-statistics of W defined in (37). Then for any t > 0,
Proof of Lemma 2. Let
V (
where is the average of k independent empirical realizations of
be a permutation of (1
The U-statistics in (36) can be written as
where denotes the set of all permutations of (1
permutation
For any 0, we have
Next we bound . By the assumption in (38), for each
in (40), we have
The Bernstein inequality for U-statistics is derived through several important results in [49, 1]. For 0
where tr() denotes the trace of its argument. Combining this bound with (43) gives rise to
Setting
Theorem 1 is a direct result of Lemma 2 by taking
5.2 Proof of Theorem 2
To prove Theorem 2, we need the following two lemmas on matrix perturbation theory.
Lemma 3 (Davis-Kahan [15, 48])be two Hermitian matrices with eigenvalues
in non-increasing order. Let the columns of Φ and
consist of the eigenvectors associated with
, respectively. Suppose
Lemma 4 (Weyl [52])be two Hermitian matrices with eigenvalues
1, . . . , D in non-increasing order. Then
Proof of Theorem 2. By Proposition 1, when , the eigenvectors associated with the smallest d eigenvalues of
) span the central subspace. Recall that the column space of
the eigenspace associated with the smallest d eigenvalues of
). To derive the relation between Proj
using Lemma 3 and Lemma 4, we need an eigengap of
). Denote the eigenvalues of
in non-increasing order, respectively. We have
, with Assumption 1 and 2(iii), Proposition 1 implies
Combining Lemma 3 and Lemma 4 and setting ) give rise to
Notice that 2, a sufficient condition to guarantee
Since 0 < t < 2, we can express this sufficient condition as Therefore, for any
Combining (9) and (48) gives rise to (11).
5.3 Proof of Corollary 1
is a random variable such that
for some a, b, c > 0, then
Lemma 5 is proved in Supplementary materials E. Corollary 1 is proved based on Lemma 5.
Proof of Corollary 1. Combining Theorem 2 and Lemma 5 gives rise to
where are given in (13).
5.4 Proof of Theorem 3
In the regression model (14), if we condition on the recovered central subspace Φ, the samples
are independently sampled from the marginal distribution of
on the subspace spanned by
Φ. Denote this marginal distribution by
. For any set Ω
). According to (21), the regression error of f crucially depends on the estimation error of g, which is established in the following lemma.
be i.i.d. samples from a probability measure
[
follows the model in (14). Let
be the estimator in (19) with polynomial order
. Under Assumption 2(i), 3, 4, we have
where c is a universal constant and is defined in (17).
Lemma 6 is proved in Supplementary materials F with classical results in nonparametric regression [23, 45, 21, 13]. The proof of Theorem 3 is given below.
Proof of Theorem 3. From (3), (21) and the fact that
We use ) to denote the expectation with respect to the joint distribution of
The regression error of f can be expressed as
Corollary 1 gives an estimate of the first term:
The second term in (50) can be estimated through Lemma 6 as follows:
where is defined in Theorem 3 and
independent of n. The last inequality in (52) results from
by (3), Corollary 1 and K chosen according to (20). Finally, combining (51) and (52) gives rise to (22).
In this paper, we combine GCR and piecewise polynomial approximations to estimate a high-dimensional function which varies along a low-dimensional central subspace. We prove that, the mean squared estimation error for the central subspace is ) and the mean squared regression error is
. A modified GCR algorithm with improved efficiency is also proposed. Numerical experiments verify our theories and demonstrate that the modified GCR has the same accuracy as that in our theoretical results.
Wenjing Liao’s research is supported by NSF DMS 1818751 and DMS 2012652. Both authors are grateful to Alessandro Lanteri, Mauro Maggioni and Stefano Vigogna for pointing out the noise bias in the regression model (14), which was not taken care of in the first version of this manuscript. This problem has been fixed in the current version.
[1] R. Ahlswede and A. Winter. Strong converse for identification via quantum channels. IEEE Transactions on Information Theory, 48(3):569–579, 2002.
[2] I. Babuˇska, R. Temponet, and G. E. Zouraris. Galerkin finite element approximations of stochastic elliptic partial differential equations. SIAM Journal on Numerical Analysis, 2004.
[3] E. Bura and R. D. Cook. Estimating the structural dimension of regressions via para- metric inverse regression. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 2001.
[4] E. Bura and R. M. Pfeiffer. Graphical methods for class prediction using dimension reduc- tion techniques on DNA microarray data. Bioinformatics, 2003.
[5] Y. Chen, Y. Chi, J. Fan, and C. Ma. Spectral methods for data science: A statistical perspective. arXiv preprint arXiv:2012.08496, 2020.
[6] C. Chesneau. Regression with random design: a minimax study. Statistics & probability letters, 77(1):40–53, 2007.
[7] A. Cohen, I. Daubechies, R. DeVore, G. Kerkyacharian, and D. Picard. Capturing Ridge Functions in High Dimensions from Point Queries. Constructive Approximation, 2012.
[8] P. G. Constantine, E. Dow, and Q. Wang. Active subspace methods in theory and practice: applications to kriging surfaces. SIAM Journal on Scientific Computing, 36(4):A1500– A1524, 2014.
[9] P. G. Constantine, B. Zaharatos, and M. Campanelli. Discovering an active subspace in a single-diode solar cell model. Statistical Analysis and Data Mining: The ASA Data Science Journal, 8(5-6):264–273, 2015.
[10] R. D. Cook and B. Li. Dimension reduction for conditional mean in regression. The Annals of Statistics, 30(2):455–474, 2002.
[11] R. D. Cook and S. Weisberg. Discussion of sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):328–332, 1991.
[12] R. D. Cook and S. Weisberg. Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86(414):328–332, 1991.
[13] D. D. Cox et al. Approximation of least squares regression on nested subspaces. The Annals of Statistics, 16(2):713–732, 1988.
[14] A. S. Dalalyan, A. Juditsky, and V. Spokoiny. A new algorithm for estimating the effective dimension-reduction subspace. The Journal of Machine Learning Research, 9:1647–1678, 2008.
[15] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.
[16] R. Dennis Cook. Save: a method for dimension reduction and graphics in regression. Communications in statistics-Theory and methods, 29(9-10):2109–2121, 2000.
[17] K.-T. Fang, S. Kotz, and K. W. Ng. Symmetric multivariate and related distributions. Chapman and Hall/CRC, 2018.
[18] M. Fornasier, K. Schnass, and J. Vybiral. Learning Functions of Few Arbitrary Linear Parameters in High Dimensions. Foundations of Computational Mathematics, 2012.
[19] W. K. Fung, X. He, L. Liu, and P. Shi. Dimension reduction based on canonical correlation. Statistica Sinica, pages 1093–1113, 2002.
[20] S. Ga¨ıffas, G. Lecu´e, et al. Optimal rates and adaptation in the single-index model using aggregation. Electronic journal of statistics, 1:538–573, 2007.
[21] S. Geer, S. van de Geer, R. Gill, B. Ripley, S. Ross, B. Silverman, and M. Stein. Empirical Processes in M-Estimation. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2000.
[22] P. E. Gill, W. Murray, and M. H. Wright. Practical optimization. London: Academic Press, 1981, 1981.
[23] L. Gy¨orfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-Free theory of Nonparametric Regression. Springer Science & Business Media, 2006.
[24] W. H¨ardle and T. M. Stoker. Investigating smooth multiple regression by the method of average derivatives. Journal of the American Statistical Association, 1989.
[25] W. H¨ardle and A. B. Tsybakov. How sensitive are average derivatives? Journal of Econometrics, 58(1-2):31–48, 1993.
[26] W. Hoeffding. A class of statistics with asymptotically normal distribution. In Breakthroughs in statistics, pages 308–334. Springer, 1992.
[27] W. Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pages 409–426. Springer, 1994.
[28] M. Hristache, A. Juditsky, J. Polzehl, and V. Spokoiny. Structure adaptive approach for dimension reduction. Annals of Statistics, 2001.
[29] H. Ichimura. Semiparametric least squares (sls) and weighted sls estimation of single-index models. Journal of Econometrics, 58(1-2):71–120, 1993.
[30] H. Kim, P. Howland, and H. Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6(Jan):37–53, 2005.
[31] T. Klock, A. Lanteri, and S. Vigogna. Estimating multi-index models with responseconditional least squares. Electronic Journal of Statistics, 15(1):589–629, 2021.
[32] A. Lanteri, M. Maggioni, and S. Vigogna. Conditional regression for single-index models. arXiv preprint arXiv:2002.10008, 2020.
[33] O. Lepski, N. Serdyukova, et al. Adaptive estimation under single-index constraint in a regression model. The Annals of Statistics, 42(1):1–28, 2014.
[34] B. Li. Sufficient dimension reduction: Methods and applications with R. Chapman and Hall/CRC, 2018.
[35] B. Li and S. Wang. On directional regression for dimension reduction. Journal of the American Statistical Association, 102(479):997–1008, 2007.
[36] B. Li, H. Zha, F. Chiaromonte, et al. Contour regression: a general approach to dimension reduction. The Annals of Statistics, 33(4):1580–1616, 2005.
[37] K.-C. Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991.
[38] K. C. Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 1991.
[39] K.-C. Li. On principal hessian directions for data visualization and dimension reduction: Another application of stein’s lemma. Journal of the American Statistical Association, 87(420):1025–1039, 1992.
[40] W. Liao and M. Maggioni. Adaptive geometric multiscale approximations for intrinsically low-dimensional data. Journal of Machine Learning Research, 20(98):1–63, 2019.
[41] Y. Ma and L. Zhu. A review on dimension reduction. International Statistical Review, 2013.
[42] S. Minsker, X. Wei, et al. Robust modifications of u-statistics and applications to covariance estimation problems. Bernoulli, 26(1):694–727, 2020.
[43] R. M. Pfeiffer and E. Bura. A model free approach to combining biomarkers. Biometrical Journal, 2008.
[44] R. M. Pfeiffer, L. Forzani, and E. Bura. Sufficient dimension reduction for longitudinally measured predictors. Statistics in Medicine, 2012.
[45] D. Pollard. Convergence of Stochastic Processes. Clinical Perspectives in Obstetrics and Gynecology. Springer New York, 1984.
[46] J. L. Powell, J. H. Stock, and T. M. Stoker. Semiparametric Estimation of Index Coeffi- cients. Econometrica, 1989.
[47] R. J. Serfling. Approximation Theorems of Mathematical Statistics, volume 162. John Wiley & Sons, 2009.
[48] W. G. Stewart and J.-G. Sun. Matrix perturbation theory. Academic Press, 1990.
[49] J. A. Tropp. User-friendly tools for random matrices: An introduction. Technical report, California Institute of Technology Division of Engineering and Applied Science, 2012.
[50] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer-Verlag New York, 2009.
[51] L. Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.
[52] H. Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differential- gleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4):441–479, 1912.
[53] Y. Xia, H. Tong, W. K. Li, and L.-X. Zhu. An adaptive estimation of dimension reduc- tion space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):363–410, 2002.
[54] Y. Xia, H. Tong, W. K. Li, and L.-X. Zhu. An adaptive estimation of dimension reduction space. In butions to Statistics, pages 299–346. World Scientific, 2009.
[55] K. Zhou, J. C. Doyle, K. Glover, et al. Robust and optimal control, volume 40. Prentice hall New Jersey, 1996.
[56] L. X. Zhu and K. T. Fang. Asymptotics for kernel estimate of sliced Inverse regression. Annals of Statistics, 1996.
[57] L. X. Zhu, M. Ohtaki, and Y. Li. On hybrid methods of inverse regression-based algorithms. Computational statistics & data analysis, 51(5):2621–2635, 2007.
Supplemental Materials for Learning functions varying along a central subspace
Proof of Proposition 1. By expressing ), we can write the covariance matrix as
which reflects x with respect to the central subspace Φ. The operator is invertible and
. We also have
For any set Ω ), we define the reflected set
spherical distribution, we have, for any set Ω
Therefore have the same distribution, which implies that Ψ
Φ have the same distribution.
We next show )). It is straightforward that
have the same distribution, (x, f(x)) and (
)) have the same distribution. Thus
Overall, the distributions of ())) are the same. Therefore, we have
Similarly, we can show that
Let
The covariance matrix ) can be written as
Let ˆbe the eigenvalues of
be the associated orthonormal eigenvectors, and let ˜
be the eigenvalues of
be the associated orthonormal eigenvectors. Then the eigenvalues of
associated eigenvectors are Φ
The eigenvalues satisfy
Proof of Example 1. Denote Ω = supp(-dimensional ball with radius r centered at
has a spherical distribution, we have Ω =
). Denote Ω
largest D-dimensional hypercube inside Ω. There are infinitely many such hypercubes all of which have side length 2
. Any such hypercube can be written as
for a set of orthornormal basis . In this proof, we choose the one satisfying
denotes the j-th column of Φ. We then define
Here -dimensional hypercube with side length 2
0 such that, for any
. Such a choice of
guarantees that for any
with Φ) is entirely inside Ω
(see Figure 15 for a demonstration). Denote the volume of a subset
with
with
Figure 15: For any Φ
) entirely locates inside Ω
be the mean of
be the mean of f(x) on
). We first estimate the difference between ¯
. For simplicity, we denote
by
. We parameterize
) to denote a point on
) be the disk centered at r(t) with radius r on the hyperplane of dimension
1 which is perpendicular to
be a measure on [0, 1] such that, for any
) is the tube enclosing
)) with radius
is uniform and
), we can express
Therefore
We next derive the error bound between
On the other hand, for any
Similarly, for any
when 4In summary,
The proof of Theorem 4 relies on several lemmas, which are presented and proved in Section D.1. We prove Theorem 4 in Section D.2.
D.1 Some key lemmas
The following lemma gives an estimate on the difference between the population variance of a bounded random variable and its empirical counterpart.
Lemma 7. Let s be a random variable in [are independent copies of s. Denote the empirical variance by
Lemma 7 is a consequence of U-statistics [27] applied on Based on Lemma 7, we prove a high probability bound between
be i.i.d. samples from the probability measure
according to the model in (1), under Assumption 2(i), 3, 4 and 5. Let
according to (30). Then
), and consider the tube
and
as separate cases.
, we expect
) to be small since
where the last inequality holds as long as ). By (30),
small when n is sufficiently large. These conditions are guaranteed as
and
is sufficiently large.
, there are sufficient points in
) is concentrated on
)) be the measure of the tube
), which satisfies
by (25). Let
) be the number of points in
is the empirical measure of
). By [40, Lemma 29], we have the following concentration of measure:
On the condition of the event )), we have
)), which implies
. By Lemma 7,
) is concentrated on
) with high probability:
A key step in the proof of Theorem 4 is to estimate the difference between ), which is given in the next lemma.
be i.i.d. samples from the probability measure
according to the model in (1), under Assumption 2(i), 3-6. Let
(30). Then
Proof of Lemma 9. We decompose the difference by
The term I in (D.4) represents the difference between the variance of f over the segment between , and the variance of f in the tube with radius r enclosing this segment. According to Assumption 6, I
The term II satisfies II
since it captures the variance of the bounded noise in [
The term III in (D.4) captures the difference between the population variance of y in the tube
) and its empirical counterpart. A high probability bound is given by Lemma 8. Putting the above ingredients together, we have
which implies (D.3).
The following lemma gives an upper bound of
Lemma 10. Suppose f is defined as (1), with the function g satisfying Assumption 3. For any two points
1], we have
Therefore,
D.2 Proof of Theorem 4
Proof of Theorem 4. We prove (32) and (33) in sequel.
• Proof of (32). Lemma 9 gives an estimate on We prove (32) by deriving a union bound to show that if n is large enough, then with high probability, every
) is small for all the (i, j) pairs satisfying A(i, j) = 1. Lemma 9 implies that, for any
Then we have
• Proof of (33). In Algorithm 2, we say two points are connected if Algorithm 2 outputs
) = 1. Algorithm 2 guarantees that if
are connected, they can not be connected with any other points, so that the connected pairs are independent from each other. As a consequence,
is upper bounded by n/2.
along each direction with grid spacing . Denote the d-dimensional cube with side length
For any Ω, we next show
with high probability. If
Ω
and by Lemma 10,
According to Lemma 9,
. Therefore, for any
Denote #ΩApplying a union bound gives
The equation above shows that, in all sets Ω, all pairs of points
satisfy
with probability no less than 1
In Algorithm 2, two points are likely to be connected if
the condition that in all the sets Ω
, all pairs of points
, there is at most one point in each Ω
that is not connected with other points in the output of Algorithm 2. Therefore, the number of connected pairs satisfies
if n is sufficiently large such that 2(
) can be computed by the following intergral
where are chosen such that
The second equality is due to the fact that is a probability which is no larger than 1.
Plugging to the equation above gives rise to
Lemma 6 is based on the following Lemma 11 (a generalization of [23, Lemma 11.1] in one dimension to the multidimensional case) and Lemma 12 ([23, Theorem 11.3]) below, which are standard results in non-parametric statistics.
Lemma 11. [23, Lemma 11.1] Suppose -smooth with
some
be the Taylor polynomial of
Then
for all z near a.
Proof of Lemma 11. We denote the multi-index ). Denote the partial derivative
. The Taylor
expansion of g at a gives
From this, and the assumption that ) smooth, one obtains
We next introduce a standard result ([23, Theorem 11.3]) in nonparametric statistics.
Lemma 12. [23, Theorem 11.3] Let be i.i.d. sampled from a probability measure
such that supp(
be sampled from the regression model
where be a linear space of functions from
be the estimator given by
Then
for some universal constant c.
In (F.1), the mean squared error is decomposed into two terms: the first term captures the variance and the second term estimates the bias. We consider piecewise polynomial approximation with order no more than
Proof of Lemma 6. In order to apply Lemma 12, we express the regression model in (14) as
with defined in (16). The function h is bounded:
The noise satisfies
for some universal c. Here dim() is the dimension of
and hence dim(
Let
be the piecewise Taylor polynomial of g with degree no more than
partition of [
cubes, where the Taylor expansion is at the center of each cube. The second term can be estimated from Lemma 11 as
Therefore
We have