Balancing the sensible level of model complexity against model fitness is a fundamental challenge faced by any
Proceedings of the 23International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).
learning algorithm. A model that is too simple can fail to capture the essential pattern in the data, and a model that is too complex is oversensitive to the idiosyncrasies of the particular data, resulting in highly variable patterns that are mere mirages in the noise. The learning algorithm’s ability to perform well on new, previously unseen data is typically used to set the model complexity. This performance is known as the out-of-sample error.
To be concrete, let dataset where
denote the features and response, respectively. The goal is to obtain an estimate of the response for a newly observed feature vector. We assume observations are independent and identically distributed draws from some joint unknown distribution
). We model this distribution as
) =
), and estimate
using the optimization problem
where is called the loss function, and
) is called the regularizer. Both the regularizer
) and the regularization parameter
have significant effects on the performance of the estimate by controlling the complexity of the model. Hence, for picking a good regularizer, r, or tuning the parameter
one would like to estimate the out-of-sample prediction error, defined as
where (sample from the unknown distribution q(y, x) independent of D, and
is a function that measures the closeness of
to
. A standard choice for
The problem of risk estimation has been extensively studied in the past fifty years and popular estimates, such as k-fold cross validation [Stone, 1974] are used extensively in practical systems. However, the emergence of high-dimensional estimation problems in which the number of features p is comparable or even larger than
0 100 200 300 400 500
Figure 1: Comparison of K-fold cross validation (for K = 3, 5, 7) and leave-one-out cross validation with the true (oracle-based) out-of-sample error for the elastic-net problem where ) =
and
) =
2 +
4. The upward bias of K-fold CV clearly decreases as number of folds increase.
N(
). The number of nonzero elements of the true
and their values is set to
. Dimensions are (
= 2. Extra-sample test data is
N(
) where
N(0, I). The true (oracle-based) out-of-sample prediction error is Err
+
. All depicted quantities are averages based on 100 random independent samples, and error bars depict one standard error.
the number of observations n, deemed many standard techniques in-accurate. For instance, Figure 1 compares the estimates obtained from k-fold cross validation for different values of k. As is clear in this figure, given the importance of each observation in high-dimensional settings, standard techniques, such as 5-fold suffer from a large bias.
One of the existing estimates of Errthat seems to be accurate in high-dimensional settings is the leave-one-out cross validation (LO), which is defined through the following formula:
where
is the leave-i-out estimate. The simulation results reported in Figure 1 and elsewhere [Rahnama Rad and Maleki, 2019,Wang et al., 2018,Stephenson and Broderick, 2019, Beirami et al., 2017, Takahashi and Kabashima, 2018] have demonstrated the good performance of LO in a wide range of high-dimensional problems. Despite the existence of extensive simulation results, the theoretical properties of LO have not been studied in the high-dimensional settings.
In this paper, we study the expected squared error of LO in estimating the out-of-sample error, in the high-dimensional setting, where both n and p are large, and n/p can be less than one. We focus on regularized regression in the generalized linear family, and we make no sparsity assumption on the vector of regression coefficients. In short, we obtain an almost sharp upper bound on the error . These bounds not only show that
also capture the rate of this convergence. This finally establishes what has been observed in empirical studies; LO obtains accurate out-of-sample risk estimates even in high-dimensional problems.
An important advantage of our theoretical results is that they can be used to clarify and connect some results from the recent literature on computationally efficient approximation to LO. For instance, [Rah- nama Rad and Maleki, 2019] showed that in the same high dimensional regime, where ALO stands for a computationally efficient approximation of LO we formally refer to in Section 1.2. A major consequence of our theory is that it shows that ALO is a consistent estimator of Err
these statements more concrete in the next sections.
1.1 Notation
We first review the notations that will be used in the rest of the paper. Let stand for the ith row of
and
stand for y and X, excluding the ith entry
and the ith row
, respectively, and let
be defined likewise. Additionally, let
stand for the regularized estimate in (1) when (
) are excluded. Moreover, define
Likewise, define ). The notation poly log n denotes polynomial of log n with a finite degree. Let
) stand for the largest and smallest eigenvalues of A, respectively. We state
) when the set of values
is stochastically bounded.
1.2 Computational complexity of LO and its approximation
The high computational cost of repeatedly refitting models is a major hurdle in using LO in high dimensional settings. A typical approach to alleviate this problem analytically approximates the leave-i-out model based on the full-data model. A large body of work has addressed computationally efficient approximations to the leave-one-out cross validation error for ridge regularized estimation problems (and its variants) [Allen, 1974,Craven and Wahba, 1979,Golub et al., 1979,O’Sullivan et al., 1986,Burman, 1990,Cessie and Houwelingen, 1992,Opper and Winther, 2000,Caw- ley and Talbot, 2008,Meijer and Goeman, 2013,Vehtari et al., 2016, Mousavi et al., 2018]. Extensions to a wide array of regularizers, such as LASSO [Obuchi and Kabashima, 2018, Rahnama Rad and Maleki, 2019, Stephenson and Broderick, 2019] and nuclear norm [Wang et al., 2018] were recently studied and the validity of these approximations in estimating LO (and its variants) were theoretically studied in [Obuchi and Kabashima, 2016,Beirami et al., 2017,Rahnama Rad and Maleki, 2019, Giordano et al., 2019, Stephenson and Broderick, 2019,Xu et al., 2019].
For example, a single Newton step around in [Wang et al., 2018,Rahnama Rad and Maleki, 2019]
to approximate
and using the Woodburry lemma, the following scalable approximate LO (ALO) formula was obtained:
where
This result was extended to nonsmooth regularizers. For example, [Wang et al., 2018, Rahnama Rad and Maleki, 2019,Stephenson and Broderick, 2019] showed that for formula is a valid approximation of LO if the following H matrix is used:
where S is the active set of is the matrix X restricted to columns indexed by S. With minor assumptions about the data generating process and without any sparsity assumption on the vector of regression co-efficients, [Rahnama Rad and Maleki, 2019] (Theorem 3 and Corollary 1) proved that for various regularizers and regression methods
the high dimensional setting where
is constant while
Our finite sample bounds in the next section show that with similar (easy to check) regularity conditions, for various regularizers and regression methods, Err
0 estimate go to zero as
but
is a fixed number. As a byproduct of this result, we show that in this high dimensional regime
. We will more formally state these claims in Section 3.
2.1 Our assumptions
Our goal is to evaluate the accuracy of LO in estimating the out-of-sample prediction error in the high-dimensional regime. Our results are valid for finite values of n and p. Later, in order to make asymptotic conclusions, we suppose that is constant while
We now state our assumptions for theorem 1. For simplicity of exposition, we start by stating a strong version of our assumptions, which often requires uniform bounds. Weaker analogues are discussed in 2.3. As the assumptions may appear somewhat opaque and technical, we will discuss them in the context of usual assumptions and concrete examples of standard generalized linear models.
Assumption 1. The vectors are independent zero mean vectors with covariance
such that
for a nonnegative constant
Assumption 1 characterizes the different distributions obtained for each are scaled in a way that ensures
= O(1) and Var(
) =
(1), assuming that
(for i = 1
) is
). For instance, under the linear model
, this scaling ensures that the signal-to-noise ratio in each observation remains fixed as n, p grow (when the noise variance is a non-zero constant). Unless we make explicit assumptions about the sparsity of
, without the 1/p scaling, the Hessian of the optimization problem (1) is dominated by the data, making the regularizer, and in turn
, irrelevant. In this paper, we make no sparsity assumption on the vector of regression coefficients. For similar finite signal-to-noise ratio scalings in the high-dimensional asymptotic analysis see [El Karoui, 2017, El Karoui et al., 2013,Bean et al., 2013,Donoho and Montanari, 2016,Donoho et al., 2011,Bayati and Montanari, 2012, Nevo and Ritov, 2016, Su et al., 2017, Dobriban and Wager, 2018,Rahnama Rad and Maleki, 2019,Xu et al., 2019]. Under this scaling, the optimal value of
will be
Mousavi et al., 2018].
Assumption 2. We assume the functions ) and
) are twice differentiable in z. We also assume that
) and
) are convex in z and
, respectively. Let (
) be a sample from the unknown distribution q(y, x) independent of D =
. We assume there exists constants
, such that, for all i, j, uniformly:
where
Assumption 2 characterizes the smoothness of the GLM problem (and its associated leave-one-out versions). As we will show below there are many examples, such as logistic and robust regression, in which we can find and
. However, in some other popular examples, such as linear or Poisson regression,
is a random quantity and we cannot find an absolute constant to dominate it everywhere. As will be discussed later in Section 2.3, we can weaken Assumption 2 at the expense of a slightly stronger moment condition on the feature vector
Example 1. In the generalized linear model family, for the negative logistic regression log-likelihood
) =
) it is easy to show that ˙
2 for any y and z, leading to
Example 2. Our next example is about a smooth approximation of the Huber loss used in robust estimation, known as the pseudo-Huber loss:
where 0 is a fixed number. If we use this loss for the linear regression problem, and set
) =
) =
). It is easy to show that ˙
, leading to
Our next example is concerned with another popular loss function in linear regression, namely the absolute deviation. However, since we would like our loss functions to be differentiable, we use the following smooth approximation of the absolute deviation loss, , introduced in [Schmidt et al., 2007]:
ℓ
where 0 is fixed.
Example 3. For ) =
) =
), we have
= 1. In fact, it is straightforward to show that ˙
define the two matrices
We assume that there exists a fixed number , such that
Assumption 3 characterizes the curvature of the GLM problem (and its associated leave-one-out versions). In some examples, such as the ones that have ridge or smoothed elastic-net as the regularizer, it is straightforward to confirm this assumption. For instance, for the ridge regularization, ) =
2, we have
. In Section 2.3, we explain how this assumption can be relaxed (at the expense of requiring more stringent moment conditions on
) to cover more examples.
Having stated our assumptions, we now move on to stating our main result before proposing a number of examples to demonstrate how this result can be applied in common GLM cases.
2.2 Main theorem
Based on these assumptions we can now evaluate the accuracy of LO in estimating Err. The following theorem proves that the expected square error of LO in estimating Err
is small even in high-dimensional asymptotic settings.
. If Assumptions 1, 2 and 3 hold, then
where the outer expectation is taken with respect to the data D and:
The proof can be found in Appendix C.
The only term that is not explicitly computed in terms of the constants in our assumptions is ]. Hence, to obtain an explicit quantitative bound for a specific GLM problem requires computing this quantity. We present two examples below.
Corollary 1. (Ridge regularized logistic regression) Consider the negative logistic regression log-likelihood , and the regularizer
) =
2, where
. Furtherassume that
is iid
), where
. If
) =
, then there exists a constant
where
The proof of this corollary can be found in Section D of the supplementary material.
Corollary 2. (Pseudo-Huber loss with strongly convex regularizer) We consider again the pseudo-Huber loss defined in (6) with parameter . As this loss is typically used in regression settings, we consider a linear regression model
denotes i.i.d. zero- mean noise, and
additionally assume that the regularizer is strongly convex with parameter
Under these conditions, there exists a fixed number
(depending on
) such that
The proof of this corollary can be found in Section E of the supplementary material.
To summarize, the examples presented in Corollary 1 and 2 satisfy the assumption needed for Theorem 1.
2.3 Extensions
As we discussed in Section 2.1, we can weaken the assumptions without a major change in our proofs or the main conclusions of our result. In this section, we aim to present one such weaker set of assumptions that enables our analyses to cover several other popular examples, such as the Poisson and linear regression.
We assume that
are i.i.d. zero mean vectors with covariance
such that
for a non-negative constant
. Furthermore, there exists a fixed number
, such that
Note that this assumption is more stringent than Assumption 1. However, in essence the only extra requirement of this assumption is a bound on the fourth moments. Hence, it holds for a wide range of random features including sub-Gaussian and sub-exponential features. Thanks to this slightly stronger moment assumption we can weaken the other
assumptions.
We assume the functions
are twice differentiable in z. Moreover, assume
) and
) are convex in z and
, respectively. Let (
be a sample from the unknown distribution q(y, x) independent of
. We assume that there exist constants ˜
and ˜
, such that for all i, j, uniformly
where
Compared to Assumption 2 that requires to be bounded everywhere, this assumption requires the 8
moment of
to be bounded. This simple modification enables our theoretical results to be applied to a much broader set of regression techniques, including Poisson, linear, and negative binomial regression. These three examples will be studied later in this section.
Let
and
be as defined in Assumption 3. We assume that there exists a fixed number ˜
, such that
Again, compared to Assumption 3, this assumption only bounds the moments of the minimum eigenvalue of the matrix. The following example shows an example in which it is impossible to find a positive lower bound for the minimum eigenvalue, but still the moments of the inverse of the minimum eigenvalue are bounded.
Example 1. Suppose that 1 and that the loss function is strongly convex with parameter c, and the regularizer is convex. Finally, suppose that
. Then, there exists a fixed number ˜
that satisfies Assumption 3
for large enough values of n and p.
The proof can be found in Section F of the supplementary material.
As we discussed before one can prove the accuracy of LO under Assumptions 1, 2
, and 3
. The following theorem formalizes this claim.
Theorem 2. Let . If Assumptions 1
, 2
and 3
where the outer expectation is taken with respect to the data D and:
The proof can be found in Section G of the supplementary material. As we described before, this theorem can cover several generalized linear models, that could not be covered by Theorem 1. We mention three important examples below.
Corollary 3. (Square loss with elastic-net penalty) Consider the data generating mechanism where
),
), and
. Suppose that we use the smoothed elastic-net optimization
where for 0,
) =
+ (1
), and
is a smooth approximation of the
-norm. Then, there exists a fixed number, ˜
, such that
Since the proof of this claim is long, we defer it to Section H of the supplementary material.
Corollary 4. [Poisson regression with soft-rectifying link] Consider the data-generating mechanism Poisson(
)), where f(z) = log(1 +
) denotes the soft-rectifying link,
b. Finally, assume that r denotes the smoothed elastic-net regularizer introduced in Corollary 3. Under these assumptions, there exists a fixed number, ˜
, such that:
The proof can be found in Section I of the supplementary file.
Remark 1. We have assumed here that is multivariate Gaussian. As might be clear to the reader from the proof, this normality assumption on x may be relaxed to an 8
moment assumption at the cost of a slightly more complicated proof.
Corollary 5 (Negative-Binomial Regression). We consider the problem of negative binomial regression with fixed shape parameter and exponential link. Here, the negative log-likelihood is given by:
where ) denotes a constant which only depends on
and y. Assume the data generating process is such that
similar to Corollary, 3 we use the smoothed elastic-net as the regularizer. Under these assumptions, there exists a fixed number, ˜
, such that
The proof can be found in Section J of the supplementary material.
We mentioned in Section 1.2 that different approximations of LO have been proposed in the literature to reduce the computational complexity of LO. Among such approximations, the ALO formula introduced in (5), is analyzed in [Rahnama Rad and Maleki, 2019] under a similar asymptotic framework as the one discussed in our paper:
Theorem 3. [Rahnama Rad and Maleki, 2019] Suppose that is constant while
. Under the assumption
), for the regression problems discussed in Corollaries 1, 2, 3, and 4 we have
Note that the ultimate goal of ALO is to use it as an estimate of Err. Hence, while Theorem 3 confirms the accuracy of ALO in approximating LO it does not explain whether the estimates obtained by ALO or LO can be trusted in high-dimensional settings. However, we can combine this result with Theorems 1 and 2 to prove the accuracy of ALO in estimating Err
. Toward this goal we first prove the following claim.
Theorem 4. Suppose that is constant while
. Under the assumption
), for the regression problems discussed in Corollaries 1, 2, 3, and 4 we have
Proof. For a fixed number M
The first inequality in the above equations is due to Markov inequality, and the second inequality is a result of Theorems 1 and 2. As we discussed in Corollaries 1, 2, 3, and 4 either are finite numbers. Hence, as M increases, the final probability can be reduced to the desired level.
Before we proceed to establish the accuracy of ALO we have to clarify Theorem 4. Note that even under the idealized (but incorrect) assumption that the individual estimates ) are independent and
s are the same as
, the central limit theorem indicates that
Hence, we should not expect the error of LO to be
). Therefore, the above theorem seems to offer the sharpest result that is possible for LO. Note that the sharpness is with regard to the rate of convergence and not the constants.
Combining the results of Theorem 2 and Theorem 4 we can finally quantify the accuracy of ALO in estimating Err
Corollary 6. Suppose that is constant while
. Under the assumption
), for the regression problems discussed in Corollaries 1, 2, 3, and 4 we have
The proof of this corollary is straightforward, and is hence skipped. Note that this corollary finally establishes the fact that ALO obtains accurate estimates of Err. While we have established this result for only four popular examples in this paper, Theorems 1, 2 and Theorem 3 of [Rahnama Rad and Maleki, 2019] can be applied to a much broader class of regression problems. Hence, a similar result is expected for such scenarios as well. Finally, we should emphasize that by comparing Theorems 4 and 6 one may notice that the accuracy of ALO might be worse than LO by a logarithmic factor. At this stage, it is not clear whether this difference is an artifact of the proof of [Rahnama Rad and Maleki,
Table 1: Square loss with elastic-net penalty: MSE(and standard errors).
2019] or it is a real extra error that has been introduced by the approximation of LO.
In this section, we present two numerical experiments to show that the ) bound given in Theorem 1 and 2 is sharp but not tight. Specifically, we generate synthetic data, and compare Err
for elastic-net linear regression and ridge logistic regression. In all the examples in this section, the rows of X are N(
). Here we let
and
) =
). The codes for the Figure 1 and Table 1,2 are available at https://github.com/RahnamaRad/LO.
Square loss with elastic-net penalty. We set
and = 0.5. The true unknown parameter vector
is sparse with k = 0.1n non-zero elements independently drawn from a zero mean unit variance Laplace distribution, leading to Var(
) = 0.1 (regardless of the values of n and p). To generate data, we sample
N(
). Here the out-of-sample error is:
As we increase n and p, we keep the ratio constant. We numerically calculate MSE
LO)
as a function of n (and p = 10n) based on 100 synthetic data samples, for each n, p and
= 5. We fitted a line to model log(MSE)
) and obtained a slope of -1.03 (SE= 0.04) and intercept of -0.46 (SE= 0.54) with an Adjusted R-squared of 0.95. The slope of -1.03 (SE = 0.04) shows that the bound is sharp because it confirms the 1/n scaling of our theory. Table 1 shows the numerical MSE as a function of n and p.
Logistic regression with ridge penalty. We set ) =
+ log(1 +
) (the negative logistic log-likelihood) and
. To generate data, we sample
Table 2: Logistic regression with ridge penalty: MSEE(Err
(and standard errors) and the upper bound based on 8 in Corollary 1 of Theorem 1.
out-of-sample error
As we increase n and p, we keep the ratio n/p = 1 constant. We numerically calculate MSEas a function of n (and p = n) based on 100 synthetic data samples, for each n, p and
= 0.1. We fitted a line to model log(MSE)
) and obtained a slope of -1.00 (SE= 0.04) and intercept of 0.34 (SE= 0.27) with an Adjusted R-squared of 0.99. The slope of -1.00 shows that the bound is sharp because it confirms the 1/n scaling of our theory. Table 2 compares the numerical MSE and the theoretical bound from Theorem 1 and Corollary 1. The theoretical upper bound was computed using 8 in Corollary 1 where in this example, we have
= 0
= 1, and
= 1, leading to
= 6311.52. The significant difference between the bound and the MSE shows that the bound is not tight.
Leave-one-out estimators (and their approximate versions) have seen renewed interest recently in the context of big data and high-dimensional problems. We show that, in general, leave-one-out risk estimators have desirable statistical behaviours in the high-dimensional setting. Although the leave-out-risk estimator itself is generally computationally intractable, this result also implies consistency for a (growing) number of approximate leave-one-out estimators, and demonstrate that such estimators offer a potentially good direction for building risk estimators for high-dimensional problems.
K.R. was supported by the NSF DMS grant 1810888, and Eugene M. Lang Junior Faculty Research Fellow- ship. A.M. was supported by the NSF DMS grant 1810888.
[Allen, 1974] Allen, D. (1974). The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16:125–127.
[Bayati and Montanari, 2012] Bayati, M. and Montanari, A. (2012). The lasso risk for gaussian matrices. IEEE Transactions on Information Theory, 58(4):1997–2017.
[Bean et al., 2013] Bean, D., Bickel, P. J., El Karoui, N., and Yu, B. (2013). Optimal m-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences, 110(36):14563–14568.
[Beirami et al., 2017] Beirami, A., Razaviyayn, M., Shahrampour, S., and Tarokh, V. (2017). On optimal generalizability in parametric learning. NIPS, pages 3455–3465.
[Burman, 1990] Burman, P. (1990). Estimation of generalized additive models. Journal of Multivariate Analysis, 32:230–255.
[Cawley and Talbot, 2008] Cawley, G. and Talbot, N. (2008). Efficient approximate leave-one-out cross-validation for kernel logistic regression. Machine Learning, 71:243–264.
[Cessie and Houwelingen, 1992] Cessie, S. and Houwelingen, J. (1992). Ridge estimators in logistic regression. Applied Statistics, 41(1):191–201.
[Craven and Wahba, 1979] Craven, P. and Wahba, G. (1979). Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377–403.
[Dobriban and Wager, 2018] Dobriban, E. and Wager, S. (2018). High-dimensional asymptotics of prediction: Ridge regression and classification. Ann. Stat., 46(1):247–279.
[Donoho et al., 2011] Donoho, D., Maleki, A., and Montanari, A. (2011). Noise sensitivity phase transition. IEEE Trans. Inform. Theory, 57(10).
[Donoho and Montanari, 2016] Donoho, D. and Montanari, A. (2016). High dimensional robust m-estimation: Asymptotic variance via approximate message passing. Probab Theory Relat Fields, 166(3-4):935–969.
[El Karoui, 2017] El Karoui, N. (2017). On the impact of predictor geometry on the performance on
high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory Related Fields, 170(1-2):95–175.
[El Karoui et al., 2013] El Karoui, N., Bean, D., Bickel, P., Lim, C., and Yu, B. (2013). On robust regression with high-dimensional predictors. PNAS, 110(36):14557–14562.
[Giordano et al., 2019] Giordano, R., Stephenson, W., Liu, R., Jordan, M., and Broderick, T. (2019). A swiss army infinitesimal jackknife. JMLR, 89(1139-1147).
[Golub et al., 1979] Golub, G., Heath, M., and Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223.
[Meijer and Goeman, 2013] Meijer, R. and Goeman, J. (2013). Efficient approximate k-fold and leave-one-out cross-validation for ridge regression. Biometrical Journal, 55(2):141–155.
[Mousavi et al., 2018] Mousavi, A., Maleki, A., and Baraniuk, R. (2018). Consistent parameter estimation for lasso and approximate message passing. The Annals of Statistics, 46(1):119–148.
[Nevo and Ritov, 2016] Nevo, D. and Ritov, Y. (2016). On Bayesian robust regression with diverging number of predictors. Electron. J. Statist., 10(2):3045–3062.
[Obuchi and Kabashima, 2016] Obuchi, T. and Kabashima, Y. (2016). Cross validation in lasso and its acceleration. J. Stat. Mech. Theor. Exp., 53(304):1–36.
[Obuchi and Kabashima, 2018] Obuchi, T. and Kabashima, Y. (2018). Accelerating CrossValidation in Multinomial Logistic Regression with ell1-Regularization. Journal of Machine Learning Research, 19:1–30.
[Opper and Winther, 2000] Opper, M. and Winther, O. (2000). Gaussian processes and SVM: Mean field results and leave-one-out. In Smola, A., Bartlett, P., Scholkopf, B., and Schuurmans, D., editors, Advances Large Margin Classifiers, pages 43–56. MIT Press, Cambridge, MA.
[O’Sullivan et al., 1986] O’Sullivan, F., Yandell, B., and Raynor, W. (1986). Automatic smoothing of regression functions in generalized linear models. JASA, 81(393):96–103.
[Rahnama Rad and Maleki, 2019] Rahnama Rad, K. and Maleki, A. (2019). A scalable estimate of the extra-sample prediction error via approximate leave-one-out. arXiv:1801.10243v3.
[Schmidt et al., 2007] Schmidt, M., Fung, G., and Rosales, R. (2007). Fast optimization methods for l1 regularization: A comparative study and two new approaches. In ECML, pages 286–297. Springer.
[Stephenson and Broderick, 2019] Stephenson, W. and Broderick, T. (2019). Sparse Approximate Cross-Validation for High-Dimensional GLMs. arXiv preprint arXiv:1905.13657.
[Stone, 1974] Stone, M. (1974). Cross-validatory choice and assesment of statistical predictions. J R Stat Soc Series B, 36(2):111–147.
[Su et al., 2017] Su, W., Bogdan, M., and Candes, E. (2017). False discoveries occur early on the Lasso path. Ann. Stat., 45(5):2133–2150.
[Takahashi and Kabashima, 2018] Takahashi, T. and Kabashima, Y. (2018). A statistical mechanics approach to de-biasing and uncertainty estimation in lasso for random measurements. Journal of Statistical Mechanics: Theory and Experiment, 7(073405).
[Vehtari et al., 2016] Vehtari, A., Mononen, T., Tolvanen, V., Sivula, T., and Winther, O. (2016). Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. Journal of Machine Learning Research, 17(1):3581–3618.
[Wang et al., 2018] Wang, S., Zhou, W., Maleki, A., Lu, H., and Mirrokni, V. (2018). Approximate Leave-One-Out for High-Dimensional Non-Differentiable Learning Problems. International Conference on Machine Learning.
[Xu et al., 2019] Xu, J., Maleki, A., Rahnama Rad, K., and Hsu, D. (2019). Consistent risk estimation in high-dimensional linear regression. arXiv:1902.01753.
A Notation
B Background material on Gaussian random variables, vectors and matrices
In this section, we review a few important results regarding the functions of Gaussian matrices and Gaussian vectors that are used in our examples. The first result is about the moments of the inverse of the minimum eigenvalue of a Wishart matrix.
Lemma 1. (Lemma 19 of [Xu et al., 2019]) Let , and suppose that
. Then, for a fixed
Our next two lemmas are concerned with the moments of a Gaussian and random variables:
. Then, we have
where the notation p!! denotes the double factorial. Furthermore, when p is even the above inequality is in fact an equality.
The proof of this claim is straightforward and can be found in many standard statistics text books.
C Proof of Theorem 1
Define
Then,
The proof concludes upon noting that Lemma 4 and 5 yield
Lemma 4. Under the assumptions of Theorem 1 we have that:
Proof of Lemma 4.
Note that
Next we study
where () is independent of D, leading to
Define the quantities
Likewise,
Likewise,
To conclude, note that:
where the last inequality is due to Lemma 6. Using the fact that get
Lemma 5. Under the assumptions of Theorem 1, we have:
Proof of Lemma 5. Note that we have for all i:
Therefore, using the mean-value Theorem, for some 1], we get
where the last inequality is due to Lemma 6. Using the fact that get
Lemma 6. If both the loss function and the regularizer are twice differentiable, then for all i = 1, . . . , n:
Proof of Lemma 6. The leave-one-out estimate, =
, satisfies
) = 0. The multivariate mean-value Theorem yields
where the Jacobean is
Moreover,
We get
leading to
and
Likewise,
D Proof of Corollary 1
We would like to use Theorem 1 to prove this corollary. Toward this goal, we first have to prove that Assumptions 1, 2, and 3 hold, and that Var[] is bounded. Given the fact that
), Assumption 1 holds. As we discussed in Example 1, Assumption 2 holds as well with
. Finally, given that the regularizer is ridge Assumption 3 holds too. Hence, the only remaining step is to check the boundedness of Var[
In the rest of the proof we aim to prove that
Note that
where to obtain the last inequality we have used log(1 + . Furthermore, since (1 +
have
Comparing with the zero estimator yields,
(1 +
) +
2. Since log(1 +
0 for any z, we get
2. Therefore, we can say that for ridge regularized logistic regression
To summarize, using the bound above and Theorem 1 (with = (
) , for ridge regularized logistic regression, we conclude that
where the last equation is due to = 2 (shown in Example 1) and
E Proof of Corollary 2
We would like to use Theorem 1 to prove this corollary. Toward this goal we have to confirm Assumptions 1, 2, and 3 and prove the boundedness of . Assumption 1 is already assumed in the corollary. Assumption 2 is also confirmed in Example 2 so that
. Since the regularizer is assumed to be strongly convex, Assumption 3 is also automatically satisfied with
. Hence, the only remaining step is to obtain an upper bound for
. In the rest of the proof we prove that
We have
Furthermore, we have that + Var[
. Additionally, note that using the strong convexity of the regularizer, and by comparing the value of
) +
) at
and 0, we have that
Therefore,
We may bound this quantity explicitly in terms of the covariance of x:
Hence,
To summarize, using the bound above and Theorem 1 (with ), we conclude that
where the last equation is due to (shown in Example 2) and
F Proof of Example 1
It is straightforward to check that, for any i, we have:
This implies that:
Define the vectors . Hence,
). Furthermore, define the matrix Z as the matrix that has
as its rows. It is straightforward to check that:
The fact that the quantity
is proved in [Xu et al., 2019]. See Lemma 1 in the supplementary material.
G Proof of Theorem 2
The proof of this result is very similar to the proof of Theorem 1. Hence, instead of rewriting the proof, we only emphasize on the differences between the proofs of Theorems 1 and 2. The strategy of the proof is exactly the same. We break the error between LO and Errand try to bound the second moments of these quantities. The following lemma obtains an upper bound for the second moment of
Lemma 7. Under the assumptions of Theorem 2 we have
Proof. The proof of this lemma is similar to the proof of Lemma 4. All the steps are exactly the same up to the
point that is proved:
However, the way we would like to bound here is slightly different from the approach used in the proof of Lemma 4. According to Lemma 6 we have:
Hence, by using the Cauchy-Schwarz inequality twice we obtain:
Using the fact that
The second Lemma aims to obtain an upper bound for the second moment of . This corresponds to Lemma 5 in the proof of Theorem 1. Lemma 8. Under the assumptions of Theorem 2, we have
Proof. Again the proof follows very similar to the steps as the proof of Lemma 5. In fact, we follow exactly the same steps until it is proved that
from which we deduce:
H Proof of Corollary 3
As is clear, we would like to use Theorem 2 to prove our claim. Toward this goal, we have to prove that Assumptions 1, 2
, and 3
hold. Furthermore, we have to obtain an upper bound for the constant ˜
, which in turn requires us to bound
]. Given that
is Gaussian, Assumption 1
is automatically satisfied. Furthermore, since the regularizer is elastic-net, it is straightforward to prove Assumption 3
. To see this, first note that, for all i, j, we have almost surely:
where ). Hence, it is straightforward to see that
Hence, the only remaining steps are to prove Assumption 2and bound the term
that
Hence,
Hence, if we prove Assumption 2, we have also proved that
In the rest of this section, we focus on the proof of Assumption 2. Note that ˙
. Under these assumptions, we prove that there exists a fixed number ˜
Consider the following definitions:
It is then straightforward to prove that
Our goal is to show that all the “finite” moments of the elements of , including the 8
moment required in our example, are O(1). From (18) we have
Hence, we bound each of the above three terms separately in the following lemmas:
Lemma 9. Under the assumptions of Example 3 we have
Proof. First note that
Note that conditioned on the distribution of
is a zero mean Gaussian random variable with variance
. Hence, (21) and the moments of a Gaussian random variable (see Lemma 2) lead to
Hence, by the law of iterated expectation, we obtain
Lemma 10. Under the assumptions of Example 3, if
Proof. Note that conditioned on X, the distribution of is multivariate Gaussian with mean zero and covariance matrix
where the first inequality is due to Lemma 2. Hence, again by the law of iterated expectation, we have
Lemma 11. Under the assumptions of Example 3 we have
The first order optimality condition yields
Since the minimum eigenvalue of the Hessian of , therefore the minimum eigenvalue of
(for all
) is greater than 2
, leading to
This together with ¨
Furthermore, we have
Note that for two random variables a and b we have
Hence,
First note that, since the maximum eigenvalue of
Hence,
Furthermore, we have
2. Note that ). Furthermore,
. Hence, using the the moments of Gaussian (see Lemma 2), we have
3. Given , the distribution of
is
). Furthermore,
, where the last inequality is due to (26). Hence, we have
Finally, we compute an upper bound on . Since
is independent of
and
, we conclude that given
) is a Gaussian random variable with mean zero and variance
where , and the second inequality is due to (27). Hence,
I Proof of Corollary 4
The goal of this section is to use Theorem 2 to prove corollary 4. Hence, we have to confirm that Assumptions 1, 2
, and 3
hold, and that
] is bounded. Similar to what we did at the beginning of Section H, it is straightforward to check the validity of Assumptions 1
. Hence, we only focus on proving Assumption 2
and finding an upper bound for
Regarding Assumption 2, we first prove that under the assumptions of this corollary, there exists a fixed number ˜
, such that
In particular, we have that:
To obtain equality (a) we have used the moment generating function of the Poisson distribution with )). To obtain inequality (b) we have used the moment generating function of a Gaussian distribution and the fact that
. Given that the upper bound we derived in (36) for the derivative of the loss function does not depend on the second input argument of the loss, that is
, the proof that Poisson loss satisfies the other conditions of Assumption 2
) will be exactly similar and hence is skipped. In particular, we have verified the conditions of Assumption 2
for any convex regularizer.
Now we turn our attention to bounding ]. First note that
Furthermore, from the mean value theorem we have:
Hence, we have:
To complete the proof we have to show that both ) and E(1 +
)(
are bounded. First note that, using
) and, for any
Hence,
The following facts will help us bound these terms:
On the other hand, it is straightforward to check that for any
By combining these equations we obtain:
Note that is a Gaussian random variable with mean zero and variance (
is bounded by a constant.
For the second term in (38) we have
are bounded for 2 and
4. As previously, we note that
is a Gaussian random variable with variance
, and hence (
is bounded. Hence, the only remaining step is to prove the boundedness of
is at most 2. Note that conditioned on
the random variable
is Gaussian with the variance that is bounded by
. Hence, using Lemma 2 we have
The -strong convexity of the smoothed elastic-net regularizer r, and the fact that
0, leads to
Hence, we have to prove that are bounded. First note we proved in (43) that:
Note that is a Gaussian random variable with variance
, and hence
is bounded. On the other hand,
It is straightforward to prove that ˙
where to obtain the last inequality we used the mean value theorem
where ˜), and the facts that ˙
) we obtain:
J Proof of Corollary 5
Similar to the proofs of Corollaries 4, 3, we would like to use Theorem 2 to prove our claim. Toward this goal, We have to prove that Assumptions 1hold. Furthermore, we have to obtain an upper bound for the constant ˜
, which in turn requires us to bound
]. Again, the proofs of Assumptions 1
and 3
are exactly the same as we presented in the last two sections. Hence, we only focus on Assumption 2
and
]. We would like to prove that the conditions of Assumption 2
are satisfied with ˜
It we compute the derivative of the log-likelihood, we will obtain
As the bound (50) is free of z, the same argument above applies to the other requirements in Assumption 2
Now we turn our attention to the calculation of ]. Note that
Note that by removing the constant from the log-likelihood we obtain
The rest of the proof is very similar to the proof that we presented for Corollary 4. Hence, we skip it.