Error bounds in estimating the out-of-sample prediction error using leave-one-out cross validation in high-dimensions

2020·Arxiv

Abstract

Abstract

We study the problem of out-of-sample risk estimation in the high dimensional regime where both the sample size n and number of features p are large, and n/p can be less than one. Extensive empirical evidence confirms the accuracy of leave-one-out cross validation (LO) for out-of-sample risk estimation. Yet, a unifying theoretical evaluation of the accuracy of LO in high-dimensional problems has remained an open problem. This paper aims to fill this gap for penalized regression in the generalized linear family. With minor assumptions about the data generating process, and without any sparsity assumptions on the regression coefficients, our theoretical analysis obtains finite sample upper bounds on the expected squared error of LO in estimating the out-of-sample error. Our bounds show that the error goes to zero as , even when the dimension p of the feature vectors is comparable with or greater than the sample size n. One technical advantage of the theory is that it can be used to clarify and connect some results from the recent literature on scalable approximate LO.

Keywords: High-dimensional statistics, Regularized estimation, Out-of-sample risk estimation, Cross validation, Generalized linear models, Model selection.

1 Introduction

Balancing the sensible level of model complexity against model fitness is a fundamental challenge faced by any

Proceedings of the 23International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).

learning algorithm. A model that is too simple can fail to capture the essential pattern in the data, and a model that is too complex is oversensitive to the idiosyncrasies of the particular data, resulting in highly variable patterns that are mere mirages in the noise. The learning algorithm’s ability to perform well on new, previously unseen data is typically used to set the model complexity. This performance is known as the out-of-sample error.

To be concrete, let dataset where denote the features and response, respectively. The goal is to obtain an estimate of the response for a newly observed feature vector. We assume observations are independent and identically distributed draws from some joint unknown distribution ). We model this distribution as ) = ), and estimate using the optimization problem

where is called the loss function, and ) is called the regularizer. Both the regularizer ) and the regularization parameter have significant effects on the performance of the estimate by controlling the complexity of the model. Hence, for picking a good regularizer, r, or tuning the parameter one would like to estimate the out-of-sample prediction error, defined as

where (sample from the unknown distribution q(y, x) independent of D, and is a function that measures the closeness of to . A standard choice for

The problem of risk estimation has been extensively studied in the past fifty years and popular estimates, such as k-fold cross validation [Stone, 1974] are used extensively in practical systems. However, the emergence of high-dimensional estimation problems in which the number of features p is comparable or even larger than

0 100 200 300 400 500

Figure 1: Comparison of K-fold cross validation (for K = 3, 5, 7) and leave-one-out cross validation with the true (oracle-based) out-of-sample error for the elastic-net problem where ) = and ) = 2 + 4. The upward bias of K-fold CV clearly decreases as number of folds increase. N(). The number of nonzero elements of the true and their values is set to . Dimensions are (= 2. Extra-sample test data is N() where N(0, I). The true (oracle-based) out-of-sample prediction error is Err+ . All depicted quantities are averages based on 100 random independent samples, and error bars depict one standard error.

the number of observations n, deemed many standard techniques in-accurate. For instance, Figure 1 compares the estimates obtained from k-fold cross validation for different values of k. As is clear in this figure, given the importance of each observation in high-dimensional settings, standard techniques, such as 5-fold suffer from a large bias.

One of the existing estimates of Errthat seems to be accurate in high-dimensional settings is the leave-one-out cross validation (LO), which is defined through the following formula:

where

is the leave-i-out estimate. The simulation results reported in Figure 1 and elsewhere [Rahnama Rad and Maleki, 2019,Wang et al., 2018,Stephenson and Broderick, 2019, Beirami et al., 2017, Takahashi and Kabashima, 2018] have demonstrated the good performance of LO in a wide range of high-dimensional problems. Despite the existence of extensive simulation results, the theoretical properties of LO have not been studied in the high-dimensional settings.

In this paper, we study the expected squared error of LO in estimating the out-of-sample error, in the high-dimensional setting, where both n and p are large, and n/p can be less than one. We focus on regularized regression in the generalized linear family, and we make no sparsity assumption on the vector of regression coefficients. In short, we obtain an almost sharp upper bound on the error . These bounds not only show that also capture the rate of this convergence. This finally establishes what has been observed in empirical studies; LO obtains accurate out-of-sample risk estimates even in high-dimensional problems.

An important advantage of our theoretical results is that they can be used to clarify and connect some results from the recent literature on computationally efficient approximation to LO. For instance, [Rah- nama Rad and Maleki, 2019] showed that in the same high dimensional regime, where ALO stands for a computationally efficient approximation of LO we formally refer to in Section 1.2. A major consequence of our theory is that it shows that ALO is a consistent estimator of Errthese statements more concrete in the next sections.

1.1 Notation

We first review the notations that will be used in the rest of the paper. Let stand for the ith row of and stand for y and X, excluding the ith entry and the ith row , respectively, and let be defined likewise. Additionally, let stand for the regularized estimate in (1) when () are excluded. Moreover, define

Likewise, define ). The notation poly log n denotes polynomial of log n with a finite degree. Let ) stand for the largest and smallest eigenvalues of A, respectively. We state ) when the set of values is stochastically bounded.

1.2 Computational complexity of LO and its approximation

The high computational cost of repeatedly refitting models is a major hurdle in using LO in high dimensional settings. A typical approach to alleviate this problem analytically approximates the leave-i-out model based on the full-data model. A large body of work has addressed computationally efficient approximations to the leave-one-out cross validation error for ridge regularized estimation problems (and its variants) [Allen, 1974,Craven and Wahba, 1979,Golub et al., 1979,O’Sullivan et al., 1986,Burman, 1990,Cessie and Houwelingen, 1992,Opper and Winther, 2000,Caw- ley and Talbot, 2008,Meijer and Goeman, 2013,Vehtari et al., 2016, Mousavi et al., 2018]. Extensions to a wide array of regularizers, such as LASSO [Obuchi and Kabashima, 2018, Rahnama Rad and Maleki, 2019, Stephenson and Broderick, 2019] and nuclear norm [Wang et al., 2018] were recently studied and the validity of these approximations in estimating LO (and its variants) were theoretically studied in [Obuchi and Kabashima, 2016,Beirami et al., 2017,Rahnama Rad and Maleki, 2019, Giordano et al., 2019, Stephenson and Broderick, 2019,Xu et al., 2019].

For example, a single Newton step around in [Wang et al., 2018,Rahnama Rad and Maleki, 2019]

to approximate

and using the Woodburry lemma, the following scalable approximate LO (ALO) formula was obtained:

where

This result was extended to nonsmooth regularizers. For example, [Wang et al., 2018, Rahnama Rad and Maleki, 2019,Stephenson and Broderick, 2019] showed that for formula is a valid approximation of LO if the following H matrix is used:

where S is the active set of is the matrix X restricted to columns indexed by S. With minor assumptions about the data generating process and without any sparsity assumption on the vector of regression co-efficients, [Rahnama Rad and Maleki, 2019] (Theorem 3 and Corollary 1) proved that for various regularizers and regression methods the high dimensional setting where is constant while

Our finite sample bounds in the next section show that with similar (easy to check) regularity conditions, for various regularizers and regression methods, Err0 estimate go to zero as but is a fixed number. As a byproduct of this result, we show that in this high dimensional regime . We will more formally state these claims in Section 3.

2 Main results

2.1 Our assumptions

Our goal is to evaluate the accuracy of LO in estimating the out-of-sample prediction error in the high-dimensional regime. Our results are valid for finite values of n and p. Later, in order to make asymptotic conclusions, we suppose that is constant while

We now state our assumptions for theorem 1. For simplicity of exposition, we start by stating a strong version of our assumptions, which often requires uniform bounds. Weaker analogues are discussed in 2.3. As the assumptions may appear somewhat opaque and technical, we will discuss them in the context of usual assumptions and concrete examples of standard generalized linear models.

Assumption 1. The vectors are independent zero mean vectors with covariance such that for a nonnegative constant

Assumption 1 characterizes the different distributions obtained for each are scaled in a way that ensures = O(1) and Var() = (1), assuming that (for i = 1) is ). For instance, under the linear model , this scaling ensures that the signal-to-noise ratio in each observation remains fixed as n, p grow (when the noise variance is a non-zero constant). Unless we make explicit assumptions about the sparsity of , without the 1/p scaling, the Hessian of the optimization problem (1) is dominated by the data, making the regularizer, and in turn , irrelevant. In this paper, we make no sparsity assumption on the vector of regression coefficients. For similar finite signal-to-noise ratio scalings in the high-dimensional asymptotic analysis see [El Karoui, 2017, El Karoui et al., 2013,Bean et al., 2013,Donoho and Montanari, 2016,Donoho et al., 2011,Bayati and Montanari, 2012, Nevo and Ritov, 2016, Su et al., 2017, Dobriban and Wager, 2018,Rahnama Rad and Maleki, 2019,Xu et al., 2019]. Under this scaling, the optimal value of will be Mousavi et al., 2018].

Assumption 2. We assume the functions ) and ) are twice differentiable in z. We also assume that ) and ) are convex in z and , respectively. Let () be a sample from the unknown distribution q(y, x) independent of D = . We assume there exists constants , such that, for all i, j, uniformly:

where

Assumption 2 characterizes the smoothness of the GLM problem (and its associated leave-one-out versions). As we will show below there are many examples, such as logistic and robust regression, in which we can find and . However, in some other popular examples, such as linear or Poisson regression, is a random quantity and we cannot find an absolute constant to dominate it everywhere. As will be discussed later in Section 2.3, we can weaken Assumption 2 at the expense of a slightly stronger moment condition on the feature vector

Example 1. In the generalized linear model family, for the negative logistic regression log-likelihood

) = ) it is easy to show that ˙2 for any y and z, leading to

Example 2. Our next example is about a smooth approximation of the Huber loss used in robust estimation, known as the pseudo-Huber loss:

where 0 is a fixed number. If we use this loss for the linear regression problem, and set ) = ) = ). It is easy to show that ˙, leading to

Our next example is concerned with another popular loss function in linear regression, namely the absolute deviation. However, since we would like our loss functions to be differentiable, we use the following smooth approximation of the absolute deviation loss, , introduced in [Schmidt et al., 2007]:

ℓ

where 0 is fixed.

Example 3. For ) = ) = ), we have = 1. In fact, it is straightforward to show that ˙

define the two matrices

We assume that there exists a fixed number , such that

Assumption 3 characterizes the curvature of the GLM problem (and its associated leave-one-out versions). In some examples, such as the ones that have ridge or smoothed elastic-net as the regularizer, it is straightforward to confirm this assumption. For instance, for the ridge regularization, ) = 2, we have . In Section 2.3, we explain how this assumption can be relaxed (at the expense of requiring more stringent moment conditions on ) to cover more examples.

Having stated our assumptions, we now move on to stating our main result before proposing a number of examples to demonstrate how this result can be applied in common GLM cases.

2.2 Main theorem

Based on these assumptions we can now evaluate the accuracy of LO in estimating Err. The following theorem proves that the expected square error of LO in estimating Erris small even in high-dimensional asymptotic settings.

. If Assumptions 1, 2 and 3 hold, then

where the outer expectation is taken with respect to the data D and:

The proof can be found in Appendix C.

The only term that is not explicitly computed in terms of the constants in our assumptions is ]. Hence, to obtain an explicit quantitative bound for a specific GLM problem requires computing this quantity. We present two examples below.

Corollary 1. (Ridge regularized logistic regression) Consider the negative logistic regression log-likelihood , and the regularizer ) = 2, where . Furtherassume that is iid ), where . If ) = , then there exists a constant

where

The proof of this corollary can be found in Section D of the supplementary material.

Corollary 2. (Pseudo-Huber loss with strongly convex regularizer) We consider again the pseudo-Huber loss defined in (6) with parameter . As this loss is typically used in regression settings, we consider a linear regression model denotes i.i.d. zero- mean noise, and additionally assume that the regularizer is strongly convex with parameter Under these conditions, there exists a fixed number (depending on ) such that

The proof of this corollary can be found in Section E of the supplementary material.

To summarize, the examples presented in Corollary 1 and 2 satisfy the assumption needed for Theorem 1.

2.3 Extensions

As we discussed in Section 2.1, we can weaken the assumptions without a major change in our proofs or the main conclusions of our result. In this section, we aim to present one such weaker set of assumptions that enables our analyses to cover several other popular examples, such as the Poisson and linear regression.

We assume that are i.i.d. zero mean vectors with covariance such that for a non-negative constant . Furthermore, there exists a fixed number , such that

Note that this assumption is more stringent than Assumption 1. However, in essence the only extra requirement of this assumption is a bound on the fourth moments. Hence, it holds for a wide range of random features including sub-Gaussian and sub-exponential features. Thanks to this slightly stronger moment assumption we can weaken the other

assumptions.

We assume the functions are twice differentiable in z. Moreover, assume ) and ) are convex in z and , respectively. Let (be a sample from the unknown distribution q(y, x) independent of . We assume that there exist constants ˜and ˜, such that for all i, j, uniformly

where

Compared to Assumption 2 that requires to be bounded everywhere, this assumption requires the 8moment of to be bounded. This simple modification enables our theoretical results to be applied to a much broader set of regression techniques, including Poisson, linear, and negative binomial regression. These three examples will be studied later in this section.

Let and be as defined in Assumption 3. We assume that there exists a fixed number ˜, such that

Again, compared to Assumption 3, this assumption only bounds the moments of the minimum eigenvalue of the matrix. The following example shows an example in which it is impossible to find a positive lower bound for the minimum eigenvalue, but still the moments of the inverse of the minimum eigenvalue are bounded.

Example 1. Suppose that 1 and that the loss function is strongly convex with parameter c, and the regularizer is convex. Finally, suppose that . Then, there exists a fixed number ˜that satisfies Assumption 3for large enough values of n and p.

The proof can be found in Section F of the supplementary material.

As we discussed before one can prove the accuracy of LO under Assumptions 1, 2, and 3. The following theorem formalizes this claim.

Theorem 2. Let . If Assumptions 1, 2and 3

where the outer expectation is taken with respect to the data D and:

The proof can be found in Section G of the supplementary material. As we described before, this theorem can cover several generalized linear models, that could not be covered by Theorem 1. We mention three important examples below.

Corollary 3. (Square loss with elastic-net penalty) Consider the data generating mechanism where ), ), and . Suppose that we use the smoothed elastic-net optimization

where for 0, ) = + (1 ), and

is a smooth approximation of the -norm. Then, there exists a fixed number, ˜, such that

Since the proof of this claim is long, we defer it to Section H of the supplementary material.

Corollary 4. [Poisson regression with soft-rectifying link] Consider the data-generating mechanism Poisson()), where f(z) = log(1 + ) denotes the soft-rectifying link, b. Finally, assume that r denotes the smoothed elastic-net regularizer introduced in Corollary 3. Under these assumptions, there exists a fixed number, ˜, such that:

The proof can be found in Section I of the supplementary file.

Remark 1. We have assumed here that is multivariate Gaussian. As might be clear to the reader from the proof, this normality assumption on x may be relaxed to an 8moment assumption at the cost of a slightly more complicated proof.

Corollary 5 (Negative-Binomial Regression). We consider the problem of negative binomial regression with fixed shape parameter and exponential link. Here, the negative log-likelihood is given by:

where ) denotes a constant which only depends on and y. Assume the data generating process is such that similar to Corollary, 3 we use the smoothed elastic-net as the regularizer. Under these assumptions, there exists a fixed number, ˜, such that

The proof can be found in Section J of the supplementary material.

3 Connection of ALO and Errout

We mentioned in Section 1.2 that different approximations of LO have been proposed in the literature to reduce the computational complexity of LO. Among such approximations, the ALO formula introduced in (5), is analyzed in [Rahnama Rad and Maleki, 2019] under a similar asymptotic framework as the one discussed in our paper:

Theorem 3. [Rahnama Rad and Maleki, 2019] Suppose that is constant while . Under the assumption ), for the regression problems discussed in Corollaries 1, 2, 3, and 4 we have

Note that the ultimate goal of ALO is to use it as an estimate of Err. Hence, while Theorem 3 confirms the accuracy of ALO in approximating LO it does not explain whether the estimates obtained by ALO or LO can be trusted in high-dimensional settings. However, we can combine this result with Theorems 1 and 2 to prove the accuracy of ALO in estimating Err. Toward this goal we first prove the following claim.

Theorem 4. Suppose that is constant while . Under the assumption ), for the regression problems discussed in Corollaries 1, 2, 3, and 4 we have

Proof. For a fixed number M

The first inequality in the above equations is due to Markov inequality, and the second inequality is a result of Theorems 1 and 2. As we discussed in Corollaries 1, 2, 3, and 4 either are finite numbers. Hence, as M increases, the final probability can be reduced to the desired level.

Before we proceed to establish the accuracy of ALO we have to clarify Theorem 4. Note that even under the idealized (but incorrect) assumption that the individual estimates ) are independent and s are the same as , the central limit theorem indicates that Hence, we should not expect the error of LO to be ). Therefore, the above theorem seems to offer the sharpest result that is possible for LO. Note that the sharpness is with regard to the rate of convergence and not the constants.

Combining the results of Theorem 2 and Theorem 4 we can finally quantify the accuracy of ALO in estimating Err

Corollary 6. Suppose that is constant while . Under the assumption ), for the regression problems discussed in Corollaries 1, 2, 3, and 4 we have

The proof of this corollary is straightforward, and is hence skipped. Note that this corollary finally establishes the fact that ALO obtains accurate estimates of Err. While we have established this result for only four popular examples in this paper, Theorems 1, 2 and Theorem 3 of [Rahnama Rad and Maleki, 2019] can be applied to a much broader class of regression problems. Hence, a similar result is expected for such scenarios as well. Finally, we should emphasize that by comparing Theorems 4 and 6 one may notice that the accuracy of ALO might be worse than LO by a logarithmic factor. At this stage, it is not clear whether this difference is an artifact of the proof of [Rahnama Rad and Maleki,

Table 1: Square loss with elastic-net penalty: MSE(and standard errors).

2019] or it is a real extra error that has been introduced by the approximation of LO.

4 Numerical Experiments

In this section, we present two numerical experiments to show that the ) bound given in Theorem 1 and 2 is sharp but not tight. Specifically, we generate synthetic data, and compare Errfor elastic-net linear regression and ridge logistic regression. In all the examples in this section, the rows of X are N(). Here we let and ) = ). The codes for the Figure 1 and Table 1,2 are available at https://github.com/RahnamaRad/LO.

Square loss with elastic-net penalty. We set

and = 0.5. The true unknown parameter vector is sparse with k = 0.1n non-zero elements independently drawn from a zero mean unit variance Laplace distribution, leading to Var() = 0.1 (regardless of the values of n and p). To generate data, we sample N(). Here the out-of-sample error is:

As we increase n and p, we keep the ratio constant. We numerically calculate MSELO)as a function of n (and p = 10n) based on 100 synthetic data samples, for each n, p and = 5. We fitted a line to model log(MSE) ) and obtained a slope of -1.03 (SE= 0.04) and intercept of -0.46 (SE= 0.54) with an Adjusted R-squared of 0.95. The slope of -1.03 (SE = 0.04) shows that the bound is sharp because it confirms the 1/n scaling of our theory. Table 1 shows the numerical MSE as a function of n and p.

Logistic regression with ridge penalty. We set ) = + log(1 + ) (the negative logistic log-likelihood) and . To generate data, we sample

Table 2: Logistic regression with ridge penalty: MSEE(Err(and standard errors) and the upper bound based on 8 in Corollary 1 of Theorem 1.

out-of-sample error

As we increase n and p, we keep the ratio n/p = 1 constant. We numerically calculate MSEas a function of n (and p = n) based on 100 synthetic data samples, for each n, p and = 0.1. We fitted a line to model log(MSE) ) and obtained a slope of -1.00 (SE= 0.04) and intercept of 0.34 (SE= 0.27) with an Adjusted R-squared of 0.99. The slope of -1.00 shows that the bound is sharp because it confirms the 1/n scaling of our theory. Table 2 compares the numerical MSE and the theoretical bound from Theorem 1 and Corollary 1. The theoretical upper bound was computed using 8 in Corollary 1 where in this example, we have = 0= 1, and = 1, leading to = 6311.52. The significant difference between the bound and the MSE shows that the bound is not tight.

5 Conclusion

Leave-one-out estimators (and their approximate versions) have seen renewed interest recently in the context of big data and high-dimensional problems. We show that, in general, leave-one-out risk estimators have desirable statistical behaviours in the high-dimensional setting. Although the leave-out-risk estimator itself is generally computationally intractable, this result also implies consistency for a (growing) number of approximate leave-one-out estimators, and demonstrate that such estimators offer a potentially good direction for building risk estimators for high-dimensional problems.

Acknowledgement

K.R. was supported by the NSF DMS grant 1810888, and Eugene M. Lang Junior Faculty Research Fellow- ship. A.M. was supported by the NSF DMS grant 1810888.

References

[Allen, 1974] Allen, D. (1974). The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16:125–127.

[Bayati and Montanari, 2012] Bayati, M. and Montanari, A. (2012). The lasso risk for gaussian matrices. IEEE Transactions on Information Theory, 58(4):1997–2017.

[Bean et al., 2013] Bean, D., Bickel, P. J., El Karoui, N., and Yu, B. (2013). Optimal m-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences, 110(36):14563–14568.

[Beirami et al., 2017] Beirami, A., Razaviyayn, M., Shahrampour, S., and Tarokh, V. (2017). On optimal generalizability in parametric learning. NIPS, pages 3455–3465.

[Burman, 1990] Burman, P. (1990). Estimation of generalized additive models. Journal of Multivariate Analysis, 32:230–255.

[Cawley and Talbot, 2008] Cawley, G. and Talbot, N. (2008). Efficient approximate leave-one-out cross-validation for kernel logistic regression. Machine Learning, 71:243–264.

[Cessie and Houwelingen, 1992] Cessie, S. and Houwelingen, J. (1992). Ridge estimators in logistic regression. Applied Statistics, 41(1):191–201.

[Craven and Wahba, 1979] Craven, P. and Wahba, G. (1979). Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377–403.

[Dobriban and Wager, 2018] Dobriban, E. and Wager, S. (2018). High-dimensional asymptotics of prediction: Ridge regression and classification. Ann. Stat., 46(1):247–279.

[Donoho et al., 2011] Donoho, D., Maleki, A., and Montanari, A. (2011). Noise sensitivity phase transition. IEEE Trans. Inform. Theory, 57(10).

[Donoho and Montanari, 2016] Donoho, D. and Montanari, A. (2016). High dimensional robust m-estimation: Asymptotic variance via approximate message passing. Probab Theory Relat Fields, 166(3-4):935–969.

[El Karoui, 2017] El Karoui, N. (2017). On the impact of predictor geometry on the performance on

high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory Related Fields, 170(1-2):95–175.

[El Karoui et al., 2013] El Karoui, N., Bean, D., Bickel, P., Lim, C., and Yu, B. (2013). On robust regression with high-dimensional predictors. PNAS, 110(36):14557–14562.

[Giordano et al., 2019] Giordano, R., Stephenson, W., Liu, R., Jordan, M., and Broderick, T. (2019). A swiss army infinitesimal jackknife. JMLR, 89(1139-1147).

[Golub et al., 1979] Golub, G., Heath, M., and Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223.

[Meijer and Goeman, 2013] Meijer, R. and Goeman, J. (2013). Efficient approximate k-fold and leave-one-out cross-validation for ridge regression. Biometrical Journal, 55(2):141–155.

[Mousavi et al., 2018] Mousavi, A., Maleki, A., and Baraniuk, R. (2018). Consistent parameter estimation for lasso and approximate message passing. The Annals of Statistics, 46(1):119–148.

[Nevo and Ritov, 2016] Nevo, D. and Ritov, Y. (2016). On Bayesian robust regression with diverging number of predictors. Electron. J. Statist., 10(2):3045–3062.

[Obuchi and Kabashima, 2016] Obuchi, T. and Kabashima, Y. (2016). Cross validation in lasso and its acceleration. J. Stat. Mech. Theor. Exp., 53(304):1–36.

[Obuchi and Kabashima, 2018] Obuchi, T. and Kabashima, Y. (2018). Accelerating CrossValidation in Multinomial Logistic Regression with ell1-Regularization. Journal of Machine Learning Research, 19:1–30.

[Opper and Winther, 2000] Opper, M. and Winther, O. (2000). Gaussian processes and SVM: Mean field results and leave-one-out. In Smola, A., Bartlett, P., Scholkopf, B., and Schuurmans, D., editors, Advances Large Margin Classifiers, pages 43–56. MIT Press, Cambridge, MA.

[O’Sullivan et al., 1986] O’Sullivan, F., Yandell, B., and Raynor, W. (1986). Automatic smoothing of regression functions in generalized linear models. JASA, 81(393):96–103.

[Rahnama Rad and Maleki, 2019] Rahnama Rad, K. and Maleki, A. (2019). A scalable estimate of the extra-sample prediction error via approximate leave-one-out. arXiv:1801.10243v3.

[Schmidt et al., 2007] Schmidt, M., Fung, G., and Rosales, R. (2007). Fast optimization methods for l1 regularization: A comparative study and two new approaches. In ECML, pages 286–297. Springer.

[Stephenson and Broderick, 2019] Stephenson, W. and Broderick, T. (2019). Sparse Approximate Cross-Validation for High-Dimensional GLMs. arXiv preprint arXiv:1905.13657.

[Stone, 1974] Stone, M. (1974). Cross-validatory choice and assesment of statistical predictions. J R Stat Soc Series B, 36(2):111–147.

[Su et al., 2017] Su, W., Bogdan, M., and Candes, E. (2017). False discoveries occur early on the Lasso path. Ann. Stat., 45(5):2133–2150.

[Takahashi and Kabashima, 2018] Takahashi, T. and Kabashima, Y. (2018). A statistical mechanics approach to de-biasing and uncertainty estimation in lasso for random measurements. Journal of Statistical Mechanics: Theory and Experiment, 7(073405).

[Vehtari et al., 2016] Vehtari, A., Mononen, T., Tolvanen, V., Sivula, T., and Winther, O. (2016). Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. Journal of Machine Learning Research, 17(1):3581–3618.

[Wang et al., 2018] Wang, S., Zhou, W., Maleki, A., Lu, H., and Mirrokni, V. (2018). Approximate Leave-One-Out for High-Dimensional Non-Differentiable Learning Problems. International Conference on Machine Learning.

[Xu et al., 2019] Xu, J., Maleki, A., Rahnama Rad, K., and Hsu, D. (2019). Consistent risk estimation in high-dimensional linear regression. arXiv:1902.01753.

A Notation

B Background material on Gaussian random variables, vectors and matrices

In this section, we review a few important results regarding the functions of Gaussian matrices and Gaussian vectors that are used in our examples. The first result is about the moments of the inverse of the minimum eigenvalue of a Wishart matrix.

Lemma 1. (Lemma 19 of [Xu et al., 2019]) Let , and suppose that . Then, for a fixed

Our next two lemmas are concerned with the moments of a Gaussian and random variables:

. Then, we have

where the notation p!! denotes the double factorial. Furthermore, when p is even the above inequality is in fact an equality.

The proof of this claim is straightforward and can be found in many standard statistics text books.

C Proof of Theorem 1

Define

Then,

The proof concludes upon noting that Lemma 4 and 5 yield

Lemma 4. Under the assumptions of Theorem 1 we have that:

Proof of Lemma 4.

Note that

Next we study

where () is independent of D, leading to

Define the quantities

Likewise,

Likewise,

To conclude, note that:

where the last inequality is due to Lemma 6. Using the fact that get

Lemma 5. Under the assumptions of Theorem 1, we have:

Proof of Lemma 5. Note that we have for all i:

Therefore, using the mean-value Theorem, for some 1], we get

where the last inequality is due to Lemma 6. Using the fact that get

Lemma 6. If both the loss function and the regularizer are twice differentiable, then for all i = 1, . . . , n:

Proof of Lemma 6. The leave-one-out estimate, = , satisfies ) = 0. The multivariate mean-value Theorem yields

where the Jacobean is

Moreover,

We get

leading to

and

Likewise,

D Proof of Corollary 1

We would like to use Theorem 1 to prove this corollary. Toward this goal, we first have to prove that Assumptions 1, 2, and 3 hold, and that Var[] is bounded. Given the fact that ), Assumption 1 holds. As we discussed in Example 1, Assumption 2 holds as well with . Finally, given that the regularizer is ridge Assumption 3 holds too. Hence, the only remaining step is to check the boundedness of Var[In the rest of the proof we aim to prove that

Note that

where to obtain the last inequality we have used log(1 + . Furthermore, since (1 + have

Comparing with the zero estimator yields, (1 + ) + 2. Since log(1 + 0 for any z, we get 2. Therefore, we can say that for ridge regularized logistic regression

To summarize, using the bound above and Theorem 1 (with = ( ) , for ridge regularized logistic regression, we conclude that

where the last equation is due to = 2 (shown in Example 1) and

E Proof of Corollary 2

We would like to use Theorem 1 to prove this corollary. Toward this goal we have to confirm Assumptions 1, 2, and 3 and prove the boundedness of . Assumption 1 is already assumed in the corollary. Assumption 2 is also confirmed in Example 2 so that . Since the regularizer is assumed to be strongly convex, Assumption 3 is also automatically satisfied with . Hence, the only remaining step is to obtain an upper bound for . In the rest of the proof we prove that

We have

Furthermore, we have that + Var[. Additionally, note that using the strong convexity of the regularizer, and by comparing the value of ) + ) at and 0, we have that Therefore,

We may bound this quantity explicitly in terms of the covariance of x:

Hence,

To summarize, using the bound above and Theorem 1 (with ), we conclude that

where the last equation is due to (shown in Example 2) and

F Proof of Example 1

It is straightforward to check that, for any i, we have:

This implies that:

Define the vectors . Hence, ). Furthermore, define the matrix Z as the matrix that has as its rows. It is straightforward to check that:

The fact that the quantity

is proved in [Xu et al., 2019]. See Lemma 1 in the supplementary material.

G Proof of Theorem 2

The proof of this result is very similar to the proof of Theorem 1. Hence, instead of rewriting the proof, we only emphasize on the differences between the proofs of Theorems 1 and 2. The strategy of the proof is exactly the same. We break the error between LO and Errand try to bound the second moments of these quantities. The following lemma obtains an upper bound for the second moment of

Lemma 7. Under the assumptions of Theorem 2 we have

Proof. The proof of this lemma is similar to the proof of Lemma 4. All the steps are exactly the same up to the

point that is proved:

However, the way we would like to bound here is slightly different from the approach used in the proof of Lemma 4. According to Lemma 6 we have:

Hence, by using the Cauchy-Schwarz inequality twice we obtain:

Using the fact that

The second Lemma aims to obtain an upper bound for the second moment of . This corresponds to Lemma 5 in the proof of Theorem 1. Lemma 8. Under the assumptions of Theorem 2, we have

Proof. Again the proof follows very similar to the steps as the proof of Lemma 5. In fact, we follow exactly the same steps until it is proved that

from which we deduce:

H Proof of Corollary 3

As is clear, we would like to use Theorem 2 to prove our claim. Toward this goal, we have to prove that Assumptions 1, 2, and 3hold. Furthermore, we have to obtain an upper bound for the constant ˜, which in turn requires us to bound ]. Given that is Gaussian, Assumption 1is automatically satisfied. Furthermore, since the regularizer is elastic-net, it is straightforward to prove Assumption 3. To see this, first note that, for all i, j, we have almost surely:

where ). Hence, it is straightforward to see that

Hence, the only remaining steps are to prove Assumption 2and bound the term that

Hence,

Hence, if we prove Assumption 2, we have also proved that

In the rest of this section, we focus on the proof of Assumption 2. Note that ˙. Under these assumptions, we prove that there exists a fixed number ˜

Consider the following definitions:

It is then straightforward to prove that

Our goal is to show that all the “finite” moments of the elements of , including the 8moment required in our example, are O(1). From (18) we have

Hence, we bound each of the above three terms separately in the following lemmas:

Lemma 9. Under the assumptions of Example 3 we have

Proof. First note that

Note that conditioned on the distribution of is a zero mean Gaussian random variable with variance . Hence, (21) and the moments of a Gaussian random variable (see Lemma 2) lead to

Hence, by the law of iterated expectation, we obtain

Lemma 10. Under the assumptions of Example 3, if

Proof. Note that conditioned on X, the distribution of is multivariate Gaussian with mean zero and covariance matrix

where the first inequality is due to Lemma 2. Hence, again by the law of iterated expectation, we have

Lemma 11. Under the assumptions of Example 3 we have

The first order optimality condition yields

Since the minimum eigenvalue of the Hessian of , therefore the minimum eigenvalue of (for all ) is greater than 2, leading to

This together with ¨

Furthermore, we have

Note that for two random variables a and b we have

Hence,

First note that, since the maximum eigenvalue of

Hence,

Furthermore, we have

2. Note that ). Furthermore, . Hence, using the the moments of Gaussian (see Lemma 2), we have

3. Given , the distribution of is ). Furthermore, , where the last inequality is due to (26). Hence, we have

Finally, we compute an upper bound on . Since is independent of and , we conclude that given ) is a Gaussian random variable with mean zero and variance

where , and the second inequality is due to (27). Hence,

I Proof of Corollary 4

The goal of this section is to use Theorem 2 to prove corollary 4. Hence, we have to confirm that Assumptions 1, 2, and 3hold, and that ] is bounded. Similar to what we did at the beginning of Section H, it is straightforward to check the validity of Assumptions 1. Hence, we only focus on proving Assumption 2and finding an upper bound for

Regarding Assumption 2, we first prove that under the assumptions of this corollary, there exists a fixed number ˜, such that

In particular, we have that:

To obtain equality (a) we have used the moment generating function of the Poisson distribution with )). To obtain inequality (b) we have used the moment generating function of a Gaussian distribution and the fact that . Given that the upper bound we derived in (36) for the derivative of the loss function does not depend on the second input argument of the loss, that is , the proof that Poisson loss satisfies the other conditions of Assumption 2) will be exactly similar and hence is skipped. In particular, we have verified the conditions of Assumption 2for any convex regularizer.

Now we turn our attention to bounding ]. First note that

Furthermore, from the mean value theorem we have:

Hence, we have:

To complete the proof we have to show that both ) and E(1 + )(are bounded. First note that, using ) and, for any

Hence,

The following facts will help us bound these terms:

On the other hand, it is straightforward to check that for any

By combining these equations we obtain:

Note that is a Gaussian random variable with mean zero and variance (is bounded by a constant.

For the second term in (38) we have

are bounded for 2 and 4. As previously, we note that is a Gaussian random variable with variance , and hence (is bounded. Hence, the only remaining step is to prove the boundedness of is at most 2. Note that conditioned on the random variable is Gaussian with the variance that is bounded by . Hence, using Lemma 2 we have

The -strong convexity of the smoothed elastic-net regularizer r, and the fact that 0, leads to

Hence, we have to prove that are bounded. First note we proved in (43) that:

Note that is a Gaussian random variable with variance , and hence is bounded. On the other hand,

It is straightforward to prove that ˙

where to obtain the last inequality we used the mean value theorem

where ˜), and the facts that ˙) we obtain:

J Proof of Corollary 5

Similar to the proofs of Corollaries 4, 3, we would like to use Theorem 2 to prove our claim. Toward this goal, We have to prove that Assumptions 1hold. Furthermore, we have to obtain an upper bound for the constant ˜, which in turn requires us to bound ]. Again, the proofs of Assumptions 1and 3are exactly the same as we presented in the last two sections. Hence, we only focus on Assumption 2and ]. We would like to prove that the conditions of Assumption 2are satisfied with ˜

It we compute the derivative of the log-likelihood, we will obtain

As the bound (50) is free of z, the same argument above applies to the other requirements in Assumption 2

Now we turn our attention to the calculation of ]. Note that

Note that by removing the constant from the log-likelihood we obtain

The rest of the proof is very similar to the proof that we presented for Corollary 4. Hence, we skip it.