Generalization error minimization: a new approach to model evaluation and selection with an application to penalized regression

2016·arXiv

Abstract

1 Introduction

Traditionally in econometrics, a statistical inference about a population is formed using a model imposed and estimated on a sample. Put another way, the goal of inference is to see whether a model estimated on a sample may be generalized to the population. Deciding whether the estimated model is valid for generalization is referred to as the process of model evaluation. The performance of a model is usually evaluated using the sample data at hand, referred to as the ‘internal validity’ of the model. However, internal validity may not be a useful indicator of model performance in many different scenarios. A perspective of increasing interest in applied econometrics considers the performance of a model on out-of-sample data. In this paper, we focus on the ability of a model estimated from a given sample to fit new samples, referred to as the generalization ability (GA) of the model.

Generalization ability is one aspect of external validity, the extent to which the results from a

study generalize to other settings. Researchers evaluating pilot programs and randomized trials in a variety of settings from labour economics to development economics are increasingly focused on the external validity of their findings.1 For instance, Heckman and Vytlacil (2007), Ludwig et al. (2011) and Allcott and Mullainathan (2012) discuss the importance of externally valid results in policy and program evaluation. In a similar vein, Guala and Mittone (2005) and List (2011) emphasize the importance of external validity in determining whether experiments can offer a robust explanation of causes and effects in the field. While external validity encompasses many issues from experimental design, sample selection, and economic theory, it clearly also raises econometric issues around model estimation and evaluation. In order to explore the properties of estimators from the perspective of GA, three questions need to be addressed. How do we measure GA? Given a useful measure, what are its properties? Given its properties, how can GA be exploited for estimation? These questions have received only tangential interest in the literature. Roe and Just (2009) discuss the trade-off between internal and external validity in empirical data, pointing out that while internal validity (or in-sample fit) is well studied, there is much less research on how to control external validity (out-of-sample fit). Some work on external validity has focussed on the properties of specific estimation methods. Angrist and Fernandez-Val (2010), for example, studies the external validity of instrument variable estimation in a labor market setting. In this paper, we answer these questions, starting by proposing a measure of GA that is straightforward to implement empirically.

With a new sample at hand, GA is easily measured by using validation or cross-validation to quantify the goodness-of-fit of the estimated model on new (out-of-sample) data. Without a

new sample, however, it can be difficult to measure GA ex ante. In this paper, when only a single sample is available, we quantify the GA of the in-sample estimates by deriving upper bounds on the empirical out-of-sample errors, which we call the empirical generalization errors (eGE). The upper bounds reveal the properties of the eGE as well as insight into other important statistical properties, such as the trade-off between in-sample and out-of-sample fit in estimation. Furthermore, the upper bounds can be extended to analyze the performance of validation, K-fold cross-validation and specific estimators such as penalized regression. Thus, the GA approach yields insight into the finite-sample and asymptotic properties of model estimation and evaluation from several different perspectives.

Essentially, GA measures the performance of a model from an external point-of-view: the greater the GA of an estimator, the better its predictions on out-of-sample data. Furthermore, we show that GA also serves as a natural criterion for model selection, throwing new light on the model selection process. We propose the criterion of minimizing eGE (maximizing GA), or

generalization error minimization (GEM), as a framework for model selection. Using penalized regression as an example of a specific model selection scenario, we show how the traditional bias-variance trade-off is connected to GEM and to the trade-off between in-sample and out-of-sample fit. Moreover, the GEM framework allows us to establish additional properties for penalized regression implicit in the bias-variance trade-off.

1.1 Approaches to model selection

Given the increasing prevalence of high-dimensional data in economics, model selection is coming to the forefront in empirical work. Researchers often desire a smaller set of predictors in order to gain insight into the most relevant relationships between outcomes and covariates. Without explicitly introducing the concept of GA, the classical approach to model selection focusses on the bias-variance trade-off, yielding methods such as the information criteria (IC), cross-validation,

and penalized regression. Consider the standard linear regression model

where Y is a vector of outcome variables, X is a matrix of covariates and u vector of i.i.d. random errors. The parameter vector may be sparse in the sense that many of its elements are zero. Model selection typically involves using a score or penalty function that depends on the data (Heckerman et al., 1995), such as the Akaike information criterion (Akaike, 1973), Bayesian information criterion (Schwarz, 1978), cross-validation errors (Stone, 1974, 1977) or the mutual information score among variables (Friedman et al., 1997, 2004). An alternative approach to model selection is penalized regression, implemented through the

objective function:

where is the -norm and is a penalty parameter. Note in eq. (1) that if , the OLS estimator is obtained. The IC can be viewed as special cases with and . The lasso (Tibshirani, 1996) corresponds to the case with (an penalty). When (an penalty), we have the ridge estimator (Hoerl and Kennard, 1970). For any , we have the bridge estimator (Frank and Friedman, 1993), proposed as a generalization of the ridge.

One way to derive the penalized regression estimates bis using the cross-validation approach, summarized in Algorithm 1. As shown in Algorithm 1, cross-validation solves the constrained

minimization problem in eq. (1) for each value of the penalty parameter to derive a bfeasible range of is exhausted, the estimate that produces the smallest out-of-sample error among all the estimated is chosen to be the penalized regression estimate, b

1. Set the penalty parameter 2. Partition the sample into a training set T and a test set S. Standardize all variables (to

ensure the penalized regression residual e satisfies E(e) = 0 in T and S). 3. Compute the penalized regression estimate bon T. Use bto calculate the prediction

error on S. 4. Increase the penalty parameter by a preset step size. Repeat 2 and 3 until b5. Select bthat minimizes the prediction error on S.

A range of consistency properties have been established for the IC and penalized regression.

Shao (1997) proves that various IC and cross-validation are consistent in model selection. Breiman (1995); Chickering et al. (2004) show that the IC have drawbacks: they tend to select more variables than necessary and are sensitive to small changes in the data. Zhang and Huang (2008); Knight and Fu (2000); Meinshausen and Bühlmann (2006); Zhao and Yu (2006) show that -penalized regression is consistent in different settings. Huang et al. (2008); Hoerl and Kennard (1970) show the consistency of penalized regression with . Zou (2006); Caner (2009); Friedman et al. (2010) propose variants of penalized regression in different scenarios and Fu (1998) compares different penalized regressions using a simulation study. Alternative approaches to model selection, such as combinatorial search algorithms may be computationally challenging to implement, especially with

high-dimensional data.2

1.2 Major results and contribution

A central idea in this paper is that model evaluation and model selection may be re-framed from the perspective of GA. If the objective is to improve the GA of a model, model selection is necessary. Conversely, GA provides a new and elegant angle to understand model selection. By introducing generalization errors as the measure of GA, we connect GEM to the traditional bias-variance trade-off, the trade-off between in-sample and out-of-sample fit and the properties of validation and cross-validation. By the same token, the concept of GA may be used to derive additional theoretical properties of model selection. Specifically for the case of regression analysis, we use GEM to unify

the properties for the class of penalized regressions with , and show that the finite-sample and asymptotic properties of penalized regression are closely related to GA. The first contribution of this paper is to quantify the GA of a model, ex ante, by deriving an

upper bound for the GE. The upper bounds depend on the sample size, an index of the complexity

of models, a loss function, and the distribution of the underlying population. The upper bound also characterizes the trade-off between in-sample fit and out-of-sample fit. As shown in Vapnik and Chervonenkis (1971a,b); McDonald et al. (2011); Smale and Zhou (2009); Hu and Zhou (2009), the inequalities underlying conventional analysis of GA focus on the relation between the population error and the empirical in-sample error. Conventional methods to improve GA involve computing discrete measures of model complexity, such as the VC dimension, Radamacher dimension or Gaussian complexity, which typically are hard to compute. In contrast, for any out-of-sample data, we quantify bounds for the prediction error of the extremum estimate learned from in-sample data. Furthermore, we show that empirical GA analysis is straightforward to implement via validation or cross-validation and possesses desirable finite-sample and asymptotic properties for model selection. A second contribution of the paper is to propose GEM as a general framework for model selection and estimation. By re-framing the bias-variance trade-off from the perspective of in-sample and out-of-sample fit, GA is connected to the traditional bias-variance trade-off and model selection. To be specific: model selection is necessary to improve the GA of an estimated model while GA naturally serves as an elegant way to understand the properties of model selection. Thus, a model estimated by GEM will not only achieve the highest GA—and thus some degree of external validity—but also possess a number of nice theoretical properties. GEM is also naturally connected to the properties of cross-validation. As argued by Varian (2014), cross-validation could be adopted more often in empirical analysis in economics, especially as big data sets are becoming increasingly available in many fields.

A third contribution of the paper is to use GA analysis to unify a class of penalized regression estimators and derive their finite-sample and asymptotic properties (including -consistency) under the same set of assumptions. Various properties of penalized regression estimators have

previously been established, such as probabilistic consistency or the oracle property (Knight and Fu, 2000; Zhao and Yu, 2006; Candes and Tao, 2007; Meinshausen and Yu, 2009; Bickel et al., 2009). GA analysis reveals that similar properties can be established more generally and for a wider class of penalized regression estimators. We also show that the -difference between the OLS estimate and any penalized regression estimate depends on their respective GAs.

Lastly, a fourth contribution of the paper is to show that GA analysis may be used to tune the hyper-parameter for validation (i.e., the ratio of training sample size to test sample size) or cross-validation (i.e., the number of folds K). Existing research has studied cross-validation for the estimation of specific parametric and nonparametric models (Hall and Marron, 1991; Hall et al., 2004; Stone, 1974, 1977). In contrast, by adapting the classical error bound inequalities that follow from our analysis of GA, we derive the optimal tuning parameters for validation and

cross-validation in a model-free setting. We also show how K affects the bias-variance trade-off for cross-validation: a higher K increases the variance and lowers the bias.

The paper is organized as follows. In Section 2, we propose empirical generalization errors as the measure of GA and the criterion to evaluate the external performance of the estimated model. Also, by deriving the upper bounds for the empirical generalization errors in different scenarios, we reveal important features of the extremum estimator such as the trade-off between in-sample and out-of-sample fit. Our approach also offers a framework to study the properties of validation

and cross-validation. In Section 3, we apply the GEM framework as a model selection criterion for penalized regression. We also demonstrate a number of new properties for penalized regression that flow from the GEM framework. We prove the -consistency of penalized regression estimators for both p and p > n cases. Further, we establish the finite-sample upper bound for the -difference between penalized and unpenalized estimators based on their respective GAs. In Section 4, we use simulations to demonstrate the ability of penalized regression to control for overfitting. We also propose a measure of overfitting based on the empirical generalization errors, called the generalized R2 or GR2. Section 5 concludes with a brief discussion of our results. Proofs (showing detailed steps) are contained in Appendix A and summary plots of the simulations are in Appendix B.

2 Generalization ability and the upper bound for the ﬁnite-sample generalization error

In this section, we propose the eGE, the measurement of GA for a model, as a new angle to evaluate the ‘goodness’ of a model. By deriving the upper bounds of eGE of the extremum estimator, we show that the upper bounds of eGE directly quantifies how much the model overfits or underfits the finite samples, also the upper bounds of eGE may be used to study the properties of eGE for a model. Furthermore, these upper bounds can be used to study the properties of cross validation and validation on finite samples. All those result implies that eGE is a convenient and useful criterion for model evaluation.

2.1 Generalization ability, generalization error and overfitting

In econometrics, choosing the best approximation to data often involves measuring a loss function, Q, defined as a functional that depends on some estimate b and the sample points .

The population error (or risk) functional is defined as

where F(y,x) is the joint distribution of y and x. Without knowing the distribution F(y,x) a priori,

given a random sample (Y,X), we define the empirical error functional as follows

For example, in the regression case, b is the estimated parameter vector and

n When estimation involves minimizing the in-sample empirical error, we have the extremum

estimator (Amemiya, 1985). In many settings, however, minimizing the in-sample empirical error does not guarantee a reliable model. In regression, for example, often the R2 is used to measure goodness-of-fit for the in-sample data. However, an estimate with a high in-sample R2 may fit out-of-sample data poorly, a feature commonly referred to as overfitting: the in-sample estimate

is too tailored for the sample data, compromising its out-of-sample performance.3 As a result, in-sample fit (internal validity) of the model may not be a reliable indicator of the general applicability (external validity) of the model.

Thus, Vapnik and Chervonenkis (1971a) refer to the generalization ability (GA) of a model: a measure of how an extremum estimator performs on out-of-sample data. GA can be measured several different ways. In the case where Y and X are directly observed, GA is a function of the

difference between the actual and predicted Y for out-of-sample data. In this paper, GA is measured by the out-of-sample empirical error functional.

Definition 1 (Training set, test set, empirical training error and empirical generalization error).

1. Let denote a sample point from F(y,x), the joint distribution of (y,x). Let denote the space of all models. The loss function for b is Q. The population error functional for b is . The empirical error functional is

2. Given a sample (Y,X), the training set refers to data used to estimate b (i.e., the in-sample data) and the test set refers to data not used to estimate b (i.e.,

the out-of-sample data). Let effective sample size for the training set, test set and the total sample, respectively, is nt

3. Let btrain denote an extremum estimator, where btrain minimizes . Under

4. For K-fold cross-validation, denote the training set and test set in the qth round, respectively,

as . In each round, the sample size for the training set is nt and the sample size for the test set is ns = n/K. Thus, is the eGE and is the eTE, respectively, in the qth round of cross-validation.

Two methods are typically used to compute the eGE of an estimate: validation and cross-

validation. Under the validation approach, the sample is randomly divided into a training set and a test set. Following estimation on the training set, the fitted model is applied to the test set to compute the validated eGE. K-fold cross-validation may be thought of as ‘multiple-round validation’. Under the cross-validation approach, the sample is randomly divided into K subsamples or folds.4 One fold is chosen to be the test set and the remaining K folds comprise the training set. Following estimation on the training set, the fitted model is applied to the test set to compute the eGE. The process is repeated K times, with each of the K folds in turn taking the role of the test set while the remaining K folds are used as the training set. In this way, K different estimates of the eGE for the fitted model are obtained. The average of the K eGEs yields the cross-validated eGE.

Cross-validation uses each data point in both the training and test sets. The method also reduces

resampling error by running the validation K times over different training and test sets. Intuitively this suggests that cross-validation is more robust to resampling error and on average will perform at least as well as validation. In Section 3, we study the generalization ability of penalized extremum estimators in both the validation and cross-validation cases.

To study the properties of eGE for a model, three assumptions are required for the analysis in

this section of the paper, stated as follows.

Assumptions

A1. In the probability space , we assume F-measurability of the loss function Q(b|y,x), the population error R(b|Y,X) and the empirical error , for any b and any

A2. The sample is randomly drawn from the population. In cases with multiple random samples, both the training set and the test set are randomly sampled from the population. In cases with a single random sample, both the training set and the test set are

A3. For any sample, the extremum estimator btrain exists. The in-sample error for btrain converges in probability to the minimal population error as n

A few comments are in order for assumptions A1–A3. The loss distribution assumption A1 is

made to simplify the analysis. The existence and convergence assumption A3 is standard (see, for example, Newey and McFadden (1994)). The independence assumption A2 is not essential; GA analysis is valid for both i.i.d. and

non-i.i.d. data. While the original research on GA in Vapnik and Chervonenkis (1974a,b) imposes the i.i.d. restriction, subsequent work has generalized their results to cases where the data are dependent or not identically distributed.5 Others have shown that if heterogeneity is due to an

observed random variable,6 the variable may be added to the model to control for the heterogeneity, while if the heterogeneity is related to a latent variable, various approaches—such as the hidden Markov model, mixture modelling or factor modelling—are available for heterogeneity control. In this paper, due to the different measure-theory setting for dependent data, we focus on the independent case as a first step.

Given A1–A3, both the eTE and eGE converge to the population error:

2.2 The upper bound of the empirical generalization error

The problem of GA and overfitting is discovered many years ago. Originally, to improve the generalization ability of a model, Vapnik and Chervonenkis (1971a,b) propose minimizing the upper bound of the population error of the estimator R(b|Y,X) as opposed to minimizing the eTE. The balance between in-sample fit and out-of-sample fit is formulated by Vapnik and Chervonenkis (1974b) using the Glivenko-Cantelli theorem and Donsker’s theorem for empirical processes. Specifically, the so-called VC inequality (Vapnik and Chervonenkis, 1974b) summarizes the relationship between

Lemma 1 (VC inequality: the upper bound for the population error). Under A1–A3, the following inequality holds with probability at least 1

where is the population error, is the empirical training error, , and h is the VC dimension.

The VC dimension h is a more general measure of the geometric complexity of a model than

the number of parameters, p, which does not readily extend as a measure of complexity in nonlinear or non-nested models. While h reduces to p directly for generalized linear models, h can also be used to order the complexity of nonlinear or non-nested models.7 Thus, eq. (2) can be used as a tool for linear, nonlinear and non-nested model selection. While Lemma 1 is established under A2, eq. (2) can be generalized to non-i.i.d. cases. Mc- Donald et al. (2011) generalizes the VC inequality for - and -mixing stationery time series while Smale and Zhou (2009) generalizes the VC inequality for panel data. Moreover, a number of papers (Michalski and Yashin, 1986; Skrondal and Rabe-Hesketh, 2004; Wang and Feng, 2005; Yu and Joachims, 2009; Pearl, 2015) show that heterogeneity can be controlled in the context of the VC inequality by implementing the latent variable model or by adding the variable causing heterogeneity into the model. The VC inequality provides an upper bound for the population error based on the eTE and the VC dimension of the model. As shown in Figure 1, when the VC dimension is low (i.e., the model is simple), the effective sample size (nt/h) of the training set is large, is small, the second term on the RHS of (2) is small, and the eTE is close to the population error. In this case the

Figure 1: The VC inequality and the upper bound of the population error

extremum estimator has a good GA. However, when the VC dimension is high (i.e., the model is very complicated), the effective sample size nt/h is small, the second term on the RHS of (2) becomes larger. In such situations, a small eTE does not guarantee a good GA, and overfitting is more likely.

Vapnik and Chervonenkis (1971a) show that minimizing the RHS of eq. (2) reduces overfitting

and improves the GA of the extremum estimator. However, this can be hard to implement because the VC dimension is hard to calculate for anything other than linear models.8 Also, VC inequality is for population. But in practice population is often not accessible and the data we collect is always finite, moreover we are often interesting in how a estimated model would perform on finite out-of-sample data, such as "how well the effect of a policy, estimated from a sample from suburb A, would explain the variation in suburb B, given relevant heterogeneity controlled. " Hence, as a handy and intuitive measurement for GA or overfitting in empirical study, eGE has been proposed and widely used in applications.

As a reasonable criterion for model evaluation, we would expect that eGE should demonstrate

a number of nice properties that characterize its pattern or regularity as an empirical process. For example, intuitively if we get an estimated model from the training set and apply it to many test sets with same size, we would expect that value of different eGEs could be as stable as possible despite of the sampling error; also, if we calculate the eGE of a model on test sets with different sizes, we would expect that value of eGEs should be influenced by the size of training set, the size of test set and the tail behavior of the error distribution. However, due to the procedure of sampling from population and subsampling by validation/cross validation, the randomness of eGE as a empirical process increase the difficulty to derive those properties. To reduce and absorb the influence of sampling and subsampling, one approach is to derive an upper bound for eGE,9 which is somehow similar to what eq. (2) does to population error, Thus, we now describe the finite-sample relation between the eGE and the eTE in validation, by adapting eq. (2).

Theorem 1 (The upper bound for the eGE of the extremum estimator under validation). Under A1–A3 and given an extremum estimator btrain , the following upper bound for the eGE holds with probability at least

where is the eGE and the eTE, is defined in Lemma 1, B lnand B is bounded, and otherwise

where

Eq. (3) provides an upper bound for the eGE of the test set as it depends on the eTE of the

model with a given h estimated on the training set. Thus, eq. (3) shows how the eTE from extremum estimation on the training set may be used to compute the eGE of the model on the test set. In other words, eq. (3) measures the GA of a model of a given complexity under the validation approach.

Theorem 1 has several important implications.

1. Upper bound of the eGE. Eq. (3) establishes the upper bound of the eGE for any out-of-sample

data of size ns based on the eTE from any in-sample data of size nt. Thus, eq. (3) quantifies the upper bound of the eGE, as opposed to Lemma 1, which quantifies the upper bound of

2. The trade-off between accuracy and looseness of the upper bound. Theorem (1) also shows that influences the trade-off between the accuracy and efficiency of the upper bound. The higher , the more likely the upper bound holds and the larger . Thus, while the probability the bound holds increases, it comes at the cost of a looser upper bound. In

3. The eGE-eTE trade-off in model selection. Eq. (3) also characterizes the trade-off between eGE and eTE for model selection in both the finite-sample and asymptotic cases. The

population GE and the population TE converge to the population error. Hence, minimizing eTE can lead directly to the true DGP in the population. By contrast, for the finite-sample

model (a high nt/h) will be unlikely to recover the true DGP leading to a large upper bound for the eGE. Thus, an oversimplified model will tend to underfit, fitting both the in-sample and out-of-sample data poorly. Thus, the complexity of a model is related to a trade-off

4. GA and tails of the loss function distribution. Eq. (3) also shows how the tail of the loss function distribution affects the upper bound of the eGE through , the second term on the

measure the heaviness of the loss distribution tail, a smaller implying a heavier tail. In the case of a heavy tail, is mathematically complicated and its convergence rate decreases to

Our next step is to establish a similar bound to eq. (3) for K-fold cross-validation. Given that

K-fold cross-validation is simply the multiple-round validation, a similar bound to eq. (3) can be

established for cross-validation by convolution. For convenience, define the empirical process

as the eGE gap in the qth round of cross-validation.

Theorem 2 (The upper bound for the eGE of the extremum estimator for cross-validation). Under A1–A3, given an extremum estimator btrain and given Tq, then

(i) if , the following upper bound for the eGE gap holds with probability at least

Figure 2: Representation of the trade-off between eGE and eTE

(ii) if Tq is heavy-tailed and sub-exponential, the following upper bound for the eGE holds with probability at least

is the eGE and is the eTE, respectively, in the qth round of cross-validation.

Theorem 2 provides an upper bound for the average eGE from K rounds of cross-validation. Generally speaking, the errors generated from cross-validation are affected both by sampling randomness (from the population) and by sub-sampling randomness that arises from partitioning

the sample into folds. Thus, the errors from cross-validation are potentially more volatile than the usual errors from estimation. Eqs. (4) and (5) offers a way to characterize the property of eGE despite of the effect of sub-sampling randomness.

The implications of eqs. (4) and (5) are as follows.

1. Upper bound of the eGE. Similar to eq. (3), eq. (4) and (5) serve as the upper bound of

the cross-validated eGE. Both equations reveal the trade-off between eTE and eGE and the influence of the tails of the loss distribution. In convolution, the tails dramatically affect the upper bound of the eGE. If Tq is light-tailed, converges to 1 exponentially when

n . When Tq is sub-exponential, the convolution is more complicated and it is hard to approximate the probability. However, with sufficiently large n, Theorem 2 shows that we

2. The trade-off between accuracy and looseness of the upper bound. Similar to eq. (3), in both eq. (4) and (5) we need to consider the trade-off between between and (or as in Theorem 1), i.e., the trade-off between efficiency and accuracy. Similar to Theorem 1,

3. Cross-validation hyperparameter K. Eqs. (4) and (5) characterize how K affects the average eGE from cross-validation (also called the cross-validation error in the literature). With a given sample and fixed K, sub-sampling randomness will produce a different average eGE each time cross-validation is performed. From Definition 1, nt and

trade-off between eTE and eGE under cross-validation. Thus, to characterize the influence of sub-sampling randomness, we establish the trade-off for cross-validation by a bound, after

• Small K. For low values of K, nt is low in each round of in-sample estimation and the eTE in each round q, , is more biased away from the

finite-sample bias for low values of K. However, since a small K implies ns is relatively large, more data is used for eGE calculation in each round, in each round the eGE on

• Large K. For high values of K, ns is low. Given a small test set size, the eGE in each round may be hard to bound from above, the averaged eGE from K rounds will be

Figure 3: The bias-variance trade-off for cross-validation eGE

more volatile and will increase. However, with a high K, the first term on the RHS of eqs. (4) and (5) tends to be closer to the true population error and the averaged eGE

In summary, Figure 3b shows that as the value of K increases, the averaged eGE from cross-validation follows a typical bias-variance trade-off. For low values of K, the average eGE is less volatile but more biased away from the population error. As K increases, the

Theorem 2 confirms the exhaustive simulation study results from Kohavi (1995). As a side

benefit, Theorem 2 also suggests that an optimal number of folds K may exist for each sample when the cross-validation approach is used with extremum estimation, as in the Kshown in Figure 3b. More specifically, the K that minimizes the upper bound (4) and (5) also maximizes the GA from cross-validation. We leave to the next section discussion on the optimal K for regression.

Theorems 1 and 2 establish upper bounds for the eGE of the extremum estimator given any size random sample, which reveals a method to analyze the prorperty of eGE for the extremum estimators. Potentially, if we take eGE as the criterion for model selection, Theorems 1 and 2

may be used to evaluate the performance of model selection under validation and cross-validation. Hence it would be natural to select the model with minimal eGE in the space of alternative models

2.3 Generalization error minimization

By establishing upper bounds for eGE under validation or cross-validation in section 2.1, we show that, as a criterion for model evaluation, a number of properties for eGE could be shown in finite samples and the asymptotic case, which suggest that it may serve as a good angle to understand model selection. Hence, by considering eGE as a criterion for model selection, we propose selecting the model based on minimizing the eGE, which we refer to as generalization error minimization or GEM.

Generally speaking, GEM can be implemented alongside with many conventional techniques of

model selection, such as penalized regressions, the information criteria and maximum a posteriori (MAP). However, in next section, we show that GEM works especially well for penalized regression. As shown in Algorithm 1, penalized regression estimation returns a bfor each . Each value of generates a different model and a different eGE. As a result, Theorems 1 and 2 guarantee that the model with the minimum eGE among has the best empirical generalization ability. By applying GEM in conjunction with validation/cross-validation and various penalty methods, the theoretical properties of the penalized regressions, such as its robustness, mode of consistency and convergence rate could be analyzed by directly applying the upper bounds we derive previously.

3 Finite-sample and asymptotic properties for penalized regression under GEM

In Section 2, to analyze eGE from the perceptive of model-free, we establish a class of upper bounds for eGE of the extremum estimators, which is potentially connected to the eTE-eGE trade-off, bias-variance trade-off and cross validation; we also propose the GEM as the general framework of model selection. In this section, we implement the GEM onto regression analysis. As shown in the following section, we apply eGE to regression analysis and show that GEM serves as a new and clear angle to understand the procedure of penalized regression. By applying the eGE as the criterion of model selection, model selection directly serves as an effective method to improve GA; the other way around, the properties of model selection can directly be explained and re-framed by the eGE of the model. Moreover, other than the properties in section 2, additional properties can be established for penalized regression under GEM framework. Specifically, we establish: (1) specific error bounds for any penalized regression, (2) -consistency for all penalized regression estimators, (3) that the upper bound for the -difference between the penalized regression estimator and the OLS estimator is a function of the eGE, the tail behavior of distribution for the loss function and the exogeneity of the sample.

3.1 Penalized regression

Firstly, we formally define penalized regression and its two most popular variants: lasso (-penalized regression) and ridge (-penalized regression). It is important to stress that each variable in (Y,X) must be standardized before implementing penalized regression. As shown by Tibshirani (1996), without standardization the penalized regression estimates may be influenced by the magnitude (units) of the variables. After standardization, of course, X and Y are unit- and scale-free.

Definition 2 (Penalized regression, ridge regression, lasso regression,

1. The general form of the objective function for penalized regression is

2. Let bdenote the solution to the constrained minimization eq. (6) for a given value of the penalty parameter . Let bdenote the estimator with the minimum eGE among all the

3. The objective function for the lasso (-norm penalty) is

4. The -norm eTE and eGE for any b are defined, respectively,

Figure 4: Comparison of estimates from OLS and various penalized regressions

The idea behind penalized regression is illustrated in Figure 4. As shown in Figure 4a, different

penalties correspond to different boundaries for the estimation feasible set. For the regression (lasso), the feasible set is a diamond. The feasible set expands to a circle under an penalty. As illustrated in Figures 4b and 4c, for given , the smaller , the more likely bis a corner solution. It follows that under the penalty, variables are more likely to be dropped compared

Figure 5

Figure 5: Outline of proof strategy

with the penalty.12 In the special case when and ), the penalized regression is identical to the Akaike (Bayesian) information criterion. Penalized regression primarily focuses on overfitting. By contrast, OLS minimizes the eTE without any penalty, often resulting in a large eGE (e.g., the ‘overfitting’ in Figure 1). It is also possible that OLS fits the training data poorly, causing both the eTE and eGE to be large (e.g., the ‘underfitting’ in Figure 1). Generally speaking, it is easier to deal with overfitting than underfitting.13 Overfitting in OLS is usually caused by including too many variables, which can be resolved by reducing p. Underfitting, however, is likely due to a lack of data (variables) and the only remedy is to collect more data.

3.2 GEM for penalized regression

The conventional route to establish finite-sample and asymptotic properties for regression is by analyzing the properties of the estimator in the space of the eTE. By contrast, to study how penalized regression improves GA, we reformulate the analysis in the space of the eGE. Figure 5 outlines our proof strategy. We show that a number of finite-sample properties for penalized regression can be established under the GEM framework.

In asymptotic analysis, consistency is typically considered to be one of the most fundamental

properties. To demonstrate that GEM is a viable estimation framework, we prove that the penalized regression model selected by eGE minimization converges to the true DGP as n . Essentially, we show that penalized regression bijectively maps bto the minimal eGE among on the test set. To link the finite-sample and asymptotic results we need to show that, if the true DGP is

bijectively assigned to the global minimum eGE in the population, and if

then bis consistent in probability or

To establish consistency for penalized regression, we make the following three additional

assumptions.

Further assumptions

Assumptions A4–A6 restrict the true DGP indexed by to be identifiable. Otherwise, there may exist an alternative that is not statistically different from the true DGP. The assumptions are standard for linear regression.

Under assumptions A1–A6, we first show that the true DGP has the lowest generalization error.

Proposition 1 (Identification of in the space of eGE). Under A1–A6, the true DGP, Y is the one and only one offering the minimal eGE as

Proposition 1 states that there is a bijective mapping between and the global minimum eGE

in the population. If A5 or A6 were violated, variables may exist in the sample that render the true DGP not to have the minimum eGE in the population.

As shown in Algorithm 1, penalized regression chooses bto be the model with the minimum

eGE in . Thus, we need to establish that when the sample size is large enough, the true DGP is included among , the list of models selected by validation or cross-validation.

Proposition 2 (Existence of -consistency). Under A1–A6 and Proposition 1, there exists at least one

Using lasso as the example of penalized regression, Figure 6 illustrates Propositions 1 and 2.

In Figure 6, refers to the true DGP, brefers to the solution of eq. (7), and the diamond-shaped feasible sets are due to the penalty. Different values of imply different areas for the feasible sets, which get smaller as the value of increases. There are three possible cases: (i) under-shrinkage—for a low value of (Figure 6a), lies within the feasible set and has the minimum eTE in the population; (ii) perfect-shrinkage—for the oracle (Figure 6b), is located precisely on the boundary of the feasible set and still has the minimum eTE in the population; (iii) over-shrinkage—for a high value of (Figure 6c), lies outside the feasible set. In cases (i) and (ii), the constraints become inactive as , so lim. However, in case (iii), lim. An important implication is that tuning the penalty parameter is critical for the theoretical properties of the penalized regression estimators.

Figure 6: Shrinking for various values of

3.3 Main results for penalized regression under GEM

Intuitively, the penalized regression estimator will be consistent in some norm or measure as long as, for a specific lies in the feasible set and offers the minimum eTE. In practice, however, we may not know a priori whether causes over-shrinking or not, especially when the number of variables, p, is not fixed. As a result, we need to use an approach like validation or cross-validation to tune the value of . Applying the results in Section 2, we show that GEM guarantees that the model selected by penalized regression with the appropriate , asymptotically converges in to the true DGP.

In this section we analyze the finite-sample and asymptotic properties of the GEM estimator in two settings: n and n < p. In the case where n , OLS is feasible, and is the unpenalized

regression estimator. In the case where n < p, OLS is not feasible, and forward stagewise regression (FSR) is the unpenalized regression estimator.

3.3.1 GEM for penalized regression with n ⩾ p

Firstly, by adapting eq. (3) and (4) and (5) for regression, we establish the upper bound of the eGE.

Lemma 2 (Upper bound for the eGE of the OLS estimator). Under A1–A6, if we assume u N

1. Validation case. The following bound for the eGE for bOLS holds with probability at least

where is the eGE and is the eTE of the OLS estimator, and is defined in Lemma 1.

2. K-fold cross-validation case. The following bound for the eGE for bOLS holds with probability at least

where is the eGE and is the eTE of the OLS estimator in the qth round of cross-validation, and is defined in Lemma 1.

In a similar fashion to eqs. (3) and (4) and (5), eqs. (9) and (10) measure the upper bound of the

eGE for the OLS estimator under validation and cross-validation, respectively. In standard practice, of course, neither validation nor cross-validation are implemented as part of OLS estimation and the eGE of the OLS estimator is not computed. Nevertheless, eqs. (9) and (10) show that it is possible to compute the eGE of the OLS estimator without having to carry out validation or cross-validation. Eqs. (9) and (10) also show that the higher the variance of u in the true DGP, the higher the upper bound of the eGE under validation and cross-validation. As a bonus of the GEM approach, eq. (10) also shows that we can find the K that maximizes the GA from cross-validation by tuning K to the lowest upper bound of the cross-validated eGE, determined by minimizing the expectation of the RHS of eq. (10).

Corollary 1 (The optimal K for penalized regression). Under A1–A6, and based on eq. (10) from Lemma 2, if we also assume the error term u , the optimal K for cross-validation in penalized regression (the minimum expected upper bound of eGE) is defined:

The penalty parameter can be tuned by validation or K-fold cross-validation. For K have K different test sets for tuning and K different training sets for estimation. Using eqs. (9) and (10), we now establish, in two steps, an upper bound for the -norm difference between

the unpenalized estimator bOLS and the corresponding penalized estimator bunder validation and cross-validation.

Proposition 3 (-difference between the penalized and unpenalized predicted values). Under A1–A6 and based on Lemma 2, Propositions 1, and 2,

1. Validation case. The following bound for the difference between the predicted values from bOLS and the validated bholds with probability at least

where bqOLS is the OLS estimator and bis the penalized estimator in the qth round of cross-validation, and is defined in Theorem 2.

Following Markov’s classical proof of consistency for OLS, Proposition 3 establishes the

-norm convergence of the fitted values to the true values. Based on A1-A6, the identification condition is satisfied, and the convergence of the fitted values implies the -norm consistency of the penalized regression estimator.

We now establish the upper bound of the -norm difference between bOLS and b

b, under validation and cross-validation.

Theorem 3 (-difference between the penalized and unpenalized regression estimators). Under A1–A6 and based on Propositions 1, 2, and 3,

1. Validation case. The following bound for the -difference between bOLS and the validated bholds with probability at least

2. K-fold cross-validation case. The following bound for the - difference between the K-fold cross-validated bOLS and bholds with probability at least

Some important remarks apply to Theorem 3. The LHS of eq. (13) measures the -norm

difference between the penalized regression estimator and the OLS estimator under validation. The RHS of eq. (13) essentially captures the maximum -norm difference between bOLS and b. As shown in eq. (13), the maximum difference depends on the GE of the true DGP and the GE of the OLS model.

• The first term on the RHS of eq. (13) (ignoring 1) is the difference between the eGE from

OLS and the upper bound of the population error, or, equivalently, the difference between the GA of the OLS estimator and its maximum. The better the GA of bOLS, the less overfitting

• The second term on the RHS of eq. (13) (ignoring 4) measures the empirical endogeneity of the OLS estimator on the test set. On the training set eTt Xs = 0, but on the test set, in general, eTs Xs