High Dimensional Forecasting via Interpretable Vector Autoregression

2014·arXiv

Abstract

1 Introduction

Vector autoregression (VAR) has emerged as the standard-bearer for macroeconomic forecasting since the seminal work of Sims (1980). VAR is also widely applied in numerous fields, including finance (e.g., Han et al. 2015), neuroscience (e.g, Hyv¨arinen et al. 2010), and signal processing (e.g., Basu et al. 2019). The number of VAR parameters grows quadratically with the the number of component series, and, in the words of Sims, this “profligate parameterization” becomes intractable for large systems. Without further assumptions, VAR modeling is infeasible except in limited situations with small number of components and lag order.

Many approaches have been proposed for reducing the dimensionality of vector time series models, including canonical correlation analysis (Box & Tiao 1977), factor models (e.g., Forni et al. 2000, Stock & Watson 2002, Bernanke et al. 2005), Bayesian models (e.g., Banbura et al. 2010; Koop 2013), scalar component models (Tiao & Tsay 1989), independent component analysis (Hyv¨arinen et al. 2010), and dynamic orthogonal component models (Matteson & Tsay 2011). Recent approaches have focused on imposing sparsity in the estimated coefficient matrices through the use of convex regularizers such as the lasso (Tibshirani 1996). Most of these methods are, however, adapted from the standard regression setting and do not specifically leverage the ordered structure inherent to the lag coefficients in a VAR.

This paper contributes to the lasso-based regularization literature on VAR estimation by proposing a new class of regularized hierarchical lag structures (HLag), that embed lag order selection into a convex regularizer to simultaneously address the dimensionality and lag selection issues. HLag thus shifts the focus from obtaining estimates that are generally sparse (as measured by the number of nonzero autoregressive coefficients) to attaining estimates with low maximal lag order. As such, it combines several important advantages: It produces interpretable models, provides a flexible, computationally efficient method for lag order selection, and offers practitioners the ability to fit VARs in situations where various components may have highly varying maximal lag orders.

Like other lasso-based methods, HLag methods have an interpretability advantage over factor and Bayesian models. They provide direct insight into the series contributing to the forecasting of each individual component. HLag has further exploratory uses relevant for the study of different economic applications, as we find our estimated models on the considered macroeconomic data sets to have an underlying economic interpretation. Comparable Bayesian methods, in contrast, primarily perform shrinkage making the estimated models more difficult to interpret, although they can be extended to include variable selection (e.g., stochastic search). Furthermore, factor models that are combinations of all the component series can greatly reduce dimensionality but forecast contributions from the original series are only implicit. By contrast, the sparse structure imposed by the HLag penalty explicitly identifies which components are contributing to model forecasts.

While our motivating goal is to produce interpretable models with improved point forecast performance, a convenient byproduct of the HLag framework is a flexible and computationally efficient method for lag order selection. Depending on the proposed HLag structure choice, each equation row in the VAR will either entirely truncate at a given lag (“componentwise HLag”), or allow the series’s own lags to truncate at a different order than those of other series (“own/other HLag”), or allow every (cross) component series to have its own lag order (“elementwise HLag”). Such lag structures are conveniently depicted in a “Maxlag matrix” which we introduce and use throughout the paper.

Furthermore, HLag penalties are unique in providing a computationally tractable way to fit high order VARs, i.e., those with a large maximal lag order (pmax). They allow the possibility of certain components requiring large max-lag orders without having to enumerate over all combinations of choices. Practitioners, however, typically choose a relatively small pmax. We believe that this practice is in part due to the limitations of current methods: information criteria make it impossible to estimate VARs with large pmax by least squares as the number of candidate lag orders scales exponentially with the number of components k. Not only is it computationally demanding to estimate so many models, overfitting also becomes a concern. Likewise, traditional lasso VAR forecasting performance degrades when pmax is too large, and many Bayesian approaches, while statistically viable, are computationally infeasible or prohibitive, as we will illustrate through simulations and applications.

In Section 2 we review the literature on dimension reduction methods to address the VAR’s overparametrization problem. In Section 3 we introduce the HLag framework. The three aforementioned hierarchical lag structures are proposed in Section 3.1. As detailed above, these structures vary in the degree to which lag order selection is common across different components. For each lag structure, a corresponding HLag model is detailed in Section 3.2 for attaining that sparsity structure. Theoretical properties of high-dimensional VARs estimated by HLag are analyzed in Section 3.3. The proposed methodology allows for flexible estimation in high dimensional settings with a single tuning parameter. We develop algorithms in Section 4 that are computationally ef-ficient and parallelizable across components. Simulations in Section 5 and applications in Section 6 highlight HLag’s advantages in forecasting and lag order selection.

2 Review of Mitigating VAR Overparametrization

We summarize the most popular approaches to address the VAR’s overparametrization problem and discuss their link to the HLag framework.

2.1 Information Criteria

Traditional approaches address overparametrization by selecting a low lag order. Early attempts utilize least squares estimation with an information criterion or hypothesis testing (L¨utkepohl 1985). The asymptotic theory of these approaches is well developed in the fixed-dimensional setting, in which the time series length T grows while the number of components k and maximal lag order pmax are held fixed (White 2001). However, for small T, it has been observed that no criterion works well (Nickelsburg 1985). Gonzalo & Pitarakis (2002) find that for fixed k and pmax, when T is relatively small, Akaike’s Information Criterion (AIC) tends to overfit whereas Schwarz’s Information Criterion (BIC) tends to severely underfit. Despite their shortcomings, AIC, BIC, and corrected AIC (Hurvich & Tsai 1989) are still the preferred lag order selection tools by most practitioners (L¨utkepohl 2007, Tsay 2013).

A drawback with such approaches is, however, that they typically require the strong assumption of a single, universal lag order that applies across all components. While this reduces the computational complexity of model selection, it has little statistical or economic justification, unnecessarily constrains the dynamic relationship between the components, and impedes forecast performance. An important motivating goal of the HLag framework is to relax this strong assumption. Gredenhoff & Karlsson (1999) show that violation of the universal lag order assumption can lead to overparameterized models or the imposition of false zero restrictions. They instead suggest considering componentwise specifications that allow each marginal regression to have a different lag order (sometimes referred to as an asymmetric VAR). One such procedure (Hsiao 1981) starts from univariate autoregressions and sequentially adds lagged components according to Akaike’s “Final Prediction Error” (Akaike 1969). However, this requires an a priori ranking of components based on their perceived predictive power, which is inherently subjective. Keating (2000) offers a more general method which estimates all potential componentwise VARs and utilizes AIC/BIC for lag order selection. Such an approach is computationally intractable and standard asymptotic justifications are inapplicable if the number of components k is large. Ding & Karlsson (2014) present several specifications which allow for varying lag order within a Bayesian framework. Markov chain Monte Carlo estimation methods with spike and slab priors are proposed, but these are computationally intensive, and estimation becomes intractable in high dimensions though recent advances have been made by Giannone et al. (2017).

Given the difficulties with lag order selection in VARs, many authors have turned instead to shrinkage-based approaches, which impose sparsity, or other economically-motivated restrictions, on the parameter space to make reliable estimation tractable, and are discussed below.

2.2 Bayesian Shrinkage

Early shrinkage methods, such as Litterman (1979), take a pragmatic Bayesian perspective. Many of them (e.g., Banbura et al. 2010; Koop 2013) apply the Minnesota prior, which uses natural conjugate priors to shrink the VAR toward either an intercept-only model or a vector random walk, depending on the context. The prior covariance is specified so as to incorporate the belief that a series’ own lags are more informative than other lags and that lower lags are more informative than higher lags. With this prior structure, coefficients at high lags will have a prior mean of zero and a prior variance that decays with the lag. Hence, coefficients with higher lags are shrunk more toward zero. However, unlike the HLag methods but similar to ridge regression, coefficients will not be estimated as exactly zero.

The own/other HLag penalty proposed below is inspired by this Minnesota prior. It also has the propensity to prioritize own lags over other lags and to assign a greater penalty to distant lags, but it formalizes these relationships by embedding two layers of hierarchy into a convex regularization framework. One layer (within each lag vector) prioritizes own lags before other lags. Another layer (across lag vectors) penalizes distant lags more than recent lags since the former can only be included in the model if the latter are selected.

The Bayesian literature on dealing with overparametrization of VARs is rapidly growing, with many recent advances on, amongst others, improved prior choices (e.g., Carriero et al. 2012, Giannone et al. 2015), stochastic volatility (e.g., Carriero et al. 2019), time-varying parameter estimation (e.g., Koop & Korobilis 2013), and dimension reduction via compressing (Koop et al. 2019).

2.3 Factor Models

Factor models form another widely used class to overcome the VAR’s overparameterization and have been used extensively for macroeconomic forecasting (e.g., Stock & Watson 2002). Here, the factors serve the purpose of dimension reduction since the information contained in the original high dimensional data set is summarized—often using principal component analysis—in a small number of factors. While Factor Augmented VARs (FAVAR) (e.g., Bernanke et al. 2005) include one or more factors in addition to the observables, all observables are expressed as a weighted average of factors in Dynamic Factor Models (e.g., Forni et al. 2000).

2.4 Lasso-based Regularization

Other shrinkage approaches have incorporated the lasso (Tibshirani 1996). Hsu et al. (2008) consider the lasso with common information criterion methods for model selection. The use of the lasso mitigates the need to conduct an exhaustive search over the space of all 2possible models but does not explicitly encourage lags to be small. HLag, in contrast, forces low lag coefficients to be selected before corresponding high lag coefficients, thereby specifically shrinking toward low lag order solutions. As will be illustrated through simulations and empirical applications, this often improves forecast performance.

To account for the VAR’s inherent ordered structure, Lozano et al. (2009) use a group lasso (Yuan & Lin 2006) penalty to group together coefficients within a common component. Song & Bickel (2011) treat each variable’s own lags differently from other variables’ lags (similar to the own/other Hlag penalty we propose), consider a group lasso structure and additionally downweight higher lags via scaling the penalty parameter by an increasing function of the coefficients’ lag. The authors note that the functional form of these weights is arbitrary, but the estimates are sensitive to the choice of weights. A similar truncating lasso penalty is proposed by Shojaie & Michailidis (2010) and refined by Shojaie et al. (2012) in the context of graphical Granger causality. However, unlike HLag, this framework requires a functional form assumption on the decay of the weights as well as a two-dimensional penalty parameter search which generally squares the computational burden.

3 Methodology

autoregression VAR) may be expressed as a multivariate regression

conditional on initial values denotes an intercept vector, coefficient matrices, and is a mean zero white noise vector time series with unspecified nonsingular contemporaneous covariance matrix

In the classical low-dimensional setting in which T > kp, one may perform least squares to fit

the VAR) model, minimizing

over denotes the Euclidean norm of a vector a. We will find it convenient to express the VAR using compact matrix notation:

Equation (3.1) is then simply

and the least squares procedure (3.2) can be expressed as minimizing

over denotes the Frobenius norm of the matrix A, that is the Euclidean norm of vec(A) (not to be mistaken for the operator norm, which does not appear in this paper).

Estimating the parameters of this model is challenging unless T is sufficiently large. Indeed, when 1, estimation by least squares becomes imprecise. We therefore seek to incorporate reasonable structural assumptions on the parameter space to make estimation tractable for moderate to small T. Multiple authors have considered using the lasso penalty, building in the assumption that the lagged coefficient matrices are sparse (e.g., Song & Bickel 2011, Davis et al. 2016, Hsu et al. 2008); theoretical work has elucidated how such structural assumptions can lead to better estimation performance even when the number of parameters is large (e.g., Basu & Michailidis 2015, Melnyk & Banerjee 2016, Lin & Michailidis 2017). In what follows, we define a class of sparsity patterns, which we call hierarchical lag or HLag structures, that arises in the context of multivariate time series.

3.1 HLag: Hierarchical Lag Structures

In Equation (3.1), the parameter controls the dynamic dependence of the ith component of on the jth component of . In describing HLag structures, we will use the following notational

convention: for 1

Consider the coefficient lags L defined by

in which we define . Therefore, each denotes the maximal coefficient lag (maxlag) for component j in the regression model for component i. In particular, is the smallest . Note that the maxlag matrix L is not symmetric, in general. There are numerous HLag structures that one can consider within the context of the VAR) model. The simplest such structure is that , meaning that there is a universal (U) maxlag that is shared by every pair of components. Expressed in terms of Equation (3.1), this would say that the methodology we introduce can be easily extended to this and many other potential HLag structures, in this paper we focus on the following three fundamental structures.

Figure 1: A componentwise (C) HLag active set structure (shaded): HLag

Figure 2: An own-other (O) HLag active set structure (shaded): HLag

Figure 3: An elementwise (E) HLag active set structure (shaded): HLag

In the next section, we introduce the proposed class of HLag estimators aimed at estimating VAR) models while shrinking the elements of L towards zero by incorporating the three HLag structures described above.

3.2 HLag: Hierarchical Group Lasso for Lag Structured VARs

In this section, we introduce convex penalties specifically tailored for attaining the three lag structures presented in the previous section. Our primary modeling tool is the hierarchical group lasso (Zhao et al. 2009, Yan & Bien 2017), which is a group lasso (Yuan & Lin 2006) with a nested group structure. The group lasso is a sum of (unsquared) Euclidean norms and is used in statistical modeling as a penalty to encourage groups of parameters to be set to zero simultaneously. Using nested groups leads to hierarchical sparsity constraints in which one set of parameters being zero implies that another set is also zero. This penalty has been applied to multiple statistical problems including regression models with interactions (Zhao et al. 2009, Jenatton et al. 2010, Radchenko & James 2010, Bach et al. 2012, Bien et al. 2013, Lim & Hastie 2015, Haris et al. 2016, She et al. 2018), covariance estimation (Bien et al. 2016), additive modeling (Lou et al. 2016), and time series (Tibshirani & Suo 2016). This last work focuses on transfer function estimation, in this case scalar regression with multiple time-lagged covariates whose coefficients decay with lag.

For each hierarchical lag structure presented above, we propose an estimator based on a convex optimization problem:

in which denotes a hierarchical lag group (HLag) penalty function. We propose three such penalty functions: componentwise; own-other; and elementwise; and discuss their relative merits.

Since all three penalty functions are based on hierarchical group lasso penalties, a unified computational approach to solve each is detailed in Section 4. First, we discuss theoretical properties of HLag.

3.3 Theoretical Properties

We build on Basu & Michailidis (2015) to analyze theoretical properties of high-dimensional VARs estimated by HLag. Consider a fixed realization of generated from the VAR model (3.1) with fixed autoregressive order ). Denote the corresponding true maxlag matrix by L. We make the following assumptions.

Assumption 1. The VAR model is stable, such that det

where

and the error covariance matrix is positive definite such that its minimum eigenvalue Λ0 and its maximum eigenvalue Λ

These assumptions are standard in the time series literature. Define the following two measures of stability of the VAR process, which will be useful for our theoretical analysis (see Basu &

Michailidis 2015 for more detail)

where ) denotes the conjugate transpose of a complex matrix.

We derive a bound on the in-sample prediction error. Define the in-sample, one-step-ahead

mean squared forecast error to be

with as defined in equation (3.3). While tr() is the irreducible error, an unavoidable part of the forecast error, a good estimator of the autoregressive parameters should allow us to control the size of the second term. In Theorem 1, we provide such a bound on the in-sample

ity at least 1

where ˆis the elementwise HLag estimator with pmax = p.

The proof of Theorem 1 is included in Section A of the appendix. Theorem 1 establishes in-sample prediction consistency in the high-dimensional regime log(same rate is obtained as for i.i.d. data, modulo a “price” paid for dependence. The temporal and cross-sectional dependence affects the rate through the internal parameters Λ

While Theorem 1 is derived under the assumption that p is the true order of the VAR, the results hold even if p is replaced by any upper bound pmax on the true order since the VARcan be viewed as a VAR, see Basu & Michailidis (2015). The convergence rate then becomesinstead of

The bound includes terms of the form 2 exponent can be removed if one adopts a more complicated weighting scheme (see e.g., Jenatton, Audibert & Bach 2011, Bien et al. 2016), which would avoid high order lag coefficients from being aggressively shrunken. However, in the context of VAR estimation, we find through simulation experiments that this aggressive shrinkage is in fact beneficial (see Section C.3 of the appendix).

4 Optimization Algorithm

We begin by noting that since the intercept does not appear in the penalty terms, it can be removed if we replace ). All three optimization

problems are of the form

and (3.5), (3.6), and (3.7) only differ by the form of the norm Ω. A key simplification is possible by observing that the objective above decouples across the rows of

in which . Hence, Equation (4.1) can be solved in parallel by

solving the “one-row” subproblem

Jenatton, Mairal, Obozinski & Bach (2011) show that hierarchical group lasso problems can be efficiently solved via the proximal gradient method. This procedure can be viewed as an extension of traditional gradient descent methods to nonsmooth objective functions. Given a convex objective function of the form is differentiable with a Lipschitz continuous gradient, the proximal gradient method produces a sequence ˆ

with the guarantee that

is O(1/m) (cf. Beck & Teboulle 2009). For m = 1, 2, . . ., its update is given by

where is an appropriately chosen step size and Proxis the proximal operator of the function ), which is evaluated at the gradient step we would take if we were minimizing alone. The proximal operator is defined as the unique solution of a convex optimization problem involving Ω

The proximal gradient method is particularly effective when the proximal operator can be evaluated efficiently. In our case, Ω) is a sum of hierarchically nested Euclidean norms. Jenatton, Mairal, Obozinski & Bach (2011) show that for such penalties, the proximal operator has essentially a closed form solution, making it extremely efficient. It remains to note

which is essentially the Fast Iterative Soft-Thresholding Algorithm developed by Beck & Teboulle (2009). We verified that our findings in the simulation study are unaffected by this choice.

Our full procedure is detailed in Algorithm 1 and is applicable to all three HLag estimators. Note that the algorithm requires an initial value ˆ[0]. As is standard in the regularization literature (e.g., Friedman et al. 2017), we use “warm starts”. We solve Algorithm 1 for a grid of penalty values starting at , the smallest value of the regularization parameter in which all coefficients will be zero. For each smaller value of along this grid, we use the previous solution as a “warm start” ( ˆ[0]) to run Algorithm 1 with the new -value. A key advantage of our HLag estimates being solutions to a convex optimization problem is that the algorithms are stable and not sensitive to the choice of initialization (Beck & Teboulle 2009). As stopping criterion, we use , while one could also use . We opt for the former since we have numerically observed in our simulation experiments that considerably less iterations are needed without affecting accuracy.

The algorithms for these methods differ only in the evaluation of their proximal operators

(since each method has a different penalty Ω). However, all three choices of Ωcorrespond to hierarchical group lasso penalties, allowing us to use the result of Jenatton, Mairal, Obozinski & Bach (2011), which shows that the proximal operator has a remarkably simple form. We write

these three problems generically as

where . The key observation in Jenatton, Mairal, Obozinski & Bach (2011) is that the dual of the proximal problem (4.2) can be solved exactly in a single pass of blockwise coordinate descent. By strong duality, this solution to the dual provides us with a solution to problem (4.2). The updates of each block are extremely simple, corresponding to a groupwise-soft-thresholding operation. Algorithm 2 shows the solution to (4.3), which includes all three of our penalties as special cases.

Selection of the penalty parameters. While some theoretical results on the choice of penalty parameters are available in the literature (Basu & Michailidis 2015), such theoretical results can not be used in practice since the penalty parameter’s value depends on properties of the underlying model that are not observable. For this reason, we use cross validation, one of the standard approaches to penalty parameter selection.

Following Friedman et al. (2010), the grid of penalty values is constructed by starting with , an estimate of the smallest value in which all coefficients are zero, then decrementing in log linear increments. The grid bounds are detailed in the appendix of Nicholson et al. (2017). The HLag methods rely on a single tuning parameter in equation (4.1). Our penalty parameter search over a one-dimensional grid is much less expensive than the search over a multi-dimensional grid as needed for the lag-weighted lasso (Song & Bickel 2011). To accommodate the time series nature of our data, we select the penalty parameters using the cross-validation approach utilized by Song & Bickel (2011) and Banbura et al. (2010). Given an evaluation period [one-step-ahead mean-squared forecast error (MSFE) as a cross-validation score:

with ˆrepresenting the forecast for time t + 1 and component i based on observing the series up to time t. If multi-step ahead forecast horizons are desired, we can simply substitute (4.4) with our desired forecast horizon h. Since this penalty search requires looping over many time points, we have coded most of the HLag methods in C++ to increase computational efficiency.

5 Simulation Study

We compare the proposed HLag methods with 13 competing approaches: (i) AIC-VAR: least squares estimation of the VAR and selection of a universal lag order using AIC, (ii) BIC-VAR: same as in (i) but lag order selection using BIC, (iii) Lasso-VAR: estimation of the VAR using an -penalty, (iv) Lag-weighted (LW) Lasso-VAR: estimation of the VAR using a weighted penalty, which applies greater regularization to higher order lags, (v) BGR-BVAR: Bayesian VAR of Banbura et al. (2010), (vi) GLP-BVAR: Bayesian VAR of Giannone et al. (2015), (vii) CCM-

Figure 4: Sparsity patterns (and magnitudes) of the HLag based simulation scenarios. Darker shading indicates coefficients that are larger in magnitude.

BVAR: Bayesian VAR of Carriero et al. (2019) (viii) DFM: Dynamic Factor Model (see e.g., Forni et al. 2000), (ix) FAVAR: Factor Augmented VAR (Bernanke et al. 2005) (x) VAR(1): least squares estimation of a VAR(1) (xi) AR: univariate autoregressive model, (xii) Sample mean: intercept-only model, (xiii) Random walk: vector random walk model. The comparison methods are detailed in Section B of the appendix.

5.1 Forecast Comparisons

To demonstrate the efficacy of the HLag methods in applications with various lag structures, we evaluate the proposed methods under four simulation scenarios.

In Scenarios 1-3, we take k = 45 components, a series length of T = 100 and simulate from a VAR with the respective HLag structures: componentwise, own-other, and elementwise. In this section, we focus on simulation scenarios where the sample size T is small to moderate compared to the number of parameters to be estimated (We investigate the impact of increasing the time series length in Section C.4 of the appendix. The coefficient matrices used in these scenarios are depicted in Figure 4, panel (1)-(3) respectively.

In Scenario 4, we consider a data generating process (DGP) with k = 40 and T = 195 that does not a priori favor the HLag approaches vis-a-vis the competing approaches but follows the “data-based Monte Carlo method” (Ho & Sorensen 1996) to make the simulation setting robust to arbitrary DGPs. This DGP does not have any special lag structure; all variables in all equations

have p = 4 non-zero lags, as can be seen from Figure 4, panel (4).

All simulations are generated from stationary coefficient matrices. Full details on each simulation design together with the steps taken to ensure the stationarity of the simulation structures are given in Sections C.1 and C.2 of the appendix. In each scenario, the error covariance is taken to be . We investigate the sensitivity of our results to various choices of error covariance in Section C.5 of the appendix. To reduce the influence of initial conditions on the DGPs, the first 500 observations were discarded as burn-in for each simulation run. We run M = 500 simulations in each scenario.

Forecast performance measure. We focus on the problem of obtaining reliable point forecasts. To evaluate how well our methods and their competitors do in the context of providing such point forecasts, we measure their performance in terms of out-of-sample point forecast accuracy and choose mean squared forecast error as our main measure of performance. We generate time series of length T, fit the models to the first 1 observations and use the last observation to compute

the one-step-ahead mean squared forecast error

with the value of component time series i at the time point simulation run, and is its predicted value.

Figure 5 gives the forecast performance of the methods in Scenarios 1-4. Concerning the VARbased methods, we report the results for known (p = 5 in Scenario 1, p = 2 in Scenario 2 and p = 4 in Scenario 3 and 4) maximal lag order. We first discuss these results and then summarize the differences in results when the maximal lag order is unknown, for which we take pmax = 12.

Scenario 1: Componentwise HLag. Componentwise and own-other HLag perform best, which is to be expected since both are geared explicitly toward Scenario 1’s lag structure. Elementwise HLag outperforms the lag-weighted lasso, and both do better than the lasso. Among the Bayesian methods, the BGR and CCM approaches are competitive to elementwise HLag, whereas the GLP approach is not. All Bayesian methods perform significantly worse (as confirmed with paired t-tests) than componentwise and own-other HLag. The factor models are not geared towards the

Figure 5: Out-of-sample mean squared forecast error for VARs in Scenario 1 to 4. Error bars of length two standard errors are in blue; the best performing method is in black.

DGP of Scenario 1: They select around five factors, on average, in their attempt to capture the time series dynamics and are not competitive to HLag. Regarding lag order selection with AIC/BIC, we can not estimate the VAR model for 1 with least squares, thus for a simple benchmark we instead estimate a (1) by least squares. Despite the explicit orientation toward modeling recent behavior in the VAR(1) model, it suffers both because it misses important longer range lag coefficients and because it is an unregularized estimator of and therefore has high variance. The univariate AR benchmark also suffers because it misses the dynamics among the time series: its MSFE is more than twice as large as the MSFEs of the HLag methods.

Scenario 2: Own-other HLag. All three HLag methods perform significantly better than the competing methods. As one would expect, own-other HLag achieves the best forecasting performance, with componentwise and elementwise HLag performing only slightly worse. As with the previous scenario, the least-squares approaches are not competitive.

Scenario 3: Elementwise HLag. As expected, elementwise HLag outperforms all others. The lag-weighted lasso outperforms componentwise and own-other HLag, which is not surprising as it is designed to accommodate this type of structure in a more crude manner than elementwise HLag. The relatively poor performance of componentwise and own-other HLag is likely due to the coefficient matrix explicitly violating the structures in all 45 rows. However, both still significantly outperform the Bayesian methods, factor-based methods and univariate benchmarks.

Scenario 4: Data-based. Though all true parameters are non-zero, the HLag approaches perform considerably better than the lasso, lag-weighted lasso, Bayesian, factor-based and univariate approaches. HLag achieves variance reduction by enforcing sparsity and low max-lag orders. This, in turn, helps to improve forecast accuracy even for non-sparse DGPs where many of the coefficients are small in magnitude, as in Figure 4, panel (4).

Unknown maximal lag order. In Figure 6, we compare the performance of the VAR-based methods for known and unknown maximal lag order. For all methods in all considered scenarios, the MSFEs are, overall larger when the true maximal lag order is unknown since now the true lag order of each time series in each equation of the VAR can be overestimated. With a total of autoregressive parameters to estimate, the methods that assume an ordering, like HLag, are greatly advantaged over a method like the lasso that does not exploit

outperform the lasso.

clear advantage over the Bayesian methods of Giannone et al. (2015), Carriero et al. (2019) and the lag-weighted lasso. The latter minimally requires specifying a weight function, and a two-dimensional penalty parameter search in our implementation, which is much more time intensive than a one-dimensional search, as required for HLag. The Bayesian method of Banbura et al. (2010) is fast to compute since there is a closed-form expression for the mean of the posterior distribution of the autoregressive parameters conditional on the error variance-covariance matrix. While the Bayesian method of Banbura et al. (2010) and lasso require, in general, less computation time, HLag has clear advantages over the former two in terms of forecast accuracy, especially when the maximal lag length pmax is large, but also in terms of lag order selection, as discussed in the following sections.

Figure 6: Out-of-sample mean squared forecast error for VARs in Scenario 1 to 4 for known (black) and unknown (gray) order. Error bars of length two standard errors are in blue.

Figure 7: Componentwise structure in the Robustness simulation Scenario 5.

5.2 Robustness of HLag as pmax Increases

We examine the impact of the maximal lag order pmax on HLag’s performance. Ideally, provided that pmax is large enough to capture the system dynamics, its choice should have little impact on forecast performance. However, we expect regularizers that treat each coefficient democratically, like the lasso, to experience degraded forecast performance as pmax increases.

As an experiment, we simulate from an HLag(5) while increasing pmax to substantially exceed the true L. Figure 7 depicts the coefficient matrices and its magnitudes in what we will call Scenario 5. All series in the first 4 rows have L = 2, the next 3 rows have L = 5, and the final 3 rows have and show the MSFEs of all VAR-based methods requiring a maximal lag order in Figure 8. As pmax increases, we expect the performance of HLag to remain relatively constant whereas the lasso and information-criterion based methods should return worse forecasts.

At pmax = 1 all models are misspecified. Since no method is capable of capturing the true dynamics of series 1-7 in Figure 7, all perform poorly. As expected, after ignoring pmax = 1, componentwise HLag achieves the best performance across all other choices for pmax, but is very closely followed by the own-other and elementwise HLag methods. Among the information-criterion based methods, AIC performs substantially worse than BIC as pmax increases. This is likely the result of BIC assigning a larger penalty on the number of coefficients than AIC. The lasso’s performance degrades substantially as the lag order increases, while the lag-weighted lasso and Bayesian methods are somewhat more robust to the lag order, but still achieve worse forecasts than every HLag procedure under all choices for pmax.

Figure 8: Robustness simulation scenario: Out-of-sample mean squared forecast errors, for differ-ent values of the maximal lag order pmax.

5.3 Lag Order Selection

While our primary intent in introducing the HLag framework is better point forecast performance and improved interpretability, one can also view HLag as an approach for selecting lag order. Below, we examine the performance of the proposed methods in estimating the maxlag matrix L defined in Section 3.1. Based on an estimate ˆof the autoregressive coefficients, we can likewise define a matrix of estimated lag orders:

where we define ˆIt is well known in the regularized regression literature (cf., Leng et al. 2006) that the optimal tuning parameter for prediction is different from that for support recovery. Nonetheless, in this section we will proceed with the cross-validation procedure used previously with only two minor modifications intended to ameliorate the tendency of cross-validation to select a value of that is smaller than optimal for support recovery. First, we cross-validate a relaxed version of the regularized methods in which the estimated nonzero coefficients are refit using ridge regression, as detailed in Section C.6 of the appendix. This modification makes the MSFE more sensitive to ˆbeing larger than necessary. Second, we use

Figure 9: -lag selection performance for Scenario 1 to 5. Error bars of length two standard errors are in blue; the best performing method is in black.

the “one-standard-error rule” discussed in Hastie et al. (2009), in which we select the largest value of whose MSFE is no more than one standard error above that of the best performing model (since we favor the most parsimonious model that does approximately as well as any other).

We consider Scenario 1 to 5 and estimate a A procedure’s lag order selection accuracy is measured based on the sum of absolute differences between and the maximum absolute differences between

The former can be seen as an overall measure of lag order error, the latter as a “worst-case” measure. We present the values on both measures relative to that of the sample mean (which chooses ˆ-based measure. We focus our discussion on the VAR-methods performing actual lag order selection. We first discuss these results then summarize the differences in results for the -based measure.

In Scenarios 1-3, the HLag methods geared towards the design-specific lag structure perform best, as expected. Least squares AIC/BIC always estimates a (1) and performs considerably worse than the best performing HLag method in Scenarios 1-2. In Scenario 3, they attain the best performance since around 82% of the elements in the true maxlag matrix are equal to one, and hence correctly recovered. However, the higher order dynamics of the remaining 18% of the elements are ignored, while elementwise HLag—which performs second best—better captures these dynamics. This explains why in terms of MSFE, elementwise HLag outperforms the (1) by a factor of 10.

In Scenario 4, least squares AIC consistently recovers the true universal order p = 4. Nevertheless, it has, in general, a tendency to select the highest feasible order, which happens to coincide here with the true order. Its overfitting tendency generally has more negative repercussions, as can be seen from Scenario 5, and even more importantly from its poor forecast performance. Componentwise HLag and least squares BIC perform similarly and are second best. Own-other, elementwise HLag, lasso and lag-weighted lasso perform similarly but underestimate the lag order of the component series with small non-zero values at higher order lags. While this negatively affects their lag order selection performance, it helps for forecast performance as discussed in Section 5.1.

In Scenario 5, componentwise and own-other HLag achieve the best performance. Their performance is five times better than the least squares AIC, and roughly 1.5 times better than the lasso, lag-weighted lasso and least squares BIC. Elementwise HLag substantially outperforms the lasso and least squares AIC, which consistently severely overestimates the true lag order. The least squares BIC, on the other hand, performs similarly to elementwise HLag on the lag selection criterion but selects the universal lag order at either 1 or 2 and thus does not capture the true dynamics of series 5-7 in Figure 7.

In Figure 10, we examine the impact of the maximal lag order pmax on a method’s lag order error. At the true order (pmax = 5), all methods achieve their best performance. As pmax increases, we find the methods’ performance to decrease, in line with the findings by Percival (2012). Yet, the HLag methods and lag-weighted lasso remain much more robust than the AIC and lasso, whose performance degrade considerably.

Results on the “worst-case” -measure are presented in Figure 11. Differences compared to the -measure are: (i) Least squares AIC/BIC are the best performing. This occurs since the true maximal lag orders are small, as well as the estimated lag orders by AIC/BIC due to the maximum number of parameters that least squares can take. Hence, the maximal difference between both is, overall, small. Their negative repercussions are better reflected through the overall -measure, or in case of the AIC as pmax increases (see Figure 10). (ii) Componentwise and own-other HLag are more robust with respect to the than elementwise HLag. The former two either add an additional lag for all time series or for

Figure 10: Robustness simulation scenario: Lag order error measures, for different values of the maximal lag order pmax.

Figure 11: -lag selection performance for Scenario 1 to 5. Error bars of length two standard errors are in blue; the best performing method is in black.

none, thereby encouraging low lag order solutions—and thus controlling the maximum difference with the small true orders—even more than elementwise HLag. The latter (and the lag-weighted lasso) can flexibly add an additional lag for each time series separately. Their price to pay for this flexibility becomes apparent through the -measure. (iii) A noticeable difference occurs between the methods that assume an ordering, like HLag and the lag-weighted lasso, and methods, like the lasso, that do not encourage low maximal lag orders. The lasso often picks up at least one lag close to the maximally specified order, thereby explaining its bad performance in terms of the -measure. As pmax increases, its performance deteriorates even more, see Figure 10.

Stability across time. We verified the stability in lag order selection across time with a rolling window approach. We estimate the different models for the last 40 time points (20%), each time using the most recent 160 observations. For each of these time points, the lag matrices are obtained and the lag selection accuracy measures in equation (5.1) are computed. For all methods, we find the lag order selection to be very stable across time with no changes in their relative performance.

6 Data Analysis

We demonstrate the usefulness of the proposed HLag methods for various applications. Our first and main application is macroeconomic forecasting (Section 6.1). We investigate the performance of the HLag methods on several VAR models where the number of time series is varied relative to the fixed sample size. Secondly, we use the HLag methods for forecast applications with high sampling rates (Section 6.2).

For all applications, we compare the forecast performance of the HLag methods to their competitors. We use the cross-validation approach from Section 4 for penalty parameter selection on time points : At each time point the forecast horizon), we first standardize each series to have sample mean zero and variance one using the most recent observations. We do this to account for possible time variation in the first and second moment of the data. Then, we estimate the VAR with pmax and compute the weighted Mean

Squared Forecast Error

where is the standard deviation of the to be forecast series, computed over the forecast evaluation period [] for each penalty parameter. We use a weighted MSFE to account for the different volatilities and predictabilities of the different series when computing an overall forecast error measure (Carriero et al. 2011). The selected penalty parameter is the one giving the lowest

wMSFE.

After penalty parameter selection, time points are used for out-of-sample rolling window forecast comparisons. Again, we standardize each series separately in each rolling window, estimate a VAR on the most recent observations and evaluate the overall forecast accuracy with the wMSFE of equation (6.1), averaged over all k time series and time points of the forecast evaluation period. Similar results are obtained with an expanding window forecast exercise and available from the authors upon request.

Finally, to assess the statistical significance of the results, we use the Model Confidence Set

(MCS) procedure of Hansen et al. (2011). It separates the best forecast methods with equal predictive ability from the others, who perform significantly worse. We use the MCSprocedure function in R to obtain a MCS that contains the best model with 75% confidence as done in Hansen et al. (2011).

6.1 Macroeconomic Forecasting

We apply the proposed HLag methods to a collection of US macroeconomic time series compiled by Stock & Watson (2005) and augmented by Koop (2013). The full data set, publicly available at The Journal of Applied Econometrics Data Archive, contains 168 quarterly macroeconomic indicators over 45 years: Quarter 2, 1959 to Quarter 4, 2007, hence T = 195. Following Stock & Watson (2012), we classify the series into 13 categories, listed in Table 6 of the appendix. Further details can be found in Section D of the appendix.

Following Koop (2013), we estimate four VAR models on this data set: The Small-Medium VAR (k = 10) which consists of GDP growth rate, the Federal Funds Rate, and CPI plus 7 additional variables, including monetary variables. The Medium VAR (k = 20) which contains the Small-Medium group plus 10 additional variables containing aggregated information on several aspects of the economy. The Medium-Large VAR (k = 40) which contains the Medium group plus 20 additional variables, including most of the remaining aggregate variables in the data set. The Large VAR (k = 168) which contains the Medium-Large group plus 128 additional variables, consisting primarily of the components that make up the aggregated variables. Note that the number of parameters quickly increases from 4 over 4 4

6.1.1 Forecast Comparisons

We compare the forecast performance of the HLag methods to their competitors on the four VAR models with Quarter 3, 1992 () is used for penalty parameter selection; Quarter 4, 1992 () to Quarter 4, 2007 () are used for out-of-sample rolling window forecast comparisons. We start with a

Figure 12: Rolling out-of-sample one-step-ahead wMSFE for the four VAR sizes. For each VAR size, forecast methods in the 75% Model Confidence Set (MCS) are in black.

discussion on the forecast accuracy for all series combined, then break down the results across different VAR sizes for specific variables.

Forecast performance across all series. We report the out-of-sample one-step-ahead weighted mean squared forecast errors for the four VAR groups with forecast horizon h = 1 in Figure 12. We discuss the results for each VAR group separately since the wMSFE are not directly comparable across the panels of Figure 12, as an average is taken over different component series which might be more or less difficult to predict.

With only a limited number of component series k included in the Small VAR, the univariate AR attains the lowest wMSFE, but own-other HLag, the lasso and FAVAR have equal predictive ability since they are included in the MCS. As more component series are added in the Medium and Medium-Large VAR, own-other and elementwise HLag outperform all other methods. The more flexible own-other and elementwise structures perform similarly, and better than the componentwise structure. While the MCS includes own-other HLag, elementwise HLag and the lasso for the Medium VAR, only own-other HLag survives for the Medium-Large VAR. This supports the widely held belief that in economic applications, a components’ own lags are likely more informative than other lags and that maxlag varies across components. Furthermore, the Bayesian and factor models are never included in the MCS, nor are the least squares methods, or univariate methods. For the Medium VAR, the information criteria AIC and BIC always select three lags. Since a relatively large number of parameters need to be estimated, their estimation error becomes large, and this, in turn, severely impacts their forecast accuracy.

Next, consider the Large VAR, noting that the VAR by AIC, BIC and VAR(1) are over-

parametrized and not included. As the number of component series k further increases, the componentwise HLag structure becomes less realistic. This is especially true in high-dimensional economic applications, in which a core subset of the included series is typically most important in forecasting. In Figure 12 we indeed see that the more flexible own-other and elementwise HLag perform considerably better than the componentwise HLag. The MCS confirms the strong performance of elementwise HLag.

HLag’s good performance across all series is confirmed by forecast accuracy results broken down by macroeconomic category. The flexible elementwise HLag is the best performing method; for almost all categories, it is included in the MCS, which is not the case for any other forecasting method. Detailed results can be found in Figure 18 of the appendix.

Furthermore, our findings remain stable when we increase the maximal lag order pmax. In line with Banbura et al. (2010), we re-estimated all models with pmax = 13. Detailed results are reported in Figure 19 of the appendix. For the Small-Medium VAR, own-other HLag performs comparable to the AR benchmark, while it outperforms all other methods for larger VARs. The lasso (and to a lesser extent the lag-weighted lasso) loses its competitiveness vis-a-vis the HLag approaches as soon as the maximal lag order pmax increases, in line with the results of Section 5.2.

Finally, we re-did our forecast exercise for longer forecast horizons h = 4 and h = 8. Detailed results are reported in Figure 20 of the appendix. All forecast errors increase with distant forecast horizons. Nonetheless, own-other HLag remains among the best forecast methods: it is the only method that is always included in the MCS. Its performance gets closer to the sample mean as the forecast horizon increases.

improve forecast accuracy over smaller VARs, we turn to the MSFEs of the individual component

series obtained with the multivariate forecast methods. We focus on Real Gross Domestic Product (GDP251), Consumer Price Index (CPIAUSL), and the Federal Funds Rate (FYFF) which are generally of primary interest to forecasters and policymakers. Figure 13 gives the MSFEs of these three component series in the four VAR models.

Despite the fact that the Small-Medium VAR forecasts well for some component series, like

Figure 13: Rolling out-of-sample one-step ahead mean squared forecast error of GDP251, CPIAUSL and FYFF for the different VAR sizes (bars from left to right: Small-Medium, Medium, Medium-Large, Large). For each method, the lowest MSFE is indicated in black.

CPIAUSL, we often find, similar to Koop (2013), that moving away from small VARs leads to improved forecast performance. Consider, for instance, GDP251 and FYFF where half of the forecast methods give the best MSFE in the Large VAR. Across the k = 10 component series included in all four VARs, HLag, the lasso and factor methods produce the best MSFEs mainly for the Medium-Large or Large VARs; the Bayesian methods mainly for the Small-Medium or

Medium VARs.

Furthermore, the loss in forecast accuracy when adding variables to the VAR, if it occurs, remains relatively limited for HLag methods (on average, only 5%) but is severe for Bayesian methods (on average, 46%). Although Bayesian methods perform shrinkage, all component series remain included in the larger VARs, which can severely impede forecast accuracy. HLag methods, in contrast, do not use all component series but offer the possibility to exclude possibly irrelevant or redundant variables from the forecast model.

While factor-based models produce good forecasts for larger VARs, as the factors can be estimated more precisely as the number of component series increases, the factors themselves do not carry, in many cases, economic interpretation. The HLag methods, in contrast, facilitate interpretation by providing direct insight into the component series that contribute to the good forecast performance, as discussed next.

Figure 14: The first three rows of ˆ, the estimated elementwise maxlag matrix in the MediumLarge VAR for the method. Components with zero maxlag are left empty.

6.1.2 Lag Order Selection

The HLag methods provide direct insight into the series contributing to the forecasting of each individual component. As an example, consider the estimated lag orders of the three main component series (GDP251, CPIAUSL and FYFF) from a fitted HLagmodel of the Medium-Large group in Figure 14. Elementwise HLag finds, for instance, that the Federal Funds Rate FYFF is an important predictor of Gross Domestic Product since two of its lagged components are included in the equation for forecasting GDP251.

Generally speaking, the lag selection results are considerably stable across time. Figure 21 in Section D of the appendix gives, for each end point of the rolling window, the fraction of non-zero coefficients in each of the 13 macroeconomic categories when forecasting GDP251, CPIAUSL, and FYFF. To forecast GDP growth, for instance, GDP components, employment, interest rates and stock prices have a stable and important contribution throughout the entire forecast evaluation period.

6.2 Applications with High Sampling Rates

The HLag methods can also be used for applications with high sampling rates. To illustrate this, we consider a financial and energy data set.

Figure 15: Financial application. Panel (a): Rolling out-of-sample one-step-ahead wMSFE with the forecast methods in the 75% MCS in black. Panel (b): Estimated maxlag matrix for elementwise HLag and Panel (c): for the lasso.

6.2.1 Financial Application

We apply the HLag methods to a financial data set containing realized variances for k = 16 stock market indices, listed in Table 7 of the appendix. Daily realized variances based on five minute returns are taken from Oxford-Man Institute of Quantitative Finance (publicly available on

6.2.2 Energy Application

We apply the HLag methods to an energy data set (Candanedo et al. 2017) containing information on k = 26 variables related to in-house energy usage, temperature and humidity conditions. The energy data was logged every 10 minutes for about 4.5 months, giving T = 19,735 observations in total. A list of all variables and a short description is provided in Table 8 of the appendix. Data are taken from the publicly available UCI Machine Learning Repository

hour), thus containing 6 082 parameters. May 16, 18:10 to May 17, 2016 18:00

(144 observations) are used for penalty parameter selection; May 17, 18:10 to May 27, 2016 18:00 (1440 observations) for forecast comparisons.

size is large, the least squares VAR-based methods do not suffer as much from the curse of dimensionality. Still, HLag has an advantage by not imposing a universal maximal lag order. On the whole data set (panel a), componentwise and elementwise HLag outperform all other methods apart from the lasso and lag-weighted lasso. Yet, a subsample analysis reveals the dominance of elementwise HLag. We split the data set into ten consecutive subperiods of equal length and repeated the same forecast exercise. Results are displayed in panels (b)-(k). Elementwise HLag maintains its good performance across the subperiods and performs best. It is included in the MCS for all subperiods except for the second, making it a valuable addition to a forecaster’s toolbox.

Figure 16: Energy application. Rolling out-of-sample one-step-ahead wMSFE on the data set and on ten subperiods. Forecast methods in the 75% MCS are indicated in black in each panel.

7 Discussion

By incorporating the property that more recent lags convey more information than distant lags, the HLag framework offers substantial forecast improvements as well as greater insight into lag order selection than existing methods. In addition, throughout our simulation scenarios, we see that each method is fairly robust to deviations from its particular hierarchical structure. The substantial improvements in forecasting accuracy in data applications provide justification for the widely held belief that as the number of component series included in a model increases, the maximal lag order is not symmetric across series.

To enforce the hierarchical lag structures, we use the nested group structure of Zhao et al.

(2009). Alternatively, one could leverage the latent overlapping group lasso (LOG) proposed by Jacob et al. (2009). While Yan & Bien (2017) indicate that the nested group structures might suffer from a more aggressive shrinkage of parameters deep in the hierarchy (i.e. higher-order autoregressive coefficients), in the VAR model, large amounts of shrinkage on the more distant lags versus small amounts of shrinkage on the more recent lags may be desirable (Song & Bickel 2011). In our simulation studies, the nested group lasso structures significantly outperformed the LOG structures in the large majority of cases. Especially as the maximal lag order increases, the nested group lasso turned out to be more robust. Detailed results are available in Section C.3 of the appendix.

Implementations of our methods are available in the R package BigVAR, which is hosted on the Comprehensive R Archive Network (cran). Despite the more challenging computational nature of overlapping group lasso problems compared to conventional sparsity or non-overlapping group sparsity problems (e.g., Chen et al. 2014, Yuan et al. 2011, Mairal et al. 2010), our methods scale well and are computationally feasible in high dimensions. For instance, for the Large VAR (k = 168, T = 195, and 113,064 parameters) estimated on the Stock and Watson data, the HLag methods only require (on an Intel Xean Gold 6126 CPU @ 2.60GHz machine) around 1.5 (Own-Other), 2 (Componentwise) and 3.5 minutes (Elementwise), including penalty parameter selection. This requires estimating the VAR 610 times (61 time points 10 penalty parameters). For fixed penalty parameter, the HLag methods can be computed in less than a second. The computational bottleneck of our implementation thus concerns the penalty parameter selection. Alternatives (information criteria or a time series cross-validation search where the models are not re-estimated every single time point but at a lower sampling frequency) can be considered to reduce the penalty parameter search for applications with high sampling rates. To be widely adopted by practitioners, we do think that our methods have a considerable advantage compared to more computationally intensive methods such as the lag-weighted lasso, the Bayesian CCM and GLP approaches requiring around 33 minutes (Lag-weighted lasso) or even more than 2 hours (Bayesian methods) for one model fit of the Large Stock and Watson VAR. At the very least, one of the proposed HLag approaches can be quickly run to provide numerous insights before a more computationally demanding method is adopted.

The HLag framework is quite flexible and can be extended in various ways. For example, more complicated weighting schemes (see e.g., Jenatton, Audibert & Bach 2011, Bien et al. 2016) could be adopted to address the more aggressive shrinkage of parameters deep in the hierarchy, but these make computation more involved (Yan & Bien 2017) and our simulations in Section C.3 of the appendix indicate that this may not be beneficial in the VAR setting. Furthermore, if the practitioner prefers to summarize the information content in large data sets by constructing few factors, HLag penalties can, for instance, be applied with minimal adaption to the factors augmenting the VAR in a FAVAR. The HLag framework would allow one to flexibly vary the number of factors in each marginal equation of the FAVAR and to automatically determine the lag order of the factors, in addition to the lag structure of the autoregressive components. Finally, building on Basu & Michailidis (2015), we derive preliminary theoretical results on prediction consistency for HLag in a high-dimensional regime. Given the complicated nested group structure of the HLag penalty, work is needed to further explore its theoretical properties. To this end, recent advances in the theory of the hierarchical group lasso (e.g., Yan & Bien 2017, Yu & Bien 2017) could be leveraged.

A Theoretical properties: Proofs

We start by proving two auxiliary results, then we combine these in the proof of Theorem 1. For ease of notation and without loss of generality, we omit the intercept vector from the VAR model (3.1).

Lemma 1.

is a minimizer of (3.4), we have that

Substituting the data generating process into the above, we

obtain

After re-arranging, we get

Now,

Under the assumption on

The result follows from observing that

and we choose

with probability at least 1

Proof. of Lemma 2. In the middle of page 20 of Basu and Michailidis (2013)(a preliminary

version of Basu & Michailidis 2015), it is shown that,

where

for some constant A > 0. For simplicity, we take A = 1. Note that the exponent can be written

as

Since 4, it follows that and the exponent is

where the last approximation follows from the assumption that

Thus, for the choice of given above, we have that

Proof. of Theorem 1. Combining the results from Lemma 1 and 2, we have for

or alternatively

with probability at least 1

Assuming that all coefficients are bounded by M, we have that

so

and finally,

B Comparison Methods

B.1 Least Squares VAR

A standard method in lower dimensional settings is to fit a ) with least squares for 0 and then to select a universal lag order using AIC or BIC. Per L¨utkepohl (2007), the

AIC and BIC of a ) are defined as

in which ˆis the residual sample covariance matrix having used least squares to fit the The lag order that minimizes AIC() is selected. This method of lag order selection is only possible when since otherwise least squares is not well-defined. In simulation Scenarios 1-3 (1, thus for a simple benchmark we instead estimate a (1) by least squares:

where

B.2 Lasso VAR

We also include two well-known lasso-based VAR regularization approaches. The lasso estimates the VAR using an

where . The lasso does not intrinsically consider lag order, hence Song & Bickel (2011) propose a lag-weighted lasso penalty in which a weighted -penalty is used with weights that increase geometrically with lag order:

The tuning parameter 1] determines how fast the penalty weight increases with lag. While this form of penalty applies greater regularization to higher order lags, it is less structured than our HLag penalties in that it does not necessarily produce sparsity patterns in which all coefficients beyond a certain lag order are zero. The regularization parameters are jointly selected using a two-dimensional penalty parameter search. We have implemented these methods in R, the code is available as Supplementary Material.

B.3 Bayesian VAR

We consider three Bayesian benchmarks: the method of Banbura et al. (2010), Giannone et al. (2015) and Carriero et al. (2019). These approaches are also applicable to a situation like ours where many parameters need to be estimated but the observation period is limited. However, in contrast to the HLag methods, these methods are not sparse (parameter estimates are only shrunken towards zero) and do not perform lag order selection.

Banbura et al. (2010) use a modified Minnesota prior which leads to a posterior for the autoregressive parameters, conditional on the error variance-covariance matrix, that is normal. As we transformed all variables for stationarity, we set all prior means in the BGR implementation to zeros. Following Banbura et al. (2010), we select the hyperparameter that controls the degree of regularization as that which minimizes the h-step ahead MSFE across the k component series. We have implemented this method in R, the code is available as Supplementary Material.

Giannone et al. (2015) choose the informativeness of the priors in an “optimal” way by treating the priors as additional parameters, as in hierarchical modeling. We use the authors’ replication files (Matlab-code) publicly available at https://www.newyorkfed.org/research/economists/giannone/pub.

Carriero et al. (2019) use a general Minnesota-based independent prior to allow for a more flexible lag choice. Note that the authors also allow for stochastic volatility, but we compare the HLag methods to their “homoscedastic” BVAR that does not allow for stochastic volatility, in line with the other methods considered in this paper. We adapt the authors’ code (publicly available at http://didattica.unibocconi.eu/mypage/index.php?IdUte=49257&idr=27515&lingua=eng) to this homoscedastic setting by combining it with Matlab code for BVAR using Gibbs sampling available at https://sites.google.com/site/dimitriskorobilis/matlab/code-for-vars. For full technical details on the Bayesian methods, we refer the reader to Banbura et al. (2010), Giannone et al. (2015) and Carriero et al. (2019) respectively.

B.4 Factor Models

We consider two factor-based benchmarks: a Dynamic Factor Model (DFM, see e.g. Forni et al. 2000; Stock & Watson (2002)) and a Factor Augmented VAR Model (FAVAR, Bernanke et al. 2005). In contrast to the HLag methods, these methods do not achieve dimension reduction by sparsity. Instead, the information contained in the large predictor set is summarized by few factors. We estimate the factors by Principal Component Analysis and follow McCracken & Ng (2016) in

using the criterion, developed in Bai & Ng (2002), to select the number of factors

Regarding the DFM, the time series are regressed on lagged values of the factors. The factors are obtained from the whole data set and their lag order is determined via AIC. Similar results are obtained with BIC and available from the authors upon request. Regarding the FAVAR model, we regress each time series on its own lagged values and lagged values of the factors. The factors are obtained from the data set of all other variables. Lag selection is done via AIC, while similar results are obtained with BIC. We have implemented both methods in R, the code is available as Supplementary Material.

B.5 Other Methods

Finally, we compare against three simple baselines. The unconditional sample mean corresponds

to the intercept-only model,

which makes one-step-ahead forecasts of the form ˆ. The vector random walk

model, which corresponds to

and makes one-step-ahead forecasts of the form ˆ. Finally, we consider a separate autoregressive model for each time series. To simultaneously obtain parameter estimates and select the

lag order, we use the univariate analogue of equation (3.4)

for each component series . As such, the univariate AR is a special univariate case of the multivariate elementwise HLag introduced in Section 3.2. For each individual autoregression, we take the maximal autoregressive order equal to the true VAR order p in the simulations. In the empirical application we take four as maximal autoregressive order.

C Simulation Study

C.1 Simulation Scenarios

Simulation Scenario 1: Componentwise Lag Structure. In this scenario, we simulate according to

This 45 45 maxlag matrix is row-wise constant, meaning that all components within a row have the same maxlag; we partition the rows into 5 groups of size 9, each group taking on a distinct maxlag in {1, 2, 3, 4, 5}. A coefficient matrix with maxlag matrix L is used in Scenario 1’s simulations and its magnitudes are depicted in Figure 4, panel (1) of the manuscript.

in such a manner that it differentiates between own and other coefficients. The coefficients of a

series’ “own lags” (i.e., ) are larger in magnitude than those of “other lags” (i.e., ). The magnitude of coefficients decreases as the lag order increases. The HLagwe simulate is depicted in Figure 4, panel (2) of the manuscript. The first 15 rows can be viewed as univariate autoregressive models in which only the own term is nonzero; in the next 15 rows, for the first k coefficients, the coefficient on a series’ own lags is larger than “other lags,” and, for the next k coefficients, only own coefficients are nonzero; the final 15 rows have nonzeros throughout the first 2k coefficients, with own coefficients dominating other coefficients in magnitude.

HLag(4) model, meaning that the maxlag is allowed to vary not just across rows but also within

rows. Each marginal series in each row is randomly assigned a maxlag of either 1 (with 90 percent probability) or 4 (with 10 percent probability). The coefficient matrices are depicted in Figure 4, panel (3).

out a simulation by bootstrapping the actual Medium-Large macroeconomic data set with k = 40

and T = 195 as discussed in Section 6 of the manuscript. We start from the estimates obtained by applying the Bayesian approach of Giannone et al. (2015) to this data set with pmax = 4. The obtained estimates of the autoregressive matrices are visualized in Figure 4, panel (4) and the autoregressive matrices verify the VAR stability conditions. We then construct our simulated data using a non-parametric residual bootstrap procedure (e.g., Kreiss & Lahiri 2012) with bootstrap errors an i.i.d. sequence of discrete random variables uniformly distributed on {1, . . . , T}.

C.2 Generation of Simulation Scenarios

All of our simulation structures were generated to ensure a stationary coefficient matrix, order to construct a coefficient matrix for these scenarios, we started by converting the VAR

For A to be stationary, its maximum eigenvalue must be less than 1 in modulus. In general, it is very difficult to generate stationary coefficient matrices. Boshnakov & Iqelan (2009) offer a potentially viable procedure that utilizes the unique structure of equation (C.1), but it does not allow for structured sparsity. We instead follow the approach put forth by Gilbert (2005) in which structured random coefficient matrices are generated until a stationary matrix is recovered.

Simulation Scenarios Order HLag 1. Componentwise 2. Own-Other 3. Elementwise 4. Data 5. Robustness

Table 2: Out-of-sample mean squared forecast errors of the LOG relative to that of the nested group lasso in Scenario 1 to 5. Outperformance (as confirmed with paired t-tests) by the nested group lasso is indicated in bold.

Simulation Scenarios Measure HLag 1. Componentwise 2. Own-Other 3. Elementwise 4. Data 5. Robustness

Table 3: Lag order selection of the LOG relative to that of the nested group lasso in Scenario 1 to 5. Outperformance (as confirmed with paired t-tests) by the nested group lasso is indicated in bold.

C.3 Sensitivity Analysis: Choice of Group Lasso Formulation

The hierarchical lag structures of the HLag methods can either be enforced via the nested group structure of Zhao et al. (2009) or via the latent overlapping group lasso (LOG) proposed by Jacob et al. (2009). We compare the LOG to the nested group structures in our simulation studies.

In Table 2, we present the MSFEs of the LOG structures relative to those of the nested group lasso (for each HLag method). In Table 3 their lag order selection performance is compared. Values above one indicate better performance of the nested group lasso compared to the LOG. In both Tables, the nested group lasso significantly outperforms the LOG in the vast majority of cases. Especially when the maximal lag order pmax increases, the nested group lasso structures perform better than the LOG structures.

The finding that the nested group lasso structures are more robust than the LOG structures as pmax increases, is confirmed through the Robustness simulation scenario. In Table 4, we report the MSFEs and lag order measures as pmax increases from its true order (five) to pmax = 50. On all performance measures, the nested group lasso structures perform, overall, better than the

Performance Maximal lag order measure HLag pmax = 5 pmax = 12 pmax = 25 pmax = 50

Table 4: Robustness simulation scenario: Forecast performance and lag order selection of the LOG relative to that of the nested group lasso for different values of the maximal lag order pmax. Outperformance (as confirmed with paired t-tests) by the nested group lasso is indicated in bold.

LOG structures and the margin by which the former outperforms the latter increases with pmax.

C.4 Sensitivity Analysis: Impact of Increasing the Time Series Length

We investigate the impact of increasing the time series length on our forecast accuracy results.

We use the autoregressive parameter structure of Scenario 5 and increase the time series length from from T = 200 over T = 500 to T = 1000 while keeping the maximal lag order pmax = 5. Figure 17 presents the MSFEs. The forecast errors of all methods decrease as T increases, in line with our expectations. While the difference between the methods decreases as the sample size increases, all HLag methods sill significantly outperform the lasso.

C.5 Sensitivity Analysis: Choice of Error Covariance matrix

We investigate the sensitivity of our forecast accuracy results to the choice of error covariance matrix. We start from the autoregressive parameter structure of Scenario 5 (pmax = 5) and consider, in turn, robustness to (i) varying the signal-to-noise ratio, (ii) unequal error variances and (iii) time variation in the error covariance matrix (i.e. stochastic volatility).

Figure 17: Robustness simulation scenario: Out-of-sample mean squared forecast errors, for different values of the sample size T. Note that we have not included the BVAR methods GLP and CCL as they are too time consuming for large-scale VARs.

Signal-to-noise ratio. In the paper, we consider , corresponding to a signal-to-noise ratioof around 100. To investigate the sensitivity of the results to a lower signal-to-noise ratio, we re-ran the simulation study with , corresponding to a signal-to-noise ratio around 10 and , corresponding to a signal-to-noise ratio around one.

Unequal error variances. We investigate whether the HLag methods behave comparably if one group of time series has a large residual variance and another group has a small residual variance. To this end, we consider one group (series 1 to 5) with residual variance one, and the other group (series 6 to 10) with residual variance equal to 0.5.

Stochastic volatility. As stochastic volatility is an important feature for macroeconomic forecasting (Clark & Ravazzolo 2015), we investigate the performance of all methods in the presence of parametric variation in the error covariance matrix. Note that none of the methods considered in this paper account for stochastic volatility and, hence, their forecast accuracy is expected to suffer. Nevertheless, it remains interesting to investigate their sensitivity to the presence of parametric variation in the VAR errors.

We consider the VAR-SV model of Clark & Ravazzolo (2015) which includes the conventional

Class Method SNR(in paper) Variances Volatility

Table 5: Robustness to various choices of error covariance matrix: Out-of-sample mean squared forecast error (standard errors are in parentheses).

macroeconomic formulation of a random walk process for log volatility. In particular, we take

with

with v, . . . , v

Table 5 gives the forecast performance of the methods under the various choices of error covariance matrix. When decreasing the signal-to-noise ratio, the forecast accuracy of all methods decreases accordingly, as expected. Similarly, under unequal error variances and in the presence of stochastic volatility, the forecast accuracy of all methods suffers compared to their performance in the original design (column 1). Importantly, the relative performance of the HLag methods to the other methods is, mainly, unaffected. One exception concerns the presence of stochastic volatility where even the homoscedastic BVAR of Carriero et al. (2019), which does not account for stochastic volatility, outperforms the HLag methods. Their heteroskedastic BVAR, which accounts for stochastic volatility, is expected to perform even better in such settings.

C.6 Relaxed VAR Estimation

Since the lasso and its structured counterparts are known to shrink non-zero regression coefficients, in practice, they are often used for model selection, followed by refitting the reduced model using least squares (Meinshausen 2007). In this section, we detail our approach to refit based on the support selected by our procedures while taking into consideration both numerical stability as well as computational efficiency.

Let denote the coefficient matrix recovered from one of our sparsity-imposing algorithms (e.g. HLag, Lasso-VAR) and suppose that it contains r nonzero coefficients. In order to take the support recovered into account we introduce that denotes the location of nonzero elements in ˆas the vec of the nonzero entries of , we obtain

the relationship

We can then express the Relaxed Least Squares estimator as:

in which denotes the Kronecker operator. In general, it is ill-advised to directly form equation (C.2). First, performing matrix operations with , which has dimension , can be very computationally demanding, especially if k is large. Second, in the event that , the resulting estimator can be very poorly conditioned. To obviate these two concerns, we propose a slight adaptation of the techniques detailed in Neumaier & Schneider (2001) that computes a variant of equation (C.2) using a QR decomposition to avoid explicit matrix inversion. Additionally, if the resulting matrix is found to be ill-conditioned, a small ridge penalty should be utilized to ensure numerically-stable solutions.

C.7 Refinements

As opposed to performing a Kronecker expansion we instead consider imposing the restrictions by row in restriction matrices of rank , denoting the

number of nonzero elements in each row of . We can then calculate each row of

Now, following Neumaier & Schneider (2001), construct the matrix

compute a QR factorization of

in which Q is an orthogonal matrix and R is upper triangular of the form:

As expanded upon in Neumaier & Schneider (2001), we can compute

which can be evaluated with a triangular solver, hence does not require explicit matrix inversion. In the event that K is poorly conditioned, to improve numerical stability, we add a small ridge penalty. It is suggested by Neumaier & Schneider (2001) to add a penalty corresponding to scaling a diagonal matrix D consisting of the Euclidean norms of the columns of in which denotes machine precision. The full refitting algorithm is detailed in Algorithm

3.

D Stock and Watson Application

To make the k = 168 variables of the Stock and Watson data approximately stationary, we apply the transformation codes provided by Stock & Watson (2005). A brief description of each variable, along with the transformation code to make them approximately stationary can be found in the Data Appendix of Koop (2013).

All 168 variables are classified into one of 13 macroeconomic categories, detailed in Table 6. The good performance of the HLag methods across all variables is confirmed by a sub-analysis on the 13 macroeconomic categories. Figure 18 breaks down the results of the Large VAR by the 13 macroeconomic categories. Generally speaking, the flexible elementwise HLag is the best performing forecasting method; for 10 out of 13 categories, it is included in the MCS. The second best performing methods are own-other HLag and the lag-weighted lasso (both for 6 out of 13 categories in the MCS).

Upon examination of the different categories, three groups can be distinguished. The first group consists of categories with a single preferred forecast method, always an HLag method. Elementwise HLag is preferred for interest rates and money; own-other HLag for employment series. The second group consists of categories with several, but a limited number (between 2 and 4) of preferred methods. Series in the second group are major measures of real economic activity (GDP components, industrial production, unemployment rate, prices), housing, and wages. The strong performance of elementwise and own-other HLag is re-confirmed in the majority of cases (3 out of 5 categories), but the MCS is extended by the lag-weighted lasso and DFM (2 out of

Figure 18: Rolling out-of-sample one-step ahead wMSFE of different categories of macroeconomic indicators in the Large VAR. For each category, forecast methods in the 75% MCS are in black.

Group Brief description Examples of series Number of series

Table 6: Macroeconomic categories of series in the 168-variable data set, following the classification of Stock & Watson (2012) their Table 1.

Figure 19: Rolling out-of-sample one-step-ahead wMSFE for the four VAR sizes with pmax = 13. For each VAR size, forecast methods in the 75% Model Confidence Set are in black.

5 categories), or componentwise HLag, FAVAR and random walk (1 out of 5 categories). The third group consists of categories which have a larger number of preferred forecast methods, like inventories and hard-to-predict series such as exchange rates, stock prices and consumer expectations. For the latter categories, in line with Stock & Watson (2012), we find multivariate forecast methods to provide no meaningful reductions over simple univariate methods (AR or sample mean).

Results on additional sensitivity analyses concerning the choice of the maximal lag order pmax and forecasts horizon are provided in Figures 19 and 20 respectively. Results on the stability of the lag selection results are displayed in Figure 21.

We focus on the Stock and Watson macroeconomic data set since it is readily available and

Figure 20: Rolling out-of-sample one-step-ahead wMSFE for the four VAR sizes at forecast horizon h = 4 (top) and h = 8 (bottom). For each VAR size, forecast methods in the 75% MCS are in black.

Figure 21: Fraction of non-zero coefficients in each of the 13 macro-economic categories to the total number of non-zero coefficients in the Medium-Large VAR estimated by elementwise HLag when forecasting GDP251 (GDP growth, left), CPIAUSL (inflation, middle) and FYFF (Federal Funds Rate, right). The horizontal axis represents the ending date of a rolling window.

popular in the literature on macroeconomic forecasting. A more recent variant is available as the FRED-QD data set, a quarterly version of the Federal Reserve Economic Data database introduced in McCracken & Ng (2016). We have performed the same empirical analysis on the FRED-QD containing k = 210 variables from Quarter 3, 1959 to Quarter 4, 2018 (T = 238). Similar findings are obtained: (i) Own-other and elementwise HLag perform comparable to the lasso methods and AR for small VAR sizes, but outperform all others for the Large VAR and a short forecast horizon. (ii) Own-other HLag is the preferred forecast method for several major macroeconomic indicators such as national income and product accounts and industrial production. For difficult to predict indicators, such as exchange rates, gains over the AR model are difficult to attain.

E Financial Application

The financial data set contains information on the realized variances of k = 16 stock market indices listed in Table 7. All time series are log-transformed to make them stationary.

Table 7: Variables used in the financial application.

F Energy Application

The energy data set contains information on k = 26 variables. A brief description of each variable, taken from https://archive.ics.uci.edu/ml/data sets/Appliances+energy+prediction, is provided in Table 8, along with the transformation code to make it approximately stationary. The transformation codes are: 1 = first difference of logged variables, 2 = first difference.

Appliances energy use in Wh 1 Lights energy use of light fixtures in the house in Wh 2 T1 Temperature in kitchen area, in Celsius 1 RH1 Humidity in kitchen area, in % 1 T2 Temperature in living room area, in Celsius 1 RH2 Humidity in living room area, in % 1 T3 Temperature in laundry room area 1 RH3 Humidity in laundry room area, in % 1 T4 Temperature in office room, in Celsius 1 RH4 Humidity in office room, in % 1 T5 Temperature in bathroom, in Celsius 1 RH5 Humidity in bathroom, in % 1 T6 Temperature outside the building (north side), in Celsius 2 RH6 Humidity outside the building (north side), in % 1 T7 Temperature in ironing room , in Celsius 1 RH7 Humidity in ironing room, in % 1 T8 Temperature in teenager room 2, in Celsius 1 RH8 Humidity in teenager room 2, in % 1 T9 Temperature in parents room, in Celsius 1 RH9 Humidity in parents room, in% 1 To Temperature outside (from Chievres weather station), in Celsius 2 Pressure From Chievres weather station, in mm Hg 1 RHout Humidity outside (from Chievres weather station), in % 1 Wind speed From Chievres weather station in m/s 2 Visibility From Chievres weather station, in km 1 Tdewpoint From Chievres weather station, C 2

Table 8: Variables used in the energy application.

References

Akaike, H. (1969), ‘Fitting autoregressive models for prediction’, Annals of the institute of Statistical Mathematics 21(2), 243–247.

Bach, F., Jenatton, R., Mairal, J. & Obozinski, G. (2012), ‘Structured sparsity through convex optimization’, Statistical Science 27(4), 450–468.

Bai, J. S. & Ng, S. (2002), ‘Determining the number of factors in approximate factor models’, Econometrica 70(1), 191–221.

Banbura, M., Giannone, D. & Reichlin, L. (2010), ‘Large Bayesian vector auto regressions’, Journal of Applied Econometrics 25(1), 71–92.

Basu, S., Li, X. Q. & Michailidis, G. (2019), ‘Low rank and structured modeling of high-dimensional vector autoregressions’, IEEE Transactions on Signal Processing 67(5), 1207–1222.

Basu, S. & Michailidis, G. (2015), ‘Regularized estimation in sparse high-dimensional time series models’, The Annals of Statistics 43(4), 1535–1567.

Beck, A. & Teboulle, M. (2009), ‘A fast iterative shrinkage-thresholding algorithm for linear inverse problems’, SIAM Journal on Imaging Sciences 2(1), 183–202.

Bernanke, B. S., Boivin, J. & Eliasz, P. (2005), ‘Measuring the effects of monetary policy: A factoraugmented vector autoregressive (FAVAR) approach’, Quarterly Journal of Economics 120(1), 387– 422.

Bien, J., Bunea, F. & Xiao, L. (2016), ‘Convex banding of the covariance matrix’, Journal of the American Statistical Association 111(514), 834–845.

Bien, J., Taylor, J. & Tibshirani, R. (2013), ‘A lasso for hierarchical interactions’, The Annals of Statistics 41(3), 1111–1141.

Boshnakov, G. N. & Iqelan, B. M. (2009), ‘Generation of time series models with given spectral properties’, Journal of Time Series Analysis 30(3), 349–368.

Box, G. E. P. & Tiao, G. C. (1977), ‘A Canonical Analysis of Multiple Time Series’, Biometrika 64(2), 355–365.

Candanedo, L., Feldheim, V. & Deramaix, D. (2017), ‘Data driven prediction models of energy use of appliances in a low-energy house’, Energy and buildings 140, 81–97.

Carriero, A., Clark, T. E. & Marcellino, M. (2019), ‘Large Bayesian vector autoregressions with stochastic volatility and non-conjugate priors’, Journal of Econometrics 212(1), 137–154.

Carriero, A., Kapetanios, G. & Marcellino, M. (2011), ‘Forecasting large datasets with Bayesian reduced rank multivariate models’, Journal of Applied Econometrics 26(5), 735–761.

Carriero, A., Kapetanios, G. & Marcellino, M. (2012), ‘Forecasting government bond yields with large Bayesian vector autoregressions’, Journal of Banking & Finance 36(7), 2026–2047.

Chen, C., Peng, Z. X. & Huang, J. Z. (2014), O(1) algorithms for overlapping group sparsity, 22nd International Conference on Pattern Recognition’, IEEE, pp. 1645–1650.

Clark, T. E. & Ravazzolo, F. (2015), ‘Macroeconomic forecasting performance uder alternative specifica-tions of time-varying volatility’, Journal of Applied Econometrics 30(4), 551–575.

Davis, R. A., Zang, P. F. & Zheng, T. (2016), ‘Sparse vector autoregressive modeling’, Journal of Computational and Graphical Statistics 25(4), 1077–1096.

Ding, S. & Karlsson, S. (2014), Bayesian VAR models with asymmetric lags, Technical report, Orebro University, Orebro University School of Business, Orebro University, Sweden.

Forni, M., Hallin, M., Lippi, M. & Reichlin, L. (2000), ‘The generalized dynamic-factor model: Identifi-cation and estimation’, Review of Economics and Statistics 82(4), 540–554.

Friedman, J., Hastie, T. & Tibshirani, R. (2010), ‘Regularization paths for generalized linear models via coordinate descent’, Journal of Statistical Software 33(1), 1–22.

Friedman, J., Hastie, T., Tibshirani, R., Simon, N., Narasimhan, B. & Qian, J. (2017), glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. R package version 2.0-13, https://CRAN.Rproject.org/package=glmnet.

Giannone, D., Lenza, M. & Primiceri, G. E. (2015), ‘Prior selection for vector autoregressions’, Review of Economics and Statistics 97(2), 436–451.

Giannone, D., Lenza, M. & Primiceri, G. E. (2017), ‘Economic predictions with big data: The illustion of sparsity’, Working paper .

Gilbert, P. (2005), ‘Brief user’s guide: Dynamic systems estimation (dse)’.

Gonzalo, J. & Pitarakis, J.-Y. (2002), ‘Lag length estimation in large dimensional systems’, Journal of Time Series Analysis 23(4), 401–423.

Gredenhoff, M. & Karlsson, S. (1999), ‘Lag-length selection in VAR-models using equal and unequal lag-length procedures’, Computational Statistics 14(2), 171–187.

Han, F., Lu, H. R. & Liu, H. (2015), ‘A direct estimation of high dimensional stationary vector autoregressions’, Journal of Machine Learning Research 16, 3115–3150.

Hansen, P. R., Lunde, A. & Nason, J. M. (2011), ‘The model confidence set’, Econometrica 79(2), 453– 497.

Haris, A., Witten, D. & Simon, N. (2016), ‘Convex modeling of interactions with strong heredity’, Journal of Computational and Graphical Statistics 25(4), 981–1004.

Hastie, T., Tibshirani, R. & Friedman, J. (2009), The elements of statistical learning, Springer.

Ho, M. S. & Sorensen, B. E. (1996), ‘Finding cointegration rank in high dimensional systems using the Johansen test: An illustration using data based Monte Carlo simulations’, Review of Economics and Statistics 78(4), 726–732.

Hsiao, C. (1981), ‘Autoregressive modelling and money-income causality detection’, Journal of Monetary Economics 7(1), 85–106.

Hsu, N. J., Hung, H. L. & Chang, Y. M. (2008), ‘Subset selection for vector autoregressive processes using lasso’, Computational Statistics & Data Analysis 52(7), 3645–3657.

Hurvich, C. M. & Tsai, C. L. (1989), ‘Regression and time series model selection in small samples’, Biometrika 76(2), 297–307.

Hyv¨arinen, A., Zhang, K., Shimizu, S. & Hoyer, P. O. (2010), ‘Estimation of structural vector autoregression model using non-guassianity’, Journal of Machine Learning Research 11, 1709–1731.

Jacob, L., Obozinski, G. & Vert, J. P. (2009), ‘Group lasso with overlap and graph lasso’, Proceedings of pp. 433–440.

Jenatton, R., Audibert, J. Y. & Bach, F. (2011), ‘Structured variable selection with sparsity-inducing norms’, Journal of Machine Learning Research 12, 2777–2824.

Jenatton, R., Mairal, J., Obozinski, G. & Bach, F. (2011), ‘Proximal methods for hierarchical sparse coding’, Journal of Machine Learning Research 12, 2297–2334.

Jenatton, R., Mairal, J., Obozinski, G. & Bach, F. R. (2010), Proximal methods for sparse hierarchical dictionary learning, in ‘Proceedings of the 27th international conference on machine learning (ICML-10)’, pp. 487–494.

Keating, J. W. (2000), ‘Macroeconomic modeling with asymmetric vector autoregressions’, Journal of Macroeconomics 22(1), 1–28.

Koop, G. (2013), ‘Forecasting with medium and large Bayesian VARs’, Journal of Applied Econometrics 28(2), 177–203.

Koop, G. & Korobilis, D. (2013), ‘Large time-varying parameter VARs’, Journal of Econometrics 177(2), 185–198.

Koop, G., Korobilis, D. & Pettenuzzo, D. (2019), ‘Bayesian compressed vector autoregressions’, Journal of Econometrics 210(1), 135–154.

Kreiss, J. P. & Lahiri, S. (2012), Bootstrap methods for time series, In: Rao, T., Rao, S. and Rao, C. (Eds.) Handbook of Statistics 30. Time Series Analysis: Methods and Applications. North Holland.

Leng, C. L., Lin, Y. & Wahba, G. (2006), ‘A note on the lasso and related procedures in model selection’, Statistica Sinica 16(4), 1273–1284.

Lim, M. & Hastie, T. (2015), ‘Learning interactions via hierarchical group-lasso regularization’, Journal of Computational and Graphical Statistics 24(3), 627–654.

Lin, J. H. & Michailidis, G. (2017), ‘Regularized estimation and testing for high-dimensional multi-block vector-autoregressive models’, Journal of Machine Learning Research 18, 117.

Litterman, R. B. (1979), Techniques of forecasting using vector autoregressions, Working papers, Federal Reserve Bank of Minneapolis.

Lou, Y., Bien, J., Caruana, R. & Gehrke, J. (2016), ‘Sparse partially linear additive models’, Journal of Computational and Graphical Statistics 25(4), 1026–1040.

Lozano, A. C., Abe, N., Liu, Y. & Rosset, S. (2009), ‘Grouped graphical Granger modeling for gene expression regulatory networks discovery’, Bioinformatics 25(12), i110–i118.

L¨utkepohl, H. (1985), ‘Comparison of criteria for estimating the order of a vector autoregressive process’, Journal of Time Series Analysis 6(1), 35–52.

L¨utkepohl, H. (2007), New introduction to multiple time series analysis, Springer.

Mairal, J., Jenatton, R., Bach, F. R. & Obozinski, G. (2010), Network flow algorithms for structured sparsity, in ‘Advances in Neural Information Processing Systems’, pp. 1558–1566.

Matteson, D. S. & Tsay, R. S. (2011), ‘Dynamic orthogonal components for multivariate time series’, Journal of the American Statistical Association 106(496), 1450–1463.

McCracken, M. W. & Ng, S. (2016), ‘FRED-MD: A monthly database for macroeconomic research’, Journal of Business & Economic Statistics 34(4), 574–589.

Meinshausen, N. (2007), ‘Relaxed lasso’, Computational Statistics & Data Analysis 52(1), 374–393.

Melnyk, I. & Banerjee, A. (2016), ‘Estimating strucured vector autoregressive models’, Proceedings of the 33rd International Conference on Machine Learning .

Neumaier, A. & Schneider, T. (2001), ‘Estimation of parameters and eigenmodes of multivariate autoregressive models’, ACM Transactions on Mathematical Software (TOMS) 27(1), 27–57.

Nicholson, W. B., Matteson, D. S. & Bien, J. (2017), ‘VARX-L: Structured regularization for large vector autoregressions with exogenous variables’, International Journal of Forecasting 33(3), 627–651.

Nickelsburg, G. (1985), ‘Small-sample properties of dimensionality statistics for fitting VAR models to aggregate economic data: A Monte Carlo study’, Journal of Econometrics 28(2), 183–192.

Percival, D. (2012), ‘Theoretical properties of the overlapping groups lasso’, Electronic Journal of Statistics 6, 269–288.

Radchenko, P. & James, G. M. (2010), ‘Variable selection using adaptive nonlinear interaction structures in high dimensions’, Journal of the American Statistical Association 105(492), 1541–1553.

She, Y. Y., Wang, Z. F. & Jiang, H. (2018), ‘Group regularized estimation under structural hierarchy’, Journal of the American Statistical Association 113(521), 445–454.

Shojaie, A., Basu, S. & Michailidis, G. (2012), ‘Adaptive thresholding for reconstructing regulatory networks from time-course gene expression data’, Statistics in Biosciences 4(1), 66–83.

Shojaie, A. & Michailidis, G. (2010), ‘Discovering graphical Granger causality using the truncating lasso penalty’, Bioinformatics 26(18), i517–i523.

Sims, C. A. (1980), ‘Macroeconomics and reality’, Econometrica 48(1), 1–48.

Song, S. & Bickel, P. (2011), ‘Large vector auto regressions’, arXiv preprint arXiv:1106.3915 .

Stock, J. H. & Watson, M. W. (2002), ‘Forecasting using principal components from a large number of predictors’, Journal of the American Statistical Association 97(460), 1167–1179.

Stock, J. H. & Watson, M. W. (2005), ‘An empirical comparison of methods for forecasting using many predictors’, Manuscript, Princeton University .

Stock, J. H. & Watson, M. W. (2012), ‘Generalized shrinkage methods for forecasting using many predictors’, Journal of Business & Economic Statistics 30(4), 481–493.

Tiao, G. C. & Tsay, R. S. (1989), ‘Model specification in multivariate time series’, Journal of the Royal Statistical Society. Series B (Methodological) 51(2), 157–213.

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society. Series B (Methodological) 58(1), 267–288.

Tibshirani, R. & Suo, X. T. (2016), ‘An ordered lasso and sparse time-lagged regression’, Technometrics 58(4), 415–423.

Tsay, R. S. (2013), Multivariate Time Series Analysis: With R and Financial Applications, John Wiley & Sons.

Tseng, P. (2008), ‘On accelerated proximal gradient methods for convex-concave optimization’, submitted

to SIAM Journal on Optimization .

White, H. (2001), Asymptotic theory for econometricians, Academic press New York.

Yan, X. H. & Bien, J. (2017), ‘Hierarchical sparse modeling: A choice of two group lasso formulations’, Statistical Science 32(4), 531–560.

Yu, G. & Bien, J. (2017), ‘Learning local dependence in ordered data’, Journal of Machine Learning Research 18, 1–60.

Yuan, L., Liu, J. & Ye, J. (2011), Efficient methods for overlapping group lasso, in ‘Advances in neural information processing systems’, pp. 352–360.

Yuan, M. & Lin, Y. (2006), ‘Model selection and estimation in regression with grouped variables’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67.

Zhao, P., Rocha, G. & Yu, B. (2009), ‘The composite absolute penalties family for grouped and hierarchical variable selection’, The Annals of Statistics 37(6A), 3468–3497.

Designed for Accessibility and to further Open Science