Oracle Efficient Estimation of Structural Breaks in Cointegrating Regressions

In this paper, we propose an adaptive group lasso procedure to efficiently estimate structural breaks in cointegrating regressions. It is well-known that the group lasso estimator is not simultaneously estimation consistent and model selection consistent in structural break settings. Hence, we use a first step group lasso estimation of a diverging number of breakpoint candidates to produce weights for a second adaptive group lasso estimation. We prove that parameter changes are estimated consistently by group lasso and show that the number of estimated breaks is greater than the true number but still sufficiently close to it. Then, we use these results and prove that the adaptive group lasso has oracle properties if weights are obtained from our first step estimation. Simulation results show that the proposed estimator delivers the expected results. An economic application to the long-run US money demand function demonstrates the practical importance of this methodology.

Keywords: Adaptive Group Lasso; Change-points; Cointegration; Model Selection; US Money Demand JEL Classification: C22, C52 MSC Classification: 62E20, 62M10, 91B84

In this paper, we consider modelling cointegration relationships where the long-run equilibrium may differ for subsamples, thereby allowing for (multiple) structural breaks in the cointegrating regression. We assume that cointegration holds over some (fairly long) period of time, but then shifts to a new ‘long-run’ relationship. The number of breaks and their location are unknown to the researcher. Although coefficients of long-run equilibrium equations are relatively persistent by definition, accounting for the possibility of structural breaks is crucial in cointegration analysis, which usually involves long sample periods. On the one hand, long time series are needed to study the long-run behaviour of economic systems, on the other hand, employing long time series increases the likelihood of encountering structural change during the sample period. It is widely known that structural breaks, when present, can mask cointegrating relationships and render cointegration tests uninformative (Campos et al., 1996; Gregory et al., 1996; Qu, 2007). Hence, we propose a two-step approach to detect (multiple) structural breaks in cointegrating regressions using penalized regression techniques.

Since time series used for economic analyses have become very long in some instances, detecting (multiple) structural breaks has emerged as an important problem in the econometrics literature. For comprehensive surveys on structural breaks in time series models (‘change-point’ detection in the statistics literature or ‘pattern recognition’ in the context of signal processing), see, for example, Perron (2006), Aue and Horváth (2013) and Niu et al. (2015). While classical structural break models for linear regressions attempt to detect one unknown break via a grid search procedure (Andrews, 1993), it is not feasible to use grid searches for the detection of multiple breaks because the computational cost increases exponentially with the presumed number of breaks (needing least squares operations of order  O(T m) for mbreaks). Addressing this issue, Bai and Perron (1998, 2003) use dynamic programming techniques (henceforth BaiPerron algorithm), requiring at most least-squares operations of order  O(T 2) for anynumber of breaks, to add breaks sequentially to the model. Recently, several approaches have been proposed that reframe the task of detecting and estimating structural breaks as a model selection problem employing penalized regressions and related model selection techniques (Davis et al., 2006; Harchaoui and Lévy-Leduc, 2010; Jin et al., 2013; Chan et al., 2014; Ciuperca, 2014; Jin et al., 2016; Qian and Jia, 2016; Qian and Su, 2016; Behrendt and Schweikert, 2020). Instead of grid search procedures which augment linear regression models with parameter changes, model selection procedures take a top-down approach and try to shrink the set of all possible breakpoint candidates to contain only the true breakpoints. These approaches benefit from high computational efficiency and detect structural breaks with high accuracy.

The theory for (multiple) structural breaks in cointegrating regressions is not nearly as developed as the theory for change-points in the statistics and signal processing literature. Most studies are concerned with cointegration testing in the presence of structural instability. One of the most popular cointegration tests with an unknown breakpoint is the one proposed by Gregory and Hansen (1996a,b) in which the location of the break can be estimated via grid search at the minimum of the individual cointegration test statistics. Hatemi-J (2008) extends the test to account for two breaks and Schweikert (2020) allows for the possibility of nonlinear adjustment to the long-run equilibrium. Maki (2012) employs a hybrid procedure, detecting  m −1 breaks by minimizing the sum of squared residuals among all possible sample splits and finally determining the last break by minimizing a cointegration test statistic. Unfortunately, these tests are non-informative about the location of breaks. They optimize the model specification to provide evidence against the null hypothesis of no cointegration, thereby not necessarily finding those breakpoints which optimize the model fit. Maki (2012) determines the first  m−1 breaks based on improving the model fit but does not do so for the last breakpoint. Hence, the set of estimated breakpoints is not completely informative. Other studies with a strong focus on cointegration testing which conduct breakpoint estimation as a by-product are Carrion-i Silvestre and Sanso (2006) and Arai and Kurozumi (2007). They propose a CUSUM-based approach to test the null hypothesis of cointegration with a structural break against the alternative hypothesis of no cointegration. Qu (2007) considers a cointegrated system allowing the cointegrating rank to change during different subsamples so that it is possible to detect cointegrating relationships that exist only in some subsamples. Westerlund and Edgerton (2006) design LM-based test statistics invariant to structural breaks to test the null of no cointegration and Davidson and Monticini (2010) use subsample procedures to account for structural breaks in their cointegration tests.

In contrast, few studies are primarily focussed on modelling structural change in cointegrated systems. Kejriwal and Perron (2008, 2010) propose to estimate the number and location of structural breaks in cointegrating equations by applying the Bai-Perron algorithm. Inference on breakpoints is studied in, among others, Bai et al. (1998), Qu and Perron (2007), Kejriwal and Perron (2008, 2010), Li and Perron (2017) and Oka and Perron (2018). Using penalized regression approaches to account for structural breaks in cointegrating regression has not been explored yet in great detail. A similar idea has been proposed by Schmidt and Schweikert (2019) but their procedure is limited to bivariate cointegrating regressions using a modified adaptive lasso estimator. Here, we extend their methodology to cointegrating regressions with multiple regressors and provide a rigorous proof that the adaptive group lasso estimator is oracle efficient in settings with an unknown number of breaks and a diverging number of breakpoint candidates.

The proposed estimation method in this paper consists of two main steps: in the first step, we apply the group lasso estimator to a cointegration model with a diverging number of breakpoint candidates. We allow that breaks can occur at any point in time except for some lateral trimming which is mostly needed to identify the baseline coefficients in the first regime. We prove that the group lasso estimator consistently estimates parameter changes. However, it is well-known that lasso estimators are not simultaneously parameter estimation consistent and model selection consistent in situations where the restricted eigenvalue condition or related conditions such as the strong irrepresentable condition do not hold (Chan et al., 2014). Under these conditions, we show that the number of selected breaks is greater than the true number of breaks almost surely, but their estimated location is sufficiently close to their true location. In the second step, we use the first step group lasso estimates as weights for the adaptive group lasso. We provide a rigorous proof that the adaptive group lasso has the oracle properties if the first step algorithm assumes a maximum number of breaks and the distance between breaks depends on the sample size ensuring that the breakpoint candidates for the second step estimation are sufficiently distinct. The number of breaks is then estimated as the number of non-zero groups obtained after adaptive group lasso optimization.

The paper is organized as follows. Section 2 describes the proposed adaptive group lasso procedure to estimate structural breaks in cointegrating regressions. Section 3 is devoted to the Monte Carlo simulation study. Section 4 reports the results of an empirical application of our methodology to the US money demand function, and Sec- tion 5 concludes. Proofs of all theorems in the paper are provided in the Mathematical Appendix.

In the following, we specify a cointegrated system with multiple structural breaks at which it attains new equilibrium states. The cointegrated system does not deviate persistently from each equilibrium until the next break occurs and a new equilibrium is maintained.

2.1 Framework


where  tj, j ∈ {0, 1, . . . m+1}denote the breakpoints 1 =  t0 < t1 < · · · < tm+1 = T +1. µis the intercept,  β′j = (βj1, βj2, . . . , βjN) are regime-dependent coefficients and  {Xt}∞t=1,where  Xt = (X1t, X2t, . . . , XNt)′, follows an N-vector integrated process1


where  X0 = 0. {ut}∞t=1 and {vt}∞t=1 are mean-zero weakly stationary error processes. For expositional simplicity, we restrict our analysis to cointegrating regressions with a constant intercept across regimes.2 We make the following assumptions about the vector process  wt = (ut, v′t)′:

Assumption 1. The vector process  {wt}∞t=1 satisfies the following conditions:


(iii)  {wt}∞t=1 is strong mixing with mixing coefficients of size  −pβ/(p−β) and E|wt|p <∞ for some p > β > 5/2.

Further, we assume that the long-run covariance matrix v = ∞�j=−∞ Evtv′t−j is positive

definite. In addition, we require that


While the first three conditions of Assumption 1 are standard in cointegration analysis, assuming that Ωvis positive definite implies that  Xtis non-cointegrated. We denote the number of structural breaks by m. While the number of true structural breaks  m0is unknown, we assume that the maximum number of structural breaks  m∗is known to the researcher. The estimated number of breakpoints is denoted by ˆm. The locations of breakpoints relative to sample size, so-called break fractions, are denoted by  τj = tj/T, j ∈ {0, 1, . . . m + 1}.

Throughout this paper, we use the following notation to present our main results: let  yT = (y1, y2, . . . , yT)′ denote the vector containing T observations of our response variable and  uT = (u1, u2, . . . , uT)′ denotes the error term vector. The vector of T observations for the N-dimensional variable  Xtis denoted by  X = (X1, . . . , XT)′.Our design matrix  ZT is an T × TNmatrix defined by


and we define the Gram matrix Σ =  Z′TZT/T 2. Adjacent columns of  ZT differ only by one entry which means that the columns are almost identical for  T → ∞. Consequently, Σ does not converge to a positive definite asymptotic counterpart. It follows that the restricted eigenvalue condition (Bickel et al., 2009) does not hold and we cannot establish our consistency proofs based on this assumption. See Chan et al. (2014) for a thorough discussion of this issue.

We set  θ1 = β1 and


for i = 2, . . . , T. For the remainder of this article,  θi = 0means that  θihas all entries equaling zero and  θi ̸= 0means that  θihas at least one non-zero entry. The coefficient vector  θ(T) = (θ1, θ2, . . . , θT)′ is of length TN and contains all time-specific parameter changes. Because we treat structural breaks as rare events and assume that parameter changes persist for some time, the number of non-zero elements in  θ(T) is assumed to be small, i.e. smaller than  m∗ + 1 groups of size N.

We denote the true value of a parameter with a 0 superscript.  {τ 0j , j = 1, . . . , m0}denotes the set of true break fractions and  β0j, j = 1, . . . , m0 + 1 defines the true coefficient of the j-th regime. For technical reasons, we additionally set  β00 = 0. Wedefine the index sets ¯A = {1 ≤ i ≤ T : θ0i ̸= 0}denoting the indices of truly non-zero coefficients (including the baseline coefficient) and  A = {i ≥ 2 : θ0i ̸= 0} denotingthe non-zero parameter changes. The index set obtained from our first step estimation belonging to all estimated non-zero parameter changes is denoted by  AT = {i ≥ 2 :˜θi ̸= 0}. We note that the first regime’s coefficient (before the first breakpoint) is not allowed to be zero.3 Since we indicate breakpoints with non-zero coefficients in our penalized regression approach, the set  A = {t01, t02, . . . , t0m0}is also used to denote true breakpoints. Similarly, the set  AT = {ˆt1, ˆt2, . . . , ˆt ˆm}denotes estimated breakpoints, i.e., indices of those coefficients which are estimated to be non-zero. |A| denotes the cardinality of the set  A and Ac denotes the complementary set. We use those sets to index rows and columns of vectors and matrices. For example, let  ZT,A, ZT,Ac containthe columns of  ZT and θA(T), θAc(T) contain the rows of  θ(T) associated with active and inactive breakpoints, respectively.

For notational convenience, we use ‘⇒’ to signify weak convergence of the associated probability measures and p→to denote convergence in probability. Continuous stochastic processes such as a Brownian motion B(s) on [0,1] are simply written as B if no confusion is caused. We also write integrals with respect to the Lebesgue measure such as 1�0 B(s)ds simply as 1�0 B. Throughout the paper, several (distinct) large constants are all denoted with C, while small constants are denoted by  ϵ.Using these definitions, our cointegration model described in Equation (1) can be

expressed as a high-dimensional regression model in matrix form


Since only  m0+ 1 groups within  θ(T) are truly non-zero, we need to obtain a sparse solution to the high-dimensional regression problem in Equation (5). This means we frame the detection of structural breaks as a model selection problem and use available methods from this strand of the literature. To reduce the dimensionality of the estimation problem, we assume that breaks occur for all coefficients simultaneously. This allows us to treat all regressors at each point in time as one group. We can therefore apply the group lasso estimator proposed by Yuan and Lin (2006) to achieve a sparse solution. As our first step, we minimize the objective function,


to obtain the group lasso estimator for  θ(T) which is henceforth denoted by ˜θ(T) =arg minθ(T) Q∗. λTis the tuning parameter and  ∥ · ∥denotes the  L2-norm. Unfortunately, the group lasso estimator inherits the same problems, namely estimation inefficiency and model selection inconsistency, as the plain lasso estimator. Similar to the idea first presented in Zou (2006), we reestimate the objective function with individual coefficient weights to alleviate this problem and to try to reduce the number of falsely detected breaks. The statistical properties of adaptive group lasso estimators for a fixed number of groups are investigated in Wang and Leng (2008). Since we have a diverging set of breakpoint candidates, least squares estimation of the full model is not feasible. However, we show that group lasso is a consistent estimator for non-zero parameter changes giving us appropriate weights for a second step adaptive group lasso estimation. This approach is similar to the ideas put forth in Wei and Huang (2010), Horowitz and Huang (2013), Schmidt and Schweikert (2019), and Behrendt and Schweikert (2020).

As will be demonstrated later, the group lasso estimator only slightly overselects breaks under the right tuning. The algorithm employed to estimate ˜θ(T) allows to prespecify the maximum number of breakpoint candidates M, i.e. the maximum number of non-zero groups in ˜θ(T), and the minimum distance between breaks. Since the group lasso overselects breaks in the first step, M should be set large enough to encompass all true breakpoints and some additional falsely selected non-zero groups. This condition guarantees that ˜θ(T) always contains MN elements. In turn,  TN − MNcolumns of ZTcorresponding to zero coefficients are eliminated during the first step to result in the  T × MNdesign matrix  ZS. Hence, for given  M ≪ T, the column size of the new design matrix is substantially smaller than the original size TN and does not longer depend on the sample size. This allows us to further assume that all eigenvalues of ΣS = Z′SZS/T 2are contained in the interval [c∗, c∗], where c∗ and c∗are two positive constants. This means that we can relate to a restricted eigenvalue condition similar to Bickel et al. (2009) for the second step estimation. While the restricted eigenvalue condition in general does not hold for change-point settings, the dimension reduction of the first step allows us to postulate this assumption for our reduced design matrix. It should be noted that our assumption for the second step estimation is not restrictive for empirical applications because the notion of a long-run equilibrium relationship implies a maximum number of breaks and a minimum regime length. A minimum regime length is further justified by the minimum subsample size needed to precisely estimate parameter changes. Consequently, M should be chosen so that the average regime length in case of equidistantly-spaced breaks still guarantees enough observations per regime to estimate all coefficient changes.

We follow Wang and Leng (2008) and define the adaptive group lasso objective function


where  γ > 0 and wiare the group-specific weights assigned as follows


and set 0  × ∞ = 0. ˜θS,i, i = 1, . . . , |AT|+ 1 denotes the non-zero group lasso coeffi-cient estimates obtained from optimizing the objective function in Equation (6). The remaining  M − |AT| −1 group elements of ˜θScan be filled with zero groups as long as their selected indices lead to ΣSbeing a positive definite matrix for all T.

We denote the estimator minimizing  Q(θS) with ˆθS = arg minθS Q. The weight of the first coefficient is usually set to zero to ensure that the system is cointegrated with a cointegrating vector different from (1, 0′)′ if no structural break occurs. Eliminating columns from the initial design matrix requires a mapping of our second step indices to recover the original indices. For notational convenience, we use the mapping g : N → N, i �→ g(i) = ti, where tiis the breakpoint corresponding to the index i, for this purpose and define the index set ¯A∗ (A∗) to pick out the elements that correspond to truly non-zero coefficients (parameter changes).

We note that the major computing cost comes from the first step group lasso estimation considering a large number of observations as potential breakpoints. The second step represents a marginal addition to the total computing time if the first step estimation was sufficiently successful in eliminating inactive breakpoint candidates. The interested reader may consult Chan et al. (2014) for a detailed discussion of computational complexity in this context.4

2.2 Asymptotic properties

In the following, we study the asymptotic properties of our adaptive group lasso estimator. To discuss asymptotic properties, we need to impose some further assumptions about the location and magnitude of active breakpoints.

Assumption 2. (i)  Imin = min1≤j≤m0+1|t0j − t0j−1| > ζT for some ζ > 0, where Imin is the

minimum break interval.

(ii) The break magnitudes are bounded to satisfy  mβ = min1≤j≤m0+1∥β0j − β0j−1∥ > ν for


Assumption 2(i) requires that the length of the regimes between breaks increases with the sample size and in the same proportions to each other. This allows us to consistently detect and estimate the true break fractions as it makes the break dates asymptotically distinct (Perron, 2006). The first inequality of Assumption 2(ii) is a necessary condition to ensure that a structural break occurs at  t0j. We do not consider small breaks with local-to-zero behaviour in this setting (see Bai et al. (1998) for assumptions used in this context). This assumption is not believed to be restrictive for the intended empirical applications where applied researchers aim to estimate the long-run equilibrium to obtain the error correction term, i.e., the cointegration residuals, for their follow-up analysis. Essentially, they need optimal in-sample forecasts in terms of mean squared error of the cointegrating regression under structural instability to consistently estimate these residuals. Boot and Pick (2019) show that in-sample forecasts are largely unaffected by local-to-zero breaks. The second part excludes the possibility of infinitely large parameter changes. Assumption 2(iii) implies that the number of active breaks is less than the number of observations and the smallest eigenvalue of Σ  ¯Ais greater than or equal to C by letting  mj = 0 for j ∈ ¯Ac. Consequently, Assumption 2(iii) ensures that Σ  ¯A is positive definite for all T. This is only then the case if  ZT, ¯Acontains columns which are sufficiently distinct. This in turn means that the intervals between breaks need to be sufficiently large for all T. It is important to note that Assumption 2(i) can be deduced as an implication of Assumption 2(iii) and we need Assumption 2(iii) exclusively for the first step estimation. Our second step estimation requires only Assumption 2(i) and (ii) as long as consistent weights are available.

First, we need to show that the initial estimator provides consistent weights for the second step adaptive lasso procedure (Huang et al., 2008). The following theorem provides a consistency result for the group lasso estimator in cointegrating regressions with (possibly) multiple structural breaks.

Theorem 1. Under Assumption 1 and Assumption 2, if  λT = 2Nc0T δ for some c0 > 0and 3/4 < δ <1, then there exists some C > 0 such that with probability greater than 1  − Cc20T 2δ−1,


Remark 1. The specification of  λTimplies that  λT → ∞ for T → ∞. This means we have to apply a stricter penalty for increasing sample sizes to discard a larger set of inactive candidate breaks searching for a fixed number of  m0active breaks. On the other hand,  λTfullfils the condition  λT/T →0 so that the tuning parameter cannot grow too fast avoiding to ignore active breaks. Since the convergence rate of the group lasso coefficients depends inversely on  δ, it is useful to employ a selection rule for  λTwhere  δ is small.

Remark 2. Given that  λTis set optimally such that  δis only slightly above 3/4, the convergence rate of our first step group lasso estimator is slightly slower than  T 1/8.This means that we lose a substantial portion of the convergence rate which is T for fixed breaks under complete information on their location. The reduced convergence rate can be considered the cost for an estimator which is robust against (multiple) structural breaks with unknown location. For comparison, the convergence rate of in-sample predictions for white noise processes with mean shifts reported in Harchaoui and Lévy-Leduc (2010) is (T/ log T)1/4. Instead,Chan et al. (2014) find that in-sample predictions for piecewise stationary autoregressive processes have a faster convergence rate which amounts to�T/ log T, but this result is based on white-noise assumptions for the error term process.

Theorem 1 shows that it is crucial to let the tuning parameter  λTgrow at the right rate. However, this rate provides only limited practical guidance towards the choice of λT. We follow Kock (2016), Qian and Su (2016) and Schmidt and Schweikert (2019) and propose to select  λTby minimizing an information criterion in the form of


where SSR is the sum of squared residuals resulting from the group lasso estimation of Equation (6) and  |AT|gives the number of non-zero breakpoint candidates. The penalty function  ρTallows for different choices. While Kock (2016) suggests to use the BIC for potentially nonstationary autoregressive models which corresponds to  ρT = log(T)/T,Qian and Su (2016) propose to use  ρT = 1/√Tfor the estimation of structural breaks in stationary time series regressions. In this paper, we follow Schmidt and Schweikert (2019) and employ a modified BIC according to Wang et al. (2009) which incorporates the additional factor log log  d∗T where d∗T denotes the total amount of coefficients in the full model. This modification of the BIC accounts for the fact that the true model must be found in situations where the number of coefficients diverges.

For the next theorem, we temporarily assume that the exact number of breaks is known. This assumption will help us to provide an important consistency result for the estimated location of breakpoints. We note that this temporary assumption will be relaxed for our main results.

Theorem 2. Under Assumption 1 and Assumption 2, if  m0is fixed and  |AT| = m0,

then for all  ϵ > 0


Remark 3. Dividing by T on both sides of the inequality in Theorem 2 shows that each break fraction can be detected within an  ϵ-neighbourhood of its true location. Hence, the convergence rate is similar to the one found in Davis et al. (2006) who use identical assumptions on the minimum break interval. Harchaoui and Lévy-Leduc (2010), allowing for a maximum number of location shifts in white noise processes, report a slightly faster convergence rate. Similarly, Chan et al. (2014) apply group lasso to piecewise stationary autoregressive processes with a potentially diverging number of true breakpoints and report the nearly optimal convergence rate log T/T if errors are Gaussian.

The previous result is an important building block for our main results. Next, we prove that the group lasso estimator yields a set of estimated breakpoints for which the number of selected breaks is greater than the true number of breaks almost surely when the exact number of breakpoints is unknown. Further, we evaluate the consistency of estimated breakpoints using the Hausdorff distance between the set of estimated breakpoints and the set of true breakpoints. We follow Boysen et al. (2009) and define dH(A, B) = maxb∈B mina∈A|b − a| with dH(A, ∅) = dH(∅, B) = 1, where  ∅is the empty set. The following theorem shows that the set of estimated breakpoints converges to the set of true breakpoints under the Hausdorff distance.

Theorem 3. If Assumption 1 and Assumption 2 hold, then as  T → ∞


Remark 4. The first part of Theorem 3 yields the familiar result that the group lasso estimator is not model selection consistent in settings where the restricted eigenvalue condition (Bickel et al., 2009) or the irrepresentable condition (Meinshausen and Bühlmann, 2006; Zhao and Yu, 2006) do not hold for the full design matrix. The estimator tends to overselect breakpoints. We note that Assumption 2(iii) is slightly different from the restricted eigenvalue condition used in Bickel et al. (2009) and restricts only the design submatrix generated from columns containing active breakpoints. This result shows that we do not systematically select too few breaks which is crucial for the intended second step estimation using weights obtained by group lasso estimation. Ignored breaks would directly result in infinite weights for the second step which would mean that these breaks could not be recovered.

Remark 5. The second part of Theorem 3 implies that the Hausdorff distance from the set of estimated breakpoints to the true breakpoints diverges slower than the sample size. Consequently, the Hausdorff distance as a percentage of the sample size is bounded by a constant. This provides us with a consistency result for the estimated break fractions and gives us justification to consider multiple structural breaks at once, since the Hausdorff distance evaluates the joint location of all breakpoints.

Finally, we consider the asymptotic properties of the adaptive group lasso estimator with weights obtained from our first step estimation. We note that Theorem 3 allows us to bound the number of breakpoint candidates by a constant. Hence, the dimensionality of the model selection problem no longer depends on the sample size.

Theorem 4. If Assumption 1 and Assumption 2 hold,  λS → 0, λ2ST (1−δ)γ → ∞ for

3/4 < δ < 1 and γ > 0, then

(a) Consistency:  ∥ˆθS − θ0S∥ = Op(T −1)


Remark 6. Although the second step tuning parameter  λScan be chosen by a selection rule independent of the first step tuning parameter  λT, its value depends on  δ, i.e. howeffective additional coefficients are penalized in the first step and consequently how many truly inactive breakpoint candidates remain in our second step design matrix. Since the number of parameters in the full model can now be limited by a pre-specified maximum number of breaks, we suggest to use an information criterion like the BIC, which has performed quite well in our simulation experiments.

Remark 7. Combining parts (a) to (c) of Theorem 4 shows that the adaptive group lasso estimator has oracle properties. This means that the adaptive group lasso performs correct model selection and has the same asymptotic distribution as the least squares estimator if the breaks’ location would have been known beforehand. Since our regression involves nonstationary components, the asymptotic distribution of the least squares estimator is naturally given as a functional of Brownian motions. Schmidt and Schweikert (2019) use the term ‘nonstandard oracle property’ to distinguish it from the term used in Fan and Li (2001). The asymptotic bias term Λ originating from the dependency between increments of the regressors and the error term of the cointegrating regression can be eliminated using dynamic augmentation according to Saikkonen (1991) and Stock and Watson (1993).

Remark 8. It is notable that our estimator has nonstandard oracle properties although the convergence rate of the group lasso estimator is slower than  T 1/8. Zou (2006) arguesthat the convergence rate of the initial estimator is allowed to be substantially slower than the desired convergence rate of the adaptive lasso estimator if the tuning parameter is specified accordingly.

In this section, we conduct simulation experiments to assess the adequacy of our technical results in Section 2. We investigate the finite sample performance of our adaptive group lasso procedure with respect to the accuracy in finding the exact number of breaks, their location and the magnitude of parameter changes. We consider model specifications with one, two and four breakpoints, respectively. The following DGP is employed to model a multivariate cointegrated system with multiple structural breaks,


where  Xt = (X1t, X2t, . . . , XNt)′ and Σ = diag(σ2ω), i.e. the innovations of our generated random walk processes have identical normal distributions.  µis a non-zero intercept and βt = (β1t, β2t, . . . , βNt) is a time-varying slope coefficient vector with non-zero baseline value and a finite number of breaks. We note that  cov(ϑt, ωt) = 0, i.e. our regressors are strictly exogenous and the asymptotic bias reported in Theorem 4 is non-existent. Naturally, the ability of all structural break estimators to detect breaks depends

on the overall signal strength. Niu et al. (2015) define signal strength in change-point models by  S = m2βImin, where Imin = min1≤j≤m0+1|tj − tj−1|is the minimum distance between breaks and  mβ = min1≤j≤m0+1∥βj − βj−1∥is the minimum jump size. For our main simulations concerned with consistency of the adaptive group lasso estimator, we use equal jump sizes for multiple breaks and locate the breaks with equidistant spacing between them. Hence, overall signal strength is a linear function of the sample size in our simulations. We use a baseline value of two and a jump size of two which is equal to the standard deviation of the regression error term. Simulations with a better signal-to-noise ratio yield more precise estimates for all sample sizes.

In Table 1, we report our results for N = 2 regressors. We specify our model for one break located at  τ = 0.5, two breaks at  τ = (0.33, 0.67) and four breaks at τ = (0.2, 0.4, 0.6, 0.8) to have an equidistant spacing on the unit interval. We first compute the percentages of correct estimation (pce) of the number of breaks m and measure the accuracy of the break date estimation conditional on the correct estimation of m. For this matter, we compute the average Hausdorff distance and divide it by T (hd/T) to compare the values across different sample sizes. The corresponding figures in our tables are reported in percentages. As T grows larger, the number of breaks is detected with increasing precision and the distance between estimated breakpoints and true breakpoints declines to nearly zero. Parameter estimates are already very accurate at small sample sizes. As expected, the parameter changes of models with fewer breakpoints can be estimated more precisely than those of models with a larger number of breakpoints, as indicated by larger standard deviations obtained for the latter at all sample sizes.5 Comparing these results with those obtained for the BaiPerron algorithm6, where the number of breaks is determined via the BIC, we find that both approaches perform similarly well. The results are reported in Table 2. While the Bai-Perron algorithm estimates the true break fractions slightly more accurately, parameter changes on average have larger standard deviations at all samples sizes. The number of structural breaks is estimated with identical accuracy.7

Next, we investigate if dynamic augmentation according to Saikkonen (1991) and

Stock and Watson (1993) yields consistent coefficient estimates if the strict exogeneity condition of our main results is violated. To do so, we follow Kejriwal and Perron (2008) and draw the vector (ϑt, ω1t, ω2t)′ jointly from a multivariate normal distribution with zero mean and covariance matrix


Using this configuration, the strict exogeneity condition is violated for both regressors but the regressors are still generated by independent processes. If we attempt to detect and estimate structural breaks without dynamic augmentation, we still detect breakpoints precisely but obtain strongly biased coefficient estimates. In Table 5, we find the corresponding results after the inclusion of l = 1 and l = 2 leads and lags. Now, we can recover the number, location and magnitude of all breakpoints with similar accuracy compared to our simulations under strict exogeneity.

In Table 3, we consider partial breaks in the cointegrating vector. We use a model specification according to the DGP in Equation (10) with N = 2 regressors and induce partial structural breaks through  β1tonly. Our estimator is applied estimating a full structural change model without prior knowledge that  β2tis constant over the sampling period. Again, we observe that the number of breaks, their timing and their magnitude is consistently estimated.8 The distance between the set of estimated breakpoints and true breakpoints is larger than in the full break setting in Table 1. This result is not surprising considering that the break magnitudes for partial breaks are smaller making it more difficult for the adaptive group lasso procedure to detect the true location of the breaks. Consequently, these results also help us to assess how the break magnitude influences the detection rates. Reducing the Euclidean distance from 2 to √2, roughly doubles the average Hausdorff distance. The convergence rates for zero parameter changes in  β2tis almost identical to the convergence rate observed for the non-zero parameter changes in  β1t. This is naturally driven by the joint evaluation of all regressors in each group. Unlike bi-level estimators proposed in Huang et al. (2009) and Breheny and Huang (2009), the adaptive lasso procedure is not able to shrink coefficients within active groups to zero. Hence, the usual convergence rate for non-zero coefficients apply. In these cases, the convergence rate for  β2tcould in principle be increased if our procedure was extended to feature bi-level shrinkage. However, this is beyond the scope of this paper and is not investigated further at this point.

Finally, we investigate how sensitive our procedure is to break fractions located near the boundary of the unit interval. While the properties of tests for structural changes in the literature depend strongly on the trimming parameter (Bai and Perron, 2006), our method to recover breaks should be more robust in this regard. We only need some lateral trimming to ensure that the first and last regimes identified by our adaptive lasso procedure comprise a sufficiently large number of observations to estimate regime-dependent coefficients.9 The results for breaks near the boundary are summarized in Table 4. The first and second panel considers one break located at  τ = 0.1 and τ = 0.9,respectively. The pce and average Hausdorff distances over all sample sizes clearly show that a break located close to the beginning of the sample is more difficult to detect than a break located at the end of the sample. Gregory and Hansen (1996a) and Schweikert (2020) report similar findings for their grid search algorithms. To investigate this further, we consider two breaks located at  τ = (0.1, 0.9) in panel three of Table 4. Here, we find that the pce is quite low compared to our main results with equidistant spacing of breakpoints. The first break is estimated less accurately than the second break which can be explained by the fact that parameter changes are measured from one regime to the next and that only a relatively small number of observations is available to estimate the break at  τ = 0.1.

The results of our first series of boundary experiments imply that it might be possible to relax our trimming restrictions and assume an asymmetric lateral trimming where the first regime must contain sufficiently many observation, say 5% of the sample, while the end of the sample does not necessarily have to be excluded. We apply a 0.05/0 trimming and estimate breaks located at  τ = (0.1, 0.95). The results for this trimming strategy are presented in panel four of Table 4. The break at  τ = 0.95 can still be accurately detected, however the standard errors of the parameter changes increase due to the smaller number of observations in the last regime. We conclude that trimming is not necessary to detect breaks located at the end of the sample. Still, we suggest to set a minimum number of observations per regime to ensure that parameter changes are estimated precisely.

In this section, we apply our proposed methodology to the US money demand function. Particularly, we estimate a long-run money demand specification and investigate the presence of long-run instabilities in a cointegrating framework. Juselius (2006) considers the condition M/P = L(Y, R) for equilibrium in the money market, which relates M/P, the ratio of nominal money balances to price levels, to real income Y and the short term nominal interest rate R. Two competing empirical specifications are considered in the literature, namely, the semi-log and the log-log specification. The latter is given by L(Y, R) = αY β1Rβ2, where αis a constant,  β1is the income-elasticity assumed to be unity and  β2 <0 is the interest-elasticity.10 For our empirical application, we choose a log-log specification which has been found to fit quite well to US data (Lucas, 2000; Bae and Jong, 2007; Ireland, 2009; Mogliani and Urga, 2018). We extend the dataset used by Maki (2012) to span the period from January 1959 to December 2018. Monthly data are obtained from the Federal Reserve Bank of St. Louis. We consider the empirical US money demand function,


where  m∗t and yt denote the natural logarithm of the ratio of nominal money balances to price levels, and the natural logarithm of real income, respectively. According to the log-log specification, we employ the natural logarithm of the short term nominal interest rate, denoted by  rt. utdenotes the equilibrium error of the money demand function if the system is cointegrated. We use M2 as nominal money, the consumer price index as prices, and the index of industrial production as real income. For the interest rate, we use the 6-month Treasury bill rate. All time series are tested for a unit root using the Dickey-Fuller test. The results, which are not reported, support the assumption that all variables are integrated of order one and we can continue our cointegration analysis.

First, we assume constancy of the parameters and ignore potential structural breaks. Estimation of the long-run equilibrium equation yields coefficients ˆµ = −0.05, ˆβ1 = 0.80and ˆβ2 = −0.08. Dynamic augmentation of the cointegrating regression with two leads and lags each, does not change the coefficient values. The Engle-Granger test based on an ADF regression yields the t-ratio  −0.063 which does not lead to a rejection of the null hypothesis at the 10% level. Similar results can be obtained for the Phillips-Ouliaris test and the Johansen test. Although it is implausible from a theoretical standpoint that the system is not cointegrated, at least our estimated coefficients have the expected sign and magnitude for post-war data. The estimated income-elasticity measured by β1is slightly below the theoretically expected value. The interest-elasticity of money demand, measured by  β2is expected to be negative. Lucas (2000) considers  −0.3, −0.5and  −0.7 as values of  β2and finds that  β2 = −0.5 gives the best fit for US data. Meltzer (1963), Lucas (1988), Hoffman and Rasche (1991) and Stock and Watson (1993) find empirical evidence consistent with the theoretical expectation that income-elasticity of money demand is unity and interest-elasticity is relatively high. Ball (2001) studies subperiods from 1903 to 1994 and argues against a stable long-run money demand. Further empirical studies have pointed out the presence of structural instability in US money demand for sample periods including data from the 1990s and 2000s (Teles and Zhou, 2005; Wang, 2011; Lucas and Nicolini, 2015). Potential nonlinearities in the functional form are investigated, for example, by Chen and Wu (2005) and Jawadi and Sousa (2013). However, we take the perspective that the linear cointegrating regression in Equation (12) approximates the data well if we simultaneously account for (multiple) parameter changes during the sample period.

A three-dimensional scatterplot of the data in Figure 1 reveals that the relationship between  rt, yt and m∗t has changed during the sampling period. We observe at least three two-dimensional surfaces which correspond to distinct long-run levels from which  m∗tdoes not persistently deviate. However, if we consider linear cointegration without the possibility of structural breaks, we infer from Figure 2 that the residual series exhibits a clear trend during the latter half of the sample. We note that the presence of structural breaks might mask the cointegrating relationship. Next, we compare several previously mentioned structural break models with our model selection approach. The Gregory and Hansen (1996a) test indicates a breakpoint at 2008 m06 but does not reject the null hypothesis at the 10% level. Because the GH-test does not model structural breaks under the null hypothesis, this means that the timing of the indicated breakpoint is not informative. The Hatemi-J (2008) test indicates two breakpoints at 1992 m01 and 2008 m06. The null hypothesis of no cointegration can be rejected at the 5% level if these breakpoints are taken into account. The maximum number of breaks chosen for the Maki (2012) test is five. It selects the breakpoints at 1986 m05, 1992 m04, 2004 m05, 2008 m11, 2014 m03 and rejects the null hypothesis of no cointegration at the 1% level. We initially also start with a maximum of five breakpoints for our adaptive group lasso procedure. However, imposing a minimum regime length of one year to precisely estimate the parameter changes and dynamically augmenting the cointegrating regression results in a model specification with three breakpoints. The final estimates yield break dates 1992 m07, 2005 m12, and 2015 m11.11


Figure 1: Three-dimensional scatterplot of  rt (x-axis), yt(y-axis) and  m∗t (z-axis).

The income-elasticity from 1959 m01 to 1992 m07 is estimated to be 0.95 and the interest-elasticity amounts to  −0.10 for the same period. These estimates correspond to the theoretical predictions formulated in Juselius (2006) and to the results reported in empirical papers considering this sample period (Lucas, 1988; Stock and Watson, 1993; Lucas, 2000). The first breakpoint leads to an income-elasticity reduction from 0.95 to 0.89 while the interest-elasticity remains largely unchanged. A partial decoupling of money demand from income might be explained by the begin of the costly Gulf War and a sharp increase in US debt. In turn, the second breakpoint at 2005 m12 has a negligible effect on the income-elasticity (0.89 to 0.90) but results in a larger reduction of the interest-elasticity from  −0.10 to −0.07. This breakpoint can be related to the beginning Global Financial Crisis of 2007-2008. It must be emphasized at this point that estimated break dates might be affected by the usual lead and lag effects, since parameter changes are representative for the following regime. In the aftermath of the Global Financial Crisis, the Federal Reserve implemented a zero interest rate policy. Consequently, the variation in the interest rate for this period approached zero which naturally reduced the interest-elasticity of money demand. After 2015 m11, the expected interest-elasticity does no longer achieve a good fit to the data and increases to 0.01. In contrast, the income-elasticity is very close to unity (0.97).

Accounting for structural breaks, as indicated by the adaptive group lasso procedure, yields a residual series which much more resembles being generated by a stationary process than the original OLS residual. Figure 3 illustrates that the residual series does not exhibit a visible trend. The speed of adjustment after equilibrium errors is now −0.097 which means that roughly 10% of long-run deviations are corrected each period.

In this paper, we propose a penalized regression approach to the problem of detecting an unknown number of structural breaks and their location in cointegrating regressions. Our estimator eliminates irrelevant breakpoints from a set of candidate breakpoints and, hence, follows a top-down approach regarding the estimation of structural breaks. Practitioners should apply this new methodology in complement to the Bai-Perron algorithm which follows a bottom-up approach, i.e. sequentially increasing the number of breaks. Due to the importance of finding the right model specification with respect to the number and location of structural breaks, either approach can serve as a valuable robustness check of the model specification chosen by the other approach. Ideally both approaches should indicate the same breakpoints which would mean that the chosen model specification is sufficiently sparse (bottom-up) and does not ignore important breaks (top-down).


Figure 2: Residual series obtained from least squares estimation.


Figure 3: Post-lasso residual series. Estimated regimes are marked by grey and white areas.

We can show the important theoretical result that the adaptive group lasso estimator has nonstandard oracle properties in settings with a diverging number of breakpoint candidates. This means that the estimator determines the true number of non-zero parameter changes with probability tending to one and consistently estimates their location. The corresponding parameter changes are estimated with the same convergence rate that least squares estimators would have under full information of the number and location of breaks.

The present paper does not consider cointegration testing. It is unclear how optimal cointegration test can be constructed from the proposed penalized regression approach. An attempt to design such cointegration tests has been made by Schmidt and Schweik- ert (2019) for a single regressor. Our results depend critically on the stationarity assumption about the error term. Hence, it is required to establish the existence of a cointegration relationship before the penalized regression is estimated. Practitioners should employ cointegration tests which are robust to the presumed number of breaks during the sample period.

Further extensions include the use of bi-level selection via the group fused lasso (Huang et al., 2009; Breheny and Huang, 2009) to estimate partial breaks more effi-ciently, and the possibility to detect structural breaks in system-based approaches with multiple equilibria (Bai et al., 1998; Qu, 2007).

I thank Florian Stark, Alexander Schmidt, Markus Mößler, Timo Dimitriadis and the participants of the Doctoral Seminar in Econometrics in Tübingen, German Statistical Week in Trier, ZU Methodenkolloquium in Friedrichshafen, THE Christmas Workshop in Stuttgart, Seminar at Maastricht University, and the 2nd CSL Symposium in Stuttgart for valuable comments and suggestions. Further, I thank Maike Becker and Manuel Huth for excellent research assistance.

Lemma 1. Under Assumption 1 and Assumption 2, for any  c0 > 0 and δ > 1/2, thereexists some constant C > 0 such that


Proof of Lemma 1. According to Theorem 4.1 of Hansen (1992a), it holds for all j = 1, . . . , N and 0 ≤ r ≤ 1 that T −1


variance  σ2. Since the second moment of r�0 BjdU +rΛjis finite and  T −2�[Tr]�i=1 Xjiui

uniformly integrable, we have E


to Theorem 3.5 of Billingsley (1999). It follows from Markov inequality that


for some C > 0. Thus, it holds that


Since N is finite, Equation (A.1) follows.


Lemma 2. Let ˜θ(T)be the estimator of  θ(T)as defined in Equation (6), then it holds under the same conditions as in Theorem 1 that




Proof of Lemma 2. This lemma is a direct consequence of the Karush-Kuhn-Tucker (KKT) conditions for group lasso estimators.


Lemma 3. A necessary and sufficient condition for the estimator ˆθSto be a solution to the adaptive group lasso objective function  Q(θ) is


Proof of Lemma 3. This lemma is a direct consequence of the Karush-Kuhn-Tucker (KKT) conditions for adaptive group lasso estimators.


Proof of Theorem 1. We prove that the group lasso estimator consistently estimates all coefficients. The first part of our proof is related to the results given in Chan et al. (2014), while the second part uses ideas presented in He and Huang (2016). By definition of ˜θ(T), it holds that


Note that ¯A contains the indices of all truly non-zero coefficients. Inserting  yT =ZTθ0(T) + uTinto Equation (A.4) yields


Noting that


and using Lemma 1, we have with probability greater than 1  − Cc20T 2δ−1 that


Now, we consider two cases:  κ2 > 2κ1 and κ2 ≤ 2κ1. First, we show that  P(κ2 >2κ1) →0 and then derive the upper bound of  ∥˜θ ¯A(T) − θ0¯A(T)∥. We note that the

triangle inequality yields


We can show, using similar arguments as in Lemma 1, that max 1≤i≤T

Op(T 5/2) and it follows that


Then, using the Cauchy-Schwarz inequality and considering that  | ¯A| = m0+ 1 is finite, we have


In contrast,  | ¯Ac|is diverging with  T → ∞. We observe that


for all j = 1, . . . , N which implies that  ∥Z′TuT∥ = Op(T 7/4). Consequently, we have


Hence, using the assumption  κ2 > 2κ1, we obtain the contradiction


for  T → ∞and it follows that  P(κ2 > 2κ1) →0. Turning to the event  κ2 ≤ 2κ1,Assumption 2(iii) and noting  X′X/T 2 = Op(1) implies that


Thus, we have


if 3/4 < δ <1. Combining the results for both cases completes the proof.


Proof of Theorem 2. Define  ATi =�|ˆti − t0i | > Tϵ�, i = 1, 2, . . . , m0 such that


The proof follows along the lines of the proof of Proposition 3 in Harchaoui and Lévy- Leduc (2010) and Theorem 2.2 in Chan et al. (2014). In the following, we focus on

i=1 P (ATiCT) →0 because the complementary part can be shown using similar argu- ments.


Next, we split  ATiinto two parts (i) ˆti < t0i and (ii) ˆti > t0i to show that  P (ATiCT) → 0.

In case of (i), applying Lemma 2 yields


Note that because of ˆti < t0i , the true coefficient has not changed at ˆti. Hence, plugging in for  ys = β0′i Xs + us yields


It follows for ˆti < t0i that,


P (ATiCT) ≤ P


For the first term, we observe that under Assumption 2 and on the set�|ˆti − t0i | > Tϵ�it holds that


for sufficiently small  ϵwith probability going to one. Taking into account that  TλT =O(T 1+δ) for 3/4 < δ <1, we conclude that  P (ATi1) → 0 for T → ∞. For the second term, we have


but since  ∥

that the right hand side of the inequality asymptotically dominates the left hand side and  P (ATi2) → 0 for T → ∞. Turning to the third term, we note the definition of ˜βi+1 = t�

according to Theorem 1 and the Continuous Mapping Theorem (see Billingsley (1999), Theorem 2.7), ˜βi+1is a consistent estimator for  β0i+1 with convergence rate  T (1−δ)/2,3/4 < δ <1. This means, we have (β0′i+1 − ˜β′i+1) → 0 and P (ATi3) → 0 for T → ∞. Itfollows that  P�ATiCT ∩ {ˆti < t0i }�→ 0.

In case of (ii), we have


Since ˆti > t0i , the true coefficient has changed at  t0i and we plug in for  ys = β0′i+1Xs+us, which yields


It follows that,


P (ATiCT) ≤ P


The same arguments as for case (i) can be used to show that  P�ATiCT ∩ {ˆti > t0i }�→ 0.Combining (i) and (ii) completes the proof of  P (ATiCT) → 0.


Proof of Theorem 3. We begin to prove the first part. Suppose that  |AT| < m0, thenthere exists some  t0i0, i0 = 1, 2, . . . and ˆts0 ∈ AT ∪ {0, ∞}, s0 = 0, 1, . . . , |AT| + 1 witht0i0+1 − t0i0 ∨ ˆts0 ≥ Tϵ/3 and t0i0+2 ∧ ˆts0+1 − t0i0+1 ≥ Tϵ/3 where ˆt0 = 0 and ˆt|AT |+1 = ∞.

Applying Lemma 2 to the intervals [t0i0 ∨ ˆts0, t0i0+1 − 1] and [t0i0+1, t0i0+2 ∧ ˆts0+1 − 1]yields




Similar arguments to those used in the proof of Theorem 2 show that either




has to hold to contradict  |AT| < m0. Since ˜βs0+1is a consistent estimator according to Theorem 1 and the Continuous Mapping Theorem, we either have ˜βs0+1 p→ β0i0 or˜βs0+1 p→ β0i0+1. In the former case, the left hand side of Inequality (A.30) converges to zero. In the latter case, the left hand side of Inequality (A.31) converges to zero. Hence, there is no situation in which not at least one probability converges to zero. Consequently, we have a contradiction to  |AT| < m0.

For the second part, we define ˆTk = {ˆt1, ˆt2, . . . , ˆtk}. Then, it is enough to show that


as  T → ∞. By Theorem 2, we have already shown that  P�dH( ˆTm0, A) > Tϵ�→ 0 so

that it suffices to show




Using similar arguments as in the proof of Theorem 2, we can show that maxk>m0 P (�m0i=1 BT,k,i,j) → 0 for 1 ≤ j ≤3. This completes the proof of Theorem


Proof of Theorem 4. To prove Theorem 4, we follow ideas similar to those put forth in Wang and Leng (2008) and Zhang and Xiang (2016). As we will note at different points of the proof, the statistical properties of the adaptive group lasso estimator hinge crucially on our first step weights. It is particularly important that our second step design matrix  ZSfulfils the restricted eigenvalue condition which can be ensured by the first step group lasso algorithm.

We note that the adaptive group lasso objective function  Q(θS) is a strictly convex function and show that there is a local minimizer which is superconsistent. Then by global convexity of  Q(θS), it follows that such a local minimizer must be ˆθS. Similaras in Fan and Li (2001), the existence of an above-described local minimizer is implied by the fact that for any  ϵ >0, there is a sufficiently large constant C > 0, such that


It holds that


Since the restricted eigenvalue condition holds for ΣS = Z′SZS/T 2, i.e., its eigenvalues are positive for all  T, and ΣSthus converges to a positive definite random matrix, we have  I1 = Op(T −1)∥v∥2. Further, it follows from Cauchy-Schwarz inequality and Lemma 1 that


and consequently  I2 = Op(T −1)∥v∥. Finally, using the Cauchy-Schwarz inequality, we have


We note that ming(i)∈AT ∩A ∥˜θS,i∥−γ = Op(1) since ˜θS,iis a consistent estimator according to Theorem 1 and our first step estimation does not ignore relevant breakpoints asymptotically according to Theorem 3. Using the condition  λS →0, we know that  I3is bounded by  Op(T −1)∥v∥. Hence, we can specify a large enough constant C such that I1 dominates I2 and I3. This completes the proof of part (a).

Next, we turn to the proof of part (b). Lemma 3 gives the necessary and sufficient condition for an estimator to be a solution to the adaptive group lasso objective function as defined by Equation (7). Now, to prove that all truly zero parameters are set to zero almost surely, it suffices to show that


or equivalently


Further, we have


and the first part of Theorem 4 implies that  ∥ˆθS,A∗ − θ0S,A∗∥ = Op(T −1) such that ∥ 1T Z′g(i)ZS,A∗�ˆθS,A∗ − θ0S,A∗�∥ = Op(1). Hence, we need to prove


Considering that Theorem 1 implies maxg(i)∈Ac ∥˜θS,i∥ = Op(T −(1−δ)/2) for 3/4 < δ < 1, we



for some C > 0. Since Assumption 1 implies  E∥ 1T Z′g(i)uT∥2 = Op(1) and λ2ST (1−δ)γ →


for all  g(i) ∈ AT ∩ Ac. Note that  |AT| < M for all Tand that all remaining indices i not included in  ATcorrespond to coefficients which have already been set to zero in the first step.

For the proof of model selection consistency, we still need to show that no truly non-zero parameter changes are set to zero. It holds that


Since  ∥ˆθS,i − θ0S,i∥ p→0 by part (a) and by considering Assumption 2, we have


This completes the proof of part (b). Finally, we turn to the proof of part (c). It follows from Lemma 3 that




Then, it holds that


where  φ ∈ Rm0+1 with ∥φ∥= 1. Since  Z′S, ¯A∗ZS, ¯A∗/T 2is a positive definite random matrix for all T, we have


As in part (a), it holds that mini∈ ¯A ∥˜θi∥−γ = Op(1) and by the conditions of Theorem 4, we have  λS → 0 as T → ∞ such that


We use  ω2vj to denote the long-run variance of the stationary process  {vjt}∞t=1, j =1, . . . , N. Note that  ω2vj is the j-th diagonal element of Ωv. Under the conditions of Assumption 1, it holds that


for  s ∈ [0, 1], j ∈ {1, . . . , N} and T → ∞, where B(s) is a scalar Brownian motion with variance  ω2vj. Further, it holds that




and B(s) is N-vector Brownian motion process with covariance matrix Ωv. Using (A.4) in Gregory and Hansen (1996a) and the Continuous Mapping Theorem, we observe that


where the weak convergence is uniform over the vector (τ1, . . . , τm0) ∈ T. Further, using (A.3) in Gregory and Hansen (1996a) and Theorem 3.1 in Hansen (1992b), we have the weak convergence to a stochastic integral


where Λ = ∞�t=0 E(v0ut). Finally, the Cramér-Wold device implies the weak convergence result in (c) which completes the proof of Theorem 4.






Table 5: Endogeneity correction via dynamic augmentation


Note: We use 1,000 replications of the data-generating process given in Equation (10) with an endogenous error term specification. The covariance matrix of the error terms is specified according to Equation (11) with  σ2ω = 1 and σ2ϑ = 4,respectively. We denote the number of leads and lags with l. The first panel reports the results for one active breakpoint at  τ = 0.5, the second panel considers two active breakpoints at  τ1 = 0.33 and τ2 = 0.67 and the third panel has four active breakpoints at  τ1 = 0.2, τ2 = 0.4, τ3 = 0.6, and τ4 = 0.8. The baseline coefficients and parameter changes at all breakpoints take the value 2. Standard deviations are given in parentheses.

