b

DiscoverSearch
About
My stuff
Modelling High-Dimensional Categorical Data Using Nonconvex Fusion Penalties
2020·arXiv
Abstract
Abstract

We propose a method for estimation in high-dimensional linear models with nominal categorical data. Our estimator, called SCOPE, fuses levels together by making their corresponding coefficients exactly equal. This is achieved using the minimax concave penalty on differences between the order statistics of the coefficients for a categorical variable, thereby clustering the coefficients. We provide an algorithm for exact and efficient computation of the global minimum of the resulting nonconvex objective in the case with a single variable with potentially many levels, and use this within a block coordinate descent procedure in the multivariate case. We show that an oracle least squares solution that exploits the unknown level fusions is a limit point of the coordinate descent with high probability, provided the true levels have a certain minimum separation; these conditions are known to be minimal in the univariate case. We demonstrate the favourable performance of SCOPE across a range of real and simulated datasets. An R package CatReg implementing SCOPE for linear models and also a version for logistic regression is available on CRAN.

Categorical data arise in a number of application areas. For example, electronic health data typically contain records of diagnoses received by patients coded within controlled vocabularies and also prescriptions, both of which give rise to categorical variables with large numbers of levels [Jensen et al., 2012]. Vehicle insurance claim data also contain a large number of categorical variables detailing properties of the vehicles and parties involved [Hu et al., 2018]. When performing regression with such data as covariates, it is often helpful, both for improved predictive performance and interpretation of the fit, to fuse the levels of several categories together in the sense that the estimated coefficients corresponding to these levels have exactly the same value.

To fix ideas, consider the following ANOVA model relating response vector  Y = (Y1, . . . , Yn)T ∈Rn to categorical predictors  Xij ∈ {1, . . . Kj}, j = 1, . . . , p:

image

Here the  εiare independent zero mean random errors,  µ0 is a global intercept and  θ0jk is thecontribution to the response of the kth level of the jth predictor; we will later place restrictions on the parameters to ensure they are identifiable. We are interested in the setting where the coefficients corresponding to any given predictor are clustered, so defining

image

we have  sj ≪ Kj, at least when  Kjis large. Note that our setup can include high-dimensional settings where p is large and many of the predictors do not contribute at all to the response: when  sj= 1, the contribution of the jth predictor is effectively null as it may be absorbed by the intercept term.

1.1 Background and motivation

Early work on collapsing levels together in low-dimensional models of the form (1) focused on performing a variety of significance tests for whether certain sets of parameters were equal [Tukey, 1949, Scott and Knott, 1974, Calinski and Corsten, 1985]. A more modern and algorithmic method based on these ideas is Delete or merge regressors [Maj-Ka´nska et al., 2015], which involves agglomerative clustering based on t-statistics for differences between levels.

The CART algorithm [Breiman et al., 1984] for building decision trees effectively starts with all levels of the variables fused together and greedily selects which levels to split. One potential drawback of these greedy approaches is that in high-dimensional settings where the search space is very large, they may fail to find good groupings of the levels. The popular random forest procedure [Breiman, 2001] uses randomisation to alleviate the issues with the greedy nature of the splits, but sacrifices interpretability of the fitted model.

An alternative to greedy approaches in high-dimensional settings is using penalty-based methods such as the Lasso [Tibshirani, 1996]. This can be applied to continuous or binary data and involves optimising an objective for which global minimisation is computationally tractable, thereby avoiding some of the pitfalls of greedy optimisation. In contrast to random forest, the fitted models are sparse and interpretable. Inspired by the success of the Lasso and related methods for high-dimensional regression, a variety of approaches have proposed estimating  θ0 = (θ0jk)j=1,...,p, k=1,...,Kj and µ0via optimising over (µ, θ) a sum of a least squares criterion

image

This is the CAS-ANOVA penalty of Bondell and Reich [2009]. The weights  wj,klcan be chosen to balance the effects of having certain levels of categories more prevalent than others in the data. The penalty is an ‘all-pairs’ version of the fused Lasso and closely related to so-called convex clustering [Hocking et al., 2011, Chiquet et al., 2017]. We note that there are several other approaches besides using penalty functions. For instance, Pauger and Wagner [2019] proposes a Bayesian modelling procedure using sparsity-inducing prior distributions to encourage fusion of levels. See also Tutz and Gertheiss [2016] and references therein for a review of other methods including those based on mixture models and kernels.

The fact that the optimisation problem resulting from (4) is convex makes the procedure attractive. However, a drawback is that it may not give a desirable form of shrinkage. Indeed, consider the case where p = 1, and dropping subscripts for simplicity, all  wkl= 1. This would typically be the case if all levels were equally prevalent. Further suppose for simplicity that the number of levels K is even. Then if the coefficients are clustered into two groups where one contains only a single isolated coefficient, the number of non-zero summands in (4) is only  K−1.This almost doubles to 2(K −2) when one of the two groups is of size 2. The extreme case where the two groups are of equal size yields (K/2)2 non-zero summands. This particular property of all-pairs penalties, which results in them favouring groups of unequal sizes, is illustrated schematically in Figure 1. We can see the impact of this in the following concrete example.

image

Figure 1: Illustration of the number of non-zero summands in (4) when p = 1, K = 16 and coefficients are clustered into two groups of equal size (right), and where one contains a single coefficient (left) and two coefficients (middle).

Suppose K = 20 levels are clustered into four groups with

image

If the coefficient estimates satisfy ˆθ1 = · · · = ˆθ4 < ˆθ5 = · · · = ˆθ10 ≤ ˆθk for all k ≥11, so the first two groups have distinct coefficients, then moving any coefficient from the first group towards the second, and so increasing the number of estimated groups, actually decreases the penalty contribution in (4). Specifically, if the kth coefficient for some  k ∈ {1, . . . , 4} moves to ˆθk +t fort ∈ [0, ˆθ5 − ˆθ4] with all other coefficients kept fixed, the penalty contribution decreases by 13t. In this case then, CAS-ANOVA will struggle to keep the groups intact, especially smaller ones. We see this in Figure 2, which shows the result of applying CAS-ANOVA to data generated according to (1) with  p = 1, θ0 as above, n= 20 (so we have a single observation corresponding to each level), and  εii.i.d.∼ N(0,1). There is no value of the tuning parameter  λwhere the true groups are recovered.

As in the standard regression setting, the bias introduced by all-pairs  ℓ1-type penalties may be reduced by choosing data-adaptive weights analogously to the adaptive Lasso [Zou, 2006], or replacing the absolute value  |θjk − θjl| by ρ(|θjk − θjl|) where ρis a concave and non-decreasing penalty function [Oelker et al., 2015, Ma and Huang, 2017]. However, this does not

image

Figure 2: Solution paths as the tuning parameter varies in a univariate example where there are four true groups. From left to right: CAS-ANOVA, the range penalty and SCOPE with γ= 8. The setup is as described in the main text of Section 1.1, with the different colours corresponding to the different true groups. The tuning parameter varies along the y axis. In this example, only SCOPE identifies the 4 correct groups at any point along its solution path.

address the basic issue of a preference for groups of unequal sizes. Additionally, optimising an objective involving a penalty with  O��pj=1 K2j�summands can be computationally challenging, particularly in the case where  ρis not convex, both in terms of runtime and memory. To help motivate the new approach we are proposing in this paper, let us consider the setting where the predictors are ordinal rather than nominal, so there is an obvious ordering among the levels. In these settings, it is natural to consider a fused Lasso [Tibshirani et al., 2005] penalty of the form

image

where  πjis a permutation of  {1, . . . , Kj}specifying the given order; this is done in Gertheiss and Tutz [2010] who advocate using it conjunction with the all-pairs-type CAS-ANOVA penalty for nominal categories.

If however we treat the nominal variable setting as analogous to having ordinal variables with unknown orderings  πj, one might initially think of choosing  πjcorresponding to the order of the estimates  θj := (θjk)Kjk=1, such that  θjπj(k) = θj(k), where θj(k) is the kth smallest entry in  θj. This however leads to what we refer to as the ‘range’ penalty:

image

Whilst this shrinks the largest and smallest of the estimated coefficients together, the remaining coefficients lying in the open interval between these are unpenalised and so no grouping of the estimates is encouraged, as we observe in Figure 2; see also Oelker et al. [2015] for a discussion of this issue in the context of ordinal variables.

1.2 Our contributions and organisation of the paper

Given how all-pairs penalties have an intrinsic and undesirable preference for unequal group sizes, and how the fused Lasso applied to ordered coefficients (6) does not result in grouping of

the coefficients, we propose the following solution. Our approach is to use the penalty

image

for concave (and nonconvex) non-decreasing penalty functions  ρj, which, for computational reasons we discuss in Section 3, we base on the minimax concave penalty (MCP) [Zhang, 2010]. In Section 2 we formally introduce our method, which we call SCOPE, standing for Sparse Concave Ordering & Penalisation Estimator.

Note that whereas in conventional high-dimensional regression, the use of nonconvex penalties has been primarily motivated by a need to reduce bias in the estimation of large coefficients [Fan and Li, 2001], here the purpose is very different: in our setting a nonconvex penalty is in fact even necessary for shrinkage to sparse solutions to occur (see Proposition 1). Because of these fundamental differences, the rich algorithmic and statistical theory concerning high-dimensional regression with nonconvex penalties (see for example Loh and Wainwright [2012, 2015], Wang et al. [2014], Fan et al. [2018], Zhao et al. [2018] and references therein) is not directly applicable to our setting.

In Section 3, we therefore introduce a new dynamic programming approach that recovers the global minimum of the resulting objective function exactly in the univariate case, i.e. when p = 1. We then build this into a blockwise coordinate descent approach to tackle the multivariate setting.

In Section 4 we study the theoretical properties of SCOPE and give sufficient conditions for the estimator to coincide with the least squares solution with oracular knowledge of the level fusions in the univariate case. These conditions involve a minimal separation between unequal coefficients that is, up to constant factors, minimax optimal. Our result contrasts sharply with Theorem 2 of Ma and Huang [2017] for an all-pairs nonconvex penalty. The latter instead shows the existence of a local optimum that coincides with the oracle least squares solution. Whilst in conventional high-dimensional regression settings, it is known that under certain conditions, all local optima have favourable properties [Loh and Wainwright, 2015], we note that the separation requirements in Ma and Huang [2017] are substantially weaker than those indicated by the minimax lower bound, and so cannot be extended to a particular local optimum determined by the data; see the discussion following Theorem 5.

We use our univariate result to show that the oracle least squares solution is a fixed point of our blockwise coordinate descent algorithm in the multivariate case. In Section 5 we outline some extensions of our methodology including a scheme for handling settings when there is a hierarchy among the categorical variables. Section 6 contains numerical experiments that demonstrate the favourable performance of our method compared to a range of competitors on both simulated and real data. We conclude with a discussion in Section 7. Further details of our algorithm can be found in the Appendix. The supplementary material contains additional information on the runtime of our algorithm, and an approximate version suitable for very large-scale settings, all the proofs, and additional information on the experiments in Section 6.

Recall that our goal is to estimate parameters (µ0, θ0) in model (1). Let us first consolidate some notation. For any  θ ∈ RK1 ×· · ·×RKp, we define θj := (θjk)Kjk=1 ∈ RKj. We will study the univariate setting where p = 1 separately, and so it will be helpful to introduce some simplified notation for this case, dropping any extraneous subscripts. We thus write  K ≡ K1, Xi ≡ Xi1

and  ρ ≡ ρ1. Additionally, we let ¯Ykdenote the average of the  Yi with Xi = k:

image

where  nk = �ni=1 1{Xi=k}.In order to avoid an arbitrary choice of corner point constraint, we instead impose the following to ensure that  θ0 is identifiable: for all j = 1, . . . , p we have

image

Let Θj = {θj ∈ RKj : gj(θj) = 0}, and let Θ = Θ1 × · · · × Θp. We will construct estimators by minimising over  µ ∈ R and θ ∈Θ an objective function of the form

image

where  ℓis the least squares loss function (3) and  θj(1) ≤ · · · ≤ θj(Kj)are the order statistics of θj. We allow for different penalty functions  ρjfor each predictor in order to help balance the effects of varying numbers of levels  Kj. The identifiability constraint that  θ ∈Θ ensures that the estimated intercept ˆµ:= arg minµ ˜Q(µ, θ) satisfies ˆµ = �ni=1 Yi/n.

We note that whilst the form of the identifiability constraint would not have a bearing on the fitted values of unregularised least squares regression, this is not necessarily the case when regularisation is imposed. For example, consider the simple univariate setting with p = 1 and the corner point constraint  θ1= 0. Then the fitted value for an observation with level 1 would simply be the average ¯Y1, coinciding with that of unpenalised least squares. However the fitted values with observations with other level  k ≥2 would be subject to regularisation and in general be different to ¯Yk. This inequitable treatment of the levels is clearly undesirable as they may have been labelled in an arbitrary way. Our identifiability constraint treats the levels more symmetrically, but also takes into account the prevalence of levels, so the fitted values corresponding to more prevalent levels effectively undergo less regularisation.

As the estimated intercept ˆµdoes not depend on the tuning parameters, we define

image

We will take the regularisers  ρj : [0, ∞) → [0, ∞) in (9) to be concave (and nonconvex); as discussed in the introduction and formalised in Proposition 1 below, a nonconvex penalty is necessary for fusion to occur.

Proposition 1. Consider the univariate case with p = 1. Suppose the subaverages ( ¯Yk)Kk=1 (7)are all distinct, and that  ρ1 ≡ ρis convex. Then any minimiser ˆθ of Q has ˆθk ̸= ˆθl for all k ̸= lsuch that ˆθ(1) < ¯Yk − ˆµ < ˆθ(K) or ˆθ(1) < ¯Yl − ˆµ < ˆθ(K).

image

2010]:

image

where (u)+ = u1{u≥0}. This is a piecewise quadratic function with gradient  λat 0 and flat beyond  γλ. For computational reasons which we discuss in Section 3, the simple piecewise quadratic form of this is particularly helpful. In the multivariate case we take  ρj = ργ,λjwith  λj = λ�Kj. This choice of scaling is motivated by requiring that when  θ0 = 0 we also have ˆθ= 0 with high probability; see Lemma 10 in the Supplementary material. We discuss the choice of the tuning parameters  λ and γin Section 3.3, but first turn to the problem of optimising (9).

In this section we include details of how SCOPE is computed. Section 3.1 motivates and describes the dynamic programming algorithm we use to compute global minimiser of the SCOPE objective, which is highly non-convex. Section 3.2 contains details of how this is used to solve the multivariate objective by embedding it within a blockwise coordinate descent routine. Discussion of practical considerations is contained in Section 3.3.

3.1 Univariate model

3.1.1 Preliminaries

We now consider the univariate case (p = 1) and explain how the solutions are computed. In this case, we may rewrite the least squares loss contribution to the objective function in the following way.

image

where  wk = nk/n. Thus the optimisation problem (9) can be written equivalently as

image

suppressing the dependence of the MCP  ρon tuning parameters  γ and λ. In fact, it is straightforward to see that the constraint that the solution lies in Θ will be automatically satisfied, so we may replace Θ with  RK. Two challenging aspects of the optimisation problem above are the presence of the nonconvex  ρand the order statistics. The latter however are easily dealt with using the result below, which holds more generally whenever  ρis a concave function.

Proposition 2. Consider the univariate optimisation (11) with  ρany concave function such that a minimiser ˆθexists. If for  k, l we have ¯Yk > ¯Yl, then ˆθk ≥ ˆθl.

This observation substantially simplifies the optimisation: after re-indexing such that ¯Y1 ≤¯Y2 ≤ · · · ≤ ¯YK, we may re-express (11) as,

image

We use the following intermediate functions to structure the algorithm:

image

for k = 2, . . . , K; here sarg min refers to the smallest minimiser in the case that it is not unique. Invariably however this will be unique, as the following result indicates.

Proposition 3. The set of ( ¯Yk)Kk=1 that yields distinct solutions to (11) has Lebesgue measure zero as a subset of  RK.

We will thus tacitly assume uniqueness in some of the discussion that follows, though this is not required for our algorithm to return a global minimiser. Observe now that ˆθK is theminimiser of the univariate objective function  fK: indeed for  k ≥ 2,

image

Furthermore, we have ˆθK−1 = bK(ˆθK), and more generally ˆθk = bk+1(ˆθk+1) for k = K −1, . . . , 1.Thus provided  fKcan be minimised efficiently (which we shall see is indeed the case), given this and the functions  b2, . . . , bKwe can iteratively compute ˆθK, ˆθK−1, . . . , ˆθ1. In order to make use of these properties, we must be able to compute  fK and the bkefficiently; we explain how to do this in the following subsection.

3.1.2 Computation of fK and b2, . . . , bK

The simple piecewise quadratic form of the MCP-based penalty is crucial to our approach for computing the  fK and the bk. Some important consequences of this piecewise quadratic property are summarised in the following lemma.

Lemma 4. For each k,

image

(iii) for each  θk+1 ∈ R, if a minimiser ˜θk = ˜θk(θk+1) of θk �→ fk(θk) + ρ(θk+1 − θk) over(−∞, θk+1] satisfies ˜θk < θk+1, then fkmust be differentiable at ˜θk.

Properties (i) and (ii) above permit exact representation of  fk and bkwith finitely many quantities. The key task then is to form the collection of intervals and corresponding coefficients of quadratic functions for

image

given a similar piecewise quadratic representation of  fk; and also the same for the linear functions composing  bk. A piecewise quadratic representation of  fk+1would then be straightforward to compute, and we can iterate this process. To take advantage of property (iii) above, in computing  gk(θk+1) we can separately search for minimisers at stationary points in (−∞, θk+1) andcompare the corresponding function values with  fk(θk+1); the fact that we need only consider potential minimisers at points of differentiability will simplify things as we shall see below.

Suppose  Ik,1, . . . , Ik,m(k)are intervals that partition R (closed on the left) and  qk,1, . . . , qk,m(k)are corresponding quadratic functions such that  fk(θk) = qk,r(θk) for θk ∈ Ik,r. Let us write

image

We may then express  fk as fk(θk) = minr ˜qk,r(θk). We can also express the penalty  ρ = ργ,λ ina similar fashion. Let

image

Then  ρ(x) = mint ˜ρt(x) for x ≥ 0. Let Dkbe the set of points at which  fkis differentiable. We then have, using Lemma 4 (iii) that

image

where ˜min denotes the minimum if it exists and  ∞otherwise. The fact that in the inner minimisation we are permitted to consider only points in  Dksimplifies the form of

image

We show in Section A.1 of the Appendix that this is finite only on an interval and there takes the value of a quadratic function; coefficients for this function and the interval endpoints have closed form expressions that are elementary functions of the coefficients and intervals corresponding to ˜qk,r. With this, we have an explicit representation of  gkas the minimum of a collection of functions that are quadratic on intervals and  ∞everywhere else. Let us refer to these intervals (closed on the left) and corresponding quadratic functions as  Jk,1, . . . , Jk,n(k) and pk,1, . . . , pk,n(k)respectively.

In order to produce a representation of  fk+1for use in future iterations, we must express  gkas a collection of quadratics defined on disjoint intervals. To this end, define for each  x ∈ R theactive set at x, A(x) = {r : x ∈ Jk,r}. Note that the endpoints of the intervals  Jk,rare the points where the active set changes and it is thus straightforward to determine A(x) at each x. Let r(x) be the index such that  gk(x) = pk,r(x)(x). For large negative values of x, A(x) will contain a single index and for such x this must be r(x). Consider also for each  r ∈ A(x) \ {r(x)}, thehorizontal coordinate  x′ of the first intersection beyond x (if it exists) between  pk,r and pk,r(x).We refer to the collection of all such tuples (x′, r) as the intersection set at xand denote it by N(x). Given r(x), N(x) can be computed easily. The intersection set N(x) then in turn helps to determine the smallest  x′ > x where r(x′) ̸= r(x) changes, that is the next knot of  gkbeyond x, as we now explain. Suppose at a point  xold, we have computed  rold = r(xold). Weset  xcur = xoldand perform the following.

1. Given  r(xcur), compute N(xcur) and set (xint, rint) = arg min(x,r)∈N(xcur) x.

2. If there are no changes in the active set between  xcur and xint, we have found the next knot point at  xint and rint = r(xint).

3. If instead the active set changes, move  xcurto the leftmost change point. We have that r(x) = rold for x ∈ [xold, xcur). To determine if r(x) changes at  xcur, we check if

image

(ii) some  rnewenters the active set at  xcurand ‘beats’  rold, so rnew ∈ A(xcur) \ A(xold)and  pk,rnew(xcur + ϵ) < pk,rold(xcur + ϵ) for ϵ >0 sufficiently small.

If either hold  xcuris a knot and  r(xcur) may be computed via  r(xcur) = arg minr∈A(xcur) pk,r(xcur).If neither hold, we conclude that  r(xcur) = roldand go to step 1 once more.

Hence we can proceed from one knot of  gkto the next by comparing the values and intersections of a small collection of quadratic functions, and thereby form a piecewise quadratic representation of  gkin a finite number of steps. Figure 3 illustrates the steps outlined above. The pieces of  bkmay be computed in a similar fashion.

We note there are several modifications that can speed up the algorithm: for example, for each  r, uk,r,2(17) is a constant function where it is finite (see  pk,2in the figure), and these can be dealt with more efficiently. For further details including pseudocode see Section A.2 of the Appendix.

image

Figure 3: Illustration of the optimisation problem and our algorithm, to be interpreted with reference to steps 1, 2, 3 in the main text. Shading indicates regions where the active set, displayed at the bottom of the plot, is invariant, and vertical dotted lines signify changes. Dotted curves correspond to parts of quadratic functions  pk,llying outside their associated intervals  Jk,l. At xold, we have r(xold) = 1, A(xold) = {1, 2} and N(xold) = {(x(1)int, 2)}. Sincethe active set changes between  xold and x(1)int, we move xcurto the first change point P and see neither (i) nor (ii) occur. We therefore return to step 1 and compute  N(xcur) which additionally contains (x(2)int,2). As the active set is unchanged between  xcur and x(2)int, we have determined the next knot point  x(2)int and minimising quadratic  pk,3.

In summary, our algorithm produces a piecewise quadratic representation of  fK, which we

can minimise efficiently to obtain ˆθK. We also have piecewise linear representations of functions b2, . . . , bKthrough which we may iteratively obtain ˆθk = bk+1(ˆθk+1) for k = K − 1, . . . , 1.

It seems challenging to obtain meaningful bounds on the number of computations that must be performed at each stage of this process in terms of parameters of the data. However, to give an indication of the scalability of this algorithm, we ran a simple example with 3 true levels and found that with 50 categories the runtime was under 10−3 seconds; with 2000 categories it was still well under half a second. More details on computation time can be found in Sections S1.3 and S3.2 of the Supplementary material. In Section S1.4 of the Supplementary material, we describe an approximate version of the algorithm that can be used for fast computation in very large-scale settings.

3.2 Multivariate model

Using our dynamic programming algorithm for the univariate problem, we can attempt to minimise the objective (9) for the multivariate problem using block coordinate descent. This has been shown empirically to be a successful strategy for minimising objectives for high-dimensional regression with nonconvex penalties such as the MCP [Breheny and Huang, 2011, Mazumder et al., 2011, Breheny and Huang, 2015], and we take this approach here. Considering the multivariate case, we iteratively minimise the objective  Q over θj := (θjk)Kjk=1 ∈ Θjkeeping all other parameters fixed. Then for a given (γ, λ) and initial estimate ˆθ(0) ∈Θ, we repeat the following until a suitable convergence criterion is met:

1. Initialise m = 1, and set for i = 1, . . . , n

image

2. For j = 1, . . . , p, compute

image

3. Increment  m → m + 1.

We define a blockwise optimum of Q to be any ˆθ ∈Θ, such that for each j = 1, . . . , p,

image

This is equivalent to ˆθbeing a fixed point of the block coordinate descent algorithm above. Provided  γ > 0, Qis continuous in  θ. As a consequence of Tseng [2001], Theorem 4.1 (c), provided the minimisers ˆθ(m)jin (19) are unique for all j and m (which will invariably be the case when the responses are realisations of continuous random variables; see Proposition 3), then all limit points of the sequence (ˆθ(m))∞m=0 are blockwise optima.

3.3 Practicalities

In practice the block coordinate descent procedure described above must be performed over a grid of (γ, λ) values to facilitate tuning parameter selection by cross-validation. In line with analogous recommendations for other penalised regression optimisation procedures [Breheny and Huang, 2011, Friedman et al., 2010], we propose, for each fixed  γ, to iteratively obtain solutions for an exponentially decreasing sequence of  λvalues, warm starting each application of block coordinate descent at the solution for the previous  λ. It is our experience that this scheme speeds up convergence and helps to guide the resulting estimates to statistically favourable local optima, as has been shown theoretically for certain nonconvex settings [Wang et al., 2014].

The grid of  γvalues can be chosen to be fairly coarse as the solutions appear to be less sensitive to this tuning parameter; in fact fixing  γ ∈ {8, 32}yields competitive performance across a range of settings (see Section 6). The choice  γ ↓0, which mimics the  ℓ0 penalty,has good statistical properties (see Theorem 5 and following discussion). However the global optimum typically has a smaller basin of attraction and can be prohibitively hard to locate, particularly in low signal to noise ratio settings where larger  γtends to dominate.

In this section, we study the theoretical properties of SCOPE. Recall our model

image

for  i = 1, . . . , n, where θ0 ∈Θ. We will assume the errors (εi)ni=1 have mean zero, are indepen- dent and sub-Gaussian with parameter  σ. Let

image

and define the oracle least squares estimate

image

This is the least squares estimate of  θ0 with oracular knowledge of which categorical levels are fused in  θ0.

Note that in the case where the errors have equal variance  v2, the expected mean squared prediction error of ˆθ0 satisfies

image

with equality when ˆθ0is unique.

Our results below establish conditions under which ˆθ0is a blockwise optimum (20) of the SCOPE objective function Q (9), or in the univariate case when this in fact coincides with SCOPE. The minimum differences between the signals defined for each j by

image

image

these latter two quantities are the minimum and maximum number of observations corresponding to a set of fused levels in the jth predictor respectively.

4.1 Univariate model

We first consider the univariate case, where as usual we will drop the subscript j for simplicity. The following result establishes conditions for recovery of the oracle least squares estimate (22).

Theorem 5. Consider the model (21) in the univariate case with p = 1. Suppose there exists

image

max{γ, ηs}. Suppose further that

image

Then with probability at least

image

the oracle least squares estimate ˆθ0 (22)is the global optimum of (9), so ˆθ = ˆθ0.

For a choice of the tuning parameters (γ, λ) with γ ≤ ηs and λsuch that equality holds in (24), we have, writing ∆  ≡ ∆(θ0), that ˆθ = ˆθ0with probability at least

image

where c is an absolute constant. The quantity  ηreflects how equal the number of observations in the true fused levels are: in settings where the prevalences of the underlying true levels are roughly equal, we would expect this to be closer to 1.

Consider now an asymptotic regime where K, s and 1/∆ are allowed to diverge with n, nmin ≍ n/K, so all levels have roughly the same prevalence, and  ηis bounded away from zero, so all true underlying levels also have roughly the same prevalence. Then in order for ˆθ = ˆθ0with high probability, we require ∆  ≳ σ�K log(K)/n. This requirement cannot be weakened for any estimator; this fact comes as a consequence of minimax lower bounds on mis-clustering errors in Gaussian mixture models [Lu and Zhou, 2016, Theorem 3.3].

We remark that our result here concerning properties of the global minimiser of our objective is very different from existing results on local minimisers of objectives involving all-pairs-type penalties. For example, in the setting above where K = n, Theorem 2 of Ma and Huang [2017] gives that provided  s = o(n1/3(log n)−1/3) and ∆ ≫ σs3/2n−1/2�log(n), there exists a sequence of local minimisers converging to the oracle least-squares estimate with high probability. This is significantly weaker than the condition ∆  ≳ σ�log(n) required for any estimator to recover oracle least-squares in this setting, illustrating the substantial difference between results on local and global optima here.

4.2 Multivariate model

When the number of variables is p > 1, models can become high-dimensional, with ordinary least squares estimation failing to provide a unique solution. We will however assume that the solution for  θ ∈ Θ0 to

image

is unique, which occurs if and only if the oracle least squares estimate (22) is unique. In this case, we note that ˆθ0 = AYfor a fixed matrix A. A necessary condition for this is that �j(sj − 1) < n.

Our result below provides a bound on the probability that the oracle least squares estimate is a blockwise optimum of the SCOPE objective (9) with  ρj = ργj,λj.This is much more meaningful than an equivalent bound for ˆθ0to be a local optimum as the number of local optima will be enormous. In general though there may be several blockwise optima, and it seems challenging to obtain a result giving conditions under which our blockwise coordinate descent procedure is guaranteed to converge to ˆθ0. Our empirical results (Section 6) however show that the fixed points computed in practice tend to give good performance.

Theorem 6. Consider the model (21) and assume ˆθ0 = AY. Suppose that there exists  η ∈ (0, 1]such that  η/sj ≤ n0j,min/n ≤ n0j,max/n ≤ 1/ηsj for all j = 1, . . . , p. Let γ∗j = min{γj, ηsj} andγ∗j = max{γj, ηsj}. Further suppose that

image

Then letting  cmin := (maxl(AAT )ll)−1, with probability at least

image

the oracle least squares estimate ˆθ0is a blockwise optimum of (9).

Now suppose  γj ≤ ηsj and λjare such that equality holds in (26) for all j. Then writing Kmax = maxj Kj, nmin = minj nj,min and ∆min = minj ∆(θ0j), we have that ˆθ0is a blockwise optimum of (9) with probability at least

image

where c is an absolute constant. Consider now an analogous asymptotic regime to that described in the previous section for the univariate case. Specifically assume  nmin ≍ n/Kmax and cmin ≳nminfor simplicity. We then see that in order for ˆθ0to be a blockwise optimum with high probability, it is sufficient that ∆min ≳ σ�Kmax log(Kmaxp)/n.

In this section, we describe some extensions of our SCOPE methodology.

Continuous covariates. If some of the covariates are continuous rather than categorical, we can apply any penalty function of choice to these, and perform a regression by optimising the sum of a least squares objective, our SCOPE penalty and these additional penalty functions, using (block) coordinate descent.

image

Z ∈ Rn×d be the centred design matrix for these covariates with  ith row Zi ∈ Rd. One can fit a model with SCOPE penalising the categorical covariates, and the Lasso with tuning parameter α >0 penalising the continuous covariates, resulting in the following objective over  β ∈ Rd and

image

This sort of integration of continuous covariates is less straightforward when attempting to use tree-based methods to handle categorical covariates, for example.

Generalised linear models. Sometimes a generalised linear model may be appropriate. Although a quadratic loss function is critical for our exact optimisation algorithm described in Section 3.1, we can iterate local quadratic approximations to the loss term in the objective and minimise this. This results in a proximal Newton algorithm and is analogous to the standard approach for solving  ℓ1-penalised generalised linear models [Friedman et al., 2010, Section 3]. An implementation of this scheme in the case of logistic regression for binary responses is available in the accompanying R package CatReg. We remark that when computing logistic regression models with a SCOPE penalty it is advisable to use a larger value of  γthan with a continuous response to aid convergence of the proximal Newton step; we recommend a default setting of γ= 100. In Section 6.2 we use the approach described above to perform a logistic regression using SCOPE on US census data.

Hierarchical categories. Often certain predictors may have levels that are effectively subdivisions of the levels of other predictors. Examples include category of item in e-commerce or geographical data with predictors for continent, countries and district. For simplicity, we will illustrate how such settings may be dealt with by considering a case with two predictors, but this may easily be generalised to more complex hierarchical structures. Suppose there is a partition  G1 ∪ · · · ∪ GK1 of {1, . . . , K2}such that for all  k = 1, . . . , K1,

image

so the levels of the second predictor in  Gkrepresent subdivisions of kth level of the first predictor. Let  K2k := |Gk| and let θ2krefer to the subvector (θ2l)l∈Gk for each k = 1, . . . , K1, socomponents of  θ2kare the coefficients corresponding to the levels in  Gk. Also let θ2k(r) denotethe rth order statistic within  θ2k. It is natural to encourage fusion among levels within  Gkmore strongly than for levels in different elements of the partition. To do this we can modify our objective function so the penalty takes the form

image

We furthermore enforce the identifiability constraints that

image

As well as yielding the desired shrinkage properties, an additional advantage of this approach is that the least squares criterion is separable in  θ21, . . . , θ2K1so the blockwise update of  θ2 canbe performed in parallel. This can lead to a substantial reduction in computation time if  K2 islarge.

In this section we explore the empirical properties of SCOPE. We first present results on the performance on simulated data, and then in Sections 6.2 to 6.5 present analyses and experiments on US census data, insurance data and COVID-19 modelling data.

We denote SCOPE with a specific choice of  γ as SCOPE-γ, and write SCOPE-CV to denote SCOPE with a cross-validated choice of  γ. SCOPE solutions are computed using our R [R Core Team, 2020] package CatReg [Stokell, 2021], using 5-fold cross-validation to select  λ for allexamples except those in Section 6.5. We compare SCOPE to linear or logistic regression where appropriate and a range of existing methods, including CAS-ANOVA [Bondell and Reich, 2009] (4), and an adaptive version where the weights  wj,klare multiplied by a factor proportional to the  |ˆθinitjk − ˆθinitjl |−1, where ˆθinitis an initial CAS-ANOVA estimate. For these methods the tuning parameter  λwas also selected by 5-fold cross-validation. As well as this, we include Delete or merge regressors (DMR) [Maj-Ka´nska et al., 2015] and Bayesian effect fusion (BEF) [Pauger and Wagner, 2019] in some experiments. With the former, models were fitted using DMRnet [Prochenka-So�ltys and Pokarowski, 2018] and selected by 5-fold cross-validation where possible; otherwise an information criterion was used. With BEF, coefficients were modelled with a Gaussian mixture model with posterior mean estimated using 1000 samples using effectFusion [Pauger et al., 2019]. We also include comparison to the tree-based approaches CART [Breiman et al., 1984] and random forests (RF) [Breiman, 2001]. Lastly, in some experiments, models were also fitted using the Lasso [Tibshirani, 1996]. CART was implemented using rpart [Therneau and Atkinson, 2019] with pruning according to the one standard error rule. Random forests and Lasso were implemented using the default settings in randomForest [Liaw and Wiener, 2002] and glmnet [Friedman et al., 2010] packages respectively. For full details of the specific versions of these methods and software used in the numerical experiments, see Section S3.1 of the Supplementary material.

6.1 Simulations

image

where g is the true regression function, ˆg an estimate, and the expectation is taken over the covariate vector x.

6.1.1 Low-dimensional experiments

Results are presented for three settings with n = 500, p = 10 given below.

image

3. As Setting 1, but with  ρ = 0.8.

image

RF 9.621(0.5)10.944(0.5)13.217(0.7) 16.344(0.9) 8.947(0.3) 9.747(0.4)11.249(0.6) 13.646(0.8)Setting 3 σ2: 1 6.25 25 100SNR: 7.3 2.9 1.5 0.73

image

Table 1: Mean squared prediction errors (and standard deviations thereof) of various methods on the settings described.

Each of these experiments were performed with noise variance  σ2 = 1, 6.25, 25 and 100. Note that the variance of the signal varies across each setting, and signal-to-noise ratio (SNR) for each experiment is displayed in Table 1. Methods included for comparison were SCOPE-8, SCOPE-32, SCOPE-CV, linear regression, vanilla and adaptive CAS-ANOVA, DMR, Bayesian effect fusion, CART and random forests. Also included are the results from the oracle least squares estimator (22).

Results are shown in Table 1 and further details are given in Section S3.2.1 of the Supplementary material. Across all experiments, SCOPE with a cross-validated choice of  γ exhibitsprediction performance at least as good as the optimal approaches, and in all but the lowest noise settings performs better than the other methods that were included. In these exceptions, we see that fixing  γto be a small value (corresponding to high-concavity) provides leading performance.

In these low noise settings, we see that the methods based on first estimating the clusterings of the levels and then estimating the coefficients without introducing further shrinkage, such as DMR or Bayesian effect Fusion, perform well. However they tend to struggle when the noise is larger. In contrast the tree-based methods perform poorly in low noise settings but exhibit competitive performance in high noise settings.

6.1.2 High-dimensional experiments

We considered 8 settings as detailed below, each with n = 500, p = 100 and simulated 500 times.

image

2. As Setting 1, but with  ρ = 0.5.

image

6. As Setting 5, but with  ρ = 0.5.

image

Models were fitted using SCOPE-8, SCOPE-32, SCOPE-CV, DMR, CART, Random forests and the Lasso. Table 2 gives the mean squared prediction errors across each of the settings.

As well as prediction performance, it is interesting to see how the methods perform in terms of variable selection performance. With categorical covariates, there are two potential ways of evaluating this. The first is to consider the number of false positives and false negatives across the p = 100 categorical variables, defining a variable j to have been selected if ˆθj ̸= 0.These results are shown in Table 3. This definition of a false positive can be considered quite conservative; typically one can find that often the false signal variables have only two levels, each with quite small coefficients. This means that the false positive rate can increase substantially with only a small increase in the dimension of the estimated linear model.

The second is to see within the signal variables (i.e., the  j for which θ0j ̸= 0), how closely the estimated clustering resembles the true structure. To quantify this, we use the adjusted Rand index [Hubert and Arabie, 1985]. This is the proportion of all pairs of observations that are either (i) in different true clusters and different estimated clusters, or (ii) in the same true cluster and estimated cluster; this is then corrected to ensure that its value is zero when exactly one of the clusterings is ‘all-in-one’. In Table 4 we report the average adjusted Rand index over the true signal variables in each setting.

image

Table 2: Mean squared prediction errors (and standard deviations thereof) of each of the meth- ods in the 8 high-dimensional settings considered.

image

Table 3: (False positive rate)/(False negative rate) of linear modelling methods considered in the high-dimensional settings.

image

Table 4: Average adjusted Rand index among true signal variables for the high-dimensional settings.

Further details can be found in Section S3.2.2 of the Supplementary material. In particular we include a table with the distribution of cross-validated choices of  γ(from a grid {4, 8, 16, 32, 64}) for each experimental setting. Note that a choice of  γ= 4 is close to the setting of  γ= 3 recommended in Zhang [2010], though the problem of categorical covariates is very different in nature than the vanilla variable selection problem considered there. Our results there suggest that for SCOPE, a larger value of  γis preferable across a range of settings, which is also visible in the comparison between  γ = 8 and γ= 32 in Table 2.

Across all the settings in this study, SCOPE performs better than any of the other methods included. This is regardless of which of the three  γregimes is chosen, although cross-validating γgives the strongest performance overall. Comparing the results for  γ = 8 and γ = 32 sug-gests that a larger (low-concavity) choice of  γis preferable for higher-dimensional settings. In setting 6, we see from Tables 3 and 4 that SCOPE obtains the true underlying groupings of the coefficients and obtains the oracle least-squares estimate in every case, giving these striking results. This is also achieved for some of the experiments in setting 5. In contrast, DMR, which initially applies a group Lasso [Yuan and Lin, 2006] to screen the categorical variables and give a low-dimensional model, necessarily misses some signal variables in this first stage and hence struggles here.

6.2 Adult dataset analysis

The Adult dataset, available from the UCI Machine Learning Repository [Dua and Graff, 2019], contains a sample of 45 222 observations based on information from the 1994 US census. The binary response variable is 0 if the individual earns at most $50 000 a year, and 1 otherwise. There are 2 continuous and 8 categorical variables; some such as ‘native country’ have large numbers of levels, bringing the total dimension to 93. An advantage of using SCOPE here over black-box predictive tools such as Random forests is the interpretability of the fitted model.

In Table 5, we show the 25-dimensional fitted model. Within the Education category, we see that six distinct levels have been identified. These agree almost exactly with the stratification one would expect, with all school dropouts before 12th grade being grouped together at the lowest level.

image

Figure 4: Prediction performance and fitted model dimension (respectively) of various methods on the Adult dataset: (A) SCOPE-100; (B) SCOPE-250; (C) Logistic regression; (D) CASANOVA; (E) Adaptive CAS-ANOVA; (F) DMR; (G) BEF; (H) CART; (I) RF.

Here we assess performance in the challenging setting when the training set is quite small by randomly selecting 1% (452) of the total observations for training, and using the remainder as a test set. Any observations containing levels not in the training set were removed. Models were fitted with SCOPE-100, SCOPE-250, logistic regression, vanilla and adaptive CAS-ANOVA, DMR, Bayesian effect fusion, CART and random forests.

We see that both SCOPE-100 and SCOPE-250 are competitive, with CART and Random forests also performing well, though the latter two include interactions in their fits. CASANOVA also performs fairly well, the misclassification error is larger that for both versions of SCOPE, and the average fitted model size is larger.

image

Table 5: Coefficients of SCOPE model trained on the full dataset. Here,  γ = 100 and λ wasselected by 5-fold cross-validation (with cross-validation error of 16.82%). Countries, aside from those in the UK, are referred to by their (possibly historical) internet top-level domains. *Relation with which the subject lives.

image

Table 6: Results of experiments on the Adult dataset.

image

Figure 5: Misclassification error and dimensions of models fitted on a sample of the Adult dataset when levels have been artificially split m times.

6.3 Adult dataset with artificially split levels

To create a more challenging example, we artificially created additional levels in the Adult dataset as follows. For each categorical variable we recursively selected a level with probability proportional to its prevalence in the data and then split it into two by appending “-0” or “-1” to the level for each observation independently and with equal probabilities. We repeated this until the total number of levels reached m times the original number of levels for that variable for m = 2, 3, 4. This process simulates for example responses to a survey, where different respondents might answer ‘US’, ‘U.S.’, ‘USA’, ‘U.S.A.’, ‘United States’ or ‘United States of America’ to a question, which would naively all be treated as different answers.

We used 2.5% (1130) of the observations for training and the remainder for testing and applied SCOPE with  γ= 100 and logistic regression. Results were averaged over 250 training and test splits. Figure 5 shows that as the number of levels increases, the misclassification error of SCOPE increases only slightly and the fitted model dimension remains almost unchanged, whereas both increase with m for logistic regression.

6.4 Insurance data example

The Prudential Life Insurance Assessment challenge was a prediction competition run on Kaggle [2015]. By more accurately predicting risk, the burden of extensive tests and check-ups for life insurance policyholders could potentially be reduced. For this experiment, we use the training set that was provided for entrants of the competition.

We removed a small number of variables due to excessive missingness, leaving 5 continuous variables and 108 categorical variables, most with 2 or 3 levels but with some in the hundreds (and the largest with 579 levels). Rather than using the response from the original dataset, which is ordinal, to better suit the regression setting we are primarily concerned with in this work, we artificially generated a continuous response. To construct this signal, firstly 10 of the categorical variables were selected at random, with probability proportional to the number of levels. For the jth of these, writing  Kjfor the number of levels, we set  sj := ⌊2 + 12 log Kj⌋ andassigned each level a coefficient in 1, . . . , sjuniformly at random, thus yielding  sjtrue levels. The coefficients for the 5 continuous covariates were generated as draws from  N5(0, I). Theresponse was then scaled to have unit variance, after which standard normal noise was added.

image

Figure 6: Mean squared prediction error on the example based on the Prudential Life Insurance Assessment dataset. Methods used are: (A) SCOPE-8; (B) SCOPE-32; (C) SCOPE-CV; (D) CART; (E) RF; (F) Lasso.

We used 10% (n = 5938) of the 59 381 total number of observations for training, and the remainder to compute an estimated MSPE (28) by taking an average over these observations. We repeated this 1000 times, sampling 10% of the observations and generating the coefficients as above anew in each repetition. The average mean squared prediction errors achieved by the various methods under comparison are given in Figure 6. We see that SCOPE with a cross-validated choice of  γperforms best, followed by the Lasso and SCOPE-32.

6.5 COVID-19 Forecast Hub example

As well as the prediction performance experiments in the rest of this section, we include an exploratory data analysis example based on data relating to the ongoing (at time of writing) global COVID-19 pandemic. The COVID-19 Forecast Hub [2020]  ‘. . . serves as a central repos-itory of forecasts and predictions from over 50 international research groups.’A collection of different research groups publish forecasts every week of case incidence in each US state for some number of weeks into the future.

In order to understand some of the difficulties of this challenging forecasting problem, we fitted an error decomposition model of the form

image

where w is the week that the forecast is for, l is the state, m indexes the forecasting model, t is the ‘target’ number of weeks in the future the forecast is for,  ηm,t,w,lis an error term, and casesw,land est.casesm,t,w,lare the observed and estimated cases respectively. This decomposition allows an interaction term between time and location, which is important given that the pandemic was known to be more severe at different times for different areas. An interaction between model and forecasting distance was also included in order to capture the effect of some models potentially being more ‘short-sighted’ than others. The inclusion of the +1 on the left-hand side is to avoid numerators or denominators of zero.

We used data from 6 April 2020 to 19 October 2020, giving a total of 100 264 (m, t, w, l)-tuples. We applied a SCOPE penalty with  γ = 8 to βw,l, which had 1428 levels. The  αm,tcoefficients, which amounted to 170 levels, were left unpenalised. The additional tuning parameter  λwas selected using the Extended Bayesian Information Criterion [Chen and Chen, 2008] rather than cross-validation, as it was more suited to this sort of exploratory analysis on data with a chronological structure.

The resulting estimates ˆβw,lhad 8 levels. We measured the ‘similarity’ of two US states la and lbover a period of time by computing the proportion of weeks at which their estimates

image

Figure 7: Similarity matrix for US states computed based on data relating to the second ‘wave’ of the COVID-19 pandemic in the US, taken to be from 26 June 2020 to 29 August 2020.

ˆβw,la = ˆβw,lbcoincided. The similarity matrix presented in Figure 7 was constructed based on the second ‘wave’ of the epidemic which occurred in Summer 2020, with clusters identified by applying spectral clustering on the similarity matrix and plotted in order of decreasing within-cluster median pairwise similarity.

The resulting clusters are at once interpretable and interesting. Roughly speaking, the top 3 clusters (‘top’ when ordered according to median pairwise within-cluster agreement) correspond to states that experienced notable pandemic activity in the second, first, and third ‘waves’ of the U.S. coronavirus pandemic, respectively. The first cluster features several southern States (e.g., Georgia, Florida, Texas) which experienced a surge of COVID cases in June–July. The second cluster features east coast states (e.g., New Jersey and New York) which experienced an enormous pandemic toll in April–May. And the third features midwestern states (e.g., Kentucky, Indiana, Nebraska) which had upticks most recently in September-October.

In this work we have introduced a new penalty-based method for performing regression on categorical data. An attractive feature of a penalty-based approach is that it can be integrated easily with existing methods for regression with continuous data, such as the Lasso. Our penalty function is nonconvex, but in contrast to the use of nonconvex penalties in standard high-dimensional regression problems, the nonconvexity here is necessary in order to obtain sparse solutions, that is fusions of levels. Whilst computing the global optimum of nonconvex problems is typically very challenging, for the case with a single categorical variable with several hundred levels, our dynamic programming algorithm can typically solve the resulting optimisation problem in less than a second on a standard laptop computer. The algorithm is thus fast enough to be embedded within a block coordinate descent procedure for handling multiple categorical variables.

We give sufficient conditions for SCOPE to recover the oracle least squares solution when p = 1 involving a minimal separation between unequal coefficients that is optimal up to constant factors. For the multivariate case where p > 1, we show that oracle least squares is a fixed point of our block coordinate descent algorithm, with high probability.

Our work offers several avenues for further work. On the theoretical front, it would be interesting to obtain guarantees for block coordinate descent to converge to a local optimum with good statistical properties, a phenomenon that we observe empirically. On the methodology side, it would be useful to generalise the penalty to allow for clustering multivariate coefficient vectors; such clustering could be helpful in the context of mixtures of regressions models, for example.

H. D. Bondell and B. J. Reich. Simultaneous factor selection and collapsing levels in ANOVA. Biometrics, 65(1):169–177, 2009.

P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics, 5(1):232, 2011.

P. Breheny and J. Huang. Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25(2):173–187, 2015.

L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification and Regression Trees. The Wadsworth and Brooks-Cole statistics-probability series. Taylor & Francis, 1984.

T. Calinski and L. Corsten. Clustering means in anova by simultaneous testing. Biometrics, pages 39–48, 1985.

J. Chen and Z. Chen. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–71, 2008.

J. Chiquet, P. Gutierrez, and G. Rigaill. Fast tree inference with weighted fusion penalties. Journal of Computational and Graphical Statistics, 26(1):205–216, 2017.

COVID-19 Forecast Hub. URL: https://covid19forecasthub.org, 2020.

D. Dua and C. Graff. UCI machine learning repository, 2019.

J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001.

J. Fan, H. Liu, Q. Sun, and T. Zhang. I-LAMM for sparse learning: Simultaneous control of algorithmic complexity and statistical error. The Annals of Statistics, 46(2):814–841, 2018.

J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.

J. Gertheiss and G. Tutz. Sparse modeling of categorial explanatory variables. The Annals of Applied Statistics, 4(4):2150–2180, 2010.

T. D. Hocking, A. Joulin, F. Bach, and J.-P. Vert. Clusterpath an algorithm for clustering using convex fusion penalties. In 28th International Conference on Machine Learning, page 1, 2011.

S. Hu, A. O’Hagan, and T. B. Murphy. Motor insurance claim modelling with factor collapsing and bayesian model averaging. Stat, 7(1):e180, 2018.

L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985.

P. B. Jensen, L. J. Jensen, and S. Brunak. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics, 13(6):395, 2012.

Kaggle. Prudential Life Insurance Assessment. URL: https://www.kaggle.com/c/prudential-life-insurance-assessment/data, 2015.

A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002.

P.-L. Loh and M. J. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637–1664, 2012.

P.-L. Loh and M. J. Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16(19):559–616, 2015.

Y. Lu and H. H. Zhou. Statistical and computational guarantees of lloyd’s algorithm and its variants. arXiv preprint arXiv:1612.02099, 2016.

S. Ma and J. Huang. A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517):410–423, 2017.

A. Maj-Ka´nska, P. Pokarowski, A. Prochenka, et al. Delete or merge regressors for linear model selection. Electronic Journal of Statistics, 9(2):1749–1778, 2015.

R. Mazumder, J. H. Friedman, and T. Hastie. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495):1125–1138, 2011.

M.-R. Oelker, W. P¨oßnecker, and G. Tutz. Selection and fusion of categorical predictors with l 0-type penalties. Statistical Modelling, 15(5):389–410, 2015.

D. Pauger and H. Wagner. Bayesian effect fusion for categorical predictors. Bayesian Analysis, 14(2):341–369, 2019.

D. Pauger, M. Leitner, H. Wagner, and G. Malsiner-Walli. effectFusion: Bayesian Effect Fusion for Categorical Predictors, 2019. R package version 1.1.1.

A. Prochenka-So�ltys and P. Pokarowski. DMRnet: Delete or Merge Regressors Algorithms for Linear and Logistic Model Selection and High-Dimensional Data, 2018. R package version 0.2.0.

R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2020.

A. J. Scott and M. Knott. A cluster analysis method for grouping means in the analysis of variance. Biometrics, pages 507–512, 1974.

B. Stokell. CatReg: Solution Paths for Linear and Logistic Regression Models with Categorical Predictors, with SCOPE Penalty https://CRAN.R-project.org/package=CatReg, 2021.

T. Therneau and B. Atkinson. rpart: Recursive Partitioning and Regression Trees, 2019. R package version 4.1-15.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67 (1):91–108, 2005.

P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475–494, 2001.

J. W. Tukey. Comparing individual means in the analysis of variance. Biometrics, pages 99–114, 1949.

G. Tutz and J. Gertheiss. Regularized regression for categorical data. Statistical Modelling, 16 (3):161–200, 2016.

Z. Wang, H. Liu, and T. Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics, 42(6):2164, 2014.

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.

C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010.

T. Zhao, H. Liu, and T. Zhang. Pathwise coordinate optimization for sparse learning: Algorithm and theory. The Annals of Statistics, 46(1):180–218, 2018.

H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006.

A.1 Candidate minimiser functions

In this section we give explicit forms of the functions  pk,ras defined in Section 3.1. We write qk,r(x) = arx2 + brx + crfor simplicity, suppressing the subscript  k. For S ⊆ R and a, b ∈ R,we write aS + b for the set  {ax + b : x ∈ S}.

Recall from Section 3.1 that

image

For a function  f : R → R ∪ {∞}, we denote the effective domain of f by

image

For each r = 1, . . . , m(k), there are cases corresponding to t = 1 and t = 2. The formulas are as follows:

image

with dom  uk,r,1 =

image

If  gk(θk+1) = uk,r,1(θk+1), then

image

The second case is

image

Here, if  gk(θk+1) = uk,r,2(θk+1), then

image

Considering (16), we see that we can also have the case where  gk(θk+1) = fk(θk+1). Thus we can form the set of quadratics  pk,rand associated intervals as the set of  uk,r,tas above for t = 1, 2 and the  qk,rthemselves. Note that when  gk(θk+1) = qk,r(θk+1), we have bk(θk+1) = θk+1.

A.2 Algorithm details

Algorithm 1 describes in detail how the optimisation routine works. In the algorithm we make use of the following objects:

for  x ∈ R, A(x) is the active set at x;

E is the set of points at which the active set changes;

N(x) is the intersection set at x;

image

U is a set of tuples (I, r) where I ⊆ Ris an interval and r is an integer, which is dynamically updated as the algorithm progresses.

See Section 3.1.2 for definitions of the sets above. We also use the convention that if  x = −∞then [x, y) = (−∞, y).

All of the  pk,1, . . . , pk,m(k) and Jk,mare computed at the start of each iterate k. We then initialise

image

the set of all of the end-points of the intervals  Jk−1,1, . . . , Jk−1,n(k).

Here x can be thought of as the ‘current position’ of the algorithm; ˜x is used to store when the minimising function  pk−1,r(x)last changed. We initialise ˜x = −∞ and x = −1 + max{y ∈Ik−1,1 : f′k−1(y−) ≤ 0}. This choice of x ensures that the active set A(x) contains only one element (as mentioned in Section 3.1); this will always be the index corresponding to the function ˜qk−1,1.

image

the functions ˜qk,1, . . . , ˜qk,m(k)and their corresponding intervals  Ik,1, . . . , Ik,m(k)that partition R. Finally, we initialise the set N(x) which will contain the intersections between  pk−1,r(x)and other functions in the active set. As the active set begins with only one function, we set N(x) = ∅.

As mentioned in Section 3.1, there are several modifications that can speed up the algorithm. One such modification follows from the fact that for each  r, uk,r,2is a constant function over its effective domain, and their effective domain is a semi-infinite interval (see Section A.1 of the Appendix for their expressions). Therefore, for a given point  x ∈ R, we can remove all such functions from A(x) except for the one taking the minimal value.

We also note that in Algorithm 1, the set N(x) is not recomputed in its entirety at every point x at which A(x) is updated, as is described in Section 3.1. Line 13 shows how sometimes N(x) can instead be updated by adding or removing elements from it. Often, points 3 (i) and 3 (ii) from the description in the Section 3.1 will coincide, and in such instances some calls to ChooseFunction (Algorithm 2) can be skipped.

This supplementary material is organised as follows. In Section S1 we include further details of our algorithm and the proofs of results in Sections 2 & 3. The proofs of Theorems 5 and 6 along with a number of lemmas they require can be found in Section S2. Section S3 contains information regarding simulation settings and additional results for the experiments in Section 6 of the main paper.

S1.1 Remarks on constrained and unconstrained formulations of the univariate objective

It is clear why the identifiability constraint (8) is important when we consider the multivariate problem in Section 3.2. However, for the univariate problem, both constrained and unconstrained formulations of the objective can be clearly defined:

image

As discussed in Section 3.1.1, we can enlarge the feasible set in (30) to be all of  RK: simi-larly to the observation that �k wkˆθuk = ˆµ = �k wk ¯Yk, the minimiser of (30) over all of  RKwill always be in Θ. This can be shown by following the argument at the beginning of the proof of Lemma 10. Therefore the algorithm defined in Section 3.1 can also be applied to the unconstrained formulation of the objective.

It is clear that these problems are essentially identical, as ˆθuis a minimiser of the unconstrained objective if and only if ˆθu − ˆµ1is a minimiser of the constrained objective. Observe that while ˆθu ∈ RK, the solution to the constrained objective is in fact (ˆµ, ˆθc) ∈ R × Θ, whichis the same K-dimensional space only with a different parametersation. In particular, ˆθc isnon-unique if and only if ˆθuis non-unique.

Since one can obtain the solution to the constrained objective by solving the unconstrained one and then reparameterising (and vice versa), we are free to assume without loss of generality that  wT ¯Y = 0, so ˆµ= 0, when solving the univariate problem, and will remark where we do this.

S1.2 Proofs of results in Sections 2 & 3

Proof of Proposition 1. Assume, without loss of generality, that ˆµ= 0. Suppose that there exists  l ̸= ksuch that ˆθk = ˆθl. Without loss of generality we have that ¯Yk ̸= ˆθk (if ¯Yk = ˆθk then¯Yl ̸= ˆθland it can be seen that ˆθ(1) < ¯Yl < ˆθK, in which case swap labels).

Now we construct ˜θby setting ˜θr = ˆθr ∧ ¯Yk for r = 1, . . . , k, and ˜θr = ˆθrotherwise. We haveℓ(ˆµ, ˜θ) < ℓ(ˆµ, ˆθ) and, by convexity of  ρ, it follows that

image

This gives the conclusion  Q(˜θ) < Q(ˆθ), contradicting the optimality of ˆθ.

Proof of Proposition 2. Suppose, for a contradiction, that ˆθk < ˆθl. Then at least one of the following must be true:

image

Let ˜θbe defined as follows. Set ˜θr = ˆθr for all r ̸= k, l. If (32) holds set ˜θk = ˆθland if (33) holds set ˜θl = ˆθk. Observe that

image

and that the squared loss of ˜θis strictly smaller than the squared loss of ˆθ, thus contradicting optimality of ˆθ.

Proof of Proposition 3. In this proof we consider the unconstrained formulation of the objective (31) discussed in Section S1.1. Suppose that (¯Yk)Kk=1 is such that there are two distinct solutions to (12), ˆθ(1) ̸= ˆθ(2). Let us assume that the levels are indexed such that ¯Y1 ≤ · · · ≤ ¯YK. Definek∗ = max{k : ˆθ(1)k ̸= ˆθ(2)k }to be the largest index at which the two solutions take different values and note that we must have ˆθ(r)1 ≤ · · · ≤ ˆθ(r)K .

First consider the case where  k∗ < K. Then

image

for r = 1, 2. We now argue that we must have ˆθ(1)k∗+1 = ˆθ(2)k∗+1 =: t∗ ≥ (ˆθ(1)k∗ ∨ ˆθ(2)k∗ ) + γλ. Indeed,suppose not, and suppose that without loss of generality ˆθ(2)k∗ > ˆθ(1)k∗ . Fix r ∈ {1, 2}. Thedirectional derivative of the objective in the direction of the binary vector with ones at the indices given by  Srand zeroes elsewhere evaluated at ˆθ(r)must be 0. But comparing these for r = 1, 2, we see they are identical except for the term  ρ′(θk∗+1 − ˆθ(r)k∗), which will be strictly larger for r = 2, giving a contradiction. This then implies that both ˆθ(1)k∗ and ˆθ(2)k∗must minimise fk∗ over θ ≤ t∗ − γλsince the full objective value is

image

for r = 1, 2. We also have that when  k∗ = K, both ˆθ(1)k∗ and ˆθ(2)k∗must minimise  fk∗.

image

fk(θk) − 12wk( ¯Yk − θk)2. In particular, properties (i) and (iii) of Lemma 4 hold with  fk replacedby  gk−1. These can be characterised as  gk−1(θk) = ˇqk,r(θk) for θk ∈ Ik,r, where Ik,r are theintervals associated with  fk and ˇqk,r(θk) := qk,r(θk) − 12wk( ¯Yk − θk)2. Note that for each  r, ˇqk,rdepends on the values of ¯Y1, . . . , ¯Yk−1but not that of ¯Yk(observe that  qk,r(θk) includes a term

image

Now as ˆθ(1)k∗ ≤ ˆθ(1)k∗+1 − γλ and ˆθ(2)k∗ ≤ ˆθ(2)k∗+1 − γλ (if k∗ < K), by Lemma 4 (iii) both must be local minima of  fk∗, and we have that there must exist distinct  r1 ̸= r2such that ˆθ(1)k∗ ∈ Ik∗,r1and ˆθ(2)k∗ ∈ Ik∗,r2. Let

image

Since ˆθ(1)k∗must be the minimum of ˇqk∗,r1(θk∗) + 12wk∗( ¯Yk∗ − θk∗)2(and similarly for ˆθ(2)k∗ ), wemust have that

image

This is a quadratic equation in ¯Yk∗, so there are at most two values for which (34) holds. Considering all pairs  r1, r2, we see that in order for there to exist two solutions ˆθ(1) ̸= ˆθ(2), ¯Yk∗must take values in a set of size at most c(K), for some function  c : N → N.

Now let

image

What we have shown, is that associated with each element ( ¯Yk)Kk=1 ∈ S, there is at least one  k∗such that

image

is bounded above by c(K). Now for each  j = 1, . . . , K, let Sjbe the set of ( ¯Yk)Kk=1 ∈ S for whichthe there exists a  k∗ with the property above and  k∗ = j. Note that  ∪jSj = S. Now Sj ⊂ RKhas Lebesgue measure zero as a finite union of graphs of measurable functions  f : RK−1 → R.Thus S has Lebesgue measure zero.

Proof of Lemma 4. Assume, without loss of generality, that ˆµ= 0. We proceed inductively, assuming that the properties (i) and (iii) hold for  fk, and (ii) holds for  bk+1. Additionally we include in our inductive hypothesis that for all  x, f′k(x−) ≥ f′k(x+), where we define  f′k(x−) andf′k(x+) to be the left-derivative and right-derivative of  fk at x, respectively. We note that these trivially hold for the base case  f1, and the case  b2can be checked by direct calculation.

We first prove (i), that  fk+1is continuous, coercive, and piecewise quadratic and with finitely many pieces. We then show that  f′k+1(x−) ≥ f′k+1(x+) for all x, which allows us to show that (iii) holds for  fk+1. Finally, we use these results to show that (ii) holds for  bk+2.

We now show that  fk+1is coercive and continuous. Clearly  gk(x) ≥ miny≤x fk(y), so itfollows that  gk(x) → ∞ as x → −∞ as fkis coercive. Furthermore  gkis bounded from below as  fkis coercive and continuous. Thus since  fk+1(x) = gk(x) + 12wk+1( ¯Yk+1 − x)2, it follows that  fk+1is coercive. Next as  gk(x) = miny≤x fk(y) + ρ(y − x), and fk and ρare continuous, it follows that  gkis continuous and therefore that  fk+1is continuous.

To see why  fk+1is piecewise quadratic with finitely many pieces, we observe that it can be written  fk+1(x) = fk(bk+1(x)) + ρ(x − bk+1(x)) + 12wk+1( ¯Yk+1 − x)2. We have by our inductive hypothesis that  fkis piecewise quadratic and  bk+1(x) is piecewise linear, both with finitely many pieces. Since the composition of a piecewise linear function inside a piecewise quadratic function is piecewise quadratic, the remainder of (i) is shown.

We now turn our attention to (iii), and define for  x ∈ R:

image

We will first show that  f′k+1(x+) ≤ f′k+1(x−) for all x ∈ R. Suppose that we are increasing x and we have reached a point where  gk(x) is not differentiable (that is, the left-derivative and the right-derivative do not match). By assumption (ii) for  bk+1, we can assume that there is some window  δ >0 such that  y∗(t) is linear for  t ∈ (x − δ, x), say y∗(t) = α + βt.

In order to proceed with the following argument, we must show that for sufficiently small ϵ >0, we have  α + β(x + ϵ) ≤ x + ϵ. If α + βx < x, this is immediate. Therefore it remains to consider the case  α + βx = x, for which we show that we must have  α = 0 and β = 1, i.ey∗(t) = t for t ∈ (x−δ, x). This follows from the observation that if  y∗(t) < t, then for all  t1 > twe have  y∗(t1) /∈ (y∗(t), t]. Indeed, suppose not, then

image

Note that  fkhas both left-derivatives and right-derivatives at every point in R. Suppose first that  β ≥0, and we observe that

image

Then by the basic definition of the right-derivative,

image

where the last inequality follows from our inductive hypothesis that  f′k(y+) ≤ f′k(y−) for ally ∈ R. An analogous argument shows that the same conclusion holds when  β < 0.

Now we use this to prove the claim. Because there are no points of  fk+1at which the left-derivative is less than the right-derivative, without loss of generality we claim that  fk+1is differentiable at  y∗(x) for all x, unless y∗(x) = x. Indeed, suppose not, then we have that f′k+1(y∗(x)−) > f′k+1(y∗(x)+) and necessarily that defining  h(y) := fk+1(y) + ρ(x − y), we have0  ∈ ∂h(y∗(x)). But since  h(y∗(x)+) < h(y∗(x)−), we contradict the optimality of  y∗(x) as thispoint is in fact a local maximum.

We finally consider claim (ii). By (iii), we have that for every point  x, y∗(x) is either x or at the minimum of one of the quadratic pieces of  fk+1(·) + ρ(x − ·). In either case, we have that y∗(x) is linear in  x and thus fk+1(y∗(x))+ρ(x−y∗(x)) is quadratic in x. We can define  gk+1(x)pointwise as the minimum of this finite set of quadratic functions of x, whose expressions are given in Appendix A.1. Importantly, the coefficients in the linear expression  y∗(x) of x dependonly on which of these functions is the minimum at x. As the number of intersections between elements in this set of quadratic functions is bounded above by twice the square of the size of the set, we can conclude that  bk+2(x) is piecewise linear and with a finite number of pieces, thus concluding the proof.

S1.3 Computation time experiments

A small experiment was performed to demonstrate the runtimes one can expect in practice for the univariate problem. Note that this clustering is applied iteratively in the block coordinate descent procedure we propose to use in multivariate settings. We considered 3 settings: one with no signal, one with 2 true clusters and one with 5 true clusters. Independent and identically distributed Gaussian noise was added to each of the subaverages. As in Section 6.3 the number of categories was increased by random splitting of the levels. Each of these tests were repeated 25 times, on a computer with a 3.2GHz processor. The results are shown in Figure 8.

image

Figure 8: Computation times for solving the univariate problem.

S1.4 Discretised algorithm

For very large-scale problems, speed can be improved if we only allow coefficients to take values in some fixed finite grid, rather than any real value. Below we describe how such an algorithm would approximately solve the univariate objective (12). We will use the unconstrained objective as discussed in Section S1.1. We would first fix L grid points  ϑ1 < · · · < ϑL, and then proceed as described in Algorithm 3.

This algorithm has the same basic structure to the approach we use in Section 3.1 for computing the exact global optimum. The difference is that now, instead of as in (14), we define  fkin the following way:

image

The objects F and B play analogous roles to  fk and bkin Section 3.1. Since we restrict θk ∈ {ϑ1, . . . , ϑL}, we only need to store the values that  fktakes at these L values; this is the purpose of the vector F in Algorithm 3. Similarly, the rows  B(k, ·) serve the same purpose as the functions  bkwhere, again, we only need to store L values corresponding to the different options for  θk.

image

This algorithm returns the optimal solution ˆθto the objective where each of the coefficients are restricted to take values only in  {ϑ1, . . . , ϑL}. We must ensure that the grid of values has fine enough resolution that interesting answers can be obtained, which requires L being sufficiently large. The number of clusters obtained by this approximate algorithm is bounded above by L, so this must not be chosen too small.

One can see that the computational complexity of this algorithm is linear in K, with a total of  O(KL2) operations required. This is of course in addition to the O(n) operations needed to compute  w1, . . . , wK and ¯Y1, . . . , ¯YKbeforehand. In particular, choosing  L ≲√Kguarantees that the complexity of this algorithm is at worst quadratic in K.

S2.1 Proof of Theorem 5

The proof of Theorem 5 requires a number of auxiliary lemmas, which can be found in Section S2.1.1.

Let us define  Ri = Yi − ˆµ for i = 1, . . . , n, and ¯Rk = 1nk�ni=1 1{Xi=k}Ri for k = 1, . . . , K. Notethat

image

where we define  v(k) ∈ Rn by v(k)i = 1nk 1{Xi=k}. Since Pis an orthogonal projection matrix, we have that  ∥Pv(k)∥2 ≤ ∥v(k)∥2 = 1√nk. It follows that  v(k)T Pεis sub-Gaussian with parameter σ/√nk. Applying the standard sub-Gaussian tail bound, we obtain

image

where recall that  wk = nk/n. Therefore, we have that

image

In the following we work on the intersection Λ :=  ∩Kk=1Λk. This entails that for each  k, | ¯Rk −θ0k| < √ηγ∗sλ/2. We now relabel indices such that ¯R1 ≤ · · · ≤ ¯RK, and so from Proposition 2 that ˆθ1 ≤ · · · ≤ ˆθK. Since our assumption (24) implies ∆(θ0) ≥ √ηγ∗sλ, it follows that on Λ the observed ordering is consistent with the ordering of the true coefficients, i.e. there exist 0 =  k0 < k1 < · · · < ks = K such that

image

Indeed, observe that for  j = 1, . . . , s −1, we have by the triangle inequality and (24), the stronger property that

image

Our optimisation objective is therefore

image

Since ¯Rkj − ¯Rkj−1+1 < √ηγ∗sλ for j = 1, . . . , s, it follows from Lemma 8 that ˆθkj+1 − ˆθkj ≥ γλfor  j = 1, . . . , s − 1, so

image

Observe that we can have  kj−1 + 1 > kj −1 for some j, in which case we take the sum over that range to be zero. Note that (40) can be optimised over (θkj−1+1, . . . , θkj) separately for each j = 1, . . . , s. If s = 1, i.e. the true signal is zero, then the result follows from Lemma 10. Now we see what happens when s > 1.

Without loss of generality, consider j = 1 and note that if  k1= 1 it is immediate that ˆθ1 = ˆθ01. Hence, we can assume that  k1 >1. We note that ˆθ01 = �k1k=1 wk ¯Rk/w01, where wedefine  w0k = n0k/n. We see that our goal is to compute

image

where  1 ∈ Rk1 is a vector of ones and ˜Rk := ¯Rk − ˆθ01 for k = 1, . . . , k1. Note that we subtract ˆθ01 to ensure that

image

as required for application of Lemma 10. We have by assumption that for  k ∈ 1, . . . , k1,| ˜Rk| ≤ √ηγ∗sλ/2 ≤ (2 ∧�w01γ)λ/w01. Thus, Lemma 10 can be applied with ˇw = w01 and itfollows that ˆθk = ˆθ01 for k = 1, . . . , k1.

S2.1.1 Auxiliary lemmas

Here we prove a number of results required to obtain conditions for recovering the oracle least squares estimate in the univariate case. Lemma 10 gives conditions for recovery of the true solution, in the case where there is zero signal. Lemmas 8 and 9 ensure that the true levels are far enough apart that they can be separated. Once we have this separation, we apply Lemma 10 on each of the levels to obtain the solution.

Lemma 7. Consider the optimisation problem

image

where  τ > 0 and κ ∈ (0, 1]. Suppose further that  τ < (1∧√κγ)λ/2κ. Then x∗ = 0is the unique optimum.

Proof. We first observe that

image

For convenience, we define  F(x) := (2τ − x)2/2 + ρκγ,λ/κ(x). It now suffices to show that F is uniquely minimised at 0 provided  τ < (1 ∧ √κγ)λ/2κ. We can clearly see that  x∗ ∈ [0, 2τ].Equation (2.3) of Breheny and Huang [2011] gives the result when  κγ ≥ 1.

When  κγ <1, we see that any stationary point of  F in [0, γλ ∧ 2τ] must be a maximum, since on this interval F(x) is a quadratic function with a negative coefficient of  x2. Therefore its minimum over [0, γλ] is attained at either  x = 0 or x = γλ ∧ 2τ. If 2τ ≤ γλ, then it suffices to check that  F(0) < F(2τ). This holds if and only if  τ < γλ/(γκ+ 1), but since we are assuming τ ≤ γλ/2 and κγ <1, this is always satisfied.

If  γλ < 2τ, then we can see that the minimum of  F over [γλ, 2τ] will be attained at exactly 2τ. Thus, here it also suffices to check  F(0) < F(2τ), which holds if and only if  τ <�γ/κλ/2.The final bound  τ < (1 ∧ √κγ)λ/2κfollows from combining the results for these cases.

The following is a deterministic result to establish separation between groups of coefficients.

Lemma 8. Consider the setup of Theorem 5, and assume that ˆµ = 0. Suppose that ¯Y1 ≤ · · · ≤¯YK, and that for j = 1, . . . , s we have

image

where  kj and kj−1are as defined in (36). Suppose further that for  j = 1, . . . , s − 1,

image

Then for  j = 1, . . . , s, we have ¯Ykj−1+1 ≤ ˆθkj−1+1 ≤ ˆθkj ≤ ¯Ykj.

Proof. For convenience, within this lemma we define  ζ := √ηγ∗sλ. Recall that the objective function which ˆθoptimises takes the form

image

We first claim that ˆθk ∈ [ ¯Y1, ¯YK] for k = 1, . . . , K. To see this, suppose that this is not the case and define ˇθby projecting ˆθ onto [ ¯Y1, ¯YK]K (i.e. ˇθk = ¯YK ∧ ( ¯Y1 ∨ ˆθk) for k = 1, . . . , K). Thepenalty contribution from ˇθis no larger than that of ˆθ, and the loss contribution is strictly smaller, so we obtain the contradiction  Q(ˇθ) < Q(ˆθ).

We now proceed to show that for  j = 1, . . . , s −1, we have ˆθkj ≤ ¯Ykj and ˆθkj+1 ≥ ¯Ykj+1. Weprove the first of these sets of inequalities, since the second follows similarly by considering the problem with  −ˆθ, − ¯Yand reversing the indices. Suppose, for contradiction, that there exists some  j in {1, . . . , s − 1} with ˆθkj > ¯Ykj. Let this jbe minimal, such that for all l < j we have ˆθkl ≤ ¯Ykl.

Next define  l1to be the maximal element of  {kj−1 + 1, . . . , kj − 1}such that ˆθl1 ≤ ¯Ykj.Similarly, we define  l2 ∈ {kj + 1, . . . , kj+1}to be minimal such that ˆθl2 ≥ ¯Ykj+1. The existence of  l1 and l2is guaranteed by Lemma 9.

We note that for  l = l1 + 1, . . . , kj, ˆθl = ˆθkjand hence ( ¯Yl − ˆθl)2 ≥ ( ¯Ykj − ˆθl)2 = ( ¯Ykj − ˆθkj)2.This can be shown by contradiction, as in (55). For such l, we have from optimality of ˆθ that¯Yl − ˆθl1 ≥ ˆθkj − ¯Yl(otherwise one could improve the objective by setting ˆθl1 = ˆθl) which implies that ˆθl1 < ¯Yl. From this it follows that ( ¯Yl − ˆθl1)2 ≤ ( ¯Ykj − ˆθl1)2, since ˆθl1 < ¯Yl ≤ ¯Ykj.

Similarly, if  l2 > kj+ 1, then for  l = kj + 1, . . . , l2 −1 we have ˆθl = ˆθkj+1 and hence( ¯Yl − ˆθl)2 ≥ ( ¯Ykj+1 − ˆθl)2 = ( ¯Ykj+1 − ˆθkj+1)2. For such l, it follows that ˆθl2 > ¯Yland therefore that ( ¯Yl − ˆθl2)2 ≤ ( ¯Ykj+1 − ˆθl2)2.

Now, we define

image

We also define ˜θ ∈ RK according to

image

image

Thus,

image

We specify the infimum in (47) because ( ¯Ykj, ˆθl2] is not closed, and let (am) be a convergent sequence in ( ¯Ykj, ˆθl2] whose limit attains this infimum. We define  a∗ = limm→∞ am.

By assumption (43), at least one of (a∗ − ˆθl1) and (ˆθl2 − a∗) is greater than or equal to  γλ.Here, we use that the separation (43)  ≥ 2γλ. If ˆθl2 − a∗ ≥ γλthen we denote this case (A1) and (45) becomes

image

We define ˜a∗ to be the minimiser over ˜a of (47). We can observe that since ¯Ykj − ˆθl1 < ζ andζ < (1 ∧ �γ ˜wkj)λ/ ˜wkj, we have ¯Ykj − ˆθl1 < (1 ∧ �γ ˜wkj)λ/ ˜wkj. Thus, we have by Lemma 7 that the uniquely optimal ˜a∗ = ˆθl1. This gives that the value of (47) is zero.

It is straightforward to see from (46) that  a∗ = ¯Ykjmust be the unique limit of (am). As wehave assumed that ˆθkj > ¯Ykjand the infimum is not attained in (¯Ykj, ¯Ykj+1), the inequality in line (46) can be made strict. It follows that  Q(ˆθ) > Q(˜θ).

Thus, it remains for us to consider the case where ˆθl2 −a∗ < γλ, which implies that  a∗− ˆθl1 ≥γλ. We denote this case (A2). Now, from (45) we can obtain

image

The objective is piecewise quadratic (and continuously differentiable), with two pieces: [ˆθl1, ˆθl2 − γλ] and (ˆθl2 − γλ, ˆθl2]. On the first region, the objective is a convex quadratic with minimum at ¯Ykj ∈ [ˆθl1, ˆθl2 − γλ].

By the assumption that  a∗ > ˆθl2 − γλ, we know that the objective must be concave on (ˆθl2 − γλ, ˆθl2]. It is clear that the derivative of the objective at ˆθl2 − γλis positive. Hence, if ˜a∗ = ˆθl2−γλ, then the objective will take a strictly lower value at some ˜a∗ ∈ (ˆθl2−γλ−ϵ, ˆθl2−γλ)(for some small  ϵ >0), contradicting optimality of ˜a∗. It therefore follows that ˜a∗ = ˆθl2.

With this knowledge, we can further simplify (48) to obtain

image

The second inequality follows from ¯Ykj − ˆθl1 ≤ ζ and ˆθl2 − ¯Ykj > ζ. Hence, we obtain that Q(ˆθ) > Q(˜θ).

We now we direct our attention towards case (B), where similarly to before we observe that the penalty contributions between  l1 and l2 in Q(ˆθ) are

image

Similarly to (44) in case (A), we obtain

image

We specify the infimum in (50) because ( ¯Ykj, ¯Ykj+1) is not closed and therefore a minimum may not exist. Let (am, bm) be a convergent sequence in ( ¯Ykj, ¯Ykj+1) whose limit achieves this infimum. We now define (a∗, b∗) = limm→∞(am, bm). By assumption (43), we know that ¯Ykj+1−¯Ykj ≥ 3γλ, which implies that ˆθl2 − ˆθl1 ≥ 3γλ. Thus, one of  {(ˆθl2 − b∗), (b∗ − a∗), (a∗ − ˆθl1)}must be at least  γλ.

We first consider if  b∗ − a∗ ≥ γλ, and denote this case (B1). Here, (50) becomes

image

Q(ˆθ) − Q(˜θ) ≥ inf¯Ykj <a≤b< ¯Ykj+1

image

We can observe that (52) is the sum of two copies of (46) in case (A1). Hence, by following the same arguments as before, we see that  Q(ˆθ) > Q(˜θ).

It therefore remains for us to obtain the result in the case that  b∗ − a∗ < γλ, and we denote this case (B2). Using that the separation (43)  ≥ 3γλ + 2ζ, it is straightforward to see that one of ( ¯Ykj+1 − b∗) and (a∗ − ¯Ykj) must be at least  γλ + ζ. By the symmetry of the problem, it is sufficient for us to consider the case where ¯Ykj+1 − b∗ ≥ γλ + ζ. In this case, we can obtain from (50) that

image

where  B =�(˜a,˜b): ˆθl1 ≤ ˜a ≤ ˜b ≤ ¯Ykj+1 − γλ − ζ, ˜b − ˜a < γλ�. From this, we can extract the terms dependent on ˜b to obtain

image

This objective is piecewise quadratic (and continuously differentiable), with two pieces; [˜a∗, ˜a∗+γλ) and [˜a∗+γλ, ˆθl2]. Over the second region, the objective is a convex quadratic with minimum at ¯Ykj+1 ∈ [˜a∗ + γλ, ˆθl2]. By following the same argument as for (48) in case (A2), we see that ˜b∗ = ˜a∗.

With this knowledge, we can further simplify (53) to obtain

image

Since ¯Ykj+1 − ˜a∗ > ζ, we can see that ( ¯Ykj+1 − ˜a∗)2 − ( ¯Ykj+1 − ˆθl2)2 >0. Thus, it suffices for us to show that

image

This objective is exactly as in (47) in case (A1), minimised over a smaller feasible set. Hence, it follows immediately that this holds and we can conclude that  Q(ˆθ) > Q(˜θ).

We now have for all cases that  Q(ˆθ) > Q(˜θ), which contradicts the optimality of ˆθ. Thus,we can conclude that for  j = 1, . . . , s, ˆθkj ≤ ¯Ykj and ˆθkj−1+1 ≥ ¯Ykj−1+1.

Lemma 9. Consider the setup of Lemma 8. For each j = 1, . . . , s, there exists  k∗j in {kj−1 +1, . . . , kj} such that ˆθk∗j ∈ [ ¯Ykj−1+1, ¯Ykj].

Proof. We first show that if ˆθkj > ¯Ykj, then for any  k with kj−1 + 1 ≤ k ≤ kj, if ˆθk > ¯Ykj thenˆθk = ˆθkj.

We prove the first case since the proof for the second is identical. Suppose that this does not hold, i.e. ˆθkj > ¯Ykjand there exists some (minimal)  k in {kj−1 + 1, . . . , kj − 1} with

¯Ykj < ˆθk < ˆθkj. Then we construct ˇθ by

image

We observe that the penalty contribution from ˇθis no more than that of ˆθand that the quadratic loss for ˇθwill be strictly less than that of ˆθ. This gives us that  Q(ˇθ) < Q(ˆθ), contradicting the optimality of ˆθ.

Similarly, if ˆθkj−1+1 < ¯Ykj−1+1then the corresponding statement that for any  k with kj−1 +1  ≤ kj, if ˆθk < ¯Ykj−1+1 then ˆθk = ˆθkj−1+1.

We now establish a simple preliminary result. Suppose that for some j in {1, . . . , s} there exists  k in {kj−1 + 1, . . . , kj} with ˆθk /∈ [ ¯Ykj−1+1, ¯Ykj], such that �{l: ˆθl=ˆθk} wl ≥ η/2s. Weclaim that if ˆθk > ¯Ykj then ˆθk ≤ ¯Ykj + (�2s/η√γλ ∨ γλ). Similarly, if ˆθk < ¯Ykj−1+1 thenˆθk ≥ ¯Ykj−1+1 − (�2s/η√γλ ∨ γλ).

To prove the claim, we consider the case ˆθk > ¯Ykj(the other is identical). By the first observation, if ˆθl > ¯Ykj for l in {kj−1 + 1, . . . kj} then ˆθl = ˆθk. Now, for contradiction, suppose ˆθk > ¯Ykj + (�2s/η√γλ ∨ γλ) and let this k be minimal. Then we can construct ˇθ by

image

By appealing to the optimality of ˆθ, we can easily observe that ˆθk−1 ≤ ¯Yk−1and therefore that the ordering of the entries of ˇθmatches that of ˆθ. Here, we use that (�2s/η√γλ ∨ γλ) ≥ γλ.

image

We will without loss of generality take the second statement to be true (the proof for the first case follows identically). Let  k′ denote the minimal element in  {kj−1 + 1, . . . , kj} such thatˆθk′ = ˆθkj. From the preliminary result established earlier, ˆθkj ≤ ¯Ykj + (�2s/η√γλ ∨ γλ). Byappealing to the optimality of ˆθ, we see that ˆθkj+1 < ˆθkj + γλ(otherwise, we could take ˆθkj tobe ¯Ykjand strictly reduce the value of the objective).

Now, we will use that the separation is at least 2(�2s/η√γλ ∨ γλ) + γλ. By our earlier observation (55), it is clear that any  l ∈ {kj + 1, . . . , kj+1} with ˆθl < ¯Ykj+1 has ˆθl = ˆθkj+1. Notethat since ˆθkj+1 − ¯Ykj < (�2s/η√γλ ∨ γλ) + γλ, it follows that ¯Ykj+1 − ˆθkj+1 > (�2s/η√γλ ∨γλ)+ζand therefore that �{k: ˆθk=ˆθkj+1} wk < η/2sby the preliminary result. Since  w0min ≥ η/s

and separation (43)  ≥ 2(�2s/η√γλ∨γλ)+γλ+ζ, we can define  l′ ∈ {kj +1, . . . , kj+1} minimalsuch that ˆθl′ ≥ ¯Ykj+1.

Now, in order to contradict the optimality of ˆθwe construct a new feasible point ˜θby setting

image

It follows that for  l = kj + 1, . . . , l′ − 1 we have

image

It is also straightforward to see that  |ˆθkj − ¯Yl| ≥ | ¯Ykj − ¯Yl| for l = k′, . . . , kj. If follows that the loss contribution in  Q(˜θ) is strictly less than that in  Q(ˆθ). Hence, using ˆθl′ − ˆθkj > γλ, weobtain

image

contradicting the optimality of ˆθ. We conclude that for j = 1, . . . , s, there exists  k∗j in {kj−1 +1, . . . , kj}such that ˆθk∗j ∈ [ ¯Ykj−1+1, ¯Ykj].

Lemma 10. Consider the univariate objective (11), relaxing the normalisation constraint to ˇw := �k wk ≤ 1. Suppose that  wT ¯Y = 0, and that ∥ ¯Y ∥∞ < (2 ∧ √γ ˇw) λ/ ˇw. Then ˆθ = 0.

Proof. Let Pw = I − 1wT / ˇw and Dw ∈ RK×K be the diagonal matrix with entries  Dkk√wk.First note that

image

Thus for all  θ ∈ RK, we have

image

Consider minimising  F over RK × [−τ, τ]K × S, where S ⊆ RK is the unit simplex scaled by ˇw. We aim to show this minimum is 0. As with the first claim in the proof of Lemma 8, it is straightforward to see that for any feasible (θ, ξ, w), there exists  θ′ with ∥θ′∥∞ ≤ ∥ξ∥∞ and

image

As on the RHS we are minimising a continuous function over a compact set, we know a minimiser must exist. Let (˜θ, ˜ξ, ˜w) be a minimiser (to be specified later). Observe that

image

is linear as a function of  ξ. Hence it is minimised over the set

image

at some point in  {−τ, τ}K. Here conv(·) denotes the convex hull operation. We thus have

image

Since the penalty contribution from ˇθis not greater than that of ˜θ, it follows that  Q(ˇθ) ≤ Q(˜θ).Thus, we can assume that entries of ˜θcan take one of only two distinct values.

Next we write ˜α = �k:˜ξk=−τ ˜wkand observe that ˜wT ˜ξ = ( ˇw − 2˜α)τ. Let us set  s = mink ˜θkand  x = maxk ˜θk − mink ˜θk. Then we have

image

In the second line above, we have solved for s to find that

image

In the third line above, we have solved for ˜αto obtain ˜α = ˇw/2 and hence ˜α( ˇw − ˜α)/ ˇw = ˇw/4.These follow from optimality of ˜θ and ˜wrespectively. The result follows from applying Lemma 7, setting  κ = ˇw/4.

S2.2 Proof of Theorem 6

We begin by defining  P 0 to be the orthogonal projection onto the linear space

image

The residuals from the oracle least-squares fit are (I − P 0)Y = (I − P 0)ε. The partial residuals R(j) as defined in (18) for the jth variable are therefore

image

For j = 1, . . . , p, we define ¯R(j)k = �ni=1 1{Xij=k}R(j)i /njk for k = 1, . . . , Kj, reordering the labels such that ¯R(j)1 ≤ · · · ≤ ¯R(j)Kj. We then aim to apply the arguments of Theorem 5 to ˆθjdefined by

image

In order to do this, we define the events (for some  τjto be determined later):

image

On the intersection of events  ∩Kjk=1Λ(2)jk , we have that  | ¯R(j)k −ˆθ0jk| < √ηγ∗jsjλj/2. By followingan identical approach to that involved in computing (35), we have that

image

where we recall that  wjk = njk/n.

We now turn our attention to the event Λ(1)j . Note that if  sj= 1, then this is immediately satisfied since ˆθ0j = θ0j = 0. If sj >1, we use that the oracle least squares estimate ˆθ0 = AY isa linear transformation A of the responses (Yi)ni=1. For each i = 1, . . . , n, Yihas an independent (non-central) sub-Gaussian distribution with parameter  σ. Therefore for each  k = 1, . . . , Kj,ˆθ0jk − θ0jk also has a sub-Gaussian distribution, with parameter at most  σc−1/2min(recalling that cmin = (maxl(AAT )ll)−1). This enables us to show that

image

We can now set  τj = √ηγ∗jsjλj/2. From (26) and the triangle inequality, on the event Λ(1)jwe have that

image

Thus, on the intersection of events Λ(1)j ∩�∩Kjk=1Λ(2)jk�, we can proceed as in the proof of Theo-

rem 5 from (38), to conclude that ˆθj = ˆθ0j.

It immediately follows that on the intersection of events  ∩pj=1�Λ(1)j ∩�∩Kjk=1Λ(2)jk��, we have

ˆθ = ˆθ0. By a union bound, this occurs with probability at least

image

where in the final line we use  sj ≤ Kj.

S3.1 Details of methods

Tree-based methods

We used the implementation of the random forest procedure [Breiman, 2001] in the R package randomForest [Liaw and Wiener, 2002] with default settings. CART [Breiman et al., 1984] was implemented in the R package rpart [Therneau and Atkinson, 2019], with pruning according to the 1-SE rule (as described in the package documentation).

CAS-ANOVA

The CAS-ANOVA estimator ˆθcasoptimises over (µ, θ) a sum of a squared loss term (3) and an all-pairs penalty term (4). In particular, Bondell and Reich [2009] consider two regimes of weight vectors w. The first is not data-dependent and sets  wj,k1k2 = (Kj + 1)−1√njk1 + njk2.The second, ‘adaptive CAS-ANOVA’, uses the ordinary least squares estimate for  θto scale the weights. Here,  wj,k1k2 = (Kj + 1)−1√njk1 + njk2|ˆθOLSjk1 − ˆθOLSjk2 |−1.Here we introduce a new variant of adaptive CAS-ANOVA, following ideas in B¨uhlmann and Van De Geer [2011] for a 2-stage adaptive Lasso procedure. Instead of using the ordinary least squares estimate ˆθOLSin the above expression, an initial (standard) CAS-ANOVA estimate is used to scale the weights, with  λselected for the initial estimate by 5-fold cross-validation. In simulations, this outperformed the adaptive CAS-ANOVA estimate using ordinary least squares initial estimates so in the interests of time and computational resources this was omitted from the simulation study. Henceforth adaptive CAS-ANOVA will refer to this 2-stage procedure.

The authors describe the optimisation of ˆθcasas a quadratic programming problem, which was solved using the R package rosqp [Anderson, 2018]. Here we used our own implementation of the quadratic programming approach described by the authors. We found it considerably faster than the code available from the authors’ website, and uses ADMM-based optimisation [Boyd et al., 2011] tools not available at the time of its publication. We also found, as discussed in Section 5.1 of Maj-Ka´nska et al. [2015], that we could not achieve the best results using the publicly available code. Lastly, using our own implementation allowed us to explore a modification of CAS-ANOVA using the more modern approach of adaptive weights via a 2-stage procedure [B¨uhlmann and Van De Geer, 2011] to compare SCOPE to a wider class of all-pairs penalty procedures.

For large categorical variables, solutions are slow to compute and consume large amounts of memory. In the case of binary response, CAS-ANOVA models were fitted iterating a locally quadratic approximation to the loss function.

DMR

The DMR algorithm [Maj-Ka´nska et al., 2015] is implemented in the R package DMRnet [Prochenka- So�ltys and Pokarowski, 2018]. The degrees of freedom in the model is decided by 5-fold cross-validation. It is based on pruning variables using the Group Lasso [Yuan and Lin, 2006] to obtain at a low-dimensional model, then performing backwards selection based on ranking t-statistics for hypotheses corresponding to each fusion between levels in categorical variables.

The cross-validation routine appeared to error when all levels of all categorical variables were not present in one of the folds. In Section 6.2, cross-validation was therefore not possible so model selection was performed based on Generalized Information Criterion (GIC) [Zheng and Loh, 1995]. In all other examples, models were selected via 5-fold cross-validation.

Bayesian effect fusion

In Section 6.1.1 we include Bayesian effect fusion [Pauger and Wagner, 2019], implemented in the R package effectFusion [Pauger et al., 2019]. Coefficients within each categorical variable were modelled with a sparse Gaussian mixture model. The posterior mean was estimated with 1000 samples.

Lasso

In Section 6.1.2 we also include Lasso [Tibshirani, 1996] fits, to serve as a reference point. Of course, this is unsuitable for models where levels in categorical variables should be clustered together, but the advanced development of the well-known R package glmnet [Friedman et al., 2010] nevertheless sees its use in practice.

In order to make the fit symmetric across the categories within each variable, models were fitted with an unpenalised intercept and featuring dummy variables for all of the categories within each variable. This is instead of the corner-point dummy variable encoding of factor variables that is commonly used when fitting linear models. Models are fitted and cross-validated with cv.glmnet using the default settings.

SCOPE

For SCOPE, we have provided the R package CatReg [Stokell, 2021]. The univariate update step (see Section 3.1) is implemented in C++ using Rcpp [Eddelbuettel and Fran¸cois, 2011], with models fitted using a wrapper in R. For the binary response case, the outer loop to iterate the local quadratic approximations in the proximal Newton algorithm are done within R. In the future, performance could be improved by iterating the univariate update step (and the local quadratic approximations, as in Sections 6.2 and 6.3) within some lower-level language. In higher-dimensional experiments, SCOPE was slowed by cycling through all the variables; an active-set approach to this could make it faster still.

S3.2 Further details of numerical experiments

For the experiments in Section 6.1, we define the signal-to-noise ratio (SNR) as  σS/σ, where σSis the standard deviation of the signal  Y − ε, and σis the standard deviation of the noise  ε.

image

Figure 9: Prediction performance of various methods: (A) SCOPE-8; (B) SCOPE-32; (C) SCOPE-CV; (D) Linear regression; (E) Oracle least squares; (F) CAS-ANOVA; (G) Adaptive CAS-ANOVA; (H) DMR; (I) BEF; (J) CART; (K) RF. Note that some ‘boxes’ are not visible in some of the plots; this is due to the MSPE in the tests being beyond the range of the plot.

S3.2.1 Low-dimensional simulations

In Table 7 we include details of computation time and dimension of the fitted models. Figure 9 visualises the results also summarised in Table 1 in the main paper.

image

Table 7: Mean fitted model dimension and computation time for the various methods.

image

Figure 10: Prediction performance of various methods: (A) SCOPE-8; (B) SCOPE-32; (C) SCOPE-CV; (D) Oracle least squares; (E) DMR; (F) CART; (G) RF; (H) Lasso. Note that some ‘boxes’ are not visible in some of the plots; this is due to the MSPE in the tests being beyond the range of the plot.

S3.2.2 High-dimensional simulations

Here we include additional results relating to the high-dimensional experiments. Figure 10 visualises the results in Table 2 of the main paper.

image

Table 8: Mean computation time (s)

image

Table 9: Mean fitted model dimension

image

Table 10: Proposition of times each  γwas selected by cross-validation.

E. Anderson. rosqp: Quadratic Programming Solver using the ’OSQP’ Library, 2018. R package version 0.1.0.

H. D. Bondell and B. J. Reich. Simultaneous factor selection and collapsing levels in ANOVA. Biometrics, 65(1):169–177, 2009.

S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.

P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics, 5(1):232, 2011.

L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification and Regression Trees. The Wadsworth and Brooks-Cole statistics-probability series. Taylor & Francis, 1984.

P. B¨uhlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.

D. Eddelbuettel and R. Fran¸cois. Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8):1–18, 2011.

J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.

A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002.

A. Maj-Ka´nska, P. Pokarowski, A. Prochenka, et al. Delete or merge regressors for linear model selection. Electronic Journal of Statistics, 9(2):1749–1778, 2015.

D. Pauger and H. Wagner. Bayesian effect fusion for categorical predictors. Bayesian Analysis, 14(2):341–369, 2019.

D. Pauger, M. Leitner, H. Wagner, and G. Malsiner-Walli. effectFusion: Bayesian Effect Fusion for Categorical Predictors, 2019. R package version 1.1.1.

A. Prochenka-So�ltys and P. Pokarowski. DMRnet: Delete or Merge Regressors Algorithms for Linear and Logistic Model Selection and High-Dimensional Data, 2018. R package version 0.2.0.

B. Stokell. CatReg: Solution Paths for Linear and Logistic Regression Models with Categorical Predictors, with SCOPE Penalty https://CRAN.R-project.org/package=CatReg, 2021.

T. Therneau and B. Atkinson. rpart: Recursive Partitioning and Regression Trees, 2019. R package version 4.1-15.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.

X. Zheng and W.-Y. Loh. Consistent variable selection in linear models. Journal of the American Statistical Association, 90(429):151–156, 1995.


Designed for Accessibility and to further Open Science