Categorical data arise in a number of application areas. For example, electronic health data typically contain records of diagnoses received by patients coded within controlled vocabularies and also prescriptions, both of which give rise to categorical variables with large numbers of levels [Jensen et al., 2012]. Vehicle insurance claim data also contain a large number of categorical variables detailing properties of the vehicles and parties involved [Hu et al., 2018]. When performing regression with such data as covariates, it is often helpful, both for improved predictive performance and interpretation of the fit, to fuse the levels of several categories together in the sense that the estimated coefficients corresponding to these levels have exactly the same value.
To fix ideas, consider the following ANOVA model relating response vector to categorical predictors
Here the are independent zero mean random errors,
is a global intercept and
contribution to the response of the kth level of the jth predictor; we will later place restrictions on the parameters to ensure they are identifiable. We are interested in the setting where the coefficients corresponding to any given predictor are clustered, so defining
we have , at least when
is large. Note that our setup can include high-dimensional settings where p is large and many of the predictors do not contribute at all to the response: when
= 1, the contribution of the jth predictor is effectively null as it may be absorbed by the intercept term.
1.1 Background and motivation
Early work on collapsing levels together in low-dimensional models of the form (1) focused on performing a variety of significance tests for whether certain sets of parameters were equal [Tukey, 1949, Scott and Knott, 1974, Calinski and Corsten, 1985]. A more modern and algorithmic method based on these ideas is Delete or merge regressors [Maj-Ka´nska et al., 2015], which involves agglomerative clustering based on t-statistics for differences between levels.
The CART algorithm [Breiman et al., 1984] for building decision trees effectively starts with all levels of the variables fused together and greedily selects which levels to split. One potential drawback of these greedy approaches is that in high-dimensional settings where the search space is very large, they may fail to find good groupings of the levels. The popular random forest procedure [Breiman, 2001] uses randomisation to alleviate the issues with the greedy nature of the splits, but sacrifices interpretability of the fitted model.
An alternative to greedy approaches in high-dimensional settings is using penalty-based methods such as the Lasso [Tibshirani, 1996]. This can be applied to continuous or binary data and involves optimising an objective for which global minimisation is computationally tractable, thereby avoiding some of the pitfalls of greedy optimisation. In contrast to random forest, the fitted models are sparse and interpretable. Inspired by the success of the Lasso and related methods for high-dimensional regression, a variety of approaches have proposed estimating via optimising over (
) a sum of a least squares criterion
This is the CAS-ANOVA penalty of Bondell and Reich [2009]. The weights can be chosen to balance the effects of having certain levels of categories more prevalent than others in the data. The penalty is an ‘all-pairs’ version of the fused Lasso and closely related to so-called convex clustering [Hocking et al., 2011, Chiquet et al., 2017]. We note that there are several other approaches besides using penalty functions. For instance, Pauger and Wagner [2019] proposes a Bayesian modelling procedure using sparsity-inducing prior distributions to encourage fusion of levels. See also Tutz and Gertheiss [2016] and references therein for a review of other methods including those based on mixture models and kernels.
The fact that the optimisation problem resulting from (4) is convex makes the procedure attractive. However, a drawback is that it may not give a desirable form of shrinkage. Indeed, consider the case where p = 1, and dropping subscripts for simplicity, all = 1. This would typically be the case if all levels were equally prevalent. Further suppose for simplicity that the number of levels K is even. Then if the coefficients are clustered into two groups where one contains only a single isolated coefficient, the number of non-zero summands in (4) is only
This almost doubles to 2(
2) when one of the two groups is of size 2. The extreme case where the two groups are of equal size yields (
non-zero summands. This particular property of all-pairs penalties, which results in them favouring groups of unequal sizes, is illustrated schematically in Figure 1. We can see the impact of this in the following concrete example.
Figure 1: Illustration of the number of non-zero summands in (4) when p = 1, K = 16 and coefficients are clustered into two groups of equal size (right), and where one contains a single coefficient (left) and two coefficients (middle).
Suppose K = 20 levels are clustered into four groups with
If the coefficient estimates satisfy ˆ11, so the first two groups have distinct coefficients, then moving any coefficient from the first group towards the second, and so increasing the number of estimated groups, actually decreases the penalty contribution in (4). Specifically, if the kth coefficient for some
] with all other coefficients kept fixed, the penalty contribution decreases by 13t. In this case then, CAS-ANOVA will struggle to keep the groups intact, especially smaller ones. We see this in Figure 2, which shows the result of applying CAS-ANOVA to data generated according to (1) with
= 20 (so we have a single observation corresponding to each level), and
1). There is no value of the tuning parameter
where the true groups are recovered.
As in the standard regression setting, the bias introduced by all-pairs -type penalties may be reduced by choosing data-adaptive weights analogously to the adaptive Lasso [Zou, 2006], or replacing the absolute value
is a concave and non-decreasing penalty function [Oelker et al., 2015, Ma and Huang, 2017]. However, this does not
Figure 2: Solution paths as the tuning parameter varies in a univariate example where there are four true groups. From left to right: CAS-ANOVA, the range penalty and SCOPE with = 8. The setup is as described in the main text of Section 1.1, with the different colours corresponding to the different true groups. The tuning parameter varies along the y axis. In this example, only SCOPE identifies the 4 correct groups at any point along its solution path.
address the basic issue of a preference for groups of unequal sizes. Additionally, optimising an objective involving a penalty with summands can be computationally challenging, particularly in the case where
is not convex, both in terms of runtime and memory. To help motivate the new approach we are proposing in this paper, let us consider the setting where the predictors are ordinal rather than nominal, so there is an obvious ordering among the levels. In these settings, it is natural to consider a fused Lasso [Tibshirani et al., 2005] penalty of the form
where is a permutation of
specifying the given order; this is done in Gertheiss and Tutz [2010] who advocate using it conjunction with the all-pairs-type CAS-ANOVA penalty for nominal categories.
If however we treat the nominal variable setting as analogous to having ordinal variables with unknown orderings , one might initially think of choosing
corresponding to the order of the estimates
, such that
th smallest entry in
. This however leads to what we refer to as the ‘range’ penalty:
Whilst this shrinks the largest and smallest of the estimated coefficients together, the remaining coefficients lying in the open interval between these are unpenalised and so no grouping of the estimates is encouraged, as we observe in Figure 2; see also Oelker et al. [2015] for a discussion of this issue in the context of ordinal variables.
1.2 Our contributions and organisation of the paper
Given how all-pairs penalties have an intrinsic and undesirable preference for unequal group sizes, and how the fused Lasso applied to ordered coefficients (6) does not result in grouping of
the coefficients, we propose the following solution. Our approach is to use the penalty
for concave (and nonconvex) non-decreasing penalty functions , which, for computational reasons we discuss in Section 3, we base on the minimax concave penalty (MCP) [Zhang, 2010]. In Section 2 we formally introduce our method, which we call SCOPE, standing for Sparse Concave Ordering & Penalisation Estimator.
Note that whereas in conventional high-dimensional regression, the use of nonconvex penalties has been primarily motivated by a need to reduce bias in the estimation of large coefficients [Fan and Li, 2001], here the purpose is very different: in our setting a nonconvex penalty is in fact even necessary for shrinkage to sparse solutions to occur (see Proposition 1). Because of these fundamental differences, the rich algorithmic and statistical theory concerning high-dimensional regression with nonconvex penalties (see for example Loh and Wainwright [2012, 2015], Wang et al. [2014], Fan et al. [2018], Zhao et al. [2018] and references therein) is not directly applicable to our setting.
In Section 3, we therefore introduce a new dynamic programming approach that recovers the global minimum of the resulting objective function exactly in the univariate case, i.e. when p = 1. We then build this into a blockwise coordinate descent approach to tackle the multivariate setting.
In Section 4 we study the theoretical properties of SCOPE and give sufficient conditions for the estimator to coincide with the least squares solution with oracular knowledge of the level fusions in the univariate case. These conditions involve a minimal separation between unequal coefficients that is, up to constant factors, minimax optimal. Our result contrasts sharply with Theorem 2 of Ma and Huang [2017] for an all-pairs nonconvex penalty. The latter instead shows the existence of a local optimum that coincides with the oracle least squares solution. Whilst in conventional high-dimensional regression settings, it is known that under certain conditions, all local optima have favourable properties [Loh and Wainwright, 2015], we note that the separation requirements in Ma and Huang [2017] are substantially weaker than those indicated by the minimax lower bound, and so cannot be extended to a particular local optimum determined by the data; see the discussion following Theorem 5.
We use our univariate result to show that the oracle least squares solution is a fixed point of our blockwise coordinate descent algorithm in the multivariate case. In Section 5 we outline some extensions of our methodology including a scheme for handling settings when there is a hierarchy among the categorical variables. Section 6 contains numerical experiments that demonstrate the favourable performance of our method compared to a range of competitors on both simulated and real data. We conclude with a discussion in Section 7. Further details of our algorithm can be found in the Appendix. The supplementary material contains additional information on the runtime of our algorithm, and an approximate version suitable for very large-scale settings, all the proofs, and additional information on the experiments in Section 6.
Recall that our goal is to estimate parameters () in model (1). Let us first consolidate some notation. For any
. We will study the univariate setting where p = 1 separately, and so it will be helpful to introduce some simplified notation for this case, dropping any extraneous subscripts. We thus write
and . Additionally, we let ¯
denote the average of the
where In order to avoid an arbitrary choice of corner point constraint, we instead impose the following to ensure that
is identifiable: for all j = 1, . . . , p we have
Let Θ, and let Θ = Θ
. We will construct estimators by minimising over
Θ an objective function of the form
where is the least squares loss function (3) and
are the order statistics of
. We allow for different penalty functions
for each predictor in order to help balance the effects of varying numbers of levels
. The identifiability constraint that
Θ ensures that the estimated intercept ˆ
:= arg min
) satisfies ˆ
We note that whilst the form of the identifiability constraint would not have a bearing on the fitted values of unregularised least squares regression, this is not necessarily the case when regularisation is imposed. For example, consider the simple univariate setting with p = 1 and the corner point constraint = 0. Then the fitted value for an observation with level 1 would simply be the average ¯
, coinciding with that of unpenalised least squares. However the fitted values with observations with other level
2 would be subject to regularisation and in general be different to ¯
. This inequitable treatment of the levels is clearly undesirable as they may have been labelled in an arbitrary way. Our identifiability constraint treats the levels more symmetrically, but also takes into account the prevalence of levels, so the fitted values corresponding to more prevalent levels effectively undergo less regularisation.
As the estimated intercept ˆdoes not depend on the tuning parameters, we define
We will take the regularisers ) in (9) to be concave (and nonconvex); as discussed in the introduction and formalised in Proposition 1 below, a nonconvex penalty is necessary for fusion to occur.
Proposition 1. Consider the univariate case with p = 1. Suppose the subaverages ( ¯are all distinct, and that
is convex. Then any minimiser ˆ
such that ˆ
where (. This is a piecewise quadratic function with gradient
at 0 and flat beyond
. For computational reasons which we discuss in Section 3, the simple piecewise quadratic form of this is particularly helpful. In the multivariate case we take
with
. This choice of scaling is motivated by requiring that when
= 0 we also have ˆ
= 0 with high probability; see Lemma 10 in the Supplementary material. We discuss the choice of the tuning parameters
in Section 3.3, but first turn to the problem of optimising (9).
In this section we include details of how SCOPE is computed. Section 3.1 motivates and describes the dynamic programming algorithm we use to compute global minimiser of the SCOPE objective, which is highly non-convex. Section 3.2 contains details of how this is used to solve the multivariate objective by embedding it within a blockwise coordinate descent routine. Discussion of practical considerations is contained in Section 3.3.
3.1 Univariate model
3.1.1 Preliminaries
We now consider the univariate case (p = 1) and explain how the solutions are computed. In this case, we may rewrite the least squares loss contribution to the objective function in the following way.
where . Thus the optimisation problem (9) can be written equivalently as
suppressing the dependence of the MCP on tuning parameters
. In fact, it is straightforward to see that the constraint that the solution lies in Θ will be automatically satisfied, so we may replace Θ with
. Two challenging aspects of the optimisation problem above are the presence of the nonconvex
and the order statistics. The latter however are easily dealt with using the result below, which holds more generally whenever
is a concave function.
Proposition 2. Consider the univariate optimisation (11) with any concave function such that a minimiser ˆ
exists. If for
This observation substantially simplifies the optimisation: after re-indexing such that ¯¯
, we may re-express (11) as,
We use the following intermediate functions to structure the algorithm:
for k = 2, . . . , K; here sarg min refers to the smallest minimiser in the case that it is not unique. Invariably however this will be unique, as the following result indicates.
Proposition 3. The set of ( ¯that yields distinct solutions to (11) has Lebesgue measure zero as a subset of
We will thus tacitly assume uniqueness in some of the discussion that follows, though this is not required for our algorithm to return a global minimiser. Observe now that ˆminimiser of the univariate objective function
: indeed for
Furthermore, we have ˆ), and more generally ˆ
Thus provided
can be minimised efficiently (which we shall see is indeed the case), given this and the functions
we can iteratively compute ˆ
. In order to make use of these properties, we must be able to compute
efficiently; we explain how to do this in the following subsection.
3.1.2 Computation of fK and b2, . . . , bK
The simple piecewise quadratic form of the MCP-based penalty is crucial to our approach for computing the . Some important consequences of this piecewise quadratic property are summarised in the following lemma.
Lemma 4. For each k,
(iii) for each , if a minimiser ˜
(
must be differentiable at ˜
Properties (i) and (ii) above permit exact representation of with finitely many quantities. The key task then is to form the collection of intervals and corresponding coefficients of quadratic functions for
given a similar piecewise quadratic representation of ; and also the same for the linear functions composing
. A piecewise quadratic representation of
would then be straightforward to compute, and we can iterate this process. To take advantage of property (iii) above, in computing
) we can separately search for minimisers at stationary points in (
compare the corresponding function values with
); the fact that we need only consider potential minimisers at points of differentiability will simplify things as we shall see below.
Suppose are intervals that partition R (closed on the left) and
are corresponding quadratic functions such that
. Let us write
We may then express ). We can also express the penalty
a similar fashion. Let
Then be the set of points at which
is differentiable. We then have, using Lemma 4 (iii) that
where ˜min denotes the minimum if it exists and otherwise. The fact that in the inner minimisation we are permitted to consider only points in
simplifies the form of
We show in Section A.1 of the Appendix that this is finite only on an interval and there takes the value of a quadratic function; coefficients for this function and the interval endpoints have closed form expressions that are elementary functions of the coefficients and intervals corresponding to ˜. With this, we have an explicit representation of
as the minimum of a collection of functions that are quadratic on intervals and
everywhere else. Let us refer to these intervals (closed on the left) and corresponding quadratic functions as
respectively.
In order to produce a representation of for use in future iterations, we must express
as a collection of quadratics defined on disjoint intervals. To this end, define for each
. Note that the endpoints of the intervals
are the points where the active set changes and it is thus straightforward to determine A(x) at each x. Let r(x) be the index such that
). For large negative values of x, A(x) will contain a single index and for such x this must be r(x). Consider also for each
horizontal coordinate
of the first intersection beyond x (if it exists) between
We refer to the collection of all such tuples (
and denote it by N(x). Given r(x), N(x) can be computed easily. The intersection set N(x) then in turn helps to determine the smallest
) changes, that is the next knot of
beyond x, as we now explain. Suppose at a point
, we have computed
set
and perform the following.
1. Given ) and set (
) = arg min
2. If there are no changes in the active set between , we have found the next knot point at
3. If instead the active set changes, move to the leftmost change point. We have that
). To determine if r(x) changes at
, we check if
(ii) some enters the active set at
and ‘beats’
and
0 sufficiently small.
If either hold is a knot and
) may be computed via
) = arg min
If neither hold, we conclude that
and go to step 1 once more.
Hence we can proceed from one knot of to the next by comparing the values and intersections of a small collection of quadratic functions, and thereby form a piecewise quadratic representation of
in a finite number of steps. Figure 3 illustrates the steps outlined above. The pieces of
may be computed in a similar fashion.
We note there are several modifications that can speed up the algorithm: for example, for each (17) is a constant function where it is finite (see
in the figure), and these can be dealt with more efficiently. For further details including pseudocode see Section A.2 of the Appendix.
Figure 3: Illustration of the optimisation problem and our algorithm, to be interpreted with reference to steps 1, 2, 3 in the main text. Shading indicates regions where the active set, displayed at the bottom of the plot, is invariant, and vertical dotted lines signify changes. Dotted curves correspond to parts of quadratic functions lying outside their associated intervals
the active set changes between
to the first change point P and see neither (i) nor (ii) occur. We therefore return to step 1 and compute
) which additionally contains (
2). As the active set is unchanged between
, we have determined the next knot point
and minimising quadratic
In summary, our algorithm produces a piecewise quadratic representation of
can minimise efficiently to obtain ˆ. We also have piecewise linear representations of functions
through which we may iteratively obtain ˆ
It seems challenging to obtain meaningful bounds on the number of computations that must be performed at each stage of this process in terms of parameters of the data. However, to give an indication of the scalability of this algorithm, we ran a simple example with 3 true levels and found that with 50 categories the runtime was under 10seconds; with 2000 categories it was still well under half a second. More details on computation time can be found in Sections S1.3 and S3.2 of the Supplementary material. In Section S1.4 of the Supplementary material, we describe an approximate version of the algorithm that can be used for fast computation in very large-scale settings.
3.2 Multivariate model
Using our dynamic programming algorithm for the univariate problem, we can attempt to minimise the objective (9) for the multivariate problem using block coordinate descent. This has been shown empirically to be a successful strategy for minimising objectives for high-dimensional regression with nonconvex penalties such as the MCP [Breheny and Huang, 2011, Mazumder et al., 2011, Breheny and Huang, 2015], and we take this approach here. Considering the multivariate case, we iteratively minimise the objective keeping all other parameters fixed. Then for a given (
) and initial estimate ˆ
Θ, we repeat the following until a suitable convergence criterion is met:
1. Initialise m = 1, and set for i = 1, . . . , n
2. For j = 1, . . . , p, compute
3. Increment
We define a blockwise optimum of Q to be any ˆΘ, such that for each j = 1, . . . , p,
This is equivalent to ˆbeing a fixed point of the block coordinate descent algorithm above. Provided
is continuous in
. As a consequence of Tseng [2001], Theorem 4.1 (c), provided the minimisers ˆ
in (19) are unique for all j and m (which will invariably be the case when the responses are realisations of continuous random variables; see Proposition 3), then all limit points of the sequence (ˆ
are blockwise optima.
3.3 Practicalities
In practice the block coordinate descent procedure described above must be performed over a grid of () values to facilitate tuning parameter selection by cross-validation. In line with analogous recommendations for other penalised regression optimisation procedures [Breheny and Huang, 2011, Friedman et al., 2010], we propose, for each fixed
, to iteratively obtain solutions for an exponentially decreasing sequence of
values, warm starting each application of block coordinate descent at the solution for the previous
. It is our experience that this scheme speeds up convergence and helps to guide the resulting estimates to statistically favourable local optima, as has been shown theoretically for certain nonconvex settings [Wang et al., 2014].
The grid of values can be chosen to be fairly coarse as the solutions appear to be less sensitive to this tuning parameter; in fact fixing
yields competitive performance across a range of settings (see Section 6). The choice
0, which mimics the
has good statistical properties (see Theorem 5 and following discussion). However the global optimum typically has a smaller basin of attraction and can be prohibitively hard to locate, particularly in low signal to noise ratio settings where larger
tends to dominate.
In this section, we study the theoretical properties of SCOPE. Recall our model
for Θ. We will assume the errors (
have mean zero, are indepen- dent and sub-Gaussian with parameter
and define the oracle least squares estimate
This is the least squares estimate of with oracular knowledge of which categorical levels are fused in
Note that in the case where the errors have equal variance , the expected mean squared prediction error of ˆ
with equality when ˆis unique.
Our results below establish conditions under which ˆis a blockwise optimum (20) of the SCOPE objective function Q (9), or in the univariate case when this in fact coincides with SCOPE. The minimum differences between the signals defined for each j by
these latter two quantities are the minimum and maximum number of observations corresponding to a set of fused levels in the jth predictor respectively.
4.1 Univariate model
We first consider the univariate case, where as usual we will drop the subscript j for simplicity. The following result establishes conditions for recovery of the oracle least squares estimate (22).
Theorem 5. Consider the model (21) in the univariate case with p = 1. Suppose there exists
max. Suppose further that
Then with probability at least
the oracle least squares estimate ˆis the global optimum of (9), so ˆ
For a choice of the tuning parameters (such that equality holds in (24), we have, writing ∆
with probability at least
where c is an absolute constant. The quantity reflects how equal the number of observations in the true fused levels are: in settings where the prevalences of the underlying true levels are roughly equal, we would expect this to be closer to 1.
Consider now an asymptotic regime where K, s and 1/∆ are allowed to diverge with n, , so all levels have roughly the same prevalence, and
is bounded away from zero, so all true underlying levels also have roughly the same prevalence. Then in order for ˆ
with high probability, we require ∆
. This requirement cannot be weakened for any estimator; this fact comes as a consequence of minimax lower bounds on mis-clustering errors in Gaussian mixture models [Lu and Zhou, 2016, Theorem 3.3].
We remark that our result here concerning properties of the global minimiser of our objective is very different from existing results on local minimisers of objectives involving all-pairs-type penalties. For example, in the setting above where K = n, Theorem 2 of Ma and Huang [2017] gives that provided ), there exists a sequence of local minimisers converging to the oracle least-squares estimate with high probability. This is significantly weaker than the condition ∆
) required for any estimator to recover oracle least-squares in this setting, illustrating the substantial difference between results on local and global optima here.
4.2 Multivariate model
When the number of variables is p > 1, models can become high-dimensional, with ordinary least squares estimation failing to provide a unique solution. We will however assume that the solution for
is unique, which occurs if and only if the oracle least squares estimate (22) is unique. In this case, we note that ˆfor a fixed matrix A. A necessary condition for this is that
Our result below provides a bound on the probability that the oracle least squares estimate is a blockwise optimum of the SCOPE objective (9) with This is much more meaningful than an equivalent bound for ˆ
to be a local optimum as the number of local optima will be enormous. In general though there may be several blockwise optima, and it seems challenging to obtain a result giving conditions under which our blockwise coordinate descent procedure is guaranteed to converge to ˆ
. Our empirical results (Section 6) however show that the fixed points computed in practice tend to give good performance.
Theorem 6. Consider the model (21) and assume ˆ. Suppose that there exists
such that
. Further suppose that
Then letting , with probability at least
the oracle least squares estimate ˆis a blockwise optimum of (9).
Now suppose are such that equality holds in (26) for all j. Then writing
), we have that ˆ
is a blockwise optimum of (9) with probability at least
where c is an absolute constant. Consider now an analogous asymptotic regime to that described in the previous section for the univariate case. Specifically assume for simplicity. We then see that in order for ˆ
to be a blockwise optimum with high probability, it is sufficient that ∆
In this section, we describe some extensions of our SCOPE methodology.
Continuous covariates. If some of the covariates are continuous rather than categorical, we can apply any penalty function of choice to these, and perform a regression by optimising the sum of a least squares objective, our SCOPE penalty and these additional penalty functions, using (block) coordinate descent.
be the centred design matrix for these covariates with
. One can fit a model with SCOPE penalising the categorical covariates, and the Lasso with tuning parameter
0 penalising the continuous covariates, resulting in the following objective over
This sort of integration of continuous covariates is less straightforward when attempting to use tree-based methods to handle categorical covariates, for example.
Generalised linear models. Sometimes a generalised linear model may be appropriate. Although a quadratic loss function is critical for our exact optimisation algorithm described in Section 3.1, we can iterate local quadratic approximations to the loss term in the objective and minimise this. This results in a proximal Newton algorithm and is analogous to the standard approach for solving -penalised generalised linear models [Friedman et al., 2010, Section 3]. An implementation of this scheme in the case of logistic regression for binary responses is available in the accompanying R package CatReg. We remark that when computing logistic regression models with a SCOPE penalty it is advisable to use a larger value of
than with a continuous response to aid convergence of the proximal Newton step; we recommend a default setting of
= 100. In Section 6.2 we use the approach described above to perform a logistic regression using SCOPE on US census data.
Hierarchical categories. Often certain predictors may have levels that are effectively subdivisions of the levels of other predictors. Examples include category of item in e-commerce or geographical data with predictors for continent, countries and district. For simplicity, we will illustrate how such settings may be dealt with by considering a case with two predictors, but this may easily be generalised to more complex hierarchical structures. Suppose there is a partition such that for all
so the levels of the second predictor in represent subdivisions of kth level of the first predictor. Let
refer to the subvector (
components of
are the coefficients corresponding to the levels in
the rth order statistic within
. It is natural to encourage fusion among levels within
more strongly than for levels in different elements of the partition. To do this we can modify our objective function so the penalty takes the form
We furthermore enforce the identifiability constraints that
As well as yielding the desired shrinkage properties, an additional advantage of this approach is that the least squares criterion is separable in so the blockwise update of
be performed in parallel. This can lead to a substantial reduction in computation time if
large.
In this section we explore the empirical properties of SCOPE. We first present results on the performance on simulated data, and then in Sections 6.2 to 6.5 present analyses and experiments on US census data, insurance data and COVID-19 modelling data.
We denote SCOPE with a specific choice of , and write SCOPE-CV to denote SCOPE with a cross-validated choice of
. SCOPE solutions are computed using our R [R Core Team, 2020] package CatReg [Stokell, 2021], using 5-fold cross-validation to select
examples except those in Section 6.5. We compare SCOPE to linear or logistic regression where appropriate and a range of existing methods, including CAS-ANOVA [Bondell and Reich, 2009] (4), and an adaptive version where the weights
are multiplied by a factor proportional to the
is an initial CAS-ANOVA estimate. For these methods the tuning parameter
was also selected by 5-fold cross-validation. As well as this, we include Delete or merge regressors (DMR) [Maj-Ka´nska et al., 2015] and Bayesian effect fusion (BEF) [Pauger and Wagner, 2019] in some experiments. With the former, models were fitted using DMRnet [Prochenka-So�ltys and Pokarowski, 2018] and selected by 5-fold cross-validation where possible; otherwise an information criterion was used. With BEF, coefficients were modelled with a Gaussian mixture model with posterior mean estimated using 1000 samples using effectFusion [Pauger et al., 2019]. We also include comparison to the tree-based approaches CART [Breiman et al., 1984] and random forests (RF) [Breiman, 2001]. Lastly, in some experiments, models were also fitted using the Lasso [Tibshirani, 1996]. CART was implemented using rpart [Therneau and Atkinson, 2019] with pruning according to the one standard error rule. Random forests and Lasso were implemented using the default settings in randomForest [Liaw and Wiener, 2002] and glmnet [Friedman et al., 2010] packages respectively. For full details of the specific versions of these methods and software used in the numerical experiments, see Section S3.1 of the Supplementary material.
6.1 Simulations
where g is the true regression function, ˆg an estimate, and the expectation is taken over the covariate vector x.
6.1.1 Low-dimensional experiments
Results are presented for three settings with n = 500, p = 10 given below.
3. As Setting 1, but with
RF 9Setting 3
SNR: 7.3 2.9 1.5 0.73
Table 1: Mean squared prediction errors (and standard deviations thereof) of various methods on the settings described.
Each of these experiments were performed with noise variance = 1, 6.25, 25 and 100. Note that the variance of the signal varies across each setting, and signal-to-noise ratio (SNR) for each experiment is displayed in Table 1. Methods included for comparison were SCOPE-8, SCOPE-32, SCOPE-CV, linear regression, vanilla and adaptive CAS-ANOVA, DMR, Bayesian effect fusion, CART and random forests. Also included are the results from the oracle least squares estimator (22).
Results are shown in Table 1 and further details are given in Section S3.2.1 of the Supplementary material. Across all experiments, SCOPE with a cross-validated choice of prediction performance at least as good as the optimal approaches, and in all but the lowest noise settings performs better than the other methods that were included. In these exceptions, we see that fixing
to be a small value (corresponding to high-concavity) provides leading performance.
In these low noise settings, we see that the methods based on first estimating the clusterings of the levels and then estimating the coefficients without introducing further shrinkage, such as DMR or Bayesian effect Fusion, perform well. However they tend to struggle when the noise is larger. In contrast the tree-based methods perform poorly in low noise settings but exhibit competitive performance in high noise settings.
6.1.2 High-dimensional experiments
We considered 8 settings as detailed below, each with n = 500, p = 100 and simulated 500 times.
2. As Setting 1, but with
6. As Setting 5, but with
Models were fitted using SCOPE-8, SCOPE-32, SCOPE-CV, DMR, CART, Random forests and the Lasso. Table 2 gives the mean squared prediction errors across each of the settings.
As well as prediction performance, it is interesting to see how the methods perform in terms of variable selection performance. With categorical covariates, there are two potential ways of evaluating this. The first is to consider the number of false positives and false negatives across the p = 100 categorical variables, defining a variable j to have been selected if ˆThese results are shown in Table 3. This definition of a false positive can be considered quite conservative; typically one can find that often the false signal variables have only two levels, each with quite small coefficients. This means that the false positive rate can increase substantially with only a small increase in the dimension of the estimated linear model.
The second is to see within the signal variables (i.e., the = 0), how closely the estimated clustering resembles the true structure. To quantify this, we use the adjusted Rand index [Hubert and Arabie, 1985]. This is the proportion of all pairs of observations that are either (i) in different true clusters and different estimated clusters, or (ii) in the same true cluster and estimated cluster; this is then corrected to ensure that its value is zero when exactly one of the clusterings is ‘all-in-one’. In Table 4 we report the average adjusted Rand index over the true signal variables in each setting.
Table 2: Mean squared prediction errors (and standard deviations thereof) of each of the meth- ods in the 8 high-dimensional settings considered.
Table 3: (False positive rate)/(False negative rate) of linear modelling methods considered in the high-dimensional settings.
Table 4: Average adjusted Rand index among true signal variables for the high-dimensional settings.
Further details can be found in Section S3.2.2 of the Supplementary material. In particular we include a table with the distribution of cross-validated choices of (from a grid {4, 8, 16, 32, 64}) for each experimental setting. Note that a choice of
= 4 is close to the setting of
= 3 recommended in Zhang [2010], though the problem of categorical covariates is very different in nature than the vanilla variable selection problem considered there. Our results there suggest that for SCOPE, a larger value of
is preferable across a range of settings, which is also visible in the comparison between
= 32 in Table 2.
Across all the settings in this study, SCOPE performs better than any of the other methods included. This is regardless of which of the three regimes is chosen, although cross-validating
gives the strongest performance overall. Comparing the results for
gests that a larger (low-concavity) choice of
is preferable for higher-dimensional settings. In setting 6, we see from Tables 3 and 4 that SCOPE obtains the true underlying groupings of the coefficients and obtains the oracle least-squares estimate in every case, giving these striking results. This is also achieved for some of the experiments in setting 5. In contrast, DMR, which initially applies a group Lasso [Yuan and Lin, 2006] to screen the categorical variables and give a low-dimensional model, necessarily misses some signal variables in this first stage and hence struggles here.
6.2 Adult dataset analysis
The Adult dataset, available from the UCI Machine Learning Repository [Dua and Graff, 2019], contains a sample of 45 222 observations based on information from the 1994 US census. The binary response variable is 0 if the individual earns at most $50 000 a year, and 1 otherwise. There are 2 continuous and 8 categorical variables; some such as ‘native country’ have large numbers of levels, bringing the total dimension to 93. An advantage of using SCOPE here over black-box predictive tools such as Random forests is the interpretability of the fitted model.
In Table 5, we show the 25-dimensional fitted model. Within the Education category, we see that six distinct levels have been identified. These agree almost exactly with the stratification one would expect, with all school dropouts before 12th grade being grouped together at the lowest level.
Figure 4: Prediction performance and fitted model dimension (respectively) of various methods on the Adult dataset: (A) SCOPE-100; (B) SCOPE-250; (C) Logistic regression; (D) CASANOVA; (E) Adaptive CAS-ANOVA; (F) DMR; (G) BEF; (H) CART; (I) RF.
Here we assess performance in the challenging setting when the training set is quite small by randomly selecting 1% (452) of the total observations for training, and using the remainder as a test set. Any observations containing levels not in the training set were removed. Models were fitted with SCOPE-100, SCOPE-250, logistic regression, vanilla and adaptive CAS-ANOVA, DMR, Bayesian effect fusion, CART and random forests.
We see that both SCOPE-100 and SCOPE-250 are competitive, with CART and Random forests also performing well, though the latter two include interactions in their fits. CASANOVA also performs fairly well, the misclassification error is larger that for both versions of SCOPE, and the average fitted model size is larger.
Table 5: Coefficients of SCOPE model trained on the full dataset. Here, selected by 5-fold cross-validation (with cross-validation error of 16.82%). Countries, aside from those in the UK, are referred to by their (possibly historical) internet top-level domains. *Relation with which the subject lives.
Table 6: Results of experiments on the Adult dataset.
Figure 5: Misclassification error and dimensions of models fitted on a sample of the Adult dataset when levels have been artificially split m times.
6.3 Adult dataset with artificially split levels
To create a more challenging example, we artificially created additional levels in the Adult dataset as follows. For each categorical variable we recursively selected a level with probability proportional to its prevalence in the data and then split it into two by appending “-0” or “-1” to the level for each observation independently and with equal probabilities. We repeated this until the total number of levels reached m times the original number of levels for that variable for m = 2, 3, 4. This process simulates for example responses to a survey, where different respondents might answer ‘US’, ‘U.S.’, ‘USA’, ‘U.S.A.’, ‘United States’ or ‘United States of America’ to a question, which would naively all be treated as different answers.
We used 2.5% (1130) of the observations for training and the remainder for testing and applied SCOPE with = 100 and logistic regression. Results were averaged over 250 training and test splits. Figure 5 shows that as the number of levels increases, the misclassification error of SCOPE increases only slightly and the fitted model dimension remains almost unchanged, whereas both increase with m for logistic regression.
6.4 Insurance data example
The Prudential Life Insurance Assessment challenge was a prediction competition run on Kaggle [2015]. By more accurately predicting risk, the burden of extensive tests and check-ups for life insurance policyholders could potentially be reduced. For this experiment, we use the training set that was provided for entrants of the competition.
We removed a small number of variables due to excessive missingness, leaving 5 continuous variables and 108 categorical variables, most with 2 or 3 levels but with some in the hundreds (and the largest with 579 levels). Rather than using the response from the original dataset, which is ordinal, to better suit the regression setting we are primarily concerned with in this work, we artificially generated a continuous response. To construct this signal, firstly 10 of the categorical variables were selected at random, with probability proportional to the number of levels. For the jth of these, writing for the number of levels, we set
assigned each level a coefficient in 1
uniformly at random, thus yielding
true levels. The coefficients for the 5 continuous covariates were generated as draws from
response was then scaled to have unit variance, after which standard normal noise was added.
Figure 6: Mean squared prediction error on the example based on the Prudential Life Insurance Assessment dataset. Methods used are: (A) SCOPE-8; (B) SCOPE-32; (C) SCOPE-CV; (D) CART; (E) RF; (F) Lasso.
We used 10% (n = 5938) of the 59 381 total number of observations for training, and the remainder to compute an estimated MSPE (28) by taking an average over these observations. We repeated this 1000 times, sampling 10% of the observations and generating the coefficients as above anew in each repetition. The average mean squared prediction errors achieved by the various methods under comparison are given in Figure 6. We see that SCOPE with a cross-validated choice of performs best, followed by the Lasso and SCOPE-32.
6.5 COVID-19 Forecast Hub example
As well as the prediction performance experiments in the rest of this section, we include an exploratory data analysis example based on data relating to the ongoing (at time of writing) global COVID-19 pandemic. The COVID-19 Forecast Hub [2020] A collection of different research groups publish forecasts every week of case incidence in each US state for some number of weeks into the future.
In order to understand some of the difficulties of this challenging forecasting problem, we fitted an error decomposition model of the form
where w is the week that the forecast is for, l is the state, m indexes the forecasting model, t is the ‘target’ number of weeks in the future the forecast is for, is an error term, and cases
and est.cases
are the observed and estimated cases respectively. This decomposition allows an interaction term between time and location, which is important given that the pandemic was known to be more severe at different times for different areas. An interaction between model and forecasting distance was also included in order to capture the effect of some models potentially being more ‘short-sighted’ than others. The inclusion of the +1 on the left-hand side is to avoid numerators or denominators of zero.
We used data from 6 April 2020 to 19 October 2020, giving a total of 100 264 (m, t, w, l)-tuples. We applied a SCOPE penalty with , which had 1428 levels. The
coefficients, which amounted to 170 levels, were left unpenalised. The additional tuning parameter
was selected using the Extended Bayesian Information Criterion [Chen and Chen, 2008] rather than cross-validation, as it was more suited to this sort of exploratory analysis on data with a chronological structure.
The resulting estimates ˆhad 8 levels. We measured the ‘similarity’ of two US states
over a period of time by computing the proportion of weeks at which their estimates
Figure 7: Similarity matrix for US states computed based on data relating to the second ‘wave’ of the COVID-19 pandemic in the US, taken to be from 26 June 2020 to 29 August 2020.
ˆcoincided. The similarity matrix presented in Figure 7 was constructed based on the second ‘wave’ of the epidemic which occurred in Summer 2020, with clusters identified by applying spectral clustering on the similarity matrix and plotted in order of decreasing within-cluster median pairwise similarity.
The resulting clusters are at once interpretable and interesting. Roughly speaking, the top 3 clusters (‘top’ when ordered according to median pairwise within-cluster agreement) correspond to states that experienced notable pandemic activity in the second, first, and third ‘waves’ of the U.S. coronavirus pandemic, respectively. The first cluster features several southern States (e.g., Georgia, Florida, Texas) which experienced a surge of COVID cases in June–July. The second cluster features east coast states (e.g., New Jersey and New York) which experienced an enormous pandemic toll in April–May. And the third features midwestern states (e.g., Kentucky, Indiana, Nebraska) which had upticks most recently in September-October.
In this work we have introduced a new penalty-based method for performing regression on categorical data. An attractive feature of a penalty-based approach is that it can be integrated easily with existing methods for regression with continuous data, such as the Lasso. Our penalty function is nonconvex, but in contrast to the use of nonconvex penalties in standard high-dimensional regression problems, the nonconvexity here is necessary in order to obtain sparse solutions, that is fusions of levels. Whilst computing the global optimum of nonconvex problems is typically very challenging, for the case with a single categorical variable with several hundred levels, our dynamic programming algorithm can typically solve the resulting optimisation problem in less than a second on a standard laptop computer. The algorithm is thus fast enough to be embedded within a block coordinate descent procedure for handling multiple categorical variables.
We give sufficient conditions for SCOPE to recover the oracle least squares solution when p = 1 involving a minimal separation between unequal coefficients that is optimal up to constant factors. For the multivariate case where p > 1, we show that oracle least squares is a fixed point of our block coordinate descent algorithm, with high probability.
Our work offers several avenues for further work. On the theoretical front, it would be interesting to obtain guarantees for block coordinate descent to converge to a local optimum with good statistical properties, a phenomenon that we observe empirically. On the methodology side, it would be useful to generalise the penalty to allow for clustering multivariate coefficient vectors; such clustering could be helpful in the context of mixtures of regressions models, for example.
H. D. Bondell and B. J. Reich. Simultaneous factor selection and collapsing levels in ANOVA. Biometrics, 65(1):169–177, 2009.
P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics, 5(1):232, 2011.
P. Breheny and J. Huang. Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25(2):173–187, 2015.
L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification and Regression Trees. The Wadsworth and Brooks-Cole statistics-probability series. Taylor & Francis, 1984.
T. Calinski and L. Corsten. Clustering means in anova by simultaneous testing. Biometrics, pages 39–48, 1985.
J. Chen and Z. Chen. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–71, 2008.
J. Chiquet, P. Gutierrez, and G. Rigaill. Fast tree inference with weighted fusion penalties. Journal of Computational and Graphical Statistics, 26(1):205–216, 2017.
COVID-19 Forecast Hub. URL: https://covid19forecasthub.org, 2020.
D. Dua and C. Graff. UCI machine learning repository, 2019.
J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001.
J. Fan, H. Liu, Q. Sun, and T. Zhang. I-LAMM for sparse learning: Simultaneous control of algorithmic complexity and statistical error. The Annals of Statistics, 46(2):814–841, 2018.
J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
J. Gertheiss and G. Tutz. Sparse modeling of categorial explanatory variables. The Annals of Applied Statistics, 4(4):2150–2180, 2010.
T. D. Hocking, A. Joulin, F. Bach, and J.-P. Vert. Clusterpath an algorithm for clustering using convex fusion penalties. In 28th International Conference on Machine Learning, page 1, 2011.
S. Hu, A. O’Hagan, and T. B. Murphy. Motor insurance claim modelling with factor collapsing and bayesian model averaging. Stat, 7(1):e180, 2018.
L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985.
P. B. Jensen, L. J. Jensen, and S. Brunak. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics, 13(6):395, 2012.
Kaggle. Prudential Life Insurance Assessment. URL: https://www.kaggle.com/c/prudential-life-insurance-assessment/data, 2015.
A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002.
P.-L. Loh and M. J. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637–1664, 2012.
P.-L. Loh and M. J. Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16(19):559–616, 2015.
Y. Lu and H. H. Zhou. Statistical and computational guarantees of lloyd’s algorithm and its variants. arXiv preprint arXiv:1612.02099, 2016.
S. Ma and J. Huang. A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517):410–423, 2017.
A. Maj-Ka´nska, P. Pokarowski, A. Prochenka, et al. Delete or merge regressors for linear model selection. Electronic Journal of Statistics, 9(2):1749–1778, 2015.
R. Mazumder, J. H. Friedman, and T. Hastie. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495):1125–1138, 2011.
M.-R. Oelker, W. P¨oßnecker, and G. Tutz. Selection and fusion of categorical predictors with l 0-type penalties. Statistical Modelling, 15(5):389–410, 2015.
D. Pauger and H. Wagner. Bayesian effect fusion for categorical predictors. Bayesian Analysis, 14(2):341–369, 2019.
D. Pauger, M. Leitner, H. Wagner, and G. Malsiner-Walli. effectFusion: Bayesian Effect Fusion for Categorical Predictors, 2019. R package version 1.1.1.
A. Prochenka-So�ltys and P. Pokarowski. DMRnet: Delete or Merge Regressors Algorithms for Linear and Logistic Model Selection and High-Dimensional Data, 2018. R package version 0.2.0.
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2020.
A. J. Scott and M. Knott. A cluster analysis method for grouping means in the analysis of variance. Biometrics, pages 507–512, 1974.
B. Stokell. CatReg: Solution Paths for Linear and Logistic Regression Models with Categorical Predictors, with SCOPE Penalty https://CRAN.R-project.org/package=CatReg, 2021.
T. Therneau and B. Atkinson. rpart: Recursive Partitioning and Regression Trees, 2019. R package version 4.1-15.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67 (1):91–108, 2005.
P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475–494, 2001.
J. W. Tukey. Comparing individual means in the analysis of variance. Biometrics, pages 99–114, 1949.
G. Tutz and J. Gertheiss. Regularized regression for categorical data. Statistical Modelling, 16 (3):161–200, 2016.
Z. Wang, H. Liu, and T. Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics, 42(6):2164, 2014.
M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010.
T. Zhao, H. Liu, and T. Zhang. Pathwise coordinate optimization for sparse learning: Algorithm and theory. The Annals of Statistics, 46(1):180–218, 2018.
H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006.
A.1 Candidate minimiser functions
In this section we give explicit forms of the functions as defined in Section 3.1. We write
for simplicity, suppressing the subscript
we write aS + b for the set
Recall from Section 3.1 that
For a function , we denote the effective domain of f by
For each r = 1, . . . , m(k), there are cases corresponding to t = 1 and t = 2. The formulas are as follows:
with dom
If
The second case is
Here, if
Considering (16), we see that we can also have the case where ). Thus we can form the set of quadratics
and associated intervals as the set of
as above for t = 1, 2 and the
themselves. Note that when
A.2 Algorithm details
Algorithm 1 describes in detail how the optimisation routine works. In the algorithm we make use of the following objects:
• for ) is the active set at x;
• E is the set of points at which the active set changes;
• N(x) is the intersection set at x;
• U is a set of tuples (is an interval and r is an integer, which is dynamically updated as the algorithm progresses.
See Section 3.1.2 for definitions of the sets above. We also use the convention that if then [
All of the are computed at the start of each iterate k. We then initialise
the set of all of the end-points of the intervals
Here x can be thought of as the ‘current position’ of the algorithm; ˜x is used to store when the minimising function last changed. We initialise ˜
. This choice of x ensures that the active set A(x) contains only one element (as mentioned in Section 3.1); this will always be the index corresponding to the function ˜
the functions ˜and their corresponding intervals
that partition R. Finally, we initialise the set N(x) which will contain the intersections between
and other functions in the active set. As the active set begins with only one function, we set
As mentioned in Section 3.1, there are several modifications that can speed up the algorithm. One such modification follows from the fact that for each is a constant function over its effective domain, and their effective domain is a semi-infinite interval (see Section A.1 of the Appendix for their expressions). Therefore, for a given point
, we can remove all such functions from A(x) except for the one taking the minimal value.
We also note that in Algorithm 1, the set N(x) is not recomputed in its entirety at every point x at which A(x) is updated, as is described in Section 3.1. Line 13 shows how sometimes N(x) can instead be updated by adding or removing elements from it. Often, points 3 (i) and 3 (ii) from the description in the Section 3.1 will coincide, and in such instances some calls to ChooseFunction (Algorithm 2) can be skipped.
This supplementary material is organised as follows. In Section S1 we include further details of our algorithm and the proofs of results in Sections 2 & 3. The proofs of Theorems 5 and 6 along with a number of lemmas they require can be found in Section S2. Section S3 contains information regarding simulation settings and additional results for the experiments in Section 6 of the main paper.
S1.1 Remarks on constrained and unconstrained formulations of the univariate objective
It is clear why the identifiability constraint (8) is important when we consider the multivariate problem in Section 3.2. However, for the univariate problem, both constrained and unconstrained formulations of the objective can be clearly defined:
As discussed in Section 3.1.1, we can enlarge the feasible set in (30) to be all of larly to the observation that
, the minimiser of (30) over all of
will always be in Θ. This can be shown by following the argument at the beginning of the proof of Lemma 10. Therefore the algorithm defined in Section 3.1 can also be applied to the unconstrained formulation of the objective.
It is clear that these problems are essentially identical, as ˆis a minimiser of the unconstrained objective if and only if ˆ
is a minimiser of the constrained objective. Observe that while ˆ
, the solution to the constrained objective is in fact (ˆ
is the same K-dimensional space only with a different parametersation. In particular, ˆ
non-unique if and only if ˆ
is non-unique.
Since one can obtain the solution to the constrained objective by solving the unconstrained one and then reparameterising (and vice versa), we are free to assume without loss of generality that = 0, when solving the univariate problem, and will remark where we do this.
S1.2 Proofs of results in Sections 2 & 3
Proof of Proposition 1. Assume, without loss of generality, that ˆ= 0. Suppose that there exists
such that ˆ
. Without loss of generality we have that ¯
¯
and it can be seen that ˆ
, in which case swap labels).
Now we construct ˜by setting ˜
otherwise. We have
) and, by convexity of
, it follows that
This gives the conclusion ), contradicting the optimality of ˆ
Proof of Proposition 2. Suppose, for a contradiction, that ˆ. Then at least one of the following must be true:
Let ˜be defined as follows. Set ˜
. If (32) holds set ˜
and if (33) holds set ˜
. Observe that
and that the squared loss of ˜is strictly smaller than the squared loss of ˆ
, thus contradicting optimality of ˆ
Proof of Proposition 3. In this proof we consider the unconstrained formulation of the objective (31) discussed in Section S1.1. Suppose that (¯is such that there are two distinct solutions to (12), ˆ
. Let us assume that the levels are indexed such that ¯
to be the largest index at which the two solutions take different values and note that we must have ˆ
First consider the case where
for r = 1, 2. We now argue that we must have ˆsuppose not, and suppose that without loss of generality ˆ
directional derivative of the objective in the direction of the binary vector with ones at the indices given by
and zeroes elsewhere evaluated at ˆ
must be 0. But comparing these for r = 1, 2, we see they are identical except for the term
), which will be strictly larger for r = 2, giving a contradiction. This then implies that both ˆ
must minimise
since the full objective value is
for r = 1, 2. We also have that when must minimise
. In particular, properties (i) and (iii) of Lemma 4 hold with
by
. These can be characterised as
intervals associated with
. Note that for each
depends on the values of ¯
but not that of ¯
(observe that
) includes a term
Now as ˆ), by Lemma 4 (iii) both must be local minima of
, and we have that there must exist distinct
such that ˆ
and ˆ
Since ˆmust be the minimum of ˇ
(and similarly for ˆ
must have that
This is a quadratic equation in ¯, so there are at most two values for which (34) holds. Considering all pairs
, we see that in order for there to exist two solutions ˆ
must take values in a set of size at most c(K), for some function
Now let
What we have shown, is that associated with each element ( ¯, there is at least one
such that
is bounded above by c(K). Now for each be the set of ( ¯
the there exists a
with the property above and
. Note that
has Lebesgue measure zero as a finite union of graphs of measurable functions
Thus S has Lebesgue measure zero.
Proof of Lemma 4. Assume, without loss of generality, that ˆ= 0. We proceed inductively, assuming that the properties (i) and (iii) hold for
, and (ii) holds for
. Additionally we include in our inductive hypothesis that for all
), where we define
) to be the left-derivative and right-derivative of
, respectively. We note that these trivially hold for the base case
, and the case
can be checked by direct calculation.
We first prove (i), that is continuous, coercive, and piecewise quadratic and with finitely many pieces. We then show that
, which allows us to show that (iii) holds for
. Finally, we use these results to show that (ii) holds for
We now show that is coercive and continuous. Clearly
follows that
is coercive. Furthermore
is bounded from below as
is coercive and continuous. Thus since
, it follows that
is coercive. Next as
are continuous, it follows that
is continuous and therefore that
is continuous.
To see why is piecewise quadratic with finitely many pieces, we observe that it can be written
. We have by our inductive hypothesis that
is piecewise quadratic and
) is piecewise linear, both with finitely many pieces. Since the composition of a piecewise linear function inside a piecewise quadratic function is piecewise quadratic, the remainder of (i) is shown.
We now turn our attention to (iii), and define for
We will first show that . Suppose that we are increasing x and we have reached a point where
) is not differentiable (that is, the left-derivative and the right-derivative do not match). By assumption (ii) for
, we can assume that there is some window
0 such that
) is linear for
In order to proceed with the following argument, we must show that for sufficiently small 0, we have
, this is immediate. Therefore it remains to consider the case
, for which we show that we must have
). This follows from the observation that if
, then for all
we have
]. Indeed, suppose not, then
Note that has both left-derivatives and right-derivatives at every point in R. Suppose first that
0, and we observe that
Then by the basic definition of the right-derivative,
where the last inequality follows from our inductive hypothesis that . An analogous argument shows that the same conclusion holds when
Now we use this to prove the claim. Because there are no points of at which the left-derivative is less than the right-derivative, without loss of generality we claim that
is differentiable at
. Indeed, suppose not, then we have that
) and necessarily that defining
0
)). But since
), we contradict the optimality of
point is in fact a local maximum.
We finally consider claim (ii). By (iii), we have that for every point ) is either x or at the minimum of one of the quadratic pieces of
). In either case, we have that
) is linear in
)) is quadratic in x. We can define
pointwise as the minimum of this finite set of quadratic functions of x, whose expressions are given in Appendix A.1. Importantly, the coefficients in the linear expression
only on which of these functions is the minimum at x. As the number of intersections between elements in this set of quadratic functions is bounded above by twice the square of the size of the set, we can conclude that
) is piecewise linear and with a finite number of pieces, thus concluding the proof.
S1.3 Computation time experiments
A small experiment was performed to demonstrate the runtimes one can expect in practice for the univariate problem. Note that this clustering is applied iteratively in the block coordinate descent procedure we propose to use in multivariate settings. We considered 3 settings: one with no signal, one with 2 true clusters and one with 5 true clusters. Independent and identically distributed Gaussian noise was added to each of the subaverages. As in Section 6.3 the number of categories was increased by random splitting of the levels. Each of these tests were repeated 25 times, on a computer with a 3.2GHz processor. The results are shown in Figure 8.
Figure 8: Computation times for solving the univariate problem.
S1.4 Discretised algorithm
For very large-scale problems, speed can be improved if we only allow coefficients to take values in some fixed finite grid, rather than any real value. Below we describe how such an algorithm would approximately solve the univariate objective (12). We will use the unconstrained objective as discussed in Section S1.1. We would first fix L grid points , and then proceed as described in Algorithm 3.
This algorithm has the same basic structure to the approach we use in Section 3.1 for computing the exact global optimum. The difference is that now, instead of as in (14), we define in the following way:
The objects F and B play analogous roles to in Section 3.1. Since we restrict
, we only need to store the values that
takes at these L values; this is the purpose of the vector F in Algorithm 3. Similarly, the rows
) serve the same purpose as the functions
where, again, we only need to store L values corresponding to the different options for
This algorithm returns the optimal solution ˆto the objective where each of the coefficients are restricted to take values only in
. We must ensure that the grid of values has fine enough resolution that interesting answers can be obtained, which requires L being sufficiently large. The number of clusters obtained by this approximate algorithm is bounded above by L, so this must not be chosen too small.
One can see that the computational complexity of this algorithm is linear in K, with a total of ) operations required. This is of course in addition to the O(n) operations needed to compute
beforehand. In particular, choosing
guarantees that the complexity of this algorithm is at worst quadratic in K.
S2.1 Proof of Theorem 5
The proof of Theorem 5 requires a number of auxiliary lemmas, which can be found in Section S2.1.1.
Let us define that
where we define is an orthogonal projection matrix, we have that
. It follows that
is sub-Gaussian with parameter
. Applying the standard sub-Gaussian tail bound, we obtain
where recall that . Therefore, we have that
In the following we work on the intersection Λ := . This entails that for each
2. We now relabel indices such that ¯
, and so from Proposition 2 that ˆ
. Since our assumption (24) implies ∆(
, it follows that on Λ the observed ordering is consistent with the ordering of the true coefficients, i.e. there exist 0 =
Indeed, observe that for 1, we have by the triangle inequality and (24), the stronger property that
Our optimisation objective is therefore
Since ¯, it follows from Lemma 8 that ˆ
for
Observe that we can have , in which case we take the sum over that range to be zero. Note that (40) can be optimised over (
) separately for each j = 1, . . . , s. If s = 1, i.e. the true signal is zero, then the result follows from Lemma 10. Now we see what happens when s > 1.
Without loss of generality, consider j = 1 and note that if = 1 it is immediate that ˆ
. Hence, we can assume that
1. We note that ˆ
define
. We see that our goal is to compute
where is a vector of ones and ˜
. Note that we subtract ˆ
to ensure that
as required for application of Lemma 10. We have by assumption that for . Thus, Lemma 10 can be applied with ˇ
follows that ˆ
S2.1.1 Auxiliary lemmas
Here we prove a number of results required to obtain conditions for recovering the oracle least squares estimate in the univariate case. Lemma 10 gives conditions for recovery of the true solution, in the case where there is zero signal. Lemmas 8 and 9 ensure that the true levels are far enough apart that they can be separated. Once we have this separation, we apply Lemma 10 on each of the levels to obtain the solution.
Lemma 7. Consider the optimisation problem
where . Suppose further that
is the unique optimum.
Proof. We first observe that
For convenience, we define ). It now suffices to show that F is uniquely minimised at 0 provided
. We can clearly see that
Equation (2.3) of Breheny and Huang [2011] gives the result when
When 1, we see that any stationary point of
] must be a maximum, since on this interval F(x) is a quadratic function with a negative coefficient of
. Therefore its minimum over [0
] is attained at either
, then it suffices to check that
). This holds if and only if
+ 1), but since we are assuming
1, this is always satisfied.
If , then we can see that the minimum of
] will be attained at exactly 2
. Thus, here it also suffices to check
), which holds if and only if
The final bound
follows from combining the results for these cases.
The following is a deterministic result to establish separation between groups of coefficients.
Lemma 8. Consider the setup of Theorem 5, and assume that ˆ. Suppose that ¯
¯
, and that for j = 1, . . . , s we have
where are as defined in (36). Suppose further that for
Then for
Proof. For convenience, within this lemma we define . Recall that the objective function which ˆ
optimises takes the form
We first claim that ˆ. To see this, suppose that this is not the case and define ˇ
by projecting ˆ
penalty contribution from ˇ
is no larger than that of ˆ
, and the loss contribution is strictly smaller, so we obtain the contradiction
We now proceed to show that for 1, we have ˆ
prove the first of these sets of inequalities, since the second follows similarly by considering the problem with
and reversing the indices. Suppose, for contradiction, that there exists some
be minimal, such that for all l < j we have ˆ
Next define to be the maximal element of
such that ˆ
Similarly, we define
to be minimal such that ˆ
. The existence of
is guaranteed by Lemma 9.
We note that for and hence ( ¯
This can be shown by contradiction, as in (55). For such l, we have from optimality of ˆ
¯
(otherwise one could improve the objective by setting ˆ
) which implies that ˆ
. From this it follows that ( ¯
Similarly, if + 1, then for
1 we have ˆ
( ¯
, it follows that ˆ
and therefore that ( ¯
Now, we define
We also define ˜according to
Thus,
We specify the infimum in (47) because ( ¯] is not closed, and let (
) be a convergent sequence in ( ¯
] whose limit attains this infimum. We define
By assumption (43), at least one of () is greater than or equal to
Here, we use that the separation (43)
then we denote this case (A1) and (45) becomes
We define ˜to be the minimiser over ˜a of (47). We can observe that since ¯
. Thus, we have by Lemma 7 that the uniquely optimal ˜
. This gives that the value of (47) is zero.
It is straightforward to see from (46) that must be the unique limit of (
have assumed that ˆ
and the infimum is not attained in (¯
), the inequality in line (46) can be made strict. It follows that
Thus, it remains for us to consider the case where ˆ, which implies that
. We denote this case (A2). Now, from (45) we can obtain
The objective is piecewise quadratic (and continuously differentiable), with two pieces: [ˆ]. On the first region, the objective is a convex quadratic with minimum at ¯
By the assumption that , we know that the objective must be concave on (ˆ
]. It is clear that the derivative of the objective at ˆ
is positive. Hence, if ˜
, then the objective will take a strictly lower value at some ˜
(for some small
0), contradicting optimality of ˜
. It therefore follows that ˜
With this knowledge, we can further simplify (48) to obtain
The second inequality follows from ¯. Hence, we obtain that
We now we direct our attention towards case (B), where similarly to before we observe that the penalty contributions between
Similarly to (44) in case (A), we obtain
We specify the infimum in (50) because ( ¯) is not closed and therefore a minimum may not exist. Let (
) be a convergent sequence in ( ¯
) whose limit achieves this infimum. We now define (
). By assumption (43), we know that ¯
¯
, which implies that ˆ
. Thus, one of
must be at least
We first consider if , and denote this case (B1). Here, (50) becomes
We can observe that (52) is the sum of two copies of (46) in case (A1). Hence, by following the same arguments as before, we see that
It therefore remains for us to obtain the result in the case that , and we denote this case (B2). Using that the separation (43)
, it is straightforward to see that one of ( ¯
) must be at least
. By the symmetry of the problem, it is sufficient for us to consider the case where ¯
. In this case, we can obtain from (50) that
where . From this, we can extract the terms dependent on ˜b to obtain
This objective is piecewise quadratic (and continuously differentiable), with two pieces; [˜]. Over the second region, the objective is a convex quadratic with minimum at ¯
]. By following the same argument as for (48) in case (A2), we see that ˜
With this knowledge, we can further simplify (53) to obtain
Since ¯, we can see that ( ¯
0. Thus, it suffices for us to show that
This objective is exactly as in (47) in case (A1), minimised over a smaller feasible set. Hence, it follows immediately that this holds and we can conclude that
We now have for all cases that ), which contradicts the optimality of ˆ
we can conclude that for
Lemma 9. Consider the setup of Lemma 8. For each j = 1, . . . , s, there exists 1
Proof. We first show that if ˆ, then for any
ˆ
We prove the first case since the proof for the second is identical. Suppose that this does not hold, i.e. ˆand there exists some (minimal)
¯. Then we construct ˇ
We observe that the penalty contribution from ˇis no more than that of ˆ
and that the quadratic loss for ˇ
will be strictly less than that of ˆ
. This gives us that
), contradicting the optimality of ˆ
Similarly, if ˆthen the corresponding statement that for any
1
We now establish a simple preliminary result. Suppose that for some j in {1, . . . , s} there exists ], such that
claim that if ˆ
). Similarly, if ˆ
ˆ
To prove the claim, we consider the case ˆ(the other is identical). By the first observation, if ˆ
. Now, for contradiction, suppose ˆ
) and let this k be minimal. Then we can construct ˇ
By appealing to the optimality of ˆ, we can easily observe that ˆ
and therefore that the ordering of the entries of ˇ
matches that of ˆ
. Here, we use that (
We will without loss of generality take the second statement to be true (the proof for the first case follows identically). Let denote the minimal element in
ˆ
. From the preliminary result established earlier, ˆ
appealing to the optimality of ˆ
, we see that ˆ
(otherwise, we could take ˆ
be ¯
and strictly reduce the value of the objective).
Now, we will use that the separation is at least 2(. By our earlier observation (55), it is clear that any
that since ˆ
, it follows that ¯
and therefore that
by the preliminary result. Since
and separation (43) , we can define
such that ˆ
Now, in order to contradict the optimality of ˆwe construct a new feasible point ˜
by setting
It follows that for
It is also straightforward to see that . If follows that the loss contribution in
) is strictly less than that in
). Hence, using ˆ
obtain
contradicting the optimality of ˆ. We conclude that for j = 1, . . . , s, there exists
1
such that ˆ
Lemma 10. Consider the univariate objective (11), relaxing the normalisation constraint to ˇ. Suppose that
be the diagonal matrix with entries
First note that
Thus for all
Consider minimising is the unit simplex scaled by ˇw. We aim to show this minimum is 0. As with the first claim in the proof of Lemma 8, it is straightforward to see that for any feasible (
), there exists
As on the RHS we are minimising a continuous function over a compact set, we know a minimiser must exist. Let (˜) be a minimiser (to be specified later). Observe that
is linear as a function of . Hence it is minimised over the set
at some point in . Here conv(
) denotes the convex hull operation. We thus have
Since the penalty contribution from ˇis not greater than that of ˜
, it follows that
Thus, we can assume that entries of ˜
can take one of only two distinct values.
Next we write ˜and observe that ˜
. Let us set
and
. Then we have
In the second line above, we have solved for s to find that
In the third line above, we have solved for ˜to obtain ˜
2 and hence ˜
These follow from optimality of ˜
respectively. The result follows from applying Lemma 7, setting
S2.2 Proof of Theorem 6
We begin by defining to be the orthogonal projection onto the linear space
The residuals from the oracle least-squares fit are (. The partial residuals
as defined in (18) for the jth variable are therefore
For j = 1, . . . , p, we define ¯, reordering the labels such that ¯
. We then aim to apply the arguments of Theorem 5 to ˆ
defined by
In order to do this, we define the events (for some to be determined later):
On the intersection of events , we have that
2. By followingan identical approach to that involved in computing (35), we have that
where we recall that
We now turn our attention to the event Λ. Note that if
= 1, then this is immediately satisfied since ˆ
1, we use that the oracle least squares estimate ˆ
a linear transformation A of the responses (
has an independent (non-central) sub-Gaussian distribution with parameter
. Therefore for each
ˆ
also has a sub-Gaussian distribution, with parameter at most
(recalling that
). This enables us to show that
We can now set 2. From (26) and the triangle inequality, on the event Λ
we have that
Thus, on the intersection of events Λ, we can proceed as in the proof of Theo-
rem 5 from (38), to conclude that ˆ
It immediately follows that on the intersection of events
ˆ. By a union bound, this occurs with probability at least
where in the final line we use
S3.1 Details of methods
Tree-based methods
We used the implementation of the random forest procedure [Breiman, 2001] in the R package randomForest [Liaw and Wiener, 2002] with default settings. CART [Breiman et al., 1984] was implemented in the R package rpart [Therneau and Atkinson, 2019], with pruning according to the 1-SE rule (as described in the package documentation).
CAS-ANOVA
The CAS-ANOVA estimator ˆoptimises over (
) a sum of a squared loss term (3) and an all-pairs penalty term (4). In particular, Bondell and Reich [2009] consider two regimes of weight vectors w. The first is not data-dependent and sets
The second, ‘adaptive CAS-ANOVA’, uses the ordinary least squares estimate for
to scale the weights. Here,
Here we introduce a new variant of adaptive CAS-ANOVA, following ideas in B¨uhlmann and Van De Geer [2011] for a 2-stage adaptive Lasso procedure. Instead of using the ordinary least squares estimate ˆ
in the above expression, an initial (standard) CAS-ANOVA estimate is used to scale the weights, with
selected for the initial estimate by 5-fold cross-validation. In simulations, this outperformed the adaptive CAS-ANOVA estimate using ordinary least squares initial estimates so in the interests of time and computational resources this was omitted from the simulation study. Henceforth adaptive CAS-ANOVA will refer to this 2-stage procedure.
The authors describe the optimisation of ˆas a quadratic programming problem, which was solved using the R package rosqp [Anderson, 2018]. Here we used our own implementation of the quadratic programming approach described by the authors. We found it considerably faster than the code available from the authors’ website, and uses ADMM-based optimisation [Boyd et al., 2011] tools not available at the time of its publication. We also found, as discussed in Section 5.1 of Maj-Ka´nska et al. [2015], that we could not achieve the best results using the publicly available code. Lastly, using our own implementation allowed us to explore a modification of CAS-ANOVA using the more modern approach of adaptive weights via a 2-stage procedure [B¨uhlmann and Van De Geer, 2011] to compare SCOPE to a wider class of all-pairs penalty procedures.
For large categorical variables, solutions are slow to compute and consume large amounts of memory. In the case of binary response, CAS-ANOVA models were fitted iterating a locally quadratic approximation to the loss function.
DMR
The DMR algorithm [Maj-Ka´nska et al., 2015] is implemented in the R package DMRnet [Prochenka- So�ltys and Pokarowski, 2018]. The degrees of freedom in the model is decided by 5-fold cross-validation. It is based on pruning variables using the Group Lasso [Yuan and Lin, 2006] to obtain at a low-dimensional model, then performing backwards selection based on ranking t-statistics for hypotheses corresponding to each fusion between levels in categorical variables.
The cross-validation routine appeared to error when all levels of all categorical variables were not present in one of the folds. In Section 6.2, cross-validation was therefore not possible so model selection was performed based on Generalized Information Criterion (GIC) [Zheng and Loh, 1995]. In all other examples, models were selected via 5-fold cross-validation.
Bayesian effect fusion
In Section 6.1.1 we include Bayesian effect fusion [Pauger and Wagner, 2019], implemented in the R package effectFusion [Pauger et al., 2019]. Coefficients within each categorical variable were modelled with a sparse Gaussian mixture model. The posterior mean was estimated with 1000 samples.
Lasso
In Section 6.1.2 we also include Lasso [Tibshirani, 1996] fits, to serve as a reference point. Of course, this is unsuitable for models where levels in categorical variables should be clustered together, but the advanced development of the well-known R package glmnet [Friedman et al., 2010] nevertheless sees its use in practice.
In order to make the fit symmetric across the categories within each variable, models were fitted with an unpenalised intercept and featuring dummy variables for all of the categories within each variable. This is instead of the corner-point dummy variable encoding of factor variables that is commonly used when fitting linear models. Models are fitted and cross-validated with cv.glmnet using the default settings.
SCOPE
For SCOPE, we have provided the R package CatReg [Stokell, 2021]. The univariate update step (see Section 3.1) is implemented in C++ using Rcpp [Eddelbuettel and Fran¸cois, 2011], with models fitted using a wrapper in R. For the binary response case, the outer loop to iterate the local quadratic approximations in the proximal Newton algorithm are done within R. In the future, performance could be improved by iterating the univariate update step (and the local quadratic approximations, as in Sections 6.2 and 6.3) within some lower-level language. In higher-dimensional experiments, SCOPE was slowed by cycling through all the variables; an active-set approach to this could make it faster still.
S3.2 Further details of numerical experiments
For the experiments in Section 6.1, we define the signal-to-noise ratio (SNR) as is the standard deviation of the signal
is the standard deviation of the noise
Figure 9: Prediction performance of various methods: (A) SCOPE-8; (B) SCOPE-32; (C) SCOPE-CV; (D) Linear regression; (E) Oracle least squares; (F) CAS-ANOVA; (G) Adaptive CAS-ANOVA; (H) DMR; (I) BEF; (J) CART; (K) RF. Note that some ‘boxes’ are not visible in some of the plots; this is due to the MSPE in the tests being beyond the range of the plot.
S3.2.1 Low-dimensional simulations
In Table 7 we include details of computation time and dimension of the fitted models. Figure 9 visualises the results also summarised in Table 1 in the main paper.
Table 7: Mean fitted model dimension and computation time for the various methods.
Figure 10: Prediction performance of various methods: (A) SCOPE-8; (B) SCOPE-32; (C) SCOPE-CV; (D) Oracle least squares; (E) DMR; (F) CART; (G) RF; (H) Lasso. Note that some ‘boxes’ are not visible in some of the plots; this is due to the MSPE in the tests being beyond the range of the plot.
S3.2.2 High-dimensional simulations
Here we include additional results relating to the high-dimensional experiments. Figure 10 visualises the results in Table 2 of the main paper.
Table 8: Mean computation time (s)
Table 9: Mean fitted model dimension
Table 10: Proposition of times each was selected by cross-validation.
E. Anderson. rosqp: Quadratic Programming Solver using the ’OSQP’ Library, 2018. R package version 0.1.0.
H. D. Bondell and B. J. Reich. Simultaneous factor selection and collapsing levels in ANOVA. Biometrics, 65(1):169–177, 2009.
S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and , 3(1):1–122, 2011.
P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics, 5(1):232, 2011.
L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification and Regression Trees. The Wadsworth and Brooks-Cole statistics-probability series. Taylor & Francis, 1984.
P. B¨uhlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
D. Eddelbuettel and R. Fran¸cois. Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8):1–18, 2011.
J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002.
A. Maj-Ka´nska, P. Pokarowski, A. Prochenka, et al. Delete or merge regressors for linear model selection. Electronic Journal of Statistics, 9(2):1749–1778, 2015.
D. Pauger and H. Wagner. Bayesian effect fusion for categorical predictors. Bayesian Analysis, 14(2):341–369, 2019.
D. Pauger, M. Leitner, H. Wagner, and G. Malsiner-Walli. effectFusion: Bayesian Effect Fusion for Categorical Predictors, 2019. R package version 1.1.1.
A. Prochenka-So�ltys and P. Pokarowski. DMRnet: Delete or Merge Regressors Algorithms for Linear and Logistic Model Selection and High-Dimensional Data, 2018. R package version 0.2.0.
B. Stokell. CatReg: Solution Paths for Linear and Logistic Regression Models with Categorical Predictors, with SCOPE Penalty https://CRAN.R-project.org/package=CatReg, 2021.
T. Therneau and B. Atkinson. rpart: Recursive Partitioning and Regression Trees, 2019. R package version 4.1-15.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
X. Zheng and W.-Y. Loh. Consistent variable selection in linear models. Journal of the American Statistical Association, 90(429):151–156, 1995.