Safe Screening for the Generalized Conditional Gradient Method

2020·Arxiv

Abstract

Abstract

The conditional gradient method (CGM) has been widely used for fast sparse approximation, having a low per iteration computational cost for structured sparse regularizers. We explore the sparsity acquiring properties of a generalized CGM (gCGM), where the constraint is replaced by a penalty function based on a gauge penalty; this can be done without significantly increasing the per-iteration computation, and applies to general notions of sparsity. Without assuming bounded iterates, we show O(1/t) convergence of the function values and gap of gCGM. We couple this with a safe screening rule, and show that at a rate )), the screened support matches the support at the solution, where 0 measures how close the problem is to being degenerate. In our experiments, we show that the gCGM for these modified penalties have similar feature selection properties as common penalties, but with potentially more stability over the choice of hyperparameter.

1 Introduction

The conditional gradient method (CGM) is an iterative method with a particularly cheap per-iteration cost, and is thus favored in large-scale machine learning applications. A generalized CGM (gCGM) minimizes over the regularized convex problem

where f is a smooth convex function and h promotes structural properties. At each iteration, the method updates the primal variable as

where [0, 1] is a pre-determined decaying sequence. If the indicator function for a compact convex set P, then this iteration scheme reduces to the vanilla CGM (vCGM) for constrained optimization; the main extension in gCGM is to solve unconstrained (but penalized) problems, where the iterates are not forced to stay within a specified bounded set. Specifically, we consider )), where is a monotonically nondecreasing function and is the gauge penalty function induced by “nice” sets P; overall, this penalty encourages sparsity in the minimizer with respect to the extremal vertices of P.

1.1 Related work

Applications. A main use case of CGMs is in finding generalized sparse solutions to convex losses Jaggi (2013); Chandrasekaran et al. (2012), where the -norm penalty in promoting element-wise sparsity Tibshirani (1996); Donoho (2006); Cand`es and Tao (2005); Cand`es and Romberg (2006) is generalized to gauge functions that promote sparsity with respect to “atoms”, which are the lowest dimensional facets of a convex set P. This generalizes sparse optimization to applications such as low-rank matrix optimization Yu et al. (2017); Freund et al. (2017) and grouped feature extraction Vinyes and Obozinski (2017); Zeng and Figueiredo (2014); Bondell and Reich (2008). Additionally, these atoms may be feasible solutions to combinatorial problems, and (1) may be a convex relaxation, such as in submodular optimization Bach (2010) and object tracking Chari et al. (2015). Other machine learning applications involving the CGM include graphical models Krishnan et al. (2015), multitask learning Sener and Koltun (2018), SVMs Lacoste-Julien et al. (2012), particle filtering Lacoste-Julien et al. (2015), and deep learning Ping et al. (2016); Berrada et al. (2018).

Safe screening. Safe screening rules for LASSO were first proposed by Ghaoui et al. (2012), and have since been extended to a number of smooth losses and generalized penalties Fercoq et al. (2015); Xiang and Ramadge (2012); Wang et al. (2014); Liu et al. (2013); Malti and Herzet (2016); Raj et al. (2016); Ndiaye et al. (2015); Wang et al. (2013); Bonnefoy et al. (2015); Zhou and Zhao (2015). Rules for “group” testing Herzet and Dr´emeau (2018) and sample screening Shibagaki et al. (2016); Ogawa et al. (2013) have also been considered. An interesting related work is the “stingy coordinate descent” method Johnson and Guestrin (2017) for LASSO, which optimizes the sparse regularized problem in a CGM-like manner, but uses screening to dynamically skip steps; this kind of methods can be extended to gCGM as well for generalized atoms.

A key challenge in penalized sparsity problems is that when the dual is constrained, the corresponding dual variable may not be feasible, and thus the computed gap is +. In this context, gap-safe screening methods offer a number of solutions, such as scaling or projecting to acquire a dual feasible candidate. We do not attempt to remedy this problem; in fact, in gCGM, the typical LASSO penalty presents a fundamental implementation issue, in that if , then the problem (2) can easily be unbounded. By requiring curvature of ) for large enough , we ensure that the dual problem is unbounded, and the natural dual candidate ) does not need to be adjusted to ensure bounded subproblems (2). This ensures that gCGM is well-defined and converging to the solution; additionally, it allows easier gap calculations. These curvature conditions will be elaborated in later sections.

Conditional gradient methods. The vCGM, also called the Frank-Wolfe method Frank and Wolfe (1956); Dunn and Harshbarger (1978), considers minimizing (1) as a constrained optimization problem (where ) for some scaling C > 0). The method is particularly useful when computing the supporting hyperplane in (2) is computationally simple (e.g., when P is the unit ball of the -norm or the nuclear norm). Thus, CGM is widely considered in the context of generalized sparse optimization Hazan (2008); Clarkson (2010); Jaggi (2013); Tewari et al. (2011), with many variations such as backward steps Lacoste-Julien and Jaggi (2015); Rao et al. (2015) and fully-corrective steps Von Hohenbalken (1977), and connections to other methods like mirror descent Bach (2015), cutting plane method Zhou et al. (2018), and greedy coordinate-wise methods Clarkson (2010).

In comparison, gCGM (where h(x) may be unconstrained) has been much less studied, and has appeared under different names, like regularized coordinate minimization Dudik et al. (2012). An O(1/t) convergence rate has been shown for specific smooth functions Mu et al. (2016), with bounded assumptions on iterates Bach (2015), or with improvement steps to ensure boundedness of sublevel sets Yu et al. (2017); Harchaoui et al. (2015). When f is quadratic and for a special form of , the gCGM can be shown to be equivalent to a form of the iterative shrinkage method, and under proper problem conditioning, has linear convergence Bredies et al. (2009); Bredies and Lorenz (2008). We also give an O(1/t) convergence rate on objective function values and minimum gap convergence, but relinquish any assumption on boundedness of iterates.

1.2 Contributions

We analyze the convergence and support recovery properties of the gCGM for (1), where )) involves only modifications of a gauge function ). We assume that the loss function f is L-smooth, the function ) grows at least quadratically when is large, and the set P is convex and compact. Our contribution is threefold.

• Without boundedness assumptions on iterates, the function value error and minimum duality gap of gCGM converge as O(1/t).

• We provide a safe dual screening rule for any intermediate variable x. This rule is algorithmically agnostic, and generalizes SAFE screening rules for LASSO to any gauge function and any case where is monotonicaly nondecreasing, in particular to cases where the dual is unconstrained and thus always feasible.

• Finally, by bounding the gradient error with the gap, we give a mechanism for deriving manifold identification rates for any version of gCGM where minimum gap rates are known.

Additionally, our proof technique is from a convex analysis viewpoint, in that we measure all distances and errors in terms of gauges and support functions of P (sometimes symmetrized). This is done for two reasons: first, to ensure that all analysis is linear invariant (in similar spirit as Lacoste-Julien and Jaggi (2015)); and second, for increased interpretability, as connections can be drawn to the much more intuitive (but restrictive) case of in sparse optimization (and more commonly considered in screening literature). All proofs are given in the appendix.

2 Preliminaries

2.1 Generalized sparse optimization

Define a finite set of points , and its convex hull as is a convex and compact set. We consider problems of the form

where is a monotonically nondecreasing function. The function

is the gauge function of P; in particular, it measures the “size” of x by giving how much the set P must be expanded (or can be contracted) to include x, and generalizes norms to any positive homogenous and subadditive function Freund (1987); Chandrasekaran et al. (2012). We define the support of x with respect to (denoted )) as the set of ’s in (5) for which 0. Such a set may not be uniquely defined, but we consider support recovery achieved if one such set is revealed.

Gauge functions can be seen as generalized versions of the -norm, which is a convex promoter of nonzero vector sparsity. In particular, if is the signed standard basis, then we exactly recover . More generally, if vectors spanning , then defining the matrix , and promotes vectors x = Pc whose pre-image c is sparse. But gauges also encompass more general scenarios, such as seminorms (e.g., total variation norm), non-polyhedral norms (e.g., nuclear norm), and conic constraints; they can also be manipulated to include ordering, such as with the OWL norm (Zeng and Figueiredo, 2014), and discover groupings with the OSCAR norm (Bondell and Reich, 2008).

A “dual gauge” can be constructed as the support function

In particular, if is a norm, then is the usual dual norm. Finding an optimal variable s in (6) is key in computing (2), and properties of z can be used to reveal the support of s with respect to P.

Property 1 (Support optimality condition). If is in the support of a minimizer of (4), then

Consider the problem

In this case, is the dual norm of . Then, by setting the optimality condition 0

In words, the gradient of f along a coordinate for which the optimal variable is nonsmooth with respect to is allowed “wiggle room”; in contrast, if g(x) is smooth in the direction of then the gradient is fixed. In terms of support recovery, maxand additionally, if then it must be that More generally, visually, the condition ) says that at the optimum, the gradient in the direction of is as steep as allowable; wants to keep going in this direction, but is blocked because of a constraint or nonsmooth penalty. For gauges, this non-smoothness only happens when the contribution of in x is 0, thus translating to a support recovery property.

The proof follows from convex analysis principles describing the dual behaviors of The property itself serves as the main principle behind dual screening methods; by identifying ’s that are sufficiently far from the maximum value, we can guess that such ’s do not appear in the support of .

Noncompact P. In practice, recession directions in P may be desirable to allow for unpenalized directions. For example, in the total variation norm, which promotes smoothness, . In this case, a finite ) constrains z to be in the nullspace of all such recession directions. In gCGM, such gauges are problematic because the solution to the generalized subproblem (2) is unbounded if a recession direction. Therefore we assume P to be compact.

0 on the boundary of P. It may be desirable to have partially enforce conic constraints as well, such as in semidefinite optimization where ) + ) promotes low-rank positive semidefinite matrices. In this case, since no negative definite elements are in P, 0 must be on the boundary of P. In the dual, this corresponds to a recession direction, as any negative definite matrix Z necessarily has This scenario does not affect the effectiveness nor analysis of gCGM; in particular, if (2) ever returns s = 0, then optimality is achieved.

Infinite atomic sets. We assume that is a finite set. In low-rank matrix completion, for which CGMs are frequently used, the nuclear norm acts as the gauge function over the set of rank-1 norm-1 matrices, which is a compact but uncountably infinite set. In fact, the gCGM is still well-defined in this case, and all of the results in this paper are consistent. However, since as there are no isolated points in , it is impossible to guarantee finite-time exact support recovery (and in fact defined below is always 0). Thus, although safe screening rules do apply in this case, without modification they may not provide practical advantages.

Gauges and support functions for convex sets are fundamental objects in convex analysis, and are discussed more by Rockafellar (1970); Borwein and Lewis (2010); Freund (1987); Friedlander et al. (2014).

2.2 Duality

The Fenchel dual of (4) can be computed as

is nonnegative and 0 only at optimality. Since at optimality ), a reasonable measure of suboptimality for a nonoptimal x is )). In particular,

can be used as a computable residual measure for both convergence tracking and screening rules; here,

2.3 Generalized CGM (gCGM)

There are many ways of solving problems of the form (4), and our dual screening results and manifold identification results are in fact method-agnostic. Here, we investigate the gCGM, which has almost as cheap of a per-iteration cost as the vCGM. In particular, if we decompose s in terms of its gauge value and normalized direction ˆs, then their minimizations can be done independently. Explicitly, step (2) can be summarized in two steps, with , and

where argmax

LMO returns a finite ˆs; however, the minimization for is more complicated. As a simple example, consider gCGM applied to the one-dimensional problem

At the very first step, , and if |c| > 1 then is unbounded. Therefore, further conditions on must be imposed.

2.4 Generalized penalty

The function facilitates the transition of (4) from penalized to constrained optimization. When , then (4) is a typical sparse regularized problem; at the other extreme, an indicator function can constrain , reducing everything to the vCGM case (vanilla CGM).

Assumption 1 (Well-defined gCGM). The function is monotonically nondecreasing over all 0. Moreover, the set of subdifferentials of is not upper bounded:

Assumption 2 (Convergence of gCGM). The function is lower bounded by a quadratic function

for some 0 and .

Property 2 (Well-defined and converging gCGM). Assumption 1 ensures that the conjugate function

is finite-valued and attained for all 0. Moreover, there always exists a finite maximizer .

In particular, in the case that , and the function

As shown earlier, when 1 then ; we exclude this case as gCGM will not converge in this case. When is strongly convex and we can show O(1/t) convergence of gCGM. When 1 2, ) is finite and the iterates are well-defined, but the method may converge or diverge.

Example: Barrier functions. Consider

which is a log-barrier penalization function for ; as ) approaches the indicator function for this constraint. Its conjugate is

achieved at + 1). For all 0, and , both and exist and are finite. Note also the implicit constraint, as )) is finite only if .

2.5 Generalized smoothness

Definition 1. A function is L-smooth with respect to P if for all :

The purpose of this generalized notion is that sometimes, given the data, tighter bounds can be computed (see, e.g., Nutini et al., 2015).

Example: Quadratic function. Suppose that

Then

While norm bounds would give , the actual values in A might lead to tighter inequalities.

Example: Linear model. Suppose that

for some convex, smooth twice-differentiable function g (e.g., logistic or exponential regression). Then

Equivalence to usual smoothness. Suppose that f is -smooth in the usual sense (with respect to ). Then since , it follows that . In this way, we refine the analysis of gCGM by absorbing the usual “set size” term into L, which in certain cases may be much smaller than .

2.6 Invariance

One appealing feature of the vCGM is that the iteration scheme and analysis can be done in a way that is invariant to both linear scaling and translation. Specifically, if Q = AP + b, and f(x) = g(Ax + b), then the two problems

are equivalent. However when the gauge function is not used as an indicator, this translation invariance vanishes; in general, ). Therefore the generalized problem formulation (4) is only linear (not translation) invariant; thus our analysis only maintains this invariance as well.

Property 3 (Invariance). Consider two equivalent problems where f(x) = g(Ax) and Q = AP:

For any x, w = Ax,

• x optimizes (P1) optimizes (P2),

• κ(x(w),

•

•

• f is L-smooth with respect to P if and only if g is L-smooth with respect to Q,

•

3 Main results

In this section we give the main theoretical contributions: convergence rate, dual screening rule, and support identification complexity. These results all derive from some simple observations:

• The minimum duality gap at converges to 0 as an optimal primal variable.

• The gradient error can be upper bounded by the gap, and support recovery is guaranteed when it is smaller than a problem-dependent constant, which is difficult to compute in practice.

• Without knowing this constant, one can still give partial support guarantees, which is used to construct screening rules.

We now state these points formally; all proofs are given in the appendix.

Theorem 1 (Convergence). Suppose that are the iterates of gCGM for which f is L-smooth with respect to is monotonically nondecreasing, and satisfies Assumptions 1 and 2 for some 0. Take + 1). Then

A key difference between this result and previous works is that we do not assume or enforce bounded

iterates.

The scaled gradient error will serve as our primary “residual quantity” in measuring distance to support recovery:

and the symmetrization ensures that ), bounding errors in both directions.

Lemma 1 (Gap bounds residual). For any primal feasible variable x,

Figure 1 gives a cartoon intuition as to what a small residual buys us. In particular, if is larger than 2res(x), then a maximal element of must also be a maximal element of . Since we can observe a bound on res(x), it is now possible to exclude which atoms are definitively not in ).

Theorem 2 (Dual screening). Assume that f is L-smooth with respect to . Then for any

implies that ), where is the optimal variable in (4).

Remark (Practical considerations). Some things to note about this screening method:

• Computing L may be challenging, depending on ; as shown previously, at the very least it may require a full pass over the data. However, this is a one-time calculation per dataset, and can be estimated if data are assumed to be drawn from specific distributions (as in sensing applications).

Figure 1: Support recovery. The constant differentiates maximal in-support values from the largest non-support value, as in (15). ) for some current (non-optimal) iterate and (illustrated as ), its largest possible value. Then it is possible that some ) exists where ; thus, a safe screening rule can at largest be a threshold at . This rule eliminates all false negatives. To ensure no false positives, the largest possible non-optimal non-support value () must be smaller than the screened point. This can only happen if .

• If is large (such as in submodular optimization) then checking condition (14) for each atom at each iteration is also cumbersome. However, if the screening is aggressive, then after a few iterations, the list of potential atoms to check will decrease quickly as well.

• Computing the gap, in comparison, is almost automatic in gCGM, given that is the (always feasible) dual candidate and already computed. In comparison, when dealing with a different dual candidate, then the term ) is not easily upper bounded, and depending on the choice of f may be difficult to compute in practice.

The “safeness” of the screening rule (Theorem 2) ensures that ), for all t. For support identification, we would like to find a ¯t where for all t > ¯). Note that with a deterministically decaying sequence for , finite-time support recovery without screening is impossible, since any erroneously selected atoms early on can never fully diminish. Even with screening, it is still not automatically guaranteed that such a finite ¯t exists, since the problem itself may be degenerate Lewis and Wright (2011); Hare (2011); Burke and Mor´e (1988). This occurs when

is a problem-dependent (algorithm-independent) quantity.

Theorem 3 (Support identification of screened gCGM)

which, under the assumptions of Theorem 1, happens at a rate )).

The proof follows from the gap bound (Lemma 1), gap rate (Theorem 1), and scrutiny of Figure 1; specifically, when 4, then any rule that screens away elements that are more than 2screen away all the non-support elements.

Remark (Generality). Note that Theorems 2 and 3 impose no conditions on the sequence , or choice of f, etc., except L-smoothness of f. In other words, for any method where minknown, then a corresponding screening rule and support identification rate automatically follow.

Figure 2: 01. For p < 1.5, the method diverged. corresponds to vCGM.

4 Experiments

We consider sparse logistic regression

where 0 controls the weighting of the penalty term and C > 0 the magnification of data vectors and are binary labels. In all cases we run gCGM with + 1).

4.1 Synthetic experiments

First, we generate i.i.d. standard Gaussian normal vectors, and fix 100, and analyze the numerical behavior of gCGM when choices of p are plotted in Figure 2, with 01. For low values of p we observe some numerical instability in the early iterates, as for “flatter” penalty functions the new steps can be very large. Figure 3 compares different problem residuals with 0. In particular we are able to verify our O(1/t) bound on all residuals, though it is clear that for this example, the gap is converging much more slowly than the gradient error , which is almost twice as fast, which is why our screening rule, though safe, can be pessimistic in practice. Finally, Figure 4 shows the evolution of the support size for p = 2 and varying values of . In general, a larger value of causes aggressive screening early on, while for larger values of screening may be much slower, despite arriving at about the same final sparsity level.

4.2 MNIST classification of 4’s vs 9’s

Figure 5 shows the screening behavior of (17) on the binary classification problem of disambiguating 4’s and 9’s in the MNIST handwriting dataset. We experiment with three schemes: one-norm squared regularization (), one-norm ball constraint ()), and log barrier (12). All experiments are halted at 10,000 iterations for fair comparison.

Figure 3: 0. The objective error and gap decay at the computed rate of O(1/t). The gradient error decays as ). In fact when the gradient error ) dips below 4, support error is 0; unfortunately, the gap takes longer to reach this point.

Figure 4: the number of unscreened variables at each iteration. We observe more aggressive screening for larger .

There are two major observations. First, the yellow curve (observed sparsity) is often much lower than the red curve (guarantee-able sparsity). This is because when is small or C is big, the gap converges slowly, and the condition 000 (our stopping condition). However, that is the tradeoff required for “safety”.

Second, the red curve (guarantee-able sparsity) is only small when the blue curve (misclassification rate) is higher, suggesting an inherent performance/sparsity tradeoff. This tradeoff is in fact observed for all three choices of and suggests that in general, the MNIST classification task performs best without extreme sparsity.

5 Conclusion

We have given a gap-based safe screening rule for a family of sparse optimization problems, for various types of sparse penalties and atoms. We analyze this in the context of the gCGM, and give rates for convergence and support identification for nondegenerate problems. In particular, the generalization over atom type and choice of allows for a much richer collection of sparse models, interpolating between the piece-wise linear unconstrained LASSO penalty and the hard norm ball constraint. These penalties differ in their sensitivity toward hyperparameters, and may be more suited to a wider range of applications.

A key promise in these rules is that, in the spirit of Ghaoui et al. (2012), screening is safe, e.g., no true nonzero will be wrongly called a zero. However, in practice this rule may be pessimistic, first because the gap may serve as an overly pessimistic upper bound of the gradient error, and second because sparsity in the true solution of the optimization problem may be overkill for sparsity of a solution that generalizes well for the machine learning task.

Still, there are practical advantages. A sparsity guarantee gives storage benefits; a model trained on a large server can be moved to a mobile device, for example, with no need for heuristic thresholding or rounding. And, if is very large, then screening can greatly improve the runtime of the linear minimization oracle (LMO), used in each step; since the rules are safe, this can be done without disrupting any convergence guarantees.

Appendix A Helpful facts

Lemma 2 (Relationship of to

Proof. Using another classical definition for gauge functions,

where is the smallest Euclidean ball of radius r that includes P; that is, ).

We denote the subdifferential of a convex function f at x as ), and the normal cone of P at x as ); See Rockafellar (1970).

Lemma 3 (Conjugate of nested function)

Figure 5: MNIST experiment. Solid/dashed blue lines are train/test misclassification rates. Solid/dashed/dotted red lines are number of unscreened features at 10000 / 5000 / 1000 iterations; it is possible that more features would be screened away after more iterations, as the gap converges very slowly for small . Green square line plots the number of nonzeros of , which is observed to be stable. (Top) . (). (Bottom) Log barrier function (12) where C = 10.

Proof. From the definitions, we have:

Lemma 4 (Chain rule for subdifferential (Bauschke and Combettes (2011), Corollary 16.72.))

Lemma 5 (Gap in primal form). For f everywhere differentiable,

Proof. By construction of )). And, in general, for convex lower semicontinuous f, f(x) + ). The rest follows from substitution.

Appendix B Proofs from Section 2

Property 1 (Support optimality condition). If is in the support of a minimizer of

where f is everywhere differentiable and satisfies Assumptions 1 and 2, then )).

Proof. Without loss of generality, we assume 0 , since . Denote ). Now, applying Lemma 4, the optimality condition for (4) is

for some is monotonically nondecreasing over the property is trivially true. Now consider 0. Noting that where is the polar set of P,

Now take the conic decomposition where 0, and

which is with equality if and only if )) whenever 0.

Property 2 (Well-defined and converging gCGM).

Proof.

• Assumption 1. Since has nonempty domain, for all . It can be shown that whenever there exists a finite 0 where ), since then

Now define (0)). By the assumptions on , for any , there exists some finite 0 where ).

Therefore there always exists a finite maximizer of (19); since also ) is not , then (19) is always attained.

• Assumption 2. Assume that is as large as possible; e.g., there exists some finite where + . Then for all , for all ),

Property 3 (Invariance). Consider two equivalent problems where f(x) = g(Ax) and Q = AP:

For any x, w = Ax,

• ))

• f(x) = g(Ax + b) is L-smooth and -strongly convex with respect to P iff g is L-smooth and convex with respect to Q

•

Appendix C Generalized smoothness

The following bound holds for any closed convex P, which may or not be compact or symmetric.

Lemma 6 (Smoothness equivalences). Suppose that f is L-smooth with respect to :

Then the following also holds:

Proof. The proof largely follows from Nesterov (2013), mildly adapted.

• First prove (20) (21). Construct ), which is convex, also L-smooth, and has minimum at x = y. Then, for any w,

• Now prove (20) (22). Using the same g as before, consider

Corollary 1 (Uniqueness of gradient). If (20) holds and 0 , then ) is unique at the optimum.

Proof. Assume that ) for some feasible. Then by optimality conditions, (0, and thus

which implies that , this can only happen if

Lemma 7 (Hessian sufficient condition). For some closed convex set P, and some convex twice differentiable function , suppose that

Proof. By definition and positive homogeneity of ), more generally

Then

Lemma 8 (Gradient suboptimality bound). Suppose that argmin

with respect to and h is convex. Then

Proof. For any ) and ),

we derive (a) from expansiveness, and (b) from convexity of f + h.

Appendix D Proofs from Section 3

The following Lemma will be used in computing the objective value bound.

Lemma 9 (One step value bound). For f L-smooth with respect to and -convex,

we take the sequence + 1), and ¯∆represents an averaged suboptimality:

Proof. For one step, at , define

. Since, by smoothness of f,

then

and therefore

() + ) +

Term B. By convexity and homogeneity of ,

Taking , then

so

By optimality conditions on the update for ((8) in main text),

where (a) follows from Assumption 2. Then

where (b) follows from Lemma 8. Overall this gives

where (c) comes from (.

Lemma 10 (Objective value bound). Given f is L-smooth with respect to is monotonically increasing and -strongly convex, then the objective error decreases as

Proof. Take ¯t > 12B large enough so that for all ¯t, 3. Then define

Then, using Lemma (9), we have

We now pick G large enough such that for all ¯is always a bounded quantity, this is always possible. Then, for all t < ¯t,

Now we make an inductive step. Suppose that for some t, ∆for all . Pick + 1).

Then

which satisfies the inductive step.

The following is a generalized and modified version of a proof segment from Jaggi (2013), which will be used for proving O(1/t) gap convergence.

Lemma 11. Pick some 0 and pick

Proof. Using integral rule, we see that

This yields

Lemma 12 (Generalized non-monotonic gap bound). Given

• ∆for some ,

• for some and D, and

• ∆(1 + ) + (

then for

Picking , and invoking Lemma 11, this yields that ∆which is impossible. Therefore, the assumption must not be true.

Theorem 1 (Convergence). Suppose that are the iterates of gCGM for which f is L-smooth with respect to is monotonically increasing and -strongly convex. Take + 2). Then

Proof. The proof follows from Lemma 10 and 12.

Lemma 1 (Gap bounds residual). For any primal feasible variable x,

Proof. Taking )), we have

Additionally, by Fenchel-Young, f(x) + Therefore

for some )).

Take ). Then

Theorem 2 (Dual screening). Assume that f is L-smooth with respect to . Then for any

implies that ), where is the optimal variable in (4).

Proof. From Lemma 1, we have that when condition (25) holds,

Then, by the triangle inequality,

Thus by Property 1, ).

References

Bach, F. (2015). Duality between subgradient and conditional gradient methods. SIAM Journal on Optimization, 25(1):115–129.

Bach, F. R. (2010). Structured sparsity-inducing norms through submodular functions. In Advances in Neural Information Processing Systems, pages 118–126.

Bauschke, H. H. and Combettes, P. L. (2011). Convex Analysis and Monotone Operator Theory in Hilbert Spaces, volume 408. Springer, 2 edition.

Berrada, L., Zisserman, A., and Kumar, M. P. (2018). Deep Frank-Wolfe for neural network optimization. arXiv preprint arXiv:1811.07591.

Bondell, H. D. and Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64(1):115–123.

Bonnefoy, A., Valentin, E., Liva, R., and Gribonval, R. (2015). Dynamic screening: Accelerating first-order algorithms for the LASSO and group-LASSO. IEEE Transactions on Signal Processing, 63(19):5121–5132.

Borwein, J. and Lewis, A. S. (2010). Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer Science and Business Media.

Bredies, K. and Lorenz, D. A. (2008). Iterated hard shrinkage for minimization problems with sparsity constraints. SIAM Journal on Scientific Computing, 30(2):657–683.

Bredies, K., Lorenz, D. A., and Maass, P. (2009). A generalized conditional gradient method and its connection to an iterative shrinkage method. Computational Optimization and Applications, 42(2):173–193.

Burke, J. V. and Mor´e, J. J. (1988). On the identification of active constraints. SIAM Journal on Numerical Analysis, 25(5):1197–1211.

Cand`es, E. and Romberg, J. (2006). Robust signal recovery from incomplete observations. In 2006 International Conference on Image Processing, pages 1281–1284. IEEE.

Cand`es, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215.

Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. (2012). The convex geometry of linear inverse problems. Foundations of Computational mathematics, 12(6):805–849.

Chari, V., Lacoste-Julien, S., Laptev, I., and Sivic, J. (2015). On pairwise costs for network flow multi-object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5537–5545.

Clarkson, K. L. (2010). Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):63.

Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306.

Dudik, M., Harchaoui, Z., and Malick, J. (2012). Lifted coordinate descent for learning with trace-norm regularization. In Artificial Intelligence and Statistics, pages 327–336.

Dunn, J. C. and Harshbarger, S. (1978). Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications, 62(2):432–444.

Fercoq, O., Gramfort, A., and Salmon, J. (2015). Mind the duality gap: safer rules for the Lasso. arXiv preprint arXiv:1505.03410.

Frank, M. and Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110.

Freund, R. M. (1987). Dual gauge programs, with applications to quadratic programming and the minimum- norm problem. Mathematical Programming, 38(1):47–67.

Freund, R. M., Grigas, P., and Mazumder, R. (2017). An extended Frank–Wolfe method with in-face directions, and its application to low-rank matrix completion. SIAM Journal on Optimization, 27(1):319–346.

Friedlander, M. P., Macedo, I., and Pong, T. K. (2014). Gauge optimization and duality. SIAM Journal on Optimization, 24(4):1999–2022.

Ghaoui, L. E., Viallon, V., and Rabbani, T. (2012). Safe feature elimination for the Lasso and sparse supervised learning problems. Pacific Journal of Optimization.

Harchaoui, Z., Juditsky, A., and Nemirovski, A. (2015). Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152(1-2):75–112.

Hare, W. (2011). Identifying active manifolds in regularization problems. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages 261–271. Springer.

Hazan, E. (2008). Sparse approximate solutions to semidefinite programs. In Latin American Symposium on Theoretical Informatics, pages 306–316. Springer.

Herzet, C. and Dr´emeau, A. (2018). Joint screening tests for Lasso. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4084–4088. IEEE.

Jaggi, M. (2013). Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, pages 427–435.

Johnson, T. B. and Guestrin, C. (2017). Stingy CD: safely avoiding wasteful updates in coordinate descent. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1752–1760.

Krishnan, R. G., Lacoste-Julien, S., and Sontag, D. (2015). Barrier Frank-Wolfe for marginal inference. In Advances in Neural Information Processing Systems, pages 532–540.

Lacoste-Julien, S. and Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. Advances in Neural Information Processing Systems, pages 496–504.

Lacoste-Julien, S., Jaggi, M., Schmidt, M., and Pletscher, P. (2012). Block-coordinate Frank-Wolfe optimiza- tion for structural svms. arXiv preprint arXiv:1207.4747.

Lacoste-Julien, S., Lindsten, F., and Bach, F. (2015). Sequential kernel herding: Frank-Wolfe optimization for particle filtering. arXiv preprint arXiv:1501.02056.

Lewis, A. S. and Wright, S. J. (2011). Identifying activity. SIAM Journal on Optimization, 21(2):597–614.

Liu, J., Zhao, Z., Wang, J., and Ye, J. (2013). Safe screening with variational inequalities and its application to Lasso. arXiv preprint arXiv:1307.7577.

Malti, A. and Herzet, C. (2016). Safe screening tests for Lasso based on firmly non-expansiveness. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4732–4736. IEEE.

Mirrokni, V., Leme, R. P., Vladu, A., and Wong, S. C.-w. (2017). Tight bounds for approximate Carath´eodory and beyond. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2440–2448.

Mu, C., Zhang, Y., Wright, J., and Goldfarb, D. (2016). Scalable robust matrix recovery: Frank–Wolfe meets proximal methods. SIAM Journal on Scientific Computing, 38(5):A3291–A3317.

Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J. (2015). Gap safe screening rules for sparse multi-task and multi-class models. In Advances in Neural Information Processing Systems, pages 811–819.

Nesterov, Y. (2013). Introductory Lectures on Convex Optimization: A Basic Course. Springer Science and Business Media.

Nutini, J., Schmidt, M., Laradji, I., Friedlander, M., and Koepke, H. (2015). Coordinate descent converges faster with the Gauss-Southwell rule than random selection. In International Conference on Machine Learning, pages 1632–1641.

Ogawa, K., Suzuki, Y., and Takeuchi, I. (2013). Safe screening of non-support vectors in pathwise SVM computation. In International conference on machine learning, pages 1382–1390.

Ping, W., Liu, Q., and Ihler, A. T. (2016). Learning infinite RBMs with Frank-Wolfe. In Advances in Neural Information Processing Systems, pages 3063–3071.

Raj, A., Olbrich, J., G¨artner, B., Sch¨olkopf, B., and Jaggi, M. (2016). Screening rules for convex problems. arXiv preprint arXiv:1609.07478.

Rao, N., Shah, P., and Wright, S. (2015). Forwardbackward greedy algorithms for atomic norm regularization. IEEE Transactions on Signal Processing, 63(21):5798–5811.

Rockafellar, R. T. (1970). Convex Analysis, volume 28. Princeton University Press.

Sener, O. and Koltun, V. (2018). Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, pages 527–538.

Shibagaki, A., Karasuyama, M., Hatano, K., and Takeuchi, I. (2016). Simultaneous safe screening of features and samples in doubly sparse modeling. In International Conference on Machine Learning, pages 1577–1586.

Tewari, A., Ravikumar, P. K., and Dhillon, I. S. (2011). Greedy algorithms for structurally constrained high dimensional problems. In Advances in Neural Information Processing Systems, pages 882–890.

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.

Vinyes, M. and Obozinski, G. (2017). Fast column generation for atomic norm regularization. In Proceedings of International Conference on Artificial Intelligence and Statistics.

Von Hohenbalken, B. (1977). Simplicial decomposition in nonlinear programming algorithms. Mathematical Programming, 13(1):49–68.

Wang, J., Zhou, J., Liu, J., Wonka, P., and Ye, J. (2014). A safe screening rule for sparse logistic regression. In Advances in neural information processing systems, pages 1053–1061.

Wang, J., Zhou, J., Wonka, P., and Ye, J. (2013). LASSO screening rules via dual polytope projection. Advances in neural information processing systems, pages 1070–1078.

Xiang, Z. J. and Ramadge, P. J. (2012). Fast Lasso screening tests based on correlations. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2137–2140. IEEE.

Yu, Y., Zhang, X., and Schuurmans, D. (2017). Generalized conditional gradient for sparse estimation. The Journal of Machine Learning Research, 18(1):5279–5324.

Zeng, X. and Figueiredo, M. A. (2014). The ordered weighted norm: Atomic formulation, projections, and algorithms. arXiv preprint arXiv:1409.4271.

Zhou, Q. and Zhao, Q. (2015). Safe subspace screening for nuclear norm regularized least squares problems. In International Conference on Machine Learning, pages 1103–1112.

Zhou, S., Gupta, S., and Udell, M. (2018). Limited memory Kelley’s method converges for composite convex and submodular objectives. Advances in Neural Information Processing Systems, pages 4414–4424.