35:[["$","audio",null,{"id":"tts"}],["$","$L3a",null,{"paperID":"2002.09718","publisher":"arxiv","paperJSON":{"title":"Safe Screening for the Generalized Conditional Gradient Method","paperID":"2002.09718","avgLineHeight":11.94,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"The conditional gradient method (CGM) has been widely used for fast sparse approximation, having a low per iteration computational cost for structured sparse regularizers. We explore the sparsity acquiring properties of a generalized CGM (gCGM), where the constraint is replaced by a penalty function based on a gauge penalty; this can be done without significantly increasing the per-iteration computation, and applies to general notions of sparsity. Without assuming bounded iterates, we show ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/t","element":"span"},{"text":") convergence of the function values and gap of gCGM. We couple this with a safe screening rule, and show that at a rate ","element":"span"},{"style":{"height":16.09},"width":139.67,"height":40.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/0-0.png","element":"img","alt":"O(1/(tδ2","inline":true},{"text":")), the screened support matches the support at the solution, where ","element":"span"},{"style":{"height":12.8},"width":56.94,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/0-1.png","element":"img","alt":" δ ≥","inline":true,"padRight":true},{"text":"0 measures how close the problem is to being degenerate. In our experiments, we show that the gCGM for these modified penalties have similar feature selection properties as common penalties, but with potentially more stability over the choice of hyperparameter.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"The conditional gradient method (CGM) is an iterative method with a particularly cheap per-iteration cost, and is thus favored in large-scale machine learning applications. A generalized CGM (gCGM) minimizes over the regularized convex problem","element":"span"}],[{"id":"id-11","style":{"width":"59%"},"width":1122,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/0-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is a smooth convex function and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"promotes structural properties. At each iteration, the method updates the primal variable ","element":"span"},{"style":{"height":14.19},"width":59.27,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/0-3.png","element":"img","alt":" x(t)","inline":true,"padRight":true},{"text":"as","element":"span"}],[{"id":"id-35","style":{"width":"67%"},"width":1263,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/0-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.98},"width":99.19,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/0-5.png","element":"img","alt":" θ(t) ∈","inline":true,"padRight":true},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1] is a pre-determined decaying sequence. If ","element":"span"},{"style":{"height":13.99},"width":119.56,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/0-6.png","element":"img","alt":" h = ιP","inline":true,"padRight":true},{"text":"the indicator function for a compact convex set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":", then this iteration scheme reduces to the vanilla CGM (vCGM) for constrained optimization; the main extension in gCGM is to solve unconstrained (but penalized) problems, where the iterates are not forced to stay within a specified bounded set. Specifically, we consider ","element":"span"},{"style":{"height":16},"width":257.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/0-7.png","element":"img","alt":" h(x) = φ(κP(x","inline":true},{"text":")), where ","element":"span"},{"style":{"height":14.79},"width":203.21,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/0-8.png","element":"img","alt":" φ : R+ → R","inline":true,"padRight":true},{"text":"is a monotonically nondecreasing function and ","element":"span"},{"style":{"height":10.39},"width":46.96,"height":25.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/0-9.png","element":"img","alt":" κP","inline":true,"padRight":true},{"text":"is the gauge penalty function induced by “nice” sets ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"; overall, this penalty encourages sparsity in the minimizer ","element":"span"},{"style":{"height":10.98},"width":38.78,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/0-10.png","element":"img","alt":" x∗","inline":true,"padRight":true},{"text":"with respect to the extremal vertices of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Related work","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Applications. ","element":"span"},{"text":"A main use case of CGMs is in finding generalized sparse solutions to convex losses ","element":"span"},{"href":"#id-0","referenceIndex":29,"text":"Jaggi ","element":"a"},{"href":"#id-0","referenceIndex":29,"text":"(2013)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":13,"text":"Chandrasekaran et al. ","element":"a"},{"href":"#id-1","referenceIndex":13,"text":"(2012)","element":"a"},{"text":", where the ","element":"span"},{"style":{"height":7.6},"width":32.6,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/1-0.png","element":"img","alt":" ℓ1","inline":true},{"text":"-norm penalty in promoting element-wise sparsity ","element":"span"},{"href":"#id-2","referenceIndex":51,"text":"Tibshirani ","element":"a"},{"href":"#id-2","referenceIndex":51,"text":"(1996)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-3","referenceIndex":16,"text":"Donoho ","element":"a"},{"href":"#id-3","referenceIndex":16,"text":"(2006)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":12,"text":"Cand`es and Tao ","element":"a"},{"href":"#id-4","referenceIndex":12,"text":"(2005)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":11,"text":"Cand`es and Romberg ","element":"a"},{"href":"#id-5","referenceIndex":11,"text":"(2006) ","element":"a"},{"text":"is generalized to gauge functions ","element":"span"},{"style":{"height":10.39},"width":46.96,"height":25.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/1-1.png","element":"img","alt":"κP","inline":true,"padRight":true},{"text":"that promote sparsity with respect to “atoms”, which are the lowest dimensional facets of a convex set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":". This generalizes sparse optimization to applications such as low-rank matrix optimization ","element":"span"},{"href":"#id-6","referenceIndex":57,"text":"Yu et al. ","element":"a"},{"href":"#id-6","referenceIndex":57,"text":"(2017)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":22,"text":"Freund et al. ","element":"a"},{"href":"#id-7","referenceIndex":22,"text":"(2017) ","element":"a"},{"text":"and grouped feature extraction ","element":"span"},{"href":"#id-8","referenceIndex":52,"text":"Vinyes and Obozinski ","element":"a"},{"href":"#id-8","referenceIndex":52,"text":"(2017)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":58,"text":"Zeng and Figueiredo ","element":"a"},{"href":"#id-9","referenceIndex":58,"text":"(2014)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":5,"text":"Bondell and Reich ","element":"a"},{"href":"#id-10","referenceIndex":5,"text":"(2008)","element":"a"},{"text":". Additionally, these atoms may be feasible solutions to combinatorial problems, and ","element":"span"},{"href":"#id-11","text":"(1) ","element":"a"},{"text":"may be a convex relaxation, such as in submodular optimization ","element":"span"},{"href":"#id-12","referenceIndex":2,"text":"Bach ","element":"a"},{"href":"#id-12","referenceIndex":2,"text":"(2010) ","element":"a"},{"text":"and object tracking ","element":"span"},{"href":"#id-13","referenceIndex":14,"text":"Chari ","element":"a"},{"href":"#id-13","referenceIndex":14,"text":"et al. ","element":"a"},{"href":"#id-13","referenceIndex":14,"text":"(2015)","element":"a"},{"text":". Other machine learning applications involving the CGM include graphical models ","element":"span"},{"href":"#id-14","referenceIndex":31,"text":"Krishnan ","element":"a"},{"href":"#id-14","referenceIndex":31,"text":"et al. ","element":"a"},{"href":"#id-14","referenceIndex":31,"text":"(2015)","element":"a"},{"text":", multitask learning ","element":"span"},{"href":"#id-15","referenceIndex":48,"text":"Sener and Koltun ","element":"a"},{"href":"#id-15","referenceIndex":48,"text":"(2018)","element":"a"},{"text":", SVMs ","element":"span"},{"href":"#id-16","referenceIndex":33,"text":"Lacoste-Julien et al. ","element":"a"},{"href":"#id-16","referenceIndex":33,"text":"(2012)","element":"a"},{"text":", particle filtering ","element":"span"},{"href":"#id-17","referenceIndex":34,"text":"Lacoste-Julien et al. ","element":"a"},{"href":"#id-17","referenceIndex":34,"text":"(2015)","element":"a"},{"text":", and deep learning ","element":"span"},{"href":"#id-18","referenceIndex":44,"text":"Ping et al. ","element":"a"},{"href":"#id-18","referenceIndex":44,"text":"(2016)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-19","referenceIndex":4,"text":"Berrada et al. ","element":"a"},{"href":"#id-19","referenceIndex":4,"text":"(2018)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Safe screening. ","element":"span"},{"text":"Safe screening rules for LASSO were first proposed by ","element":"span"},{"href":"#id-20","referenceIndex":24,"text":"Ghaoui et al. ","element":"a"},{"href":"#id-20","referenceIndex":24,"text":"(2012)","element":"a"},{"text":", and have since been extended to a number of smooth losses and generalized penalties ","element":"span"},{"href":"#id-21","referenceIndex":19,"text":"Fercoq et al. ","element":"a"},{"href":"#id-21","referenceIndex":19,"text":"(2015)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-22","referenceIndex":56,"text":"Xiang and ","element":"a"},{"href":"#id-22","referenceIndex":56,"text":"Ramadge ","element":"a"},{"href":"#id-22","referenceIndex":56,"text":"(2012)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-23","referenceIndex":54,"text":"Wang et al. ","element":"a"},{"href":"#id-23","referenceIndex":54,"text":"(2014)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Liu et al. ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2013)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-25","referenceIndex":37,"text":"Malti and Herzet ","element":"a"},{"href":"#id-25","referenceIndex":37,"text":"(2016)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-26","referenceIndex":45,"text":"Raj et al. ","element":"a"},{"href":"#id-26","referenceIndex":45,"text":"(2016)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-27","referenceIndex":40,"text":"Ndiaye ","element":"a"},{"href":"#id-27","referenceIndex":40,"text":"et al. ","element":"a"},{"href":"#id-27","referenceIndex":40,"text":"(2015)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-28","referenceIndex":55,"text":"Wang et al. ","element":"a"},{"href":"#id-28","referenceIndex":55,"text":"(2013)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-29","referenceIndex":6,"text":"Bonnefoy et al. ","element":"a"},{"href":"#id-29","referenceIndex":6,"text":"(2015)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-30","referenceIndex":59,"text":"Zhou and Zhao ","element":"a"},{"href":"#id-30","referenceIndex":59,"text":"(2015)","element":"a"},{"text":". Rules for “group” testing ","element":"span"},{"href":"#id-31","referenceIndex":28,"text":"Herzet and Dr´emeau ","element":"a"},{"href":"#id-31","referenceIndex":28,"text":"(2018) ","element":"a"},{"text":"and sample screening ","element":"span"},{"href":"#id-32","referenceIndex":49,"text":"Shibagaki et al. ","element":"a"},{"href":"#id-32","referenceIndex":49,"text":"(2016)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-33","referenceIndex":43,"text":"Ogawa et al. ","element":"a"},{"href":"#id-33","referenceIndex":43,"text":"(2013) ","element":"a"},{"text":"have also been considered. An interesting related work is the “stingy coordinate descent” method ","element":"span"},{"href":"#id-34","referenceIndex":30,"text":"Johnson and Guestrin ","element":"a"},{"href":"#id-34","referenceIndex":30,"text":"(2017) ","element":"a"},{"text":"for LASSO, which optimizes the sparse regularized problem in a CGM-like manner, but uses screening to dynamically skip steps; this kind of methods can be extended to gCGM as well for generalized atoms.","element":"span"}],[{"text":"A key challenge in penalized sparsity problems is that when the dual is constrained, the corresponding dual variable may not be feasible, and thus the computed gap is +","element":"span"},{"style":{"height":7.2},"width":40,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/1-2.png","element":"img","alt":"∞","inline":true},{"text":". In this context, gap-safe screening methods offer a number of solutions, such as scaling or projecting to acquire a dual feasible candidate. We do not attempt to remedy this problem; in fact, in gCGM, the typical LASSO penalty presents a fundamental implementation issue, in that if ","element":"span"},{"style":{"height":16},"width":209.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/1-3.png","element":"img","alt":" h(x) = ∥x∥1","inline":true},{"text":", then the problem ","element":"span"},{"href":"#id-35","text":"(2) ","element":"a"},{"text":"can easily be unbounded. By requiring curvature of ","element":"span"},{"style":{"height":16},"width":57.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/1-4.png","element":"img","alt":" φ(ξ","inline":true},{"text":") for large enough ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/1-5.png","element":"img","alt":" ξ","inline":true},{"text":", we ensure that the dual problem is unbounded, and the natural dual candidate ","element":"span"},{"style":{"height":18.18},"width":275.22,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/1-6.png","element":"img","alt":" z(t) = −∇f(x(t)","inline":true},{"text":") does not need to be adjusted to ensure bounded subproblems ","element":"span"},{"href":"#id-35","text":"(2)","element":"a"},{"text":". This ensures that gCGM is well-defined and converging to the solution; additionally, it allows easier gap calculations. These curvature conditions will be elaborated in later sections.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Conditional gradient methods. ","element":"span"},{"text":"The vCGM, also called the Frank-Wolfe method ","element":"span"},{"href":"#id-36","referenceIndex":20,"text":"Frank and Wolfe ","element":"a"},{"href":"#id-36","referenceIndex":20,"text":"(1956)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-37","referenceIndex":18,"text":"Dunn and Harshbarger ","element":"a"},{"href":"#id-37","referenceIndex":18,"text":"(1978)","element":"a"},{"text":", considers minimizing ","element":"span"},{"href":"#id-11","text":"(1) ","element":"a"},{"text":"as a constrained optimization problem (where ","element":"span"},{"style":{"height":16},"width":234.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/1-7.png","element":"img","alt":"h(x) = ιCP(x","inline":true},{"text":") for some scaling ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C > ","element":"span"},{"text":"0). The method is particularly useful when computing the supporting hyperplane in ","element":"span"},{"href":"#id-35","text":"(2) ","element":"a"},{"text":"is computationally simple (e.g., when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"is the unit ball of the ","element":"span"},{"style":{"height":7.6},"width":32.61,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/1-8.png","element":"img","alt":" ℓ1","inline":true},{"text":"-norm or the nuclear norm). Thus, CGM is widely considered in the context of generalized sparse optimization ","element":"span"},{"href":"#id-38","referenceIndex":27,"text":"Hazan ","element":"a"},{"href":"#id-38","referenceIndex":27,"text":"(2008)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-39","referenceIndex":15,"text":"Clarkson ","element":"a"},{"href":"#id-39","referenceIndex":15,"text":"(2010)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-0","referenceIndex":29,"text":"Jaggi ","element":"a"},{"href":"#id-0","referenceIndex":29,"text":"(2013)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-40","referenceIndex":50,"text":"Tewari et al. ","element":"a"},{"href":"#id-40","referenceIndex":50,"text":"(2011)","element":"a"},{"text":", with many variations such as backward steps ","element":"span"},{"href":"#id-41","referenceIndex":32,"text":"Lacoste-Julien and ","element":"a"},{"href":"#id-41","referenceIndex":32,"text":"Jaggi ","element":"a"},{"href":"#id-41","referenceIndex":32,"text":"(2015)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-42","referenceIndex":46,"text":"Rao et al. ","element":"a"},{"href":"#id-42","referenceIndex":46,"text":"(2015) ","element":"a"},{"text":"and fully-corrective steps ","element":"span"},{"href":"#id-43","referenceIndex":53,"text":"Von Hohenbalken ","element":"a"},{"href":"#id-43","referenceIndex":53,"text":"(1977)","element":"a"},{"text":", and connections to other methods like mirror descent ","element":"span"},{"href":"#id-44","referenceIndex":1,"text":"Bach ","element":"a"},{"href":"#id-44","referenceIndex":1,"text":"(2015)","element":"a"},{"text":", cutting plane method ","element":"span"},{"href":"#id-45","referenceIndex":60,"text":"Zhou et al. ","element":"a"},{"href":"#id-45","referenceIndex":60,"text":"(2018)","element":"a"},{"text":", and greedy coordinate-wise methods ","element":"span"},{"href":"#id-39","referenceIndex":15,"text":"Clarkson ","element":"a"},{"href":"#id-39","referenceIndex":15,"text":"(2010)","element":"a"},{"text":".","element":"span"}],[{"text":"In comparison, gCGM (where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") may be unconstrained) has been much less studied, and has appeared under different names, like regularized coordinate minimization ","element":"span"},{"href":"#id-46","referenceIndex":17,"text":"Dudik et al. ","element":"a"},{"href":"#id-46","referenceIndex":17,"text":"(2012)","element":"a"},{"text":". An ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/t","element":"span"},{"text":") convergence rate has been shown for specific smooth functions ","element":"span"},{"href":"#id-47","referenceIndex":39,"text":"Mu et al. ","element":"a"},{"href":"#id-47","referenceIndex":39,"text":"(2016)","element":"a"},{"text":", with bounded assumptions on iterates ","element":"span"},{"href":"#id-44","referenceIndex":1,"text":"Bach ","element":"a"},{"href":"#id-44","referenceIndex":1,"text":"(2015)","element":"a"},{"text":", or with improvement steps to ensure boundedness of sublevel sets ","element":"span"},{"href":"#id-6","referenceIndex":57,"text":"Yu et al. ","element":"a"},{"href":"#id-6","referenceIndex":57,"text":"(2017)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-48","referenceIndex":25,"text":"Harchaoui ","element":"a"},{"href":"#id-48","referenceIndex":25,"text":"et al. ","element":"a"},{"href":"#id-48","referenceIndex":25,"text":"(2015)","element":"a"},{"text":". When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is quadratic and for a special form of ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/1-9.png","element":"img","alt":" φ","inline":true},{"text":", the gCGM can be shown to be equivalent to a form of the iterative shrinkage method, and under proper problem conditioning, has linear convergence ","element":"span"},{"href":"#id-49","referenceIndex":9,"text":"Bredies et al. ","element":"a"},{"href":"#id-49","referenceIndex":9,"text":"(2009)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-50","referenceIndex":8,"text":"Bredies and Lorenz ","element":"a"},{"href":"#id-50","referenceIndex":8,"text":"(2008)","element":"a"},{"text":". We also give an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/t","element":"span"},{"text":") convergence rate on objective function values and minimum gap convergence, but relinquish any assumption on boundedness of iterates.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"1.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Contributions","element":"span"}],[{"text":"We analyze the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"convergence ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"support recovery ","element":"span"},{"text":"properties of the gCGM for ","element":"span"},{"href":"#id-11","text":"(1)","element":"a"},{"text":", where ","element":"span"},{"style":{"height":16},"width":259.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-0.png","element":"img","alt":" h(x) = φ(κP(x","inline":true},{"text":")) involves only modifications of a gauge function ","element":"span"},{"style":{"height":16},"width":88.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-1.png","element":"img","alt":" κP(x","inline":true},{"text":"). We assume that the loss function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smooth, the function ","element":"span"},{"style":{"height":16},"width":57.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-2.png","element":"img","alt":" φ(ξ","inline":true},{"text":") grows at least quadratically when ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-3.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"is large, and the set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"is convex and compact. Our contribution is threefold.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Without boundedness assumptions on iterates","element":"span"},{"text":", the function value error and minimum duality gap of gCGM converge as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/t","element":"span"},{"text":").","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"We provide a safe dual screening rule for any intermediate variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". This rule is algorithmically agnostic, and generalizes SAFE screening rules for LASSO to any gauge function and any case where ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-4.png","element":"img","alt":" φ","inline":true,"padRight":true},{"text":"is monotonicaly nondecreasing, in particular to cases where the dual is unconstrained and thus always feasible.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"Finally, by bounding the gradient error with the gap, we give a mechanism for deriving manifold identification rates for any version of gCGM where minimum gap rates are known.","element":"span"}],[{"text":"Additionally, our proof technique is from a convex analysis viewpoint, in that we measure all distances and errors in terms of gauges and support functions of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"(sometimes symmetrized). This is done for two reasons: first, to ensure that all analysis is linear invariant (in similar spirit as ","element":"span"},{"href":"#id-41","referenceIndex":32,"text":"Lacoste-Julien and Jaggi ","element":"a"},{"href":"#id-41","referenceIndex":32,"text":"(2015)","element":"a"},{"text":"); and second, for increased interpretability, as connections can be drawn to the much more intuitive (but restrictive) case of ","element":"span"},{"style":{"height":16},"width":185.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-5.png","element":"img","alt":" κP = ∥ · ∥1","inline":true,"padRight":true},{"text":"in sparse optimization (and more commonly considered in screening literature). All proofs are given in the appendix.","element":"span"}]]},{"heading":"2 Preliminaries","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Generalized sparse optimization","element":"span"}],[{"text":"Define a finite set of points ","element":"span"},{"style":{"height":17.38},"width":393.76,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-6.png","element":"img","alt":" P0 = {p1, ..., pm} ⊂ Rd","inline":true},{"text":", and its convex hull as ","element":"span"},{"style":{"height":15.6},"width":597.44,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-7.png","element":"img","alt":" P = conv(P0); since m is finite, P","inline":true,"padRight":true},{"text":"is a convex and compact set. We consider problems of the form","element":"span"}],[{"id":"id-52","style":{"width":"62%"},"width":1164,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.79},"width":227.96,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-9.png","element":"img","alt":" φ : R+ → R+","inline":true,"padRight":true},{"text":"is a monotonically nondecreasing function. The function","element":"span"}],[{"style":{"width":"66%"},"width":1244,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-10.png","element":"img"}],[{"text":"is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"gauge function ","element":"span"},{"text":"of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"; in particular, it measures the “size” of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"by giving how much the set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"must be expanded (or can be contracted) to include ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":", and generalizes norms to any positive homogenous and subadditive function ","element":"span"},{"href":"#id-51","referenceIndex":21,"text":"Freund ","element":"a"},{"href":"#id-51","referenceIndex":21,"text":"(1987)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":13,"text":"Chandrasekaran et al. ","element":"a"},{"href":"#id-1","referenceIndex":13,"text":"(2012)","element":"a"},{"text":". We define the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"support of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with respect to ","element":"span"},{"style":{"height":13.19},"width":43.72,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-11.png","element":"img","alt":"P0","inline":true,"padRight":true},{"text":"(denoted ","element":"span"},{"style":{"height":16.7},"width":160.16,"height":41.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-12.png","element":"img","alt":" suppP(x","inline":true},{"text":")) as the set of ","element":"span"},{"style":{"height":10},"width":31.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-13.png","element":"img","alt":" pi","inline":true},{"text":"’s in ","element":"span"},{"href":"#id-52","text":"(5) ","element":"a"},{"text":"for which ","element":"span"},{"style":{"height":11.19},"width":73.1,"height":27.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-14.png","element":"img","alt":" ci >","inline":true,"padRight":true},{"text":"0. Such a set may not be uniquely defined, but we consider ","element":"span"},{"style":{"fontStyle":"italic"},"text":"support recovery ","element":"span"},{"text":"achieved if one such set is revealed.","element":"span"}],[{"text":"Gauge functions can be seen as generalized versions of the ","element":"span"},{"style":{"height":7.6},"width":32.6,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-15.png","element":"img","alt":" ℓ1","inline":true},{"text":"-norm, which is a convex promoter of nonzero vector sparsity. In particular, if ","element":"span"},{"style":{"height":17.9},"width":265.2,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-16.png","element":"img","alt":" P0 = {±ek}dk=1 ","inline":true,"padRight":true},{"text":"is the signed standard basis, then we exactly recover ","element":"span"},{"style":{"height":16},"width":234.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-17.png","element":"img","alt":"κP(x) = ∥x∥1","inline":true},{"text":". More generally, if ","element":"span"},{"style":{"height":13.19},"width":230.78,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-18.png","element":"img","alt":" P0 contains d","inline":true,"padRight":true},{"text":"vectors spanning ","element":"span"},{"style":{"height":13.38},"width":45.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-19.png","element":"img","alt":" Rd","inline":true},{"text":", then defining the matrix ","element":"span"},{"style":{"height":15.6},"width":270.2,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-20.png","element":"img","alt":" P = (p1, ..., pd),","inline":true},{"style":{"height":17.39},"width":311.66,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-21.png","element":"img","alt":"κP(x) = ∥P −1x∥1","inline":true},{"text":", and promotes vectors ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Pc ","element":"span"},{"text":"whose pre-image ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"is sparse. But gauges also encompass more general scenarios, such as seminorms (e.g., total variation norm), non-polyhedral norms (e.g., nuclear norm), and conic constraints; they can also be manipulated to include ordering, such as with the OWL norm ","element":"span"},{"href":"#id-9","referenceIndex":58,"text":"(Zeng and Figueiredo, ","element":"a"},{"href":"#id-9","referenceIndex":58,"text":"2014)","element":"a"},{"text":", and discover groupings with the OSCAR norm ","element":"span"},{"href":"#id-10","referenceIndex":5,"text":"(Bondell and Reich, ","element":"a"},{"href":"#id-10","referenceIndex":5,"text":"2008)","element":"a"},{"text":".","element":"span"}],[{"text":"A “dual gauge” can be constructed as the support function","element":"span"}],[{"id":"id-53","style":{"width":"57%"},"width":1085,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/2-22.png","element":"img"}],[{"text":"In particular, if ","element":"span"},{"style":{"height":10.39},"width":46.96,"height":25.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-0.png","element":"img","alt":" κP","inline":true,"padRight":true},{"text":"is a norm, then ","element":"span"},{"style":{"height":9.99},"width":46.77,"height":24.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-1.png","element":"img","alt":" σP","inline":true,"padRight":true},{"text":"is the usual dual norm. Finding an optimal variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"in ","element":"span"},{"href":"#id-53","text":"(6) ","element":"a"},{"text":"is key in computing ","element":"span"},{"href":"#id-35","text":"(2)","element":"a"},{"text":", and properties of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"can be used to reveal the support of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":".","element":"span"}],[{"id":"id-77","style":{"fontWeight":"bold"},"text":"Property 1 ","element":"span"},{"text":"(Support optimality condition)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"height":10},"width":31.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-2.png","element":"img","alt":" pi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is in the support of ","element":"span"},{"style":{"height":10.98},"width":38.78,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-3.png","element":"img","alt":" x∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"a minimizer of ","element":"span"},{"href":"#id-52","text":"(4)","element":"a"},{"style":{"fontStyle":"italic"},"text":", then","element":"span"}],[{"style":{"width":"27%"},"width":518,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-4.png","element":"img"}],[{"style":{"height":14},"width":375.26,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-5.png","element":"img","alt":"Example: ℓ1 norm.","inline":true,"padRight":true},{"text":"Consider the problem","element":"span"}],[{"style":{"width":"24%"},"width":454,"height":145,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-6.png","element":"img"}],[{"text":"In this case, ","element":"span"},{"style":{"height":16},"width":198.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-7.png","element":"img","alt":" σP = ∥ · ∥∞","inline":true,"padRight":true},{"text":"is the dual norm of ","element":"span"},{"style":{"height":16},"width":182.31,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-8.png","element":"img","alt":" κP = ∥ · ∥1","inline":true},{"text":". Then, by setting the optimality condition 0 ","element":"span"},{"style":{"height":15.6},"width":152.73,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-9.png","element":"img","alt":" ∈ ∂g(x∗)","inline":true}],[{"style":{"width":"69%"},"width":1311,"height":189,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-10.png","element":"img"}],[{"text":"In words, the gradient of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"along a coordinate for which the optimal variable is nonsmooth with respect to ","element":"span"},{"style":{"height":10.39},"width":46.96,"height":25.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-11.png","element":"img","alt":"κP","inline":true,"padRight":true},{"text":"is allowed “wiggle room”; in contrast, if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") is smooth in the direction of ","element":"span"},{"style":{"height":9.19},"width":33.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-12.png","element":"img","alt":" xi","inline":true,"padRight":true},{"text":"then the gradient is fixed. In terms of support recovery, max","element":"span"},{"style":{"height":16.15},"width":230.11,"height":40.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-13.png","element":"img","alt":"i |z∗i | = ∥x∗∥1","inline":true,"padRight":true},{"text":"and additionally, if ","element":"span"},{"style":{"height":16.15},"width":210.82,"height":40.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-14.png","element":"img","alt":" |z∗i | < ∥x∗∥1","inline":true,"padRight":true},{"text":"then it must be that ","element":"span"},{"style":{"height":15.14},"width":124.13,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-15.png","element":"img","alt":" x∗i = 0.","inline":true},{"text":"More generally, visually, the condition ","element":"span"},{"style":{"height":17.53},"width":247.03,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-16.png","element":"img","alt":" pTi z∗ = σP(z∗","inline":true},{"text":") says that at the optimum, the gradient in the ","element":"span"},{"text":"direction of ","element":"span"},{"style":{"height":10},"width":31.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-17.png","element":"img","alt":" pi","inline":true,"padRight":true},{"text":"is as steep as allowable; ","element":"span"},{"style":{"height":10.98},"width":38.78,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-18.png","element":"img","alt":" x∗","inline":true,"padRight":true},{"text":"wants to keep going in this direction, but is blocked because of a constraint or nonsmooth penalty. For gauges, this non-smoothness only happens when the contribution of ","element":"span"},{"style":{"height":10},"width":31.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-19.png","element":"img","alt":" pi","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"is 0, thus translating to a support recovery property.","element":"span"}],[{"text":"The proof follows from convex analysis principles describing the dual behaviors of ","element":"span"},{"style":{"height":15.6},"width":453.28,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-20.png","element":"img","alt":" κP(x∗) and σP(−∇f(x∗)).","inline":true,"padRight":true},{"text":"The property itself serves as the main principle behind dual screening methods; by identifying ","element":"span"},{"style":{"height":10},"width":31.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-21.png","element":"img","alt":" pi","inline":true},{"text":"’s that are sufficiently far from the maximum value, we can guess that such ","element":"span"},{"style":{"height":10},"width":31.04,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-22.png","element":"img","alt":" pi","inline":true},{"text":"’s do not appear in the support of ","element":"span"},{"style":{"height":10.99},"width":38.78,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-23.png","element":"img","alt":" x∗","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Noncompact ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"In practice, recession directions in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"may be desirable to allow for unpenalized directions. For example, in the total variation norm, which promotes smoothness, ","element":"span"},{"style":{"height":16},"width":350.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-24.png","element":"img","alt":" κP(x) = 0 if x = β1","inline":true},{"text":". In this case, a finite ","element":"span"},{"style":{"height":16},"width":84.45,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-25.png","element":"img","alt":" σP(z","inline":true},{"text":") constrains ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"to be in the nullspace of all such recession directions. In gCGM, such gauges are problematic because the solution to the generalized subproblem ","element":"span"},{"href":"#id-35","text":"(2) ","element":"a"},{"text":"is unbounded if ","element":"span"},{"style":{"height":17.39},"width":433.01,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-26.png","element":"img","alt":" ∇f(x)T c ̸= 0 for any c in","inline":true,"padRight":true},{"text":"a recession direction. Therefore we assume ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"to be compact.","element":"span"}],[{"text":"0 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"on the boundary of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"It may be desirable to have ","element":"span"},{"style":{"height":10.39},"width":46.96,"height":25.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-27.png","element":"img","alt":" κP","inline":true,"padRight":true},{"text":"partially enforce conic constraints as well, such as in semidefinite optimization where ","element":"span"},{"style":{"height":16},"width":264.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-28.png","element":"img","alt":" κP(X) = tr(X","inline":true},{"text":") + ","element":"span"},{"style":{"height":16.79},"width":117.16,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-29.png","element":"img","alt":" ι·⪰0(X","inline":true},{"text":") promotes low-rank positive semidefinite matrices. In this case, since no negative definite elements are in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":", 0 must be on the boundary of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":". In the dual, this corresponds to a recession direction, as any negative definite matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"necessarily has ","element":"span"},{"style":{"height":16},"width":194.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-30.png","element":"img","alt":" σP(Z) = 0.","inline":true,"padRight":true},{"text":"This scenario does not affect the effectiveness nor analysis of gCGM; in particular, if ","element":"span"},{"href":"#id-35","text":"(2) ","element":"a"},{"text":"ever returns ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"= 0, then optimality is achieved.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Infinite atomic sets. ","element":"span"},{"text":"We assume that ","element":"span"},{"style":{"height":13.19},"width":43.72,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-31.png","element":"img","alt":" P0","inline":true,"padRight":true},{"text":"is a finite set. In low-rank matrix completion, for which CGMs are frequently used, the nuclear norm acts as the gauge function over the set of rank-1 norm-1 matrices, which is a compact but uncountably infinite set. In fact, the gCGM is still well-defined in this case, and all of the results in this paper are consistent. However, since as there are no isolated points in ","element":"span"},{"style":{"height":13.19},"width":43.72,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-32.png","element":"img","alt":" P0","inline":true},{"text":", it is impossible to guarantee finite-time exact support recovery (and in fact ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/3-33.png","element":"img","alt":" δ","inline":true,"padRight":true},{"text":"defined below is always 0). Thus, although safe screening rules do apply in this case, without modification they may not provide practical advantages.","element":"span"}],[{"text":"Gauges and support functions for convex sets are fundamental objects in convex analysis, and are discussed more by ","element":"span"},{"href":"#id-54","referenceIndex":47,"text":"Rockafellar ","element":"a"},{"href":"#id-54","referenceIndex":47,"text":"(1970)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-55","referenceIndex":7,"text":"Borwein and Lewis ","element":"a"},{"href":"#id-55","referenceIndex":7,"text":"(2010)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-51","referenceIndex":21,"text":"Freund ","element":"a"},{"href":"#id-51","referenceIndex":21,"text":"(1987)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-56","referenceIndex":23,"text":"Friedlander et al. ","element":"a"},{"href":"#id-56","referenceIndex":23,"text":"(2014)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Duality","element":"span"}],[{"text":"The Fenchel dual of ","element":"span"},{"href":"#id-52","text":"(4) ","element":"a"},{"text":"can be computed as","element":"span"}],[{"style":{"width":"100%"},"width":1876,"height":336,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-0.png","element":"img"}],[{"text":"is nonnegative and 0 only at optimality. Since at optimality ","element":"span"},{"style":{"height":16},"width":250.95,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-1.png","element":"img","alt":" z∗ = −∇f(x∗","inline":true},{"text":"), a reasonable measure of suboptimality for a nonoptimal ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"is ","element":"span"},{"style":{"height":16},"width":253.13,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-2.png","element":"img","alt":" gap(x, −∇f(x","inline":true},{"text":")). In particular,","element":"span"}],[{"style":{"width":"56%"},"width":1062,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-3.png","element":"img"}],[{"text":"can be used as a computable residual measure for both convergence tracking and screening rules; here,","element":"span"}],[{"style":{"width":"99%"},"width":1871,"height":258,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-4.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"2.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Generalized CGM (gCGM)","element":"span"}],[{"text":"There are many ways of solving problems of the form ","element":"span"},{"href":"#id-52","text":"(4)","element":"a"},{"text":", and our dual screening results and manifold identification results are in fact method-agnostic. Here, we investigate the gCGM, which has almost as cheap of a per-iteration cost as the vCGM. In particular, if we decompose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"in terms of its gauge value ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-5.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"and normalized direction ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":", then their minimizations can be done independently. Explicitly, step ","element":"span"},{"href":"#id-35","text":"(2) ","element":"a"},{"text":"can be summarized in two steps, with ","element":"span"},{"style":{"height":14},"width":141.45,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-6.png","element":"img","alt":" s = ξ · ˆs","inline":true},{"text":", and","element":"span"}],[{"id":"id-85","style":{"width":"68%"},"width":1291,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-7.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.6},"width":234.96,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-8.png","element":"img","alt":" LMOP(z) :=","inline":true,"padRight":true},{"text":"argmax","element":"span"}],[{"text":"LMO returns a finite ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"; however, the minimization for ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-9.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"is more complicated. As a simple example, consider gCGM applied to the one-dimensional problem","element":"span"}],[{"style":{"width":"26%"},"width":500,"height":145,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-10.png","element":"img"}],[{"text":"At the very first step, ","element":"span"},{"style":{"height":16},"width":192.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-11.png","element":"img","alt":" f ′(0) = −c","inline":true},{"text":", and if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"style":{"fontStyle":"italic"},"text":"| ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"1 then ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-12.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"is unbounded. Therefore, further conditions on ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-13.png","element":"img","alt":" φ","inline":true,"padRight":true},{"text":"must be imposed.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Generalized penalty","element":"span"}],[{"text":"The function ","element":"span"},{"style":{"height":14.79},"width":227.74,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-14.png","element":"img","alt":" φ : R+ → R+","inline":true,"padRight":true},{"text":"facilitates the transition of ","element":"span"},{"href":"#id-52","text":"(4) ","element":"a"},{"text":"from penalized to constrained optimization. When ","element":"span"},{"style":{"height":16},"width":147.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-15.png","element":"img","alt":"φ(ξ) = ξ","inline":true},{"text":", then ","element":"span"},{"href":"#id-52","text":"(4) ","element":"a"},{"text":"is a typical sparse regularized problem; at the other extreme, ","element":"span"},{"style":{"height":16.79},"width":208.23,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-16.png","element":"img","alt":" φ(ξ) = ιξ≤C","inline":true,"padRight":true},{"text":"an indicator function can constrain ","element":"span"},{"style":{"height":11.6},"width":132.81,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/4-17.png","element":"img","alt":" x ∈ CP","inline":true},{"text":", reducing everything to the vCGM case (vanilla CGM).","element":"span"}],[{"id":"id-57","style":{"fontWeight":"bold"},"text":"Assumption 1 ","element":"span"},{"text":"(Well-defined gCGM)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The function ","element":"span"},{"style":{"height":14.79},"width":227.96,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-0.png","element":"img","alt":" φ : R+ → R+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is monotonically nondecreasing over all ","element":"span"},{"style":{"height":14},"width":61.33,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-1.png","element":"img","alt":"ξ ≥","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". Moreover, the set of subdifferentials of ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-2.png","element":"img","alt":" φ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is not upper bounded:","element":"span"}],[{"style":{"width":"64%"},"width":1217,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-3.png","element":"img"}],[{"id":"id-59","style":{"fontWeight":"bold"},"text":"Assumption 2 ","element":"span"},{"text":"(Convergence of gCGM)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The function ","element":"span"},{"style":{"height":14.79},"width":254.02,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-4.png","element":"img","alt":" φ : R+ → R+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is lower bounded by a quadratic function","element":"span"}],[{"style":{"width":"58%"},"width":1087,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":13.59},"width":87.24,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-6.png","element":"img","alt":" µφ >","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14},"width":39.74,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-7.png","element":"img","alt":" φ0","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"id":"id-79","style":{"fontWeight":"bold"},"text":"Property 2 ","element":"span"},{"text":"(Well-defined and converging gCGM)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assumption ","element":"span"},{"href":"#id-57","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"ensures that the conjugate function","element":"span"}],[{"style":{"width":"60%"},"width":1132,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"is finite-valued and attained for all ","element":"span"},{"style":{"height":12.8},"width":64.29,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-9.png","element":"img","alt":" ν ≥","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". Moreover, there always exists a finite maximizer ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-10.png","element":"img","alt":" ξ","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"width":"100%"},"width":1874,"height":394,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-11.png","element":"img"}],[{"text":"In particular, in the case that ","element":"span"},{"style":{"height":14.4},"width":371.45,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-12.png","element":"img","alt":" α = 1, then β → +∞","inline":true},{"text":", and the function","element":"span"}],[{"style":{"width":"34%"},"width":650,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-13.png","element":"img"}],[{"text":"As shown earlier, when ","element":"span"},{"style":{"height":11.6},"width":449.09,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-14.png","element":"img","alt":" α = 1 then whenever ν >","inline":true,"padRight":true},{"text":"1 then ","element":"span"},{"style":{"height":16},"width":221.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-15.png","element":"img","alt":" φ∗(ν) = +∞","inline":true},{"text":"; we exclude this case as gCGM will not converge in this case. When ","element":"span"},{"style":{"height":14},"width":146.83,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-16.png","element":"img","alt":" α ≥ 2, φ","inline":true,"padRight":true},{"text":"is strongly convex and we can show ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/t","element":"span"},{"text":") convergence of gCGM. When 1 ","element":"span"},{"style":{"height":9.6},"width":109.79,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-17.png","element":"img","alt":" < α <","inline":true,"padRight":true},{"text":"2, ","element":"span"},{"style":{"height":16},"width":78.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-18.png","element":"img","alt":" φ∗(ν","inline":true},{"text":") is finite and the iterates are well-defined, but the method may converge or diverge.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Example: Barrier functions. ","element":"span"},{"text":"Consider","element":"span"}],[{"id":"id-75","style":{"width":"67%"},"width":1271,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-19.png","element":"img"}],[{"text":"which is a log-barrier penalization function for ","element":"span"},{"style":{"height":14},"width":103.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-20.png","element":"img","alt":" ξ ≤ C","inline":true},{"text":"; as ","element":"span"},{"style":{"height":16},"width":239.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-21.png","element":"img","alt":" β → +∞, φ(ξ","inline":true},{"text":") approaches the indicator function for this constraint. Its conjugate is","element":"span"}],[{"style":{"width":"29%"},"width":561,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-22.png","element":"img"}],[{"text":"achieved at ","element":"span"},{"style":{"height":17.38},"width":285.79,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-23.png","element":"img","alt":" ξ = C2βν/(Cβν","inline":true,"padRight":true},{"text":"+ 1). For all ","element":"span"},{"style":{"height":14.4},"width":195.19,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-24.png","element":"img","alt":" C > 0, β >","inline":true,"padRight":true},{"text":"0, and ","element":"span"},{"style":{"height":17.38},"width":239.44,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-25.png","element":"img","alt":" ν ̸= −(Cβ)−1","inline":true},{"text":", both ","element":"span"},{"style":{"height":14.18},"width":39.74,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-26.png","element":"img","alt":" φ∗","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.18},"width":35.26,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-27.png","element":"img","alt":" ξ∗","inline":true,"padRight":true},{"text":"exist and are finite. Note also the implicit constraint, as ","element":"span"},{"style":{"height":16},"width":127.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-28.png","element":"img","alt":" φ(κP(x","inline":true},{"text":")) is finite only if ","element":"span"},{"style":{"height":11.6},"width":132.81,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-29.png","element":"img","alt":" x ∈ CP","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Generalized smoothness","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 1. ","element":"span"},{"text":"A function ","element":"span"},{"style":{"height":16.58},"width":195.36,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-30.png","element":"img","alt":" f : Rd → R","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smooth with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"if for all ","element":"span"},{"style":{"height":16.58},"width":155.94,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-31.png","element":"img","alt":" x, y ∈ Rd","inline":true},{"text":":","element":"span"}],[{"style":{"width":"71%"},"width":1346,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/5-32.png","element":"img"}],[{"text":"The purpose of this generalized notion is that sometimes, given the data, tighter bounds can be computed (see, e.g., ","element":"span"},{"href":"#id-58","referenceIndex":42,"text":"Nutini et al., ","element":"a"},{"href":"#id-58","referenceIndex":42,"text":"2015)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Example: Quadratic function. ","element":"span"},{"text":"Suppose that","element":"span"}],[{"style":{"width":"20%"},"width":391,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-0.png","element":"img"}],[{"text":"Then","element":"span"}],[{"style":{"width":"40%"},"width":750,"height":157,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-1.png","element":"img"}],[{"text":"While norm bounds would give ","element":"span"},{"style":{"height":15.79},"width":314.74,"height":39.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-2.png","element":"img","alt":" d2L1 ≥ dL2 ≥ L∞","inline":true},{"text":", the actual values in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"might lead to tighter inequalities.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Example: Linear model. ","element":"span"},{"text":"Suppose that","element":"span"}],[{"style":{"width":"19%"},"width":362,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-3.png","element":"img"}],[{"text":"for some convex, smooth twice-differentiable function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g ","element":"span"},{"text":"(e.g., logistic or exponential regression). Then","element":"span"}],[{"style":{"width":"30%"},"width":570,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-4.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Equivalence to usual smoothness. ","element":"span"},{"text":"Suppose that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-5.png","element":"img","alt":" L2","inline":true},{"text":"-smooth in the usual sense (with respect to ","element":"span"},{"style":{"height":16},"width":84.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-6.png","element":"img","alt":"∥ · ∥2","inline":true},{"text":"). Then since ","element":"span"},{"style":{"height":16},"width":342.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-7.png","element":"img","alt":" diam(P)κP ≥ ∥x∥2","inline":true},{"text":", it follows that ","element":"span"},{"style":{"height":16},"width":284.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-8.png","element":"img","alt":" L ≤ diam(P)L2","inline":true},{"text":". In this way, we refine the analysis of gCGM by absorbing the usual “set size” term into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":", which in certain cases may be much smaller than ","element":"span"},{"style":{"height":16},"width":203.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-9.png","element":"img","alt":"diam(P)L2","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.6 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Invariance","element":"span"}],[{"text":"One appealing feature of the vCGM is that the iteration scheme and analysis can be done in a way that is invariant to both linear scaling and translation. Specifically, if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Ax ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":"), then the two problems","element":"span"}],[{"style":{"width":"31%"},"width":589,"height":60,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-10.png","element":"img"}],[{"text":"are equivalent. However when the gauge function is not used as an indicator, this translation invariance vanishes; in general, ","element":"span"},{"style":{"height":17.28},"width":379.31,"height":43.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-11.png","element":"img","alt":" κP(x) ̸= κP+{b}(x + b","inline":true},{"text":"). Therefore the generalized problem formulation ","element":"span"},{"href":"#id-52","text":"(4) ","element":"a"},{"text":"is only linear (not translation) invariant; thus our analysis only maintains this invariance as well.","element":"span"}],[{"id":"id-81","style":{"fontWeight":"bold"},"text":"Property 3 ","element":"span"},{"text":"(Invariance)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Consider two equivalent problems where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Ax","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"style":{"fontStyle":"italic"},"text":":","element":"span"}],[{"style":{"width":"32%"},"width":618,"height":151,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Ax","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"style":{"fontStyle":"italic"},"text":"optimizes (P1) ","element":"span"},{"style":{"height":8.8},"width":127.39,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-13.png","element":"img","alt":" ⇐⇒ w","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"optimizes (P2),","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"κ","element":"span"},{"style":{"height":8.4},"width":24,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-14.png","element":"img","alt":"P","inline":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"style":{"height":16},"width":117.59,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-15.png","element":"img","alt":") = κQ","inline":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"w","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":16},"width":499.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-16.png","element":"img","alt":" σP(−∇f(x)) = σQ(−∇g(w),","inline":true}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":16},"width":721.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-17.png","element":"img","alt":" LMOQ(−∇g(w)) = A LMOP(−∇f(x)),","inline":true}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"style":{"fontStyle":"italic"},"text":"if and only if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":16},"width":720.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/6-18.png","element":"img","alt":" and gap(x, −∇f(x)) = gap(w, −∇g(w)).","inline":true}]]},{"heading":"3 Main results","paragraphs":[[{"text":"In this section we give the main theoretical contributions: convergence rate, dual screening rule, and support identification complexity. These results all derive from some simple observations:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"The minimum duality gap at ","element":"span"},{"style":{"height":14.18},"width":59.27,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-0.png","element":"img","alt":" x(t)","inline":true,"padRight":true},{"text":"converges to 0 as ","element":"span"},{"style":{"height":14.58},"width":162.48,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-1.png","element":"img","alt":" x(t) → x∗","inline":true,"padRight":true},{"text":"an optimal primal variable.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"The gradient error can be upper bounded by the gap, and support recovery is guaranteed when it is smaller than a problem-dependent constant, which is difficult to compute in practice.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"Without knowing this constant, one can still give partial support guarantees, which is used to construct screening rules.","element":"span"}],[{"text":"We now state these points formally; all proofs are given in the appendix.","element":"span"}],[{"id":"id-67","style":{"fontWeight":"bold"},"text":"Theorem 1 ","element":"span"},{"text":"(Convergence)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that ","element":"span"},{"style":{"height":14.18},"width":59.27,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-2.png","element":"img","alt":" x(t) ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are the iterates of gCGM for which ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"height":14.79},"width":486.82,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-3.png","element":"img","alt":"�P := P ∪ −P, φ : R+ → R+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is monotonically nondecreasing, and satisfies Assumptions ","element":"span"},{"href":"#id-57","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-59","style":{"fontStyle":"italic"},"text":"2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":13.59},"width":87.23,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-4.png","element":"img","alt":"µφ >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". Take ","element":"span"},{"style":{"height":18.18},"width":181.24,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-5.png","element":"img","alt":" θ(t) = 2/(t","inline":true,"padRight":true},{"text":"+ 1)","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then","element":"span"}],[{"style":{"width":"66%"},"width":1239,"height":179,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-6.png","element":"img"}],[{"text":"A key difference between this result and previous works is that we do not assume or enforce bounded","element":"span"}],[{"text":"iterates.","element":"span"}],[{"text":"The scaled gradient error will serve as our primary “residual quantity” in measuring distance to support recovery:","element":"span"}],[{"style":{"width":"29%"},"width":553,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-7.png","element":"img"}],[{"text":"and the symmetrization ","element":"span"},{"style":{"height":11.6},"width":257.38,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-8.png","element":"img","alt":"�P := P ∪ −P","inline":true,"padRight":true},{"text":"ensures that ","element":"span"},{"style":{"height":18.19},"width":412.88,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-9.png","element":"img","alt":" σ �P(z − z∗) = σ �P(z∗ − z","inline":true},{"text":"), bounding errors in both ","element":"span"},{"text":"directions.","element":"span"}],[{"id":"id-68","style":{"fontWeight":"bold"},"text":"Lemma 1 ","element":"span"},{"text":"(Gap bounds residual)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any primal feasible variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"id":"id-62","style":{"width":"28%"},"width":530,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-10.png","element":"img"}],[{"text":"Figure ","element":"span"},{"href":"#id-60","text":"1 ","element":"a"},{"text":"gives a cartoon intuition as to what a small residual buys us. In particular, if ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-11.png","element":"img","alt":" δ","inline":true,"padRight":true},{"text":"is larger than 2","element":"span"},{"style":{"fontWeight":"bold"},"text":"res","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"), then a maximal element of ","element":"span"},{"style":{"height":17.38},"width":282.32,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-12.png","element":"img","alt":" {−∇f(x∗)T pk}k","inline":true,"padRight":true},{"text":"must also be a maximal element of ","element":"span"},{"style":{"height":17.38},"width":264,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-13.png","element":"img","alt":" {−∇f(x)T pk}k","inline":true},{"text":". Since we can observe a bound on ","element":"span"},{"style":{"fontWeight":"bold"},"text":"res","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"), it is now possible to exclude which atoms are definitively ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not ","element":"span"},{"text":"in ","element":"span"},{"style":{"height":16.7},"width":175.62,"height":41.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-14.png","element":"img","alt":"suppP(x∗","inline":true},{"text":").","element":"span"}],[{"id":"id-63","style":{"fontWeight":"bold"},"text":"Theorem 2 ","element":"span"},{"text":"(Dual screening)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"height":11.6},"width":30,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-15.png","element":"img","alt":"�P","inline":true},{"style":{"fontStyle":"italic"},"text":". Then for any ","element":"span"},{"style":{"height":14},"width":251.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-16.png","element":"img","alt":" x, any p ∈ P0,","inline":true}],[{"style":{"width":"72%"},"width":1357,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"implies that ","element":"span"},{"style":{"height":16.7},"width":244.38,"height":41.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-18.png","element":"img","alt":" p ̸∈ suppP(x∗","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":10.98},"width":38.78,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-19.png","element":"img","alt":" x∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the optimal variable in ","element":"span"},{"href":"#id-52","text":"(4)","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"width":"68%"},"width":1283,"height":172,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-20.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"(Practical considerations)","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Some things to note about this screening method:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"Computing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"may be challenging, depending on ","element":"span"},{"style":{"height":10.39},"width":46.96,"height":25.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/7-21.png","element":"img","alt":" κP","inline":true},{"text":"; as shown previously, at the very least it may require a full pass over the data. However, this is a one-time calculation per dataset, and can be estimated if data are assumed to be drawn from specific distributions (as in sensing applications).","element":"span"}],[{"style":{"width":"59%"},"width":1113,"height":478,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-0.png","element":"img"}],[{"text":"Figure 1: ","element":"figcaption","subtype":"caption"},{"id":"id-60","style":{"fontWeight":"bold"},"text":"Support recovery. ","element":"figcaption","subtype":"caption"},{"text":"The constant ","element":"figcaption","subtype":"caption"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-1.png","element":"img","alt":" δ","inline":true,"padRight":true},{"text":"differentiates maximal in-support values from the largest non-support value, as in ","element":"figcaption","subtype":"caption"},{"href":"#id-61","text":"(15)","element":"a","subtype":"caption"},{"text":". ","element":"figcaption","subtype":"caption"},{"style":{"height":18.18},"width":201.1,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-2.png","element":"img","alt":" ϵ = res(x(t)","inline":true},{"text":") for some current (non-optimal) iterate ","element":"figcaption","subtype":"caption"},{"style":{"height":18.18},"width":472.04,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-3.png","element":"img","alt":" x(t). Denote z∗ = −∇f(x∗)","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":18.18},"width":855.16,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-4.png","element":"img","alt":" z(t) = −∇f(x(t)). Suppose σP(z(t)) = σP(z∗) + ϵ","inline":true,"padRight":true},{"text":"(illustrated as ","element":"figcaption","subtype":"caption"},{"style":{"height":11.59},"width":104.9,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-5.png","element":"img","alt":" s1 + ϵ","inline":true},{"text":"), its largest possible value. Then it is possible that some ","element":"figcaption","subtype":"caption"},{"style":{"height":16.3},"width":244.09,"height":40.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-6.png","element":"img","alt":" p ∈ suppP(x∗","inline":true},{"text":") exists where ","element":"figcaption","subtype":"caption"},{"style":{"height":18.19},"width":384.44,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-7.png","element":"img","alt":" pT z(t) = σP(z(t)) − 2ϵ","inline":true},{"text":"; thus, a safe screening rule can at largest be a threshold at ","element":"figcaption","subtype":"caption"},{"style":{"height":18.19},"width":228.6,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-8.png","element":"img","alt":" σP(z(t)) − 2ϵ","inline":true},{"text":". This rule eliminates all false negatives. To ensure no false positives, the largest possible non-optimal non-support value (","element":"figcaption","subtype":"caption"},{"style":{"height":11.59},"width":105.66,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-9.png","element":"img","alt":"s2 + ϵ","inline":true},{"text":") must be smaller than the screened point. This can only happen if ","element":"figcaption","subtype":"caption"},{"style":{"height":12.4},"width":112.28,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-10.png","element":"img","alt":" δ > 4ϵ","inline":true},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"If ","element":"span"},{"style":{"height":13.19},"width":43.72,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-11.png","element":"img","alt":" P0","inline":true,"padRight":true},{"text":"is large (such as in submodular optimization) then checking condition ","element":"span"},{"href":"#id-62","text":"(14) ","element":"a"},{"text":"for each atom at each iteration is also cumbersome. However, if the screening is aggressive, then after a few iterations, the list of potential atoms to check will decrease quickly as well.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"Computing the gap, in comparison, is almost automatic in gCGM, given that ","element":"span"},{"style":{"height":18.18},"width":275.84,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-12.png","element":"img","alt":" z(t) = −∇f(x(t)","inline":true,"padRight":true},{"text":"is the (always feasible) dual candidate and ","element":"span"},{"style":{"height":14.18},"width":55.18,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-13.png","element":"img","alt":" s(t) ","inline":true,"padRight":true},{"text":"already computed. In comparison, when dealing with a different dual candidate, then the term ","element":"span"},{"style":{"height":15.6},"width":231.2,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-14.png","element":"img","alt":" f(x) + f ∗(−z","inline":true},{"text":") is not easily upper bounded, and depending on the choice of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"may be difficult to compute in practice.","element":"span"}],[{"text":"The “safeness” of the screening rule (Theorem ","element":"span"},{"href":"#id-63","text":"2) ","element":"a"},{"text":"ensures that ","element":"span"},{"style":{"height":18.88},"width":295.02,"height":47.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-15.png","element":"img","alt":" S(t) ⊇ suppP(x∗","inline":true},{"text":"), for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". For support identification, we would like to find a ","element":"span"},{"text":"¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"where for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > ","element":"span"},{"text":"¯","element":"span"},{"style":{"height":18.88},"width":329.28,"height":47.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-16.png","element":"img","alt":"t, S(t) = suppP(x∗","inline":true},{"text":"). Note that with a deterministically decaying sequence for ","element":"span"},{"style":{"height":14.18},"width":56.31,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-17.png","element":"img","alt":" θ(t)","inline":true},{"text":", finite-time support recovery ","element":"span"},{"style":{"fontStyle":"italic"},"text":"without ","element":"span"},{"text":"screening is impossible, since any erroneously selected atoms early on can never fully diminish. Even with screening, it is still not automatically guaranteed that such a finite ","element":"span"},{"text":"¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"exists, since the problem itself may be degenerate ","element":"span"},{"href":"#id-64","referenceIndex":35,"text":"Lewis and Wright ","element":"a"},{"href":"#id-64","referenceIndex":35,"text":"(2011)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-65","referenceIndex":26,"text":"Hare ","element":"a"},{"href":"#id-65","referenceIndex":26,"text":"(2011)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-66","referenceIndex":10,"text":"Burke and Mor´e ","element":"a"},{"href":"#id-66","referenceIndex":10,"text":"(1988)","element":"a"},{"text":". This occurs when ","element":"span"},{"style":{"height":14.8},"width":218.89,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-18.png","element":"img","alt":" δ = 0, where","inline":true}],[{"id":"id-61","style":{"width":"71%"},"width":1331,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-19.png","element":"img"}],[{"text":"is a problem-dependent (algorithm-independent) quantity.","element":"span"}],[{"id":"id-69","style":{"fontWeight":"bold"},"text":"Theorem 3 ","element":"span"},{"text":"(Support identification of screened gCGM)","element":"span"},{"style":{"height":14.8},"width":876.22,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-20.png","element":"img","alt":". Assume f is L-smooth with respect to �P. Then","inline":true}],[{"style":{"width":"99%"},"width":1868,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-21.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"which, under the assumptions of Theorem ","element":"span"},{"href":"#id-67","style":{"fontStyle":"italic"},"text":"1, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"happens at a rate ","element":"span"},{"style":{"height":17.39},"width":205.1,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-22.png","element":"img","alt":" t = O(1/(δ2","inline":true},{"text":"))","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"The proof follows from the gap bound (Lemma ","element":"span"},{"href":"#id-68","text":"1)","element":"a"},{"text":", gap rate (Theorem ","element":"span"},{"href":"#id-67","text":"1)","element":"a"},{"text":", and scrutiny of Figure ","element":"span"},{"href":"#id-60","text":"1; ","element":"a"},{"text":"specifically, when ","element":"span"},{"style":{"height":16},"width":108.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-23.png","element":"img","alt":" ϵ < δ/","inline":true},{"text":"4, then any rule that screens away elements that are more than 2","element":"span"},{"style":{"height":18.18},"width":337.54,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-24.png","element":"img","alt":"ϵ from σP(x(t)) will","inline":true,"padRight":true},{"text":"screen away ","element":"span"},{"style":{"fontStyle":"italic"},"text":"all ","element":"span"},{"text":"the non-support elements.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"(Generality)","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Note that Theorems ","element":"span"},{"href":"#id-63","text":"2 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-69","text":"3 ","element":"a"},{"text":"impose no conditions on the sequence ","element":"span"},{"style":{"height":14.19},"width":61.88,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-25.png","element":"img","alt":" θ(k)","inline":true},{"text":", or choice of ","element":"span"},{"style":{"height":14},"width":34.74,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-26.png","element":"img","alt":" φ,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":", etc., except ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":". In other words, for any method where ","element":"span"},{"style":{"height":15.6},"width":103.08,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-27.png","element":"img","alt":" ϵ(t) ≥","inline":true,"padRight":true},{"text":"min","element":"span"},{"style":{"height":18.98},"width":424.12,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/8-28.png","element":"img","alt":"i≤t gap(x(i), ∇f(x(i))) is","inline":true,"padRight":true},{"text":"known, then a corresponding screening rule and support identification rate automatically follow.","element":"span"}],[{"style":{"width":"57%"},"width":1081,"height":718,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-0.png","element":"img"}],[{"text":"Figure 2: ","element":"figcaption","subtype":"caption"},{"id":"id-70","style":{"height":14.8},"width":643.35,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-1.png","element":"img","alt":" Duality gap for varying p. λ = 0.","inline":true},{"text":"01. For ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p < ","element":"figcaption","subtype":"caption"},{"text":"1","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"5, the method diverged. ","element":"figcaption","subtype":"caption"},{"style":{"height":12.4},"width":145.18,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-2.png","element":"img","alt":" p = +∞","inline":true,"padRight":true},{"text":"corresponds to vCGM.","element":"figcaption","subtype":"caption"}]]},{"heading":"4 Experiments","paragraphs":[[{"text":"We consider sparse logistic regression","element":"span"}],[{"id":"id-74","style":{"width":"71%"},"width":1331,"height":136,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-3.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.6},"width":65.31,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-4.png","element":"img","alt":" λ >","inline":true,"padRight":true},{"text":"0 controls the weighting of the penalty term and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C > ","element":"span"},{"text":"0 the magnification of ","element":"span"},{"style":{"height":16.58},"width":360.18,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-5.png","element":"img","alt":" P. Here, ai ∈ Rd are","inline":true,"padRight":true},{"text":"data vectors and ","element":"span"},{"style":{"height":16},"width":207.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-6.png","element":"img","alt":" bi ∈ {−1, 1}","inline":true,"padRight":true},{"text":"are binary labels. In all cases we run gCGM with ","element":"span"},{"style":{"height":18.18},"width":181.24,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-7.png","element":"img","alt":" θ(t) = 2/(t","inline":true,"padRight":true},{"text":"+ 1).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Synthetic experiments","element":"span"}],[{"text":"First, we generate ","element":"span"},{"style":{"height":16.17},"width":146.86,"height":40.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-8.png","element":"img","alt":" ai ∈ R50","inline":true,"padRight":true},{"text":"i.i.d. standard Gaussian normal vectors, and fix ","element":"span"},{"style":{"height":14.8},"width":349.15,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-9.png","element":"img","alt":" bi = 1, for i = 1, ...,","inline":true,"padRight":true},{"text":"100, and analyze the numerical behavior of gCGM when ","element":"span"},{"style":{"height":17.38},"width":1059.76,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-10.png","element":"img","alt":" φ(ξ) = p−1ξp; we fix C = 1 here. The duality gap for different","inline":true,"padRight":true},{"text":"choices of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"are plotted in Figure ","element":"span"},{"href":"#id-70","text":"2, ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":10.8},"width":106.54,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-11.png","element":"img","alt":" λ = 0.","inline":true},{"text":"01. For low values of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"we observe some numerical instability in the early iterates, as for “flatter” penalty functions the new steps ","element":"span"},{"style":{"height":14.18},"width":55.17,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-12.png","element":"img","alt":" s(t) ","inline":true,"padRight":true},{"text":"can be very large. Figure ","element":"span"},{"href":"#id-71","text":"3 ","element":"a"},{"text":"compares different problem residuals with ","element":"span"},{"style":{"height":14},"width":237.75,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-13.png","element":"img","alt":" p = 2, λ = 1.","inline":true},{"text":"0. In particular we are able to verify our ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/t","element":"span"},{"text":") bound on all residuals, though it is clear that for this example, the gap is converging much more slowly than the gradient error ","element":"span"},{"style":{"height":18.18},"width":217.59,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-14.png","element":"img","alt":" σ(z(t) − z∗)2","inline":true},{"text":", which is almost twice as fast, which is why our screening rule, though safe, can be pessimistic in practice. Finally, Figure ","element":"span"},{"href":"#id-72","text":"4 ","element":"a"},{"text":"shows the evolution of the support size for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 2 and varying values of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-15.png","element":"img","alt":" λ","inline":true},{"text":". In general, a larger value of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-16.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"causes aggressive screening early on, while for larger values of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-17.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"screening may be much slower, despite arriving at about the same final sparsity level.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"MNIST classification of 4’s vs 9’s","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-73","text":"5 ","element":"a"},{"text":"shows the screening behavior of ","element":"span"},{"href":"#id-74","text":"(17) ","element":"a"},{"text":"on the binary classification problem of disambiguating 4’s and 9’s in the MNIST handwriting dataset. We experiment with three schemes: one-norm squared regularization (","element":"span"},{"style":{"height":19.37},"width":236.98,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-18.png","element":"img","alt":"h(x) = λ2 ∥x∥21","inline":true},{"text":"), one-norm ball constraint (","element":"span"},{"style":{"height":16},"width":234.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/9-19.png","element":"img","alt":"h(x) = ιCP(x","inline":true},{"text":")), and log barrier ","element":"span"},{"href":"#id-75","text":"(12)","element":"a"},{"text":". All experiments are halted ","element":"span"},{"text":"at 10,000 iterations for fair comparison.","element":"span"}],[{"style":{"width":"58%"},"width":1097,"height":828,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/10-0.png","element":"img"}],[{"text":"Figure 3: ","element":"figcaption","subtype":"caption"},{"id":"id-71","style":{"height":14},"width":454.46,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/10-1.png","element":"img","alt":" Residuals. p = 2, λ = 1.","inline":true},{"text":"0. The objective error and gap decay at the computed rate of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"O","element":"figcaption","subtype":"caption"},{"text":"(1","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"/t","element":"figcaption","subtype":"caption"},{"text":"). The gradient error decays as ","element":"figcaption","subtype":"caption"},{"style":{"height":17.75},"width":134.06,"height":44.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/10-2.png","element":"img","alt":" O(1/√t","inline":true},{"text":"). In fact when the gradient error ","element":"figcaption","subtype":"caption"},{"style":{"height":18.18},"width":183.82,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/10-3.png","element":"img","alt":" σ(z(t) − z∗","inline":true},{"text":") dips below ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":39.21,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/10-4.png","element":"img","alt":" δ/","inline":true},{"text":"4, support error is 0; unfortunately, the gap takes longer to reach this point.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"57%"},"width":1076,"height":815,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/10-5.png","element":"img"}],[{"text":"Figure 4: ","element":"figcaption","subtype":"caption"},{"id":"id-72","style":{"height":18.18},"width":657.3,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/10-6.png","element":"img","alt":" Screening. p = 2, and we plot |S(t)|","inline":true,"padRight":true},{"text":"the number of unscreened variables at each iteration. We observe more aggressive screening for larger ","element":"figcaption","subtype":"caption"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/10-7.png","element":"img","alt":" λ","inline":true},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"There are two major observations. First, the yellow curve (observed sparsity) is often much lower than the red curve (guarantee-able sparsity). This is because when ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-0.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"is small or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"is big, the gap converges slowly, and the condition ","element":"span"},{"style":{"height":18.18},"width":550.53,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-1.png","element":"img","alt":" gap(x(t)) < δ/4 requires t > 10,","inline":true,"padRight":true},{"text":"000 (our stopping condition). However, that is the tradeoff required for “safety”.","element":"span"}],[{"text":"Second, the red curve (guarantee-able sparsity) is only small when the blue curve (misclassification rate) is higher, suggesting an inherent performance/sparsity tradeoff. This tradeoff is in fact observed for all three choices of ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-2.png","element":"img","alt":" φ","inline":true,"padRight":true},{"text":"and suggests that in general, the MNIST classification task performs best without extreme sparsity.","element":"span"}]]},{"heading":"5 Conclusion","paragraphs":[[{"text":"We have given a gap-based safe screening rule for a family of sparse optimization problems, for various types of sparse penalties and atoms. We analyze this in the context of the gCGM, and give rates for convergence and support identification for nondegenerate problems. In particular, the generalization over atom type and choice of ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-3.png","element":"img","alt":" φ","inline":true,"padRight":true},{"text":"allows for a much richer collection of sparse models, interpolating between the piece-wise linear unconstrained LASSO penalty and the hard norm ball constraint. These penalties differ in their sensitivity toward hyperparameters, and may be more suited to a wider range of applications.","element":"span"}],[{"text":"A key promise in these rules is that, in the spirit of ","element":"span"},{"href":"#id-20","referenceIndex":24,"text":"Ghaoui et al. ","element":"a"},{"href":"#id-20","referenceIndex":24,"text":"(2012)","element":"a"},{"text":", screening is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"safe","element":"span"},{"text":", e.g., no true nonzero will be wrongly called a zero. However, in practice this rule may be pessimistic, first because the gap may serve as an overly pessimistic upper bound of the gradient error, and second because sparsity in the true solution of the optimization problem may be overkill for sparsity of a solution that generalizes well for the machine learning task.","element":"span"}],[{"text":"Still, there are practical advantages. A sparsity guarantee gives storage benefits; a model trained on a large server can be moved to a mobile device, for example, with no need for heuristic thresholding or rounding. And, if ","element":"span"},{"style":{"height":13.19},"width":43.72,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-4.png","element":"img","alt":" P0","inline":true,"padRight":true},{"text":"is very large, then screening can greatly improve the runtime of the linear minimization oracle (LMO), used in each step; since the rules are safe, this can be done without disrupting any convergence guarantees.","element":"span"}]]},{"heading":"Appendix A Helpful facts","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Lemma 2 ","element":"span"},{"text":"(Relationship of ","element":"span"},{"style":{"height":10.39},"width":46.96,"height":25.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-5.png","element":"img","alt":" κP","inline":true,"padRight":true},{"text":"to ","element":"span"},{"style":{"height":16},"width":258.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-6.png","element":"img","alt":" ∥ · ∥2). Denote","inline":true}],[{"style":{"width":"63%"},"width":1198,"height":203,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Using another classical definition for gauge functions,","element":"span"}],[{"style":{"width":"59%"},"width":1115,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.19},"width":41.18,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-9.png","element":"img","alt":" Br","inline":true,"padRight":true},{"text":"is the smallest Euclidean ball of radius ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"that includes ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"; that is, ","element":"span"},{"style":{"height":16},"width":216.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-10.png","element":"img","alt":" r ≤ diam(P","inline":true},{"text":").","element":"span"}],[{"text":"We denote the subdifferential of a convex function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"as ","element":"span"},{"style":{"height":16},"width":85.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-11.png","element":"img","alt":" ∂f(x","inline":true},{"text":"), and the normal cone of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"as ","element":"span"},{"style":{"height":16.4},"width":98.09,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-12.png","element":"img","alt":"NP(x","inline":true},{"text":"); See ","element":"span"},{"href":"#id-54","referenceIndex":47,"text":"Rockafellar ","element":"a"},{"href":"#id-54","referenceIndex":47,"text":"(1970)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 3 ","element":"span"},{"text":"(Conjugate of nested function)","element":"span"},{"style":{"height":16},"width":1135.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-13.png","element":"img","alt":". If g(x) = φ(κP(x)) and φ is monotonically nondecreasing, then","inline":true},{"style":{"height":16},"width":329.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/11-14.png","element":"img","alt":"g∗(z) = φ∗(σP(z)).","inline":true}],[{"style":{"width":"56%"},"width":1068,"height":1810,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/12-0.png","element":"img"}],[{"text":"Figure 5: ","element":"figcaption","subtype":"caption"},{"id":"id-73","style":{"fontWeight":"bold"},"text":"MNIST ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"experiment. ","element":"figcaption","subtype":"caption"},{"text":"Solid/dashed blue lines are train/test misclassification rates. Solid/dashed/dotted red lines are number of unscreened features at 10000 / 5000 / 1000 iterations; it is possible that more features would be screened away after more iterations, as the gap converges very slowly for small ","element":"figcaption","subtype":"caption"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/12-1.png","element":"img","alt":" λ","inline":true},{"text":". Green square line plots the number of nonzeros of ","element":"figcaption","subtype":"caption"},{"style":{"height":14.19},"width":126.72,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/12-2.png","element":"img","alt":" x(10000)","inline":true},{"text":", which is observed to be stable. (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Top","element":"figcaption","subtype":"caption"},{"text":") ","element":"figcaption","subtype":"caption"},{"style":{"height":19.37},"width":236.96,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/12-3.png","element":"img","alt":"h(x) = λ2 ∥x∥21","inline":true},{"text":". (","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":403.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/12-4.png","element":"img","alt":"Middle) h(x) = ιCP(x","inline":true},{"text":"). (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Bottom","element":"figcaption","subtype":"caption"},{"text":") Log barrier function ","element":"figcaption","subtype":"caption"},{"href":"#id-75","text":"(12) ","element":"a","subtype":"caption"},{"text":"where ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"C ","element":"figcaption","subtype":"caption"},{"text":"= 10.","element":"figcaption","subtype":"caption"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"From the definitions, we have:","element":"span"}],[{"style":{"width":"78%"},"width":1463,"height":259,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-0.png","element":"img"}],[{"id":"id-78","style":{"fontWeight":"bold"},"text":"Lemma 4 ","element":"span"},{"text":"(Chain rule for subdifferential ","element":"span"},{"href":"#id-76","referenceIndex":3,"text":"(Bauschke and Combettes ","element":"a"},{"href":"#id-76","referenceIndex":3,"text":"(2011)","element":"a"},{"text":", Corollary 16.72.))","element":"span"},{"style":{"height":14},"width":279.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-1.png","element":"img","alt":". Let f : H → R","inline":true},{"style":{"height":15.6},"width":1875.53,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-2.png","element":"img","alt":"be continuous and convex, and let φ : R → R be increasing on range(f). Suppose that (ri(range f) + R++) ∩","inline":true}],[{"style":{"width":"71%"},"width":1337,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-3.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 5 ","element":"span"},{"text":"(Gap in primal form)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"everywhere differentiable,","element":"span"}],[{"style":{"width":"73%"},"width":1375,"height":204,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By construction of ","element":"span"},{"style":{"height":17.38},"width":541.02,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-5.png","element":"img","alt":" s, ∇f(x)T s+h(s) = h∗(−∇f(x","inline":true},{"text":")). And, in general, for convex lower semicontinuous ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") + ","element":"span"},{"style":{"height":17.39},"width":380.42,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-6.png","element":"img","alt":" f ∗(∇f(x)) = xT ∇f(x","inline":true},{"text":"). The rest follows from substitution.","element":"span"}]]},{"heading":"Appendix B Proofs from Section 2","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Property ","element":"span"},{"href":"#id-77","style":{"fontWeight":"bold"},"text":"1 ","element":"a"},{"text":"(Support optimality condition)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"height":10},"width":31.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-7.png","element":"img","alt":" pi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is in the support of ","element":"span"},{"style":{"height":10.98},"width":38.78,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-8.png","element":"img","alt":" x∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"a minimizer of","element":"span"}],[{"style":{"width":"23%"},"width":449,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is everywhere differentiable and ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-10.png","element":"img","alt":" φ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfies Assumptions ","element":"span"},{"href":"#id-57","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-59","style":{"fontStyle":"italic"},"text":"2, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"then ","element":"span"},{"style":{"height":17.39},"width":479.1,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-11.png","element":"img","alt":" −∇f(x∗)T pi = σP(−∇f(x∗","inline":true},{"text":"))","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Without loss of generality, we assume 0 ","element":"span"},{"style":{"height":11.6},"width":73.08,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-12.png","element":"img","alt":" ∈ P","inline":true},{"text":", since ","element":"span"},{"style":{"height":12.88},"width":232.04,"height":32.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-13.png","element":"img","alt":" κP = κP∪{0}","inline":true},{"text":". Denote ","element":"span"},{"style":{"height":16},"width":245.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-14.png","element":"img","alt":" z∗ = −∇f(x∗","inline":true},{"text":"). Now, applying Lemma ","element":"span"},{"href":"#id-78","text":"4, ","element":"a"},{"text":"the optimality condition for ","element":"span"},{"href":"#id-52","text":"(4) ","element":"a"},{"text":"is","element":"span"}],[{"style":{"width":"56%"},"width":1066,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-15.png","element":"img"}],[{"text":"for some ","element":"span"},{"style":{"height":16},"width":625.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-16.png","element":"img","alt":" α ∈ ∂φ(ξ) with ξ = κP(x∗). Since φ","inline":true,"padRight":true},{"text":"is monotonically nondecreasing over ","element":"span"},{"style":{"height":16.18},"width":435.5,"height":40.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-17.png","element":"img","alt":" R+, α ≥ 0. If α = 0 then","inline":true,"padRight":true},{"text":"the property is trivially true. Now consider ","element":"span"},{"style":{"height":9.6},"width":67.7,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-18.png","element":"img","alt":" α >","inline":true,"padRight":true},{"text":"0. Noting that ","element":"span"},{"style":{"height":10.39},"width":165.67,"height":25.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-19.png","element":"img","alt":" κP = σP◦","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":11.78},"width":47,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-20.png","element":"img","alt":" P◦","inline":true,"padRight":true},{"text":"is the polar set of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"48%"},"width":918,"height":137,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-21.png","element":"img"}],[{"text":"Now take the conic decomposition ","element":"span"},{"style":{"height":17.6},"width":258.12,"height":43.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-22.png","element":"img","alt":" x∗ = �mi=1 cipi","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":12.8},"width":72.58,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-23.png","element":"img","alt":" ci ≥","inline":true,"padRight":true},{"text":"0, and","element":"span"}],[{"style":{"width":"38%"},"width":715,"height":185,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-24.png","element":"img"}],[{"text":"which is with equality if and only if ","element":"span"},{"style":{"height":17.53},"width":238.34,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-25.png","element":"img","alt":" pTi z∗ = σP(z∗","inline":true},{"text":")) whenever ","element":"span"},{"style":{"height":11.19},"width":72.58,"height":27.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-26.png","element":"img","alt":" ci >","inline":true,"padRight":true},{"text":"0.","element":"span"}],[{"style":{"width":"1%"},"width":28,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/13-27.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Property ","element":"span"},{"href":"#id-79","style":{"fontWeight":"bold"},"text":"2 ","element":"a"},{"text":"(Well-defined and converging gCGM)","element":"span"},{"style":{"fontWeight":"bold"},"text":".","element":"span"}],[{"style":{"width":"96%"},"width":1806,"height":513,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Assumption 1. ","element":"span"},{"text":"Since ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-1.png","element":"img","alt":" φ","inline":true,"padRight":true},{"text":"has nonempty domain, ","element":"span"},{"style":{"height":16},"width":239.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-2.png","element":"img","alt":" φ∗(ν) > −∞","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":6.8},"width":21,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-3.png","element":"img","alt":" ν","inline":true},{"text":". ","element":"span"},{"text":"It can be shown that ","element":"span"},{"style":{"height":16},"width":219.41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-4.png","element":"img","alt":"φ∗(ν) < +∞","inline":true,"padRight":true},{"text":"whenever there exists a finite ","element":"span"},{"style":{"height":14},"width":61.34,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-5.png","element":"img","alt":" ξ ≥","inline":true,"padRight":true},{"text":"0 where ","element":"span"},{"style":{"height":16},"width":151.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-6.png","element":"img","alt":" ν ∈ ∂φ(ξ","inline":true},{"text":"), since then","element":"span"}],[{"id":"id-80","style":{"width":"19%"},"width":362,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-7.png","element":"img"}],[{"text":"Now define ","element":"span"},{"style":{"height":16},"width":141.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-8.png","element":"img","alt":" S := [φ′","inline":true},{"text":"(0)","element":"span"},{"style":{"height":12.4},"width":89.23,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-9.png","element":"img","alt":", +∞","inline":true},{"text":"). By the assumptions on ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-10.png","element":"img","alt":" φ","inline":true},{"text":", for any ","element":"span"},{"style":{"height":11.6},"width":96.95,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-11.png","element":"img","alt":" ν ∈ S","inline":true},{"text":", there exists some finite ","element":"span"},{"style":{"height":14},"width":61.33,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-12.png","element":"img","alt":" ξ ≥","inline":true,"padRight":true},{"text":"0 where ","element":"span"},{"style":{"height":16},"width":151.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-13.png","element":"img","alt":" ν ∈ ∂φ(ξ","inline":true},{"text":").","element":"span"}],[{"style":{"width":"58%"},"width":1103,"height":210,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-14.png","element":"img"}],[{"text":"Therefore there always exists a finite maximizer ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-15.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"of ","element":"span"},{"href":"#id-80","text":"(19)","element":"a"},{"text":"; since also ","element":"span"},{"style":{"height":16},"width":78.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-16.png","element":"img","alt":" φ∗(ν","inline":true},{"text":") is not ","element":"span"},{"style":{"height":10.8},"width":71,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-17.png","element":"img","alt":" ±∞","inline":true},{"text":", then ","element":"span"},{"href":"#id-80","text":"(19) ","element":"a"},{"text":"is always attained.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Assumption 2. ","element":"span"},{"text":"Assume that ","element":"span"},{"style":{"height":14},"width":39.74,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-18.png","element":"img","alt":" φ0","inline":true,"padRight":true},{"text":"is as large as possible; e.g., there exists some finite ","element":"span"},{"style":{"height":14},"width":33.43,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-19.png","element":"img","alt":" ξ0","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":17.39},"width":202.47,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-20.png","element":"img","alt":"φ(ξ0) = µξ20","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":14},"width":39.74,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-21.png","element":"img","alt":" φ0","inline":true},{"text":". Then for all ","element":"span"},{"style":{"height":14},"width":105.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-22.png","element":"img","alt":" ξ ≥ ξ0","inline":true},{"text":", for all ","element":"span"},{"style":{"height":16},"width":151.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-23.png","element":"img","alt":" ν ∈ ∂φ(ξ","inline":true},{"text":"),","element":"span"}],[{"style":{"width":"94%"},"width":1771,"height":519,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-24.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Property ","element":"span"},{"href":"#id-81","style":{"fontWeight":"bold"},"text":"3 ","element":"a"},{"text":"(Invariance)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Consider two equivalent problems where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Ax","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"style":{"fontStyle":"italic"},"text":":","element":"span"}],[{"style":{"width":"32%"},"width":618,"height":154,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/14-25.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Ax","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"width":"78%"},"width":1464,"height":1581,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/15-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":16},"width":678.31,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/15-1.png","element":"img","alt":" LMOQ(−∇g(w)) = A LMOP(−∇f(x","inline":true},{"text":"))","element":"span"}],[{"style":{"width":"83%"},"width":1571,"height":333,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/15-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Ax ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-smooth and ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/15-3.png","element":"img","alt":" µ","inline":true},{"style":{"fontStyle":"italic"},"text":"-strongly convex with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"style":{"fontStyle":"italic"},"text":"iff ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-smooth and ","element":"span"},{"style":{"height":14},"width":172.64,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/15-4.png","element":"img","alt":" µ-strongly","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"convex with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"}],[{"style":{"width":"78%"},"width":1476,"height":214,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/15-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":16},"width":643.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/16-0.png","element":"img","alt":" gap(x, −∇f(x)) = gap(w, −∇g(w)).","inline":true}],[{"style":{"width":"86%"},"width":1627,"height":449,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/16-1.png","element":"img"}]]},{"heading":"Appendix C Generalized smoothness","paragraphs":[[{"text":"The following bound holds for any closed convex ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":", which may or not be compact or symmetric.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 6 ","element":"span"},{"text":"(Smoothness equivalences)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"height":10.39},"width":46.96,"height":25.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/16-2.png","element":"img","alt":" κP","inline":true},{"style":{"fontStyle":"italic"},"text":":","element":"span"}],[{"id":"id-83","style":{"width":"71%"},"width":1332,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/16-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then the following also holds:","element":"span"}],[{"id":"id-84","style":{"width":"97%"},"width":1819,"height":367,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/16-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof largely follows from ","element":"span"},{"href":"#id-82","referenceIndex":41,"text":"Nesterov ","element":"a"},{"href":"#id-82","referenceIndex":41,"text":"(2013)","element":"a"},{"text":", mildly adapted.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"First prove ","element":"span"},{"href":"#id-83","text":"(20) ","element":"a"},{"style":{"height":8.8},"width":40,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/16-5.png","element":"img","alt":" ⇒","inline":true,"padRight":true},{"href":"#id-84","text":"(21)","element":"a"},{"text":". Construct ","element":"span"},{"style":{"height":17.38},"width":394.62,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/16-6.png","element":"img","alt":" g(x) = f(x) − xT ∇f(y","inline":true},{"text":"), which is convex, also ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smooth, and has minimum at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":". Then, for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"74%"},"width":1396,"height":806,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/16-7.png","element":"img"}],[{"style":{"width":"82%"},"width":1551,"height":674,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/17-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"Now prove ","element":"span"},{"href":"#id-83","text":"(20) ","element":"a"},{"style":{"height":8.8},"width":40,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/17-1.png","element":"img","alt":" ⇒","inline":true,"padRight":true},{"href":"#id-84","text":"(22)","element":"a"},{"text":". Using the same ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g ","element":"span"},{"text":"as before, consider","element":"span"}],[{"style":{"width":"94%"},"width":1772,"height":1079,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/17-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Corollary 1 ","element":"span"},{"text":"(Uniqueness of gradient)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"href":"#id-83","text":"(20) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds and ","element":"span"},{"text":"0 ","element":"span"},{"style":{"height":11.6},"width":136.63,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/17-3.png","element":"img","alt":" ∈ int P","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":16},"width":95.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/17-4.png","element":"img","alt":" ∇f(x","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is unique at the optimum.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Assume that ","element":"span"},{"style":{"height":16},"width":208.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/17-5.png","element":"img","alt":" f(x) = f(x∗","inline":true},{"text":") for some ","element":"span"},{"style":{"height":15.2},"width":164.36,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/17-6.png","element":"img","alt":" x ̸= x∗, x","inline":true,"padRight":true},{"text":"feasible. Then by optimality conditions, ","element":"span"},{"style":{"height":17.39},"width":152.1,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/17-7.png","element":"img","alt":"∇f(x∗)T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16},"width":170.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/17-8.png","element":"img","alt":"x∗ − x) ≤","inline":true,"padRight":true},{"text":"0, and thus","element":"span"}],[{"style":{"width":"56%"},"width":1064,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/17-9.png","element":"img"}],[{"text":"which implies that ","element":"span"},{"style":{"height":15.6},"width":736.7,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/17-10.png","element":"img","alt":" σP(∇f(x) − ∇f(x∗)) = 0. Since 0 ∈ int P","inline":true},{"text":", this can only happen if ","element":"span"},{"style":{"height":15.6},"width":302.25,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/17-11.png","element":"img","alt":" ∇f(x) = ∇f(x∗).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Lemma 7 ","element":"span"},{"text":"(Hessian sufficient condition)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For some closed convex set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"style":{"fontStyle":"italic"},"text":", and some convex twice differentiable function ","element":"span"},{"style":{"height":14},"width":198.47,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-0.png","element":"img","alt":" f : Rn → R","inline":true},{"style":{"fontStyle":"italic"},"text":", suppose that","element":"span"}],[{"style":{"width":"71%"},"width":1333,"height":198,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By definition and positive homogeneity of ","element":"span"},{"style":{"height":16},"width":88.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-2.png","element":"img","alt":" κP(x","inline":true},{"text":"), more generally","element":"span"}],[{"style":{"width":"34%"},"width":643,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-3.png","element":"img"}],[{"text":"Then","element":"span"}],[{"style":{"width":"98%"},"width":1851,"height":404,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-4.png","element":"img"}],[{"id":"id-86","style":{"fontWeight":"bold"},"text":"Lemma 8 ","element":"span"},{"text":"(Gradient suboptimality bound)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that ","element":"span"},{"style":{"height":10.98},"width":85.21,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-5.png","element":"img","alt":" x∗ =","inline":true,"padRight":true},{"text":"argmin","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"with respect to ","element":"span"},{"style":{"height":9.99},"width":46.77,"height":24.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-6.png","element":"img","alt":" σP","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is convex. Then","element":"span"}],[{"style":{"width":"52%"},"width":975,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"For any ","element":"span"},{"style":{"height":16},"width":159.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-8.png","element":"img","alt":" α ∈ ∂h(x","inline":true},{"text":") and ","element":"span"},{"style":{"height":16},"width":193.27,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-9.png","element":"img","alt":" α∗ ∈ ∂h(x∗","inline":true},{"text":"),","element":"span"}],[{"style":{"width":"95%"},"width":1795,"height":342,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-10.png","element":"img"}],[{"text":"we derive (a) from expansiveness, and (b) from convexity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":".","element":"span"}]]},{"heading":"Appendix D Proofs from Section 3","paragraphs":[[{"text":"The following Lemma will be used in computing the objective value bound.","element":"span"}],[{"id":"id-87","style":{"fontWeight":"bold"},"text":"Lemma 9 ","element":"span"},{"text":"(One step value bound)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"height":10.39},"width":46.96,"height":25.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-11.png","element":"img","alt":" κP","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14},"width":62.02,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-12.png","element":"img","alt":" φ µ","inline":true},{"style":{"fontStyle":"italic"},"text":"-convex,","element":"span"}],[{"style":{"width":"99%"},"width":1869,"height":302,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/18-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"we take the sequence ","element":"span"},{"style":{"height":18.19},"width":181.23,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-0.png","element":"img","alt":" θ(t) = 2/(t","inline":true,"padRight":true},{"text":"+ 1)","element":"span"},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"text":"¯","element":"span"},{"text":"∆","element":"span"},{"style":{"height":11.2},"width":36.5,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-1.png","element":"img","alt":"(t)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"represents an averaged suboptimality:","element":"span"}],[{"style":{"width":"67%"},"width":1257,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"For one step, at ","element":"span"},{"style":{"height":14.18},"width":135.18,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-3.png","element":"img","alt":" x = x(t)","inline":true},{"text":", define","element":"span"}],[{"style":{"width":"35%"},"width":657,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-4.png","element":"img"}],[{"style":{"height":14.58},"width":233.53,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-5.png","element":"img","alt":"and ∆ = ∆(t)","inline":true},{"text":". Since, by smoothness of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"74%"},"width":1390,"height":313,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-6.png","element":"img"}],[{"text":"then","element":"span"}],[{"style":{"width":"80%"},"width":1500,"height":379,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-7.png","element":"img"}],[{"text":"and therefore","element":"span"}],[{"style":{"height":18.18},"width":216.8,"height":45.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-8.png","element":"img","alt":"A = ∇f(x)T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":6.8},"width":90.39,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-9.png","element":"img","alt":"s − x","inline":true},{"text":") + ","element":"span"},{"style":{"height":16},"width":687.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-10.png","element":"img","alt":" h(s) − h(x) = −f(x) − h(x) − f ∗(∇f(x","inline":true},{"text":") + ","element":"span"},{"style":{"height":16},"width":593.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-11.png","element":"img","alt":" h∗(−∇f(x)) = −gap(x, −∇f(x)).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Term ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"By convexity and homogeneity of ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-12.png","element":"img","alt":" κ","inline":true},{"text":",","element":"span"}],[{"style":{"width":"98%"},"width":1842,"height":366,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-13.png","element":"img"}],[{"text":"Taking ","element":"span"},{"style":{"height":21.28},"width":169.17,"height":53.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-14.png","element":"img","alt":" θ(t) = 2t+1","inline":true},{"text":", then","element":"span"}],[{"style":{"width":"29%"},"width":547,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-15.png","element":"img"}],[{"text":"so","element":"span"}],[{"style":{"width":"42%"},"width":793,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/19-16.png","element":"img"}],[{"text":"By optimality conditions on the update for ","element":"span"},{"style":{"height":14.19},"width":55.18,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-0.png","element":"img","alt":" s(t)","inline":true,"padRight":true},{"text":"(","element":"span"},{"href":"#id-85","text":"(8) ","element":"a"},{"text":"in main text),","element":"span"}],[{"style":{"width":"87%"},"width":1637,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-1.png","element":"img"}],[{"text":"where (a) follows from Assumption ","element":"span"},{"href":"#id-59","text":"2. ","element":"a"},{"text":"Then","element":"span"}],[{"style":{"width":"77%"},"width":1457,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-2.png","element":"img"}],[{"text":"where (b) follows from Lemma ","element":"span"},{"href":"#id-86","text":"8. ","element":"a"},{"text":"Overall this gives","element":"span"}],[{"style":{"width":"92%"},"width":1730,"height":321,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-3.png","element":"img"}],[{"text":"where (c) comes from (","element":"span"},{"style":{"height":18.17},"width":396.55,"height":45.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-4.png","element":"img","alt":"�mi=1 ci)2 ≤ m �mi=1 c2i","inline":true,"padRight":true},{"text":".","element":"span"}],[{"id":"id-89","style":{"fontWeight":"bold"},"text":"Lemma 10 ","element":"span"},{"text":"(Objective value bound)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"height":14.79},"width":348.4,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-5.png","element":"img","alt":"�P and φ : R+ → R+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is monotonically increasing and ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-6.png","element":"img","alt":" µ","inline":true},{"style":{"fontStyle":"italic"},"text":"-strongly convex, then the objective error decreases as","element":"span"}],[{"style":{"width":"23%"},"width":432,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Take ","element":"span"},{"text":"¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > ","element":"span"},{"text":"12","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"large enough so that for all ","element":"span"},{"style":{"height":12.8},"width":56.46,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-8.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", ","element":"span"},{"style":{"height":24.58},"width":297.6,"height":61.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-9.png","element":"img","alt":"3L22µ2 (θ(t))2 ≤ θ(t)/","inline":true},{"text":"3. Then define","element":"span"}],[{"style":{"width":"41%"},"width":781,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-10.png","element":"img"}],[{"text":"Then, using Lemma ","element":"span"},{"href":"#id-87","text":"(9)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"45%"},"width":855,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-11.png","element":"img"}],[{"text":"We now pick ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"large enough such that for all ","element":"span"},{"style":{"height":12.8},"width":56.46,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-12.png","element":"img","alt":" t ≤","inline":true,"padRight":true},{"text":"¯","element":"span"},{"style":{"height":18.19},"width":675.71,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-13.png","element":"img","alt":"t, ∆(t) ≤ G/t, and G > 24A. Since ∆(t) ","inline":true,"padRight":true},{"text":"is always a bounded quantity, this is always possible. Then, for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t < ","element":"span"},{"text":"¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"69%"},"width":1303,"height":298,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-14.png","element":"img"}],[{"text":"Now we make an inductive step. Suppose that for some ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", ∆","element":"span"},{"style":{"height":18.18},"width":182.65,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-15.png","element":"img","alt":"(t′) < G/t′","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":12.8},"width":92.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-16.png","element":"img","alt":" t′ ≤ t","inline":true},{"text":". Pick ","element":"span"},{"style":{"height":18.18},"width":182.72,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/20-17.png","element":"img","alt":" θ(t) = 2/(t","inline":true,"padRight":true},{"text":"+ 1).","element":"span"}],[{"text":"Then","element":"span"}],[{"style":{"width":"54%"},"width":1019,"height":628,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/21-0.png","element":"img"}],[{"text":"which satisfies the inductive step.","element":"span"}],[{"text":"The following is a generalized and modified version of a proof segment from ","element":"span"},{"href":"#id-0","referenceIndex":29,"text":"Jaggi ","element":"a"},{"href":"#id-0","referenceIndex":29,"text":"(2013)","element":"a"},{"text":", which will be used for proving ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/t","element":"span"},{"text":") gap convergence.","element":"span"}],[{"id":"id-88","style":{"fontWeight":"bold"},"text":"Lemma 11. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Pick some ","element":"span"},{"text":"0 ","element":"span"},{"style":{"height":13.19},"width":175.65,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/21-1.png","element":"img","alt":" < T2 < T1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and pick","element":"span"}],[{"style":{"width":"79%"},"width":1494,"height":445,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/21-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Using integral rule, we see that","element":"span"}],[{"style":{"width":"54%"},"width":1019,"height":285,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/21-3.png","element":"img"}],[{"text":"This yields","element":"span"}],[{"style":{"width":"87%"},"width":1641,"height":562,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/21-4.png","element":"img"}],[{"id":"id-90","style":{"fontWeight":"bold"},"text":"Lemma 12 ","element":"span"},{"text":"(Generalized non-monotonic gap bound)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"∆","element":"span"},{"style":{"height":21.28},"width":477.94,"height":53.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-0.png","element":"img","alt":"(t) := g(x(t)) − g(x∗) ≤ G1t+D","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":13.19},"width":47.33,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-1.png","element":"img","alt":" G1","inline":true},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":21.28},"width":179.17,"height":53.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-2.png","element":"img","alt":" θ(t) = G2t+D","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":13.19},"width":47.34,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-3.png","element":"img","alt":" G2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"style":{"fontStyle":"italic"},"text":", and","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"∆","element":"span"},{"style":{"height":14.18},"width":203.27,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-4.png","element":"img","alt":"(t+1) − ∆(k)","inline":true},{"text":"(1 + ","element":"span"},{"style":{"height":18.18},"width":361.07,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-5.png","element":"img","alt":" αθ(k)) ≤ −θ(k)gap(k","inline":true},{"text":") + (","element":"span"},{"style":{"height":18.18},"width":387.87,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-6.png","element":"img","alt":"θ(k))2G3 for some G3,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"then for","element":"span"}],[{"style":{"width":"80%"},"width":1516,"height":742,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-7.png","element":"img"}],[{"text":"Picking ","element":"span"},{"style":{"height":17.34},"width":756.76,"height":43.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-8.png","element":"img","alt":" C1 = G1, C2 = αG1G2 + G3G22, C3 = G2G4","inline":true},{"text":", and invoking Lemma ","element":"span"},{"href":"#id-88","text":"11, ","element":"a"},{"text":"this yields that ∆","element":"span"},{"style":{"height":17.39},"width":162.94,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-9.png","element":"img","alt":"(t+1) < 0,","inline":true,"padRight":true},{"text":"which is impossible. Therefore, the assumption must not be true.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem ","element":"span"},{"href":"#id-67","style":{"fontWeight":"bold"},"text":"1 ","element":"a"},{"text":"(Convergence)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that ","element":"span"},{"style":{"height":14.18},"width":59.27,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-10.png","element":"img","alt":" x(t) ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are the iterates of gCGM for which ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"height":14.79},"width":285.44,"height":36.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-11.png","element":"img","alt":"�P, φ : R+ → R+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is monotonically increasing and ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-12.png","element":"img","alt":" µ","inline":true},{"style":{"fontStyle":"italic"},"text":"-strongly convex. Take ","element":"span"},{"style":{"height":18.19},"width":181.24,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-13.png","element":"img","alt":" θ(t) = 4/(t","inline":true,"padRight":true},{"text":"+ 2)","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then","element":"span"}],[{"style":{"width":"66%"},"width":1239,"height":202,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-14.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof follows from Lemma ","element":"span"},{"href":"#id-89","text":"10 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-90","text":"12.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Lemma ","element":"span"},{"href":"#id-68","style":{"fontWeight":"bold"},"text":"1 ","element":"a"},{"text":"(Gap bounds residual)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any primal feasible variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"width":"28%"},"width":530,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-15.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Taking ","element":"span"},{"style":{"height":16},"width":254.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-16.png","element":"img","alt":" g(x) = φ(κP(x","inline":true},{"text":")), we have","element":"span"}],[{"style":{"width":"47%"},"width":897,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-17.png","element":"img"}],[{"text":"Additionally, by Fenchel-Young, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") + ","element":"span"},{"style":{"height":17.38},"width":281.93,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-18.png","element":"img","alt":" f ∗(−u) ≥ −xT u","inline":true,"padRight":true},{"text":"Therefore","element":"span"}],[{"style":{"width":"54%"},"width":1019,"height":185,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-19.png","element":"img"}],[{"text":"for some ","element":"span"},{"style":{"height":16},"width":298.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/22-20.png","element":"img","alt":" w ∈ ∂(φ ◦ κP)(x∗","inline":true},{"text":")).","element":"span"}],[{"text":"Take ","element":"span"},{"style":{"height":16},"width":202.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/23-0.png","element":"img","alt":" u = −∇f(x","inline":true},{"text":"). Then","element":"span"}],[{"style":{"width":"85%"},"width":1606,"height":324,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/23-1.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Theorem ","element":"span"},{"href":"#id-63","style":{"fontWeight":"bold"},"text":"2 ","element":"a"},{"text":"(Dual screening)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"height":11.6},"width":30,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/23-2.png","element":"img","alt":"�P","inline":true},{"style":{"fontStyle":"italic"},"text":". Then for any ","element":"span"},{"style":{"height":14},"width":251.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/23-3.png","element":"img","alt":" x, any p ∈ P0,","inline":true}],[{"id":"id-91","style":{"width":"72%"},"width":1357,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/23-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"implies that ","element":"span"},{"style":{"height":16.69},"width":244.38,"height":41.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/23-5.png","element":"img","alt":" p ̸∈ suppP(x∗","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":10.98},"width":38.78,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/23-6.png","element":"img","alt":" x∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the optimal variable in ","element":"span"},{"href":"#id-52","text":"(4)","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"From Lemma ","element":"span"},{"href":"#id-68","text":"1, ","element":"a"},{"text":"we have that when condition ","element":"span"},{"href":"#id-91","text":"(25) ","element":"a"},{"text":"holds,","element":"span"}],[{"style":{"width":"46%"},"width":876,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/23-7.png","element":"img"}],[{"text":"Then, by the triangle inequality,","element":"span"}],[{"style":{"width":"50%"},"width":953,"height":163,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/23-8.png","element":"img"}],[{"text":"Thus by Property ","element":"span"},{"href":"#id-77","text":"1, ","element":"a"},{"style":{"height":16.7},"width":244.38,"height":41.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/23-9.png","element":"img","alt":" p ̸∈ suppP(x∗","inline":true},{"text":").","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-44","text":"Bach, F. (2015). ","element":"span"},{"text":"Duality between subgradient and conditional gradient methods. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 25(1):115–129.","element":"span"}],[{"id":"id-12","text":"Bach, F. R. (2010). Structured sparsity-inducing norms through submodular functions. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 118–126.","element":"span"}],[{"id":"id-76","text":"Bauschke, H. H. and Combettes, P. L. (2011). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Convex Analysis and Monotone Operator Theory in Hilbert Spaces","element":"span"},{"text":", volume 408. Springer, 2 edition.","element":"span"}],[{"id":"id-19","text":"Berrada, L., Zisserman, A., and Kumar, M. P. (2018). Deep Frank-Wolfe for neural network optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1811.07591","element":"span"},{"text":".","element":"span"}],[{"id":"id-10","text":"Bondell, H. D. and Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised ","element":"span"},{"text":"clustering of predictors with OSCAR. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Biometrics","element":"span"},{"text":", 64(1):115–123.","element":"span"}],[{"id":"id-29","text":"Bonnefoy, A., Valentin, E., Liva, R., and Gribonval, R. (2015). Dynamic screening: Accelerating first-order ","element":"span"},{"text":"algorithms for the LASSO and group-LASSO. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Signal Processing","element":"span"},{"text":", 63(19):5121–5132.","element":"span"}],[{"id":"id-55","text":"Borwein, J. and Lewis, A. S. (2010). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Convex Analysis and Nonlinear Optimization: Theory and Examples","element":"span"},{"text":". Springer Science and Business Media.","element":"span"}],[{"id":"id-50","text":"Bredies, K. and Lorenz, D. A. (2008). Iterated hard shrinkage for minimization problems with sparsity ","element":"span"},{"text":"constraints. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Scientific Computing","element":"span"},{"text":", 30(2):657–683.","element":"span"}],[{"id":"id-49","text":"Bredies, K., Lorenz, D. A., and Maass, P. (2009). A generalized conditional gradient method and its connection ","element":"span"},{"text":"to an iterative shrinkage method. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Computational Optimization and Applications","element":"span"},{"text":", 42(2):173–193.","element":"span"}],[{"id":"id-66","text":"Burke, J. V. and Mor´e, J. J. (1988). On the identification of active constraints. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Numerical Analysis","element":"span"},{"text":", 25(5):1197–1211.","element":"span"}],[{"id":"id-5","text":"Cand`es, E. and Romberg, J. (2006). Robust signal recovery from incomplete observations. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2006 International Conference on Image Processing","element":"span"},{"text":", pages 1281–1284. IEEE.","element":"span"}],[{"id":"id-4","text":"Cand`es, E. J. and Tao, T. (2005). Decoding by linear programming. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE transactions on information theory","element":"span"},{"text":", 51(12):4203–4215.","element":"span"}],[{"id":"id-1","text":"Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. (2012). The convex geometry of linear ","element":"span"},{"text":"inverse problems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Foundations of Computational mathematics","element":"span"},{"text":", 12(6):805–849.","element":"span"}],[{"id":"id-13","text":"Chari, V., Lacoste-Julien, S., Laptev, I., and Sivic, J. (2015). On pairwise costs for network flow multi-object ","element":"span"},{"text":"tracking. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pages 5537–5545.","element":"span"}],[{"id":"id-39","text":"Clarkson, K. L. (2010). Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACM Transactions on Algorithms (TALG)","element":"span"},{"text":", 6(4):63.","element":"span"}],[{"id":"id-3","text":"Donoho, D. L. (2006). Compressed sensing. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Information Theory","element":"span"},{"text":", 52(4):1289–1306.","element":"span"}],[{"id":"id-46","text":"Dudik, M., Harchaoui, Z., and Malick, J. (2012). Lifted coordinate descent for learning with trace-norm ","element":"span"},{"text":"regularization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Artificial Intelligence and Statistics","element":"span"},{"text":", pages 327–336.","element":"span"}],[{"id":"id-37","text":"Dunn, J. C. and Harshbarger, S. (1978). Conditional gradient algorithms with open loop step size rules. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Mathematical Analysis and Applications","element":"span"},{"text":", 62(2):432–444.","element":"span"}],[{"id":"id-21","text":"Fercoq, O., Gramfort, A., and Salmon, J. (2015). Mind the duality gap: safer rules for the Lasso. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1505.03410","element":"span"},{"text":".","element":"span"}],[{"id":"id-36","text":"Frank, M. and Wolfe, P. (1956). An algorithm for quadratic programming. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Naval research logistics quarterly","element":"span"},{"text":", 3(1-2):95–110.","element":"span"}],[{"id":"id-51","text":"Freund, R. M. (1987). Dual gauge programs, with applications to quadratic programming and the minimum- ","element":"span"},{"text":"norm problem. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematical Programming","element":"span"},{"text":", 38(1):47–67.","element":"span"}],[{"id":"id-7","text":"Freund, R. M., Grigas, P., and Mazumder, R. (2017). An extended Frank–Wolfe method with in-face directions, ","element":"span"},{"text":"and its application to low-rank matrix completion. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 27(1):319–346.","element":"span"}],[{"id":"id-56","text":"Friedlander, M. P., Macedo, I., and Pong, T. K. (2014). Gauge optimization and duality. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 24(4):1999–2022.","element":"span"}],[{"id":"id-20","text":"Ghaoui, L. E., Viallon, V., and Rabbani, T. (2012). Safe feature elimination for the Lasso and sparse ","element":"span"},{"text":"supervised learning problems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Pacific Journal of Optimization","element":"span"},{"text":".","element":"span"}],[{"id":"id-48","text":"Harchaoui, Z., Juditsky, A., and Nemirovski, A. (2015). Conditional gradient algorithms for norm-regularized ","element":"span"},{"text":"smooth convex optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematical Programming","element":"span"},{"text":", 152(1-2):75–112.","element":"span"}],[{"id":"id-65","text":"Hare, W. (2011). Identifying active manifolds in regularization problems. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Fixed-Point Algorithms for Inverse Problems in Science and Engineering","element":"span"},{"text":", pages 261–271. Springer.","element":"span"}],[{"id":"id-38","text":"Hazan, E. (2008). Sparse approximate solutions to semidefinite programs. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Latin American Symposium on Theoretical Informatics","element":"span"},{"text":", pages 306–316. Springer.","element":"span"}],[{"id":"id-31","text":"Herzet, C. and Dr´emeau, A. (2018). Joint screening tests for Lasso. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","element":"span"},{"text":", pages 4084–4088. IEEE.","element":"span"}],[{"id":"id-0","text":"Jaggi, M. (2013). Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICML","element":"span"},{"text":", pages 427–435.","element":"span"}],[{"id":"id-34","text":"Johnson, T. B. and Guestrin, C. (2017). Stingy CD: safely avoiding wasteful updates in coordinate descent. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 34th International Conference on Machine Learning-Volume 70","element":"span"},{"text":", pages 1752–1760.","element":"span"}],[{"id":"id-14","text":"Krishnan, R. G., Lacoste-Julien, S., and Sontag, D. (2015). Barrier Frank-Wolfe for marginal inference. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 532–540.","element":"span"}],[{"id":"id-41","text":"Lacoste-Julien, S. and Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization ","element":"span"},{"text":"variants. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 496–504.","element":"span"}],[{"id":"id-16","text":"Lacoste-Julien, S., Jaggi, M., Schmidt, M., and Pletscher, P. (2012). Block-coordinate Frank-Wolfe optimiza- ","element":"span"},{"text":"tion for structural svms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1207.4747","element":"span"},{"text":".","element":"span"}],[{"id":"id-17","text":"Lacoste-Julien, S., Lindsten, F., and Bach, F. (2015). Sequential kernel herding: Frank-Wolfe optimization ","element":"span"},{"text":"for particle filtering. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1501.02056","element":"span"},{"text":".","element":"span"}],[{"id":"id-64","text":"Lewis, A. S. and Wright, S. J. (2011). Identifying activity. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 21(2):597–614.","element":"span"}],[{"id":"id-24","text":"Liu, J., Zhao, Z., Wang, J., and Ye, J. (2013). Safe screening with variational inequalities and its application ","element":"span"},{"text":"to Lasso. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1307.7577","element":"span"},{"text":".","element":"span"}],[{"id":"id-25","text":"Malti, A. and Herzet, C. (2016). Safe screening tests for Lasso based on firmly non-expansiveness. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","element":"span"},{"text":", pages 4732–4736. IEEE.","element":"span"}],[{"text":"Mirrokni, V., Leme, R. P., Vladu, A., and Wong, S. C.-w. (2017). Tight bounds for approximate Carath´eodory ","element":"span"},{"text":"and beyond. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 34th International Conference on Machine Learning","element":"span"},{"text":", volume 70, pages 2440–2448.","element":"span"}],[{"id":"id-47","text":"Mu, C., Zhang, Y., Wright, J., and Goldfarb, D. (2016). Scalable robust matrix recovery: Frank–Wolfe meets ","element":"span"},{"text":"proximal methods. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Scientific Computing","element":"span"},{"text":", 38(5):A3291–A3317.","element":"span"}],[{"id":"id-27","text":"Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J. (2015). Gap safe screening rules for sparse multi-task ","element":"span"},{"text":"and multi-class models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 811–819.","element":"span"}],[{"id":"id-82","text":"Nesterov, Y. (2013). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Introductory Lectures on Convex Optimization: A Basic Course","element":"span"},{"text":". Springer Science and Business Media.","element":"span"}],[{"id":"id-58","text":"Nutini, J., Schmidt, M., Laradji, I., Friedlander, M., and Koepke, H. (2015). Coordinate descent converges ","element":"span"},{"text":"faster with the Gauss-Southwell rule than random selection. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1632–1641.","element":"span"}],[{"id":"id-33","text":"Ogawa, K., Suzuki, Y., and Takeuchi, I. (2013). Safe screening of non-support vectors in pathwise SVM ","element":"span"},{"text":"computation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pages 1382–1390.","element":"span"}],[{"id":"id-18","text":"Ping, W., Liu, Q., and Ihler, A. T. (2016). Learning infinite RBMs with Frank-Wolfe. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 3063–3071.","element":"span"}],[{"id":"id-26","text":"Raj, A., Olbrich, J., G¨artner, B., Sch¨olkopf, B., and Jaggi, M. (2016). Screening rules for convex problems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1609.07478","element":"span"},{"text":".","element":"span"}],[{"id":"id-42","text":"Rao, N., Shah, P., and Wright, S. (2015). Forwardbackward greedy algorithms for atomic norm regularization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Signal Processing","element":"span"},{"text":", 63(21):5798–5811.","element":"span"}],[{"id":"id-54","text":"Rockafellar, R. T. (1970). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Convex Analysis","element":"span"},{"text":", volume 28. Princeton University Press.","element":"span"}],[{"id":"id-15","text":"Sener, O. and Koltun, V. (2018). Multi-task learning as multi-objective optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 527–538.","element":"span"}],[{"id":"id-32","text":"Shibagaki, A., Karasuyama, M., Hatano, K., and Takeuchi, I. (2016). Simultaneous safe screening of features ","element":"span"},{"text":"and samples in doubly sparse modeling. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1577–1586.","element":"span"}],[{"id":"id-40","text":"Tewari, A., Ravikumar, P. K., and Dhillon, I. S. (2011). Greedy algorithms for structurally constrained high ","element":"span"},{"text":"dimensional problems. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 882–890.","element":"span"}],[{"id":"id-2","text":"Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of the Royal Statistical Society: Series B (Methodological)","element":"span"},{"text":", 58(1):267–288.","element":"span"}],[{"id":"id-8","text":"Vinyes, M. and Obozinski, G. (2017). Fast column generation for atomic norm regularization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":".","element":"span"}],[{"id":"id-43","text":"Von Hohenbalken, B. (1977). Simplicial decomposition in nonlinear programming algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematical Programming","element":"span"},{"text":", 13(1):49–68.","element":"span"}],[{"id":"id-23","text":"Wang, J., Zhou, J., Liu, J., Wonka, P., and Ye, J. (2014). A safe screening rule for sparse logistic regression. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pages 1053–1061.","element":"span"}],[{"id":"id-28","text":"Wang, J., Zhou, J., Wonka, P., and Ye, J. (2013). LASSO screening rules via dual polytope projection. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pages 1070–1078.","element":"span"}],[{"id":"id-22","text":"Xiang, Z. J. and Ramadge, P. J. (2012). Fast Lasso screening tests based on correlations. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","element":"span"},{"text":", pages 2137–2140. IEEE.","element":"span"}],[{"id":"id-6","text":"Yu, Y., Zhang, X., and Schuurmans, D. (2017). Generalized conditional gradient for sparse estimation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Journal of Machine Learning Research","element":"span"},{"text":", 18(1):5279–5324.","element":"span"}],[{"id":"id-9","text":"Zeng, X. and Figueiredo, M. A. (2014). The ordered weighted ","element":"span"},{"style":{"height":7.6},"width":32.6,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.09718/images/26-0.png","element":"img","alt":" ℓ1","inline":true,"padRight":true},{"text":"norm: Atomic formulation, projections, and algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1409.4271","element":"span"},{"text":".","element":"span"}],[{"id":"id-30","text":"Zhou, Q. and Zhao, Q. (2015). Safe subspace screening for nuclear norm regularized least squares problems. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1103–1112.","element":"span"}],[{"id":"id-45","text":"Zhou, S., Gupta, S., and Udell, M. (2018). Limited memory Kelley’s method converges for composite convex ","element":"span"},{"text":"and submodular objectives. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 4414–4424.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]