A Light Touch for Heavily Constrained SGD
2015·Arxiv
Abstract

Minimizing empirical risk subject to a set of constraints can be a useful strategy for learning restricted classes of functions, such as monotonic functions, submodular functions, classifiers that guarantee a certain class label for some subset of examples, etc. However, these restrictions may result in a very large number of constraints. Projected stochastic gradient descent (SGD) is often the default choice for large-scale optimization in machine learning, but requires a projection after each update. For heavily-constrained objectives, we propose an efficient extension of SGD that stays close to the feasible region while only applying constraints probabilistically at each iteration. Theoretical analysis shows a compelling trade-off between per-iteration work and the number of iterations needed on problems with a large number of constraints.

Many machine learning problems can benefit from the addition of constraints. For example, one can learn monotonic functions by adding appropriate constraints to ensure or encourage positive derivatives everywhere [e.g. Archer and Wang, 1993, Sill, 1998, Spouge et al., 2003, Daniels and Velikova, 2010, Gupta et al., 2016]. Submodular functions can often be learned from noisy examples by imposing constraints to ensure submodularity holds. Another example occurs when one wishes to guarantee that a classifier will correctly label certain “canonical” examples, which can be enforced by constraining the function values on those examples. See Qu and Hu [2011] for some other examples of constraints useful in machine learning.

However, these practical uses of constraints in machine learning are impractical in that the number of constraints may be very large, and scale poorly with the number of features d or number of training samples n. In this paper we propose a new strategy for tackling such heavily-constrained problems, with guarantees and compelling convergence rates for large-scale convex problems.

A standard approach for large-scale empirical risk minimization is projected stochastic gradient descent [e.g. Zinkevich, 2003, Nemirovski et al., 2009]. Each SGD iteration is computationally cheap, and the algorithm converges quickly to a solution good enough for machine learning needs. However, this algorithm requires a projection onto the feasible region after each stochastic gradient step, which can be prohibitively slow if there are many non-trivial constraints, and is not easy to parallelize. Recently, Frank-Wolfe-style algorithms [e.g. Hazan and Kale, 2012, Jaggi, 2013] have been proposed that remove the projection, but require a constrained linear optimization at each iteration.

We propose a new strategy for large-scale constrained optimization that, like Mahdavi et al. [2012], moves the constraints into the objective and finds an approximate solution of the resulting unconstrained problem, projecting the (potentially-infeasible) result onto the constraints only once, at the end. Their work focused on handling only one constraint, but as they noted, multiple constraints  g1(x) ≤ 0, g2(x) ≤ 0, . . . , gm(x) ≤ 0can be reduced to one constraint by replacing the m constraints with their maximum:  maxi gi(x) ≤ 0. However, this still requires that all m constraints be checked at every iteration. In this paper, we focus on the computational complexity as a function of the number of constraints m, and show that it is possible to achieve good convergence rates without checking constraints so often.

The key challenge to handling a large number of constraints is determining which constraints are active at the optimum of the constrained problem, which is likely to be only a small fraction of the total constraint set. For example, for linear inequality constraints on a d-dimensional problem, no more than d of the constraints will be active at the optimum, and furthermore, once the active constraints are known, the problem reduces to solving the unconstrained problem that results from projecting onto them, which is typically vastly easier.

To identify and focus on the important constraints, we propose learning a probability distribution over the m constraints that concentrates on the most-violated, and sampling constraints from this evolving distribution at each iteration. We call this approach LightTouch because at each iteration only a few constraints are checked, and the solution is only nudged toward the feasible set. LightTouch is suitable for convex problems, but we also propose a variant, MidTouch, that enjoys a superior convergence rate on strongly convex problems. These two algorithms are introduced and analyzed in Section 3.

Our proposed strategy removes the per-iteration m-dependence on the number of constraint evaluations. LightTouch and MidTouch do need more iterations to converge, but each iteration is faster, resulting in a net performance improvement. To be precise, we show that the total number of constraint checks required to achieve  ǫ-suboptimality when optimizing a non-strongly convex objective decreases from  O(m/ǫ2)to ˜O((ln m)/ǫ2 + m(ln m)3/2/ǫ3/2)—notice that the m-dependence of the dominant (in  ǫ) term has decreased from m to ln m. For a  λ-strongly convex objective, the dominant (again in) term in our bound on the number of constraint checks decreases from  O(m/λ2ǫ)to ˜O((ln m)/λ2ǫ), but like the non-strongly convex result this bound contains lower-order terms with worse m-dependencies. A more careful comparison of the performance of our algorithms can be found in Section 4.

While they check fewer than m constraints per iteration, these algorithms do need to pay a O(m) per-iteration arithmetic cost. When each constraint is expensive to check, this cost can be neglected. However, when the constraints are simple to check (e.g. box constraints, or the lattice monotonicity constraints considered in our experiments), it can be partially addressed by transforming the problem into an equivalent one with fewer more costly constraints. This, as well as other practical considerations, are discussed in Section 5.

Experiments on a large-scale real-world heavily-constrained ranking problem show that our proposed approach works well in practice. This problem was too large for a projected SGD implementation using an off-the-shelf quadratic programming solver to perform projections, but was amenable to an approach based on a fast approximate projection routine tailored to this particular constraint set. Measured in terms of runtime, however, LightTouch was still significantly faster. Each constraint in this problem is trivial, requiring only a single comparison operation to check, so the aforementioned O(m) arithmetic cost of LightTouch is a significant issue. Despite this, LightTouch was roughly as fast as the Mahdavi et al. [2012]-like algorithm FullTouch. In light of other experiments showing that LightTouch checks dramatically fewer constraints in total than FullTouch, we believe that LightTouch is well-suited to machine learning problems with many nontrivial constraints.

Consider the constrained optimization problem:

image

where  W ⊆ Rdis bounded, closed and convex, and  f : W → Rand all  gi : W → Rare convex (our notation is summarized in Table 1). We assume that W is a simple object, e.g. an  ℓ2ball, onto which it is inexpensive to project, and that the “trickier” aspects of the domain are specified via the constraints  gi(w) ≤ 0. Notice that we

Table 1: Key notation.

image

consider constraints written in terms of arbitrary convex functions, and are not restricted to e.g. only linear or quadratic constraints.

2.1 FullTouch: A Relaxation with a Feasible Minimizer

We build on the approach of Mahdavi et al. [2012] to relax Equation 1. Defining  g(w) = maxi gi(w)and introducing a Lagrange multiplier  αyields the equivalent optimization problem:

image

Directly optimizing over w and  αis problematic because the optimal value for  αis infinite for any w that violates a constraint. Instead, we follow Mahdavi et al. [2012, Section 4.2] in relaxing the problem by adding an upper bound of γon  α, and using the fact that  max0≤α≤γ αg(w) = γ max(0, g(w)).

In the following lemma, we show that, with the proper choice of  γ, any minimizer of this relaxed objective is a feasible solution of Equation 1, indicating that using stochastic gradient descent (SGD) to minimize the relaxation (h(w) in the lemma below) will be effective.

Lemma 1. Suppose that f is  Lf-Lipschitz, i.e.  |f(w) − f(w′)| ≤ Lf ∥w − w′∥2for all  w, w′ ∈ W, and that there is a constant  ρ > 0such that if g(w) = 0 then�� ˇ∇��2 ≥ ρfor all  ˇ∇ ∈ ∂g(w), where  ∂g(w)is the subdifferential of g(w).

image

image

If  γ > Lf/ρ, then for any infeasible w (i.e. for which g(w) > 0):

image

where  Πg (w)is the projection of w onto the set  {w ∈ W : g(w) ≤ 0}w.r.t. the Euclidean norm.

Proof. In Appendix C.

The strategy of applying SGD to h(w), detailed in Algorithm 1, which we call FullTouch, has the same “flavor” as the algorithms proposed by Mahdavi et al. [2012], and we use it as a baseline comparison point for our other algorithms.

Application of a standard SGD bound to FullTouch shows that it converges at a rate with no explicit dependence on the number of constraints m, measured in terms of the number of iterations required to achieve some desired suboptimality (see Appendix C.1), although the  γparameter can introduce an implicit d or m-dependence, depending on the constraints (discussed in Section 2.2). The main drawback of FullTouch is that each iteration is expensive, requiring the evaluation of all m constraints, since differentiation of g requires first identifying the most-violated. This is the key issue we tackle with the LightTouch algorithm proposed in Section 3.

2.2 Constraint-Dependence of γ

The conditions on Lemma 1 were stated in terms of g, instead of the individual  gis, because it is difficult to provide suitable conditions on the “component” constraints without accounting for their interactions.

For a point w where two or more constraints intersect, the subdifferential of g(w) consists of all convex combinations of subgradients of the intersecting constraints, with the consequence that even if each of the subgradients of the gi(w)s has norm at least  ρ′, subgradients of g(w) will generally have norms smaller than  ρ′. Exactly how much smaller depends on the particular constraints under consideration. We illustrate this phenomenon with the following examples, but note that, in practice,  γshould be chosen experimentally for any particular problem, so the question of the d and m-dependence of  γis mostly of theoretical interest.

Box Constraints Consider the m = 2d box constraints  gi(w) = −wi − 1and  gi+d(w) = wi − 1, all of which have gradients of norm 1. At most d constraints can intersect (at a corner of the  [−1, 1]dbox), all of which are mutually orthogonal, so the norm of any convex combination of their gradients is lower bounded by that of their average, ρ = 1/√d. Hence, one should choose  γ >√d Lf.

As in the above example,  γ ∝�min(m, d)will suffice when the subgradients of intersecting constraints are at least orthogonal, and  γcan be smaller if they always have positive inner products. However, if subgradients of intersecting constraints tend to point in opposing directions, then  γmay need to be much larger, as in our next example:

Ordering Constraints Suppose the  m = d − 1constraints order the components of w as  w1 ≤ w2 ≤ · · · ≤ wd, for which  gi(w) = (wi − wi+1)/√2, gradients of which again have norm 1. All of these constraints may be active simultaneously, in which case there is widespread cancellation in the average gradient  (e1 − ed)/(m√2), where  eiis the ith standard unit basis vector. The norm of this average gradient is  ρ = 1/m, so we should choose  γ > (d − 1)Lf.

In light of this example, one begins to wonder if a suitable  γwill necessarily exist—fortunately, the convexity of g enables us to prove a trivial bound as long as g(v) is strictly negative for some  v ∈ W:

Lemma 2. Suppose that there exists a  v ∈ Wfor which g(v) < 0, and let  Dw ≥ supw,w′∈W ∥w − w′∥2bound the diameter of W. Then  ρ = −g(v)/Dwsatisfies the conditions of Lemma 1.

Proof. Let  w ∈ Wbe a point for which g(w) = 0, and ˇ∇ ∈ ∂g(w)an arbitrary subgradient. By convexity,  g(v) ≥g(w) +�v − w, ˇ∇�. The Cauchy-Schwarz inequality then gives that:

image

from which the claim follows immediately.

Linear Constraints Consider the constraints  Aw ⪯ b, with each row of A having unit norm,  bmin = mini bi > 0, and W being the  ℓ2ball of radius r. It follows from Lemma 2 that  γ > (2r/bmin)Lfsuffices. Notice that the earlier box constraint example satisfies these assumptions (with  bmin = 1and  r =√d).

As the above examples illustrate, subgradients of g will be large at the boundary if subgradients of the  gis are large, and the constraints intersect at sufficiently shallow angles that, representing boundary subgradients of g as convex combinations of subgradients of the  gis, the components reinforce each other, or at least do not cancel too much. This requirement is related to the linear regularity assumption introduced by Bauschke [1996], and considered recently by Wang et al. [2015].

This section presents the main contribution of this paper: an algorithm that stochastically samples a small subset of the m constraints at each SGD iteration, updates the parameters based on the subgradients of the sampled constraints, and carefully learns the distribution over the constraints to produce a net performance gain.

We first motivate the approach by considering an oracle, then explain the algorithm and present convergence results for the convex (Section 3.2) and strongly convex (Section 3.3) cases.

3.1 Wanted: An Oracle For the Most Violated Constraint

Because FullTouch only needs to differentiate the most violated constraint at each iteration, it follows that if one had access to an oracle that identified the most-violated constraint, then the overall convergence rate (including the cost of each iteration) could only depend on m through  γ. This motivates us to learn to predict the most-violated constraint, ideally at a significantly better than linear-in-m rate.

To this end, we further relax the problem of minimizing h(w) (defined in Lemma 1) by replacing  γ max(0, g(w))with maximization over a probability distribution (as in Clarkson et al. [2010]), yielding the equivalent convex-linear

image

optimization problem:

image

Here,  ∆mis the m-dimensional simplex. We propose optimizing over w and p jointly, thereby learning the most-violated constraint, represented by the multinoulli distribution p over constraint indices, at the same time as we optimize over w.

3.2 LightTouch: Stochastic Constraint Handling

To optimize Equation 3, our proposed algorithm (Algorithm 2, LightTouch) iteratively samples stochastic gradients ˇ∆(t)ww.r.t. w and  ˆ∆(t)pw.r.t. p of ˜h(w, p), and then takes an SGD step on w and a multiplicative step on p:

image

where the exp and ln of the p-update are performed element-wise,  Πwprojects onto W w.r.t. the Euclidean norm, and Πponto  ∆mvia normalization (i.e. dividing its parameter by its sum).

The key to getting a good convergence rate for this algorithm is to choose ˇ∆wand ˆ∆psuch that they are both inexpensive to compute, and tend to have small norms. For ˇ∆w, this can be accomplished straightforwardly, by sampling a constraint index i according to p, and taking:

image

where ˇ∆is a stochastic subgradient of f and ˇ∇ max(0, gi(w))is a subgradient of  max(0, gi(w)). Calculating each such ˇ∆wrequires differentiating only one constraint, and it is easy to verify that ˇ∆wis a subgradient of ˜hw.r.t. w in expectation over ˇ∆and i. Taking  Gfto be a bound on the norm of ˇ∆and  Ggon the norms of subgradients of the  gis shows that ˇ∆w’s norm is bounded by  Gf + γGg.

For ˆ∆p, some care must be taken. Simply sampling a constraint index j uniformly and defining:

image

where  ejis the jth m-dimensional standard unit basis vector, does produce a ˆ∆pthat in expectation is the gradient of ˜hw.r.t. p, but it has a norm bound proportional to m. Such potentially large stochastic gradients would result in the number of iterations required to achieve some target suboptimality being proportional to  m2in our final bound.

A typical approach to reducing the variance (and hence the expected magnitude) of ˆ∆pis minibatching: instead of sampling a single constraint index j at every iteration, we could instead sample a subset S of size |S| = k without replacement, and use:

image

This is effective, but not enough, because reducing the variance by a factor of k via minibatching requires that we check k times more constraints. For this reason, in addition to minibatching, we center the stochastic gradients, as is done by the well-known SVRG algorithm [Johnson and Zhang, 2013], by storing a gradient estimate  γµwith  µ ∈ Rm, at each iteration sampling a set S of size |S| = k uniformly without replacement, and computing:

image

We then update the jth coordinate of  µto be  µj = max {0, gj(w)}for every  j ∈ S. The norms of the resulting stochastic gradients will be small if  γµis a good estimate of the gradient, i.e.  µj ≈ max(0, gj(w)).

The difference between  µjand  max(0, gj(w))can be bounded in terms of how many consecutive iterations may have elapsed since  µjwas last updated. It turns out (see Lemma 4 in Appendix C.2) that this quantity can be bounded uniformly by O((m/k) ln(mT )) with high probability, which implies that if the  gis are  Lg-Lipschitz, then  |gj(w) − µj| ≤ Lgη(Gf + γGg)O((m/k) ln(mT )), since at most O((m/k) ln(mT )) updates of magnitude η(Gf + γGg)may have occurred since  µjwas last updated. Choosing  η ∝ 1/√T, as is standard, moves this portion (the “variance portion”) of the ˆ∆p-dependence out of the dominant  O(1/√T)term and into a subordinate term in our final bound.

The remainder of the ˆ∆p-dependence (the “mean portion”) depends on the norm of  E[ ˆ∆p] = γ �j ej max(0, gj(w)). It is here that our use of multiplicative p-updates becomes significant, because with such updates the relevant norm is the  ℓ∞norm, instead of e.g. the  ℓ2norm (as would be the case if we updated p using SGD), thus we can bound ���E[ ˆ∆p]���∞with no explicit m-dependence.

The following theorem on the convergence rate of LightTouch is proved by applying a mirror descent bound for saddle point problems while bounding the stochastic gradient norms as described above.

Theorem 1. Suppose that the conditions of Lemma 1 apply, with  g(w) = maxi(gi(w)). Define  Dw ≥max{1, ∥w − w′∥2}as a bound on the diameter of W (notice that we also choose  Dwto be at least 1), Gf ≥�� ˇ∆(t)��2and  Gg ≥�� ˇ∇ max(0, gi(w))��2as uniform upper bounds on the (stochastic) gradient magnitudes of f and the  gis, respectively, for all  i ∈ {1, . . . , m}and  w, w′ ∈ W. We also assume that all  gis are  Lg-Lipschitz w.r.t.  ∥·∥2, i.e.  |gi(w) − gi(w′)| ≤ Lg ∥w − w′∥2. Our result will be expressed in terms of a total iteration count  Tǫsatisfying:

image

Define:

image

If  k ≤ m, then we optimize Equation 1 using  Tǫiterations of Algorithm 2 (LightTouch), basing the stochastic gradients w.r.t. p on k constraints at each iteration, and using the step size:

image

If k > m, then LightTouch would check more than m constraints per iteration anyway, so we instead use  Tǫiterations of Algorithm 1 (FullTouch) with the step size:

image

In either case, we perform  Tǫiterations, requiring a total of  Cǫ“constraint checks” (evaluations or differentiations of a single  gi):

image

and with probability  1 − δ:

image

where  w∗ ∈ {w ∈ W : ∀i.gi(w) ≤ 0}is an arbitrary constraint-satisfying reference vector.

Proof. In Appendix C.2.

The most important thing to notice about this theorem is that the dominant terms in the bounds on the number of iterations and number of constraint checks are roughly  γ2 ln mtimes the usual  1/ǫ2convergence rate for SGD on a non-strongly convex objective. The lower-order terms have a worse m-dependence, however, with the result that, as the desired suboptimality  ǫshrinks, the algorithm performs fewer constraint checks per iteration until ultimately (once ǫis on the order of  1/m2) only a constant number are checked during each iteration.

3.3 MidTouch: Strong Convexity

To this point, we have only required that the objective function f be convex. However, roughly the same approach also works when f is taken to be  λ-strongly convex, although we have only succeeded in proving an in-expectation result, and the algorithm, Algorithm 3 (MidTouch), differs from LightTouch not only in that the w updates use a  1/λtstep size, but also in being a two-phase algorithm, the first of which, like FullTouch, checks every constraint at each iteration, and the second of which, like LightTouch with k = 1, checks only two. The following theorem bounds the convergence rate if we perform  T1 ≈ mτ 2iterations in the first phase and  T2 ≈ τ3in the second, where the parameter τdetermines the total number of iterations performed:

image

Theorem 2. Suppose that the conditions of Lemma 1 apply, with  g(w) = maxi(gi(w)). Define  Gf ≥�� ˇ∆(t)��2and  Gg ≥�� ˇ∇ max(0, gi(w))��2as uniform upper bounds on the (stochastic) gradient magnitudes of f and the  gis, respectively, for all  i ∈ {1, . . . , m}. We also assume that f is  λ-strongly convex, and that all  gis are  Lg-Lipschitz w.r.t. ∥·∥2, i.e.  |gi(w) − gi(w′)| ≤ Lg ∥w − w′∥2for all  w, w′ ∈ W. If we run Algorithm 3 (MidTouch) with the p-update step size  η = λ/2γ2L2gfor  Tǫ1iterations in the first phase and  Tǫ2in the second:

image

requiring a total of  Cǫ“constraint checks” (evaluations or differentiations of a single  gi):

image

Table 2: Comparison of the number of iterations, and number of constraint checks, required to achieve  ǫ-suboptimality with high probability when optimizing a non-strongly-convex objective, up to constant and logarithmic factors, dropping the  Lg, Gfand  Ggdependencies, and ignoring the one-time cost of projecting the final result in FullTouch and LightTouch. For LLO-FW, the parameter to the local linear oracle has magnitude  O(√dν). See Section 4, Appendix C, the non-smooth stochastic result of Hazan and Kale [2012, Theorem 4.3], and Garber and Hazan [2013, Theorem 2]. Notice that because this table compares upper bounds to upper bounds, subsequent work may improve these bounds further.

image

where  w∗ = argmin{w∈W:∀i.gi(w)≤0} f(w)is the optimal constraint-satisfying reference vector.

Proof. In Appendix D.

Notice that the above theorem bounds not the suboptimality of  Πg( ¯w), but rather its squared Euclidean distance from w∗, for which reason the denominator of the highest order term depends on  λ2rather than  λ. Like Theorem 1 in the non-strongly convex case, the dominant terms above, both in terms of the total number of iterations and number of constraint checks, match the usual  1/ǫconvergence rate for unconstrained strongly-convex SGD with an additional γ2 ln mfactor, while the lower-order terms have a worse m-dependence. As before, fewer constraint checks will be performed per iteration as  ǫshrinks, reaching a constant number (on average) once  ǫis on the order of  1/m6.

Table 2 compares upper bounds on the convergence rates and per-iteration costs when applied to a convex (but not necessarily strongly convex) problem for LightTouch, FullTouch, projected SGD, the online Frank-Wolfe algorithm of Hazan and Kale [2012], and a Frank-Wolfe-like online algorithm for optimization over a polytope [Garber and Hazan, 2013]. The latter algorithm, which we refer to as LLO-FW, achieves convergence rates comparable to projected SGD, but uses a local linear oracle instead of a projection or full linear optimization. To simplify the presentation, the dependencies on  Lg, Gfand  Gghave been dropped—please refer to Theorems 1 and 2 and the cited references for the complete statements. Table 3 contains the same comparison (without online Frank-Wolfe) for  λ-strongly convex problems.

At each iteration, all of these algorithms must find a stochastic subgradient of f. In addition, each iteration of LightTouch and MidTouch must perform O(m) arithmetic operations (for the m-dimensional vector operations used when updating p)—this issue will be discussed further in Section 5.1. However, projected SGD must project its iterate onto the constraints w.r.t. the Euclidean norm, online Frank-Wolfe must perform a linear optimization subject to the constraints, and LLO-FW must evaluate a local linear oracle, which amounts to essentially local linear optimization.

Table 3: Same as Table 2, except that the results bound the number of iterations or constraint checks required to achieve  E[∥w − w∗∥22] ≤ ǫ, and the objective function is assumed to be  λ-strongly convex. The bound given for FullTouch assumes that the constant  ηused in Algorithm 1 has been replaced with the standard decreasing  1/λtstep size used in strongly-convex SGD. The MidTouch bounds each contain four terms, listed in order of most-to-least dominant (in  ǫ). For LLO-FW, the parameter to the local linear oracle has magnitude  O(√dν). See Section 4, Appendix D, and Garber and Hazan [2013, Theorem 3].

image

LightTouch, MidTouch and FullTouch share the same  γ-dependence, but the m-dependence of the convergence rate of LightTouch and MidTouch is logarithmically worse. The number of constraint evaluations, however, is better: in the non-strongly convex case, ignoring all but the m and  ǫdependencies, FullTouch will check  O(m/ǫ2)constraints, while LightTouch will check only ˜O((ln m)/ǫ2+m/ǫ3/2), a significant improvement when is small. Hence, particularly for problems with many expensive-to-evaluate constraints, one would expect LightTouch to converge much more rapidly. Likewise, for  λ-strongly convex optimization, the dominant (in) terms in the bounds on the number of constraint evaluations go as  m/ǫfor FullTouch, and as  (ln m)/ǫfor MidTouch, although the lower-order terms in the MidTouch bound are significantly more complex than in the non-strongly convex case (see Table 3 for full details).

Comparing with projected SGD, online Frank-Wolfe and LLO-FW is less straightforward, not only because we’re comparing upper bounds to upper bounds (with all of the uncertainty that this entails), but also because we must relate the value of  γto the cost of performing the required projection, constrained linear optimization or local linear oracle evaluation. We note, however, that for non-strongly convex optimization, the  ǫ-dependence of the convergence rate bound is worse for online Frank-Wolfe (1/ǫ3) than for the other algorithms (1/ǫ2), and that unless the constraints have some special structure, performing a projection can be a very expensive operation.

For example, with general linear inequality constraints, each constraint check performed by LightTouch, MidTouch or FullTouch requires O(d) time, whereas each linear program optimized by online Frank-Wolfe could be solved in  O(d2m)time [Nemirovski, 2004, Chapter 10.1], and each projection performed by SGD in  O((dm)3/2)time [Goldfarb and Liu, 1991]. When the constraints are taken to be arbitrary convex functions, instead of linear functions, projections may be even more difficult.

We believe that in many cases  γ2will be roughly on the order of the dimension d, or number of constraints m, whichever is smaller, although it can be worse for difficult constraint sets (see Section 2.2). In practice, we have found that a surprisingly small  γ—we use  γ = 1in our experiments (Section 6)—often suffices to result in convergence to a feasible solution. With this in mind, and in light of the fact that a fast projection, linear optimization, or local linear oracle evaluation may only be possible for particular constraint sets, we believe that our algorithms compare favorably with the alternatives.

image

Algorithms 2 and 3 were designed primarily to be easy to analyze, but in real world-applications we recommend making a few tweaks to improve performance. The first of these is trivial: using a decreasing w-update step size η(t)w = ηw/√twhen optimizing a non-strongly convex objective, and  η(t)w = ηw/tfor a strongly-convex objective. In both cases we continue to use a constant p-update step size  ηp. This change, as well as that described in Section 5.2, is included in Algorithm 4.

5.1 Constraint Aggregation

A natural concern about Algorithms 2 and 3 is that O(m) arithmetic operations are performed per iteration, even when only a few constraints are checked. When each constraint is expensive, this is a minor issue, since this cost will be “drowned out” by that of checking the constraints. However, when the constraints are very cheap, and the O(m) arithmetic cost compares disfavorably with the cost of checking a handful of constraints, it can become a bottleneck.

Our solution to this issue is simple: transform a problem with a large number of cheap constraints into one with a smaller number of more expensive constraints. To this end, we partition the constraint indices 1, . . . , m into  ˜msets {Mi}of size at most  ⌈m/ ˜m⌉, defining  ˜gi(w) = maxj∈Mi gj(w), and then apply LightTouch or MidTouch on the  ˜maggregated constraints  ˜gi(w) ≤ 0. This makes each constraint check  ⌈m/ ˜m⌉times more expensive, but reduces the dimension of p from m to  ˜m, shrinking the per-iteration arithmetic cost to  O( ˜m).

5.2 Automatic Minibatching

Because LightTouch takes a minibatch size k as a parameter, and the constants from which we derive the recommended choice of k (Theorem 1) are often unknown, a user is in the uncomfortable position of having to perform a parameter search not only over the step sizes  ηwand  ηp, but also the minibatch size. Furthermore, the fact that the theoreticallyrecommended k is a decreasing function of T indicates that it might be better to check more constraints in early iterations, and fewer in later ones. Likewise, MidTouch is structured as a two-phase algorithm, in which every iteration checks every constraint in the first phase, and only a constant number in the second, but it seems more sensible for the number of constraint checks to decrease gradually over time.

In addition, for both algorithms, it would be desirable to support separate minibatching of the loss and constraint stochastic subgradients (w.r.t. w), in which case there would be three minibatching parameters to determine:  kf, kgand  kp. This makes things even harder for the user, since now there are three additional parameters that must be specified.

To remove the need to specify any minibatch-size hyperparameters, and to enable the minibatch sizes to change from iteration-to-iteration, we propose a heuristic that will automatically determine the minibatch sizes  k(t)f,  k(t)gand  k(t)pfor each of the stochastic gradient components at each iteration. Intuitively, we want to choose minibatch sizes in such a way that the stochastic gradients are both cheap to compute and have low variance. Our proposed heuristic does this by trading-off the computational cost and “bound impact” of the overall stochastic gradient, where the “bound impact” is a variance-like quantity that approximates the impact that taking a step with particular minibatch sizes has on the relevant convergence rate bound.

Suppose that we’re about to perform the tth iteration, and know that a single stochastic subgradient ˇ∆of f(w) (corresponding to the loss portion of ˇ∆w) has variance (more properly, covariance matrix trace)  ¯v(t)fand requires a computational investment of  ¯c(t)funits. Similarly, if we define ˇ∆gby sampling  i ∼ pand taking ˇ∆g = γ ˇ∇ max{0, gi(w)}(corresponding to the constraint portion of ˇ∆w), then we can define variance and cost estimates of ˇ∆gto be  ¯v(t)gand ¯c(t)g, respectively. Likewise, we take  ¯v(t)pand  ¯c(t)pto be estimates of the variance and cost of a (non-minibatched version of) ˆ∆p.

In all three cases, the variance and cost estimates are those of a single sample, implying that a stochastic subgradient of f(w) averaged over a minibatch of size  k(t)fwill have variance  ¯v(t)f /k(t)fand require a computational investment of ¯c(t)f k(t)f, and likewise for the constraints and distribution. In the context of Algorithm 4, with minibatch sizes of  k(t)f, k(t)gand  k(t)p, we define the overall bound impact b and computational cost c of a single update as:

image

We should emphasize that the above definition of b is merely a useful approximation of how these quantities truly affect our bounds.

Given the three variance and three cost estimates, we choose minibatch sizes in such a way as to minimize both the computational cost and bound impact of an update. Imagine that we are given a fixed computational budget c. Then our goal will be to choose the minibatch sizes in such a way that b is minimized for this budget, a problem that is easily solved in closed form:

image

We propose choosing the proportionality constant (and thereby the cost budget c) in such a way that  k(t)f = 2(enabling us to calculate sample variances, as explained below), and round the two other sizes to the nearest integers, lowerbounding each so that  k(t)g ≥ 2and  k(t)p ≥ 1.

While the variances and costs are not truly known during optimization, they are easy to estimate from known quantities. For the costs  ¯c(t)f,  ¯c(t)gand  ¯c(t)p, we simply time how long each past stochastic gradient calculation has taken, and then average them to estimate the future costs. For the variances  ¯v(t)fand  ¯v(t)g, we restrict ourselves to minibatch sizes k(t)f , k(t)g ≥ 2, calculate the sample variances  v(t)fand  v(t)gof the stochastic gradients at each iteration, and then average over all past iterations (either uniformly, or a weighted average placing more weight on recent iterations).

For  ¯v(t)p, the situation is a bit more complicated, since the p-updates are multiplicative (so we should use an  ℓ∞variance) and centered as in Equation 4. Upper-bounding the  ℓ∞norm with the  ℓ2norm and using the fact that the minibatch  S(t)is independently sampled yields the following crude estimate:

image

We again average  v(t)pacross past iterations to estimate  ¯v(t)p.

We validated the performance of our practical variant of LightTouch (Algorithm 4) on a YouTube ranking problem in the style of Joachims [2002], in which the task is to predict what a user will watch next, given that they have just viewed a certain video. In this setting, a user has just viewed video a, was presented with a list of candidate videos to watch next, and clicked on  b+, with  b−being the video immediately preceding  b+in the list (if  b+was the first list element, then the example is thrown out).

We used an anonymized proprietary dataset consisting of n = 612 587 training pairs of feature vectors  (x+, x−), where  x+is a vector of 12 features summarizing the similarity between a and  b+, and  x−between a and  b−.

We treat this as a standard pairwise ranking problem, for which the goal is to estimate a function  f(Φ(x)) = ⟨w, Φ(x)⟩such that  f(Φ(x+)) > f(Φ(x−))for as many examples as possible, subject to the appropriate regularization (or, in this case, constraints). Specifically, the (unconstrained) learning task is to minimize the average empirical hinge loss:

image

All twelve of the features were designed to provide positive evidence—in other words, if any one increases (holding the others fixed), then we expect  f(Φ(x))to increase. We have found that using constraints to enforce this monotonicity property results in a better model in practice.

We define  Φ(·)as in lattice regression using simplex interpolation [Garcia et al., 2012, Gupta et al., 2016], an approach which works well at combining a small number of informative features, and more importantly (for our purposes) enables one to force the learned function to be monotonic via linear inequality constraints on the parameters. For the resulting problem, the feature vectors have dimension  d = 212 = 4096, we chose W to be defined by the box constraints  −10 ≤ wi ≤ 10in each of the 4096 dimensions, and the total number of monotonicity-enforcing linear inequality constraints is m = 24 576.

Every  Φ(x)contains only d + 1 = 13 nonzeros and can be computed in O(d ln d) time. Hence, stochastic gradients of f are inexpensive to compute. Likewise, checking a monotonicity constraint only requires a single comparison between two parameter values, so although there are a large number of them, each constraint is very inexpensive to check.

6.1 Implementations

We implemented all algorithms in C++. Before running our main experiments, we performed crude parameter searches on a power-of-four grid (i.e.  . . . , 1/16, 1/4, 1, 4, 16, . . .). For each candidate value we performed roughly 10 000 iterations, and chose the parameter that appeared to result in the fastest convergence in terms of the objective function.

LightTouch Our implementation of LightTouch includes all of the suggested changes of Section 5, including the constraint aggregation approach of Section 5.1, although we used no aggregation until our timing comparison (Section 6.3). For automatic minibatching, we took weighted averages of the variance estimates as  ¯v(t+1) ∝ v(t) + ν¯v(t). We found that up-weighting recent estimates (taking  ν < 1) resulted in a noticeable improvement, but that the precise value of  νmattered little (we used  ν = 0.999). Based on the grid search described above, we chose  γ = 1, ηw = 16and  ηp = 1/16.

FullTouch Our FullTouch implementation differs from that in Algorithm 1 only in that we used a decreasing step size η(t)w = ηw/√t. As with LightTouch, we chose  γ = 1and  ηw = 16based on a grid search.

ProjectedSGD We implemented Euclidean projections onto lattice monotonicity constraints using IPOPT [W¨achter and Biegler, 2006] to optimize the resulting sparse 4096-dimensional quadratic program. However, the use of a QP solver for projected SGD—a very heavyweight solution—resulted in an implementation that was too slow to experiment with, requiring nearly four minutes per projection (observe that our experiments each ran for millions of iterations).

ApproxSGD This is an approximate projected SGD implementation using the fast approximate update procedure described in Gupta et al. [2016], which is an active set method that, starting from the current iterate, moves along the boundary of the feasible region, adding constraints to the active set as they are encountered, until the desired step is exhausted (this is reminiscent of the local linear oracles considered by Garber and Hazan [2013]). This approach is particularly well-suited to this particular constraint set because (1) when checking constraints for possible inclusion in the active set, it exploits the sparsity of the stochastic gradients to only consider monotonicity constraints which could possibly be violated, and (2) projecting onto an intersection of active monotonicity constraints reduces to uniformly averaging every set of parameters that are “linked together” by active constraints. Like the other algorithms, we used step sizes of  η(t)w = ηw/√tand chose  ηw = 64based on the grid search (recall that  ηw = 16was better for the other two algorithms).

In every experiment we repeatedly looped over a random permutation of the training set, and generated plots by averaging over 5 such runs (with the same 5 random permutations) for each algorithm.

6.2 Constraint-check Comparison

In our first set of experiments, we compared the performance of LightTouch, FullTouch and ApproxSGD in terms of the number of stochastic subgradients of f drawn, and the number of constraints checked. Because LightTouch’s automatic minibatching fixes  k(t)f = 2(with the other two minibatch sizes being automatically determined), in these experiments we used minibatch sizes of 2 for FullTouch and ApproxSGD, guaranteeing that all three algorithms observe the same number of stochastic subgradients of f at each iteration.

The left-hand plot of Figure 1 shows that all three algorithms converge at roughly comparable per-iteration rates, with ApproxSGD having a slight advantage over FullTouch, which itself converges a bit more rapidly than LightTouch. The right-hand plot shows a striking difference, however—LightTouch reaches a near-optimal solution having checked more than  10×fewer constraints than FullTouch. Notice that we plot the suboptimalities of the projected iterates Πw(w(t))rather than of the  w(t)s themselves, in order to emulate the final projection (line 7 of Algorithm 1 and 17 of Algorithm 4), and guarantee that we only compare the average losses of feasible intermediate solutions.

In Figure 2, we explore how well our algorithms enforce feasibility, and how effective automatic minibatching is at choosing minibatch sizes. The left-hand plot shows that both FullTouch has converged to a nearly-feasible solution

image

Figure 1: Comparison of convergence rates of LightTouch, FullTouch and ApproxSGD on the YouTube ranking problem of Section 6. The two plots show the objective function (average training hinge loss)  f(Πg(w(t)))as a function of the number of iterations, and as a function of the total number of times a single constraint function  giwas evaluated or differentiated, respectively.

after roughly 10 000 iterations, and LightTouch (unsurprisingly) takes more, perhaps 100 000 or so. In the right-hand plot, we see that, in line with our expectations (see Section 5.2), LightTouch’s automatic minibatching results in very few constraints being checked in late iterations.

6.3 Timing Comparison

image

Figure 3: Plot of the objective function (average training hinge loss)  f(Πg(w(t)))as a function of runtime for our implementations of LightTouch, FullTouch and ApproxSGD, on the YouTube ranking problem of Section 6.

Our final experiment compared the wall-clock runtimes of our implementations. Note that, because each monotonicity constraint can be checked with only a single comparison (compare with e.g. O(d) arithmetic operations for a dense linear inequality constraint), the O(m) arithmetic cost of maintaining and updating the probability distribution p over the constraints is significant. Hence, in terms of the constraint costs, this is nearly a worse-case problem for LightTouch. We experimented with power-of-4 constraint aggregate sizes (Section 5.1), and found that using  ˜m = 96aggregated constraints, each of size 256, worked best.

FullTouch, without minibatching, draws a single stochastic subgradient of f and checks every constraint at each iteration. However, it would seem to be more efficient to use minibatching to look at more stochastic subgradients at each iteration, and therefore fewer constraints per stochastic subgradient of f. Hence, for FullTouch, we again searched over power-of-4 minibatch sizes, and found that 16 worked best.

For ApproxSGD, the situation is less clear-cut. On the one hand, increasing the minibatch size results in fewer approximate projections being performed per stochastic subgradient of f. On the other, averaging more stochastic

image

Figure 2: Comparison constraint-handling of LightTouch and FullTouch on the YouTube ranking problem of Section 6. The two plots show the constraint violation magnitude  max{0, g(w(t))}, and the average number of constraints checked per iteration up to this point, respectively, both as functions of the number of iterations.

subgradients results in less sparsity, slowing down the approximate projection. We found that the latter consideration wins out—after searching again over power-of-4 minibatch sizes, we found that a minibatch size of 1 (i.e. no minibatching) worked best.

Figure 3 contains the results of these experiments, showing that both FullTouch and LightTouch converge significantly faster than ApproxSGD. Interestingly, ApproxSGD is rather slow in early iterations (clipped off in plot), but accelerates in later iterations. We speculate that the reason for this behavior is that, close to convergence, the steps taken at each iteration are smaller, and therefore the active sets constructed during the approximate projection routine do not grow as large. FullTouch enjoys a small advantage over LightTouch until both algorithms are very close to convergence, but based on the results of Section 6.2, we believe that this advantage would reverse if there were more constraints, or if the constraints were more expensive to check.

We have proposed an efficient strategy for large-scale heavily constrained optimization, building on the work of Mahdavi et al. [2012], and analyze its performance, demonstrating that, asymptotically, our approach requires many fewer constraint checks in order to converge.

We build on these theoretical results to propose a practical variant. The most significant of these improvements is based on the observation that our algorithm takes steps based on three separate stochastic gradients, and that trading off the variances of computational costs of these three components is beneficial. To this end, we propose a heuristic for dynamically choosing minibatch sizes in such a way as to encourage faster convergence at a lower computational cost.

Experiments on a real-world 4096-dimensionalmachine learning problem with 24 576 constraints and 612 587 training examples—too large for a QP-based implementation of projected SGD—showed that our proposed method is effective. In particular, we find that, in practice, our technique checks fewer constraints per iteration than competing algorithms, and, as expected, checks ever fewer as optimization progresses.

We thank Kevin Canini, Mahdi Milani Fard, Andrew Frigyik, Michael Friedlander and Seungil You for helpful discussions, and proofreading earlier drafts.

N. P. Archer and S. Wang. Application of the back propagation neural network algorithm with monotonicity constraints for two-group classification problems. Decision Sciences, 24:60–75, 1993.

H. H. Bauschke. Projection Algorithms and Monotone Operators. Ph.D. Thesis, Simon Fraser University, Canada, 1996.

A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett., 31(3):167–175, May 2003.

K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machine learning. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, FOCS’10, pages 449–457, Washington, DC, USA, 2010. IEEE Computer Society.

H. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE Trans. Neural Networks, 21 (6):906–917, 2010.

K. Dzhaparidze and J. H. van Zanten. On Bernstein-type inequalities for martingales. Stochastic Processes and their Applications, 93(1):109–117, May 2001.

D. Garber and E. Hazan. Playing non-linear games with linear oracles. In FOCS, pages 420–428. IEEE Computer Society, 2013.

E. K. Garcia, R. Arora, and M. Gupta. Optimized regression for efficient function evaluation. IEEE Trans. Image Processing, 21(9):4128–4140, Sept. 2012.

D. Goldfarb and S. Liu. An O(Ln3) primal interior point algorithm for convex quadratic programming. Math. Program., 49(3):325–340, Jan. 1991.

M. R. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczydlowski, and A. van Esbroeck. Monotonic calibrated interpolated look-up tables. JMLR, 17(109):1–47, 2016.

E. Hazan and S. Kale. Projection-free online learning. In ICML’12, 2012.

M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML’13, volume 28, pages 427– 435, 2013.

T. Joachims. Optimizing search engines using clickthrough data. In KDD’02, pages 133–142, New York, NY, USA, 2002.

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS’13, pages 315–323. 2013.

M. Mahdavi, T. Yang, R. Jin, S. Zhu, and J. Yi. Stochastic gradient descent with only one projection. In NIPS’12, pages 494–502. 2012.

A. Nemirovski. Lecture notes: Interior point polynomial time methods in convex programming. 2004. URL http://www2.isye.gatech.edu/˜nemirovs/Lect_IPM.pdf.

A. Nemirovski and D. Yudin. Problem complexity and method efficiency in optimization. John Wiley & Sons Ltd, 1983.

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic pro- gramming. SIAM Journal on Optimization, 19(4):1574–1609, January 2009.

Y. Qu and B. Hu. Generalized constraint neural network regression model subject to linear priors. IEEE Trans. on Neural Networks, 22(11):2447–2459, 2011.

A. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In NIPS’13, pages 3066–3074. 2013.

S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Mathematical Programming, 127(1):3–30, March 2011.

J. Sill. Monotonic networks. Advances in Neural Information Processing Systems (NIPS), 1998.

J. Spouge, H. Wan, and W. J. Wilbur. Least squares isotonic regression in two dimensions. Journal of Optimization Theory and Applications, 117(3):585–605, 2003.

N. Srebro, K. Sridharan, and A. Tewari. On the universality of online mirror descent. In NIPS’11, 2011.

A. W¨achter and L. T. Biegler. On the implementation of a primal-dual interior point filter line search algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25–57, 2006.

M. Wang, Y. Chen, J. Liu, and Y. Gu. Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints. ArXiv e-prints, Nov. 2015.

Wikipedia. Coupon collector’s problem Wikipedia, the free encyclopedia, 2014. URL http://en.wikipedia.org/wiki/Coupon_collector%27s_problem. [Online; accessed 20-November-2014].

Wikipedia. Properties of polynomial roots Wikipedia, the free encyclopedia, 2015. URL http://en.wikipedia.org/wiki/Properties_of_polynomial_roots. [Online; accessed 27-March-2015].

M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML’03, 2003.

Table 4: New notation in Appendix A.

image

Mirror descent [Nemirovski and Yudin, 1983, Beck and Teboulle, 2003] is a meta-algorithm for stochastic optimization (more generally, online regret minimization) which performs gradient updates with respect to a meta-parameter, the distance generating function (d.g.f.). The two most widely-used d.g.f.s are the squared Euclidean norm and negative Shannon entropy, for which the resulting MD instantiations are stochastic gradient descent (SGD) and a multiplicative updating algorithm, respectively. These are precisely the two d.g.f.s which our constrained algorithm will use for the updates of w and p. We’ll here give a number of results which differ only slightly from “standard” ones, beginning with a statement of an online MD bound adapted from Srebro et al. [2011]:

Theorem 3. Let  ∥·∥and  ∥·∥∗be a norm and its dual. Suppose that the distance generating function (d.g.f.)  Ψis 1-strongly convex w.r.t.  ∥·∥. Let  Ψ∗be the convex conjugate of  Ψ, and take  BΨ(w|w′) = Ψ(w) − Ψ(w′) −⟨∇Ψ(w′), w − w′⟩to be the associated Bregman divergence.

Take  ft : W → Rto be a sequence of convex functions on which we perform T iterations of mirror descent starting from  w(1) ∈ W:

image

where ˇ∇ft(w(t)) ∈ ∂ft(w(t))is a subgradient of  ftat  w(t). Then:

image

where  w∗ ∈ Wis an arbitrary reference vector.

Proof. This proof is essentially the same as that of Srebro et al. [2011, Lemma 2]. By convexity:

image

By H¨older’s inequality,  ⟨w′, w⟩ ≤ ∥w′∥ ∥w∥∗. Also,  Ψ(w) = supv(⟨v, w⟩−Ψ∗(v))is maximized when  ∇Ψ∗(v) = w, so  ∇Ψ(∇Ψ∗(v)) = v. These results combined with the definition of  ˜w(t+1)give:

image

Using Young’s inequality and the definition of the Bregman divergence:

image

Applying the 1-strong convexity of  Ψto cancel the��w(t) − ˜w(t+1)��2 /2and  BΨ( ˜w(t+1) | w(t))terms:

image

Summing over t, using the nonnegativity of  BΨ, and dividing through by  ηTgives the claimed result.

It is straightforward to transform Theorem 3 into an in-expectation result for stochastic subgradients:

Corollary 1. Take  ft : W → Rto be a sequence of convex functions, and F a filtration. Suppose that we perform T iterations of stochastic mirror descent starting from  w(1) ∈ W, using the definitions of Theorem 3:

image

where ˇ∆(t)is a stochastic subgradient of  ft, i.e.  E[ ˇ∆(t) | Ft−1] ∈ ∂ft(w(t)), and ˇ∆(t)is  Ft-measurable. Then:

image

where  w∗ ∈ Wis an arbitrary reference vector.

Proof. Define ˜ft (w) =� ˇ∆(t), w�, and observe that applying the non-stochastic MD algorithm of Theorem 3 to the sequence of functions ˜ftresults in the same sequence of iterates  w(t)as does applying the above stochastic MD update to the sequence of functions  ft. Hence:

image

By convexity,  ft(w(t)) − ft(w∗) ≤�E[ ˇ∆(t) | Ft−1], w(t) − w∗�, while ˜ft(w(t)) − ˜ft(w∗) =� ˇ∆(t), w(t) − w∗�by definition. Taking expectations of both sides of Equation 5 and plugging in these inequalities yields the claimed result.

We next prove a high-probabilityanalogue of the Corollary 1, based on a martingale bound of Dzhaparidze and van Zanten [2001]:

Corollary 2. In addition to the assumptions of Corollary 1, suppose that, with probability  1 − δσ, σsatisfies the following uniformly for all  t ∈ {1, . . ., T }:

image

Then, with probability  1 − δσ − δ, the above  σbound will hold, and:

image

where  w∗ ∈ Wis an arbitrary reference vector and  R∗ ≥ supw∈W ∥w − w∗∥bounds the radius of W centered on w∗.

Proof. Define ˜ft (w) =� ˇ∆(t), w�as in the proof of Corollary 1, and observe that Equation 5 continues to apply. Define a sequence of random variables  M0 = 0, Mt = Mt−1 +�E[ ˇ∆(t) | Ft−1] − ˇ∆(t), w(t) − w∗�, and notice that M forms a martingale w.r.t. the filtration F. From this definition, H¨older’s inequality gives that:

image

the above holding with probability  1 − δσ. Plugging  a = R∗σand  L = T R2∗σ2into the Bernstein-type martingale inequality of Dzhaparidze and van Zanten [2001, Theorem 3.3] gives:

image

Solving for  ǫusing the quadratic formula and upper-bounding gives that, with probability  1 − δσ − δ:

image

As in the proof of Corollary 1,  ft(w(t)) − ft(w∗) ≤ �E[ ˇ∆(t) | Ft−1], w(t) − w∗�, while ˜ft(w(t)) − ˜ft(w∗) =� ˇ∆(t), w(t) − w∗�by definition, which combined with Equation 5 yields the claimed result.

Algorithm 2 (LightTouch) jointly optimizes over two sets of parameters, for which the objective is convex in the first and linear (hence concave) in the second. The convergence rate will be determined from a saddle-point bound, which we derive from Corollary 2 by following Nemirovski et al. [2009], Rakhlin and Sridharan [2013], and simply applying it twice:

Corollary 3. Let  ∥·∥wand  ∥·∥αbe norms with duals  ∥·∥w∗and  ∥·∥α∗. Suppose that  Ψwand  Ψαare 1-strongly convex w.r.t.  ∥·∥wand  ∥·∥α, have convex conjugates  Ψ∗wand  Ψ∗α, and associated Bregman divergences  BΨwand BΨα, respectively.

Take  f : W × A → Rto be convex in its first parameter and concave in its second, let F be a filtration, and suppose that we perform T iterations of MD:

image

where ˇ∆(t)wis a stochastic subgradient of  f(w(t), α(t))w.r.t. its first parameter, and  ˆ∆(t)αa stochastic supergradient w.r.t. its second, with both ˇ∆(t)wand  ˆ∆(t)αbeing  Ft-measurable. We assume that, with probabilities  1−δσwand  1−δσα(respectively),  σ2wand  σ2αsatisfy the following uniformly for all  t ∈ {1, . . ., T }:

image

Under these conditions, with probability  1 − δσw − δσα − 2δ, the above  σwand  σαbounds will hold, and:

image

where  w∗ ∈ Wand  α∗ ∈ Aare arbitrary reference vectors, and  Rw∗ ≥ ∥w − w∗∥wand  Rα∗ ≥ ∥α − α∗∥αbound the radii of W and A centered on  w∗and  α∗, respectively.

Proof. This is a convex-concave saddle-point problem, which we will optimize by playing two convex optimization algorithms against each other, as in Nemirovski et al. [2009], Rakhlin and Sridharan [2013]. By Corollary 2, with probability  1 − δσw − δand  1 − δσα − δ, respectively:

image

Adding these two inequalities gives the claimed result.

For  λ-strongly convex objective functions, we can achieve a faster convergence rate for SGD by using the step sizes ηt = 1/λt. Our eventual algorithm (Algorithm 3) for strongly-convex heavily-constrained optimization will proceed in two phases, with the second phase “picking up” where the first phase “left off”, for which reason we present a convergence rate, based on Shalev-Shwartz et al. [2011, Lemma 2], that effectively starts at iteration  T0by using the step sizes  ηt = 1/λ(T0 + t):

Theorem 4. Take  ft : W → Rto be a sequence of  λ-strongly convex functions on which we perform T iterations of stochastic gradient descent starting from  w(1) ∈ W:

image

where ˇ∇ft�w(t)� ∈ ∂ft�w(t)�is a subgradient of  ftat  w(t), and�� ˇ∇ft�w(t)���2 ≤ Gfor all t. If we choose ηt = 1λ(T0+t)for some  T0 ∈ N, then:

image

where  w∗ ∈ Wis an arbitrary reference vector and  G ≥�� ˇ∇ft�w(t)���2bounds the subgradient norms for all t.

Proof. This is nothing but a small tweak to Shalev-Shwartz et al. [2011, Lemma 2]. Starting from Equations 10 and 11 of that proof:

image

Taking  ηt = 1λ(T0+t):

image

Dividing through by T , simplifying and bounding yields the claimed result.

As we did Appendix A, we convert this into a result for stochastic subgradients:

Corollary 4. Take  ft : W → Rto be a sequence of  λ-strongly convex functions, and F a filtration. Suppose that we perform T iterations of stochastic gradient descent starting from  w(1) ∈ W:

image

where ˇ∆(t)is a stochastic subgradient of  ft, i.e.  E[ ˇ∆(t) | Ft−1] ∈ ∂ft(w(t)), and ˇ∆(t)is  Ft-measurable. If we choose ηt = 1λ(T0+t)for some  T0 ∈ N, then:

image

where  w∗ ∈ Wis an arbitrary reference vector and  G ≥�� ˇ∆(t)��2bounds the stochastic subgradient norms for all t.

Proof. Same proof technique as Corollary 1, but based on Theorem 4 rather than Theorem 3.

We now use this result to prove an in-expectation saddle point bound:

Corollary 5. Let  ∥·∥αand  ∥·∥α∗be a norm and its dual. Suppose that  Ψαis 1-strongly convex w.r.t.  ∥·∥α, and has convex conjugate  Ψ∗αand associated Bregman divergence  BΨα.

Take  f : W × A → Rto be  λ-strongly convex in its first parameter and concave in its second, let F be a filtration, and suppose that we perform T iterations of SGD on w and MD on  α:

image

where ˇ∆(t)wis a stochastic subgradient of  f(w(t), α(t))w.r.t. its first parameter, and  ˆ∆(t)αa stochastic supergradient w.r.t. its second, with both ˇ∆(t)wand  ˆ∆(t)αbeing  Ft-measurable. Then:

image

where  w∗ ∈ Wand  α∗ ∈ Aare arbitrary reference vectors, and  Gw ≥��� ˇ∆(t)w���2bounds the stochastic subgradient norms w.r.t. w for all t.

Proof. As we did in the proof of Corollary 3, we will play two convex optimization algorithms against each other. By Corollaries 4 and 1:

image

Adding these two inequalities gives the claimed result.

We begin by proving that, if  γis sufficiently large, then optimizing the relaxed objective, and projecting the resulting solution, will bring us close to the optimum of the constrained objective.

Lemma 1. In the setting of Section 2, suppose that f is  Lf-Lipschitz, i.e.  |f(w) − f(w′)| ≤ Lf ∥w − w′∥2for all w, w′ ∈ W, and that there is a constant  ρ > 0such that if g(w) = 0 then�� ˇ∇��2 ≥ ρfor all  ˇ∇ ∈ ∂g(w), where  ∂g(w)is the subdifferential of g(w).

image

If  γ > Lf/ρ, then for any infeasible w (i.e. for which g(w) > 0):

image

where  Πg (w)is the projection of w onto the set  {w ∈ W : g(w) ≤ 0}w.r.t. the Euclidean norm.

Proof. Let  w ∈ Wbe an arbitrary infeasible point. Because f is  Lf-Lipschitz:

image

Since  Πg(w)is the projection of w onto the constraints w.r.t. the Euclidean norm, we must have by the first order optimality conditions that there exists a  ν ≥ 0such that:

image

This implies that  w − Πg(w)is a scalar multiple of some ˇ∇ ∈ ∂g(Πg(w)). Because g is convex and  Πg (w)is on the boundary,  g(w) ≥ g(Πg(w)) +� ˇ∇, w − Πg(w)�=� ˇ∇, w − Πg(w)�, so:

image

Combining the definition of h with Equations 6 and 7 yields:

image

Both claims follow immediately if  γρ > Lf.

C.1 Analysis of FullTouch

We’ll now use Lemma 1 and Corollary 2 to bound the convergence rate of SGD on the function h of Lemma 1 (this is FullTouch). Like the algorithm itself, the convergence rate is little different from that found by Mahdavi et al. [2012] (aside from the bound on  ∥ ¯w − Πg( ¯w)∥2), and is included here only for completeness.

Lemma 3. Suppose that the conditions of Lemma 1 apply, with  g(w) = maxi(gi(w)). Define  Dw ≥supw,w′∈W ∥w − w′∥2as the diameter of  W, Gf ≥ �� ˇ∆(t)��2and  Gg ≥ �� ˇ∇ max(0, gi(w))��2as uniform upper bounds on the (stochastic) gradient magnitudes of f and the  gis, respectively.

If we optimize Equation 1 using Algorithm 1 (FullTouch) with the step size:

image

then with probability  1 − δ:

image

where  w∗ ∈ {w ∈ W : ∀i.gi(w) ≤ 0}is an arbitrary constraint-satisfying reference vector, and:

image

Proof. We choose  Ψ(w) = ∥w∥22 /2, for which the mirror descent update rule is precisely SGD. Because  Ψwis (half of) the squared Euclidean norm, it is trivially 1-strongly convex w.r.t. the Euclidean norm, so  ∥·∥ = ∥·∥∗ = ∥·∥2. Furthermore,  BΨ(w∗ | w(1)) ≤ D2w/2and  R∗ ≤ Dw.

We may upper bound the 2-norm of our stochastic gradients as��� ˇ∆(t)w���2 ≤ Gf + γGg. Only the f-portion of the objective is stochastic, so the error of the ˇ∆(t)ws can be trivially upper bounded, with probability 1, with  σ = 2Gf. Hence, by Corollary 2 (taking  Ftto be e.g. the smallest  σ-algebra making ˇ∆(t), . . . , ˇ∆(t)measurable), with probability 1 − δ:

image

Plugging in the definition of  η, moving the average defining  ¯winside h by Jensen’s inequality, substituting  f(w∗) =h(w∗)because  w∗satisfies the constraints, applying Lemma 1 and simplifying yields the claimed result.

In terms of the number of iterations required to achieve some desired level of suboptimality, this bound on  UFmay be expressed as:

Theorem 5. Suppose that the conditions of Lemmas 1 and 3 apply, and that  ηis as defined in Lemma 3.

If we optimize Equation 1 using  Tǫiterations of Algorithm 1 (FullTouch):

image

then  UF ≤ ǫwith probability  1 − δ. where  w∗ ∈ {w ∈ W : ∀i.gi(w) ≤ 0}is an arbitrary constraint-satisfying reference vector.

Proof. Based on the bound of Lemma 3, define:

image

and consider the polynomial  0 = ax2 + bx + c. Roots of this polynomial are xs for which  UF = ǫ, while for xs larger than any root we’ll have that  UF ≤ ǫ. Hence, we can bound the T required to ensure  ǫ-suboptimality by bounding the roots of this polynomial. By the Fujiwara bound [Wikipedia, 2015]:

image

giving the claimed result.

C.2 Analysis of LightTouch

Because we use the reduced-variance algorithm of Johnson and Zhang [2013], and therefore update the remembered gradient  µone random coordinate at a time, we must first bound the maximum number of iterations over which a coordinate can go un-updated:

Lemma 4. Consider a process which maintains a sequence of vectors  s(t) ∈ Nmfor  t ∈ {1, . . . , T }, where  s(1)is initialized to zero and  s(t+1)is derived from  s(t)by independently sampling  k = |St| ≤ mrandom indices  St ⊆{1, . . . , m} uniformly without replacement, and then setting  s(t+1)j = tfor  j ∈ Stand  s(t+1)j = s(t)jfor  j /∈ St. Then, with probability  1 − δ:

image

Table 5: New notation in Appendix C.2.

image

Proof. This is closely related to the “coupon collector’s problem” [Wikipedia, 2014]. We will begin by partitioning time into contiguous size-n chunks, with 1, . . . , n forming the first chunk, n + 1, . . . , 2n the second, and so on.

Within each chunk the probability that any particular index was never sampled is  ((m − k)/m)n, so by the union bound the probability that any one of the m indices was never sampled is bounded by  m((m − k)/m)n:

image

Define  n = ⌈(m/k) ln(2mT/δ)⌉, so:

image

This shows that for this choice of n, the probability of there existing an index which is never sampled in some particular batch is bounded by  δ/2T. By the union bound, the probability of any of  ⌈T/n⌉batches containing an index which is never sampled is bounded by  (δ/2T )⌈T/n⌉ ≤ (δ/2n) + (δ/2T ) ≤ δ.

If every index is sampled within every batch, then over the first  n⌈T/n⌉ ≥ Tsteps, the most steps which could elapse over which a particular index is not sampled is  2n − 2(if the index is sampled on the first step of one chunk, and the last step of the next chunk), which implies the claimed result.

We now combine this bound with Corollary 2 and make appropriate choices of the two d.g.f.s to yield a bound on the LightTouch convergence rate:

Lemma 5. Suppose that the conditions of Lemma 1 apply, with  g(w) = maxi(gi(w)). Define  Dw ≥supw,w′∈W max{1, ∥w − w′∥2}as a bound on the diameter of W (notice that we also choose  Dwto be at least  1), Gf ≥ �� ˇ∆(t)��2and  Gg ≥ �� ˇ∇ max(0, gi(w))��2as uniform upper bounds on the (stochastic) gradient magnitudes of f and the  gis, respectively, for all  i ∈ {1, . . . , m}. We also assume that all  gis are  Lg-Lipschitz w.r.t. ∥·∥2, i.e.  |gi(w) − gi(w′)| ≤ Lg ∥w − w′∥2for all  w, w′ ∈ W.

Define:

image

If  k ≤ mand we optimize Equation 1 using Algorithm 2 (LightTouch), basing the stochastic gradients w.r.t. p on k constraints at each iteration, and using the step size:

image

then it holds with probability  1 − δthat:

image

where  w∗ ∈ {w ∈ W : ∀i.gi(w) ≤ 0}is an arbitrary constraint-satisfying reference vector, and:

image

If k > m, then we should fall-back to using FullTouch, in which case the result of Lemma 3 will apply.

Proof. We choose  Ψw(w) = ∥w∥22 /2and  Ψp(p) = �mi=1 pi ln pito be the squared Euclidean norm divided by 2 and the negative Shannon entropy, respectively, which yields the updates of Algorithm 2. We assume that the ˇ∆(t)s are random variables on some probability space (depending on the source of the stochastic gradients of f), and likewise the  its and  jts on another, so  Ftmay be taken to be the product of the smallest  σ-algebras which make ˇ∆(1), . . . , ˇ∆(t)and  i1, j1, . . . , it, jtmeasurable, respectively, with conditional expectations being taken w.r.t. the product measure. Under the definitions of Corollary 3 (taking  α = p), with probability  1 − δσw − δσp − 2δ′:

image

As in the proof of Lemma 3,  Ψwis 1-strongly convex w.r.t. the Euclidean norm, so  ∥·∥w = ∥·∥w∗ = ∥·∥2, BΨw(w∗ |w(1)) ≤ D2w/2and  Rw∗ ≤ Dw. Because  Ψpis the negative entropy, which is 1-strongly convex w.r.t. the 1-norm (this is Pinsker’s inequality),  ∥·∥p = ∥·∥1and  ∥·∥p∗ = ∥·∥∞, implying that  Rp∗ = 1. Since  p(1)is initialized to the uniform distribution,  BΨp(p∗ | p(1)) = DKL(p∗ | p(1)) ≤ ln m.

The stochastic gradient definitions of Algorithm 2 give that��� ˇ∆(t)w���w∗ ≤ Gf + γGgand  σw ≤ 2(Gf + γGg)with probability  1 = 1 − δσwby the triangle inequality, and ˜h(w∗, p(t)) = f(w∗)because  w∗satisfies the constraints. All of these facts together give that, with probability  1 − δσp − δ′:

image

We now move the average defining  ¯winside ˜h(which is convex in its first parameter) by Jensen’s inequality, and use the fact that there exists a  p∗such that ˜h(w, p∗) = h(w)to apply Lemma 1:

image

By the triangle inequality and the fact that  (a + b)2 ≤ 2a2 + 2b2:

image

Substituting into Equation 9 and using the fact that  a + b ≤ (√a +√b)2:

image

We will now turn our attention to the problem of bounding  σp. Notice that because we sample i.i.d.  jts uniformly at every iteration, they form an instance of the process of Lemma 4 with  µ(t)j = max(0, gj(w(s(t)j ))), showing that with probability  1 − δσp:

image

By the definition of ˆ∆(t)p(Algorithm 2):

image

where in the second step we used the definition of the  ∞-norm, in the third we used the Lipschitz continuity of the gis (and hence of their positive parts), in the fourth we bounded the distance between two iterates with the number of iterations times a bound on the total step size, and in the fifth we used Equation 11. This shows that we may define:

image

and it will satisfy the conditions of Corollary 3. Notice that, due to the  ηfactor,  σpwill be decreasing in T . Substituting the definitions of  ηand  σpinto Equation 10, choosing  δσp = δ′ = δ/3and using the assumption that  Dw ≥ 1gives

that with probability  1 − δ:

image

Rounding up the constant terms:

image

Substituting the definition of k, simplifying and bounding yields the claimed result.

In terms of the number of iterations required to achieve some desired level of suboptimality, this bound on  ULand the bound of Lemma 3 on  UFmay be combined to yield the following:

Theorem 1. Suppose that the conditions of Lemmas 1 and 5 apply. Our result will be expressed in terms of a total iteration count  Tǫsatisfying:

image

Define k in terms of  Tǫas in Lemma 5. If  k ≤ m, then we optimize Equation 1 using  Tǫiterations of Algorithm 2 (LightTouch) with  ηas in Lemma 5. If k > m, then we use  Tǫiterations of Algorithm 1 (FullTouch) with  ηas in Lemma 3. In either case, we perform  Tǫiterations, requiring a total of  Cǫ“constraint checks” (evaluations or differentiations of a single  gi):

image

and with probability  1 − δ:

image

where  w∗ ∈ {w ∈ W : ∀i.gi(w) ≤ 0}is an arbitrary constraint-satisfying reference vector.

Proof. Regardless of the value of k, it follows from Lemmas 5 and 3 that:

image

As in the proof of Theorem 5, we define:

image

and consider the polynomial  0 = ax2 + bx + c. Any upper bound on all roots  x =√Tof this polynomial will result in a lower-bound the values of T for which  UL, UF ≤ ǫwith probability  1 − δ. By the Fujiwara bound [Wikipedia, 2015]:

image

giving the claimed bound on  Tǫ. For  Cǫ, we observe that we will perform no more than k + 1 constraint checks at each iteration (k + 1 by LightTouch if  k ≤ m, and m + 1 by FullTouch if k > m), and substitute the above bound on Tǫinto the definition of k, yielding:

image

giving the claimed result (notice the √1 + ln Tǫfactor on the RHS, for which reason we have a ˜Obound on  Cǫ, instead of O).

We now move on to the analysis of our LightTouch variant for  λ-strongly convex objectives, Algorithm 3 (MidTouch). While we were able to prove a high-probability bound for LightTouch, we were unable to do so for MidTouch, because the extra terms resulting from the use of a Bernstein-type martingale inequality were too large (since the other terms shrank as a result of the strong convexity assumption). Instead, we give an in-expectation result, and leave the proof of a corresponding high-probability bound to future work.

Our first result is an analogue of Lemmas 3 and 5, and bounds the suboptimality achieved by MidTouch as a function of the iteration counts  T1and  T2of the two phases:

Lemma 6. Suppose that the conditions of Lemma 1 apply, with  g(w) = maxi(gi(w)). Define  Gf ≥ �� ˇ∆(t)��2and  Gg ≥�� ˇ∇ max(0, gi(w))��2as uniform upper bounds on the (stochastic) gradient magnitudes of f and the  gis, respectively, for all  i ∈ {1, . . . , m}. We also assume that f is  λ-strongly convex, and that all  gis are  Lg-Lipschitz w.r.t. ∥·∥2, i.e.  |gi(w) − gi(w′)| ≤ Lg ∥w − w′∥2for all  w, w′ ∈ W.

If we optimize Equation 1 using Algorithm 3 (MidTouch) with the p-update step size  η = λ/2γ2L2g, then:

image

where  w∗ = argmin{w∈W:∀i.gi(w)≤0} f(w)is the optimal constraint-satisfying reference vector.

Proof. As in the proof of Lemma 3, the first phase of Algorithm 3 is nothing but (strongly convex) SGD on the overall objective function h, so by Corollary 4:

image

so by Jensen’s inequality:

image

For the second phase, as in the proof of Lemma 5, we choose  Ψp(p) = �mi=1 pi ln pito be negative Shannon entropy, which yields the second-phase updates of Algorithm 3. By Corollary 5:

image

As before,  ∥·∥p = ∥·∥1, ∥·∥p∗ = ∥·∥∞, and  BΨp(p∗ | p(T1+1)) = DKL(p∗ | p(T1+1)) ≤ ln m. Hence:

image

Since h is  λ-strongly convex and  w∗is optimal,��w(T1+1) − w∗��22 ≤ 2λ(h(w(T1+1)) − h(w∗)). By Equation 12:

image

Since the (uncentered) second moment is equal to the mean plus the variance, and using the fact that ˜h(w∗, p(t)) =f(w∗)since all constraints are satisfied at  w∗:

image

image

the first step using the fact that the  gjs are  Lg-Lipschitz and Jensen’s inequality. For the second step, we choose  p∗such that  w∗, p∗is a minimax optimal pair (recall that  w∗is optimal by assumption), and use the  λ-strong convexity of ˜h. Substituting into Equation 13 and using the fact that ˜h(w∗, p∗) = f(w∗):

image

We now follow the proof of Lemma 5 and bound  σ2p. By the definition of  ˆ∆(t)p(Algorithm 3):

image

The indices j are sampled uniformly, so the maximum time  maxj(t − s(t)j )since we last sampled the same index is an instance of the coupon collector’s problem Wikipedia [2014]. Because the  gjs are  Lg-Lipschitz:

image

the second step because, between iteration  s(t)jand iteration t we will perform  t − s(t)jupdates of magnitude at most Gw/λT1, and the third step because, as an instance of the coupon collector’s problem,  maxj(t − s(t)j )has expectation mHm ≤ m + m ln m (Hmis the mth harmonic number) and variance  m2π2/6. Substituting into Equation 14:

image

By the  λ-strong convexity of ˜h:

image

Using the facts that  ∥Πg( ¯w) − w∗∥ ≤ ∥ ¯w − w∗∥because  w∗is feasible, and that  Gw = Gf + γGg, completes the proof.

We now move on to the main result: a bound on the number of iterations (equivalently, the number of stochastic loss gradients) and constraint checks required to achieve  ǫ-suboptimality:

Theorem 2. Suppose that the conditions of Lemmas 1 and 6 apply, with the p-update step size  ηas defined in Lemma 6. If we run Algorithm 3 (MidTouch) for  Tǫ1iterations in the first phase and  Tǫ2in the second:

image

requiring a total of  Cǫ“constraint checks” (evaluations or differentiations of a single  gi):

image

where  w∗ = argmin{w∈W:∀i.gi(w)≤0} f(w)is the optimal constraint-satisfying reference vector.

image

By Lemma 6, the above definitions imply that:

image

Defining  ǫ = E�∥Πg( ¯w) − w∗∥22�and rearranging:

image

We will now upper-bound all roots of the above equation with a quantity  τǫ, for which all  τ ≥ τǫwill result in ǫ-suboptimality. By the Fujiwara bound [Wikipedia, 2015], and including the constraint that  τ ≥ 1:

image

Substituting the above bound on  τǫinto the definitions of  T1and  T2gives the claimed magnitudes of these  Tǫ1and Tǫ2, and using the fact that the  Cǫ = O(mTǫ1 + Tǫ2)gives the claimed bound on  Cǫ.


designed for accessibility and to further open science