A Light Touch for Heavily Constrained SGD

2015·Arxiv

Abstract

Abstract

Minimizing empirical risk subject to a set of constraints can be a useful strategy for learning restricted classes of functions, such as monotonic functions, submodular functions, classifiers that guarantee a certain class label for some subset of examples, etc. However, these restrictions may result in a very large number of constraints. Projected stochastic gradient descent (SGD) is often the default choice for large-scale optimization in machine learning, but requires a projection after each update. For heavily-constrained objectives, we propose an efficient extension of SGD that stays close to the feasible region while only applying constraints probabilistically at each iteration. Theoretical analysis shows a compelling trade-off between per-iteration work and the number of iterations needed on problems with a large number of constraints.

1 Introduction

Many machine learning problems can benefit from the addition of constraints. For example, one can learn monotonic functions by adding appropriate constraints to ensure or encourage positive derivatives everywhere [e.g. Archer and Wang, 1993, Sill, 1998, Spouge et al., 2003, Daniels and Velikova, 2010, Gupta et al., 2016]. Submodular functions can often be learned from noisy examples by imposing constraints to ensure submodularity holds. Another example occurs when one wishes to guarantee that a classifier will correctly label certain “canonical” examples, which can be enforced by constraining the function values on those examples. See Qu and Hu [2011] for some other examples of constraints useful in machine learning.

However, these practical uses of constraints in machine learning are impractical in that the number of constraints may be very large, and scale poorly with the number of features d or number of training samples n. In this paper we propose a new strategy for tackling such heavily-constrained problems, with guarantees and compelling convergence rates for large-scale convex problems.

A standard approach for large-scale empirical risk minimization is projected stochastic gradient descent [e.g. Zinkevich, 2003, Nemirovski et al., 2009]. Each SGD iteration is computationally cheap, and the algorithm converges quickly to a solution good enough for machine learning needs. However, this algorithm requires a projection onto the feasible region after each stochastic gradient step, which can be prohibitively slow if there are many non-trivial constraints, and is not easy to parallelize. Recently, Frank-Wolfe-style algorithms [e.g. Hazan and Kale, 2012, Jaggi, 2013] have been proposed that remove the projection, but require a constrained linear optimization at each iteration.

We propose a new strategy for large-scale constrained optimization that, like Mahdavi et al. [2012], moves the constraints into the objective and finds an approximate solution of the resulting unconstrained problem, projecting the (potentially-infeasible) result onto the constraints only once, at the end. Their work focused on handling only one constraint, but as they noted, multiple constraints can be reduced to one constraint by replacing the m constraints with their maximum: . However, this still requires that all m constraints be checked at every iteration. In this paper, we focus on the computational complexity as a function of the number of constraints m, and show that it is possible to achieve good convergence rates without checking constraints so often.

The key challenge to handling a large number of constraints is determining which constraints are active at the optimum of the constrained problem, which is likely to be only a small fraction of the total constraint set. For example, for linear inequality constraints on a d-dimensional problem, no more than d of the constraints will be active at the optimum, and furthermore, once the active constraints are known, the problem reduces to solving the unconstrained problem that results from projecting onto them, which is typically vastly easier.

To identify and focus on the important constraints, we propose learning a probability distribution over the m constraints that concentrates on the most-violated, and sampling constraints from this evolving distribution at each iteration. We call this approach LightTouch because at each iteration only a few constraints are checked, and the solution is only nudged toward the feasible set. LightTouch is suitable for convex problems, but we also propose a variant, MidTouch, that enjoys a superior convergence rate on strongly convex problems. These two algorithms are introduced and analyzed in Section 3.

Our proposed strategy removes the per-iteration m-dependence on the number of constraint evaluations. LightTouch and MidTouch do need more iterations to converge, but each iteration is faster, resulting in a net performance improvement. To be precise, we show that the total number of constraint checks required to achieve -suboptimality when optimizing a non-strongly convex objective decreases from to —notice that the m-dependence of the dominant (in ) term has decreased from m to ln m. For a -strongly convex objective, the dominant (again in) term in our bound on the number of constraint checks decreases from to , but like the non-strongly convex result this bound contains lower-order terms with worse m-dependencies. A more careful comparison of the performance of our algorithms can be found in Section 4.

While they check fewer than m constraints per iteration, these algorithms do need to pay a O(m) per-iteration arithmetic cost. When each constraint is expensive to check, this cost can be neglected. However, when the constraints are simple to check (e.g. box constraints, or the lattice monotonicity constraints considered in our experiments), it can be partially addressed by transforming the problem into an equivalent one with fewer more costly constraints. This, as well as other practical considerations, are discussed in Section 5.

Experiments on a large-scale real-world heavily-constrained ranking problem show that our proposed approach works well in practice. This problem was too large for a projected SGD implementation using an off-the-shelf quadratic programming solver to perform projections, but was amenable to an approach based on a fast approximate projection routine tailored to this particular constraint set. Measured in terms of runtime, however, LightTouch was still significantly faster. Each constraint in this problem is trivial, requiring only a single comparison operation to check, so the aforementioned O(m) arithmetic cost of LightTouch is a significant issue. Despite this, LightTouch was roughly as fast as the Mahdavi et al. [2012]-like algorithm FullTouch. In light of other experiments showing that LightTouch checks dramatically fewer constraints in total than FullTouch, we believe that LightTouch is well-suited to machine learning problems with many nontrivial constraints.

2 Heavily Constrained SGD

Consider the constrained optimization problem:

where is bounded, closed and convex, and and all are convex (our notation is summarized in Table 1). We assume that W is a simple object, e.g. an ball, onto which it is inexpensive to project, and that the “trickier” aspects of the domain are specified via the constraints . Notice that we

Table 1: Key notation.

consider constraints written in terms of arbitrary convex functions, and are not restricted to e.g. only linear or quadratic constraints.

2.1 FullTouch: A Relaxation with a Feasible Minimizer

We build on the approach of Mahdavi et al. [2012] to relax Equation 1. Defining and introducing a Lagrange multiplier yields the equivalent optimization problem:

Directly optimizing over w and is problematic because the optimal value for is infinite for any w that violates a constraint. Instead, we follow Mahdavi et al. [2012, Section 4.2] in relaxing the problem by adding an upper bound of on , and using the fact that .

In the following lemma, we show that, with the proper choice of , any minimizer of this relaxed objective is a feasible solution of Equation 1, indicating that using stochastic gradient descent (SGD) to minimize the relaxation (h(w) in the lemma below) will be effective.

Lemma 1. Suppose that f is -Lipschitz, i.e. for all , and that there is a constant such that if g(w) = 0 thenfor all , where is the subdifferential of g(w).

If , then for any infeasible w (i.e. for which g(w) > 0):

where is the projection of w onto the set w.r.t. the Euclidean norm.

Proof. In Appendix C.

The strategy of applying SGD to h(w), detailed in Algorithm 1, which we call FullTouch, has the same “flavor” as the algorithms proposed by Mahdavi et al. [2012], and we use it as a baseline comparison point for our other algorithms.

Application of a standard SGD bound to FullTouch shows that it converges at a rate with no explicit dependence on the number of constraints m, measured in terms of the number of iterations required to achieve some desired suboptimality (see Appendix C.1), although the parameter can introduce an implicit d or m-dependence, depending on the constraints (discussed in Section 2.2). The main drawback of FullTouch is that each iteration is expensive, requiring the evaluation of all m constraints, since differentiation of g requires first identifying the most-violated. This is the key issue we tackle with the LightTouch algorithm proposed in Section 3.

2.2 Constraint-Dependence of γ

The conditions on Lemma 1 were stated in terms of g, instead of the individual s, because it is difficult to provide suitable conditions on the “component” constraints without accounting for their interactions.

For a point w where two or more constraints intersect, the subdifferential of g(w) consists of all convex combinations of subgradients of the intersecting constraints, with the consequence that even if each of the subgradients of the s has norm at least , subgradients of g(w) will generally have norms smaller than . Exactly how much smaller depends on the particular constraints under consideration. We illustrate this phenomenon with the following examples, but note that, in practice, should be chosen experimentally for any particular problem, so the question of the d and m-dependence of is mostly of theoretical interest.

Box Constraints Consider the m = 2d box constraints and , all of which have gradients of norm 1. At most d constraints can intersect (at a corner of the box), all of which are mutually orthogonal, so the norm of any convex combination of their gradients is lower bounded by that of their average, . Hence, one should choose .

As in the above example, will suffice when the subgradients of intersecting constraints are at least orthogonal, and can be smaller if they always have positive inner products. However, if subgradients of intersecting constraints tend to point in opposing directions, then may need to be much larger, as in our next example:

Ordering Constraints Suppose the constraints order the components of w as , for which , gradients of which again have norm 1. All of these constraints may be active simultaneously, in which case there is widespread cancellation in the average gradient , where is the ith standard unit basis vector. The norm of this average gradient is , so we should choose .

In light of this example, one begins to wonder if a suitable will necessarily exist—fortunately, the convexity of g enables us to prove a trivial bound as long as g(v) is strictly negative for some :

Lemma 2. Suppose that there exists a for which g(v) < 0, and let bound the diameter of W. Then satisfies the conditions of Lemma 1.

Proof. Let be a point for which g(w) = 0, and an arbitrary subgradient. By convexity, . The Cauchy-Schwarz inequality then gives that:

from which the claim follows immediately.

Linear Constraints Consider the constraints , with each row of A having unit norm, , and W being the ball of radius r. It follows from Lemma 2 that suffices. Notice that the earlier box constraint example satisfies these assumptions (with and ).

As the above examples illustrate, subgradients of g will be large at the boundary if subgradients of the s are large, and the constraints intersect at sufficiently shallow angles that, representing boundary subgradients of g as convex combinations of subgradients of the s, the components reinforce each other, or at least do not cancel too much. This requirement is related to the linear regularity assumption introduced by Bauschke [1996], and considered recently by Wang et al. [2015].

3 A Light Touch

This section presents the main contribution of this paper: an algorithm that stochastically samples a small subset of the m constraints at each SGD iteration, updates the parameters based on the subgradients of the sampled constraints, and carefully learns the distribution over the constraints to produce a net performance gain.

We first motivate the approach by considering an oracle, then explain the algorithm and present convergence results for the convex (Section 3.2) and strongly convex (Section 3.3) cases.

3.1 Wanted: An Oracle For the Most Violated Constraint

Because FullTouch only needs to differentiate the most violated constraint at each iteration, it follows that if one had access to an oracle that identified the most-violated constraint, then the overall convergence rate (including the cost of each iteration) could only depend on m through . This motivates us to learn to predict the most-violated constraint, ideally at a significantly better than linear-in-m rate.

To this end, we further relax the problem of minimizing h(w) (defined in Lemma 1) by replacing with maximization over a probability distribution (as in Clarkson et al. [2010]), yielding the equivalent convex-linear

optimization problem:

Here, is the m-dimensional simplex. We propose optimizing over w and p jointly, thereby learning the most-violated constraint, represented by the multinoulli distribution p over constraint indices, at the same time as we optimize over w.

3.2 LightTouch: Stochastic Constraint Handling

To optimize Equation 3, our proposed algorithm (Algorithm 2, LightTouch) iteratively samples stochastic gradients w.r.t. w and w.r.t. p of , and then takes an SGD step on w and a multiplicative step on p:

where the exp and ln of the p-update are performed element-wise, projects onto W w.r.t. the Euclidean norm, and onto via normalization (i.e. dividing its parameter by its sum).

The key to getting a good convergence rate for this algorithm is to choose and such that they are both inexpensive to compute, and tend to have small norms. For , this can be accomplished straightforwardly, by sampling a constraint index i according to p, and taking:

where is a stochastic subgradient of f and is a subgradient of . Calculating each such requires differentiating only one constraint, and it is easy to verify that is a subgradient of w.r.t. w in expectation over and i. Taking to be a bound on the norm of and on the norms of subgradients of the s shows that ’s norm is bounded by .

For , some care must be taken. Simply sampling a constraint index j uniformly and defining:

where is the jth m-dimensional standard unit basis vector, does produce a that in expectation is the gradient of w.r.t. p, but it has a norm bound proportional to m. Such potentially large stochastic gradients would result in the number of iterations required to achieve some target suboptimality being proportional to in our final bound.

A typical approach to reducing the variance (and hence the expected magnitude) of is minibatching: instead of sampling a single constraint index j at every iteration, we could instead sample a subset S of size |S| = k without replacement, and use:

This is effective, but not enough, because reducing the variance by a factor of k via minibatching requires that we check k times more constraints. For this reason, in addition to minibatching, we center the stochastic gradients, as is done by the well-known SVRG algorithm [Johnson and Zhang, 2013], by storing a gradient estimate with , at each iteration sampling a set S of size |S| = k uniformly without replacement, and computing:

We then update the jth coordinate of to be for every . The norms of the resulting stochastic gradients will be small if is a good estimate of the gradient, i.e. .

The difference between and can be bounded in terms of how many consecutive iterations may have elapsed since was last updated. It turns out (see Lemma 4 in Appendix C.2) that this quantity can be bounded uniformly by O((m/k) ln(mT )) with high probability, which implies that if the s are -Lipschitz, then , since at most O((m/k) ln(mT )) updates of magnitude may have occurred since was last updated. Choosing , as is standard, moves this portion (the “variance portion”) of the -dependence out of the dominant term and into a subordinate term in our final bound.

The remainder of the -dependence (the “mean portion”) depends on the norm of . It is here that our use of multiplicative p-updates becomes significant, because with such updates the relevant norm is the norm, instead of e.g. the norm (as would be the case if we updated p using SGD), thus we can bound with no explicit m-dependence.

The following theorem on the convergence rate of LightTouch is proved by applying a mirror descent bound for saddle point problems while bounding the stochastic gradient norms as described above.

Theorem 1. Suppose that the conditions of Lemma 1 apply, with . Define as a bound on the diameter of W (notice that we also choose to be at least 1), and as uniform upper bounds on the (stochastic) gradient magnitudes of f and the s, respectively, for all and . We also assume that all s are -Lipschitz w.r.t. , i.e. . Our result will be expressed in terms of a total iteration count satisfying:

Define:

If , then we optimize Equation 1 using iterations of Algorithm 2 (LightTouch), basing the stochastic gradients w.r.t. p on k constraints at each iteration, and using the step size:

If k > m, then LightTouch would check more than m constraints per iteration anyway, so we instead use iterations of Algorithm 1 (FullTouch) with the step size:

In either case, we perform iterations, requiring a total of “constraint checks” (evaluations or differentiations of a single ):

and with probability :

where is an arbitrary constraint-satisfying reference vector.

Proof. In Appendix C.2.

The most important thing to notice about this theorem is that the dominant terms in the bounds on the number of iterations and number of constraint checks are roughly times the usual convergence rate for SGD on a non-strongly convex objective. The lower-order terms have a worse m-dependence, however, with the result that, as the desired suboptimality shrinks, the algorithm performs fewer constraint checks per iteration until ultimately (once is on the order of ) only a constant number are checked during each iteration.

3.3 MidTouch: Strong Convexity

To this point, we have only required that the objective function f be convex. However, roughly the same approach also works when f is taken to be -strongly convex, although we have only succeeded in proving an in-expectation result, and the algorithm, Algorithm 3 (MidTouch), differs from LightTouch not only in that the w updates use a step size, but also in being a two-phase algorithm, the first of which, like FullTouch, checks every constraint at each iteration, and the second of which, like LightTouch with k = 1, checks only two. The following theorem bounds the convergence rate if we perform iterations in the first phase and in the second, where the parameter determines the total number of iterations performed:

Theorem 2. Suppose that the conditions of Lemma 1 apply, with . Define and as uniform upper bounds on the (stochastic) gradient magnitudes of f and the s, respectively, for all . We also assume that f is -strongly convex, and that all s are -Lipschitz w.r.t. , i.e. for all . If we run Algorithm 3 (MidTouch) with the p-update step size for iterations in the first phase and in the second:

requiring a total of “constraint checks” (evaluations or differentiations of a single ):

Table 2: Comparison of the number of iterations, and number of constraint checks, required to achieve -suboptimality with high probability when optimizing a non-strongly-convex objective, up to constant and logarithmic factors, dropping the and dependencies, and ignoring the one-time cost of projecting the final result in FullTouch and LightTouch. For LLO-FW, the parameter to the local linear oracle has magnitude . See Section 4, Appendix C, the non-smooth stochastic result of Hazan and Kale [2012, Theorem 4.3], and Garber and Hazan [2013, Theorem 2]. Notice that because this table compares upper bounds to upper bounds, subsequent work may improve these bounds further.

where is the optimal constraint-satisfying reference vector.

Proof. In Appendix D.

Notice that the above theorem bounds not the suboptimality of , but rather its squared Euclidean distance from , for which reason the denominator of the highest order term depends on rather than . Like Theorem 1 in the non-strongly convex case, the dominant terms above, both in terms of the total number of iterations and number of constraint checks, match the usual convergence rate for unconstrained strongly-convex SGD with an additional factor, while the lower-order terms have a worse m-dependence. As before, fewer constraint checks will be performed per iteration as shrinks, reaching a constant number (on average) once is on the order of .

4 Theoretical Comparison

Table 2 compares upper bounds on the convergence rates and per-iteration costs when applied to a convex (but not necessarily strongly convex) problem for LightTouch, FullTouch, projected SGD, the online Frank-Wolfe algorithm of Hazan and Kale [2012], and a Frank-Wolfe-like online algorithm for optimization over a polytope [Garber and Hazan, 2013]. The latter algorithm, which we refer to as LLO-FW, achieves convergence rates comparable to projected SGD, but uses a local linear oracle instead of a projection or full linear optimization. To simplify the presentation, the dependencies on and have been dropped—please refer to Theorems 1 and 2 and the cited references for the complete statements. Table 3 contains the same comparison (without online Frank-Wolfe) for -strongly convex problems.

At each iteration, all of these algorithms must find a stochastic subgradient of f. In addition, each iteration of LightTouch and MidTouch must perform O(m) arithmetic operations (for the m-dimensional vector operations used when updating p)—this issue will be discussed further in Section 5.1. However, projected SGD must project its iterate onto the constraints w.r.t. the Euclidean norm, online Frank-Wolfe must perform a linear optimization subject to the constraints, and LLO-FW must evaluate a local linear oracle, which amounts to essentially local linear optimization.

Table 3: Same as Table 2, except that the results bound the number of iterations or constraint checks required to achieve , and the objective function is assumed to be -strongly convex. The bound given for FullTouch assumes that the constant used in Algorithm 1 has been replaced with the standard decreasing step size used in strongly-convex SGD. The MidTouch bounds each contain four terms, listed in order of most-to-least dominant (in ). For LLO-FW, the parameter to the local linear oracle has magnitude . See Section 4, Appendix D, and Garber and Hazan [2013, Theorem 3].

LightTouch, MidTouch and FullTouch share the same -dependence, but the m-dependence of the convergence rate of LightTouch and MidTouch is logarithmically worse. The number of constraint evaluations, however, is better: in the non-strongly convex case, ignoring all but the m and dependencies, FullTouch will check constraints, while LightTouch will check only , a significant improvement when is small. Hence, particularly for problems with many expensive-to-evaluate constraints, one would expect LightTouch to converge much more rapidly. Likewise, for -strongly convex optimization, the dominant (in) terms in the bounds on the number of constraint evaluations go as for FullTouch, and as for MidTouch, although the lower-order terms in the MidTouch bound are significantly more complex than in the non-strongly convex case (see Table 3 for full details).

Comparing with projected SGD, online Frank-Wolfe and LLO-FW is less straightforward, not only because we’re comparing upper bounds to upper bounds (with all of the uncertainty that this entails), but also because we must relate the value of to the cost of performing the required projection, constrained linear optimization or local linear oracle evaluation. We note, however, that for non-strongly convex optimization, the -dependence of the convergence rate bound is worse for online Frank-Wolfe () than for the other algorithms (), and that unless the constraints have some special structure, performing a projection can be a very expensive operation.

For example, with general linear inequality constraints, each constraint check performed by LightTouch, MidTouch or FullTouch requires O(d) time, whereas each linear program optimized by online Frank-Wolfe could be solved in time [Nemirovski, 2004, Chapter 10.1], and each projection performed by SGD in time [Goldfarb and Liu, 1991]. When the constraints are taken to be arbitrary convex functions, instead of linear functions, projections may be even more difficult.

We believe that in many cases will be roughly on the order of the dimension d, or number of constraints m, whichever is smaller, although it can be worse for difficult constraint sets (see Section 2.2). In practice, we have found that a surprisingly small —we use in our experiments (Section 6)—often suffices to result in convergence to a feasible solution. With this in mind, and in light of the fact that a fast projection, linear optimization, or local linear oracle evaluation may only be possible for particular constraint sets, we believe that our algorithms compare favorably with the alternatives.

5 Practical Considerations

Algorithms 2 and 3 were designed primarily to be easy to analyze, but in real world-applications we recommend making a few tweaks to improve performance. The first of these is trivial: using a decreasing w-update step size when optimizing a non-strongly convex objective, and for a strongly-convex objective. In both cases we continue to use a constant p-update step size . This change, as well as that described in Section 5.2, is included in Algorithm 4.

5.1 Constraint Aggregation

A natural concern about Algorithms 2 and 3 is that O(m) arithmetic operations are performed per iteration, even when only a few constraints are checked. When each constraint is expensive, this is a minor issue, since this cost will be “drowned out” by that of checking the constraints. However, when the constraints are very cheap, and the O(m) arithmetic cost compares disfavorably with the cost of checking a handful of constraints, it can become a bottleneck.

Our solution to this issue is simple: transform a problem with a large number of cheap constraints into one with a smaller number of more expensive constraints. To this end, we partition the constraint indices 1, . . . , m into sets of size at most , defining , and then apply LightTouch or MidTouch on the aggregated constraints . This makes each constraint check times more expensive, but reduces the dimension of p from m to , shrinking the per-iteration arithmetic cost to .

5.2 Automatic Minibatching

Because LightTouch takes a minibatch size k as a parameter, and the constants from which we derive the recommended choice of k (Theorem 1) are often unknown, a user is in the uncomfortable position of having to perform a parameter search not only over the step sizes and , but also the minibatch size. Furthermore, the fact that the theoreticallyrecommended k is a decreasing function of T indicates that it might be better to check more constraints in early iterations, and fewer in later ones. Likewise, MidTouch is structured as a two-phase algorithm, in which every iteration checks every constraint in the first phase, and only a constant number in the second, but it seems more sensible for the number of constraint checks to decrease gradually over time.

In addition, for both algorithms, it would be desirable to support separate minibatching of the loss and constraint stochastic subgradients (w.r.t. w), in which case there would be three minibatching parameters to determine: and . This makes things even harder for the user, since now there are three additional parameters that must be specified.

To remove the need to specify any minibatch-size hyperparameters, and to enable the minibatch sizes to change from iteration-to-iteration, we propose a heuristic that will automatically determine the minibatch sizes , and for each of the stochastic gradient components at each iteration. Intuitively, we want to choose minibatch sizes in such a way that the stochastic gradients are both cheap to compute and have low variance. Our proposed heuristic does this by trading-off the computational cost and “bound impact” of the overall stochastic gradient, where the “bound impact” is a variance-like quantity that approximates the impact that taking a step with particular minibatch sizes has on the relevant convergence rate bound.

Suppose that we’re about to perform the tth iteration, and know that a single stochastic subgradient of f(w) (corresponding to the loss portion of ) has variance (more properly, covariance matrix trace) and requires a computational investment of units. Similarly, if we define by sampling and taking (corresponding to the constraint portion of ), then we can define variance and cost estimates of to be and , respectively. Likewise, we take and to be estimates of the variance and cost of a (non-minibatched version of) .

In all three cases, the variance and cost estimates are those of a single sample, implying that a stochastic subgradient of f(w) averaged over a minibatch of size will have variance and require a computational investment of , and likewise for the constraints and distribution. In the context of Algorithm 4, with minibatch sizes of , and , we define the overall bound impact b and computational cost c of a single update as:

We should emphasize that the above definition of b is merely a useful approximation of how these quantities truly affect our bounds.

Given the three variance and three cost estimates, we choose minibatch sizes in such a way as to minimize both the computational cost and bound impact of an update. Imagine that we are given a fixed computational budget c. Then our goal will be to choose the minibatch sizes in such a way that b is minimized for this budget, a problem that is easily solved in closed form:

We propose choosing the proportionality constant (and thereby the cost budget c) in such a way that (enabling us to calculate sample variances, as explained below), and round the two other sizes to the nearest integers, lowerbounding each so that and .

While the variances and costs are not truly known during optimization, they are easy to estimate from known quantities. For the costs , and , we simply time how long each past stochastic gradient calculation has taken, and then average them to estimate the future costs. For the variances and , we restrict ourselves to minibatch sizes , calculate the sample variances and of the stochastic gradients at each iteration, and then average over all past iterations (either uniformly, or a weighted average placing more weight on recent iterations).

For , the situation is a bit more complicated, since the p-updates are multiplicative (so we should use an variance) and centered as in Equation 4. Upper-bounding the norm with the norm and using the fact that the minibatch is independently sampled yields the following crude estimate:

We again average across past iterations to estimate .

6 Experiments

We validated the performance of our practical variant of LightTouch (Algorithm 4) on a YouTube ranking problem in the style of Joachims [2002], in which the task is to predict what a user will watch next, given that they have just viewed a certain video. In this setting, a user has just viewed video a, was presented with a list of candidate videos to watch next, and clicked on , with being the video immediately preceding in the list (if was the first list element, then the example is thrown out).

We used an anonymized proprietary dataset consisting of n = 612 587 training pairs of feature vectors , where is a vector of 12 features summarizing the similarity between a and , and between a and .

We treat this as a standard pairwise ranking problem, for which the goal is to estimate a function such that for as many examples as possible, subject to the appropriate regularization (or, in this case, constraints). Specifically, the (unconstrained) learning task is to minimize the average empirical hinge loss:

All twelve of the features were designed to provide positive evidence—in other words, if any one increases (holding the others fixed), then we expect to increase. We have found that using constraints to enforce this monotonicity property results in a better model in practice.

We define as in lattice regression using simplex interpolation [Garcia et al., 2012, Gupta et al., 2016], an approach which works well at combining a small number of informative features, and more importantly (for our purposes) enables one to force the learned function to be monotonic via linear inequality constraints on the parameters. For the resulting problem, the feature vectors have dimension , we chose W to be defined by the box constraints in each of the 4096 dimensions, and the total number of monotonicity-enforcing linear inequality constraints is m = 24 576.

Every contains only d + 1 = 13 nonzeros and can be computed in O(d ln d) time. Hence, stochastic gradients of f are inexpensive to compute. Likewise, checking a monotonicity constraint only requires a single comparison between two parameter values, so although there are a large number of them, each constraint is very inexpensive to check.

6.1 Implementations

We implemented all algorithms in C++. Before running our main experiments, we performed crude parameter searches on a power-of-four grid (i.e. ). For each candidate value we performed roughly 10 000 iterations, and chose the parameter that appeared to result in the fastest convergence in terms of the objective function.

LightTouch Our implementation of LightTouch includes all of the suggested changes of Section 5, including the constraint aggregation approach of Section 5.1, although we used no aggregation until our timing comparison (Section 6.3). For automatic minibatching, we took weighted averages of the variance estimates as . We found that up-weighting recent estimates (taking ) resulted in a noticeable improvement, but that the precise value of mattered little (we used ). Based on the grid search described above, we chose and .

FullTouch Our FullTouch implementation differs from that in Algorithm 1 only in that we used a decreasing step size . As with LightTouch, we chose and based on a grid search.

ProjectedSGD We implemented Euclidean projections onto lattice monotonicity constraints using IPOPT [W¨achter and Biegler, 2006] to optimize the resulting sparse 4096-dimensional quadratic program. However, the use of a QP solver for projected SGD—a very heavyweight solution—resulted in an implementation that was too slow to experiment with, requiring nearly four minutes per projection (observe that our experiments each ran for millions of iterations).

ApproxSGD This is an approximate projected SGD implementation using the fast approximate update procedure described in Gupta et al. [2016], which is an active set method that, starting from the current iterate, moves along the boundary of the feasible region, adding constraints to the active set as they are encountered, until the desired step is exhausted (this is reminiscent of the local linear oracles considered by Garber and Hazan [2013]). This approach is particularly well-suited to this particular constraint set because (1) when checking constraints for possible inclusion in the active set, it exploits the sparsity of the stochastic gradients to only consider monotonicity constraints which could possibly be violated, and (2) projecting onto an intersection of active monotonicity constraints reduces to uniformly averaging every set of parameters that are “linked together” by active constraints. Like the other algorithms, we used step sizes of and chose based on the grid search (recall that was better for the other two algorithms).

In every experiment we repeatedly looped over a random permutation of the training set, and generated plots by averaging over 5 such runs (with the same 5 random permutations) for each algorithm.

6.2 Constraint-check Comparison

In our first set of experiments, we compared the performance of LightTouch, FullTouch and ApproxSGD in terms of the number of stochastic subgradients of f drawn, and the number of constraints checked. Because LightTouch’s automatic minibatching fixes (with the other two minibatch sizes being automatically determined), in these experiments we used minibatch sizes of 2 for FullTouch and ApproxSGD, guaranteeing that all three algorithms observe the same number of stochastic subgradients of f at each iteration.

The left-hand plot of Figure 1 shows that all three algorithms converge at roughly comparable per-iteration rates, with ApproxSGD having a slight advantage over FullTouch, which itself converges a bit more rapidly than LightTouch. The right-hand plot shows a striking difference, however—LightTouch reaches a near-optimal solution having checked more than fewer constraints than FullTouch. Notice that we plot the suboptimalities of the projected iterates rather than of the s themselves, in order to emulate the final projection (line 7 of Algorithm 1 and 17 of Algorithm 4), and guarantee that we only compare the average losses of feasible intermediate solutions.

In Figure 2, we explore how well our algorithms enforce feasibility, and how effective automatic minibatching is at choosing minibatch sizes. The left-hand plot shows that both FullTouch has converged to a nearly-feasible solution

Figure 1: Comparison of convergence rates of LightTouch, FullTouch and ApproxSGD on the YouTube ranking problem of Section 6. The two plots show the objective function (average training hinge loss) as a function of the number of iterations, and as a function of the total number of times a single constraint function was evaluated or differentiated, respectively.

after roughly 10 000 iterations, and LightTouch (unsurprisingly) takes more, perhaps 100 000 or so. In the right-hand plot, we see that, in line with our expectations (see Section 5.2), LightTouch’s automatic minibatching results in very few constraints being checked in late iterations.

6.3 Timing Comparison

Figure 3: Plot of the objective function (average training hinge loss) as a function of runtime for our implementations of LightTouch, FullTouch and ApproxSGD, on the YouTube ranking problem of Section 6.

Our final experiment compared the wall-clock runtimes of our implementations. Note that, because each monotonicity constraint can be checked with only a single comparison (compare with e.g. O(d) arithmetic operations for a dense linear inequality constraint), the O(m) arithmetic cost of maintaining and updating the probability distribution p over the constraints is significant. Hence, in terms of the constraint costs, this is nearly a worse-case problem for LightTouch. We experimented with power-of-4 constraint aggregate sizes (Section 5.1), and found that using aggregated constraints, each of size 256, worked best.

FullTouch, without minibatching, draws a single stochastic subgradient of f and checks every constraint at each iteration. However, it would seem to be more efficient to use minibatching to look at more stochastic subgradients at each iteration, and therefore fewer constraints per stochastic subgradient of f. Hence, for FullTouch, we again searched over power-of-4 minibatch sizes, and found that 16 worked best.

For ApproxSGD, the situation is less clear-cut. On the one hand, increasing the minibatch size results in fewer approximate projections being performed per stochastic subgradient of f. On the other, averaging more stochastic

Figure 2: Comparison constraint-handling of LightTouch and FullTouch on the YouTube ranking problem of Section 6. The two plots show the constraint violation magnitude , and the average number of constraints checked per iteration up to this point, respectively, both as functions of the number of iterations.

subgradients results in less sparsity, slowing down the approximate projection. We found that the latter consideration wins out—after searching again over power-of-4 minibatch sizes, we found that a minibatch size of 1 (i.e. no minibatching) worked best.

Figure 3 contains the results of these experiments, showing that both FullTouch and LightTouch converge significantly faster than ApproxSGD. Interestingly, ApproxSGD is rather slow in early iterations (clipped off in plot), but accelerates in later iterations. We speculate that the reason for this behavior is that, close to convergence, the steps taken at each iteration are smaller, and therefore the active sets constructed during the approximate projection routine do not grow as large. FullTouch enjoys a small advantage over LightTouch until both algorithms are very close to convergence, but based on the results of Section 6.2, we believe that this advantage would reverse if there were more constraints, or if the constraints were more expensive to check.

7 Conclusions

We have proposed an efficient strategy for large-scale heavily constrained optimization, building on the work of Mahdavi et al. [2012], and analyze its performance, demonstrating that, asymptotically, our approach requires many fewer constraint checks in order to converge.

We build on these theoretical results to propose a practical variant. The most significant of these improvements is based on the observation that our algorithm takes steps based on three separate stochastic gradients, and that trading off the variances of computational costs of these three components is beneficial. To this end, we propose a heuristic for dynamically choosing minibatch sizes in such a way as to encourage faster convergence at a lower computational cost.

Experiments on a real-world 4096-dimensionalmachine learning problem with 24 576 constraints and 612 587 training examples—too large for a QP-based implementation of projected SGD—showed that our proposed method is effective. In particular, we find that, in practice, our technique checks fewer constraints per iteration than competing algorithms, and, as expected, checks ever fewer as optimization progresses.

Acknowledgments

We thank Kevin Canini, Mahdi Milani Fard, Andrew Frigyik, Michael Friedlander and Seungil You for helpful discussions, and proofreading earlier drafts.

References

N. P. Archer and S. Wang. Application of the back propagation neural network algorithm with monotonicity constraints for two-group classification problems. Decision Sciences, 24:60–75, 1993.

H. H. Bauschke. Projection Algorithms and Monotone Operators. Ph.D. Thesis, Simon Fraser University, Canada, 1996.

A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett., 31(3):167–175, May 2003.

K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machine learning. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, FOCS’10, pages 449–457, Washington, DC, USA, 2010. IEEE Computer Society.

H. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE Trans. Neural Networks, 21 (6):906–917, 2010.

K. Dzhaparidze and J. H. van Zanten. On Bernstein-type inequalities for martingales. Stochastic Processes and their Applications, 93(1):109–117, May 2001.

D. Garber and E. Hazan. Playing non-linear games with linear oracles. In FOCS, pages 420–428. IEEE Computer Society, 2013.

E. K. Garcia, R. Arora, and M. Gupta. Optimized regression for efficient function evaluation. IEEE Trans. Image Processing, 21(9):4128–4140, Sept. 2012.

D. Goldfarb and S. Liu. An O(Ln3) primal interior point algorithm for convex quadratic programming. Math. Program., 49(3):325–340, Jan. 1991.

M. R. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczydlowski, and A. van Esbroeck. Monotonic calibrated interpolated look-up tables. JMLR, 17(109):1–47, 2016.

E. Hazan and S. Kale. Projection-free online learning. In ICML’12, 2012.

M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML’13, volume 28, pages 427– 435, 2013.

T. Joachims. Optimizing search engines using clickthrough data. In KDD’02, pages 133–142, New York, NY, USA, 2002.

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS’13, pages 315–323. 2013.

M. Mahdavi, T. Yang, R. Jin, S. Zhu, and J. Yi. Stochastic gradient descent with only one projection. In NIPS’12, pages 494–502. 2012.

A. Nemirovski. Lecture notes: Interior point polynomial time methods in convex programming. 2004. URL .

A. Nemirovski and D. Yudin. Problem complexity and method efficiency in optimization. John Wiley & Sons Ltd, 1983.

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic pro- gramming. SIAM Journal on Optimization, 19(4):1574–1609, January 2009.

Y. Qu and B. Hu. Generalized constraint neural network regression model subject to linear priors. IEEE Trans. on Neural Networks, 22(11):2447–2459, 2011.

A. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In NIPS’13, pages 3066–3074. 2013.

S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Mathematical Programming, 127(1):3–30, March 2011.

J. Sill. Monotonic networks. Advances in Neural Information Processing Systems (NIPS), 1998.

J. Spouge, H. Wan, and W. J. Wilbur. Least squares isotonic regression in two dimensions. Journal of Optimization Theory and Applications, 117(3):585–605, 2003.

N. Srebro, K. Sridharan, and A. Tewari. On the universality of online mirror descent. In NIPS’11, 2011.

A. W¨achter and L. T. Biegler. On the implementation of a primal-dual interior point filter line search algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25–57, 2006.

M. Wang, Y. Chen, J. Liu, and Y. Gu. Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints. ArXiv e-prints, Nov. 2015.

Wikipedia. Coupon collector’s problem — Wikipedia, the free encyclopedia, 2014. URL http://en.wikipedia.org/wiki/Coupon_collector%27s_problem. [Online; accessed 20-November-2014].

Wikipedia. Properties of polynomial roots — Wikipedia, the free encyclopedia, 2015. URL http://en.wikipedia.org/wiki/Properties_of_polynomial_roots. [Online; accessed 27-March-2015].

M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML’03, 2003.

Table 4: New notation in Appendix A.

A Mirror Descent

Mirror descent [Nemirovski and Yudin, 1983, Beck and Teboulle, 2003] is a meta-algorithm for stochastic optimization (more generally, online regret minimization) which performs gradient updates with respect to a meta-parameter, the distance generating function (d.g.f.). The two most widely-used d.g.f.s are the squared Euclidean norm and negative Shannon entropy, for which the resulting MD instantiations are stochastic gradient descent (SGD) and a multiplicative updating algorithm, respectively. These are precisely the two d.g.f.s which our constrained algorithm will use for the updates of w and p. We’ll here give a number of results which differ only slightly from “standard” ones, beginning with a statement of an online MD bound adapted from Srebro et al. [2011]:

Theorem 3. Let and be a norm and its dual. Suppose that the distance generating function (d.g.f.) is 1-strongly convex w.r.t. . Let be the convex conjugate of , and take to be the associated Bregman divergence.

Take to be a sequence of convex functions on which we perform T iterations of mirror descent starting from :

where is a subgradient of at . Then:

where is an arbitrary reference vector.

Proof. This proof is essentially the same as that of Srebro et al. [2011, Lemma 2]. By convexity:

By H¨older’s inequality, . Also, is maximized when , so . These results combined with the definition of give:

Using Young’s inequality and the definition of the Bregman divergence:

Applying the 1-strong convexity of to cancel theand terms:

Summing over t, using the nonnegativity of , and dividing through by gives the claimed result.

It is straightforward to transform Theorem 3 into an in-expectation result for stochastic subgradients:

Corollary 1. Take to be a sequence of convex functions, and F a filtration. Suppose that we perform T iterations of stochastic mirror descent starting from , using the definitions of Theorem 3:

where is a stochastic subgradient of , i.e. , and is -measurable. Then:

where is an arbitrary reference vector.

Proof. Define , and observe that applying the non-stochastic MD algorithm of Theorem 3 to the sequence of functions results in the same sequence of iterates as does applying the above stochastic MD update to the sequence of functions . Hence:

By convexity, , while by definition. Taking expectations of both sides of Equation 5 and plugging in these inequalities yields the claimed result.

We next prove a high-probabilityanalogue of the Corollary 1, based on a martingale bound of Dzhaparidze and van Zanten [2001]:

Corollary 2. In addition to the assumptions of Corollary 1, suppose that, with probability satisfies the following uniformly for all :

Then, with probability , the above bound will hold, and:

where is an arbitrary reference vector and bounds the radius of W centered on .

Proof. Define as in the proof of Corollary 1, and observe that Equation 5 continues to apply. Define a sequence of random variables , and notice that M forms a martingale w.r.t. the filtration F. From this definition, H¨older’s inequality gives that:

the above holding with probability . Plugging and into the Bernstein-type martingale inequality of Dzhaparidze and van Zanten [2001, Theorem 3.3] gives:

Solving for using the quadratic formula and upper-bounding gives that, with probability :

As in the proof of Corollary 1, , while by definition, which combined with Equation 5 yields the claimed result.

Algorithm 2 (LightTouch) jointly optimizes over two sets of parameters, for which the objective is convex in the first and linear (hence concave) in the second. The convergence rate will be determined from a saddle-point bound, which we derive from Corollary 2 by following Nemirovski et al. [2009], Rakhlin and Sridharan [2013], and simply applying it twice:

Corollary 3. Let and be norms with duals and . Suppose that and are 1-strongly convex w.r.t. and , have convex conjugates and , and associated Bregman divergences and , respectively.

Take to be convex in its first parameter and concave in its second, let F be a filtration, and suppose that we perform T iterations of MD:

where is a stochastic subgradient of w.r.t. its first parameter, and a stochastic supergradient w.r.t. its second, with both and being -measurable. We assume that, with probabilities and (respectively), and satisfy the following uniformly for all :

Under these conditions, with probability , the above and bounds will hold, and:

where and are arbitrary reference vectors, and and bound the radii of W and A centered on and , respectively.

Proof. This is a convex-concave saddle-point problem, which we will optimize by playing two convex optimization algorithms against each other, as in Nemirovski et al. [2009], Rakhlin and Sridharan [2013]. By Corollary 2, with probability and , respectively:

Adding these two inequalities gives the claimed result.

B SGD for Strongly-Convex Functions

For -strongly convex objective functions, we can achieve a faster convergence rate for SGD by using the step sizes . Our eventual algorithm (Algorithm 3) for strongly-convex heavily-constrained optimization will proceed in two phases, with the second phase “picking up” where the first phase “left off”, for which reason we present a convergence rate, based on Shalev-Shwartz et al. [2011, Lemma 2], that effectively starts at iteration by using the step sizes :

Theorem 4. Take to be a sequence of -strongly convex functions on which we perform T iterations of stochastic gradient descent starting from :

where is a subgradient of at , andfor all t. If we choose for some , then:

where is an arbitrary reference vector and bounds the subgradient norms for all t.

Proof. This is nothing but a small tweak to Shalev-Shwartz et al. [2011, Lemma 2]. Starting from Equations 10 and 11 of that proof:

Taking :

Dividing through by T , simplifying and bounding yields the claimed result.

As we did Appendix A, we convert this into a result for stochastic subgradients:

Corollary 4. Take to be a sequence of -strongly convex functions, and F a filtration. Suppose that we perform T iterations of stochastic gradient descent starting from :

where is a stochastic subgradient of , i.e. , and is -measurable. If we choose for some , then:

where is an arbitrary reference vector and bounds the stochastic subgradient norms for all t.

Proof. Same proof technique as Corollary 1, but based on Theorem 4 rather than Theorem 3.

We now use this result to prove an in-expectation saddle point bound:

Corollary 5. Let and be a norm and its dual. Suppose that is 1-strongly convex w.r.t. , and has convex conjugate and associated Bregman divergence .

Take to be -strongly convex in its first parameter and concave in its second, let F be a filtration, and suppose that we perform T iterations of SGD on w and MD on :

where is a stochastic subgradient of w.r.t. its first parameter, and a stochastic supergradient w.r.t. its second, with both and being -measurable. Then:

where and are arbitrary reference vectors, and bounds the stochastic subgradient norms w.r.t. w for all t.

Proof. As we did in the proof of Corollary 3, we will play two convex optimization algorithms against each other. By Corollaries 4 and 1:

Adding these two inequalities gives the claimed result.

C Analyses of FullTouch and LightTouch

We begin by proving that, if is sufficiently large, then optimizing the relaxed objective, and projecting the resulting solution, will bring us close to the optimum of the constrained objective.

Lemma 1. In the setting of Section 2, suppose that f is -Lipschitz, i.e. for all , and that there is a constant such that if g(w) = 0 thenfor all , where is the subdifferential of g(w).

If , then for any infeasible w (i.e. for which g(w) > 0):

where is the projection of w onto the set w.r.t. the Euclidean norm.

Proof. Let be an arbitrary infeasible point. Because f is -Lipschitz:

Since is the projection of w onto the constraints w.r.t. the Euclidean norm, we must have by the first order optimality conditions that there exists a such that:

This implies that is a scalar multiple of some . Because g is convex and is on the boundary, , so:

Combining the definition of h with Equations 6 and 7 yields:

Both claims follow immediately if .

C.1 Analysis of FullTouch

We’ll now use Lemma 1 and Corollary 2 to bound the convergence rate of SGD on the function h of Lemma 1 (this is FullTouch). Like the algorithm itself, the convergence rate is little different from that found by Mahdavi et al. [2012] (aside from the bound on ), and is included here only for completeness.

Lemma 3. Suppose that the conditions of Lemma 1 apply, with . Define as the diameter of and as uniform upper bounds on the (stochastic) gradient magnitudes of f and the s, respectively.

If we optimize Equation 1 using Algorithm 1 (FullTouch) with the step size:

then with probability :

where is an arbitrary constraint-satisfying reference vector, and:

Proof. We choose , for which the mirror descent update rule is precisely SGD. Because is (half of) the squared Euclidean norm, it is trivially 1-strongly convex w.r.t. the Euclidean norm, so . Furthermore, and .

We may upper bound the 2-norm of our stochastic gradients as. Only the f-portion of the objective is stochastic, so the error of the s can be trivially upper bounded, with probability 1, with . Hence, by Corollary 2 (taking to be e.g. the smallest -algebra making measurable), with probability :

Plugging in the definition of , moving the average defining inside h by Jensen’s inequality, substituting because satisfies the constraints, applying Lemma 1 and simplifying yields the claimed result.

In terms of the number of iterations required to achieve some desired level of suboptimality, this bound on may be expressed as:

Theorem 5. Suppose that the conditions of Lemmas 1 and 3 apply, and that is as defined in Lemma 3.

If we optimize Equation 1 using iterations of Algorithm 1 (FullTouch):

then with probability . where is an arbitrary constraint-satisfying reference vector.

Proof. Based on the bound of Lemma 3, define:

and consider the polynomial . Roots of this polynomial are xs for which , while for xs larger than any root we’ll have that . Hence, we can bound the T required to ensure -suboptimality by bounding the roots of this polynomial. By the Fujiwara bound [Wikipedia, 2015]:

giving the claimed result.

C.2 Analysis of LightTouch

Because we use the reduced-variance algorithm of Johnson and Zhang [2013], and therefore update the remembered gradient one random coordinate at a time, we must first bound the maximum number of iterations over which a coordinate can go un-updated:

Lemma 4. Consider a process which maintains a sequence of vectors for , where is initialized to zero and is derived from by independently sampling random indices {1, . . . , m} uniformly without replacement, and then setting for and for . Then, with probability :

Table 5: New notation in Appendix C.2.

Proof. This is closely related to the “coupon collector’s problem” [Wikipedia, 2014]. We will begin by partitioning time into contiguous size-n chunks, with 1, . . . , n forming the first chunk, n + 1, . . . , 2n the second, and so on.

Within each chunk the probability that any particular index was never sampled is , so by the union bound the probability that any one of the m indices was never sampled is bounded by :

Define , so:

This shows that for this choice of n, the probability of there existing an index which is never sampled in some particular batch is bounded by . By the union bound, the probability of any of batches containing an index which is never sampled is bounded by .

If every index is sampled within every batch, then over the first steps, the most steps which could elapse over which a particular index is not sampled is (if the index is sampled on the first step of one chunk, and the last step of the next chunk), which implies the claimed result.

We now combine this bound with Corollary 2 and make appropriate choices of the two d.g.f.s to yield a bound on the LightTouch convergence rate:

Lemma 5. Suppose that the conditions of Lemma 1 apply, with . Define as a bound on the diameter of W (notice that we also choose to be at least and as uniform upper bounds on the (stochastic) gradient magnitudes of f and the s, respectively, for all . We also assume that all s are -Lipschitz w.r.t. , i.e. for all .

Define:

If and we optimize Equation 1 using Algorithm 2 (LightTouch), basing the stochastic gradients w.r.t. p on k constraints at each iteration, and using the step size:

then it holds with probability that:

where is an arbitrary constraint-satisfying reference vector, and:

If k > m, then we should fall-back to using FullTouch, in which case the result of Lemma 3 will apply.

Proof. We choose and to be the squared Euclidean norm divided by 2 and the negative Shannon entropy, respectively, which yields the updates of Algorithm 2. We assume that the s are random variables on some probability space (depending on the source of the stochastic gradients of f), and likewise the s and s on another, so may be taken to be the product of the smallest -algebras which make and measurable, respectively, with conditional expectations being taken w.r.t. the product measure. Under the definitions of Corollary 3 (taking ), with probability :

As in the proof of Lemma 3, is 1-strongly convex w.r.t. the Euclidean norm, so and . Because is the negative entropy, which is 1-strongly convex w.r.t. the 1-norm (this is Pinsker’s inequality), and , implying that . Since is initialized to the uniform distribution, .

The stochastic gradient definitions of Algorithm 2 give thatand with probability by the triangle inequality, and because satisfies the constraints. All of these facts together give that, with probability :

We now move the average defining inside (which is convex in its first parameter) by Jensen’s inequality, and use the fact that there exists a such that to apply Lemma 1:

By the triangle inequality and the fact that :

Substituting into Equation 9 and using the fact that :

We will now turn our attention to the problem of bounding . Notice that because we sample i.i.d. s uniformly at every iteration, they form an instance of the process of Lemma 4 with , showing that with probability :

By the definition of (Algorithm 2):

where in the second step we used the definition of the -norm, in the third we used the Lipschitz continuity of the s (and hence of their positive parts), in the fourth we bounded the distance between two iterates with the number of iterations times a bound on the total step size, and in the fifth we used Equation 11. This shows that we may define:

and it will satisfy the conditions of Corollary 3. Notice that, due to the factor, will be decreasing in T . Substituting the definitions of and into Equation 10, choosing and using the assumption that gives

that with probability :

Rounding up the constant terms:

Substituting the definition of k, simplifying and bounding yields the claimed result.

In terms of the number of iterations required to achieve some desired level of suboptimality, this bound on and the bound of Lemma 3 on may be combined to yield the following:

Theorem 1. Suppose that the conditions of Lemmas 1 and 5 apply. Our result will be expressed in terms of a total iteration count satisfying:

Define k in terms of as in Lemma 5. If , then we optimize Equation 1 using iterations of Algorithm 2 (LightTouch) with as in Lemma 5. If k > m, then we use iterations of Algorithm 1 (FullTouch) with as in Lemma 3. In either case, we perform iterations, requiring a total of “constraint checks” (evaluations or differentiations of a single ):

and with probability :

where is an arbitrary constraint-satisfying reference vector.

Proof. Regardless of the value of k, it follows from Lemmas 5 and 3 that:

As in the proof of Theorem 5, we define:

and consider the polynomial . Any upper bound on all roots of this polynomial will result in a lower-bound the values of T for which with probability . By the Fujiwara bound [Wikipedia, 2015]:

giving the claimed bound on . For , we observe that we will perform no more than k + 1 constraint checks at each iteration (k + 1 by LightTouch if , and m + 1 by FullTouch if k > m), and substitute the above bound on into the definition of k, yielding:

giving the claimed result (notice the factor on the RHS, for which reason we have a bound on , instead of O).

D Analysis of MidTouch

We now move on to the analysis of our LightTouch variant for -strongly convex objectives, Algorithm 3 (MidTouch). While we were able to prove a high-probability bound for LightTouch, we were unable to do so for MidTouch, because the extra terms resulting from the use of a Bernstein-type martingale inequality were too large (since the other terms shrank as a result of the strong convexity assumption). Instead, we give an in-expectation result, and leave the proof of a corresponding high-probability bound to future work.

Our first result is an analogue of Lemmas 3 and 5, and bounds the suboptimality achieved by MidTouch as a function of the iteration counts and of the two phases:

Lemma 6. Suppose that the conditions of Lemma 1 apply, with . Define and as uniform upper bounds on the (stochastic) gradient magnitudes of f and the s, respectively, for all . We also assume that f is -strongly convex, and that all s are -Lipschitz w.r.t. , i.e. for all .

If we optimize Equation 1 using Algorithm 3 (MidTouch) with the p-update step size , then:

where is the optimal constraint-satisfying reference vector.

Proof. As in the proof of Lemma 3, the first phase of Algorithm 3 is nothing but (strongly convex) SGD on the overall objective function h, so by Corollary 4:

so by Jensen’s inequality:

For the second phase, as in the proof of Lemma 5, we choose to be negative Shannon entropy, which yields the second-phase updates of Algorithm 3. By Corollary 5:

As before, , and . Hence:

Since h is -strongly convex and is optimal,. By Equation 12:

Since the (uncentered) second moment is equal to the mean plus the variance, and using the fact that since all constraints are satisfied at :

the first step using the fact that the s are -Lipschitz and Jensen’s inequality. For the second step, we choose such that is a minimax optimal pair (recall that is optimal by assumption), and use the -strong convexity of . Substituting into Equation 13 and using the fact that :

We now follow the proof of Lemma 5 and bound . By the definition of (Algorithm 3):

The indices j are sampled uniformly, so the maximum time since we last sampled the same index is an instance of the coupon collector’s problem Wikipedia [2014]. Because the s are -Lipschitz:

the second step because, between iteration and iteration t we will perform updates of magnitude at most , and the third step because, as an instance of the coupon collector’s problem, has expectation is the mth harmonic number) and variance . Substituting into Equation 14:

By the -strong convexity of :

Using the facts that because is feasible, and that , completes the proof.

We now move on to the main result: a bound on the number of iterations (equivalently, the number of stochastic loss gradients) and constraint checks required to achieve -suboptimality:

Theorem 2. Suppose that the conditions of Lemmas 1 and 6 apply, with the p-update step size as defined in Lemma 6. If we run Algorithm 3 (MidTouch) for iterations in the first phase and in the second:

requiring a total of “constraint checks” (evaluations or differentiations of a single ):

where is the optimal constraint-satisfying reference vector.

By Lemma 6, the above definitions imply that:

Defining and rearranging:

We will now upper-bound all roots of the above equation with a quantity , for which all will result in -suboptimality. By the Fujiwara bound [Wikipedia, 2015], and including the constraint that :

Substituting the above bound on into the definitions of and gives the claimed magnitudes of these and , and using the fact that the gives the claimed bound on .

designed for accessibility and to further open science