35:[["$","audio",null,{"id":"tts"}],["$","$L3a",null,{"paperID":"1710.10345","publisher":"arxiv","paperJSON":{"title":"The Implicit Bias of Gradient Descent on Separable Data","paperID":"1710.10345","avgLineHeight":13.56,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The result also generalizes to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a deep network in a certain restricted setting. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization in more complex models and with other optimization methods.","element":"span"}],[{"style":{"width":"88%"},"width":1532,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/0-0.png","element":"img"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"It is becoming increasingly clear that implicit biases introduced by the optimization algorithm play a crucial role in deep learning and in the generalization ability of the learned models ","element":"span"},{"href":"#id-0","referenceIndex":14,"text":"(Neyshabur et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":14,"text":"2014","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":15,"text":"2015","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":23,"text":"Zhang et al., ","element":"a"},{"href":"#id-2","referenceIndex":23,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-3","referenceIndex":11,"text":"Keskar et al., ","element":"a"},{"href":"#id-3","referenceIndex":11,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":16,"text":"Neyshabur et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":16,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":22,"text":"Wilson et al., ","element":"a"},{"href":"#id-5","referenceIndex":22,"text":"2017","element":"a"},{"text":"). In particular, minimizing the training error, without explicit regularization, over models with more parameters and capacity than the number of training examples, often yields good generalization. This is despite the fact that the empirical optimization problem being highly underdetermined. That is, there are many global minima of the training objective, most of which will not generalize well, but the optimization algorithm (e.g. gradient descent) biases us toward a particular minimum that does generalize well. Unfortunately, we still do not have a good understanding of the biases introduced by different optimization algorithms in different situations.","element":"span"}],[{"text":"We do have an understanding of the implicit regularization introduced by early stopping of stochastic methods or, at an extreme, of one-pass (no repetition) stochastic gradient descent ","element":"span"},{"href":"#id-6","referenceIndex":7,"text":"(Hardt et al., ","element":"a"},{"href":"#id-6","referenceIndex":7,"text":"2016","element":"a"},{"text":"). However, as discussed above, in deep learning we often benefit from implicit bias even when optimizing the training error to convergence (without early stopping) using stochastic or batch methods. For loss functions with attainable, finite minimizers, such as the squared loss, we have some","element":"span"}],[{"style":{"width":"70%"},"width":1225,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/0-1.png","element":"img"}],[{"text":"License: CC-BY 4.0, see ","element":"span"},{"href":"https://creativecommons.org/licenses/by/4.0/","text":"https://creativecommons.org/licenses/by/4.0/","element":"a"},{"text":". Attribution requirements are provided at ","element":"span"},{"href":"http://jmlr.org/papers/v19/18-188.html","text":"http://jmlr.org/papers/v19/18-188.html","element":"a"},{"text":".","element":"span"}],[{"text":"understanding of this: in particular, when minimizing an underdetermined least squares problem using gradient descent starting from the origin, it can be shown that we will converge to the minimum Euclidean norm solution. However, the logistic loss, and its generalization the cross-entropy loss which is often used in deep learning, do not admit finite minimizers on separable problems. Instead, to drive the loss toward zero and thus minimize it, the norm of the predictor must diverge toward infinity.","element":"span"}],[{"text":"Do we still benefit from implicit regularization when minimizing the logistic loss on separable data? Clearly the norm of the predictor itself is not minimized, since it grows to infinity. However, for prediction, only the direction of the predictor, i.e. the normalized ","element":"span"},{"style":{"height":17.6},"width":246.16,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-0.png","element":"img","alt":" w(t)/ ∥w(t)∥","inline":true},{"text":", is important. How does ","element":"span"},{"style":{"height":17.6},"width":579.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-1.png","element":"img","alt":" w(t)/ ∥w(t)∥ behave as t → ∞","inline":true,"padRight":true},{"text":"when we minimize the logistic (or similar) loss using gradient descent on separable data, i.e., when it is possible to get zero misclassification error and thus drive the loss to zero?","element":"span"}],[{"text":"In this paper, we show that even without any explicit regularization, for all linearly separable datasets, when minimizing logistic regression problems using gradient descent, we have that ","element":"span"},{"style":{"height":17.6},"width":246.16,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-2.png","element":"img","alt":"w(t)/ ∥w(t)∥","inline":true,"padRight":true},{"text":"converges to the ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-3.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"maximum margin separator, i.e. to the solution of the hard margin SVM for homogeneous linear predictors. This happens even though neither the norm ","element":"span"},{"style":{"height":17.6},"width":160.92,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-4.png","element":"img","alt":" ∥w∥, nor","inline":true,"padRight":true},{"text":"the margin constraint, are part of the objective or explicitly introduced into optimization. More generally, we show the same behavior for generalized linear problems with any smooth, monotone strictly decreasing, lower bounded loss with an exponential tail. Furthermore, we characterize the rate of this convergence, and show that it is rather slow, wherein for almost all datasets, the distance to the max-margin predictor decreasing only as ","element":"span"},{"text":"O","element":"span"},{"text":"(1","element":"span"},{"text":"/ ","element":"span"},{"text":"log(","element":"span"},{"text":"t","element":"span"},{"text":"))","element":"span"},{"text":", and in some degenerate datasets, the rate further slows down to ","element":"span"},{"text":"O","element":"span"},{"text":"(log log(","element":"span"},{"text":"t","element":"span"},{"text":")","element":"span"},{"text":"/ ","element":"span"},{"text":"log(","element":"span"},{"text":"t","element":"span"},{"text":"))","element":"span"},{"text":". This explains why the predictor continues to improve even when the training loss is already extremely small. We emphasize that this bias is specific to gradient descent, and changing the optimization algorithm, e.g. using adaptive learning rate methods such as ADAM ","element":"span"},{"href":"#id-7","referenceIndex":12,"text":"(Kingma and Ba","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":12,"text":"2015","element":"a"},{"text":"), changes this implicit bias.","element":"span"}]]},{"heading":"2. Main Results","paragraphs":[[{"text":"Consider a dataset ","element":"span"},{"style":{"height":21.86},"width":479.76,"height":54.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-5.png","element":"img","alt":" {xn, yn}Nn=1, with xn ∈ Rd","inline":true,"padRight":true},{"text":"and binary labels ","element":"span"},{"style":{"height":17.6},"width":237.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-6.png","element":"img","alt":" yn ∈ {−1, 1}","inline":true},{"text":". We analyze learning ","element":"span"},{"text":"by minimizing an empirical loss of the form","element":"span"}],[{"id":"id-9","style":{"width":"64%"},"width":1114,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-7.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.94},"width":140.4,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-8.png","element":"img","alt":" w ∈ Rd ","inline":true,"padRight":true},{"text":"is the weight vector. To simplify notation, we assume that all the labels are positive: ","element":"span"},{"style":{"height":16.4},"width":218.32,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-9.png","element":"img","alt":"∀n : yn = 1","inline":true,"padRight":true},{"text":"— this is true without loss of generality, since we can always re-define ","element":"span"},{"style":{"height":12.4},"width":211.16,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-10.png","element":"img","alt":" ynxn as xn.","inline":true}],[{"text":"We are particularly interested in problems that are linearly separable, and the loss is smooth strictly decreasing and non-negative:","element":"span"}],[{"id":"id-8","text":"Assumption 1 ","element":"span"},{"text":"The dataset is linearly separable: ","element":"span"},{"style":{"height":16.82},"width":569.72,"height":42.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-11.png","element":"img","alt":" ∃w∗ such that ∀n : w⊤∗ xn > 0 .","inline":true}],[{"text":"Assumption 2 ","element":"span"},{"style":{"height":17.6},"width":84.68,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-12.png","element":"img","alt":" ℓ (u)","inline":true,"padRight":true},{"text":"is a positive, differentiable, monotonically decreasing to zero","element":"span"},{"text":"1","element":"span"},{"text":", (so ","element":"span"},{"style":{"height":17.6},"width":223.12,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-13.png","element":"img","alt":" ∀u : ℓ (u) >","inline":true},{"style":{"height":17.6},"width":1008.08,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-14.png","element":"img","alt":"0, ℓ′ (u) < 0, limu→∞ ℓ (u) = limu→∞ ℓ′ (u) = 0), a β","inline":true},{"text":"-smooth function, i.e. its derivative is ","element":"span"},{"style":{"height":16.4},"width":41.88,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-15.png","element":"img","alt":" β-","inline":true,"padRight":true},{"text":"Lipshitz, and ","element":"span"},{"style":{"height":17.63},"width":443,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/1-16.png","element":"img","alt":" lim supu→−∞ ℓ′ (u) < 0.","inline":true}],[{"text":"Assumption ","element":"span"},{"href":"#id-8","text":"2 ","element":"a"},{"text":"includes many common loss functions, including the logistic, exp-loss","element":"span"},{"text":"2 ","element":"span"},{"text":"and probit losses. Assumption ","element":"span"},{"href":"#id-8","text":"2 ","element":"a"},{"text":"implies that ","element":"span"},{"style":{"height":19.15},"width":398.6,"height":47.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-0.png","element":"img","alt":" L (w) is a βσ2max (X )","inline":true},{"text":"-smooth function, where ","element":"span"},{"style":{"height":17.6},"width":285.92,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-1.png","element":"img","alt":" σmax (X ) is the","inline":true,"padRight":true},{"text":"maximal singular value of the data matrix ","element":"span"},{"style":{"height":15.94},"width":210.2,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-2.png","element":"img","alt":" X ∈ Rd×N.","inline":true}],[{"text":"Under these conditions, the infimum of the optimization problem is zero, but it is not attained at any finite ","element":"span"},{"text":"w","element":"span"},{"text":". Furthermore, no finite critical point ","element":"span"},{"text":"w ","element":"span"},{"text":"exists. We consider minimizing eq. ","element":"span"},{"href":"#id-9","text":"1 ","element":"a"},{"text":"using Gradient Descent (GD) with a fixed learning rate ","element":"span"},{"style":{"height":15.6},"width":109.92,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-3.png","element":"img","alt":" η, i.e.,","inline":true,"padRight":true},{"text":"with steps of the form:","element":"span"}],[{"id":"id-10","style":{"width":"85%"},"width":1482,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-4.png","element":"img"}],[{"text":"We do not require convexity. Under Assumptions 1 and 2, gradient descent converges to the global minimum (i.e. to zero loss) even without it:","element":"span"}],[{"id":"id-16","text":"Lemma 1 ","element":"span"},{"text":"Let ","element":"span"},{"text":"w ","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"be the iterates of gradient descent (eq. ","element":"span"},{"href":"#id-10","text":"2) ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":19.14},"width":463.2,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-5.png","element":"img","alt":" η < 2β−1σ−2max (X ) and","inline":true,"padRight":true},{"text":"any starting point ","element":"span"},{"text":"w","element":"span"},{"text":"(0)","element":"span"},{"text":". Under Assumptions ","element":"span"},{"href":"#id-8","text":"1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-8","text":"2, ","element":"a"},{"text":"we have: (1) ","element":"span"},{"style":{"height":17.6},"width":489.24,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-6.png","element":"img","alt":" limt→∞ L (w (t)) = 0, (2)","inline":true},{"style":{"height":17.6},"width":1084.28,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-7.png","element":"img","alt":"limt→∞ ∥w (t)∥ = ∞, and (3) ∀n : limt→∞ w (t)⊤ xn = ∞.","inline":true}],[{"text":"Proof Since the data is linearly separable, ","element":"span"},{"style":{"height":15.09},"width":77.96,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-8.png","element":"img","alt":" ∃w∗","inline":true,"padRight":true},{"text":"which linearly separates the data, and therefore","element":"span"}],[{"style":{"width":"38%"},"width":672,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-9.png","element":"img"}],[{"text":"For any finite ","element":"span"},{"text":"w","element":"span"},{"text":", this sum cannot be equal to zero, as a sum of negative terms, since ","element":"span"},{"style":{"height":16.62},"width":288.88,"height":41.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-10.png","element":"img","alt":" ∀n : w⊤∗ xn > 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.6},"width":280.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-11.png","element":"img","alt":" ∀u : ℓ′ (u) < 0","inline":true},{"text":". Therefore, there are no finite critical points ","element":"span"},{"text":"w","element":"span"},{"text":", for which ","element":"span"},{"style":{"height":17.6},"width":327.4,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-12.png","element":"img","alt":" ∇L (w) = 0. But","inline":true,"padRight":true},{"text":"gradient descent on a smooth loss with an appropriate stepsize is always guaranteed to converge to a critical point: ","element":"span"},{"style":{"height":17.6},"width":463.64,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-13.png","element":"img","alt":" ∇L (w (t)) → 0 (see, e.g.","inline":true,"padRight":true},{"text":"Lemma ","element":"span"},{"href":"#id-11","text":"10 ","element":"a"},{"text":"in Appendix ","element":"span"},{"href":"#id-12","text":"A.4, ","element":"a"},{"text":"slightly adapted from ","element":"span"},{"href":"#id-13","referenceIndex":4,"text":"Ganti ","element":"a"},{"href":"#id-13","referenceIndex":4,"text":"(2015","element":"a"},{"text":"), Theorem 2). This necessarily implies that ","element":"span"},{"style":{"height":17.6},"width":825.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-14.png","element":"img","alt":" ∥w (t)∥ → ∞ while ∀n : w (t)⊤ xn > 0 for","inline":true,"padRight":true},{"text":"large enough ","element":"span"},{"text":"t","element":"span"},{"text":"—since only then ","element":"span"},{"style":{"height":31.6},"width":360.4,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-15.png","element":"img","alt":" ℓ′ �w (t)⊤ xn�→ 0","inline":true},{"text":". Therefore, ","element":"span"},{"style":{"height":17.6},"width":201.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-16.png","element":"img","alt":" L (w) → 0","inline":true},{"text":", so GD converges to the global minimum.","element":"span"}],[{"text":"The main question we ask is: can we characterize the direction in which ","element":"span"},{"text":"w","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"diverges? That is, does the limit ","element":"span"},{"style":{"height":17.6},"width":417.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-17.png","element":"img","alt":" limt→∞ w (t) / ∥w (t)∥","inline":true,"padRight":true},{"text":"always exist, and if so, what is it?","element":"span"}],[{"text":"In order to analyze this limit, we will need to make a further assumption on the tail of the loss function:","element":"span"}],[{"id":"id-14","text":"Definition 2 ","element":"span"},{"text":"A function ","element":"span"},{"text":"f ","element":"span"},{"text":"(","element":"span"},{"text":"u","element":"span"},{"text":") ","element":"span"},{"text":"has a “tight exponential tail”, if there exist positive constants ","element":"span"},{"style":{"height":12},"width":279.44,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-18.png","element":"img","alt":" c, a, µ+, µ−, u+","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.09},"width":223.76,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-19.png","element":"img","alt":" u− such that","inline":true}],[{"style":{"width":"46%"},"width":802,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-20.png","element":"img"}],[{"id":"id-15","text":"Assumption 3 ","element":"span"},{"text":"The negative loss derivative ","element":"span"},{"style":{"height":17.6},"width":129.32,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-21.png","element":"img","alt":" −ℓ′ (u)","inline":true,"padRight":true},{"text":"has a tight exponential tail (Definition ","element":"span"},{"href":"#id-14","text":"2)","element":"a"},{"text":".","element":"span"}],[{"text":"For example, the exponential loss ","element":"span"},{"style":{"height":17.6},"width":223.04,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-22.png","element":"img","alt":" ℓ (u) = e−u ","inline":true,"padRight":true},{"text":"and the commonly used logistic loss ","element":"span"},{"style":{"height":17.6},"width":137.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-23.png","element":"img","alt":" ℓ (u) =","inline":true},{"style":{"height":17.6},"width":241.16,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-24.png","element":"img","alt":"log (1 + e−u)","inline":true,"padRight":true},{"text":"both follow this assumption with ","element":"span"},{"text":"a ","element":"span"},{"text":"= ","element":"span"},{"text":"c ","element":"span"},{"text":"= 1","element":"span"},{"text":". We will assume ","element":"span"},{"text":"a ","element":"span"},{"text":"= ","element":"span"},{"text":"c ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"— without loss of generality, since these constants can be always absorbed by re-scaling ","element":"span"},{"style":{"height":16.4},"width":167.96,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/2-25.png","element":"img","alt":" xn and η.","inline":true}],[{"text":"We are now ready to state our main result:","element":"span"}],[{"id":"id-19","text":"Theorem 3 ","element":"span"},{"text":"For any dataset which is linearly separable (Assumption ","element":"span"},{"href":"#id-8","text":"1)","element":"a"},{"text":", any ","element":"span"},{"style":{"height":16.4},"width":26,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-0.png","element":"img","alt":" β","inline":true},{"text":"-smooth decreasing loss function (Assumption ","element":"span"},{"href":"#id-8","text":"2) ","element":"a"},{"text":"with an exponential tail (Assumption ","element":"span"},{"href":"#id-15","text":"3)","element":"a"},{"text":", any stepsize ","element":"span"},{"style":{"height":19.15},"width":355.88,"height":47.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-1.png","element":"img","alt":" η < 2β−1σ−2max (X )","inline":true,"padRight":true},{"text":"and any starting point ","element":"span"},{"text":"w","element":"span"},{"text":"(0)","element":"span"},{"text":", the gradient descent iterates (as in eq. ","element":"span"},{"href":"#id-10","text":"2) ","element":"a"},{"text":"will behave as:","element":"span"}],[{"id":"id-20","style":{"width":"62%"},"width":1084,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.28},"width":198.44,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-3.png","element":"img","alt":" ˆw is the L2","inline":true,"padRight":true},{"text":"max margin vector (the solution to the hard margin SVM):","element":"span"}],[{"id":"id-18","style":{"width":"68%"},"width":1182,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-4.png","element":"img"}],[{"text":"and the residual grows at most as ","element":"span"},{"style":{"height":17.6},"width":560.56,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-5.png","element":"img","alt":" ∥ρ (t)∥ = O(log log(t)), and so","inline":true}],[{"style":{"width":"22%"},"width":391,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-6.png","element":"img"}],[{"text":"Furthermore, for almost all data sets (all except measure zero), the residual ","element":"span"},{"style":{"height":17.6},"width":72.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-7.png","element":"img","alt":" ρ(t)","inline":true,"padRight":true},{"text":"is bounded.","element":"span"}],[{"text":"Proof Sketch (complete proof in the appendix) ","element":"span"},{"text":"We first understand intuitively why an exponential tail of the loss entail asymptotic convergence to the max margin vector: Assume for simplicity that ","element":"span"},{"style":{"height":17.6},"width":217.28,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-8.png","element":"img","alt":" ℓ (u) = e−u ","inline":true,"padRight":true},{"text":"exactly, and examine the asymptotic regime of gradient descent in which ","element":"span"},{"style":{"height":17.6},"width":394.88,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-9.png","element":"img","alt":"∀n : w (t)⊤ xn → ∞","inline":true},{"text":", as is guaranteed by Lemma ","element":"span"},{"href":"#id-16","text":"1. ","element":"a"},{"text":"Suppose ","element":"span"},{"style":{"height":17.6},"width":267.76,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-10.png","element":"img","alt":" w (t) / ∥w (t)∥","inline":true,"padRight":true},{"text":"converges to some limit ","element":"span"},{"style":{"height":10.69},"width":70.48,"height":26.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-11.png","element":"img","alt":" w∞","inline":true,"padRight":true},{"text":"such so we can write ","element":"span"},{"style":{"height":17.6},"width":1168.28,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-12.png","element":"img","alt":" w (t) = g (t) w∞ + ρ (t) such that g (t) → ∞, ∀n :x⊤n w∞ > 0,","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.6},"width":422.32,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-13.png","element":"img","alt":" limt→∞ ρ (t) /g (t) = 0","inline":true},{"text":". The gradient can then be written as:","element":"span"}],[{"id":"id-17","style":{"width":"97%"},"width":1691,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-14.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":17.6},"width":192.8,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-15.png","element":"img","alt":" g(t) → ∞","inline":true,"padRight":true},{"text":"and the exponents become more negative, only those samples with the largest (i.e., least negative) exponents will contribute to the gradient. These are precisely the samples with the smallest margin ","element":"span"},{"style":{"height":16.02},"width":275.4,"height":40.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-16.png","element":"img","alt":" argminnw⊤∞xn","inline":true},{"text":", aka the “support vectors”. The negative gradient (eq. ","element":"span"},{"href":"#id-17","text":"5) ","element":"a"},{"text":"would then asymptotically become a non-negative linear combination of support vectors. The limit ","element":"span"},{"style":{"height":10.69},"width":70.48,"height":26.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-17.png","element":"img","alt":"w∞","inline":true,"padRight":true},{"text":"will then be dominated by these gradients, since any initial conditions become negligible as ","element":"span"},{"style":{"height":17.6},"width":253.76,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-18.png","element":"img","alt":"∥w (t)∥ → ∞","inline":true,"padRight":true},{"text":"(from Lemma ","element":"span"},{"href":"#id-16","text":"1)","element":"a"},{"text":". Therefore, ","element":"span"},{"style":{"height":10.69},"width":70.48,"height":26.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-19.png","element":"img","alt":" w∞","inline":true,"padRight":true},{"text":"will also be a non-negative linear combination of support vectors, and so will its scaling ","element":"span"},{"style":{"height":20.8},"width":459.68,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-20.png","element":"img","alt":" ˆw = w∞/�minn w⊤∞xn�","inline":true},{"text":". We therefore have:","element":"span"}],[{"id":"id-25","style":{"width":"92%"},"width":1602,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-21.png","element":"img"}],[{"text":"These are precisely the KKT conditions for the SVM problem (eq. ","element":"span"},{"href":"#id-18","text":"4) ","element":"a"},{"text":"and we can conclude that ","element":"span"},{"style":{"height":12.59},"width":37,"height":31.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-22.png","element":"img","alt":" ˆw","inline":true,"padRight":true},{"text":"is indeed its solution and ","element":"span"},{"style":{"height":10.69},"width":70.48,"height":26.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-23.png","element":"img","alt":" w∞","inline":true,"padRight":true},{"text":"is thus proportional to it.","element":"span"}],[{"text":"To prove Theorem ","element":"span"},{"href":"#id-19","text":"3 ","element":"a"},{"text":"rigorously, we need to show that ","element":"span"},{"style":{"height":17.6},"width":268.24,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-24.png","element":"img","alt":" w (t) / ∥w (t)∥","inline":true,"padRight":true},{"text":"has a limit, that ","element":"span"},{"style":{"height":12.8},"width":84,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-25.png","element":"img","alt":" ∀n :","inline":true},{"style":{"height":17.6},"width":550.76,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-26.png","element":"img","alt":"w⊤∞xn > 0, that g (t) = log (t)","inline":true,"padRight":true},{"text":"and to bound the effect of various residual errors, such as gradients ","element":"span"},{"text":"of non-support vectors and the fact that the loss is only approximately exponential. To do so, we substitute eq. ","element":"span"},{"href":"#id-20","text":"3 ","element":"a"},{"text":"into the gradient descent dynamics (eq. ","element":"span"},{"href":"#id-10","text":"2)","element":"a"},{"text":", with ","element":"span"},{"style":{"height":15.28},"width":175.24,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-27.png","element":"img","alt":" w∞ = ˆw","inline":true,"padRight":true},{"text":"being the max margin vector and ","element":"span"},{"text":"g","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") = log ","element":"span"},{"text":"t","element":"span"},{"text":". We then show that, except when certain degeneracies occur, the increment in the norm of ","element":"span"},{"style":{"height":17.6},"width":83.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-28.png","element":"img","alt":" ρ (t)","inline":true,"padRight":true},{"text":"is bounded by ","element":"span"},{"style":{"height":15.09},"width":606.16,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-29.png","element":"img","alt":" C1t−ν for some C1 > 0 and ν > 1","inline":true},{"text":", which is a converging series. This happens because the increment in the max margin term, ","element":"span"},{"style":{"height":19.14},"width":610.04,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-30.png","element":"img","alt":" ˆw [log (t + 1) − log (t)] ≈ ˆwt−1,","inline":true,"padRight":true},{"text":"cancels out the dominant ","element":"span"},{"style":{"height":15.14},"width":59.24,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-31.png","element":"img","alt":" t−1 ","inline":true,"padRight":true},{"text":"term in the gradient ","element":"span"},{"style":{"height":17.6},"width":235.4,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-32.png","element":"img","alt":" −∇L (w (t))","inline":true,"padRight":true},{"text":"(eq. ","element":"span"},{"href":"#id-17","text":"5 ","element":"a"},{"text":"with ","element":"span"},{"text":"g ","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") = log (","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":16.42},"width":226.52,"height":41.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/3-33.png","element":"img","alt":"w⊤∞xn = 1).","inline":true}],[{"text":"Degenerate and Non-Degenerate Data Sets ","element":"span"},{"text":"An earlier conference version of this paper ","element":"span"},{"href":"#id-21","referenceIndex":20,"text":"(Soudry et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":20,"text":"2018","element":"a"},{"text":") included a partial version of Theorem ","element":"span"},{"href":"#id-19","text":"3, ","element":"a"},{"text":"which only applies to almost all data sets, in which case we can ensure the residual ","element":"span"},{"style":{"height":17.6},"width":72.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-0.png","element":"img","alt":" ρ(t)","inline":true,"padRight":true},{"text":"is bounded. This partial statement (for almost all data sets) is restated and proved as Theorem ","element":"span"},{"href":"#id-22","text":"9 ","element":"a"},{"text":"in Appendix ","element":"span"},{"text":"A. ","element":"span"},{"text":"It applies, e.g. with probability one for data sampled from any absolutely continuous distribution. It does not apply in “degenerate” cases where some of the support vectors ","element":"span"},{"style":{"height":10.69},"width":47.4,"height":26.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-1.png","element":"img","alt":" xn","inline":true,"padRight":true},{"text":"(for which ","element":"span"},{"style":{"height":15.28},"width":204.88,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-2.png","element":"img","alt":" ˆw⊤xn = 1","inline":true},{"text":") are associated with dual variables that are zero (","element":"span"},{"style":{"height":14.29},"width":130.96,"height":35.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-3.png","element":"img","alt":"αn = 0","inline":true},{"text":") in the dual optimum of ","element":"span"},{"href":"#id-18","text":"4. ","element":"a"},{"text":"As we show in Appendix ","element":"span"},{"text":"B, ","element":"span"},{"text":"this only happens on measure zero data sets. Here, we prove the more general result which applies for all data sets, including degenerate data sets. To do so, in Theorem ","element":"span"},{"href":"#id-23","text":"13 ","element":"a"},{"text":"in Appendix ","element":"span"},{"text":"C ","element":"span"},{"text":"we provide a more complete characterization of the iterates ","element":"span"},{"text":"w","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"that explicitly specifies all unbounded components even in the degenerate case. We then prove the Theorem by plugging in this more complete characterization and showing that the residual is bounded, thus also establishing Theorem ","element":"span"},{"href":"#id-19","text":"3.","element":"a"}],[{"text":"Parallel Work on the Degenerate Case ","element":"span"},{"text":"Following the publication of our initial version, and while preparing this revised version for publication, we learned of parallel work by Ziwei Ji and Matus Telgarsky that also closes this gap. ","element":"span"},{"href":"#id-24","referenceIndex":10,"text":"Ji and Telgarsky ","element":"a"},{"href":"#id-24","referenceIndex":10,"text":"(2018","element":"a"},{"text":") provide an analysis of the degenerate case, establishing converges to the max margin predictor by showing that","element":"span"},{"style":{"height":32.53},"width":255.55,"height":81.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-4.png","element":"img","alt":"�� w(t)∥w(t)∥ − ˆw∥ ˆw∥","inline":true}],[{"style":{"width":"99%"},"width":1724,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-5.png","element":"img"}],[{"text":"the convergence is actually quadratically faster (see Section ","element":"span"},{"text":"3)","element":"span"},{"text":". However, Ji and Telgarsky go even further and provide a characterization also when the data is non-separable but ","element":"span"},{"text":"w","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"still goes to infinity.","element":"span"}],[{"text":"More Refined Analysis of the Residual ","element":"span"},{"text":"In some non-degenerate cases, we can further characterize the asymptotic behavior of ","element":"span"},{"style":{"height":17.6},"width":83.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-6.png","element":"img","alt":" ρ (t)","inline":true},{"text":". To do so, we need to refer to the KKT conditions (eq. ","element":"span"},{"href":"#id-25","text":"6) ","element":"a"},{"text":"of the SVM problem (eq. ","element":"span"},{"href":"#id-18","text":"4) ","element":"a"},{"text":"and the associated support vectors ","element":"span"},{"style":{"height":16.62},"width":357,"height":41.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-7.png","element":"img","alt":" S = argminn ˆw⊤xn","inline":true},{"text":". We then have the following Theorem, proved in Appendix ","element":"span"},{"text":"A:","element":"span"}],[{"id":"id-53","text":"Theorem 4 ","element":"span"},{"text":"Under the conditions and notation of Theorem ","element":"span"},{"href":"#id-19","text":"3, ","element":"a"},{"text":"for almost all datasets, if in addition the support vectors span the data (i.e. ","element":"span"},{"style":{"height":17.6},"width":607.12,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-8.png","element":"img","alt":" rank (XS) = rank (X), where XS","inline":true,"padRight":true},{"text":"is a matrix whose columns are only those data points ","element":"span"},{"style":{"height":17.6},"width":941.8,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-9.png","element":"img","alt":" xn s.t. ˆw⊤xn = 1), then limt→∞ ρ (t) = ˜w, where ˜w","inline":true,"padRight":true},{"text":"is a solution to","element":"span"}],[{"id":"id-54","style":{"width":"66%"},"width":1153,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-10.png","element":"img"}],[{"text":"Analogies with Boosting ","element":"span"},{"text":"Perhaps most similar to our study is the line of work on understanding AdaBoost in terms its implicit bias toward large ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-11.png","element":"img","alt":" L1","inline":true},{"text":"-margin solutions, starting with the seminal work of ","element":"span"},{"href":"#id-26","referenceIndex":19,"text":"Schapire et al. ","element":"a"},{"href":"#id-26","referenceIndex":19,"text":"(1998","element":"a"},{"text":"). Since AdaBoost can be viewed as coordinate descent on the exponential loss of a linear model, these results can be interpreted as analyzing the bias of coordinate descent, rather then gradient descent, on a monotone decreasing loss with an exact exponential tail. Indeed, with small enough step sizes, such a coordinate descent procedure does converge precisely to the maximum ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-12.png","element":"img","alt":" L1","inline":true},{"text":"-margin solution ","element":"span"},{"href":"#id-27","referenceIndex":24,"text":"(Zhang et al., ","element":"a"},{"href":"#id-27","referenceIndex":24,"text":"2005","element":"a"},{"text":"; ","element":"span"},{"href":"#id-28","referenceIndex":21,"text":"Telgarsky","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":21,"text":"2013","element":"a"},{"text":"). In fact, ","element":"span"},{"href":"#id-28","referenceIndex":21,"text":"Telgarsky ","element":"a"},{"href":"#id-28","referenceIndex":21,"text":"(2013","element":"a"},{"text":") also generalizes these results to other losses with tight exponential tails, similar to the class of losses we consider here.","element":"span"}],[{"text":"Also related is the work of ","element":"span"},{"href":"#id-29","referenceIndex":18,"text":"Rosset et al. ","element":"a"},{"href":"#id-29","referenceIndex":18,"text":"(2004","element":"a"},{"text":"). They considered the regularization path ","element":"span"},{"style":{"height":10.88},"width":105.52,"height":27.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-13.png","element":"img","alt":" wλ =","inline":true},{"style":{"height":21.25},"width":426.48,"height":53.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-14.png","element":"img","alt":"arg min L(w)+λ ∥w∥pp ","inline":true,"padRight":true},{"text":"for similar loss functions as we do, and showed that ","element":"span"},{"style":{"height":20.78},"width":387.08,"height":51.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-15.png","element":"img","alt":" limλ→0 wλ/ ∥wλ∥p is","inline":true,"padRight":true},{"text":"proportional to the maximum ","element":"span"},{"style":{"height":17.09},"width":47.76,"height":42.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-16.png","element":"img","alt":" Lp","inline":true,"padRight":true},{"text":"margin solution. That is, they showed how adding infinitesimal ","element":"span"},{"style":{"height":17.09},"width":47.76,"height":42.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/4-17.png","element":"img","alt":" Lp","inline":true,"padRight":true},{"text":"(e.g. ","element":"span"},{"style":{"height":15.09},"width":185.96,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-0.png","element":"img","alt":" L1 and L2","inline":true},{"text":") regularization to logistic-type losses gives rise to the corresponding max-margin predictor.","element":"span"},{"text":"3 ","element":"span"},{"text":"However, ","element":"span"},{"href":"#id-29","referenceIndex":18,"text":"Rosset et al. ","element":"a"},{"text":"do not consider the effect of the optimization algorithm, and instead add explicit regularization. Here we are specifically interested in the bias implied by the algorithm not by adding (even infinitesimal) explicit regularization. We see that coordinate descent gives rise to the max ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-1.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"margin predictor, while gradient descent gives rise to the max ","element":"span"},{"style":{"height":14.69},"width":155.48,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-2.png","element":"img","alt":" L2 norm","inline":true,"padRight":true},{"text":"predictor. In Section ","element":"span"},{"href":"#id-30","text":"4.3 ","element":"a"},{"text":"and in follow-up work ","element":"span"},{"href":"#id-31","referenceIndex":6,"text":"(Gunasekar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-31","referenceIndex":6,"text":"2018","element":"a"},{"text":") we discuss also other optimization algorithms, and their implied biases.","element":"span"}],[{"text":"Non-homogeneous linear predictors ","element":"span"},{"text":"In this paper we focused on homogeneous linear predictors of the form ","element":"span"},{"style":{"height":8},"width":92.28,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-3.png","element":"img","alt":" w⊤x","inline":true},{"text":", similarly to previous works (e.g., ","element":"span"},{"href":"#id-29","referenceIndex":18,"text":"Rosset et al. ","element":"a"},{"href":"#id-29","referenceIndex":18,"text":"(2004","element":"a"},{"text":"); ","element":"span"},{"href":"#id-28","referenceIndex":21,"text":"Telgarsky ","element":"a"},{"href":"#id-28","referenceIndex":21,"text":"(2013","element":"a"},{"text":")). Specifically, we did not have the common intercept term: ","element":"span"},{"style":{"height":14},"width":168.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-4.png","element":"img","alt":" w⊤x + b","inline":true},{"text":". One may be tempted to introduce the intercept in the usual way, i.e., by extending all the input vectors ","element":"span"},{"style":{"height":10.69},"width":47.4,"height":26.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-5.png","element":"img","alt":" xn","inline":true,"padRight":true},{"text":"with an additional ","element":"span"},{"style":{"height":12},"width":145.08,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-6.png","element":"img","alt":"′1′ com-","inline":true,"padRight":true},{"text":"ponent. In this extended input space, naturally, all our results hold. Therefore, we converge in direction to the ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-7.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"max margin solution (eq. ","element":"span"},{"href":"#id-18","text":"4) ","element":"a"},{"text":"in the extended space. However, if we translate this solution to the original ","element":"span"},{"text":"x ","element":"span"},{"text":"space we obtain","element":"span"}],[{"style":{"width":"41%"},"width":721,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-8.png","element":"img"}],[{"text":"which is not the ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-9.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"max margin (SVM) solution","element":"span"}],[{"style":{"width":"36%"},"width":637,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-10.png","element":"img"}],[{"text":"where we do not have a ","element":"span"},{"style":{"height":15.14},"width":35.72,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-11.png","element":"img","alt":" b2 ","inline":true,"padRight":true},{"text":"penalty in the objective.","element":"span"}]]},{"heading":"3. Implications: Rates of convergence","paragraphs":[[{"text":"The solution in eq. ","element":"span"},{"href":"#id-20","text":"3 ","element":"a"},{"text":"implies that ","element":"span"},{"style":{"height":17.6},"width":267.76,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-12.png","element":"img","alt":" w (t) / ∥w (t)∥","inline":true,"padRight":true},{"text":"converges to the normalized max margin vector ","element":"span"},{"style":{"height":17.6},"width":165.6,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-13.png","element":"img","alt":" ˆw/ ∥ ˆw∥ .","inline":true,"padRight":true},{"text":"Moreover, this convergence is very slow— logarithmic in the number of iterations. Specifically, our results imply the following tight rates of convergence:","element":"span"}],[{"id":"id-34","text":"Theorem 5 ","element":"span"},{"text":"Under the conditions and notation of Theorem ","element":"span"},{"href":"#id-19","text":"3, ","element":"a"},{"text":"for any linearly separable data set, the normalized weight vector converges to the normalized max margin vector in ","element":"span"},{"style":{"height":14.69},"width":151.52,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-14.png","element":"img","alt":" L2 norm","inline":true}],[{"id":"id-132","style":{"width":"69%"},"width":1201,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-15.png","element":"img"}],[{"text":"with this rate improving to ","element":"span"},{"text":"O","element":"span"},{"text":"(1","element":"span"},{"text":"/ ","element":"span"},{"text":"log(","element":"span"},{"text":"t","element":"span"},{"text":")) ","element":"span"},{"text":"for almost every dataset; and in angle","element":"span"}],[{"id":"id-36","style":{"width":"71%"},"width":1232,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-16.png","element":"img"}],[{"text":"with this rate improving to ","element":"span"},{"style":{"height":19.91},"width":244.04,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-17.png","element":"img","alt":" O(1/ log2(t))","inline":true,"padRight":true},{"text":"for almost every dataset; and the margin converges as","element":"span"}],[{"id":"id-32","style":{"width":"69%"},"width":1203,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/5-18.png","element":"img"}],[{"text":"On the other hand, the loss itself decreases as","element":"span"}],[{"id":"id-33","style":{"width":"61%"},"width":1058,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-0.png","element":"img"}],[{"text":"All the rates in the above Theorem are a direct consequence of Theorem ","element":"span"},{"href":"#id-19","text":"3, ","element":"a"},{"text":"except for avoiding the ","element":"span"},{"text":"log log ","element":"span"},{"text":"t ","element":"span"},{"text":"factor for the degenerate cases in eq. ","element":"span"},{"href":"#id-32","text":"10 ","element":"a"},{"text":"and eq. ","element":"span"},{"href":"#id-33","text":"11 ","element":"a"},{"text":"(i.e., establishing that the rates ","element":"span"},{"text":"1","element":"span"},{"text":"/ ","element":"span"},{"text":"log ","element":"span"},{"text":"t ","element":"span"},{"text":"and ","element":"span"},{"text":"1","element":"span"},{"text":"/t ","element":"span"},{"text":"always hold)—this additional improvement is a consequence of the more complete characterization of Theorem ","element":"span"},{"href":"#id-23","text":"13. ","element":"a"},{"text":"Full details are provided in Appendix ","element":"span"},{"text":"D. ","element":"span"},{"text":"In this appendix, we also provide a simple construction showing all the rates in Theorem ","element":"span"},{"href":"#id-34","text":"5 ","element":"a"},{"text":"are tight (except possibly for the ","element":"span"},{"text":"log log ","element":"span"},{"text":"t ","element":"span"},{"text":"factors).","element":"span"}],[{"text":"The sharp contrast between the tight logarithmic and ","element":"span"},{"text":"1","element":"span"},{"text":"/t ","element":"span"},{"text":"rates in Theorem ","element":"span"},{"href":"#id-34","text":"5 ","element":"a"},{"text":"implies that the convergence of ","element":"span"},{"text":"w","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"to the max-margin ","element":"span"},{"style":{"height":12.59},"width":37,"height":31.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-1.png","element":"img","alt":" ˆw","inline":true,"padRight":true},{"text":"can be logarithmic in the loss itself, and we might need to wait until the loss is exponentially small in order to be close to the max-margin solution. This can help explain why continuing to optimize the training loss, even after the training error is zero and the training loss is extremely small, still improves generalization performance—our results suggests that the margin could still be improving significantly in this regime.","element":"span"}],[{"text":"A numerical illustration of the convergence is depicted in Figure ","element":"span"},{"href":"#id-35","text":"1. ","element":"a"},{"text":"As predicted by the theory, the norm ","element":"span"},{"style":{"height":17.6},"width":130.48,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-2.png","element":"img","alt":" ∥w(t)∥","inline":true,"padRight":true},{"text":"grows logarithmically (note the semi-log scaling), and ","element":"span"},{"text":"w","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"converges to the max-margin separator, but only logarithmically, while the loss itself decreases very rapidly (note the log-log scaling).","element":"span"}],[{"text":"An important practical consequence of our theory, is that although the margin of ","element":"span"},{"text":"w","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"keeps improving, and so we can expect the population (or test) misclassification error of ","element":"span"},{"text":"w","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"to improve for many datasets, the same cannot be said about the expected population loss (or test loss)! At the limit, the direction of ","element":"span"},{"text":"w","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"will converge toward the max margin predictor ","element":"span"},{"style":{"height":16.59},"width":427.12,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-3.png","element":"img","alt":" ˆw. Although ˆw has zero","inline":true,"padRight":true},{"text":"training error, it will not generally have zero misclassification error on the population, or on a test or a validation set. Since the norm of ","element":"span"},{"text":"w","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"will increase, if we use the logistic loss or any other convex loss, the loss incurred on those misclassified points will also increase. More formally, consider the logistic loss ","element":"span"},{"style":{"height":17.6},"width":365,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-4.png","element":"img","alt":" ℓ(u) = log(1 + e−u)","inline":true,"padRight":true},{"text":"and define also the hinge-at-zero loss ","element":"span"},{"style":{"height":17.6},"width":478.88,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-5.png","element":"img","alt":" h(u) = max(0, −u). Since","inline":true},{"style":{"height":12.59},"width":37,"height":31.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-6.png","element":"img","alt":"ˆw","inline":true,"padRight":true},{"text":"classifies all training points correctly, we have that on the training set ","element":"span"},{"style":{"height":22.05},"width":402.68,"height":55.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-7.png","element":"img","alt":"�Nn=1 h( ˆw⊤xn) = 0.","inline":true,"padRight":true},{"text":"However, on the population we would expect some errors and so ","element":"span"},{"style":{"height":17.6},"width":567.76,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-8.png","element":"img","alt":" E[h( ˆw⊤x)] > 0. Since w(t) ≈","inline":true},{"style":{"height":17.6},"width":867.36,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-9.png","element":"img","alt":"ˆw log t and ℓ(αu) → αh(u) as α → ∞, we have:","inline":true}],[{"style":{"width":"84%"},"width":1468,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-10.png","element":"img"}],[{"text":"That is, the population loss increases logarithmically while the margin and the population misclassi-fication error improve. Roughly speaking, the improvement in misclassification does not out-weight the increase in the loss of those points still misclassified.","element":"span"}],[{"text":"The increase in the test loss is practically important because the loss on a validation set is frequently used to monitor progress and decide on stopping. Similar to the population loss, the validation loss will increase logarithmically with ","element":"span"},{"text":"t","element":"span"},{"text":", if there is at least one sample in the validation set which is classified incorrectly by the max margin vector (since we would not expect zero validation error). More precisely, as a direct consequence of Theorem ","element":"span"},{"href":"#id-19","text":"3 ","element":"a"},{"text":"(as shown on Appendix ","element":"span"},{"text":"D)","element":"span"},{"text":":","element":"span"}],[{"text":"Corollary 6 Let ","element":"span"},{"style":{"height":12.8},"width":18,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-11.png","element":"img","alt":" ℓ","inline":true,"padRight":true},{"text":"be the logistic loss, and ","element":"span"},{"text":"V ","element":"span"},{"text":"be an independent validation set, for which ","element":"span"},{"style":{"height":13.2},"width":140.88,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-12.png","element":"img","alt":" ∃x ∈ V","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":13.39},"width":172.24,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-13.png","element":"img","alt":" x⊤ ˆw < 0","inline":true},{"text":". Then the validation loss increases as","element":"span"}],[{"style":{"width":"47%"},"width":813,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/6-14.png","element":"img"}],[{"style":{"width":"98%"},"width":1704,"height":544,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-0.png","element":"img"}],[{"id":"id-35","text":"Figure 1: Visualization of or main results on a synthetic dataset in which the ","element":"figcaption","subtype":"caption"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-1.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"max margin vector ","element":"figcaption","subtype":"caption"},{"style":{"height":12.59},"width":37,"height":31.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-2.png","element":"img","alt":" ˆw","inline":true,"padRight":true},{"text":"is precisely known. (A) The dataset (positive and negatives samples (","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":169.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-3.png","element":"img","alt":"y = ±1)","inline":true,"padRight":true},{"text":"are respectively denoted by ","element":"figcaption","subtype":"caption"},{"style":{"height":14},"width":192.64,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-4.png","element":"img","alt":"′+′ and ′◦′","inline":true},{"text":"), max margin separating hyperplane (black line), and the asymptotic solution of GD (dashed blue). For both GD and GD with momentum (GDMO), we show: (B) The norm of ","element":"figcaption","subtype":"caption"},{"text":"w ","element":"figcaption","subtype":"caption"},{"text":"(","element":"figcaption","subtype":"caption"},{"text":"t","element":"figcaption","subtype":"caption"},{"text":")","element":"figcaption","subtype":"caption"},{"text":", normalized so it would equal to ","element":"figcaption","subtype":"caption"},{"text":"1 ","element":"figcaption","subtype":"caption"},{"text":"at the last iteration, to facilitate comparison. As expected (eq. ","element":"figcaption","subtype":"caption"},{"href":"#id-20","text":"3)","element":"a","subtype":"caption"},{"text":", the norm increases logarithmically; (C) the training loss. As expected, it decreases as ","element":"figcaption","subtype":"caption"},{"style":{"height":15.14},"width":59.24,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-5.png","element":"img","alt":" t−1 ","inline":true,"padRight":true},{"text":"(eq. ","element":"figcaption","subtype":"caption"},{"href":"#id-33","text":"11)","element":"a","subtype":"caption"},{"text":"; and (D&E) the angle and margin gap of ","element":"figcaption","subtype":"caption"},{"style":{"height":17.6},"width":241,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-6.png","element":"img","alt":" w (t) from ˆw","inline":true,"padRight":true},{"text":"(eqs. ","element":"figcaption","subtype":"caption"},{"href":"#id-36","text":"9 ","element":"a","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"href":"#id-32","text":"10)","element":"a","subtype":"caption"},{"text":". As expected, these are logarithmically decreasing to zero. Implementation details: The dataset includes four support vectors: ","element":"figcaption","subtype":"caption"},{"style":{"height":17.6},"width":1394.12,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-7.png","element":"img","alt":" x1 = (0.5, 1.5) , x2 = (1.5, 0.5) with y1 = y2 = 1, and x3 = −x1, x4 = −x2","inline":true,"padRight":true},{"text":"with ","element":"figcaption","subtype":"caption"},{"style":{"height":16.4},"width":397.16,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-8.png","element":"img","alt":" y3 = y4 = −1 (the L2","inline":true,"padRight":true},{"text":"normalized max margin vector is then ","element":"figcaption","subtype":"caption"},{"style":{"height":19.6},"width":371.44,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-9.png","element":"img","alt":" ˆw = (1, 1) /√2 with","inline":true,"padRight":true},{"text":"margin equal to","element":"figcaption","subtype":"caption"},{"style":{"height":18.4},"width":220.72,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-10.png","element":"img","alt":"√2 ), and 12","inline":true,"padRight":true},{"text":"other random datapoints (","element":"figcaption","subtype":"caption"},{"text":"6 ","element":"figcaption","subtype":"caption"},{"text":"from each class), that are not on the margin. We used a learning rate ","element":"figcaption","subtype":"caption"},{"style":{"height":19.13},"width":617.96,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-11.png","element":"img","alt":" η = 1/σ2max (X), where σ2max (X)","inline":true,"padRight":true},{"text":"is the maximal ","element":"figcaption","subtype":"caption"},{"text":"singular value of ","element":"figcaption","subtype":"caption"},{"style":{"height":15.6},"width":407.92,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-12.png","element":"img","alt":" X, momentum γ = 0.9","inline":true,"padRight":true},{"text":"for GDMO, and initialized at the origin.","element":"figcaption","subtype":"caption"}],[{"text":"This behavior might cause us to think we are over-fitting or otherwise encourage us to stop the optimization. However, this increase does not actually represent the model getting worse, merely ","element":"span"},{"style":{"height":17.6},"width":130.48,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-13.png","element":"img","alt":"∥w(t)∥","inline":true,"padRight":true},{"text":"getting larger, and in fact the model might be getting better (increasing the margin and possibly decreasing the error rate).","element":"span"}]]},{"heading":"4. Extensions","paragraphs":[[{"text":"4.1 Multi-Class Classification with Cross-Entropy Loss","element":"span"}],[{"text":"So far, we have discussed the problem of binary classification, but in many practical situations, we have more than two classes. For multi-class problems, the labels are the class indices ","element":"span"},{"style":{"height":13.6},"width":93.32,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-14.png","element":"img","alt":" yn ∈","inline":true},{"style":{"height":17.6},"width":332.56,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-15.png","element":"img","alt":"[K] ≜ {1, . . . , K}","inline":true,"padRight":true},{"text":"and we learn a predictor ","element":"span"},{"style":{"height":10.88},"width":54.48,"height":27.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-16.png","element":"img","alt":" wk","inline":true,"padRight":true},{"text":"for each class ","element":"span"},{"style":{"height":17.6},"width":148.8,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-17.png","element":"img","alt":" k ∈ [K]","inline":true},{"text":". A common loss function in multi-class classification is the following cross-entropy loss with a softmax output, which is a generalization of the logistic loss:","element":"span"}],[{"id":"id-38","style":{"width":"76%"},"width":1328,"height":133,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/7-18.png","element":"img"}],[{"text":"What do the linear predictors ","element":"span"},{"style":{"height":17.6},"width":106.76,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/8-0.png","element":"img","alt":" wk(t)","inline":true,"padRight":true},{"text":"converge to if we minimize the cross-entropy loss by gradient descent on the predictors? In Appendix ","element":"span"},{"text":"E ","element":"span"},{"text":"we analyze this problem for separable data and show that again, the predictors diverge to infinity and the loss converges to zero. Next, to answer to which direction do these predictors converge, we define ","element":"span"},{"style":{"height":15.47},"width":54.48,"height":38.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/8-1.png","element":"img","alt":" ˆwk","inline":true,"padRight":true},{"text":"as the solution of the ","element":"span"},{"text":"K","element":"span"},{"text":"-class SVM:","element":"span"}],[{"id":"id-37","style":{"width":"84%"},"width":1459,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/8-2.png","element":"img"}],[{"text":"for each ","element":"span"},{"style":{"height":18.29},"width":1419.28,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/8-3.png","element":"img","alt":" k ∈ [K], define Sk = arg minn( ˆwyn − ˆwk)⊤xn = {n : ( ˆwyn − ˆwk)⊤xn = 1}","inline":true},{"text":", i.e., the ","element":"span"},{"style":{"height":15.53},"width":56,"height":38.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/8-4.png","element":"img","alt":"kth ","inline":true,"padRight":true},{"text":"class support vectors, and define ","element":"span"},{"style":{"height":13.28},"width":76.08,"height":33.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/8-5.png","element":"img","alt":" αn,k","inline":true,"padRight":true},{"text":"as some positive dual variables for ","element":"span"},{"style":{"height":15.28},"width":44.4,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/8-6.png","element":"img","alt":" Sk","inline":true,"padRight":true},{"text":"that together satisfy the ","element":"span"},{"text":"K","element":"span"},{"text":"-class SVM KKT conditions. Using these definitions, we prove the following Theorem:","element":"span"}],[{"id":"id-139","text":"Theorem 7 ","element":"span"},{"text":"For all multiclass datasets which are linearly separable (i.e. the constraints in eq. ","element":"span"},{"href":"#id-37","text":"14 ","element":"a"},{"text":"below are feasible) and for which the equation","element":"span"}],[{"id":"id-39","style":{"width":"78%"},"width":1365,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/8-7.png","element":"img"}],[{"text":"has a solution ","element":"span"},{"style":{"height":20.51},"width":162.44,"height":51.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/8-8.png","element":"img","alt":" { ˜wk}Kk=1","inline":true},{"text":", the following holds: for any starting point ","element":"span"},{"text":"w","element":"span"},{"text":"(0) ","element":"span"},{"text":"and any small enough ","element":"span"},{"text":"stepsize, the iterates of gradient descent on eq. ","element":"span"},{"href":"#id-38","text":"13 ","element":"a"},{"text":"will behave as:","element":"span"}],[{"id":"id-140","style":{"width":"64%"},"width":1113,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/8-9.png","element":"img"}],[{"text":"where the residual ","element":"span"},{"style":{"height":17.6},"width":96.68,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/8-10.png","element":"img","alt":" ρk(t)","inline":true,"padRight":true},{"text":"is bounded.","element":"span"}],[{"text":"Note that here we had to assume eq. ","element":"span"},{"href":"#id-39","text":"15 ","element":"a"},{"text":"has a solution. In the binary case, we could prove that this equation has a solution for almost every dataset. In the original version of this manuscript, we incorrectly assumed that this proof in the binary case carries to the multiclass case (as was pointed to us by Yutong Wang). We therefore added the assumption that eq. ","element":"span"},{"href":"#id-39","text":"15 ","element":"a"},{"text":"has a solution. We conjecture this assumption should also be true for almost all datasets in the multiclass case (see Appendix H in ","element":"span"},{"href":"#id-40","referenceIndex":17,"text":"Ravi et al. ","element":"a"},{"href":"#id-40","referenceIndex":17,"text":"(2024","element":"a"},{"text":")), but we leave this proof for future work.","element":"span"}],[{"text":"4.2 Deep networks","element":"span"}],[{"text":"So far we have only considered linear prediction. Naturally, it is desirable to generalize our results also to non-linear models and especially multi-layer neural networks.","element":"span"}],[{"text":"Even without a formal extension and description of the precise bias, our results already shed light on how minimizing the cross-entropy loss with gradient descent can have a margin maximizing effect, how the margin might improve only logarithmically slow, and why it might continue to improve even as the validation loss increases. These effects are demonstrated in Figure ","element":"span"},{"href":"#id-41","text":"2 ","element":"a"},{"text":"and Table ","element":"span"},{"href":"#id-42","text":"1 ","element":"a"},{"text":"which portray typical training of a convolutional neural network using unregularized gradient descent","element":"span"},{"text":"4","element":"span"},{"text":". As can be seen, the norm of the weight increases, but the validation error continues decreasing, albeit very slowly (as predicted by the theory), even after the training error is zero and the training loss is extremely small. We can now understand how even though the loss is already extremely small, some sort of margin might be gradually improving as we continue optimizing. We can also observe how the validation loss increases despite the validation error decreasing, as discussed in Section ","element":"span"},{"text":"3.","element":"span"}],[{"style":{"width":"97%"},"width":1691,"height":421,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/9-0.png","element":"img"}],[{"id":"id-41","text":"Figure 2: Training of a convolutional neural network on CIFAR10 using stochastic gradient de-","element":"figcaption","subtype":"caption"}],[{"style":{"width":"89%"},"width":1553,"height":709,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/9-1.png","element":"img"}],[{"id":"id-42","text":"Table 1: Sample values from various epochs in the experiment depicted in Fig. ","element":"figcaption","subtype":"caption"},{"href":"#id-41","text":"2.","element":"a","subtype":"caption"}],[{"text":"As an initial advance toward tackling deep network, we can point out that for several special cases, our results may be directly applied to multi-layered networks. First, somewhat trivially, our results may be applied directly to the last weight layer of a neural network if the last hidden layer becomes fixed and linearly separable after a certain number of iterations. This can become true, either approximately, if the input to the last hidden layer is normalized (e.g., using batch norm), or exactly, if the last hidden layer is quantized ","element":"span"},{"href":"#id-43","referenceIndex":9,"text":"(Hubara et al., ","element":"a"},{"href":"#id-43","referenceIndex":9,"text":"2018","element":"a"},{"text":").","element":"span"}],[{"text":"Second, as we show next, our results may be applied exactly on deep networks if only a single weight layer is being optimized, and, furthermore, after a sufficient number of iterations, the activation units stop switching and the training error goes to zero.","element":"span"}],[{"text":"Corollary 8 We examine a multilayer neural network with component-wise ReLU functions ","element":"span"},{"text":"f ","element":"span"},{"text":"(","element":"span"},{"text":"z","element":"span"},{"text":") = max [","element":"span"},{"text":"z, ","element":"span"},{"text":"0]","element":"span"},{"text":", and weights ","element":"span"},{"style":{"height":21.86},"width":161.96,"height":54.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/9-2.png","element":"img","alt":" {Wl}Ll=1","inline":true},{"text":". Given input ","element":"span"},{"style":{"height":10.69},"width":47.4,"height":26.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/9-3.png","element":"img","alt":" xn","inline":true,"padRight":true},{"text":"and target ","element":"span"},{"style":{"height":17.6},"width":245.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/9-4.png","element":"img","alt":" yn ∈ {−1, 1}","inline":true},{"text":", the DNN produces a ","element":"span"},{"text":"scalar output","element":"span"}],[{"style":{"width":"99%"},"width":1727,"height":267,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/9-5.png","element":"img"}],[{"text":"signs, then ","element":"span"},{"style":{"height":17.6},"width":269.68,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-0.png","element":"img","alt":" wl(t)/ ∥wl(t)∥","inline":true,"padRight":true},{"text":"converges to","element":"span"}],[{"style":{"width":"41%"},"width":715,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-1.png","element":"img"}],[{"text":"Proof We examine the output of the network given a single input ","element":"span"},{"style":{"height":14.8},"width":262.52,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-2.png","element":"img","alt":" xn, for t > t0.","inline":true,"padRight":true},{"text":"Since the ReLU inputs do not switch signs, we can write ","element":"span"},{"style":{"height":10.88},"width":36.4,"height":27.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-3.png","element":"img","alt":" vl","inline":true},{"text":", the output of layer ","element":"span"},{"text":"l","element":"span"},{"text":", as","element":"span"}],[{"style":{"width":"26%"},"width":456,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-4.png","element":"img"}],[{"text":"where we defined ","element":"span"},{"style":{"height":17.68},"width":269.04,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-5.png","element":"img","alt":" Al,n for l < L","inline":true,"padRight":true},{"text":"as a diagonal 0-1 matrix, which diagonal is the ReLU slopes at layer ","element":"span"},{"style":{"height":17.49},"width":465.52,"height":43.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-6.png","element":"img","alt":" l, sample n, and AL,n = 1","inline":true},{"text":". Additionally, we define","element":"span"}],[{"style":{"width":"51%"},"width":892,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-7.png","element":"img"}],[{"text":"Using this notation we can write","element":"span"}],[{"id":"id-44","style":{"width":"82%"},"width":1431,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-8.png","element":"img"}],[{"text":"This implies that","element":"span"}],[{"style":{"width":"49%"},"width":860,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-9.png","element":"img"}],[{"text":"which is the same as the original linear problem. Since the loss converges to zero, the dataset ","element":"span"},{"style":{"height":20.42},"width":239.72,"height":51.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-10.png","element":"img","alt":"{˜xl,n, yn}Nn=1 ","inline":true,"padRight":true},{"text":"must be linearly separable. Applying Theorem ","element":"span"},{"href":"#id-19","text":"3, ","element":"a"},{"text":"and recalling that ","element":"span"},{"style":{"height":18.58},"width":272.08,"height":46.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-11.png","element":"img","alt":" u(wl) = ˜x⊤l wl","inline":true,"padRight":true},{"text":"from eq. ","element":"span"},{"href":"#id-44","text":"17, ","element":"a"},{"text":"we prove this corollary.","element":"span"}],[{"text":"Importantly, this case is non-convex, unless we are optimizing the last layer. Note we assumed ReLU functions for simplicity, but this proof can be easily generalized for any other piecewise linear constant activation functions (e.g., leaky ReLU, max-pooling).","element":"span"}],[{"text":"Lastly, in a follow-up work ","element":"span"},{"href":"#id-45","referenceIndex":2,"text":"(Gunasekar et al., ","element":"a"},{"href":"#id-45","referenceIndex":2,"text":"2018b","element":"a"},{"text":"), given a few additional assumptions, extended our results to linear predictors which can be written as a homogeneous polynomial in the parameters. These results seem to indicate that, in many cases, GD operating on exp-tailed loss with positively homogeneous predictors aims to a specific direction. This is the direction of the max margin predictor minimizing the ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-12.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"norm in the parameter space. It is not yet clear how to generally translate such an implicit bias in the parameter space to the implicit bias in the predictor space — except in special cases, such as deep linear neural nets, as we have shown in ","element":"span"},{"href":"#id-45","referenceIndex":2,"text":"(Gunasekar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":2,"text":"2018b","element":"a"},{"text":"). Moreover, in non-linear neural nets, there are many equivalent max-margin solutions which minimize the ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/10-13.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"norm of the parameters. Therefore, it is natural to expect that GD would have additional implicit biases, which select a specific subset of these solutions.","element":"span"}],[{"style":{"width":"98%"},"width":1711,"height":542,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/11-0.png","element":"img"}],[{"id":"id-49","text":"Figure 3: Same as Fig. ","element":"figcaption","subtype":"caption"},{"href":"#id-35","text":"1, ","element":"a","subtype":"caption"},{"text":"except we multiplied all ","element":"figcaption","subtype":"caption"},{"style":{"height":10.69},"width":41.96,"height":26.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/11-1.png","element":"img","alt":" x2","inline":true,"padRight":true},{"text":"values in the dastaset by ","element":"figcaption","subtype":"caption"},{"text":"20","element":"figcaption","subtype":"caption"},{"text":", and also train using ADAM. The final weight vector produced after ","element":"figcaption","subtype":"caption"},{"style":{"height":15.14},"width":104.36,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/11-2.png","element":"img","alt":" 2·106 ","inline":true,"padRight":true},{"text":"epochs of optimization using ADAM (red dashed line) does not converge to L2 max margin solution (black line), in contrast to GD (blue dashed line), or GDMO.","element":"figcaption","subtype":"caption"}],[{"id":"id-30","text":"4.3 Other optimization methods","element":"span"}],[{"text":"In this paper we examined the implicit bias of gradient descent. Different optimization algorithms exhibit different biases, and understanding these biases and how they differ is crucial to understanding and constructing learning methods attuned to the inductive biases we expect. Can we characterize the implicit bias and convergence rate in other optimization methods?","element":"span"}],[{"text":"In Figure ","element":"span"},{"href":"#id-35","text":"1 ","element":"a"},{"text":"we see that adding momentum does not qualitatively affect the bias induced by gradient descent. In Figure ","element":"span"},{"href":"#id-46","text":"4 ","element":"a"},{"text":"in Appendix ","element":"span"},{"text":"F ","element":"span"},{"text":"we also repeat the experiment using stochastic gradient descent, and observe a similar asymptotic bias (this was later proved in ","element":"span"},{"href":"#id-47","referenceIndex":1,"text":"Nacson et al. ","element":"a"},{"href":"#id-47","referenceIndex":1,"text":"(2018","element":"a"},{"text":")). This is consistent with the fact that momentum, acceleration and stochasticity do not change the bias when using gradient descent to optimize an under determined least squares problem. It would be beneficial, though, to rigorously understand how much we can generalize our result to gradient descent variants, and how the convergence rates might change in these cases.","element":"span"}],[{"text":"On the other hand, as an example of how changing the optimization algorithm does change the bias, consider adaptive methods, such as AdaGrad ","element":"span"},{"href":"#id-48","referenceIndex":3,"text":"(Duchi et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-48","referenceIndex":3,"text":"2011","element":"a"},{"text":") and ADAM ","element":"span"},{"href":"#id-7","referenceIndex":12,"text":"(Kingma and Ba, ","element":"a"},{"href":"#id-7","referenceIndex":12,"text":"2015","element":"a"},{"text":"). In Figure ","element":"span"},{"href":"#id-49","text":"3 ","element":"a"},{"text":"we show the predictors obtained by ADAM and by gradient descent on a simple data set. Both methods converge to zero training error solutions. But although gradient descent converges to the ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/11-3.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"max margin predictor, as predicted by our theory, ADAM does not. The implicit bias of adaptive methods has in fact been a recent topic of interest, with ","element":"span"},{"href":"#id-50","referenceIndex":8,"text":"Hoffer et al. ","element":"a"},{"href":"#id-50","referenceIndex":8,"text":"(2017","element":"a"},{"text":") and ","element":"span"},{"href":"#id-5","referenceIndex":22,"text":"Wilson et al. ","element":"a"},{"href":"#id-5","referenceIndex":22,"text":"(2017","element":"a"},{"text":") suggesting they lead to worse generalization, and ","element":"span"},{"href":"#id-5","referenceIndex":22,"text":"Wilson et al. ","element":"a"},{"href":"#id-5","referenceIndex":22,"text":"(2017","element":"a"},{"text":") providing examples of the differences in the bias for linear regression problems with the squared loss. Can we characterize the bias of adaptive methods for logistic regression problems? Can we characterize the bias of other optimization methods, providing a general understanding linking optimization algorithms with their biases?","element":"span"}],[{"text":"In a follow-up paper ","element":"span"},{"href":"#id-31","referenceIndex":6,"text":"(Gunasekar et al., ","element":"a"},{"href":"#id-31","referenceIndex":6,"text":"2018","element":"a"},{"text":") provided initial answers to these questions. ","element":"span"},{"href":"#id-31","referenceIndex":6,"text":"Gunasekar et al. ","element":"a"},{"href":"#id-31","referenceIndex":6,"text":"(2018","element":"a"},{"text":") derived a precise characterization of the limit direction of steepest descent for general norms when optimizing the exp-loss, and show that for adaptive methods such as Adagrad the limit direc-","element":"span"}],[{"text":"tion can depend on the initial point and step size and is thus not as predictable and robust as with non-adaptive methods.","element":"span"}],[{"text":"4.4 Other loss functions","element":"span"}],[{"text":"In this work we focused on loss functions with exponential tail and observed a very slow, logarithmic convergence of the normalized weight vector to the ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/12-0.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"max margin direction. A natural question that follows is how does this behavior change with types of loss function tails. Specifically, does the normalized weight vector always converge to the ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/12-1.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"max margin solution? How is the convergence rate affected? Can we im","element":"span"},{"href":"#id-51","referenceIndex":13,"text":"prove the conv","element":"a"},{"href":"#id-51","referenceIndex":13,"text":"ergenc","element":"a"},{"text":"e rate beyond the logarithmic rate found in this work?","element":"span"}],[{"text":"In a follow-up work ","element":"span"},{"href":"#id-51","referenceIndex":13,"text":"Nacson et al. ","element":"a"},{"href":"#id-51","referenceIndex":13,"text":"(2018","element":"a"},{"text":") provided partial answers to these questions. They proved that the exponential tail has the optimal convergence rate, for tails for which ","element":"span"},{"style":{"height":17.6},"width":243.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/12-2.png","element":"img","alt":" ℓ′(u) is of the","inline":true,"padRight":true},{"text":"form ","element":"span"},{"style":{"height":17.6},"width":450.16,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/12-3.png","element":"img","alt":" exp(−uν) with ν > 0.25","inline":true},{"text":". They then conjectured, based on heuristic analysis, that the exponential tail is optimal among all possible tails. Furthermore, they demonstrated that polynomial or heavier tails do not converge to the max margin solution. Lastly, for the exponential loss they proposed a normalized gradient scheme which can significantly improve convergence rate, achieving ","element":"span"},{"style":{"height":19.41},"width":259.16,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/12-4.png","element":"img","alt":"O(log(t)/√t).","inline":true}],[{"text":"4.5 Matrix Factorization","element":"span"}],[{"text":"With multi-layered neural networks in mind, ","element":"span"},{"href":"#id-52","referenceIndex":5,"text":"Gunasekar et al. ","element":"a"},{"href":"#id-52","referenceIndex":5,"text":"(2017","element":"a"},{"text":") recently embarked on a study of the implicit bias of under-determined matrix factorization problems, where the squared loss of the linear observation of a matrix is minimized by gradient descent on its factorization. Since a matrix factorization can be viewed as a two-layer network with linear activations, this is perhaps the simplest deep model one can study in full, and can thus provide insight and direction to studying more complex neural networks. ","element":"span"},{"href":"#id-52","referenceIndex":5,"text":"Gunasekar et al. ","element":"a"},{"text":"conjectured, and provided theoretical and empirical evidence, that gradient descent on the factorization for an under-determined problem converges to the minimum nuclear norm solution, but only if the initialization is infinitesimally close to zero and the step-sizes are infinitesimally small. With finite step-sizes or finite initialization, ","element":"span"},{"href":"#id-52","referenceIndex":5,"text":"Gunasekar et al. ","element":"a"},{"text":"could not characterize the ","element":"span"},{"href":"#id-31","referenceIndex":6,"text":"bias.","element":"a"}],[{"text":"The follow-up paper ","element":"span"},{"href":"#id-31","referenceIndex":6,"text":"(Gunasekar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-31","referenceIndex":6,"text":"2018","element":"a"},{"text":") studied this same problem with exponential loss instead of squared loss. Under additional assumptions on the asymptotic convergence of update directions and gradient directions, they were able to relate the direction of gradient descent iterates on the factorized parameterization asymptotically to the maximum margin solution with unit nuclear norm. Unlike the case of squared loss, the result for exponential loss are independent of initialization and with only mild conditions on the step size. Here again, we see the asymptotic nature of exponential loss on separable data nullifying the initialization effects thereby making the analysis simpler compared to squared loss.","element":"span"}]]},{"heading":"5. Summary","paragraphs":[[{"text":"We characterized the implicit bias induced by gradient descent on homogeneous linear predictors when minimizing smooth monotone loss functions with an exponential tail. This is the type of loss commonly being minimized in deep learning. We can now rigorously understand:","element":"span"}],[{"text":"1. How gradient descent, without early stopping, induces implicit ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/12-5.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization and converges to the maximum ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/12-6.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"margin solution, when minimizing for binary classification with","element":"span"}],[{"text":"logistic loss, exp-loss, or other exponential tailed monotone decreasing loss, as well as for multi-class classification with cross-entropy loss. Notably, even though the logistic loss and the exp-loss behave very different on non-separable problems, they exhibit the same behaviour for separable problems. This implies that the non-tail part does not affect the bias. The bias is also independent of the step-size used (as long as it is small enough to ensure convergence) and is also independent on the initialization (unlike for least square problems).","element":"span"}],[{"text":"2. The convergence of the direction of gradient descent updates to the maximum ","element":"span"},{"style":{"height":16},"width":187.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/13-0.png","element":"img","alt":" L2 margin","inline":true,"padRight":true},{"text":"solution, however is very slow compared to the convergence of training loss, which explains why it is worthwhile continuing to optimize long after we have zero training error, and even when the loss itself is already extremely small.","element":"span"}],[{"text":"3. We should not rely on plateauing of the training loss or on the loss (logistic or exp or cross-entropy) evaluated on a validation data, as measures to decide when to stop. Instead, we should look at the ","element":"span"},{"text":"0","element":"span"},{"text":"–","element":"span"},{"text":"1 ","element":"span"},{"text":"error on the validation dataset. We might improve the validation and test errors even when when the decrease in the training loss is tiny and even when the validation loss itself increases.","element":"span"}],[{"text":"Perhaps that gradient descent leads to a max ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/13-1.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"margin solution is not a big surprise to those for whom the connection between ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/13-2.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization and gradient descent is natural. Nevertheless, we are not familiar with any prior study or mention of this fact, let alone a rigorous analysis and study of how this bias is exact and independent of the initial point and the step-size. Furthermore, we also analyze the rate at which this happens, leading to the novel observations discussed above. Even more importantly, we hope that our analysis can open the door to further analysis of different optimization methods or in different models, including deep networks, where implicit regularization is not well understood even for least square problems, or where we do not have such a natural guess as for gradient descent on linear problems. Analyzing gradient descent on logistic/cross-entropy loss is not only arguably more relevant than the least square loss, but might also be technically easier.","element":"span"}]]},{"heading":"Acknowledgments","paragraphs":[[{"text":"The authors are grateful to J. Lee, and C. Zeno for helpful comments on the manuscript. The research of DS was supported by the Israel Science Foundation (grant No. 31/1031), by the Taub foundation and of NS by the National Science Foundation.","element":"span"}]]},{"heading":"Appendix Appendix A. Proof of Theorems 3 and 4 for almost every dataset","paragraphs":[[{"text":"In the following sub-sections we first prove Theorem ","element":"span"},{"href":"#id-22","text":"9 ","element":"a"},{"text":"below, which is a version of Theorem ","element":"span"},{"href":"#id-19","text":"3, ","element":"a"},{"text":"specialized for almost every dataset. We then prove Theorem ","element":"span"},{"href":"#id-53","text":"4 ","element":"a"},{"text":"(which is already stated for almost every dataset).","element":"span"}],[{"id":"id-22","text":"Theorem 9 ","element":"span"},{"text":"For almost every dataset which is linearly separable (Assumption ","element":"span"},{"href":"#id-8","text":"1)","element":"a"},{"text":", any ","element":"span"},{"style":{"height":16.4},"width":166.96,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-0.png","element":"img","alt":" β-smooth","inline":true,"padRight":true},{"text":"decreasing loss function (Assumption ","element":"span"},{"href":"#id-8","text":"2) ","element":"a"},{"text":"with an exponential tail (Assumption ","element":"span"},{"href":"#id-15","text":"3)","element":"a"},{"text":", any stepsize ","element":"span"},{"style":{"height":13.6},"width":71.44,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-1.png","element":"img","alt":" η <","inline":true},{"style":{"height":19.14},"width":273.8,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-2.png","element":"img","alt":"2β−1σ−2max (X )","inline":true,"padRight":true},{"text":"and any starting point ","element":"span"},{"text":"w","element":"span"},{"text":"(0)","element":"span"},{"text":", the gradient descent iterates (as in eq. ","element":"span"},{"href":"#id-10","text":"2) ","element":"a"},{"text":"will behave ","element":"span"},{"text":"as:","element":"span"}],[{"style":{"width":"62%"},"width":1084,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-3.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.28},"width":198.44,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-4.png","element":"img","alt":" ˆw is the L2","inline":true,"padRight":true},{"text":"max margin vector","element":"span"}],[{"style":{"width":"42%"},"width":727,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-5.png","element":"img"}],[{"text":"the residual ","element":"span"},{"style":{"height":17.6},"width":72.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-6.png","element":"img","alt":" ρ(t)","inline":true,"padRight":true},{"text":"is bounded, and so","element":"span"}],[{"style":{"width":"22%"},"width":391,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-7.png","element":"img"}],[{"text":"In the following proofs, for any solution ","element":"span"},{"text":"w ","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":")","element":"span"},{"text":", we define","element":"span"}],[{"style":{"width":"29%"},"width":505,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":12.99},"width":156.04,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-9.png","element":"img","alt":" ˆw and ˜w","inline":true,"padRight":true},{"text":"follow the conditions of Theorems ","element":"span"},{"href":"#id-19","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-53","text":"4, ","element":"a"},{"text":"i.e. ","element":"span"},{"style":{"height":15.28},"width":194.6,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-10.png","element":"img","alt":" ˆw is the L2","inline":true,"padRight":true},{"text":"is the max margin vector defined above, and ","element":"span"},{"style":{"height":11.79},"width":37,"height":29.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-11.png","element":"img","alt":" ˜w","inline":true,"padRight":true},{"text":"is a vector which satisfies eq. ","element":"span"},{"href":"#id-54","text":"7:","element":"a"}],[{"id":"id-55","style":{"width":"67%"},"width":1159,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-12.png","element":"img"}],[{"text":"where we recall that we denoted ","element":"span"},{"style":{"height":18.62},"width":233.16,"height":46.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-13.png","element":"img","alt":" XS ∈ Rd×|S| ","inline":true,"padRight":true},{"text":"as the matrix whose columns are the support vectors, a subset ","element":"span"},{"style":{"height":17.6},"width":290.32,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-14.png","element":"img","alt":" S ⊂ {1, . . . , N}","inline":true,"padRight":true},{"text":"of the columns of ","element":"span"},{"style":{"height":19.54},"width":493.4,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-15.png","element":"img","alt":" X = [x1, . . . , xN] ∈ Rd×N.","inline":true}],[{"text":"In Lemma ","element":"span"},{"text":"12 ","element":"span"},{"text":"(Appendix ","element":"span"},{"text":"B) ","element":"span"},{"text":"we prove that for almost every dataset ","element":"span"},{"style":{"height":8},"width":33,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-16.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"is uniquely defined, there are no more then ","element":"span"},{"text":"d ","element":"span"},{"text":"support vectors and ","element":"span"},{"style":{"height":16.8},"width":286.76,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-17.png","element":"img","alt":" αn ̸= 0, ∀n ∈ S","inline":true},{"text":". Therefore, eq. ","element":"span"},{"href":"#id-55","text":"19 ","element":"a"},{"text":"is well-defined in those cases. If the support vectors do not span the data, then the solution ","element":"span"},{"style":{"height":11.79},"width":37,"height":29.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-18.png","element":"img","alt":" ˜w","inline":true,"padRight":true},{"text":"to eq. ","element":"span"},{"href":"#id-55","text":"19 ","element":"a"},{"text":"might not be unique. In this case, we can use any such solution in the proof.","element":"span"}],[{"text":"We furthermore denote the minimum margin to a non-support vector as:","element":"span"}],[{"id":"id-58","style":{"width":"60%"},"width":1039,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-19.png","element":"img"}],[{"text":"and by ","element":"span"},{"style":{"height":15.6},"width":270.08,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-20.png","element":"img","alt":" Ci,ǫi,ti (i ∈ N","inline":true},{"text":") various positive constants which are independent of ","element":"span"},{"text":"t","element":"span"},{"text":". Lastly, we define ","element":"span"},{"style":{"height":17.82},"width":212.4,"height":44.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-21.png","element":"img","alt":"P1 ∈ Rd×d ","inline":true,"padRight":true},{"text":"as the orthogonal projection matrix","element":"span"},{"text":"5 ","element":"span"},{"text":"to the subspace spanned by the support vectors (the columns of ","element":"span"},{"style":{"height":18.02},"width":422.12,"height":45.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-22.png","element":"img","alt":" XS), and ¯P1 = I − P1","inline":true,"padRight":true},{"text":"as the complementary projection (to the left nullspace of ","element":"span"},{"style":{"height":15.2},"width":88.28,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/14-23.png","element":"img","alt":"XS).","inline":true}],[{"text":"A.1 Simple proof of Theorem ","element":"span"},{"href":"#id-22","text":"9","element":"a"}],[{"text":"In this section we first examine the special case that ","element":"span"},{"style":{"height":17.6},"width":209.12,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-0.png","element":"img","alt":" ℓ (u) = e−u ","inline":true,"padRight":true},{"text":"and take the continuous time limit of gradient descent: ","element":"span"},{"style":{"height":15.6},"width":184.72,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-1.png","element":"img","alt":" η → 0 , so","inline":true}],[{"style":{"width":"23%"},"width":411,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-2.png","element":"img"}],[{"text":"The proof in this case is rather short and self-contained (i.e., does not rely on any previous results), and so it helps to clarify the main ideas of the general (more complicated) proof which we will give in the next sections.","element":"span"}],[{"id":"id-56","style":{"width":"96%"},"width":1661,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-3.png","element":"img"}],[{"text":"Our goal is to show that ","element":"span"},{"style":{"height":17.6},"width":121.36,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-4.png","element":"img","alt":" ∥r (t)∥","inline":true,"padRight":true},{"text":"is bounded, and therefore ","element":"span"},{"style":{"height":17.6},"width":323.56,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-5.png","element":"img","alt":" ρ (t) = r (t) + ˜w","inline":true,"padRight":true},{"text":"is bounded. Eq. ","element":"span"},{"href":"#id-56","text":"21 ","element":"a"},{"text":"implies that","element":"span"}],[{"style":{"width":"72%"},"width":1247,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-6.png","element":"img"}],[{"text":"and therefore","element":"span"}],[{"id":"id-57","style":{"width":"87%"},"width":1514,"height":567,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-7.png","element":"img"}],[{"text":"where in the last equality we used eq. ","element":"span"},{"href":"#id-56","text":"21 ","element":"a"},{"text":"and decomposed the sum over support vectors ","element":"span"},{"text":"S ","element":"span"},{"text":"and non-support vectors. We examine both bracketed terms. Recall that ","element":"span"},{"style":{"height":15.39},"width":540.04,"height":38.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-8.png","element":"img","alt":" ˆw⊤xn = 1 for n ∈ S, and that","inline":true,"padRight":true},{"text":"we defined (in eq. ","element":"span"},{"href":"#id-55","text":"19) ","element":"a"},{"style":{"height":20.8},"width":706.12,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-9.png","element":"img","alt":" ˜w so that �n∈S exp�− ˜w⊤xn�xn = ˆw","inline":true},{"text":". Thus, the first bracketed term in eq. ","element":"span"},{"href":"#id-57","text":"23 ","element":"a"},{"text":"can be written as","element":"span"}],[{"id":"id-59","style":{"width":"88%"},"width":1526,"height":248,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-10.png","element":"img"}],[{"text":"since ","element":"span"},{"style":{"height":17.6},"width":375.76,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-11.png","element":"img","alt":" ∀z, z (e−z − 1) ≤ 0","inline":true},{"text":". Furthermore, since ","element":"span"},{"style":{"height":20.02},"width":879.32,"height":50.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-12.png","element":"img","alt":" ∀z e−zz ≤ 1 and θ = argminn/∈Sx⊤n ˆw > 1 (eq.","inline":true,"padRight":true},{"href":"#id-58","text":"20)","element":"a"},{"text":", the second bracketed term in eq. ","element":"span"},{"href":"#id-57","text":"23 ","element":"a"},{"text":"can be upper bounded by","element":"span"}],[{"id":"id-60","style":{"width":"96%"},"width":1675,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-13.png","element":"img"}],[{"text":"Substituting eq. ","element":"span"},{"href":"#id-59","text":"24 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-60","text":"25 ","element":"a"},{"text":"into eq. ","element":"span"},{"href":"#id-57","text":"23 ","element":"a"},{"text":"and integrating, we obtain, that ","element":"span"},{"style":{"height":15.6},"width":289,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-14.png","element":"img","alt":" ∃C, C′ such that","inline":true}],[{"style":{"width":"59%"},"width":1033,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/15-15.png","element":"img"}],[{"text":"since ","element":"span"},{"style":{"height":13.2},"width":103.12,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-0.png","element":"img","alt":" θ > 1","inline":true,"padRight":true},{"text":"(eq. ","element":"span"},{"href":"#id-58","text":"20)","element":"a"},{"text":". Thus, we showed that ","element":"span"},{"text":"r","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"is bounded, which completes the proof for the special case. ","element":"span"},{"style":{"height":0},"width":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-1.png","element":"img","alt":" ■","inline":true}],[{"text":"A.2 Complete proof of Theorem ","element":"span"},{"href":"#id-22","text":"9","element":"a"}],[{"text":"Next, we give the proof for the general case (non-infinitesimal step size, and exponentially-tailed functions). Though it is based on a similar analysis as in the special case we examined in the previous section, it is somewhat more involved since we have to bound additional terms.","element":"span"}],[{"text":"First, we state two auxiliary lemmata, that are proven below in appendix sections ","element":"span"},{"href":"#id-12","text":"A.4 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-61","text":"A.5:","element":"a"}],[{"id":"id-11","text":"Lemma 10 ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":17.6},"width":225.68,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-2.png","element":"img","alt":" L (w) be a β","inline":true},{"text":"-smooth non-negative objective. If ","element":"span"},{"style":{"height":18.74},"width":173.48,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-3.png","element":"img","alt":" η < 2β−1","inline":true},{"text":", then, for any ","element":"span"},{"text":"w","element":"span"},{"text":"(0)","element":"span"},{"text":", with the GD sequence","element":"span"}],[{"style":{"width":"100%"},"width":1729,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-4.png","element":"img"}],[{"id":"id-66","text":"Lemma 11 ","element":"span"},{"text":"We have","element":"span"}],[{"id":"id-65","style":{"width":"88%"},"width":1532,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-5.png","element":"img"}],[{"text":"Additionally, ","element":"span"},{"style":{"height":15.6},"width":296.36,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-6.png","element":"img","alt":" ∀ǫ1 > 0 , ∃C2, t2","inline":true},{"text":", such that ","element":"span"},{"style":{"height":16},"width":192.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-7.png","element":"img","alt":" ∀t > t2, if","inline":true}],[{"id":"id-64","style":{"width":"57%"},"width":1000,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-8.png","element":"img"}],[{"text":"then the following improved bound holds","element":"span"}],[{"id":"id-69","style":{"width":"71%"},"width":1228,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-9.png","element":"img"}],[{"text":"Our goal is to show that ","element":"span"},{"style":{"height":17.6},"width":121.36,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-10.png","element":"img","alt":" ∥r (t)∥","inline":true,"padRight":true},{"text":"is bounded, and therefore ","element":"span"},{"style":{"height":17.6},"width":325.96,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-11.png","element":"img","alt":" ρ (t) = r (t) + ˜w","inline":true,"padRight":true},{"text":"is bounded. To show this, we will upper bound the following equation","element":"span"}],[{"style":{"width":"88%"},"width":1523,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-12.png","element":"img"}],[{"text":"First, we note that first term in this equation can be upper-bounded by","element":"span"}],[{"id":"id-63","style":{"width":"89%"},"width":1554,"height":405,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-13.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"we used eq. ","element":"span"},{"href":"#id-56","text":"21, ","element":"a"},{"text":"in ","element":"span"},{"text":"(2) ","element":"span"},{"text":"we used eq. ","element":"span"},{"href":"#id-10","text":"2, ","element":"a"},{"text":"and in ","element":"span"},{"style":{"height":17.6},"width":757.4,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-14.png","element":"img","alt":" (3) we used ∀x > 0 : x ≥ log (1 + x) > 0,","inline":true,"padRight":true},{"text":"and also that","element":"span"}],[{"style":{"width":"75%"},"width":1306,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/16-15.png","element":"img"}],[{"id":"id-62","style":{"width":"99%"},"width":1727,"height":270,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-0.png","element":"img"}],[{"text":"Substituting eq. ","element":"span"},{"href":"#id-62","text":"33 ","element":"a"},{"text":"into eq. ","element":"span"},{"href":"#id-63","text":"31, ","element":"a"},{"text":"and recalling that a ","element":"span"},{"style":{"height":12.33},"width":60.24,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-1.png","element":"img","alt":" t−ν ","inline":true,"padRight":true},{"text":"power series converges for any ","element":"span"},{"style":{"height":14.4},"width":184.16,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-2.png","element":"img","alt":" ν > 1, we","inline":true,"padRight":true},{"text":"can find ","element":"span"},{"style":{"height":15.09},"width":217.96,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-3.png","element":"img","alt":" C0 such that","inline":true}],[{"id":"id-68","style":{"width":"85%"},"width":1477,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-4.png","element":"img"}],[{"text":"Note that this equation also implies that ","element":"span"},{"style":{"height":15.09},"width":59.24,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-5.png","element":"img","alt":" ∀ǫ0","inline":true}],[{"id":"id-70","style":{"width":"72%"},"width":1250,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-6.png","element":"img"}],[{"text":"Next, we would like to bound the second term in eq. ","element":"span"},{"href":"#id-64","text":"30. ","element":"a"},{"text":"From eq. ","element":"span"},{"href":"#id-65","text":"27 ","element":"a"},{"text":"in Lemma ","element":"span"},{"href":"#id-66","text":"11, ","element":"a"},{"text":"we can find ","element":"span"},{"style":{"height":15.6},"width":427.2,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-7.png","element":"img","alt":"t1, C1 such that ∀t > t1:","inline":true}],[{"id":"id-67","style":{"width":"78%"},"width":1356,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-8.png","element":"img"}],[{"text":"Thus, by combining eqs. ","element":"span"},{"href":"#id-67","text":"36 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-68","text":"34 ","element":"a"},{"text":"into eq. ","element":"span"},{"href":"#id-64","text":"30, ","element":"a"},{"text":"we find","element":"span"}],[{"style":{"width":"43%"},"width":746,"height":365,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-9.png","element":"img"}],[{"text":"which is a bounded, since ","element":"span"},{"style":{"height":13.2},"width":115.6,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-10.png","element":"img","alt":" θ > 1","inline":true,"padRight":true},{"text":"(eq. ","element":"span"},{"href":"#id-58","text":"20) ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":15.6},"width":222.64,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-11.png","element":"img","alt":" µ−, µ+ > 0","inline":true,"padRight":true},{"text":"(Definition ","element":"span"},{"href":"#id-14","text":"2)","element":"a"},{"text":". Therefore, ","element":"span"},{"style":{"height":17.6},"width":164.84,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-12.png","element":"img","alt":" ∥r (t)∥ is","inline":true,"padRight":true},{"text":"bounded. ","element":"span"},{"style":{"height":0},"width":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-13.png","element":"img","alt":" ■","inline":true}],[{"text":"A.3 Proof of Theorem ","element":"span"},{"href":"#id-53","text":"4","element":"a"}],[{"text":"All that remains now is to show that ","element":"span"},{"style":{"height":17.6},"width":1076,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-14.png","element":"img","alt":" ∥r (t)∥ → 0 if rank (XS) = rank (X), and that ˜w is unique","inline":true,"padRight":true},{"text":"given ","element":"span"},{"text":"w ","element":"span"},{"text":"(0)","element":"span"},{"text":". To do so, this proof will continue where the proof of Theorem ","element":"span"},{"href":"#id-19","text":"3 ","element":"a"},{"text":"stopped, using notations and equations from that proof.","element":"span"}],[{"text":"Since ","element":"span"},{"text":"r ","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"has a bounded norm, its two orthogonal components ","element":"span"},{"style":{"height":19.22},"width":534.64,"height":48.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-15.png","element":"img","alt":" r (t) = P1r (t) + ¯P1r (t) also","inline":true,"padRight":true},{"text":"have bounded norms (recall that ","element":"span"},{"style":{"height":18.02},"width":123.56,"height":45.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-16.png","element":"img","alt":" P1, ¯P1","inline":true,"padRight":true},{"text":"were defined in the beginning of appendix section ","element":"span"},{"text":"A)","element":"span"},{"text":". From eq. ","element":"span"},{"href":"#id-10","text":"2, ","element":"a"},{"style":{"height":17.6},"width":144.68,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-17.png","element":"img","alt":" ∇L (w)","inline":true,"padRight":true},{"text":"is spanned by the columns of ","element":"span"},{"style":{"height":17.6},"width":522.92,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-18.png","element":"img","alt":" X. If rank (XS) = rank (X)","inline":true},{"text":", then it is also spanned by the columns of ","element":"span"},{"style":{"height":19.41},"width":495.76,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-19.png","element":"img","alt":" XS, and so ¯P1∇L (w) = 0","inline":true},{"text":". Therefore, ","element":"span"},{"style":{"height":19.41},"width":130.76,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-20.png","element":"img","alt":"¯P1r (t)","inline":true,"padRight":true},{"text":"is not updated during GD, and remains constant. Since ","element":"span"},{"style":{"height":11.79},"width":37,"height":29.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-21.png","element":"img","alt":" ˜w","inline":true,"padRight":true},{"text":"in eq. ","element":"span"},{"href":"#id-56","text":"21 ","element":"a"},{"text":"is also bounded, we can absorb this constant ","element":"span"},{"style":{"height":19.41},"width":262.12,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-22.png","element":"img","alt":"¯P1r (t) into ˜w","inline":true,"padRight":true},{"text":"without affecting eq. ","element":"span"},{"href":"#id-54","text":"7 ","element":"a"},{"text":"(since ","element":"span"},{"style":{"height":19.23},"width":454.96,"height":48.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-23.png","element":"img","alt":" ∀n ∈ S : x⊤n ¯P1r (t) = 0","inline":true},{"text":"). Thus, without loss of generality, we can ","element":"span"},{"text":"assume that ","element":"span"},{"style":{"height":17.6},"width":277.88,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-24.png","element":"img","alt":" r (t) = P1r (t).","inline":true}],[{"style":{"width":"65%"},"width":1139,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/17-25.png","element":"img"}],[{"text":"By contradiction, we assume that the complementary set is not finite,","element":"span"}],[{"style":{"width":"39%"},"width":686,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-0.png","element":"img"}],[{"text":"Additionally, the set ","element":"span"},{"text":"T ","element":"span"},{"text":"is not finite: if it were finite, it would have had a finite maximal point ","element":"span"},{"style":{"height":15.49},"width":170.92,"height":38.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-1.png","element":"img","alt":"tmax ∈ T","inline":true,"padRight":true},{"text":", and then, combining eqs. ","element":"span"},{"href":"#id-69","text":"29, ","element":"a"},{"href":"#id-64","text":"30, ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-68","text":"34, ","element":"a"},{"text":"we would find that ","element":"span"},{"style":{"height":15.09},"width":177.36,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-2.png","element":"img","alt":" ∀t > tmax","inline":true}],[{"style":{"width":"94%"},"width":1636,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-3.png","element":"img"}],[{"text":"which is impossible since ","element":"span"},{"style":{"height":20.67},"width":220.72,"height":51.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-4.png","element":"img","alt":" ∥r (t)∥2 ≥ 0","inline":true},{"text":". Furthermore, eq. ","element":"span"},{"href":"#id-68","text":"34 ","element":"a"},{"text":"implies that","element":"span"}],[{"style":{"width":"38%"},"width":665,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-5.png","element":"img"}],[{"text":"where ","element":"span"},{"text":"h ","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"is a positive monotone function decreasing to zero. Let ","element":"span"},{"style":{"height":14.4},"width":70.24,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-6.png","element":"img","alt":" t3, t","inline":true,"padRight":true},{"text":"be any two points such that ","element":"span"},{"style":{"height":19.22},"width":854.44,"height":48.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-7.png","element":"img","alt":"t3 < t, {t3, t3 + 1, . . . t} ⊂ ¯T , and (t3 − 1) ∈ T","inline":true,"padRight":true},{"text":". For all such ","element":"span"},{"style":{"height":15.2},"width":299.84,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-8.png","element":"img","alt":" t3 and t, we have","inline":true}],[{"id":"id-71","style":{"width":"91%"},"width":1580,"height":514,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-9.png","element":"img"}],[{"text":"Also, recall that ","element":"span"},{"style":{"height":13.89},"width":146.6,"height":34.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-10.png","element":"img","alt":" t3 > t0","inline":true},{"text":", so from eq. ","element":"span"},{"href":"#id-70","text":"35, ","element":"a"},{"text":"we have that ","element":"span"},{"style":{"height":17.6},"width":687.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-11.png","element":"img","alt":" |∥r (t3)∥ − ∥r (t3 − 1)∥| < ǫ0. Since","inline":true},{"style":{"height":17.6},"width":485.8,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-12.png","element":"img","alt":"∥r (t3 − 1)∥ < ǫ1 (from T","inline":true,"padRight":true},{"text":"definition), we conclude that ","element":"span"},{"style":{"height":17.6},"width":339.08,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-13.png","element":"img","alt":" ∥r (t3)∥ ≤ ǫ1 + ǫ0","inline":true},{"text":". Moreover, since ","element":"span"},{"style":{"height":16.01},"width":37,"height":40.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-14.png","element":"img","alt":"¯T","inline":true,"padRight":true},{"text":"is an infinite set, we can choose ","element":"span"},{"style":{"height":13.89},"width":32.84,"height":34.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-15.png","element":"img","alt":" t3","inline":true,"padRight":true},{"text":"as large as we want. This implies that ","element":"span"},{"style":{"height":15.09},"width":149.2,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-16.png","element":"img","alt":" ∀ǫ2 > 0","inline":true,"padRight":true},{"text":"we can find ","element":"span"},{"style":{"height":13.89},"width":32.84,"height":34.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-17.png","element":"img","alt":" t3","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":17.6},"width":413,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-18.png","element":"img","alt":" ǫ2 > h (t3), since h (t)","inline":true,"padRight":true},{"text":"is a monotonically decreasing function. Therefore, from eq. ","element":"span"},{"href":"#id-71","text":"37,","element":"a"}],[{"style":{"width":"100%"},"width":1734,"height":340,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-19.png","element":"img"}],[{"id":"id-12","text":"A.4 Proof of Lemma ","element":"span"},{"href":"#id-11","text":"10","element":"a"}],[{"text":"Lemma 10 Let ","element":"span"},{"style":{"height":17.6},"width":225.68,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-20.png","element":"img","alt":" L (w) be a β","inline":true},{"text":"-smooth non-negative objective. If ","element":"span"},{"style":{"height":18.74},"width":173.48,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-21.png","element":"img","alt":" η < 2β−1","inline":true},{"text":", then, for any ","element":"span"},{"text":"w","element":"span"},{"text":"(0)","element":"span"},{"text":", with the GD sequence","element":"span"}],[{"style":{"width":"100%"},"width":1729,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/18-22.png","element":"img"}],[{"text":"This proof is a slightly modified version of the proof of Theorem 2 in ","element":"span"},{"href":"#id-13","referenceIndex":4,"text":"(Ganti","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":4,"text":"2015","element":"a"},{"text":"). Recall a well-known property of ","element":"span"},{"style":{"height":16.4},"width":26,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-0.png","element":"img","alt":" β","inline":true},{"text":"-smooth functions:","element":"span"}],[{"style":{"width":"76%"},"width":1324,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-1.png","element":"img"}],[{"text":"From the ","element":"span"},{"style":{"height":16.4},"width":26,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-2.png","element":"img","alt":" β","inline":true},{"text":"-smoothness of ","element":"span"},{"text":"L ","element":"span"},{"text":"(","element":"span"},{"text":"w","element":"span"},{"text":")","element":"span"}],[{"style":{"width":"95%"},"width":1652,"height":544,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-3.png","element":"img"}],[{"text":"which implies","element":"span"}],[{"style":{"width":"87%"},"width":1507,"height":149,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-4.png","element":"img"}],[{"text":"The right hand side is upper bounded by a finite constant, since ","element":"span"},{"style":{"height":17.6},"width":683.96,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-5.png","element":"img","alt":" L (w (0)) < ∞ and 0 ≤ L (w (t + 1)).","inline":true,"padRight":true},{"text":"This implies","element":"span"}],[{"style":{"width":"26%"},"width":462,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-6.png","element":"img"}],[{"text":"and therefore ","element":"span"},{"style":{"height":20.86},"width":400.24,"height":52.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-7.png","element":"img","alt":" ∥∇L (w (t))∥2 → 0. ■","inline":true}],[{"id":"id-61","text":"A.5 Proof of Lemma ","element":"span"},{"href":"#id-66","text":"11","element":"a"}],[{"text":"Recall that we defined ","element":"span"},{"style":{"height":17.6},"width":804.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-8.png","element":"img","alt":" r (t) = w (t) − ˆw log t − ˜w, with ˆw and ˜w","inline":true,"padRight":true},{"text":"follow the conditions of the Theorems ","element":"span"},{"href":"#id-19","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-53","text":"4, ","element":"a"},{"text":"i.e, ","element":"span"},{"style":{"height":15.28},"width":198.92,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-9.png","element":"img","alt":" ˆw is the L2","inline":true,"padRight":true},{"text":"max margin vector and (eq. ","element":"span"},{"href":"#id-18","text":"4)","element":"a"},{"text":", and eq. ","element":"span"},{"href":"#id-54","text":"7 ","element":"a"},{"text":"holds","element":"span"}],[{"style":{"width":"33%"},"width":585,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-10.png","element":"img"}],[{"text":"Lemma 11 We have","element":"span"}],[{"style":{"width":"88%"},"width":1532,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-11.png","element":"img"}],[{"text":"Additionally, ","element":"span"},{"style":{"height":15.6},"width":296.36,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-12.png","element":"img","alt":" ∀ǫ1 > 0 , ∃C2, t2","inline":true},{"text":", such that ","element":"span"},{"style":{"height":16},"width":192.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-13.png","element":"img","alt":" ∀t > t2, if","inline":true}],[{"style":{"width":"57%"},"width":1000,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-14.png","element":"img"}],[{"text":"then the following improved bound holds","element":"span"}],[{"style":{"width":"70%"},"width":1228,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/19-15.png","element":"img"}],[{"text":"From Lemma ","element":"span"},{"href":"#id-16","text":"1, ","element":"a"},{"style":{"height":17.6},"width":546.56,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/20-0.png","element":"img","alt":" ∀n : limt→∞ w (t)⊤ xn = ∞","inline":true},{"text":". In addition, from assumption ","element":"span"},{"href":"#id-15","text":"3 ","element":"a"},{"text":"the negative loss derivative ","element":"span"},{"style":{"height":17.6},"width":129.8,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/20-1.png","element":"img","alt":" −ℓ′ (u)","inline":true,"padRight":true},{"text":"has an exponential tail ","element":"span"},{"style":{"height":12.33},"width":66.56,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/20-2.png","element":"img","alt":" e−u ","inline":true,"padRight":true},{"text":"(recall we assume ","element":"span"},{"text":"a ","element":"span"},{"text":"= ","element":"span"},{"text":"c ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"without loss of generality). Combining both facts, we have positive constants ","element":"span"},{"style":{"height":16.4},"width":552.08,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/20-3.png","element":"img","alt":" µ−, µ+, t− and t+ such that ∀n","inline":true}],[{"id":"id-75","style":{"width":"91%"},"width":1575,"height":174,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/20-4.png","element":"img"}],[{"text":"Next, we examine the expression we wish to bound, recalling that ","element":"span"},{"style":{"height":17.6},"width":508.8,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/20-5.png","element":"img","alt":" r (t) = w (t) − ˆw log t − ˜w:","inline":true}],[{"id":"id-72","style":{"width":"82%"},"width":1435,"height":536,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/20-6.png","element":"img"}],[{"text":"where in last line we used eqs. ","element":"span"},{"href":"#id-25","text":"6 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-54","text":"7 ","element":"a"},{"text":"to obtain","element":"span"}],[{"style":{"width":"44%"},"width":776,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/20-7.png","element":"img"}],[{"text":"We examine the three terms in eq. ","element":"span"},{"href":"#id-72","text":"41. ","element":"a"},{"text":"The first term can be upper bounded by","element":"span"}],[{"id":"id-79","style":{"width":"72%"},"width":1247,"height":412,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/20-8.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"we used that ","element":"span"},{"style":{"height":17.7},"width":377.68,"height":44.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/21-0.png","element":"img","alt":"¯P1 ˆw = ¯P1XSα = 0","inline":true,"padRight":true},{"text":"from eq. ","element":"span"},{"href":"#id-25","text":"6, ","element":"a"},{"text":"and in ","element":"span"},{"text":"(2) ","element":"span"},{"text":"we used that ","element":"span"},{"style":{"height":17.6},"width":290.36,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/21-1.png","element":"img","alt":" ˆw⊤r (t) = o (t),","inline":true,"padRight":true},{"text":"since","element":"span"}],[{"style":{"width":"100%"},"width":1730,"height":928,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/21-2.png","element":"img"}],[{"text":"Therefore, ","element":"span"},{"style":{"height":17.82},"width":154.08,"height":44.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/21-3.png","element":"img","alt":" ∀t > t′+:","inline":true}],[{"id":"id-74","style":{"width":"99%"},"width":1728,"height":1163,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/21-4.png","element":"img"}],[{"text":"We examine each term ","element":"span"},{"text":"n ","element":"span"},{"text":"in this sum, and divide into two cases, depending on the sign of ","element":"span"},{"style":{"height":17.62},"width":143.48,"height":44.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/21-5.png","element":"img","alt":" x⊤n r (t).","inline":true}],[{"id":"id-73","style":{"width":"96%"},"width":1661,"height":175,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-0.png","element":"img"}],[{"text":"We further divide into cases:","element":"span"}],[{"text":"1. If","element":"span"},{"style":{"height":22.16},"width":390.16,"height":55.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-1.png","element":"img","alt":"��x⊤n r(t)�� ≤ C0t−0.5µ+","inline":true},{"text":", then we can upper bound eq. ","element":"span"},{"href":"#id-73","text":"46 ","element":"a"},{"text":"with","element":"span"}],[{"id":"id-82","style":{"width":"70%"},"width":1218,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-2.png","element":"img"}],[{"text":"2. If","element":"span"},{"style":{"height":21.97},"width":390.16,"height":54.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-3.png","element":"img","alt":"��x⊤n r(t)�� > C0t−0.5µ+","inline":true},{"text":", then we can find ","element":"span"},{"style":{"height":16.82},"width":144.08,"height":42.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-4.png","element":"img","alt":" t′′+ > t′+ ","inline":true,"padRight":true},{"text":"to upper bound eq. ","element":"span"},{"href":"#id-73","text":"46 ","element":"a"},{"style":{"height":18.02},"width":154.08,"height":45.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-5.png","element":"img","alt":" ∀t > t′′+:","inline":true}],[{"id":"id-80","style":{"width":"100%"},"width":1733,"height":412,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-6.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"we used the fact that ","element":"span"},{"style":{"height":20.35},"width":1013.2,"height":50.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-7.png","element":"img","alt":" e−x ≤ 1 − x + x2 for x ≥ 0 and in (2) we defined t′′+ so","inline":true,"padRight":true},{"text":"that the previous expression is negative — since ","element":"span"},{"style":{"height":15.14},"width":127.6,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-8.png","element":"img","alt":" t−0.5µ+ ","inline":true,"padRight":true},{"text":"decreases slower than ","element":"span"},{"style":{"height":12.34},"width":99.8,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-9.png","element":"img","alt":" t−µ+.","inline":true}],[{"text":"3. If","element":"span"},{"style":{"height":22.16},"width":247.4,"height":55.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-10.png","element":"img","alt":"��x⊤n r(t)�� ≥ ǫ2","inline":true},{"text":", then we define ","element":"span"},{"style":{"height":25.38},"width":1084.76,"height":63.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-11.png","element":"img","alt":" t′′′+ > t′′+ such that t′′′+ > exp�minn ˜w⊤xn� �e0.5ǫ2 − 1�−1/µ+,","inline":true,"padRight":true},{"text":"and therefore ","element":"span"},{"style":{"height":20.8},"width":1135.64,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-12.png","element":"img","alt":" ∀t > t′′′+, we have�1 + t−µ+ exp�−µ+ ˜w⊤xn��e−ǫ2 < e−0.5ǫ2 .","inline":true}],[{"id":"id-78","style":{"width":"99%"},"width":1728,"height":290,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-13.png","element":"img"}],[{"text":"1. If","element":"span"},{"style":{"height":22.16},"width":398.36,"height":55.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-14.png","element":"img","alt":"��x⊤n r(t)�� ≤ C0t−0.5µ−","inline":true},{"text":", then, since ","element":"span"},{"style":{"height":31.6},"width":388.24,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-15.png","element":"img","alt":" −ℓ′ �w (t)⊤ xn�> 0","inline":true},{"text":", we can upper bound term ","element":"span"},{"text":"n ","element":"span"},{"text":"in eq. ","element":"span"},{"href":"#id-74","text":"45 ","element":"a"},{"text":"with","element":"span"}],[{"id":"id-83","style":{"width":"82%"},"width":1419,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-16.png","element":"img"}],[{"text":"2. If","element":"span"},{"style":{"height":21.97},"width":398.36,"height":54.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-17.png","element":"img","alt":"��x⊤n r (t)�� > C0t−0.5µ−","inline":true,"padRight":true},{"text":", then, using eq. ","element":"span"},{"href":"#id-75","text":"40 ","element":"a"},{"text":"we upper bound term ","element":"span"},{"text":"n ","element":"span"},{"text":"in eq. ","element":"span"},{"href":"#id-74","text":"45 ","element":"a"},{"text":"with","element":"span"}],[{"id":"id-77","style":{"width":"94%"},"width":1626,"height":320,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/22-18.png","element":"img"}],[{"text":"Next, we will show that ","element":"span"},{"style":{"height":16.62},"width":174.8,"height":41.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-0.png","element":"img","alt":" ∃t′− > t− ","inline":true,"padRight":true},{"text":"such that the last expression is strictly negative ","element":"span"},{"style":{"height":16.62},"width":159.8,"height":41.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-1.png","element":"img","alt":" ∀t > t′−.","inline":true,"padRight":true},{"text":"Let ","element":"span"},{"text":"M > ","element":"span"},{"text":"1 ","element":"span"},{"text":"be some arbitrary constant. Then, since","element":"span"},{"style":{"height":32.4},"width":673.84,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-2.png","element":"img","alt":"�t−1e− ˜w⊤xn exp�−r (t)⊤ xn��µ− =","inline":true}],[{"id":"id-76","style":{"width":"93%"},"width":1623,"height":935,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-3.png","element":"img"}],[{"text":"since","element":"span"},{"style":{"height":22.16},"width":957.16,"height":55.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-4.png","element":"img","alt":"��x⊤n r (t)�� > C0t−0.5µ−, x⊤n r (t) < 0 and ex ≥ 1 + x","inline":true},{"text":". In this case last line is strictly ","element":"span"},{"text":"larger than ","element":"span"},{"text":"1 ","element":"span"},{"text":"for sufficiently large ","element":"span"},{"text":"t","element":"span"},{"text":". Therefore, after we substitute eqs. ","element":"span"},{"href":"#id-76","text":"52 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-76","text":"53 ","element":"a"},{"text":"into ","element":"span"},{"href":"#id-77","text":"51, ","element":"a"},{"text":"we find that ","element":"span"},{"style":{"height":16.82},"width":738.2,"height":42.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-5.png","element":"img","alt":" ∃t′− > tM > t− such that ∀t > t′−, term k","inline":true,"padRight":true},{"text":"in eq. ","element":"span"},{"href":"#id-74","text":"45 ","element":"a"},{"text":"is strictly negative","element":"span"}],[{"id":"id-81","style":{"width":"71%"},"width":1233,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-6.png","element":"img"}],[{"text":"3. If","element":"span"},{"style":{"height":22.16},"width":250.28,"height":55.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-7.png","element":"img","alt":"��x⊤k r(t)�� ≥ ǫ2","inline":true,"padRight":true},{"text":", which is a special case of the previous case (","element":"span"},{"style":{"height":22.16},"width":506.32,"height":55.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-8.png","element":"img","alt":"��x⊤k r (t)�� > C0t−0.5µ−) then","inline":true},{"style":{"height":16.82},"width":146.96,"height":42.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-9.png","element":"img","alt":"∀t > t′−","inline":true},{"text":", either eq. ","element":"span"},{"href":"#id-76","text":"52 ","element":"a"},{"text":"or ","element":"span"},{"href":"#id-76","text":"53 ","element":"a"},{"text":"holds. Furthermore, in this case, ","element":"span"},{"style":{"height":16.82},"width":512.08,"height":42.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-10.png","element":"img","alt":" ∃t′′− > t′− and M′′ > 1 such","inline":true,"padRight":true},{"text":"that ","element":"span"},{"style":{"height":16.81},"width":140.24,"height":42.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-11.png","element":"img","alt":" ∀t > t′′− ","inline":true,"padRight":true},{"text":"eq. ","element":"span"},{"href":"#id-76","text":"53 ","element":"a"},{"text":"can be lower bounded by","element":"span"}],[{"style":{"width":"93%"},"width":1620,"height":279,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-12.png","element":"img"}],[{"text":"To conclude, we choose ","element":"span"},{"style":{"height":20.8},"width":338.4,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-13.png","element":"img","alt":" t0 = max�t′′′+, t′′−�:","inline":true}],[{"text":"1. If ","element":"span"},{"style":{"height":17.6},"width":267.08,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-14.png","element":"img","alt":" ∥P1r (t)∥ ≥ ǫ1","inline":true,"padRight":true},{"text":"(as in Eq. ","element":"span"},{"href":"#id-64","text":"28)","element":"a"},{"text":", we have that","element":"span"}],[{"style":{"width":"91%"},"width":1579,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-15.png","element":"img"}],[{"text":"where in ","element":"span"},{"style":{"height":18},"width":740.84,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-16.png","element":"img","alt":" (1) we used P⊤1 xn = xn ∀n ∈ S, in (2)","inline":true,"padRight":true},{"text":"we denoted by ","element":"span"},{"style":{"height":17.6},"width":187.4,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-17.png","element":"img","alt":" σmin (XS)","inline":true},{"text":", the minimal ","element":"span"},{"text":"non-zero singular value of ","element":"span"},{"style":{"height":14.69},"width":59.92,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-18.png","element":"img","alt":" XS","inline":true,"padRight":true},{"text":"and used eq. ","element":"span"},{"href":"#id-64","text":"28. ","element":"a"},{"text":"Therefore, for some ","element":"span"},{"style":{"height":21.97},"width":316.72,"height":54.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/23-19.png","element":"img","alt":" k,��x⊤k r�� ≥ ǫ2 ≜","inline":true}],[{"style":{"width":"93%"},"width":1615,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-0.png","element":"img"}],[{"style":{"height":20.8},"width":697.64,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-1.png","element":"img","alt":"η exp�− maxn ˜w⊤xn� �1 − e−0.5ǫ2�ǫ2","inline":true,"padRight":true},{"text":"(eq. ","element":"span"},{"href":"#id-78","text":"49)","element":"a"},{"text":". Then we find that eq. ","element":"span"},{"href":"#id-74","text":"45 ","element":"a"},{"text":"can be upper bounded by ","element":"span"},{"style":{"height":20.8},"width":487.88,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-2.png","element":"img","alt":" −C′′0 t−1 + o�t−1�, ∀t > t0","inline":true},{"text":", given eq. ","element":"span"},{"href":"#id-64","text":"28. ","element":"a"},{"text":"Substituting this result, together with eqs. ","element":"span"},{"href":"#id-79","text":"42 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-74","text":"44 ","element":"a"},{"text":"into eq. ","element":"span"},{"href":"#id-72","text":"41, ","element":"a"},{"text":"we obtain ","element":"span"},{"style":{"height":15.09},"width":130.76,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-3.png","element":"img","alt":" ∀t > t0","inline":true}],[{"style":{"width":"48%"},"width":841,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-4.png","element":"img"}],[{"text":"This implies that ","element":"span"},{"style":{"height":17.01},"width":437.96,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-5.png","element":"img","alt":" ∃C2 < C′′0 and ∃t2 > t0","inline":true,"padRight":true},{"text":"such that eq. ","element":"span"},{"href":"#id-69","text":"29 ","element":"a"},{"text":"holds. This implies also that eq. ","element":"span"},{"href":"#id-65","text":"27 ","element":"a"},{"text":"holds for ","element":"span"},{"style":{"height":17.6},"width":280.28,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-6.png","element":"img","alt":" ∥P1r (t)∥ ≥ ǫ1.","inline":true}],[{"text":"2. Otherwise, if ","element":"span"},{"style":{"height":17.6},"width":267.56,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-7.png","element":"img","alt":" ∥P1r (t)∥ < ǫ1","inline":true},{"text":", we find that ","element":"span"},{"style":{"height":15.09},"width":130.76,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-8.png","element":"img","alt":" ∀t > t0","inline":true,"padRight":true},{"text":", each term in eq. ","element":"span"},{"href":"#id-74","text":"45 ","element":"a"},{"text":"can be upper bounded by either zero (eqs. ","element":"span"},{"href":"#id-80","text":"48 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-81","text":"54)","element":"a"},{"text":", or terms proportional to ","element":"span"},{"style":{"height":15.14},"width":170.8,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-9.png","element":"img","alt":" t−1−1.5µ+ ","inline":true,"padRight":true},{"text":"(eq. ","element":"span"},{"href":"#id-82","text":"47) ","element":"a"},{"text":"or ","element":"span"},{"style":{"height":18.74},"width":263.96,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-10.png","element":"img","alt":" t−1−0.5µ−, (eq.","inline":true,"padRight":true},{"href":"#id-83","text":"50)","element":"a"},{"text":". Combining this together with eqs. ","element":"span"},{"href":"#id-79","text":"42, ","element":"a"},{"href":"#id-74","text":"44 ","element":"a"},{"text":"into eq. ","element":"span"},{"href":"#id-72","text":"41 ","element":"a"},{"text":"we obtain (for some positive constants ","element":"span"},{"style":{"height":15.6},"width":354.84,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-11.png","element":"img","alt":" C3, C4, C5, and C6)","inline":true}],[{"style":{"width":"85%"},"width":1479,"height":148,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-12.png","element":"img"}]]},{"heading":"Appendix B. Generic solutions of the KKT conditions in eq. 6","paragraphs":[[{"text":"Lemma 12 For almost all datasets there is a unique ","element":"span"},{"style":{"height":8},"width":33,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-13.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"which satisfies the KKT conditions (eq. ","element":"span"},{"href":"#id-25","text":"6)","element":"a"},{"text":":","element":"span"}],[{"style":{"width":"84%"},"width":1469,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-14.png","element":"img"}],[{"text":"Furthermore, in this solution ","element":"span"},{"style":{"height":16.8},"width":526.44,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-15.png","element":"img","alt":" αn ̸= 0 if ˆw⊤xn = 1, i.e., xn","inline":true,"padRight":true},{"text":"is a support vector (","element":"span"},{"style":{"height":13.2},"width":112.04,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-16.png","element":"img","alt":"n ∈ S","inline":true},{"text":"), and there are at most ","element":"span"},{"text":"d ","element":"span"},{"text":"such support vectors.","element":"span"}],[{"text":"For almost every set ","element":"span"},{"text":"X","element":"span"},{"text":", no more than ","element":"span"},{"style":{"height":16.4},"width":203.4,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-17.png","element":"img","alt":" d points xn","inline":true,"padRight":true},{"text":"can be on the same hyperplane. Therefore, since all support vectors must lie on the same hyperplane, there can be at most ","element":"span"},{"text":"d ","element":"span"},{"text":"support vectors, for almost every ","element":"span"},{"text":"X","element":"span"},{"text":".","element":"span"}],[{"text":"Given the set of support vectors, ","element":"span"},{"text":"S","element":"span"},{"text":", the KKT conditions of eq. ","element":"span"},{"href":"#id-25","text":"6 ","element":"a"},{"text":"entail that ","element":"span"},{"style":{"height":17.6},"width":358.96,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-18.png","element":"img","alt":" αn = 0 if n /∈ S and","inline":true}],[{"id":"id-85","style":{"width":"62%"},"width":1087,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-19.png","element":"img"}],[{"text":"where we denoted ","element":"span"},{"style":{"height":11.09},"width":155.4,"height":27.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-20.png","element":"img","alt":" αS as α","inline":true,"padRight":true},{"text":"restricted to the support vector components. For almost every set ","element":"span"},{"text":"X","element":"span"},{"text":", since ","element":"span"},{"style":{"height":21.12},"width":479.88,"height":52.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-21.png","element":"img","alt":" d ≥ |S|, X⊤S XS ∈ R|S|×|S|","inline":true,"padRight":true},{"text":"is invertible. Therefore, ","element":"span"},{"style":{"height":10.69},"width":55.6,"height":26.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-22.png","element":"img","alt":" αS","inline":true,"padRight":true},{"text":"has the unique solution","element":"span"}],[{"id":"id-84","style":{"width":"61%"},"width":1055,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-23.png","element":"img"}],[{"text":"This implies that ","element":"span"},{"style":{"height":15.2},"width":215.4,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-24.png","element":"img","alt":" ∀n ∈ S, αn","inline":true,"padRight":true},{"text":"is equal to a rational function in the components of ","element":"span"},{"style":{"height":14.8},"width":265.36,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-25.png","element":"img","alt":" XS, i.e., αn =","inline":true},{"style":{"height":17.6},"width":611.88,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-26.png","element":"img","alt":"pn (XS) /qn (XS), where pn and qn","inline":true,"padRight":true},{"text":"are polynomials in the components of ","element":"span"},{"style":{"height":14.69},"width":59.92,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-27.png","element":"img","alt":" XS","inline":true},{"text":". Therefore, if ","element":"span"},{"style":{"height":14.4},"width":141.08,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-28.png","element":"img","alt":" αn = 0,","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":17.6},"width":222.16,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-29.png","element":"img","alt":" pn (XS) = 0","inline":true},{"text":", so the components of ","element":"span"},{"style":{"height":14.69},"width":59.92,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-30.png","element":"img","alt":" XS","inline":true,"padRight":true},{"text":"must be at a root of the polynomial ","element":"span"},{"style":{"height":11.6},"width":43.08,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-31.png","element":"img","alt":" pn","inline":true},{"text":". The roots of the polynomial ","element":"span"},{"style":{"height":11.6},"width":43.08,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-32.png","element":"img","alt":" pn","inline":true,"padRight":true},{"text":"have measure zero, unless ","element":"span"},{"style":{"height":17.6},"width":615.24,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-33.png","element":"img","alt":" ∀XS : pn (XS) = 0. However, pn","inline":true,"padRight":true},{"text":"cannot be identically equal to zero, since, for example, if ","element":"span"},{"style":{"height":21.66},"width":1091.92,"height":54.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-34.png","element":"img","alt":" X⊤S =�I|S|×|S|, 0|S|×(d−|S|)�, then X⊤S XS = I|S|×|S|, and so","inline":true,"padRight":true},{"text":"in this case ","element":"span"},{"style":{"height":16.8},"width":365.68,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/24-35.png","element":"img","alt":" ∀n ∈ S, αn = 1 ̸= 0","inline":true},{"text":", from eq. ","element":"span"},{"href":"#id-84","text":"58.","element":"a"}],[{"text":"Therefore, for a given ","element":"span"},{"text":"S","element":"span"},{"text":", the event that “eq. ","element":"span"},{"href":"#id-85","text":"57 ","element":"a"},{"text":"has a solution with a zero component” has a zero measure. Moreover, the union of these events, for all possible ","element":"span"},{"text":"S","element":"span"},{"text":", also has zero measure, as a finite union of zero measures sets (there are only finitely many possible sets ","element":"span"},{"style":{"height":17.6},"width":415.4,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-0.png","element":"img","alt":" S ⊂ {1, . . . , N} ). This","inline":true,"padRight":true},{"text":"implies that, for almost all datasets ","element":"span"},{"style":{"height":17.6},"width":447.56,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-1.png","element":"img","alt":" X, αn = 0 only if n /∈ S","inline":true},{"text":". Furthermore, for almost all datasets the solution ","element":"span"},{"style":{"height":8},"width":33,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-2.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"is unique: for each dataset, ","element":"span"},{"text":"S ","element":"span"},{"text":"is uniquely determined, and given ","element":"span"},{"text":"S ","element":"span"},{"text":", the solution eq. ","element":"span"},{"href":"#id-85","text":"57 ","element":"a"},{"text":"is uniquely given by eq. ","element":"span"},{"href":"#id-84","text":"58. ","element":"a"},{"style":{"height":0},"width":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-3.png","element":"img","alt":" ■","inline":true}]]},{"heading":"Appendix C. Completing the proof of Theorem 3 for zero measure cases","paragraphs":[[{"text":"In the preceding Appendices, we established Theorem ","element":"span"},{"href":"#id-53","text":"4, ","element":"a"},{"text":"which only applied when all support vectors are associated with non-zero coefficients. This characterizes almost all data sets, i.e. all except for measure zero. We now turn to presenting and proving a more complete characterization of the limit behaviour of gradient descent, which covers all data sets, including those degenerate data sets not covered by Theorem ","element":"span"},{"href":"#id-53","text":"4, ","element":"a"},{"text":"thus establishing Theorem ","element":"span"},{"href":"#id-19","text":"3.","element":"a"}],[{"text":"In order to do so, we first have to introduce additional notation and a recursive treatment of the data set. We will define a sequence of data sets ","element":"span"},{"style":{"height":20.77},"width":150.8,"height":51.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-4.png","element":"img","alt":"¯PmX ¯Sm","inline":true,"padRight":true},{"text":"obtained by considering only a subset ","element":"span"},{"style":{"height":17.5},"width":56.4,"height":43.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-5.png","element":"img","alt":"¯Sm","inline":true,"padRight":true},{"text":"of the points, and projecting them using the projection matrix ","element":"span"},{"style":{"height":17.7},"width":64.08,"height":44.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-6.png","element":"img","alt":"¯Pm","inline":true},{"text":". We start, for ","element":"span"},{"text":"m ","element":"span"},{"text":"= 0","element":"span"},{"text":", with the full original data set, i.e. ","element":"span"},{"style":{"height":19.41},"width":587.76,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-7.png","element":"img","alt":"¯S0 = {1, . . . , N} and ¯P0 = Id×d","inline":true},{"text":". We then define ","element":"span"},{"style":{"height":15.28},"width":66.48,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-8.png","element":"img","alt":" ˆwm","inline":true,"padRight":true},{"text":"as the max margin predictor for ","element":"span"},{"style":{"height":21.15},"width":322.08,"height":52.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-9.png","element":"img","alt":"¯Pm−1X ¯Sm−1, i.e.:","inline":true}],[{"id":"id-86","style":{"width":"78%"},"width":1363,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-10.png","element":"img"}],[{"text":"In particular, ","element":"span"},{"style":{"height":15.28},"width":53.48,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-11.png","element":"img","alt":" ˆw1","inline":true,"padRight":true},{"text":"is the max margin predictor for the original data set. We then denote ","element":"span"},{"style":{"height":18.75},"width":56.4,"height":46.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-12.png","element":"img","alt":" S+m ","inline":true,"padRight":true},{"text":"the indices ","element":"span"},{"text":"of non-support vectors for ","element":"span"},{"href":"#id-86","text":"59, ","element":"a"},{"style":{"height":15.09},"width":56.4,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-13.png","element":"img","alt":" Sm","inline":true,"padRight":true},{"text":"the indices of support vector of ","element":"span"},{"href":"#id-86","text":"59 ","element":"a"},{"text":"with non-zero coefficients for the dual variables corresponding to the margin constraints (for some dual solution), and ","element":"span"},{"style":{"height":17.5},"width":183.88,"height":43.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-14.png","element":"img","alt":"¯Sm the set","inline":true,"padRight":true},{"text":"of support vector with zero coefficients. That is:","element":"span"}],[{"id":"id-105","style":{"width":"91%"},"width":1586,"height":393,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-15.png","element":"img"}],[{"text":"The problematic degenerate case, not covered by the analysis of Theorem ","element":"span"},{"href":"#id-53","text":"4, ","element":"a"},{"text":"is when there are support vectors with zero coefficients, i.e., when ","element":"span"},{"style":{"height":18.82},"width":150.16,"height":47.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-16.png","element":"img","alt":"¯Sm ̸= ∅","inline":true},{"text":". In this case we recurse on these zero-coefficient support vectors (i.e., on ","element":"span"},{"style":{"height":17.5},"width":56.4,"height":43.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-17.png","element":"img","alt":"¯Sm","inline":true},{"text":"), but only consider their components orthogonal to the non-zero-coefficient support vectors (i.e., not spanned by points in ","element":"span"},{"style":{"height":15.09},"width":56.4,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-18.png","element":"img","alt":" Sm","inline":true},{"text":"). That is, we project using:","element":"span"}],[{"style":{"width":"100%"},"width":1734,"height":291,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/25-19.png","element":"img"}],[{"text":"denote the stopping stage ","element":"span"},{"text":"M","element":"span"},{"text":"—that is, ","element":"span"},{"text":"M ","element":"span"},{"text":"is the minimal ","element":"span"},{"style":{"height":17.51},"width":355.12,"height":43.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-0.png","element":"img","alt":" m such that ¯Sm = ∅","inline":true},{"text":". Our characterization will be in terms of the sequence ","element":"span"},{"style":{"height":15.79},"width":223.16,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-1.png","element":"img","alt":" ˆw1, . . . , ˆwM","inline":true},{"text":". As established in Lemma ","element":"span"},{"text":"12 ","element":"span"},{"text":"of Appendix ","element":"span"},{"text":"B, ","element":"span"},{"text":"for almost all data sets we will not have support vectors with non-zero coefficients, and so we will have ","element":"span"},{"text":"M ","element":"span"},{"text":"= 1","element":"span"},{"text":", and so the characterization only depends on the max margin predictor ","element":"span"},{"style":{"height":15.28},"width":53.48,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-2.png","element":"img","alt":" ˆw1","inline":true,"padRight":true},{"text":"of the original data set. But, even for the measure zero of data sets in which ","element":"span"},{"text":"M > ","element":"span"},{"text":"1","element":"span"},{"text":", we provide the following more complete characterization:","element":"span"}],[{"id":"id-23","text":"Theorem 13 ","element":"span"},{"text":"For all datasets which are linearly separable (Assumption ","element":"span"},{"href":"#id-8","text":"1) ","element":"a"},{"text":"and given a ","element":"span"},{"style":{"height":16.4},"width":166.96,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-3.png","element":"img","alt":" β-smooth","inline":true,"padRight":true},{"text":"loss function (Assumption ","element":"span"},{"href":"#id-8","text":"2) ","element":"a"},{"text":"with an exponential tail (Assumption ","element":"span"},{"href":"#id-15","text":"3)","element":"a"},{"text":", gradient descent (as in eq. ","element":"span"},{"href":"#id-10","text":"2) ","element":"a"},{"text":"with step size ","element":"span"},{"style":{"height":19.15},"width":355.88,"height":47.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-4.png","element":"img","alt":" η < 2β−1σ−2max (X )","inline":true,"padRight":true},{"text":"and any starting point ","element":"span"},{"text":"w","element":"span"},{"text":"(0)","element":"span"},{"text":", the iterates of gradient descent can ","element":"span"},{"text":"be written as:","element":"span"}],[{"id":"id-134","style":{"width":"100%"},"width":1729,"height":259,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-5.png","element":"img"}],[{"text":"residual ","element":"span"},{"style":{"height":17.6},"width":83.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-6.png","element":"img","alt":" ρ (t)","inline":true,"padRight":true},{"text":"is bounded.","element":"span"}],[{"text":"C.1 Auxiliary notation","element":"span"}],[{"text":"We say that a function ","element":"span"},{"style":{"height":16.4},"width":193.76,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-7.png","element":"img","alt":" f : N → R","inline":true,"padRight":true},{"text":"is absolutely summable if ","element":"span"},{"style":{"height":19.25},"width":320,"height":48.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-8.png","element":"img","alt":"�∞t=1 |f (t)| < ∞","inline":true},{"text":", and then we denote ","element":"span"},{"style":{"height":17.6},"width":183.08,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-9.png","element":"img","alt":"f (t) ∈ L1","inline":true},{"text":". Furthermore, we define","element":"span"}],[{"style":{"width":"68%"},"width":1176,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.68},"width":247.44,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-11.png","element":"img","alt":" ˜wm and ˇwk,m","inline":true,"padRight":true},{"text":"are defined next, and additionally, we denote","element":"span"}],[{"style":{"width":"14%"},"width":259,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-12.png","element":"img"}],[{"text":"We define, ","element":"span"},{"style":{"height":15.2},"width":231.12,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-13.png","element":"img","alt":" ∀m ≥ 1, ˜wm","inline":true,"padRight":true},{"text":"as the solution of","element":"span"}],[{"id":"id-89","style":{"width":"84%"},"width":1463,"height":135,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-14.png","element":"img"}],[{"text":"such that","element":"span"}],[{"id":"id-91","style":{"width":"66%"},"width":1148,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-15.png","element":"img"}],[{"text":"The existence and uniqueness of the solution, ","element":"span"},{"style":{"height":14.48},"width":66.48,"height":36.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-16.png","element":"img","alt":" ˜wm","inline":true,"padRight":true},{"text":"are proved in appendix section ","element":"span"},{"href":"#id-87","text":"C.4. ","element":"a"},{"text":"Lastly, we define, ","element":"span"},{"style":{"height":17.68},"width":341.52,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-17.png","element":"img","alt":" ∀m > k ≥ 1, ˇwk,m","inline":true,"padRight":true},{"text":"as the solution of","element":"span"}],[{"id":"id-101","style":{"width":"88%"},"width":1522,"height":157,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-18.png","element":"img"}],[{"text":"such that","element":"span"}],[{"id":"id-90","style":{"width":"67%"},"width":1169,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/26-19.png","element":"img"}],[{"text":"The existence and uniqueness of the solution ","element":"span"},{"style":{"height":16.67},"width":94.8,"height":41.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-0.png","element":"img","alt":" ˇwk,m","inline":true,"padRight":true},{"text":"are proved in appendix section ","element":"span"},{"href":"#id-88","text":"C.5. ","element":"a"},{"text":"Together, eqs. ","element":"span"},{"href":"#id-89","text":"63-","element":"a"},{"href":"#id-90","text":"66 ","element":"a"},{"text":"entail the existence of a unique decomposition, ","element":"span"},{"style":{"height":14.8},"width":167.04,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-1.png","element":"img","alt":" ∀m ≥ 1 :","inline":true}],[{"id":"id-113","style":{"width":"90%"},"width":1558,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-2.png","element":"img"}],[{"text":"given the constraints in eqs. ","element":"span"},{"href":"#id-91","text":"64 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-90","text":"66 ","element":"a"},{"text":"hold.","element":"span"}],[{"text":"C.2 Proof of Theorem ","element":"span"},{"href":"#id-23","text":"13","element":"a"}],[{"text":"In the following proofs, for any solution ","element":"span"},{"text":"w","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":")","element":"span"},{"text":", we define","element":"span"}],[{"style":{"width":"55%"},"width":953,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-3.png","element":"img"}],[{"text":"noting that","element":"span"}],[{"style":{"width":"31%"},"width":538,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-4.png","element":"img"}],[{"text":"and","element":"span"}],[{"id":"id-92","style":{"width":"69%"},"width":1201,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.79},"width":37,"height":29.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-6.png","element":"img","alt":" ˜w","inline":true,"padRight":true},{"text":"follow the conditions of Theorem ","element":"span"},{"href":"#id-23","text":"13. ","element":"a"},{"text":"Our goal is to show that ","element":"span"},{"style":{"height":17.6},"width":113.68,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-7.png","element":"img","alt":" ∥r(t)∥","inline":true,"padRight":true},{"text":"is bounded. To show this, we will upper bound the following equation","element":"span"}],[{"id":"id-97","style":{"width":"86%"},"width":1498,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-8.png","element":"img"}],[{"text":"First, we note that ","element":"span"},{"style":{"height":15.09},"width":368.36,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-9.png","element":"img","alt":" ∃t0 such that ∀t > t0","inline":true,"padRight":true},{"text":"the first term in this equation can be upper bounded by","element":"span"}],[{"id":"id-94","style":{"width":"97%"},"width":1680,"height":549,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-10.png","element":"img"}],[{"text":"where in (1) we used eq. ","element":"span"},{"href":"#id-92","text":"68, ","element":"a"},{"text":"in (2) we used eq. ","element":"span"},{"href":"#id-10","text":"2 ","element":"a"},{"text":"and in (3) we used ","element":"span"},{"style":{"height":17.6},"width":544.28,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-11.png","element":"img","alt":" ∀x > 0 : x ≥ log (1 + x) > 0,","inline":true,"padRight":true},{"text":"and also using ","element":"span"},{"style":{"height":17.6},"width":307.6,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-12.png","element":"img","alt":" ℓ′(w(t)⊤xn) < 0","inline":true,"padRight":true},{"text":"for large enough ","element":"span"},{"text":"t","element":"span"},{"text":", we have that","element":"span"}],[{"style":{"width":"102%"},"width":1769,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-13.png","element":"img"}],[{"text":"which is negative for sufficiently large ","element":"span"},{"style":{"height":20.8},"width":408.8,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-14.png","element":"img","alt":" t0 (since log�1 + t−1�","inline":true},{"text":"decreases as ","element":"span"},{"style":{"height":15.14},"width":59.24,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-15.png","element":"img","alt":" t−1","inline":true},{"text":", which is slower then ","element":"span"},{"style":{"height":18},"width":810.2,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/27-16.png","element":"img","alt":" 1/ (t log (t))), ∀n : ˆw⊤1 xn ≥ 1 and ℓ′(u) ≤ 0.","inline":true}],[{"text":"Also, from Lemma ","element":"span"},{"href":"#id-11","text":"10 ","element":"a"},{"text":"we know that:","element":"span"}],[{"id":"id-93","style":{"width":"76%"},"width":1324,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-0.png","element":"img"}],[{"text":"Substituting eq. ","element":"span"},{"href":"#id-93","text":"72 ","element":"a"},{"text":"into eq. ","element":"span"},{"href":"#id-94","text":"70, ","element":"a"},{"text":"and recalling that ","element":"span"},{"style":{"height":17.6},"width":259.4,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-1.png","element":"img","alt":" t−ν1 log−ν2 (t)","inline":true,"padRight":true},{"text":"converges for any ","element":"span"},{"style":{"height":16.4},"width":269.2,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-2.png","element":"img","alt":" ν1 > 1 and any","inline":true}],[{"id":"id-96","style":{"width":"99%"},"width":1727,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-3.png","element":"img"}],[{"id":"id-99","text":"Also, in the next subsection we will prove that","element":"span"}],[{"text":"Lemma 14 Let ","element":"span"},{"style":{"height":17.6},"width":289.64,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-4.png","element":"img","alt":" κ1 (t) and κ2 (t)","inline":true,"padRight":true},{"text":"be functions in ","element":"span"},{"style":{"height":15.09},"width":144.88,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-5.png","element":"img","alt":" L1, then","inline":true}],[{"id":"id-95","style":{"width":"75%"},"width":1298,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-6.png","element":"img"}],[{"text":"Thus, by combining eqs. ","element":"span"},{"href":"#id-95","text":"74 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-96","text":"73 ","element":"a"},{"text":"into eq. ","element":"span"},{"href":"#id-97","text":"69, ","element":"a"},{"text":"we find","element":"span"}],[{"style":{"width":"59%"},"width":1026,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-7.png","element":"img"}],[{"text":"On this result we apply the following lemma (with ","element":"span"},{"style":{"height":17.6},"width":804.88,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-8.png","element":"img","alt":" φ (t) = ∥r(t)∥, h (t) = 2κ1 (t), and z (t) =","inline":true},{"style":{"height":17.6},"width":277.16,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-9.png","element":"img","alt":"κ0 (t) + 2κ2 (t)","inline":true},{"text":"), which we prove in appendix ","element":"span"},{"href":"#id-98","text":"C.6:","element":"a"}],[{"id":"id-130","text":"Lemma 15 ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":17.6},"width":297.8,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-10.png","element":"img","alt":" φ (t) , h (t) , z (t)","inline":true,"padRight":true},{"text":"be three functions from ","element":"span"},{"style":{"height":17.49},"width":440.84,"height":43.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-11.png","element":"img","alt":" N to R≥0, and C1, C2, C3","inline":true,"padRight":true},{"text":"be three positive constants. Then, if ","element":"span"},{"style":{"height":19.25},"width":492,"height":48.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-12.png","element":"img","alt":"�∞t=1 h (t) ≤ C1 < ∞, and","inline":true}],[{"id":"id-131","style":{"width":"70%"},"width":1215,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-13.png","element":"img"}],[{"text":"we have","element":"span"}],[{"style":{"width":"66%"},"width":1144,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-14.png","element":"img"}],[{"text":"and obtain that","element":"span"}],[{"style":{"width":"60%"},"width":1048,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-15.png","element":"img"}],[{"text":"since we assumed that ","element":"span"},{"style":{"height":17.6},"width":441.32,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-16.png","element":"img","alt":" ∀i = 0, 1, 2 : κi (t) ∈ L1","inline":true},{"text":". This completes our proof. ","element":"span"},{"style":{"height":0},"width":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-17.png","element":"img","alt":" ■","inline":true}],[{"text":"C.3 Proof of Lemma ","element":"span"},{"href":"#id-99","text":"14","element":"a"}],[{"id":"id-100","text":"Before we prove Lemma ","element":"span"},{"href":"#id-99","text":"14, ","element":"a"},{"text":"we prove the following auxilary Lemma:","element":"span"}],[{"text":"Lemma 16 Consider the function ","element":"span"},{"style":{"height":20.3},"width":1128.08,"height":50.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-18.png","element":"img","alt":" f(t) = t−ν1(log(t))−ν2(log log(t))−ν3 . . . (log◦M(t))−νM+1. If","inline":true},{"style":{"height":16.82},"width":592.24,"height":42.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-19.png","element":"img","alt":"∃m0 ≤ M + 1 such that νm0 > 1","inline":true,"padRight":true},{"text":"and for all ","element":"span"},{"style":{"height":17.6},"width":616.28,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-20.png","element":"img","alt":" m′ < m0,νm′ = 1, then f(t) ∈ L1.","inline":true}],[{"text":"Proof To prove Lemma ","element":"span"},{"href":"#id-100","text":"16, ","element":"a"},{"text":"we will show that the improper integeral","element":"span"},{"style":{"height":22.9},"width":500.36,"height":57.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-21.png","element":"img","alt":"� ∞t1 f(t)dt for any t1 > 0 is","inline":true,"padRight":true},{"text":"bounded, i.e., ","element":"span"},{"style":{"height":22.9},"width":459.28,"height":57.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-22.png","element":"img","alt":" ∀t1 > 0,� ∞t1 f(t)dt < C","inline":true},{"text":". Using the integeral test for convergence (or Maclaurin– ","element":"span"},{"text":"Cauchy test) this in turn implies that ","element":"span"},{"style":{"height":20.97},"width":783.8,"height":52.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/28-23.png","element":"img","alt":" ∀t1 > 0, �∞t1 f(t) < C, and thus f(t) ∈ L1.","inline":true}],[{"text":"First, if ","element":"span"},{"style":{"height":16.81},"width":1260.88,"height":42.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-0.png","element":"img","alt":" m0 > 1, then ν1 = ν2 . . . = νm0−1 = 1 and νm0 = 1+ǫ for some ǫ > 0","inline":true},{"text":". Using change of variables ","element":"span"},{"style":{"height":20.91},"width":487.04,"height":52.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-1.png","element":"img","alt":" y = log◦(m0−1)(t), we have","inline":true}],[{"style":{"width":"62%"},"width":1088,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-2.png","element":"img"}],[{"text":"and for all ","element":"span"},{"style":{"height":32.18},"width":1245,"height":80.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-3.png","element":"img","alt":" m > m0,�log◦(m−1)(t)�−νm =�log◦(m−m0)(y)�−νm ≤ (log(y))|νm|","inline":true},{"text":". Thus, denoting","element":"span"}],[{"style":{"width":"99%"},"width":1727,"height":238,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-4.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":31.98},"width":1585.36,"height":79.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-5.png","element":"img","alt":" m0 = 1, we have ν1 = 1 + ǫ for some ǫ > 0, and for m > 1,�log◦(m−1)(t)�−νm ≤","inline":true}],[{"style":{"height":22.07},"width":203.4,"height":55.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-6.png","element":"img","alt":"(log(t))|νm|","inline":true},{"text":". Thus, denoting, ","element":"span"},{"style":{"height":28.48},"width":976.28,"height":71.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-7.png","element":"img","alt":" ˜ν = �M+1m=2 |νm|, we have� ∞t1 f(t)dt ≤� ∞t1 (log(t))˜νt1+ǫ dt.","inline":true}],[{"text":"Thus, for any ","element":"span"},{"style":{"height":10.69},"width":55.4,"height":26.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-8.png","element":"img","alt":" m0","inline":true},{"text":", we only need to show that for all ","element":"span"},{"style":{"height":28.48},"width":746.8,"height":71.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-9.png","element":"img","alt":" t1 > 0, ǫ > 0 and ˜ν > 0,� ∞t1 (log(t))˜νt1+ǫ dt <","inline":true},{"style":{"height":8},"width":55.68,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-10.png","element":"img","alt":"∞.","inline":true}],[{"text":"Let us now look at","element":"span"},{"style":{"height":28.48},"width":1289.68,"height":71.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-11.png","element":"img","alt":"� ∞t1 (log(t))˜νt1+ǫ dt. using u = (log(t))˜ν and dv = 1t1+ǫ , we have du =","inline":true},{"style":{"height":22.42},"width":539.04,"height":56.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-12.png","element":"img","alt":"˜νt−1 (log(t))˜ν−1 and v = − 1ǫtǫ","inline":true,"padRight":true},{"text":". Using integration by parts,","element":"span"},{"style":{"height":19.6},"width":540.32,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-13.png","element":"img","alt":"�udv = uv −�vdu, we have","inline":true}],[{"style":{"width":"53%"},"width":921,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-14.png","element":"img"}],[{"text":"Recursing the above equation ","element":"span"},{"text":"K ","element":"span"},{"text":"times such that ","element":"span"},{"style":{"height":12.8},"width":235.6,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-15.png","element":"img","alt":" ˜ν − K < 0","inline":true},{"text":", we have positive constants ","element":"span"},{"style":{"height":14.8},"width":304.24,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-16.png","element":"img","alt":"c0, c1, . . . cK > 0","inline":true,"padRight":true},{"text":"independent of ","element":"span"},{"text":"t","element":"span"},{"text":", such that","element":"span"}],[{"style":{"width":"88%"},"width":1533,"height":607,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-17.png","element":"img"}],[{"text":"where ","element":"span"},{"text":"(1) ","element":"span"},{"text":"follows as ","element":"span"},{"style":{"height":26.96},"width":548.36,"height":67.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-18.png","element":"img","alt":"�K−1k=0 ck(log(t))˜ν−kǫtǫ t→∞→ 0, (2)","inline":true,"padRight":true},{"text":"follows as ","element":"span"},{"text":"K ","element":"span"},{"text":"is chosen such that ","element":"span"},{"style":{"height":12.8},"width":204.88,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-19.png","element":"img","alt":" ˜ν − K < 0","inline":true,"padRight":true},{"text":"and hence for all ","element":"span"},{"style":{"height":21.26},"width":415.6,"height":53.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-20.png","element":"img","alt":" t > 0, (log(t))˜ν−K < 1","inline":true},{"text":". This completes the proof of the lemma.","element":"span"}],[{"text":"Lemma 14 Let ","element":"span"},{"style":{"height":17.6},"width":289.64,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-21.png","element":"img","alt":" κ1 (t) and κ2 (t)","inline":true,"padRight":true},{"text":"be functions in ","element":"span"},{"style":{"height":15.09},"width":144.88,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-22.png","element":"img","alt":" L1, then","inline":true}],[{"style":{"width":"75%"},"width":1298,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/29-23.png","element":"img"}],[{"text":"Proof Recall that we defined","element":"span"}],[{"id":"id-102","style":{"width":"60%"},"width":1040,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/30-0.png","element":"img"}],[{"text":"where","element":"span"}],[{"id":"id-108","style":{"width":"70%"},"width":1221,"height":283,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/30-1.png","element":"img"}],[{"text":"with ","element":"span"},{"style":{"height":17.87},"width":337.68,"height":44.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/30-2.png","element":"img","alt":" ˆwm, ˜wm and ˇwk,m","inline":true,"padRight":true},{"text":"defined in eqs. ","element":"span"},{"href":"#id-86","text":"59, ","element":"a"},{"href":"#id-89","text":"63 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-101","text":"65, ","element":"a"},{"text":"respectively. We note that","element":"span"}],[{"id":"id-103","style":{"width":"71%"},"width":1233,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/30-3.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"72%"},"width":1260,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/30-4.png","element":"img"}],[{"text":"Additionally, we define ","element":"span"},{"style":{"height":17.78},"width":251.08,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/30-5.png","element":"img","alt":" Ch, C′h so that","inline":true}],[{"id":"id-109","style":{"width":"70%"},"width":1216,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/30-6.png","element":"img"}],[{"text":"and","element":"span"}],[{"id":"id-107","style":{"width":"78%"},"width":1365,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/30-7.png","element":"img"}],[{"text":"We wish to calculate","element":"span"}],[{"id":"id-104","style":{"width":"84%"},"width":1466,"height":233,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/30-8.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"we used eq. ","element":"span"},{"href":"#id-102","text":"79 ","element":"a"},{"text":"and in ","element":"span"},{"text":"(2) ","element":"span"},{"text":"we used the definition of GD in eq. ","element":"span"},{"href":"#id-10","text":"2. ","element":"a"},{"text":"We can bound the second term using Cauchy-Shwartz inequality and eq. ","element":"span"},{"href":"#id-103","text":"82:","element":"a"}],[{"style":{"width":"89%"},"width":1548,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/30-9.png","element":"img"}],[{"text":"Next, we examine the second term in eq. ","element":"span"},{"href":"#id-104","text":"86","element":"a"}],[{"id":"id-106","style":{"width":"86%"},"width":1496,"height":553,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/30-10.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"recall from eq. ","element":"span"},{"href":"#id-105","text":"60 ","element":"a"},{"text":"that ","element":"span"},{"style":{"height":18.56},"width":134.16,"height":46.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-0.png","element":"img","alt":" Sm, S+m ","inline":true,"padRight":true},{"text":"are mutually exclusive and ","element":"span"},{"style":{"height":19.75},"width":402.2,"height":49.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-1.png","element":"img","alt":" ∪Mm=1Sm ∪ S+m = [N].","inline":true,"padRight":true},{"text":"Next we upper bound the three terms in eq. ","element":"span"},{"href":"#id-106","text":"87. ","element":"a"},{"text":"To bound the first term in eq. ","element":"span"},{"href":"#id-106","text":"87 ","element":"a"},{"text":"we use Cauchy-Shartz, and eq. ","element":"span"},{"href":"#id-107","text":"85","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"91%"},"width":1591,"height":159,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-2.png","element":"img"}],[{"text":"In bounding the second term in eq. ","element":"span"},{"href":"#id-106","text":"87, ","element":"a"},{"text":"note that for tight exponential tail loss, since ","element":"span"},{"style":{"height":35.3},"width":1735.52,"height":88.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-3.png","element":"img","alt":" w (t)⊤ xn →∞","inline":true},{"text":", for large enough ","element":"span"},{"style":{"height":17.6},"width":1370.32,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-4.png","element":"img","alt":" t0, we have −ℓ′(w (t)⊤ xn) ≤ (1+exp(−µ+w (t)⊤ xn)) exp(−w (t)⊤ xn) ≤","inline":true},{"style":{"height":17.6},"width":585.32,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-5.png","element":"img","alt":"2 exp(−w (t)⊤ xn) for all t > t0","inline":true},{"text":". The first term in eq. ","element":"span"},{"href":"#id-106","text":"87 ","element":"a"},{"text":"can be bounded by the following set of inequalities, for ","element":"span"},{"style":{"height":14},"width":119.96,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-6.png","element":"img","alt":" t > t0,","inline":true}],[{"style":{"width":"94%"},"width":1642,"height":878,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-7.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"we used eqs. ","element":"span"},{"href":"#id-102","text":"79 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-108","text":"80, ","element":"a"},{"text":"in ","element":"span"},{"text":"(2) ","element":"span"},{"text":"we used that ","element":"span"},{"style":{"height":17.62},"width":713.12,"height":44.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-8.png","element":"img","alt":" ∀x : xe−x ≤ 1 and x⊤n r (t) ≥ 0, (3) we","inline":true,"padRight":true},{"text":"used eq. ","element":"span"},{"href":"#id-109","text":"84 ","element":"a"},{"text":"and in ","element":"span"},{"text":"(4) ","element":"span"},{"text":"we denoted ","element":"span"},{"style":{"height":20.46},"width":487.12,"height":51.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-9.png","element":"img","alt":" θm = minn∈S+m ˆw⊤mxn > 1","inline":true,"padRight":true},{"text":"and the last line is integrable based ","element":"span"},{"text":"on Lemma ","element":"span"},{"href":"#id-100","text":"16.","element":"a"}],[{"text":"Next, we bound the last term in eq. ","element":"span"},{"href":"#id-106","text":"87. ","element":"a"},{"text":"For exponential tailed losses (Assumption ","element":"span"},{"href":"#id-15","text":"3)","element":"a"},{"text":", since ","element":"span"},{"style":{"height":17.6},"width":275.84,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-10.png","element":"img","alt":"w(t)⊤xn → ∞","inline":true},{"text":", we have positive constants ","element":"span"},{"style":{"height":16.4},"width":631.76,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-11.png","element":"img","alt":" µ−, µ+ > 0, t− and t+ such that ∀n","inline":true}],[{"style":{"width":"80%"},"width":1397,"height":174,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-12.png","element":"img"}],[{"text":"We define ","element":"span"},{"style":{"height":17.6},"width":141.8,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-13.png","element":"img","alt":" γn(t) as","inline":true}],[{"style":{"width":"96%"},"width":1661,"height":233,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/31-14.png","element":"img"}],[{"text":"From this result, we have the following set of inequalities:","element":"span"}],[{"id":"id-110","style":{"width":"103%"},"width":1794,"height":922,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/32-0.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"we used eqs. ","element":"span"},{"href":"#id-102","text":"79 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-108","text":"80, ","element":"a"},{"text":"and in ","element":"span"},{"style":{"height":18.48},"width":486.64,"height":46.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/32-1.png","element":"img","alt":" (2) we used Pk−1 ˇwk,m = 0","inline":true,"padRight":true},{"text":"from eq. ","element":"span"},{"href":"#id-90","text":"66 ","element":"a"},{"text":"(so ","element":"span"},{"style":{"height":16.88},"width":211.6,"height":42.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/32-2.png","element":"img","alt":" x⊤n ˇwk,l = 0","inline":true,"padRight":true},{"text":"if ","element":"span"},{"text":"m < k","element":"span"},{"text":") and in ","element":"span"},{"text":"(3) ","element":"span"},{"text":"defined","element":"span"}],[{"style":{"width":"74%"},"width":1286,"height":133,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/32-3.png","element":"img"}],[{"text":"Note ","element":"span"},{"style":{"height":17.68},"width":379.6,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/32-4.png","element":"img","alt":" ∃tψ such that ∀t > tψ","inline":true},{"text":", we can bound ","element":"span"},{"style":{"height":17.6},"width":171.76,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/32-5.png","element":"img","alt":" ψm (t) by","inline":true}],[{"id":"id-112","style":{"width":"72%"},"width":1254,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/32-6.png","element":"img"}],[{"text":"Thus, the third term in ","element":"span"},{"href":"#id-106","text":"87 ","element":"a"},{"text":"is given by","element":"span"}],[{"id":"id-111","style":{"width":"96%"},"width":1671,"height":691,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/32-7.png","element":"img"}],[{"text":"where ","element":"span"},{"text":"(1) ","element":"span"},{"text":"follows from the bound in eq. ","element":"span"},{"href":"#id-110","text":"90.","element":"a"}],[{"text":"We examine the first term in eq. ","element":"span"},{"href":"#id-111","text":"93","element":"a"}],[{"style":{"width":"71%"},"width":1245,"height":264,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/33-0.png","element":"img"}],[{"style":{"height":17.68},"width":228.88,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/33-1.png","element":"img","alt":"∀t > t1 > tψ","inline":true},{"text":", where we will determine ","element":"span"},{"style":{"height":13.89},"width":32.84,"height":34.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/33-2.png","element":"img","alt":" t1","inline":true,"padRight":true},{"text":"later. We have the following for all ","element":"span"},{"style":{"height":17.6},"width":162.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/33-3.png","element":"img","alt":" m ∈ [M]","inline":true}],[{"style":{"width":"107%"},"width":1868,"height":668,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/33-4.png","element":"img"}],[{"text":"where we set ","element":"span"},{"style":{"height":15.09},"width":424.04,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/33-5.png","element":"img","alt":" t1 > 0 such that ∀t > t1","inline":true,"padRight":true},{"text":"the term in the square bracket is positive and","element":"span"}],[{"style":{"width":"28%"},"width":496,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/33-6.png","element":"img"}],[{"text":"in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"we used that since ","element":"span"},{"style":{"height":14.34},"width":249.64,"height":35.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/33-7.png","element":"img","alt":" e−x ≥ 1 − x","inline":true},{"text":", and also from using ","element":"span"},{"style":{"height":17.6},"width":388.04,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/33-8.png","element":"img","alt":" e−xx ≤ 1 and in (2)","inline":true,"padRight":true},{"text":"we use that ","element":"span"},{"style":{"height":14.8},"width":163.6,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/33-9.png","element":"img","alt":"∀x ≥ −1","inline":true,"padRight":true},{"text":"we have that ","element":"span"},{"style":{"height":19.14},"width":605.2,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/33-10.png","element":"img","alt":" e−x ≤ 1 − x + x2 and ψm (t) ≤ 1","inline":true,"padRight":true},{"text":"from eq. ","element":"span"},{"href":"#id-112","text":"92.","element":"a"}],[{"text":"We examine the second term in eq. ","element":"span"},{"href":"#id-111","text":"93 ","element":"a"},{"text":"using the decomposition of ","element":"span"},{"style":{"height":15.28},"width":66.48,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/34-0.png","element":"img","alt":" ˆwm","inline":true,"padRight":true},{"text":"from eq. ","element":"span"},{"href":"#id-113","text":"67","element":"a"}],[{"id":"id-114","style":{"width":"110%"},"width":1906,"height":1278,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/34-1.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"we used eq. ","element":"span"},{"href":"#id-113","text":"67, ","element":"a"},{"text":"in ","element":"span"},{"text":"(2) ","element":"span"},{"text":"we re-arranged the order of summation in the last term, and in ","element":"span"},{"text":"(3) ","element":"span"},{"text":"we just use a change of variables.","element":"span"}],[{"text":"Next, we examine ","element":"span"},{"style":{"height":18.29},"width":583.92,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/34-2.png","element":"img","alt":" Γm,n(t) for each m and n ∈ Sm","inline":true,"padRight":true},{"text":"in eq. ","element":"span"},{"href":"#id-114","text":"95. ","element":"a"},{"text":"Note that, ","element":"span"},{"style":{"height":17.68},"width":340.84,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/34-3.png","element":"img","alt":" ∃t2 > tψ such that","inline":true}],[{"style":{"width":"72%"},"width":1253,"height":170,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/34-4.png","element":"img"}],[{"text":"In this case, ","element":"span"},{"style":{"height":15.09},"width":131.24,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/34-5.png","element":"img","alt":" ∀t > t2","inline":true}],[{"style":{"width":"97%"},"width":1689,"height":182,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/34-6.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"follows from the definition of ","element":"span"},{"style":{"height":15.2},"width":196.72,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/34-7.png","element":"img","alt":" t2, wherein","inline":true}],[{"style":{"width":"95%"},"width":1645,"height":257,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/34-8.png","element":"img"}],[{"style":{"width":"98%"},"width":1701,"height":706,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/35-0.png","element":"img"}],[{"text":"where in ","element":"span"},{"style":{"height":17.6},"width":407.92,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/35-1.png","element":"img","alt":" (1), we use ψm(t) ≤ 1","inline":true,"padRight":true},{"text":"from eq. ","element":"span"},{"href":"#id-112","text":"92 ","element":"a"},{"text":"and using eq. ","element":"span"},{"href":"#id-102","text":"79, ","element":"a"},{"text":"in ","element":"span"},{"text":"(2) ","element":"span"},{"text":"we used bound on ","element":"span"},{"style":{"height":15.09},"width":57.84,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/35-2.png","element":"img","alt":" hm","inline":true,"padRight":true},{"text":"from eq. ","element":"span"},{"href":"#id-109","text":"84, ","element":"a"},{"text":"in ","element":"span"},{"text":"(3) ","element":"span"},{"text":"for some large enough ","element":"span"},{"style":{"height":32.62},"width":857.2,"height":81.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/35-3.png","element":"img","alt":" t′+ > t+, we have exp(µ+Ch∥xn∥)(�m−1r=1 log◦r(t))µ+ ≤ C1, and","inline":true,"padRight":true},{"text":"for the second term we used the inequality ","element":"span"},{"style":{"height":19.14},"width":857,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/35-4.png","element":"img","alt":" e−x ≤ 1 − x + 0.5x2 for x > 0, and (4) holds","inline":true,"padRight":true},{"text":"asymptotically for ","element":"span"},{"style":{"height":16.62},"width":134.48,"height":41.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/35-5.png","element":"img","alt":" t > t′′+ ","inline":true,"padRight":true},{"text":"for large enough ","element":"span"},{"style":{"height":20.16},"width":410.32,"height":50.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/35-6.png","element":"img","alt":" t′′+ > t′+ as C0t−0.5µ+","inline":true,"padRight":true},{"text":"converges slower than ","element":"span"},{"style":{"height":19.34},"width":286.04,"height":48.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/35-7.png","element":"img","alt":"0.5C20t−µ+ to 0.","inline":true}],[{"style":{"width":"114%"},"width":1972,"height":1415,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/35-8.png","element":"img"}],[{"style":{"width":"102%"},"width":1770,"height":2239,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/36-0.png","element":"img"}],[{"text":"where the last inequality follows as for large enough ","element":"span"},{"style":{"height":30.91},"width":615.76,"height":77.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/36-1.png","element":"img","alt":" t′′− > t′−, we have exp(−Ch∥xn∥)τt �m−2r=1 log◦r(t) ≤","inline":true},{"style":{"height":15.09},"width":60.92,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/36-2.png","element":"img","alt":"C2.","inline":true}],[{"style":{"width":"94%"},"width":1634,"height":454,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-0.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"we dropped the other positive terms, and ","element":"span"},{"text":"(2) ","element":"span"},{"text":"follows for large enough ","element":"span"},{"style":{"height":15.62},"width":148.4,"height":39.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-1.png","element":"img","alt":" t′′′− > t′′−","inline":true,"padRight":true},{"text":"as the ","element":"span"},{"style":{"height":34.98},"width":510.2,"height":87.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-2.png","element":"img","alt":" C0 log�log◦(m−1)(t)�−0.5˜µ−","inline":true,"padRight":true},{"text":"converges to ","element":"span"},{"text":"0 ","element":"span"},{"text":"more slowly than the other negative terms.","element":"span"}],[{"style":{"width":"94%"},"width":1632,"height":288,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-3.png","element":"img"}],[{"text":"Collecting all the terms from the above special cases, and substituting back into eq. ","element":"span"},{"href":"#id-104","text":"86, ","element":"a"},{"text":"we note that all terms are either negative, in ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-4.png","element":"img","alt":" L1","inline":true},{"text":", or of the form ","element":"span"},{"style":{"height":17.6},"width":534.44,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-5.png","element":"img","alt":" f (t) ∥r (t)∥, where f (t) ∈ L1","inline":true},{"text":", thus proving the lemma.","element":"span"}],[{"id":"id-87","text":"C.4 Proof of the existence and uniqueness of the solution to eqs. ","element":"span"},{"href":"#id-89","text":"63-","element":"a"},{"href":"#id-91","text":"64","element":"a"}],[{"text":"We wish to prove that ","element":"span"},{"style":{"height":14.8},"width":167.04,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-6.png","element":"img","alt":" ∀m ≥ 1 :","inline":true}],[{"id":"id-116","style":{"width":"72%"},"width":1254,"height":135,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-7.png","element":"img"}],[{"text":"such that","element":"span"}],[{"id":"id-115","style":{"width":"66%"},"width":1147,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-8.png","element":"img"}],[{"text":"we have a unique solution. From eq. ","element":"span"},{"href":"#id-115","text":"102, ","element":"a"},{"text":"we can modify eq. ","element":"span"},{"href":"#id-116","text":"101 ","element":"a"},{"text":"to","element":"span"}],[{"style":{"width":"51%"},"width":893,"height":134,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-9.png","element":"img"}],[{"text":"To prove this, without loss of generality, and with a slight abuse of notation, we will denote ","element":"span"},{"style":{"height":15.09},"width":106.28,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-10.png","element":"img","alt":" Sm as","inline":true},{"style":{"height":31.6},"width":1031.12,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-11.png","element":"img","alt":"S1, ¯Pm−1xn as xn and βn = exp�− �m−1k=1 ˜w⊤k ¯Pk−1xn�","inline":true},{"text":", so we can write the above equation as","element":"span"}],[{"style":{"width":"32%"},"width":570,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-12.png","element":"img"}],[{"text":"In the following Lemma ","element":"span"},{"href":"#id-117","text":"17 ","element":"a"},{"text":"we prove this equation ","element":"span"},{"style":{"height":24.61},"width":208.28,"height":61.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/37-13.png","element":"img","alt":" ∀β ∈ R|S1|>0 .","inline":true}],[{"id":"id-117","style":{"width":"99%"},"width":1728,"height":203,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-0.png","element":"img"}],[{"text":"and for ","element":"span"},{"style":{"height":19.55},"width":538.48,"height":48.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-1.png","element":"img","alt":" ∀z ∈ Rd such that z⊤XS1 = 0","inline":true,"padRight":true},{"text":"we would have ","element":"span"},{"style":{"height":16.59},"width":178.04,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-2.png","element":"img","alt":" ˜w⊤1 z = 0.","inline":true}],[{"text":"Proof Let ","element":"span"},{"style":{"height":19.55},"width":970.8,"height":48.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-3.png","element":"img","alt":" K = rank (XS1). Let and U = [u1, . . . , ud] ∈ Rd×d ","inline":true,"padRight":true},{"text":"be a set of orthonormal vectors (i.e., ","element":"span"},{"style":{"height":12.4},"width":346.84,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-4.png","element":"img","alt":" UU⊤ = U⊤U = I","inline":true},{"text":") such that ","element":"span"},{"style":{"height":17.6},"width":372.4,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-5.png","element":"img","alt":" u1 = ˆw1/ ∥ˆw1∥, and","inline":true}],[{"id":"id-121","style":{"width":"73%"},"width":1269,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-6.png","element":"img"}],[{"text":"while","element":"span"}],[{"id":"id-118","style":{"width":"66%"},"width":1154,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-7.png","element":"img"}],[{"text":"In other words, ","element":"span"},{"style":{"height":10.69},"width":44.84,"height":26.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-8.png","element":"img","alt":" u1","inline":true,"padRight":true},{"text":"is in the direction of ","element":"span"},{"style":{"height":17.6},"width":306.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-9.png","element":"img","alt":" ˆw1, [u1, . . . , uK]","inline":true,"padRight":true},{"text":"are in the space spanned by the columns of ","element":"span"},{"style":{"height":17.62},"width":445.44,"height":44.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-10.png","element":"img","alt":" XS1, and [uK+1, . . . , ud]","inline":true,"padRight":true},{"text":"are orthogonal to the columns of ","element":"span"},{"style":{"height":16.42},"width":87.8,"height":41.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-11.png","element":"img","alt":" XS1.","inline":true}],[{"text":"We define ","element":"span"},{"style":{"height":15.09},"width":518.12,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-12.png","element":"img","alt":" vn = U⊤xn and s = U⊤ ˜w1","inline":true},{"text":". Note that ","element":"span"},{"style":{"height":17.49},"width":498.44,"height":43.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-13.png","element":"img","alt":" ∀i > K : vi,n = 0 ∀n ∈ S1","inline":true,"padRight":true},{"text":"from eq. ","element":"span"},{"href":"#id-118","text":"105, ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":15.09},"width":310.48,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-14.png","element":"img","alt":" ∀i > K : si = 0","inline":true},{"text":", since for ","element":"span"},{"style":{"height":19.55},"width":550.96,"height":48.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-15.png","element":"img","alt":" ∀z ∈ Rd such that z⊤XS1 = 0","inline":true,"padRight":true},{"text":"we would have ","element":"span"},{"style":{"height":17.01},"width":317.72,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-16.png","element":"img","alt":" ˜w⊤1 z = 0. Lastly,","inline":true,"padRight":true},{"text":"equation ","element":"span"},{"href":"#id-117","text":"103 ","element":"a"},{"text":"becomes","element":"span"}],[{"style":{"width":"69%"},"width":1209,"height":153,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-17.png","element":"img"}],[{"text":"Multiplying by ","element":"span"},{"style":{"height":12.4},"width":54.4,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-18.png","element":"img","alt":" U⊤ ","inline":true,"padRight":true},{"text":"from the left, we obtain","element":"span"}],[{"style":{"width":"53%"},"width":933,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-19.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":17.6},"width":287.92,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-20.png","element":"img","alt":" u1 = ˆw1/ ∥ˆw1∥","inline":true},{"text":", we have that","element":"span"}],[{"id":"id-119","style":{"width":"78%"},"width":1359,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-21.png","element":"img"}],[{"text":"We recall that ","element":"span"},{"style":{"height":25.22},"width":1040.84,"height":63.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-22.png","element":"img","alt":" v1,n = ˆw⊤1 xn/ ∥ˆw1∥ = 1/ ∥ˆw1∥ , ∀n ∈ S1. Given {sj}Kj=2","inline":true},{"text":", we examine eq. ","element":"span"},{"href":"#id-119","text":"107 ","element":"a"},{"text":"for ","element":"span"},{"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"61%"},"width":1056,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-23.png","element":"img"}],[{"text":"This equation always has the unique solution","element":"span"}],[{"id":"id-120","style":{"width":"99%"},"width":1727,"height":445,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/38-24.png","element":"img"}],[{"text":"multiplying by ","element":"span"},{"style":{"height":17.6},"width":455.92,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-0.png","element":"img","alt":" exp (s1/ ∥ˆw1∥) we obtain","inline":true}],[{"style":{"width":"65%"},"width":1134,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-1.png","element":"img"}],[{"text":"where we defined","element":"span"}],[{"style":{"width":"49%"},"width":854,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-2.png","element":"img"}],[{"text":"Therefore, any critical point of ","element":"span"},{"style":{"height":17.6},"width":265.64,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-3.png","element":"img","alt":" E (s2, . . . , sK)","inline":true,"padRight":true},{"text":"would be a solution of eq. ","element":"span"},{"href":"#id-120","text":"109 ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":14.8},"width":236.6,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-4.png","element":"img","alt":" 2 ≤ i ≤ K,","inline":true,"padRight":true},{"text":"and substituting this solution into eq. ","element":"span"},{"href":"#id-120","text":"108 ","element":"a"},{"text":"we obtain ","element":"span"},{"style":{"height":17.6},"width":594.92,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-5.png","element":"img","alt":" s1. Since βn > 0, E (s2, . . . , sK)","inline":true,"padRight":true},{"text":"is a convex function, as positive linear combination of convex function (exponential). Therefore, any finite critical point is a global minimum. All that remains is to show that a finite minimum exists and that it is unique.","element":"span"}],[{"text":"From the definition of ","element":"span"},{"style":{"height":25.73},"width":789.96,"height":64.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-6.png","element":"img","alt":" S1, ∃α ∈ R|S1|>0 such that ˆw1 = �n∈S1 αnxn","inline":true,"padRight":true},{"text":". Multiplying this equation ","element":"span"},{"text":"by ","element":"span"},{"style":{"height":12.4},"width":54.4,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-7.png","element":"img","alt":" U⊤ ","inline":true,"padRight":true},{"text":"we obtain that ","element":"span"},{"style":{"height":24.61},"width":571.36,"height":61.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-8.png","element":"img","alt":" ∃α ∈ R|S1|>0 such that 2 ≤ i ≤ K","inline":true}],[{"style":{"width":"58%"},"width":1011,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-9.png","element":"img"}],[{"text":"Therefore, ","element":"span"},{"style":{"height":18.29},"width":335.56,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-10.png","element":"img","alt":" ∀ (s2, . . . , sK) ̸= 0","inline":true,"padRight":true},{"text":"we have that","element":"span"}],[{"id":"id-122","style":{"width":"64%"},"width":1112,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-11.png","element":"img"}],[{"text":"Recall, from eq. ","element":"span"},{"href":"#id-121","text":"104 ","element":"a"},{"text":"that ","element":"span"},{"style":{"height":24.26},"width":1256.12,"height":60.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-12.png","element":"img","alt":" ∀ (s2, . . . , sK) ̸= 0, ∃n ∈ S1 : �Kj=2 sjvj,n ̸= 0, and that αn > 0.","inline":true,"padRight":true},{"text":"Therefore, eq. ","element":"span"},{"href":"#id-122","text":"111 ","element":"a"},{"text":"implies that ","element":"span"},{"style":{"height":24.45},"width":1162.6,"height":61.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-13.png","element":"img","alt":" ∃n ∈ S1 such that �Kj=2 sjvj,n > 0 and also ∃m ∈ S1 such that","inline":true},{"style":{"height":24.45},"width":319.16,"height":61.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-14.png","element":"img","alt":"�Kj=2 sjvj,m < 0.","inline":true}],[{"text":"Thus, in any direction we take a limit in which ","element":"span"},{"style":{"height":17.6},"width":477.76,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-15.png","element":"img","alt":" |si| → ∞ ∀2 ≤ i ≤ K","inline":true},{"text":", we obtain that ","element":"span"},{"style":{"height":17.6},"width":393.92,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-16.png","element":"img","alt":"E (s2, . . . , sK) → ∞","inline":true},{"text":", since at least one exponent in the sum diverge. Since ","element":"span"},{"style":{"height":17.6},"width":321.79,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-17.png","element":"img","alt":" E (s2, . . . , sK), is","inline":true,"padRight":true},{"text":"a continuous function, it implies it has a finite global minimum. This proves the existence of a finite solution. To prove uniqueness we will show the function is strictly convex, since the hessian is (strictly) positive definite, i.e., that the following expression is strictly positive:","element":"span"}],[{"style":{"width":"60%"},"width":1038,"height":481,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-18.png","element":"img"}],[{"text":"the last expression is indeed strictly positive since ","element":"span"},{"style":{"height":24.45},"width":841.88,"height":61.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-19.png","element":"img","alt":" ∀q ̸= 0, ∃n ∈ S1 : �Kj=2 qjvj,n ̸= 0, from eq.","inline":true,"padRight":true},{"href":"#id-121","text":"104. ","element":"a"},{"text":"Thus, there exists a unique solution ","element":"span"},{"style":{"height":14.48},"width":66.2,"height":36.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/39-20.png","element":"img","alt":" ˜w1.","inline":true}],[{"id":"id-88","text":"C.5 Proof of the existence and uniqueness of the solution to eqs. ","element":"span"},{"href":"#id-101","text":"65-","element":"a"},{"href":"#id-90","text":"66","element":"a"}],[{"text":"Lemma 18 For ","element":"span"},{"style":{"height":14.8},"width":225.04,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-0.png","element":"img","alt":" ∀m > k ≥ 1","inline":true},{"text":", the equations","element":"span"}],[{"id":"id-125","style":{"width":"87%"},"width":1521,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-1.png","element":"img"}],[{"text":"under the constraints","element":"span"}],[{"id":"id-124","style":{"width":"67%"},"width":1161,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-2.png","element":"img"}],[{"text":"have a unique solution ","element":"span"},{"style":{"height":16.67},"width":107.48,"height":41.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-3.png","element":"img","alt":" ˇwk,m.","inline":true}],[{"text":"Proof For this proof we denote ","element":"span"},{"style":{"height":16.99},"width":74.56,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-4.png","element":"img","alt":" XSk","inline":true,"padRight":true},{"text":"as the matrix which columns are ","element":"span"},{"style":{"height":17.6},"width":231.28,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-5.png","element":"img","alt":" {xn|n ∈ Sk}","inline":true},{"text":", the orthogonal projection matrix ","element":"span"},{"style":{"height":18.82},"width":1217.2,"height":47.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-6.png","element":"img","alt":" Qk = Pk ¯Pk−1where QkQm = 0 ∀k ̸= m, Qk ¯Pm = 0 ∀k < m, and","inline":true}],[{"id":"id-126","style":{"width":"69%"},"width":1202,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-7.png","element":"img"}],[{"text":"We will write ","element":"span"},{"style":{"height":20.42},"width":1107.52,"height":51.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-8.png","element":"img","alt":" ˇwk,m = Wk,muk,m , where uk,m ∈ Rdk and Wk,m ∈ Rd×dk ","inline":true,"padRight":true},{"text":"is a full rank matrix such that ","element":"span"},{"style":{"height":17.68},"width":400.72,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-9.png","element":"img","alt":" QkWk,m = Wk,m, so","inline":true}],[{"id":"id-123","style":{"width":"68%"},"width":1185,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-10.png","element":"img"}],[{"text":"and, furthermore,","element":"span"}],[{"id":"id-127","style":{"width":"74%"},"width":1286,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-11.png","element":"img"}],[{"text":"Recall that ","element":"span"},{"style":{"height":19.11},"width":1084.84,"height":47.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-12.png","element":"img","alt":" ∀m : ¯PmPm = 0 and ∀k ≥ 1, ∀n ∈ Sm ¯Pm+kxn = 0","inline":true},{"text":". Therefore, ","element":"span"},{"style":{"height":17.94},"width":201.08,"height":44.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-13.png","element":"img","alt":" ∀v ∈ Rd ,","inline":true},{"style":{"height":20.11},"width":740.4,"height":50.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-14.png","element":"img","alt":"Pk−1Qkv = 0, ¯PkQkv = 0. Thus, ˇwk,m","inline":true,"padRight":true},{"text":"eq. ","element":"span"},{"href":"#id-123","text":"115 ","element":"a"},{"text":"implies the constraints in eq. ","element":"span"},{"href":"#id-124","text":"113 ","element":"a"},{"text":"hold.","element":"span"}],[{"text":"Next, we prove the existence and uniqueness of the solution ","element":"span"},{"style":{"height":17.68},"width":541.56,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-15.png","element":"img","alt":" ˇwk,m for each k = 1, . . . , m","inline":true,"padRight":true},{"text":"separately. We multiply eq. ","element":"span"},{"href":"#id-125","text":"112 ","element":"a"},{"text":"from the left by the identity matrix, decomposed to orthogonal projection matrices as in eq. ","element":"span"},{"href":"#id-126","text":"114. ","element":"a"},{"text":"Since each matrix projects to an orthogonal subspace, we can solve each product separately.","element":"span"}],[{"text":"The product with ","element":"span"},{"style":{"height":17.5},"width":64.08,"height":43.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-16.png","element":"img","alt":"¯Pm","inline":true,"padRight":true},{"text":"is equal to zero for both sides of the equation. The product with ","element":"span"},{"style":{"height":16},"width":100.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-17.png","element":"img","alt":" Qk is","inline":true,"padRight":true},{"text":"equal to","element":"span"}],[{"style":{"width":"78%"},"width":1365,"height":157,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-18.png","element":"img"}],[{"text":"Substituting eq. ","element":"span"},{"href":"#id-123","text":"115, ","element":"a"},{"text":"and multiplying by ","element":"span"},{"style":{"height":19.58},"width":110.16,"height":48.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-19.png","element":"img","alt":" W⊤k,m ","inline":true,"padRight":true},{"text":"from the right, we obtain","element":"span"}],[{"id":"id-129","style":{"width":"101%"},"width":1751,"height":200,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/40-20.png","element":"img"}],[{"text":"Denoting ","element":"span"},{"style":{"height":18.82},"width":279.72,"height":47.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-0.png","element":"img","alt":" Ek ∈ R|Sk|×|Sk| ","inline":true,"padRight":true},{"text":"as diagonal matrix for which ","element":"span"},{"style":{"height":21.26},"width":450.08,"height":53.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-1.png","element":"img","alt":" Enn,k = exp�− 12 ˜w⊤xn�","inline":true},{"text":", the matrix in the square bracket in the left hand side can be written as","element":"span"}],[{"id":"id-128","style":{"width":"67%"},"width":1175,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-2.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":20.8},"width":461,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-3.png","element":"img","alt":" rank�AA⊤�= rank (A)","inline":true,"padRight":true},{"text":"for any matrix ","element":"span"},{"text":"A","element":"span"},{"text":", the rank of this matrix is equal to","element":"span"}],[{"style":{"width":"54%"},"width":939,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-4.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"we used that ","element":"span"},{"style":{"height":14.88},"width":51.12,"height":37.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-5.png","element":"img","alt":" Ek","inline":true,"padRight":true},{"text":"is diagonal and non-zero, and in ","element":"span"},{"text":"(2) ","element":"span"},{"text":"we used eq. ","element":"span"},{"href":"#id-127","text":"116. ","element":"a"},{"text":"This implies that the ","element":"span"},{"style":{"height":15.28},"width":130.32,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-6.png","element":"img","alt":" dk ×dk","inline":true,"padRight":true},{"text":"matrix in eq. ","element":"span"},{"href":"#id-128","text":"118 ","element":"a"},{"text":"is full rank, and so eq. ","element":"span"},{"href":"#id-129","text":"117 ","element":"a"},{"text":"has a unique solution ","element":"span"},{"style":{"height":13.28},"width":86.16,"height":33.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-7.png","element":"img","alt":" uk,m","inline":true},{"text":". Therefore, there exists a unique solution ","element":"span"},{"style":{"height":16.67},"width":107,"height":41.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-8.png","element":"img","alt":" ˇwk,m.","inline":true}],[{"id":"id-98","text":"C.6 Proof of Lemma ","element":"span"},{"href":"#id-130","text":"15","element":"a"}],[{"text":"Lemma 15 Let ","element":"span"},{"style":{"height":17.6},"width":297.8,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-9.png","element":"img","alt":" φ (t) , h (t) , z (t)","inline":true,"padRight":true},{"text":"be three functions from ","element":"span"},{"style":{"height":17.49},"width":440.84,"height":43.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-10.png","element":"img","alt":" N to R≥0, and C1, C2, C3","inline":true,"padRight":true},{"text":"be three positive constants. Then, if ","element":"span"},{"style":{"height":19.25},"width":492,"height":48.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-11.png","element":"img","alt":"�∞t=1 h (t) ≤ C1 < ∞, and","inline":true}],[{"style":{"width":"70%"},"width":1215,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-12.png","element":"img"}],[{"text":"we have","element":"span"}],[{"style":{"width":"66%"},"width":1144,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-13.png","element":"img"}],[{"text":"Proof We define ","element":"span"},{"style":{"height":17.6},"width":359.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-14.png","element":"img","alt":" ψ (t) = z (t) + h (t)","inline":true},{"text":", and start from eq. ","element":"span"},{"href":"#id-131","text":"75","element":"a"}],[{"style":{"width":"72%"},"width":1256,"height":542,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/41-15.png","element":"img"}],[{"text":"we keep iterating eq. ","element":"span"},{"href":"#id-131","text":"75, ","element":"a"},{"text":"until we obtain","element":"span"}],[{"style":{"width":"77%"},"width":1340,"height":894,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/42-0.png","element":"img"}],[{"text":"Therefore, the Lemma holds with ","element":"span"},{"style":{"height":17.6},"width":815,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/42-1.png","element":"img","alt":" C2 = (φ (1) + C) exp (C) and C3 = exp (C).","inline":true}]]},{"heading":"Appendix D. Calculation of convergence rates","paragraphs":[[{"text":"In this section we calculate the various rates mentioned in section ","element":"span"},{"text":"3.","element":"span"}],[{"text":"D.1 Proof of Theorem ","element":"span"},{"href":"#id-34","text":"5","element":"a"}],[{"text":"From Theorems ","element":"span"},{"href":"#id-53","text":"4 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-23","text":"13, ","element":"a"},{"text":"we can write ","element":"span"},{"style":{"height":17.6},"width":623.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/43-0.png","element":"img","alt":" w (t) = ˆw log t+ρ (t), where ρ (t)","inline":true,"padRight":true},{"text":"has a bounded norm for almost all datasets, while in zero measure case ","element":"span"},{"style":{"height":17.6},"width":83.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/43-1.png","element":"img","alt":" ρ (t)","inline":true,"padRight":true},{"text":"contains additional ","element":"span"},{"text":"O","element":"span"},{"text":"(log log(","element":"span"},{"text":"t","element":"span"},{"text":")) ","element":"span"},{"text":"components which are orthogonal to the support vectors in ","element":"span"},{"style":{"height":15.09},"width":43.4,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/43-2.png","element":"img","alt":" S1","inline":true},{"text":", and, asymptotically, have a positive angle with the other support vectors. In this section we first calculate the various convergence rates for the non-degenerate case of Theorem ","element":"span"},{"href":"#id-53","text":"4, ","element":"a"},{"text":"and then write the correction in the zero measure cases, if there is such a correction.","element":"span"}],[{"text":"First, we calculated of the normalized weight vector (eq. ","element":"span"},{"href":"#id-132","text":"8)","element":"a"},{"text":", for almost every dataset:","element":"span"}],[{"id":"id-133","style":{"width":"104%"},"width":1804,"height":945,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/43-3.png","element":"img"}],[{"text":"where to obtain eq. ","element":"span"},{"href":"#id-133","text":"119 ","element":"a"},{"text":"we used ","element":"span"},{"style":{"height":25.6},"width":584.96,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/43-4.png","element":"img","alt":"1√1+x = 1 − 12x + 34x2 + O�x3�","inline":true},{"text":", and in the last line we used the fact that ","element":"span"},{"style":{"height":17.6},"width":83.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/43-5.png","element":"img","alt":" ρ (t)","inline":true,"padRight":true},{"text":"has a bounded norm for almost every dataset. Thus, in this case","element":"span"}],[{"style":{"width":"34%"},"width":603,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/43-6.png","element":"img"}],[{"text":"For the measure zero cases, we instead have from eq. ","element":"span"},{"href":"#id-134","text":"62, ","element":"a"},{"style":{"height":22.05},"width":623,"height":55.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/43-7.png","element":"img","alt":" w(t) = �Mm=1 ˆw log◦m(t) + ρ(t),","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":17.6},"width":115.6,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/43-8.png","element":"img","alt":" ∥ρ(t)∥","inline":true,"padRight":true},{"text":"is bounded (Theorem ","element":"span"},{"href":"#id-19","text":"3)","element":"a"},{"text":". Let ","element":"span"},{"style":{"height":22.05},"width":601.64,"height":55.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/43-9.png","element":"img","alt":" ˜ρ(t) = �Mm=2 ˆw log◦m(t) + ρ(t)","inline":true},{"text":", such that ","element":"span"},{"text":"w","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") = ","element":"span"},{"style":{"height":17.6},"width":748.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/43-10.png","element":"img","alt":"ˆw log(t) + ˜ρ(t) with ˜ρ(t) = O(log log(t))","inline":true},{"text":". Repeating the same calculations as above, we have for the degenerate cases,","element":"span"}],[{"style":{"width":"37%"},"width":642,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/43-11.png","element":"img"}],[{"text":"Next, we use eq. ","element":"span"},{"href":"#id-133","text":"119 ","element":"a"},{"text":"to calculate the angle (eq. ","element":"span"},{"href":"#id-36","text":"9)","element":"a"}],[{"style":{"width":"109%"},"width":1901,"height":459,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/44-0.png","element":"img"}],[{"text":"for almost every dataset. Thus, in this case","element":"span"}],[{"style":{"width":"28%"},"width":494,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/44-1.png","element":"img"}],[{"text":"Repeating the same calculation for the measure zero case, we have instead","element":"span"}],[{"style":{"width":"36%"},"width":625,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/44-2.png","element":"img"}],[{"text":"Next, we calculate the margin (eq. ","element":"span"},{"href":"#id-32","text":"10)","element":"a"}],[{"id":"id-135","style":{"width":"80%"},"width":1399,"height":403,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/44-3.png","element":"img"}],[{"text":"for almost every dataset, where in eq. ","element":"span"},{"href":"#id-135","text":"120 ","element":"a"},{"text":"we used eq. ","element":"span"},{"href":"#id-58","text":"20. ","element":"a"},{"text":"Interestingly the measure zero case has a similar convergence rate, since after a sufficient number of iterations, the ","element":"span"},{"text":"O","element":"span"},{"text":"(log log(","element":"span"},{"text":"t","element":"span"},{"text":")) ","element":"span"},{"text":"correction is orthogonal to ","element":"span"},{"style":{"height":17.62},"width":566.12,"height":44.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/44-4.png","element":"img","alt":" xk, where k = argminnx⊤n w(t)","inline":true},{"text":". Thus, for all datasets,","element":"span"}],[{"style":{"width":"67%"},"width":1172,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/44-5.png","element":"img"}],[{"text":"Calculation of the training loss (eq. ","element":"span"},{"href":"#id-33","text":"11)","element":"a"},{"text":":","element":"span"}],[{"style":{"width":"90%"},"width":1568,"height":568,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/44-6.png","element":"img"}],[{"text":"Thus, for all datasets ","element":"span"},{"style":{"height":19.14},"width":357.32,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-0.png","element":"img","alt":" L (w (t)) = O(t−1)","inline":true},{"text":". Note that the zero measure case has the same behavior, since after a sufficient number of iterations, the ","element":"span"},{"text":"O","element":"span"},{"text":"(log log(","element":"span"},{"text":"t","element":"span"},{"text":")) ","element":"span"},{"text":"correction has a non-negative angle with all the support vectors.","element":"span"}],[{"text":"Next, we give an example demonstrating the bounds above, for the non-degenerate case, are strict. Consider optimization with and exponential loss ","element":"span"},{"style":{"height":17.6},"width":218.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-1.png","element":"img","alt":" ℓ (u) = e−u","inline":true},{"text":", and a single data point ","element":"span"},{"text":"x ","element":"span"},{"text":"= (1","element":"span"},{"text":", ","element":"span"},{"text":"0)","element":"span"},{"text":". In this case ","element":"span"},{"style":{"height":17.6},"width":453.04,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-2.png","element":"img","alt":" ˆw = (1, 0) and ∥ ˆw∥ = 1","inline":true},{"text":". We take the limit ","element":"span"},{"style":{"height":15.6},"width":118.96,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-3.png","element":"img","alt":" η → 0","inline":true},{"text":", and obtain the continuous time version of GD:","element":"span"}],[{"style":{"width":"37%"},"width":647,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-4.png","element":"img"}],[{"text":"We can analytically integrate these equations to obtain","element":"span"}],[{"style":{"width":"52%"},"width":903,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-5.png","element":"img"}],[{"text":"Using this example with ","element":"span"},{"style":{"height":17.6},"width":200.08,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-6.png","element":"img","alt":" w2 (0) > 0","inline":true},{"text":", it is easy to see that the above upper bounds are strict in the non-degenerate case. ","element":"span"},{"style":{"height":0},"width":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-7.png","element":"img","alt":" ■","inline":true}],[{"text":"D.2 Validation error lower bound","element":"span"}],[{"text":"Lastly, recall that ","element":"span"},{"text":"V ","element":"span"},{"text":"is a set of indices for validation set samples. We calculate of the validation loss for logistic loss, if the error of the ","element":"span"},{"style":{"height":14.69},"width":46.76,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-8.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"max margin vector has some classification errors on the validation, i.e., ","element":"span"},{"style":{"height":15.47},"width":382.68,"height":38.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-9.png","element":"img","alt":" ∃k ∈ V : ˆw⊤xk < 0:","inline":true}],[{"style":{"width":"90%"},"width":1558,"height":558,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-10.png","element":"img"}],[{"text":"Thus, for all datasets ","element":"span"},{"style":{"height":17.6},"width":450.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-11.png","element":"img","alt":" Lval (w (t)) = Ω(log(t)).","inline":true}]]},{"heading":"Appendix E. Softmax output with cross-entropy loss","paragraphs":[[{"text":"We examine multiclass classification. In the case the labels are the class index ","element":"span"},{"style":{"height":17.6},"width":372.4,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-12.png","element":"img","alt":" yn ∈ {1, . . . , K} and","inline":true,"padRight":true},{"text":"we have a weight matrix ","element":"span"},{"style":{"height":19.14},"width":825.08,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-13.png","element":"img","alt":" W ∈ RK×d with wk being the k-th row of W.","inline":true}],[{"text":"Furthermore, we define ","element":"span"},{"style":{"height":20.8},"width":296.96,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-14.png","element":"img","alt":" w = vec�W⊤�","inline":true},{"text":", a basis vector ","element":"span"},{"style":{"height":20.32},"width":632.08,"height":50.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-15.png","element":"img","alt":" ek ∈ RK so that(ek)i = δki, and","inline":true,"padRight":true},{"text":"the matrix ","element":"span"},{"style":{"height":18.02},"width":839.92,"height":45.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-16.png","element":"img","alt":" Ak ∈ RdK×d so that Ak = ek ⊗ Id, where ⊗","inline":true,"padRight":true},{"text":"is the Kronecker product and ","element":"span"},{"style":{"height":15.28},"width":148.64,"height":38.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-17.png","element":"img","alt":" Id is the","inline":true,"padRight":true},{"text":"d","element":"span"},{"text":"-dimension identity matrix. Note that ","element":"span"},{"style":{"height":17.58},"width":229.4,"height":43.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-18.png","element":"img","alt":" A⊤k w = wk.","inline":true}],[{"text":"Consider the cross entropy loss with softmax output","element":"span"}],[{"style":{"width":"44%"},"width":771,"height":133,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/45-19.png","element":"img"}],[{"text":"Using our notation, this loss can be re-written as","element":"span"}],[{"id":"id-136","style":{"width":"76%"},"width":1323,"height":285,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-0.png","element":"img"}],[{"text":"Therefore","element":"span"}],[{"style":{"width":"69%"},"width":1207,"height":283,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-1.png","element":"img"}],[{"text":"If, again, we make the assumption that the data is linearly separable, i.e., in our notation","element":"span"}],[{"id":"id-137","style":{"width":"78%"},"width":1357,"height":277,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-2.png","element":"img"}],[{"text":"is strictly negative for any finite ","element":"span"},{"text":"w","element":"span"},{"text":". However, from Lemma ","element":"span"},{"href":"#id-11","text":"10, ","element":"a"},{"text":"in gradient descent with an appropriately small learning rate, we have that ","element":"span"},{"style":{"height":17.6},"width":294.28,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-3.png","element":"img","alt":" ∇L (w (t)) → 0","inline":true},{"text":". This implies that: ","element":"span"},{"style":{"height":17.6},"width":435.76,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-4.png","element":"img","alt":" ∥w (t)∥ → ∞, and ∀k ̸=","inline":true},{"style":{"height":17.6},"width":676.64,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-5.png","element":"img","alt":"yn, ∃r : w (t)⊤ (Ar − Ak) xn → ∞","inline":true},{"text":", which implies ","element":"span"},{"style":{"height":18.29},"width":762.08,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-6.png","element":"img","alt":" ∀k ̸= yn, maxk w (t)⊤ (Ak − Ayn) xn →","inline":true},{"style":{"height":8},"width":78.08,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-7.png","element":"img","alt":"−∞","inline":true},{"text":". Examining the loss (eq. ","element":"span"},{"href":"#id-136","text":"122) ","element":"a"},{"text":"we find that ","element":"span"},{"style":{"height":17.6},"width":263.08,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-8.png","element":"img","alt":" L (w (t)) → 0","inline":true,"padRight":true},{"text":"in this case. Thus, we arrive to an equivalent Lemma to Lemma ","element":"span"},{"href":"#id-16","text":"1, ","element":"a"},{"text":"for this case:","element":"span"}],[{"id":"id-138","text":"Lemma 19 ","element":"span"},{"text":"Let ","element":"span"},{"text":"w ","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"be the iterates of gradient descent (eq. ","element":"span"},{"href":"#id-10","text":"2) ","element":"a"},{"text":"with an appropriately small learning rate, for cross-entropy loss operating on a softmax output, under the assumption of strict linear separability (Assumption ","element":"span"},{"href":"#id-137","text":"4)","element":"a"},{"text":", then: (1) ","element":"span"},{"style":{"height":17.6},"width":1045.08,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-9.png","element":"img","alt":" limt→∞ L (w (t)) = 0, (2) limt→∞ ∥w (t)∥ = ∞, and (3)","inline":true},{"style":{"height":18.29},"width":911,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-10.png","element":"img","alt":"∀n, k ̸= yn : limt→∞ w (t)⊤ (Ayn − Ak) xn = ∞.","inline":true}],[{"text":"Using Lemma ","element":"span"},{"href":"#id-11","text":"10 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-138","text":"19, ","element":"a"},{"text":"we prove the following Theorem (equivalent to Theorem ","element":"span"},{"href":"#id-22","text":"9) ","element":"a"},{"text":"in the next section:","element":"span"}],[{"text":"Theorem 7 For all multiclass datasets which are linearly separable (i.e. the constraints in eq. ","element":"span"},{"href":"#id-37","text":"14 ","element":"a"},{"text":"below are feasible) and for which the equation","element":"span"}],[{"style":{"width":"78%"},"width":1365,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-11.png","element":"img"}],[{"text":"has a solution ","element":"span"},{"style":{"height":20.32},"width":162.44,"height":50.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-12.png","element":"img","alt":" { ˜wk}Kk=1","inline":true},{"text":", the following holds: for any starting point ","element":"span"},{"text":"w","element":"span"},{"text":"(0) ","element":"span"},{"text":"and any small enough ","element":"span"},{"text":"stepsize, the iterates of gradient descent on eq. ","element":"span"},{"href":"#id-38","text":"13 ","element":"a"},{"text":"will behave as:","element":"span"}],[{"style":{"width":"64%"},"width":1113,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-13.png","element":"img"}],[{"text":"where the residual ","element":"span"},{"style":{"height":17.6},"width":96.68,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/46-14.png","element":"img","alt":" ρk(t)","inline":true,"padRight":true},{"text":"is bounded.","element":"span"}],[{"text":"E.1 Notations and Definitions","element":"span"}],[{"text":"To prove Theorem ","element":"span"},{"href":"#id-139","text":"7 ","element":"a"},{"text":"we require additional notation. we define ","element":"span"},{"style":{"height":18.48},"width":411.24,"height":46.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-0.png","element":"img","alt":" ˜xn,k ≜ (Ayn − Ak)xn","inline":true},{"text":". Using this notation, we can re-write eq. ","element":"span"},{"href":"#id-37","text":"14 ","element":"a"},{"text":"(K-class SVM) as","element":"span"}],[{"style":{"width":"72%"},"width":1259,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-1.png","element":"img"}],[{"text":"From the KKT optimality conditions, we have for some ","element":"span"},{"style":{"height":16.88},"width":169.4,"height":42.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-2.png","element":"img","alt":" αn,k ≥ 0,","inline":true}],[{"id":"id-151","style":{"width":"65%"},"width":1131,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-3.png","element":"img"}],[{"text":"In addition, for each of the K classes, we define ","element":"span"},{"style":{"height":18.29},"width":582.6,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-4.png","element":"img","alt":" Sk = arg minn( ˆwyn − ˆwk)⊤xn","inline":true,"padRight":true},{"text":"(the k’th class","element":"span"}],[{"text":"support vectors).","element":"span"}],[{"text":"Using this definition, we define ","element":"span"},{"style":{"height":20.93},"width":304.2,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-5.png","element":"img","alt":" XSk ∈ RdK×|Sk| ","inline":true,"padRight":true},{"text":"as the matrix which columns are ","element":"span"},{"style":{"height":17.68},"width":273.08,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-6.png","element":"img","alt":" ˜xn,k, ∀n ∈ Sk.","inline":true,"padRight":true},{"text":"We also define ","element":"span"},{"style":{"height":29.86},"width":137.8,"height":74.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-7.png","element":"img","alt":" S ≜ K�","inline":true}],[{"text":"We recall that we defined ","element":"span"},{"style":{"height":18.02},"width":394.8,"height":45.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-8.png","element":"img","alt":" W ∈ RK×d with wk","inline":true,"padRight":true},{"text":"being the k-th row of ","element":"span"},{"style":{"height":17.6},"width":445.88,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-9.png","element":"img","alt":" W and w = vec(W⊤).","inline":true}],[{"text":"Similarly, we define:","element":"span"}],[{"style":{"width":"52%"},"width":903,"height":211,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-10.png","element":"img"}],[{"text":"Using our notations, eq. ","element":"span"},{"href":"#id-140","text":"16 ","element":"a"},{"text":"can be re-written as ","element":"span"},{"style":{"height":17.6},"width":568.04,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-11.png","element":"img","alt":" w = ˆw log(t) + ρ(t) when ρ(t)","inline":true,"padRight":true},{"text":"is bounded. For any solution ","element":"span"},{"text":"w","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":")","element":"span"},{"text":", we define","element":"span"}],[{"id":"id-142","style":{"width":"64%"},"width":1110,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-12.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":12.59},"width":37,"height":31.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-13.png","element":"img","alt":" ˆw","inline":true,"padRight":true},{"text":"is the concatenation of ","element":"span"},{"style":{"height":15.79},"width":184.08,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-14.png","element":"img","alt":" ˆw1, ..., ˆwk","inline":true,"padRight":true},{"text":"which are the K-class SVM solution, so","element":"span"}],[{"id":"id-150","style":{"width":"78%"},"width":1359,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-15.png","element":"img"}],[{"text":"and ","element":"span"},{"style":{"height":11.79},"width":37,"height":29.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-16.png","element":"img","alt":" ˜w","inline":true,"padRight":true},{"text":"satisfies the equation:","element":"span"}],[{"id":"id-152","style":{"width":"74%"},"width":1290,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-17.png","element":"img"}],[{"text":"As we assume in the Theorem, this equation has a solution.","element":"span"}],[{"text":"For each of the K classes, we define ","element":"span"},{"style":{"height":19.74},"width":206.64,"height":49.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-18.png","element":"img","alt":" Pk1 ∈ Rd×d","inline":true,"padRight":true},{"text":"as the orthogonal projection matrix to the subspace ","element":"span"},{"text":"spanned by the support vector of the k’th class, and ","element":"span"},{"style":{"height":19.94},"width":235.92,"height":49.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-19.png","element":"img","alt":"¯Pk1 = I − Pk1 ","inline":true,"padRight":true},{"text":"as the complementary projection. ","element":"span"},{"text":"Finally, we define ","element":"span"},{"style":{"height":17.82},"width":620.88,"height":44.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-20.png","element":"img","alt":" P1 ∈ RKd×Kd and ¯P1 ∈ RKd×Kd ","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"style":{"width":"59%"},"width":1025,"height":151,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-21.png","element":"img"}],[{"text":"In the following section we will also use ","element":"span"},{"style":{"height":18.45},"width":84.2,"height":46.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/47-22.png","element":"img","alt":" 1{A}","inline":true},{"text":", the indicator function, which is ","element":"span"},{"text":"1 ","element":"span"},{"text":"if ","element":"span"},{"text":"A ","element":"span"},{"text":"is satisfied and 0 otherwise.","element":"span"}],[{"id":"id-147","text":"E.2 Auxiliary Lemma","element":"span"}],[{"text":"Lemma 20 We have","element":"span"}],[{"id":"id-146","style":{"width":"81%"},"width":1408,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-0.png","element":"img"}],[{"text":"Additionally, ","element":"span"},{"style":{"height":15.6},"width":289.16,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-1.png","element":"img","alt":" ∀ǫ1 > 0, ∃C2, t2","inline":true},{"text":", such that ","element":"span"},{"style":{"height":15.09},"width":130.76,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-2.png","element":"img","alt":" ∀t > t2","inline":true},{"text":", such that if","element":"span"}],[{"id":"id-145","style":{"width":"57%"},"width":993,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-3.png","element":"img"}],[{"text":"then we can improve this bound to","element":"span"}],[{"style":{"width":"69%"},"width":1206,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-4.png","element":"img"}],[{"text":"We prove the Lemma below, in appendix section ","element":"span"},{"href":"#id-141","text":"E.4","element":"a"}],[{"text":"E.3 Proof of Theorem ","element":"span"},{"href":"#id-139","text":"7","element":"a"}],[{"text":"Our goal is to show that ","element":"span"},{"text":"||","element":"span"},{"text":"r","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":")","element":"span"},{"text":"|| ","element":"span"},{"text":"is bounded, and therefore ","element":"span"},{"style":{"height":17.6},"width":295.24,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-5.png","element":"img","alt":" ρ(t) = r(t) + ˜w","inline":true,"padRight":true},{"text":"is bounded. To show this, we will upper bound the following equation","element":"span"}],[{"style":{"width":"88%"},"width":1522,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-6.png","element":"img"}],[{"text":"First, we note that first term in this equation can be upper-bounded by","element":"span"}],[{"id":"id-144","style":{"width":"89%"},"width":1553,"height":406,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-7.png","element":"img"}],[{"text":"where in (1) we used eq. ","element":"span"},{"href":"#id-142","text":"125, ","element":"a"},{"text":"in (2) we used eq 2.2, and in (3) we used ","element":"span"},{"style":{"height":17.6},"width":515.96,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-8.png","element":"img","alt":" ∀x > 0 : x ≥ log(1+x) > 0,","inline":true,"padRight":true},{"text":"and also that","element":"span"}],[{"id":"id-143","style":{"width":"100%"},"width":1729,"height":469,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-9.png","element":"img"}],[{"text":"Substituting eq. ","element":"span"},{"href":"#id-143","text":"134 ","element":"a"},{"text":"into eq. ","element":"span"},{"href":"#id-144","text":"132, ","element":"a"},{"text":"and recalling that a ","element":"span"},{"style":{"height":12.34},"width":60.24,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-10.png","element":"img","alt":" t−ν ","inline":true,"padRight":true},{"text":"power series converges for any ","element":"span"},{"style":{"height":14.4},"width":177.44,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-11.png","element":"img","alt":" ν > 1, we","inline":true,"padRight":true},{"text":"can find ","element":"span"},{"style":{"height":15.09},"width":217.96,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-12.png","element":"img","alt":" C0 such that","inline":true}],[{"id":"id-149","style":{"width":"85%"},"width":1476,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/48-13.png","element":"img"}],[{"text":"Note that this equation also implies that ","element":"span"},{"style":{"height":15.09},"width":59.24,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-0.png","element":"img","alt":" ∀ǫ0","inline":true}],[{"style":{"width":"72%"},"width":1249,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-1.png","element":"img"}],[{"text":"Next, we would like to bound the second term in eq. ","element":"span"},{"href":"#id-145","text":"131. ","element":"a"},{"text":"From eq. ","element":"span"},{"href":"#id-146","text":"128 ","element":"a"},{"text":"in Lemma ","element":"span"},{"href":"#id-147","text":"20, ","element":"a"},{"text":"we can find ","element":"span"},{"style":{"height":15.6},"width":427.2,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-2.png","element":"img","alt":"t1, C1 such that ∀t > t1:","inline":true}],[{"id":"id-148","style":{"width":"70%"},"width":1228,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-3.png","element":"img"}],[{"text":"Thus, by combining eqs. ","element":"span"},{"href":"#id-148","text":"137 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-149","text":"135 ","element":"a"},{"text":"into eq. ","element":"span"},{"href":"#id-145","text":"131, ","element":"a"},{"text":"we find:","element":"span"}],[{"style":{"width":"34%"},"width":593,"height":362,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-4.png","element":"img"}],[{"text":"which is bounded, since ","element":"span"},{"style":{"height":13.2},"width":101.68,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-5.png","element":"img","alt":" θ > 1","inline":true,"padRight":true},{"text":"(eq. ","element":"span"},{"href":"#id-150","text":"126)","element":"a"},{"text":". Therefore, ","element":"span"},{"text":"||","element":"span"},{"text":"r","element":"span"},{"text":"(","element":"span"},{"text":"t","element":"span"},{"text":")","element":"span"},{"text":"|| ","element":"span"},{"text":"is bounded.","element":"span"}],[{"id":"id-141","text":"E.4 Proof of Lemma ","element":"span"},{"href":"#id-147","text":"20","element":"a"}],[{"text":"Lemma 20 We have","element":"span"}],[{"style":{"width":"81%"},"width":1408,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-6.png","element":"img"}],[{"text":"Additionally, ","element":"span"},{"style":{"height":15.6},"width":289.16,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-7.png","element":"img","alt":" ∀ǫ1 > 0, ∃C2, t2","inline":true},{"text":", such that ","element":"span"},{"style":{"height":15.09},"width":130.76,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-8.png","element":"img","alt":" ∀t > t2","inline":true},{"text":", such that if","element":"span"}],[{"style":{"width":"57%"},"width":993,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-9.png","element":"img"}],[{"text":"then we can improve this bound to","element":"span"}],[{"style":{"width":"69%"},"width":1206,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-10.png","element":"img"}],[{"text":"We wish to bound ","element":"span"},{"style":{"height":17.6},"width":401.96,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-11.png","element":"img","alt":" (r(t + 1) − r(t))⊤r(t)","inline":true},{"text":". First, we recall we defined ","element":"span"},{"style":{"height":18.48},"width":418.04,"height":46.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-12.png","element":"img","alt":" ˜xn,k ≜ (Ayn − Ak)xn.","inline":true}],[{"id":"id-157","style":{"width":"96%"},"width":1666,"height":443,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-13.png","element":"img"}],[{"text":"where in the last line we used eqs. ","element":"span"},{"href":"#id-151","text":"124 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-152","text":"127 ","element":"a"},{"text":"to obtain","element":"span"}],[{"style":{"width":"78%"},"width":1350,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-14.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":18.45},"width":84.2,"height":46.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/49-15.png","element":"img","alt":" 1{A}","inline":true,"padRight":true},{"text":"is the indicator function which is ","element":"span"},{"text":"1 ","element":"span"},{"text":"if ","element":"span"},{"text":"A ","element":"span"},{"text":"is satisfied and 0 otherwise.","element":"span"}],[{"text":"The first term can be upper bounded by","element":"span"}],[{"id":"id-155","style":{"width":"72%"},"width":1246,"height":413,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/50-0.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(1) ","element":"span"},{"text":"we used that ","element":"span"},{"style":{"height":17.6},"width":366.44,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/50-1.png","element":"img","alt":" P2 ˆw = 0, and in (2)","inline":true,"padRight":true},{"text":"we used that ","element":"span"},{"style":{"height":17.6},"width":390.56,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/50-2.png","element":"img","alt":" ˆw⊤r (t) = o (t), since","inline":true}],[{"style":{"width":"100%"},"width":1729,"height":1022,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/50-3.png","element":"img"}],[{"text":"We examine each term ","element":"span"},{"text":"n ","element":"span"},{"text":"in the sum:","element":"span"}],[{"id":"id-153","style":{"width":"99%"},"width":1729,"height":1720,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/51-0.png","element":"img"}],[{"text":"Recalling that ","element":"span"},{"style":{"height":17.6},"width":509,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/52-0.png","element":"img","alt":" w(t) = ˆw log(t) + ˜w + r(t)","inline":true},{"text":", eq. ","element":"span"},{"href":"#id-153","text":"142 ","element":"a"},{"text":"can be upper bounded by","element":"span"}],[{"style":{"width":"99%"},"width":1722,"height":822,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/52-1.png","element":"img"}],[{"text":"where in (1) we used ","element":"span"},{"href":"#id-150","style":{"height":32.4},"width":1355.64,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/52-2.png","element":"img","alt":" xe−x < 1, ∀x : (e−x − 1)x < 0, θ = mink�minn/∈Sk ˜x⊤n,k ˆw�> 1 (eq. 126)","inline":true,"padRight":true},{"text":"and denoted:","element":"span"}],[{"style":{"width":"92%"},"width":1600,"height":346,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/52-3.png","element":"img"}],[{"text":"We use the fact that ","element":"span"},{"style":{"height":17.6},"width":378.64,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/52-4.png","element":"img","alt":" ∀x : (e−x − 1)x < 0","inline":true,"padRight":true},{"text":"and therefore ","element":"span"},{"style":{"height":17.6},"width":139.68,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/52-5.png","element":"img","alt":" ∀(n, k):","inline":true}],[{"style":{"width":"87%"},"width":1514,"height":175,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/52-6.png","element":"img"}],[{"text":"to show that ","element":"span"},{"style":{"height":17.6},"width":75.56,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/52-7.png","element":"img","alt":" φ(t)","inline":true,"padRight":true},{"text":"is strictly negative. If ","element":"span"},{"style":{"height":19.57},"width":193.36,"height":48.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/52-8.png","element":"img","alt":" ˜x⊤n,k1r ≥ 0","inline":true,"padRight":true},{"text":"then from the last two equations:","element":"span"}],[{"style":{"width":"99%"},"width":1728,"height":610,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/52-9.png","element":"img"}],[{"text":"and therefore","element":"span"}],[{"style":{"width":"100%"},"width":1733,"height":414,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-0.png","element":"img"}],[{"text":"The first term is negative and the second is positive. From Lemma ","element":"span"},{"href":"#id-138","text":"19 ","element":"a"},{"style":{"height":18.29},"width":317.12,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-1.png","element":"img","alt":" w(t)⊤˜xn,r1 → ∞","inline":true},{"text":". Therefore ","element":"span"},{"style":{"height":19.82},"width":831.56,"height":49.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-2.png","element":"img","alt":"∃t3 so that ∀t > t3 : exp(−w(t)⊤˜xn,r1) < K2 ","inline":true,"padRight":true},{"text":"and therefore this sum is strictly negative since","element":"span"}],[{"style":{"width":"97%"},"width":1686,"height":281,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-3.png","element":"img"}],[{"text":"2. If ","element":"span"},{"style":{"height":17.01},"width":138.36,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-4.png","element":"img","alt":" n ∈ Sk1","inline":true,"padRight":true},{"text":"then we examine the sum","element":"span"}],[{"id":"id-154","style":{"width":"99%"},"width":1728,"height":572,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-5.png","element":"img"}],[{"text":"where in the last transition we used Lemma ","element":"span"},{"href":"#id-138","text":"19. ","element":"a"},{"text":"b. If ","element":"span"},{"style":{"height":20.98},"width":293.48,"height":52.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-6.png","element":"img","alt":" |˜x⊤n,k1r(t)| ≤ C0","inline":true,"padRight":true},{"text":"then we can find constant ","element":"span"},{"style":{"height":15.09},"width":48.2,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-7.png","element":"img","alt":" C5","inline":true,"padRight":true},{"text":"so that eq. ","element":"span"},{"href":"#id-154","text":"146 ","element":"a"},{"text":"can be upper bounded by","element":"span"}],[{"style":{"width":"87%"},"width":1514,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-8.png","element":"img"}],[{"text":"since ","element":"span"},{"style":{"height":20.78},"width":556.04,"height":51.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-9.png","element":"img","alt":" −˜x⊤n,r1r(t) ≤ −˜x⊤n,k1r(t) ≤ C0","inline":true,"padRight":true},{"text":"and by definition, ","element":"span"},{"style":{"height":18.48},"width":395.96,"height":46.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-10.png","element":"img","alt":" ∀(n, k) : ˆw⊤˜xn,k ≥ 1.","inline":true,"padRight":true},{"text":"Therefore, eq. ","element":"span"},{"href":"#id-153","text":"142 ","element":"a"},{"text":"can be upper bounded by","element":"span"}],[{"style":{"width":"69%"},"width":1200,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-11.png","element":"img"}],[{"text":"If, in addition, ","element":"span"},{"style":{"height":20.78},"width":593.68,"height":51.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-12.png","element":"img","alt":" ∃k, n ∈ Sk : |˜x⊤n,kr(t)| > ǫ2 then","inline":true}],[{"style":{"width":"88%"},"width":1529,"height":227,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/53-13.png","element":"img"}],[{"text":"and we can improve this bound to","element":"span"}],[{"id":"id-156","style":{"width":"100%"},"width":1729,"height":445,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/54-0.png","element":"img"}],[{"text":"where in ","element":"span"},{"style":{"height":18.48},"width":857.48,"height":46.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/54-1.png","element":"img","alt":" (1) we used P⊤1 ˜xn,k = ˜xn,k ∀k, n ∈ Sk, in (2)","inline":true,"padRight":true},{"text":"we denoted by ","element":"span"},{"style":{"height":17.6},"width":187.4,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/54-2.png","element":"img","alt":" σmin (XS)","inline":true},{"text":", the minimal ","element":"span"},{"text":"non-zero singular value of ","element":"span"},{"style":{"height":14.69},"width":59.92,"height":36.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/54-3.png","element":"img","alt":" XS","inline":true,"padRight":true},{"text":"and used eq. ","element":"span"},{"href":"#id-145","text":"129. ","element":"a"},{"text":"Therefore, for some ","element":"span"},{"style":{"height":32.53},"width":414.16,"height":81.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/54-4.png","element":"img","alt":" (n, k),��˜x⊤n,kr�� ≥ ǫ2 ≜","inline":true},{"style":{"height":21.45},"width":677,"height":53.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/54-5.png","element":"img","alt":"|S|−1 σ2min (XS) ǫ21. If ||P1r(t)|| ≥ ǫ1","inline":true},{"text":", then combining eq. ","element":"span"},{"href":"#id-155","text":"140 ","element":"a"},{"text":"with eq. ","element":"span"},{"href":"#id-156","text":"151 ","element":"a"},{"text":"we find that eq. ","element":"span"},{"href":"#id-157","text":"138 ","element":"a"},{"text":"can be upper bounded by:","element":"span"}],[{"style":{"width":"99%"},"width":1727,"height":365,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/54-6.png","element":"img"}],[{"text":"Therefore, ","element":"span"},{"style":{"height":15.09},"width":271.88,"height":37.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/54-7.png","element":"img","alt":" ∃t1 > 0 and C1","inline":true,"padRight":true},{"text":"such that eq. ","element":"span"},{"href":"#id-146","text":"128 ","element":"a"},{"text":"holds.","element":"span"}]]},{"heading":"Appendix F. An experiment with stochastic gradient descent","paragraphs":[[{"style":{"width":"98%"},"width":1704,"height":544,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1710.10345/images/54-8.png","element":"img"}],[{"id":"id-46","text":"Figure 4: Same as Fig. ","element":"figcaption","subtype":"caption"},{"href":"#id-35","text":"1, ","element":"a","subtype":"caption"},{"text":"except stochastic gradient decent is used (with mini-batch of size 4), instead of GD.","element":"figcaption","subtype":"caption"}]]},{"heading":"References","paragraphs":[[{"id":"id-47","text":"Mor Shpigel Nacson, Nati Srebro, and Daniel Soudry. Stochastic Gradient Descent on Separable ","element":"span"},{"text":"Data Exact Convergence with a Fixed Learning Rate. AISTATS, 2019.","element":"span"}],[{"id":"id-45","text":"Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit Bias of Gradient Descent ","element":"span"},{"text":"on Linear Convolutional Networks. NIPS, 2018.","element":"span"}],[{"id":"id-48","text":"John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and ","element":"span"},{"text":"stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.","element":"span"}],[{"id":"id-13","text":"Radha Krishna Ganti. EE6151, Convex optimization algorithms. Unconstrained minimization: Gra- ","element":"span"},{"text":"dient descent algorithm, 2015. URL","element":"span"}],[{"id":"id-52","text":"Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Sre- ","element":"span"},{"text":"bro. Implicit Regularization in Matrix Factorization. NIPS, pages 1–10, 2017.","element":"span"}],[{"id":"id-31","text":"Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in ","element":"span"},{"text":"terms of optimization geometry. ICML, 2018.","element":"span"}],[{"id":"id-6","text":"Moritz Hardt, Benjamin Recht, and Y Singer. Train faster, generalize better: Stability of stochastic ","element":"span"},{"text":"gradient descent. ICML, pages 1–24, 2016.","element":"span"}],[{"id":"id-50","text":"Elad Hoffer, Itay Hubara, and D. Soudry. Train longer, generalize better: closing the generalization ","element":"span"},{"text":"gap in large batch training of neural networks. In NIPS, pages 1–13, may 2017.","element":"span"}],[{"id":"id-43","text":"I Hubara, M Courbariaux, D. Soudry, R El-yaniv, and Y Bengio. Quantized Neural Networks: ","element":"span"},{"text":"Training Neural Networks with Low Precision Weights and Activations. JMLR, 2018.","element":"span"}],[{"id":"id-24","text":"Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. Communi- ","element":"span"},{"text":"cated by the authors, 2018.","element":"span"}],[{"id":"id-3","text":"Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ","element":"span"},{"text":"ter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR, pages 1–16, 2017.","element":"span"}],[{"id":"id-7","text":"Diederik P Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. In ","element":"span"},{"text":"ICLR, pages 1–13, 2015.","element":"span"}],[{"id":"id-51","text":"Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Conver- ","element":"span"},{"text":"gence of Gradient Descent on Separable Data. AISTATS, pages 1–45, 2019.","element":"span"}],[{"id":"id-0","text":"Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On ","element":"span"},{"text":"the role of implicit regularization in deep learning. arXiv:1412.6614, 2014.","element":"span"}],[{"id":"id-1","text":"Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimiza- ","element":"span"},{"text":"tion in deep neural networks. In NIPS, 2015.","element":"span"}],[{"id":"id-4","text":"Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring Gen- ","element":"span"},{"text":"eralization in Deep Learning. arXiv, jun 2017.","element":"span"}],[{"id":"id-40","text":"Hrithik Ravi, Clayton Scott, Daniel Soudry, and Yutong Wang The implicit bias of gradient descent ","element":"span"},{"text":"on separable multiclass data. In NeurIPS, 2024.","element":"span"}],[{"id":"id-29","text":"Saharon Rosset, Ji Zhu, and Trevor J Hastie. Margin Maximizing Loss Functions. In ","element":"span"},{"text":"NIPS, pages 1237–1244, 2004.","element":"span"}],[{"id":"id-26","text":"Robert E Schapire, Yoav Freund, Peter Bartlett, Wee Sun Lee, et al. Boosting the margin: A new ","element":"span"},{"text":"explanation for the effectiveness of voting methods. The annals of statistics, 26(5):1651–1686, 1998.","element":"span"}],[{"id":"id-21","text":"Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, and N Srebro. The Implicit Bias of Gradient ","element":"span"},{"text":"Descent on Separable Data. In ICLR, 2018.","element":"span"}],[{"id":"id-28","text":"Matus Telgarsky. Margins, shrinkage and boosting. In ","element":"span"},{"text":"Proceedings of the 30th International Conference on International Conference on Machine Learning-Volume 28, pages II–307. JMLR. org, 2013.","element":"span"}],[{"id":"id-5","text":"Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. ","element":"span"},{"text":"The Marginal Value of Adaptive Gradient Methods in Machine Learning. arXiv, pages 1–14, 2017.","element":"span"}],[{"id":"id-2","text":"Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding ","element":"span"},{"text":"deep learning requires rethinking generalization. In ICLR, 2017.","element":"span"}],[{"id":"id-27","text":"Tong Zhang, Bin Yu, et al. Boosting with early stopping: Convergence and consistency. ","element":"span"},{"text":"The Annals of Statistics, 33(4):1538–1579, 2005.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]