35:[["$","audio",null,{"id":"tts"}],["$","$L3a",null,{"paperID":"1810.03037","publisher":"arxiv","paperJSON":{"title":"Why do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem","paperID":"1810.03037","avgLineHeight":11.95,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"Empirical evidence suggests that neural networks with ReLU activations generalize better with over-parameterization. However, there is currently no theoretical analysis that explains this observation. In this work, we provide theoretical and empirical evidence that, in certain cases, overparameterized convolutional networks generalize better than small networks because of an interplay between weight clustering and feature exploration at initialization. We demonstrate this theoretically for a 3-layer convolutional neural network with max-pooling, in a novel setting which extends the XOR problem. We show that this interplay implies that with overparamterization, gradient descent converges to global minima with better generalization performance compared to global minima of small networks. Empirically, we demonstrate these phenomena for a 3-layer convolutional neural network in the MNIST task.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Most successful deep learning models use more parameters than needed to achieve zero training error. This is typically referred to as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"overparameterization","element":"span"},{"text":". Indeed, it can be argued that overparameterization is one of the key techniques that has led to the remarkable success of neural networks. However, there is still no theoretical account for its effectiveness.","element":"span"}],[{"text":"One very intriguing observation in this context is that overparameterized networks with ReLU activations, which are trained with gradient based methods, often exhibit better generalization error than smaller networks ","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"(Neyshabur et al., ","element":"a"},{"href":"#id-0","referenceIndex":15,"text":"2014, ","element":"a"},{"href":"#id-1","referenceIndex":16,"text":"2018; ","element":"a"},{"href":"#id-2","referenceIndex":17,"text":"Novak et al., ","element":"a"},{"href":"#id-2","referenceIndex":17,"text":"2018)","element":"a"},{"text":". ","element":"span"},{"text":"In particular, it often happens that two networks, one with ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/0-0.png","element":"img","alt":" N1","inline":true,"padRight":true},{"text":"neurons and one with ","element":"span"},{"style":{"height":13.19},"width":151.05,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/0-1.png","element":"img","alt":" N2 > N1","inline":true,"padRight":true},{"text":"neurons achieve zero training error, but the larger network has better test error. ","element":"span"},{"text":"This somewhat counter-intuitive observation suggests that first-order methods which are trained on overparameterized networks have an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"inductive bias ","element":"span"},{"text":"towards solutions with better generalization performance. Understanding this inductive bias is a necessary step towards a full understanding of neural networks in practice.","element":"span"}],[{"text":"Providing theoretical guarantees for overparameterization is extremely challenging due to two main reasons. ","element":"span"},{"text":"First, to show a generalization gap between smaller and larger models, one needs to prove that large networks have better sample complexity than smaller ones. However, current generalization bounds that are based on complexity measures do not offer such guarantees.","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/0-2.png","element":"img","alt":"1","inline":true,"padRight":true},{"text":"Second, analyzing convergence of first-order methods on networks with ReLU activations is a major challenge. Indeed, there are no optimization guarantees even for simple learning tasks such as the classic two dimensional XOR problem. Given these difficulties, it is natural to analyze a simplified scenario, which ideally shares various features with real-world settings.","element":"span"}],[{"text":"In this work we follow this approach and show that a possible explanation for the success of overparameterization is a combination of two effects: weight exploration and weight clustering. Weight","element":"span"}],[{"id":"id-4","style":{"width":"43%"},"width":768,"height":657,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/1-0.png","element":"img"}],[{"text":"Figure 1: ","element":"figcaption","subtype":"caption"},{"text":"overparameterization improves generalization in the XORD problem. ","element":"figcaption","subtype":"caption"},{"text":"The network in Eq. ","element":"figcaption","subtype":"caption"},{"href":"#id-3","text":"2 ","element":"a","subtype":"caption"},{"text":"is trained on data from the XORD problem (see Sec. ","element":"figcaption","subtype":"caption"},{"text":"4)","element":"span","subtype":"caption"},{"text":". The figure shows the test error obtained for different number of channels ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k","element":"figcaption","subtype":"caption"},{"text":". The blue curve shows test error when restricting to cases where training error was zero. It can be seen that increasing the number of channels improves the generalization performance. Experimental details are provided in supplementary material.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"0%"},"width":6,"height":5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/1-1.png","element":"img"}],[{"text":"exploration refers to the fact that larger models explore the set of possible weights more effectively since they have more neurons in each layer. Weight clustering is an effect we demonstrate here, which refers to the fact that weight vectors in the same layer tend to cluster around a small number of prototypes.","element":"span"}],[{"text":"To see ","element":"span"},{"style":{"fontStyle":"italic"},"text":"informally ","element":"span"},{"text":"how these effects act in the case of overparameterization, consider a binary classification problem and a training set. The training set typically contains multiple patterns that discriminate between the two classes. The smaller network will find detectors (e.g., convolutional filters) for a subset of these patterns and reach zero training error, but not generalize because it is missing some of the patterns. This is a result of an under-exploration effect for the small net. On the other hand, the larger net has better exploration and will find more relevant detectors for classification. Furthermore, due to the clustering effect its weight vectors will be close to a small set of prototypes. Therefore the effective capacity of the overall model will be restricted, leading to good generalization.","element":"span"}],[{"text":"The network we study here includes some key architectural components used in modern machine learning models. ","element":"span"},{"text":"Specifically, it consists of a convolution layer with a ReLU activation function, followed by a max-pooling operation, and a fully-connected layer. This is a key component of most machine-vision models, since it can be used to detect patterns in an input image. We are also not aware of any theoretical guarantees for a network of this structure.","element":"span"}],[{"text":"For this architecture, we consider the problem of detecting two dimensional binary patterns in a high dimensional input vector. The patterns we focus on are the XOR combination (i.e., (1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) or (","element":"span"},{"style":{"height":14},"width":99.63,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/1-2.png","element":"img","alt":"−1, −","inline":true},{"text":"1)). This problem is a high dimensional extension of the XOR problem. We refer to it as the “XOR Detection problem (XORD). One advantage of this setting is that it nicely exhibits the phenomenon of overparameterization empirically, and is therefore a good test-bed for understanding overparameterization. Fig. ","element":"span"},{"href":"#id-4","text":"1 ","element":"a"},{"text":"shows the result of learning the XORD problem with the above network, and different number of channels. It can be seen that increasing the number of channels improves test error.","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/1-3.png","element":"img","alt":"2","inline":true}],[{"text":"Motivated by these empirical observations, we present a theoretical analysis of optimization and generalization in the XORD problem. Under certain distributional assumptions, we will show that overparameterized networks enjoy a combination of better exploration of features at initialization and clustering of weights, leading to better generalization for overparameterized networks.","element":"span"}],[{"text":"Importantly, we show empirically that our insights from the XORD problem transfer to other settings. In particular, we see a similar phenomenon when learning on the MNIST data, where we verify that weights are clustered at convergence and better exploration of weights for large networks.","element":"span"}],[{"text":"Finally, another contribution of our work is the first proof of convergence of gradient descent in the classic XOR problem with inputs in ","element":"span"},{"style":{"height":17.39},"width":106.77,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-0.png","element":"img","alt":" {±1}2","inline":true},{"text":". The proof is simple and conveys the key insights of the analysis of the general XORD problem. See Section ","element":"span"},{"text":"3 ","element":"span"},{"text":"for further details.","element":"span"}]]},{"heading":"2 Related Work","paragraphs":[[{"text":"In recent years there have been many works on theoretical aspects of deep learning. We will refer to those that are most relevant to this work. First, we note that we are not aware of any work that shows that generalization performance provably improves with over-parameterization. This distinguishes our work from all previous works.","element":"span"}],[{"text":"Several works study convolutional networks with ReLU activations and their properties ","element":"span"},{"href":"#id-5","referenceIndex":6,"text":"(Du et al., ","element":"a"},{"href":"#id-5","referenceIndex":6,"text":"2017a,","element":"a"},{"href":"#id-6","referenceIndex":7,"text":"b; ","element":"a"},{"href":"#id-7","referenceIndex":2,"text":"Brutzkus & Globerson, ","element":"a"},{"href":"#id-7","referenceIndex":2,"text":"2017)","element":"a"},{"text":". All of these works consider convolutional networks with a single channel. Recently, there have been numerous works that provide guarantees for gradient-based methods in general settings ","element":"span"},{"href":"#id-8","referenceIndex":4,"text":"(Daniely, ","element":"a"},{"href":"#id-8","referenceIndex":4,"text":"2017; ","element":"a"},{"href":"#id-9","referenceIndex":12,"text":"Li & Liang, ","element":"a"},{"href":"#id-9","referenceIndex":12,"text":"2018; ","element":"a"},{"href":"#id-10","referenceIndex":9,"text":"Du et al., ","element":"a"},{"href":"#id-10","referenceIndex":9,"text":"2018b,","element":"a"},{"href":"#id-11","referenceIndex":8,"text":"a; ","element":"a"},{"href":"#id-12","referenceIndex":1,"text":"Allen-Zhu et al., ","element":"a"},{"href":"#id-12","referenceIndex":1,"text":"2018)","element":"a"},{"text":". However, their analysis holds for over-parameterized networks with an extremely large number of neurons that are not used in practice (e.g., the number of neurons is a very large polynomial of certain problem parameters). Furthermore, we consider a 3-layer convolutional network with max-pooling which is not studied in these works.","element":"span"}],[{"href":"#id-13","referenceIndex":19,"text":"Soltanolkotabi et al. ","element":"a"},{"href":"#id-13","referenceIndex":19,"text":"(2018)","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":5,"text":"Du & Lee ","element":"a"},{"href":"#id-14","referenceIndex":5,"text":"(2018) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-15","referenceIndex":14,"text":"Li et al. ","element":"a"},{"href":"#id-15","referenceIndex":14,"text":"(2017) ","element":"a"},{"text":"study the role of over-parameterization in the case of quadratic activation functions. ","element":"span"},{"href":"#id-16","referenceIndex":3,"text":"Brutzkus et al. ","element":"a"},{"href":"#id-16","referenceIndex":3,"text":"(2018) ","element":"a"},{"text":"provide generalization guarantees for over-parameterized networks with Leaky ReLU activations on linearly separable data. ","element":"span"},{"href":"#id-1","referenceIndex":16,"text":"Neyshabur ","element":"a"},{"href":"#id-1","referenceIndex":16,"text":"et al. ","element":"a"},{"href":"#id-1","referenceIndex":16,"text":"(2018) ","element":"a"},{"text":"prove generalization bounds for neural networks. However, these bounds are empirically vacuous for over-parameterized networks and they do not prove that networks found by optimization algorithms give low generalization bounds.","element":"span"}]]},{"heading":"3 Warm up: the XOR Problem","paragraphs":[[{"text":"We begin by studying the simplest form of our model: the classic XOR problem in two dimensions.","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-1.png","element":"img","alt":"3","inline":true,"padRight":true},{"text":"We will show that this problem illustrates the key phenomena that allow overparameterized networks to perform better than smaller ones. Namely, exploration at initialization and clustering during training. For the XOR problem, this will imply that overparameterized networks have better ","element":"span"},{"style":{"fontStyle":"italic"},"text":"optimization ","element":"span"},{"text":"performance. In later sections, we will show that the same phenomena occur for higher dimensions in the XORD problem and imply better ","element":"span"},{"style":{"fontStyle":"italic"},"text":"generalization ","element":"span"},{"text":"of global minima for overparameterized convolutional networks.","element":"span"}],[{"id":"id-21","style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Problem Formulation","element":"span"}],[{"text":"In the XOR problem, we are given a training set ","element":"span"},{"style":{"height":20.4},"width":631.93,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-2.png","element":"img","alt":" S = {(xi, yi)}4i=1 ⊆ {±1}2 × {±1}2","inline":true,"padRight":true},{"text":"consisting of ","element":"span"},{"text":"points ","element":"span"},{"style":{"height":9.59},"width":42.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-3.png","element":"img","alt":" x1","inline":true,"padRight":true},{"text":"= (1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1), ","element":"span"},{"style":{"height":9.59},"width":42.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-4.png","element":"img","alt":" x2","inline":true,"padRight":true},{"text":"= (","element":"span"},{"style":{"height":14},"width":61.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-5.png","element":"img","alt":"−1,","inline":true,"padRight":true},{"text":"1), ","element":"span"},{"style":{"height":9.59},"width":42.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-6.png","element":"img","alt":" x3","inline":true,"padRight":true},{"text":"= (","element":"span"},{"style":{"height":14},"width":99.63,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-7.png","element":"img","alt":"−1, −","inline":true},{"text":"1), ","element":"span"},{"style":{"height":9.59},"width":42.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-8.png","element":"img","alt":" x4","inline":true,"padRight":true},{"text":"= (1","element":"span"},{"style":{"height":7.6},"width":48.71,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-9.png","element":"img","alt":", −","inline":true},{"text":"1) with labels ","element":"span"},{"style":{"height":10},"width":35.54,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-10.png","element":"img","alt":" y1","inline":true,"padRight":true},{"text":"= 1, ","element":"span"},{"style":{"height":14},"width":210.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-11.png","element":"img","alt":" y2 = −1, y3","inline":true,"padRight":true},{"text":"= 1 and ","element":"span"},{"style":{"height":10},"width":131.77,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-12.png","element":"img","alt":" y4 = −","inline":true},{"text":"1, respectively. Our goal is to learn the XOR function ","element":"span"},{"style":{"height":14.19},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-13.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":": ","element":"span"},{"style":{"height":17.39},"width":271.7,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-14.png","element":"img","alt":" {±1}2 → {±1}","inline":true},{"text":", such that ","element":"span"},{"style":{"height":16},"width":94.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-15.png","element":"img","alt":"f ∗(xi","inline":true},{"text":") = ","element":"span"},{"style":{"height":10},"width":30.54,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-16.png","element":"img","alt":" yi","inline":true,"padRight":true},{"text":"for 1 ","element":"span"},{"style":{"height":12.8},"width":97.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/2-17.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4, with a neural network and gradient descent.","element":"span"}],[{"id":"id-23","style":{"width":"88%"},"width":1558,"height":344,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-0.png","element":"img"}],[{"text":"Figure 2: ","element":"figcaption","subtype":"caption"},{"text":"Overparameterization and optimization in the XOR problem. The vectors in blue are the vectors ","element":"figcaption","subtype":"caption"},{"style":{"height":17.89},"width":65.88,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-1.png","element":"img","alt":"w(i)t","inline":true,"padRight":true},{"text":"and in red are the vectors ","element":"figcaption","subtype":"caption"},{"style":{"height":17.89},"width":70.94,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-2.png","element":"img","alt":" u(i)t .","inline":true,"padRight":true},{"text":"(a) Exploration at initialization (t=0) for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"= 50 (Lemma ","element":"figcaption","subtype":"caption"},{"href":"#id-17","text":"3.1) ","element":"a","subtype":"caption"},{"text":"(b) Clustering and convergence to global minimum for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"= 50 (Lemma ","element":"figcaption","subtype":"caption"},{"href":"#id-18","text":"3.2 ","element":"a","subtype":"caption"},{"text":"and Theorem ","element":"figcaption","subtype":"caption"},{"href":"#id-19","text":"3.3) ","element":"a","subtype":"caption"},{"text":"(c) Non-sufficient exploration at initialization (t=0) for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"= 2 (Theorem ","element":"figcaption","subtype":"caption"},{"href":"#id-20","text":"3.4)","element":"a","subtype":"caption"},{"text":". (d) Convergence to local minimum (Theorem ","element":"figcaption","subtype":"caption"},{"href":"#id-20","text":"3.4)","element":"a","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Neural Architecture: ","element":"span"},{"text":"For this task we consider the following two-layer fully connected network.","element":"span"}],[{"style":{"width":"70%"},"width":1241,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-3.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.19},"width":195.06,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-4.png","element":"img","alt":" W ∈ R2k×2","inline":true,"padRight":true},{"text":"is the weight matrix whose rows are the ","element":"span"},{"style":{"height":14.19},"width":69.97,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-5.png","element":"img","alt":" w(i)","inline":true,"padRight":true},{"text":"vectors followed by the ","element":"span"},{"style":{"height":14.19},"width":62.87,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-6.png","element":"img","alt":" u(i)","inline":true,"padRight":true},{"text":"vectors, and ","element":"span"},{"style":{"height":16},"width":62.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-7.png","element":"img","alt":"σ(x","inline":true},{"text":") = max","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", x","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"is the ReLU activation applied element-wise. We note that ","element":"span"},{"style":{"height":14.19},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-8.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"can be implemented with this network for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2 and this is the minimal ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"for which this is possible. Thus we refer to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k > ","element":"span"},{"text":"2 as the overparameterized case.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Training Algorithm: ","element":"span"},{"text":"The parameters of the network ","element":"span"},{"style":{"height":16},"width":109.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-9.png","element":"img","alt":" NW (x","inline":true},{"text":") are learned using gradient descent","element":"span"}],[{"text":"on the hinge loss objective. We use a constant learning rate ","element":"span"},{"style":{"height":18.38},"width":111.64,"height":45.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-10.png","element":"img","alt":" η ≤ cηk","inline":true,"padRight":true},{"text":", where ","element":"span"},{"style":{"height":19.37},"width":113.84,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-11.png","element":"img","alt":" cη < 12","inline":true},{"text":". The parameters ","element":"span"},{"style":{"height":13.19},"width":65.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-12.png","element":"img","alt":"NW","inline":true,"padRight":true},{"text":"are initialized as IID Gaussians with zero mean and standard deviation ","element":"span"},{"style":{"height":19.03},"width":190.84,"height":47.57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-13.png","element":"img","alt":" σg ≤ cη16k3/2","inline":true,"padRight":true},{"text":". We consider ","element":"span"},{"text":"the hinge-loss objective:","element":"span"}],[{"style":{"width":"35%"},"width":629,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-14.png","element":"img"}],[{"text":"where optimization is only over the first layer of the network. We note that for ","element":"span"},{"style":{"height":13.2},"width":69.7,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-15.png","element":"img","alt":" k ≥","inline":true,"padRight":true},{"text":"2 any global minimum ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-16.png","element":"img","alt":" ℓ","inline":true,"padRight":true},{"text":"satisfies ","element":"span"},{"style":{"height":16},"width":74.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-17.png","element":"img","alt":" ℓ(W","inline":true},{"text":") = 0 and sign(","element":"span"},{"style":{"height":16},"width":121.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-18.png","element":"img","alt":"NW (xi","inline":true},{"text":")) = ","element":"span"},{"style":{"height":16},"width":94.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-19.png","element":"img","alt":" f ∗(xi","inline":true},{"text":") for 1 ","element":"span"},{"style":{"height":12.8},"width":97.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-20.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Notations: ","element":"span"},{"text":"We will need the following notations. Let ","element":"span"},{"style":{"height":13.19},"width":49.64,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-21.png","element":"img","alt":" Wt","inline":true,"padRight":true},{"text":"be the weight matrix at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"of gradient descent. For 1 ","element":"span"},{"style":{"height":13.2},"width":129.93,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-22.png","element":"img","alt":" ≤ i ≤ k","inline":true},{"text":", denote by ","element":"span"},{"style":{"height":20.6},"width":165.9,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-23.png","element":"img","alt":" w(i)t ∈ R2","inline":true,"padRight":true},{"text":"the ","element":"span"},{"style":{"height":13.38},"width":44.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-24.png","element":"img","alt":" ith","inline":true,"padRight":true},{"text":"weight vector at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Similarly we define ","element":"span"},{"style":{"height":20.6},"width":158.8,"height":51.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-25.png","element":"img","alt":" u(i)t ∈ R2","inline":true,"padRight":true},{"text":"to be the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"+","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"weight vector at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". For each point ","element":"span"},{"style":{"height":13.19},"width":114.24,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-26.png","element":"img","alt":" xi ∈ S","inline":true,"padRight":true},{"text":"define the following sets of neurons:","element":"span"}],[{"style":{"width":"30%"},"width":541,"height":160,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-27.png","element":"img"}],[{"text":"and for each iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", let ","element":"span"},{"style":{"height":16},"width":63.83,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-28.png","element":"img","alt":" ai(t","inline":true},{"text":") be the number of iterations 0 ","element":"span"},{"style":{"height":12.8},"width":134.77,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-29.png","element":"img","alt":" ≤ t′ ≤ t","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":17.32},"width":233.38,"height":43.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/3-30.png","element":"img","alt":" yiNWt′ (xi) <","inline":true,"padRight":true},{"text":"1.","element":"span"}],[{"id":"id-30","style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Over-parameterized Networks Optimize Well","element":"span"}],[{"text":"In this section we assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k > ","element":"span"},{"text":"16. The following lemma shows that with high probability, for every training point, overparameterized networks are initialized at directions that have positive correlation with the training point. The proof uses a standard measure concentration argument. We refer to this as “exploration” as it lets the optimization procedure explore these parts of weight space.","element":"span"}],[{"id":"id-17","style":{"fontWeight":"bold"},"text":"Lemma 3.1. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Exploration at Initialization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":13.38},"width":119.24,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-0.png","element":"img","alt":" − 8e−8","inline":true},{"style":{"fontStyle":"italic"},"text":", for all ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":12.8},"width":97.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-1.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4","element":"span"}],[{"style":{"width":"40%"},"width":708,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-2.png","element":"img"}],[{"text":"Next, we show an example of the weight dynamics which imply that the weights tend to cluster around a few directions. The proof uses the fact that with high probability the initial weights have small norm and proceeds by induction on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"to show the dynamics.","element":"span"}],[{"id":"id-18","style":{"width":"99%"},"width":1754,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-4.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"there exists a vector ","element":"span"},{"style":{"height":9.99},"width":36.06,"height":24.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-5.png","element":"img","alt":" vt","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":11.19},"width":143.72,"height":27.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-6.png","element":"img","alt":" vt · xi >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":16},"width":221.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-7.png","element":"img","alt":" |vt · x2| < 2η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":20.68},"width":354.3,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-8.png","element":"img","alt":" w(j)t = ai(t)ηxi + vt","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"The sequence ","element":"span"},{"style":{"height":16.79},"width":172.51,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-9.png","element":"img","alt":" {ai(t)}t≥0","inline":true,"padRight":true},{"text":"is non-decreasing and it can be shown that ","element":"span"},{"style":{"height":9.19},"width":32.06,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-10.png","element":"img","alt":" ai","inline":true},{"text":"(0) = 1 with high probablity. Therefore, the above lemma shows that for all ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-11.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":20.68},"width":123.7,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-12.png","element":"img","alt":"i), w(j)t","inline":true,"padRight":true},{"text":"tends to cluster around ","element":"span"},{"style":{"height":9.59},"width":37.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-13.png","element":"img","alt":" xi","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"increases. Since with probability 1, ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-14.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":104.7,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-15.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(3) = [","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"], the above lemma characterizes the dynamics of all ","element":"span"},{"text":"filters ","element":"span"},{"style":{"height":20.6},"width":73.49,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-16.png","element":"img","alt":" w(j)t","inline":true,"padRight":true},{"text":". In the supplementary we show a similar result for the filters ","element":"span"},{"style":{"height":20.6},"width":66.39,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-17.png","element":"img","alt":" u(j)t","inline":true,"padRight":true},{"text":".","element":"span"}],[{"text":"By applying both of the above lemmas, it can be shown that for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k > ","element":"span"},{"text":"16 gradient descent converges to a global minimum with high probability and that the weights are clustered at convergence.","element":"span"}],[{"id":"id-19","style":{"width":"100%"},"width":1757,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-18.png","element":"img"}],[{"style":{"height":25.45},"width":171.44,"height":63.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-19.png","element":"img","alt":"T ≤ 16√k√k−2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"iterations, gradient descent converges to a global minimum ","element":"span"},{"style":{"height":13.19},"width":60.64,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-20.png","element":"img","alt":" WT","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":". Furthermore, for ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-21.png","element":"img","alt":" i ∈ {1, 3}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and all ","element":"span"},{"style":{"height":18.27},"width":139.78,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-22.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", the angle between ","element":"span"},{"style":{"height":21.36},"width":73.49,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-23.png","element":"img","alt":" w(j)T","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":9.59},"width":37.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-24.png","element":"img","alt":" xi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is at most ","element":"span"},{"text":"arccos","element":"span"},{"style":{"height":28.8},"width":114.5,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-25.png","element":"img","alt":"�1−2cη1+cη","inline":true}],[{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":21.36},"width":66.39,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-26.png","element":"img","alt":" u(j)T","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Small Network Fail to Optimize","element":"span"}],[{"text":"In contrast to the case of large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", we show that for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2, the initialization does not explore all directions, leading to convergence to a suboptimal solution.","element":"span"}],[{"id":"id-20","style":{"fontWeight":"bold"},"text":"Theorem 3.4. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Insufficient Exploration at Initialization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"75","element":"span"},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"style":{"height":16},"width":161.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-27.png","element":"img","alt":" i ∈ {1, 3}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-28.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-29.png","element":"img","alt":" ∅","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"or ","element":"span"},{"style":{"height":16},"width":161.81,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-30.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":18.27},"width":56.56,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-31.png","element":"img","alt":" U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-32.png","element":"img","alt":" ∅","inline":true},{"style":{"fontStyle":"italic"},"text":". As a result, with probability ","element":"span"},{"style":{"height":13.2},"width":72.99,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-33.png","element":"img","alt":"≥ 0.","inline":true},{"text":"75","element":"span"},{"style":{"fontStyle":"italic"},"text":", gradient descent converges to a model which errs on at least one input pattern.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiments","element":"span"}],[{"text":"In this section we empirically demonstrate the theoretical results. We implemented the learning setting described in Sec. ","element":"span"},{"href":"#id-21","text":"3.1 ","element":"a"},{"text":"and conducted two experiments: one with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 50 and one with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2 We note that for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2 the XOR function ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-34.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"can be realized by the network in Eq. ","element":"span"},{"href":"#id-22","text":"6. ","element":"a"},{"text":"Figure ","element":"span"},{"href":"#id-23","text":"2 ","element":"a"},{"text":"shows the results. It can be seen that our theory nicely predicts the behavior of gradient descent. For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 50 we see the effect of exploration at initialization and clustering which imply convergence to global minimum. In contrast, the small network does not explore all directions at initialization and therefore converges to a local minimum. This is despite the fact that it has sufficient expressive power to implement ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/4-35.png","element":"img","alt":" f ∗","inline":true},{"text":".","element":"span"}]]},{"heading":"4 The XORD Problem","paragraphs":[[{"text":"In the previous section we analyzed the XOR problem, showing that using a large number of channels allows gradient descent to learn the XOR function. ","element":"span"},{"text":"This allowed us to understand the effect of overparameterization on optimization. However, it did not let us study generalization because in the learning setting all four examples were given, so that any model with zero training error also had zero test error.","element":"span"}],[{"text":"In order to study the effect of overparameterization on generalization we consider a more general setting, which we refer to as the XOR Detection problem (XORD). As can be seen in Fig. ","element":"span"},{"href":"#id-4","text":"1, ","element":"a"},{"text":"in the XORD problem large networks generalize better than smaller ones. This is despite the fact that small networks can reach zero training error. Our goal is to understand this phenomenon from a theoretical persepective.","element":"span"}],[{"text":"In this section, we define the XORD problem. We begin with some notations and definitions. We consider a classification problem in the space ","element":"span"},{"style":{"height":17.38},"width":123.66,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-0.png","element":"img","alt":" {±1}2d","inline":true},{"text":", for ","element":"span"},{"style":{"height":13.2},"width":69.09,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-1.png","element":"img","alt":" d ≥","inline":true,"padRight":true},{"text":"1. Given a vector ","element":"span"},{"style":{"height":17.38},"width":211.19,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-2.png","element":"img","alt":" x ∈ {±1}2d","inline":true},{"text":", we consider its partition into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"sets of two coordinates as follows ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"= (","element":"span"},{"style":{"height":10.4},"width":156.04,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-3.png","element":"img","alt":"x1, ..., xd","inline":true},{"text":") where ","element":"span"},{"style":{"height":17.38},"width":195.01,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-4.png","element":"img","alt":" xi ∈ {±1}2","inline":true},{"text":". We refer to each such ","element":"span"},{"style":{"height":9.59},"width":37.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-5.png","element":"img","alt":" xi","inline":true,"padRight":true},{"text":"as a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"pattern ","element":"span"},{"text":"in ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Neural Architecture: ","element":"span"},{"text":"We consider learning with the following three-layer neural net model. The","element":"span"}],[{"text":"first layer is a convolutional layer with non-overlapping filters and multiple channels, the second layer is max pooling and the third layer is a fully connected layer with 2","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"hidden neurons and weights fixed to values ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-6.png","element":"img","alt":" ±","inline":true},{"text":"1. Formally, for an input ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"= (","element":"span"},{"style":{"height":17.38},"width":283.48,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-7.png","element":"img","alt":"x1, ..., xd) ∈ R2d","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":15.77},"width":133.02,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-8.png","element":"img","alt":" xi ∈ R2","inline":true},{"text":", the output of the network is denoted by ","element":"span"},{"style":{"height":16},"width":109.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-9.png","element":"img","alt":" NW (x","inline":true},{"text":") and is given by:","element":"span"}],[{"id":"id-3","style":{"width":"74%"},"width":1305,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-10.png","element":"img"}],[{"text":"where notation is as in the XOR problem.","element":"span"}],[{"id":"id-45","style":{"fontWeight":"bold"},"text":"Remark 4.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Because there are only ","element":"span"},{"text":"4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"different patterns, the network is limited in terms of the number of rules it can implement. Specifically, it is easy to show that its VC dimension is at most ","element":"span"},{"text":"15 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(see supplementary material). Despite this limited expressive power, there is a generalization gap between small and large networks in this setting, as can be seen in Fig. ","element":"span"},{"href":"#id-4","style":{"fontStyle":"italic"},"text":"1, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and in our analysis below.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Data Generating Distribution: ","element":"span"},{"text":"Next we define the classification rule we will focus on. Define the","element":"span"}],[{"text":"four two-dimensional binary patterns ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-11.png","element":"img","alt":" p1","inline":true,"padRight":true},{"text":"= (1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1)","element":"span"},{"style":{"height":11.1},"width":57.66,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-12.png","element":"img","alt":", p2","inline":true,"padRight":true},{"text":"= (1","element":"span"},{"style":{"height":16},"width":141.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-13.png","element":"img","alt":", −1), p3","inline":true,"padRight":true},{"text":"= (","element":"span"},{"style":{"height":16},"width":192.71,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-14.png","element":"img","alt":"−1, −1), p4","inline":true,"padRight":true},{"text":"= (","element":"span"},{"style":{"height":14},"width":61.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-15.png","element":"img","alt":"−1,","inline":true,"padRight":true},{"text":"1). Define ","element":"span"},{"style":{"height":16.79},"width":285.06,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-16.png","element":"img","alt":"Ppos = {p1, p3}","inline":true,"padRight":true},{"text":"to be the set of positive patterns and ","element":"span"},{"style":{"height":16.79},"width":289.23,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-17.png","element":"img","alt":" Pneg = {p2, p4}","inline":true,"padRight":true},{"text":"to be the set of negative patterns. Define the classification rule:","element":"span"}],[{"style":{"width":"71%"},"width":1257,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-18.png","element":"img"}],[{"text":"Namely, ","element":"span"},{"style":{"height":14.19},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-19.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"detects whether a positive pattern appears in the input. For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= 1, ","element":"span"},{"style":{"height":14.19},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-20.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"is the XOR classifier in Sec. ","element":"span"},{"text":"3.","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"be a distribution over ","element":"span"},{"style":{"height":16},"width":175.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-21.png","element":"img","alt":" X × {±1}","inline":true,"padRight":true},{"text":"such that for all (","element":"span"},{"style":{"height":16},"width":169.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-22.png","element":"img","alt":"x, y) ∼ D","inline":true,"padRight":true},{"text":"we have ","element":"span"},{"style":{"height":16},"width":162.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-23.png","element":"img","alt":" y = f ∗(x","inline":true},{"text":"). We say that a point (","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", y","element":"span"},{"text":") is positive if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"= 1 and negative otherwise. Let ","element":"span"},{"style":{"height":14.79},"width":55.74,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-24.png","element":"img","alt":" D+","inline":true,"padRight":true},{"text":"be the marginal distribution over ","element":"span"},{"style":{"height":17.38},"width":123.66,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-25.png","element":"img","alt":" {±1}2d","inline":true,"padRight":true},{"text":"of positive points and ","element":"span"},{"style":{"height":13.19},"width":55.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-26.png","element":"img","alt":" D−","inline":true,"padRight":true},{"text":"be the marginal distribution of negative points.","element":"span"}],[{"text":"For each point ","element":"span"},{"style":{"height":17.38},"width":203.67,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-27.png","element":"img","alt":" x ∈ {±1}2d","inline":true},{"text":", define ","element":"span"},{"style":{"height":13.19},"width":46.58,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-28.png","element":"img","alt":" Px","inline":true,"padRight":true},{"text":"to be the set of unique two-dimensional patterns that the point ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"contains, namely ","element":"span"},{"style":{"height":16.79},"width":386.88,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-29.png","element":"img","alt":" Px = {i | ∃j, xj = pi}","inline":true},{"text":". In the following definition we introduce the notion of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"diverse ","element":"span"},{"text":"points, which will play a key role in our analysis.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 4.2 ","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"Diverse Points). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"We say that a positive point ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is diverse if ","element":"span"},{"href":"#id-24","style":{"height":17.38},"width":302.18,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-30.png","element":"img","alt":" Px = {1, 2, 3, 4}.4","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"We say that a negative point ","element":"span"},{"text":"(","element":"span"},{"style":{"height":10.4},"width":74.98,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-31.png","element":"img","alt":"x, −","inline":true},{"text":"1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is diverse if ","element":"span"},{"style":{"height":16},"width":198.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-32.png","element":"img","alt":" Px = {2, 4}","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"width":"96%"},"width":1694,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-33.png","element":"img"}],[{"text":"if both ","element":"span"},{"style":{"height":14.79},"width":57.99,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-34.png","element":"img","alt":" D+","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":57.99,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-35.png","element":"img","alt":" D−","inline":true,"padRight":true},{"text":"are uniform, then by the inclusion-exclusion principle it follows that ","element":"span"},{"style":{"height":10.79},"width":45.05,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-36.png","element":"img","alt":" p+","inline":true,"padRight":true},{"text":"= 1 ","element":"span"},{"style":{"height":4.4},"width":31,"height":11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-37.png","element":"img","alt":" −","inline":true}],[{"id":"id-24","style":{"width":"40%"},"width":704,"height":65,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/5-38.png","element":"img"}],[{"id":"id-25","style":{"width":"88%"},"width":1558,"height":344,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-0.png","element":"img"}],[{"text":"Figure 3: ","element":"figcaption","subtype":"caption"},{"text":"Overparameterization and generalization in the XORD problem. The vectors in blue are the vectors ","element":"figcaption","subtype":"caption"},{"style":{"height":17.89},"width":65.88,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-1.png","element":"img","alt":"w(i)t","inline":true,"padRight":true},{"text":"and in red are the vectors ","element":"figcaption","subtype":"caption"},{"style":{"height":17.89},"width":59.43,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-2.png","element":"img","alt":" u(i)t ","inline":true,"padRight":true},{"text":". (a) Exploration at initialization (t=0) for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"= 100 (b) Clustering and ","element":"figcaption","subtype":"caption"},{"text":"convergence to global minimum that recovers ","element":"figcaption","subtype":"caption"},{"style":{"height":13.29},"width":131.25,"height":33.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-3.png","element":"img","alt":" f ∗ for k","inline":true,"padRight":true},{"text":"= 100 (c) Non-sufficient exploration at initialization (t=0) for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"= 2. (d) Convergence to global minimum with non-zero test error for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"= 2.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Learning Setup: ","element":"span"},{"text":"Our analysis will focus on the problem of learning ","element":"span"},{"style":{"height":14.19},"width":39.8,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-4.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"from training data with","element":"span"}],[{"text":"the three layer neural net model in Eq. ","element":"span"},{"href":"#id-3","text":"2. ","element":"a"},{"text":"The learning algorithm will be gradient descent, randomly initialized. As in any learning task in practice, ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-5.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"is unknown to the training algorithm. Our goal is to analyze the performance of gradient descent when given data that is labeled with ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-6.png","element":"img","alt":" f ∗","inline":true},{"text":". We assume that we are given a training set ","element":"span"},{"style":{"height":17.38},"width":568.14,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-7.png","element":"img","alt":" S = S+ ∪ S− ⊆ {±1}2d × {±1}2","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":14.79},"width":49.44,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-8.png","element":"img","alt":" S+","inline":true,"padRight":true},{"text":"consists of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"IID points drawn from ","element":"span"},{"style":{"height":14.79},"width":55.74,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-9.png","element":"img","alt":" D+","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":49.44,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-10.png","element":"img","alt":" S−","inline":true,"padRight":true},{"text":"consists of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"IID points drawn from ","element":"span"},{"style":{"height":16.18},"width":84.71,"height":40.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-11.png","element":"img","alt":" D−.5","inline":true}],[{"text":"Importantly, we note that the function ","element":"span"},{"style":{"height":14.19},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-12.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"can be realized by the above network with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2. Indeed, the network ","element":"span"},{"style":{"height":13.19},"width":65.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-13.png","element":"img","alt":" NW","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":14.18},"width":74.58,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-14.png","element":"img","alt":" w(1)","inline":true,"padRight":true},{"text":"= 3","element":"span"},{"style":{"height":18.08},"width":142.31,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-15.png","element":"img","alt":"p1, w(2)","inline":true,"padRight":true},{"text":"= 3","element":"span"},{"style":{"height":18.08},"width":135.21,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-16.png","element":"img","alt":"p3, u(1)","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":18.08},"width":135.21,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-17.png","element":"img","alt":" p2, u(2)","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-18.png","element":"img","alt":" p4","inline":true,"padRight":true},{"text":"satisfies sign (","element":"span"},{"style":{"height":16},"width":109.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-19.png","element":"img","alt":"NW (x","inline":true},{"text":")) = ","element":"span"},{"style":{"height":16},"width":83.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-20.png","element":"img","alt":" f ∗(x","inline":true},{"text":") for all ","element":"span"},{"style":{"height":17.38},"width":201.52,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-21.png","element":"img","alt":" x ∈ {±1}2d","inline":true},{"text":". It can be seen that for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1, ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-22.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"cannot be realized. Therefore, any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k > ","element":"span"},{"text":"2 is an overparameterized setting.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Training Algorithm: ","element":"span"},{"text":"We will use gradient descent to optimize the following hinge-loss function.","element":"span"}],[{"id":"id-29","style":{"width":"73%"},"width":1285,"height":240,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-23.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":14},"width":64.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-24.png","element":"img","alt":" γ ≥","inline":true,"padRight":true},{"text":"1.","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-25.png","element":"img","alt":"6","inline":true,"padRight":true},{"text":"We assume that gradient descent runs with a constant learning rate ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-26.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"and the weights are randomly initiliazed with IID Gaussian weights with mean 0 and standard deviation ","element":"span"},{"style":{"height":11.59},"width":38.77,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-27.png","element":"img","alt":" σg","inline":true},{"text":". Furthermore, only the weights of the first layer, the convolutional filters, are trained.","element":"span"},{"style":{"height":8},"width":16,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-28.png","element":"img","alt":"7","inline":true,"padRight":true},{"text":"As in Section ","element":"span"},{"text":"3, ","element":"span"},{"text":"we will use the notations ","element":"span"},{"style":{"height":20.6},"width":152.64,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-29.png","element":"img","alt":" Wt, w(i)t","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":20.6},"width":62.87,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-30.png","element":"img","alt":" u(i)t","inline":true,"padRight":true},{"text":"for the weights at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"of gradient descent. ","element":"span"},{"text":"At each iteration (starting from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 0), gradient descent performs the update ","element":"span"},{"style":{"height":19.77},"width":323.26,"height":49.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-31.png","element":"img","alt":" Wt+1 = Wt − η ∂ℓ∂W","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":13.19},"width":49.63,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/6-32.png","element":"img","alt":"Wt","inline":true},{"text":").","element":"span"}]]},{"heading":"5 XORD on Decoy Sets","paragraphs":[[{"text":"In Fig. ","element":"span"},{"href":"#id-4","text":"1 ","element":"a"},{"text":"we showed that the XORD problem exhibits better generalization for overparameterized models. Here we will empirically show how this comes about due to the effects of clustering and exploration. We compare two networks as in Sec. ","element":"span"},{"text":"4. ","element":"span"},{"text":"The first has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2 (i.e., four hidden neurons) and the second has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 100. As mentioned earlier, both these nets can achieve zero test error on the XORD problem.","element":"span"}],[{"text":"We consider a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"diverse ","element":"span"},{"text":"training set, namely, one which contains only diverse points. The set has 6 positive diverse points and 6 negative diverse points. Each positive point contains all the patterns ","element":"span"},{"style":{"height":16},"width":260.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-0.png","element":"img","alt":"{p1, p2, p3, p4}","inline":true,"padRight":true},{"text":"and each negative point contains all the patterns ","element":"span"},{"style":{"height":16},"width":141.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-1.png","element":"img","alt":" {p2, p4}","inline":true},{"text":". ","element":"span"},{"text":"Note that in order to achieve zero training error on this set, a network needs only to detect ","element":"span"},{"style":{"fontStyle":"italic"},"text":"at least ","element":"span"},{"text":"one of the patterns ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-2.png","element":"img","alt":"p1","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-3.png","element":"img","alt":" p3","inline":true},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"at least ","element":"span"},{"text":"one of the patterns ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-4.png","element":"img","alt":" p2","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-5.png","element":"img","alt":" p4","inline":true},{"text":". For example, a network with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2 and filters ","element":"span"},{"style":{"height":14.18},"width":74.58,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-6.png","element":"img","alt":"w(1)","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":14.18},"width":74.58,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-7.png","element":"img","alt":" w(2)","inline":true,"padRight":true},{"text":"= 3","element":"span"},{"style":{"height":18.08},"width":135.33,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-8.png","element":"img","alt":"p1, u(1)","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":14.18},"width":67.48,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-9.png","element":"img","alt":" u(2)","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-10.png","element":"img","alt":" p2","inline":true},{"text":", has zero train loss. However, this network will not generalize to non-diverse points, where only a subset of the patterns appear. Thus we refer to it as a “decoy” training set.","element":"span"}],[{"text":"Fig. ","element":"span"},{"href":"#id-25","text":"3 ","element":"a"},{"text":"shows the results of training the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2 and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 100 networks on the decoy training set. Both networks reach zero training error. However, the larger network learns the XORD function exactly, whereas the smaller network does not, and will therefore misclassify certain data points. As Fig. ","element":"span"},{"href":"#id-25","text":"3 ","element":"a"},{"text":"clearly shows, the reason for the failure of the smaller network is that at initialization there is insufficient exploration of weight space. On the other hand, the larger network both explores well at initialization, and converges to clustered weights corresponding to all relevant patterns.","element":"span"}],[{"text":"The above observations are for a training set that contains only diverse points. However, there are other decoy training sets which also contain non-diverse points (see supplementary for an example). We also note that in the experiments in Fig. ","element":"span"},{"href":"#id-4","text":"1, ","element":"a"},{"text":"we trained gradient descent on various training sets which do not contain only diverse points. The generalization gap that we observe for 0 training error solutions, suggests the existence of other decoy training sets.","element":"span"}]]},{"heading":"6 XORD Theoretical Analysis","paragraphs":[[{"text":"In Sec. ","element":"span"},{"text":"5 ","element":"span"},{"text":"we saw a case where overparameterized networks generalize better than smaller ones. This was due to the fact that the training set was a “decoy” in the sense that it could be explained by a subset of the discriminative patterns. Due to the under-exploration of weights in the smaller model this led to zero training error but non-zero test error.","element":"span"}],[{"text":"We proceed to formulate this intuition. Our theoretical results will show that for diverse training sets, networks with ","element":"span"},{"style":{"height":13.2},"width":64.07,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-11.png","element":"img","alt":" k ≥","inline":true,"padRight":true},{"text":"120 will converge with high probability to a solution with zero training error that recovers ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-12.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"(Sec. ","element":"span"},{"href":"#id-26","text":"6.1)","element":"a"},{"text":". On the other hand, networks with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2 will converge with constant probability to zero training error solutions which do not recover ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-13.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"(Sec. ","element":"span"},{"href":"#id-27","text":"6.2)","element":"a"},{"text":". Finally, we show that in a PAC setting these results imply a sample complexity gap between large and small networks (Sec. ","element":"span"},{"href":"#id-28","text":"6.3)","element":"a"},{"text":".","element":"span"}],[{"text":"We assume that the training set consists of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"positive diverse points and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"negative diverse points. For the analysis, without loss of generality, we can assume that the training set consists of one positive diverse point ","element":"span"},{"style":{"height":12.98},"width":51.26,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-14.png","element":"img","alt":" x+","inline":true,"padRight":true},{"text":"and one negative diverse point ","element":"span"},{"style":{"height":8.98},"width":51.26,"height":22.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-15.png","element":"img","alt":" x−","inline":true},{"text":". This follows since the network and its gradient have the same value for two different positive diverse points and two different negative diverse points. Therefore, this holds for the loss function in Eq. ","element":"span"},{"href":"#id-29","text":"4 ","element":"a"},{"text":"as well.","element":"span"}],[{"text":"For the analysis, we need a few more definitions. Define the following sets for each 1 ","element":"span"},{"style":{"height":12.8},"width":97.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-16.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4:","element":"span"}],[{"style":{"width":"35%"},"width":624,"height":215,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-17.png","element":"img"}],[{"text":"For each set of binary patterns ","element":"span"},{"style":{"height":17.38},"width":192.19,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-18.png","element":"img","alt":" A ⊆ {±1}2","inline":true,"padRight":true},{"text":"define ","element":"span"},{"style":{"height":10},"width":44.04,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-19.png","element":"img","alt":" pA","inline":true,"padRight":true},{"text":"to be the probability to sample a point ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"such that ","element":"span"},{"style":{"height":13.99},"width":131.9,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-20.png","element":"img","alt":" Px = A","inline":true},{"text":". Let ","element":"span"},{"style":{"height":16},"width":608.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-21.png","element":"img","alt":" A1 = {2}, A2 = {4}, A3 = {2, 4, 1}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":236.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-22.png","element":"img","alt":" A4 = {2, 4, 3}","inline":true},{"text":". The following quantity will be useful in our analysis:","element":"span"}],[{"id":"id-34","style":{"width":"57%"},"width":1002,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-23.png","element":"img"}],[{"text":"Finally, we let ","element":"span"},{"style":{"height":16.98},"width":77.02,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-24.png","element":"img","alt":" a+(t","inline":true},{"text":") be the number of iterations 0 ","element":"span"},{"style":{"height":12.8},"width":134.77,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-25.png","element":"img","alt":" ≤ t′ ≤ t","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":18.31},"width":246.85,"height":45.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-26.png","element":"img","alt":" NWt′ (x+) < γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.8},"width":59.31,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-27.png","element":"img","alt":" c ≤","inline":true,"padRight":true},{"text":"10","element":"span"},{"style":{"height":7.6},"width":56.81,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/7-28.png","element":"img","alt":"−10","inline":true,"padRight":true},{"text":"be a negligible constant.","element":"span"}],[{"id":"id-26","style":{"fontWeight":"bold"},"text":"6.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Overparameterized Network","element":"span"}],[{"text":"As in Sec. ","element":"span"},{"href":"#id-30","text":"3.2, ","element":"a"},{"text":"we will show that both exploration at initialization and clustering will imply good performance of overparameterized networks. ","element":"span"},{"text":"Concretely, they will imply convergence to a global minimum that recovers ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-0.png","element":"img","alt":" f ∗","inline":true},{"text":". However, the analysis in XORD is significantly more involved.","element":"span"}],[{"text":"We assume that ","element":"span"},{"style":{"height":13.2},"width":71.14,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-1.png","element":"img","alt":" k ≥","inline":true,"padRight":true},{"text":"120 and gradient descent runs with parameters ","element":"span"},{"style":{"height":18.38},"width":122.52,"height":45.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-2.png","element":"img","alt":" η = cηk","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":19.37},"width":156.53,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-3.png","element":"img","alt":" cη ≤ 1410","inline":true},{"text":", ","element":"span"},{"style":{"height":23.52},"width":167.4,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-4.png","element":"img","alt":"σg ≤ cη16k32","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":64.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-5.png","element":"img","alt":" γ ≥","inline":true,"padRight":true},{"text":"8.","element":"span"}],[{"text":"In the analysis there are several instances of exploration and clustering effects. ","element":"span"},{"text":"Due to space limitations, here we will show one such instance. In the following lemma we show an example of exploration at initialization. The proof is a direct application of a concentration bound.","element":"span"}],[{"id":"id-31","style":{"fontWeight":"bold"},"text":"Lemma 6.1. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Exploration. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least ","element":"span"},{"text":"1","element":"span"},{"style":{"height":13.38},"width":112.26,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-6.png","element":"img","alt":"−4e−8","inline":true},{"style":{"fontStyle":"italic"},"text":", it holds that","element":"span"},{"style":{"height":19.96},"width":94.74,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-7.png","element":"img","alt":"��W +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":103.59,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-8.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.97},"width":144.52,"height":49.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-9.png","element":"img","alt":"�� − k2�� ≤","inline":true,"padRight":true},{"text":"2","element":"span"},{"style":{"height":16},"width":54.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-10.png","element":"img","alt":"√k","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"Next, we characterize the dynamics of filters in ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-11.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":103.6,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-12.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(3) for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":".","element":"span"}],[{"id":"id-32","style":{"width":"99%"},"width":1754,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-14.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"there exists a vector ","element":"span"},{"style":{"height":9.99},"width":36.06,"height":24.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-15.png","element":"img","alt":" vt","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":12.7},"width":141.41,"height":31.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-16.png","element":"img","alt":" vt · pi >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":16},"width":219.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-17.png","element":"img","alt":" |vt · p2| < 2η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":20.68},"width":365.18,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-18.png","element":"img","alt":" w(j)t = a+(t)ηpi + vt","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"We note that ","element":"span"},{"style":{"height":16.98},"width":77.02,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-19.png","element":"img","alt":" a+(t","inline":true},{"text":") is a non-decreasing sequence such that ","element":"span"},{"style":{"height":12.98},"width":46.06,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-20.png","element":"img","alt":" a+","inline":true},{"text":"(0) = 1 with high probability. Therefore, the above lemma suggests that the weights in ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-21.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":104.38,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-22.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(3) tend to get clustered as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"increases.","element":"span"}],[{"text":"By combining Lemma ","element":"span"},{"href":"#id-31","text":"6.1, ","element":"a"},{"text":"Lemma ","element":"span"},{"href":"#id-32","text":"6.2 ","element":"a"},{"text":"and other similar lemmas given in the supplementary (for other sets ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-23.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":18.27},"width":103.49,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-24.png","element":"img","alt":"i), U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":")), the following convergence theorem can be shown. The proof consists of a ","element":"span"},{"text":"careful and lengthy analysis of the dynamics of gradient descent and is given in the supplementary.","element":"span"}],[{"id":"id-35","style":{"fontWeight":"bold"},"text":"Theorem 6.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least ","element":"span"},{"style":{"height":19.2},"width":144,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-25.png","element":"img","alt":"�1 − c −","inline":true,"padRight":true},{"text":"16","element":"span"},{"style":{"height":19.2},"width":79.34,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-26.png","element":"img","alt":"e−8�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"after running gradient descent for ","element":"span"},{"style":{"height":13.2},"width":81.5,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-27.png","element":"img","alt":" T ≥","inline":true}],[{"style":{"width":"99%"},"width":1755,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-28.png","element":"img"}],[{"style":{"height":17.38},"width":198.62,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-29.png","element":"img","alt":"x ∈ {±1}2d","inline":true},{"style":{"fontStyle":"italic"},"text":". Furthermore, for ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-30.png","element":"img","alt":" i ∈ {1, 3}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and all ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-31.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", the angle between ","element":"span"},{"style":{"height":21.36},"width":73.49,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-32.png","element":"img","alt":" w(j)T","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":11.09},"width":34.94,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-33.png","element":"img","alt":" pi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is at most ","element":"span"},{"text":"arccos","element":"span"},{"style":{"height":28.8},"width":157.86,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-34.png","element":"img","alt":"�γ−1−2cηγ−1+cη","inline":true}],[{"text":"This result shows if the training set consists only of diverse points, then with high probability over the initialization, overparameterized networks converge to a global minimum which realizes ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-35.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"in a constant number of iterations.","element":"span"}],[{"id":"id-27","style":{"fontWeight":"bold"},"text":"6.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Small Network","element":"span"}],[{"text":"Next we consider the case of the small network ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2, and show that it has inferior generalization due to under-exploration. We assume that gradient descent runs with parameters values of ","element":"span"},{"style":{"height":11.59},"width":83.83,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-36.png","element":"img","alt":" η, σg","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-37.png","element":"img","alt":"γ","inline":true,"padRight":true},{"text":"which are similar to the previous section but in a slightly broader set of values (see supplementary for details). The main result of this section shows that with constant probability, gradient descent converges to a global minimum that does not recover ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-38.png","element":"img","alt":" f ∗","inline":true},{"text":".","element":"span"}],[{"id":"id-36","style":{"fontWeight":"bold"},"text":"Theorem 6.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least ","element":"span"},{"text":"(1 ","element":"span"},{"style":{"height":6.8},"width":57.85,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-39.png","element":"img","alt":" − c","inline":true},{"text":") ","element":"span"},{"style":{"height":19.37},"width":31.9,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-40.png","element":"img","alt":"3348","inline":true},{"style":{"fontStyle":"italic"},"text":", gradient descent converges to a global minimum ","element":"span"},{"style":{"fontStyle":"italic"},"text":"that does not recover ","element":"span"},{"style":{"height":14.19},"width":39.8,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-41.png","element":"img","alt":" f ∗","inline":true},{"style":{"fontStyle":"italic"},"text":". Furthermore, there exists ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":12.8},"width":100.23,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-42.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that the global minimum misclas-sifies all points ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":13.99},"width":142.17,"height":34.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-43.png","element":"img","alt":" Px = Ai","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"The proof follows due to an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"under-exploration ","element":"span"},{"text":"effect. Concretely, let ","element":"span"},{"style":{"height":21.36},"width":74.58,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-44.png","element":"img","alt":" w(1)T","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":21.36},"width":74.58,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-45.png","element":"img","alt":" w(2)T","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":21.36},"width":67.48,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-46.png","element":"img","alt":" u(1)T","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.36},"width":67.48,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-47.png","element":"img","alt":" u(2)T","inline":true,"padRight":true},{"text":"be the filters of the network at the iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"in which gradient descent converges to a global minimum (convergence occurs with high constant probability). The proof shows that gradient descent will not learn ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-48.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"if one of the following conditions is met: a) ","element":"span"},{"style":{"height":18.7},"width":68.17,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-49.png","element":"img","alt":" W +T","inline":true,"padRight":true},{"text":"(1) = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-50.png","element":"img","alt":" ∅","inline":true},{"text":". b) ","element":"span"},{"style":{"height":18.7},"width":68.17,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-51.png","element":"img","alt":" W +T","inline":true,"padRight":true},{"text":"(3) = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-52.png","element":"img","alt":" ∅","inline":true},{"text":". c) ","element":"span"},{"style":{"height":21.36},"width":182.78,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/8-53.png","element":"img","alt":" u(1)T · p2 >","inline":true,"padRight":true},{"text":"0 and","element":"span"}],[{"id":"id-39","style":{"width":"79%"},"width":1405,"height":560,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-0.png","element":"img"}],[{"text":"Figure 4: ","element":"figcaption","subtype":"caption"},{"text":"Clustering and Exploration in MNIST (a) Distribution of angle to closest center in trained and random networks. (b) The plot shows the test error of the small network (4 channels) with standard training (red), the small network that uses clusters from the large network (blue), and the large network (120 channels) with standard training (green). It can be seen that the large network is effectively compressed without losing much accuracy.","element":"figcaption","subtype":"caption"}],[{"style":{"height":21.36},"width":182.94,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-1.png","element":"img","alt":"u(2)T · p2 >","inline":true,"padRight":true},{"text":"0. d) ","element":"span"},{"style":{"height":21.36},"width":182.94,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-2.png","element":"img","alt":" u(1)T · p4 >","inline":true,"padRight":true},{"text":"0 and ","element":"span"},{"style":{"height":21.36},"width":182.94,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-3.png","element":"img","alt":" u(2)T · p4 >","inline":true,"padRight":true},{"text":"0. Then by using a symmetry argument which is based on the symmetry of the initialization and the training data it can be shown that one of the above conditions is met with high constant probability.","element":"span"}],[{"id":"id-28","style":{"fontWeight":"bold"},"text":"6.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"A Sample Complexity Gap","element":"span"}],[{"text":"In the previous analysis we assumed that the training set was diverse. Here we consider the standard PAC setting of a distribution over inputs, and show that indeed overparameterized models enjoy better generalization. Recall that the sample complexity ","element":"span"},{"style":{"height":16},"width":103.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-4.png","element":"img","alt":" m(ϵ, δ","inline":true},{"text":") of a learning algorithm is the minimal number of samples required for learning a model with test error at most ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-5.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"with confidence greater than 1 ","element":"span"},{"style":{"height":11.6},"width":58.85,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-6.png","element":"img","alt":" − δ","inline":true,"padRight":true},{"href":"#id-33","referenceIndex":18,"text":"(Shalev-Shwartz & Ben-David, ","element":"a"},{"href":"#id-33","referenceIndex":18,"text":"2014)","element":"a"},{"text":".","element":"span"}],[{"text":"We are interested in the sample complexity of learning with ","element":"span"},{"style":{"height":13.2},"width":64.07,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-7.png","element":"img","alt":" k ≥","inline":true,"padRight":true},{"text":"120 and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2. Denote these two functions by ","element":"span"},{"style":{"height":16},"width":121.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-8.png","element":"img","alt":" m1(ϵ, δ","inline":true},{"text":") and ","element":"span"},{"style":{"height":16},"width":121.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-9.png","element":"img","alt":" m2(ϵ, δ","inline":true},{"text":"). The following result states that there is a gap between the sample complexity of the two models, where the larger model in fact enjoys better complexity.","element":"span"}],[{"id":"id-37","style":{"fontWeight":"bold"},"text":"Theorem 6.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be a distribution with paramaters ","element":"span"},{"style":{"height":10.79},"width":122.36,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-10.png","element":"img","alt":" p+, p−","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14.18},"width":36.05,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-11.png","element":"img","alt":" p∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"(see Eq. ","element":"span"},{"href":"#id-34","style":{"fontStyle":"italic"},"text":"5)","element":"a"},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":14},"width":147.18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-12.png","element":"img","alt":" δ ≥ 1 −","inline":true},{"style":{"height":16},"width":234.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-13.png","element":"img","alt":"p+p−(1 − c −","inline":true,"padRight":true},{"text":"16","element":"span"},{"style":{"height":13.38},"width":59.46,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-14.png","element":"img","alt":"e−8","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"text":"0 ","element":"span"},{"style":{"height":14.18},"width":147.42,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-15.png","element":"img","alt":" ≤ ϵ < p∗","inline":true},{"style":{"fontStyle":"italic"},"text":". Then ","element":"span"},{"style":{"height":16},"width":179.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-16.png","element":"img","alt":" m1(ϵ, δ) ≤","inline":true,"padRight":true},{"text":"2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"whereas ","element":"span"},{"style":{"height":30.43},"width":395.28,"height":76.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-17.png","element":"img","alt":" m2(ϵ, δ) ≥2 log( 48δ33(1−c))log(p+p−)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-18.png","element":"img","alt":"9","inline":true}],[{"text":"The proof (see supplementary material) follows from Theorem ","element":"span"},{"href":"#id-35","text":"6.3 ","element":"a"},{"text":"and Theorem ","element":"span"},{"href":"#id-36","text":"6.4 ","element":"a"},{"text":"and the fact that the probability to sample a training set with only diverse points is (","element":"span"},{"style":{"height":16},"width":136.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-19.png","element":"img","alt":"p+p−)m","inline":true},{"text":".","element":"span"}],[{"text":"We will illustrate the guarantee of Theorem ","element":"span"},{"href":"#id-37","text":"6.5 ","element":"a"},{"text":"with several numerical examples. Assume that for the distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", the probability to sample a positive point is ","element":"span"},{"style":{"height":19.37},"width":16,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-20.png","element":"img","alt":"12","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.18},"width":36.04,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-21.png","element":"img","alt":" p∗","inline":true,"padRight":true},{"text":"= min","element":"span"},{"style":{"height":29.2},"width":250.9,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-22.png","element":"img","alt":"�1−p+4 , 1−p−4 �","inline":true}],[{"text":"(it is easy to construct such distributions). ","element":"span"},{"text":"First, consider the case ","element":"span"},{"style":{"height":10.79},"width":166.84,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-23.png","element":"img","alt":" p+ = p−","inline":true,"padRight":true},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"98 and ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-24.png","element":"img","alt":" δ","inline":true,"padRight":true},{"text":"= 1 ","element":"span"},{"style":{"height":17.38},"width":280.96,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-25.png","element":"img","alt":" − 0.982(1 − c −","inline":true,"padRight":true},{"text":"16","element":"span"},{"style":{"height":17.38},"width":174.85,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-26.png","element":"img","alt":"e−8) ≤ 0.","inline":true},{"text":"05. Here we get that for any 0 ","element":"span"},{"style":{"height":13.2},"width":163.22,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-27.png","element":"img","alt":" ≤ ϵ < 0.","inline":true},{"text":"005, ","element":"span"},{"style":{"height":16},"width":186.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-28.png","element":"img","alt":" m1(ϵ, δ) ≤","inline":true,"padRight":true},{"text":"2 whereas ","element":"span"},{"style":{"height":16},"width":180.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-29.png","element":"img","alt":"m2(ϵ, δ) ≥","inline":true,"padRight":true},{"text":"129. Next, consider the case where ","element":"span"},{"style":{"height":10.79},"width":147.9,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-30.png","element":"img","alt":" p+ = p−","inline":true,"padRight":true},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"92. It follows that for ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-31.png","element":"img","alt":" δ","inline":true,"padRight":true},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"16 and any 0 ","element":"span"},{"style":{"height":13.2},"width":148.68,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-32.png","element":"img","alt":" ≤ ϵ < 0.","inline":true},{"text":"02 it holds that ","element":"span"},{"style":{"height":16},"width":181.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-33.png","element":"img","alt":" m1(ϵ, δ) ≤","inline":true,"padRight":true},{"text":"2 and ","element":"span"},{"style":{"height":16},"width":181.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-34.png","element":"img","alt":" m2(ϵ, δ) ≥","inline":true,"padRight":true},{"text":"17. In contrast, for sufficiently small ","element":"span"},{"style":{"height":10.79},"width":45.05,"height":26.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-35.png","element":"img","alt":" p+","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10},"width":45.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-36.png","element":"img","alt":"p−","inline":true},{"text":", e.g., in which ","element":"span"},{"style":{"height":14.79},"width":195.22,"height":36.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/9-37.png","element":"img","alt":" p+, p− ≤ 0.","inline":true},{"text":"7, our bound does not guarantee a generalization gap.","element":"span"}]]},{"heading":"7 Experiments on MNIST","paragraphs":[[{"text":"We next demonstrate how our theoretical insights from the XORD problem are also manifest when learning a neural net on the MNIST dataset. The network we use for learning is quite similar to the one use for XORD. It is a three layer network: the first layer is a convolution with 3 ","element":"span"},{"style":{"height":8},"width":31,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/10-0.png","element":"img","alt":" ×","inline":true,"padRight":true},{"text":"3 filters and multiple channels (we vary the number of channels), followed by 2 ","element":"span"},{"style":{"height":8},"width":31,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/10-1.png","element":"img","alt":" ×","inline":true,"padRight":true},{"text":"2 max pooling and then a fully connected layer. We use Adam ","element":"span"},{"href":"#id-38","referenceIndex":11,"text":"(Kingma & Ba, ","element":"a"},{"href":"#id-38","referenceIndex":11,"text":"2014) ","element":"a"},{"text":"for optimization. In the supplementary we show empirical results for other filter sizes. Further details of the experiments are given there. Below we show how our two main theoretical insights for XORD are clearly exhibited in the MNIST data.","element":"span"}],[{"text":"We first check the clustering observation. Namely, that optimization tends to converge to clusters of similar filters. We train the three layer network described above with 120 channels on 6000 randomly sampled MNIST images. Then, we normalize each filter of the trained network to have unit norm. We then cluster all 120 9-dimensional vectors using kmeans to four clusters. Finally, for each filter we calculate its angle with its closest cluster center. In the second experiment we perform exactly the same procedure, but with a network with randomly initialized weights.","element":"span"}],[{"text":"Fig. ","element":"span"},{"href":"#id-39","text":"4a ","element":"a"},{"text":"shows the results for this experiment. It can be clearly seen that in the trained network, most of the 9-dimensional filters have a relatively small angle with their closest center. Furthermore, the distributions of angles to closest center are significantly different in the case of trained and random networks. This suggests that there is an inductive bias towards solutions with clustered weights, as predicted by the theory.","element":"span"}],[{"text":"We next explore the effect of exploration. Namely, to what degree do larger models explore useful regions in weight space. ","element":"span"},{"text":"The observation in our theoretical analysis is that both small and large networks can find weights that arrive at zero training error. But large networks will find a wider variety of weights, which will also generalize better.","element":"span"}],[{"text":"Here we propose to test this via the following setup: first train a large network. Then cluster its weights into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"clusters and use the centers to initialize a smaller network with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"filters. If these ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"filters generalize better than ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"filters learned from random initialization, this would suggest that the larger network indeed explored weight space more effectively.","element":"span"}],[{"text":"To apply this idea to MNIST, We trained an “over-parameterized” 3-layer network with 120 channels. We clustered its filters with k-means into 4 clusters and used the cluster centers as initialization for a small network with 4 channels. Then we trained only the fully connected layer and the bias of the first layer in the small network. In Fig. ","element":"span"},{"href":"#id-39","text":"4b ","element":"a"},{"text":"we show that for various training set sizes, the performance of the small network improves with the new initialization and nearly matches the performance of the over-parameterized network. This suggests that the large network explored better features in the convolutional layer than the smaller one.","element":"span"}]]},{"heading":"8 Conclusions","paragraphs":[[{"text":"In this paper we consider a simplified learning task on binary vectors to study generalization of overparameterized networks. ","element":"span"},{"text":"In this setting, we prove that clustering of weights and exploration of the weight space, imply better generalization performance for overparameterized networks. We empirically verify our findings on the MNIST task.","element":"span"}],[{"text":"We believe that the approach of studying challenging theoretical problems in deep learning through simplified learning tasks can be fruitful. For future work, it would be interesting to consider more complex tasks, e.g., filters of higher dimension or non-binary data, to better understand overparameterization.","element":"span"}]]},{"heading":"Acknowledgments","paragraphs":[[{"text":"This research is supported by the Blavatnik Computer Science Research Fund and by the Yandex Initiative in Machine Learning.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-12","text":"Allen-Zhu, Zeyuan, Li, Yuanzhi, and Liang, Yingyu. Learning and generalization in overparameterized ","element":"span"},{"text":"neural networks, going beyond two layers. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1811.04918","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-7","text":"Brutzkus, Alon and Globerson, Amir. Globally optimal gradient descent for a convnet with gaussian ","element":"span"},{"text":"inputs. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 605–614, 2017.","element":"span"}],[{"id":"id-16","text":"Brutzkus, Alon, Globerson, Amir, Malach, Eran, and Shalev-Shwartz, Shai. ","element":"span"},{"text":"Sgd learns over-parameterized networks that provably generalize on linearly separable data. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-8","text":"Daniely, Amit. Sgd learns the conjugate kernel class of the network. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pp. 2422–2430, 2017.","element":"span"}],[{"id":"id-14","text":"Du, Simon S and Lee, Jason D. ","element":"span"},{"text":"On the power of over-parametrization in neural networks with quadratic activation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1803.01206","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-5","text":"Du, Simon S, Lee, Jason D, and Tian, Yuandong. When is a convolutional filter easy to learn? ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1709.06129","element":"span"},{"text":", 2017a.","element":"span"}],[{"id":"id-6","text":"Du, Simon S, Lee, Jason D, Tian, Yuandong, Poczos, Barnabas, and Singh, Aarti. Gradient de- ","element":"span"},{"text":"scent learns one-hidden-layer cnn: ","element":"span"},{"text":"Don’t be afraid of spurious local minima. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1712.00779","element":"span"},{"text":", 2017b.","element":"span"}],[{"id":"id-11","text":"Du, Simon S, Lee, Jason D, Li, Haochuan, Wang, Liwei, and Zhai, Xiyu. Gradient descent finds global ","element":"span"},{"text":"minima of deep neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1811.03804","element":"span"},{"text":", 2018a.","element":"span"}],[{"id":"id-10","text":"Du, Simon S, Zhai, Xiyu, Poczos, Barnabas, and Singh, Aarti. Gradient descent provably optimizes ","element":"span"},{"text":"over-parameterized neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1810.02054","element":"span"},{"text":", 2018b.","element":"span"}],[{"text":"Hoffer, Elad, Hubara, Itay, and Soudry, Daniel. Fix your classifier: the marginal value of training the ","element":"span"},{"text":"last weight layer. 2018.","element":"span"}],[{"id":"id-38","text":"Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1412.6980","element":"span"},{"text":", 2014.","element":"span"}],[{"id":"id-9","text":"Li, Yuanzhi and Liang, Yingyu. Learning overparameterized neural networks via stochastic gradient ","element":"span"},{"text":"descent on structured data. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1808.01204","element":"span"},{"text":", 2018.","element":"span"}],[{"text":"Li, Yuanzhi and Yuan, Yang. Convergence analysis of two-layer neural networks with relu activation. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pp. 597–607, 2017.","element":"span"}],[{"id":"id-15","text":"Li, Yuanzhi, Ma, Tengyu, and Zhang, Hongyang. Algorithmic regularization in over-parameterized ","element":"span"},{"text":"matrix recovery. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1712.09203","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-0","text":"Neyshabur, Behnam, Tomioka, Ryota, and Srebro, Nathan. In search of the real inductive bias: On ","element":"span"},{"text":"the role of implicit regularization in deep learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1412.6614","element":"span"},{"text":", 2014.","element":"span"}],[{"id":"id-1","text":"Neyshabur, Behnam, Li, Zhiyuan, Bhojanapalli, Srinadh, LeCun, Yann, and Srebro, Nathan. Towards ","element":"span"},{"text":"understanding the role of over-parametrization in generalization of neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1805.12076","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-2","text":"Novak, Roman, Bahri, Yasaman, Abolafia, Daniel A, Pennington, Jeffrey, and Sohl-Dickstein, ","element":"span"},{"text":"Jascha. ","element":"span"},{"text":"Sensitivity and generalization in neural networks: an empirical study. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1802.08760","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-33","text":"Shalev-Shwartz, Shai and Ben-David, Shai. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Understanding machine learning: From theory to algorithms","element":"span"},{"text":". Cambridge university press, 2014.","element":"span"}],[{"id":"id-13","text":"Soltanolkotabi, Mahdi, Javanmard, Adel, and Lee, Jason D. Theoretical insights into the optimiza- ","element":"span"},{"text":"tion landscape of over-parameterized shallow neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Information Theory","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-40","text":"Vershynin, Roman. High-dimensional probability. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"An Introduction with Applications","element":"span"},{"text":", 2017.","element":"span"}]]},{"heading":"A Experiment in Figure 1","paragraphs":[[{"text":"We tested the generalization performance in the setup of Section","element":"span"},{"style":{"fontWeight":"bold"},"text":"??","element":"span"},{"text":". We considered networks with number of channels 4,6,8,20,50,100 and 200. The distribution in this setting has ","element":"span"},{"style":{"height":10.79},"width":45.05,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-0.png","element":"img","alt":" p+","inline":true,"padRight":true},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"5 and ","element":"span"},{"style":{"height":10},"width":45.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-1.png","element":"img","alt":" p−","inline":true,"padRight":true},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"9 and the training sets are of size 12 (6 positive, 6 negative). Note that in this case the training set contains non-diverse points with high probability. The ground truth network can be realized by a network with 4 channels. For each number of channels we trained a convolutional network 100 times and averaged the results. In each run we sampled a new training set and new initialization of the weights according to a gaussian distribution with mean 0 and standard deviation 0.00001. For each number of channels ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":", we ran gradient descent with learning rate ","element":"span"},{"style":{"height":19.37},"width":57.25,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-2.png","element":"img","alt":"0.04c","inline":true,"padRight":true},{"text":"and stopped it if it did not improve the cost for 20 consecutive iterations or if it reached 30000 iterations. The last iteration was taken for the calculations. We plot both average test error over all 100 runs and average test error only over the runs that ended at 0% train error. In this case, for each number of channels 4,6,8,20,50,100,200 the number of runs in which gradient descent converged to a 0% train error solution is 62, 79, 94, 100, 100, 100, 100, respectively.","element":"span"}]]},{"heading":"B Proofs for Section 3","paragraphs":[[{"text":"In the XOR problem, we are given a training set ","element":"span"},{"style":{"height":20.4},"width":631.94,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-3.png","element":"img","alt":" S = {(xi, yi)}4i=1 ⊆ {±1}2 × {±1}2","inline":true,"padRight":true},{"text":"consisting of ","element":"span"},{"text":"points ","element":"span"},{"style":{"height":9.59},"width":42.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-4.png","element":"img","alt":" x1","inline":true,"padRight":true},{"text":"= (1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1), ","element":"span"},{"style":{"height":9.59},"width":42.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-5.png","element":"img","alt":" x2","inline":true,"padRight":true},{"text":"= (","element":"span"},{"style":{"height":14},"width":61.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-6.png","element":"img","alt":"−1,","inline":true,"padRight":true},{"text":"1), ","element":"span"},{"style":{"height":9.59},"width":42.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-7.png","element":"img","alt":" x3","inline":true,"padRight":true},{"text":"= (","element":"span"},{"style":{"height":14},"width":99.63,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-8.png","element":"img","alt":"−1, −","inline":true},{"text":"1), ","element":"span"},{"style":{"height":9.59},"width":42.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-9.png","element":"img","alt":" x4","inline":true,"padRight":true},{"text":"= (1","element":"span"},{"style":{"height":7.6},"width":48.71,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-10.png","element":"img","alt":", −","inline":true},{"text":"1) with labels ","element":"span"},{"style":{"height":10},"width":35.54,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-11.png","element":"img","alt":" y1","inline":true,"padRight":true},{"text":"= 1, ","element":"span"},{"style":{"height":14},"width":210.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-12.png","element":"img","alt":" y2 = −1, y3","inline":true,"padRight":true},{"text":"= 1 and ","element":"span"},{"style":{"height":10},"width":131.77,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-13.png","element":"img","alt":" y4 = −","inline":true},{"text":"1, respectively. Our goal is to learn the XOR function ","element":"span"},{"style":{"height":14.19},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-14.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":": ","element":"span"},{"style":{"height":17.39},"width":271.7,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-15.png","element":"img","alt":" {±1}2 → {±1}","inline":true},{"text":", such that ","element":"span"},{"style":{"height":16},"width":94.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-16.png","element":"img","alt":"f ∗(xi","inline":true},{"text":") = ","element":"span"},{"style":{"height":10},"width":30.54,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-17.png","element":"img","alt":" yi","inline":true,"padRight":true},{"text":"for 1 ","element":"span"},{"style":{"height":12.8},"width":97.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-18.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4, with a neural network and gradient descent.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Neural Architecture: ","element":"span"},{"text":"For this task we consider the following two-layer fully connected network.","element":"span"}],[{"id":"id-22","style":{"width":"70%"},"width":1241,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-19.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.19},"width":195.06,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-20.png","element":"img","alt":" W ∈ R2k×2","inline":true,"padRight":true},{"text":"is the weight matrix whose rows are the ","element":"span"},{"style":{"height":14.19},"width":69.97,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-21.png","element":"img","alt":" w(i)","inline":true,"padRight":true},{"text":"vectors followed by the ","element":"span"},{"style":{"height":14.19},"width":62.87,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-22.png","element":"img","alt":" u(i)","inline":true,"padRight":true},{"text":"vectors, and ","element":"span"},{"style":{"height":16},"width":62.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-23.png","element":"img","alt":"σ(x","inline":true},{"text":") = max","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", x","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"is the ReLU activation applied element-wise. We note that ","element":"span"},{"style":{"height":14.19},"width":39.8,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-24.png","element":"img","alt":" f ∗","inline":true,"padRight":true},{"text":"can be implemented with this network for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2 and this is the minimal ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"for which this is possible. Thus we refer to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k > ","element":"span"},{"text":"2 as the overparameterized case.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Training Algorithm: ","element":"span"},{"text":"The parameters of the network ","element":"span"},{"style":{"height":16},"width":109.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-25.png","element":"img","alt":" NW (x","inline":true},{"text":") are learned using gradient descent","element":"span"}],[{"text":"on the hinge loss objective. We use a constant learning rate ","element":"span"},{"style":{"height":18.38},"width":111.64,"height":45.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-26.png","element":"img","alt":" η ≤ cηk","inline":true,"padRight":true},{"text":", where ","element":"span"},{"style":{"height":19.37},"width":113.84,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-27.png","element":"img","alt":" cη < 12","inline":true},{"text":". The parameters ","element":"span"},{"style":{"height":13.19},"width":65.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-28.png","element":"img","alt":"NW","inline":true,"padRight":true},{"text":"are initialized as IID Gaussians with zero mean and standard deviation ","element":"span"},{"style":{"height":19.03},"width":190.84,"height":47.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-29.png","element":"img","alt":" σg ≤ cη16k3/2","inline":true,"padRight":true},{"text":". We consider ","element":"span"},{"text":"the hinge-loss objective:","element":"span"}],[{"style":{"width":"35%"},"width":629,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-30.png","element":"img"}],[{"text":"where optimization is only over the first layer of the network. We note that for ","element":"span"},{"style":{"height":13.2},"width":69.7,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-31.png","element":"img","alt":" k ≥","inline":true,"padRight":true},{"text":"2 any global minimum ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-32.png","element":"img","alt":" ℓ","inline":true,"padRight":true},{"text":"satisfies ","element":"span"},{"style":{"height":16},"width":74.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-33.png","element":"img","alt":" ℓ(W","inline":true},{"text":") = 0 and sign(","element":"span"},{"style":{"height":16},"width":121.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-34.png","element":"img","alt":"NW (xi","inline":true},{"text":")) = ","element":"span"},{"style":{"height":16},"width":94.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-35.png","element":"img","alt":" f ∗(xi","inline":true},{"text":") for 1 ","element":"span"},{"style":{"height":12.8},"width":97.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/12-36.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Notations: ","element":"span"},{"text":"We will need the following notations. Let ","element":"span"},{"style":{"height":13.19},"width":49.64,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-0.png","element":"img","alt":" Wt","inline":true,"padRight":true},{"text":"be the weight matrix at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"of gradient descent. For 1 ","element":"span"},{"style":{"height":13.2},"width":129.93,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-1.png","element":"img","alt":" ≤ i ≤ k","inline":true},{"text":", denote by ","element":"span"},{"style":{"height":20.6},"width":165.9,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-2.png","element":"img","alt":" w(i)t ∈ R2","inline":true,"padRight":true},{"text":"the ","element":"span"},{"style":{"height":13.38},"width":44.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-3.png","element":"img","alt":" ith","inline":true,"padRight":true},{"text":"weight vector at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Similarly we define ","element":"span"},{"style":{"height":20.6},"width":158.8,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-4.png","element":"img","alt":" u(i)t ∈ R2","inline":true,"padRight":true},{"text":"to be the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"+","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"weight vector at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". For each point ","element":"span"},{"style":{"height":13.19},"width":114.24,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-5.png","element":"img","alt":" xi ∈ S","inline":true,"padRight":true},{"text":"define the following sets of neurons:","element":"span"}],[{"style":{"width":"27%"},"width":484,"height":355,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-6.png","element":"img"}],[{"text":"and for each iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", let ","element":"span"},{"style":{"height":16},"width":63.83,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-7.png","element":"img","alt":" ai(t","inline":true},{"text":") be the number of iterations 0 ","element":"span"},{"style":{"height":12.8},"width":134.77,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-8.png","element":"img","alt":" ≤ t′ ≤ t","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":17.32},"width":233.39,"height":43.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-9.png","element":"img","alt":" yiNWt′ (xi) <","inline":true,"padRight":true},{"text":"1.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Overparameterized Network","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma B.1. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Exploration at initialization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":13.39},"width":119.24,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-10.png","element":"img","alt":" − 8e−8","inline":true},{"style":{"fontStyle":"italic"},"text":", for all ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":13.6},"width":102.82,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-11.png","element":"img","alt":" ≤ j ≤","inline":true,"padRight":true},{"text":"4","element":"span"}],[{"style":{"width":"40%"},"width":718,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Without loss of generality consider","element":"span"},{"style":{"height":19.96},"width":81.46,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-13.png","element":"img","alt":"��W +0","inline":true,"padRight":true},{"text":"(1)","element":"span"},{"style":{"height":19.96},"width":13,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-14.png","element":"img","alt":"��","inline":true},{"text":". Since the sign of a one dimensional Gaussian random variable is a Bernoulli random variable, we get by Hoeffding’s inequality","element":"span"}],[{"style":{"width":"46%"},"width":811,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-15.png","element":"img"}],[{"text":"Since","element":"span"},{"style":{"height":19.96},"width":81.46,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-16.png","element":"img","alt":"��W +0","inline":true,"padRight":true},{"text":"(1)","element":"span"},{"style":{"height":19.96},"width":125.73,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-17.png","element":"img","alt":"��+��W +0","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.96},"width":87.42,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-18.png","element":"img","alt":"�� = k","inline":true,"padRight":true},{"text":"with probability 1, we get that if","element":"span"},{"style":{"height":19.96},"width":94.74,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-19.png","element":"img","alt":"��W +0","inline":true,"padRight":true},{"text":"(1)","element":"span"},{"style":{"height":20},"width":229.72,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-20.png","element":"img","alt":"�� − k2�� < 2√k","inline":true,"padRight":true},{"text":"then","element":"span"},{"style":{"height":19.96},"width":94.74,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-21.png","element":"img","alt":"��W +0","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.96},"width":144.52,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-22.png","element":"img","alt":"�� − k2�� <","inline":true,"padRight":true},{"text":"2","element":"span"},{"style":{"height":16},"width":54.21,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-23.png","element":"img","alt":"√k","inline":true},{"text":". The result now follows by symmetry and the union bound.","element":"span"}],[{"style":{"width":"99%"},"width":1753,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-24.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"be a random variable distributed as ","element":"span"},{"style":{"height":17.38},"width":131.9,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-25.png","element":"img","alt":" N(0, σ2","inline":true},{"text":"). Then by Proposition 2.1.2 in ","element":"span"},{"href":"#id-40","referenceIndex":20,"text":"Vershynin ","element":"a"},{"href":"#id-40","referenceIndex":20,"text":"(2017)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"24%"},"width":422,"height":87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-26.png","element":"img"}],[{"text":"Therefore, for all 1 ","element":"span"},{"style":{"height":14},"width":134.9,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-27.png","element":"img","alt":" ≤ j ≤ k","inline":true,"padRight":true},{"text":"and 1 ","element":"span"},{"style":{"height":12.8},"width":97.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-28.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4,","element":"span"}],[{"style":{"width":"36%"},"width":643,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-29.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"36%"},"width":635,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-30.png","element":"img"}],[{"text":"The result follows by applying a union bound over all 2","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"weight vectors and the four points ","element":"span"},{"style":{"height":9.59},"width":37.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-31.png","element":"img","alt":" xi","inline":true},{"text":", 1 ","element":"span"},{"style":{"height":12.8},"width":97.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-32.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4.","element":"span"}],[{"id":"id-41","style":{"fontWeight":"bold"},"text":"Lemma B.3. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Clustering Dynamics. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Lemma ","element":"span"},{"href":"#id-18","style":{"fontStyle":"italic","fontWeight":"bold"},"text":"3.2 ","element":"a"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"restated and extended. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability","element":"span"}],[{"style":{"width":"93%"},"width":1641,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/13-33.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"1. For ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-0.png","element":"img","alt":" i ∈ {1, 3}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-1.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", it holds that ","element":"span"},{"style":{"height":20.93},"width":523.92,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-2.png","element":"img","alt":" w(j)t = w(j)0 + ai(t)ηxi + αix2","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"2. For ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-3.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":18.27},"width":123.95,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-4.png","element":"img","alt":" j ∈ U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", it holds that ","element":"span"},{"style":{"height":20.93},"width":509.7,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-5.png","element":"img","alt":" u(j)t = u(j)0 + ai(t)ηxi + αix1","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"width":"99%"},"width":1756,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-6.png","element":"img"}],[{"text":"cases follow by a symmetry. The proof is by induction. Assume that ","element":"span"},{"style":{"height":17.94},"width":147.87,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-7.png","element":"img","alt":" j ∈ W +t","inline":true,"padRight":true},{"text":"(1). For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 0 the ","element":"span"},{"text":"claim holds with ","element":"span"},{"style":{"height":16.94},"width":41.49,"height":42.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-8.png","element":"img","alt":" αt1","inline":true,"padRight":true},{"text":"= 0. For a point (","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", y","element":"span"},{"text":") let ","element":"span"},{"style":{"height":11.2},"width":88.26,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-9.png","element":"img","alt":" ℓ(x,y)","inline":true,"padRight":true},{"text":"= max","element":"span"},{"style":{"height":16},"width":178.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-10.png","element":"img","alt":"{1 − yNW","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". Then it holds that","element":"span"}],[{"style":{"height":18.14},"width":157.61,"height":45.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-11.png","element":"img","alt":"∂w(i) (W","inline":true},{"text":") = ","element":"span"},{"style":{"height":21.62},"width":461.6,"height":54.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-12.png","element":"img","alt":" −yσ′(w(i) · x)x1yNW (x)<1","inline":true},{"text":". Assume without loss of generality that ","element":"span"},{"style":{"height":16.94},"width":92.68,"height":42.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-13.png","element":"img","alt":" αt1 >","inline":true,"padRight":true},{"text":"0. Define ","element":"span"},{"style":{"height":21.62},"width":275.69,"height":54.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-14.png","element":"img","alt":"β1 = 1NW (x1)<1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.62},"width":300.6,"height":54.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-15.png","element":"img","alt":" β2 = 1NW (x2)>−1","inline":true},{"text":". Using these notations, we have","element":"span"}],[{"style":{"width":"44%"},"width":775,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-16.png","element":"img"}],[{"text":"and for any values of ","element":"span"},{"style":{"height":16},"width":244.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-17.png","element":"img","alt":" β1, β2 ∈ {0, 1}","inline":true,"padRight":true},{"text":"the induction step follows.","element":"span"}],[{"style":{"width":"1%"},"width":28,"height":13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-18.png","element":"img"}],[{"text":"For each point ","element":"span"},{"style":{"height":9.59},"width":37.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-19.png","element":"img","alt":" xi","inline":true},{"text":", define the following sums:","element":"span"}],[{"style":{"width":"28%"},"width":503,"height":548,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-20.png","element":"img"}],[{"text":"We will prove the following lemma regarding ","element":"span"},{"style":{"height":17.94},"width":51.73,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-21.png","element":"img","alt":" S+t","inline":true,"padRight":true},{"text":"(1)","element":"span"},{"style":{"height":14.72},"width":69.44,"height":36.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-22.png","element":"img","alt":", S−t","inline":true,"padRight":true},{"text":"(1)","element":"span"},{"style":{"height":17.94},"width":73.28,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-23.png","element":"img","alt":", R+t","inline":true,"padRight":true},{"text":"(1)","element":"span"},{"style":{"height":14.72},"width":73.28,"height":36.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-24.png","element":"img","alt":", R−t","inline":true,"padRight":true},{"text":"(1) for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1. By symmetry, ","element":"span"},{"text":"analogous lemmas follow for ","element":"span"},{"style":{"height":15.2},"width":24.8,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-25.png","element":"img","alt":" i ̸","inline":true},{"text":"= 1.","element":"span"}],[{"id":"id-42","style":{"width":"63%"},"width":1118,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-26.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"1. For all ","element":"span"},{"style":{"height":12.8},"width":56.46,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-27.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":17.94},"width":55.56,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-28.png","element":"img","alt":" R+t","inline":true,"padRight":true},{"text":"(1) + ","element":"span"},{"style":{"height":14.72},"width":55.56,"height":36.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-29.png","element":"img","alt":" R−t","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":14.4},"width":84.06,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-30.png","element":"img","alt":" ≤ kη","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"2. Let ","element":"span"},{"style":{"height":12.8},"width":56.46,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-31.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"then ","element":"span"},{"style":{"height":14.72},"width":51.73,"height":36.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-32.png","element":"img","alt":" S−t","inline":true,"padRight":true},{"text":"(1) = 0","element":"span"},{"style":{"fontStyle":"italic"},"text":". Furthermore, if ","element":"span"},{"style":{"height":16},"width":246.37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-33.png","element":"img","alt":" −yNWt(x1) <","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":19.87},"width":76.94,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-34.png","element":"img","alt":" S+t+1","inline":true},{"text":"(1) ","element":"span"},{"style":{"height":17.94},"width":93.8,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-35.png","element":"img","alt":" ≥ S+t","inline":true,"padRight":true},{"text":"(1) +","element":"span"},{"style":{"height":19.96},"width":81.45,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-36.png","element":"img","alt":"��W +0","inline":true,"padRight":true},{"text":"(1)","element":"span"},{"style":{"height":19.96},"width":39.92,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-37.png","element":"img","alt":"�� η","inline":true},{"style":{"fontStyle":"italic"},"text":". Otherwise, if ","element":"span"},{"style":{"height":16},"width":246.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-38.png","element":"img","alt":" −yNWt(x1) ≥","inline":true,"padRight":true},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"then ","element":"span"},{"style":{"height":19.87},"width":76.94,"height":49.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-39.png","element":"img","alt":" S+t+1","inline":true},{"text":"(1) = ","element":"span"},{"style":{"height":17.94},"width":51.73,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-40.png","element":"img","alt":" S+t","inline":true,"padRight":true},{"text":"(1)","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"1. Assume by contradiction that there exists ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > ","element":"span"},{"text":"0, such that ","element":"span"},{"style":{"height":17.94},"width":55.56,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-41.png","element":"img","alt":" R+t","inline":true,"padRight":true},{"text":"(1) + ","element":"span"},{"style":{"height":14.72},"width":55.57,"height":36.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-42.png","element":"img","alt":" R−t","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":14.4},"width":90.69,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-43.png","element":"img","alt":" > kη","inline":true},{"text":". It ","element":"span"},{"text":"follows that, without loss of generality, there exists ","element":"span"},{"style":{"height":17.94},"width":139.44,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-44.png","element":"img","alt":" j ∈ U +t","inline":true,"padRight":true},{"text":"(1) such that ","element":"span"},{"style":{"height":28.8},"width":308.82,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-45.png","element":"img","alt":" σ�u(j)t · x1�> η","inline":true},{"text":". However, this contradicts Lemma ","element":"span"},{"href":"#id-41","text":"B.3.","element":"a"}],[{"text":"2. All of the claims are direct consequences of Lemma ","element":"span"},{"href":"#id-41","text":"B.3.","element":"a"}],[{"id":"id-44","style":{"width":"100%"},"width":1757,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-46.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"style":{"fontStyle":"italic"},"text":"there were at least ","element":"span"},{"style":{"height":25.45},"width":155.3,"height":63.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-47.png","element":"img","alt":" l ≥ 4√k√k−2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"iterations, in which ","element":"span"},{"style":{"height":16},"width":241.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-48.png","element":"img","alt":" −yNWt(xi) <","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", then it holds that ","element":"span"},{"style":{"height":16},"width":241.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-49.png","element":"img","alt":" −yNWt(xi) ≥","inline":true,"padRight":true},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":13.2},"width":96.52,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/14-50.png","element":"img","alt":" t ≥ T","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Without loss of generality assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1. By Lemma ","element":"span"},{"href":"#id-42","text":"B.4 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-43","text":"E.3, ","element":"a"},{"text":"with probability","element":"span"}],[{"style":{"height":25.45},"width":225.12,"height":63.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-0.png","element":"img","alt":"√2k√πe8k − 8e−8","inline":true},{"text":", if ","element":"span"},{"style":{"height":16},"width":246.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-1.png","element":"img","alt":" −yNWt(x1) <","inline":true,"padRight":true},{"text":"1 then ","element":"span"},{"style":{"height":19.87},"width":76.94,"height":49.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-2.png","element":"img","alt":" S+t+1","inline":true},{"text":"(1) ","element":"span"},{"style":{"height":17.94},"width":94.21,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-3.png","element":"img","alt":" ≥ S+t","inline":true,"padRight":true},{"text":"(1) +","element":"span"},{"href":"#id-42","style":{"height":28.8},"width":225.26,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-4.png","element":"img","alt":"�k2 − 2√k�η","inline":true},{"text":". Therefore, by Lemma ","element":"span"},{"href":"#id-42","text":"B.4, ","element":"a"},{"text":"for all ","element":"span"},{"style":{"height":13.2},"width":96.52,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-5.png","element":"img","alt":" t ≥ T","inline":true}],[{"style":{"width":"43%"},"width":772,"height":216,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-6.png","element":"img"}],[{"text":"where the last ineqaulity follows by the assumption on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem B.6. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Convergence and clustering. Theorem ","element":"span"},{"href":"#id-19","style":{"fontStyle":"italic","fontWeight":"bold"},"text":"3.3 ","element":"a"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"restated. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k > ","element":"span"},{"text":"16","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"height":25.45},"width":226.25,"height":63.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-7.png","element":"img","alt":"√2k√πe8k − 8e−8","inline":true},{"style":{"fontStyle":"italic"},"text":", after at most ","element":"span"},{"style":{"height":25.45},"width":175.08,"height":63.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-8.png","element":"img","alt":" T ≤ 16√k√k−2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"iterations, gradient descent converges ","element":"span"},{"style":{"fontStyle":"italic"},"text":"to a global minimum. Furthermore, for ","element":"span"},{"style":{"height":16},"width":163.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-9.png","element":"img","alt":" i ∈ {1, 3}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and all ","element":"span"},{"style":{"height":18.27},"width":138.8,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-10.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", the angle between ","element":"span"},{"style":{"height":21.36},"width":73.5,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-11.png","element":"img","alt":" w(j)T","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":9.59},"width":37.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-12.png","element":"img","alt":" xi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is at most ","element":"span"},{"text":"arccos","element":"span"},{"style":{"height":28.8},"width":145.81,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-13.png","element":"img","alt":"�1−2cη1+cη�","inline":true},{"style":{"fontStyle":"italic"},"text":". Similarly, for ","element":"span"},{"style":{"height":16},"width":161.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-14.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and all ","element":"span"},{"style":{"height":18.27},"width":125.81,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-15.png","element":"img","alt":" j ∈ U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", the angle between ","element":"span"},{"style":{"height":21.37},"width":66.39,"height":53.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-16.png","element":"img","alt":" u(j)T","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":9.59},"width":37.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-17.png","element":"img","alt":" xi","inline":true}],[{"style":{"fontStyle":"italic"},"text":"is at most ","element":"span"},{"text":"arccos","element":"span"},{"style":{"height":28.8},"width":114.5,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-18.png","element":"img","alt":"�1−2cη1+cη","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Proposition ","element":"span"},{"href":"#id-44","text":"B.5 ","element":"a"},{"text":"implies that there are at most ","element":"span"},{"style":{"height":25.45},"width":84.7,"height":63.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-19.png","element":"img","alt":"16√k√k−2","inline":true,"padRight":true},{"text":"iterations in which there exists (","element":"span"},{"style":{"height":10.4},"width":87.78,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-20.png","element":"img","alt":"xi, yi","inline":true},{"text":") ","element":"span"},{"text":"such that ","element":"span"},{"style":{"height":16},"width":222.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-21.png","element":"img","alt":" yiNWt(xi) <","inline":true,"padRight":true},{"text":"1. After at most that many iterations, gradient descent converges to a global minimum.","element":"span"}],[{"text":"Without loss of generality, we prove the clustering claim for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1 and all ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-22.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(1). At a global ","element":"span"},{"text":"minimum, ","element":"span"},{"style":{"height":16},"width":203.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-23.png","element":"img","alt":" NWT (x1) ≥","inline":true,"padRight":true},{"text":"1. Therefore, by Lemma ","element":"span"},{"href":"#id-41","text":"B.3 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-42","text":"B.4 ","element":"a"},{"text":"it follows that","element":"span"}],[{"style":{"width":"35%"},"width":615,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-24.png","element":"img"}],[{"text":"and thus ","element":"span"},{"style":{"height":22.57},"width":242.18,"height":56.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-25.png","element":"img","alt":" ai(T) ≥ 12cη −","inline":true,"padRight":true},{"text":"1. Therefore, for any ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-26.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(1), the cosine of the angle between ","element":"span"},{"style":{"height":21.36},"width":73.5,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-27.png","element":"img","alt":" w(j)T","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.59},"width":42.26,"height":23.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-28.png","element":"img","alt":" x1","inline":true,"padRight":true},{"text":"is at least","element":"span"}],[{"style":{"width":"55%"},"width":980,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-29.png","element":"img"}],[{"text":"where we used the triangle inequality and Lemma ","element":"span"},{"href":"#id-41","text":"B.3. ","element":"a"},{"text":"The claim follows.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Small Network","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma B.7. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Non-exploration at initialization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"75","element":"span"},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"style":{"height":11.2},"width":54.23,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-30.png","element":"img","alt":" i ∈","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":18.27},"width":68.16,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-31.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-32.png","element":"img","alt":" ∅","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"or ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-33.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":18.27},"width":56.56,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-34.png","element":"img","alt":" U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-35.png","element":"img","alt":" ∅","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Since the sign of a one dimensional Gaussian random variable is a Bernoulli random variable, the probability that ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-36.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.8},"width":102.36,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-37.png","element":"img","alt":"i) ̸= ∅","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-38.png","element":"img","alt":" i ∈ {1, 3}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.27},"width":56.55,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-39.png","element":"img","alt":" U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.8},"width":102.36,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-40.png","element":"img","alt":"i) ̸= ∅","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-41.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":19.37},"width":16,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-42.png","element":"img","alt":" 14","inline":true},{"text":". The claim follows.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem B.8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2","element":"span"},{"style":{"fontStyle":"italic"},"text":". With probability ","element":"span"},{"style":{"height":13.2},"width":76.93,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-43.png","element":"img","alt":" ≥ 0.","inline":true},{"text":"75","element":"span"},{"style":{"fontStyle":"italic"},"text":", gradient descent converges to a local minimum.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"As in the proof of Theorem ","element":"span"},{"href":"#id-19","text":"3.3, ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":16},"width":163.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-44.png","element":"img","alt":" i ∈ {1, 3}","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-45.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.8},"width":106.39,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-46.png","element":"img","alt":"i) ̸= ∅","inline":true},{"text":", then eventually, ","element":"span"},{"style":{"height":16},"width":224.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-47.png","element":"img","alt":" yiNWt(xi) ≥","inline":true,"padRight":true},{"text":"1. ","element":"span"},{"text":"Similarly, for ","element":"span"},{"style":{"height":16},"width":169.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-48.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":18.27},"width":56.55,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-49.png","element":"img","alt":" U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.8},"width":112.11,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-50.png","element":"img","alt":"i) ̸= ∅","inline":true},{"text":", then eventually, ","element":"span"},{"style":{"height":16},"width":227.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-51.png","element":"img","alt":" yiNWt(xi) ≥","inline":true,"padRight":true},{"text":"1. However, if without loss of ","element":"span"},{"text":"generality ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-52.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(1) = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-53.png","element":"img","alt":" ∅","inline":true},{"text":", then for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"48%"},"width":848,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-54.png","element":"img"}],[{"text":"Furthermore, there exists the first iteration ","element":"span"},{"style":{"height":10},"width":28.39,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-55.png","element":"img","alt":" t′","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":17.32},"width":234.39,"height":43.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-56.png","element":"img","alt":" yiNWt′ (xi) ≥","inline":true,"padRight":true},{"text":"1 for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 3 (since ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-57.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(3) ","element":"span"},{"style":{"height":16.4},"width":63.07,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-58.png","element":"img","alt":" ̸= ∅","inline":true},{"text":") ","element":"span"},{"text":"and any ","element":"span"},{"style":{"height":16},"width":171.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-59.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":18.27},"width":56.56,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-60.png","element":"img","alt":" U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.8},"width":113.6,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-61.png","element":"img","alt":"i) ̸= ∅","inline":true},{"text":". Then, in iteration ","element":"span"},{"style":{"height":10},"width":28.39,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-62.png","element":"img","alt":" t′","inline":true,"padRight":true},{"text":"+ 1 for all 1 ","element":"span"},{"style":{"height":13.6},"width":114.07,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-63.png","element":"img","alt":" ≤ j ≤","inline":true,"padRight":true},{"text":"2 it holds that ","element":"span"},{"style":{"height":22.96},"width":175.04,"height":57.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-64.png","element":"img","alt":"u(j)t′+1xi <","inline":true,"padRight":true},{"text":"0 and ","element":"span"},{"style":{"height":22.96},"width":182.14,"height":57.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-65.png","element":"img","alt":" w(j)t′+1xi <","inline":true,"padRight":true},{"text":"0 for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1 or ","element":"span"},{"style":{"height":16},"width":162.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-66.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":18.27},"width":56.56,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-67.png","element":"img","alt":" U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-68.png","element":"img","alt":" ∅","inline":true},{"text":". Therefore at ","element":"span"},{"style":{"height":10},"width":28.39,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/15-69.png","element":"img","alt":" t′","inline":true,"padRight":true},{"text":"+ 1 we are ","element":"span"},{"text":"at a local minimum.","element":"span"}],[{"id":"id-46","style":{"width":"35%"},"width":623,"height":539,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-0.png","element":"img"}],[{"text":"Figure 5: ","element":"figcaption","subtype":"caption"},{"text":"Higher confidence of hinge-loss results in better performance in the XORD problem.","element":"figcaption","subtype":"caption"}]]},{"heading":"C Proofs and Experiments for Section 4","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"C.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"VC Dimension","element":"span"}],[{"text":"As noted in Remark ","element":"span"},{"href":"#id-45","text":"4.1, ","element":"a"},{"text":"the VC dimension of the model we consider is at most 15. To see this, we first define for any ","element":"span"},{"style":{"height":17.39},"width":202.63,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-1.png","element":"img","alt":" z ∈ {±1}2d","inline":true,"padRight":true},{"text":"the set ","element":"span"},{"style":{"height":17.39},"width":212.69,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-2.png","element":"img","alt":" Pz ⊆ {±1}2","inline":true,"padRight":true},{"text":"which contains all the distinct two dimensional binary patterns that ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"z ","element":"span"},{"text":"has. For example, for a positive diverse point ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"z ","element":"span"},{"text":"it holds that ","element":"span"},{"style":{"height":17.39},"width":215.47,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-3.png","element":"img","alt":" Pz = {±1}2","inline":true},{"text":". Now, for any points ","element":"span"},{"style":{"height":18.18},"width":323.24,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-4.png","element":"img","alt":" z(1), z(2) ∈ {±1}2d","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":14.48},"width":216.3,"height":36.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-5.png","element":"img","alt":" Pz(1) = Pz(2)","inline":true,"padRight":true},{"text":"and for any filter ","element":"span"},{"style":{"height":14.18},"width":127.73,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-6.png","element":"img","alt":" w ∈ R2","inline":true,"padRight":true},{"text":"it holds that max","element":"span"},{"style":{"height":28.8},"width":231.68,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-7.png","element":"img","alt":"j σ�w · z(1)j �","inline":true},{"text":"= max","element":"span"},{"style":{"height":28.8},"width":231.68,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-8.png","element":"img","alt":"j σ�w · z(2)j �","inline":true},{"text":". Therefore, for any ","element":"span"},{"style":{"height":18.18},"width":219.21,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-9.png","element":"img","alt":" W, NW (z(1)","inline":true},{"text":") = ","element":"span"},{"style":{"height":18.18},"width":147.91,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-10.png","element":"img","alt":" NW (z(2)","inline":true},{"text":"). Specifically, this implies that if both ","element":"span"},{"style":{"height":14.19},"width":64.14,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-11.png","element":"img","alt":" z(1)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.19},"width":64.14,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-12.png","element":"img","alt":" z(2)","inline":true,"padRight":true},{"text":"are diverse then ","element":"span"},{"style":{"height":18.19},"width":147.91,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-13.png","element":"img","alt":" NW (z(1)","inline":true},{"text":") = ","element":"span"},{"style":{"height":18.19},"width":147.92,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-14.png","element":"img","alt":" NW (z(2)","inline":true},{"text":"). Since there are 15 non-empty subsets of ","element":"span"},{"style":{"height":17.38},"width":106.77,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-15.png","element":"img","alt":" {±1}2","inline":true},{"text":", it follows that for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"the network can shatter a set of at most 15 points, or equivalently, its VC dimension is at most 15. Despite these expressive power limitations, there is a generalization gap between small and large networks in this setting, as can be seen in Figure ","element":"span"},{"href":"#id-4","text":"1.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"C.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Hinge Loss Confidence","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-46","text":"5 ","element":"a"},{"text":"shows that setting ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-16.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"= 5 gives better performance than setting ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-17.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"= 1 in the XORD problem. The setting is similar to the setting of Section ","element":"span"},{"text":"A. ","element":"span"},{"text":"Each point is an average test error of 100 runs.","element":"span"}]]},{"heading":"D Experiments for Section 5","paragraphs":[[{"text":"Here we show an example of a training set that contains a non-diverse negative point. In total, the training set has 6 positive points and 6 negative points. We implemented the setting of Section ","element":"span"},{"text":"4 ","element":"span"},{"text":"and ran gradient descent on this training set. In Figure ","element":"span"},{"href":"#id-47","text":"6 ","element":"a"},{"text":"we show the results. The large network recovers ","element":"span"},{"style":{"height":14.19},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-18.png","element":"img","alt":"f ∗","inline":true},{"text":", while the small does not. This is despite the fact that both networks achieve zero training error.","element":"span"}]]},{"heading":"E Proof of Theorem 6.3","paragraphs":[[{"text":"We first restate the theorem.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem E.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Theorem ","element":"span"},{"href":"#id-35","style":{"fontStyle":"italic","fontWeight":"bold"},"text":"6.3 ","element":"a"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"restated and extended.) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least","element":"span"},{"style":{"height":19.2},"width":144,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-19.png","element":"img","alt":"�1 − c −","inline":true,"padRight":true},{"text":"16","element":"span"},{"style":{"height":19.2},"width":79.34,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-20.png","element":"img","alt":"e−8�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"after running gradient descent for ","element":"span"},{"style":{"height":25.18},"width":273.88,"height":62.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/16-21.png","element":"img","alt":" T ≥ 28(γ+1+8cη)cη","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"iterations, it converges to a global minimum which","element":"span"}],[{"id":"id-47","style":{"width":"88%"},"width":1558,"height":347,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-0.png","element":"img"}],[{"text":"Figure 6: ","element":"figcaption","subtype":"caption"},{"text":"Overparameterization and generalization in XORD problem. The vectors in blue are the vectors ","element":"figcaption","subtype":"caption"},{"style":{"height":17.89},"width":65.88,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-1.png","element":"img","alt":"w(i)t","inline":true,"padRight":true},{"text":"and in red are the vectors ","element":"figcaption","subtype":"caption"},{"style":{"height":17.89},"width":59.43,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-2.png","element":"img","alt":" u(i)t ","inline":true,"padRight":true},{"text":". (a) Exploration at initialization (t=0) for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"= 100 (b) Clustering and ","element":"figcaption","subtype":"caption"},{"text":"convergence to global minimum that recovers ","element":"figcaption","subtype":"caption"},{"style":{"height":13.29},"width":131.25,"height":33.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-3.png","element":"img","alt":" f ∗ for k","inline":true,"padRight":true},{"text":"= 100 (c) Non-sufficient exploration at initialization (t=0) for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"= 2. (d) Convergence to global minimum with non-zero test error for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"= 2.","element":"figcaption","subtype":"caption"}],[{"style":{"fontStyle":"italic"},"text":"satisfies ","element":"span"},{"text":"sign (","element":"span"},{"style":{"height":16},"width":127.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-4.png","element":"img","alt":"NWT (x","inline":true},{"text":")) = ","element":"span"},{"style":{"height":16},"width":83.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-5.png","element":"img","alt":" f ∗(x","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":17.38},"width":202.76,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-6.png","element":"img","alt":" x ∈ {±1}2d","inline":true},{"style":{"fontStyle":"italic"},"text":". Furthermore, for ","element":"span"},{"style":{"height":16},"width":164.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-7.png","element":"img","alt":" i ∈ {1, 3}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and all ","element":"span"},{"style":{"height":18.27},"width":139.71,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-8.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the angle between ","element":"span"},{"style":{"height":21.36},"width":73.5,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-9.png","element":"img","alt":" w(j)T","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":11.1},"width":34.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-10.png","element":"img","alt":" pi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is at most ","element":"span"},{"text":"arccos","element":"span"},{"style":{"height":28.8},"width":157.86,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-11.png","element":"img","alt":"�γ−1−2cηγ−1+cη","inline":true}],[{"text":"We will first need a few notations. Define ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-12.png","element":"img","alt":" p1","inline":true,"padRight":true},{"text":"= (1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1)","element":"span"},{"style":{"height":10.4},"width":59.97,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-13.png","element":"img","alt":", x2","inline":true,"padRight":true},{"text":"= (1","element":"span"},{"style":{"height":16},"width":141.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-14.png","element":"img","alt":", −1), p3","inline":true,"padRight":true},{"text":"= (","element":"span"},{"style":{"height":16},"width":192.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-15.png","element":"img","alt":"−1, −1), p4","inline":true,"padRight":true},{"text":"= (","element":"span"},{"style":{"height":14},"width":61.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-16.png","element":"img","alt":"−1,","inline":true,"padRight":true},{"text":"1) and the following sets:","element":"span"}],[{"style":{"width":"72%"},"width":1282,"height":261,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-17.png","element":"img"}],[{"text":"We can use these definitions to express more easily the gradient updates. Concretely, let ","element":"span"},{"style":{"height":13.6},"width":64.42,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-18.png","element":"img","alt":" j ∈","inline":true},{"style":{"height":17.94},"width":68.17,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-19.png","element":"img","alt":"W +t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16},"width":159.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-20.png","element":"img","alt":"i1) ∩ W −t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":12.79},"width":29.73,"height":31.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-21.png","element":"img","alt":"i2","inline":true},{"text":") then the gradient update is given as follows:","element":"span"},{"style":{"height":7.6},"width":31.9,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-22.png","element":"img","alt":"10","inline":true}],[{"id":"id-57","style":{"width":"99%"},"width":1752,"height":238,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-23.png","element":"img"}],[{"text":"We denote by ","element":"span"},{"style":{"height":12.99},"width":51.26,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-24.png","element":"img","alt":" x+","inline":true,"padRight":true},{"text":"a positive diverse point and ","element":"span"},{"style":{"height":8.99},"width":51.26,"height":22.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-25.png","element":"img","alt":" x−","inline":true,"padRight":true},{"text":"a negative diverse point. Define the following sums for ","element":"span"},{"style":{"height":16},"width":192.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-26.png","element":"img","alt":" φ ∈ {+, −}","inline":true},{"text":":","element":"span"}],[{"style":{"width":"80%"},"width":1407,"height":596,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/17-27.png","element":"img"}],[{"style":{"width":"89%"},"width":1581,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-0.png","element":"img"}],[{"text":"Without loss of generality, we can assume that the training set consists of one positive diverse point ","element":"span"},{"style":{"height":12.99},"width":51.26,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-1.png","element":"img","alt":" x+","inline":true,"padRight":true},{"text":"and one negative diverse point ","element":"span"},{"style":{"height":8.99},"width":51.26,"height":22.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-2.png","element":"img","alt":" x−","inline":true},{"text":". This follows since the network and its gradient have the same value for two different positive diverse points and two different negative points. Therefore, this holds for the loss function defined in Eq. ","element":"span"},{"href":"#id-29","text":"4 ","element":"a"},{"text":"as well.","element":"span"}],[{"text":"We let ","element":"span"},{"style":{"height":16.98},"width":77.02,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-3.png","element":"img","alt":" a+(t","inline":true},{"text":") be the number of iterations 0 ","element":"span"},{"style":{"height":12.8},"width":134.77,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-4.png","element":"img","alt":" ≤ t′ ≤ t","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":18.31},"width":246.85,"height":45.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-5.png","element":"img","alt":" NWt′ (x+) < γ","inline":true},{"text":".","element":"span"}],[{"text":"We will now proceed to prove the theorem. In Section ","element":"span"},{"href":"#id-48","text":"E.0.1 ","element":"a"},{"text":"we prove results on the filters at initialization. In Section ","element":"span"},{"href":"#id-49","text":"E.0.2 ","element":"a"},{"text":"we prove several lemmas that exhibit the clustering dynamics. In Section ","element":"span"},{"href":"#id-50","text":"E.0.3 ","element":"a"},{"text":"we prove upper bounds on ","element":"span"},{"style":{"height":14.72},"width":51.73,"height":36.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-6.png","element":"img","alt":" S−t","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":17.94},"width":56.12,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-7.png","element":"img","alt":" P +t","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.72},"width":56.12,"height":36.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-8.png","element":"img","alt":" P −t","inline":true,"padRight":true},{"text":"for all iterations ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". In Section ","element":"span"},{"href":"#id-51","text":"E.0.4 ","element":"a"},{"text":"we characterize the dynamics of ","element":"span"},{"style":{"height":17.94},"width":51.73,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-9.png","element":"img","alt":" S+t","inline":true,"padRight":true},{"text":"and in Section ","element":"span"},{"href":"#id-52","text":"E.0.5 ","element":"a"},{"text":"we prove an upper bound on it together with ","element":"span"},{"text":"upper bounds on ","element":"span"},{"style":{"height":16.98},"width":143.96,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-10.png","element":"img","alt":" NWt(x+","inline":true},{"text":") and ","element":"span"},{"style":{"height":16},"width":174.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-11.png","element":"img","alt":" −NWt(x−","inline":true},{"text":") for all iterations ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":".","element":"span"}],[{"text":"We provide an optimization guarantee for gradient descent in Section ","element":"span"},{"href":"#id-53","text":"E.0.6. ","element":"a"},{"text":"We prove generalization guarantees for the points in the positive class and negative class in Section ","element":"span"},{"href":"#id-54","text":"E.0.7 ","element":"a"},{"text":"and Section ","element":"span"},{"href":"#id-55","text":"E.0.8, ","element":"a"},{"text":"respectively. We complete the proof of the theorem in Section ","element":"span"},{"href":"#id-56","text":"E.0.9 ","element":"a"},{"text":"with proofs for the clustering effect at the global minimum.","element":"span"}],[{"id":"id-48","style":{"fontWeight":"bold"},"text":"E.0.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Initialization Guarantees","element":"span"}],[{"id":"id-66","style":{"width":"99%"},"width":1754,"height":340,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Without loss of generality consider","element":"span"},{"style":{"height":19.96},"width":81.46,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-13.png","element":"img","alt":"��W +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":103.59,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-14.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.96},"width":13,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-15.png","element":"img","alt":"��","inline":true},{"text":". Since ","element":"span"},{"style":{"height":19.51},"width":183.17,"height":48.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-16.png","element":"img","alt":" P�j ∈ W +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":103.6,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-17.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.2},"width":17,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-18.png","element":"img","alt":"�","inline":true},{"text":"= ","element":"span"},{"style":{"height":19.37},"width":16,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-19.png","element":"img","alt":"12","inline":true},{"text":", we ","element":"span"},{"text":"get by Hoeffding’s inequality","element":"span"}],[{"style":{"width":"54%"},"width":959,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-20.png","element":"img"}],[{"text":"The result now follows by the union bound.","element":"span"}],[{"id":"id-43","style":{"width":"99%"},"width":1753,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-21.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"be a random variable distributed as ","element":"span"},{"style":{"height":17.39},"width":131.9,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-22.png","element":"img","alt":" N(0, σ2","inline":true},{"text":"). Then by Proposition 2.1.2 in ","element":"span"},{"href":"#id-40","referenceIndex":20,"text":"Vershynin ","element":"a"},{"href":"#id-40","referenceIndex":20,"text":"(2017)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"24%"},"width":422,"height":87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-23.png","element":"img"}],[{"text":"Therefore, for all 1 ","element":"span"},{"style":{"height":14},"width":134.9,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-24.png","element":"img","alt":" ≤ j ≤ k","inline":true,"padRight":true},{"text":"and 1 ","element":"span"},{"style":{"height":12.8},"width":97.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-25.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4,","element":"span"}],[{"style":{"width":"36%"},"width":641,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-26.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"36%"},"width":633,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-27.png","element":"img"}],[{"text":"The result follows by applying a union bound over all 2","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"weight vectors and the four points ","element":"span"},{"style":{"height":11.1},"width":34.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-28.png","element":"img","alt":" pi","inline":true},{"text":", 1 ","element":"span"},{"style":{"height":12.8},"width":97.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/18-29.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4.","element":"span"}],[{"text":"From now on we assume that the highly probable event in Lemma ","element":"span"},{"href":"#id-43","text":"E.3 ","element":"a"},{"text":"holds.","element":"span"}],[{"id":"id-58","style":{"height":16.98},"width":1102.37,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-0.png","element":"img","alt":"Lemma E.4. NWt(x+) < 1 and −NWt(x−) < 1 for 0 ≤ t ≤ 2.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By Lemma ","element":"span"},{"href":"#id-43","text":"E.3 ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"96%"},"width":1691,"height":219,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-1.png","element":"img"}],[{"text":"and similarly ","element":"span"},{"style":{"height":16},"width":236.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-2.png","element":"img","alt":" −NW0(x−) <","inline":true,"padRight":true},{"text":"1. Therefore, by Eq. ","element":"span"},{"href":"#id-57","text":"7 ","element":"a"},{"text":"and Eq. ","element":"span"},{"href":"#id-57","text":"8 ","element":"a"},{"text":"we get:","element":"span"}],[{"text":"1. For ","element":"span"},{"style":{"height":18.27},"width":502.92,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-3.png","element":"img","alt":" i ∈ {1, 3}, l ∈ {2, 4}, j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.25},"width":141.68,"height":40.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-4.png","element":"img","alt":"i) ∩ W −0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"), it holds that ","element":"span"},{"style":{"height":20.93},"width":415.99,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-5.png","element":"img","alt":" w(j)1 = w(j)0 − ηpl + ηpi","inline":true},{"text":".","element":"span"}],[{"text":"2. For ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-6.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-7.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"), it holds that ","element":"span"},{"style":{"height":20.93},"width":202.56,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-8.png","element":"img","alt":" w(j)1 = w(j)0","inline":true,"padRight":true},{"text":".","element":"span"}],[{"text":"3. For ","element":"span"},{"style":{"height":18.27},"width":491.3,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-9.png","element":"img","alt":" i ∈ {1, 3}, l ∈ {2, 4}, j ∈ U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.25},"width":130.06,"height":40.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-10.png","element":"img","alt":"i) ∩ U −0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"), it holds that ","element":"span"},{"style":{"height":20.93},"width":401.81,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-11.png","element":"img","alt":" u(j)1 = u(j)0 − ηpi + ηpl","inline":true},{"text":".","element":"span"}],[{"text":"4. For ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-12.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.27},"width":123.95,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-13.png","element":"img","alt":" j ∈ U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"), it holds that ","element":"span"},{"style":{"height":20.93},"width":188.36,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-14.png","element":"img","alt":" u(j)2 = u(j)0","inline":true,"padRight":true},{"text":".","element":"span"}],[{"text":"Applying Lemma ","element":"span"},{"href":"#id-43","text":"E.3 ","element":"a"},{"text":"again and using the fact that ","element":"span"},{"style":{"height":19.37},"width":112.01,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-15.png","element":"img","alt":" η ≤ 18k","inline":true,"padRight":true},{"text":"we have ","element":"span"},{"style":{"height":16.98},"width":238.4,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-16.png","element":"img","alt":" NW1(x+) < γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":236.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-17.png","element":"img","alt":" −NW1(x−) <","inline":true,"padRight":true},{"text":"1. ","element":"span"},{"text":"Therefore we get,","element":"span"}],[{"text":"1. For ","element":"span"},{"style":{"height":18.27},"width":502.92,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-18.png","element":"img","alt":" i ∈ {1, 3}, l ∈ {2, 4}, j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.25},"width":141.68,"height":40.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-19.png","element":"img","alt":"i) ∩ W −0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"), it holds that ","element":"span"},{"style":{"height":20.93},"width":202.57,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-20.png","element":"img","alt":" w(j)2 = w(j)0","inline":true,"padRight":true},{"text":"+ 2","element":"span"},{"style":{"height":11.1},"width":56.16,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-21.png","element":"img","alt":"ηpi","inline":true},{"text":".","element":"span"}],[{"text":"2. For ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-22.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-23.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"), it holds that ","element":"span"},{"style":{"height":20.93},"width":202.56,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-24.png","element":"img","alt":" w(j)2 = w(j)0","inline":true,"padRight":true},{"text":".","element":"span"}],[{"text":"3. For ","element":"span"},{"style":{"height":18.27},"width":491.3,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-25.png","element":"img","alt":" i ∈ {1, 3}, l ∈ {2, 4}, j ∈ U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.25},"width":130.06,"height":40.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-26.png","element":"img","alt":"i) ∩ U −0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"), it holds that ","element":"span"},{"style":{"height":20.93},"width":401.82,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-27.png","element":"img","alt":" u(j)2 = u(j)0 − ηpi + ηpl","inline":true},{"text":".","element":"span"}],[{"text":"4. For ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-28.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.27},"width":123.95,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-29.png","element":"img","alt":" j ∈ U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"), it holds that ","element":"span"},{"style":{"height":20.93},"width":188.36,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-30.png","element":"img","alt":" u(j)2 = u(j)0","inline":true,"padRight":true},{"text":".","element":"span"}],[{"text":"As before, by Lemma ","element":"span"},{"href":"#id-43","text":"E.3 ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":16.98},"width":238.4,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-31.png","element":"img","alt":" NW2(x+) < γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":236.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-32.png","element":"img","alt":" −NW2(x−) <","inline":true,"padRight":true},{"text":"1.","element":"span"}],[{"id":"id-49","style":{"fontWeight":"bold"},"text":"E.0.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Clustering Dynamics Lemmas","element":"span"}],[{"text":"In the following lemmas we assume that the highly probable event in Lemma ","element":"span"},{"href":"#id-43","text":"E.3 ","element":"a"},{"text":"holds. We therefore do not mention the probability in the statements of the lemmas.","element":"span"}],[{"id":"id-63","style":{"fontWeight":"bold"},"text":"Lemma E.5. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Clusetering. Lemma ","element":"span"},{"href":"#id-32","style":{"fontStyle":"italic","fontWeight":"bold"},"text":"6.2 ","element":"a"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"restated and extended. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For all ","element":"span"},{"style":{"height":12.8},"width":59.3,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-33.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"there exists ","element":"span"},{"style":{"height":17.13},"width":37.64,"height":42.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-34.png","element":"img","alt":" αti","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-35.png","element":"img","alt":"i ∈ {1, 3}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":17.13},"width":134.94,"height":42.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-36.png","element":"img","alt":" |αti| ≤ η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and the following holds:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. For ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-37.png","element":"img","alt":" i ∈ {1, 3}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-38.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", it holds that ","element":"span"},{"style":{"height":20.93},"width":533.38,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-39.png","element":"img","alt":" w(j)t = w(j)0 + a+(t)ηpi + αtip2","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"2. For ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-40.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-41.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", it holds that ","element":"span"},{"style":{"height":20.93},"width":328.66,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-42.png","element":"img","alt":" w(j)t = w(j)0 + mp2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":11.6},"width":110.7,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-43.png","element":"img","alt":" m ∈ Z","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"height":17.94},"width":120.68,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-44.png","element":"img","alt":"3. W +t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") = ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-45.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16},"width":278.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-46.png","element":"img","alt":"i) for i ∈ {1, 3}.","inline":true}],[{"style":{"width":"99%"},"width":1753,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-47.png","element":"img"}],[{"text":"for all ","element":"span"},{"style":{"height":12.8},"width":56.46,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-48.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"1. To prove this, we will show by induction on ","element":"span"},{"style":{"height":12.8},"width":56.46,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-49.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"1, that for all ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-50.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":18.27},"width":135.33,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-51.png","element":"img","alt":"i)∩W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"), where ","element":"span"},{"style":{"height":16},"width":158.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-52.png","element":"img","alt":"l ∈ {2, 4}","inline":true,"padRight":true},{"text":"the following holds:","element":"span"}],[{"text":"1. ","element":"span"},{"style":{"height":17.94},"width":135.57,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-53.png","element":"img","alt":" j ∈ W +t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":").","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"w","element":"span"},{"style":{"height":20.6},"width":104.42,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-54.png","element":"img","alt":"(j)t · pl","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"w","element":"span"},{"style":{"height":20.93},"width":175.36,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-55.png","element":"img","alt":"(j)0 · pl − η","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"w","element":"span"},{"style":{"height":20.6},"width":104.42,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-56.png","element":"img","alt":"(j)t · pl","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"w","element":"span"},{"style":{"height":20.6},"width":105.51,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/19-57.png","element":"img","alt":"(0)t · pl","inline":true},{"text":".","element":"span"}],[{"text":"3. ","element":"span"},{"style":{"height":20.93},"width":532.48,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-0.png","element":"img","alt":" w(j)t = w(j)0 + a+(t)ηpi + αip2","inline":true}],[{"text":"4. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"w","element":"span"},{"style":{"height":20.6},"width":180.85,"height":51.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-1.png","element":"img","alt":"(j)t · pi > η","inline":true},{"text":".","element":"span"}],[{"text":"The claim holds for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 1 by the proof of Lemma ","element":"span"},{"href":"#id-58","text":"E.4. ","element":"a"},{"text":"Assume it holds for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". By the induction hypothesis there exists an ","element":"span"},{"style":{"height":16},"width":170.05,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-2.png","element":"img","alt":" l′ ∈ {2, 4}","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":18.7},"width":135.57,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-3.png","element":"img","alt":" j ∈ W +T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.68},"width":141.68,"height":41.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-4.png","element":"img","alt":"i) ∩ W −T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":10.8},"width":26.67,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-5.png","element":"img","alt":"l′","inline":true},{"text":"). By Eq. ","element":"span"},{"href":"#id-57","text":"7 ","element":"a"},{"text":"we have,","element":"span"}],[{"id":"id-59","style":{"width":"63%"},"width":1120,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.98},"width":151.22,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-7.png","element":"img","alt":" a = a+(t","inline":true,"padRight":true},{"text":"+ 1) ","element":"span"},{"style":{"height":16.98},"width":114.95,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-8.png","element":"img","alt":" − a+(t","inline":true},{"text":") and ","element":"span"},{"style":{"height":16},"width":194.29,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-9.png","element":"img","alt":" b ∈ {−1, 0}","inline":true},{"text":". From this follows the third claim of the induction proof and the first claim of the lemma.","element":"span"}],[{"text":"If ","element":"span"},{"style":{"height":21.36},"width":131.86,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-10.png","element":"img","alt":" w(j)T ·pl","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":131.86,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-11.png","element":"img","alt":" w(j)0 ·pl","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":10.8},"width":88.99,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-12.png","element":"img","alt":" l′ = l","inline":true,"padRight":true},{"text":"and either ","element":"span"},{"style":{"height":22.96},"width":155.62,"height":57.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-13.png","element":"img","alt":" w(j)T +1 ·pl","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":131.86,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-14.png","element":"img","alt":" w(j)0 ·pl","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"= 0 or ","element":"span"},{"style":{"height":22.96},"width":155.62,"height":57.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-15.png","element":"img","alt":" w(j)T +1 ·pl","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":196,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-16.png","element":"img","alt":" w(j)0 ·pl −η","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":10.8},"width":101.24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-17.png","element":"img","alt":" b = −","inline":true},{"text":"1. Otherwise, assume that ","element":"span"},{"style":{"height":21.36},"width":129.11,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-18.png","element":"img","alt":" w(j)T ·pl","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":190.48,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-19.png","element":"img","alt":" w(j)0 ·pl −η","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-43","text":"E.3 ","element":"a"},{"text":"we have 0 ","element":"span"},{"style":{"height":22.65},"width":289.38,"height":56.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-20.png","element":"img","alt":" < w(j)0 ·pl <√2η4","inline":true,"padRight":true},{"text":". ","element":"span"},{"text":"Therefore ","element":"span"},{"style":{"height":21.37},"width":307.99,"height":53.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-21.png","element":"img","alt":" −η < w(j)T · pl <","inline":true,"padRight":true},{"text":"0 and ","element":"span"},{"style":{"height":15.2},"width":99.35,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-22.png","element":"img","alt":" l′ ̸= l","inline":true},{"text":". It follows that either ","element":"span"},{"style":{"height":22.96},"width":166.56,"height":57.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-23.png","element":"img","alt":" w(j)T +1 · pl","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":217.89,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-24.png","element":"img","alt":" w(j)0 · pl − η","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"= 0 or ","element":"span"},{"style":{"height":22.96},"width":164.57,"height":57.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-25.png","element":"img","alt":" w(j)T +1 · pl","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":140.82,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-26.png","element":"img","alt":" w(j)0 · pl","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":10.8},"width":106.6,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-27.png","element":"img","alt":" b = −","inline":true},{"text":"1. In both cases, we have","element":"span"},{"style":{"height":29.53},"width":269.72,"height":73.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-28.png","element":"img","alt":"��w(j)T +1 · pl�� < η","inline":true},{"text":". Furthermore, by Eq. ","element":"span"},{"href":"#id-59","text":"9, ","element":"a"},{"style":{"height":22.96},"width":484.67,"height":57.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-29.png","element":"img","alt":"w(j)T +1 · pi ≥ w(j)T · pi > η","inline":true},{"text":". ","element":"span"},{"text":"Hence, arg max","element":"span"},{"style":{"height":22.98},"width":270.13,"height":57.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-30.png","element":"img","alt":"1≤l≤4 w(j)T +1 · pl","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"which by definition implies that ","element":"span"},{"style":{"height":20.3},"width":199.98,"height":50.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-31.png","element":"img","alt":"j ∈ W +T +1(i","inline":true},{"text":"). This concludes the proof by induction which shows that ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-32.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":18.02},"width":150.53,"height":45.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-33.png","element":"img","alt":"i) ⊆ W +t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") for all ","element":"span"},{"style":{"height":12.8},"width":56.46,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-34.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"1.","element":"span"}],[{"text":"In order to prove the lemma, it suffices to show that ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-35.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(2)","element":"span"},{"style":{"height":18.27},"width":99.81,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-36.png","element":"img","alt":"∪W +0","inline":true,"padRight":true},{"text":"(4) ","element":"span"},{"style":{"height":17.94},"width":110.24,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-37.png","element":"img","alt":" ⊆ W +t","inline":true,"padRight":true},{"text":"(2)","element":"span"},{"style":{"height":17.94},"width":99.81,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-38.png","element":"img","alt":"∪W +t","inline":true,"padRight":true},{"text":"(4) and prove ","element":"span"},{"text":"the second claim. This follows since ","element":"span"},{"style":{"height":20.4},"width":161.64,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-39.png","element":"img","alt":"�4i=1 W +t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ..., k","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". We will show by induction on ","element":"span"},{"style":{"height":12.8},"width":57.27,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-40.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"1, ","element":"span"},{"text":"that for all ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-41.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":18.27},"width":103.6,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-42.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(4), the following holds:","element":"span"}],[{"text":"1. ","element":"span"},{"style":{"height":17.94},"width":135.57,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-43.png","element":"img","alt":" j ∈ W +t","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":17.94},"width":103.59,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-44.png","element":"img","alt":" ∩ W +t","inline":true,"padRight":true},{"text":"(4).","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"height":20.93},"width":328.66,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-45.png","element":"img","alt":" w(j)t = w(j)0 + mp2","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":11.6},"width":110.7,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-46.png","element":"img","alt":" m ∈ Z","inline":true},{"text":".","element":"span"}],[{"text":"The claim holds for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 1 by the proof of Lemma ","element":"span"},{"href":"#id-58","text":"E.4. ","element":"a"},{"text":"Assume it holds for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". By the induction hypothesis ","element":"span"},{"style":{"height":18.7},"width":135.57,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-47.png","element":"img","alt":" j ∈ W +T","inline":true,"padRight":true},{"text":"(2)","element":"span"},{"style":{"height":18.7},"width":100.01,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-48.png","element":"img","alt":"∩W +T","inline":true,"padRight":true},{"text":"(4). Assume without loss of generality that ","element":"span"},{"style":{"height":18.7},"width":135.57,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-49.png","element":"img","alt":" j ∈ W +T","inline":true,"padRight":true},{"text":"(2). This implies that ","element":"span"},{"style":{"height":15.48},"width":135.57,"height":38.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-50.png","element":"img","alt":"j ∈ W −T","inline":true,"padRight":true},{"text":"(2) as well. Therefore, by Eq. ","element":"span"},{"href":"#id-57","text":"7 ","element":"a"},{"text":"we have","element":"span"}],[{"id":"id-60","style":{"width":"63%"},"width":1119,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-51.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":172.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-52.png","element":"img","alt":" a ∈ {0, 1}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":199.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-53.png","element":"img","alt":" b ∈ {0, −1}","inline":true},{"text":". By the induction hypothesis, ","element":"span"},{"style":{"height":22.96},"width":97.82,"height":57.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-54.png","element":"img","alt":" w(j)T +1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":201.84,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-55.png","element":"img","alt":" w(j)0 + mp2","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":11.6},"width":116.34,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-56.png","element":"img","alt":" m ∈ Z","inline":true},{"text":". If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"= 1 or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"= 0 we have for ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-57.png","element":"img","alt":" i ∈ {1, 3}","inline":true},{"text":",","element":"span"}],[{"style":{"width":"44%"},"width":780,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-58.png","element":"img"}],[{"text":"where the first inequality follows since ","element":"span"},{"style":{"height":18.7},"width":152.87,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-59.png","element":"img","alt":" j ∈ W +T","inline":true,"padRight":true},{"text":"(2) and the second by Eq. ","element":"span"},{"href":"#id-60","text":"10. ","element":"a"},{"text":"This implies that ","element":"span"},{"style":{"height":20.3},"width":168.61,"height":50.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-60.png","element":"img","alt":"j ∈ W +T +1","inline":true},{"text":"(2) ","element":"span"},{"style":{"height":20.3},"width":136.64,"height":50.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-61.png","element":"img","alt":" ∩ W +T +1","inline":true},{"text":"(4).","element":"span"}],[{"style":{"width":"100%"},"width":1757,"height":379,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-62.png","element":"img"}],[{"id":"id-64","style":{"fontWeight":"bold"},"text":"Lemma E.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For all ","element":"span"},{"style":{"height":12.8},"width":56.46,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-63.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"we have","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. ","element":"span"},{"style":{"height":20.93},"width":335.67,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-64.png","element":"img","alt":" u(j)t = u(j)0 + mηp2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":11.6},"width":110.7,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-65.png","element":"img","alt":" m ∈ Z","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"height":18.27},"width":109.07,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-66.png","element":"img","alt":"2. U +0","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":18.27},"width":91.98,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-67.png","element":"img","alt":" ∪ U +0","inline":true,"padRight":true},{"text":"(4) ","element":"span"},{"style":{"height":17.94},"width":98.62,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-68.png","element":"img","alt":" ⊆ U +t","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":17.94},"width":91.98,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/20-69.png","element":"img","alt":" ∪ U +t","inline":true,"padRight":true},{"text":"(4)","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"width":"99%"},"width":1755,"height":156,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-0.png","element":"img"}],[{"text":"Assume by contradiction that there exist an iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"for which ","element":"span"},{"style":{"height":20.93},"width":486.14,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-1.png","element":"img","alt":" u(j)t = u(j)0 + αtηp2 + βtηpi","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16},"width":546.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-2.png","element":"img","alt":"βt ∈ {−1, 1}, αt ∈ Z, i ∈ {1, 3}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.93},"width":80.08,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-3.png","element":"img","alt":" u(j)t−1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":259.98,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-4.png","element":"img","alt":" u(j)0 + αt−1ηp2","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":13.19},"width":158.41,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-5.png","element":"img","alt":" αt−1 ∈ Z","inline":true},{"text":". ","element":"span"},{"href":"#id-61","style":{"height":7.6},"width":31.9,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-6.png","element":"img","alt":"11","inline":true,"padRight":true},{"text":"Since the coefficient of ","element":"span"},{"style":{"height":11.1},"width":34.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-7.png","element":"img","alt":" pi","inline":true,"padRight":true},{"text":"changed in iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", we have ","element":"span"},{"style":{"height":18.27},"width":147.55,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-8.png","element":"img","alt":" j ∈ U +t−1","inline":true},{"text":"(1)","element":"span"},{"style":{"height":18.27},"width":112.52,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-9.png","element":"img","alt":"∪U +t−1","inline":true},{"text":"(3). However, this contradicts the claim above ","element":"span"},{"text":"which shows that if ","element":"span"},{"style":{"height":20.93},"width":80.08,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-10.png","element":"img","alt":" u(j)t−1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":259.02,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-11.png","element":"img","alt":" u(j)0 + αt−1ηp2","inline":true},{"text":", then ","element":"span"},{"style":{"height":18.27},"width":147.55,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-12.png","element":"img","alt":" j ∈ U +t−1","inline":true},{"text":"(2) ","element":"span"},{"style":{"height":18.27},"width":115.58,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-13.png","element":"img","alt":" ∪ U +t−1","inline":true},{"text":"(4).","element":"span"}],[{"id":"id-62","style":{"fontWeight":"bold"},"text":"Lemma E.7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":168.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-14.png","element":"img","alt":" i ∈ {1, 3}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":167.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-15.png","element":"img","alt":" l ∈ {2, 4}","inline":true},{"style":{"fontStyle":"italic"},"text":". For all ","element":"span"},{"style":{"height":12.8},"width":60.88,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-16.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", if ","element":"span"},{"style":{"height":18.27},"width":132.79,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-17.png","element":"img","alt":" j ∈ U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.25},"width":133.59,"height":40.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-18.png","element":"img","alt":"i) ∩ U −0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", then there exists ","element":"span"},{"style":{"height":16},"width":347.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-19.png","element":"img","alt":"at ∈ {0, −1}, bt ∈ N","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":20.93},"width":468.04,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-20.png","element":"img","alt":" u(j)t = u(j)0 + atηpi + btηpl","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"First note that by Eq. ","element":"span"},{"href":"#id-57","text":"8 ","element":"a"},{"text":"we generally have ","element":"span"},{"style":{"height":20.93},"width":460.61,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-21.png","element":"img","alt":" u(j)t = u(j)0 + αηpi + βηpl","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":14.4},"width":148.43,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-22.png","element":"img","alt":" α, β ∈ Z","inline":true},{"text":". Since ","element":"span"},{"style":{"height":29.53},"width":284.54,"height":73.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-23.png","element":"img","alt":"��u(j)0 · p1�� ≤√2η4","inline":true,"padRight":true},{"text":", by the gradient update in Eq. ","element":"span"},{"href":"#id-57","text":"8 ","element":"a"},{"text":"it holds that ","element":"span"},{"style":{"height":16},"width":214.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-24.png","element":"img","alt":" at ∈ {0, −1}","inline":true},{"text":". Indeed, ","element":"span"},{"style":{"height":9.19},"width":37.06,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-25.png","element":"img","alt":" a0","inline":true,"padRight":true},{"text":"= 0 and by","element":"span"}],[{"text":"the gradient update if ","element":"span"},{"style":{"height":9.19},"width":74.01,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-26.png","element":"img","alt":" at−1","inline":true,"padRight":true},{"text":"= 0 or ","element":"span"},{"style":{"height":9.19},"width":160.02,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-27.png","element":"img","alt":" at−1 = −","inline":true},{"text":"1 th","element":"span"},{"href":"#id-57","text":"en ","element":"a"},{"style":{"height":16},"width":212.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-28.png","element":"img","alt":" at ∈ {−1, 0}","inline":true},{"text":".","element":"span"}],[{"text":"Assume by contradiction that there exists an iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > ","element":"span"},{"text":"0 such that ","element":"span"},{"style":{"height":13.19},"width":126.44,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-29.png","element":"img","alt":" bt = −","inline":true},{"text":"1 and ","element":"span"},{"style":{"height":13.19},"width":70.05,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-30.png","element":"img","alt":" bt−1","inline":true,"padRight":true},{"text":"= 0. Note that by Eq. ","element":"span"},{"href":"#id-57","text":"8 ","element":"a"},{"text":"this can only occur if ","element":"span"},{"style":{"height":18.27},"width":197.24,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-31.png","element":"img","alt":" j ∈ U +t−1(l","inline":true},{"text":"). ","element":"span"},{"text":"We have ","element":"span"},{"style":{"height":20.93},"width":80.09,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-32.png","element":"img","alt":" u(j)t−1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":257.72,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-33.png","element":"img","alt":" u(j)0 + at−1ηpi","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16},"width":273.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-34.png","element":"img","alt":"at−1 ∈ {0, −1}","inline":true},{"text":". ","element":"span"},{"text":"Observe that","element":"span"},{"style":{"height":29.53},"width":409.24,"height":73.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-35.png","element":"img","alt":"��u(j)t−1 · pi�� ≥ ��u(j)0 · pi��","inline":true,"padRight":true},{"text":"by the fact that","element":"span"},{"style":{"height":29.53},"width":297.84,"height":73.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-36.png","element":"img","alt":"��u(j)0 · pi�� ≤ √2η4","inline":true,"padRight":true},{"text":". ","element":"span"},{"text":"Since","element":"span"}],[{"style":{"width":"99%"},"width":1755,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-37.png","element":"img"}],[{"id":"id-50","style":{"fontWeight":"bold"},"text":"E.0.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bounding ","element":"span"},{"style":{"height":17.94},"width":56.12,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-38.png","element":"img","alt":" P +t","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":", ","element":"span"},{"style":{"height":14.72},"width":56.12,"height":36.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-39.png","element":"img","alt":" P −t","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"and ","element":"span"},{"style":{"height":14.72},"width":51.73,"height":36.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-40.png","element":"img","alt":" S−t","inline":true}],[{"id":"id-71","style":{"fontWeight":"bold"},"text":"Lemma E.8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The following holds","element":"span"}],[{"style":{"width":"100%"},"width":1758,"height":463,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-41.png","element":"img"}],[{"text":"For the third claim, without loss of generality, assume by contradiction that for ","element":"span"},{"style":{"height":17.94},"width":138.84,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-42.png","element":"img","alt":" j ∈ U +t","inline":true,"padRight":true},{"text":"(1) it ","element":"span"},{"text":"holds that","element":"span"},{"style":{"height":29.53},"width":242.83,"height":73.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-43.png","element":"img","alt":"��u(j)t · p2�� > η","inline":true},{"text":". Since","element":"span"},{"style":{"height":29.53},"width":242.83,"height":73.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-44.png","element":"img","alt":"��u(j)t · p1�� < η","inline":true,"padRight":true},{"text":"by Lemma ","element":"span"},{"href":"#id-62","text":"E.7, ","element":"a"},{"text":"it follows that ","element":"span"},{"style":{"height":17.94},"width":127.62,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-45.png","element":"img","alt":" j ∈ U +t","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":17.94},"width":92.71,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-46.png","element":"img","alt":" ∪ U +t","inline":true,"padRight":true},{"text":"(4), a ","element":"span"},{"text":"contradiction. Therefore,","element":"span"},{"style":{"height":29.53},"width":239.14,"height":73.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-47.png","element":"img","alt":"��u(j)t · p2�� ≤ η","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":17.94},"width":123.95,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-48.png","element":"img","alt":" j ∈ U +t","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":17.94},"width":91.98,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-49.png","element":"img","alt":" ∪ U +t","inline":true,"padRight":true},{"text":"(3), from which the claim follows.","element":"span"}],[{"id":"id-51","style":{"fontWeight":"bold"},"text":"E.0.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Dynamics of ","element":"span"},{"style":{"height":17.94},"width":51.73,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-50.png","element":"img","alt":" S+t","inline":true}],[{"id":"id-74","style":{"fontWeight":"bold"},"text":"Lemma E.9. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let","element":"span"}],[{"id":"id-61","style":{"width":"77%"},"width":1354,"height":385,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/21-51.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We will prove the claim by induction on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 0 this clearly holds. Assume it holds for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". Let ","element":"span"},{"style":{"height":18.7},"width":151.17,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-0.png","element":"img","alt":" j1 ∈ W +T","inline":true,"padRight":true},{"text":"(1) and ","element":"span"},{"style":{"height":18.7},"width":151.17,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-1.png","element":"img","alt":" j2 ∈ W +T","inline":true,"padRight":true},{"text":"(3). By Eq. ","element":"span"},{"href":"#id-57","text":"7, ","element":"a"},{"text":"the gradient updates of the corresponding weight ","element":"span"},{"text":"vector are given as follows:","element":"span"}],[{"style":{"width":"64%"},"width":1135,"height":170,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":167.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-3.png","element":"img","alt":" a ∈ {0, 1}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":302.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-4.png","element":"img","alt":" b1, b2 ∈ {−1, 0, 1}","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-63","text":"E.5, ","element":"a"},{"style":{"height":20.3},"width":184.2,"height":50.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-5.png","element":"img","alt":" j1 ∈ W +T +1","inline":true},{"text":"(1) and ","element":"span"},{"style":{"height":20.3},"width":184.21,"height":50.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-6.png","element":"img","alt":" j2 ∈ W +T +1","inline":true},{"text":"(3). Therefore,","element":"span"}],[{"style":{"width":"89%"},"width":1580,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-7.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"89%"},"width":1580,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-8.png","element":"img"}],[{"text":"By Lemma ","element":"span"},{"href":"#id-63","style":{"fontStyle":"italic"},"text":"E.","element":"a"},{"text":"5 we have","element":"span"},{"style":{"height":19.96},"width":81.46,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-9.png","element":"img","alt":"��W +t","inline":true,"padRight":true},{"text":"(1)","element":"span"},{"style":{"height":19.96},"width":147.87,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-10.png","element":"img","alt":"�� =��W +0","inline":true,"padRight":true},{"text":"(1)","element":"span"},{"style":{"height":19.96},"width":13,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-11.png","element":"img","alt":"��","inline":true,"padRight":true},{"text":"and","element":"span"},{"style":{"height":19.96},"width":81.45,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-12.png","element":"img","alt":"��W +t","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.96},"width":147.88,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-13.png","element":"img","alt":"�� =��W +0","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.96},"width":13,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-14.png","element":"img","alt":"��","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". It follows that","element":"span"}],[{"style":{"width":"39%"},"width":692,"height":482,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-15.png","element":"img"}],[{"text":"where the second equality follows by the induction hypothesis. This proves the claim.","element":"span"}],[{"id":"id-67","style":{"fontWeight":"bold"},"text":"Lemma E.10. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The following holds:","element":"span"}],[{"style":{"height":16.98},"width":238.32,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-16.png","element":"img","alt":"1. If NWt(x+","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"< γ ","element":"span"},{"style":{"height":16},"width":252.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-17.png","element":"img","alt":" and −NWt(x−","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"< ","element":"span"},{"text":"1","element":"span"},{"style":{"height":19.87},"width":192.02,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-18.png","element":"img","alt":", then S+t+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"style":{"height":17.94},"width":27.3,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-19.png","element":"img","alt":"+t","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"η","element":"span"},{"style":{"height":19.96},"width":81.46,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-20.png","element":"img","alt":"��W +t","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":17.94},"width":103.6,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-21.png","element":"img","alt":" ∪ W +t","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.96},"width":25.29,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-22.png","element":"img","alt":"��.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. If ","element":"span"},{"style":{"height":16.99},"width":236.06,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-23.png","element":"img","alt":" NWt(x+) ≥ γ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":234.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-24.png","element":"img","alt":" −NWt(x−) <","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":19.87},"width":76.94,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-25.png","element":"img","alt":" S+t+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":17.94},"width":51.73,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-26.png","element":"img","alt":" S+t","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"height":19.87},"width":880.34,"height":49.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-27.png","element":"img","alt":"3. If NWt(x+) < γ and −NWt(x−) ≥ 1, then S+t+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":17.94},"width":51.73,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-28.png","element":"img","alt":" S+t","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":19.96},"width":109.31,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-29.png","element":"img","alt":" η��W +t","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":17.94},"width":103.6,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-30.png","element":"img","alt":" ∪ W +t","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.96},"width":25.28,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-31.png","element":"img","alt":"��.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"1. The equality follows since for each ","element":"span"},{"style":{"height":16},"width":342.83,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-32.png","element":"img","alt":" i ∈ {1, 3}, l ∈ {2, 4}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.94},"width":135.57,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-33.png","element":"img","alt":" j ∈ W +t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16},"width":140.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-34.png","element":"img","alt":"i) ∩ W −t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":") we have ","element":"span"},{"style":{"height":22.53},"width":86.74,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-35.png","element":"img","alt":"w(j)t+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.6},"width":286.95,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-36.png","element":"img","alt":" w(j)t + ηpi − ηpl","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.87},"width":90.14,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-37.png","element":"img","alt":" W +t+1","inline":true},{"text":"(1) ","element":"span"},{"style":{"height":19.87},"width":125.56,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-38.png","element":"img","alt":" ∪ W +t+1","inline":true},{"text":"(3) = ","element":"span"},{"style":{"height":17.94},"width":68.17,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-39.png","element":"img","alt":" W +t","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":17.94},"width":103.59,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-40.png","element":"img","alt":" ∪ W +t","inline":true,"padRight":true},{"text":"(3) by Lemma ","element":"span"},{"href":"#id-63","text":"E.5.","element":"a"}],[{"text":"2. In this case for each ","element":"span"},{"style":{"height":16},"width":352.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-41.png","element":"img","alt":" i ∈ {1, 3}, l ∈ {2, 4}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.94},"width":139.66,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-42.png","element":"img","alt":" j ∈ W +t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16},"width":143.31,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-43.png","element":"img","alt":"i) ∩ W −t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":") we have ","element":"span"},{"style":{"height":22.53},"width":86.74,"height":56.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-44.png","element":"img","alt":" w(j)t+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.6},"width":181.45,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-45.png","element":"img","alt":" w(j)t − ηpl","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.87},"width":90.14,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-46.png","element":"img","alt":" W +t+1","inline":true},{"text":"(1) ","element":"span"},{"style":{"height":19.87},"width":125.56,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-47.png","element":"img","alt":" ∪ W +t+1","inline":true},{"text":"(3) = ","element":"span"},{"style":{"height":17.94},"width":68.17,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-48.png","element":"img","alt":" W +t","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":17.94},"width":103.59,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-49.png","element":"img","alt":" ∪ W +t","inline":true,"padRight":true},{"text":"(3) by Lemma ","element":"span"},{"href":"#id-63","text":"E.5.","element":"a"}],[{"text":"3. This equality follows since for each ","element":"span"},{"style":{"height":16},"width":382.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-50.png","element":"img","alt":" i ∈ {1, 3}, l ∈ {2, 4}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.94},"width":152,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-51.png","element":"img","alt":" j ∈ W +t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16},"width":148.25,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-52.png","element":"img","alt":"i) ∩ W −t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":") we have ","element":"span"},{"style":{"height":22.53},"width":86.74,"height":56.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-53.png","element":"img","alt":"w(j)t+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.6},"width":180.81,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-54.png","element":"img","alt":" w(j)t + ηpi","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.87},"width":90.14,"height":49.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-55.png","element":"img","alt":" W +t+1","inline":true},{"text":"(1) ","element":"span"},{"style":{"height":19.87},"width":125.57,"height":49.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-56.png","element":"img","alt":" ∪ W +t+1","inline":true},{"text":"(3) = ","element":"span"},{"style":{"height":17.94},"width":68.17,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-57.png","element":"img","alt":" W +t","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":17.94},"width":103.6,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-58.png","element":"img","alt":" ∪ W +t","inline":true,"padRight":true},{"text":"(3) by Lemma ","element":"span"},{"href":"#id-63","text":"E.5.","element":"a"}],[{"style":{"width":"1%"},"width":28,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/22-59.png","element":"img"}],[{"id":"id-52","style":{"fontWeight":"bold"},"text":"E.0.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Upper Bounds on ","element":"span"},{"style":{"height":16.98},"width":143.97,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-0.png","element":"img","alt":" NWt(x+","inline":true},{"text":")","element":"span"},{"style":{"fontWeight":"bold"},"text":", ","element":"span"},{"style":{"height":16},"width":174.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-1.png","element":"img","alt":" −NWt(x−","inline":true},{"text":") ","element":"span"},{"style":{"fontWeight":"bold"},"text":"and ","element":"span"},{"style":{"height":17.94},"width":51.73,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-2.png","element":"img","alt":" S+t","inline":true}],[{"id":"id-68","style":{"fontWeight":"bold"},"text":"Lemma E.11. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that ","element":"span"},{"style":{"height":16.99},"width":236.06,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-3.png","element":"img","alt":" NWt(x+) ≥ γ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":234.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-4.png","element":"img","alt":" −NWt(x−) <","inline":true,"padRight":true},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":13.2},"width":240.98,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-5.png","element":"img","alt":" T ≤ t < T + b","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":13.2},"width":59.17,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-6.png","element":"img","alt":" b ≥","inline":true,"padRight":true},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then ","element":"span"},{"style":{"height":18.67},"width":546.86,"height":46.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-7.png","element":"img","alt":"NWT +b(x+) ≤ NWT (x+) − (b −","inline":true,"padRight":true},{"text":"1)","element":"span"},{"style":{"height":19.96},"width":194.68,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-8.png","element":"img","alt":"cη + η��W +0","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":18.27},"width":103.6,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-9.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(4)","element":"span"},{"style":{"height":19.96},"width":13,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-10.png","element":"img","alt":"��","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Define ","element":"span"},{"style":{"height":17.94},"width":55.56,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-11.png","element":"img","alt":" R+t","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":17.94},"width":162.2,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-12.png","element":"img","alt":" Y +t − Z+t","inline":true,"padRight":true},{"text":"where","element":"span"}],[{"style":{"width":"80%"},"width":1410,"height":295,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-13.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":16},"width":298.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-14.png","element":"img","alt":" l ∈ {2, 4}, t = T","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.87},"width":184.43,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-15.png","element":"img","alt":" j ∈ U +t+1(l","inline":true},{"text":"). Then, either ","element":"span"},{"style":{"height":17.94},"width":131.9,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-16.png","element":"img","alt":" j ∈ U +t","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":17.94},"width":93.56,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-17.png","element":"img","alt":" ∪ U +t","inline":true,"padRight":true},{"text":"(4) or ","element":"span"},{"style":{"height":17.94},"width":131.9,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-18.png","element":"img","alt":" j ∈ U +t","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":17.94},"width":93.57,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-19.png","element":"img","alt":" ∪ U +t","inline":true,"padRight":true},{"text":"(3). ","element":"span"},{"text":"In the first case, ","element":"span"},{"style":{"height":22.53},"width":79.64,"height":56.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-20.png","element":"img","alt":" u(j)t+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.6},"width":173.62,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-21.png","element":"img","alt":" u(j)t + ηpl","inline":true},{"text":". Note that this implies that ","element":"span"},{"style":{"height":17.94},"width":56.56,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-22.png","element":"img","alt":" U +t","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":17.94},"width":92.44,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-23.png","element":"img","alt":" ∪ U +t","inline":true,"padRight":true},{"text":"(4) ","element":"span"},{"style":{"height":19.87},"width":122.92,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-24.png","element":"img","alt":" ⊆ U +t+1","inline":true},{"text":"(2) ","element":"span"},{"style":{"height":19.87},"width":115.59,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-25.png","element":"img","alt":" ∪ U +t+1","inline":true},{"text":"(4) ","element":"span"},{"text":"(since ","element":"span"},{"style":{"height":11.1},"width":33.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-26.png","element":"img","alt":" pl","inline":true,"padRight":true},{"text":"will remain the maximal direction). Therefore,","element":"span"}],[{"style":{"width":"86%"},"width":1525,"height":401,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-27.png","element":"img"}],[{"text":"In the second case, where we have ","element":"span"},{"style":{"height":17.94},"width":124.07,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-28.png","element":"img","alt":" j ∈ U +t","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":17.94},"width":92,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-29.png","element":"img","alt":" ∪ U +t","inline":true,"padRight":true},{"text":"(3), it holds that ","element":"span"},{"style":{"height":22.53},"width":79.64,"height":56.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-30.png","element":"img","alt":" u(j)t+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.6},"width":323.46,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-31.png","element":"img","alt":" u(j)t + ηpl, j ∈ U −t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":") ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":22.53},"width":216.71,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-32.png","element":"img","alt":" u(j)t+1 · pl > η","inline":true},{"text":". Furthermore, by Lemma ","element":"span"},{"href":"#id-62","text":"E.7, ","element":"a"},{"style":{"height":20.6},"width":205.06,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-33.png","element":"img","alt":" u(j)t · pi < η","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-34.png","element":"img","alt":" i ∈ {1, 3}","inline":true},{"text":". Note that by Lemma ","element":"span"},{"href":"#id-62","text":"E.7, ","element":"a"},{"text":"any ","element":"span"},{"style":{"height":17.94},"width":139.55,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-35.png","element":"img","alt":" j1 ∈ U +t","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":17.94},"width":91.98,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-36.png","element":"img","alt":" ∪ U +t","inline":true,"padRight":true},{"text":"(3) satisfies ","element":"span"},{"style":{"height":19.87},"width":162.7,"height":49.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-37.png","element":"img","alt":" j1 ∈ U +t+1","inline":true},{"text":"(2) ","element":"span"},{"style":{"height":19.87},"width":115.13,"height":49.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-38.png","element":"img","alt":" ∪ U +t+1","inline":true},{"text":"(4). By all these observations, we have","element":"span"}],[{"style":{"width":"99%"},"width":1756,"height":691,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-39.png","element":"img"}],[{"text":"in the case that ","element":"span"},{"style":{"height":20.3},"width":197.98,"height":50.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-40.png","element":"img","alt":" j ∈ W +T +1(l","inline":true},{"text":"), or","element":"span"}],[{"style":{"width":"45%"},"width":801,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-41.png","element":"img"}],[{"text":"if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j /","element":"span"},{"style":{"height":9.6},"width":27,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-42.png","element":"img","alt":"∈","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"style":{"height":20.3},"width":63.58,"height":50.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/23-43.png","element":"img","alt":"+T +1","inline":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":").","element":"span"}],[{"style":{"width":"99%"},"width":1755,"height":444,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-0.png","element":"img"}],[{"id":"id-65","style":{"fontWeight":"bold"},"text":"Lemma E.12. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that ","element":"span"},{"style":{"height":16.98},"width":236.08,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-1.png","element":"img","alt":" NWt(x+) < γ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":234.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-2.png","element":"img","alt":" −NWt(x−) ≥","inline":true,"padRight":true},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":13.2},"width":240.98,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-3.png","element":"img","alt":" T ≤ t < T + b","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":13.2},"width":59.17,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-4.png","element":"img","alt":" b ≥","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then ","element":"span"},{"style":{"height":20.03},"width":663.69,"height":50.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-5.png","element":"img","alt":"−NWT +b(x−) ≤ −NWT (x−) − bη��W +0","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":18.27},"width":103.6,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-6.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(4)","element":"span"},{"style":{"height":19.96},"width":95.24,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-7.png","element":"img","alt":"�� + cη","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"width":"80%"},"width":1409,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-8.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"49%"},"width":869,"height":113,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-9.png","element":"img"}],[{"text":"First note that by Lemma ","element":"span"},{"href":"#id-63","text":"E.5 ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":19.87},"width":90.14,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-10.png","element":"img","alt":" W +t+1","inline":true},{"text":"(2) ","element":"span"},{"style":{"height":19.87},"width":128.21,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-11.png","element":"img","alt":" ∪ W +t+1","inline":true},{"text":"(4) = ","element":"span"},{"style":{"height":17.94},"width":68.17,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-12.png","element":"img","alt":" W +t","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":17.94},"width":106.24,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-13.png","element":"img","alt":" ∪ W +t","inline":true,"padRight":true},{"text":"(4). Next, for any ","element":"span"},{"style":{"height":16},"width":158.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-14.png","element":"img","alt":"l ∈ {2, 4}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.94},"width":135.57,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-15.png","element":"img","alt":" j ∈ W +t","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":") we have ","element":"span"},{"style":{"height":22.53},"width":86.74,"height":56.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-16.png","element":"img","alt":" w(j)t+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.6},"width":179.81,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-17.png","element":"img","alt":" w(j)t + ηpl","inline":true},{"text":". Therefore,","element":"span"}],[{"style":{"width":"99%"},"width":1753,"height":307,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-18.png","element":"img"}],[{"text":"To see this, note that by Lemma ","element":"span"},{"href":"#id-62","text":"E.7 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-64","text":"E.6 ","element":"a"},{"text":"it holds that ","element":"span"},{"style":{"height":21.36},"width":369.22,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-19.png","element":"img","alt":" u(j)T = u(j)0 + aT ηpl","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16},"width":223.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-20.png","element":"img","alt":"aT ∈ {−1, 0}","inline":true},{"text":". Hence, ","element":"span"},{"style":{"height":22.96},"width":90.72,"height":57.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-21.png","element":"img","alt":" u(j)T +1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":66.39,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-22.png","element":"img","alt":" u(j)0","inline":true,"padRight":true},{"text":"+","element":"span"},{"style":{"height":11.19},"width":141.68,"height":27.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-23.png","element":"img","alt":"aT +1ηpl","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16},"width":263.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-24.png","element":"img","alt":" aT +1 ∈ {−1, 0}","inline":true},{"text":". Since","element":"span"},{"style":{"height":29.53},"width":281.99,"height":73.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-25.png","element":"img","alt":"��u(j)0 · p2�� <√2η4","inline":true,"padRight":true},{"text":"it follows","element":"span"}],[{"style":{"width":"93%"},"width":1645,"height":211,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-26.png","element":"img"}],[{"text":"if ","element":"span"},{"style":{"height":16},"width":158.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-27.png","element":"img","alt":" l ∈ {2, 4}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.3},"width":187.56,"height":50.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-28.png","element":"img","alt":" j ∈ U +T +1(l","inline":true},{"text":"), or","element":"span"}],[{"style":{"width":"99%"},"width":1754,"height":396,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/24-29.png","element":"img"}],[{"text":"and since ","element":"span"},{"style":{"height":20.3},"width":101.22,"height":50.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-0.png","element":"img","alt":" W +T +1","inline":true},{"text":"(1) ","element":"span"},{"style":{"height":20.3},"width":136.67,"height":50.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-1.png","element":"img","alt":" ∪ W +T +1","inline":true},{"text":"(3) = ","element":"span"},{"style":{"height":18.7},"width":68.17,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-2.png","element":"img","alt":" W +T","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.7},"width":103.62,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-3.png","element":"img","alt":" ∪ W +T","inline":true,"padRight":true},{"text":"(3) by Lemma ","element":"span"},{"href":"#id-63","text":"E.5, ","element":"a"},{"text":"we get ","element":"span"},{"style":{"height":17.21},"width":86.01,"height":43.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-4.png","element":"img","alt":" S−T +b","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":15.48},"width":51.74,"height":38.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-5.png","element":"img","alt":" S−T","inline":true,"padRight":true},{"text":". Hence, we can ","element":"span"},{"text":"conclude that","element":"span"}],[{"id":"id-69","style":{"width":"99%"},"width":1756,"height":532,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-6.png","element":"img"}],[{"text":"The proof of the second claim follows similarly. It holds that ","element":"span"},{"style":{"height":17.59},"width":277.28,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-7.png","element":"img","alt":" −NWT +1(x−) <","inline":true,"padRight":true},{"text":"1+2","element":"span"},{"style":{"height":11.59},"width":33.24,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-8.png","element":"img","alt":"cη","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":16},"width":243.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-9.png","element":"img","alt":" −NWT (x−) <","inline":true,"padRight":true},{"text":"1. Otherwise if ","element":"span"},{"style":{"height":16},"width":234.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-10.png","element":"img","alt":" −NWt(x−) ≥","inline":true,"padRight":true},{"text":"1 for ","element":"span"},{"style":{"height":13.2},"width":239.7,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-11.png","element":"img","alt":" T ≤ t ≤ T + b","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":278.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-12.png","element":"img","alt":" −NWT −1(x−) <","inline":true,"padRight":true},{"text":"1 then ","element":"span"},{"style":{"height":17.68},"width":276.34,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-13.png","element":"img","alt":" −NWT +b(x−) ≤","inline":true,"padRight":true},{"text":"1 + 3","element":"span"},{"style":{"height":11.59},"width":33.24,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-14.png","element":"img","alt":"cη","inline":true}],[{"text":"by Lemma ","element":"span"},{"href":"#id-65","text":"E.12.","element":"a"}],[{"text":"The third claim holds by the following identities and bounds ","element":"span"},{"style":{"height":16.98},"width":367.94,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-15.png","element":"img","alt":" NWT (x+) − NWT (x−","inline":true},{"text":") = ","element":"span"},{"style":{"height":18.7},"width":154.74,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-16.png","element":"img","alt":" S+T − P +T","inline":true,"padRight":true},{"text":"+","element":"span"},{"style":{"height":15.48},"width":160.23,"height":38.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-17.png","element":"img","alt":"P −T − S−T","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":15.48},"width":102.31,"height":38.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-18.png","element":"img","alt":" P −T ≥","inline":true,"padRight":true},{"text":"0,","element":"span"},{"style":{"height":19.96},"width":375.4,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-19.png","element":"img","alt":"��P +T�� ≤ cη,��S−T�� ≤ cη","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.98},"width":469.96,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-20.png","element":"img","alt":" NWT (x+) − NWT (x−) ≤ γ","inline":true,"padRight":true},{"text":"+ 1 + 6","element":"span"},{"style":{"height":11.59},"width":33.24,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-21.png","element":"img","alt":"cη","inline":true,"padRight":true},{"text":"by the previous claims.","element":"span"}],[{"id":"id-53","style":{"fontWeight":"bold"},"text":"E.0.6 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Optimization","element":"span"}],[{"text":"We are now ready to prove a global optimality guarantee for gradient descent.","element":"span"}],[{"id":"id-72","style":{"width":"100%"},"width":1758,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-22.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= ","element":"span"},{"style":{"height":11.59},"width":171.26,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-23.png","element":"img","alt":"7(γ+1+8cη)","inline":true},{"text":"( ","element":"span"},{"style":{"height":16},"width":136.29,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-24.png","element":"img","alt":"k2 −2√k)η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"iterations, gradient descent converges to a global minimum.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"First note that with probability at least 1","element":"span"},{"style":{"height":4.4},"width":31,"height":11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-25.png","element":"img","alt":"−","inline":true}],[{"style":{"height":25.45},"width":216.42,"height":63.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-26.png","element":"img","alt":"√2k√πe8k −4e−8","inline":true,"padRight":true},{"text":"the claims of Lemma ","element":"span"},{"href":"#id-66","text":"E.2 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-43","text":"E.3 ","element":"a"},{"text":"hold. ","element":"span"},{"text":"Now, if gradient descent has not reached a global minimum at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"then either ","element":"span"},{"style":{"height":16.98},"width":236.07,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-27.png","element":"img","alt":"NWt(x+) < γ","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":16},"width":234.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-28.png","element":"img","alt":" −NWt(x−) <","inline":true,"padRight":true},{"text":"1. If ","element":"span"},{"style":{"height":16.98},"width":267.06,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-29.png","element":"img","alt":" −NWt(x+) < γ","inline":true,"padRight":true},{"text":"then by Lemma ","element":"span"},{"href":"#id-67","text":"E.10 ","element":"a"},{"text":"it holds that","element":"span"}],[{"id":"id-84","style":{"width":"77%"},"width":1359,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-30.png","element":"img"}],[{"text":"where the last inequality follows by Lemma ","element":"span"},{"href":"#id-66","text":"E.2.","element":"a"}],[{"text":"If ","element":"span"},{"style":{"height":16.98},"width":241.75,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-31.png","element":"img","alt":" NWt(x+) ≥ γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":237.27,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-32.png","element":"img","alt":" −NWt(x−) <","inline":true,"padRight":true},{"text":"1 we have ","element":"span"},{"style":{"height":19.87},"width":76.94,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-33.png","element":"img","alt":" S+t+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":17.94},"width":51.74,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-34.png","element":"img","alt":" S+t","inline":true,"padRight":true},{"text":"by Lemma ","element":"span"},{"href":"#id-67","text":"E.10. ","element":"a"},{"text":"However, by Lemma ","element":"span"},{"href":"#id-68","text":"E.11, ","element":"a"},{"text":"it follows that after 5 consecutive iterations ","element":"span"},{"style":{"height":10.8},"width":203.69,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-35.png","element":"img","alt":" t < t′ < t","inline":true,"padRight":true},{"text":"+ 6 in which ","element":"span"},{"style":{"height":18.31},"width":268.57,"height":45.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-36.png","element":"img","alt":" NWt′ (x+) ≥ γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.32},"width":245.2,"height":43.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-37.png","element":"img","alt":"−NWt′ (x−) <","inline":true,"padRight":true},{"text":"1, we have ","element":"span"},{"style":{"height":18.57},"width":270.11,"height":46.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-38.png","element":"img","alt":" NWt+6(x+) < γ","inline":true},{"text":". To see this, first note that for all ","element":"span"},{"style":{"height":16.99},"width":274.05,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-39.png","element":"img","alt":" t, NWt(x+) ≤ γ","inline":true,"padRight":true},{"text":"+ 3","element":"span"},{"style":{"height":11.59},"width":33.24,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-40.png","element":"img","alt":"cη","inline":true,"padRight":true},{"text":"by Lemma ","element":"span"},{"href":"#id-69","text":"E.13. ","element":"a"},{"text":"Then, by Lemma ","element":"span"},{"href":"#id-68","text":"E.11 ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"51%"},"width":896,"height":164,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-41.png","element":"img"}],[{"text":"where the second inequality follows by Lemma ","element":"span"},{"href":"#id-66","text":"E.2 ","element":"a"},{"text":"and the last inequality by the assumption on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". Assume by contradiction that GD has not converged to a global minimum after ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= ","element":"span"},{"style":{"height":11.59},"width":171.26,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-42.png","element":"img","alt":"7(γ+1+8cη)","inline":true},{"text":"( ","element":"span"},{"style":{"height":16},"width":136.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-43.png","element":"img","alt":"k2 −2√k)η","inline":true,"padRight":true},{"text":"iterations. Then, by the above observations, and the fact that ","element":"span"},{"style":{"height":18.27},"width":95.26,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-44.png","element":"img","alt":" S+0 >","inline":true,"padRight":true},{"text":"0 with probability 1, we have","element":"span"}],[{"style":{"width":"27%"},"width":483,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/25-45.png","element":"img"}],[{"text":"However, this contradicts Lemma ","element":"span"},{"href":"#id-69","text":"E.13.","element":"a"}],[{"id":"id-54","style":{"fontWeight":"bold"},"text":"E.0.7 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Generalization on Positive Class","element":"span"}],[{"text":"We will first need the following three lemmas.","element":"span"}],[{"id":"id-73","style":{"width":"62%"},"width":1100,"height":353,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof is similar to the proof of Lemma ","element":"span"},{"href":"#id-66","text":"E.2.","element":"a"}],[{"id":"id-70","style":{"fontWeight":"bold"},"text":"Lemma E.16. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that gradient descent converged to a global minimum at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then there exists an iteration ","element":"span"},{"style":{"height":13.19},"width":139.11,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-1.png","element":"img","alt":" T2 < T","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for which ","element":"span"},{"style":{"height":17.94},"width":144.12,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-2.png","element":"img","alt":" S+t ≥ γ","inline":true,"padRight":true},{"text":"+ 1 ","element":"span"},{"style":{"height":15.59},"width":96.18,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-3.png","element":"img","alt":" − 3cη","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":13.2},"width":122.62,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-4.png","element":"img","alt":" t ≥ T2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and for all ","element":"span"},{"style":{"height":13.19},"width":122.62,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-5.png","element":"img","alt":" t < T2","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":16},"width":234.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-6.png","element":"img","alt":"−NWt(x−) <","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Assume that for all 0 ","element":"span"},{"style":{"height":13.2},"width":148.88,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-7.png","element":"img","alt":" ≤ t ≤ T1","inline":true,"padRight":true},{"text":"it holds that ","element":"span"},{"style":{"height":16.99},"width":236.07,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-8.png","element":"img","alt":" NWt(x+) < γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":234.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-9.png","element":"img","alt":" −NWt(x−) <","inline":true,"padRight":true},{"text":"1. By continuing the calculation of Lemma ","element":"span"},{"href":"#id-58","text":"E.4 ","element":"a"},{"text":"we have the following:","element":"span"}],[{"text":"1. For ","element":"span"},{"style":{"height":18.27},"width":497.27,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-10.png","element":"img","alt":" i ∈ {1, 3}, l ∈ {2, 4}, j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.25},"width":127.56,"height":40.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-11.png","element":"img","alt":"i)∩W −0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"), it holds that ","element":"span"},{"style":{"height":22.96},"width":73.49,"height":57.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-12.png","element":"img","alt":" w(j)T1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":73.5,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-13.png","element":"img","alt":" w(j)0","inline":true,"padRight":true},{"text":"+","element":"span"},{"style":{"height":19.37},"width":418.59,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-14.png","element":"img","alt":"T1ηpi− 12(1−(−1)T1)ηpl","inline":true,"padRight":true},{"text":".","element":"span"}],[{"text":"2. For ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-15.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-16.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"), it holds that ","element":"span"},{"style":{"height":22.96},"width":73.5,"height":57.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-17.png","element":"img","alt":" w(j)T1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":73.49,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-18.png","element":"img","alt":" w(j)0","inline":true,"padRight":true},{"text":".","element":"span"}],[{"text":"3. For ","element":"span"},{"style":{"height":18.27},"width":491.3,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-19.png","element":"img","alt":" i ∈ {1, 3}, l ∈ {2, 4}, j ∈ U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.25},"width":130.06,"height":40.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-20.png","element":"img","alt":"i) ∩ U −0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"), it holds that ","element":"span"},{"style":{"height":22.96},"width":66.39,"height":57.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-21.png","element":"img","alt":" u(j)T1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":279.84,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-22.png","element":"img","alt":" u(j)0 − ηpi + ηpl","inline":true},{"text":".","element":"span"}],[{"text":"4. For ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-23.png","element":"img","alt":" i ∈ {2, 4}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.27},"width":123.95,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-24.png","element":"img","alt":" j ∈ U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"), it holds that ","element":"span"},{"style":{"height":22.96},"width":66.39,"height":57.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-25.png","element":"img","alt":" u(j)T1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":66.39,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-26.png","element":"img","alt":" u(j)0","inline":true,"padRight":true},{"text":".","element":"span"}],[{"text":"Therefore, there exists an iteration ","element":"span"},{"style":{"height":13.19},"width":39.29,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-27.png","element":"img","alt":" T1","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":19.37},"width":259.78,"height":48.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-28.png","element":"img","alt":" NWT1 (x+) ≥ γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.39},"width":256.74,"height":45.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-29.png","element":"img","alt":" −NWT1 (x−) <","inline":true,"padRight":true},{"text":"1 and for all ","element":"span"},{"style":{"height":16.99},"width":365.78,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-30.png","element":"img","alt":"t < T1, NWt(x+) < γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":234.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-31.png","element":"img","alt":" −NWt(x−) <","inline":true,"padRight":true},{"text":"1. Let ","element":"span"},{"style":{"height":13.2},"width":123.3,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-32.png","element":"img","alt":" T2 ≤ T","inline":true,"padRight":true},{"text":"be the first iteration such that ","element":"span"},{"style":{"height":18.38},"width":255.34,"height":45.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-33.png","element":"img","alt":" −NWT2 (x−) ≥","inline":true,"padRight":true},{"text":"1. We claim that for all ","element":"span"},{"style":{"height":13.2},"width":225.82,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-34.png","element":"img","alt":" T1 ≤ t ≤ T2","inline":true,"padRight":true},{"text":"we have ","element":"span"},{"style":{"height":19.37},"width":376.98,"height":48.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-35.png","element":"img","alt":" NWT1 (x+) ≥ γ − 2cη","inline":true},{"text":". It suffices to show that for all ","element":"span"},{"style":{"height":13.2},"width":201.11,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-36.png","element":"img","alt":"T1 ≤ t < T2","inline":true,"padRight":true},{"text":"the following holds:","element":"span"}],[{"text":"1. If ","element":"span"},{"style":{"height":16.98},"width":236.06,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-37.png","element":"img","alt":" NWt(x+) ≥ γ","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":18.57},"width":372.82,"height":46.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-38.png","element":"img","alt":" NWt+1(x+) ≥ γ − 2cη","inline":true},{"text":".","element":"span"}],[{"text":"2. If ","element":"span"},{"style":{"height":16.99},"width":236.07,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-39.png","element":"img","alt":" NWt(x+) < γ","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":18.57},"width":392.06,"height":46.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-40.png","element":"img","alt":" NWt+1(x+) ≥ NWt(x+","inline":true},{"text":").","element":"span"}],[{"text":"The first claim follows since at any iteration ","element":"span"},{"style":{"height":16.98},"width":143.97,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-41.png","element":"img","alt":" NWt(x+","inline":true},{"text":") can decrease by at most 2","element":"span"},{"style":{"height":14.4},"width":42.24,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-42.png","element":"img","alt":"ηk","inline":true,"padRight":true},{"text":"= 2","element":"span"},{"style":{"height":11.59},"width":33.24,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-43.png","element":"img","alt":"cη","inline":true},{"text":". For the second claim, let ","element":"span"},{"style":{"height":10.8},"width":105.56,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-44.png","element":"img","alt":" t′ < t","inline":true,"padRight":true},{"text":"be the latest iteration such that ","element":"span"},{"style":{"height":18.31},"width":259.68,"height":45.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-45.png","element":"img","alt":" NWt′ (x+) ≥ γ","inline":true},{"text":". Then at iteration ","element":"span"},{"style":{"height":10},"width":28.39,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-46.png","element":"img","alt":" t′","inline":true,"padRight":true},{"text":"it holds that ","element":"span"},{"style":{"height":17.32},"width":254.79,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-47.png","element":"img","alt":" −NWt′ (x−) <","inline":true,"padRight":true},{"text":"1 and ","element":"span"},{"style":{"height":18.31},"width":266.01,"height":45.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-48.png","element":"img","alt":" NWt′ (x+) ≥ γ","inline":true},{"text":". ","element":"span"},{"text":"Therefore, for all ","element":"span"},{"style":{"height":16},"width":388.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-49.png","element":"img","alt":" i ∈ {1, 3}, l ∈ {2, 4}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.27},"width":127.66,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-50.png","element":"img","alt":"j ∈ U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":18.27},"width":131.54,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-51.png","element":"img","alt":"i) ∩ U +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":") it holds that ","element":"span"},{"style":{"height":22.96},"width":90.42,"height":57.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-52.png","element":"img","alt":" u(j)t′+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":21.36},"width":174.19,"height":53.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-53.png","element":"img","alt":" u(j)t′ + ηpl","inline":true},{"text":". Hence, by Lemma ","element":"span"},{"href":"#id-64","text":"E.6 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-62","text":"E.7 ","element":"a"},{"text":"it holds ","element":"span"},{"text":"that ","element":"span"},{"style":{"height":20.29},"width":90.49,"height":50.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-54.png","element":"img","alt":" U +t′+1","inline":true},{"text":"(1) ","element":"span"},{"style":{"height":20.29},"width":126.9,"height":50.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-55.png","element":"img","alt":" ∪ U +t′+1","inline":true},{"text":"(3) = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-56.png","element":"img","alt":" ∅","inline":true},{"text":". Therefore, by the gradient update in Eq. ","element":"span"},{"href":"#id-57","text":"8, ","element":"a"},{"text":"for all 1 ","element":"span"},{"style":{"height":14},"width":142.28,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-57.png","element":"img","alt":" ≤ j ≤ k","inline":true},{"text":", and all ","element":"span"},{"style":{"height":12.8},"width":192.44,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-58.png","element":"img","alt":"t′ < t′′ ≤ t","inline":true,"padRight":true},{"text":"we have ","element":"span"},{"style":{"height":22.96},"width":99.21,"height":57.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-59.png","element":"img","alt":" u(j)t′′+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":21.36},"width":66.39,"height":53.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-60.png","element":"img","alt":" u(j)t′′","inline":true,"padRight":true},{"text":", which implies that ","element":"span"},{"style":{"height":19.91},"width":437.11,"height":49.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-61.png","element":"img","alt":" NWt′′+1(x+) ≥ NWt′′ (x+","inline":true},{"text":"). For ","element":"span"},{"style":{"height":10},"width":107.81,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-62.png","element":"img","alt":" t′′ = t","inline":true,"padRight":true},{"text":"we get ","element":"span"},{"style":{"height":18.57},"width":392.06,"height":46.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-63.png","element":"img","alt":"NWt+1(x+) ≥ NWt(x+","inline":true},{"text":").","element":"span"}],[{"text":"The above argument shows that ","element":"span"},{"style":{"height":19.37},"width":366.41,"height":48.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-64.png","element":"img","alt":" NWT2 (x+) ≥ γ − 2cη","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.39},"width":257.74,"height":45.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-65.png","element":"img","alt":" −NWT2 (x−) ≥","inline":true,"padRight":true},{"text":"1. Since ","element":"span"},{"style":{"height":19.37},"width":222.64,"height":48.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-66.png","element":"img","alt":" NWT2 (x+) −","inline":true},{"style":{"height":18.38},"width":164.87,"height":45.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-67.png","element":"img","alt":"NWT2 (x−","inline":true},{"text":") = ","element":"span"},{"style":{"height":20.3},"width":169.42,"height":50.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-68.png","element":"img","alt":" S+T2 − P +T2","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":17.08},"width":383.14,"height":42.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-69.png","element":"img","alt":" P −T2 − S−T2, P −T2, S−T2 ≥","inline":true,"padRight":true},{"text":"0 and","element":"span"},{"style":{"height":20.63},"width":178.66,"height":51.57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-70.png","element":"img","alt":"��P −T2�� ≤ cη","inline":true,"padRight":true},{"text":"it follows that ","element":"span"},{"style":{"height":20.3},"width":139.7,"height":50.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-71.png","element":"img","alt":" S+T2 ≥ γ","inline":true,"padRight":true},{"text":"+ 1 ","element":"span"},{"style":{"height":15.59},"width":93.77,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-72.png","element":"img","alt":" − 3cη","inline":true},{"text":". ","element":"span"},{"text":"Finally, by Lemma ","element":"span"},{"href":"#id-67","text":"E.10 ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":17.94},"width":128.32,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-73.png","element":"img","alt":" S+t ≥ γ","inline":true,"padRight":true},{"text":"+ 1 ","element":"span"},{"style":{"height":15.59},"width":93.02,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-74.png","element":"img","alt":" − 3cη","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":13.2},"width":106.81,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-75.png","element":"img","alt":" t ≥ T2","inline":true},{"text":".","element":"span"}],[{"id":"id-77","style":{"fontWeight":"bold"},"text":"Lemma E.17. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let","element":"span"}],[{"style":{"width":"61%"},"width":1084,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/26-76.png","element":"img"}],[{"style":{"width":"79%"},"width":1395,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Assume that ","element":"span"},{"style":{"height":13.2},"width":73.59,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-1.png","element":"img","alt":" k ≥","inline":true,"padRight":true},{"text":"64 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and gradient descent converged to a global minimum at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Then, ","element":"span"},{"style":{"height":18.7},"width":104.67,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-2.png","element":"img","alt":"X+T ≤","inline":true,"padRight":true},{"text":"34","element":"span"},{"style":{"height":11.59},"width":33.25,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-3.png","element":"img","alt":"cη","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":18.7},"width":100.52,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-4.png","element":"img","alt":" Y +T ≤","inline":true,"padRight":true},{"text":"1 + 38","element":"span"},{"style":{"height":11.59},"width":33.24,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-5.png","element":"img","alt":"cη","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Notice that by the gradient update in Eq. ","element":"span"},{"href":"#id-57","text":"7 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-43","text":"E.3, ","element":"a"},{"style":{"height":17.94},"width":61.14,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-6.png","element":"img","alt":" X+t","inline":true,"padRight":true},{"text":"can be strictly larger than max","element":"span"},{"style":{"height":19.96},"width":238.11,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-7.png","element":"img","alt":"�X+t−1, η��W +t","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":17.94},"width":103.6,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-8.png","element":"img","alt":" ∪ W +t","inline":true,"padRight":true},{"text":"(4)","element":"span"},{"style":{"height":19.96},"width":36.28,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-9.png","element":"img","alt":"��","inline":true,"padRight":true},{"text":"only if ","element":"span"},{"style":{"height":16.99},"width":296.1,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-10.png","element":"img","alt":" NWt−1(x+) < γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":282.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-11.png","element":"img","alt":" −NWt−1(x−) ≥","inline":true,"padRight":true},{"text":"1. ","element":"span"},{"text":"Furthermore, in this case ","element":"span"},{"style":{"height":18.27},"width":198.96,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-12.png","element":"img","alt":" X+t − X+t−1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":19.96},"width":109.31,"height":49.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-13.png","element":"img","alt":" η��W +t","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":17.94},"width":103.6,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-14.png","element":"img","alt":" ∪ W +t","inline":true,"padRight":true},{"text":"(4)","element":"span"},{"style":{"height":19.96},"width":13,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-15.png","element":"img","alt":"��","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-67","text":"E.10, ","element":"a"},{"style":{"height":17.94},"width":51.74,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-16.png","element":"img","alt":" S+t","inline":true,"padRight":true},{"text":"increases in this case by ","element":"span"},{"style":{"height":19.96},"width":109.31,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-17.png","element":"img","alt":"η��W +t","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":17.94},"width":103.6,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-18.png","element":"img","alt":" ∪ W +t","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.96},"width":13,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-19.png","element":"img","alt":"��","inline":true},{"text":". We know by Lemma ","element":"span"},{"href":"#id-70","text":"E.16 ","element":"a"},{"text":"that there exists ","element":"span"},{"style":{"height":13.19},"width":124.03,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-20.png","element":"img","alt":" T2 < T","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":20.3},"width":136.64,"height":50.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-21.png","element":"img","alt":" S+T2 ≥ γ","inline":true,"padRight":true},{"text":"+ 1 ","element":"span"},{"style":{"height":15.59},"width":93.16,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-22.png","element":"img","alt":" − 3cη","inline":true,"padRight":true},{"text":"and that ","element":"span"},{"style":{"height":16.98},"width":236.07,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-23.png","element":"img","alt":" NWt(x+) < γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":234.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-24.png","element":"img","alt":" −NWt(x−) ≥","inline":true,"padRight":true},{"text":"1 only for ","element":"span"},{"style":{"height":13.19},"width":106.81,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-25.png","element":"img","alt":" t > T2","inline":true},{"text":". Since ","element":"span"},{"style":{"height":17.94},"width":128.32,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-26.png","element":"img","alt":" S+t ≤ γ","inline":true,"padRight":true},{"text":"+1+8","element":"span"},{"style":{"height":11.59},"width":33.24,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-27.png","element":"img","alt":"cη","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"by Lemma ","element":"span"},{"href":"#id-69","text":"E.13, ","element":"a"},{"text":"there can only be at most ","element":"span"},{"style":{"height":10.8},"width":61.01,"height":26.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-28.png","element":"img","alt":"11cη","inline":true}],[{"href":"#id-69","text":"It fol","element":"a"},{"text":"lows that","element":"span"}],[{"style":{"width":"52%"},"width":925,"height":341,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-29.png","element":"img"}],[{"text":"where the second inequality follows by Lemma ","element":"span"},{"href":"#id-66","text":"E.2 ","element":"a"},{"text":"and the third inequality by the assumption on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". At convergence we have ","element":"span"},{"style":{"height":16},"width":152.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-30.png","element":"img","alt":" NWT (x−","inline":true},{"text":") = ","element":"span"},{"style":{"height":15.48},"width":51.74,"height":38.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-31.png","element":"img","alt":" S−T","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":18.81},"width":499.37,"height":47.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-32.png","element":"img","alt":" X+T − Y +T − P −T ≥ −1 − 3cη","inline":true,"padRight":true},{"text":"by Lemma ","element":"span"},{"href":"#id-69","text":"E.13 ","element":"a"},{"text":"(recall","element":"span"},{"text":"that ","element":"span"},{"style":{"height":14.72},"width":55.56,"height":36.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-33.png","element":"img","alt":" R−t","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":17.94},"width":55.57,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-34.png","element":"img","alt":" R+t","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":17.94},"width":168.36,"height":44.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-35.png","element":"img","alt":" X+t − Y +t","inline":true,"padRight":true},{"text":"). Furthermore, ","element":"span"},{"style":{"height":15.49},"width":100.17,"height":38.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-36.png","element":"img","alt":" P −T ≥","inline":true,"padRight":true},{"text":"0 and by Lemma ","element":"span"},{"href":"#id-71","text":"E.8 ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":15.59},"width":140.17,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-37.png","element":"img","alt":" S−T ≤ cη","inline":true},{"text":". Therefore, ","element":"span"},{"text":"we get ","element":"span"},{"style":{"height":18.7},"width":100.52,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-38.png","element":"img","alt":" Y +T ≤","inline":true,"padRight":true},{"text":"1 + 38","element":"span"},{"style":{"height":11.59},"width":33.25,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-39.png","element":"img","alt":"cη","inline":true},{"text":".","element":"span"}],[{"text":"We are now ready to prove the main result of this section.","element":"span"}],[{"id":"id-82","style":{"fontWeight":"bold"},"text":"Proposition E.18. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Define ","element":"span"},{"style":{"height":16},"width":62.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-40.png","element":"img","alt":" β(γ","inline":true},{"text":") = ","element":"span"},{"style":{"height":27.36},"width":127.5,"height":68.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-41.png","element":"img","alt":"γ−40 14 cη39cη+1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":". Assume that ","element":"span"},{"style":{"height":14},"width":67.5,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-42.png","element":"img","alt":" γ ≥","inline":true,"padRight":true},{"text":"2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":13.2},"width":66.64,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-43.png","element":"img","alt":" k ≥","inline":true,"padRight":true},{"text":"64","element":"span"},{"style":{"height":32.3},"width":176.82,"height":80.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-44.png","element":"img","alt":"�β(γ)+1β(γ)−1�2","inline":true},{"style":{"fontStyle":"italic"},"text":". Then with","element":"span"}],[{"style":{"width":"100%"},"width":1757,"height":138,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-45.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"With probability at least 1","element":"span"},{"style":{"height":4.4},"width":31,"height":11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-46.png","element":"img","alt":"−","inline":true}],[{"text":"show generalization on positive points. Assume that gradie","element":"span"},{"href":"#id-72","text":"nt desc","element":"a"},{"text":"ent converg","element":"span"},{"href":"#id-73","text":"ed to ","element":"a"},{"text":"a global minimum","element":"span"}],[{"text":"at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". Let (","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"z","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) be a positive point. Then there exists ","element":"span"},{"style":{"height":16},"width":166.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-47.png","element":"img","alt":" zi ∈ {(1,","inline":true,"padRight":true},{"text":"1)","element":"span"},{"style":{"height":16},"width":188.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-48.png","element":"img","alt":", (−1, −1)}","inline":true},{"text":". Assume","element":"span"}],[{"text":"without loss of generality that ","element":"span"},{"style":{"height":9.59},"width":34.8,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-49.png","element":"img","alt":" zi","inline":true,"padRight":true},{"text":"= (","element":"span"},{"style":{"height":14},"width":99.63,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-50.png","element":"img","alt":"−1, −","inline":true},{"text":"1) = ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-51.png","element":"img","alt":" p3","inline":true},{"text":". Define","element":"span"}],[{"style":{"width":"79%"},"width":1390,"height":607,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/27-52.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":16.98},"width":513.3,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-0.png","element":"img","alt":" NWT (x+) ≥ γ, −NWT (x−) ≥","inline":true,"padRight":true},{"text":"1,","element":"span"},{"style":{"height":19.97},"width":170.96,"height":49.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-1.png","element":"img","alt":"��P −T�� ≤ cη","inline":true,"padRight":true},{"text":"by Lemma ","element":"span"},{"href":"#id-71","text":"E.8 ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":18.7},"width":170.99,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-2.png","element":"img","alt":" P +T , S−T ≥","inline":true,"padRight":true},{"text":"0 , we obtain","element":"span"}],[{"id":"id-75","style":{"width":"64%"},"width":1127,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-3.png","element":"img"}],[{"text":"Furthermore, by Lemma ","element":"span"},{"href":"#id-74","text":"E.9 ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"99%"},"width":1752,"height":335,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-4.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":16},"width":62.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-5.png","element":"img","alt":" α(k","inline":true},{"text":") =","element":"span"}],[{"style":{"width":"74%"},"width":1303,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-6.png","element":"img"}],[{"text":"which implies together with Eq. ","element":"span"},{"href":"#id-75","text":"15 ","element":"a"},{"text":"that ","element":"span"},{"style":{"height":18.7},"width":61.14,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-7.png","element":"img","alt":" X+T","inline":true,"padRight":true},{"text":"(3) ","element":"span"},{"style":{"height":29.43},"width":176.84,"height":73.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-8.png","element":"img","alt":" ≥ γ+1−5cη41+α(k)","inline":true,"padRight":true},{"text":". Therefore,","element":"span"}],[{"id":"id-76","style":{"width":"73%"},"width":1292,"height":290,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-9.png","element":"img"}],[{"text":"where the first inequality is true because","element":"span"}],[{"style":{"width":"96%"},"width":1694,"height":188,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-10.png","element":"img"}],[{"text":"The second inequality in Eq. ","element":"span"},{"href":"#id-76","text":"18 ","element":"a"},{"text":"follows since ","element":"span"},{"style":{"height":18.81},"width":143.95,"height":47.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-11.png","element":"img","alt":" P +T ≤ cη","inline":true,"padRight":true},{"text":"and by appyling Lemma ","element":"span"},{"href":"#id-77","text":"E.17. ","element":"a"},{"text":"Finally, the last ","element":"span"},{"text":"inequality in Eq. ","element":"span"},{"href":"#id-76","text":"18 ","element":"a"},{"text":"follows by the assumption on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". ","element":"span"},{"style":{"height":7.6},"width":31.9,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-12.png","element":"img","alt":"12","inline":true,"padRight":true},{"text":"Hence, ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"z ","element":"span"},{"text":"is classified correctly.","element":"span"}],[{"id":"id-55","style":{"fontWeight":"bold"},"text":"E.0.8 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Generalization on Negative Class","element":"span"}],[{"text":"We will need the following lemmas.","element":"span"}],[{"id":"id-78","style":{"width":"70%"},"width":1236,"height":558,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/28-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof is similar to the proof of Lemma ","element":"span"},{"href":"#id-66","text":"E.2 ","element":"a"},{"text":"and follows from the fact that","element":"span"}],[{"style":{"width":"74%"},"width":1316,"height":344,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-0.png","element":"img"}],[{"id":"id-80","style":{"fontWeight":"bold"},"text":"Lemma E.20. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let","element":"span"}],[{"style":{"width":"76%"},"width":1343,"height":296,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"style":{"height":14},"width":125.7,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-2.png","element":"img","alt":" X, Y ≥","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":19.96},"width":209.11,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-3.png","element":"img","alt":" |X| ≤ η��U +0","inline":true,"padRight":true},{"text":"(2)","element":"span"},{"style":{"height":19.96},"width":242.77,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-4.png","element":"img","alt":"��, |Y | ≤ η��U +0","inline":true,"padRight":true},{"text":"(4)","element":"span"},{"style":{"height":19.96},"width":13,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-5.png","element":"img","alt":"��","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":28.69},"width":282.84,"height":71.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-6.png","element":"img","alt":" X−t −X|U +0 (2)| = Y −t −Y|U +0 (4)|","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"First, we will prove that for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"there exists ","element":"span"},{"style":{"height":13.19},"width":111.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-7.png","element":"img","alt":" at ∈ Z","inline":true,"padRight":true},{"text":"such that for ","element":"span"},{"style":{"height":15.05},"width":140,"height":37.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-8.png","element":"img","alt":" j1 ∈ U −0","inline":true,"padRight":true},{"text":"(2) and ","element":"span"},{"style":{"height":15.05},"width":140,"height":37.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-9.png","element":"img","alt":" j2 ∈ U −0","inline":true,"padRight":true},{"text":"(4) ","element":"span"},{"text":"it holds that ","element":"span"},{"style":{"height":20.93},"width":360.15,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-10.png","element":"img","alt":" u(j1)t = u(j1)0 + atηp2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.93},"width":360.14,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-11.png","element":"img","alt":" u(j2)t = u(j2)0 − atηp2","inline":true},{"text":". ","element":"span"},{"style":{"height":7.6},"width":31.9,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-12.png","element":"img","alt":"13","inline":true,"padRight":true},{"text":"We will prove this by induction on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":".","element":"span"}],[{"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 0 this clearly holds. Assume it holds for an iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Let ","element":"span"},{"style":{"height":15.05},"width":139.55,"height":37.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-13.png","element":"img","alt":" j1 ∈ U −0","inline":true,"padRight":true},{"text":"(2) and ","element":"span"},{"style":{"height":15.05},"width":139.55,"height":37.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-14.png","element":"img","alt":" j2 ∈ U −0","inline":true,"padRight":true},{"text":"(4). By ","element":"span"},{"text":"the induction hypothesis, there exists ","element":"span"},{"style":{"height":13.19},"width":121.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-15.png","element":"img","alt":" aT ∈ Z","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":20.93},"width":351.92,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-16.png","element":"img","alt":" u(j1)t = u(j1)0 +atηp2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.93},"width":351.92,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-17.png","element":"img","alt":" u(j2)t = u(j2)0 −atηp2","inline":true},{"text":". Since for all 1 ","element":"span"},{"style":{"height":14},"width":152.53,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-18.png","element":"img","alt":" ≤ j ≤ k","inline":true,"padRight":true},{"text":"it holds that","element":"span"},{"style":{"height":29.53},"width":293.75,"height":73.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-19.png","element":"img","alt":"��u(j)0 · p2�� < √2η4","inline":true,"padRight":true},{"text":", it follows that either ","element":"span"},{"style":{"height":15.05},"width":56.55,"height":37.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-20.png","element":"img","alt":" U −0","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"style":{"height":14.72},"width":104.5,"height":36.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-21.png","element":"img","alt":" ⊆ U −t","inline":true,"padRight":true},{"text":"(2) and","element":"span"}],[{"style":{"width":"99%"},"width":1756,"height":257,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-22.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":235.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-23.png","element":"img","alt":" a ∈ {−1, 0, 1}","inline":true},{"text":". Hence, ","element":"span"},{"style":{"height":22.53},"width":80.33,"height":56.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-24.png","element":"img","alt":" u(j1)t+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":305.42,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-25.png","element":"img","alt":" u(j1)0 +(at+a)ηp2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.93},"width":441.34,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-26.png","element":"img","alt":" u(j2)t = u(j2)0 −(at+a)ηp2","inline":true},{"text":". This concludes the proof by induction.","element":"span"}],[{"text":"Now, consider an iteration ","element":"span"},{"style":{"height":18.27},"width":182.16,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-27.png","element":"img","alt":" t, j1 ∈ U +0","inline":true,"padRight":true},{"text":"(2), ","element":"span"},{"style":{"height":18.27},"width":142.36,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-28.png","element":"img","alt":" j2 ∈ U +0","inline":true,"padRight":true},{"text":"(4) and the integer ","element":"span"},{"style":{"height":9.19},"width":33.06,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-29.png","element":"img","alt":" at","inline":true,"padRight":true},{"text":"defined above. If ","element":"span"},{"style":{"height":12.8},"width":78.57,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-30.png","element":"img","alt":" at ≥","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"text":"then","element":"span"}],[{"style":{"width":"88%"},"width":1546,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-31.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"88%"},"width":1546,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-32.png","element":"img"}],[{"text":"Define ","element":"span"},{"style":{"height":15.05},"width":150.42,"height":37.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-33.png","element":"img","alt":" X = X−0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.05},"width":142.12,"height":37.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-34.png","element":"img","alt":" Y = Y −0","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":19.96},"width":209.11,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-35.png","element":"img","alt":" |X| ≤ η��U −0","inline":true,"padRight":true},{"text":"(2)","element":"span"},{"style":{"height":19.96},"width":242.6,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-36.png","element":"img","alt":"��, |Y | ≤ η��U −0","inline":true,"padRight":true},{"text":"(4)","element":"span"},{"style":{"height":19.96},"width":13,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-37.png","element":"img","alt":"��","inline":true,"padRight":true},{"text":"and","element":"span"}],[{"style":{"width":"78%"},"width":1371,"height":257,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/29-38.png","element":"img"}],[{"style":{"width":"111%"},"width":1952,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-0.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"111%"},"width":1952,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-1.png","element":"img"}],[{"text":"Define","element":"span"}],[{"style":{"width":"83%"},"width":1469,"height":281,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-2.png","element":"img"}],[{"text":"Since for all 1 ","element":"span"},{"style":{"height":14},"width":138.65,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-3.png","element":"img","alt":" ≤ j ≤ k","inline":true,"padRight":true},{"text":"it holds that","element":"span"},{"style":{"height":29.53},"width":284.5,"height":73.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-4.png","element":"img","alt":"��u(j)0 · p2�� <√2η4","inline":true,"padRight":true},{"text":", we have ","element":"span"},{"style":{"height":19.96},"width":211.62,"height":49.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-5.png","element":"img","alt":" |X| ≤ η��U −0","inline":true,"padRight":true},{"text":"(2)","element":"span"},{"style":{"height":19.96},"width":246.05,"height":49.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-6.png","element":"img","alt":"��, |Y | ≤ η��U −0","inline":true,"padRight":true},{"text":"(4)","element":"span"},{"style":{"height":19.96},"width":13,"height":49.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-7.png","element":"img","alt":"��","inline":true},{"text":". Furthermore,","element":"span"}],[{"style":{"width":"78%"},"width":1376,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-8.png","element":"img"}],[{"text":"which concludes the proof.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma E.21. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let","element":"span"}],[{"style":{"width":"83%"},"width":1468,"height":306,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then for all ","element":"span"},{"style":{"height":19.2},"width":288.95,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-10.png","element":"img","alt":" t, X−t −X−0","inline":true}],[{"style":{"width":"101%"},"width":1782,"height":203,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-11.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 0 this clearly holds. Assume it holds for an iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Let ","element":"span"},{"style":{"height":19.51},"width":157.82,"height":48.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-12.png","element":"img","alt":" j1 ∈�U +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":91.98,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-13.png","element":"img","alt":" ∪ U +0","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.2},"width":114.02,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-14.png","element":"img","alt":"�∩ U −0","inline":true,"padRight":true},{"text":"(2) ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":19.51},"width":158.17,"height":48.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-15.png","element":"img","alt":" j2 ∈�U +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":91.98,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-16.png","element":"img","alt":" ∪ U +0","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.2},"width":119.24,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-17.png","element":"img","alt":"�∩ U −0","inline":true,"padRight":true},{"text":"(4). By the induction hypothesis, there exists an integer ","element":"span"},{"style":{"height":12.8},"width":77.34,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-18.png","element":"img","alt":" at ≥","inline":true,"padRight":true},{"text":"0 such ","element":"span"},{"text":"that ","element":"span"},{"style":{"height":20.93},"width":463.62,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-19.png","element":"img","alt":" u(j1)t · p2 = u(j1)0 · p2 + ηat","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.93},"width":463.62,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-20.png","element":"img","alt":" u(j2)t · p4 = u(j2)0 · p4 + ηat","inline":true},{"text":". Since for all 1 ","element":"span"},{"style":{"height":14},"width":135.39,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-21.png","element":"img","alt":" ≤ j ≤ k","inline":true,"padRight":true},{"text":"it holds that ","element":"span"},{"style":{"height":29.53},"width":281.99,"height":73.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-22.png","element":"img","alt":"��u(j)0 · p1�� <√2η4","inline":true,"padRight":true},{"text":", it follows that if ","element":"span"},{"style":{"height":12.8},"width":77.16,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-23.png","element":"img","alt":" at ≥","inline":true,"padRight":true},{"text":"1 we have the following update at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"+ 1:","element":"span"}],[{"style":{"width":"98%"},"width":1739,"height":376,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/30-24.png","element":"img"}],[{"style":{"width":"63%"},"width":1111,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-0.png","element":"img"}],[{"text":"such that ","element":"span"},{"style":{"height":16},"width":168.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-1.png","element":"img","alt":" a ∈ {0, 1}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":303.66,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-2.png","element":"img","alt":" b1, b2 ∈ {−1, 0, 1}","inline":true},{"text":". Hence, ","element":"span"},{"style":{"height":22.53},"width":151.97,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-3.png","element":"img","alt":" u(j1)t+1 · p2","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":345,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-4.png","element":"img","alt":" u(j1)0 · p2 + η(at + a","inline":true},{"text":") and ","element":"span"},{"style":{"height":22.53},"width":151.97,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-5.png","element":"img","alt":" u(j2)t+1 · p4","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.93},"width":343.6,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-6.png","element":"img","alt":"u(j2)0 · p4 + η(at + a","inline":true},{"text":"). This concludes the proof by induction.","element":"span"}],[{"text":"Now, consider an iteration ","element":"span"},{"style":{"height":19.51},"width":202.72,"height":48.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-7.png","element":"img","alt":" t, j1 ∈�U +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":91.97,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-8.png","element":"img","alt":" ∪ U +0","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.2},"width":120.89,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-9.png","element":"img","alt":"�∩ U −0","inline":true,"padRight":true},{"text":"(2) and ","element":"span"},{"style":{"height":19.51},"width":162.3,"height":48.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-10.png","element":"img","alt":" j2 ∈�U +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":91.98,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-11.png","element":"img","alt":" ∪ U +0","inline":true,"padRight":true},{"text":"(3)","element":"span"},{"style":{"height":19.2},"width":120.89,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-12.png","element":"img","alt":"�∩ U −0","inline":true,"padRight":true},{"text":"(4) ","element":"span"},{"text":"and the integer ","element":"span"},{"style":{"height":9.19},"width":33.06,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-13.png","element":"img","alt":" at","inline":true,"padRight":true},{"text":"defined above. We have,","element":"span"}],[{"style":{"width":"88%"},"width":1546,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-14.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"88%"},"width":1546,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-15.png","element":"img"}],[{"text":"It follows that","element":"span"}],[{"style":{"width":"61%"},"width":1084,"height":418,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-16.png","element":"img"}],[{"text":"which concludes the proof.","element":"span"}],[{"text":"We are now ready to prove the main result of this section.","element":"span"}],[{"id":"id-83","style":{"fontWeight":"bold"},"text":"Proposition E.22. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Define ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-17.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":27.36},"width":124.92,"height":68.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-18.png","element":"img","alt":"1−36 14 cη35cη","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":". Assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k > ","element":"span"},{"text":"64","element":"span"},{"style":{"height":32.3},"width":133.46,"height":80.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-19.png","element":"img","alt":"�β+1β−1�2","inline":true},{"style":{"fontStyle":"italic"},"text":". Then with probability at least","element":"span"}],[{"style":{"width":"99%"},"width":1754,"height":137,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-20.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"With probability at least 1","element":"span"},{"style":{"height":4.4},"width":31,"height":11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-21.png","element":"img","alt":"−","inline":true}],[{"style":{"height":25.45},"width":227.98,"height":63.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-22.png","element":"img","alt":"√2k√πe8k −16e−8","inline":true,"padRight":true},{"text":"Proposition ","element":"span"},{"href":"#id-72","text":"E.14 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-78","text":"E.19 ","element":"a"},{"text":"hold. It suffices to show generalization on negative points. Assume that gradient descent converged to a global minimum at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". Let (","element":"span"},{"style":{"height":10.4},"width":72.51,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-23.png","element":"img","alt":"z, −","inline":true},{"text":"1) be a negative point. Assume without loss of generality that ","element":"span"},{"style":{"height":11.1},"width":135.31,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-24.png","element":"img","alt":" zi = p2","inline":true,"padRight":true},{"text":"for all 1 ","element":"span"},{"style":{"height":13.2},"width":129.92,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-25.png","element":"img","alt":" ≤ i ≤ d","inline":true},{"text":". Define the following sums for ","element":"span"},{"style":{"height":16},"width":158.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-26.png","element":"img","alt":" l ∈ {2, 4}","inline":true},{"text":",","element":"span"}],[{"style":{"width":"69%"},"width":1225,"height":412,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-27.png","element":"img"}],[{"text":"First, we notice that","element":"span"}],[{"style":{"width":"57%"},"width":1014,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/31-28.png","element":"img"}],[{"id":"id-81","style":{"width":"99%"},"width":1753,"height":280,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-0.png","element":"img"}],[{"text":"We note that by the analysis in Lemma ","element":"span"},{"href":"#id-78","text":"E.19, ","element":"a"},{"text":"it holds that for any ","element":"span"},{"style":{"height":18.27},"width":178.16,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-1.png","element":"img","alt":" t, j1 ∈ U +0","inline":true,"padRight":true},{"text":"(2) and ","element":"span"},{"style":{"height":18.27},"width":139.55,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-2.png","element":"img","alt":" j2 ∈ U +0","inline":true,"padRight":true},{"text":"(4), ","element":"span"},{"text":"either ","element":"span"},{"style":{"height":17.94},"width":155.27,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-3.png","element":"img","alt":" j1 ∈ U +t","inline":true,"padRight":true},{"text":"(2) and ","element":"span"},{"style":{"height":17.94},"width":155.27,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-4.png","element":"img","alt":" j2 ∈ U +t","inline":true,"padRight":true},{"text":"(4), or ","element":"span"},{"style":{"height":17.94},"width":155.27,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-5.png","element":"img","alt":" j1 ∈ U +t","inline":true,"padRight":true},{"text":"(4) and ","element":"span"},{"style":{"height":17.94},"width":155.27,"height":44.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-6.png","element":"img","alt":" j2 ∈ U +t","inline":true,"padRight":true},{"text":"(2). We assume without loss of ","element":"span"},{"text":"generality that ","element":"span"},{"style":{"height":18.7},"width":139.55,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-7.png","element":"img","alt":" j1 ∈ U +T","inline":true,"padRight":true},{"text":"(2) and ","element":"span"},{"style":{"height":18.7},"width":139.55,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-8.png","element":"img","alt":" j2 ∈ U +T","inline":true,"padRight":true},{"text":"(4). It follows that in this case ","element":"span"},{"style":{"height":14.79},"width":80.99,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-9.png","element":"img","alt":" NWT","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.68},"width":144.16,"height":41.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-10.png","element":"img","alt":"z) ≤ S−T","inline":true,"padRight":true},{"text":"+","element":"span"},{"style":{"height":15.48},"width":158.64,"height":38.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-11.png","element":"img","alt":"X−T −Z−T","inline":true,"padRight":true},{"text":"(2)","element":"span"},{"style":{"height":4.4},"width":31,"height":11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-12.png","element":"img","alt":"−","inline":true},{"style":{"height":15.48},"width":56.99,"height":38.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-13.png","element":"img","alt":"Y −T","inline":true,"padRight":true},{"text":"(2). ","element":"span"},{"href":"#id-79","style":{"height":7.6},"width":31.9,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-14.png","element":"img","alt":" 14","inline":true},{"text":"Otherwise we would replace ","element":"span"},{"style":{"height":15.48},"width":56.99,"height":38.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-15.png","element":"img","alt":" Y −T","inline":true,"padRight":true},{"text":"(2) with ","element":"span"},{"style":{"height":15.48},"width":56.99,"height":38.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-16.png","element":"img","alt":" Y −T","inline":true,"padRight":true},{"text":"(4) and vice versa and continue with the same ","element":"span"},{"text":"proof.","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":16},"width":62.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-17.png","element":"img","alt":" α(k","inline":true},{"text":") =","element":"span"}],[{"style":{"width":"58%"},"width":1019,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-18.png","element":"img"}],[{"text":"and by Lemma ","element":"span"},{"href":"#id-80","text":"E.20 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-78","text":"E.19 ","element":"a"},{"text":"there exists ","element":"span"},{"style":{"height":15.59},"width":118.37,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-19.png","element":"img","alt":" Y ≤ cη","inline":true,"padRight":true},{"text":"such that:","element":"span"}],[{"style":{"width":"42%"},"width":755,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-20.png","element":"img"}],[{"text":"Plugging these inequalities in Eq. ","element":"span"},{"href":"#id-81","text":"21 ","element":"a"},{"text":"we get:","element":"span"}],[{"style":{"width":"56%"},"width":988,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-21.png","element":"img"}],[{"text":"which implies that","element":"span"}],[{"style":{"width":"94%"},"width":1661,"height":322,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-22.png","element":"img"}],[{"text":"where the last inequality holds for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k > ","element":"span"},{"text":"64","element":"span"},{"style":{"height":32.3},"width":133.46,"height":80.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-23.png","element":"img","alt":"�β+1β−1�2","inline":true},{"text":". ","element":"span"},{"style":{"height":8},"width":31.9,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-24.png","element":"img","alt":"15","inline":true,"padRight":true},{"text":"Therefore, ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"z ","element":"span"},{"text":"is classified correctly.","element":"span"}],[{"id":"id-56","style":{"fontWeight":"bold"},"text":"E.0.9 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Finishing the Proof","element":"span"}],[{"style":{"width":"99%"},"width":1756,"height":135,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-25.png","element":"img"}],[{"style":{"height":14.4},"width":38.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-26.png","element":"img","alt":"β1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"href":"#id-78","style":{"height":27.36},"width":127.5,"height":68.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-27.png","element":"img","alt":"γ−40 14 cη39cη+1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":38.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-28.png","element":"img","alt":" β2","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":27.36},"width":124.93,"height":68.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-29.png","element":"img","alt":"1−36 14 cη35cη","inline":true,"padRight":true},{"text":"and let ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-30.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"= max","element":"span"},{"style":{"height":16},"width":138.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-31.png","element":"img","alt":"{β1, β2}","inline":true},{"text":". For ","element":"span"},{"style":{"height":14},"width":70.3,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-32.png","element":"img","alt":" γ ≥","inline":true,"padRight":true},{"text":"8 and ","element":"span"},{"style":{"height":19.37},"width":153.16,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-33.png","element":"img","alt":" cη ≤ 1410","inline":true,"padRight":true},{"text":"it holds that ","element":"span"},{"text":"64","element":"span"},{"style":{"height":32.3},"width":182.29,"height":80.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-34.png","element":"img","alt":"�β+1β−1�2<","inline":true,"padRight":true},{"text":"120. By Proposition ","element":"span"},{"href":"#id-82","text":"E.18 ","element":"a"},{"text":"and Proposition ","element":"span"},{"href":"#id-83","text":"E.22, ","element":"a"},{"text":"it follows that for ","element":"span"},{"style":{"height":13.2},"width":68.96,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-35.png","element":"img","alt":" k ≥","inline":true,"padRight":true},{"text":"120 gradient descent converges to a global minimum which classifies all points correctly.","element":"span"}],[{"text":"We will now prove the clustering effect at a global minimum. ","element":"span"},{"text":"By Lemma ","element":"span"},{"href":"#id-70","text":"E.16 ","element":"a"},{"text":"it holds that ","element":"span"},{"style":{"height":18.7},"width":128.32,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-36.png","element":"img","alt":"S+T ≥ γ","inline":true,"padRight":true},{"text":"+ 1 ","element":"span"},{"style":{"height":15.59},"width":212.27,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-37.png","element":"img","alt":" − 3cη ≥ γ −","inline":true,"padRight":true},{"text":"1. Therefore, by Lemma ","element":"span"},{"href":"#id-63","text":"E.5 ","element":"a"},{"text":"it follows that","element":"span"}],[{"id":"id-79","style":{"width":"73%"},"width":1285,"height":75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/32-38.png","element":"img"}],[{"text":"and thus ","element":"span"},{"style":{"height":23.4},"width":268.02,"height":58.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-0.png","element":"img","alt":" a+(T) ≥ γ−12cη −","inline":true,"padRight":true},{"text":"1. Therefore, for any ","element":"span"},{"style":{"height":18.27},"width":135.57,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-1.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") such that ","element":"span"},{"style":{"height":16},"width":159.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-2.png","element":"img","alt":" i ∈ {1, 3}","inline":true},{"text":", the cosine of the angle ","element":"span"},{"text":"between ","element":"span"},{"style":{"height":21.36},"width":73.49,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-3.png","element":"img","alt":" w(j)T","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.1},"width":34.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-4.png","element":"img","alt":" pi","inline":true,"padRight":true},{"text":"is at least","element":"span"}],[{"style":{"width":"60%"},"width":1064,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-5.png","element":"img"}],[{"text":"where we used the triangle inequality and Lemma ","element":"span"},{"href":"#id-63","text":"E.5. ","element":"a"},{"text":"The claim follows.","element":"span"}]]},{"heading":"F Proof of Theorem 6.4","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Theorem F.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Theorem ","element":"span"},{"href":"#id-36","style":{"fontStyle":"italic","fontWeight":"bold"},"text":"6.4 ","element":"a"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"restated","element":"span"},{"style":{"fontStyle":"italic"},"text":") Assume that gradient descent runs with parameaters ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-6.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":18.38},"width":29.24,"height":45.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-7.png","element":"img","alt":"cηk","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":24.51},"width":353.54,"height":61.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-8.png","element":"img","alt":" cη ≤ 141, σg ≤ cη16k32","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14},"width":71,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-9.png","element":"img","alt":" γ ≥","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then, with probability at least ","element":"span"},{"text":"(1 ","element":"span"},{"style":{"height":19.37},"width":115.92,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-10.png","element":"img","alt":" − c) 3348","inline":true},{"style":{"fontStyle":"italic"},"text":", gradient descent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"converges to a global minimum that does not recover ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-11.png","element":"img","alt":" f ∗","inline":true},{"style":{"fontStyle":"italic"},"text":". Furthermore, there exists ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":12.8},"width":106.74,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-12.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that the global minimum misclassifies all points ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":13.99},"width":142.17,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-13.png","element":"img","alt":" Px = Ai","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"We refer to Eq. ","element":"span"},{"href":"#id-84","text":"14 ","element":"a"},{"text":"in the proof of Proposition ","element":"span"},{"href":"#id-72","text":"E.14. ","element":"a"},{"text":"To show convergence and provide convergence rates of gradient descent, the proof uses Lemma ","element":"span"},{"href":"#id-66","text":"E.2. ","element":"a"},{"text":"However, to only show convergence, it suffices to bound the probability that ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-14.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":104.86,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-15.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(3) ","element":"span"},{"style":{"height":16.4},"width":65.22,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-16.png","element":"img","alt":" ̸= ∅","inline":true,"padRight":true},{"text":"and that the initialization satisfies Lemma ","element":"span"},{"href":"#id-43","text":"E.3. ","element":"a"},{"text":"Given that Lemma ","element":"span"},{"href":"#id-43","text":"E.3 ","element":"a"},{"text":"holds (with probability at least 1 ","element":"span"},{"style":{"height":28.8},"width":77.52,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-17.png","element":"img","alt":" −�","inline":true}],[{"style":{"width":"99%"},"width":1755,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-18.png","element":"img"}],[{"text":"and ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-19.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":100.73,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-20.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(3) ","element":"span"},{"style":{"height":16.4},"width":62.06,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-21.png","element":"img","alt":" ̸= ∅","inline":true,"padRight":true},{"text":"which implies that gradient descent converges to a global minimum. For the ","element":"span"},{"text":"rest of the proof we will condition on the corresponding event. Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"be the iteration in which gradient descent converges to a global minimum. Note that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is a random variable. Denote the network at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":". For all ","element":"span"},{"style":{"height":14.18},"width":134.17,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-22.png","element":"img","alt":" z ∈ R2d","inline":true,"padRight":true},{"text":"denote","element":"span"}],[{"style":{"width":"90%"},"width":1590,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-23.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"denote the event for which at least one of the following holds:","element":"span"}],[{"text":"1. ","element":"span"},{"style":{"height":18.7},"width":68.17,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-24.png","element":"img","alt":" W +T","inline":true,"padRight":true},{"text":"(1) = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-25.png","element":"img","alt":" ∅","inline":true},{"text":".","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"height":18.7},"width":68.17,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-26.png","element":"img","alt":" W +T","inline":true,"padRight":true},{"text":"(3) = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-27.png","element":"img","alt":" ∅","inline":true},{"text":".","element":"span"}],[{"text":"3. ","element":"span"},{"style":{"height":18.08},"width":182.6,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-28.png","element":"img","alt":" u(1) · p2 >","inline":true,"padRight":true},{"text":"0 and ","element":"span"},{"style":{"height":18.08},"width":182.6,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-29.png","element":"img","alt":" u(2) · p2 >","inline":true,"padRight":true},{"text":"0.","element":"span"}],[{"text":"4. ","element":"span"},{"style":{"height":18.08},"width":182.6,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-30.png","element":"img","alt":" u(1) · p4 >","inline":true,"padRight":true},{"text":"0 and ","element":"span"},{"style":{"height":18.08},"width":182.6,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-31.png","element":"img","alt":" u(2) · p4 >","inline":true,"padRight":true},{"text":"0.","element":"span"}],[{"text":"Our proof will proceed as follows. We will first show that if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"occurs then gradient descent does not learn ","element":"span"},{"style":{"height":14.18},"width":39.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-32.png","element":"img","alt":" f ∗","inline":true},{"text":", i.e., the network ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"does not satisfy sign (","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":")) = ","element":"span"},{"style":{"height":16},"width":83.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-33.png","element":"img","alt":" f ∗(x","inline":true},{"text":") for all ","element":"span"},{"style":{"height":17.38},"width":201.02,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-34.png","element":"img","alt":" x ∈ {±1}2d","inline":true},{"text":". Then, we will show that ","element":"span"},{"style":{"height":19.37},"width":174.66,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-35.png","element":"img","alt":" P [E] ≥ 1112","inline":true},{"text":". This will conclude the proof.","element":"span"}],[{"text":"Assume that one of the first two items in the definition of the event ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"occurs. Without loss of generality assume that ","element":"span"},{"style":{"height":18.7},"width":68.17,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-36.png","element":"img","alt":" W +T","inline":true,"padRight":true},{"text":"(1) = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-37.png","element":"img","alt":" ∅","inline":true,"padRight":true},{"text":"and recall that ","element":"span"},{"style":{"height":8.98},"width":51.26,"height":22.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-38.png","element":"img","alt":" x−","inline":true,"padRight":true},{"text":"denotes a negative vector which only contains ","element":"span"},{"text":"the patterns ","element":"span"},{"style":{"height":11.1},"width":99.48,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-39.png","element":"img","alt":" p2, p4","inline":true,"padRight":true},{"text":"and let ","element":"span"},{"style":{"height":14.18},"width":160.63,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-40.png","element":"img","alt":" z+ ∈ R2d","inline":true,"padRight":true},{"text":"be a positive vector which only contains the patterns ","element":"span"},{"style":{"height":11.1},"width":159.02,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-41.png","element":"img","alt":" p1, p2, p4","inline":true},{"text":". By the assumption ","element":"span"},{"style":{"height":18.7},"width":68.17,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-42.png","element":"img","alt":" W +T","inline":true,"padRight":true},{"text":"(1) = ","element":"span"},{"style":{"height":13.6},"width":20,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-43.png","element":"img","alt":" ∅","inline":true,"padRight":true},{"text":"and the fact that ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-44.png","element":"img","alt":" p1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":11.1},"width":70.94,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-45.png","element":"img","alt":" −p3","inline":true,"padRight":true},{"text":"it follows that for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2,","element":"span"}],[{"style":{"width":"80%"},"width":1406,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-46.png","element":"img"}],[{"text":"Furthermore, since ","element":"span"},{"style":{"height":12.98},"width":48.8,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-47.png","element":"img","alt":" z+","inline":true,"padRight":true},{"text":"contains more distinct patterns than ","element":"span"},{"style":{"height":8.98},"width":51.26,"height":22.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-48.png","element":"img","alt":" x−","inline":true},{"text":", it follows that for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2,","element":"span"}],[{"style":{"width":"78%"},"width":1378,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/33-49.png","element":"img"}],[{"style":{"width":"99%"},"width":1753,"height":137,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-0.png","element":"img"}],[{"text":"be the negative vector with all of its patterns equal to ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-1.png","element":"img","alt":" p4","inline":true},{"text":". It is clear that ","element":"span"},{"style":{"height":16},"width":162.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-2.png","element":"img","alt":" N(z−) ≥","inline":true,"padRight":true},{"text":"0 and therefore ","element":"span"},{"style":{"height":8.98},"width":48.8,"height":22.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-3.png","element":"img","alt":"z−","inline":true,"padRight":true},{"text":"is not classified correctly. This concludes the first part of the proof. We will now proceed to show that ","element":"span"},{"style":{"height":19.37},"width":174.66,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-4.png","element":"img","alt":" P [E] ≥ 1112","inline":true},{"text":".","element":"span"}],[{"text":"Denote by ","element":"span"},{"style":{"height":13.99},"width":40.89,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-5.png","element":"img","alt":" Ai","inline":true,"padRight":true},{"text":"the event that item ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"in the definition of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"occurs and for an event ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"denote by ","element":"span"},{"style":{"height":11.6},"width":43.88,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-6.png","element":"img","alt":" Ac","inline":true,"padRight":true},{"text":"its complement. Thus ","element":"span"},{"style":{"height":10.8},"width":45.72,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-7.png","element":"img","alt":" Ec","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":17.53},"width":124.07,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-8.png","element":"img","alt":" ∩4i=1Aci","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":87.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-9.png","element":"img","alt":" P [Ec","inline":true},{"text":"] = ","element":"span"},{"style":{"height":16},"width":552.59,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-10.png","element":"img","alt":" P [Ac3 ∩ Ac4 | Ac1 ∩ Ac2] P [Ac1 ∩ Ac2","inline":true},{"text":"].","element":"span"},{"text":"We will first calculate ","element":"span"},{"style":{"height":16},"width":180,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-11.png","element":"img","alt":" P [Ac1 ∩ Ac2","inline":true},{"text":"]. By Lemma ","element":"span"},{"href":"#id-63","text":"E.5, ","element":"a"},{"text":"we know that for ","element":"span"},{"style":{"height":18.27},"width":254.17,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-12.png","element":"img","alt":" i ∈ {1, 3}, W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") = ","element":"span"},{"style":{"height":18.7},"width":68.17,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-13.png","element":"img","alt":" W +T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"). ","element":"span"},{"text":"Therefore, it suffices to calculate the probabilty that ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-14.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":16.4},"width":69.71,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-15.png","element":"img","alt":" ̸= ∅","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-16.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(3) ","element":"span"},{"style":{"height":16.4},"width":69.71,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-17.png","element":"img","alt":" ̸= ∅","inline":true},{"text":", provided that ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-18.png","element":"img","alt":"W +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":107.53,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-19.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(3) ","element":"span"},{"style":{"height":16.4},"width":71.9,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-20.png","element":"img","alt":" ̸= ∅","inline":true},{"text":". ","element":"span"},{"text":"Without conditioning on ","element":"span"},{"style":{"height":18.27},"width":68.17,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-21.png","element":"img","alt":" W +0","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.27},"width":107.53,"height":45.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-22.png","element":"img","alt":" ∪ W +0","inline":true,"padRight":true},{"text":"(3) ","element":"span"},{"style":{"height":16.4},"width":71.9,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-23.png","element":"img","alt":" ̸= ∅","inline":true},{"text":", for each 1 ","element":"span"},{"style":{"height":12.8},"width":117.54,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-24.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4 and ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":13.6},"width":107.68,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-25.png","element":"img","alt":" ≤ j ≤","inline":true,"padRight":true},{"text":"2 the event that ","element":"span"},{"style":{"height":18.27},"width":140.42,"height":45.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-26.png","element":"img","alt":" j ∈ W +0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") holds with probability ","element":"span"},{"style":{"height":19.37},"width":16,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-27.png","element":"img","alt":" 14","inline":true},{"text":". Since the initializations of the filters ","element":"span"},{"text":"are independent, we have ","element":"span"},{"style":{"height":16},"width":180,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-28.png","element":"img","alt":" P [Ac1 ∩ Ac2","inline":true},{"text":"] = ","element":"span"},{"style":{"height":19.37},"width":81.35,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-29.png","element":"img","alt":" 16. 16","inline":true},{"text":"We will show that ","element":"span"},{"style":{"height":16},"width":353.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-30.png","element":"img","alt":" P [Ac3 ∩ Ac4 | Ac1 ∩ Ac2","inline":true},{"text":"] = ","element":"span"},{"style":{"height":19.37},"width":16,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-31.png","element":"img","alt":"12","inline":true,"padRight":true},{"text":"by a symmetry argument. ","element":"span"},{"text":"This will finish the proof of the theorem. For the proof, it will be more convenient to denote the matrix of weights at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"as a tuple of 4 vectors, i.e., ","element":"span"},{"style":{"height":28.8},"width":514.28,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-32.png","element":"img","alt":" Wt =�w(1)0 , w(2)0 , u(1)0 , u(2)0 �","inline":true},{"text":". Consider two initializations","element":"span"}],[{"style":{"height":28.8},"width":541.09,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-33.png","element":"img","alt":"W (1)0 =�w(1)0 , w(2)0 , u(1)0 , u(2)0 �","inline":true},{"text":"and ","element":"span"},{"style":{"height":28.8},"width":572.09,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-34.png","element":"img","alt":" W (2)0 =�w(1)0 , w(2)0 , −u(1)0 , u(2)0 �","inline":true},{"text":"and let ","element":"span"},{"style":{"height":20.6},"width":83.51,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-35.png","element":"img","alt":" W (1)t","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.6},"width":83.51,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-36.png","element":"img","alt":" W (2)t","inline":true,"padRight":true},{"text":"be the corresponding weight values at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". We will prove the following lemma:","element":"span"}],[{"id":"id-85","style":{"width":"99%"},"width":1752,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-37.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We will show this by induction on ","element":"span"},{"style":{"height":13.78},"width":83.22,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-38.png","element":"img","alt":" t. 17","inline":true},{"text":"This holds by definition for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 0. Assume it holds for an iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Denote ","element":"span"},{"style":{"height":22.53},"width":90.14,"height":56.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-39.png","element":"img","alt":" W (2)t+1","inline":true,"padRight":true},{"text":"= (","element":"span"},{"style":{"height":10.8},"width":218.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-40.png","element":"img","alt":"z1, z2, v1, v2","inline":true},{"text":"). We need to show that ","element":"span"},{"style":{"height":22.53},"width":412.22,"height":56.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-41.png","element":"img","alt":" z1 = w(1)t+1, z2 = w(2)t+1","inline":true},{"text":", ","element":"span"},{"style":{"height":22.53},"width":207.02,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-42.png","element":"img","alt":"v1 = −u(1)t+1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":22.53},"width":176.03,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-43.png","element":"img","alt":" v2 = u(2)t+1","inline":true},{"text":". By the induction hypothesis it holds that ","element":"span"},{"style":{"height":22.53},"width":172.17,"height":56.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-44.png","element":"img","alt":" NW (1)t (x+","inline":true},{"text":") = ","element":"span"},{"style":{"height":22.53},"width":172.17,"height":56.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-45.png","element":"img","alt":" NW (2)t (x+","inline":true},{"text":") and ","element":"span"},{"style":{"height":21.55},"width":172.17,"height":53.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-46.png","element":"img","alt":"NW (1)t (x−","inline":true},{"text":") = ","element":"span"},{"style":{"height":21.55},"width":172.17,"height":53.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-47.png","element":"img","alt":" NW (2)t (x−","inline":true},{"text":"). This follows since for diverse points (either positive or negative), negating a neuron does not change the function value. Thus, according to Eq. ","element":"span"},{"href":"#id-57","text":"7 ","element":"a"},{"text":"and Eq. ","element":"span"},{"href":"#id-57","text":"8 ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":22.53},"width":181.56,"height":56.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-48.png","element":"img","alt":" z1 = w(1)t+1","inline":true},{"text":", ","element":"span"},{"style":{"height":22.53},"width":186.84,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-49.png","element":"img","alt":"z2 = w(2)t+1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":22.53},"width":180,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-50.png","element":"img","alt":" v2 = u(2)t+1","inline":true},{"text":". We are left to show that ","element":"span"},{"style":{"height":22.53},"width":211,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-51.png","element":"img","alt":" v1 = −u(1)t+1","inline":true},{"text":". This follows from Eq. ","element":"span"},{"href":"#id-57","text":"8 ","element":"a"},{"text":"and the ","element":"span"},{"text":"following facts:","element":"span"}],[{"text":"1. ","element":"span"},{"style":{"height":11.1},"width":165.9,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-52.png","element":"img","alt":" p3 = −p1","inline":true},{"text":".","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"height":11.1},"width":165.9,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-53.png","element":"img","alt":" p2 = −p4","inline":true},{"text":".","element":"span"}],[{"text":"3. arg max","element":"span"},{"style":{"height":13.5},"width":190.32,"height":33.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-54.png","element":"img","alt":"1≤l≤4 u · pl","inline":true,"padRight":true},{"text":"= 1 if and only if arg max","element":"span"},{"style":{"height":13.5},"width":221.32,"height":33.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-55.png","element":"img","alt":"1≤l≤4 −u · pl","inline":true,"padRight":true},{"text":"= 3.","element":"span"}],[{"text":"4. arg max","element":"span"},{"style":{"height":13.5},"width":190.32,"height":33.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-56.png","element":"img","alt":"1≤l≤4 u · pl","inline":true,"padRight":true},{"text":"= 2 if and only if arg max","element":"span"},{"style":{"height":13.5},"width":221.32,"height":33.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-57.png","element":"img","alt":"1≤l≤4 −u · pl","inline":true,"padRight":true},{"text":"= 4.","element":"span"}],[{"text":"5. arg max","element":"span"},{"style":{"height":13.9},"width":204.1,"height":34.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-58.png","element":"img","alt":"l∈{2,4} u · pl","inline":true,"padRight":true},{"text":"= 2 if and only if arg max","element":"span"},{"style":{"height":13.9},"width":235.1,"height":34.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-59.png","element":"img","alt":"l∈{2,4} −u · pl","inline":true,"padRight":true},{"text":"= 4.","element":"span"}],[{"text":"To see this, we will illustrate this through one case, the other cases are similar. Assume, for example, that arg max","element":"span"},{"style":{"height":22.98},"width":233.62,"height":57.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-60.png","element":"img","alt":"1≤l≤4 u(1)t · pl","inline":true,"padRight":true},{"text":"= 3 and arg max","element":"span"},{"style":{"height":23.38},"width":247.4,"height":58.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-61.png","element":"img","alt":"l∈{2,4} u(1)t · pl","inline":true,"padRight":true},{"text":"= 2 and assume without loss of generality that ","element":"span"},{"style":{"height":22.53},"width":172.17,"height":56.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-62.png","element":"img","alt":" NW (1)t (x+","inline":true},{"text":") = ","element":"span"},{"style":{"height":22.53},"width":276.87,"height":56.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-63.png","element":"img","alt":" NW (2)t (x+) < γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.55},"width":172.17,"height":53.87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-64.png","element":"img","alt":" NW (1)t (x−","inline":true},{"text":") = ","element":"span"},{"style":{"height":21.55},"width":286.3,"height":53.87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-65.png","element":"img","alt":" NW (2)t (x−) > −","inline":true},{"text":"1. Then, by Eq. ","element":"span"},{"href":"#id-57","text":"8, ","element":"a"},{"style":{"height":22.53},"width":79.64,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-66.png","element":"img","alt":" u(1)t+1","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.6},"width":232.39,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-67.png","element":"img","alt":"u(1)t −p3 +p2","inline":true},{"text":". By the induction hypothesis and the above facts it follows that ","element":"span"},{"style":{"height":20.6},"width":358.45,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-68.png","element":"img","alt":" v1 = −u(1)t −p1 +p4","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":22.53},"width":445.75,"height":56.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-69.png","element":"img","alt":"−u(1)t + p3 − p2 = −u(1)t+1","inline":true},{"text":". This concludes the proof.","element":"span"}],[{"text":"Consider an initialization of gradient descent where ","element":"span"},{"style":{"height":20.93},"width":74.58,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-70.png","element":"img","alt":" w(1)0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.93},"width":74.58,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-71.png","element":"img","alt":" w(2)0","inline":true,"padRight":true},{"text":"are fixed and the event that we conditioned on in the beginning of the proof and ","element":"span"},{"style":{"height":15.55},"width":137.8,"height":38.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-72.png","element":"img","alt":" Ac1 ∩ Ac2","inline":true,"padRight":true},{"text":"hold. Define the set ","element":"span"},{"style":{"height":13.19},"width":46.22,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-73.png","element":"img","alt":" B1","inline":true,"padRight":true},{"text":"to be the set of all ","element":"span"},{"text":"pair of vectors (","element":"span"},{"style":{"height":10.8},"width":99.7,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-74.png","element":"img","alt":"v1, v2","inline":true},{"text":") such that if ","element":"span"},{"style":{"height":20.93},"width":170.12,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-75.png","element":"img","alt":" u(1)0 = v1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.93},"width":170.12,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-76.png","element":"img","alt":" u(1)0 = v2","inline":true,"padRight":true},{"text":"then at iteration ","element":"span"},{"style":{"height":18.08},"width":244.71,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/34-77.png","element":"img","alt":" T, u(1) · p2 >","inline":true,"padRight":true},{"text":"0 and","element":"span"}],[{"style":{"height":18.08},"width":179.96,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-0.png","element":"img","alt":"u(2) · p2 >","inline":true,"padRight":true},{"text":"0. Note that this definition implicitly implies that this initialization satisfies the condition in Lemma ","element":"span"},{"href":"#id-43","text":"E.3 ","element":"a"},{"text":"and leads to a global minimum. Similarly, let ","element":"span"},{"style":{"height":13.19},"width":46.23,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-1.png","element":"img","alt":" B2","inline":true,"padRight":true},{"text":"be the set of all pair of vectors (","element":"span"},{"style":{"height":10.8},"width":99.7,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-2.png","element":"img","alt":"v1, v2","inline":true},{"text":") such that if ","element":"span"},{"style":{"height":20.93},"width":172.32,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-3.png","element":"img","alt":" u(1)0 = v1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.93},"width":172.32,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-4.png","element":"img","alt":" u(1)0 = v2","inline":true,"padRight":true},{"text":"then at iteration ","element":"span"},{"style":{"height":18.08},"width":247.5,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-5.png","element":"img","alt":" T, u(1) · p4 >","inline":true,"padRight":true},{"text":"0 and ","element":"span"},{"style":{"height":18.08},"width":190.88,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-6.png","element":"img","alt":" u(2) · p2 >","inline":true,"padRight":true},{"text":"0. First, if (","element":"span"},{"style":{"height":16},"width":213.59,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-7.png","element":"img","alt":"v1, v2) ∈ B1","inline":true,"padRight":true},{"text":"then (","element":"span"},{"style":{"height":10.8},"width":130.7,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-8.png","element":"img","alt":"−v1, v2","inline":true},{"text":") satisfies the conditions of Lemma ","element":"span"},{"href":"#id-43","text":"E.3. ","element":"a"},{"text":"Second, by Lemma ","element":"span"},{"href":"#id-85","text":"F.2, ","element":"a"},{"text":"it follows that if (","element":"span"},{"style":{"height":16},"width":212.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-9.png","element":"img","alt":"v1, v2) ∈ B1","inline":true,"padRight":true},{"text":"then initializating with (","element":"span"},{"style":{"height":10.8},"width":130.7,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-10.png","element":"img","alt":"−v1, v2","inline":true},{"text":"), leads to the same values of ","element":"span"},{"style":{"height":16.99},"width":143.97,"height":42.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-11.png","element":"img","alt":" NWt(x+","inline":true},{"text":") and ","element":"span"},{"style":{"height":16},"width":143.97,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-12.png","element":"img","alt":" NWt(x−","inline":true},{"text":") in all iterations 0 ","element":"span"},{"style":{"height":13.2},"width":138.59,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-13.png","element":"img","alt":" ≤ t ≤ T","inline":true},{"text":". Therefore, initializing with (","element":"span"},{"style":{"height":10.8},"width":130.7,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-14.png","element":"img","alt":"−v1, v2","inline":true},{"text":") leads to a convergence to a global minimum with the same value of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"as the initialization with (","element":"span"},{"style":{"height":10.8},"width":99.7,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-15.png","element":"img","alt":"v1, v2","inline":true},{"text":"). Furthermore, if (","element":"span"},{"style":{"height":16},"width":212.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-16.png","element":"img","alt":"v1, v2) ∈ B1","inline":true},{"text":", then by Lemma ","element":"span"},{"href":"#id-85","text":"F.2, ","element":"a"},{"text":"initializing with ","element":"span"},{"style":{"height":20.93},"width":194.11,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-17.png","element":"img","alt":" u(1)0 = −v1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.93},"width":163.12,"height":52.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-18.png","element":"img","alt":" u(1)0 = v2","inline":true,"padRight":true},{"text":"results in ","element":"span"},{"style":{"height":18.08},"width":180.97,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-19.png","element":"img","alt":" u(1) · p2 <","inline":true,"padRight":true},{"text":"0 and ","element":"span"},{"style":{"height":18.08},"width":182.6,"height":45.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-20.png","element":"img","alt":" u(2) · p2 >","inline":true,"padRight":true},{"text":"0. It follows that (","element":"span"},{"style":{"height":16},"width":212.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-21.png","element":"img","alt":"v1, v2) ∈ B1","inline":true,"padRight":true},{"text":"if and only if (","element":"span"},{"style":{"height":16},"width":243.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-22.png","element":"img","alt":"−v1, v2) ∈ B2","inline":true},{"text":". For ","element":"span"},{"style":{"height":16},"width":234.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-23.png","element":"img","alt":" l1, l2 ∈ {2, 4}","inline":true,"padRight":true},{"text":"define ","element":"span"},{"style":{"height":28.8},"width":1074.81,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-24.png","element":"img","alt":" Pl1,l2 = P�u(1) · pl1 > 0 ∧ u(2) · pl2 > 0 | Ac1 ∩ Ac2, w(1)0 , w(2)0 �","inline":true},{"text":"Then, by","element":"span"}],[{"text":"symmetry of the initialization and the latter arguments it follows that ","element":"span"},{"style":{"height":15.59},"width":188.88,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-25.png","element":"img","alt":" P2,2 = P4,2","inline":true},{"text":". By similar arguments we can obtain the equalities ","element":"span"},{"style":{"height":15.59},"width":432.77,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-26.png","element":"img","alt":" P2,2 = P4,2 = P4,4 = P2,4","inline":true},{"text":". Since all of these four probabilities sum to 1, each is equal to ","element":"span"},{"style":{"height":19.37},"width":86.38,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-27.png","element":"img","alt":"14. 18","inline":true},{"text":"Taking expectations of these","element":"span"}],[{"text":"probabilities with respect to the values of ","element":"span"},{"style":{"height":20.93},"width":74.58,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-28.png","element":"img","alt":" w(1)0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.93},"width":74.58,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-29.png","element":"img","alt":" w(2)0","inline":true,"padRight":true},{"text":"(given that Lemma ","element":"span"},{"href":"#id-43","text":"E.3 ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":15.56},"width":139.53,"height":38.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-30.png","element":"img","alt":" Ac1 ∩ Ac2","inline":true,"padRight":true},{"text":"hold)","element":"span"}],[{"text":"and using the law of total expectation, we conclude that","element":"span"}],[{"style":{"width":"68%"},"width":1206,"height":169,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-31.png","element":"img"}],[{"text":"Finally, let ","element":"span"},{"style":{"height":13.19},"width":44.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-32.png","element":"img","alt":" Z1","inline":true,"padRight":true},{"text":"be the set of positive points which contain only the patterns ","element":"span"},{"style":{"height":14.7},"width":243.78,"height":36.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-33.png","element":"img","alt":" p1, p2, p4, Z2","inline":true,"padRight":true},{"text":"be the set of positive points which contain only the patterns ","element":"span"},{"style":{"height":11.1},"width":176.85,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-34.png","element":"img","alt":" p3, p2, p4","inline":true},{"text":". Let ","element":"span"},{"style":{"height":13.19},"width":44.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-35.png","element":"img","alt":" Z3","inline":true,"padRight":true},{"text":"be the set which contains the negative point with all patterns equal to ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-36.png","element":"img","alt":" p2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":44.88,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-37.png","element":"img","alt":" Z4","inline":true,"padRight":true},{"text":"be the set which contains the negative point with all patterns equal to ","element":"span"},{"style":{"height":11.1},"width":39.95,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-38.png","element":"img","alt":" p4","inline":true},{"text":". By the proof of the previous section, if the event ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"holds, then there exists 1 ","element":"span"},{"style":{"height":12.8},"width":97.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-39.png","element":"img","alt":" ≤ i ≤","inline":true,"padRight":true},{"text":"4, such that gradient descent converges to a solution at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"which errs on all of the points in ","element":"span"},{"style":{"height":13.19},"width":39.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-40.png","element":"img","alt":" Zi","inline":true},{"text":". Therefore, its test error will be at least ","element":"span"},{"style":{"height":14.18},"width":36.05,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-41.png","element":"img","alt":" p∗","inline":true,"padRight":true},{"text":"(recall Eq. ","element":"span"},{"href":"#id-34","text":"5)","element":"a"},{"text":".","element":"span"}]]},{"heading":"G Proof of Theorem 6.5","paragraphs":[[{"text":"Let ","element":"span"},{"style":{"height":16},"width":383.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-42.png","element":"img","alt":" δ ≥ 1 − p+p−(1 − c −","inline":true,"padRight":true},{"text":"16","element":"span"},{"style":{"height":13.38},"width":59.46,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-43.png","element":"img","alt":"e−8","inline":true},{"text":"). By Theorem ","element":"span"},{"href":"#id-35","text":"6.3, ","element":"a"},{"text":"given 2 samples, one positive and one negative, with probability at least 1 ","element":"span"},{"style":{"height":16},"width":370.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-44.png","element":"img","alt":" − δ ≤ p+p−(1 − c −","inline":true,"padRight":true},{"text":"16","element":"span"},{"style":{"height":13.38},"width":59.46,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-45.png","element":"img","alt":"e−8","inline":true},{"text":"), gradient descent will converge to a global minimum that has 0 test error. Therefore, for all ","element":"span"},{"style":{"height":12.8},"width":58.24,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-46.png","element":"img","alt":" ϵ ≥","inline":true,"padRight":true},{"text":"0, ","element":"span"},{"style":{"height":16},"width":161.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-47.png","element":"img","alt":" m(ϵ, δ) ≤","inline":true,"padRight":true},{"text":"2. On the other hand, by Theorem ","element":"span"},{"href":"#id-36","text":"6.4, ","element":"a"},{"text":"if ","element":"span"},{"style":{"height":30.43},"width":293.31,"height":76.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-48.png","element":"img","alt":" m <2 log( 48δ33(1−c))log(p+p−)","inline":true,"padRight":true},{"text":"then with probability greater than","element":"span"}],[{"style":{"width":"31%"},"width":550,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-49.png","element":"img"}],[{"text":"gradient descent converges to a global minimum with test error at least ","element":"span"},{"style":{"height":14.18},"width":36.05,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-50.png","element":"img","alt":" p∗","inline":true},{"text":". It follows that for 0 ","element":"span"},{"style":{"height":12.8},"width":31,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-51.png","element":"img","alt":" ≤","inline":true},{"style":{"height":30.43},"width":509.44,"height":76.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/35-52.png","element":"img","alt":"ϵ < p∗, m(ϵ, δ) ≥2 log( 48δ33(1−c))log(p+p−)","inline":true,"padRight":true},{"text":".","element":"span"}]]},{"heading":"H Experiments for Section 7","paragraphs":[[{"text":"We first provide several details on the experiments in Section ","element":"span"},{"text":"7. ","element":"span"},{"text":"We trained the overparamaterized network with 120 channels once for each training set size and recorded the clustered weights. We used Adam for optimization and batch size which is one-tenth of the size of the training set. We used learning rate=0.01 and standard deviation of 0.05 for initialization with truncated normal weights. For the small network with random initialization we used the same optimization method and batch sizes","element":"span"}],[{"id":"id-86","style":{"width":"84%"},"width":1481,"height":631,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/36-0.png","element":"img"}],[{"text":"Figure 7: ","element":"figcaption","subtype":"caption"},{"text":"Clustering and Exploration in MNIST with 4x4 filters (a) Distribution of angle to closest center in trained and random networks. (b) The plot shows the test error of the small network (4 channels) with standard training (red), the small network that uses clusters from the large network (blue), and the large network (120 channels) with standard training (green).","element":"figcaption","subtype":"caption"}],[{"text":"but tried 6 different pairs of values for learning rate and standard deviation: (0.01,0.01), (0.01,0.05), (0.05,0.05), (0.05, 0.01), (0.1,0.5) and (0.1,0.1). For each pair and training set size we trained 20 times and averaged the results. The curve is the best test accuracy we got among all learning rate and standard deviation pairs.","element":"span"}],[{"text":"For the small network with cluster initialization we experimented with the same setup as the small network with random initializatoin but only experimented with learning rate 0.01 and standard deviation 0.05. The curve is an average of 20 runs for each training set size.","element":"span"}],[{"text":"We also experimented with other filter sizes in similar setups. Figure ","element":"span"},{"href":"#id-86","text":"7 ","element":"a"},{"text":"shows the results for 4x4 filters and clustering from 120 filters to 4 filters (with 2000 training points). Figure ","element":"span"},{"href":"#id-87","text":"8 ","element":"a"},{"text":"shows the results for 7x7 filters and clustering from 120 filters to 4 filters (with 2000 training points).","element":"span"}],[{"id":"id-87","style":{"width":"84%"},"width":1481,"height":631,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1810.03037/images/37-0.png","element":"img"}],[{"text":"Figure 8: ","element":"figcaption","subtype":"caption"},{"text":"Clustering and Exploration in MNIST with 7x7 filters (a) Distribution of angle to closest center in trained and random networks. (b) The plot shows the test error of the small network (4 channels) with standard training (red), the small network that uses clusters from the large network (blue), and the large network (120 channels) with standard training (green).","element":"figcaption","subtype":"caption"}]]}],"_version":"3.3.2"},"paperNode":"$28:props:children:props:children:0:props:product"}]]