36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2002.02247","publisher":"arxiv","paperJSON":{"title":"Almost Sure Convergence of Dropout Algorithms for Neural Networks","paperID":"2002.02247","avgLineHeight":13.54,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"We investigate the convergence and convergence rate of stochastic training algorithms for ","element":"span"},{"text":"Neural Networks (NNs) ","element":"span"},{"text":"that have been inspired by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"(","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"Hinton et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"2012","element":"a"},{"text":"). With the goal of avoiding overfitting during training in ","element":"span"},{"text":"NNs, ","element":"span"},{"text":"dropout algorithms consist in practice of multiplying the weight matrices of a ","element":"span"},{"text":"NN ","element":"span"},{"text":"componentwise by independently drawn random matrices with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":"-valued entries during each iteration of ","element":"span"},{"text":"Stochastic Gradient Descent ","element":"span"},{"text":"(SGD). ","element":"span"},{"text":"This paper presents a probability theoretical proof that for fully-connected ","element":"span"},{"text":"NNs ","element":"span"},{"text":"with differentiable, polynomially bounded activation functions, if we project the weights onto a compact set when using a dropout algorithm, then the weights of the ","element":"span"},{"text":"NN ","element":"span"},{"text":"converge to a unique stationary point of a projected system of ","element":"span"},{"text":"Ordinary Differential Equations (ODEs).","element":"span"}],[{"text":"After this general convergence guarantee, we go on to investigate the convergence rate of dropout. Firstly, we obtain generic sample complexity bounds for finding ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/0-0.png","element":"img","alt":" ϵ","inline":true},{"text":"-stationary points of smooth nonconvex functions using ","element":"span"},{"text":"SGD ","element":"span"},{"text":"with dropout that explicitly depend on the dropout probability. Secondly, we obtain an upper bound on the rate of convergence of ","element":"span"},{"text":"Gradient Descent (GD) ","element":"span"},{"text":"on the limiting ","element":"span"},{"text":"ODEs ","element":"span"},{"text":"of dropout algorithms for ","element":"span"},{"text":"NNs ","element":"span"},{"text":"with the shape of an arborescence of arbitrary depth and with linear activation functions. The latter bound shows that for an algorithm such as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect ","element":"span"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":47,"text":"Wan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":47,"text":"2013","element":"a"},{"text":"), the convergence rate can be impaired exponentially by the depth of the arborescence.","element":"span"}],[{"text":"In contrast, we experimentally observe no such dependence for wide ","element":"span"},{"text":"NNs ","element":"span"},{"text":"with just a few dropout layers. ","element":"span"},{"text":"We also provide a heuristic argument for this observation. ","element":"span"},{"text":"Our results suggest that there is a change of scale of the effect of the dropout probability in the convergence rate that depends on the relative size of the width of the ","element":"span"},{"text":"NN ","element":"span"},{"text":"compared to its depth. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Keywords: ","element":"span"},{"text":"Dropout, Convergence, Neural Networks, Stochastic Approximation, ODE Method","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"(","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"Hinton et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"2012","element":"a"},{"text":") is a technique to avoid overfitting during training of ","element":"span"},{"text":"NNs ","element":"span"},{"text":"that consists of temporarily ‘dropping’ nodes of the network independently at each step of ","element":"span"},{"text":"SGD. ","element":"span"},{"text":"While in the original ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"algorithm in ","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"Hinton et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"2012","element":"a"},{"text":") only nodes from the network were dropped, several stochastic training algorithms that avoid overfitting in ","element":"span"},{"text":"NNs ","element":"span"},{"text":"have appeared since then; for example, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect ","element":"span"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":47,"text":"Wan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":47,"text":"2013","element":"a"},{"text":"), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Cutout ","element":"span"},{"text":"(","element":"span"},{"href":"#id-2","referenceIndex":11,"text":"DeVries and ","element":"a"},{"href":"#id-2","referenceIndex":11,"text":"Taylor","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":11,"text":"2017","element":"a"},{"text":"). Figure ","element":"span"},{"href":"#id-3","text":"1 ","element":"a"},{"text":"depicts a ","element":"span"},{"text":"NN ","element":"span"},{"text":"where we use ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect ","element":"span"},{"text":"and drop individual edges","element":"span"}],[{"style":{"width":"36%"},"width":627,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/0-1.png","element":"img"}],[{"text":"instead of nodes. In practice, such dropout algorithms consist of multiplying componentwise weight matrices of the ","element":"span"},{"text":"NN ","element":"span"},{"text":"in each iteration by independently drawn random matrices with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":"-valued entries. The elements of these random matrices indicate whether each individual edge or node is filtered (","element":"span"},{"text":"0","element":"span"},{"text":") or is not filtered (","element":"span"},{"text":"1","element":"span"},{"text":") during a training step. ","element":"span"},{"text":"The resulting weight matrices are then used in the backpropagation algorithm for computing the gradient of a ","element":"span"},{"text":"NN. ","element":"span"},{"text":"Mathematically, dropout turns the backpropagation algorithm into a step of a ","element":"span"},{"text":"SGD ","element":"span"},{"text":"in which the primary source of randomness is the ","element":"span"},{"text":"NN’","element":"span"},{"text":"s configuration. Under mild independence assumptions, the loss function of dropout is a risk function averaged over all possible ","element":"span"},{"text":"NNs ","element":"span"},{"text":"configurations (","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"Baldi and Sadowski","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"2013","element":"a"},{"text":").","element":"span"}],[{"id":"id-3","style":{"width":"96%"},"width":1661,"height":457,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/1-0.png","element":"img"}],[{"text":"Figure 1: ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"(a,b) Dropconnect","element":"figcaption","subtype":"caption"},{"text":"’s training step (","element":"figcaption","subtype":"caption"},{"href":"#id-1","referenceIndex":47,"text":"Wan et al.","element":"a","subtype":"caption"},{"text":", ","element":"figcaption","subtype":"caption"},{"href":"#id-1","referenceIndex":47,"text":"2013","element":"a","subtype":"caption"},{"text":") in a ","element":"figcaption","subtype":"caption"},{"text":"NN ","element":"figcaption","subtype":"caption"},{"text":"with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"L ","element":"figcaption","subtype":"caption"},{"text":"= 3 ","element":"figcaption","subtype":"caption"},{"text":"layers. In this algorithm, on every iteration, a random ","element":"figcaption","subtype":"caption"},{"text":"NN ","element":"figcaption","subtype":"caption"},{"text":"is first generated by removing each edge with probability ","element":"figcaption","subtype":"caption"},{"style":{"height":17.6},"width":167.43,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/1-1.png","element":"img","alt":" p ∈ (0, 1]","inline":true,"padRight":true},{"text":"independently of all other edges. The output of this random ","element":"figcaption","subtype":"caption"},{"text":"NN ","element":"figcaption","subtype":"caption"},{"text":"is then used to update all weights using the backpropagation algorithm. ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"(c) ","element":"figcaption","subtype":"caption"},{"text":"An example arborescence of depth ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"L ","element":"figcaption","subtype":"caption"},{"text":"= 3","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"An interesting aspect of dropout algorithms is that they lie at the intersection of stochastic optimization and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"percolation theory","element":"span"},{"text":", which investigates properties related to connectedness of random graphs and deterministic (possibly infinite) graphs in which vertices and edges are deleted at random. In the case of dropout, the output of the filtered ","element":"span"},{"text":"NN ","element":"span"},{"text":"with temporarily deleted edges is used to update the weights. If dropout filters too many weights, then little information about the input can pass through the network, which will consequently also yield a gradient update for that step that contains little relevant information.","element":"span"}],[{"text":"As an example, we may consider again the networks in Figures ","element":"span"},{"href":"#id-3","text":"1 ","element":"a"},{"text":"(a)–(b) when we use ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect","element":"span"},{"text":", that is, we filter each edge with probability ","element":"span"},{"style":{"height":15.2},"width":102.6,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/1-2.png","element":"img","alt":" 1 − p","inline":true,"padRight":true},{"text":"independently of all other edges. We can observe that the number of paths ","element":"span"},{"style":{"height":12},"width":28,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/1-3.png","element":"img","alt":" χ","inline":true,"padRight":true},{"text":"in Figure ","element":"span"},{"href":"#id-3","text":"1 ","element":"a"},{"text":"(b) that fully transverse the network (","element":"span"},{"style":{"height":16},"width":123.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/1-4.png","element":"img","alt":"χ = 5","inline":true},{"text":") is much smaller compared to those of Figure ","element":"span"},{"href":"#id-3","text":"1 ","element":"a"},{"text":"(a) (","element":"span"},{"style":{"height":17.6},"width":269.96,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/1-5.png","element":"img","alt":"χ = 240). In","inline":true,"padRight":true},{"text":"an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-layer ","element":"span"},{"text":"NN ","element":"span"},{"text":"with no biases, a path from the input layer to the output goes through ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"weights that have filters. Then, the probability that a path from input to output stays unfiltered and contributes to a weight update is ","element":"span"},{"style":{"height":18.73},"width":44.96,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/1-6.png","element":"img","alt":" pL","inline":true},{"text":". If we now fix one edge in the path, then the probability of updating its corresponding weight through that path in particular is also ","element":"span"},{"style":{"height":18.73},"width":44.96,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/1-7.png","element":"img","alt":" pL","inline":true},{"text":". There are, however, many other paths in a ","element":"span"},{"text":"NN ","element":"span"},{"text":"passing through a single edge. The probability that one of those paths is not filtered will be large and may compensate the exponential factor ","element":"span"},{"style":{"height":18.73},"width":44.96,"height":46.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/1-8.png","element":"img","alt":" pL","inline":true},{"text":". Considering the connection to bond percolation, one may therefore suspect that dropout algorithms may perform worse than a routine implementation of the backpropagation algorithm. However, dropout algorithms usually perform well since they avoid overfitting in ","element":"span"},{"text":"NNs ","element":"span"},{"text":"(","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"Hinton et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"2012","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":43,"text":"Srivastava et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":43,"text":"2014","element":"a"},{"text":"). From the point of view of bond percolation however, this should still come at the cost of slower convergence of dropout algorithms, and conceivably by as much as a factor ","element":"span"},{"style":{"height":18.73},"width":232.13,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/2-0.png","element":"img","alt":" pL, where L","inline":true,"padRight":true},{"text":"is the number of dropout layers.","element":"span"}],[{"text":"Most theoretical focus has been on the generalization properties of ","element":"span"},{"text":"NNs ","element":"span"},{"text":"trained with dropout algorithms. ","element":"span"},{"text":"We can mention ","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"Hinton et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"2012","element":"a"},{"text":"); ","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"Baldi and Sadowski ","element":"a"},{"text":"(","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"2013","element":"a"},{"text":"); ","element":"span"},{"href":"#id-6","referenceIndex":46,"text":"Wager et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-6","referenceIndex":46,"text":"2013","element":"a"},{"text":"); ","element":"span"},{"href":"#id-5","referenceIndex":43,"text":"Srivastava et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-5","referenceIndex":43,"text":"2014","element":"a"},{"text":"); ","element":"span"},{"href":"#id-7","referenceIndex":3,"text":"Baldi and Sadowski ","element":"a"},{"text":"(","element":"span"},{"href":"#id-7","referenceIndex":3,"text":"2014","element":"a"},{"text":"); ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"Cavazza et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-9","referenceIndex":30,"text":"Mianjy et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-9","referenceIndex":30,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-10","referenceIndex":28,"text":"Mianjy and Arora ","element":"a"},{"text":"(","element":"span"},{"href":"#id-10","referenceIndex":28,"text":"2019","element":"a"},{"text":"); ","element":"span"},{"href":"#id-11","referenceIndex":35,"text":"Pal et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-11","referenceIndex":35,"text":"2020","element":"a"},{"text":"); ","element":"span"},{"href":"#id-12","referenceIndex":48,"text":"Wei et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-12","referenceIndex":48,"text":"2020","element":"a"},{"text":"), which we briefly review in Section ","element":"span"},{"href":"#id-13","text":"1.3. ","element":"a"},{"text":"In this paper, however, we investigate dropout from the stochastic optimization perspective. That is, we aim to answer if dropout algorithms converge and study the rate at which they converge, which is expected to depend on the dropout probability. Compared to the study of the generalization properties of dropout, this aim has received less attention in the literature. In particular, we can only mention ","element":"span"},{"href":"#id-14","referenceIndex":29,"text":"Mianjy and Arora ","element":"a"},{"text":"(","element":"span"},{"href":"#id-14","referenceIndex":29,"text":"2020","element":"a"},{"text":") and ","element":"span"},{"href":"#id-15","referenceIndex":40,"text":"Senen-Cerda and Sanders ","element":"a"},{"text":"(","element":"span"},{"href":"#id-15","referenceIndex":40,"text":"2022","element":"a"},{"text":"). In ","element":"span"},{"href":"#id-14","referenceIndex":29,"text":"Mianjy and Arora ","element":"a"},{"text":"(","element":"span"},{"href":"#id-14","referenceIndex":29,"text":"2020","element":"a"},{"text":"), a convergence rate for the test error in a classification setting is obtained when training shallow ","element":"span"},{"text":"NNs ","element":"span"},{"text":"with dropout. This rate, is, however, independent of the dropout probability. In ","element":"span"},{"href":"#id-15","referenceIndex":40,"text":"Senen-Cerda and Sanders ","element":"a"},{"text":"(","element":"span"},{"href":"#id-15","referenceIndex":40,"text":"2022","element":"a"},{"text":"), a convergence rate for the empirical risk associated with training shallow linear ","element":"span"},{"text":"NNs ","element":"span"},{"text":"with dropout is obtained that depends on the dropout probability. Both results refer to shallow ","element":"span"},{"text":"NNs ","element":"span"},{"text":"where the width of the ","element":"span"},{"text":"NN ","element":"span"},{"text":"plays a role in the convergence rate. We refer to Section ","element":"span"},{"href":"#id-13","text":"1.3 ","element":"a"},{"text":"below for further details.","element":"span"}],[{"text":"From the previous discussion, however, we suspect that there is an effect of dropout in the convergence rate in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"deep ","element":"span"},{"text":"NNs ","element":"span"},{"text":"with several layers of dropout. In this paper, we investigate this problem. In particular, we provide convergence guarantees for training ","element":"span"},{"text":"NNs ","element":"span"},{"text":"that have several layers of dropout and analyze simplified models for deep ","element":"span"},{"text":"NNs, ","element":"span"},{"text":"for which it is possible to obtain an explicit convergence rate that depends on the dropout probability and depth. We also consider the effect on the sample complexity of using dropout ","element":"span"},{"text":"SGD ","element":"span"},{"text":"and complement the previous results with simulations on realistic ","element":"span"},{"text":"NNs ","element":"span"},{"text":"to examine the convergence rate of dropout empirically.","element":"span"}],[{"text":"Before introducing the results of the paper we briefly define the fundamental concepts related to training of ","element":"span"},{"text":"NNs ","element":"span"},{"text":"with dropout that we will use throughout this paper.","element":"span"}],[{"id":"id-52","style":{"fontWeight":"bold"},"text":"1.1 Dropout and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"SGD","element":"span"}],[{"text":"A ","element":"span"},{"text":"NN ","element":"span"},{"style":{"height":14.7},"width":284.12,"height":36.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/2-1.png","element":"img","alt":" ΨW : X → Y","inline":true,"padRight":true},{"text":"with weights ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"is typically used to predict output ","element":"span"},{"style":{"height":16},"width":258.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/2-2.png","element":"img","alt":" Y ∈ Y given","inline":true,"padRight":true},{"text":"input ","element":"span"},{"style":{"height":12.8},"width":144.38,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/2-3.png","element":"img","alt":" X ∈ X","inline":true,"padRight":true},{"text":"both of which are sampled from some joint distribution. For a given loss ","element":"span"},{"style":{"height":14.8},"width":268.95,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/2-4.png","element":"img","alt":"l : Y × Y → R","inline":true},{"text":", the risk function of ","element":"span"},{"style":{"height":14.7},"width":69.94,"height":36.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/2-5.png","element":"img","alt":" ΨW","inline":true,"padRight":true},{"text":"is usually defined as","element":"span"}],[{"id":"id-16","style":{"width":"73%"},"width":1273,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/2-6.png","element":"img"}],[{"text":"where the distribution is usually given by the empirical distribution of a finite number of samples ","element":"span"},{"style":{"height":18.09},"width":403.37,"height":45.22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/2-7.png","element":"img","alt":" {(xi, yi)}ni=1 ∈ X × Y","inline":true},{"text":". In this case, the risk is an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"empirical risk","element":"span"},{"text":".","element":"span"}],[{"text":"Ideally, the ","element":"span"},{"text":"NN ","element":"span"},{"text":"is operated using weights in the set ","element":"span"},{"style":{"height":17.6},"width":312.12,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/2-8.png","element":"img","alt":" arg minW U(W).","inline":true,"padRight":true},{"text":"However, the weights are found in practice by using gradient descent or its stochastic variant ","element":"span"},{"text":"SGD, ","element":"span"},{"text":"which aims to minimize the risk in ","element":"span"},{"href":"#id-16","text":"(1) ","element":"a"},{"text":"by updating the weights in the local direction that minimizes the function. At time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", the weights ","element":"span"},{"style":{"height":16.33},"width":77.92,"height":40.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/2-9.png","element":"img","alt":" W [t] ","inline":true,"padRight":true},{"text":"of the ","element":"span"},{"text":"NN ","element":"span"},{"text":"are namely updated by setting","element":"span"}],[{"id":"id-54","style":{"width":"65%"},"width":1138,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/2-10.png","element":"img"}],[{"text":"Here, ","element":"span"},{"style":{"height":16.01},"width":110.26,"height":40.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-0.png","element":"img","alt":"˜∆[t+1] ","inline":true,"padRight":true},{"text":"is a stochastic estimate of the gradient of ","element":"span"},{"href":"#id-16","text":"(1) ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":16.33},"width":117.53,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-1.png","element":"img","alt":" α{t+1} ","inline":true,"padRight":true},{"text":"is a step size which we will specify later. Let ","element":"span"},{"style":{"height":17.6},"width":197.3,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-2.png","element":"img","alt":" BW (X, Y )","inline":true,"padRight":true},{"text":"be the gradient at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"of ","element":"span"},{"href":"#id-16","text":"(1)","element":"a"},{"text":". If the input and output samples ","element":"span"},{"style":{"height":19.13},"width":244.23,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-3.png","element":"img","alt":" X[t+1], Y [t+1] ","inline":true,"padRight":true},{"text":"are provided at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", then the update of ","element":"span"},{"text":"SGD ","element":"span"},{"text":"is given by","element":"span"}],[{"id":"id-17","style":{"width":"65%"},"width":1141,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-4.png","element":"img"}],[{"text":"As we have mentioned, dropout filters are applied to some of the weights ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"during training by using matrices of random variables ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":"-valued entries. ","element":"span"},{"text":"Denote by ","element":"span"},{"style":{"height":19.13},"width":386.62,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-5.png","element":"img","alt":"F [t+1], X[t+1], Y [t+1] ","inline":true,"padRight":true},{"text":"the dropout filters and the samples provided to the ","element":"span"},{"text":"SGD ","element":"span"},{"text":"algorithm at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", respectively. Compared to ","element":"span"},{"href":"#id-17","text":"(3)","element":"a"},{"text":", a dropout algorithm defines the estimate of the gradient update as","element":"span"}],[{"id":"id-18","style":{"width":"74%"},"width":1282,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":12},"width":34,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-7.png","element":"img","alt":" ⊙","inline":true,"padRight":true},{"text":"denotes the componentwise product.","element":"span"}],[{"text":"Note that in ","element":"span"},{"href":"#id-18","text":"(4) ","element":"a"},{"text":"the filters appear twice. Firstly, they filter the weights ","element":"span"},{"style":{"height":16.33},"width":270.88,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-8.png","element":"img","alt":" W [t] when the","inline":true,"padRight":true},{"text":"gradient is computed depending only on the subnetwork provided by dropping some edges or nodes. Secondly, they filter the updates in ","element":"span"},{"style":{"height":15.93},"width":110.26,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-9.png","element":"img","alt":" ∆[t+1] ","inline":true,"padRight":true},{"text":"since only the remaining weights will be updated. We remark that in this general formulation, other distributions for the filters than those for dropout and dropconnect are allowed. For specific examples of distribution of the filter matrices we refer to Section ","element":"span"},{"href":"#id-19","text":"2.3.","element":"a"}],[{"text":"We next present the results of this paper.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"1.2 Summary of results","element":"span"}],[{"text":"Our first result is a formal probability theoretical proof that for any (fully connected) ","element":"span"},{"text":"NN ","element":"span"},{"text":"topology and with differentiable polynomially bounded activation functions (see Theorem ","element":"span"},{"href":"#id-20","text":"5)","element":"a"},{"text":", the iterates of projected ","element":"span"},{"text":"SGD ","element":"span"},{"text":"with dropout-like filters converge. In particular, a step of projected ","element":"span"},{"text":"SGD ","element":"span"},{"text":"with dropout is given by","element":"span"}],[{"id":"id-23","style":{"width":"75%"},"width":1314,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.93},"width":110.26,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-11.png","element":"img","alt":" ∆[t+1] ","inline":true,"padRight":true},{"text":"is the estimate of the gradient with dropout in ","element":"span"},{"href":"#id-18","text":"(4) ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":15.5},"width":58.7,"height":38.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-12.png","element":"img","alt":" PH","inline":true,"padRight":true},{"text":"is an operator that projects the iterates onto a compact convex set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"(","element":"span"},{"href":"#id-21","referenceIndex":34,"text":"Oymak","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":34,"text":"2018","element":"a"},{"text":"). In order to state our first result, we define a ","element":"span"},{"style":{"height":16},"width":676.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-13.png","element":"img","alt":" dropout algorithm’s risk function as","inline":true}],[{"style":{"width":"77%"},"width":1342,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-14.png","element":"img"}],[{"text":"and we will consider ","element":"span"},{"style":{"height":19.13},"width":560.27,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-15.png","element":"img","alt":" l(a, b) = |a − b|2 to be the ℓ2","inline":true},{"text":"-loss. The result is stated informally in the next proposition.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Result 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(Informal statement of Proposition ","element":"span"},{"href":"#id-22","style":{"fontStyle":"italic"},"text":"6.","element":"a"},{"style":{"fontStyle":"italic"},"text":") Under sufficient regularity of the activation functions, bounded moments and independence of random variables and some assumptions on the boundary ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"style":{"fontStyle":"italic"},"text":", with update ","element":"span"},{"href":"#id-23","text":"(5)","element":"a"},{"style":{"fontStyle":"italic"},"text":", the weights ","element":"span"},{"style":{"height":20.33},"width":126.26,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-16.png","element":"img","alt":" (W [t])t","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"converge to a unique stationary set of a projected system of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ODEs","element":"span"}],[{"id":"id-24","style":{"width":"66%"},"width":1146,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":17.6},"width":195.26,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-18.png","element":"img","alt":" π(W) is a","inline":true,"padRight":true},{"text":"constraint term","element":"span"},{"style":{"fontStyle":"italic"},"text":", which describes the minimum vector required to keep the gradient flow of ","element":"span"},{"style":{"height":13.2},"width":190.11,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/3-19.png","element":"img","alt":" ∇D in H.","inline":true}],[{"text":"This result provides a formal guarantee with the sufficient conditions for dropout algorithms to be well-behaved and at least asymptotically (meaning after sufficiently many iterations) to not suffer from problems that could have arisen from the relation to bond percolation. Moreover, for a wide range of ","element":"span"},{"text":"NNs ","element":"span"},{"text":"and activation functions the function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"text":"is the expectation of the risk over the dropout’s filters distribution, which in our result is not restricted to dropping nodes and can even be coupled to the data. This result also shows that ","element":"span"},{"text":"SGD ","element":"span"},{"text":"with dropout converges to the stationary points of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":")","element":"span"},{"text":". While a guarantee is necessary, a convergence rate would yield more insight into the trade-offs of the algorithm, especially in the dependence on depth.","element":"span"}],[{"text":"In our second result, we go one step beyond the convergence guarantee and compute a bound for the sample complexity of ","element":"span"},{"text":"SGD ","element":"span"},{"text":"with dropout to an ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-0.png","element":"img","alt":" ϵ","inline":true},{"text":"-stationary point of a generic smooth nonconvex function ","element":"span"},{"style":{"height":17.6},"width":610.42,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-1.png","element":"img","alt":" D(W). We say W ∈ W is an ϵ","inline":true},{"text":"-stationary point of ","element":"span"},{"text":"D ","element":"span"},{"text":"if ","element":"span"},{"style":{"height":17.6},"width":296.57,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-2.png","element":"img","alt":" ∥∇D(W)∥2 ≤ ϵ","inline":true,"padRight":true},{"text":"holds. Note that stationary points are not necessarily minima, but the sample complexity, understood as the number of iterations ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"required to reach ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-3.png","element":"img","alt":" ϵ","inline":true},{"text":"-stationarity, is usually associated with the complexity of the function to be optimized.","element":"span"}],[{"text":"For a generic smooth nonconvex function ","element":"span"},{"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":")","element":"span"},{"text":", we consider dropout to be ","element":"span"},{"text":"SGD ","element":"span"},{"text":"with the update in ","element":"span"},{"href":"#id-18","text":"(4)","element":"a"},{"text":", where filters ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"are chosen independently at each step and are ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":"-valued for each parameter. In our result we assume boundedness and Lipschitzness conditions on ","element":"span"},{"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":")","element":"span"},{"text":". Moreover, under some additional assumptions on the loss function, examples of ","element":"span"},{"text":"NNs ","element":"span"},{"text":"with sigmoid activation functions ","element":"span"},{"style":{"height":17.6},"width":437.83,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-4.png","element":"img","alt":" σ(t) = 1/(1 + exp(−t))","inline":true,"padRight":true},{"text":"are also covered by our result. In this particular case, ","element":"span"},{"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"text":"holds with the definition in ","element":"span"},{"href":"#id-24","text":"(6)","element":"a"},{"text":". For the general case we prove the following:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Result 2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(Informal statement of Proposition ","element":"span"},{"href":"#id-25","style":{"fontStyle":"italic"},"text":"7.","element":"a"},{"style":{"fontStyle":"italic"},"text":") Assume that ","element":"span"},{"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"has enough regularity and satisfies some boundedness and Lipschitzness assumptions. Let ","element":"span"},{"style":{"height":16.33},"width":93.44,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-5.png","element":"img","alt":" W {t} ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be iterates of ","element":"span"},{"href":"#id-23","text":"(5)","element":"a"},{"style":{"fontStyle":"italic"},"text":". For any ","element":"span"},{"style":{"height":12.8},"width":119,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-6.png","element":"img","alt":" T ∈ N","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"there exist ","element":"span"},{"style":{"height":19.93},"width":638.05,"height":49.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-7.png","element":"img","alt":" c > 0 and c1, c2 > 0 and α{t} = η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"constant such that if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p > c/T","element":"span"},{"style":{"fontStyle":"italic"},"text":", then as ","element":"span"},{"style":{"height":15.2},"width":156.08,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-8.png","element":"img","alt":" T → ∞,","inline":true}],[{"id":"id-27","style":{"width":"76%"},"width":1330,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-9.png","element":"img"}],[{"text":"Hence, at least ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"iterations of dropout-like ","element":"span"},{"text":"SGD ","element":"span"},{"text":"algorithms are required to reach an ","element":"span"},{"style":{"height":20.33},"width":521.21,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-10.png","element":"img","alt":"O((p(c1 + (1 − p)c2)/T)1/4)","inline":true},{"text":"-stationary point of nonconvex smooth functions in expectation. Here, ","element":"span"},{"style":{"height":11.2},"width":93.09,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-11.png","element":"img","alt":" c1, c2","inline":true,"padRight":true},{"text":"are constants depending on the data and function, respectively. Compared to the theoretical optimum rate of ","element":"span"},{"style":{"height":20.34},"width":179.18,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-12.png","element":"img","alt":" O(T −1/4)","inline":true,"padRight":true},{"text":"for ","element":"span"},{"text":"SGD ","element":"span"},{"text":"on nonconvex smooth functions (","element":"span"},{"href":"#id-26","referenceIndex":12,"text":"Drori ","element":"a"},{"href":"#id-26","referenceIndex":12,"text":"and Shamir","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":12,"text":"2020","element":"a"},{"text":"), this result shows that dropout changes the optimization landscape and approximate stationary points are easier to find depending on the dropout probability. In this setting, we also consider the complexity when we scale the weights by a factor ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/p ","element":"span"},{"text":"during training, which is commonly used to compensate the effect of dropout on the convergence rate.","element":"span"}],[{"text":"It must be emphasized that Proposition ","element":"span"},{"href":"#id-25","text":"7 ","element":"a"},{"text":"does not assume much structure on the objective function. As consequence, in spite of the fact that the bound in ","element":"span"},{"href":"#id-27","text":"(8) ","element":"a"},{"text":"holds in some settings with deep ","element":"span"},{"text":"NNs, ","element":"span"},{"text":"the depth of such ","element":"span"},{"text":"NN ","element":"span"},{"text":"would appear only ","element":"span"},{"style":{"fontStyle":"italic"},"text":"implicitly ","element":"span"},{"text":"in the constants ","element":"span"},{"style":{"height":11.2},"width":93.09,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/4-13.png","element":"img","alt":"c1, c2","inline":true},{"text":". In order to determine the dependence between the convergence rate and the depth of a ","element":"span"},{"text":"NN ","element":"span"},{"style":{"fontStyle":"italic"},"text":"explicitly","element":"span"},{"text":", one must exploit the specific structure of a ","element":"span"},{"text":"NN, ","element":"span"},{"text":"which we leverage in our next result.","element":"span"}],[{"text":"Our third result in this paper is an explicit upper bound for the rate of convergence of regular ","element":"span"},{"text":"GD ","element":"span"},{"text":"on the limiting ","element":"span"},{"text":"ODEs ","element":"span"},{"text":"of dropout algorithms for arborescences (a class of trees, see Figure ","element":"span"},{"href":"#id-3","text":"1c ","element":"a"},{"text":"for an example), of arbitrary depth with linear activation functions ","element":"span"},{"style":{"height":17.6},"width":166.55,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/5-0.png","element":"img","alt":" σ(t) = t.","inline":true,"padRight":true},{"text":"In particular, we will consider the update rule","element":"span"}],[{"id":"id-32","style":{"width":"66%"},"width":1151,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/5-1.png","element":"img"}],[{"text":"Analyzing the convergence of training algorithms on simplified ","element":"span"},{"text":"NNs ","element":"span"},{"text":"with linear activation functions is commonly used to gain insight into more complex models, see e.g. (","element":"span"},{"href":"#id-28","referenceIndex":1,"text":"Arora et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":1,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-29","referenceIndex":41,"text":"Shamir","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":41,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-30","referenceIndex":5,"text":"Bartlett et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":5,"text":"2018","element":"a"},{"text":"). Even without a dropout algorithm present, this task already provides a substantial theoretical challenge as the optimization landscape is nonconvex. Our choice to restrict the analysis to arborescences allows us to quantitatively tie our upper bound for the convergence rate to the depth and the number of paths within the arborescence. We prove the following:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Result 3 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(Informal statement of Proposition ","element":"span"},{"href":"#id-31","style":{"fontStyle":"italic"},"text":"9.","element":"a"},{"style":{"fontStyle":"italic"},"text":") Assume that the base graph ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"style":{"fontStyle":"italic"},"text":"of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NN ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is an arborescence of depth ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|L","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"| ","element":"span"},{"style":{"fontStyle":"italic"},"text":"leaves and the filters ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"fontStyle":"italic"},"text":"follow the distribution prescribed by ","element":"span"},{"text":"Dropconnect ","element":"span"},{"style":{"fontStyle":"italic"},"text":"or ","element":"span"},{"text":"Dropout ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with dropout probability ","element":"span"},{"style":{"height":15.2},"width":103.69,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/5-2.png","element":"img","alt":" 1 − p","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"(see Proposition ","element":"span"},{"href":"#id-31","style":{"fontStyle":"italic"},"text":"9)","element":"a"},{"style":{"fontStyle":"italic"},"text":". Then there exist ","element":"span"},{"style":{"height":16.4},"width":402.2,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/5-3.png","element":"img","alt":" α > 0 and 1 > η > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"depending on the initialization such that the iterates of ","element":"span"},{"href":"#id-32","text":"(9) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"satisfy","element":"span"}],[{"style":{"width":"83%"},"width":1446,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/5-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"with","element":"span"}],[{"style":{"width":"62%"},"width":1075,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/5-5.png","element":"img"}],[{"text":"One important consequence of this result is that the convergence rate exponent indeed deteriorates by a factor ","element":"span"},{"style":{"height":18.74},"width":44.96,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/5-6.png","element":"img","alt":" pL ","inline":true,"padRight":true},{"text":"in these ","element":"span"},{"text":"NNs. ","element":"span"},{"text":"Finally, we complement this result with numerical experiments. We target the dependency of the convergence on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"for more realistic wider and nonlinear networks on commonly used datasets. Perhaps surprisingly, we do not observe an exponential decrease of the convergence rate exponent due to dropout in these simulations. We will offer some heuristic explanation for this result by looking at the update rate of a generic weight.","element":"span"}],[{"text":"Our results lead to the following consequences. First, whenever the iterates of a dropout algorithm with ","element":"span"},{"style":{"height":15.02},"width":35.18,"height":37.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/5-7.png","element":"img","alt":" ℓ2","inline":true},{"text":"-loss are bounded, they are guaranteed to converge to a stationary point of the risk function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"text":"induced by the dropout algorithm. Secondly, we prove rigorously that the convergence rate when training with e.g. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect ","element":"span"},{"text":"can change the convergence rate on the empirical risk depending on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"and in arborescences can decrease by as much as a factor ","element":"span"},{"style":{"height":18.73},"width":44.96,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/5-8.png","element":"img","alt":" pL","inline":true},{"text":". For more realistic wider networks, however, we conduct numerical experiments that suggest that the convergence rate is not necessarily affected by depth as much across different dropout rates ","element":"span"},{"style":{"height":15.2},"width":86.36,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/5-9.png","element":"img","alt":" 1−p","inline":true,"padRight":true},{"text":"in neural networks with just a few layers of dropout.","element":"span"}],[{"text":"Our findings motivate further theoretical study of the convergence rate of dropout for deep and wide networks. We suspect that there is a transition regime of the convergence rate. Such transition would affect the dependence on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"and would be observed when going from networks with many layers of dropout with small width, where dependence on the rate may be exponential in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":", to networks with a few layers of dropout but very wide, where dependence is not exponential anymore.","element":"span"}],[{"id":"id-13","style":{"fontWeight":"bold"},"text":"1.3 Literature overview","element":"span"}],[{"text":"The first description of a dropout algorithm was by ","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"Hinton et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"2012","element":"a"},{"text":"). Diverse variants of the algorithm have appeared since, including versions in which edges are dropped (","element":"span"},{"href":"#id-1","referenceIndex":47,"text":"Wan ","element":"a"},{"href":"#id-1","referenceIndex":47,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":47,"text":"2013","element":"a"},{"text":"); groups of edges are dropped from the input layer (","element":"span"},{"href":"#id-2","referenceIndex":11,"text":"DeVries and Taylor","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":11,"text":"2017","element":"a"},{"text":"); the distribution of the filters are Gaussian (","element":"span"},{"href":"#id-33","referenceIndex":22,"text":"Kingma et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-33","referenceIndex":22,"text":"2015","element":"a"},{"text":"; ","element":"span"},{"href":"#id-34","referenceIndex":31,"text":"Molchanov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":31,"text":"2017","element":"a"},{"text":"); the removal probabilities change adaptively (","element":"span"},{"href":"#id-35","referenceIndex":2,"text":"Ba and Frey","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":2,"text":"2013","element":"a"},{"text":"; ","element":"span"},{"href":"#id-36","referenceIndex":27,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":27,"text":"2016","element":"a"},{"text":"); and that are suitable for recurrent ","element":"span"},{"text":"NNs ","element":"span"},{"text":"(","element":"span"},{"href":"#id-37","referenceIndex":49,"text":"Zaremba et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":49,"text":"2014","element":"a"},{"text":"; ","element":"span"},{"href":"#id-38","referenceIndex":39,"text":"Semeniuta et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","referenceIndex":39,"text":"2016","element":"a"},{"text":"). The performance of the original algorithm has been investigated on datasets (","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"Hinton et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"2012","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":43,"text":"Srivastava et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":43,"text":"2014","element":"a"},{"text":"), and dropout algorithms have found application in e.g. image classification (","element":"span"},{"href":"#id-39","referenceIndex":24,"text":"Krizhevsky ","element":"a"},{"href":"#id-39","referenceIndex":24,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":24,"text":"2012","element":"a"},{"text":"), handwriting recognition (","element":"span"},{"href":"#id-40","referenceIndex":36,"text":"Pham et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":36,"text":"2014","element":"a"},{"text":"), heart sound classification (","element":"span"},{"href":"#id-41","referenceIndex":19,"text":"Kay ","element":"a"},{"href":"#id-41","referenceIndex":19,"text":"and Agarwal","element":"a"},{"text":", ","element":"span"},{"href":"#id-41","referenceIndex":19,"text":"2016","element":"a"},{"text":"), and drug discovery in cancer research (","element":"span"},{"href":"#id-42","referenceIndex":45,"text":"Urban et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":45,"text":"2018","element":"a"},{"text":").","element":"span"}],[{"text":"Theoretical studies of dropout algorithms have focused on their regularization effect. The effect was first noted by ","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"Hinton et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"2012","element":"a"},{"text":"); ","element":"span"},{"href":"#id-5","referenceIndex":43,"text":"Srivastava et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-5","referenceIndex":43,"text":"2014","element":"a"},{"text":"), and subsequently investigated in-depth for both linear ","element":"span"},{"text":"NNs ","element":"span"},{"text":"as well as nonlinear ","element":"span"},{"text":"NNs ","element":"span"},{"text":"by ","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"Baldi and Sadowski ","element":"a"},{"text":"(","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"2013","element":"a"},{"text":"); ","element":"span"},{"href":"#id-6","referenceIndex":46,"text":"Wager et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-6","referenceIndex":46,"text":"2013","element":"a"},{"text":"); ","element":"span"},{"href":"#id-7","referenceIndex":3,"text":"Baldi and Sadowski ","element":"a"},{"text":"(","element":"span"},{"href":"#id-7","referenceIndex":3,"text":"2014","element":"a"},{"text":"); ","element":"span"},{"href":"#id-12","referenceIndex":48,"text":"Wei et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-12","referenceIndex":48,"text":"2020","element":"a"},{"text":"). Within the context of matrix factorization, it has been shown that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout","element":"span"},{"text":"’s regularization induces a shrinkage and a thresholding of the singular values of the matrix at the optimum (","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"Cavazza et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"2018","element":"a"},{"text":"). Characterizations of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout","element":"span"},{"text":"’s risk function and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout","element":"span"},{"text":"’s regularizer for (usually linear) ","element":"span"},{"text":"NNs ","element":"span"},{"text":"can be found in ","element":"span"},{"href":"#id-9","referenceIndex":30,"text":"Mianjy et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-9","referenceIndex":30,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-10","referenceIndex":28,"text":"Mianjy and Arora ","element":"a"},{"text":"(","element":"span"},{"href":"#id-10","referenceIndex":28,"text":"2019","element":"a"},{"text":"); ","element":"span"},{"href":"#id-11","referenceIndex":35,"text":"Pal et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-11","referenceIndex":35,"text":"2020","element":"a"},{"text":"). Random networks with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"have been also studied in ","element":"span"},{"href":"#id-43","referenceIndex":42,"text":"Sicking et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-43","referenceIndex":42,"text":"2020","element":"a"},{"text":") and in ","element":"span"},{"href":"#id-44","referenceIndex":16,"text":"Huang ","element":"a"},{"href":"#id-44","referenceIndex":16,"text":"et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-44","referenceIndex":16,"text":"2019","element":"a"},{"text":").","element":"span"}],[{"text":"Detailed theoretical investigations into the convergence of dropout algorithms are however relatively scarce. While revising this paper, new results appeared and these now give insight into the convergence rate of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"in ReLU shallow ","element":"span"},{"text":"NNs ","element":"span"},{"text":"for a classification task (","element":"span"},{"href":"#id-14","referenceIndex":29,"text":"Mianjy and Arora","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":29,"text":"2020","element":"a"},{"text":"). In ","element":"span"},{"href":"#id-14","referenceIndex":29,"text":"Mianjy and Arora ","element":"a"},{"text":"(","element":"span"},{"href":"#id-14","referenceIndex":29,"text":"2020","element":"a"},{"text":"), it is shown that ","element":"span"},{"style":{"height":17.6},"width":129.26,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/6-0.png","element":"img","alt":" O(1/ϵ)","inline":true,"padRight":true},{"text":"iterations of ","element":"span"},{"text":"SGD ","element":"span"},{"text":"to reach ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/6-1.png","element":"img","alt":" ϵ","inline":true},{"text":"-suboptimality for the test error are required; interestingly, it is independent of the dropout probability because of their assumption that the data distribution is separable by a margin in a particular Reproducing Kernel Hilbert space. Compared to our generic convergence result, we do not assume structure on the predictor or data and look instead at the iterations required to reach ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/6-2.png","element":"img","alt":" ϵ","inline":true},{"text":"-stationarity in nonconvex functions using dropout-like ","element":"span"},{"text":"SGD. ","element":"span"},{"text":"A study of the asymptotic convergence rate of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect ","element":"span"},{"text":"on shallow linear neural networks has also appeared recently (","element":"span"},{"href":"#id-15","referenceIndex":40,"text":"Senen-Cerda and Sanders","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":40,"text":"2022","element":"a"},{"text":"). There, an asymptotic convergence rate for dropout linear shallow networks is provided. Namely, for wide linear shallow networks with width ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"and dropout probability ","element":"span"},{"style":{"height":16},"width":332.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/6-3.png","element":"img","alt":" 1 − p > 0 a local","inline":true,"padRight":true},{"text":"convergence rate close to a minimum of ","element":"span"},{"style":{"height":17.6},"width":645.11,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/6-4.png","element":"img","alt":" O(p(1 − p)/(pD + 1 − p)) is found.","inline":true,"padRight":true},{"text":"Finally, it must be noted that convergence properties have been thoroughly studied within the context of ","element":"span"},{"text":"NNs ","element":"span"},{"text":"being trained without dropout algorithms, see e.g. ","element":"span"},{"href":"#id-28","referenceIndex":1,"text":"Arora et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-28","referenceIndex":1,"text":"2019","element":"a"},{"text":"); ","element":"span"},{"href":"#id-29","referenceIndex":41,"text":"Shamir ","element":"a"},{"text":"(","element":"span"},{"href":"#id-29","referenceIndex":41,"text":"2019","element":"a"},{"text":"); ","element":"span"},{"href":"#id-45","referenceIndex":50,"text":"Zou et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-45","referenceIndex":50,"text":"2020","element":"a"},{"text":"); ","element":"span"},{"href":"#id-46","referenceIndex":13,"text":"Gao et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-46","referenceIndex":13,"text":"2021","element":"a"},{"text":") and references therein.","element":"span"}],[{"text":"Dropout algorithms can, by construction, be understood as forms of ","element":"span"},{"text":"SGD. ","element":"span"},{"text":"More generally, dropout algorithms are all stochastic approximation algorithms. The first stochastic approximations algorithms were introduced by ","element":"span"},{"href":"#id-47","referenceIndex":37,"text":"Robbins and Monro ","element":"a"},{"text":"(","element":"span"},{"href":"#id-47","referenceIndex":37,"text":"1951","element":"a"},{"text":"); ","element":"span"},{"href":"#id-48","referenceIndex":20,"text":"Kiefer and Wolfowitz ","element":"a"},{"text":"(","element":"span"},{"href":"#id-48","referenceIndex":20,"text":"1952","element":"a"},{"text":"), and have been subject to enormous literature due to their ubiquity. For overviews and their application to ","element":"span"},{"text":"NNs, ","element":"span"},{"text":"we refer to books by ","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"Kushner and Yin ","element":"a"},{"text":"(","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"2003","element":"a"},{"text":"); ","element":"span"},{"href":"#id-50","referenceIndex":7,"text":"Borkar ","element":"a"},{"text":"(","element":"span"},{"href":"#id-50","referenceIndex":7,"text":"2009","element":"a"},{"text":"); ","element":"span"},{"href":"#id-51","referenceIndex":6,"text":"Bertsekas and Tsitsiklis ","element":"a"},{"text":"(","element":"span"},{"href":"#id-51","referenceIndex":6,"text":"1995","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A word on notation","element":"span"}],[{"text":"In this paper we index deterministic sequences with curly brackets: ","element":"span"},{"style":{"height":19.53},"width":178.13,"height":48.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-0.png","element":"img","alt":" α{1}, β{1}","inline":true},{"text":", etc. This distinguishes them from sequences of random variables, which we index using square brackets, e.g. ","element":"span"},{"style":{"height":19.13},"width":263.44,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-1.png","element":"img","alt":" X[1], Y [1], etc.","inline":true}],[{"text":"Deterministic vectors are written in lower case like ","element":"span"},{"style":{"height":15.93},"width":127.79,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-2.png","element":"img","alt":" x ∈ Rd","inline":true},{"text":", but an exception is made for random variables (which are always capitalized). Matrices are also always capitalized. For a function ","element":"span"},{"style":{"height":12.4},"width":194.96,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-3.png","element":"img","alt":" σ : R → R","inline":true,"padRight":true},{"text":"and a matrix ","element":"span"},{"style":{"height":18.33},"width":347.13,"height":45.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-4.png","element":"img","alt":" A ∈ Ra×b, a, b ≥ 1","inline":true},{"text":", we denote by ","element":"span"},{"style":{"height":17.6},"width":93.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-5.png","element":"img","alt":" σ(A)","inline":true,"padRight":true},{"text":"the matrix with ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-6.png","element":"img","alt":" σ","inline":true,"padRight":true},{"text":"applied componentwise to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":". Subscripts will be used to denote the entries of any tensor, e.g. ","element":"span"},{"style":{"height":18.04},"width":300.11,"height":45.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-7.png","element":"img","alt":"xi, Ai,j, or Ti,j,l","inline":true},{"text":". For any vector ","element":"span"},{"style":{"height":18.33},"width":273.38,"height":45.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-8.png","element":"img","alt":" x ∈ Rd, the ℓ2","inline":true},{"text":"-norm is defined as ","element":"span"},{"style":{"height":22},"width":440.84,"height":55.01,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-9.png","element":"img","alt":" ∥x∥2 ≜ (�di=1 |xi|2)1/2.","inline":true,"padRight":true},{"text":"For any matrix ","element":"span"},{"style":{"height":15.93},"width":176.92,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-10.png","element":"img","alt":" A ∈ Ra×b","inline":true},{"text":", the Frobenius norm is defined as ","element":"span"},{"style":{"height":24.4},"width":596.03,"height":61.01,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-11.png","element":"img","alt":" ∥A∥F ≜ (�ai=1�bj=1 |Ai,j|2)1/2.","inline":true,"padRight":true},{"text":"For two matrices ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A, B","element":"span"},{"text":", the Hadamard (componentwise) product is denoted by ","element":"span"},{"style":{"height":14.4},"width":133.34,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-12.png","element":"img","alt":" A ⊙ B.","inline":true}],[{"text":"Let ","element":"span"},{"style":{"height":15.82},"width":57.52,"height":39.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-13.png","element":"img","alt":" N+","inline":true,"padRight":true},{"text":"be the strictly positive integers and ","element":"span"},{"style":{"height":17.6},"width":562.74,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-14.png","element":"img","alt":" N0 ≜ N+ ∪ {0}. For l ∈ N+","inline":true},{"text":", we denote ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"] = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , l","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". For a function ","element":"span"},{"style":{"height":19.13},"width":222.32,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-15.png","element":"img","alt":" g ∈ C2(Rn)","inline":true},{"text":", we denote the gradient and Hessian of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g ","element":"span"},{"text":"with respect to the Euclidean norm ","element":"span"},{"style":{"height":19.89},"width":502.45,"height":49.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-16.png","element":"img","alt":" ∥·∥2 in Rn by ∇g and ∇2g","inline":true},{"text":", respectively.","element":"span"}]]},{"heading":"2. Model","paragraphs":[[{"text":"We now formally define ","element":"span"},{"text":"NNs, ","element":"span"},{"text":"which we had depicted in Figure ","element":"span"},{"href":"#id-3","text":"1, ","element":"a"},{"text":"as well as the class of activation functions that we will use for the convergence guarantee in our first result below.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.1 Neural networks, and their structure","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"denote the number of layers in the ","element":"span"},{"text":"NN, ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":16.22},"width":152.45,"height":40.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-17.png","element":"img","alt":" dl ∈ N+","inline":true,"padRight":true},{"text":"the output dimension of layer ","element":"span"},{"style":{"height":19.18},"width":629.73,"height":47.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-18.png","element":"img","alt":"l = 1, . . . , L. Let Wl+1 ∈ Rdl+1×dl ","inline":true,"padRight":true},{"text":"denote the matrix of weights in between layers ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"for ","element":"span"},{"style":{"height":19.53},"width":1658.75,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-19.png","element":"img","alt":" l = 0, 1, . . . , L − 1. Denote W = (WL, . . . , W1) ∈ W with W ≜ RdL×dL−1 × · · · × Rd1×d0","inline":true,"padRight":true},{"text":"the set of all possible weights. In this paper, we consider ","element":"span"},{"text":"NNs ","element":"span"},{"text":"without biases.","element":"span"}],[{"style":{"height":12.8},"width":379.84,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-20.png","element":"img","alt":"Definition 4 Let σ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be an activation function ","element":"span"},{"style":{"height":12.8},"width":265.62,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-21.png","element":"img","alt":" σ : R → R. A","inline":true,"padRight":true},{"text":"Neural Network (NN) ","element":"span"},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"layers ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is given by the class of functions ","element":"span"},{"style":{"height":17.84},"width":313.32,"height":44.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-22.png","element":"img","alt":" ΨW : Rd0 → RdL ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"defined iteratively by","element":"span"}],[{"style":{"width":"94%"},"width":1627,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-23.png","element":"img"}],[{"text":"Canonical activation functions include the ","element":"span"},{"text":"Rectified Linear Unit (ReLU) ","element":"span"},{"text":"function ","element":"span"},{"style":{"height":17.6},"width":122.33,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-24.png","element":"img","alt":" σ(t) =","inline":true,"padRight":true},{"text":"max","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", t","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":", the sigmoid function ","element":"span"},{"style":{"height":18.73},"width":363.37,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-25.png","element":"img","alt":" σ(t) = 1/(1 + e−t)","inline":true},{"text":", and the linear function ","element":"span"},{"style":{"height":17.6},"width":243.14,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-26.png","element":"img","alt":" σ(t) = t. In","inline":true,"padRight":true},{"text":"Sections ","element":"span"},{"text":"2 ","element":"span"},{"text":"and ","element":"span"},{"text":"3 ","element":"span"},{"text":"we restrict to the case that ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-27.png","element":"img","alt":" σ","inline":true,"padRight":true},{"text":"belongs to a class of polynomially bounded differentiable functions.","element":"span"}],[{"id":"id-20","style":{"height":12.8},"width":563.62,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-28.png","element":"img","alt":"Definition 5 For σ : R → R","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"differentiable, denote the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"style":{"fontStyle":"italic"},"text":"th derivative of ","element":"span"},{"style":{"height":19.53},"width":346.62,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-29.png","element":"img","alt":" σ by σ(l). The set","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"of polynomially bounded maps with continuous derivatives up to order ","element":"span"},{"style":{"height":14.62},"width":122.75,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-30.png","element":"img","alt":" r ∈ N0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is given by","element":"span"}],[{"style":{"width":"82%"},"width":1435,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-31.png","element":"img"}],[{"text":"Note that the linear and sigmoid activation function both belong to ","element":"span"},{"style":{"height":18.37},"width":309.28,"height":45.93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-32.png","element":"img","alt":" CrPB(R) for any","inline":true},{"style":{"height":14.62},"width":126.4,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-33.png","element":"img","alt":"r ∈ N0","inline":true},{"text":". Also, any polynomial activation function ","element":"span"},{"style":{"height":17.6},"width":230.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-34.png","element":"img","alt":" P(x) ∈ R[x]","inline":true,"padRight":true},{"text":"belongs to ","element":"span"},{"style":{"height":24.29},"width":314.59,"height":60.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/7-35.png","element":"img","alt":" Cdeg(P)PB (R). The","inline":true,"padRight":true},{"text":"ReLU ","element":"span"},{"text":"activation function is not in ","element":"span"},{"style":{"height":18.37},"width":460.59,"height":45.93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-0.png","element":"img","alt":" CrPB(R) for any r ∈ N0","inline":true},{"text":". However, because the class ","element":"span"},{"style":{"height":18.37},"width":145.68,"height":45.93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-1.png","element":"img","alt":"CrPB(R)","inline":true,"padRight":true},{"text":"contains polynomials of any degree, we can approximate cases such as ","element":"span"},{"text":"ReLU ","element":"span"},{"text":"by ","element":"span"},{"text":"using, e.g., the softplus activation function ","element":"span"},{"style":{"height":17.6},"width":505.51,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-2.png","element":"img","alt":" σt(x) = log(1 + exp(tx))/t","inline":true},{"text":", which satisfies that ","element":"span"},{"style":{"height":17.6},"width":771.8,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-3.png","element":"img","alt":"limt→∞ σt(x) = ReLU(x) for every x ∈ R","inline":true},{"text":". Note that the softplus activation function belongs to ","element":"span"},{"style":{"height":19.91},"width":157.65,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-4.png","element":"img","alt":" C2PB(R).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"2.2 Backpropagation, and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"SGD","element":"span"}],[{"text":"In Section ","element":"span"},{"href":"#id-52","text":"1.1 ","element":"a"},{"text":"we have defined the risk ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"text":"that in the previous notation now depends on a loss ","element":"span"},{"style":{"height":15.53},"width":366.06,"height":38.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-5.png","element":"img","alt":" l : RdL × RdL → R","inline":true},{"text":". Throughout this article, we will specify the Euclidean ","element":"span"},{"style":{"height":15.02},"width":150.26,"height":37.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-6.png","element":"img","alt":" ℓ2-norm","inline":true},{"style":{"height":19.41},"width":335.16,"height":48.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-7.png","element":"img","alt":"l(x, y) ≜ ∥x − y∥22 ","inline":true,"padRight":true},{"text":"as our loss function of interest without loss of generality. ","element":"span"},{"style":{"height":8.4},"width":17,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-8.png","element":"img","alt":" 1","inline":true}],[{"text":"Furthermore, in the definition of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"text":"in ","element":"span"},{"href":"#id-16","text":"(1)","element":"a"},{"text":", we make no distinction between an oracle risk function or empirical risk function. Both situations are covered by the definition in ","element":"span"},{"href":"#id-16","text":"(1)","element":"a"},{"text":". Hence, our results cover the empirical risk case when we have a finite number of samples, as well as the online learning case, where a new sample is provided at each step of ","element":"span"},{"text":"SGD. ","element":"span"},{"text":"What we do assume is that one has the ability to repeatedly draw independent and identically distributed samples either distribution.","element":"span"}],[{"text":"In an attempt to find a critical point in the set ","element":"span"},{"style":{"height":17.6},"width":300.14,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-9.png","element":"img","alt":" arg minW U(W)","inline":true},{"text":", as mentioned in ","element":"span"},{"href":"#id-52","text":"(1.1)","element":"a"},{"text":", ","element":"span"},{"text":"SGD ","element":"span"},{"text":"is commonly used. Let ","element":"span"},{"style":{"height":22.02},"width":317.5,"height":55.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-10.png","element":"img","alt":" {(Y [t], X[t])}t∈N+","inline":true,"padRight":true},{"text":"be a sequence of independent copies of ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X, Y ","element":"span"},{"text":")","element":"span"},{"text":", let ","element":"span"},{"style":{"height":16.73},"width":197.81,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-11.png","element":"img","alt":" W [0] ∈ W","inline":true,"padRight":true},{"text":"be an arbitrary nonrandom initialization of the weights. For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , L","element":"span"},{"text":", ","element":"span"},{"style":{"height":16.22},"width":529.48,"height":40.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-12.png","element":"img","alt":"r = 1, . . . , di+1, l = 1, . . . , di","inline":true},{"text":", the weights are iteratively updated according to","element":"span"}],[{"id":"id-94","style":{"width":"77%"},"width":1345,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-13.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":22.03},"width":690.46,"height":55.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-14.png","element":"img","alt":" t = 0, 1, 2, et cetera. Here {α{t}}t∈N+","inline":true,"padRight":true},{"text":"denotes a positive, deterministic step size sequence, and the estimate of the gradient ","element":"span"},{"style":{"height":17.6},"width":480.55,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-15.png","element":"img","alt":" BW (·, ·) = ∇W l(ΨW (·), ·)","inline":true,"padRight":true},{"text":"is computed using the backpropagation algorithm, which is given in Definition ","element":"span"},{"href":"#id-53","text":"15 ","element":"a"},{"text":"in Appendix ","element":"span"},{"text":"A. ","element":"span"},{"text":"The stochastic gradient is an unbiased estimate of the gradient of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":")","element":"span"},{"text":". In particular, we have","element":"span"}],[{"style":{"width":"83%"},"width":1444,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-16.png","element":"img"}],[{"id":"id-19","style":{"fontWeight":"bold"},"text":"2.3 Dropout algorithms, and their risk functions","element":"span"}],[{"text":"Dropout algorithms use ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":"-valued random matrices as filters of weights during the backpropagation step of ","element":"span"},{"text":"SGD. ","element":"span"},{"text":"More precisely, we examine the following class of dropout algorithms. Let ","element":"span"},{"style":{"height":19.53},"width":1097.17,"height":48.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-17.png","element":"img","alt":" (F, X, Y ) : Ω → {0, 1}dL×dL−1×. . .×{0, 1}d1×d0×Rd0×RdL ","inline":true,"padRight":true},{"text":"be a random variable on the probability space ","element":"span"},{"style":{"height":17.6},"width":166.64,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-18.png","element":"img","alt":" (Ω, F, P)","inline":true},{"text":". Here, we write ","element":"span"},{"style":{"height":19.53},"width":783.47,"height":48.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-19.png","element":"img","alt":" F = (FL, . . . , F1) and Fi+1 ∈ {0, 1}di+1×di","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":15.2},"width":291.41,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-20.png","element":"img","alt":" i = 0, . . . , L − 1","inline":true},{"text":", similar to how we notate weight matrices. Let ","element":"span"},{"style":{"height":22.02},"width":465.79,"height":55.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-21.png","element":"img","alt":" {(F [t], X[t], Y [t])}t∈N+ be","inline":true,"padRight":true},{"text":"a sequence of independent copies of ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"F, X, Y ","element":"span"},{"text":")","element":"span"},{"text":". In tensor notation, the weights are updated by using ","element":"span"},{"href":"#id-54","text":"(2) ","element":"a"},{"text":"with the random direction ","element":"span"},{"style":{"height":15.93},"width":110.27,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-22.png","element":"img","alt":" ∆[t+1] ","inline":true,"padRight":true},{"text":"for dropout given in ","element":"span"},{"href":"#id-18","text":"(4)","element":"a"},{"text":". For each dropout algorithm a different filter distribution will be chosen. We can mention a few:","element":"span"}],[{"text":"(i) In canonical ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"(","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"Hinton et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"2012","element":"a"},{"text":"), ","element":"span"},{"style":{"height":18.53},"width":790.11,"height":46.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-23.png","element":"img","alt":" Fi,r,l′ = Fi,r,l ∼ Bernoulli(p) for any l, l′ ∈","inline":true},{"style":{"height":17.6},"width":331.38,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/8-24.png","element":"img","alt":"[di] with p = 1/2.","inline":true}],[{"style":{"width":"93%"},"width":1620,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-0.png","element":"img"}],[{"text":"(iii) In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Cutout ","element":"span"},{"text":"(","element":"span"},{"href":"#id-2","referenceIndex":11,"text":"DeVries and Taylor","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":11,"text":"2017","element":"a"},{"text":"), ","element":"span"},{"style":{"height":18.44},"width":868,"height":46.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-1.png","element":"img","alt":" F1,r,l = 0 whenever |r − S1| < c, c ∈ N+ and","inline":true},{"style":{"height":17.6},"width":919.14,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-2.png","element":"img","alt":"|l − S2| < c with (S1, S2) ∼ Uniform([d1] × [d0]).","inline":true}],[{"text":"In fact, the class of dropout algorithms we consider is quite large. For example, ","element":"span"},{"style":{"height":16.33},"width":145.67,"height":40.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-3.png","element":"img","alt":" F [t] can","inline":true,"padRight":true},{"text":"depend on ","element":"span"},{"style":{"height":24.01},"width":367.72,"height":60.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-4.png","element":"img","alt":" (X[t], Y [t]), and F [t]i","inline":true,"padRight":true},{"text":"does not need to have the same distribution as ","element":"span"},{"style":{"height":26.41},"width":251.43,"height":66.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-5.png","element":"img","alt":" F [t]j for i ̸= j.","inline":true,"padRight":true},{"text":"Recall, however, that if for some filter ","element":"span"},{"style":{"height":26.84},"width":488.58,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-6.png","element":"img","alt":" F [t+1]i,r,l = 0 for some i, r, l","inline":true},{"text":", then in ","element":"span"},{"href":"#id-54","text":"(2) ","element":"a"},{"text":", ","element":"span"},{"style":{"height":26.84},"width":186.04,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-7.png","element":"img","alt":" ∆[t]i,r,l = 0","inline":true,"padRight":true},{"text":"and we have ","element":"span"},{"style":{"height":26.84},"width":281.78,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-8.png","element":"img","alt":" W [t]i,r,l = W [t+1]i,r,l ","inline":true,"padRight":true},{"text":". In other words, filtered variables are not updated with these ","element":"span"},{"text":"dropout algorithms.","element":"span"}],[{"text":"If ","element":"span"},{"style":{"height":15.93},"width":64.76,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-9.png","element":"img","alt":" F [t] ","inline":true,"padRight":true},{"text":"is independent of ","element":"span"},{"style":{"height":20.33},"width":636.05,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-10.png","element":"img","alt":" (X[t], Y [t]) for each t ∈ N0 and Ω","inline":true,"padRight":true},{"text":"countable, then the dropout algorithm’s risk function in ","element":"span"},{"href":"#id-24","text":"(6) ","element":"a"},{"text":"simplifies to","element":"span"}],[{"style":{"width":"81%"},"width":1413,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-11.png","element":"img"}],[{"text":"Here the sums are over all possible outcomes of the random variables ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"and ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X, Y ","element":"span"},{"text":")","element":"span"},{"text":", respectively. One implication of Proposition ","element":"span"},{"href":"#id-22","text":"6 ","element":"a"},{"text":"in the result of the next Section ","element":"span"},{"text":"3 ","element":"span"},{"text":"is that dropout algorithms of the kind in ","element":"span"},{"href":"#id-54","text":"(2)","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","text":"(4) ","element":"a"},{"text":"converge to a critical point of ","element":"span"},{"href":"#id-24","text":"(6)","element":"a"},{"text":".","element":"span"}]]},{"heading":"3. Convergence of projected dropout algorithms","paragraphs":[[{"text":"Our first result pertains to the convergence of dropout algorithms for a wide range of activation functions and dropout filters. While convergence is expected in practice, we prove such convergence rigorously. In order to control the iterates of the stochastic algorithm, we project the iterates into a compact set. The projection assumption is common when investigating the convergence of stochastic algorithms (","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"Kushner and Yin","element":"a"},{"text":", ","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"2003","element":"a"},{"text":"; ","element":"span"},{"href":"#id-50","referenceIndex":7,"text":"Borkar","element":"a"},{"text":", ","element":"span"},{"href":"#id-50","referenceIndex":7,"text":"2009","element":"a"},{"text":"; ","element":"span"},{"href":"#id-51","referenceIndex":6,"text":"Bertsekas and Tsitsiklis","element":"a"},{"text":", ","element":"span"},{"href":"#id-51","referenceIndex":6,"text":"1995","element":"a"},{"text":"; ","element":"span"},{"href":"#id-21","referenceIndex":34,"text":"Oymak","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":34,"text":"2018","element":"a"},{"text":"); it essentially bounds the weights. For example, for ","element":"span"},{"style":{"height":16.73},"width":158.69,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-12.png","element":"img","alt":" V [t] ∈ R","inline":true,"padRight":true},{"text":"and an update function ","element":"span"},{"style":{"height":20.33},"width":360.34,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-13.png","element":"img","alt":" f : R → R, f(V [t])","inline":true,"padRight":true},{"text":"is projected onto an interval ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"a, b","element":"span"},{"text":"] ","element":"span"},{"text":"is by clipping and setting ","element":"span"},{"style":{"height":20.33},"width":619.83,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-14.png","element":"img","alt":" V [t+1] = min{max{f(V [t]), a}, b}","inline":true},{"text":". There are also results involving generalization bounds for ","element":"span"},{"text":"NNs ","element":"span"},{"text":"where bounded weights play a role in controlling the learning capacity of the ","element":"span"},{"text":"NN ","element":"span"},{"text":"(","element":"span"},{"href":"#id-55","referenceIndex":33,"text":"Neyshabur et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-55","referenceIndex":33,"text":"2015","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1 Almost sure convergence","element":"span"}],[{"text":"We first consider the notation and assumptions regarding the projection step of ","element":"span"},{"text":"SGD. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":14.4},"width":149.77,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-15.png","element":"img","alt":"H ⊆ W","inline":true,"padRight":true},{"text":"be a convex compact nonempty set and let ","element":"span"},{"style":{"height":15.5},"width":265.02,"height":38.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-16.png","element":"img","alt":" PH : W → H","inline":true,"padRight":true},{"text":"be the projection onto ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". By compactness and convexity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":", the projection is unique. In a projected dropout algorithm, the weight update in ","element":"span"},{"href":"#id-54","text":"(2) ","element":"a"},{"text":"is replaced by ","element":"span"},{"href":"#id-23","text":"(5)","element":"a"},{"text":". Because of the projection, our analysis will tie the limiting behavior of ","element":"span"},{"href":"#id-23","text":"(5) ","element":"a"},{"text":"to a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"projected ","element":"span"},{"text":"ODE. ","element":"span"},{"text":"To state such type of ","element":"span"},{"text":"ODE, ","element":"span"},{"text":"we need to define a ","element":"span"},{"style":{"height":17.6},"width":414.67,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-17.png","element":"img","alt":" constraint term π(W)","inline":true},{"text":", which is defined as the minimum vector required to keep the solution of the gradient flow","element":"span"}],[{"id":"id-56","style":{"width":"65%"},"width":1140,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-18.png","element":"img"}],[{"text":"in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". Appendix ","element":"span"},{"text":"C ","element":"span"},{"text":"defines the projection term carefully for the case that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":"’s boundary is piecewise smooth. Finally, define the set of stationary points","element":"span"}],[{"style":{"width":"75%"},"width":1300,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/9-19.png","element":"img"}],[{"text":"The set ","element":"span"},{"style":{"height":15.9},"width":55.75,"height":39.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-0.png","element":"img","alt":" SH","inline":true,"padRight":true},{"text":"can be divided into a countable number of disjoint compact and connected subsets ","element":"span"},{"style":{"height":15.6},"width":180.96,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-1.png","element":"img","alt":" S1, S2, · · ·","inline":true,"padRight":true},{"text":", say. We choose the following set of assumptions:","element":"span"}],[{"text":"(N1) ","element":"span"},{"style":{"height":19.91},"width":237.48,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-2.png","element":"img","alt":" σ ∈ C2PB(R).","inline":true}],[{"style":{"width":"99%"},"width":1721,"height":474,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-3.png","element":"img"}],[{"text":"(N6) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"−∇","element":"span"},{"style":{"height":17.6},"width":969.69,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-4.png","element":"img","alt":"W D|H(W) + π(W) ̸= 0 whenever ∇W D|H(W) ̸= 0.","inline":true}],[{"text":"We are now in position to state our first result:","element":"span"}],[{"id":"id-22","style":{"height":20.48},"width":590.4,"height":51.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-5.png","element":"img","alt":"Proposition 6 Let {W [t]}t∈N0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the sequence of random variables generated by ","element":"span"},{"href":"#id-23","text":"(5) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"href":"#id-18","text":"(4) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"on a probability space ","element":"span"},{"style":{"height":17.6},"width":166.64,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-6.png","element":"img","alt":" (Ω, F, P)","inline":true},{"style":{"fontStyle":"italic"},"text":". Under assumptions (N1)–(N4) , there is a set ","element":"span"},{"style":{"height":16},"width":187.31,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-7.png","element":"img","alt":" N ⊂ Ω of","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"probability zero such that for ","element":"span"},{"style":{"height":20.33},"width":335.59,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-8.png","element":"img","alt":" ω ̸∈ N, {W [t](ω)}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"converges to a limit set of the projected ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ODE ","element":"span"},{"style":{"fontStyle":"italic"},"text":"in ","element":"span"},{"href":"#id-56","text":"(16)","element":"a"},{"style":{"fontStyle":"italic"},"text":". If moreover (N5)–(N6) hold, then for almost all ","element":"span"},{"style":{"height":20.33},"width":389.03,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-9.png","element":"img","alt":" ω ∈ Ω, {W [t](ω)}t∈N","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"converges to a unique point in ","element":"span"},{"style":{"height":17.6},"width":482.06,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-10.png","element":"img","alt":" {W ∈ H|∇D|H(W) = 0}.","inline":true}],[{"text":"Theoretically, Proposition ","element":"span"},{"href":"#id-22","text":"6 ","element":"a"},{"text":"guarantees that projected dropout algorithms converge for regression with the ","element":"span"},{"style":{"height":15.02},"width":35.18,"height":37.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-11.png","element":"img","alt":" ℓ2","inline":true},{"text":"-norm almost surely. ","element":"span"},{"text":"Proposition ","element":"span"},{"href":"#id-22","text":"6 ","element":"a"},{"text":"implies that if one is using a regular ","element":"span"},{"style":{"fontStyle":"italic"},"text":"nonprojected ","element":"span"},{"text":"dropout algorithm and one sees that the iterates ","element":"span"},{"style":{"height":20.33},"width":179.53,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-12.png","element":"img","alt":" {W [t]}t>0","inline":true,"padRight":true},{"text":"are bounded, then these iterates are in fact converging to a stationary point of ","element":"span"},{"href":"#id-24","text":"(6)","element":"a"},{"text":". Assumptions (N5)– (N6) are technical but are expected to hold in many cases. In particular, (N5) holds for the uniformly convergent approximation to a ","element":"span"},{"text":"ReLU ","element":"span"},{"text":"activation function given by softplus ","element":"span"},{"style":{"height":17.6},"width":524.14,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-13.png","element":"img","alt":"σt(x) = log(1 + exp(tx))/t","inline":true},{"text":", and holds for many smooth activation functions. Also (N6) is expected to hold when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"is generic polytope for which the gradient ","element":"span"},{"style":{"height":12.4},"width":70.35,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-14.png","element":"img","alt":" ∇D","inline":true,"padRight":true},{"text":"is not exactly orthogonal to the normal to the surface.","element":"span"}],[{"text":"Observe also that Proposition ","element":"span"},{"href":"#id-22","text":"6 ","element":"a"},{"text":"holds remarkably generally. For example, the dependence structure of ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"F, X, Y ","element":"span"},{"text":") ","element":"span"},{"text":"as random variables is not restricted; it covers commonly used dropout algorithms such as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect","element":"span"},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Cutout","element":"span"},{"text":"; and it holds for differentiable activation functions. Proposition ","element":"span"},{"href":"#id-22","text":"6 ","element":"a"},{"text":"includes also online and offline learning, depending on the distribution ","element":"span"},{"style":{"height":12},"width":26,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-15.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"from which we sample.","element":"span"}],[{"text":"Our proof of Proposition ","element":"span"},{"href":"#id-22","text":"6 ","element":"a"},{"text":"is in Appendix ","element":"span"},{"text":"D ","element":"span"},{"text":"and relies on the framework of stochastic approximation in (","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"Kushner and Yin","element":"a"},{"text":", ","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"2003","element":"a"},{"text":", Theorem 2.1, p. 127). In the background the stochastic process ","element":"span"},{"style":{"height":20.33},"width":179.53,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/10-16.png","element":"img","alt":" {W [t]}t>0","inline":true,"padRight":true},{"text":"is being scaled in both parameter space and time so that the resulting sample paths provably converge to the gradient flow in ","element":"span"},{"href":"#id-56","text":"(16)","element":"a"},{"text":". Examining the proof, we expect that Proposition ","element":"span"},{"href":"#id-22","text":"6 ","element":"a"},{"text":"can be extended to cases where the filters as random variables have finite moments, for example, when they are Gaussian distributed (","element":"span"},{"href":"#id-34","referenceIndex":31,"text":"Molchanov ","element":"a"},{"href":"#id-34","referenceIndex":31,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":31,"text":"2017","element":"a"},{"text":"). Concretely, the proofs of Lemmas ","element":"span"},{"href":"#id-57","text":"17 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-58","text":"18 ","element":"a"},{"text":"in Appendix ","element":"span"},{"text":"D ","element":"span"},{"text":"rely only on the assumption that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"has finite moments, and may therefore be extended.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.2 Generic sample complexity for dropout ","element":"span"},{"style":{"fontWeight":"bold"},"text":"SGD","element":"span"}],[{"text":"Examining Proposition ","element":"span"},{"href":"#id-22","text":"6, ","element":"a"},{"text":"we note that it does not give insight into the convergence rate or the precise stationary point of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"text":"to which the iterates ","element":"span"},{"style":{"height":20.33},"width":124.13,"height":50.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-0.png","element":"img","alt":" {W [t]}","inline":true,"padRight":true},{"text":"converge. A related goal in stochastic optimization is to ask for the number of iterations of ","element":"span"},{"href":"#id-54","text":"(2) ","element":"a"},{"text":"required to achieve a point close to stationarity in expectation, also referred to the sample complexity of the algorithm. We say ","element":"span"},{"style":{"height":12.8},"width":308.19,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-1.png","element":"img","alt":" W ∈ W is an ϵ","inline":true},{"text":"-stationary point of a differentiable function ","element":"span"},{"text":"D ","element":"span"},{"text":"if ","element":"span"},{"style":{"height":17.6},"width":296.26,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-2.png","element":"img","alt":" ∥∇D(W)∥2 ≤ ϵ","inline":true,"padRight":true},{"text":"holds. For nonconvex functions ","element":"span"},{"text":"D ","element":"span"},{"text":"with a Lipschitz continuous gradient ","element":"span"},{"style":{"height":12.8},"width":68.36,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-3.png","element":"img","alt":"∇D","inline":true},{"text":", ","element":"span"},{"text":"SGD ","element":"span"},{"text":"convergence to an ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-4.png","element":"img","alt":" ϵ","inline":true},{"text":"-stationary point in expectation can be achieved in ","element":"span"},{"style":{"height":19.13},"width":131.46,"height":47.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-5.png","element":"img","alt":" O(ϵ−4)","inline":true,"padRight":true},{"text":"iterations; see ","element":"span"},{"href":"#id-59","referenceIndex":8,"text":"Bottou et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-59","referenceIndex":8,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-26","referenceIndex":12,"text":"Drori and Shamir ","element":"a"},{"text":"(","element":"span"},{"href":"#id-26","referenceIndex":12,"text":"2020","element":"a"},{"text":").","element":"span"}],[{"text":"We will consider nonconvex functions with a Lipschitz continuous gradient and assume that the filters ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"and the data ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"= (","element":"span"},{"style":{"fontStyle":"italic"},"text":"X, Y ","element":"span"},{"text":") ","element":"span"},{"text":"are independent. We will also assume that the distribution of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"is well-behaved so as to guarantee that we also have the following relations for the functions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r, ","element":"span"},{"text":"U ","element":"span"},{"text":"and ","element":"span"},{"text":"D","element":"span"},{"text":":","element":"span"}],[{"style":{"width":"82%"},"width":1434,"height":179,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-6.png","element":"img"}],[{"text":"Note that the function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"in this setting includes the loss function formulation from ","element":"span"},{"href":"#id-16","text":"(1) ","element":"a"},{"text":"with","element":"span"}],[{"id":"id-60","style":{"width":"75%"},"width":1303,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-7.png","element":"img"}],[{"text":"and in general, at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"the update rule will be","element":"span"}],[{"style":{"width":"77%"},"width":1340,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-8.png","element":"img"}],[{"text":"In the case of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"dropout","element":"span"},{"text":", for example, we expect that the sample complexity of finding an ","element":"span"},{"style":{"height":8},"width":32.72,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-9.png","element":"img","alt":" ϵ-","inline":true,"padRight":true},{"text":"stationary point for the empirical risk will change depending on the dropout probability ","element":"span"},{"style":{"height":15.2},"width":94.75,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-10.png","element":"img","alt":" 1−p.","inline":true,"padRight":true},{"text":"In particular, if ","element":"span"},{"style":{"height":17.6},"width":511.42,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-11.png","element":"img","alt":" p ↓ 0 and ∥∇U(W)∥∞ < C","inline":true,"padRight":true},{"text":"holds for any ","element":"span"},{"style":{"height":17.6},"width":630.96,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-12.png","element":"img","alt":" W ∈ W, then ∇D(W) = EF [F ⊙","inline":true},{"style":{"height":17.6},"width":434.01,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-13.png","element":"img","alt":"∇U(F ⊙ W)] = O(pC)","inline":true},{"text":". On the other hand if ","element":"span"},{"style":{"height":16},"width":92.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-14.png","element":"img","alt":" p ↑ 1","inline":true},{"text":", then the variance of ","element":"span"},{"style":{"height":17.6},"width":336.74,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-15.png","element":"img","alt":" F ⊙ ∇U(F ⊙ W),","inline":true,"padRight":true},{"text":"will also be small. We make these intuitions rigorous in the next proposition. For some ","element":"span"},{"style":{"height":18.33},"width":440.63,"height":45.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-16.png","element":"img","alt":"N ∈ N, we let W = RN ","inline":true,"padRight":true},{"text":"be the parameter space and ","element":"span"},{"style":{"height":17.53},"width":218.33,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-17.png","element":"img","alt":" z ∈ Z ⊆ Rd ","inline":true,"padRight":true},{"text":"a Lebesgue measurable set. We assume the following:","element":"span"}],[{"style":{"width":"99%"},"width":1719,"height":306,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/11-18.png","element":"img"}],[{"text":"Except for (Q4) and (Q5), all other assumptions are routinely used in sample complexity analysis. While the assumptions of Proposition ","element":"span"},{"href":"#id-25","text":"7 ","element":"a"},{"text":"below hold for general nonconvex smooth functions ","element":"span"},{"text":"D","element":"span"},{"text":", in the case of ","element":"span"},{"text":"NNs ","element":"span"},{"text":"and the setting in ","element":"span"},{"href":"#id-60","text":"(20) ","element":"a"},{"text":"we remark that there are examples ","element":"span"},{"id":"id-63","text":"that satisfy these assumptions such as the following one:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Example 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"In a binary classification setting, the set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is compact, that is, the data pairs ","element":"span"},{"style":{"height":17.6},"width":208.69,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-0.png","element":"img","alt":"(x, y) ∈ Z","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"take values in a compact set where ","element":"span"},{"style":{"height":17.6},"width":203.27,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-1.png","element":"img","alt":" y ∈ {0, 1}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are labels for the two classes. A ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NN, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"denoted by ","element":"span"},{"style":{"height":20.41},"width":118.68,"height":51.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-2.png","element":"img","alt":"˜ΨW (·)","inline":true},{"style":{"fontStyle":"italic"},"text":", uses sigmoid activation functions ","element":"span"},{"style":{"height":17.6},"width":534.42,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-3.png","element":"img","alt":" σ(t) = 1/1 + exp(−t) with","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"output in ","element":"span"},{"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":". The output of ","element":"span"},{"style":{"height":18.71},"width":69.94,"height":46.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-4.png","element":"img","alt":"˜ΨW","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is then used for binary classification with a logistic map, that is, the predicted probability of belonging to one of the classes is given by ","element":"span"},{"style":{"height":17.6},"width":187.34,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-5.png","element":"img","alt":" ΨW (x) =","inline":true},{"style":{"height":20.41},"width":403.26,"height":51.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-6.png","element":"img","alt":"1/(1 + exp(−˜ΨW (x))","inline":true},{"style":{"fontStyle":"italic"},"text":". In this setting, assumptions (Q1)–(Q3) will hold if the loss ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is also smooth (such as the ","element":"span"},{"style":{"height":15.02},"width":35.18,"height":37.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-7.png","element":"img","alt":" ℓ2","inline":true},{"style":{"fontStyle":"italic"},"text":"-loss). In this case, we have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") = ","element":"span"},{"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and the constants in (Q1)–(Q5) will also indirectly depend on the depth and width of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NN.","element":"span"}],[{"text":"Regarding (Q4), note that it allows for dependencies between filters. We also assume (Q5) for the sake of simplicity: we could instead use projected SGD with updates from ","element":"span"},{"href":"#id-23","text":"(5) ","element":"a"},{"text":"instead of (Q5), but using projected SGD would leave the scalings in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"invariant. ","element":"span"},{"style":{"height":8.4},"width":17,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-8.png","element":"img","alt":"2","inline":true,"padRight":true},{"text":"Recall that ","element":"span"},{"style":{"height":17.6},"width":460.43,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-9.png","element":"img","alt":" D(W) = EF [U(F ⊙ W)]","inline":true},{"text":". The proof the following proposition can be found in Appendix ","element":"span"},{"text":"E.","element":"span"}],[{"style":{"height":20.33},"width":551.64,"height":50.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-10.png","element":"img","alt":"Proposition 7 Let (F [t])t∈N","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a sequence of independent random variables with distribution ","element":"span"},{"style":{"height":16.33},"width":216.59,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-11.png","element":"img","alt":" F. Let W [t] ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be iterates of ","element":"span"},{"href":"#id-60","text":"(21)","element":"a"},{"style":{"fontStyle":"italic"},"text":". Assume (Q1)–(Q5). Define ","element":"span"},{"style":{"height":21.29},"width":514.42,"height":53.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-12.png","element":"img","alt":" J = S2+ 32N2(ℓ2R2+2ℓR).","inline":true,"padRight":true},{"text":"(a) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":19.13},"width":542.67,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-13.png","element":"img","alt":" T ∈ N+. If p > Mℓ/(NS2T)","inline":true},{"style":{"fontStyle":"italic"},"text":", then there exists a constant stepsize ","element":"span"},{"style":{"height":19.93},"width":336.78,"height":49.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-14.png","element":"img","alt":" α{t} = η > 0 such","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"that for all ","element":"span"},{"style":{"height":17.6},"width":137.9,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-15.png","element":"img","alt":" t ∈ [T],","inline":true}],[{"style":{"width":"78%"},"width":1365,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-16.png","element":"img"}],[{"text":"(b) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":14.4},"width":112.76,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-17.png","element":"img","alt":" T ≥ 4","inline":true},{"style":{"fontStyle":"italic"},"text":". There exists a sequence of decreasing stepsizes satisfying ","element":"span"},{"style":{"height":20.34},"width":356.42,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-18.png","element":"img","alt":" α{t} = 1/(ℓ√t) for","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"all ","element":"span"},{"style":{"height":17.6},"width":314.48,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-19.png","element":"img","alt":" t ∈ [T] such that","inline":true}],[{"id":"id-25","style":{"width":"82%"},"width":1433,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-20.png","element":"img"}],[{"text":"In Proposition ","element":"span"},{"href":"#id-25","text":"7, ","element":"a"},{"text":"we observe that finding approximate stationary points is easier with a larger dropout probability ","element":"span"},{"style":{"height":15.2},"width":103.09,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-21.png","element":"img","alt":" 1 − p","inline":true,"padRight":true},{"text":"for a wide range of filter distributions like those determining ","element":"span"},{"style":{"fontStyle":"italic"},"text":"dropout ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"dropconnect","element":"span"},{"text":", as guaranteed by (Q4). In Proposition ","element":"span"},{"href":"#id-25","text":"7(","element":"a"},{"text":"a) we also see a dependence of the convergence rate on","element":"span"},{"style":{"height":20.8},"width":641.98,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-22.png","element":"img","alt":"�p(S2 + (1 − p)J. The term pS2 ","inline":true,"padRight":true},{"text":"corresponds to the variance of the gradient due the distribution of data in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"and decreases with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"; while the term ","element":"span"},{"style":{"height":17.6},"width":182,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-23.png","element":"img","alt":" p(1 − p)J","inline":true,"padRight":true},{"text":"stems from the variance due to dropout. Note that the sum achieves a maximum for ","element":"span"},{"style":{"height":17.6},"width":172.28,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-24.png","element":"img","alt":" p ∈ (0, 1)","inline":true},{"text":". We note that Proposition ","element":"span"},{"href":"#id-25","text":"7 ","element":"a"},{"text":"does not suggest that the convergence to minima is faster for smaller ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":". In particular, saddle points can become easier to find as ","element":"span"},{"style":{"height":16},"width":90.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/12-25.png","element":"img","alt":"p ↑ 0","inline":true},{"text":". As seen later in the numerical experiments with ","element":"span"},{"text":"NNs ","element":"span"},{"text":"in Section ","element":"span"},{"href":"#id-61","text":"5.1, ","element":"a"},{"text":"or in similar work from ","element":"span"},{"href":"#id-14","referenceIndex":29,"text":"Mianjy and Arora ","element":"a"},{"text":"(","element":"span"},{"href":"#id-14","referenceIndex":29,"text":"2020","element":"a"},{"text":"); ","element":"span"},{"href":"#id-15","referenceIndex":40,"text":"Senen-Cerda and Sanders ","element":"a"},{"text":"(","element":"span"},{"href":"#id-15","referenceIndex":40,"text":"2022","element":"a"},{"text":"), the ","element":"span"},{"text":"NN ","element":"span"},{"text":"structure and data distribution can change the convergence rate dependence on the dropout probability. As an example, in ","element":"span"},{"href":"#id-15","referenceIndex":40,"text":"Senen-Cerda and Sanders ","element":"a"},{"text":"(","element":"span"},{"href":"#id-15","referenceIndex":40,"text":"2022","element":"a"},{"text":") it is suggested that the convergence rate dependence on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"and the width of the ","element":"span"},{"text":"NN ","element":"span"},{"text":"can have different regimes depending on whether we are close to a minimum or not. Similarly, smaller ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"does not necessarily improve generalization. In particular, if the dropout probability ","element":"span"},{"style":{"height":15.2},"width":89.4,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-0.png","element":"img","alt":" 1−p","inline":true,"padRight":true},{"text":"is large, the optimization landscape will be flat with many approximate stationary points. In this case, ","element":"span"},{"text":"SGD ","element":"span"},{"text":"with dropout with a limited sample complexity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"iterations will not explore the landscape as much as when using a smaller dropout probability. With a flatter landscape in mind, it may be better in the complexity trade-off to use a larger ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"for finding an approximate minimum and generalize better instead of finding a stationary point.","element":"span"}],[{"text":"A possible approach to avoid the flattening of the landscape is to scale the weights appropriately during training. This is, for example, what is conducted in practice in some implementations of dropout.","element":"span"},{"style":{"height":8.4},"width":17,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-1.png","element":"img","alt":"3 ","inline":true,"padRight":true},{"text":"Assuming (Q4) holds, we consider the update rule","element":"span"}],[{"style":{"width":"77%"},"width":1349,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-2.png","element":"img"}],[{"text":"With ","element":"span"},{"href":"#id-62","text":"(24)","element":"a"},{"text":", the use of filters is compensated by increasing the size of the updates and weights accordingly. In this case, ","element":"span"},{"text":"SGD ","element":"span"},{"text":"with this update rule is actually minimizing the function","element":"span"}],[{"id":"id-62","style":{"width":"58%"},"width":1018,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-3.png","element":"img"}],[{"text":"which also compensates in expectation the effect of the filters. With the update rule in ","element":"span"},{"href":"#id-62","text":"(24)","element":"a"},{"text":", we can again obtain an expression for the complexity of finding an ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-4.png","element":"img","alt":" ϵ","inline":true},{"text":"-stationary point of ","element":"span"},{"style":{"height":20.61},"width":112.76,"height":51.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-5.png","element":"img","alt":"˜D(W)","inline":true},{"text":". The following is proved in Appendix ","element":"span"},{"text":"E:","element":"span"}],[{"style":{"height":20.33},"width":551.64,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-6.png","element":"img","alt":"Proposition 8 Let (F [t])t∈N","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a sequence of independent random variables with distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":". Assume (Q1)–(Q5). Let ","element":"span"},{"style":{"height":16.33},"width":77.92,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-7.png","element":"img","alt":" W [t] ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be iterates of ","element":"span"},{"href":"#id-62","text":"(24)","element":"a"},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":19.13},"width":558.12,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-8.png","element":"img","alt":" T ∈ N+. If p > Mℓ/(NS2T),","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"then there exists a constant stepsize ","element":"span"},{"style":{"height":19.93},"width":237.77,"height":49.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-9.png","element":"img","alt":" α{t} = η > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":17.6},"width":137.9,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-10.png","element":"img","alt":" t ∈ [T],","inline":true}],[{"style":{"width":"98%"},"width":1699,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-11.png","element":"img"}],[{"text":"Proposition ","element":"span"},{"href":"#id-62","text":"8 ","element":"a"},{"text":"shows that for the scaled dropout ","element":"span"},{"text":"SGD ","element":"span"},{"text":"of ","element":"span"},{"href":"#id-62","text":"(24) ","element":"a"},{"text":"the complexity of finding an ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-12.png","element":"img","alt":" ϵ","inline":true},{"text":"-stationary point monotonically increases with ","element":"span"},{"style":{"height":15.2},"width":99.2,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-13.png","element":"img","alt":" 1 − p","inline":true},{"text":". This result contrasts with Proposition ","element":"span"},{"href":"#id-25","text":"7, ","element":"a"},{"text":"where a different behavior was observed. We remark, however, that this result assumes (Q5), which for small ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"cannot realistically hold since a bound ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"for the norm of the weights may also scale by a factor ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/p","element":"span"},{"text":". This result, just like with Proposition ","element":"span"},{"href":"#id-25","text":"7, ","element":"a"},{"text":"also does not imply that good weights ","element":"span"},{"style":{"height":12.8},"width":150.93,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-14.png","element":"img","alt":" W ∈ W","inline":true,"padRight":true},{"text":"become easier to find by using the update ","element":"span"},{"href":"#id-62","text":"(24)","element":"a"},{"text":". Indeed, scaling partially avoids the flattening of the landscape—the Lipschitz constant of ","element":"span"},{"style":{"height":16.61},"width":68.36,"height":41.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-15.png","element":"img","alt":"∇˜D","inline":true,"padRight":true},{"text":"is namely scaled by a factor ","element":"span"},{"style":{"height":19.13},"width":82.59,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-16.png","element":"img","alt":" 1/p2","inline":true},{"text":"—but the variance of ","element":"span"},{"text":"SGD ","element":"span"},{"text":"due to dropout is also increased considerably. This variance becomes dominant when the dropout rate ","element":"span"},{"style":{"height":16},"width":175.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-17.png","element":"img","alt":" 1 − p ↑ 1","inline":true,"padRight":true},{"text":"due to the inverse dependence on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"in the sample complexity.","element":"span"}],[{"text":"Propositions ","element":"span"},{"href":"#id-25","text":"7 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-62","text":"8 ","element":"a"},{"text":"show that the complexity of finding ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/13-18.png","element":"img","alt":" ϵ","inline":true},{"text":"-stationary points heavily depends on the algorithm used. However, when we restrict the results to deep ","element":"span"},{"text":"NNs ","element":"span"},{"text":"such as with Example ","element":"span"},{"href":"#id-63","text":"1, ","element":"a"},{"text":"the bounds do not provide much information on the dependence of the convergence rate on the depth of the network. This fact also shows the limitations of using a generic sample complexity analysis.","element":"span"}],[{"text":"In order to obtain an explicit convergence rate depending on the depth, we need to use the additional structure of the ","element":"span"},{"text":"NN. ","element":"span"},{"text":"In the next section we will be able to compute the convergence rate to a global minimum for ","element":"span"},{"text":"NNs ","element":"span"},{"text":"that are shaped like arborescences and obtain an explicit bound that depends on the depth of the arborescence and the dropout probability.","element":"span"}]]},{"heading":"4. Convergence rate of GD on D(W) for arborescences with linear activation","paragraphs":[[{"text":"We obtained a convergence guarantee as well as a bound for the sample complexity of dropout in the previous section. Next, we focus on the convergence rate of dropout in functions that model the structure of ","element":"span"},{"text":"NNs. ","element":"span"},{"text":"In particular, we will derive an explicit convergence rate for dropout algorithms in the case that we have linear activations ","element":"span"},{"style":{"height":17.6},"width":169.24,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-0.png","element":"img","alt":" σ(z) = z","inline":true,"padRight":true},{"text":"and that the ","element":"span"},{"text":"NN ","element":"span"},{"text":"is structured as an arborescence: see Figure ","element":"span"},{"href":"#id-3","text":"1c. ","element":"a"},{"text":"Specifically, we will study the following regular ","element":"span"},{"text":"GD ","element":"span"},{"text":"algorithm on dropout’s risk function:","element":"span"}],[{"id":"id-64","style":{"width":"74%"},"width":1280,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-1.png","element":"img"}],[{"text":"Here, we keep the step size ","element":"span"},{"style":{"height":12.4},"width":108.86,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-2.png","element":"img","alt":" α > 0","inline":true,"padRight":true},{"text":"fixed. Note that this algorithm generates a deterministic sequence ","element":"span"},{"style":{"height":20.48},"width":211.85,"height":51.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-3.png","element":"img","alt":" {W {t}}t∈N0","inline":true,"padRight":true},{"text":"as opposed to a sequence of random variables ","element":"span"},{"style":{"height":20.48},"width":196.79,"height":51.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-4.png","element":"img","alt":" {W [t]}t∈N0","inline":true,"padRight":true},{"text":"as generated by ","element":"span"},{"href":"#id-54","text":"(2) ","element":"a"},{"text":"or ","element":"span"},{"href":"#id-18","text":"(4)","element":"a"},{"text":". ","element":"span"},{"text":"We will use a linear activation function ","element":"span"},{"style":{"height":17.6},"width":170.66,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-5.png","element":"img","alt":" σ(t) = t","inline":true},{"text":", which combined with the arborescence structure will allow us to obtain an explicit convergence rate. While the iterates of ","element":"span"},{"href":"#id-64","text":"(27) ","element":"a"},{"text":"are not stochastic, analogous to Proposition ","element":"span"},{"href":"#id-22","text":"6, ","element":"a"},{"text":"the stochastic iterates will converge to a gradient flow of an ","element":"span"},{"text":"ODE, ","element":"span"},{"text":"whose discretization is given in ","element":"span"},{"href":"#id-64","text":"(27)","element":"a"},{"text":". Analyzing ","element":"span"},{"text":"ODEs ","element":"span"},{"text":"related to ","element":"span"},{"text":"NNs ","element":"span"},{"text":"is common in literature ","element":"span"},{"href":"#id-65","referenceIndex":44,"text":"Tarmoun et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-65","referenceIndex":44,"text":"2021","element":"a"},{"text":"); ","element":"span"},{"href":"#id-66","referenceIndex":17,"text":"Jacot et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-66","referenceIndex":17,"text":"2018","element":"a"},{"text":"). For more discussion on the relationship between the iterates of ","element":"span"},{"href":"#id-64","text":"(27) ","element":"a"},{"text":"and dropout we refer to Appendix ","element":"span"},{"text":"B.","element":"span"}],[{"text":"Our main convergence result in Proposition ","element":"span"},{"href":"#id-67","text":"13 ","element":"a"},{"text":"below holds for general distribution functions. However, we show here the cases of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect","element":"span"},{"text":", which are most insightful. We use the following notation adapted from graph theory. Consider a fixed, directed ","element":"span"},{"style":{"fontStyle":"italic"},"text":"base graph ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"= (","element":"span"},{"style":{"fontStyle":"italic"},"text":"E","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V","element":"span"},{"text":") ","element":"span"},{"text":"without cycles in which all paths have length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":", which describes a ","element":"span"},{"text":"NN’","element":"span"},{"text":"s structure as follows. Each vertex ","element":"span"},{"style":{"height":12.8},"width":115.34,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-6.png","element":"img","alt":" v ∈ V","inline":true,"padRight":true},{"text":"represents a neuron of the ","element":"span"},{"text":"NN, ","element":"span"},{"text":"and each directed edge ","element":"span"},{"style":{"height":17.6},"width":275.35,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-7.png","element":"img","alt":" e = (u, v) ∈ E","inline":true,"padRight":true},{"text":"indicates that neuron ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u","element":"span"},{"text":"’s output is input to neuron ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":". Note that to each edge ","element":"span"},{"style":{"height":13.2},"width":105.83,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-8.png","element":"img","alt":" e ∈ E","inline":true,"padRight":true},{"text":"in the ","element":"span"},{"text":"NN, ","element":"span"},{"text":"a weight ","element":"span"},{"style":{"height":14.62},"width":151.41,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-9.png","element":"img","alt":" We ∈ R","inline":true,"padRight":true},{"text":"and a filter variable ","element":"span"},{"style":{"height":17.6},"width":287.1,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-10.png","element":"img","alt":" Fe ∈ {0, 1} are","inline":true,"padRight":true},{"text":"associated. We will write ","element":"span"},{"style":{"height":16.73},"width":186.17,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-11.png","element":"img","alt":" W = R|E| ","inline":true,"padRight":true},{"text":"for simplicity. For an arborescence ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":", we denote by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":") ","element":"span"},{"text":"the edge set of leaves. Let ","element":"span"},{"style":{"height":13.2},"width":228.32,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-12.png","element":"img","alt":" M > 2δ > 0","inline":true,"padRight":true},{"text":"be real numbers and suppose that we initialize the weights ","element":"span"},{"style":{"height":17.6},"width":159.84,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-13.png","element":"img","alt":" {We}e∈E","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"id":"id-68","style":{"width":"68%"},"width":1180,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/14-14.png","element":"img"}],[{"text":"The proof of Proposition ","element":"span"},{"href":"#id-31","text":"9 ","element":"a"},{"text":"is deferred to Appendix ","element":"span"},{"text":"I, ","element":"span"},{"text":"which is a consequence of our more ","element":"span"},{"id":"id-31","text":"general result in Proposition ","element":"span"},{"href":"#id-67","text":"13.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Proposition 9 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that the base graph ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is an arborescence of depth ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|L","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"| ","element":"span"},{"style":{"fontStyle":"italic"},"text":"leaves, the activation function ","element":"span"},{"style":{"height":17.6},"width":150.38,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-0.png","element":"img","alt":" σ(t) = t","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is linear, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is independent of ","element":"span"},{"style":{"height":23.2},"width":435.31,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-1.png","element":"img","alt":" (X, Y ), and {W {0}e }e∈E","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is initialized according to ","element":"span"},{"href":"#id-68","text":"(28)","element":"a"},{"style":{"fontStyle":"italic"},"text":". If the ","element":"span"},{"style":{"height":17.6},"width":146.69,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-2.png","element":"img","alt":" {Fe}e∈E","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"follow the distribution prescribed by ","element":"span"},{"text":"Dropconnect ","element":"span"},{"style":{"fontStyle":"italic"},"text":"or ","element":"span"},{"text":"Dropout","element":"span"},{"style":{"fontStyle":"italic"},"text":", then there exists ","element":"span"},{"style":{"height":12.4},"width":108.26,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-3.png","element":"img","alt":" α > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that the iterates of ","element":"span"},{"href":"#id-64","text":"(27) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"satisfy","element":"span"}],[{"style":{"width":"81%"},"width":1417,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"with","element":"span"}],[{"id":"id-131","style":{"width":"65%"},"width":1128,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-5.png","element":"img"}],[{"id":"id-88","style":{"fontWeight":"bold"},"text":"4.1 Discussion","element":"span"}],[{"text":"In Proposition ","element":"span"},{"href":"#id-31","text":"9 ","element":"a"},{"text":"we consider the cases of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect","element":"span"},{"text":", in which nodes or edges are dropped with probability ","element":"span"},{"style":{"height":15.2},"width":101.53,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-6.png","element":"img","alt":" 1 − p","inline":true},{"text":", respectively. Observe that the convergence rate exponent depends on ","element":"span"},{"style":{"height":19.53},"width":757.03,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-7.png","element":"img","alt":" pL and (2δ2/M 2)2L where 2δ2/M 2 < 1","inline":true},{"text":"; see ","element":"span"},{"href":"#id-68","text":"(28)","element":"a"},{"text":". The first term in particular indicates that as the ","element":"span"},{"text":"NN ","element":"span"},{"text":"becomes deeper, the convergence rate exponent of ","element":"span"},{"text":"GD ","element":"span"},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect ","element":"span"},{"text":"will decrease by a factor ","element":"span"},{"style":{"height":18.73},"width":44.96,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-8.png","element":"img","alt":" pL","inline":true},{"text":". The second term ","element":"span"},{"style":{"height":19.53},"width":221.08,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-9.png","element":"img","alt":" (2δ2/M 2)2L","inline":true,"padRight":true},{"text":"shows the increased difficulty of training deeper ","element":"span"},{"text":"NNs ","element":"span"},{"text":"and has been observed e.g., by ","element":"span"},{"href":"#id-29","referenceIndex":41,"text":"Shamir ","element":"a"},{"text":"(","element":"span"},{"href":"#id-29","referenceIndex":41,"text":"2019","element":"a"},{"text":"); ","element":"span"},{"href":"#id-28","referenceIndex":1,"text":"Arora et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-28","referenceIndex":1,"text":"2019","element":"a"},{"text":"). The exponential dependence in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is moreover tight when using ","element":"span"},{"text":"GD ","element":"span"},{"text":"and is intrinsic to the method (","element":"span"},{"href":"#id-29","referenceIndex":41,"text":"Shamir","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":41,"text":"2019","element":"a"},{"text":"). Hence, dropout adds another exponential dependence to the convergence rate in arborescences, which is due to the stochastic nature of the algorithm. In Figure ","element":"span"},{"href":"#id-69","text":"2 ","element":"a"},{"text":"an experiment confirming this intuition on the convergence rate of dropout on a single path for different depths can be seen.","element":"span"}],[{"text":"Finally, our proofs of Proposition ","element":"span"},{"href":"#id-31","text":"9 ","element":"a"},{"text":"and the related more general result in Proposition ","element":"span"},{"href":"#id-67","text":"13 ","element":"a"},{"text":"below can be found in Appendix ","element":"span"},{"text":"H. ","element":"span"},{"text":"The proof strategy is to show that a ","element":"span"},{"text":"Polyak–Łojasiewicz ","element":"span"},{"text":"(PL) ","element":"span"},{"text":"inequality holds, which allows one to obtain convergence rates for ","element":"span"},{"text":"GD ","element":"span"},{"text":"on nonconvex functions (","element":"span"},{"href":"#id-70","referenceIndex":18,"text":"Karimi et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-70","referenceIndex":18,"text":"2016","element":"a"},{"text":"). The new part of the argument is that we use conserved quantities and a double induction to identify a compact set in which the iterates remain and simultaneously a ","element":"span"},{"text":"PL ","element":"span"},{"text":"inequality holds. The method that we develop and which is sketched in the next subsection depends intricately on the arborescence structure and cannot be readily applied to other cases.","element":"span"}],[{"text":"To compare this result with more realistic models, we will examine the convergence rate of dropout in deep and wide ","element":"span"},{"text":"NNs ","element":"span"},{"text":"in Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"with a heuristic and experimental approach.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.2 Sketch of the proof","element":"span"}],[{"text":"Besides the previous notation, we need to introduce notation corresponding to subgraphs and paths. Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"be the set of all subgraphs of the base layered directed graph ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"vertices, and let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":") ","element":"span"},{"text":"be the set of edges of a subgraph ","element":"span"},{"style":{"height":22.1},"width":364.64,"height":55.26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-10.png","element":"img","alt":" g ∈ G. Let Γji(g; e)","inline":true,"padRight":true},{"text":"be defined as the set of ","element":"span"},{"text":"all paths in the directed graph ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g ","element":"span"},{"text":"that start at vertex ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":", traverse edge ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e","element":"span"},{"text":", and end at vertex ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":". If the origin or end vertices are in the input or output layer, the subscript or superscript is dropped from the notation, respectively. For every path ","element":"span"},{"style":{"height":17.6},"width":638.34,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-11.png","element":"img","alt":" γ ≜ (γ1, . . . , γL) ∈ Γ(g), we write","inline":true},{"style":{"height":21.05},"width":640.24,"height":52.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-12.png","element":"img","alt":"Pγ ≜ �e∈γ We and Fγ ≜ �e∈γ Fe","inline":true,"padRight":true},{"text":"for notational convenience. Finally, let ","element":"span"},{"style":{"height":17.6},"width":317.8,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-13.png","element":"img","alt":" GF ≜ (EF , V) be","inline":true,"padRight":true},{"text":"the random subgraph of base graph ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"that has edge set ","element":"span"},{"style":{"height":17.6},"width":402.58,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-14.png","element":"img","alt":" EF ≜ {e ∈ E|Fe = 1}","inline":true},{"text":". We denote ","element":"span"},{"style":{"height":21.56},"width":816.99,"height":53.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/15-15.png","element":"img","alt":"µg ≜ P[GF = g], and ηγ ≜ �{g∈G|γ∈Γ(g)} µg","inline":true},{"text":". We first provide an explicit characterization of ","element":"span"},{"text":"dropout’s risk function in ","element":"span"},{"href":"#id-24","text":"(6) ","element":"a"},{"text":"in terms of paths in the graph that describes the structure of","element":"span"}],[{"id":"id-69","style":{"width":"95%"},"width":1659,"height":1509,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/16-0.png","element":"img"}],[{"text":"Figure 2: The average loss depending on the number of steps of ","element":"figcaption","subtype":"caption"},{"text":"SGD ","element":"figcaption","subtype":"caption"},{"text":"with dropout of the function ","element":"figcaption","subtype":"caption"},{"style":{"height":22},"width":462.8,"height":55.01,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/16-1.png","element":"img","alt":" f(w) = (y − �Li=1 wix)2","inline":true,"padRight":true},{"text":"and its average convergence slope. ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"(a) ","element":"figcaption","subtype":"caption"},{"text":"The average loss ","element":"figcaption","subtype":"caption"},{"text":"for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"L ","element":"figcaption","subtype":"caption"},{"text":"= 1","element":"figcaption","subtype":"caption"},{"text":". ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"(b) ","element":"figcaption","subtype":"caption"},{"text":"The average loss for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"L ","element":"figcaption","subtype":"caption"},{"text":"= 3","element":"figcaption","subtype":"caption"},{"text":". ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"(c) ","element":"figcaption","subtype":"caption"},{"text":"The average loss for ","element":"figcaption","subtype":"caption"},{"style":{"height":18},"width":442.38,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/16-2.png","element":"img","alt":" L = 5. (d) The slope β","inline":true,"padRight":true},{"text":"of the fit of ","element":"figcaption","subtype":"caption"},{"style":{"height":16.8},"width":245.07,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/16-3.png","element":"img","alt":" y = −βx + γ","inline":true,"padRight":true},{"text":"for the curves in (a), (b) and (c). The slopes ","element":"figcaption","subtype":"caption"},{"style":{"height":16.4},"width":26,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/16-4.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"for a given ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"l ","element":"figcaption","subtype":"caption"},{"text":"have been normalized at ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p ","element":"figcaption","subtype":"caption"},{"text":"= 1 ","element":"figcaption","subtype":"caption"},{"text":"for comparison across depths ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"L","element":"figcaption","subtype":"caption"},{"text":". Note that for larger ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"L","element":"figcaption","subtype":"caption"},{"text":", the effect of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p ","element":"figcaption","subtype":"caption"},{"text":"becomes also more pronounced. This is in agreement with the conclusion in Section ","element":"figcaption","subtype":"caption"},{"text":"4, ","element":"span","subtype":"caption"},{"text":"where we expect a convergence rate depending on ","element":"figcaption","subtype":"caption"},{"style":{"height":18.73},"width":44.95,"height":46.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/16-5.png","element":"img","alt":" pL","inline":true},{"text":". In this case, other effects of depth are also observed, such as a dependence on the initialization.","element":"figcaption","subtype":"caption"}],[{"text":"the ","element":"span"},{"text":"NN. ","element":"span"},{"text":"This is possible since we assume linear activation functions. The following lemma now holds, and is proved in Appendix ","element":"span"},{"text":"F.","element":"span"}],[{"id":"id-71","style":{"fontWeight":"bold"},"text":"Lemma 10 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that the base graph ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is a fixed, directed graph without cycles in which all paths have length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and there are ","element":"span"},{"style":{"height":15.1},"width":45.71,"height":37.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-0.png","element":"img","alt":" dL","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"output nodes (N6’), that ","element":"span"},{"style":{"height":17.6},"width":150.38,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-1.png","element":"img","alt":" σ(t) = t","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"(N7), and that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is independent of ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X, Y ","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(N8). Then","element":"span"}],[{"id":"id-120","style":{"width":"74%"},"width":1281,"height":143,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Moreover ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") + ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", where","element":"span"}],[{"style":{"width":"94%"},"width":1633,"height":271,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Here, the constants ","element":"span"},{"style":{"height":13.42},"width":107.67,"height":33.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-4.png","element":"img","alt":" ηγ, µγ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"depend explicitly on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"’s distribution and the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NN’","element":"span"},{"style":{"fontStyle":"italic"},"text":"s architecture.","element":"span"}],[{"text":"Note that Lemma ","element":"span"},{"href":"#id-71","text":"10 ","element":"a"},{"text":"essentially changes variables to rewrite the dropout risk function as a sum over paths instead of a sum over graphs. This representation allows us to clearly identify the regularization term ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":")","element":"span"},{"text":". For example in the case of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect ","element":"span"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":47,"text":"Wan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":47,"text":"2013","element":"a"},{"text":"), where the filter variables ","element":"span"},{"style":{"height":17.6},"width":146.69,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-5.png","element":"img","alt":" {Fe}e∈E","inline":true,"padRight":true},{"text":"are independent random variables with distribution ","element":"span"},{"text":"Bernoulli(","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":")","element":"span"},{"text":", Lemma ","element":"span"},{"href":"#id-71","text":"10 ","element":"a"},{"text":"holds with ","element":"span"},{"style":{"height":20.95},"width":550.7,"height":52.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-6.png","element":"img","alt":" µg = p|E(g)|(1 − p)|E(G)|−|E(g)|","inline":true},{"text":". Also note that if for all subgraphs ","element":"span"},{"style":{"height":16.4},"width":102.74,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-7.png","element":"img","alt":" g ∈ G","inline":true,"padRight":true},{"text":"and vertices ","element":"span"},{"style":{"height":17.6},"width":115.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-8.png","element":"img","alt":" i ∈ [d]","inline":true,"padRight":true},{"text":"the number of paths that end at ","element":"span"},{"style":{"height":19.13},"width":415.61,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-9.png","element":"img","alt":" i satisfies |Γi(g)| = 1 ,","inline":true,"padRight":true},{"text":"such as when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"is an arborescence, then for all subgraphs ","element":"span"},{"style":{"height":17.6},"width":603.36,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-10.png","element":"img","alt":" g ∈ G and paths γ ∈ Γ(g) there","inline":true,"padRight":true},{"text":"is only one path ending at a leave node ","element":"span"},{"style":{"height":17.6},"width":487.96,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-11.png","element":"img","alt":" γL, that is, ΓγL(g) = {γ}.","inline":true}],[{"text":"We now focus on a base graph that is an arborescence of arbitrary depth; see Figure ","element":"span"},{"href":"#id-3","text":"1c. ","element":"a"},{"text":"Hence we now replace (N6’) in Lemma ","element":"span"},{"href":"#id-71","text":"10 ","element":"a"},{"text":"that assumes a generic graph by assumption (N6), where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"is specifically an arborescence. The following specification of Corollary ","element":"span"},{"href":"#id-72","text":"11 ","element":"a"},{"text":"is also proven in Appendix ","element":"span"},{"text":"F.","element":"span"}],[{"id":"id-72","style":{"fontWeight":"bold"},"text":"Corollary 11 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that the base graph ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is an arborescence of depth ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(N6), and (N7)– (N8) from Lemma ","element":"span"},{"href":"#id-71","style":{"fontStyle":"italic"},"text":"10. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"Then ","element":"span"},{"style":{"height":18.73},"width":639.34,"height":46.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-12.png","element":"img","alt":" D(W) = I(W) + D(W opt), where","inline":true}],[{"style":{"width":"77%"},"width":1345,"height":234,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":21.85},"width":1005.74,"height":54.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-14.png","element":"img","alt":" νγ ≜ ηγE[X2γ0], zγ ≜ E[YγLXγ0]/E[X2γ0] for γ ∈ Γ(G)","inline":true},{"style":{"fontStyle":"italic"},"text":". Consequently, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") = 0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arborescence.","element":"span"}],[{"text":"The convergence result we are about to show uses the fact that for the system of ","element":"span"},{"text":"ODEs ","element":"span"},{"style":{"height":17.6},"width":431.25,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/17-15.png","element":"img","alt":"dW/ dt = −∇W D(W)","inline":true,"padRight":true},{"text":"there are conserved quantities. Within the proof, these conserved quantities have the crucial role of guaranteeing compactness for the iterates. Specifically, let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":") ","element":"span"},{"text":"denote the leaves of the subtree of ","element":"span"},{"style":{"height":16.4},"width":102.74,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-0.png","element":"img","alt":" g ∈ G","inline":true,"padRight":true},{"text":"rooted at a vertex ","element":"span"},{"style":{"height":17.6},"width":162.68,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-1.png","element":"img","alt":" f ∈ E(g)","inline":true},{"text":", and define the set of leaves of ","element":"span"},{"style":{"height":18.44},"width":513.22,"height":46.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-2.png","element":"img","alt":" G as L(G) ≜ ∪f∈EL(G; f)","inline":true},{"text":". We remark that in the previous notation ","element":"span"},{"style":{"height":17.6},"width":484.22,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-3.png","element":"img","alt":"dL = |L(G)|. For W ∈ W","inline":true,"padRight":true},{"text":"and each leaf ","element":"span"},{"style":{"height":17.6},"width":226.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-4.png","element":"img","alt":" f ∈ E\\L(G)","inline":true},{"text":", define the quantity","element":"span"}],[{"id":"id-74","style":{"width":"68%"},"width":1182,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-5.png","element":"img"}],[{"text":"Define ","element":"span"},{"style":{"height":25.55},"width":1122.54,"height":63.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-6.png","element":"img","alt":" Cmin ≜ mine∈E\\L(G) Ce and C{t}e = Ce(W {t}) for t ∈ N+","inline":true,"padRight":true},{"text":"also, both of which we require later. Lemma ","element":"span"},{"href":"#id-73","text":"12 ","element":"a"},{"text":"now proves that the function ","element":"span"},{"style":{"height":17.64},"width":50.19,"height":44.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-7.png","element":"img","alt":" Cf","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-74","text":"(35) ","element":"a"},{"text":"is a conserved quantity; the proof is in Appendix ","element":"span"},{"text":"G.","element":"span"}],[{"id":"id-73","style":{"fontWeight":"bold"},"text":"Lemma 12 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume (N2) from Proposition ","element":"span"},{"href":"#id-22","style":{"fontStyle":"italic"},"text":"6, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"(N6) from Corollary ","element":"span"},{"href":"#id-72","style":{"fontStyle":"italic"},"text":"11 ","element":"a"},{"style":{"fontStyle":"italic"},"text":", (N7), (N8) from Lemma ","element":"span"},{"href":"#id-71","style":{"fontStyle":"italic"},"text":"10. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"Then under the negative gradient flow ","element":"span"},{"style":{"height":17.6},"width":393.33,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-8.png","element":"img","alt":" dW/ dt = −∇D(W),","inline":true}],[{"id":"id-126","style":{"width":"54%"},"width":939,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":17.6},"width":239.49,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-10.png","element":"img","alt":" f ∈ E\\L(G).","inline":true}],[{"text":"We are almost in position to state our second result, but need to introduce still some notation. We define the following constants","element":"span"}],[{"style":{"width":"80%"},"width":1387,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-11.png","element":"img"}],[{"text":"for notational convenience. Also, for ","element":"span"},{"style":{"height":15.6},"width":407.28,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-12.png","element":"img","alt":" 0 < δ < M, we define","inline":true}],[{"style":{"width":"94%"},"width":1637,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-13.png","element":"img"}],[{"text":"a bounded set of parameters where if the weight is associated with a leaf, they are furthermore bounded away from zero. Let finally","element":"span"}],[{"style":{"width":"89%"},"width":1556,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-14.png","element":"img"}],[{"text":"denote the set of all weight parameters that are ","element":"span"},{"style":{"height":8.4},"width":21,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-15.png","element":"img","alt":" ε","inline":true},{"text":"-close to a critical point and for which the conserved quantities in ","element":"span"},{"href":"#id-74","text":"(35) ","element":"a"},{"text":"deviate by no more than ","element":"span"},{"style":{"height":26.84},"width":155.58,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-16.png","element":"img","alt":" O(C{0}f )","inline":true,"padRight":true},{"text":"from their initial value ","element":"span"},{"style":{"height":26.84},"width":99.11,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-17.png","element":"img","alt":" C{0}f .","inline":true,"padRight":true},{"text":"These deviations are made explicit by the intervals","element":"span"}],[{"style":{"width":"97%"},"width":1683,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-18.png","element":"img"}],[{"text":"Our proof shows that the iterates ","element":"span"},{"style":{"height":20.95},"width":194.58,"height":52.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-19.png","element":"img","alt":" {W {t}}t≥0","inline":true,"padRight":true},{"text":"stay in the intersection ","element":"span"},{"style":{"height":17.6},"width":396.94,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/18-20.png","element":"img","alt":" S ∩ B(ε, I), and this","inline":true,"padRight":true},{"text":"implies that the weights (including those associated with the leaves) remain bounded. The ","element":"span"},{"id":"id-67","text":"following now holds, and its proof can be found in Appendix ","element":"span"},{"text":"H.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proposition 13 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume (N2) from Proposition ","element":"span"},{"href":"#id-22","style":{"fontStyle":"italic"},"text":"6, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"(N6) from Corollary ","element":"span"},{"href":"#id-72","style":{"fontStyle":"italic"},"text":"11, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"(N7)–(N8) from Lemma ","element":"span"},{"href":"#id-71","style":{"fontStyle":"italic"},"text":"10, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"that ","element":"span"},{"style":{"height":21.35},"width":1069.25,"height":53.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-0.png","element":"img","alt":" W {0} ∈ S ∩ B(ϵ, I) and ML ≥ |zγ| for all γ ∈ Γ(G)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"(N9), that","element":"span"}],[{"href":"#id-71","style":{"height":21.95},"width":542.7,"height":54.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-1.png","element":"img","alt":"2Cmin(W {0}) > δ2 (N10). If","inline":true}],[{"id":"id-125","style":{"width":"97%"},"width":1679,"height":186,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"then the iterates of ","element":"span"},{"href":"#id-64","text":"(27) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"satisfy","element":"span"}],[{"style":{"width":"99%"},"width":1724,"height":165,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-3.png","element":"img"}],[{"text":"Proposition ","element":"span"},{"href":"#id-67","text":"13 ","element":"a"},{"text":"identifies explicitly how the convergence rate of ","element":"span"},{"text":"GD ","element":"span"},{"text":"on a dropout’s risk function depends on the dropout algorithm and the structure of the arborescence: parameters such as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|L","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":", L ","element":"span"},{"text":"are implicitly present in the constants ","element":"span"},{"style":{"height":17.6},"width":414.32,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-4.png","element":"img","alt":" νmin and ∥ν∥1 in α, τ.","inline":true}],[{"text":"Note that Assumptions (N9)–(N10) are relatively benign. These assumptions are for example satisfied when initializing ","element":"span"},{"style":{"height":23.2},"width":675.57,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-5.png","element":"img","alt":" M > W {0}e >√2δ for e ∈ E\\L(G)","inline":true,"padRight":true},{"text":"and setting ","element":"span"},{"style":{"height":17.6},"width":128.4,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-6.png","element":"img","alt":" |Wl| ≤","inline":true},{"style":{"height":21.01},"width":863.56,"height":52.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-7.png","element":"img","alt":"δ/�|L(G)| for all l ∈ L(G) and ϵ = I(W {0})","inline":true},{"text":", which we assume in Proposition ","element":"span"},{"href":"#id-31","text":"9. ","element":"a"},{"text":"In other words, this initialization sets the weights that are associated with leaves small compared to all other weights.","element":"span"}]]},{"heading":"5. Eﬀect of dropout on the convergence rate in wider networks","paragraphs":[[{"text":"In Proposition ","element":"span"},{"href":"#id-67","text":"13, ","element":"a"},{"text":"we have proven that the convergence rate depends on ","element":"span"},{"style":{"height":18.74},"width":44.96,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-8.png","element":"img","alt":" pL ","inline":true,"padRight":true},{"text":"for ","element":"span"},{"text":"NNs ","element":"span"},{"text":"shaped like arborescences. Let ","element":"span"},{"style":{"height":15.02},"width":90.72,"height":37.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-9.png","element":"img","alt":" Gtree","inline":true,"padRight":true},{"text":"be a tree and ","element":"span"},{"style":{"height":17.6},"width":232.33,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-10.png","element":"img","alt":" e ∈ E(Gtree)","inline":true,"padRight":true},{"text":"be an edge. Denote by ","element":"span"},{"style":{"height":20.33},"width":190.51,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-11.png","element":"img","alt":" Γ[t](e) the","inline":true,"padRight":true},{"text":"set of paths passing through ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e ","element":"span"},{"text":"that are not filtered by dropout at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". We observe that at any given time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"of dropout ","element":"span"},{"text":"SGD,","element":"span"}],[{"style":{"width":"71%"},"width":1235,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-12.png","element":"img"}],[{"text":"If we denote by ","element":"span"},{"style":{"height":20.38},"width":393.81,"height":50.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-13.png","element":"img","alt":" tupdate(Gtree) = 1/pL ","inline":true,"padRight":true},{"text":"the average update time for a weight in ","element":"span"},{"style":{"height":15.6},"width":202,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-14.png","element":"img","alt":" Gtree, then","inline":true,"padRight":true},{"text":"we need ","element":"span"},{"style":{"height":19.53},"width":88.59,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-15.png","element":"img","alt":" 1/pL ","inline":true,"padRight":true},{"text":"more time on average for a given edge to be updated than when we do not use dropout. For wider networks ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":", however, edges can be updated simultaneously and repeatedly via different available paths. By the previous intuition we might still expect that, if the updates are sufficiently independent, the convergence rate depends approximately on ","element":"span"},{"style":{"height":18.44},"width":161.84,"height":46.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-16.png","element":"img","alt":"1/tupdate","inline":true},{"text":". In order to verify this intuition we will determine ","element":"span"},{"style":{"height":16.44},"width":118.2,"height":41.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-17.png","element":"img","alt":" tupdate","inline":true,"padRight":true},{"text":"for ","element":"span"},{"text":"NNs ","element":"span"},{"text":"that are much wider than deep, and later simulate their convergence rates also in realistic settings.","element":"span"}],[{"text":"Suppose now that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"is a graph of a fully-connected ","element":"span"},{"text":"NN ","element":"span"},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"dropout layers each of which has width ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". For each of the vertices ","element":"span"},{"style":{"height":13.2},"width":113.31,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-18.png","element":"img","alt":" u ∈ G","inline":true,"padRight":true},{"text":"in a dropout layer, there is an associated dropout filter variable ","element":"span"},{"style":{"height":17.6},"width":544.99,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-19.png","element":"img","alt":" Fu ∼i.i.d. Ber(p) where p > 0","inline":true,"padRight":true},{"text":"is fixed. That is, we use ","element":"span"},{"style":{"fontStyle":"italic"},"text":"dropout","element":"span"},{"text":". Note that any other additional input or output layer without filters only changes the number of paths by a multiplicative factor. Hence, we will restrict to the case that all nodes in the layers have filter variables. In this case, we may consider a path ","element":"span"},{"style":{"height":17.6},"width":475.94,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-20.png","element":"img","alt":" γ = (u1, . . . , uL) as a set","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"vertices—one for each dropout layer—instead of edges. For two paths ","element":"span"},{"style":{"height":16},"width":234.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/19-21.png","element":"img","alt":" γ and δ, we","inline":true,"padRight":true},{"text":"consider their intersection ","element":"span"},{"style":{"height":16},"width":96.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-0.png","element":"img","alt":" γ ∩ δ","inline":true,"padRight":true},{"text":"as the subset of vertices belonging to both paths. Hence, ","element":"span"},{"style":{"height":17.6},"width":189.97,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-1.png","element":"img","alt":"|γ ∩ δ| = l","inline":true,"padRight":true},{"text":"implies that the intersection has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"vertices, not necessarily forming a path.","element":"span"}],[{"text":"We remark that we can restrict to the case ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L > ","element":"span"},{"text":"2","element":"span"},{"text":". In the case of one dropout layer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"= 1","element":"span"},{"text":", an edge ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e ","element":"span"},{"text":"= (","element":"span"},{"style":{"fontStyle":"italic"},"text":"u, v","element":"span"},{"text":") ","element":"span"},{"text":"conected to a dropout node ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"is updated if and only if the filter ","element":"span"},{"style":{"height":15.6},"width":442.82,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-2.png","element":"img","alt":"Fu = 1, where u ∈ G","inline":true,"padRight":true},{"text":"is the adjacent vertex to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e ","element":"span"},{"text":"with a dropout filter, so that in this case ","element":"span"},{"style":{"height":23.2},"width":1137.71,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-3.png","element":"img","alt":" P[w[t]e is updated] = 1 − p. For L = 2, an edge e = (u, v)","inline":true,"padRight":true},{"text":"is updated if and only if ","element":"span"},{"style":{"height":23.2},"width":923.56,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-4.png","element":"img","alt":"Fu = Fv = 1, so that P[w[t]e is updated] = 1 − p2","inline":true},{"text":". Recall that we denote by ","element":"span"},{"style":{"height":17.6},"width":81.56,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-5.png","element":"img","alt":" Γ(e)","inline":true,"padRight":true},{"text":"the set of ","element":"span"},{"text":"paths ","element":"span"},{"style":{"height":16},"width":122.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-6.png","element":"img","alt":" γ of G","inline":true,"padRight":true},{"text":"passing through ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e","element":"span"},{"text":". For a path ","element":"span"},{"style":{"height":17.6},"width":159.91,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-7.png","element":"img","alt":" γ ∈ Γ(e)","inline":true},{"text":", in the following, we let ","element":"span"},{"style":{"height":21.05},"width":267.62,"height":52.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-8.png","element":"img","alt":" Fγ = �u∈γ Fu","inline":true,"padRight":true},{"text":"be the indicator of a path being filtered. Thus, ","element":"span"},{"style":{"height":17.42},"width":215.88,"height":43.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-9.png","element":"img","alt":" Fγ is 1 is γ","inline":true,"padRight":true},{"text":"is not filtered and ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise. We will use Greek letters for paths and Latin letters for vertices when referring to filters ","element":"span"},{"style":{"height":17.42},"width":47.06,"height":43.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-10.png","element":"img","alt":" Fγ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.62},"width":48.06,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-11.png","element":"img","alt":" Fu","inline":true,"padRight":true},{"text":"respectively.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 14 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be a graph of a fully-connected ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NN ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L > ","element":"span"},{"text":"2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"dropout layers, each with the same width ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and with dropout filters ","element":"span"},{"style":{"height":16},"width":291.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-12.png","element":"img","alt":" Fu for u ∈ G.","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"For an edge ","element":"span"},{"style":{"height":17.6},"width":202.31,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-13.png","element":"img","alt":" e ∈ E(G),","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"let ","element":"span"},{"style":{"height":21.45},"width":375.11,"height":53.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-14.png","element":"img","alt":" FΓ(e) = �γ∈Γ(e) Fγ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"denote the random variable that counts the number of nonfiltered ","element":"span"},{"style":{"fontStyle":"italic"},"text":"traversing paths through ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e","element":"span"},{"style":{"fontStyle":"italic"},"text":". If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L, p ","element":"span"},{"style":{"fontStyle":"italic"},"text":"are fixed, then as ","element":"span"},{"style":{"height":15.2},"width":161.86,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-15.png","element":"img","alt":" D → ∞,","inline":true}],[{"id":"id-78","style":{"width":"67%"},"width":1166,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-16.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"We will use the Paley–Zygmund inequality. For a nonnegative random variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"with finite second moment, for any ","element":"span"},{"style":{"height":17.6},"width":184,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-17.png","element":"img","alt":" θ ∈ (0, 1),","inline":true}],[{"id":"id-75","style":{"width":"66%"},"width":1152,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-18.png","element":"img"}],[{"text":"We will use ","element":"span"},{"href":"#id-75","text":"(45) ","element":"a"},{"text":"with the random variable ","element":"span"},{"style":{"height":18.75},"width":91.1,"height":46.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-19.png","element":"img","alt":" FΓ(e)","inline":true},{"text":". The idea is that if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"is much larger than ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":", the average number of paths passing through ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e ","element":"span"},{"text":"is also large. We are using dropout, so the filter variable corresponding to an edge ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e ","element":"span"},{"text":"= (","element":"span"},{"style":{"fontStyle":"italic"},"text":"u, v","element":"span"},{"text":") ","element":"span"},{"text":"will depend on only the vertex ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u","element":"span"},{"text":", that is, ","element":"span"},{"style":{"height":14.62},"width":160.14,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-20.png","element":"img","alt":" Fe = Fu","inline":true},{"text":". For counting paths we also need to take into account that the filter ","element":"span"},{"style":{"height":14.62},"width":45.06,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-21.png","element":"img","alt":" Fv","inline":true,"padRight":true},{"text":"will occurring in all paths passing through ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e","element":"span"},{"text":". Since only the two vertices ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v ","element":"span"},{"text":"of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e ","element":"span"},{"text":"are fixed we can compute","element":"span"}],[{"id":"id-77","style":{"width":"74%"},"width":1287,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-22.png","element":"img"}],[{"text":"We define the set of broken paths in ","element":"span"},{"style":{"height":17.6},"width":134.68,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-23.png","element":"img","alt":" Γ(e) as","inline":true}],[{"style":{"width":"81%"},"width":1410,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-24.png","element":"img"}],[{"text":"that is, ","element":"span"},{"style":{"height":17.6},"width":183.26,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-25.png","element":"img","alt":" γ ∈ Γb(e)","inline":true,"padRight":true},{"text":"if and only if there exist ","element":"span"},{"style":{"height":17.6},"width":602.42,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-26.png","element":"img","alt":" η, δ ∈ Γ(e) such that γ = η ∩ δ","inline":true},{"text":". In particular, ","element":"span"},{"style":{"height":17.6},"width":98.04,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-27.png","element":"img","alt":"Γb(e)","inline":true,"padRight":true},{"text":"contains paths and unions of vertices of paths that pass through ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e","element":"span"},{"text":". Then we have:","element":"span"}],[{"style":{"width":"85%"},"width":1484,"height":176,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/20-28.png","element":"img"}],[{"id":"id-76","style":{"width":"77%"},"width":1341,"height":632,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-0.png","element":"img"}],[{"text":"where (i) we have first used that ","element":"span"},{"style":{"height":17.42},"width":47.06,"height":43.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-1.png","element":"img","alt":" Fγ","inline":true,"padRight":true},{"text":"are indicators for occurring ","element":"span"},{"style":{"height":17.6},"width":161.88,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-2.png","element":"img","alt":" γ ∈ Γ(e)","inline":true,"padRight":true},{"text":"and that at least ","element":"span"},{"style":{"height":14.8},"width":107.63,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-3.png","element":"img","alt":"l ≥ 2","inline":true,"padRight":true},{"text":"since vertices ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v ","element":"span"},{"text":"are shared among all paths in ","element":"span"},{"style":{"height":17.6},"width":81.56,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-4.png","element":"img","alt":" Γ(e)","inline":true},{"text":"; secondly, that we have separated the sum over paths into a path ","element":"span"},{"style":{"height":11.6},"width":24,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-5.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"and all other paths ","element":"span"},{"style":{"height":12.8},"width":20,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-6.png","element":"img","alt":" δ","inline":true,"padRight":true},{"text":"that coincide in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"vertices. In (ii) we have computed the probability by noting that for ","element":"span"},{"style":{"height":17.6},"width":614.82,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-7.png","element":"img","alt":" γ and δ such that |γ ∩δ| = l ≥ 2,","inline":true},{"style":{"height":20.55},"width":364.6,"height":51.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-8.png","element":"img","alt":"E[FγFδ] = plp2L−2l","inline":true},{"text":", where the term ","element":"span"},{"style":{"height":18.73},"width":31.96,"height":46.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-9.png","element":"img","alt":" pl ","inline":true,"padRight":true},{"text":"accounts for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"shared filters corresponding to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"shared vertices and ","element":"span"},{"style":{"height":18.73},"width":115.21,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-10.png","element":"img","alt":" p2L−2l ","inline":true,"padRight":true},{"text":"for the remaining products of filters. Note that we have used the independence assumption for filters here. (iii) We have used here that ","element":"span"},{"style":{"height":17.6},"width":359.25,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-11.png","element":"img","alt":" η = δ ∩ γ ∈ Γb(e),","inline":true,"padRight":true},{"text":"so that we can separate the previous sum into first, fixing the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"vertices where two paths intersect—including ","element":"span"},{"style":{"height":17.6},"width":683.64,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-12.png","element":"img","alt":" e—with η ∈ Γb(e) such that |η| = l","inline":true},{"text":", and then looking for all possible ","element":"span"},{"style":{"height":17.6},"width":574.02,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-13.png","element":"img","alt":"δ, γ ∈ Γ(e) such that γ ∩ δ = η","inline":true},{"text":". For (iv) we fix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"vertices where ","element":"span"},{"style":{"height":16},"width":142.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-14.png","element":"img","alt":" γ and δ","inline":true,"padRight":true},{"text":"coincide, then there are still ","element":"span"},{"style":{"height":19.53},"width":280.41,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-15.png","element":"img","alt":" (D(D − 1))L−l ","inline":true,"padRight":true},{"text":"possible ordered vertex pairs to choose from all the other vertices where ","element":"span"},{"style":{"height":16},"width":154.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-16.png","element":"img","alt":" γ and δ","inline":true,"padRight":true},{"text":"do not coincide. (v) For the remaining sum, for each ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"fixed locations— including the vertices of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e","element":"span"},{"text":", which are fixed—we can still choose ","element":"span"},{"style":{"height":15.13},"width":91.18,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-17.png","element":"img","alt":" Dl−2 ","inline":true,"padRight":true},{"text":"remaining possible vertices. Additionally, there are for each ","element":"span"},{"style":{"height":22.56},"width":413.56,"height":56.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-18.png","element":"img","alt":" l,�L−2l−2�distinct l − 2","inline":true,"padRight":true},{"text":"locations for these vertices. Hence, plugging ","element":"span"},{"href":"#id-76","text":"(52) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-77","text":"(46) ","element":"a"},{"text":"into ","element":"span"},{"href":"#id-75","text":"(45) ","element":"a"},{"text":"yields","element":"span"}],[{"style":{"width":"85%"},"width":1471,"height":297,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-19.png","element":"img"}],[{"text":"In particular, setting ","element":"span"},{"style":{"height":18.73},"width":297.68,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-20.png","element":"img","alt":" θ−1 = 2pLDL−2 ","inline":true,"padRight":true},{"text":"and computing the higher order noting that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L > ","element":"span"},{"text":"2","element":"span"},{"text":", we obtain that","element":"span"}],[{"style":{"width":"67%"},"width":1166,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-21.png","element":"img"}],[{"text":"or alternatively noting that ","element":"span"},{"style":{"height":19.95},"width":1045.33,"height":49.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-22.png","element":"img","alt":" {FΓ(e) ≤ 1/2} = {FΓ(e) = 0}, since FΓ(e) ∈ N we obtain","inline":true}],[{"style":{"width":"68%"},"width":1181,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-23.png","element":"img"}],[{"text":"Finally note that ","element":"span"},{"style":{"height":21.48},"width":436.86,"height":53.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-24.png","element":"img","alt":" 1 − p2 ≤ P[FΓ(e) = 0]","inline":true,"padRight":true},{"text":"since the edge ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e ","element":"span"},{"text":"can be present in a path only if the filters at both vertices of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e ","element":"span"},{"text":"have value ","element":"span"},{"text":"1","element":"span"},{"text":", which occurs with probability ","element":"span"},{"style":{"height":18.33},"width":207.73,"height":45.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-25.png","element":"img","alt":" p2, so that","inline":true},{"style":{"height":21.49},"width":335.24,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/21-26.png","element":"img","alt":"P[FΓ(e) > 0] < p2.","inline":true}],[{"text":"Note that in the proof of Lemma ","element":"span"},{"href":"#id-78","text":"14 ","element":"a"},{"text":"we can recover the scaling ","element":"span"},{"style":{"height":18.73},"width":44.96,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-0.png","element":"img","alt":" pL ","inline":true,"padRight":true},{"text":"that we have seen in Proposition ","element":"span"},{"href":"#id-67","text":"13 ","element":"a"},{"text":"by setting ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"in ","element":"span"},{"href":"#id-76","text":"(52) ","element":"a"},{"text":"and in ","element":"span"},{"href":"#id-76","text":"(50)","element":"a"},{"text":".","element":"span"}],[{"text":"From Lemma ","element":"span"},{"href":"#id-78","text":"14 ","element":"a"},{"text":"we expect that for a wide network with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"layers where ","element":"span"},{"style":{"height":13.6},"width":279.35,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-1.png","element":"img","alt":" D ≫ L and an","inline":true,"padRight":true},{"text":"edge ","element":"span"},{"style":{"height":17.6},"width":168.86,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-2.png","element":"img","alt":" e ∈ E(G)","inline":true},{"text":", we have that","element":"span"}],[{"id":"id-79","style":{"width":"69%"},"width":1197,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-3.png","element":"img"}],[{"text":"If the convergence rate is related to the update rule, then we would expect that for a wide network the rate would be independent of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"which is different from the path network considered in Proposition ","element":"span"},{"href":"#id-67","text":"13. ","element":"a"},{"text":"In the next section we will verify this intuition on real datasets. Note, however, that we do not expect to see the dependence on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"as shown in ","element":"span"},{"href":"#id-79","text":"(58)","element":"a"},{"text":": this heuristic argument provides only the rate at which a weight is updated, and stochastic averaging is not solely driving the convergence rate. In particular, from an example for wide shallow linear networks in ","element":"span"},{"href":"#id-15","referenceIndex":40,"text":"Senen-Cerda and Sanders ","element":"a"},{"text":"(","element":"span"},{"href":"#id-15","referenceIndex":40,"text":"2022","element":"a"},{"text":"), close to a critical point of a dropout ","element":"span"},{"text":"ODE, ","element":"span"},{"text":"the dependence scales with a factor ","element":"span"},{"style":{"height":17.6},"width":151.9,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-4.png","element":"img","alt":" p(1 − p)","inline":true,"padRight":true},{"text":"instead of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":". This is due to the fact that for larger ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":", there are regions of the landscape close to minima that become flat, as also hinted by Proposition ","element":"span"},{"href":"#id-25","text":"7. ","element":"a"},{"text":"Indeed, when ","element":"span"},{"style":{"height":17.6},"width":496.08,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-5.png","element":"img","alt":" p ↑ 1 the term (1−p)J ↓ 0","inline":true,"padRight":true},{"text":"in the convergence rate of Proposition ","element":"span"},{"href":"#id-25","text":"7 ","element":"a"},{"text":"lowers the complexity of finding an ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-6.png","element":"img","alt":" ϵ","inline":true},{"text":"-stationary point. Hence, there are landscape regimes and initialization issues that also account for the convergence rate in ","element":"span"},{"text":"NNs.","element":"span"}],[{"id":"id-61","style":{"fontWeight":"bold"},"text":"5.1 Numerical Experiments","element":"span"}],[{"text":"In this section we conduct the dropout stochastic gradient descent algorithm numerically,","element":"span"},{"style":{"height":8.8},"width":17,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-7.png","element":"img","alt":"4","inline":true,"padRight":true},{"text":"for different datasets and network architectures. We measure the convergence rate for differ-ent widths ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", depths ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":", and dropout probabilities ","element":"span"},{"style":{"height":15.2},"width":99.77,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-8.png","element":"img","alt":" 1 − p","inline":true},{"text":". We then compare these measurements to the bounds on the convergence rates obtained in Section ","element":"span"},{"text":"4. ","element":"span"},{"text":"We use ","element":"span"},{"style":{"height":18.33},"width":228.04,"height":45.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-9.png","element":"img","alt":" Tensorflow5","inline":true,"padRight":true},{"text":"for the implementation.","element":"span"}],[{"text":"5.1.1 Setup","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Datasets. ","element":"span"},{"text":"We will consider three commonly used data sets of images: the ","element":"span"},{"text":"MNIST","element":"span"},{"style":{"height":19.13},"width":176.45,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-10.png","element":"img","alt":"6 (LeCun","inline":true,"padRight":true},{"href":"#id-80","referenceIndex":26,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-80","referenceIndex":26,"text":"2010","element":"a"},{"text":"), ","element":"span"},{"text":"CIFAR-","element":"span"},{"text":"100-fine","element":"span"},{"style":{"height":8.8},"width":17,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-11.png","element":"img","alt":"7","inline":true},{"text":", and ","element":"span"},{"text":"CIFAR-","element":"span"},{"text":"100-coarse datasets (","element":"span"},{"href":"#id-81","referenceIndex":23,"text":"Krizhevsky","element":"a"},{"text":", ","element":"span"},{"href":"#id-81","referenceIndex":23,"text":"2009","element":"a"},{"text":").","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"NN ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Architecture. ","element":"span"},{"text":"We use as a base architecture a LeNet with 11 layers where the two dense layers have been substituted with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"fully-connected ","element":"span"},{"text":"ReLU ","element":"span"},{"text":"layers of width ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". Each of these layers have dropout with dropout probability ","element":"span"},{"style":{"height":15.2},"width":105.84,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/22-12.png","element":"img","alt":" 1 − p","inline":true},{"text":". While larger networks are commonly used in practice, a LeNet architecture is sufficient to test the effect of dropout on the convergence rate as we verify with the simulations.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Loss. ","element":"span"},{"text":"We use the cross-entropy loss, which is commonly used for classification. For two distributions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"with support on ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"] ","element":"span"},{"text":"labels, the cross-entropy loss is defined as","element":"span"}],[{"style":{"width":"62%"},"width":1088,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Stopping criteria. ","element":"span"},{"text":"In all experiments, we stop after ","element":"span"},{"text":"40 ","element":"span"},{"text":"epochs.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Initialization. ","element":"span"},{"text":"In order to see the convergence rate close to a minimum. ","element":"span"},{"text":"We use first a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Gaussian initialization","element":"span"},{"text":", that is, we set every weight on the dense layers to ","element":"span"},{"style":{"height":17.24},"width":149.5,"height":43.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-1.png","element":"img","alt":" Wijk ∼","inline":true},{"style":{"height":19.98},"width":334.46,"height":49.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-2.png","element":"img","alt":"Normal(0, 1/√D)","inline":true,"padRight":true},{"text":"in an independent manner, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"is the width of the layer. While this initialization is standard, we note that we cannot expect to compare convergence rates for different numbers of layers ","element":"span"},{"style":{"height":17.6},"width":243.92,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-3.png","element":"img","alt":" L ∈ {1, 2, 3}","inline":true,"padRight":true},{"text":"and for different dropout probabilities ","element":"span"},{"style":{"height":15.2},"width":114.24,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-4.png","element":"img","alt":" 1 − p,","inline":true,"padRight":true},{"text":"since the loss functions are also different. In the course of our experiments, we found that there are also many saddle points where ","element":"span"},{"text":"SGD ","element":"span"},{"text":"remains stuck, which complicated the estimation of the convergence rate. In order to start approximately at the same neighborhood where the iterates stay and continuously track minima across different choices of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":", for each ","element":"span"},{"style":{"height":17.6},"width":244.89,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-5.png","element":"img","alt":"L ∈ {1, 2, 3}","inline":true,"padRight":true},{"text":"we have used a two-step approach in order to avoid areas of the landscape with saddle points. We first run ADAM","element":"span"},{"style":{"height":15.13},"width":125.1,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-6.png","element":"img","alt":"8 for 2","inline":true,"padRight":true},{"text":"epochs with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"1 ","element":"span"},{"text":"and store the weights. Secondly, for each ","element":"span"},{"style":{"height":15.6},"width":122.72,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-7.png","element":"img","alt":" p ∈ P","inline":true,"padRight":true},{"text":"we then perform dropout ","element":"span"},{"text":"SGD ","element":"span"},{"text":"with initialization given by the stored weights. In this manner, we expect that we are approximately “tracking” the same local region across the optimization landscape when we change ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":". Optimization with ADAM is less prone to remain in flat areas of the landscape since it uses a dynamic step size. Hence, if after the dynamic step the iterates remain in a part of the landscape with no saddle points that smoothly changes with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":", we also expect in this case to obtain comparable convergence rates for ","element":"span"},{"text":"SGD ","element":"span"},{"text":"for each fixed ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Step size and batch size. ","element":"span"},{"text":"In each experiment, the step size is given by ","element":"span"},{"style":{"height":18.73},"width":346.23,"height":46.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-8.png","element":"img","alt":" η = 10−5 and the","inline":true,"padRight":true},{"text":"batch size is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"= 1024","element":"span"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Fitting procedure. ","element":"span"},{"text":"We fix a set of probabilities ","element":"span"},{"style":{"height":17.6},"width":185.23,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-9.png","element":"img","alt":" P ⊂ [0, 1]","inline":true,"padRight":true},{"text":"and depths ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"and for each pair ","element":"span"},{"style":{"height":17.6},"width":272,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-10.png","element":"img","alt":" (p, l) ∈ P × L","inline":true,"padRight":true},{"text":"we run the algorithm above. From the value of the loss from all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"iterations of ","element":"span"},{"text":"SGD ","element":"span"},{"style":{"height":19.81},"width":209.74,"height":49.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-11.png","element":"img","alt":" L = (lt)Tt=0 ","inline":true,"padRight":true},{"text":"in one run, we compute a moving average ","element":"span"},{"style":{"height":19.81},"width":283.48,"height":49.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-12.png","element":"img","alt":" a(L)Tt=0, where","inline":true,"padRight":true},{"text":"we average the loss across a window with size given by the number of batches ","element":"span"},{"style":{"height":15.6},"width":214.14,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-13.png","element":"img","alt":" nb required","inline":true,"padRight":true},{"text":"to complete one epoch. In this manner we obtain an average convergence rate and diminish the stochasticity from the dataset. We then fit the averaged loss of the iterates ","element":"span"},{"style":{"height":19.81},"width":211.1,"height":49.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-14.png","element":"img","alt":" a(L)Tt=0 for","inline":true,"padRight":true},{"text":"each ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"to the function","element":"span"}],[{"style":{"width":"71%"},"width":1240,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-15.png","element":"img"}],[{"text":"We run the experiment ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"= 10 ","element":"span"},{"text":"times for each ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"p, l","element":"span"},{"text":") ","element":"span"},{"text":"and obtain an average convergence exponent ","element":"span"},{"style":{"height":22.96},"width":272.99,"height":57.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-16.png","element":"img","alt":" (˜βp,l)(p,l)∈P×L.","inline":true}],[{"text":"5.1.2 Results","element":"span"}],[{"text":"In Figure ","element":"span"},{"href":"#id-82","text":"3 ","element":"a"},{"text":"we can see the plots of ","element":"span"},{"style":{"height":21.45},"width":61.14,"height":53.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-17.png","element":"img","alt":"˜βp,l","inline":true},{"text":". As suspected from the heuristic argument, we do not see an increasingly large dependence on ","element":"span"},{"style":{"height":17.6},"width":867.29,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/23-18.png","element":"img","alt":" p for L = 1, 2 or 3 when D ∈ {50, 100}. For","inline":true,"padRight":true},{"text":"the ","element":"span"},{"text":"MNIST ","element":"span"},{"text":"dataset some dependence on the depth is appreciated, but this may be due","element":"span"}],[{"id":"id-82","style":{"width":"96%"},"width":1672,"height":1084,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/24-0.png","element":"img"}],[{"text":"Figure 3: The fit ","element":"figcaption","subtype":"caption"},{"style":{"height":21.45},"width":921.97,"height":53.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/24-1.png","element":"img","alt":"˜βp,l for p ∈ {i × 10−1 : i ∈ [10]} and l ∈ {1, 2, 3}","inline":true,"padRight":true},{"text":"for LeNet with different widths ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"D ","element":"figcaption","subtype":"caption"},{"text":"and different datasets. Here ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"(a) ","element":"figcaption","subtype":"caption"},{"text":"MNIST ","element":"figcaption","subtype":"caption"},{"text":"with ","element":"figcaption","subtype":"caption"},{"style":{"height":18},"width":234.98,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/24-2.png","element":"img","alt":" D = 50; (a′)","inline":true,"padRight":true},{"text":"MNIST ","element":"figcaption","subtype":"caption"},{"text":"with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"D ","element":"figcaption","subtype":"caption"},{"text":"= 100","element":"figcaption","subtype":"caption"},{"text":"; ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"(b) ","element":"figcaption","subtype":"caption"},{"text":"CIFAR-","element":"figcaption","subtype":"caption"},{"text":"100-fine labels with ","element":"figcaption","subtype":"caption"},{"style":{"height":18},"width":228.08,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/24-3.png","element":"img","alt":" D = 50; (b′)","inline":true,"padRight":true},{"text":"CIFAR-","element":"figcaption","subtype":"caption"},{"text":"100-fine labels with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"D ","element":"figcaption","subtype":"caption"},{"text":"= 100","element":"figcaption","subtype":"caption"},{"text":"; ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"(c) ","element":"figcaption","subtype":"caption"},{"text":"CIFAR- ","element":"figcaption","subtype":"caption"},{"text":"100-coarse labels with ","element":"figcaption","subtype":"caption"},{"style":{"height":18},"width":224.59,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/24-4.png","element":"img","alt":" D = 50; (c′","inline":true},{"text":") ","element":"figcaption","subtype":"caption"},{"text":"CIFAR-","element":"figcaption","subtype":"caption"},{"text":"100-coarse labels with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"D ","element":"figcaption","subtype":"caption"},{"text":"= 100","element":"figcaption","subtype":"caption"},{"text":". While for the ","element":"figcaption","subtype":"caption"},{"text":"MNIST ","element":"figcaption","subtype":"caption"},{"text":"dataset there seems to be an increasing dependence of dropout on the convergence rate with the depth ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"L","element":"figcaption","subtype":"caption"},{"text":", for ","element":"figcaption","subtype":"caption"},{"text":"CIFAR ","element":"figcaption","subtype":"caption"},{"text":"no such dependence is observed. We remark, however, that in the ","element":"figcaption","subtype":"caption"},{"text":"CIFAR ","element":"figcaption","subtype":"caption"},{"text":"datasets encountering saddle points was more common. For those areas the loss profile is flat and so we expect the fits to be biased towards the origin in some cases.","element":"figcaption","subtype":"caption"}],[{"text":"to other factors that affect the convergence rate, like initialization issues. For the ","element":"span"},{"text":"CIFAR ","element":"span"},{"text":"datasets, convergence is greatly affected by saddlepoints despite the use of dropout. This is, however, common when using ","element":"span"},{"text":"SGD ","element":"span"},{"text":"with small constant stepsizes. In particular, in practical scenarios other schemes that adjust the stepsize, like e.g. ADAM, may be more appropriate when dealing with deep networks with dropout in different layers. From the experiments it is concluded that despite the stochasticity provided by dropout, the convergence rate is not affected much by a varying dropout probability ","element":"span"},{"style":{"height":15.2},"width":103.05,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/24-5.png","element":"img","alt":" 1 − p","inline":true,"padRight":true},{"text":"in wide networks with just few dropout layers.","element":"span"}]]},{"heading":"6. Conclusion","paragraphs":[[{"text":"In this paper we have shown with a probability theoretical proof that a large class of dropout algorithms for neural networks converge almost surely to a unique stationary set of a projected system of ","element":"span"},{"text":"ODEs. ","element":"span"},{"text":"The result gives a formal guarantee that these dropout algorithms are well-behaved for a wide range of ","element":"span"},{"text":"NNs ","element":"span"},{"text":"and activation functions, and will at least asymptotically not suffer from issues because of the connection to bond percolation. We leave the extension of this result for nonsmooth activation functions such as ReLU for future work. Additionally, we established bounds for the sample complexity of ","element":"span"},{"text":"SGD ","element":"span"},{"text":"with dropout to converge to an ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/25-0.png","element":"img","alt":" ϵ","inline":true},{"text":"-stationary point of a generic nonconvex function. An upper bound to the rate of convergence of ","element":"span"},{"text":"GD ","element":"span"},{"text":"on the limiting ","element":"span"},{"text":"ODE ","element":"span"},{"text":"of dropout algorithms was established as well for arborescences of arbitrary depth with linear activation functions. While ","element":"span"},{"text":"GD ","element":"span"},{"text":"on the limiting ","element":"span"},{"text":"ODE ","element":"span"},{"text":"is not strictly a dropout algorithm, the result is a necessary step towards analyzing the convergence rate of the actual stochastic implementations of dropout algorithms. Finally, Proposition ","element":"span"},{"href":"#id-31","text":"9 ","element":"a"},{"text":"specifically implies that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect ","element":"span"},{"text":"can impair the convergence rate by as much as an exponential factor in the number of layers of thin but deep networks. We have theoretically and experimentally verified this claim in experiments with a path network. This fact is in contrast to wide networks with a few dropout layers where a strong dependence on the dropout probability ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"is not experimentally observed. These two observations together imply that there is a change of regime in the convergence rate from networks that are wide with a few dropout layers to thin networks with many dropout layers.","element":"span"}]]},{"heading":"Acknowledgments","paragraphs":[[{"text":"We thank the anonymous referees for their feedback. ","element":"span"},{"text":"Their suggestions have led to an improved paper.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-28","text":"Sanjeev Arora, Noah Golowich, Nadav Cohen, and Wei Hu. ","element":"span"},{"text":"A convergence analysis of gradient descent for deep linear neural networks. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"7th International Conference on Learning Representations, ICLR 2019","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-35","text":"Jimmy Ba and Brendan Frey. ","element":"span"},{"text":"Adaptive Dropout for training deep neural networks. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 3084–3092, 2013.","element":"span"}],[{"id":"id-7","text":"Pierre Baldi and Peter Sadowski. The Dropout learning algorithm. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Artificial intelligence","element":"span"},{"text":", 210:78–122, 2014.","element":"span"}],[{"id":"id-4","text":"Pierre Baldi and Peter J. Sadowski. Understanding Dropout. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 2814–2822, 2013.","element":"span"}],[{"id":"id-30","text":"Peter L. Bartlett, David P. Helmbold, and Philip M. Long. Gradient descent with identity ","element":"span"},{"text":"initialization efficiently learns positive-definite linear transformations by deep residual networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural Computation","element":"span"},{"text":", 31:477–502, 2018.","element":"span"}],[{"id":"id-51","text":"Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-dynamic programming: an overview. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of 1995 34th IEEE Conference on Decision and Control","element":"span"},{"text":", volume 1, pages 560–564. IEEE, 1995.","element":"span"}],[{"id":"id-50","text":"Vivek S. Borkar. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Stochastic approximation: a dynamical systems viewpoint","element":"span"},{"text":", volume 48. Springer, 2009.","element":"span"}],[{"id":"id-59","text":"Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale ","element":"span"},{"text":"machine learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Siam Review","element":"span"},{"text":", 60(2):223–311, 2018.","element":"span"}],[{"text":"Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Foundations and ","element":"span"},{"style":{"height":18.4},"width":585.7,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/26-0.png","element":"img","alt":"Trends® in Machine Learning","inline":true},{"text":", 8(3-4):231–357, 2015.","element":"span"}],[{"id":"id-8","text":"Jacopo Cavazza, Pietro Morerio, Benjamin Haeffele, Connor Lane, Vittorio Murino, and ","element":"span"},{"text":"Rene Vidal. Dropout as a low-rank regularizer for matrix factorization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 435–444, 2018.","element":"span"}],[{"id":"id-2","text":"Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural ","element":"span"},{"text":"networks with cutout. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1708.04552","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-26","text":"Yoel Drori and Ohad Shamir. The complexity of finding stationary points with stochastic ","element":"span"},{"text":"gradient descent. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 2658–2667. PMLR, 2020.","element":"span"}],[{"id":"id-46","text":"Tianxiang Gao, Hailiang Liu, Jia Liu, Hridesh Rajan, and Hongyang Gao. A global conver- ","element":"span"},{"text":"gence theory for deep relu implicit networks via over-parameterization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-106","text":"Piotr Hajłasz. ","element":"span"},{"text":"Whitney’s example by way of Assouad’s embedding. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the American Mathematical Society","element":"span"},{"text":", 131(11):3463–3467, 2003.","element":"span"}],[{"id":"id-0","text":"Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. ","element":"span"},{"text":"Salakhutdinov. ","element":"span"},{"text":"Improving neural networks by preventing co-adaptation of feature detectors. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1207.0580","element":"span"},{"text":", 2012.","element":"span"}],[{"id":"id-44","text":"Wei Huang, Richard Yi Da Xu, Weitao Du, Yutian Zeng, and Yunce Zhao. Mean field theory ","element":"span"},{"text":"for deep dropout networks: digging up gradient backpropagation deeply. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1912.09132","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-66","text":"Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence ","element":"span"},{"text":"and generalization in neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 31, 2018.","element":"span"}],[{"id":"id-70","text":"Hamed Karimi, Julie Nutini, and Mark Schmidt. ","element":"span"},{"text":"Linear convergence of gradient and proximal–gradient methods under the Polyak-łojasiewicz condition. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Joint European Conference on Machine Learning and Knowledge Discovery in Databases","element":"span"},{"text":", pages 795–811. Springer, 2016.","element":"span"}],[{"id":"id-41","text":"Edmund Kay and Anurag Agarwal. Dropconnected neural network trained with diverse ","element":"span"},{"text":"features for classifying heart sounds. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2016 Computing in Cardiology Conference (CinC)","element":"span"},{"text":", pages 617–620. IEEE, 2016.","element":"span"}],[{"id":"id-48","text":"Jack Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regression ","element":"span"},{"text":"function. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Mathematical Statistics","element":"span"},{"text":", 23(3):462–466, 1952.","element":"span"}],[{"text":"Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 12 2014.","element":"span"}],[{"id":"id-33","text":"Durk P. Kingma, Tim Salimans, and Max Welling. ","element":"span"},{"text":"Variational Dropout and the local reparameterization trick. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 2575–2583, 2015.","element":"span"}],[{"id":"id-81","text":"Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.","element":"span"}],[{"id":"id-39","text":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep ","element":"span"},{"text":"convolutional neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 1097–1105, 2012.","element":"span"}],[{"id":"id-49","text":"Harold Kushner and G. George Yin. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Stochastic approximation and recursive algorithms and applications","element":"span"},{"text":", volume 35. Springer Science & Business Media, 2003.","element":"span"}],[{"id":"id-80","text":"Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist","element":"span"},{"text":", 2, 2010.","element":"span"}],[{"id":"id-36","text":"Zhe Li, Boqing Gong, and Tianbao Yang. Improved Dropout for shallow and deep learning. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 2523–2531, 2016.","element":"span"}],[{"id":"id-10","text":"Poorya Mianjy and Raman Arora. On Dropout and nuclear norm regularization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 4575–4584, 2019.","element":"span"}],[{"id":"id-14","text":"Poorya Mianjy and Raman Arora. On convergence and generalization of dropout training. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 33, 2020.","element":"span"}],[{"id":"id-9","text":"Poorya Mianjy, Raman Arora, and Rene Vidal. On the implicit bias of Dropout. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 3540–3548, 2018.","element":"span"}],[{"id":"id-34","text":"Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational Dropout sparsifies ","element":"span"},{"text":"deep neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 34th International Conference on Machine Learning-Volume 70","element":"span"},{"text":", pages 2498–2507. JMLR. org, 2017.","element":"span"}],[{"id":"id-92","text":"Anthony P. Morse. The behavior of a function on its critical set. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Annals of Mathematics","element":"span"},{"text":", pages 62–70, 1939.","element":"span"}],[{"id":"id-55","text":"Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in ","element":"span"},{"text":"neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pages 1376–1401, 2015.","element":"span"}],[{"id":"id-21","text":"Samet Oymak. Learning compact neural networks with regularization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 3966–3975, 2018.","element":"span"}],[{"id":"id-11","text":"Ambar Pal, Connor Lane, René Vidal, and Benjamin D. Haeffele. On the regularization prop- ","element":"span"},{"text":"erties of structured dropout. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pages 7671–7679, 2020.","element":"span"}],[{"id":"id-40","text":"Vu Pham, Théodore Bluche, Christopher Kermorvant, and Jérôme Louradour. Dropout im- ","element":"span"},{"text":"proves recurrent neural networks for handwriting recognition. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2014 14th International Conference on Frontiers in Handwriting Recognition","element":"span"},{"text":", pages 285–290. IEEE, 2014.","element":"span"}],[{"id":"id-47","text":"Herbert Robbins and Sutton Monro. A stochastic approximation method. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Mathematical Statistics","element":"span"},{"text":", pages 400–407, 1951.","element":"span"}],[{"id":"id-93","text":"Arthur Sard. ","element":"span"},{"text":"The measure of the critical values of differentiable maps. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bulletin of the American Mathematical Society","element":"span"},{"text":", 48(12):883–890, 1942.","element":"span"}],[{"id":"id-38","text":"Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. ","element":"span"},{"text":"Recurrent dropout without memory loss. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers","element":"span"},{"text":", pages 1757–1766, 2016.","element":"span"}],[{"id":"id-15","text":"Albert Senen-Cerda and Jaron Sanders. Asymptotic convergence rate of dropout on shallow ","element":"span"},{"text":"linear neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems","element":"span"},{"text":", pages 105–106, 2022.","element":"span"}],[{"id":"id-29","text":"Ohad Shamir. Exponential convergence time of gradient descent for one-dimensional deep ","element":"span"},{"text":"linear neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pages 2691–2713, 2019.","element":"span"}],[{"id":"id-43","text":"Joachim Sicking, Maram Akila, Tim Wirtz, Sebastian Houben, and Asja Fischer. Character- ","element":"span"},{"text":"istics of Monte Carlo dropout in wide neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2007.05434","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-5","text":"Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- ","element":"span"},{"text":"dinov. Dropout: a simple way to prevent neural networks from overfitting. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Journal of Machine Learning Research","element":"span"},{"text":", 15(1):1929–1958, 2014.","element":"span"}],[{"id":"id-65","text":"Salma Tarmoun, Guilherme Franca, Benjamin D Haeffele, and Rene Vidal. Understanding ","element":"span"},{"text":"the dynamics of gradient flow in overparameterized linear models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", volume 139 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pages 10153–10161, 18–24 Jul 2021.","element":"span"}],[{"id":"id-42","text":"Gregor Urban, Kevin Bache, Duc T.T. Phan, Agua Sobrino, Alexander K. Shmakov, ","element":"span"},{"text":"Stephanie J. Hachey, Christopher C.W. Hughes, and Pierre Baldi. Deep learning for drug discovery and cancer research: Automated analysis of vascularization images. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","element":"span"},{"text":", 16(3):1029–1035, 2018.","element":"span"}],[{"id":"id-6","text":"Stefan Wager, Sida Wang, and Percy S. Liang. Dropout training as adaptive regularization. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 351–359, 2013.","element":"span"}],[{"id":"id-1","text":"Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of ","element":"span"},{"text":"neural networks using Dropconnect. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1058–1066, 2013.","element":"span"}],[{"id":"id-12","text":"Colin Wei, Sham Kakade, and Tengyu Ma. The implicit and explicit regularization effects ","element":"span"},{"text":"of dropout. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2002.12915","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-37","text":"Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regulariza- ","element":"span"},{"text":"tion. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1409.2329","element":"span"},{"text":", 2014.","element":"span"}],[{"id":"id-45","text":"Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over- ","element":"span"},{"text":"parameterized deep ReLU networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning","element":"span"},{"text":", 109(3):467–492, 2020.","element":"span"}]]},{"heading":"Appendix Appendix A. Backpropagation Algorithm","paragraphs":[[{"text":"We define the backpropagation algorithm used in Section ","element":"span"},{"text":"2 ","element":"span"},{"text":"to compute the estimate of the gradient.","element":"span"}],[{"id":"id-53","style":{"height":19.13},"width":669.27,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-0.png","element":"img","alt":"Definition 15 Assume σ ∈ C1(R)","inline":true},{"style":{"fontStyle":"italic"},"text":". Given weights ","element":"span"},{"style":{"height":12.8},"width":150.32,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-1.png","element":"img","alt":" W ∈ W","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and input–output pair ","element":"span"},{"style":{"height":17.6},"width":144.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-2.png","element":"img","alt":" (x, y) ∈","inline":true}],[{"style":{"width":"99%"},"width":1724,"height":412,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-3.png","element":"img"}],[{"text":"Definition ","element":"span"},{"href":"#id-53","text":"15 ","element":"a"},{"text":"is essentially a computationally efficient manner of calculating the gradient ","element":"span"},{"style":{"height":17.6},"width":258.04,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-4.png","element":"img","alt":"∇l(ΨW (x), y)","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-16","text":"(1)","element":"a"},{"text":", leveraging the ","element":"span"},{"text":"NN’","element":"span"},{"text":"s layered structure together with the chain rule of differentation to come to a recursive computation of the partial derivatives.","element":"span"}]]},{"heading":"Appendix B. ODE method","paragraphs":[[{"text":"Regarding our second result in Proposition ","element":"span"},{"href":"#id-67","text":"13, ","element":"a"},{"text":"observe that ","element":"span"},{"text":"GD ","element":"span"},{"text":"on a limiting ","element":"span"},{"text":"ODE ","element":"span"},{"text":"is not exactly a dropout algorithm. Analyzing ","element":"span"},{"text":"GD’","element":"span"},{"text":"s convergence rate however is an important stepping stone towards analyzing the convergence rate of dropout algorithms. To see the mathematical relation, consider that any dropout algorithm updates the weights","element":"span"}],[{"id":"id-83","style":{"width":"65%"},"width":1127,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-5.png","element":"img"}],[{"text":"randomly for ","element":"span"},{"style":{"height":14.8},"width":258.8,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-6.png","element":"img","alt":" n = 0, 1, 2, · · ·","inline":true,"padRight":true},{"text":". Here, the ","element":"span"},{"style":{"height":16.33},"width":82.56,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-7.png","element":"img","alt":" α{n} ","inline":true,"padRight":true},{"text":"denote the step sizes of the algorithm, and the ","element":"span"},{"style":{"height":15.93},"width":118.58,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-8.png","element":"img","alt":"∆[n+1] ","inline":true,"padRight":true},{"text":"represent the random directions that result from the act of dropping weights. As we will show in this paper under assumptions of independence, these random directions satisfy","element":"span"}],[{"style":{"width":"71%"},"width":1243,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-9.png","element":"img"}],[{"text":"for some continuous, differentiable function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":")","element":"span"},{"text":". Observe that the algorithm in ","element":"span"},{"href":"#id-83","text":"(62) ","element":"a"},{"text":"satis-fies ","element":"span"},{"style":{"height":20.33},"width":1659.98,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-10.png","element":"img","alt":" W [n+1] = W [n]+α{n}(−∇D(W [n])+M[n+1]) where M[n+1] = E[∆[n+1] | W [0], . . . , W [n]]−","inline":true},{"style":{"height":15.93},"width":118.58,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-11.png","element":"img","alt":"∆[n+1] ","inline":true,"padRight":true},{"text":"describes a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"martingale difference ","element":"span"},{"text":"sequence. This martingale difference sequence’s expectation with respect to the past ","element":"span"},{"style":{"height":19.13},"width":417.84,"height":47.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-12.png","element":"img","alt":" W [0], . . . , W [n] is zero.","inline":true}],[{"text":"For diminishing step sizes ","element":"span"},{"style":{"height":16.33},"width":82.56,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-13.png","element":"img","alt":" α{n}","inline":true},{"text":", we can consequently view dropout algorithms as in ","element":"span"},{"href":"#id-83","text":"(62) ","element":"a"},{"text":"as being noisy discretizations of the ordinary differential equation","element":"span"}],[{"id":"id-84","style":{"width":"60%"},"width":1049,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/29-14.png","element":"img"}],[{"text":"In fact, we employ the so-called ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ordinary differential equation method ","element":"span"},{"text":"(","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"Kushner and Yin","element":"a"},{"text":", ","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"2003","element":"a"},{"text":"; ","element":"span"},{"href":"#id-50","referenceIndex":7,"text":"Borkar","element":"a"},{"text":", ","element":"span"},{"href":"#id-50","referenceIndex":7,"text":"2009","element":"a"},{"text":"), which formally establishes that the random iterates in ","element":"span"},{"href":"#id-83","text":"(62) ","element":"a"},{"text":"follow the trajectories of the gradient flow in ","element":"span"},{"href":"#id-84","text":"(64)","element":"a"},{"text":". Hence, after sufficiently many iterations ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"and for a sufficiently small step size ","element":"span"},{"style":{"height":8.4},"width":28,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-0.png","element":"img","alt":" α","inline":true},{"text":", the convergence rate of the deterministic ","element":"span"},{"text":"GD ","element":"span"},{"text":"algorithm","element":"span"}],[{"style":{"width":"66%"},"width":1158,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-1.png","element":"img"}],[{"text":"gives insight into the convergence rate of the stochastic dropout algorithm in ","element":"span"},{"href":"#id-83","text":"(62)","element":"a"},{"text":".","element":"span"}]]},{"heading":"Appendix C. Projection operator","paragraphs":[[{"text":"We define here the projection operator ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-2.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"used in Section ","element":"span"},{"text":"3. ","element":"span"},{"text":"Say that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"is defined by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"smooth constraints ","element":"span"},{"style":{"height":16},"width":505.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-3.png","element":"img","alt":" qi : W → R, i = 1, . . . , l","inline":true,"padRight":true},{"text":"satisfying ","element":"span"},{"style":{"height":17.6},"width":624.77,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-4.png","element":"img","alt":" q1(W) ≤ 0, . . . , ql(W) ≤ 0, i.e.,","inline":true},{"style":{"height":17.6},"width":671.46,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-5.png","element":"img","alt":"H = {W ∈ W : qi(W) ≤ 0 ∀i ∈ [l]}","inline":true},{"text":". Denote by ","element":"span"},{"style":{"height":17.6},"width":195.46,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-6.png","element":"img","alt":" ∇D|H(W)","inline":true,"padRight":true},{"text":"the gradient of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"text":"restricted to ","element":"span"},{"style":{"height":15.1},"width":315.24,"height":37.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-7.png","element":"img","alt":" H and let TWW","inline":true,"padRight":true},{"text":"be the tangent space of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":". Suppose that ","element":"span"},{"style":{"height":17.6},"width":423.54,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-8.png","element":"img","alt":" ∇qi(W) ̸= 0 whenever","inline":true},{"style":{"height":17.6},"width":198.89,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-9.png","element":"img","alt":"qi(W) = 0","inline":true},{"text":", and that these are linearly independent. At any point ","element":"span"},{"style":{"height":14},"width":167.68,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-10.png","element":"img","alt":" W ∈ ∂H","inline":true},{"text":", we define the outer normal cone","element":"span"}],[{"style":{"width":"85%"},"width":1480,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-11.png","element":"img"}],[{"text":"We also assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"text":"is upper semicontinuous, i.e., if ","element":"span"},{"style":{"height":20.41},"width":607.14,"height":51.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-12.png","element":"img","alt":"˜W ∈ BH(W, δ), where BH(W, δ)","inline":true,"padRight":true},{"text":"is the ball of radius ","element":"span"},{"style":{"height":13.2},"width":119.64,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-13.png","element":"img","alt":" δ > 0","inline":true,"padRight":true},{"text":"centered at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"and intersected with ","element":"span"},{"style":{"height":17.6},"width":455.73,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-14.png","element":"img","alt":" H, then C(W) = ∩δ>0","inline":true},{"style":{"height":24.98},"width":1275,"height":62.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-15.png","element":"img","alt":"�∪ ˜W∈BH(W,δ)C( ˜W)�. Let π(W) ≜ −t1[W ∈ ∂H] with t ∈ C(W)","inline":true,"padRight":true},{"text":"minimal to resolve the violated constraints of ","element":"span"},{"style":{"height":17.6},"width":895.32,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-16.png","element":"img","alt":" D|H(W) at W ∈ ∂H so that D|H(W) + π(W)","inline":true,"padRight":true},{"text":"points inside ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". In particular, we have","element":"span"}],[{"style":{"width":"99%"},"width":1725,"height":210,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-17.png","element":"img"}]]},{"heading":"Appendix D. Proof of Proposition 6","paragraphs":[[{"text":"The proof of Proposition ","element":"span"},{"href":"#id-22","text":"6 ","element":"a"},{"text":"relies on the framework of stochastic approximation in ","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"Kushner ","element":"a"},{"href":"#id-49","referenceIndex":25,"text":"and Yin ","element":"a"},{"text":"(","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"2003","element":"a"},{"text":"). Specifically, Proposition ","element":"span"},{"href":"#id-22","text":"6 ","element":"a"},{"text":"follows from Theorem 2.1 on p. 127 if we can show that its conditions (A2.1)–(A2.6) on p. 126 are satisfied. In the notation of Sections ","element":"span"},{"text":"2, ","element":"span"},{"text":"3, ","element":"span"},{"text":"these conditions read:","element":"span"}],[{"style":{"width":"71%"},"width":1759,"height":666,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/30-18.png","element":"img"}],[{"style":{"width":"71%"},"width":1758,"height":391,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-0.png","element":"img"}],[{"text":"We next also state for your convenience Theorem 2.1 by ","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"Kushner and Yin ","element":"a"},{"text":"(","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"2003","element":"a"},{"text":") in the notation of this paper. Their result does require some notation, as it characterizes the limiting behavior of the iterates of","element":"span"}],[{"id":"id-85","style":{"width":"81%"},"width":1416,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-1.png","element":"img"}],[{"text":"For any sequence of step sizes ","element":"span"},{"style":{"height":16.33},"width":82.56,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-2.png","element":"img","alt":" α{n} ","inline":true,"padRight":true},{"text":"satisfying (A2.4), define ","element":"span"},{"style":{"height":21.6},"width":550.49,"height":54.01,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-3.png","element":"img","alt":" t0 = 0 and tn = �n−1i=0 α{i}.","inline":true,"padRight":true},{"text":"Define the continuous-time interpolation","element":"span"}],[{"style":{"width":"69%"},"width":1208,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-4.png","element":"img"}],[{"text":"as well as for ","element":"span"},{"style":{"height":14.62},"width":151.75,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-5.png","element":"img","alt":" m ∈ N0","inline":true},{"text":", the shifted processes ","element":"span"},{"style":{"height":17.6},"width":856.12,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-6.png","element":"img","alt":" Wm(t) = W0(tm + t) for t ∈ (−∞, ∞). Let","inline":true,"padRight":true},{"text":"furthermore ","element":"span"},{"style":{"height":17.6},"width":1495.87,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-7.png","element":"img","alt":" o(t) = inf{n ∈ N0 : tn ≤ t < tn+1} for t ∈ [0, ∞), and o(t) = 0 for t ∈ (−∞, ∞),","inline":true,"padRight":true},{"text":"and define","element":"span"}],[{"style":{"width":"74%"},"width":1289,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-8.png","element":"img"}],[{"text":"as well as for ","element":"span"},{"style":{"height":14.62},"width":140.16,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-9.png","element":"img","alt":" m ∈ N0","inline":true},{"text":", the shifted processes ","element":"span"},{"style":{"height":24.01},"width":900.19,"height":60.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-10.png","element":"img","alt":" Zm(t) = �o(tm+t)−1i=m for t ∈ [0, ∞) and Zm(t) =","inline":true},{"style":{"height":25.28},"width":677.49,"height":63.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-11.png","element":"img","alt":"− �m−1i=o(tm+t) α{i}Zi for t ∈ (−∞, 0)","inline":true},{"text":". The following now holds:","element":"span"}],[{"id":"id-86","style":{"fontWeight":"bold"},"text":"Theorem 16 (A part of Theorem 2.1 by ","element":"span"},{"href":"#id-49","referenceIndex":25,"style":{"fontWeight":"bold"},"text":"Kushner and Yin ","element":"a"},{"style":{"fontWeight":"bold"},"text":"(","element":"span"},{"href":"#id-49","referenceIndex":25,"style":{"fontWeight":"bold"},"text":"2003","element":"a"},{"style":{"fontWeight":"bold"},"text":")) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let conditions","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"(A2.1)–(A2.5) hold for algorithm ","element":"span"},{"href":"#id-85","text":"(71)","element":"a"},{"style":{"fontStyle":"italic"},"text":", with the projection onto ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"style":{"fontStyle":"italic"},"text":"being as described in Appendix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Then there is a set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"style":{"fontStyle":"italic"},"text":"of probability zero such that for ","element":"span"},{"style":{"height":16.8},"width":141.83,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-12.png","element":"img","alt":" ω ̸∈ N","inline":true},{"style":{"fontStyle":"italic"},"text":", the set of functions ","element":"span"},{"style":{"height":17.6},"width":550.48,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-13.png","element":"img","alt":" {Wm(ω, ·), Zm(ω, ·), m < ∞}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is equicontinuous. Let ","element":"span"},{"style":{"height":17.6},"width":321.91,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-14.png","element":"img","alt":" (W(ω, ·), Z(ω, ·))","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"denote the limit of some convergent subsequence. Then this pair satisfies the projected ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ODE ","element":"span"},{"href":"#id-56","text":"(16)","element":"a"},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":20.33},"width":195.12,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-15.png","element":"img","alt":"{W [n](ω)}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"converges to some limit set of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ODE ","element":"span"},{"style":{"fontStyle":"italic"},"text":"in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"style":{"fontStyle":"italic"},"text":". Suppose that (A2.6) holds. Then, for almost all ","element":"span"},{"style":{"height":20.33},"width":252.62,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-16.png","element":"img","alt":" ω, {W [n](ω)}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"converges to a unique ","element":"span"},{"style":{"height":15.02},"width":53.28,"height":37.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/31-17.png","element":"img","alt":" Si.","inline":true}],[{"text":"In order to apply Theorem ","element":"span"},{"href":"#id-86","text":"16 ","element":"a"},{"text":"and arrive at Proposition ","element":"span"},{"href":"#id-22","text":"6, ","element":"a"},{"text":"we verify conditions (A2.1)– (A2.6) through Lemmas ","element":"span"},{"href":"#id-57","text":"17–","element":"a"},{"href":"#id-87","text":"19 ","element":"a"},{"text":"shown next in Appendix ","element":"span"},{"href":"#id-88","text":"D.1. ","element":"a"},{"text":"These lemmas are proven in Appendices ","element":"span"},{"href":"#id-89","text":"D.1.1–","element":"a"},{"href":"#id-90","text":"D.1.3, ","element":"a"},{"text":"respectively.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.1 Verification of conditions (A2.1)–(A2.6)","element":"span"}],[{"text":"First we assume conditions (N1)–(N3) and we prove that the variance of the random update direction in ","element":"span"},{"href":"#id-18","text":"(4) ","element":"a"},{"text":"is finite. This verifies condition (A2.1). The proof can be found in ","element":"span"},{"id":"id-57","text":"Appendix ","element":"span"},{"href":"#id-89","text":"D.1.1:","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 17 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume (N1)–(N3) from Proposition ","element":"span"},{"href":"#id-22","style":{"fontStyle":"italic"},"text":"6. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"Then ","element":"span"},{"style":{"height":24.04},"width":564.07,"height":60.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-0.png","element":"img","alt":" supt∈N E[∥∆[t+1]i ∥2F] < ∞ for","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , L","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"We prove next that if ","element":"span"},{"style":{"height":18.37},"width":231.8,"height":45.93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-1.png","element":"img","alt":" σ ∈ CrPB(R)","inline":true},{"text":", then the random update direction in ","element":"span"},{"href":"#id-18","text":"(4)","element":"a"},{"text":", conditional ","element":"span"},{"text":"on all prior updates, has conditional expectation ","element":"span"},{"style":{"height":20.33},"width":185.52,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-2.png","element":"img","alt":" ∇D(W [t])","inline":true},{"text":". Lemma ","element":"span"},{"href":"#id-58","text":"18 ","element":"a"},{"text":"verifies conditions (A2.2), (A2.3), and (A2.5) (in particular, here ","element":"span"},{"style":{"height":19.53},"width":153.04,"height":48.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-3.png","element":"img","alt":" β[t] = 0","inline":true},{"text":"). The proof can be found in Appendix ","element":"span"},{"href":"#id-91","text":"D.1.2:","element":"a"}],[{"id":"id-58","style":{"fontWeight":"bold"},"text":"Lemma 18 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume (N2)–(N4) from Proposition ","element":"span"},{"href":"#id-22","style":{"fontStyle":"italic"},"text":"6. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"Then ","element":"span"},{"style":{"height":20.33},"width":593.17,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-4.png","element":"img","alt":" E[∆[t+1]|Ft] = ∇D(W [t]). Fur-","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"thermore, ","element":"span"},{"style":{"height":12.8},"width":427.06,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-5.png","element":"img","alt":" ∇D : W → W is r − 1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"times continuously differentiable.","element":"span"}],[{"text":"From these conditions the first part of Proposition ","element":"span"},{"href":"#id-22","text":"6 ","element":"a"},{"text":"follows. To prove the second part of Proposition ","element":"span"},{"href":"#id-22","text":"6, ","element":"a"},{"text":"we have to prove that the set of stationary points ","element":"span"},{"style":{"height":15.9},"width":55.76,"height":39.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-6.png","element":"img","alt":" SH","inline":true,"padRight":true},{"text":"is well-behaved in the sense that ","element":"span"},{"style":{"height":17.85},"width":163.48,"height":44.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-7.png","element":"img","alt":" D|Si(W)","inline":true,"padRight":true},{"text":"is constant. If an objective function is sufficiently differentiable, this is guaranteed by the Morse–Sard Theorem (","element":"span"},{"href":"#id-92","referenceIndex":32,"text":"Morse","element":"a"},{"text":", ","element":"span"},{"href":"#id-92","referenceIndex":32,"text":"1939","element":"a"},{"text":"; ","element":"span"},{"href":"#id-93","referenceIndex":38,"text":"Sard","element":"a"},{"text":", ","element":"span"},{"href":"#id-93","referenceIndex":38,"text":"1942","element":"a"},{"text":"). In the present case however we must take into account the possibility of an intersection of the set of stationary points with the boundary ","element":"span"},{"style":{"height":14},"width":62.59,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-8.png","element":"img","alt":" ∂H","inline":true},{"text":". Assuming (N4) and (N5) provides sufficient conditions. The proof of Lemma ","element":"span"},{"href":"#id-87","text":"19 ","element":"a"},{"text":"can be found in Appendix ","element":"span"},{"href":"#id-90","text":"D.1.3:","element":"a"}],[{"id":"id-87","style":{"fontWeight":"bold"},"text":"Lemma 19 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If (N2)–(N5) hold, then ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is constant on each ","element":"span"},{"style":{"height":15.02},"width":53.28,"height":37.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-9.png","element":"img","alt":" Si.","inline":true}],[{"text":"Since Conditions (A2.1)–(A2.6) of Thm. 2.1 on p. 127 in ","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"Kushner and Yin ","element":"a"},{"text":"(","element":"span"},{"href":"#id-49","referenceIndex":25,"text":"2003","element":"a"},{"text":") are now proven satisfied, the proof of Proposition ","element":"span"},{"href":"#id-22","text":"6 ","element":"a"},{"text":"is now completed.","element":"span"}],[{"id":"id-89","style":{"width":"83%"},"width":1444,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-10.png","element":"img"}],[{"text":"We need to carefully track all sequences of random variables created by a dropout algorithm throughout this proof, which we state here first explicitly.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 20 (Dropout iterates) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"During its ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"+ 1)","element":"span"},{"style":{"fontStyle":"italic"},"text":"-st ","element":"span"},{"text":"feedforward step","element":"span"},{"style":{"fontStyle":"italic"},"text":", the algorithm iteratively calculates","element":"span"}],[{"style":{"width":"77%"},"width":1340,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-11.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":15.2},"width":338.24,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-12.png","element":"img","alt":" i = 1, 2, . . . , L − 1","inline":true},{"style":{"fontStyle":"italic"},"text":", to output","element":"span"}],[{"style":{"width":"78%"},"width":1356,"height":67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Subsequently for its ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"+ 1)","element":"span"},{"style":{"fontStyle":"italic"},"text":"-st ","element":"span"},{"text":"backpropagation step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the algorithm calculates","element":"span"}],[{"style":{"width":"87%"},"width":1515,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-14.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"iteratively for ","element":"span"},{"style":{"height":16},"width":302.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-15.png","element":"img","alt":" j = L − 1, . . . , 1","inline":true},{"style":{"fontStyle":"italic"},"text":". The algorithm then calculates","element":"span"}],[{"id":"id-95","style":{"width":"69%"},"width":1210,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/32-16.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , L","element":"span"},{"style":{"fontStyle":"italic"},"text":", and finally updates all weights according to ","element":"span"},{"href":"#id-94","text":"(13)","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"The idea of the proof of Lemma ","element":"span"},{"href":"#id-57","text":"17 ","element":"a"},{"text":"is to expand the terms in ","element":"span"},{"style":{"height":24.01},"width":110.27,"height":60.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-0.png","element":"img","alt":" ∆[t+1]i","inline":true,"padRight":true},{"text":"defined in Definition ","element":"span"},{"href":"#id-95","text":"20 ","element":"a"},{"text":"recursively, and identify a polynomial in variables ","element":"span"},{"style":{"height":17.88},"width":752.82,"height":44.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-1.png","element":"img","alt":" {∥Y ∥n2∥X∥m2 }m∈N0 and n = 0, 1, 2. We","inline":true,"padRight":true},{"text":"will use several bounds that pertain to the Frobenius norm, written down in Lemma ","element":"span"},{"href":"#id-96","text":"30 ","element":"a"},{"text":"in Appendix ","element":"span"},{"text":"J, ","element":"span"},{"text":"and we will iterate these in a moment.","element":"span"}],[{"text":"First, we will prove two bounds on the activation function applied to an arbitrary matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":". Recall that ","element":"span"},{"style":{"height":19.91},"width":231.8,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-2.png","element":"img","alt":" σ ∈ C2PB(R)","inline":true,"padRight":true},{"text":"by assumption (N1). There thus (i) exists some ","element":"span"},{"style":{"height":15.6},"width":287.62,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-3.png","element":"img","alt":" C0, k0 > 0 such","inline":true,"padRight":true},{"text":"that ","element":"span"},{"style":{"height":19.53},"width":724.4,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-4.png","element":"img","alt":" |σ(z)| ≤ C0(1 + z2)k0 for all z ∈ R","inline":true},{"text":", and there exists some ","element":"span"},{"style":{"height":15.6},"width":419.05,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-5.png","element":"img","alt":" C1, k1 > 0 such that","inline":true}],[{"style":{"width":"99%"},"width":1719,"height":193,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-6.png","element":"img"}],[{"text":"for some constant ","element":"span"},{"style":{"height":15.02},"width":163.13,"height":37.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-7.png","element":"img","alt":" C2 > 0.","inline":true,"padRight":true},{"text":"Similarly there exists some ","element":"span"},{"style":{"height":18.36},"width":609.04,"height":45.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-8.png","element":"img","alt":" C3 > 0 such that ∥σ′(A)∥F ≤","inline":true},{"style":{"height":19.53},"width":300.89,"height":48.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-9.png","element":"img","alt":"C3(1 + ∥A∥F)k.","inline":true,"padRight":true},{"text":"Note furthermore that (ii) for all ","element":"span"},{"style":{"height":14.8},"width":115.93,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-10.png","element":"img","alt":" l ≥ 0","inline":true},{"text":", by submultiplicativity of the Frobenius norm,","element":"span"}],[{"id":"id-97","style":{"width":"95%"},"width":1657,"height":182,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-11.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":23.8},"width":435.2,"height":59.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-12.png","element":"img","alt":" C4 = max{1, Cl/22 } > 0","inline":true},{"text":". Again, a similar bound holds for ","element":"span"},{"style":{"height":8},"width":49.68,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-13.png","element":"img","alt":" σ′.","inline":true,"padRight":true},{"text":"Next, note that we have by (i) submultiplicativity and Lemma ","element":"span"},{"href":"#id-96","text":"30 ","element":"a"},{"text":"that","element":"span"}],[{"style":{"width":"89%"},"width":1549,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-14.png","element":"img"}],[{"text":"The first term is bounded with probability one: ","element":"span"},{"style":{"height":26.84},"width":506.11,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-15.png","element":"img","alt":" F [t]i,r,l ∈ {0, 1} for all i, r, l, t","inline":true},{"text":". For the second ","element":"span"},{"text":"term, consider the following bound:","element":"span"}],[{"id":"id-99","style":{"width":"92%"},"width":1605,"height":171,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-16.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":14.4},"width":188.28,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-17.png","element":"img","alt":" 1 ≤ i ≤ L","inline":true},{"text":", where we have also used the submultiplicative property. For the third term, consider the next bound: (i) recursing ","element":"span"},{"href":"#id-97","text":"(79) ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":26.41},"width":791.34,"height":66.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-18.png","element":"img","alt":" A = I and B = (W [t]j ⊙ F [t+1]j )A[t+1]j−1 etc,","inline":true,"padRight":true},{"text":"we obtain that there exists some ","element":"span"},{"style":{"height":15.02},"width":130.3,"height":37.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-19.png","element":"img","alt":" C5 > 0","inline":true},{"text":", say, so that","element":"span"}],[{"id":"id-98","style":{"width":"94%"},"width":1641,"height":341,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-20.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":16},"width":334.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-21.png","element":"img","alt":" j = 1, 2, . . . , L−1","inline":true},{"text":". Similar to the derivation in ","element":"span"},{"href":"#id-98","text":"(82)","element":"a"},{"text":", we obtain instead with ","element":"span"},{"style":{"height":12.8},"width":237.82,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-22.png","element":"img","alt":" σ′ that there","inline":true,"padRight":true},{"text":"exists some ","element":"span"},{"style":{"height":15.02},"width":322.08,"height":37.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-23.png","element":"img","alt":" C6 > 0 such that","inline":true}],[{"id":"id-100","style":{"width":"96%"},"width":1662,"height":133,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/33-24.png","element":"img"}],[{"text":"Recall that ","element":"span"},{"style":{"height":24.01},"width":768.98,"height":60.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/34-0.png","element":"img","alt":" ∥∆[t+1]i ∥F ≤ ∥F [t+1]i ∥F∥R[t+1]i ∥F∥A[t+1]i−1 ∥F","inline":true},{"text":". This, together with using ","element":"span"},{"href":"#id-99","text":"(81) ","element":"a"},{"text":"repeat- ","element":"span"},{"text":"edly for ","element":"span"},{"style":{"height":16},"width":295.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/34-1.png","element":"img","alt":" j = i, . . . , L − 1","inline":true},{"text":", and ","element":"span"},{"href":"#id-98","text":"(82)","element":"a"},{"text":", ","element":"span"},{"href":"#id-100","text":"(83)","element":"a"},{"text":", yields the following inequality","element":"span"}],[{"id":"id-101","style":{"width":"100%"},"width":1736,"height":1405,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/34-2.png","element":"img"}],[{"text":"Lastly, we bound ","element":"span"},{"style":{"height":24.29},"width":175.43,"height":60.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/34-3.png","element":"img","alt":" ∥R[t+1]L ∥F","inline":true},{"text":". By applying (i) subadditivity of the norm ","element":"span"},{"style":{"height":18.36},"width":396.62,"height":45.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/34-4.png","element":"img","alt":" ∥A + B∥F ≤ ∥A∥F +","inline":true},{"style":{"height":18.36},"width":104.93,"height":45.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/34-5.png","element":"img","alt":"∥B∥F","inline":true,"padRight":true},{"text":"and then using the elementary bound ","element":"span"},{"style":{"height":19.13},"width":359.47,"height":47.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/34-6.png","element":"img","alt":" (a+b)2 ≤ 2(a2+b2)","inline":true,"padRight":true},{"text":"as well as submultiplicativity, we obtain","element":"span"}],[{"id":"id-102","style":{"width":"95%"},"width":1658,"height":323,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/34-7.png","element":"img"}],[{"text":"By combining inequalities ","element":"span"},{"href":"#id-101","text":"(84)","element":"a"},{"text":", ","element":"span"},{"href":"#id-102","text":"(85)","element":"a"},{"text":", and upper bounding the exponent ","element":"span"},{"style":{"height":15.53},"width":90.47,"height":38.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/34-8.png","element":"img","alt":" kL−1 ","inline":true,"padRight":true},{"text":"of the term ","element":"span"},{"style":{"height":20.33},"width":256.69,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/34-9.png","element":"img","alt":"1 + ∥X[t+1]∥F","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-102","text":"(85) ","element":"a"},{"text":"by ","element":"span"},{"style":{"height":24.4},"width":189.82,"height":61.01,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/34-10.png","element":"img","alt":" 2 �L−1j=1 kj","inline":true},{"text":", we conclude that","element":"span"}],[{"style":{"width":"98%"},"width":1710,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/34-11.png","element":"img"}],[{"id":"id-103","style":{"width":"96%"},"width":1675,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-0.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , L ","element":"span"},{"text":"and some constants ","element":"span"},{"style":{"height":15.6},"width":117.7,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-1.png","element":"img","alt":" C8, C9","inline":true,"padRight":true},{"text":"and polynomials ","element":"span"},{"style":{"height":17.6},"width":556.19,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-2.png","element":"img","alt":" P1(z1, . . . , zL), P2(z1, . . . , zL),","inline":true,"padRight":true},{"text":"say, the latter both in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"variables. Because of the projection and by definition of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":", there exists a constant ","element":"span"},{"style":{"height":24.01},"width":498.23,"height":60.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-3.png","element":"img","alt":" M such that ∥W [t]i ∥F ≤ M","inline":true,"padRight":true},{"text":"with probability one for all ","element":"span"},{"style":{"height":15.82},"width":387.18,"height":39.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-4.png","element":"img","alt":" i = 1, . . . , L, t ∈ N+.","inline":true,"padRight":true},{"text":"Furthermore, ","element":"span"},{"style":{"height":24.15},"width":629.5,"height":60.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-5.png","element":"img","alt":" ∥F [t]i ∥F ≤ maxi=0,...,L−1�didi+1","inline":true,"padRight":true},{"text":"with probability one for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , L","element":"span"},{"text":", ","element":"span"},{"style":{"height":15.82},"width":127.39,"height":39.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-6.png","element":"img","alt":"t ∈ N+","inline":true},{"text":". These two bounds, together with ","element":"span"},{"href":"#id-103","text":"(86) ","element":"a"},{"text":"and the fact that ","element":"span"},{"style":{"height":15.2},"width":111.35,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-7.png","element":"img","alt":" P1, P2","inline":true,"padRight":true},{"text":"are polynomials, as well as the hypothesis that ","element":"span"},{"style":{"height":17.88},"width":790.01,"height":44.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-8.png","element":"img","alt":" E[∥Y ∥m2 ∥X∥n2] < ∞ ∀m ∈ {0, 1, 2}, n ∈ N0","inline":true},{"text":", implies the result.","element":"span"}],[{"id":"id-91","style":{"width":"79%"},"width":1373,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-9.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":17.6},"width":995.06,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-10.png","element":"img","alt":" i ∈ {1, . . . , L}, r ∈ {1, . . . , di+1} and l ∈ {1, . . . , di}","inline":true},{"text":". Recall that ","element":"span"},{"style":{"height":15.02},"width":43.37,"height":37.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-11.png","element":"img","alt":" Ft","inline":true,"padRight":true},{"text":"is the smallest ","element":"span"},{"style":{"height":8},"width":41.5,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-12.png","element":"img","alt":" σ-","inline":true,"padRight":true},{"text":"algebra generated by ","element":"span"},{"style":{"height":20.95},"width":492.94,"height":52.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-13.png","element":"img","alt":" {W [0], (F [s], X[s], Y [s])}s≤t","inline":true},{"text":", and note that ","element":"span"},{"style":{"height":18.55},"width":182.57,"height":46.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-14.png","element":"img","alt":" W [t] is Ft","inline":true},{"text":"-measurable. The (i) ","element":"span"},{"style":{"height":15.02},"width":43.36,"height":37.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-15.png","element":"img","alt":" Ft","inline":true},{"text":"-measurability of ","element":"span"},{"style":{"height":16.33},"width":77.92,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-16.png","element":"img","alt":" W [t] ","inline":true,"padRight":true},{"text":"together with the (ii) hypothesis that the sequences of random variables ","element":"span"},{"style":{"height":22.02},"width":417.78,"height":55.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-17.png","element":"img","alt":" {(F [s], X[s], Y [s])}s∈N+","inline":true,"padRight":true},{"text":"is i.i.d. implies that","element":"span"}],[{"id":"id-104","style":{"width":"98%"},"width":1698,"height":444,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-18.png","element":"img"}],[{"text":"Next, we need to check that we can exchange the derivative and expectation. Note that we have the same assumptions ","element":"span"},{"style":{"height":17.88},"width":832.8,"height":44.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-19.png","element":"img","alt":" E[∥Y ∥m2 ∥X∥n2] < ∞ ∀m ∈ {0, 1, 2}, n ∈ N+","inline":true,"padRight":true},{"text":"as for Lemma ","element":"span"},{"href":"#id-57","text":"17. ","element":"a"},{"text":"as well as that ","element":"span"},{"style":{"height":18.37},"width":256.98,"height":45.93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-20.png","element":"img","alt":" σ ∈ CrPB(R).","inline":true,"padRight":true},{"text":"Therefore, by ","element":"span"},{"href":"#id-103","text":"(86) ","element":"a"},{"text":"in Lemma ","element":"span"},{"href":"#id-57","text":"17 ","element":"a"},{"text":"we have that ","element":"span"},{"style":{"height":26.84},"width":186.26,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-21.png","element":"img","alt":" |∆[t+1]i,r,l | is","inline":true,"padRight":true},{"text":"upper bounded and moreover ","element":"span"},{"style":{"height":26.84},"width":638.1,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-22.png","element":"img","alt":" E[∆[t+1]i,r,l ] ≤ CH for some CH ≤ ∞","inline":true,"padRight":true},{"text":"only dependent on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". The ","element":"span"},{"text":"interchange is then warranted by the dominated convergence theorem. Hence continuing from ","element":"span"},{"href":"#id-104","text":"(87)","element":"a"},{"text":", we obtain","element":"span"}],[{"style":{"width":"87%"},"width":1515,"height":233,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-23.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"height":18.37},"width":236.13,"height":45.93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-24.png","element":"img","alt":" σ ∈ CrPB(R)","inline":true},{"text":", then for any multi-index ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"on the set of weights, a bound similar to ","element":"span"},{"href":"#id-103","text":"(86) ","element":"a"},{"text":"holds by the chain rule:","element":"span"}],[{"style":{"width":"84%"},"width":1470,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-25.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.02},"width":161.77,"height":42.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-26.png","element":"img","alt":" P1,s, P2,s","inline":true,"padRight":true},{"text":"are polynomials and ","element":"span"},{"style":{"height":13.02},"width":157.85,"height":32.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-27.png","element":"img","alt":" ns,1, ns,2","inline":true,"padRight":true},{"text":"are the top exponents in the expansion in ","element":"span"},{"style":{"height":17.6},"width":105.21,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-28.png","element":"img","alt":"∥X∥F","inline":true},{"text":". Hence, using the assumption ","element":"span"},{"style":{"height":17.88},"width":808.42,"height":44.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-29.png","element":"img","alt":" E[∥Y ∥m2 ∥X∥n2] < ∞ ∀m ∈ {0, 1, 2}, n ∈ N+","inline":true},{"text":", we obtain ","element":"span"},{"text":"for any ","element":"span"},{"style":{"height":13.2},"width":280.6,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-30.png","element":"img","alt":" W ∈ K ⊂ W","inline":true,"padRight":true},{"text":"a compact set that ","element":"span"},{"style":{"height":17.6},"width":585.39,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-31.png","element":"img","alt":" E[|∂sl(Y, ΨW⊙F (X))|] ≤ CK .","inline":true,"padRight":true},{"text":"In particular we can apply the dominated convergence theorem and conclude ","element":"span"},{"style":{"height":19.13},"width":464.64,"height":47.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-32.png","element":"img","alt":" D(W) ∈ Cr−1(W) with","inline":true},{"style":{"height":17.6},"width":619.66,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/35-33.png","element":"img","alt":"∂sD(W) = E[∂sl(Y, ΨW⊙F (X))].","inline":true}],[{"id":"id-90","style":{"width":"78%"},"width":1362,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-0.png","element":"img"}],[{"text":"We use Sard’s theorem (","element":"span"},{"href":"#id-93","referenceIndex":38,"text":"Sard","element":"a"},{"text":", ","element":"span"},{"href":"#id-93","referenceIndex":38,"text":"1942","element":"a"},{"text":") to prove Lemma ","element":"span"},{"href":"#id-87","text":"19, ","element":"a"},{"text":"which gives sufficient conditions for condition (A2.6):","element":"span"}],[{"id":"id-105","href":"#id-93","referenceIndex":38,"style":{"height":18},"width":1169.04,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-1.png","element":"img","alt":"Proposition 21 (Sard, 1942) Let f : M → N be a f ∈ Cr ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"map between manifolds with ","element":"span"},{"style":{"height":17.6},"width":1223.27,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-2.png","element":"img","alt":"dim(M) = m, dim(N) = n. Let Crit(f) = {x ∈ M : ∇f(x) = 0}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the set of critical points of ","element":"span"},{"style":{"height":17.6},"width":668.04,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-3.png","element":"img","alt":" f. If r > m/n − 1, then f(Crit(f))","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"has measure zero.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-87","style":{"fontStyle":"italic"},"text":"19. ","element":"a"},{"text":"By Lemma ","element":"span"},{"href":"#id-58","text":"18, ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":17.6},"width":302.61,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-4.png","element":"img","alt":" D(W) ∈ Cr(W)","inline":true},{"text":". By assumption (N5) we have that if ","element":"span"},{"style":{"height":17.6},"width":970.47,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-5.png","element":"img","alt":" W ∈ ∂H and D(W) + π(W) = 0, then D(W) = 0","inline":true},{"text":". Furthermore ","element":"span"},{"style":{"height":17.42},"width":329.64,"height":43.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-6.png","element":"img","alt":" W ∈ Sj for some","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":", i.e., the critical points of ","element":"span"},{"style":{"height":17.6},"width":1180.51,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-7.png","element":"img","alt":" D(W) + π(W) are {W ∈ W | ∇D(W) = 0} ∩ H. We apply","inline":true,"padRight":true},{"text":"Sard’s theorem (Proposition ","element":"span"},{"href":"#id-105","text":"21) ","element":"a"},{"text":"to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":")","element":"span"},{"text":". We have that if ","element":"span"},{"style":{"height":17.6},"width":573.38,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-8.png","element":"img","alt":" r ≥ dim(W), then D(Si) ⊆ R","inline":true,"padRight":true},{"text":"has measure zero. Since ","element":"span"},{"style":{"height":15.02},"width":38.76,"height":37.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-9.png","element":"img","alt":" Si","inline":true,"padRight":true},{"text":"is connected there is a continuous path ","element":"span"},{"style":{"height":18.44},"width":443.35,"height":46.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-10.png","element":"img","alt":" za,b : [0, 1] → Si joining","inline":true,"padRight":true},{"text":"any two points ","element":"span"},{"style":{"height":15.6},"width":163.71,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-11.png","element":"img","alt":" a, b ∈ Si","inline":true},{"text":". By continuity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"text":"we must have then ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":")","element":"span"},{"text":", since otherwise we would have ","element":"span"},{"style":{"height":17.6},"width":393.04,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-12.png","element":"img","alt":" [D(a), D(b)] ⊆ D(Si)","inline":true,"padRight":true},{"text":"which has positive measure in ","element":"span"},{"text":"R","element":"span"},{"text":". Therefore ","element":"span"},{"style":{"height":17.6},"width":109.12,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-13.png","element":"img","alt":"D(Si)","inline":true,"padRight":true},{"text":"must be a constant.","element":"span"}],[{"text":"Remark that in Lemma ","element":"span"},{"href":"#id-87","text":"19 ","element":"a"},{"text":"the condition ","element":"span"},{"style":{"height":17.6},"width":251.24,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-14.png","element":"img","alt":" r ≥ dim(W)","inline":true,"padRight":true},{"text":"cannot immediately be eliminated. When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r < ","element":"span"},{"text":"dim(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":")","element":"span"},{"text":", there are examples of functions which are not constant on their connected critical sets, see e.g. ","element":"span"},{"href":"#id-106","referenceIndex":14,"text":"Hajłasz ","element":"a"},{"text":"(","element":"span"},{"href":"#id-106","referenceIndex":14,"text":"2003","element":"a"},{"text":").","element":"span"}]]},{"heading":"Appendix E. Proof of Propositions 7 and 8","paragraphs":[[{"text":"We use standard tools for proving convergence to an ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-15.png","element":"img","alt":" ϵ","inline":true},{"text":"-stationary point (for a reference, see ","element":"span"},{"href":"#id-59","referenceIndex":8,"text":"Bottou et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-59","referenceIndex":8,"text":"2018","element":"a"},{"text":")). We require first the following bounds on the variance induced by dropout.","element":"span"}],[{"id":"id-119","style":{"fontWeight":"bold"},"text":"Lemma 22 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is a random variable satisfying (Q4). If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is a vector of random variables with distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":", then","element":"span"}],[{"style":{"width":"38%"},"width":670,"height":204,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-16.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"We prove first (i). Recall that ","element":"span"},{"style":{"height":19.53},"width":222.18,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-17.png","element":"img","alt":" f ∈ {0, 1}N","inline":true},{"text":". If we denote by ","element":"span"},{"style":{"height":16.4},"width":142.74,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-18.png","element":"img","alt":" fi the i","inline":true},{"text":"th entry of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":", then note that from (Q4) ","element":"span"},{"style":{"height":17.6},"width":1239.59,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-19.png","element":"img","alt":" P[fi = 1] = p and so E[|fi − E[fi]|] = E[|fi − p|] = 2p(1 − p). From","inline":true,"padRight":true},{"text":"linearity (i) follows. For (ii), we have","element":"span"}],[{"style":{"width":"86%"},"width":1502,"height":180,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/36-20.png","element":"img"}],[{"text":"where in the last inequality we have used the Cauchy–Schwartz inequality.","element":"span"}],[{"text":"In order to prove both Propositions ","element":"span"},{"href":"#id-25","text":"7 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-62","text":"8 ","element":"a"},{"text":"simultaneously, we will temporarily redefine in this section ","element":"span"},{"text":"D ","element":"span"},{"text":"as","element":"span"}],[{"id":"id-109","style":{"width":"67%"},"width":1167,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/37-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c > ","element":"span"},{"text":"0 ","element":"span"},{"text":"is a constant. Later on we will specify both ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"for Proposition ","element":"span"},{"href":"#id-25","text":"7 ","element":"a"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/p ","element":"span"},{"text":"for Proposition ","element":"span"},{"href":"#id-62","text":"8, ","element":"a"},{"text":"respectively.","element":"span"}],[{"id":"id-107","style":{"fontWeight":"bold"},"text":"Lemma 23 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume (Q3) and (Q4), that is, ","element":"span"},{"style":{"height":12.8},"width":149.24,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/37-1.png","element":"img","alt":" ∇U is ℓ","inline":true},{"style":{"fontStyle":"italic"},"text":"-Lipschitz and the distribution of the filters is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"style":{"fontStyle":"italic"},"text":"-valued. Then, ","element":"span"},{"style":{"height":15.13},"width":274.18,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/37-2.png","element":"img","alt":" ∇D is also c2ℓ","inline":true},{"style":{"fontStyle":"italic"},"text":"-Lipschitz.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"Using (i) Jensen’s inequality with the norm, we have for a fixed ","element":"span"},{"style":{"height":15.6},"width":266.42,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/37-3.png","element":"img","alt":" w, s ∈ W that","inline":true}],[{"style":{"width":"86%"},"width":1489,"height":503,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/37-4.png","element":"img"}],[{"text":"where we have also used (ii) the fact that for a vector ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":"-valued vector ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"we have ","element":"span"},{"style":{"height":18},"width":566.07,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/37-5.png","element":"img","alt":"∥f ⊙ u∥2 ≤ ∥u∥2, (iii) ∇U is ℓ","inline":true},{"text":"-Lipschitz.","element":"span"}],[{"text":"The proof of the following lemma can be found in Appendix ","element":"span"},{"href":"#id-61","text":"E.1.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 24 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume (Q1)–(Q4), then for any ","element":"span"},{"style":{"height":17.6},"width":606.66,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/37-6.png","element":"img","alt":" w ∈ W with ∥w∥2 < R, we have","inline":true}],[{"style":{"width":"94%"},"width":1640,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/37-7.png","element":"img"}],[{"text":"We obtain in the next lemma a simple bound for the variance of the gradient that depends on the data.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 25 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume (Q1)–(Q4), then for any ","element":"span"},{"style":{"height":15.6},"width":310.49,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/37-8.png","element":"img","alt":" w ∈ W, we have","inline":true}],[{"id":"id-111","style":{"width":"83%"},"width":1441,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/37-9.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"We use first the definition of ","element":"span"},{"text":"U ","element":"span"},{"text":"as an expectation. We have","element":"span"}],[{"style":{"width":"89%"},"width":1555,"height":174,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/37-10.png","element":"img"}],[{"style":{"width":"68%"},"width":1179,"height":315,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-0.png","element":"img"}],[{"text":"where in (i) we have used the upper bound for ","element":"span"},{"style":{"height":17.6},"width":321.82,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-1.png","element":"img","alt":" ∥∇r(w ⊙ cf, z)∥2","inline":true,"padRight":true},{"text":"from (Q2) and in (ii) that since ","element":"span"},{"style":{"height":19.41},"width":883.25,"height":48.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-2.png","element":"img","alt":" fi ∈ {0, 1} for all i ∈ [N], we have ∥f∥22 = ∥f∥1","inline":true,"padRight":true},{"text":"so using linearity with (Q4) the bound ","element":"span"},{"text":"follows.","element":"span"}],[{"text":"By (Q3)–(Q4) and Lemma ","element":"span"},{"href":"#id-107","text":"23, ","element":"a"},{"style":{"height":15.13},"width":187.48,"height":37.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-3.png","element":"img","alt":" ∇D is c2ℓ","inline":true},{"text":"-Lipschitz. In this case, we can then use the following common argument: if ","element":"span"},{"style":{"height":15.13},"width":181.8,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-4.png","element":"img","alt":" ∇D is c2ℓ","inline":true},{"text":"-Lipschitz then we have the inequality","element":"span"}],[{"style":{"width":"89%"},"width":1551,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-5.png","element":"img"}],[{"text":"We can then use the definition of ","element":"span"},{"style":{"height":16.33},"width":287.24,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-6.png","element":"img","alt":" W [t+1] to write","inline":true}],[{"id":"id-114","style":{"width":"90%"},"width":1568,"height":174,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-7.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":15.02},"width":221.32,"height":37.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-8.png","element":"img","alt":" Ft be the σ","inline":true},{"text":"-algebra of ","element":"span"},{"style":{"height":20.33},"width":649.49,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-9.png","element":"img","alt":" (W [0], F [1], Z[1], . . . , W [t], F [t], Z[t])","inline":true},{"text":". Conditional on ","element":"span"},{"style":{"height":19.13},"width":227.84,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-10.png","element":"img","alt":" Ft, F [t+1] ⊙","inline":true},{"style":{"height":20.34},"width":463.96,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-11.png","element":"img","alt":"∇r(W [t] ⊙ F [t+1], Z[t+1])","inline":true,"padRight":true},{"text":"is an unbiased estimator of ","element":"span"},{"style":{"height":20.34},"width":182.16,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-12.png","element":"img","alt":" ∇D(W [t])","inline":true,"padRight":true},{"text":"so that by linearity","element":"span"}],[{"id":"id-108","style":{"width":"89%"},"width":1541,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-13.png","element":"img"}],[{"text":"Similarly to ","element":"span"},{"href":"#id-108","text":"(97)","element":"a"},{"text":", we can decompose","element":"span"}],[{"id":"id-110","style":{"width":"97%"},"width":1689,"height":837,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/38-14.png","element":"img"}],[{"text":"where in the last step the cross-term vanishes since, by using the independence assumption of ","element":"span"},{"style":{"height":16.33},"width":272.84,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-0.png","element":"img","alt":" Z[t+1] and F [t]","inline":true},{"text":". If we take the expectation with respect to ","element":"span"},{"style":{"height":15.93},"width":106.81,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-1.png","element":"img","alt":" Z[t+1] ","inline":true,"padRight":true},{"text":"first, then we find","element":"span"}],[{"style":{"width":"94%"},"width":1642,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-2.png","element":"img"}],[{"text":"Similarly, we can add and substract ","element":"span"},{"style":{"height":20.33},"width":182.16,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-3.png","element":"img","alt":" ∇D(W [t])","inline":true,"padRight":true},{"text":"in the first term and repeat the argument with the definitions of ","element":"span"},{"style":{"height":12.8},"width":242.72,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-4.png","element":"img","alt":" ∇U and ∇D","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-109","text":"(90)","element":"a"},{"text":", where we take the expectation of ","element":"span"},{"href":"#id-110","text":"(98) ","element":"a"},{"text":"with respect to ","element":"span"},{"style":{"height":15.93},"width":108.02,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-5.png","element":"img","alt":" F [t+1] ","inline":true,"padRight":true},{"text":"instead. A similar cross-term vanishes. We then obtain","element":"span"}],[{"id":"id-112","style":{"width":"99%"},"width":1722,"height":271,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-6.png","element":"img"}],[{"text":"Define the constant ","element":"span"},{"style":{"height":21.29},"width":591.95,"height":53.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-7.png","element":"img","alt":" Jc = S2 + 32N2c(ℓ2R2 + 2cℓR)","inline":true,"padRight":true},{"text":"depending on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":". Using the bounds of ","element":"span"},{"text":"Lemma ","element":"span"},{"href":"#id-111","text":"24 ","element":"a"},{"text":"together with assumption (Q5) and Lemma ","element":"span"},{"href":"#id-111","text":"25 ","element":"a"},{"text":"in ","element":"span"},{"href":"#id-112","text":"(100) ","element":"a"},{"text":"we obtain","element":"span"}],[{"id":"id-113","style":{"width":"98%"},"width":1712,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-8.png","element":"img"}],[{"text":"Substitute now ","element":"span"},{"href":"#id-108","text":"(97) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-113","text":"(101) ","element":"a"},{"text":"in ","element":"span"},{"href":"#id-114","text":"(96)","element":"a"},{"text":". ","element":"span"},{"text":"After taking the expectation, we can use a telescopic sum in ","element":"span"},{"href":"#id-114","text":"(96) ","element":"a"},{"text":"with the previous bounds, which yields","element":"span"}],[{"style":{"width":"98%"},"width":1696,"height":339,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-9.png","element":"img"}],[{"text":"By (Q1) we have ","element":"span"},{"style":{"height":20.33},"width":590.3,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-10.png","element":"img","alt":" E[D(W [0])] − E[D(W [T])] ≤ 2M","inline":true,"padRight":true},{"text":"independently of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":". Assuming that ","element":"span"},{"style":{"height":16.73},"width":122.3,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-11.png","element":"img","alt":" α{t} <","inline":true},{"style":{"height":21.75},"width":315.72,"height":54.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-12.png","element":"img","alt":"1c2ℓ for all t ∈ [T]","inline":true},{"text":", we then have","element":"span"}],[{"id":"id-115","style":{"width":"86%"},"width":1504,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-13.png","element":"img"}],[{"text":"We not proceed with proving Propositions ","element":"span"},{"href":"#id-25","text":"7 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-62","text":"8. ","element":"a"},{"href":"#id-25","style":{"height":20.33},"width":882.13,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-14.png","element":"img","alt":"Proof of Propositions 7 (a) and 8 : If α{t} = η","inline":true,"padRight":true},{"text":"is a constant in ","element":"span"},{"href":"#id-115","text":"(103)","element":"a"},{"text":", we find","element":"span"}],[{"style":{"width":"81%"},"width":1406,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-15.png","element":"img"}],[{"text":"Minimizing the bound over ","element":"span"},{"style":{"height":12},"width":22,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-16.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"yields that the minimum occurs at ","element":"span"},{"style":{"height":19.13},"width":497.89,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-17.png","element":"img","alt":" η2 = M/(ℓNc4p(S2 + (1 −","inline":true},{"style":{"height":17.6},"width":360.99,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-18.png","element":"img","alt":"p)Jc)T). For this η","inline":true},{"text":", the bound reads","element":"span"}],[{"id":"id-116","style":{"width":"80%"},"width":1391,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/39-19.png","element":"img"}],[{"text":"For proving Proposition ","element":"span"},{"href":"#id-25","text":"7 ","element":"a"},{"text":"(a), set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"in ","element":"span"},{"href":"#id-116","text":"(105) ","element":"a"},{"text":"as well as ","element":"span"},{"style":{"height":21.29},"width":529.95,"height":53.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-0.png","element":"img","alt":" Jc = J = S2 + 32N2(ℓ2R2 +","inline":true},{"style":{"height":17.6},"width":90.47,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-1.png","element":"img","alt":"2ℓR)","inline":true},{"text":". Note finally that the condition ","element":"span"},{"style":{"height":19.13},"width":359.35,"height":47.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-2.png","element":"img","alt":" η < 1/(c2ℓ) = 1/ℓ","inline":true,"padRight":true},{"text":"is satisfied, for example, if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p > ","element":"span"},{"style":{"height":19.13},"width":252.61,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-3.png","element":"img","alt":"Mℓ/(NS2T).","inline":true}],[{"text":"For proving Propostion ","element":"span"},{"href":"#id-62","text":"8, ","element":"a"},{"text":"set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/p ","element":"span"},{"text":"in ","element":"span"},{"href":"#id-116","text":"(105) ","element":"a"},{"text":"as well as ","element":"span"},{"style":{"height":22.02},"width":501.57,"height":55.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-4.png","element":"img","alt":" J1/p = S2 + 32N2(ℓ2R2 +","inline":true},{"style":{"height":17.6},"width":146.56,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-5.png","element":"img","alt":"2ℓ/p)/p","inline":true},{"text":". Note finally that the condition ","element":"span"},{"style":{"height":19.13},"width":384.78,"height":47.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-6.png","element":"img","alt":" η < 1/(c2ℓ) = p2/ℓ","inline":true,"padRight":true},{"text":"is satisfied, for example, if","element":"span"}],[{"style":{"height":19.13},"width":332.75,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-7.png","element":"img","alt":"p > Mℓ/(NS2T).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-25","style":{"fontStyle":"italic"},"text":"7 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"(b): ","element":"span"},{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"and denote ","element":"span"},{"href":"#id-115","style":{"height":18},"width":343.23,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-8.png","element":"img","alt":" J1 = J in (103).","inline":true,"padRight":true},{"text":"We can also set ","element":"span"},{"style":{"height":20.33},"width":282.26,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-9.png","element":"img","alt":"α{t} = 1/(ℓ√t)","inline":true},{"text":". It is easily verified that for ","element":"span"},{"style":{"height":14.4},"width":123.56,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-10.png","element":"img","alt":" T ≥ 4:","inline":true}],[{"id":"id-117","style":{"width":"73%"},"width":1278,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-11.png","element":"img"}],[{"text":"Substituting these bounds in ","element":"span"},{"href":"#id-115","text":"(103) ","element":"a"},{"text":"yields the result.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.1 Proof of Lemma ","element":"span"},{"href":"#id-111","style":{"fontWeight":"bold"},"text":"24","element":"a"}],[{"text":"Noting that we have temporarily the definition ","element":"span"},{"style":{"height":17.6},"width":571.98,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-12.png","element":"img","alt":" ∇D(w) = E[cf ⊙ ∇U(w ⊙ cf)]","inline":true,"padRight":true},{"text":"we can write","element":"span"}],[{"style":{"height":32.4},"width":1733.03,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-13.png","element":"img","alt":"E�∥∇D(w) − cf ⊙ ∇U(w ⊙ cf)∥22�= Ef1�∥Ef2[cf2 ⊙ ∇U(w ⊙ cf2) − cf1 ⊙ ∇U(w ⊙ cf1)]∥22�","inline":true}],[{"style":{"width":"98%"},"width":1710,"height":400,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-14.png","element":"img"}],[{"text":"where (i) we have used Jensen’s inequality for a vector-valued random variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":", namely ","element":"span"},{"style":{"height":17.6},"width":343.86,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-15.png","element":"img","alt":"∥E[v]∥2 ≤ E[∥v∥2]","inline":true},{"text":", and (ii) the subadditivity of the norm ","element":"span"},{"style":{"height":17.6},"width":604.98,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-16.png","element":"img","alt":" ∥a + b∥2 ≤ ∥a∥2 + ∥b∥2 for any","inline":true},{"style":{"height":18.33},"width":176.04,"height":45.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-17.png","element":"img","alt":"a, b ∈ RN","inline":true},{"text":". We now note that","element":"span"}],[{"style":{"width":"98%"},"width":1702,"height":293,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-18.png","element":"img"}],[{"text":"where we have used (i) the Lipschitzness assumption of ","element":"span"},{"style":{"height":12.8},"width":66.36,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-19.png","element":"img","alt":" ∇U","inline":true,"padRight":true},{"text":"from (Q3), and (ii) the facts that ","element":"span"},{"style":{"height":19.41},"width":564.72,"height":48.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-20.png","element":"img","alt":" ∥w∥22 < R2 and ∥f2∥22 = ∥f2∥1","inline":true},{"text":". The latter is true because for any vector ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"with entries ","element":"span"},{"style":{"height":19.41},"width":441.85,"height":48.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-21.png","element":"img","alt":"{−1, 0, 1}, ∥f∥22 = ∥f∥1","inline":true},{"text":". We can reason similarly with ","element":"span"},{"style":{"height":16.4},"width":145.92,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-22.png","element":"img","alt":" f1 − f2.","inline":true}],[{"text":"Using (Q2) we can also bound","element":"span"}],[{"style":{"width":"74%"},"width":1296,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-23.png","element":"img"}],[{"text":"Hence, we have in ","element":"span"},{"href":"#id-117","text":"(107) ","element":"a"},{"text":"that","element":"span"}],[{"style":{"width":"89%"},"width":1544,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/40-24.png","element":"img"}],[{"id":"id-118","style":{"width":"64%"},"width":1121,"height":342,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/41-0.png","element":"img"}],[{"text":"where (i) for a random variable ","element":"span"},{"style":{"height":23.8},"width":904.7,"height":59.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/41-1.png","element":"img","alt":" v we have E[v]2 ≤ E[v2] and (ii) ∥f2∥1/21 ≤ ∥f2∥1","inline":true,"padRight":true},{"text":"since either ","element":"span"},{"style":{"height":17.6},"width":426.61,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/41-2.png","element":"img","alt":"∥f2∥1 = 0 or ∥f2∥1 ≥ 1","inline":true},{"text":". We can now add an expectation term in the norm ","element":"span"},{"style":{"height":17.6},"width":339.96,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/41-3.png","element":"img","alt":" ∥f2−f1∥1 ≤ ∥f2−","inline":true},{"style":{"height":17.6},"width":1727.18,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/41-4.png","element":"img","alt":"E[f2]∥1+∥f1−E[f1]∥1 and ∥f2∥1 ≤ ∥f2−E[f2]∥1+∥E[f2]∥1. Here, ∥E[f2]∥1 = ∥E[f1]∥1 = pN","inline":true,"padRight":true},{"text":"by (Q4). Hence, from ","element":"span"},{"href":"#id-118","text":"(110) ","element":"a"},{"text":"onward we can write","element":"span"}],[{"style":{"width":"98%"},"width":1703,"height":928,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/41-5.png","element":"img"}],[{"text":"where we have used (i) Lemma ","element":"span"},{"href":"#id-119","text":"22(","element":"a"},{"text":"i), (ii) independence of ","element":"span"},{"style":{"height":16.4},"width":190.22,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/41-6.png","element":"img","alt":" f1 from f2","inline":true,"padRight":true},{"text":"and Lemma ","element":"span"},{"href":"#id-119","text":"22(","element":"a"},{"text":"i) again, (iii) Lemma ","element":"span"},{"href":"#id-119","text":"22(","element":"a"},{"text":"ii), and (iv) bounded ","element":"span"},{"style":{"height":17.6},"width":644.22,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/41-7.png","element":"img","alt":" 1 + pN < 2N and p(1 − p) ≤ 1/4.","inline":true}]]},{"heading":"Appendix F. Path representation of D(W) – Proofs of Lemma 10 and Corollary 11","paragraphs":[[{"style":{"fontStyle":"italic"},"text":"Proof of ","element":"span"},{"href":"#id-120","text":"(31)","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Recall that ","element":"span"},{"style":{"height":17.6},"width":256.33,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/41-8.png","element":"img","alt":" GF = (EF , V)","inline":true,"padRight":true},{"text":"is a random subgraph of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"= (","element":"span"},{"style":{"fontStyle":"italic"},"text":"E","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V","element":"span"},{"text":") ","element":"span"},{"text":"with edge set ","element":"span"},{"style":{"height":17.6},"width":430.49,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/41-9.png","element":"img","alt":"EF = {e ∈ E : Fe = 1}","inline":true},{"text":". By (i) the law of total expectation, and by (ii) independence of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"and ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X, Y ","element":"span"},{"text":")","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"70%"},"width":1217,"height":311,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/41-10.png","element":"img"}],[{"id":"id-121","style":{"width":"77%"},"width":1340,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/42-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of ","element":"span"},{"href":"#id-120","text":"(32)","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Expand ","element":"span"},{"href":"#id-121","text":"(112) ","element":"a"},{"text":"to find","element":"span"}],[{"style":{"width":"99%"},"width":1722,"height":816,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/42-1.png","element":"img"}],[{"text":"after rearranging terms. ","element":"span"},{"text":"This completes Lemma ","element":"span"},{"href":"#id-71","text":"10’","element":"a"},{"text":"s proof after identifying ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"text":"here as the left and right sum, respectively. To prove Corollary ","element":"span"},{"href":"#id-72","text":"11, ","element":"a"},{"text":"consider that since for an arborescence ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") = 0","element":"span"},{"text":", we can write","element":"span"}],[{"style":{"width":"89%"},"width":1548,"height":353,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/42-2.png","element":"img"}],[{"text":"Here, (iii) follows because since ","element":"span"},{"style":{"height":18.62},"width":714.02,"height":46.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/42-3.png","element":"img","alt":" I(W) ≥ 0 and I(W) = 0 at zγ = Pγ","inline":true},{"text":", what remains must be the optimum. This completes the proofs of Lemma ","element":"span"},{"href":"#id-71","text":"10 ","element":"a"},{"text":"and Corollary ","element":"span"},{"href":"#id-72","text":"11.","element":"a"}]]},{"heading":"Appendix G. Conserved quantities – Proof of Lemma 12","paragraphs":[[{"text":"For any edge ","element":"span"},{"style":{"height":16.4},"width":118.33,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/42-4.png","element":"img","alt":" f ∈ E,","inline":true}],[{"style":{"width":"84%"},"width":1470,"height":278,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/42-5.png","element":"img"}],[{"text":"Note that ","element":"span"},{"style":{"height":19.53},"width":271.15,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-0.png","element":"img","alt":" Γ(g; l) = Γl(g)","inline":true,"padRight":true},{"text":"for any leaf ","element":"span"},{"style":{"height":17.6},"width":367.13,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-1.png","element":"img","alt":" l ∈ L(G) and g ∈ G","inline":true},{"text":", and therefore in particular","element":"span"}],[{"style":{"width":"81%"},"width":1410,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-2.png","element":"img"}],[{"text":"Recall that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":"; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":") ","element":"span"},{"text":"is the set of leaves of the subtree of the base graph ","element":"span"},{"style":{"height":16.4},"width":347.89,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-3.png","element":"img","alt":" G rooted at f ∈ E.","inline":true,"padRight":true},{"text":"By the fact that ","element":"span"},{"style":{"height":21.89},"width":323.29,"height":54.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-4.png","element":"img","alt":" {Γl(g; f)}l∈L(G;f)","inline":true,"padRight":true},{"text":"partitions ","element":"span"},{"style":{"height":17.6},"width":500.8,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-5.png","element":"img","alt":" Γ(g; f) for any g ∈ G, viz.,","inline":true}],[{"style":{"width":"93%"},"width":1618,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-6.png","element":"img"}],[{"text":"it follows that","element":"span"}],[{"style":{"width":"64%"},"width":1119,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-7.png","element":"img"}],[{"text":"Note in fact that this proof works for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"any ","element":"span"},{"text":"base graph ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"that has no cycles and only length-","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"paths, so not just an arborescence. This is why we make Assumption (N6’) as opposed to the stronger Assumption (N6) in Corollary ","element":"span"},{"href":"#id-72","text":"11.","element":"a"}]]},{"heading":"Appendix H. Proof of Proposition 13","paragraphs":[[{"text":"The proof of Proposition ","element":"span"},{"href":"#id-67","text":"13 ","element":"a"},{"text":"is by double induction on the statements ","element":"span"},{"style":{"height":20.33},"width":376.51,"height":50.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-8.png","element":"img","alt":" A(t) ≡ {I(W {s}) ≤","inline":true},{"style":{"height":20.33},"width":1727.68,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-9.png","element":"img","alt":"I(W {s−1})e−2νminκα, ∀s ∈ [t]} and B(t) ≡ {W {s} ∈ K, ∀s ∈ [t]} where κ > 0 is a free","inline":true,"padRight":true},{"text":"parameter and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"is a compact set which we will define. Concretely, we prove that there exist ","element":"span"},{"style":{"height":17.6},"width":1337.27,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-10.png","element":"img","alt":" α and κ such that A(t)∩B(t) ⇒ B(t+1) and A(t)∩B(t+1) ⇒ A(t+1)","inline":true},{"text":". Appendix ","element":"span"},{"href":"#id-122","text":"H.4 ","element":"a"},{"text":"describes in detail how the upcoming Lemmas ","element":"span"},{"href":"#id-123","text":"26–","element":"a"},{"href":"#id-124","text":"28 ","element":"a"},{"text":"provide sufficient conditions for the induction step. There we also maximize the upper bound on the convergence rate over ","element":"span"},{"style":{"height":11.2},"width":37.14,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-11.png","element":"img","alt":" κ,","inline":true,"padRight":true},{"text":"which gives the rate in ","element":"span"},{"href":"#id-125","text":"(41)","element":"a"},{"text":".","element":"span"}],[{"text":"We start by giving Lemmas ","element":"span"},{"href":"#id-123","text":"26–","element":"a"},{"href":"#id-124","text":"28. ","element":"a"},{"text":"Recall first the definition of the set ","element":"span"},{"href":"#id-126","style":{"height":18},"width":282.93,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-12.png","element":"img","alt":" B(ϵ, I) in (39).","inline":true}],[{"text":"Here, with a minor abuse of notation, we define also","element":"span"}],[{"style":{"width":"86%"},"width":1502,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-13.png","element":"img"}],[{"id":"id-123","text":"where ","element":"span"},{"style":{"height":19.53},"width":582.9,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-14.png","element":"img","alt":" {γl} ≜ Γl(G) for l ∈ L(G) if G","inline":true,"padRight":true},{"text":"is an arborescence.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 26 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume (N2) from Proposition ","element":"span"},{"href":"#id-22","style":{"fontStyle":"italic"},"text":"6 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and (N6) from Corollary ","element":"span"},{"href":"#id-72","style":{"fontStyle":"italic"},"text":"11. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"Then:","element":"span"}],[{"style":{"width":"97%"},"width":1693,"height":164,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-15.png","element":"img"}],[{"text":"Lemma ","element":"span"},{"href":"#id-123","text":"26 ","element":"a"},{"text":"implies that ","element":"span"},{"style":{"height":17.6},"width":128.97,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-16.png","element":"img","alt":" B(ϵ, I)","inline":true,"padRight":true},{"text":"is compact and that ","element":"span"},{"style":{"height":17.6},"width":200.91,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-17.png","element":"img","alt":" D(W) is β","inline":true},{"text":"-smooth on the compact set ","element":"span"},{"style":{"height":17.6},"width":399.39,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-18.png","element":"img","alt":"K = S ∩ B(ϵ, I), i.e.,","inline":true}],[{"style":{"width":"79%"},"width":1374,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-19.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":15.2},"width":211.18,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-20.png","element":"img","alt":" W, W ′ ∈ K","inline":true},{"text":". Its proof is deferred to Appendix ","element":"span"},{"href":"#id-127","text":"H.1. ","element":"a"},{"text":"Next, Lemma ","element":"span"},{"href":"#id-128","text":"27 ","element":"a"},{"text":"gives a lower bound on the curvature of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":") ","element":"span"},{"text":"on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"in the direction of ","element":"span"},{"id":"id-128","style":{"height":17.6},"width":152.48,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/43-21.png","element":"img","alt":"∇D(W)","inline":true},{"text":", in the form of a ","element":"span"},{"text":"PL-","element":"span"},{"text":"inequality (","element":"span"},{"href":"#id-70","referenceIndex":18,"text":"Karimi et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-70","referenceIndex":18,"text":"2016","element":"a"},{"text":"). Its proof is in Appendix ","element":"span"},{"href":"#id-129","text":"H.2.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 27 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume (N2) from Proposition ","element":"span"},{"href":"#id-22","style":{"fontStyle":"italic"},"text":"6 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and (N6) from Corollary ","element":"span"},{"href":"#id-72","style":{"fontStyle":"italic"},"text":"11. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"height":16.73},"width":221.78,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-0.png","element":"img","alt":" W {t} ∈ S ∩","inline":true}],[{"style":{"width":"99%"},"width":1723,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-1.png","element":"img"}],[{"text":"Lemma ","element":"span"},{"href":"#id-124","text":"28 ","element":"a"},{"text":"proves that the conserved quantities of the gradient flow remain bounded under the ","element":"span"},{"text":"GD ","element":"span"},{"text":"algorithm in ","element":"span"},{"href":"#id-64","text":"(27)","element":"a"},{"text":". This lemma allows us to keep track of the iterates in the compact set ","element":"span"},{"style":{"height":17.6},"width":317.26,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-2.png","element":"img","alt":" K = S ∩ B(ϵ, I)","inline":true,"padRight":true},{"text":"by relating them to conserved quantities and exploiting the fact that under ","element":"span"},{"text":"GD ","element":"span"},{"style":{"height":26.84},"width":606.3,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-3.png","element":"img","alt":" |C{t+1}f − C{t}f | has order O(α2)","inline":true},{"text":". Appendix ","element":"span"},{"href":"#id-129","text":"H.2 ","element":"a"},{"text":"contains its proof.","element":"span"}],[{"id":"id-124","style":{"fontWeight":"bold"},"text":"Lemma 28 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume (N2) from Proposition ","element":"span"},{"href":"#id-22","style":{"fontStyle":"italic"},"text":"6 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and (N6) from Corollary ","element":"span"},{"href":"#id-72","style":{"fontStyle":"italic"},"text":"11. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"height":19.13},"width":275.66,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-4.png","element":"img","alt":" W {t} ∈ S, and","inline":true},{"style":{"height":26.84},"width":1727.88,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-5.png","element":"img","alt":"C{t}f > 0 for all f ∈ E\\L(G), then 4α2 ∥ν∥1 M2(L−1)�D(W {t})−D(W opt)�≥ |C{t+1}f −C{t}f |.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"A note on the exchange of derivative and expectation in this section. ","element":"span"},{"text":"Whenever we make both Assumption (N2) in Proposition ","element":"span"},{"href":"#id-22","text":"6 ","element":"a"},{"text":"and (N7) in Lemma ","element":"span"},{"href":"#id-71","text":"10, ","element":"a"},{"text":"the exchange of derivative and expectation is warranted. This occurs several times throughout this section. We refer to the proof of Lemma ","element":"span"},{"href":"#id-58","text":"18 ","element":"a"},{"text":"for the details.","element":"span"}],[{"id":"id-127","style":{"fontWeight":"bold"},"text":"H.1 Compactness, and smoothness – Proof of Lemma ","element":"span"},{"href":"#id-123","style":{"fontWeight":"bold"},"text":"26","element":"a"}],[{"text":"In the proof of Lemma ","element":"span"},{"href":"#id-123","text":"26, ","element":"a"},{"text":"we will upper bound the operator norm of the Hessian. Recall that for a symmetric bilinear matrix ","element":"span"},{"style":{"height":23.31},"width":558.73,"height":58.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-6.png","element":"img","alt":" A, ∥A∥op ≜ sup∥v∥2=1 |vT Av|.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of (i)","element":"span"},{"text":". By continuity of the conditions in ","element":"span"},{"href":"#id-126","text":"(39)","element":"a"},{"text":", the set ","element":"span"},{"style":{"height":19.95},"width":308.2,"height":49.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-7.png","element":"img","alt":" B(ϵ, {Cf}f∈E\\L)","inline":true,"padRight":true},{"text":"is closed. We need to prove boundedness. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":19.95},"width":428.77,"height":49.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-8.png","element":"img","alt":" W ∈ B(ϵ, {Cf}f∈E\\L)","inline":true},{"text":", and suppose w.l.o.g. ","element":"span"},{"text":"that for some ","element":"span"},{"style":{"height":19.95},"width":1353.37,"height":49.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-9.png","element":"img","alt":" f∗ ∈ E\\L we have |Wf∗| > Q, where Q > maxj∈E\\L,γ∈Γ(G){|Cj| , |zγ|}","inline":true},{"text":". We want to find a path ","element":"span"},{"style":{"height":18.62},"width":448.77,"height":46.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-10.png","element":"img","alt":" γ ∈ Γ(G) such that Pγ","inline":true,"padRight":true},{"text":"is large for a contradiction with the assumption that ","element":"span"},{"style":{"height":17.6},"width":190.9,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-11.png","element":"img","alt":"I(W) ≤ ϵ","inline":true},{"text":". By ","element":"span"},{"href":"#id-74","text":"(35)","element":"a"},{"text":", we have the inequality ","element":"span"},{"style":{"height":23.09},"width":545.53,"height":57.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-12.png","element":"img","alt":"�l∈L(G;f∗) W 2l > Q2 − |Cf∗|","inline":true,"padRight":true},{"text":"so that for some ","element":"span"},{"style":{"height":17.6},"width":253.44,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-13.png","element":"img","alt":"l∗ ∈ L(G; f∗)","inline":true,"padRight":true},{"text":"we must have ","element":"span"},{"style":{"height":20.05},"width":567.68,"height":50.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-14.png","element":"img","alt":" W 2l∗ > (Q − |Cf∗|)/ |L(G; f∗)|","inline":true},{"text":". Consequently, we have by ","element":"span"},{"href":"#id-74","text":"(35) ","element":"a"},{"text":"that ","element":"span"},{"style":{"height":21.69},"width":710.99,"height":54.22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-15.png","element":"img","alt":" |We|2 > (Q2 −|Cf∗|)/|L(G; f∗)|−|Ce|","inline":true,"padRight":true},{"text":"for any edge ","element":"span"},{"style":{"height":13.2},"width":97.65,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-16.png","element":"img","alt":" e ∈ γ","inline":true,"padRight":true},{"text":"in any path ","element":"span"},{"style":{"height":19.64},"width":337.51,"height":49.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-17.png","element":"img","alt":" γ ∈ Γl∗(G) except","inline":true,"padRight":true},{"text":"for the edge ","element":"span"},{"style":{"height":16.4},"width":43.06,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-18.png","element":"img","alt":" f∗ ","inline":true,"padRight":true},{"text":"where we have ","element":"span"},{"style":{"height":18.44},"width":197.7,"height":46.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-19.png","element":"img","alt":" |Wf∗| > Q","inline":true,"padRight":true},{"text":"by assumption. In particular, we have the bound ","element":"span"},{"style":{"height":17.6},"width":244.3,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-20.png","element":"img","alt":"|We| > O(Q)","inline":true,"padRight":true},{"text":"for any edge ","element":"span"},{"style":{"height":13.2},"width":97.65,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-21.png","element":"img","alt":" e ∈ γ","inline":true,"padRight":true},{"text":"for any path ","element":"span"},{"style":{"height":17.6},"width":238.28,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-22.png","element":"img","alt":" γ ∈ Γ(G; f∗)","inline":true},{"text":". Therefore if we pick ","element":"span"},{"style":{"height":17.6},"width":238.28,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-23.png","element":"img","alt":" γ ∈ Γ(G; f∗)","inline":true,"padRight":true},{"text":"we have","element":"span"}],[{"style":{"width":"80%"},"width":1387,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-24.png","element":"img"}],[{"text":"for sufficiently large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":", which is a contradiction. We must thus have ","element":"span"},{"style":{"height":18.44},"width":383.51,"height":46.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-25.png","element":"img","alt":" |Wf∗| ≤ Q for some","inline":true},{"style":{"height":16},"width":142.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-26.png","element":"img","alt":"Q < ∞","inline":true},{"text":". If on the other hand ","element":"span"},{"style":{"height":17.6},"width":607.04,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-27.png","element":"img","alt":" |Wl| > Q for some l ∈ L(G; f∗)","inline":true},{"text":", by ","element":"span"},{"href":"#id-74","text":"(35) ","element":"a"},{"text":"we must also have ","element":"span"},{"style":{"height":19.98},"width":548.38,"height":49.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-28.png","element":"img","alt":"(Wf∗)2 > Q2 + Cf∗ > O(Q2)","inline":true,"padRight":true},{"text":"for sufficiently large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":". This case is, thus, the same as before. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proof of (ii)","element":"span"},{"text":". Using a regular upper bound to the entries of ","element":"span"},{"style":{"height":19.13},"width":530.89,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-29.png","element":"img","alt":" ∇2I(W) when W ∈ S will","inline":true,"padRight":true},{"text":"suffice. Element-wise, we have","element":"span"}],[{"style":{"width":"95%"},"width":1655,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-30.png","element":"img"}],[{"style":{"height":31.6},"width":1494.89,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-31.png","element":"img","alt":"2 �δ∈Γ(G;i)∩Γ(G;j) νδ�PδWiPδWj − PδWiWj (zγ − Pγ)�, if i ̸= j, Γ(G; i) ∩ Γ(G; j) ̸= ∅,","inline":true}],[{"style":{"height":26.03},"width":1081.7,"height":65.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-32.png","element":"img","alt":"2 �γ∈Γ(G;i) νγ( PγWi )2 if i = j,","inline":true}],[{"style":{"width":"65%"},"width":1141,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/44-33.png","element":"img"}],[{"text":"Hence, noting that since we have ","element":"span"},{"style":{"height":18.44},"width":560.89,"height":46.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-0.png","element":"img","alt":" |Wf| ≤ M for all f ∈ E on S","inline":true},{"text":", we can bound ","element":"span"},{"style":{"height":18.62},"width":206.97,"height":46.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-1.png","element":"img","alt":" |Pγ/Wf| ≤","inline":true},{"style":{"height":20.55},"width":353.53,"height":51.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-2.png","element":"img","alt":"ML−1, |zγ| ≤ ML ","inline":true,"padRight":true},{"text":"and the other terms similarly. We upper bound the number of terms in the sum over ","element":"span"},{"style":{"height":18.62},"width":1042.65,"height":46.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-3.png","element":"img","alt":" Γ(G; i) and Γ(G; i) ∩ Γ(G; j) by |Γ(G)| and νγ ≤ νmax","inline":true},{"text":". Adding all terms, we obtain that ","element":"span"},{"style":{"height":20.33},"width":399.23,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-4.png","element":"img","alt":" 6νmax |Γ(G)| M2(L−1) ","inline":true,"padRight":true},{"text":"is an upper bound for each of the entries of ","element":"span"},{"style":{"height":19.13},"width":175.49,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-5.png","element":"img","alt":" ∇2I(W).","inline":true,"padRight":true},{"text":"This gives an upper bound ","element":"span"},{"style":{"height":20.95},"width":862.43,"height":52.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-6.png","element":"img","alt":" ∥∇2I(W)∥op ≤ 6|E|νmax |Γ(G)| M2(L−1) in S.","inline":true}],[{"id":"id-129","style":{"fontWeight":"bold"},"text":"H.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"PL-","element":"span"},{"style":{"fontWeight":"bold"},"text":"inequality on a compact set – Proof of Lemma ","element":"span"},{"href":"#id-128","style":{"fontWeight":"bold"},"text":"27","element":"a"}],[{"text":"Recall the definition of a ","element":"span"},{"text":"PL-","element":"span"},{"text":"inequality:","element":"span"}],[{"style":{"height":19.13},"width":979.48,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-7.png","element":"img","alt":"Definition 29 Let u ∈ C2(K, R) where K ⊂ Rn ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is compact and ","element":"span"},{"style":{"height":18},"width":412.56,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-8.png","element":"img","alt":" K\\∂K ̸= ∅. Denote","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"by ","element":"span"},{"style":{"height":17.6},"width":357.97,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-9.png","element":"img","alt":" u∗ = minx∈K u(x)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and suppose that ","element":"span"},{"style":{"height":17.6},"width":242.39,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-10.png","element":"img","alt":" u∗ ∈ K\\∂K","inline":true},{"style":{"fontStyle":"italic"},"text":". We say that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfies a ","element":"span"},{"text":"Polyak– ","element":"span"},{"text":"Łojasiewicz (PL) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"inequality if there exist a ","element":"span"},{"style":{"height":14.3},"width":132.07,"height":35.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-11.png","element":"img","alt":" τK > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"depending only on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that","element":"span"}],[{"id":"id-130","style":{"width":"73%"},"width":1275,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-12.png","element":"img"}],[{"text":"A ","element":"span"},{"text":"PL-","element":"span"},{"text":"inequality together with ","element":"span"},{"style":{"height":16.4},"width":26,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-13.png","element":"img","alt":" β","inline":true},{"text":"-smoothness on a compact set will imply that ","element":"span"},{"style":{"height":20.34},"width":201.45,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-14.png","element":"img","alt":" D(W {t})−","inline":true},{"style":{"height":18.73},"width":167.02,"height":46.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-15.png","element":"img","alt":"D(W opt)","inline":true,"padRight":true},{"text":"decreases. To see this, note that by (i) ","element":"span"},{"style":{"height":16.4},"width":26,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-16.png","element":"img","alt":" β","inline":true},{"text":"-smoothness, and (ii) the update rule","element":"span"}],[{"style":{"width":"92%"},"width":1592,"height":170,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-17.png","element":"img"}],[{"text":"If furthermore ","element":"span"},{"style":{"height":17.6},"width":232.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-18.png","element":"img","alt":" α ≤ 1/(2β)","inline":true},{"text":", then also ","element":"span"},{"style":{"height":17.6},"width":327.86,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-19.png","element":"img","alt":" βα − 1 ≤ −1/2.","inline":true,"padRight":true},{"text":"Together with ","element":"span"},{"href":"#id-130","text":"(125)","element":"a"},{"text":", and after rearranging terms, one finds that","element":"span"}],[{"id":"id-144","style":{"width":"91%"},"width":1579,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-20.png","element":"img"}],[{"text":"By (iii) ","element":"span"},{"style":{"height":14.8},"width":473.3,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-21.png","element":"img","alt":" 1 + x ≤ ex for all x ∈ R","inline":true},{"text":", we obtain ","element":"span"},{"href":"#id-131","text":"(29)","element":"a"},{"text":". The strategy will now be to prove that there is a ","element":"span"},{"text":"PL-","element":"span"},{"text":"inequality in some compact set, that the iterates remain in that compact set, and that the function is ","element":"span"},{"style":{"height":16.4},"width":192.23,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-22.png","element":"img","alt":" β-smooth.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-128","style":{"fontStyle":"italic"},"text":"27. ","element":"a"},{"text":"First note that if ","element":"span"},{"style":{"height":17.6},"width":465.69,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-23.png","element":"img","alt":" l ∈ L(G) and γ ∈ Γ(G; l)","inline":true},{"text":", the indexes of the weights in the product ","element":"span"},{"style":{"height":24.44},"width":223.48,"height":61.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-24.png","element":"img","alt":" |P {t}γ /W {t}l |","inline":true,"padRight":true},{"text":"belong to the index set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E\\L","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":")","element":"span"},{"text":". The proof follows (i) by restricting the sum, and (ii) from the fact that for every path ","element":"span"},{"style":{"height":17.6},"width":173.9,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-25.png","element":"img","alt":" γ ∈ Γ(G)","inline":true,"padRight":true},{"text":"in an arborescence ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":", there is exactly one leaf ","element":"span"},{"style":{"height":19.53},"width":619.22,"height":48.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-26.png","element":"img","alt":" l ∈ L(G) such that γl = γ. Thus","inline":true}],[{"style":{"width":"98%"},"width":1702,"height":311,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-27.png","element":"img"}],[{"text":"where in (iii) we have used the bound ","element":"span"},{"style":{"height":25.55},"width":986.65,"height":63.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-28.png","element":"img","alt":" |W {t}i | ≥ mine∈E\\L(G) |W {t}e | for all i ∈ E\\L(G) and","inline":true,"padRight":true},{"text":"similarly with ","element":"span"},{"style":{"height":18.62},"width":430.42,"height":46.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-29.png","element":"img","alt":" νγ ≥ νmin for γ ∈ Γ(G)","inline":true},{"text":". Finally, by ","element":"span"},{"href":"#id-74","text":"(35)","element":"a"},{"text":", we have ","element":"span"},{"style":{"height":25.55},"width":532.26,"height":63.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/45-30.png","element":"img","alt":" mine∈E\\L(G) |W {t}e |2 ≥ C{t}min.","inline":true,"padRight":true},{"text":"This completes the proof.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"H.3 Conserved quantities remain bounded throughout ","element":"span"},{"style":{"fontWeight":"bold"},"text":"GD ","element":"span"},{"style":{"fontWeight":"bold"},"text":"– Proof of Lemma ","element":"span"},{"href":"#id-124","style":{"fontWeight":"bold"},"text":"28","element":"a"}],[{"style":{"height":17.6},"width":475.56,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-0.png","element":"img","alt":"Proof Pick f ∈ E\\L(G)","inline":true},{"text":". By (i) Corollary ","element":"span"},{"href":"#id-72","text":"11, ","element":"a"},{"text":"and (ii) Lemma ","element":"span"},{"href":"#id-73","text":"12, ","element":"a"},{"text":"we have","element":"span"}],[{"id":"id-132","style":{"width":"97%"},"width":1692,"height":883,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-1.png","element":"img"}],[{"id":"id-133","text":"By Cauchy–Schwartz we also have","element":"span"}],[{"style":{"width":"95%"},"width":1656,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-2.png","element":"img"}],[{"text":"If we have ","element":"span"},{"style":{"height":26.84},"width":1028.64,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-3.png","element":"img","alt":" C{t}f > 0, then (W {t}f )2 > (W {t}γL )2 for any γ ∈ Γ(G; f)","inline":true},{"text":". Thus, combining the ","element":"span"},{"text":"estimate ","element":"span"},{"href":"#id-132","text":"(129) ","element":"a"},{"text":"with ","element":"span"},{"href":"#id-133","text":"(131) ","element":"a"},{"text":"we obtain","element":"span"}],[{"id":"id-134","style":{"width":"87%"},"width":1517,"height":152,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-4.png","element":"img"}],[{"text":"Extending the sums in ","element":"span"},{"href":"#id-134","text":"(132) ","element":"a"},{"text":"from ","element":"span"},{"style":{"height":17.6},"width":816.4,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-5.png","element":"img","alt":" Γ(G; f) to Γ(G) and from L(G; f) to L(G)","inline":true},{"text":", respectively, ","element":"span"},{"id":"id-137","text":"yields","element":"span"}],[{"style":{"width":"80%"},"width":1390,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-6.png","element":"img"}],[{"text":"where we have used the bound ","element":"span"},{"style":{"height":19.95},"width":829.92,"height":49.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-7.png","element":"img","alt":" |Wf| ≤ maxe∈E\\L(G) |We| for all f ∈ E\\L(G)","inline":true},{"text":". Similarly, using ","element":"span"},{"href":"#id-132","text":"(130) ","element":"a"},{"text":"and the trivial bound ","element":"span"},{"style":{"height":18.62},"width":484.11,"height":46.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-8.png","element":"img","alt":" νγ ≤ ∥ν∥1 for any γ ∈ Γ","inline":true},{"text":", and by absorbing one ","element":"span"},{"style":{"height":17.42},"width":237.63,"height":43.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-9.png","element":"img","alt":" νγ-term into","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":")","element":"span"},{"text":"’s expression, we obtain","element":"span"}],[{"id":"id-139","style":{"width":"80%"},"width":1384,"height":87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-10.png","element":"img"}],[{"text":"for the lower bound. Because ","element":"span"},{"style":{"height":16.73},"width":184.64,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-11.png","element":"img","alt":" W {t} ∈ S","inline":true,"padRight":true},{"text":"by assumption, ","element":"span"},{"style":{"height":25.55},"width":634.05,"height":63.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/46-12.png","element":"img","alt":" maxe∈E\\L(G) |W {t}e |2 ≤ M2. This","inline":true,"padRight":true},{"text":"completes the proof.","element":"span"}],[{"id":"id-122","style":{"fontWeight":"bold"},"text":"H.4 Double induction","element":"span"}],[{"text":"We now use Lemmas ","element":"span"},{"href":"#id-123","text":"26–","element":"a"},{"href":"#id-124","text":"28 ","element":"a"},{"text":"together in a double induction to finally prove Proposition ","element":"span"},{"href":"#id-67","text":"13. ","element":"a"},{"text":"Let ","element":"span"},{"style":{"height":12.4},"width":105.32,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-0.png","element":"img","alt":" κ > 0","inline":true,"padRight":true},{"text":"and denote the statements:","element":"span"}],[{"id":"id-136","style":{"width":"77%"},"width":1334,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-1.png","element":"img"}],[{"text":"We will prove that there exists a ","element":"span"},{"style":{"height":12.4},"width":105.32,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-2.png","element":"img","alt":" κ > 0","inline":true,"padRight":true},{"text":"such that when choosing ","element":"span"},{"style":{"height":8.4},"width":28,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-3.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"appropriately, firstly","element":"span"}],[{"id":"id-135","style":{"width":"62%"},"width":1087,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-4.png","element":"img"}],[{"text":"and secondly,","element":"span"}],[{"style":{"width":"64%"},"width":1123,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-5.png","element":"img"}],[{"style":{"height":17.6},"width":637.65,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-6.png","element":"img","alt":"Step 1: A(t) ∩ B(t) ⇒ B(t + 1).","inline":true,"padRight":true},{"text":"We need to prove that ","element":"span"},{"style":{"height":20.33},"width":606.71,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-7.png","element":"img","alt":" W {t+1} ∈ B(ϵ, I) ∩ S assuming","inline":true,"padRight":true},{"href":"#id-135","text":"(135) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-136","text":"(136)","element":"a"},{"text":". ","element":"span"},{"text":"Using ","element":"span"},{"href":"#id-137","text":"(133) ","element":"a"},{"text":"from the proof of Lemma ","element":"span"},{"href":"#id-124","text":"28 ","element":"a"},{"text":"repeatedly with the bound ","element":"span"},{"style":{"height":23.2},"width":373.2,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-8.png","element":"img","alt":"maxe∈E |W {t}e | ≤ M","inline":true},{"text":", we obtain","element":"span"}],[{"id":"id-140","style":{"width":"75%"},"width":1310,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-9.png","element":"img"}],[{"text":"By ","element":"span"},{"href":"#id-135","text":"(135)","element":"a"},{"text":", we can upper bound","element":"span"}],[{"style":{"width":"92%"},"width":1592,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-10.png","element":"img"}],[{"text":"If furthermore (C1) ","element":"span"},{"style":{"height":14.22},"width":344.48,"height":35.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-11.png","element":"img","alt":" 0 < 2νminκα < 1","inline":true},{"text":", then (i) the inequality ","element":"span"},{"style":{"height":17.6},"width":501.6,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-12.png","element":"img","alt":" 1/(1 − exp(−2νminκα)) <","inline":true},{"style":{"height":17.6},"width":210.82,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-13.png","element":"img","alt":"1/(νminκα)","inline":true,"padRight":true},{"text":"holds, so that","element":"span"}],[{"id":"id-141","style":{"width":"97%"},"width":1681,"height":175,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-14.png","element":"img"}],[{"text":"In the same manner, we can also prove ","element":"span"},{"href":"#id-138","text":"(141) ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":26.84},"width":388.52,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-15.png","element":"img","alt":" C{0}f instead of C{0}min","inline":true},{"text":". This yields","element":"span"}],[{"id":"id-138","style":{"width":"73%"},"width":1263,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-16.png","element":"img"}],[{"text":"for any ","element":"span"},{"style":{"height":17.6},"width":238.57,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-17.png","element":"img","alt":" f ∈ E\\L(G)","inline":true},{"text":". Similarly, for a lower bound, we can use ","element":"span"},{"href":"#id-139","text":"(134) ","element":"a"},{"text":"repeatedly together with the bound ","element":"span"},{"href":"#id-140","text":"(140) ","element":"a"},{"text":"and condition (C1) yielding","element":"span"}],[{"id":"id-142","style":{"width":"73%"},"width":1269,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-18.png","element":"img"}],[{"text":"for any ","element":"span"},{"style":{"height":17.6},"width":237.21,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-19.png","element":"img","alt":" f ∈ E\\L(G)","inline":true},{"text":". Now, suppose (D1) ","element":"span"},{"style":{"height":24.13},"width":391.78,"height":60.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-20.png","element":"img","alt":" C{0}min − κ1/(L−1) > 0","inline":true,"padRight":true},{"text":"and let (C2) the step size ","element":"span"},{"text":"satisfy","element":"span"}],[{"style":{"width":"68%"},"width":1187,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/47-21.png","element":"img"}],[{"text":"We have (i) by ","element":"span"},{"href":"#id-141","text":"(142) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-142","text":"(143) ","element":"a"},{"text":"that","element":"span"}],[{"style":{"width":"95%"},"width":1650,"height":385,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-0.png","element":"img"}],[{"text":"Then ","element":"span"},{"style":{"height":20.33},"width":321.19,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-1.png","element":"img","alt":" W {t+1} ∈ B(ϵ, I)","inline":true,"padRight":true},{"text":"by ","element":"span"},{"href":"#id-126","text":"(39)","element":"a"},{"text":". Hence, ","element":"span"},{"style":{"height":26.84},"width":242.23,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-2.png","element":"img","alt":" M > W {t+1}f","inline":true}],[{"id":"id-143","style":{"width":"100%"},"width":1734,"height":448,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-3.png","element":"img"}],[{"text":"Suppose now for a moment that (C2) the right-hand side of ","element":"span"},{"href":"#id-143","text":"(146) ","element":"a"},{"text":"is positive for some sufficiently small ","element":"span"},{"style":{"height":8.4},"width":28,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-4.png","element":"img","alt":" α","inline":true},{"text":". We could then use the ","element":"span"},{"text":"PL ","element":"span"},{"text":"inequality from Lemma ","element":"span"},{"href":"#id-128","text":"27 ","element":"a"},{"text":"together with ","element":"span"},{"style":{"height":25.55},"width":876.84,"height":63.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-5.png","element":"img","alt":"mine∈E\\L(G) |W {t}e |2(L−1) ≥ (C{t}min)L−1, that is,","inline":true}],[{"id":"id-145","style":{"width":"71%"},"width":1243,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-6.png","element":"img"}],[{"text":"To see how, note that the argumentation around ","element":"span"},{"href":"#id-144","text":"(127) ","element":"a"},{"text":"together with ","element":"span"},{"href":"#id-145","text":"(147) ","element":"a"},{"text":"and (i) the induction hypothesis ","element":"span"},{"style":{"height":20.33},"width":861.58,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-7.png","element":"img","alt":" B(t + 1) we have W {t}, W {t+1} ∈ B(ϵ, I) ∩ S","inline":true,"padRight":true},{"text":"and (ii) the clause (L1)","element":"span"}],[{"id":"id-146","style":{"width":"99%"},"width":1723,"height":416,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-8.png","element":"img"}],[{"text":"where we have also used (iii) the induction hypothesis ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":")","element":"span"},{"text":", i.e., that ","element":"span"},{"style":{"height":20.33},"width":398.66,"height":50.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-9.png","element":"img","alt":" I(W {t}) ≤ I(W {0}) ·","inline":true},{"style":{"height":15.13},"width":197.3,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-10.png","element":"img","alt":"e−2νminκαt.","inline":true}],[{"text":"We now investigate the exponent in ","element":"span"},{"href":"#id-146","text":"(148) ","element":"a"},{"text":"for a moment. Assuming (C2) and if (C3) the right-hand side of ","element":"span"},{"href":"#id-146","text":"(148) ","element":"a"},{"text":"is furthermore smaller than ","element":"span"},{"style":{"height":20.33},"width":701.74,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-11.png","element":"img","alt":" I(W {0}) exp(−2νminκα(t + 1)), then","inline":true,"padRight":true},{"text":"the induction step would be complete. Note finally that both conditions (C2) and (C3) are satisfied when choosing","element":"span"}],[{"style":{"width":"74%"},"width":1294,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/48-12.png","element":"img"}],[{"text":"or equivalently","element":"span"}],[{"id":"id-147","style":{"width":"68%"},"width":1187,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-0.png","element":"img"}],[{"text":"To also satisfy condition (C1), we thus require that","element":"span"}],[{"style":{"width":"76%"},"width":1327,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Step 3. ","element":"span"},{"text":"Let us summarize. Convergence occurs at rate at most ","element":"span"},{"style":{"height":14.22},"width":154.96,"height":35.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-2.png","element":"img","alt":" 2νminκα","inline":true,"padRight":true},{"text":"if conditions (L1), (D1), (C1)–(C3) hold. Hence we have to choose ","element":"span"},{"style":{"height":24.13},"width":712.16,"height":60.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-3.png","element":"img","alt":" κ > 0 such that C{0}min − κL−1 > 0 and","inline":true}],[{"style":{"width":"78%"},"width":1366,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-4.png","element":"img"}],[{"text":"Note that we can maximize the convergence rate ","element":"span"},{"style":{"height":14.22},"width":154.92,"height":35.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-5.png","element":"img","alt":" 2νminακ","inline":true,"padRight":true},{"text":"by maximizing ","element":"span"},{"style":{"height":24.13},"width":198.02,"height":60.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-6.png","element":"img","alt":" κ2(C{0}min −","inline":true},{"style":{"height":20.33},"width":170.67,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-7.png","element":"img","alt":"κ1/(L−1))","inline":true},{"text":", which occurs when ","element":"span"},{"style":{"height":24.13},"width":1147.93,"height":60.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-8.png","element":"img","alt":" κ = (C{0}min)L−1(1 + 1/(2(L − 1)))−(L−1) ≥ e−1/2(C{0}min)L−1.","inline":true,"padRight":true},{"text":"Substituting this in ","element":"span"},{"href":"#id-147","text":"(152) ","element":"a"},{"text":"we require a step size","element":"span"}],[{"style":{"width":"87%"},"width":1517,"height":133,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-9.png","element":"img"}],[{"text":"Finally, we have the bound ","element":"span"},{"style":{"height":20.33},"width":611.09,"height":50.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-10.png","element":"img","alt":" β ≤ 6νmax |E(G)| |Γ(G)| M2(L−1) ","inline":true,"padRight":true},{"text":"from Lemma ","element":"span"},{"href":"#id-123","text":"26 ","element":"a"},{"text":"in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"text":", so that","element":"span"}],[{"style":{"width":"99%"},"width":1713,"height":181,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-11.png","element":"img"}],[{"text":"This completes our proof of Proposition ","element":"span"},{"href":"#id-67","text":"13.","element":"a"}]]},{"heading":"Appendix I. Convergence rate in the case of Dropout and Dropconnect – Proof of Proposition 9","paragraphs":[[{"text":"We consider first the case of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect","element":"span"},{"text":". ","element":"span"},{"text":"We have that ","element":"span"},{"style":{"height":17.6},"width":146.69,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-12.png","element":"img","alt":" {Fe}e∈E","inline":true,"padRight":true},{"text":"are independent and identically distributed ","element":"span"},{"text":"Bernoulli(","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":") ","element":"span"},{"text":"random variables. Suppose that the base graph ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"has no cycles and every path is of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":". Then by definition in Lemma ","element":"span"},{"href":"#id-71","text":"10, ","element":"a"},{"text":"we have","element":"span"}],[{"id":"id-148","style":{"width":"82%"},"width":1435,"height":242,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-13.png","element":"img"}],[{"text":"where (i) we have used ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect","element":"span"},{"text":"’s distribution on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":".","element":"span"}],[{"text":"Now suppose that additionally we make the stronger assumption that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"is an arborescence. Then by definition in Corollary ","element":"span"},{"href":"#id-72","text":"11 ","element":"a"},{"style":{"height":20.15},"width":253.56,"height":50.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-14.png","element":"img","alt":" νγ = E[X2]ηγ","inline":true},{"text":", and subsequently we can calculate ","element":"span"},{"style":{"height":23.49},"width":1106.26,"height":58.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/49-15.png","element":"img","alt":"∥ν∥1 = E[X2] �γ∈Γ(G) νγ = E[X2] |Γ(G)| pL = E[X2]dLpL.","inline":true}],[{"text":"Now, since by assumption ","element":"span"},{"style":{"height":21.35},"width":1161.9,"height":53.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-0.png","element":"img","alt":" maxγ |zγ| ≤ ML and |Wf| ≤ M for all f ∈ E, then I(W {0}) ≤","inline":true},{"style":{"height":19.53},"width":284.56,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-1.png","element":"img","alt":"O(|Γ(G)| M2L)","inline":true,"padRight":true},{"text":"so that substitution of in the definition of ","element":"span"},{"style":{"height":8.4},"width":28,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-2.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"in Proposition ","element":"span"},{"href":"#id-67","text":"13 ","element":"a"},{"text":"yields","element":"span"}],[{"style":{"width":"59%"},"width":1030,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-3.png","element":"img"}],[{"text":"where we have used that ","element":"span"},{"style":{"height":17.35},"width":211.91,"height":43.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-4.png","element":"img","alt":" Cmin ≤ M2","inline":true},{"text":". Finally multiplying by ","element":"span"},{"style":{"height":8},"width":23,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-5.png","element":"img","alt":" τ","inline":true,"padRight":true},{"text":"gives the rate","element":"span"}],[{"style":{"width":"62%"},"width":1078,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-6.png","element":"img"}],[{"text":"Substituting these results in the rate ","element":"span"},{"style":{"height":8.4},"width":52,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-7.png","element":"img","alt":" τα","inline":true,"padRight":true},{"text":"in Proposition ","element":"span"},{"href":"#id-67","text":"13 ","element":"a"},{"text":"yields the result for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect","element":"span"},{"text":".","element":"span"}],[{"text":"Finally we note that for the case of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout","element":"span"},{"text":", filtering all nodes independently in an arborescence is equivalent to filtering all edges independently except the edge at the root. In particular, in ","element":"span"},{"href":"#id-148","text":"(155)","element":"a"},{"text":", we have ","element":"span"},{"style":{"height":19.53},"width":399.43,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-8.png","element":"img","alt":" P[γ ∈ Γ(GF )] = pL−1","inline":true},{"text":". The remaining steps of the proof are then the same as for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropconnect ","element":"span"},{"text":"and comparing ","element":"span"},{"style":{"height":18.73},"width":250.78,"height":46.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-9.png","element":"img","alt":" pL with pL−1 ","inline":true,"padRight":true},{"text":"we can absorb the missing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"factor into the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O ","element":"span"},{"text":"notation, which does not change the order in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":".","element":"span"}]]},{"heading":"Appendix J. Inequalities pertaining to the Frobenius norm","paragraphs":[[{"id":"id-96","style":{"fontWeight":"bold"},"text":"Lemma 30 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any matrix ","element":"span"},{"style":{"height":15.54},"width":547.53,"height":38.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-10.png","element":"img","alt":" A ∈ Rm×n and 1 ≤ k < ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", it holds that ","element":"span"},{"style":{"height":22.7},"width":333.5,"height":56.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-11.png","element":"img","alt":"�i,j(1 + A2ij)k ≤","inline":true},{"style":{"height":19.53},"width":306.53,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-12.png","element":"img","alt":"nm(1 + ∥A∥F)2k","inline":true},{"style":{"fontStyle":"italic"},"text":". For any two matrices ","element":"span"},{"style":{"height":16.33},"width":715.57,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-13.png","element":"img","alt":" A ∈ Rm×n, B ∈ Rn×p and 0 ≤ k < ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", it holds that ","element":"span"},{"style":{"height":19.53},"width":775.92,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-14.png","element":"img","alt":"(1 + ∥AB∥F)k ≤ (1 + ∥A∥F)k(1 + ∥B∥F)k","inline":true},{"style":{"fontStyle":"italic"},"text":". For any two matrices ","element":"span"},{"style":{"height":16.33},"width":249.16,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-15.png","element":"img","alt":" A, B ∈ Rn×m","inline":true},{"style":{"fontStyle":"italic"},"text":", it holds that ","element":"span"},{"style":{"height":18.36},"width":470.99,"height":45.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-16.png","element":"img","alt":"∥A ⊙ B∥F ≤ ∥A∥F ∥B∥F.","inline":true}],[{"style":{"width":"100%"},"width":1732,"height":356,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-17.png","element":"img"}],[{"text":"where (ii) we have used that the function ","element":"span"},{"style":{"height":15.53},"width":40.21,"height":38.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-18.png","element":"img","alt":" zk ","inline":true,"padRight":true},{"text":"is nondecreasing in ","element":"span"},{"style":{"height":14.8},"width":458.55,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-19.png","element":"img","alt":" z ≥ 0 whenever k ≥ 0.","inline":true,"padRight":true},{"text":"Because (iii) for the ","element":"span"},{"style":{"height":15.24},"width":36.18,"height":38.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-20.png","element":"img","alt":" ℓk","inline":true},{"text":"-norm for sequences it holds that ","element":"span"},{"style":{"height":20.05},"width":672.4,"height":50.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-21.png","element":"img","alt":" ∥x∥22k ≤ ∥x∥22 whenever 1 ≤ k < ∞,","inline":true,"padRight":true},{"text":"we obtain","element":"span"}],[{"style":{"width":"79%"},"width":1367,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-22.png","element":"img"}],[{"text":"where (iv) we have used that the function ","element":"span"},{"style":{"height":19.53},"width":940.7,"height":48.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-23.png","element":"img","alt":" (1+z2)k ≤ (1+z)2k for all z ≥ 0 whenever k ≥ 0.","inline":true,"padRight":true},{"text":"This proves the first inequality.","element":"span"}],[{"text":"The second inequality is an immediate consequence of the submultiplicativity property of the Frobenius norm and its positivity, i.e.,","element":"span"}],[{"style":{"width":"84%"},"width":1460,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/50-24.png","element":"img"}],[{"text":"Raising to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th power left and right finishes its proof.","element":"span"}],[{"text":"The third inequality follows from strict positivity of the summands:","element":"span"}],[{"style":{"width":"82%"},"width":1426,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.02247/images/51-0.png","element":"img"}],[{"text":"Each of the inequalities has now been shown.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]