Dropout (Hinton et al., 2012) is a technique to avoid overfitting during training of NNs that consists of temporarily ‘dropping’ nodes of the network independently at each step of SGD. While in the original Dropout algorithm in Hinton et al. (2012) only nodes from the network were dropped, several stochastic training algorithms that avoid overfitting in NNs have appeared since then; for example, Dropconnect (Wan et al., 2013), Cutout (DeVries and Taylor, 2017). Figure 1 depicts a NN where we use Dropconnect and drop individual edges
instead of nodes. In practice, such dropout algorithms consist of multiplying componentwise weight matrices of the NN in each iteration by independently drawn random matrices with {0, 1}-valued entries. The elements of these random matrices indicate whether each individual edge or node is filtered (0) or is not filtered (1) during a training step. The resulting weight matrices are then used in the backpropagation algorithm for computing the gradient of a NN. Mathematically, dropout turns the backpropagation algorithm into a step of a SGD in which the primary source of randomness is the NN’s configuration. Under mild independence assumptions, the loss function of dropout is a risk function averaged over all possible NNs configurations (Baldi and Sadowski, 2013).
Figure 1: (a,b) Dropconnect’s training step (Wan et al., 2013) in a NN with L = 3 layers. In this algorithm, on every iteration, a random NN is first generated by removing each edge with probability independently of all other edges. The output of this random NN is then used to update all weights using the backpropagation algorithm. (c) An example arborescence of depth L = 3.
An interesting aspect of dropout algorithms is that they lie at the intersection of stochastic optimization and percolation theory, which investigates properties related to connectedness of random graphs and deterministic (possibly infinite) graphs in which vertices and edges are deleted at random. In the case of dropout, the output of the filtered NN with temporarily deleted edges is used to update the weights. If dropout filters too many weights, then little information about the input can pass through the network, which will consequently also yield a gradient update for that step that contains little relevant information.
As an example, we may consider again the networks in Figures 1 (a)–(b) when we use Dropconnect, that is, we filter each edge with probability independently of all other edges. We can observe that the number of paths
in Figure 1 (b) that fully transverse the network (
) is much smaller compared to those of Figure 1 (a) (
an L-layer NN with no biases, a path from the input layer to the output goes through L weights that have filters. Then, the probability that a path from input to output stays unfiltered and contributes to a weight update is
. If we now fix one edge in the path, then the probability of updating its corresponding weight through that path in particular is also
. There are, however, many other paths in a NN passing through a single edge. The probability that one of those paths is not filtered will be large and may compensate the exponential factor
. Considering the connection to bond percolation, one may therefore suspect that dropout algorithms may perform worse than a routine implementation of the backpropagation algorithm. However, dropout algorithms usually perform well since they avoid overfitting in NNs (Hinton et al., 2012; Srivastava et al., 2014). From the point of view of bond percolation however, this should still come at the cost of slower convergence of dropout algorithms, and conceivably by as much as a factor
is the number of dropout layers.
Most theoretical focus has been on the generalization properties of NNs trained with dropout algorithms. We can mention Hinton et al. (2012); Baldi and Sadowski (2013); Wager et al. (2013); Srivastava et al. (2014); Baldi and Sadowski (2014); Cavazza et al. (2018); Mianjy et al. (2018); Mianjy and Arora (2019); Pal et al. (2020); Wei et al. (2020), which we briefly review in Section 1.3. In this paper, however, we investigate dropout from the stochastic optimization perspective. That is, we aim to answer if dropout algorithms converge and study the rate at which they converge, which is expected to depend on the dropout probability. Compared to the study of the generalization properties of dropout, this aim has received less attention in the literature. In particular, we can only mention Mianjy and Arora (2020) and Senen-Cerda and Sanders (2022). In Mianjy and Arora (2020), a convergence rate for the test error in a classification setting is obtained when training shallow NNs with dropout. This rate, is, however, independent of the dropout probability. In Senen-Cerda and Sanders (2022), a convergence rate for the empirical risk associated with training shallow linear NNs with dropout is obtained that depends on the dropout probability. Both results refer to shallow NNs where the width of the NN plays a role in the convergence rate. We refer to Section 1.3 below for further details.
From the previous discussion, however, we suspect that there is an effect of dropout in the convergence rate in deep NNs with several layers of dropout. In this paper, we investigate this problem. In particular, we provide convergence guarantees for training NNs that have several layers of dropout and analyze simplified models for deep NNs, for which it is possible to obtain an explicit convergence rate that depends on the dropout probability and depth. We also consider the effect on the sample complexity of using dropout SGD and complement the previous results with simulations on realistic NNs to examine the convergence rate of dropout empirically.
Before introducing the results of the paper we briefly define the fundamental concepts related to training of NNs with dropout that we will use throughout this paper.
1.1 Dropout and SGD
A NN with weights W is typically used to predict output
input
both of which are sampled from some joint distribution. For a given loss
, the risk function of
is usually defined as
where the distribution is usually given by the empirical distribution of a finite number of samples . In this case, the risk is an empirical risk.
Ideally, the NN is operated using weights in the set However, the weights are found in practice by using gradient descent or its stochastic variant SGD, which aims to minimize the risk in (1) by updating the weights in the local direction that minimizes the function. At time t, the weights
of the NN are namely updated by setting
Here, is a stochastic estimate of the gradient of (1) and
is a step size which we will specify later. Let
be the gradient at W of (1). If the input and output samples
are provided at time t, then the update of SGD is given by
As we have mentioned, dropout filters are applied to some of the weights W during training by using matrices of random variables F with {0, 1}-valued entries. Denote by the dropout filters and the samples provided to the SGD algorithm at time t, respectively. Compared to (3), a dropout algorithm defines the estimate of the gradient update as
where denotes the componentwise product.
Note that in (4) the filters appear twice. Firstly, they filter the weights gradient is computed depending only on the subnetwork provided by dropping some edges or nodes. Secondly, they filter the updates in
since only the remaining weights will be updated. We remark that in this general formulation, other distributions for the filters than those for dropout and dropconnect are allowed. For specific examples of distribution of the filter matrices we refer to Section 2.3.
We next present the results of this paper.
1.2 Summary of results
Our first result is a formal probability theoretical proof that for any (fully connected) NN topology and with differentiable polynomially bounded activation functions (see Theorem 5), the iterates of projected SGD with dropout-like filters converge. In particular, a step of projected SGD with dropout is given by
where is the estimate of the gradient with dropout in (4) and
is an operator that projects the iterates onto a compact convex set H (Oymak, 2018). In order to state our first result, we define a
and we will consider -loss. The result is stated informally in the next proposition.
Result 1 (Informal statement of Proposition 6.) Under sufficient regularity of the activation functions, bounded moments and independence of random variables and some assumptions on the boundary H, with update (5), the weights converge to a unique stationary set of a projected system of ODEs
where constraint term, which describes the minimum vector required to keep the gradient flow of
This result provides a formal guarantee with the sufficient conditions for dropout algorithms to be well-behaved and at least asymptotically (meaning after sufficiently many iterations) to not suffer from problems that could have arisen from the relation to bond percolation. Moreover, for a wide range of NNs and activation functions the function D(W) is the expectation of the risk over the dropout’s filters distribution, which in our result is not restricted to dropping nodes and can even be coupled to the data. This result also shows that SGD with dropout converges to the stationary points of D(W). While a guarantee is necessary, a convergence rate would yield more insight into the trade-offs of the algorithm, especially in the dependence on depth.
In our second result, we go one step beyond the convergence guarantee and compute a bound for the sample complexity of SGD with dropout to an -stationary point of a generic smooth nonconvex function
-stationary point of D if
holds. Note that stationary points are not necessarily minima, but the sample complexity, understood as the number of iterations T required to reach
-stationarity, is usually associated with the complexity of the function to be optimized.
For a generic smooth nonconvex function D(W), we consider dropout to be SGD with the update in (4), where filters F are chosen independently at each step and are {0, 1}-valued for each parameter. In our result we assume boundedness and Lipschitzness conditions on D(W). Moreover, under some additional assumptions on the loss function, examples of NNs with sigmoid activation functions are also covered by our result. In this particular case, D(W) = D(W) holds with the definition in (6). For the general case we prove the following:
Result 2 (Informal statement of Proposition 7.) Assume that D(W) has enough regularity and satisfies some boundedness and Lipschitzness assumptions. Let be iterates of (5). For any
there exist
constant such that if p > c/T, then as
Hence, at least T iterations of dropout-like SGD algorithms are required to reach an -stationary point of nonconvex smooth functions in expectation. Here,
are constants depending on the data and function, respectively. Compared to the theoretical optimum rate of
for SGD on nonconvex smooth functions (Drori and Shamir, 2020), this result shows that dropout changes the optimization landscape and approximate stationary points are easier to find depending on the dropout probability. In this setting, we also consider the complexity when we scale the weights by a factor 1/p during training, which is commonly used to compensate the effect of dropout on the convergence rate.
It must be emphasized that Proposition 7 does not assume much structure on the objective function. As consequence, in spite of the fact that the bound in (8) holds in some settings with deep NNs, the depth of such NN would appear only implicitly in the constants . In order to determine the dependence between the convergence rate and the depth of a NN explicitly, one must exploit the specific structure of a NN, which we leverage in our next result.
Our third result in this paper is an explicit upper bound for the rate of convergence of regular GD on the limiting ODEs of dropout algorithms for arborescences (a class of trees, see Figure 1c for an example), of arbitrary depth with linear activation functions In particular, we will consider the update rule
Analyzing the convergence of training algorithms on simplified NNs with linear activation functions is commonly used to gain insight into more complex models, see e.g. (Arora et al., 2019; Shamir, 2019; Bartlett et al., 2018). Even without a dropout algorithm present, this task already provides a substantial theoretical challenge as the optimization landscape is nonconvex. Our choice to restrict the analysis to arborescences allows us to quantitatively tie our upper bound for the convergence rate to the depth and the number of paths within the arborescence. We prove the following:
Result 3 (Informal statement of Proposition 9.) Assume that the base graph G of the NN is an arborescence of depth L with |L(G)| leaves and the filters F follow the distribution prescribed by Dropconnect or Dropout with dropout probability (see Proposition 9). Then there exist
depending on the initialization such that the iterates of (9) satisfy
with
One important consequence of this result is that the convergence rate exponent indeed deteriorates by a factor in these NNs. Finally, we complement this result with numerical experiments. We target the dependency of the convergence on p for more realistic wider and nonlinear networks on commonly used datasets. Perhaps surprisingly, we do not observe an exponential decrease of the convergence rate exponent due to dropout in these simulations. We will offer some heuristic explanation for this result by looking at the update rate of a generic weight.
Our results lead to the following consequences. First, whenever the iterates of a dropout algorithm with -loss are bounded, they are guaranteed to converge to a stationary point of the risk function D(W) induced by the dropout algorithm. Secondly, we prove rigorously that the convergence rate when training with e.g. Dropout or Dropconnect can change the convergence rate on the empirical risk depending on p and in arborescences can decrease by as much as a factor
. For more realistic wider networks, however, we conduct numerical experiments that suggest that the convergence rate is not necessarily affected by depth as much across different dropout rates
in neural networks with just a few layers of dropout.
Our findings motivate further theoretical study of the convergence rate of dropout for deep and wide networks. We suspect that there is a transition regime of the convergence rate. Such transition would affect the dependence on p and would be observed when going from networks with many layers of dropout with small width, where dependence on the rate may be exponential in p, to networks with a few layers of dropout but very wide, where dependence is not exponential anymore.
1.3 Literature overview
The first description of a dropout algorithm was by Hinton et al. (2012). Diverse variants of the algorithm have appeared since, including versions in which edges are dropped (Wan et al., 2013); groups of edges are dropped from the input layer (DeVries and Taylor, 2017); the distribution of the filters are Gaussian (Kingma et al., 2015; Molchanov et al., 2017); the removal probabilities change adaptively (Ba and Frey, 2013; Li et al., 2016); and that are suitable for recurrent NNs (Zaremba et al., 2014; Semeniuta et al., 2016). The performance of the original algorithm has been investigated on datasets (Hinton et al., 2012; Srivastava et al., 2014), and dropout algorithms have found application in e.g. image classification (Krizhevsky et al., 2012), handwriting recognition (Pham et al., 2014), heart sound classification (Kay and Agarwal, 2016), and drug discovery in cancer research (Urban et al., 2018).
Theoretical studies of dropout algorithms have focused on their regularization effect. The effect was first noted by Hinton et al. (2012); Srivastava et al. (2014), and subsequently investigated in-depth for both linear NNs as well as nonlinear NNs by Baldi and Sadowski (2013); Wager et al. (2013); Baldi and Sadowski (2014); Wei et al. (2020). Within the context of matrix factorization, it has been shown that Dropout’s regularization induces a shrinkage and a thresholding of the singular values of the matrix at the optimum (Cavazza et al., 2018). Characterizations of Dropout’s risk function and Dropout’s regularizer for (usually linear) NNs can be found in Mianjy et al. (2018); Mianjy and Arora (2019); Pal et al. (2020). Random networks with Dropout have been also studied in Sicking et al. (2020) and in Huang et al. (2019).
Detailed theoretical investigations into the convergence of dropout algorithms are however relatively scarce. While revising this paper, new results appeared and these now give insight into the convergence rate of Dropout in ReLU shallow NNs for a classification task (Mianjy and Arora, 2020). In Mianjy and Arora (2020), it is shown that iterations of SGD to reach
-suboptimality for the test error are required; interestingly, it is independent of the dropout probability because of their assumption that the data distribution is separable by a margin in a particular Reproducing Kernel Hilbert space. Compared to our generic convergence result, we do not assume structure on the predictor or data and look instead at the iterations required to reach
-stationarity in nonconvex functions using dropout-like SGD. A study of the asymptotic convergence rate of Dropout and Dropconnect on shallow linear neural networks has also appeared recently (Senen-Cerda and Sanders, 2022). There, an asymptotic convergence rate for dropout linear shallow networks is provided. Namely, for wide linear shallow networks with width D and dropout probability
convergence rate close to a minimum of
Finally, it must be noted that convergence properties have been thoroughly studied within the context of NNs being trained without dropout algorithms, see e.g. Arora et al. (2019); Shamir (2019); Zou et al. (2020); Gao et al. (2021) and references therein.
Dropout algorithms can, by construction, be understood as forms of SGD. More generally, dropout algorithms are all stochastic approximation algorithms. The first stochastic approximations algorithms were introduced by Robbins and Monro (1951); Kiefer and Wolfowitz (1952), and have been subject to enormous literature due to their ubiquity. For overviews and their application to NNs, we refer to books by Kushner and Yin (2003); Borkar (2009); Bertsekas and Tsitsiklis (1995).
A word on notation
In this paper we index deterministic sequences with curly brackets: , etc. This distinguishes them from sequences of random variables, which we index using square brackets, e.g.
Deterministic vectors are written in lower case like , but an exception is made for random variables (which are always capitalized). Matrices are also always capitalized. For a function
and a matrix
, we denote by
the matrix with
applied componentwise to A. Subscripts will be used to denote the entries of any tensor, e.g.
. For any vector
-norm is defined as
For any matrix
, the Frobenius norm is defined as
For two matrices A, B, the Hadamard (componentwise) product is denoted by
Let be the strictly positive integers and
, we denote [l] = {1, . . . , l}. For a function
, we denote the gradient and Hessian of g with respect to the Euclidean norm
, respectively.
We now formally define NNs, which we had depicted in Figure 1, as well as the class of activation functions that we will use for the convergence guarantee in our first result below.
2.1 Neural networks, and their structure
Let L denote the number of layers in the NN, and the output dimension of layer
denote the matrix of weights in between layers l and l + 1 for
the set of all possible weights. In this paper, we consider NNs without biases.
be an activation function
Neural Network (NN) with L layers is given by the class of functions
defined iteratively by
Canonical activation functions include the Rectified Linear Unit (ReLU) function max{0, t}, the sigmoid function
, and the linear function
Sections 2 and 3 we restrict to the case that
belongs to a class of polynomially bounded differentiable functions.
differentiable, denote the lth derivative of
of polynomially bounded maps with continuous derivatives up to order
is given by
Note that the linear and sigmoid activation function both belong to . Also, any polynomial activation function
belongs to
ReLU activation function is not in
. However, because the class
contains polynomials of any degree, we can approximate cases such as ReLU by using, e.g., the softplus activation function
, which satisfies that
. Note that the softplus activation function belongs to
2.2 Backpropagation, and SGD
In Section 1.1 we have defined the risk U(W) that in the previous notation now depends on a loss . Throughout this article, we will specify the Euclidean
as our loss function of interest without loss of generality.
Furthermore, in the definition of U(W) in (1), we make no distinction between an oracle risk function or empirical risk function. Both situations are covered by the definition in (1). Hence, our results cover the empirical risk case when we have a finite number of samples, as well as the online learning case, where a new sample is provided at each step of SGD. What we do assume is that one has the ability to repeatedly draw independent and identically distributed samples either distribution.
In an attempt to find a critical point in the set , as mentioned in (1.1), SGD is commonly used. Let
be a sequence of independent copies of (X, Y ), let
be an arbitrary nonrandom initialization of the weights. For i = 1, . . . , L,
, the weights are iteratively updated according to
for denotes a positive, deterministic step size sequence, and the estimate of the gradient
is computed using the backpropagation algorithm, which is given in Definition 15 in Appendix A. The stochastic gradient is an unbiased estimate of the gradient of U(W). In particular, we have
2.3 Dropout algorithms, and their risk functions
Dropout algorithms use {0, 1}-valued random matrices as filters of weights during the backpropagation step of SGD. More precisely, we examine the following class of dropout algorithms. Let be a random variable on the probability space
. Here, we write
for
, similar to how we notate weight matrices. Let
a sequence of independent copies of (F, X, Y ). In tensor notation, the weights are updated by using (2) with the random direction
for dropout given in (4). For each dropout algorithm a different filter distribution will be chosen. We can mention a few:
(i) In canonical Dropout (Hinton et al., 2012),
(iii) In Cutout (DeVries and Taylor, 2017),
In fact, the class of dropout algorithms we consider is quite large. For example, depend on
does not need to have the same distribution as
Recall, however, that if for some filter
, then in (2) ,
and we have
. In other words, filtered variables are not updated with these dropout algorithms.
If is independent of
countable, then the dropout algorithm’s risk function in (6) simplifies to
Here the sums are over all possible outcomes of the random variables F and (X, Y ), respectively. One implication of Proposition 6 in the result of the next Section 3 is that dropout algorithms of the kind in (2), (4) converge to a critical point of (6).
Our first result pertains to the convergence of dropout algorithms for a wide range of activation functions and dropout filters. While convergence is expected in practice, we prove such convergence rigorously. In order to control the iterates of the stochastic algorithm, we project the iterates into a compact set. The projection assumption is common when investigating the convergence of stochastic algorithms (Kushner and Yin, 2003; Borkar, 2009; Bertsekas and Tsitsiklis, 1995; Oymak, 2018); it essentially bounds the weights. For example, for and an update function
is projected onto an interval [a, b] is by clipping and setting
. There are also results involving generalization bounds for NNs where bounded weights play a role in controlling the learning capacity of the NN (Neyshabur et al., 2015).
3.1 Almost sure convergence
We first consider the notation and assumptions regarding the projection step of SGD. Let be a convex compact nonempty set and let
be the projection onto H. By compactness and convexity of H, the projection is unique. In a projected dropout algorithm, the weight update in (2) is replaced by (5). Because of the projection, our analysis will tie the limiting behavior of (5) to a projected ODE. To state such type of ODE, we need to define a
, which is defined as the minimum vector required to keep the solution of the gradient flow
in H. Appendix C defines the projection term carefully for the case that H’s boundary is piecewise smooth. Finally, define the set of stationary points
The set can be divided into a countable number of disjoint compact and connected subsets
, say. We choose the following set of assumptions:
(N1)
(N6) −∇
We are now in position to state our first result:
be the sequence of random variables generated by (5) with (4) on a probability space
. Under assumptions (N1)–(N4) , there is a set
probability zero such that for
converges to a limit set of the projected ODE in (16). If moreover (N5)–(N6) hold, then for almost all
converges to a unique point in
Theoretically, Proposition 6 guarantees that projected dropout algorithms converge for regression with the -norm almost surely. Proposition 6 implies that if one is using a regular nonprojected dropout algorithm and one sees that the iterates
are bounded, then these iterates are in fact converging to a stationary point of (6). Assumptions (N5)– (N6) are technical but are expected to hold in many cases. In particular, (N5) holds for the uniformly convergent approximation to a ReLU activation function given by softplus
, and holds for many smooth activation functions. Also (N6) is expected to hold when H is generic polytope for which the gradient
is not exactly orthogonal to the normal to the surface.
Observe also that Proposition 6 holds remarkably generally. For example, the dependence structure of (F, X, Y ) as random variables is not restricted; it covers commonly used dropout algorithms such as Dropout, Dropconnect, and Cutout; and it holds for differentiable activation functions. Proposition 6 includes also online and offline learning, depending on the distribution from which we sample.
Our proof of Proposition 6 is in Appendix D and relies on the framework of stochastic approximation in (Kushner and Yin, 2003, Theorem 2.1, p. 127). In the background the stochastic process is being scaled in both parameter space and time so that the resulting sample paths provably converge to the gradient flow in (16). Examining the proof, we expect that Proposition 6 can be extended to cases where the filters as random variables have finite moments, for example, when they are Gaussian distributed (Molchanov et al., 2017). Concretely, the proofs of Lemmas 17 and 18 in Appendix D rely only on the assumption that F has finite moments, and may therefore be extended.
3.2 Generic sample complexity for dropout SGD
Examining Proposition 6, we note that it does not give insight into the convergence rate or the precise stationary point of D(W) to which the iterates converge. A related goal in stochastic optimization is to ask for the number of iterations of (2) required to achieve a point close to stationarity in expectation, also referred to the sample complexity of the algorithm. We say
-stationary point of a differentiable function D if
holds. For nonconvex functions D with a Lipschitz continuous gradient
, SGD convergence to an
-stationary point in expectation can be achieved in
iterations; see Bottou et al. (2018); Drori and Shamir (2020).
We will consider nonconvex functions with a Lipschitz continuous gradient and assume that the filters F and the data Z = (X, Y ) are independent. We will also assume that the distribution of Z is well-behaved so as to guarantee that we also have the following relations for the functions r, U and D:
Note that the function r in this setting includes the loss function formulation from (1) with
and in general, at time t the update rule will be
In the case of dropout, for example, we expect that the sample complexity of finding an stationary point for the empirical risk will change depending on the dropout probability
In particular, if
holds for any
. On the other hand if
, then the variance of
will also be small. We make these intuitions rigorous in the next proposition. For some
be the parameter space and
a Lebesgue measurable set. We assume the following:
Except for (Q4) and (Q5), all other assumptions are routinely used in sample complexity analysis. While the assumptions of Proposition 7 below hold for general nonconvex smooth functions D, in the case of NNs and the setting in (20) we remark that there are examples that satisfy these assumptions such as the following one:
Example 1 In a binary classification setting, the set Z is compact, that is, the data pairs take values in a compact set where
are labels for the two classes. A NN, denoted by
, uses sigmoid activation functions
output in R. The output of
is then used for binary classification with a logistic map, that is, the predicted probability of belonging to one of the classes is given by
. In this setting, assumptions (Q1)–(Q3) will hold if the loss l is also smooth (such as the
-loss). In this case, we have D(W) = D(W) and the constants in (Q1)–(Q5) will also indirectly depend on the depth and width of the NN.
Regarding (Q4), note that it allows for dependencies between filters. We also assume (Q5) for the sake of simplicity: we could instead use projected SGD with updates from (5) instead of (Q5), but using projected SGD would leave the scalings in p and T invariant. Recall that
. The proof the following proposition can be found in Appendix E.
be a sequence of independent random variables with distribution
be iterates of (21). Assume (Q1)–(Q5). Define
(a) Let
, then there exists a constant stepsize
that for all
(b) Let . There exists a sequence of decreasing stepsizes satisfying
all
In Proposition 7, we observe that finding approximate stationary points is easier with a larger dropout probability for a wide range of filter distributions like those determining dropout and dropconnect, as guaranteed by (Q4). In Proposition 7(a) we also see a dependence of the convergence rate on
corresponds to the variance of the gradient due the distribution of data in Z and decreases with p; while the term
stems from the variance due to dropout. Note that the sum achieves a maximum for
. We note that Proposition 7 does not suggest that the convergence to minima is faster for smaller p. In particular, saddle points can become easier to find as
. As seen later in the numerical experiments with NNs in Section 5.1, or in similar work from Mianjy and Arora (2020); Senen-Cerda and Sanders (2022), the NN structure and data distribution can change the convergence rate dependence on the dropout probability. As an example, in Senen-Cerda and Sanders (2022) it is suggested that the convergence rate dependence on p and the width of the NN can have different regimes depending on whether we are close to a minimum or not. Similarly, smaller p does not necessarily improve generalization. In particular, if the dropout probability
is large, the optimization landscape will be flat with many approximate stationary points. In this case, SGD with dropout with a limited sample complexity of T iterations will not explore the landscape as much as when using a smaller dropout probability. With a flatter landscape in mind, it may be better in the complexity trade-off to use a larger p for finding an approximate minimum and generalize better instead of finding a stationary point.
A possible approach to avoid the flattening of the landscape is to scale the weights appropriately during training. This is, for example, what is conducted in practice in some implementations of dropout.Assuming (Q4) holds, we consider the update rule
With (24), the use of filters is compensated by increasing the size of the updates and weights accordingly. In this case, SGD with this update rule is actually minimizing the function
which also compensates in expectation the effect of the filters. With the update rule in (24), we can again obtain an expression for the complexity of finding an -stationary point of
. The following is proved in Appendix E:
be a sequence of independent random variables with distribution F. Assume (Q1)–(Q5). Let
be iterates of (24). Let
then there exists a constant stepsize
such that for all
Proposition 8 shows that for the scaled dropout SGD of (24) the complexity of finding an -stationary point monotonically increases with
. This result contrasts with Proposition 7, where a different behavior was observed. We remark, however, that this result assumes (Q5), which for small p cannot realistically hold since a bound R for the norm of the weights may also scale by a factor 1/p. This result, just like with Proposition 7, also does not imply that good weights
become easier to find by using the update (24). Indeed, scaling partially avoids the flattening of the landscape—the Lipschitz constant of
is namely scaled by a factor
—but the variance of SGD due to dropout is also increased considerably. This variance becomes dominant when the dropout rate
due to the inverse dependence on p in the sample complexity.
Propositions 7 and 8 show that the complexity of finding -stationary points heavily depends on the algorithm used. However, when we restrict the results to deep NNs such as with Example 1, the bounds do not provide much information on the dependence of the convergence rate on the depth of the network. This fact also shows the limitations of using a generic sample complexity analysis.
In order to obtain an explicit convergence rate depending on the depth, we need to use the additional structure of the NN. In the next section we will be able to compute the convergence rate to a global minimum for NNs that are shaped like arborescences and obtain an explicit bound that depends on the depth of the arborescence and the dropout probability.
We obtained a convergence guarantee as well as a bound for the sample complexity of dropout in the previous section. Next, we focus on the convergence rate of dropout in functions that model the structure of NNs. In particular, we will derive an explicit convergence rate for dropout algorithms in the case that we have linear activations and that the NN is structured as an arborescence: see Figure 1c. Specifically, we will study the following regular GD algorithm on dropout’s risk function:
Here, we keep the step size fixed. Note that this algorithm generates a deterministic sequence
as opposed to a sequence of random variables
as generated by (2) or (4). We will use a linear activation function
, which combined with the arborescence structure will allow us to obtain an explicit convergence rate. While the iterates of (27) are not stochastic, analogous to Proposition 6, the stochastic iterates will converge to a gradient flow of an ODE, whose discretization is given in (27). Analyzing ODEs related to NNs is common in literature Tarmoun et al. (2021); Jacot et al. (2018). For more discussion on the relationship between the iterates of (27) and dropout we refer to Appendix B.
Our main convergence result in Proposition 13 below holds for general distribution functions. However, we show here the cases of Dropout and Dropconnect, which are most insightful. We use the following notation adapted from graph theory. Consider a fixed, directed base graph G = (E, V) without cycles in which all paths have length L, which describes a NN’s structure as follows. Each vertex represents a neuron of the NN, and each directed edge
indicates that neuron u’s output is input to neuron v. Note that to each edge
in the NN, a weight
and a filter variable
associated. We will write
for simplicity. For an arborescence G, we denote by L(G) the edge set of leaves. Let
be real numbers and suppose that we initialize the weights
as follows:
The proof of Proposition 9 is deferred to Appendix I, which is a consequence of our more general result in Proposition 13.
Proposition 9 Assume that the base graph G is an arborescence of depth L with |L(G)| leaves, the activation function is linear, F is independent of
is initialized according to (28). If the
follow the distribution prescribed by Dropconnect or Dropout, then there exists
such that the iterates of (27) satisfy
with
4.1 Discussion
In Proposition 9 we consider the cases of Dropout and Dropconnect, in which nodes or edges are dropped with probability , respectively. Observe that the convergence rate exponent depends on
; see (28). The first term in particular indicates that as the NN becomes deeper, the convergence rate exponent of GD with Dropout or Dropconnect will decrease by a factor
. The second term
shows the increased difficulty of training deeper NNs and has been observed e.g., by Shamir (2019); Arora et al. (2019). The exponential dependence in L is moreover tight when using GD and is intrinsic to the method (Shamir, 2019). Hence, dropout adds another exponential dependence to the convergence rate in arborescences, which is due to the stochastic nature of the algorithm. In Figure 2 an experiment confirming this intuition on the convergence rate of dropout on a single path for different depths can be seen.
Finally, our proofs of Proposition 9 and the related more general result in Proposition 13 below can be found in Appendix H. The proof strategy is to show that a Polyak–Łojasiewicz (PL) inequality holds, which allows one to obtain convergence rates for GD on nonconvex functions (Karimi et al., 2016). The new part of the argument is that we use conserved quantities and a double induction to identify a compact set in which the iterates remain and simultaneously a PL inequality holds. The method that we develop and which is sketched in the next subsection depends intricately on the arborescence structure and cannot be readily applied to other cases.
To compare this result with more realistic models, we will examine the convergence rate of dropout in deep and wide NNs in Section 5 with a heuristic and experimental approach.
4.2 Sketch of the proof
Besides the previous notation, we need to introduce notation corresponding to subgraphs and paths. Let G be the set of all subgraphs of the base layered directed graph G with d vertices, and let E(g) be the set of edges of a subgraph be defined as the set of all paths in the directed graph g that start at vertex i, traverse edge e, and end at vertex j. If the origin or end vertices are in the input or output layer, the subscript or superscript is dropped from the notation, respectively. For every path
for notational convenience. Finally, let
the random subgraph of base graph G that has edge set
. We denote
. We first provide an explicit characterization of dropout’s risk function in (6) in terms of paths in the graph that describes the structure of
Figure 2: The average loss depending on the number of steps of SGD with dropout of the function and its average convergence slope. (a) The average loss for L = 1. (b) The average loss for L = 3. (c) The average loss for
of the fit of
for the curves in (a), (b) and (c). The slopes
for a given l have been normalized at p = 1 for comparison across depths L. Note that for larger L, the effect of p becomes also more pronounced. This is in agreement with the conclusion in Section 4, where we expect a convergence rate depending on
. In this case, other effects of depth are also observed, such as a dependence on the initialization.
the NN. This is possible since we assume linear activation functions. The following lemma now holds, and is proved in Appendix F.
Lemma 10 Assume that the base graph G is a fixed, directed graph without cycles in which all paths have length L and there are output nodes (N6’), that
(N7), and that F is independent of (X, Y ) (N8). Then
Moreover D(W) = J (W) + R(W), where
Here, the constants depend explicitly on F’s distribution and the NN’s architecture.
Note that Lemma 10 essentially changes variables to rewrite the dropout risk function as a sum over paths instead of a sum over graphs. This representation allows us to clearly identify the regularization term R(W). For example in the case of Dropconnect (Wan et al., 2013), where the filter variables are independent random variables with distribution Bernoulli(p), Lemma 10 holds with
. Also note that if for all subgraphs
and vertices
the number of paths that end at
such as when G is an arborescence, then for all subgraphs
is only one path ending at a leave node
We now focus on a base graph that is an arborescence of arbitrary depth; see Figure 1c. Hence we now replace (N6’) in Lemma 10 that assumes a generic graph by assumption (N6), where G is specifically an arborescence. The following specification of Corollary 11 is also proven in Appendix F.
Corollary 11 Assume that the base graph G is an arborescence of depth L (N6), and (N7)– (N8) from Lemma 10. Then
and . Consequently, R(W) = 0 for an arborescence.
The convergence result we are about to show uses the fact that for the system of ODEs there are conserved quantities. Within the proof, these conserved quantities have the crucial role of guaranteeing compactness for the iterates. Specifically, let L(g; f) denote the leaves of the subtree of
rooted at a vertex
, and define the set of leaves of
. We remark that in the previous notation
and each leaf
, define the quantity
Define also, both of which we require later. Lemma 12 now proves that the function
in (35) is a conserved quantity; the proof is in Appendix G.
Lemma 12 Assume (N2) from Proposition 6, (N6) from Corollary 11 , (N7), (N8) from Lemma 10. Then under the negative gradient flow
for all
We are almost in position to state our second result, but need to introduce still some notation. We define the following constants
for notational convenience. Also, for
a bounded set of parameters where if the weight is associated with a leaf, they are furthermore bounded away from zero. Let finally
denote the set of all weight parameters that are -close to a critical point and for which the conserved quantities in (35) deviate by no more than
from their initial value
These deviations are made explicit by the intervals
Our proof shows that the iterates stay in the intersection
implies that the weights (including those associated with the leaves) remain bounded. The following now holds, and its proof can be found in Appendix H.
Proposition 13 Assume (N2) from Proposition 6, (N6) from Corollary 11, (N7)–(N8) from Lemma 10, that (N9), that
then the iterates of (27) satisfy
Proposition 13 identifies explicitly how the convergence rate of GD on a dropout’s risk function depends on the dropout algorithm and the structure of the arborescence: parameters such as p, |L(G)|, L are implicitly present in the constants
Note that Assumptions (N9)–(N10) are relatively benign. These assumptions are for example satisfied when initializing and setting
, which we assume in Proposition 9. In other words, this initialization sets the weights that are associated with leaves small compared to all other weights.
In Proposition 13, we have proven that the convergence rate depends on for NNs shaped like arborescences. Let
be a tree and
be an edge. Denote by
set of paths passing through e that are not filtered by dropout at time t. We observe that at any given time t of dropout SGD,
If we denote by the average update time for a weight in
we need
more time on average for a given edge to be updated than when we do not use dropout. For wider networks G, however, edges can be updated simultaneously and repeatedly via different available paths. By the previous intuition we might still expect that, if the updates are sufficiently independent, the convergence rate depends approximately on
. In order to verify this intuition we will determine
for NNs that are much wider than deep, and later simulate their convergence rates also in realistic settings.
Suppose now that G is a graph of a fully-connected NN with L dropout layers each of which has width D. For each of the vertices in a dropout layer, there is an associated dropout filter variable
is fixed. That is, we use dropout. Note that any other additional input or output layer without filters only changes the number of paths by a multiplicative factor. Hence, we will restrict to the case that all nodes in the layers have filter variables. In this case, we may consider a path
of L vertices—one for each dropout layer—instead of edges. For two paths
consider their intersection
as the subset of vertices belonging to both paths. Hence,
implies that the intersection has l vertices, not necessarily forming a path.
We remark that we can restrict to the case L > 2. In the case of one dropout layer L = 1, an edge e = (u, v) conected to a dropout node u is updated if and only if the filter is the adjacent vertex to e with a dropout filter, so that in this case
is updated if and only if
. Recall that we denote by
the set of paths
passing through e. For a path
, in the following, we let
be the indicator of a path being filtered. Thus,
is not filtered and 0 otherwise. We will use Greek letters for paths and Latin letters for vertices when referring to filters
and
respectively.
Lemma 14 Let G be a graph of a fully-connected NN with L > 2 dropout layers, each with the same width D and with dropout filters For an edge
let
denote the random variable that counts the number of nonfiltered traversing paths through e. If L, p are fixed, then as
Proof We will use the Paley–Zygmund inequality. For a nonnegative random variable Z with finite second moment, for any
We will use (45) with the random variable . The idea is that if D is much larger than L, the average number of paths passing through e is also large. We are using dropout, so the filter variable corresponding to an edge e = (u, v) will depend on only the vertex u, that is,
. For counting paths we also need to take into account that the filter
will occurring in all paths passing through e. Since only the two vertices u and v of e are fixed we can compute
We define the set of broken paths in
that is, if and only if there exist
. In particular,
contains paths and unions of vertices of paths that pass through e. Then we have:
where (i) we have first used that are indicators for occurring
and that at least
since vertices u and v are shared among all paths in
; secondly, that we have separated the sum over paths into a path
and all other paths
that coincide in l vertices. In (ii) we have computed the probability by noting that for
, where the term
accounts for the l shared filters corresponding to l shared vertices and
for the remaining products of filters. Note that we have used the independence assumption for filters here. (iii) We have used here that
so that we can separate the previous sum into first, fixing the l vertices where two paths intersect—including
, and then looking for all possible
. For (iv) we fix l vertices where
coincide, then there are still
possible ordered vertex pairs to choose from all the other vertices where
do not coincide. (v) For the remaining sum, for each l fixed locations— including the vertices of e, which are fixed—we can still choose
remaining possible vertices. Additionally, there are for each
locations for these vertices. Hence, plugging (52) and (46) into (45) yields
In particular, setting and computing the higher order noting that L > 2, we obtain that
or alternatively noting that
Finally note that since the edge e can be present in a path only if the filters at both vertices of e have value 1, which occurs with probability
Note that in the proof of Lemma 14 we can recover the scaling that we have seen in Proposition 13 by setting D = 1 in (52) and in (50).
From Lemma 14 we expect that for a wide network with L layers where edge
, we have that
If the convergence rate is related to the update rule, then we would expect that for a wide network the rate would be independent of L which is different from the path network considered in Proposition 13. In the next section we will verify this intuition on real datasets. Note, however, that we do not expect to see the dependence on p as shown in (58): this heuristic argument provides only the rate at which a weight is updated, and stochastic averaging is not solely driving the convergence rate. In particular, from an example for wide shallow linear networks in Senen-Cerda and Sanders (2022), close to a critical point of a dropout ODE, the dependence scales with a factor instead of p. This is due to the fact that for larger p, there are regions of the landscape close to minima that become flat, as also hinted by Proposition 7. Indeed, when
in the convergence rate of Proposition 7 lowers the complexity of finding an
-stationary point. Hence, there are landscape regimes and initialization issues that also account for the convergence rate in NNs.
5.1 Numerical Experiments
In this section we conduct the dropout stochastic gradient descent algorithm numerically,for different datasets and network architectures. We measure the convergence rate for differ-ent widths D, depths L, and dropout probabilities
. We then compare these measurements to the bounds on the convergence rates obtained in Section 4. We use
for the implementation.
5.1.1 Setup
Datasets. We will consider three commonly used data sets of images: the MNISTet al., 2010), CIFAR-100-fine
, and CIFAR-100-coarse datasets (Krizhevsky, 2009).
NN Architecture. We use as a base architecture a LeNet with 11 layers where the two dense layers have been substituted with L fully-connected ReLU layers of width D. Each of these layers have dropout with dropout probability . While larger networks are commonly used in practice, a LeNet architecture is sufficient to test the effect of dropout on the convergence rate as we verify with the simulations.
Loss. We use the cross-entropy loss, which is commonly used for classification. For two distributions p and q with support on [n] labels, the cross-entropy loss is defined as
Stopping criteria. In all experiments, we stop after 40 epochs.
Initialization. In order to see the convergence rate close to a minimum. We use first a Gaussian initialization, that is, we set every weight on the dense layers to in an independent manner, where D is the width of the layer. While this initialization is standard, we note that we cannot expect to compare convergence rates for different numbers of layers
and for different dropout probabilities
since the loss functions are also different. In the course of our experiments, we found that there are also many saddle points where SGD remains stuck, which complicated the estimation of the convergence rate. In order to start approximately at the same neighborhood where the iterates stay and continuously track minima across different choices of p, for each
we have used a two-step approach in order to avoid areas of the landscape with saddle points. We first run ADAM
epochs with p = 0.1 and store the weights. Secondly, for each
we then perform dropout SGD with initialization given by the stored weights. In this manner, we expect that we are approximately “tracking” the same local region across the optimization landscape when we change p. Optimization with ADAM is less prone to remain in flat areas of the landscape since it uses a dynamic step size. Hence, if after the dynamic step the iterates remain in a part of the landscape with no saddle points that smoothly changes with p, we also expect in this case to obtain comparable convergence rates for SGD for each fixed L.
Step size and batch size. In each experiment, the step size is given by batch size is b = 1024.
Fitting procedure. We fix a set of probabilities and depths L = {1, 2, 3} and for each pair
we run the algorithm above. From the value of the loss from all T iterations of SGD
in one run, we compute a moving average
we average the loss across a window with size given by the number of batches
to complete one epoch. In this manner we obtain an average convergence rate and diminish the stochasticity from the dataset. We then fit the averaged loss of the iterates
each p and l to the function
We run the experiment R = 10 times for each (p, l) and obtain an average convergence exponent
5.1.2 Results
In Figure 3 we can see the plots of . As suspected from the heuristic argument, we do not see an increasingly large dependence on
the MNIST dataset some dependence on the depth is appreciated, but this may be due
Figure 3: The fit for LeNet with different widths D and different datasets. Here (a) MNIST with
MNIST with D = 100; (b) CIFAR-100-fine labels with
CIFAR-100-fine labels with D = 100; (c) CIFAR- 100-coarse labels with
) CIFAR-100-coarse labels with D = 100. While for the MNIST dataset there seems to be an increasing dependence of dropout on the convergence rate with the depth L, for CIFAR no such dependence is observed. We remark, however, that in the CIFAR datasets encountering saddle points was more common. For those areas the loss profile is flat and so we expect the fits to be biased towards the origin in some cases.
to other factors that affect the convergence rate, like initialization issues. For the CIFAR datasets, convergence is greatly affected by saddlepoints despite the use of dropout. This is, however, common when using SGD with small constant stepsizes. In particular, in practical scenarios other schemes that adjust the stepsize, like e.g. ADAM, may be more appropriate when dealing with deep networks with dropout in different layers. From the experiments it is concluded that despite the stochasticity provided by dropout, the convergence rate is not affected much by a varying dropout probability in wide networks with just few dropout layers.
In this paper we have shown with a probability theoretical proof that a large class of dropout algorithms for neural networks converge almost surely to a unique stationary set of a projected system of ODEs. The result gives a formal guarantee that these dropout algorithms are well-behaved for a wide range of NNs and activation functions, and will at least asymptotically not suffer from issues because of the connection to bond percolation. We leave the extension of this result for nonsmooth activation functions such as ReLU for future work. Additionally, we established bounds for the sample complexity of SGD with dropout to converge to an -stationary point of a generic nonconvex function. An upper bound to the rate of convergence of GD on the limiting ODE of dropout algorithms was established as well for arborescences of arbitrary depth with linear activation functions. While GD on the limiting ODE is not strictly a dropout algorithm, the result is a necessary step towards analyzing the convergence rate of the actual stochastic implementations of dropout algorithms. Finally, Proposition 9 specifically implies that Dropout and Dropconnect can impair the convergence rate by as much as an exponential factor in the number of layers of thin but deep networks. We have theoretically and experimentally verified this claim in experiments with a path network. This fact is in contrast to wide networks with a few dropout layers where a strong dependence on the dropout probability p is not experimentally observed. These two observations together imply that there is a change of regime in the convergence rate from networks that are wide with a few dropout layers to thin networks with many dropout layers.
We thank the anonymous referees for their feedback. Their suggestions have led to an improved paper.
Sanjeev Arora, Noah Golowich, Nadav Cohen, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. In 7th International Conference on Learning Representations, ICLR 2019, 2019.
Jimmy Ba and Brendan Frey. Adaptive Dropout for training deep neural networks. In Advances in Neural Information Processing Systems, pages 3084–3092, 2013.
Pierre Baldi and Peter Sadowski. The Dropout learning algorithm. Artificial intelligence, 210:78–122, 2014.
Pierre Baldi and Peter J. Sadowski. Understanding Dropout. In Advances in Neural Information Processing Systems, pages 2814–2822, 2013.
Peter L. Bartlett, David P. Helmbold, and Philip M. Long. Gradient descent with identity initialization efficiently learns positive-definite linear transformations by deep residual networks. Neural Computation, 31:477–502, 2018.
Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-dynamic programming: an overview. In Proceedings of 1995 34th IEEE Conference on Decision and Control, volume 1, pages 560–564. IEEE, 1995.
Vivek S. Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and , 8(3-4):231–357, 2015.
Jacopo Cavazza, Pietro Morerio, Benjamin Haeffele, Connor Lane, Vittorio Murino, and Rene Vidal. Dropout as a low-rank regularizer for matrix factorization. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pages 435–444, 2018.
Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
Yoel Drori and Ohad Shamir. The complexity of finding stationary points with stochastic gradient descent. In International Conference on Machine Learning, pages 2658–2667. PMLR, 2020.
Tianxiang Gao, Hailiang Liu, Jia Liu, Hridesh Rajan, and Hongyang Gao. A global conver- gence theory for deep relu implicit networks via over-parameterization. In International Conference on Learning Representations, 2021.
Piotr Hajłasz. Whitney’s example by way of Assouad’s embedding. Proceedings of the American Mathematical Society, 131(11):3463–3467, 2003.
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
Wei Huang, Richard Yi Da Xu, Weitao Du, Yutian Zeng, and Yunce Zhao. Mean field theory for deep dropout networks: digging up gradient backpropagation deeply. arXiv preprint arXiv:1912.09132, 2019.
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal–gradient methods under the Polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
Edmund Kay and Anurag Agarwal. Dropconnected neural network trained with diverse features for classifying heart sounds. In 2016 Computing in Cardiology Conference (CinC), pages 617–620. IEEE, 2016.
Jack Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
Durk P. Kingma, Tim Salimans, and Max Welling. Variational Dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
Harold Kushner and G. George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
Zhe Li, Boqing Gong, and Tianbao Yang. Improved Dropout for shallow and deep learning. In Advances in Neural Information Processing Systems, pages 2523–2531, 2016.
Poorya Mianjy and Raman Arora. On Dropout and nuclear norm regularization. In International Conference on Machine Learning, pages 4575–4584, 2019.
Poorya Mianjy and Raman Arora. On convergence and generalization of dropout training. Advances in Neural Information Processing Systems, 33, 2020.
Poorya Mianjy, Raman Arora, and Rene Vidal. On the implicit bias of Dropout. In International Conference on Machine Learning, pages 3540–3548, 2018.
Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational Dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2498–2507. JMLR. org, 2017.
Anthony P. Morse. The behavior of a function on its critical set. Annals of Mathematics, pages 62–70, 1939.
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015.
Samet Oymak. Learning compact neural networks with regularization. In International Conference on Machine Learning, pages 3966–3975, 2018.
Ambar Pal, Connor Lane, René Vidal, and Benjamin D. Haeffele. On the regularization prop- erties of structured dropout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7671–7679, 2020.
Vu Pham, Théodore Bluche, Christopher Kermorvant, and Jérôme Louradour. Dropout im- proves recurrent neural networks for handwriting recognition. In 2014 14th International Conference on Frontiers in Handwriting Recognition, pages 285–290. IEEE, 2014.
Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
Arthur Sard. The measure of the critical values of differentiable maps. Bulletin of the American Mathematical Society, 48(12):883–890, 1942.
Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent dropout without memory loss. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1757–1766, 2016.
Albert Senen-Cerda and Jaron Sanders. Asymptotic convergence rate of dropout on shallow linear neural networks. In Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, pages 105–106, 2022.
Ohad Shamir. Exponential convergence time of gradient descent for one-dimensional deep linear neural networks. In Conference on Learning Theory, pages 2691–2713, 2019.
Joachim Sicking, Maram Akila, Tim Wirtz, Sebastian Houben, and Asja Fischer. Character- istics of Monte Carlo dropout in wide neural networks. arXiv preprint arXiv:2007.05434, 2020.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- dinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
Salma Tarmoun, Guilherme Franca, Benjamin D Haeffele, and Rene Vidal. Understanding the dynamics of gradient flow in overparameterized linear models. In International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10153–10161, 18–24 Jul 2021.
Gregor Urban, Kevin Bache, Duc T.T. Phan, Agua Sobrino, Alexander K. Shmakov, Stephanie J. Hachey, Christopher C.W. Hughes, and Pierre Baldi. Deep learning for drug discovery and cancer research: Automated analysis of vascularization images. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 16(3):1029–1035, 2018.
Stefan Wager, Sida Wang, and Percy S. Liang. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems, pages 351–359, 2013.
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using Dropconnect. In International Conference on Machine Learning, pages 1058–1066, 2013.
Colin Wei, Sham Kakade, and Tengyu Ma. The implicit and explicit regularization effects of dropout. arXiv preprint arXiv:2002.12915, 2020.
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regulariza- tion. arXiv preprint arXiv:1409.2329, 2014.
Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over- parameterized deep ReLU networks. Machine Learning, 109(3):467–492, 2020.
We define the backpropagation algorithm used in Section 2 to compute the estimate of the gradient.
. Given weights
and input–output pair
Definition 15 is essentially a computationally efficient manner of calculating the gradient in (1), leveraging the NN’s layered structure together with the chain rule of differentation to come to a recursive computation of the partial derivatives.
Regarding our second result in Proposition 13, observe that GD on a limiting ODE is not exactly a dropout algorithm. Analyzing GD’s convergence rate however is an important stepping stone towards analyzing the convergence rate of dropout algorithms. To see the mathematical relation, consider that any dropout algorithm updates the weights
randomly for . Here, the
denote the step sizes of the algorithm, and the
represent the random directions that result from the act of dropping weights. As we will show in this paper under assumptions of independence, these random directions satisfy
for some continuous, differentiable function D(W). Observe that the algorithm in (62) satis-fies describes a martingale difference sequence. This martingale difference sequence’s expectation with respect to the past
For diminishing step sizes , we can consequently view dropout algorithms as in (62) as being noisy discretizations of the ordinary differential equation
In fact, we employ the so-called ordinary differential equation method (Kushner and Yin, 2003; Borkar, 2009), which formally establishes that the random iterates in (62) follow the trajectories of the gradient flow in (64). Hence, after sufficiently many iterations n and for a sufficiently small step size , the convergence rate of the deterministic GD algorithm
gives insight into the convergence rate of the stochastic dropout algorithm in (62).
We define here the projection operator used in Section 3. Say that H is defined by l smooth constraints
satisfying
. Denote by
the gradient of D(W) restricted to
be the tangent space of W at W. Suppose that
, and that these are linearly independent. At any point
, we define the outer normal cone
We also assume that C(W) is upper semicontinuous, i.e., if is the ball of radius
centered at W and intersected with
minimal to resolve the violated constraints of
points inside H. In particular, we have
The proof of Proposition 6 relies on the framework of stochastic approximation in Kushner and Yin (2003). Specifically, Proposition 6 follows from Theorem 2.1 on p. 127 if we can show that its conditions (A2.1)–(A2.6) on p. 126 are satisfied. In the notation of Sections 2, 3, these conditions read:
We next also state for your convenience Theorem 2.1 by Kushner and Yin (2003) in the notation of this paper. Their result does require some notation, as it characterizes the limiting behavior of the iterates of
For any sequence of step sizes satisfying (A2.4), define
Define the continuous-time interpolation
as well as for , the shifted processes
furthermore
and define
as well as for , the shifted processes
. The following now holds:
Theorem 16 (A part of Theorem 2.1 by Kushner and Yin (2003)) Let conditions
(A2.1)–(A2.5) hold for algorithm (71), with the projection onto H being as described in Appendix C. Then there is a set N of probability zero such that for , the set of functions
is equicontinuous. Let
denote the limit of some convergent subsequence. Then this pair satisfies the projected ODE (16), and
converges to some limit set of the ODE in H. Suppose that (A2.6) holds. Then, for almost all
converges to a unique
In order to apply Theorem 16 and arrive at Proposition 6, we verify conditions (A2.1)– (A2.6) through Lemmas 17–19 shown next in Appendix D.1. These lemmas are proven in Appendices D.1.1–D.1.3, respectively.
D.1 Verification of conditions (A2.1)–(A2.6)
First we assume conditions (N1)–(N3) and we prove that the variance of the random update direction in (4) is finite. This verifies condition (A2.1). The proof can be found in Appendix D.1.1:
Lemma 17 Assume (N1)–(N3) from Proposition 6. Then i = 0, 1, . . . , L.
We prove next that if , then the random update direction in (4), conditional on all prior updates, has conditional expectation
. Lemma 18 verifies conditions (A2.2), (A2.3), and (A2.5) (in particular, here
). The proof can be found in Appendix D.1.2:
Lemma 18 Assume (N2)–(N4) from Proposition 6. Then thermore,
times continuously differentiable.
From these conditions the first part of Proposition 6 follows. To prove the second part of Proposition 6, we have to prove that the set of stationary points is well-behaved in the sense that
is constant. If an objective function is sufficiently differentiable, this is guaranteed by the Morse–Sard Theorem (Morse, 1939; Sard, 1942). In the present case however we must take into account the possibility of an intersection of the set of stationary points with the boundary
. Assuming (N4) and (N5) provides sufficient conditions. The proof of Lemma 19 can be found in Appendix D.1.3:
Lemma 19 If (N2)–(N5) hold, then D(W) is constant on each
Since Conditions (A2.1)–(A2.6) of Thm. 2.1 on p. 127 in Kushner and Yin (2003) are now proven satisfied, the proof of Proposition 6 is now completed.
We need to carefully track all sequences of random variables created by a dropout algorithm throughout this proof, which we state here first explicitly.
Definition 20 (Dropout iterates) During its (t + 1)-st feedforward step, the algorithm iteratively calculates
for , to output
Subsequently for its (t + 1)-st backpropagation step the algorithm calculates
iteratively for . The algorithm then calculates
for i = 1, . . . , L, and finally updates all weights according to (13).
The idea of the proof of Lemma 17 is to expand the terms in defined in Definition 20 recursively, and identify a polynomial in variables
will use several bounds that pertain to the Frobenius norm, written down in Lemma 30 in Appendix J, and we will iterate these in a moment.
First, we will prove two bounds on the activation function applied to an arbitrary matrix A. Recall that by assumption (N1). There thus (i) exists some
that
, and there exists some
for some constant Similarly there exists some
Note furthermore that (ii) for all
, by submultiplicativity of the Frobenius norm,
for . Again, a similar bound holds for
Next, note that we have by (i) submultiplicativity and Lemma 30 that
The first term is bounded with probability one: . For the second term, consider the following bound:
for , where we have also used the submultiplicative property. For the third term, consider the next bound: (i) recursing (79) with
we obtain that there exists some
, say, so that
for . Similar to the derivation in (82), we obtain instead with
exists some
Recall that . This, together with using (81) repeat- edly for
, and (82), (83), yields the following inequality
Lastly, we bound . By applying (i) subadditivity of the norm
and then using the elementary bound
as well as submultiplicativity, we obtain
By combining inequalities (84), (85), and upper bounding the exponent of the term
in (85) by
, we conclude that
for i = 1, . . . , L and some constants and polynomials
say, the latter both in L variables. Because of the projection and by definition of H, there exists a constant
with probability one for all
Furthermore,
with probability one for all i = 1, . . . , L,
. These two bounds, together with (86) and the fact that
are polynomials, as well as the hypothesis that
, implies the result.
Let . Recall that
is the smallest
algebra generated by
, and note that
-measurable. The (i)
-measurability of
together with the (ii) hypothesis that the sequences of random variables
is i.i.d. implies that
Next, we need to check that we can exchange the derivative and expectation. Note that we have the same assumptions as for Lemma 17. as well as that
Therefore, by (86) in Lemma 17 we have that
upper bounded and moreover
only dependent on H. The interchange is then warranted by the dominated convergence theorem. Hence continuing from (87), we obtain
If , then for any multi-index s on the set of weights, a bound similar to (86) holds by the chain rule:
where are polynomials and
are the top exponents in the expansion in
. Hence, using the assumption
, we obtain for any
a compact set that
In particular we can apply the dominated convergence theorem and conclude
We use Sard’s theorem (Sard, 1942) to prove Lemma 19, which gives sufficient conditions for condition (A2.6):
map between manifolds with
be the set of critical points of
has measure zero.
Proof of Lemma 19. By Lemma 18, we have . By assumption (N5) we have that if
. Furthermore
j, i.e., the critical points of
Sard’s theorem (Proposition 21) to D(W). We have that if
has measure zero. Since
is connected there is a continuous path
any two points
. By continuity of D(W) we must have then D(a) = D(b), since otherwise we would have
which has positive measure in R. Therefore
must be a constant.
Remark that in Lemma 19 the condition cannot immediately be eliminated. When r < dim(W), there are examples of functions which are not constant on their connected critical sets, see e.g. Hajłasz (2003).
We use standard tools for proving convergence to an -stationary point (for a reference, see Bottou et al. (2018)). We require first the following bounds on the variance induced by dropout.
Lemma 22 Assume that F is a random variable satisfying (Q4). If f is a vector of random variables with distribution F, then
Proof We prove first (i). Recall that . If we denote by
th entry of f, then note that from (Q4)
linearity (i) follows. For (ii), we have
where in the last inequality we have used the Cauchy–Schwartz inequality.
In order to prove both Propositions 7 and 8 simultaneously, we will temporarily redefine in this section D as
where c > 0 is a constant. Later on we will specify both c = 1 for Proposition 7 and c = 1/p for Proposition 8, respectively.
Lemma 23 Assume (Q3) and (Q4), that is, -Lipschitz and the distribution of the filters is {0, 1}-valued. Then,
-Lipschitz.
Proof Using (i) Jensen’s inequality with the norm, we have for a fixed
where we have also used (ii) the fact that for a vector u and {0, 1}-valued vector f we have -Lipschitz.
The proof of the following lemma can be found in Appendix E.1.
Lemma 24 Assume (Q1)–(Q4), then for any
We obtain in the next lemma a simple bound for the variance of the gradient that depends on the data.
Lemma 25 Assume (Q1)–(Q4), then for any
Proof We use first the definition of U as an expectation. We have
where in (i) we have used the upper bound for from (Q2) and in (ii) that since
so using linearity with (Q4) the bound follows.
By (Q3)–(Q4) and Lemma 23, -Lipschitz. In this case, we can then use the following common argument: if
-Lipschitz then we have the inequality
We can then use the definition of
Let -algebra of
. Conditional on
is an unbiased estimator of
so that by linearity
Similarly to (97), we can decompose
where in the last step the cross-term vanishes since, by using the independence assumption of . If we take the expectation with respect to
first, then we find
Similarly, we can add and substract in the first term and repeat the argument with the definitions of
in (90), where we take the expectation of (98) with respect to
instead. A similar cross-term vanishes. We then obtain
Define the constant depending on c. Using the bounds of Lemma 24 together with assumption (Q5) and Lemma 25 in (100) we obtain
Substitute now (97) and (101) in (96). After taking the expectation, we can use a telescopic sum in (96) with the previous bounds, which yields
By (Q1) we have independently of c. Assuming that
, we then have
We not proceed with proving Propositions 7 and 8. is a constant in (103), we find
Minimizing the bound over yields that the minimum occurs at
, the bound reads
For proving Proposition 7 (a), set c = 1 in (105) as well as . Note finally that the condition
is satisfied, for example, if p >
For proving Propostion 8, set c = 1/p in (105) as well as . Note finally that the condition
is satisfied, for example, if
Proof of Proposition 7 (b): Let c = 1 and denote We can also set
. It is easily verified that for
Substituting these bounds in (103) yields the result.
E.1 Proof of Lemma 24
Noting that we have temporarily the definition we can write
where (i) we have used Jensen’s inequality for a vector-valued random variable v, namely , and (ii) the subadditivity of the norm
. We now note that
where we have used (i) the Lipschitzness assumption of from (Q3), and (ii) the facts that
. The latter is true because for any vector f with entries
. We can reason similarly with
Using (Q2) we can also bound
Hence, we have in (107) that
where (i) for a random variable since either
. We can now add an expectation term in the norm
by (Q4). Hence, from (110) onward we can write
where we have used (i) Lemma 22(i), (ii) independence of and Lemma 22(i) again, (iii) Lemma 22(ii), and (iv) bounded
Proof of (31). Recall that is a random subgraph of G = (E, V) with edge set
. By (i) the law of total expectation, and by (ii) independence of F and (X, Y ),
Proof of (32). Expand (112) to find
after rearranging terms. This completes Lemma 10’s proof after identifying J (W) and R(W) here as the left and right sum, respectively. To prove Corollary 11, consider that since for an arborescence R(W) = 0, we can write
Here, (iii) follows because since , what remains must be the optimum. This completes the proofs of Lemma 10 and Corollary 11.
For any edge
Note that for any leaf
, and therefore in particular
Recall that L(G; f) is the set of leaves of the subtree of the base graph By the fact that
partitions
it follows that
Note in fact that this proof works for any base graph G that has no cycles and only length-L paths, so not just an arborescence. This is why we make Assumption (N6’) as opposed to the stronger Assumption (N6) in Corollary 11.
The proof of Proposition 13 is by double induction on the statements parameter and K is a compact set which we will define. Concretely, we prove that there exist
. Appendix H.4 describes in detail how the upcoming Lemmas 26–28 provide sufficient conditions for the induction step. There we also maximize the upper bound on the convergence rate over
which gives the rate in (41).
We start by giving Lemmas 26–28. Recall first the definition of the set
Here, with a minor abuse of notation, we define also
where is an arborescence.
Lemma 26 Assume (N2) from Proposition 6 and (N6) from Corollary 11. Then:
Lemma 26 implies that is compact and that
-smooth on the compact set
for . Its proof is deferred to Appendix H.1. Next, Lemma 27 gives a lower bound on the curvature of D(W) on K in the direction of
, in the form of a PL-inequality (Karimi et al., 2016). Its proof is in Appendix H.2.
Lemma 27 Assume (N2) from Proposition 6 and (N6) from Corollary 11. If
Lemma 28 proves that the conserved quantities of the gradient flow remain bounded under the GD algorithm in (27). This lemma allows us to keep track of the iterates in the compact set by relating them to conserved quantities and exploiting the fact that under GD
. Appendix H.2 contains its proof.
Lemma 28 Assume (N2) from Proposition 6 and (N6) from Corollary 11. If
A note on the exchange of derivative and expectation in this section. Whenever we make both Assumption (N2) in Proposition 6 and (N7) in Lemma 10, the exchange of derivative and expectation is warranted. This occurs several times throughout this section. We refer to the proof of Lemma 18 for the details.
H.1 Compactness, and smoothness – Proof of Lemma 26
In the proof of Lemma 26, we will upper bound the operator norm of the Hessian. Recall that for a symmetric bilinear matrix
Proof of (i). By continuity of the conditions in (39), the set is closed. We need to prove boundedness. Let
, and suppose w.l.o.g. that for some
. We want to find a path
is large for a contradiction with the assumption that
. By (35), we have the inequality
so that for some
we must have
. Consequently, we have by (35) that
for any edge
in any path
for the edge
where we have
by assumption. In particular, we have the bound
for any edge
for any path
. Therefore if we pick
we have
for sufficiently large Q, which is a contradiction. We must thus have . If on the other hand
, by (35) we must also have
for sufficiently large Q. This case is, thus, the same as before. Proof of (ii). Using a regular upper bound to the entries of
suffice. Element-wise, we have
Hence, noting that since we have , we can bound
and the other terms similarly. We upper bound the number of terms in the sum over
. Adding all terms, we obtain that
is an upper bound for each of the entries of
This gives an upper bound
H.2 PL-inequality on a compact set – Proof of Lemma 27
Recall the definition of a PL-inequality:
is compact and
by
and suppose that
. We say that u satisfies a Polyak– Łojasiewicz (PL) inequality if there exist a
depending only on K such that
A PL-inequality together with -smoothness on a compact set will imply that
decreases. To see this, note that by (i)
-smoothness, and (ii) the update rule
If furthermore , then also
Together with (125), and after rearranging terms, one finds that
By (iii) , we obtain (29). The strategy will now be to prove that there is a PL-inequality in some compact set, that the iterates remain in that compact set, and that the function is
Proof of Lemma 27. First note that if , the indexes of the weights in the product
belong to the index set E\L(G). The proof follows (i) by restricting the sum, and (ii) from the fact that for every path
in an arborescence G, there is exactly one leaf
where in (iii) we have used the bound similarly with
. Finally, by (35), we have
This completes the proof.
H.3 Conserved quantities remain bounded throughout GD – Proof of Lemma 28
. By (i) Corollary 11, and (ii) Lemma 12, we have
By Cauchy–Schwartz we also have
If we have . Thus, combining the estimate (129) with (131) we obtain
Extending the sums in (132) from , respectively, yields
where we have used the bound . Similarly, using (130) and the trivial bound
, and by absorbing one
I(W)’s expression, we obtain
for the lower bound. Because by assumption,
completes the proof.
H.4 Double induction
We now use Lemmas 26–28 together in a double induction to finally prove Proposition 13. Let and denote the statements:
We will prove that there exists a such that when choosing
appropriately, firstly
and secondly,
We need to prove that
(135) and (136). Using (133) from the proof of Lemma 28 repeatedly with the bound
, we obtain
By (135), we can upper bound
If furthermore (C1) , then (i) the inequality
holds, so that
In the same manner, we can also prove (141) for . This yields
for any . Similarly, for a lower bound, we can use (134) repeatedly together with the bound (140) and condition (C1) yielding
for any . Now, suppose (D1)
and let (C2) the step size satisfy
We have (i) by (142) and (143) that
Then by (39). Hence,
Suppose now for a moment that (C2) the right-hand side of (146) is positive for some sufficiently small . We could then use the PL inequality from Lemma 27 together with
To see how, note that the argumentation around (127) together with (147) and (i) the induction hypothesis and (ii) the clause (L1)
where we have also used (iii) the induction hypothesis A(t), i.e., that
We now investigate the exponent in (148) for a moment. Assuming (C2) and if (C3) the right-hand side of (148) is furthermore smaller than the induction step would be complete. Note finally that both conditions (C2) and (C3) are satisfied when choosing
or equivalently
To also satisfy condition (C1), we thus require that
Step 3. Let us summarize. Convergence occurs at rate at most if conditions (L1), (D1), (C1)–(C3) hold. Hence we have to choose
Note that we can maximize the convergence rate by maximizing
, which occurs when
Substituting this in (152) we require a step size
Finally, we have the bound from Lemma 26 in S, so that
This completes our proof of Proposition 13.
We consider first the case of Dropconnect. We have that are independent and identically distributed Bernoulli(p) random variables. Suppose that the base graph G has no cycles and every path is of length L. Then by definition in Lemma 10, we have
where (i) we have used Dropconnect’s distribution on F.
Now suppose that additionally we make the stronger assumption that G is an arborescence. Then by definition in Corollary 11 , and subsequently we can calculate
Now, since by assumption so that substitution of in the definition of
in Proposition 13 yields
where we have used that . Finally multiplying by
gives the rate
Substituting these results in the rate in Proposition 13 yields the result for Dropconnect.
Finally we note that for the case of Dropout, filtering all nodes independently in an arborescence is equivalent to filtering all edges independently except the edge at the root. In particular, in (155), we have . The remaining steps of the proof are then the same as for Dropconnect and comparing
we can absorb the missing p factor into the O notation, which does not change the order in L.
Lemma 30 For any matrix , it holds that
. For any two matrices
, it holds that
. For any two matrices
, it holds that
where (ii) we have used that the function is nondecreasing in
Because (iii) for the
-norm for sequences it holds that
we obtain
where (iv) we have used that the function This proves the first inequality.
The second inequality is an immediate consequence of the submultiplicativity property of the Frobenius norm and its positivity, i.e.,
Raising to the k-th power left and right finishes its proof.
The third inequality follows from strict positivity of the summands:
Each of the inequalities has now been shown.