b

DiscoverSearch
About
My stuff
TAdam: A Robust Stochastic Gradient Optimizer
2020·arXiv
Abstract
Abstract

Machine learning algorithms aim to find patterns from observations, which may include some noise, especially in robotics domain. To perform well even with such noise, we expect them to be able to detect outliers and discard them when needed. We therefore propose a new stochastic gradient optimization method, whose robustness is directly built in the algorithm, using the robust student-t distribution as its core idea. Adam, the popular optimization method, is modified with our method and the resultant optimizer, so-called TAdam, is shown to effectively outperform Adam in terms of robustness against noise on diverse task, ranging from regression and classification to reinforcement learning problems.

The field of machine learning is undoubtedly dominated by first-order optimization methods based on the gradient descent algorithm and particularly [1], its stochastic variant, the stochastic gradient descent (SGD) method [2]. The popularity of the SGD algorithm comes from its simplicity, its computational efficiency with respect to second-order methods, its applicability to online training and its convergence rate that is independent of the training set. In addition, SGD has high affinity with deep learning [3], where network parameters are updated by backpropagation of their gradients, and is intensively used to train large deep neural networks.

Despite such established popularity, a specific trait of SGD is the inherent noise, coming from sampling training points. Even though this stochasticity makes the algorithm more likely to find a global minimum, those fluctuations also slow down the learning process and furthermore, render the algorithm sensitive to outliers. Indeed, bad estimates of the gradients are likely to produce bad estimation of the minimum.

Many of the new optimizers proposed to improve the SGD algorithm and tackle complex training scenarios where gradient descent methods behave poorly also share the same weakness to aberrant value. Adam (Adaptive moment estimates) [4], one of the most widely used and practical optimizers for training deep learning models, is no exception. This is mainly due to the insufficient number of samples implicitly involved in its first moment evaluation.

The weakness to noisy data is particularly important in robotics learning where incomplete, ambiguous and noisy sensor data are inevitable. Furthermore, in order to generate large scale robot datasets for scaling up robot learning, the ability to use automatically labeled data [5] is important.

image

of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan {ilboudo.wendyam eric.in1, kobayashi,

image

Robust learning methods are therefore needed to deal with the eventual noisy labels and can improve the performance of low-cost robots that suffers from inaccurate position control and calibration and noisy executions, without the need of a noise modeling network [6].

Hence, the aim of the present research is to propose a robust version of Adam through the use of robust estimates of the momentum, which is assumed to be the first-order probabilistic moment of the gradients. The key idea for such robust estimates is the use of the student-t distribution, which is a model suitable for the estimates from a few samples [7].

A. Background

a) Stochastic Gradient Descent: Let  xtbe a random sample from the data set at iteration  t, Jθ(xt)the objective function evaluated on data  xtwith the parameters  θ, gt =∇θJθ(xt)its gradient, and  αthe learning rate. The SGD algorithm [2] updates  θt−1to  θtthrough the following update rule:

image

This algorithm yields at least local minima of J w.r.t  θ.

b) Improving SGD: Since its proposition, many ideas have been developed in order to improve the convergence property of the SGD algorithm. This feature heavily connects to the fluctuations of the gradients during learning. All the research that aim to accelerate the convergence rate have done so through several approaches. For instance, they improved i) the update method of the parameters [8], [9], [10], [11]; ii) the adjustment of the learning rate [12], [13], [14], [15]; and iii) the robustness to aberrant values from heavy-tailed data [16], [17], [18], [19]. Those approaches have culminated to some pretty effective state-of-the-art first-order optimization methods, going from the momentum idea to the adaptive learning rate and variance reduction schemes. Below, we review some of the works related to the robustness.

B. Previous works

As stated before, SGD is inherently noisy and susceptible to produce bad minima estimates when facing aberrant gradient estimates. A lot of work have therefore been done to propose more robust methods for efficient machine learning under noise or data with heavy tails.

In this review, we ignore the general statistical methods for robust mean estimates [20] such as the median based estimations [21], [22], [23] due to their practical limitations. Three main approaches are distinguished: a) methods based

on direct robust estimates of the loss (or risk) function [24]; b) methods based on robust estimates of the gradients [25], [19] among which falls our algorithm; and c) methods with small learning rates for wrong gradient estimates [18]. a) Robust risk estimation: Those methods usually require the use of all the available data in order to produce, for each parameter, a robust estimate of the loss function to be minimized. A specific inconvenient trait of this approach is the implicit definition of the robust estimate, which may introduce some computational roadblocks. As briefly explained by Holland et al. [19], since the estimates do not need to be convex even in the case where the loss function is, the nonlinear optimization can be both unstable and costly in high dimensions. b) Robust gradient descent: This approach usually rely on the replacement of the empirical mean (first moment) gradient estimate with a more robust alternative, and simply differs in the method used to achieve this objective. Chen et al. [25] proposed the use of the geometric median of the gradients mean to aggregate multiple candidates. Using the same strategy, Prasad et al. [26] proposed a class of gradient estimator based on the idea that the gradient of a population loss could be regarded as the mean of a multivariate distribution, reducing the problem of gradient estimation to a multivariate mean estimation problem. Very close to our approach, Holland et al. [19] proposed to carefully reduce the effect of aberrant values instead of discarding them, which can also result in unfortunate discards of valuable data. c) Adaptive learning rate: This approach is to reduce the effect of wrong gradient estimates by reducing the learning rate. One such approach has been proposed by Haimin et al. [18] and shares the same objective as ours to produce a robust version of the Adam optimization algorithm. The method employed by Haimin et al. uses an exponential moving average (EMA) of the ratio between the current loss value  ltand the past one  lt−1to scale the learning rate. However, this strategy allows the outliers to modify the estimated gradient mean, and then uses the impact of the deviated mean on the loss function to reduce the effect on subsequent updates. As one of the problems in the EMA scheme, the lack of robustness has been dealt with in [16] and [17]. In those methods, the exponential decay parameter of the EMA is increased whenever a value that falls beyond some boundary is encountered. The common drawback in this strategy is that all outlier gradients are treated equally and discretely without consideration of how far they are from the normal values, and the boundary over which a data is considered to be an outlier must be set manually before training. d) Our contribution: To the best of our knowledge, our approach, named TAdam, is the first to employ estimates of the student-t distribution first moment to replace the estimates of the Gaussian first moment introduced by Adam, through the EMA scheme. The main advantage of this approach is that it relies on the natural robustness of the student-t distribution and its ability to deal with outliers, and can easily be reduced to Adam for non-heavy-tailed data.

image

Fig. 1: Sensitivity to outliers in Adam: a regression task with noise drawn from a student-t distribution with degrees of freedom  ν = 1,µ = 0 and scale λ = 0.05was conducted with Adam; the predicted curve had large variance and its accuracy was clearly deteriorated due to the noises.

Also, even though, in this letter, we use our method to modify the popular optimizer Adam, we encourage the reader to keep in mind that it can be integrated to the other stochastic gradient descent methods that rely on EMAs like RMSProp [14], VSGD-fd [16], Adasecant [17] or Adabound [15].

A. Adaptive moment estimation: Adam

Before describing our proposal, let us introduce Adam [4], the baseline of TAdam. Adam is a popular method that combines the advantages of SGD with momentum along with those of adaptive learning rate methods [14], [13]. Its update rule is implemented as follows:

image

where  mtis the first-order moment (i.e., mean of gradients) and  vtis the second-order moment utilized to adjust learning rates at time step  t. β1and  β2are the exponential decay rates (by default 0.9 and 0.999, respectively).  αis the global learning rate and  ϵis a small value added to avoid division by zero (typical value of  10−8).

Although the use of EMAs in equations (2) and (3) makes the gradients smooth and reduces the fluctuations inherent to SGD, they are also sensitive to outliers. In particular, with a small value like  β1 (= 0.9), the momentum  mtis very likely to be pulled out by outliers and easily deviate from the true average. This fluctuation makes learning unstable (see Fig. 1), and therefore, more robust learning techniques are needed.

B. Overview

Our proposition relies on the fact that the EMA, like equations (2) and (3), can be regarded as an incremental update law of the mean in the normal distribution with a fixed number of samples. The sensitivity of Adam to aberrant gradient values is therefore just a feature inherited from the normal distribution, which is itself also sensitive to outliers.

image

Fig. 2: Robustness to outliers: the normal distribution (in green) was pulled out by outliers; in contrast, the student-t distribution (in red) allowed their existence and hardly moved.

image

In order for Adam to be robust, the distribution of the gradients must be assumed to come from a robust probability distribution. We therefore propose to replace the normal distribution moment estimates by those from the student-t distribution, which is well-known to be a robust probability distribution [7], [27], [28], as shown in Fig. 2, and a general form of the normal distribution. From the next section, we describe how the EMA is replaced using the student-t distribution, and the features of our implementation are analyzed later. A pseudo code of TAdam is summarized in Algorithm 1.

C. Formulation

To replace the EMA by the student-t distribution, a new hyperparameter, the degrees of freedom of the student-t

distribution  ν, is introduced to control the robustness.

We can derive the incremental update law of the first moment  µfor the student-t distribution using a maximum log-likelihood estimator. Given  x1, . . . , xn d-dimensional i.i.d. random samples from multivariate student-t distribution ptwith the parameters  µ, Σand  ν, its log-likelihood function is expressed as:

image

where  Di = (xi−µ)T Σ−1(xi−µ). Taking the gradient with respect to  µand setting it equal to 0 gives us:

image

If we solve this equation for  µ, we get the expression of the first moment estimate given n samples:

image

where wi = (ν + d)/(ν + Di)and Wn = �ni=1wi.

By assuming a diagonal distribution and fixing the number of samples (decaying  Wn), we can derive the equation (8) used below in TAdam. Due to the high value of  β2(i.e., 0.999 about 1000 samples) w.r.t.  β1(i.e., 0.9 about 10 samples), only the first-order moment in equation (2) is replaced by the following rule:

image

where

image

vt−1is the unmodified Adam’s second moment estimate coming from equation (3), and d is the dimension of the gradient  gt(i.e., the number of parameters in subsets like layers of deep learning). Here, the summation in the denominator of  wtis substituted from now on by  Dtsince it corresponds to the Mahalanobis distance between the gradient of the parameter  θj, gjt, and the corresponding previous estimate of the mean,  mjt−1, w.r.t. the variance that is assumed to be the same as Adam’s second moment estimate,  vjt−1. Note that, ultimately, the gradients converge to zero, and therefore, the second moment would be consistent with the variance of the gradients.

The power of this update rule is two folds: the outliers detection and the robustness control. Their details are explained below.

D. The outliers detection

This is performed through  wtwhich is an adaptive weight of the mean introduced in equation (8) with degrees of freedom  ν. Again, we can notice that  wtdepends on the Mahalanobis distance  Dt. Hence, outlying gradient values are down-weighted since their Mahalanobis distances are larger than for normal values, and their contribution to the momentum update is therefore automatically dampened. On the contrary, the normal gradients are up-weighted ultimately by  1 + d/νdue to zero Mahalanobis distances, although m is kept in that case since  mt−1 = gt. In short, TAdam automatically and continuously reduces only the adverse effects of the outlier gradients.

E. The robustness control

The Student-t distribution has a controllable robustness and that nice property of being similar to the normal distribution when the degrees of freedom grows larger. The same feature is left in TAdam, as can be seen in equation (9). Namely, when  ν → ∞, we have:

image

In this case, TAdam loses its robustness to outliers, like Adam.

To make TAdam be an extended version of Adam, the decay rule in equation (10) is designed to fulfill some requirements. Specifically, if  ν → ∞, the decay rate derived from  Wt−1and  wtin equation (8) must be consistent with β1at any time.

image

To satisfy such a constant W, the decay rate in equation (10) can be derived as follows if the decay rule is given as  Wt ←γWt−1 + wt.

image

By the above derivation, TAdam defined by equations (8)(10) is proved to be the extended version of Adam defined by the equation (2) (and equations (3)(4)).

F. The Regret Bound and TAdam’s Convergence

The convergence of the TAdam algorithm is assured by the two following theorems, whose proofs can be found in the appendix:

Theorem 1. Given  {θt}T0and  {vt}T0, the sequences obtained from the TAdam algorithm,  αt = α√t, βw = β1t, E[βw] ≤¯βw < 1and  γ = ¯βw√β2 < 1. If F has a bounded diameter  D∞, and if  g =∥ ∇ft(θ) ∥∞≤ G∞for all  t ∈ [T]and  θ ∈ F. Then, for  θtgenerated using TAdam (with the AMSGrad [29] scheme), we have the following upper bound on the regret:

image

Theorem 2. Let’s assume that the gradients g ultimately follow an asymptotic Normal distribution  g ∈ Rd ∼ N, according to the central limit theorem; then the Mahalanobis distance appearing in TAdam follows a Chi-Squared distribution  D2M(g, µ) = �j

value of the adaptive decay parameter  βw = Wt−1Wt−1+wtis constrained, for  β1 < 1, by the following relation:

image

We can see that the difference between the upper bound of TAdam and Adam lies in the value of ¯βw, which corresponds to the expected value of the adaptive exponential decay parameter  βw = Wt−1Wt−1+wt. Theorem (2) tells us that, if the gradients are normally distributed, this value is bounded above by  β1, so that we can recover the same upper bound for TAdam and Adam. However, if we know the exact value of the expected value, a more precise upper bound for the regret can be obtained.

To assess the robustness of TAdam against noisy data, we conducted three types of experiments spanning the main machine learning frameworks, i.e. supervised learning (regression and classification) and reinforcement learning. We compare TAdam mainly with Adam, but also with another robust gradient descent algorithm, RoAdam [18].

A. Robust Supervised Learning

It has been shown [30] that training standard supervised learning algorithms with noisy data resulted in bad performance and accuracy of the resulting models. In real robotic tasks, it is often unrealistic to assume that the true state is completely observable and noise-free, and perfect supervised signals are difficult to obtain. In the following experiments, TAdam reveals to be useful in increasing the accuracy of the models, even when facing noisy inputs.

1) Robust Regression:

a) Experimental settings: The regression setting on which we compared TAdam, Adam and RoAdam is as follows. A ground truth function is defined as  f(x) = sin(2πx)and we set a fully-connected neural network to approximate it from scattered observations t, sampled from the true function

image

Fig. 3: Results of the regression task: (First Two Figures) Loss function w.r.t. the noise probability p; in all the noise settings, TAdam outperformed Adam. (Last Two Figures) Prediction curves after learning; although Adam suffered a large variance against the large noise and a bad prediction accuracy, TAdam relatively succeeded in approximating the ground truth function.

image

Fig. 4: Training and test accuracy (noise-free and noise-included) and loss (noise-free) for ResNet-34 on CIFAR-100.

with noise. The observations have a probability p of being infected by some noise  ζ, so that:

image

where  St(νζ, λζ)designates a student-t distribution with degrees of freedom  νζ, 0location, and scale  λζ, and Bern(p/100) is a Bernoulli distribution with the probability p as its parameter. The model, on the other hand, is a neural network with 5 linear layers, each composed of 50 neurons. The ReLU activation function [31] is used for all the hidden layers, while the loss function for the network is the Mean Squared Error (MSE). b) Experimental results: The results of the loss functions against the noise probability p on the regression task are depicted in Fig. 3a and Fig. 3b. Note that 50 trials are conducted for each p. As it can be seen, TAdam absolutely

outperformed Adam in all the cases, and reveals to be more robust than RoAdam. In addition, as the noise probability in the observations increases, TAdam managed to resist to their effect. To visualize the learning results, the predicted curves after learning are also illustrated in Fig. 3c and Fig. 3d. The learning variances of Adam were obviously larger than those generated by TAdam, and TAdam relatively succeeded in following the ground truth function from the observations even with large noise.

2) Robust Classification:

a) Experimental settings: Here, we use the same experimental settings described in [15] and compare Adam, AMSGrad and their T versions along with RoAdam on an image classification task on the standard CIFAR-100 dataset. The architecture of the convolutional network involved in the described experiments is the ResNet-34 [32]. A fixed budget of 200 epochs are used throughout the training, and the learning rates are reduced by 10 after 150 epochs. The optimizers are launched with the following hyperpa-

image

Fig. 5: Training curves for PPO agent.

rameter values: {learning rate: 0.001}, {betas: (0.99, 0.999)} and both T algorithms use the default degrees of freedom, i.e. {degrees of freedom = dimension of the gradients}. The third beta value of RoAdam is also set to {0.999}. b) Experimental results: We first launched a simulation without noise, using directly the unmodified datasets. The results for that simulation are found in Fig. 4a and Fig. 4d. We can see that TAdam and TAMSGrad are able to achieve faster convergence during the training phase compared to the standard versions, and also show higher level of generalization during the test phase. The corresponding loss curves, Fig. 4b and Fig. 4e, show that TAMSGrad is able to reach a lowest point during the training phase, while also keeping a low loss value on the test data. This result points the fact that TAMSGrad builds on the combined improvement of the first moment (TAdam) and second moment (AMSGrad) in order to provide a more stable algorithm that can outperform the others. Next, we applied, with a probability of 25%, a color jittering effect on the training dataset and replaced 20% of the original training data points by fake ones, in order to test the ability of the optimizers to extract the most useful informations from corrupted datasets. The results can be seen in Fig. 4c and Fig. 4f and it highlights the benefits of TAdam against Adam. Indeed, even thought the value of beta1 is larger (0.99 instead of default 0.9), Adam remains sensitive to outliers, while TAdam can ignore them.

B. Robust Reinforcement Learning

Whether it comes from sensors, or from bad estimates during learning, or from different feedbacks from different human instructors (e.g. non-technical users in real-world robotics situations), noisiness is inseparable from robotics reinforcement learning (RL). In order to test the robustness properties of TAdam in RL tasks, we conducted some simulations on six different Pybullet gym environments [33]. The results are summarized in Fig. 5.

a) Experimental settings: The Fig. 5 summarizes four trials with four different seeds on each environment. The algorithm employed is the proximal policy optimization (PPO) [34], from the Berkley artificial intelligence research implementation, rlpyt [35], with the following setting:

image

TABLE I: Settings for the RL experiments

No gradient norm clipping was used throughout the simulations, since the property at test is the robustness of the optimizers to aberrant gradient values and their ability to produce good policies. Gradient norm clipping introduces a manually defined heuristic threshold, which depends on the task and on various conditions, and moreover, is used for the norm of all gradients larger than its value. Such trick would therefore introduce some undesirable bias in the results.

The simulations involved two different learning rates: the widely used and fine tuned value for Adam,  3 × 10−4, and the defined default, yet larger value,  1 × 10−3.

b) Experimental results: Searching for the optimal learning rate is commonly known to be a tedious and serious problem in SGD based algorithms, and high learning rates (particularly the default Adam step value  1 × 10−3) are usually not used in reinforcement learning due to the amount of noise coming from the early bootstrapping stage, but also to avoid the agent from reaching an early deterministic policy. As displayed by the results in Fig. 5, a high learning rate causes Adam to suffer from both these problems and makes it unable to converge to a good policy. On the other hand, TAdam proves to be robust enough to sustain different learning rates, and learns the tasks with both given hyperparameter values. Thanks to its careful updates of the agent, TAdam can still reach a sub-optimal policy that may even be better than the one reached with smaller learning rates (Fig. 5c, 5f). This feature offered by TAdam not only allows for the use of higher learning rates in order to accelerate the learning process, but also reduces the difficulties related to the tuning of the learning rate, since the default learning rate can be directly used. Also, as stated in the experimental settings section, no gradient norm clipping was used during the simulations. Without this trick, we can see that Adam fails altogether on the inverted double pendulum task, while TAdam naturally and automatically ignores or reduces the effect of large gradients, keeping the gradient (momentum) from overshooting during learning and making the gradient norm clipping stratagem unnecessary.

In this letter, we proposed and described TAdam, a new stochastic gradient optimizer, which makes the Adam algorithm much more robust and provides a way to produce stable and efficient machine learning applications. TAdam is based on the robust mean estimate rule of the Student-t distribution as an alternative to the standard EMA. We veri-fied that TAdam outperformed Adam in terms of robustness on supervised learning (regression and classification) tasks, and reinforcement learning tasks.

In this work, TAdam uses a fixed degrees of freedom  νwhich is equal to the dimension of the gradients, and therefore has a fixed robustness. A straightforward improvement is therefore to design a mechanism that automatically updates the parameter  νduring the learning process, according to the presence or absence of outliers.

image

A. Proof of Theorem 1

First, we start by noticing that the basic bound of the regret from the convergence proof by Reddi et al. [29] also holds

for TAdam, i.e.:

image

However, to further refine this upper bound, we need to redefine the Lemma 2 used in the proof of Reddi et al, since β1t = βw = Wt−1Wt−1+wt ≤ β1does not hold anymore for all time step t. For this purpose, we use the expected value of  βw, ¯βw = E[βw] < 1, instead of  β1, to define the upper bound and, following the same process as Reddi et al., define a similar expression to their Lemma 2 in the case of TAdam:

image

Based on this new lemma, the remaining steps are completely identical to the proof of Reddi et al., and the final regret bound of TAdam is given by:

image

B. Proof of Theorem 2

Assuming that the gradients g ultimately follow an asymptotic normal distribution  g ∈ Rd ∼ N(µ, Σ), then we know that  D2M(g, µ) = �j

degrees of freedom of the chi-squared distribution. Applying this to the Mahalanobis distance in TAdam, we have:

image

Now, we know that the expected value of the chi-squared distribution with d degrees of freedom is  E[Dt] = dand the expected value of the inverse-chi-squared distribution with the same degrees of freedom is given by  E[D−1t ] =1d−2, ∀d > 2. We can therefore define:

image

This inequality comes from the Jensen’s inequality and from the fact that  f(x) = 1x+νand  f(x) = 1x−1+νare respectively convex and concave. The expected value of the weights  wtin TAdam, can therefore be expressed as:

image

We can then infer the mean of the weighted sum  Wt:

image

Where we have defined  a = 2β1−1β1and taken advantage of the monotonic decrease of the sequence  attowards 0, given that a < 1 for  β1 < 1. We move on to express the upper bound for  E[βw]where  βw = Wt−1Wt−1+wt. For this purpose, we make use of the Hartley and Ross unbiased estimator for the mean of the ratio between two random variables [36], [37], which, based on the fact that the covariance between Wt−1and  Wt−1 + wtis positive, gives:

image

The last inequality is drawn from the relations depicted by Eq. 24.

[1] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010, pp. 177–186, Springer, 2010.

[2] H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.

[3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.

[4] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[5] M. Suchi, T. Patten, D. Fischinger, and M. Vincze, “Easylabel: A semiautomatic pixel-wise object annotation tool for creating robotic rgb-d datasets,” in International Conference on Robotics and Automation (ICRA), pp. 6678–6684, IEEE, 2019.

[6] A. Gupta, A. Murali, D. P. Gandhi, and L. Pinto, “Robot learning in homes: Improving generalization and reducing dataset bias,” in Advances in Neural Information Processing Systems, pp. 9094–9104, 2018.

[7] C. M. Bishop, Pattern recognition and machine learning. springer, 2006.

[8] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.

[9] Y. Nesterov, “A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2),” in Doklady AN USSR, vol. 269, pp. 543–547, 1983.

[10] N. L. Roux, M. Schmidt, and F. R. Bach, “A stochastic gradient method with an exponential convergence rate for finite training sets,” in Advances in neural information processing systems, pp. 2663–2671, 2012.

[11] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Advances in neural information processing systems, pp. 315–323, 2013.

[12] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.

[13] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.

[14] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.

[15] L. Luo, Y. Xiong, Y. Liu, and X. Sun, “Adaptive gradient methods with dynamic bound of learning rate,” arXiv preprint arXiv:1902.09843, 2019.

[16] T. Schaul and Y. LeCun, “Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients,” arXiv preprint arXiv:1301.3764, 2013.

[17] C. Gulcehre, J. Sotelo, M. Moczulski, and Y. Bengio, “A robust adaptive stochastic gradient method for deep learning,” in 2017 International Joint Conference on Neural Networks (IJCNN), pp. 125– 132, IEEE, 2017.

[18] Y. Haimin, P. Zhisong, and T. Qing, “Robust and adaptive online time series prediction with long short-term memory [j],” Computational Intelligence and Neuroscience, vol. 2017, pp. 1–9, 2017.

[19] M. J. Holland and K. Ikeda, “Efficient learning with robust gradient descent,” Machine Learning, vol. 108, no. 8-9, pp. 1523–1560, 2019.

[20] M. Lerasle and R. I. Oliveira, “Robust empirical mean estimators,” arXiv preprint arXiv:1112.3914, 2011.

[21] S. Minsker et al., “Geometric median and robust estimation in banach spaces,” Bernoulli, vol. 21, no. 4, pp. 2308–2335, 2015.

[22] G. Lugosi and S. Mendelson, “Risk minimization by median-of-means tournaments,” arXiv preprint arXiv:1608.00757, 2016.

[23] D. Hsu and S. Sabato, “Loss minimization and parameter estimation with heavy tails,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 543–582, 2016.

[24] C. Brownlees, E. Joly, G. Lugosi, et al., “Empirical risk minimization for heavy-tailed losses,” The Annals of Statistics, vol. 43, no. 6, pp. 2507–2536, 2015.

[25] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in adversarial settings: Byzantine gradient descent,” ACM SIGMETRICS Performance Evaluation Review, vol. 46, no. 1, pp. 96–96, 2019.

[26] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar, “Robust estimation via robust gradient estimation,” arXiv preprint arXiv:1802.06485, 2018.

[27] O. Arslan, P. D. Constable, and J. T. Kent, “Convergence behavior of the em algorithm for the multivariate t-distribution,” Communications in statistics-theory and methods, vol. 24, no. 12, pp. 2981–3000, 1995.

[28] F. Z. Do˘gru, Y. M. Bulut, and O. Arslan, “Doubly reweighted estimators for the parameters of the multivariate t-distribution,” Communications in Statistics-Theory and Methods, vol. 47, no. 19, pp. 4751–4771, 2018.

[29] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” arXiv preprint arXiv:1904.09237, 2019.

[30] D. F. Nettleton, A. Orriols-Puig, and A. Fornells, “A study of the effect of different types of noise on the precision of supervised learning techniques,” Artificial intelligence review, vol. 33, no. 4, pp. 275–306, 2010.

[31] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and improving convolutional neural networks via concatenated rectified linear units,” in international conference on machine learning, pp. 2217– 2225, 2016.

[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.

[33] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning.” http://pybullet.org, 2016–2019.

[34] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.

[35] A. Stooke and P. Abbeel, “rlpyt: A research code base for deep reinforcement learning in pytorch,” arXiv preprint arXiv:1909.01500, 2019.

[36] H. Hartley and A. Ross, “Unbiased ratio estimators,” Nature, vol. 174, no. 4423, pp. 270–271, 1954.

[37] L. A. Goodman and H. O. Hartley, “The precision of unbiased ratiotype estimators,” Journal of the American Statistical Association, vol. 53, no. 282, pp. 491–508, 1958.


Designed for Accessibility and to further Open Science