Choosing the Sample with Lowest Loss makes SGD Robust

2020·Arxiv

Abstract

Abstract

The presence of outliers can potentially significantly skew the parameters of machine learning models trained via stochastic gradient descent (SGD). In this paper we propose a simple variant of the simple SGD method: in each step, first choose a set of k samples, then from these choose the one with the smallest current loss, and do an SGD-like update with this chosen sample. Vanilla SGD corresponds to k = 1, i.e. no choice; 2 represents a new algorithm that is however effectively minimizing a non-convex surrogate loss. Our main contribution is a theoretical analysis of the robustness properties of this idea for ML problems which are sums of convex losses; these are backed up with linear regression and small-scale neural network experiments.

1 Introduction

This paper focuses on machine learning problems that can be formulated as optimizing the sum of n convex loss functions:

where F(w) = ) is the sum of convex, continuously differentiable loss functions. Stochastic gradient descent (SGD) is a popular way to solve such problems when n is large; the simplest SGD update is:

where the sample is typically chosen uniformly at random from [n].

However, as is well known, the performance of SGD and most other stochastic optimization methods is highly sensitive to the quality of the available training data. A small fraction of outliers can cause SGD to converge far away from the true optimum. While there has been a significant amount of work on more robust algorithms for special problem classes (e.g. linear regression, PCA etc.) in this paper our objective is to make a modification to the basic SGD method itself; one that can be easily applied to the many settings where vanilla SGD is already used in the training of machine learning models.

We call our method Min-k Loss SGD (MKL-SGD), given below (Algorithm 1). In each iteration, we first choose a set of k samples and then select the sample with the smallest current loss in that set; the gradient of this sample is then used for the update step.

skew are often those with high loss. In this paper, our focus is on the stochastic setting for standard convex functions. We show that it provides a certain degree of robustness against outliers/bad training samples that may otherwise skew the estimate.

Our Contributions

• To keep the analysis simple yet insightful, we define four natural and deterministic problem settings -noiseless with no outliers, noiseless with outliers, and noisy with and without outliers - in which we study the performance of MKL-SGD. In each of these settings the individual losses are assumed to be convex, and the overall loss is additionally strongly convex. We are interested in finding the optimum of the “good” samples, but we do not a-priori know which samples are good and which are outliers.

• The expected MKL-SGD update (over the randomness of sample choice) is not the gradient of the original loss function (as would have been the case with vanilla SGD); it is instead the gradient of a different non-convex surrogate loss, even for the simplest and friendliest setting of noiseless with no outliers. Our first result establishes that this non-convexity however does not yield any bad local minima or fixed points for MKL-SGD in this particular setting, ensuring its success.

• We next turn to the setting of noiseless with outliers, where the surrogate loss can now potentially have many spurious local minima. We show that by picking a value of k high enough (depending on the condition number of the loss functions that we define) the local minima of MKL-SGD closest to better than the (unique) fixed point of SGD.

• We establish the convergence rates of MKL-SGD-with and without outliers - for both the noiseless and noisy settings.

• We back up our theoretical results with both synthetic linear regression experiments that provide insight, as well as encouraging results on the MNIST and CIFAR-10 datasets.

2 Related Work

The related work can be divided into the following four main subparts:

Stochastic optimization and weighted sampling The proposed MKL-SGDalgorithm inherently implements a weighted sampling strategy to pick samples. Weighted sampling is one of the popular variants of SGD that can be used for matching one distribution to another (importance sampling), improving the rate of convergence, variance reduction or all of them and has been considered in [16, 33, 38, 19]. Other popular weighted sampling techniques include [26, 25, 23]. Without the assumption of strong convexity for each the weighted sampling techniques often lead to biased estimators which are difficult to analyze. Another idea that is analogous to weighted sampling includes boosting [11] where harder samples are used to train subsequent classifiers. However, in presence of outliers and label noise, learning the hard samples may often lead to over-fitting the solution to these bad samples. This serves as a motivation for picking samples with the lowest loss in MKL-SGD.

Robust linear regression Learning with bad training samples is challenging and often intractable even for simple convex optimization problems. For example, OLS is quite susceptible to arbitrary corruptions by even a small fraction of outliers. Least Median Squares (LMS) and least trimmed squares (LTS) estimator proposed in [31, 34, 35] are both sample efficient, have a relatively high break-down point, but require exponential running time to converge. [14] provides a detailed survey on some of these robust estimators for OLS problem. Recently, [5, 6, 32, 20, 18] have proposed robust learning algorithms for linear regression which require the computation of gradient over the entire dataset which may be computationally intractable for large datasets. Another line of recent work considers robustness in the high-dimensional setting ([27, 36, 8, 3, 24]) In this version, our focus is on general stochastic optimization in presence of outliers.

Robust optimization Robust optimization has received a renewed impetus following the works in [10, 22, 7]. In most modern machine learning problems, however, simultaneous access to gradients over the entire dataset is time consuming and often, infeasible. [9, 28] provides robust meta-algorithms for stochastic optimization under adversarial corruptions. However, both these algorithms require the computation of one or many principal components per epoch which requires atleast ) computation ([1]). In contrast, MKL-SGDalgorithm runs in O(k) computations per iteration where k is the number of loss evaluations per epoch. In this paper, we don’t consider the stronger adversarial model, our focus is on a tractable method that provides robustness on a simpler corruption model (as defined in the next section).

Label noise in deep learning [2, 21, 4] describe different techniques to learn in presence of label noise and outliers. [30] showed that deep neural networks are robust to random label noise especially for datasets like MNIST and CIFAR10. [15, 29] propose optimization methods based on re-weighting samples that often require significant pre-processing. In this paper, our aim is to propose a computationally inexpensive optimization approach that can also provide a certain degree of robustness.

3 Problem Setup

We make the following assumptions about our problem setting described in 1. Let O be the set of outlier samples; this set is of course unknown to the algorithm. We denote the optimum of the non-outlier samples by , i.e.

In this paper we show that MKL-SGD allows us to estimate without a-priori knowledge of the set O, under certain conditions. We now spell these conditions out.

is convex in w, with Lipschitz continuous gradients with constant .

and define L :=

It is common to also assume strong convexity of the overall loss function ). Here, since we are dropping samples, we need a very slightly stronger assumption.

Assumption 2 (Overall loss). For any size subset S of the samples, we assume the loss function strongly convex in w. Recall that here k is the size of the sample set in the MKL-SGD algorithm.

Lastly, we also assume that all the functions share the same minimum value. Assumption 3 is often satisfied by most standard loss functions with a finite unique minima [12] such as squared loss, hinge loss, etc.

Assumption 3 (Equal minimum values). Each of the functions ) shares the same minimum value min) = min.

We are now in a position to formally define three problem settings we will consider in this paper. For each i let := : ˆw = arg mindenote the set of optimal solutions (there can be more than one because ) is only convex but not strongly convex). Let d(a, S) denote the shortest distance between point a and set S.

Noiseless setting with no outliers: As a first step and sanity check, we consider what happens in the easiest case: where there are no outliers. There is also no “noise”, by which we mean that the optimum we seek is also in the optimal set of every one of the individual sample losses, i.e.

Of course in this case vanilla SGD (and many other methods) will converge to as well; we just study this setting as a first step and also to build insight.

Outlier setting: Finally, we consider the case where a subset O of the samples are outliers. Specifically, we assume that for outlier samples the we seek lies far from their optimal sets, while for the others it is in the optimal sets:

Note that now vanilla SGD on the entire loss function will not converge to .

Noisy setting: As a second step, we consider the case when samples are noisy but there are no outliers. In particular, we model noise by allowing to now be outside of individual optimal sets , but not too far; specifically,

For the noisy setting, we will focus only on the convergence guarantees. We will show that MKL-SGD gets close to in this setting; again in this case vanilla SGD will do so as well for the no outliers setting of course.

4 Understanding MKL-SGD

To build some intuition for MKL-SGD, we describe the notation and look at some simple settings. Recall MKL-SGD takes k samples and then retains the one with lowest current loss; this means it is sampling non-uniformly. For any w, let ) be the sorted order w.r.t. the loss at that w, i.e.

Recall that for a sample to be the one picked by MKL-SGD for updating w, it needs to first be part of the set of k samples, and then have the lowest loss among them. A simple calculation shows that probability that the best sample ) is the one picked by MKL-SGD is given by

In the rest of the paper, we will focus on the “with replacement” scenario for ease of presentation; this choice does not change our main ideas or results. With this notation, we can rewrite the expected update step of MKL-SGD as

For simplicity of notation in the rest of the paper, we will relabel the update term in the above by defining as follows:

Underlying this notation is the idea that, in expectation, MKL-SGD is akin to gradient descent on a surrogate loss function ˜) which is different from the original loss function ); indeed if needed this surrogate loss can be found (upto a constant shift) from the above gradient. We will not do that explicitly here, but instead note that even with all our assumptions, indeed even without any outliers or noise, this surrogate loss can be non-convex. It is thus important to see that MKL-SGD does the right thing in all of our settings, which is what we describe next.

4.1 Noiseless setting with no outliers

As a first step (and for the purposes of sanity check), we look at MKL-SGD in the simplest setting when there are no outliers and no noise. Recall from above that this means that is in the optimal set of every single individual loss ). However as mentioned above, even in this case the surrogate loss can be non-convex, as seen e.g. in Figure 1 for a simple example. However, in the following lemma we show that even though the

Figure 1: Non-convexity of the surface plot with three samples in the two-dimensional noiseless linear regression setting

overall surrogate loss ˜) is non-convex, in this no-noise no-outlier setting it has a special property with regards to the point .

Lemma 1. In the noiseless setting, for any w there exists a 0 such that

In other words, what this lemma says is that on the line between any point w and the point , the surrogate loss function ˜F is convex from any point – even though it is not convex overall. This is akin to the restricted secant inequality condition described in [17, 37]. The following theorem uses this lemma to establish our first result: that in the noiseless setting with no outliers, is the only fixed point (in expectation) of MKL-SGD.

Theorem 1 (Unique stationary point). For the noiseless setting with no outliers, and under assumptions 1 3, the expected MKL-SGD update satisfies ) = 0 if and only if .

4.2 Noiseless setting with Outliers

In presence of outliers, the surrogate loss can have multiple local minima that are far from and indeed potentially even worse than what we could have obtained with vanilla SGD on the original loss function. We now analyze MKL-SGD in the simple setting of symmetric squared loss functions and try to gain useful insights into the landscape of loss function for the scalar setting. We would like to point out that the analysis in the next part serves as a clean template and can be extended for many other standard loss functions used in convex optimization.

Squared loss in the scalar setting Figure 2 will be a handy tool for visualizing and understanding both the notation and results of this subsection. Consider the case where all losses are squared losses, with all the clean samples centered at and all the outliers at , but all having different Lipschitz constants. Specifically, consider:

Let := minand Let := maxand = max= min. Let us define

. We initialize MKL-SGD at , a point where the losses of outlier samples are 0 and

all the clean samples have non-zero losses. As a result at , MKL-SGD has a tendency to pick all the outlier samples with a higher probability than any of the clean samples. This does not bode well for the algorithm since this implies that the final stationary point will be heavily influenced by outliers. Let ¯be the stationary point of MKL-SGD for this scalar case when initialized at .

represent the total probability of picking outliers at the starting point . The maximum possible value that can be attained ˆp over the entire landscape is given as:

The next condition gives a sufficient condition to avoid all the bad local minima are avoided no matter where we initialize. For the simple scalar case, the condition is:

To further elaborate on this, for the loss functions and defined in equations (4) and (5) respectively, if condition 1 is not satisfied, then we cannot say anything about where MKL-SGD converges. However, if condition 1 holds true, then we are in Case 1 (Figure 2), i.e. the stationary point attained by MKL-SGD will be such that it is possible to avoid the existence of the first bad local minima. The first bad local minima occurs by solving the optimization problem where the top-|O| highest probabilities are assigned to the bad samples.

Following the above analysis recursively, we can show that all other subsequent bad local minimas are avoided as well, until we reach the local minima which assigns the largest () probabilities to the clean samples. This indicates that irrespective of where we initialize in the 1D landscape, we are bound to end

Figure 2: Illustration with conditions when bad local minima will or will not exist. Here, we demonstrate that even if we start at an initialization that assigns the highest probabilities to bad samples (red), it is possible to avoid the existence of a bad local minima if Condition 1 is satisfied. Recursively, we show in Lemma 2 that it is possible to avoid all bad local minima and reach a good local minima (where the good samples have the highest probabilities)

up at a local minima with the highest probabilities assigned to the clean samples. In the latter part of this section, we will show that MKL-SGD solution attained when Case 1 holds is provably better than the SGD solution. However, if condition 1 is false (Case 2, Figure 2), then it is possible that MKL-SGD gets stuck at any one of the many local minimas that exist close to the outlier center and we cannot say anything about the relative distance from .

A key takeaway from the above condition is that . For a fixed and n, increasing the parameter k (upto ) in MKL-SGD leads to an increase in ˆp and thus increasing k can lead to the violation of the above condition. This happens because samples with lower loss will be picked with increasing probability as k increases and as a result the propensity of MKL-SGD to converge towards the closest stationary point it encounters is higher.

Squared loss in the vector setting The loss functions are redefined as follows:

Without loss of generality, assume that 2and

. Let ¯w be any stationary attained by MKL-SGD. Suppose be the angle betweenthe line passing through and and the line connecting ¯w and . Let us define := max

starting point . The maximum possible value that can be attained ˆp is given as:

In the subsequent lemma, we show that that it is possible to drift into the ball ) where the clean samples have the highest probability or the lowest loss.

Lemma 2. Consider the loss function and as defined in equations (7) and (9) respectively. Suppose q = cos

from any initialization , for any stationary point ¯w attained by MKL-SGD, we have that ¯)

In other words, initializing at any point in the landscape, the final stationary point attained by MKLSGD will inevitably assign the largest probabilities to the clean samples. The proof is availabe in Appendix Section 8.2.3. For the scalar case, d = 1, we have = 0 . If = 1 and all the outliers are centered at the same point, then in the scalar setting the condition in Lemma 2 reduces to condition 1.

Note that, the above lemma leads to a very strong worst-case guarantee. It states that the farthest optimum will always be within a bowl of distance r from no matter where we initialize. Moreover, as long as the condition is satisfied no matter where the outliers lie (can be adversarially chosen), MKL-SGD always has the propensity to bring the iterates to a ball of radius . However, when the necessary conditions for its convergence are violated, the guarantees are initialization dependent. Thus, all the discussions in the rest of this section will be with respect to these worst case guarantees. However, as we see in the experimental section for both neural networks and linear regression, random initialization also seems to perform better than SGD.

A direct result of Lemma 2 is that higher the condition number of the set of quadratic loss functions, lower is the fraction of outliers the MKL-SGD can tolerate. This is because large results in

a small value of

1 1 + . This implies that ˆp has to be small which in turn requires smaller fractions fo

corruptions, .

The relative distance of the outliers from plays a critical role in the condition for Lemma 2. We know that (0, 1]. = 1 implies the outliers are equidistant from the optimum . Low values of lead to a large q leading to the violation of the condition with ˆp (since RHS in the condition is very small), which implies that one bad outlier can guarantee that the condition in Lemma 2 are violated. The guarantees in the above lemma are only when the outliers are not adversarially chosen to lie at very high relative distances from . One way to avoid the set of outliers far far away from the optimum is to have a filtering step at the start of the algorithm like the one in [9]. We will refer this in Experiments.

Effect of cos At first glance, it may seem that cos = 0 may cause 1 + 0 and since ˆp(w) > 0, the condition in Lemma 2 may never be satisfied. Since, the term cos shows up in the denominator of the loss associated with outlier centered at . Thus, low values of cos implies high value of loss associated with the function centered at which in turn implies the maximum probability attained by that sample can never be in the top-|O| probabilities for that ¯w.

Analysis for the general outlier setting: In this part, we analyze the fixed point equations associated with MKL-SGD and SGD and try to understand the behavior sake of simplicity, we will assume that . Next, we analyze the following two quantities: i) distance of from and distance of the any of the solutions attained by from .

Lemma 3. Let indicate the solution attained SGD. Under assumptions 1-3, there exists an such that for all ,

Using Lemma 1, we will define as follows:

In Appendix Section 8.2.2, we show that 0, however the exact lower bounds for this are loss function dependent. Naively, = min

Lemma 4. Let be any first order stationary point attained by MKL-SGD. Under assumptions 1-3, for a given 1 and as defined in equation (10), there exists a such that for all ,

Finally, we show that any solution attained by MKL-SGD is provably better than the solution attained by SGD. We would like to emphasize that this is a very strong result. The MKL-SGD has numerous local minima and here we show that even the worstsolution attained by MKL-SGD is closer to than the solution attained by SGD. Let us define ) = (1

be the the stationary points attained by SGD and MKL-SGD algorithms respectively for the noiseless setting with outliers. Under assumptions 1-3, for any ) and defined in equation (10), there exists an such that for all and,

For squared loss in scalar setting, we claimed that for a fixed , using a large k may not be a good idea. Here, however once we are in the ball, ), using larger k (any k < n2 ), reduces ) and allows MKL-SGD to get closer to .

The conditions required in Lemma 2 and Theorem 2 enable us to provide guarantees for only a subset of relatively well-conditioned problems. We would like to emphasize that the bounds we obtain are worst case bounds and not in expectation. As we will note in the Section 6 and the Appendix, however these bounds may not be necessary, for standard convex optimization problems MKL-SGD easily outperforms SGD.

5 Convergence Rates

In this section, we go back to the in expectation convergence analysis which is standard for the stochastic settings. For smooth functions with strong convexity, [25, 26] provided guarantees for linear rate of convergence. We restate the theorem here and show that the theorem still holds for the non-convex landscape obtained by MKL-SGD in noiseless setting.

Lemma 5 (Linear Convergence [26])-strongly convex. Set with := argminF(w). Suppose 1sup. Let . After T iterations, SGD satisfies:

where ˆsup=

In the noiseless setting, we have = 0 and so := 0. in (12) is the same as stated in Theorem 1. Even though above theorem is for SGD, it still can be applied to our algorithm 1. At each iteration there exists a parameter that could be seen as the strong convexity parameter (c.f. Lemma 1). For MKL-SGD, the parameter (12) should be min. Thus, MKL-SGD algorithm still guarantees linear convergence result but with an implication of slower speed of convergence than standard SGD.

However, Lemma 5 will not hold for MKL-SGD in noisy setting since there exists no strong convexity parameter. Even for noiseless setting, the rate of convergence for MKL-SGD given in Lemma 5 is not tight. The upper bound in (12) is loosely set to the constant := minfor all the iterations. We fix it by concretely looking at each iteration. We give a general bound for the any stochastic algorithm (c.f. Theorem 3) for both noiseless and noisy setting in absence and presence of outliers.

Suppose at iteration, the stepsize is set as , then conditioned on the current parameter , the expectation of the distance between the and can be upper bounded as:

where

Theorem 3 implies that for any stochastic algorithm in the both noisy and noiseless setting, outliers can make the upper bound () much worse as it produces an extra term (the third term in . However, its impact can be reduced by appropriately setting ), for instance using a larger k in MKL-SGD. In the appendix, we also provide a sufficient condition (Corollary 1 in the Appendix) when MKL-SGD is always better than standard SGD (in terms of its distance from in expectation).

The convergence rate depends on the constant min). Note that this term minnot too small for our algorithm MKL-SGD since it is a minimum among all the good sample (not including the outliers). However, when compared with vanilla SGD where minmindefined in (3) for MKL-SGD, in some sense, could be smaller than 1/N. For instance, in the experiments given in Figure 5 (a)-(c) (Appendix 8.4.1), the slope of SGD is steeper than MKL-SGD, which implies that .

To understand the residual term . Let us take the noiseless setting with outliers for an example. We have ) = 0 for all = 0. Then the term can be reduced to

If we are at the same point for both SGD and MKL-SGD and 1 N for , we have

. It means that MKL-SGD could reach to a neighbor with a radius that is possibly smaller

than vanilla SGD algorithm, with a rate proportional to min) but not necessarily faster than vanilla SGD.

6 Experiments

In this section, we compare the performance of MKL-SGD and SGD for synthetic datasets for linear regression and small-scale neural networks.

6.1 Linear Regression

For simple linear regression, we assume that are sampled from normal distribution with different condition numbers. ) where D is a diagonal matrix such that and = 1 for all i). For the noisy case, we assume additive Gaussian noise with mean 0 and variance 1. We compare the performance of MKL-SGD and SGD for different values of (Fig. 3) under noiseless and noisy settings against varying levels of corruption . It is important to note that different values correspond to different rates of convergence. To ensure fair comparison, we run the algorithms till the error values stop decaying and take the distance of from the exponential moving average of the iterates.

Figure 3: Comparing the performance of MKL-SGD (k = 2) and SGD for different values of in noiseless and noisy linear regression against varying fraction of outliers.

Table 1: Comparing the test accuracy of SGD and MKL-SGD(k = 5/3) over MNIST and CIFAR-10 datasets in presence of corruptions via directed label noise.

6.2 Neural Networks

For deep learning experiments, our results are in presence of corruptions via the directed noise model. In this corruption model, all the samples of class a that are in error are assigned the same wrong label b. This is a stronger corruption model than corruption by random noise (results in Appendix). For the MKL-SGD algorithm, we run a more practical batched (size b) variant such that if k = 2 the algorithm picks b/2 samples out of b sample loss evaluations. The oracle contains results obtained by running SGD over only non-corrupted samples. More experimental results on neural networks for MNIST and CIFAR10 datasets can be found in the Appendix.

MNIST: We train standard 2 layer convolutional network on subsampled MNIST (5000 samples with labels). We train over 80 epochs using an initial learning rate of 0.05 with the decaying schedule of factor 5 after every 30 epochs. The results of the MNIST dataset are averaged over 5 runs.

CIFAR10: We train Resnet-18 [13] on CIFAR-10 (50000 training samples with labels) for over 200 epochs using an initial learning rate of 0.05 with the decaying schedule of factor 5 after every 90 epochs. The reported accuracy is based on the true validation set. The results of the CIFAR-10 dataset are averaged over 3 runs. Lastly, in Fig. 4, we show that for a neural network MKL-SGD typically has a higher training loss but smaller test loss which partially explains its superior generalization performance.

Figure 4: Comparing training loss, test loss and test accuracy of MKL-SGDand SGD. Parameters: = 0.2, k = 2, b = 16. The training loss is lower for SGD which means that SGD overfits to the noisy data. The lower test loss and higher accuracy demonstrates the robustness MKL-SGDprovides for corrupted data.

7 Conclusion and Future Work

In this paper, we propose MKL-SGD that is computationally inexpensive, has linear convergence (upto a certain neighborhood) and is robust against outliers. We analyze MKL-SGD algorithm under noiseless and noisy settings with and without outliers. MKL-SGD outperforms SGD in terms of generalization for both linear regression and neural network experiments. More importantly, MKL-SGD opens up a plethora of challenging questions with respect to understanding convex optimization in a non-convex landscape.

To ensure consistency, i.e. 0, we require that + 1. In all other cases, there will be a non-zero contribution from the outliers which keeps the MKL-SGD solution from exactly converging to . In this paper, we consider unknown should be treated as a hyperparameter. However, if we knew the fraction of corruption, then with the right k and smart initialization, it is possible to guarantee consistency. For neural network experiments in the Appendix, we show that tuning k as a hyperparameter can lead to significant improvements in performance in presence of outliers.

Preliminary experiments indicate that smarter initialization techniques can improve the performance of MKL-SGD. The obvious question then is to provide worst case guarantees for a larger subset of problems using smarter initialization techniques. It will be interesting to analyze the tradeoff between rates of convergence to MKL-SGD and its robustness to outliers. The worst case analysis in the noisy setting with and without outliers also remains an open problem.

References

[1] Farhad Pourkamali Anaraki and Shannon Hughes. Memory and computation efficient pca via very sparse random projections. In International Conference on Machine Learning, pages 1341–1349, 2014.

[2] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.

[3] Sivaraman Balakrishnan, Simon S Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Conference on Learning Theory, pages 169–212, 2017.

[4] Yoshua Bengio, J´erˆome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.

[5] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, pages 721–729, 2015.

[6] Kush Bhatia, Prateek Jain, Parameswaran Kamalaruban, and Purushottam Kar. Consistent robust regression. In Advances in Neural Information Processing Systems, pages 2110–2119, 2017.

[7] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 47–60. ACM, 2017.

[8] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under adversarial corruption. In International Conference on Machine Learning, pages 774–782, 2013.

[9] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. arXiv preprint arXiv:1803.02815, 2018.

[10] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864, 2019.

[11] Yoav Freund, Robert Schapire, and Naoki Abe. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14(771-780):1612, 1999.

[12] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.

[14] Peter J Huber. Robust statistics. Springer, 2011.

[15] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055, 2017.

[16] Herman Kahn and Andy W Marshall. Methods of reducing sample size in monte carlo computations. Journal of the Operations Research Society of America, 1(5):263–278, 1953.

[17] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-�lojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.

[18] Sushrut Karmalkar, Adam Klivans, and Pravesh Kothari. List-decodable linear regression. In Advances in Neural Information Processing Systems, pages 7423–7432, 2019.

[19] Angelos Katharopoulos and Fran¸cois Fleuret. Not all samples are created equal: Deep learning with importance sampling. arXiv preprint arXiv:1803.00942, 2018.

[20] Adam Klivans, Pravesh K Kothari, and Raghu Meka. Efficient algorithms for outlier-robust regression. arXiv preprint arXiv:1803.03241, 2018.

[21] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189–1197, 2010.

[22] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016.

[23] Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 147–156. IEEE, 2013.

[24] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust sparse regression. arXiv preprint arXiv:1805.11643, 2018.

[25] Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011.

[26] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in neural information processing systems, pages 1017–1025, 2014.

[27] Art B Owen. A robust hybrid of lasso and ridge regression. 2007.

[28] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.

[29] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050, 2018.

[30] David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.

[31] Peter J Rousseeuw. Least median of squares regression. Journal of the American statistical association, 79(388):871–880, 1984.

[32] Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. In International Conference on Machine Learning, pages 5739–5748, 2019.

[33] Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262, 2009.

[34] Jan V´ıˇsek et al. The least weighted squares ii. consistency and asymptotic normality. Bulletin of the Czech Econometric Society, 9, 2002.

[35] Jan ´Amos V´ıˇsek. The least trimmed squares. part i: Consistency. Kybernetika, 42(1):1–36, 2006.

[36] Huan Xu, Constantine Caramanis, and Shie Mannor. Robust regression and lasso. In Advances in Neural Information Processing Systems, pages 1801–1808, 2009.

[37] Hui Zhang. The restricted strong convexity revisited: analysis of equivalence to error bound and quadratic growth. Optimization Letters, 11(4):817–833, 2017.

[38] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning, pages 1–9, 2015.

8 Appendix

8.1 Additional Results for Section 3

The following lemma provides upper bounds on the expected gradient of the worst-possible MKL-SGD solution that lies in a ball around . Simultaneously satisfying the following bound with the one in Lemma 3 may lead to an infeasible set of and . And thus we use Lemma 4 in conjunction with 3.

The proof for lemma 2 can be found in the Appendix Section 8.2.7

8.2 Proofs and supporting lemmas

8.2.1 Proof of Lemma 1

). Let us fix a ). We know that for any strongly convex in w with parameter . This implies

A naive bound for the above Lemma can be:

8.2.2 Proof of Theorem 1

Proof. By the definition of the noiseless framework, is the unique optimum of F(w) and lies in the optimal set of each ). We will prove this theorem by contradiction. Assume there exists some ˆthat also satisfies optimum of ( ˆw) = 0. At ˆw, we have 0 =( ˆ. This implies ˆ.

Theorem 1 and Assumption 2 guarantee that 0. If f(w) is strongly convex and g(w) is convex, then we know that f(w) + g(w) is strongly convex. On similar lines we can show that 0 by splitting the terms in ˜F(w) as ˜)). The first term has 0 (Assumption 2) and the second term has = 0 (since it is convex). Note, is a positive constant independent of w and so the above lemma is for all w.

8.2.3 Proof of Lemma 2

Let ¯w be a stationary point of MKL-SGD. Now, we analyze the loss landscape on the line joining and where is any arbitrary point in the landscape at a distance as far as the farthest outlier from . Let C be a very large number.

The loss functions and are redefined as follows:

where |O| = b such that n = g + b. Let = minand Let = maxand = min, = min. Let us define = .

the outliers have the lowest loss. As stated in the previous lemma, the results hold irrespective of that. This implies:

Without loss of generality assume that the outliers are ordered as follows: .

Now be some point of intersection of function in the set of clean samples and a function in the set ofoutliers to . Let be the angle between the line connecting and to the line connecting to . For any two curves with Lipschitz constants , the halfspaces passing through the weighted mean are also the region where both functions have equal values.

Thus,

Let denote the following ratio:

minima do no exist is ˆ

corresponding and ˜w in the landscape and so now is a function of ˜w as well as is represented in the

theorem statement.

Note: In the vector case, for example there exists a fine tradeoff between how large can be and if for large , the loss corresponding to the outlier will be one of the lowest. Understanding that tradeoff is beyond the scope of this paper.

8.2.4 Proof of Lemma 3

8.2.5 Proof of Lemma 4

8.2.6 Proof of Theorem 2

Proof. There exists an such that in Lemma 3, we have

Combining above equation with Lemma 4, we get

8.2.7 Proof of Lemma 6

Proof. From the definition of good samples in the noiseless setting, we know that ) = 0 . Similarly, for samples belonging to the outlier set, . There exists a ball around the optimum of radius r such that ). Assume that and , such that .

= (1 (21)

8.3 Additional results and proofs for Section 5

Consider the sample size n with bad set(outlier) O and good set G such that . Define

We assume: (1) (Stationary Point) Assume is the solution for the average loss function of good sample such that

(2) (Strong Convexity) ) is strongly convex with parameters i.e.,

(3) (Gradient Lipschitz) ) has Liptchitz gradient i.e.,

Theorem 4. (Distance to

where

Proof. Observe first that for each component function i.e. ,

We next take an expectation with respect to the choice of i conditional on

− 2(1 − sup) (∇) − ∇

Now we first bound Term1 as follows

For Term2 we apply the property of the convex function )

Putting the upper bound of Term1 and Term2 back to (24) gives

where

We have the following corollary that for noiseless setting, if we can have some good initialization, MKLSGD is always better than SGD even the corrupted data is greater than half. For noisy setting, we can also perform better than SGD with one more condition: the noise is not large than the distance . This condition is not mild in the sense that is always greater than for SGD algorithm and for MKL-SGD.

Corollary 1. Suppose we have . At iteration t for , the parameter satisfies supinf). Moreover, assume the noise level at optimal satisfies

Using the same setup, the vanilla SGD and MKL-SGD (K=2) algorithms yield respectively

where

Proof. Start from the inequality 24 in the proof of Theorem 4. We have Term1 as follows:

Putting the terms back to (24), we have for (sup)

where

Now we analyse the term for vanilla SGD and MKL-SGD(K = 2) respectively. For vanilla SGD, we have ) = and ) = 0, which results in

Note that MKL-SGD for K = 2 have

where ) are the indices of data samples for some w:

Suppose the iteration satisfies that ) for . For , we have for

We will have if the following inequality holds

Indeed, for the noise level satisfying (26) we have for ,

Summing up the terms in , we get (30). For the noise level satisfying (27) we have

which results in (30).

8.4 More experimental results

8.4.1 Linear Regression

Here, we show that there exists a tradeoff for MKL-SGD between the rate of convergence and robustness the algorithm provides against outliers depending on the value of the parameter k. Larger the k, more robust is the algorithm, but slower is the rate of convergence. The algorithm outperforms median loss SGD and SGD. We also experimentd with other order statistics and observed that for most general settings MKL-SGD was the best to pick. Note that the outliers are chosen from N(0, 1) distribution.

Figure 5: Comparing the performance of MKL-SGD, SGD and Median loss SGD in the noiseless setting, d = 50.

Figure 6: Comparing the performance of MKL-SGD, SGD and Median loss SGD in the noisy setting, d = 10, Noise variance=0.0001

Figure 7: Comparing the performance of MKL-SGD, SGD and Median loss SGD in the noiseless setting, d = 25, Noise variance=0.01

8.4.2 Neural Network Experiments

Here, we show that in presence of outliers instead of tuning other hyperparameters like learning rate, tuning over k might lead to significant gains in performances for deep neural networks. To illustrate this we play around with two commonly used noise models: random noise and directed noise. In the random noise model, the outlier label is randomly assigned while for the directed noise model for some class ‘a’, the outlier is assigned the same label ‘b’, similarly all the outliers for class ‘b’ are assigned label ‘c’ and so on.

Table 2: In this experiments, we train a standard 2 layer CNN on subsampled MNIST (5000 training samples with labels corrupted using random label noise). We train over 80 epochs using an initial learning rate of 0.05 with the decaying schedule of factor 5 after every 30 epochs. The reported accuracy is based on the true validation set. The results of the MNIST dataset are reported as the mean of 5 runs. For the MKL-SGD algorithm, we introduce a more practical variant that evaluates k sample losses and picks a batch of size where k = 10.

Table 3: In this experiments, we train a standard 2 layer CNN on subsampled MNIST (5000 training samples with labels corrupted using random label noise). We train over 80 epochs using an initial learning rate of 0.05 with the decaying schedule of factor 5 after every 30 epochs. The reported accuracy is based on the true validation set. The results of the MNIST dataset are reported as the mean of 5 runs. For the MKL-SGD algorithm, we introduce a more practical variant that evaluates k sample losses and picks a batch of size where k = 10.

Table 4: In this experiments, we train Resnet 18 on CIFAR-10 (50000 training samples with labels corrupted using directed label noise). We train over 200 epochs using an initial learning rate of 0.05 with the decaying schedule of factor 5 after every 90 epochs. The reported accuracy is based on the true validation set. The results of the CIFAR-10 dataset are reported as the mean of 3 runs. For the MKL-SGD algorithm, we introduce a more practical variant that evaluates k sample losses and picks a batch of size