We study robust least squares regression, where the goal is to robustly estimate a linear predictor from data which is potentially corrupted by an adversary. We focus on the setting where response variables are corrupted via an oblivious adversary. Such a setting has numerous applications such as click-fraud in a typical ads system, ratings-fraud in recommendation systems, as well as the less obvious application of regression with heavy tailed noise.
For the problem of oblivious adversarial corruptions, our goal is to design an estimator that satisfies three key criteria: (a) (statistical efficiency) estimates the optimal solution consistently with nearly optimal statistical rates, (b) (robustness efficiency) allows a high amount of corruption, i.e., fraction of corruptions is computational efficiency) has the same or nearly the same computational complexity as the standard ordinary least squares (OLS) estimator. Most existing techniques do not even provide consistent estimates in this adversary model (Bhatia et al., 2015; Nasrabadi et al., 2011; Nguyen and Tran, 2013; Prasad et al., 2018; Diakonikolas et al., 2018; Wright and Ma, 2010). Bhatia et al. (2017) provides statistically consistent and computationally
Figure 1: The first two plots show the parameter error (y-axis) of various estimators as we vary fraction of corruptions in the robust regression setting (x-axis); noise variance is 0 for the first plot and 1 for the second. Plots indicate that AdaCRR is able to tolerate significantly higher fraction of outliers than most existing methods. The last plot shows parameter error over number of iterations for robust regression, indicating AdaCRR can be upto 100x faster as compared to Huber regression.
efficient estimator, but requires the fraction of corruptions to be less than a small constant (Tsakonas et al. (2014) study Huber-loss based regression to provide nearly optimal statistical rate with nearly optimal fraction of corruptions. But their sample complexity is sub-optimal, and more critically, the algorithm has super-linear computational complexity (in terms of number of points) and is significantly slower than the standard least squares estimator.
So the following is still an open question: “Can we design a linear time consistent estimator for robust regression that allows almost all responses to be corrupted by an oblivious adversary?”
We answer this question in affirmative, i.e., we design a novel outlier removal technique that can ensure consistent estimation at nearly optimal statistical rates, assuming Gaussian data and sub-Gaussian noise. Our results hold as long as the number of points n is larger than the input dimensionality p by logarithmic factors, i.e., , and allows
responses to be corrupted; the number of corrupted responses can be increased to
with a slightly worse generalization error rate.
Our algorithm, which we refer to as AdaCRR 1, uses a similar technique as Bhatia et al. (2015,
2017), where we threshold out points that we estimate as outliers in each iteration. However, we show that fixed thresholding operators as in Bhatia et al. (2015, 2017) can get stuck at poor fixed-points in presence of a large number of outliers (see Section 4). Instead, we rely on an adaptive thresholding operator that uses noise in each iteration to avoid such sub-optimal fixed-points. Similar to Bhatia et al. (2015, 2017), AdaCRR-FC solves a standard OLS problem in each iteration, so the overall complexity is is the number of iterations and
time-complexity of an OLS solver. We show that
iterations are enough to obtain
-optimal solution, i.e., the algorithm is almost as efficient as the standard OLS solvers. Our simulations also demonstrate our claim, i.e., we observe that AdaCRR-FC is significantly more efficient than Huber-loss based approaches (Tsakonas et al., 2014) while still ensuring consistency in presence of a large number of corruptions unlike existing thresholding techniques (Bhatia et al., 2015, 2017) (see Figure 1).
The above result requires which is prohibitively large for high-dimensional problems. Instead, we study the problem with sparsity structure on the regression vector (Wainwright, 2009). That is, we study the problem of sparse linear regression with oblivious response corruptions. We provide first (to the best of our knowledge) consistent estimator for the problem under standard RSC assumptions. Similar to the low-d case, we allow
fraction of points to be corrupted, but the sample complexity requirement is only
is the number of non-zero entries in the optimal sparse regression vector. Existing Huber-loss based estimators (Tsakonas et al., 2014) would be difficult to extend to this setting due to the additional non-smooth
regularization of the regression vector. Existing hard-thresholding based consistent estimators (Bhatia et al., 2017) marginalize out the regression vector, which is possible only in low-d due to the closed form representation of the least squares solution, and hence, do not trivially extend to sparse regression.
Finally, we enhance and apply our technique to the problem of regression with heavy-tailed noise. By treating the tail as oblivious adversarial corruptions, we obtain consistent estimators for a large class of heavy-tailed noise distributions that might not even have well-defined first or second moments. Despite being a well-studied problem, to the best of our knowledge, this is the first such result in this domain of learning with heavy tailed noise. For example, our results provide consistent estimators with Cauchy noise, for which even the mean is not well defined, with rates which are very similar to that of standard sub-Gaussian distributions. In contrast, most existing results (Sun et al., 2018; Hsu and Sabato, 2016) do not even hold for Cauchy noise as they require the variance of the noise to be bounded. Furthermore, existing results mostly rely on median of means technique (Hsu and Sabato, 2016; Lecué and Lerasle, 2017; Prasad et al., 2018), while we present a novel but natural viewpoint of modeling the tail of noise as adversarial but oblivious corruptions.
discusses some of the related works. Section 4 presents our algorithm and discusses why adaptive thresholding is necessary. Our extension to sparse linear regression is presented in Section 6. Section 7 presents our results for the regression with heavy tailed noise problem. We conclude with Section 8. Due to the lack of space, most proofs and experiments are presented in the appendix.
We are given n independent data points sampled from a Gaussian distribution
and their corrupted responses
is the true regression vector,
- the white noise - is independent of
and is sampled from a sub-Gaussian distribution with parameter
is the corruption in the response of
is a sparse corruption set, i.e.,
is independent of
. Apart from this independence we do not impose any restrictions on the values of corruptions added by the adversary. Our goal is to robustly estimate
corrupted data
. In particular, following are the key criteria in evaluating an estimator’s performance:
• Breakdown point: It is the maximum fraction of corruption, , above which the estimator is not guaranteed to recover
with small error, even as
• Statistical rates and sample complexity: We are interested in the generalization error () of the estimator and its scaling with problem dependent quantities like n, p, noise variance
as well as the fraction of corruption
• Computational complexity: The number of computational steps taken to compute the estimator. The goal is to obtain nearly linear time estimators similar to the standard OLS solvers.
As discussed later in the section, our AdaCRR estimator is near optimal with respect to all three criteria above.
Heavy-tailed Regression. We also study the heavy-tailed regression problem where is a heavy-tailed distribution, such as the Cauchy distribution which does not even have bounded first moment. The goal is to design an efficient estimator that provides nearly optimal statistical rates.
Notation. Let be the matrix whose
row is equal to
. For any matrix
and subset
to denote the submatrix of X obtained by selecting the rows corresponding to S. Throughout the paper, we denote vectors by bold-faced letters (a), and matrices by capital letters (
for a positive definite matrix
denotes the
, i.e., the number of non-zero elements in
for a large enough constant
independent of
to denote the set of random variables whose Moment Generating Function (MGF) is less than the MGF of
2.1. Main Results
Robust Regression: For robust regression with oblivious response variable corruptions, we propose the first efficient consistent estimator with break-down point of 1. That is,
Theorem 1 (Robust Regression) Let observations generated from the oblivious adversary model, i.e.,
is selected independently of
. Suppose AdaCRR-FC is run for T iterations with appropariate choice of hyperparameters. Then with probability at least
-th iterate
by the AdaCRR-FC algorithm satisfies:
Remarks: a) AdaCRR-FC solves an OLS problem in each iteration and the number of iterations is , so the overall time complexity of the algorithm is still nearly linear in n. In contrast, standard Huber-loss or
loss based methods (Tsakonas et al., 2014; Nasrabadi et al., 2011) have iteration complexity of
-suboptimality and require
, which implies super-linear
time complexity. Our experiments (Section K.1) also agree with this observation.
b) Break-down point of AdaCRR-FC satisfies:
. In contrast, similar consistent estimator by Bhatia et al. (2017) requires
. In fact, Proposition 3 shows that fixed hard thresholding operators like the ones used by (Bhatia et al., 2015, 2017) cannot provide consistent estimator for
; instead, we propose and analyze a randomized and adaptive thresholding operator (Algorithm 4) to avoid sub-optimal fixed-points.
c) Generalization error of AdaCRR-FC is , which is information theoretically
Table 1: Comparison of various approaches for regression under oblivious adversary model. The computational rates represents the time taken by estimator to compute an -approximate solution.
optimal up to factors. In contrast, most of the existing analysis for
-loss do not guarantee such consistent estimators (Nasrabadi et al., 2011; Wright and Ma, 2010; Nguyen and Tran, 2013). d) Our result is presented for Gaussian covariates and sub-Gaussian response noise. However, the technique is significantly more general and can apply to a large class of sub-Gaussian data distributions. Furthermore, we can relax the assumption on independence of
. It suffices to have
e) Sample complexity of AdaCRR-FC is nearly optimal and can be improved to
-sparse estimators with the data that satisfies RSC/RSS (Theorem 5).
See Table 1 for a detailed comparison with the existing works.
Regression with Heavy-tailed Noise: We present our result for regression with heavy-tailed noise.
the -th iterate of AdaCRR-FC guarantees the following with probability
Remarks: a) Note that our technique does not even require the first moment to exist. In contrast, existing results hold only when the variance is bounded (Hsu and Sabato, 2016). In fact, the general requirement on distribution of is significantly weaker and holds for almost every distribution whose parameters are independent of n. Also, we present a similar result for mean estimation with symmetric noise
b) For Cauchy noise (Johnson et al., 2005) with location parameter 0, and scale parameter can guarantee error rate of
despite unbounded variance which precludes most of the existing results. Our empirical results also agree with the theoretical claims, i.e., they show small generalization error for AdaCRR-FC while almost trivial error for several heavy-tailed regression algorithms (see Figure 5).
c) Similar to robust regression, the estimator is nearly linear in n, p. Moreover, we can extend our analysis to sparse linear regression with heavy-tailed response noise.
The problems of robust regression and heavy tailed regression have been extensively studied in the fields of robust statistics and statistical learning theory. We now review some of the relevant works in the literature and discuss their applicability to our setup.
Robust Regression. The problem of response corrupted robust regression can be written as the following equivalent optimization problems:
The problem is NP-hard in general due to it’s combinatorial nature (Studer et al., 2012). Rousseeuw (1984) introduced the Least Trimmed Squares (LTS) estimator which computes OLS estimator over all subsets of points and selects the best estimator. Naturally, the estimator’s computational complexity is exponential in n and is not practical. There are some practical variants like RANSAC (Fischler and Bolles, 1981) but they are mostly heuristics and do not come with strong guarantees.
A number of approaches have been proposed which relax (2) with Wright and Ma,
2010) or Huber loss (Huber et al., 1973). Tsakonas et al. (2014) analyze Huber regression estimator under the oblivious adversary model and show that it tolerates any constant fraction of corruptions, while being consistent. However, their analysis requires Wright and Ma (2010); Nasrabadi et al. (2011) also study convex relaxations of (2), albeit in the sparse regression setting. While their estimators tolerate any constant fraction of corruptions, they do not guarantee consistency in presence of white noise. Statistical properties aside, a major drawback of Huber’s M-estimator and other convex relaxation based approaches is that they are computationally expensive due to sublinear convergence rates to the global optimum. Another class of approaches use greedy or local search heuristics to approximately solve the
constrained objectives. For example, the estimator of Bhatia et al. (2017) uses alternating minimization to optimize objective (2). While this estimator is consistent and converges linearly to the optimal solution, it only tolerates a small fraction of corruptions and breaks down when
is greater than a small constant.
Another active line of research on robust regression has focused on handling more challenging adversary models. One such popular model is the malicious adversary model, where the adversary looks at the data before adding corruptions. Recently there has been a flurry of research on designing robust estimators that are both computationally and statistically efficient in this setting (Bhatia et al., 2015; Prasad et al., 2018; Diakonikolas et al., 2018; Klivans et al., 2018). While the approach by Bhatia et al. (2015) is based on an alternating minimization procedure, Prasad et al. (2018) and Diakonikolas et al. (2018) derive robust regression estimators based on robust mean estimation (Lai et al., 2016; Diakonikolas et al., 2016). However, for such an adaptive adversary, we cannot expect to achieve consistent estimator. In fact, it is easy to show that we cannot expect to obtain generalization error better than is the fraction of corruptions and
noise variance. Furthermore, as we show in our experiments, these techniques fail to recover the parameter vector in the oblivious adversary model when the fraction of corruption is
Heavy-tailed Regression. Robustness to heavy-tailed noise distribution is another regression setting that is actively studied in the statistics community. The objective here is to construct estimators which work without the sub-Gaussian distributional assumptions that are typically imposed on the data distribution, and allow it to be a heavy tailed distribution. For the setting where the noise is heavy-tailed with bounded variance, Huber’s estimator is known to achieve subGaussian style rates (Fan et al., 2017; Sun et al., 2018). Several other non-convex losses such as Tukey’s biweight and Cauchy loss have also been proposed for this setting (see Loh et al., 2017, and references therein). For the case where both the covariates and noise are heavy-tailed, several recent works have proposed computationally efficient estimators that achieve sub-Gaussian style rates (Hsu and Sabato, 2016; Lecué and Lerasle, 2017; Prasad et al., 2018). As noted earlier, all of these results require bounded variance. Moreover, many of the Huber-loss style estimators typically do not have linear time computational complexity. In contrast, our result holds even if the
ment of noise is bounded where
is any arbitrary small constant. Furthermore, the estimation algorithm is nearly linear in the number of data points as well as data dimensionality.
In this section we describe our algorithm AdaCRR (see Algorithm 1), for estimating the regression vector in the oblivious adversary model. At a high level, AdaCRR uses alternating minimization to optimize objective (2). That is, AdaCRR maintains an estimate of the coefficient vector set of corrupted responses
, and alternatively updates them at every iteration.
Updating Given any subset
is updated using the points in
. We study two variants of AdaCRR which differ in how we update
. In AdaCRR-FC (Algorithm 2) we perform a fully corrective linear regression step on points from
. In AdaCRR-GD (Algorithm 3) we take a gradient descent step to update
. While these two variants have similar statistical properties, the GD variant is computationally more efficient, especially for large n and p.
Updating For any given
, AdaCRR updates
using a novel hard thresholding procedure, which adds all the points whose absolute residual is larger than an adaptively chosen threshold, to the set
. Hard thresholding based algorithms for robust regression have been explored in the literature (Bhatia et al., 2017, 2015) but they use thresholding with a fixed threshold or at a fixed level and are unable to guarantee
break-down point. In fact, as we show in Proposition 3, such fixed hard thresholding operators cannot in general tolerate such large fraction of corruption.
In contrast, our hard thresholding routine (detailed in Section 4.1) selects the threshold adaptively and adds randomness to escape bad fixed points. While randomness has proven to be useful in escaping first and second order saddle points in unconstrained optimization (Ge et al., 2015; Jin et al., 2017), to the best of our knowledge, our result is the first such result for a constrained optimization problem with randomness in the projection step.
Before we proceed, note that Algorithm 1 relies on a new set of samples for each iteration. This ensures independence of the current iterate from the samples and is done mainly for theoretical convenience. We believe this can be eliminated using more complex arguments in the analysis.
4.1. AdaHT: Adaptive Hard Thresholding Operator
In this section we describe our hard thresholding operator AdaHT. There are two key steps involved in AdaHT, which we describe below. Consider the call to AdaHT in iteration of Algorithm 1.
Interval Selection. In the first step we find an interval on positive real line which acts as a “crude” threshold for our hard thresholding operator. We partition the positive real line into intervals of width . We then place points in these intervals based on the magnitude of their residuals. Finally, we pick the smallest j such that the
interval has fewer than
elements in it, for some
such that
be the chosen interval. This interval acts as a crude threshold. All the points to the left of
interval are considered as un-corrupted points and added to
7-9, Algorithm 4); all the points to the right of
interval are considered as corrupted points. The goal of such interval selection is to ensure: a) all the true un-corrupted points lie to the left of the interval and are included in
, and, b) not many points fall in interval
so that a large fraction of the points that are in set
remain independent of each other. This independence allow us to exploit sub-Gaussian concentration results rather than employing a worse case-bounds and helps achieve optimal consistent rates.
Let be a constant. Then, we select interval length as:
where, are approximate upper bounds of
where . In Section 5 we show that for appropriate choice of
, all the true un-corrupted points lie to the left of
interval. In Appendix H we present techniques to estimate
with a constant
. Estimating the noise variance
) is significantly more tricky and it is not clear if it is even possible apriori. So, in practice one can either use prior knowledge or treat
as a hyper-parameter that is selected using cross-validation.
Points in Selected Interval. This step decides inclusion in of points which fall in the selected interval
be the mid-point of this interval. For each point in this interval we sample
uniformly from
, for some universal constant a P p0, 0.1s. If the magnitude of its residual is smaller than
we consider it as un-corrupted and add it to
10-15, Algorithm 4). As we show in the proof of Theorem 1, this additional randomness is critical in avoiding poor fixed points and in obtaining the desired statistical rates for the problem.
4.2. Fixed Hard Thresholding doesn’t work
In this section we show that algorithms of Bhatia et al. (2015, 2017) that rely on fixed hard thresholding operators pruning out a fixed number of elements, need not recover the true parameter when
. We prove this for TORRENT (Bhatia et al., 2015); proof for CRR (Bhatia et al., 2017) can be similarly worked out.
TORRENT is based on a similar alternating minimization procedure as AdaCRR, but differs from it in the subset selection routine: instead of adaptive hard thresholding, TORRENT always chooses the smallest elements from the residual vector
. The following proposition provides an example where TORRENT fails to recover the underlying estimate for
Proposition 3 (Lower Bound for TORRENT) Let otherwise. Consider the limit as
and suppose
which is far from
) such that if TORRENT is initialized at w, it remains at w even after infinite iterations.
See Appendix A for a detailed proof of the proposition. Figure on the right shows the performance of TORRENT on the 1-d regression problem described in Proposition 3. The x-axis denotes the initial point while the y-axis denotes the point of convergence of TORRENT. Clearly TORRENT fails with several initializations despite
In this section we provide an outline of the proof of our main result stated in Theorem 1. We prove a more general result in Theorem 4 from which Theorem 1 follows readily.
Theorem 4 (AdaCRR-FC for Robust Regression) Consider the setting of Theorem 1. Set for a universal constant
given s.t.
. Then the iterates
of AdaCRR-FC (Algorithm 2) executed with the above given hyperparameters, satisfy the following (w.p.
where break-down point Proof (Sketch) Consider the
iteration of AdaCRR-FC. We first divide
the following mutually exclusive sets:
where is as defined in Line 6, Algorithm 4. Note that
contains the egregious outliers in
contains all the “true” uncorrupted points. Our proof first shows that
satisfies the properties described in Section 4.1. Specifically, the output
satisfies: a)
. Next, we show that
can be written in terms of
The rest of the proof focuses on bounding the two terms in the RHS of the above equation. To bound the first term involving we use the observation that
are independent of
and rely on concentration properties of sub-Gaussian random variables. To bound the other term involving
, we rely on a crucial property of our algorithm which guarantees
perform a worst case analysis to bound
. See Appendix I for a detailed proof.
Discussion: Theorem 4 characterizes both the computational as well as statistical guarantees of AdaCRR-FC. More specifically, consider setting . Then, if the number of samples
, AdaCRR-FC after
iterations (and hence nearly linear computation time) produces an iterate
satisfying,
for is a universal constant. This shows that constant-factor estimates of
suffices to achieve information theoretically optimal rates, up to
factors, even with
fraction of corruptions. In fact, AdaCRR-FC can tolerate
fraction of corruptions by setting
, although with a slightly worse parameter estimation error.
In this section, we extend our algorithm to the problem of sparse regression with oblivious response variable corruptions. In this setting the dimension of the data p is allowed to exceed the sample size , the linear regression model is unidentifiable. Consequently, to make the model identifiable, certain structural assumptions need to be imposed on the parameter vector
Following Wainwright (2009), this work assumes that
-sparse i.e. has at most
entries. Our objective now is to recover a sparse
with small generalization error. In this setting, we modify the update step of w in Algorithm 1 as follows
and start the algorithm at . We refer to this modified algorithm as AdaCRR-HD. A huge number of techniques have been proposed to solve the above optimization problem efficiently. Under certain properties of the design matrix (Restricted Eigenvalue property), these techniques estimate
up to statistical precision, using just
samples. In this work we use the Iterative Hard Thresholding (IHT) technique of Jain et al. (2014) to solve the above problem. More details about the IHT Algorithm can be found in Appendix C.
Theorem 5 (AdaCRR-HD for Sparse Robust Regression) Consider the setting of Theorem 1. In addition assume . Use IHT (Algorithm 6) to solve (7) in each iteration of AdaCRR-HD with parameter
all the other parameters as in Theorem 4. Then the
iterate produced by AdaCRR-HD, for
satisfies the following bound with probability at least
We would like to highlight nearly linear sample complexity in for well-conditioned covariates. Furthermore, the total time complexity of the algorithm is still nearly linear in n and p. Finally, to the best of our knowledge, this is the first result for the sparse regression setting with oblivious response corruptions and break-down point
In this section we consider the problem of linear regression with heavy-tailed noise. We consider the heavy-tailed model from Section 2 where we observe n i.i.d samples from the linear model: is sampled from a heavy-tailed distribution. We now show that our estimator from Section 4 can be adapted to this setting to estimate
with sub-Gaussian error rates, even when the noise lacks the first moment.
In this setting, although there is no adversary corrupting the data, we consider any point with noise greater than a threshold as a “corrupted” point, and try not to use these points to estimate
That is, we decompose
. Note that this implies dependence between
, but as we show later in Appendix D, our proof still goes through with minor modifications, and in fact, provides similar rates as the case where
is sampled from a Gaussian distribution. Below, we provide a more general result than Theorem 2, from which Theorem 2 follows by appropriate choice of
, the tail probability of
Theorem 6 (AdaCRR-FC for Heavy-tailed Noise) Consider the setting of Theorem 2. Let be any threshold and
that satisfies (4), set
Then, for any
and
iterate produced by AdaCRR-FC executed with above
parameters and satisfies the following w.p.
Note that if the distribution of is independent of n, we should always be able to find constants
and
to obtain nearly optimal rates. We instantiate this claim for the popular Cauchy noise, for which the existing results do not even apply due to unbounded variance.
Corollary 7 (Cauchy noise) Consider the similar setting as in Theorem 2. Suppose the noise follows a Cauchy distribution with location parameter 0 and scale parameter . Then, the
iterate of AdaCRR-FC, for
satisfies the following, w.p.
We would like to note that despite sub-Gaussian style rates for Cauchy noise, the sample and time complexity of the algorithm is still nearly optimal.
Mean estimation: Although our result holds for regression, we can extend our result to solve the mean estimation problem as well. That is, suppose is the mean of a distribution and
is a zero mean random variable which follows a heavy-tailed distribution. Then by using a simple symmetrization reduction, we can show that we can compute
This result seems to be counter-intuitive as Devroye et al. (2016) derive lower bounds for heavy tailed mean estimation and show that over the set of all moment bounded distributions, no estimator can achieve faster rates than
while we can obtain
rates. However, we additionally require noise distribution to be symmetric, while the lower bound construction uses asymmetric noise distribution. We further discuss this problem in Appendix G. Similarly, our result avoids regression lower-bound by Sun et al. (2018), as we do not estimate the bias term in our regression model.
In this paper, we studied the problem of response robust regression with oblivious adversary. For this problem, we presented a simple outlier removal based algorithm that uses a novel randomized and adaptive thresholding algorithm. We proved that our algorithm provides a consistent estimator with break-down point (fraction of corruptions) of while still ensuring a nearly linear-time computational complexity. Empirical results on synthetic data agrees with our results and show computational advantage of our algorithm over Huber-loss based algorithms (Tsakonas et al., 2014) as well as better break-down point than thresholding techniques (Bhatia et al., 2015, 2017). We also provided an extension of our approach to the high-dimensional setting. Finally, our technique extends to the problem of linear regression with heavy-tailed noise, where we provide nearly optimal rates for a general class of noise distributions that need not have a well-defined first moment.
The finite sample break-down point of our method is which is still sub-optimal compared to the information theoretic limit of
. Obtaining efficient estimators for nearly optimal break-down point is an interesting open question. Furthermore, our algorithm requires an approximate estimate of noise variance
which can sometimes be difficult to select in practice. A completely parameter-free algorithm for robust regression (similar to OLS) is an interesting research direction that should have significant impact in practice as well.
Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Fast global convergence rates of gradient methods for high-dimensional statistical recovery. In Advances in Neural Information Processing Systems, pages 37–45, 2010.
Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, pages 721–729, 2015.
Kush Bhatia, Prateek Jain, Parameswaran Kamalaruban, and Purushottam Kar. Consistent robust regression. In Advances in Neural Information Processing Systems, pages 2110–2119, 2017.
Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Linden- strauss. Random Structures & Algorithms, 22(1):60–65, 2003.
Luc Devroye, Matthieu Lerasle, Gabor Lugosi, Roberto I Oliveira, et al. Sub-Gaussian mean esti- mators. The Annals of Statistics, 44(6):2695–2725, 2016.
Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655–664. IEEE, 2016.
Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stew- art. Sever: A robust meta-algorithm for stochastic optimization. arXiv preprint arXiv:1803.02815, 2018.
Lee H Dicker. Variance estimation in high-dimensional linear models. Biometrika, 101(2):269–284, 2014.
Jianqing Fan, Shaojun Guo, and Ning Hao. Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1):37–65, 2012.
Jianqing Fan, Quefeng Li, and Yuyan Wang. Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):247–265, 2017.
Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24 (6):381–395, 1981.
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points-online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.
Frank R Hampel. A general qualitative definition of robustness. The Annals of Mathematical Statistics, pages 1887–1896, 1971.
Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17(1):543–582, 2016.
Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of sub-Gaussian random vectors. Electronic Communications in Probability, 17, 2012.
Peter J Huber et al. Robust regression: asymptotics, conjectures and monte carlo. The Annals of Statistics, 1(5):799–821, 1973.
Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014.
Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017.
Norman L Johnson, Adrienne W Kemp, and Samuel Kotz. Univariate Discrete Distributions, volume 444. John Wiley & Sons, 2005.
Adam Klivans, Pravesh K Kothari, and Raghu Meka. Efficient algorithms for outlier-robust regres- sion. arXiv preprint arXiv:1803.03241, 2018.
Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665– 674. IEEE, 2016.
Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selec- tion. Annals of Statistics, pages 1302–1338, 2000.
Guillaume Lecué and Matthieu Lerasle. Robust machine learning by median-of-means: theory and practice. arXiv preprint arXiv:1711.10306, 2017.
Po-Ling Loh et al. Statistical consistency and asymptotic normality for high-dimensional robust m-estimators. The Annals of Statistics, 45(2):866–896, 2017.
Mark Schmidt. Matlab code by Mark Schmidt, 2006. URL https://www.cs.ubc.ca/~schmidtm/Software/code.html.
Nasser M Nasrabadi, Trac D Tran, and Nam Nguyen. Robust lasso with missing and grossly cor- rupted observations. In Advances in Neural Information Processing Systems, pages 1881–1889, 2011.
Nam H Nguyen and Trac D Tran. Exact recoverability from dense corrupted observations via minimization. IEEE transactions on information theory, 59(4):2017–2035, 2013.
Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust esti- mation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.
Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241–2259, 2010.
Peter J Rousseeuw. Least median of squares regression. Journal of the American statistical association, 79(388):871–880, 1984.
Christoph Studer, Patrick Kuppinger, Graeme Pope, and Helmut Bolcskei. Recovery of sparsely corrupted signals. IEEE Transactions on Information Theory, 58(5):3115–3130, 2012.
Qiang Sun, Wen-Xin Zhou, and Jianqing Fan. Adaptive Huber regression. Journal of the American Statistical Association, (just-accepted):1–35, 2018.
Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012.
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
Efthymios Tsakonas, Joakim Jaldén, Nicholas D Sidiropoulos, and Björn Ottersten. Convergence of the huber regression m-estimate in the presence of dense outliers. IEEE Signal Processing Letters, 21(10):1211–1214, 2014.
Roman Vershynin. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability, 25(3):655–686, 2012.
Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5): 2183–2202, 2009.
John Wright and Yi Ma. Dense error correction via -minimization. IEEE Transactions on Information Theory, 56(7):3540–3560, 2010.
Allen Y Yang, Zihan Zhou, Arvind Ganesh Balasubramanian, S Shankar Sastry, and Yi Ma. Fast -minimization algorithms for robust face recognition. IEEE Transactions on Image Processing, 22(8):3234–3246, 2013.
Shuheng Zhou. Restricted eigenvalue conditions on subgaussian random matrices. arXiv preprint arXiv:0912.4045, 2009.
Before we present the proof of the Theorem, we introduce some notation and present useful intermediate results which we require in our proof. The proofs of all the Lemmas in this section can be found in Appendix I.
Notation Recall that are the new points obtained in
iteration of Algorithm 1. Let
be the corruption vector added to these points and
be the noise vector. Let
be obtained from
by applying the whitening transformation:
Let be the set of un-corrupted points in
be the output of AdaHT in the
iteration of AdaCRR-FC and
be the interval chosen. For any
matrix with
as rows. Finally, let us define
B.1. Intermediate Results
Lemma 8 The input to AdaHT can be written in terms of
The following Lemma obtains a bound on , the interval number, chosen by Algorithm 4.
Lemma 9 (Interval Number) Let be the interval chosen by AdaHT in the
iteration of AdaCRR-FC. Then
The following Lemma presents a condition on which ensures that all the uncorrupted points fall to the left of
Lemma 10 (Interval Length) Consider the iteration of AdaCRR-FC. Suppose AdaHT is run with the interval length
such that:
. Define sets
, which are subsets of points in
, as follows:
Then the following statements hold with probability at least
Moreover, all the points in fall in the
B.2. Main Argument
We first prove the following Lemma, which obtains a bound on the progress made by AdaCRR-FC
in each iteration, assuming . In Section B.2.2 we use this Lemma to prove Theorem 4.
Lemma 11 Consider the setting of Theorem 4. Let
where
Proof Consider the iteration of AdaCRR-FC. We divide the
into the following mutually exclusive sets
where is as defined by Line 6 of Algorithm 4. We now highlight some key properties of the sets
, which follow from Lemma 10 and hold with probability at least
2. S
3. Since |
4. Since any point in lies in the
interval (see Lemma 10), we have
5. For a given set of points , there are at most
possible choices for
; one for each possible choice of
We often use the above properties in the proof. Using sets , we rewrite
So we have:
We now derive high probability upper bounds for each of the terms in the above expression.
Bounding
We now use the concentration properties of the smallest eigenvalue of covariance matrix to bound the above quantity. Using Lemma 32 in Appendix J, we obtain the following inequality, which holds with probability at least
Bounding as follows:
Note that
First, note that the distribution of covariates (x) and dense noise of points in
is the same and equal to their corresponding distributions on entire data. This follows from the fact that
is formed based on the magnitude of corruptions
, which is chosen independent of the data. We use this observation to derive upper bound for
. Using chi-square concentration result from Lemma 30 in Appendix J, we obtain the following upper bound for
Combining this result with the upper bound on , we have (w.p.
where the third equality follows from the fact that and the definition of
equality follows from
. This shows that with probability at least
Bounding We use a similar technique as above to bound
. We first upper bound
To bound we make use of the fact that
is independent of
. Conditioned on
follows a Gaussian distribution with covariance
. Using concentration results for sum of chi-square random variables (see Lemma 30) along with upper bound on
Lemma 31), we can show that for any given
, the following holds with probability at least
Taking a union bound over all possible choices of j, we obtain the following bound, which holds with probability at least
Bounding requires more careful arguments that we present in Section B.2.1 where by using Equation (18), Lemma 14, and Lemma 18, we have w.p.
Combining the bounds in Equations (13), (14), (15), we get the following bound, which holds with probability at least
This finishes the proof of the Lemma. What remains now is to bound , which we do in Section B.2.1.
B.2.1. BOUNDING
We first re-write
where is sampled uniformly from
. First, note that
depends on
- the interval chosen by AdaHT in the first iteration - which in turn depends on
. This dependence of
complicates the analysis. So to simplify the analysis, we bound
by bounding the quantity over all intervals. The bound would then follow by union bound over all possible intervals, whose number is bounded by
To this end, we first define
where . Note that if
. Now by taking supremum over all j’s and using triangular inequality, we have:
(18) The first term above is bounded by Lemma 14 and the second term by Lemma 18. Finally, taking a union bound over all possible choices of j, we get a high probability upper bound for
Expectation of Before bounding the expectation of
we present two auxiliary Lemmas, the proofs of which can be found in Appendix I.
Lemma 12 Let be a random vector. For any given
and scalars
consider the following random vector:
Then the expected value of a
satisfies:
Lemma 13 Let be uniformly sampled from
and is independent of
. For any given
, consider the following random vector:
Then the expected value of
such that
We are now ready to present the main result on expectation of
Lemma 14 (Expectation of Conditioned on
, the following holds w.p. at least
Proof Using the expression for in Equation (17), we have
where . Invoking Lemma 13 with
Moreover, by definition. Furthermore, using standard concentration of sub-Gaussian random variables, we have:
with probability at least
. Using these two observations, we get (w.p.
Concentration of We first present some auxiliary Lemmas which will help us derive concentration results for
. The following Lemmas help us show that
is a sub-Gaussian random variable. The proofs of these Lemmas can be found in Appendix I.
Lemma 15 Let be a random vector. For any given vector
, and scalars b, v, the following random vector is sub-Gaussian:
Moreover, there exists a universal constant
, such that the following holds for any
Lemma 16 Let be uniformly sampled from
and is independent of
. For any given vector
, the following random vector is sub-Gaussian:
Moreover, we have:
Lemma 17 Let be independent samples from
be independent samples from the uniform distribution on rs, ts and are independent of
. For any given vectors
, the following random vector is sub-Gaussian:
Moreover, with probability at least
is a universal constant.
Lemma 18 (Concentration of Conditioned on
, the following holds w.p.
Proof The proof follows similar path to the proof of Lemma 17. From Equation (17) we have the following expression for
From Lemma 17 we know that conditioned on is a sub-gaussian random variable. Moreover, the following bound holds with probability at least
where the third equality follows from the fact that w.p.
; recall using sub-Gaussian tail bounds, we have (w.p.
We now proceed to the proof of Theorem 4. To prove Theorem 4, we prove the following bound on , which is slightly stronger than the bound in Theorem 4:
Theorem 4 directly follows from the above result by observing that and
. We use induction on iteration t to prove this result.
Base Case (First note that by definition of the interval length
in Equation (3), we have
So from Lemma 11 we have the following bound on , which holds with probability at least
Using definitions of in Theorem 4, and
we get:
Induction Step. Suppose the Theorem holds for , we show that it also holds for
1, with high probability. We first show that
. Consider the
where the first inequality follows from the fact that , the second inequality uses the bound
on in inequality (19), and the last inequality holds whenever
for some universal constant c. This shows that , with probability at least
. We now use Lemma 11 to get the following bound on
Using definitions of in Theorem 4, and
, and the bound on
from Equation (19), we get:
The AdaCRR-HD algorithm for consistent robust sparse regression is given in Algorithm 5. Before we present the proof of Theorem 5, we first recall some notation and introduce some additional ingredients which we require for the proof of the Theorem.
Notation. Recall that are approximate upper bounds of
which satisfy the following inequalities
The interval length we choose in this setting is given by:
be the sparsity of . The rest of the notation is same as in Theorems 4, which we recall here for convenience. Let
be the interval chosen in
iteration. Let
be the midpoint of
interval. Let
be the set of un-corrupted points in
the output of AdaHT in the
iteration of AdaCRR-HD. Let
, for some universal constant
C.1. Background on Iterative Hard Thresholding (IHT)
The IHT algorithm for solving the following sparse regression problem is given in Algorithm 6
Jain et al. (2014) show that if the design matrix X satisfies the Restricted Strong Convexity (RSC) and Restricted Strong Smoothness (RSS) properties (defined below), then IHT can efficiently solve the above optimization problem.
Definition C.1 (RSC Property). A matrix is said to satisfy Restricted Strong Convexity (RSC) at sparsity level k with strong convexity constraint
if the following holds for all
s.t.
Definition C.2 (RSS Property). A matrix is said to satisfy Restricted Strong Smoothness (RSS) at sparsity level k with strong smoothness constraint
if the following holds for all
s.t.
Theorem 19 (Theorem 1, Jain et al. (2014)) Let X have RSS, RSC parameters given by . Suppose IHT is run with
iterate of the
IHT algorithm, for
When the covariates are sampled from a Gaussian distribution, the following result of Agarwal et al. (2010) implies that X satisfies RSC and RSS properties with high probability.
Theorem 20 (Agarwal et al. (2010)) Suppose the rows of are sampled from a Gaussian distribution with covariance
. Then the following statements hold with probability at least
The following Corollary follows immediately from the above Theorem.
Corollary 21 Suppose the rows of are sampled from a Gaussian distribution with covariance
. Then with probability at least
satisfies RSC, RSS properties at sparsity level k, with strong convexity constraint
and strong smoothness constraint
C.2. Main Argument
We first prove the following more general result for any then readily follows by substituting
in the following Theorem.
Theorem 22 Let observations generated from the oblivious adversary model and let
be such that
. Suppose AdaCRR-HD is run for T iterations. Suppose the sparsity k in the call to IHT is such that
and the hyperparameters of AdaHT are set as follows:
for some universal constant . Then the iterates
produced by AdaCRR-HD satisfy
with probability greater than
The proof of Theorem 22 uses the same arguments as in the proof of Theorem 4. So we only focus on the key differences in the proof. That is, we use the same induction based argument for getting the final bound. So, we only need to prove the following lemma:
Lemma 23 Consider the setting of Theorem 22. Let
where
Proof Similar to proof of Theorem 4, we divide into mutually exclusive sets
Since , using similar argument as in Lemma 10, it is easy to verify that the sets
satisfy the following key properties, with probability at least
1.
2. S
3. |
We use the above properties to first show that the input to IHT satisfies RSC, RSS properties. Since
Since is chosen by an oblivious adversary, the rows of
are i.i.d and follow a Gaussian distribution with covariance
. Using Corollary 21 on the RHS of the above equation we obtain
This shows that satisfies RSC property with
similar argument shows that
satisfies RSS property
We now use the convergence properties of IHT presented in Theorem 19 to obtain a bound on . Suppose the sparsity k in the call to IHT in Algorithm 1 is such that
Rearranging terms in the above expression gives us
We now use the Restricted Eigenvalue property of to lower bound the LHS of the above equation. Using Lemma 35 in Appendix J we obtain the following lower bound on
which holds with probability at least
for some constant , which depends on
. We now bound RHS of (20). Let
denote the set of non-zero indices of
Note that be a sub-matrix obtained from
by selecting rows corresponding to
and columns corresponding to
be the sub-matrix of
restricted to rows and columns corresponding to
Plugging the previous two results in (20), and using assumption about , we get (w.p.
for some universal constant c. The rest of the proof focuses on bounding is similar to the proofs of Theorem 4. Using sets
, we rewrite
We now bound each of the terms in the RHS of the above equation. Note that since is a random quantity, we take a union bound over all possible
while bounding these terms. So we have an additional log p term in our bounds.
Using similar techniques as in the proof of Theorem 4 for bounding show that the following hold with probability at least
Combining the bounds in Equations (21) and (22), we get the following bound, which holds with probability at least
We first introduce some notation that we use in the proof.
Notation. Recall that is an approximate upper bound of
which satisfies the following inequality:
Define the tail probability of the noise distribution as:
Note that the key idea behind adapting Algorithm 1 to heavy-tailed setting is to consider all the points with noise magnitude less than
as un-corrupted points and try to use only these points to estimate the parameter vector. Accordingly, we define
, the set of “un-corrupted” points in
is the noise vector corresponding to the points in
be the output of Algorithm 4 in the
iteration of Algorithm 1. The rest of the notation is same as in Theorem 4, which we recall next. Finally, we define
some constant
D.1. Intermediate Results
The following Lemma, which is similar to Lemma 10, provides condition on which ensures all the “uncorrupted” points lie to the left of the
Lemma 24 (Interval Length) Consider the iteration of AdaCRR-FC. Suppose AdaHT is run with the interval length
. Define sets
, which are subsets of points in
, as follows:
Then the following statements hold with probability at least
Moreover, all the points in fall in the
Proof The proof of the Lemma uses the exact same arguments as in the proof of Lemma 10 and relies on the following bound on the residual of uncorrupted points
where the last inequality holds with probability at least and follows from the concentration properties of sub-Gaussian random variables.
D.2. Main Argument
We first prove the following more general result for any then readily follows from this by substituting
Theorem 25 Let observations generated from a linear model with heavy tailed noise. Suppose AdaCRR-FC is run for T iterations. Suppose that the hyperparameters of AdaHT are set as follows:
for some universal constant . Then the iterates
produced by AdaCRR-FC satisfy:
with probability greater than
The proof of the Theorem uses the same arguments as in the proof of Theorem 4. Specifically, we use the same induction based argument for getting the final bound. So, we only need to prove the following lemma.
Lemma 26 Consider the setting of Theorem 25. Let
where Proof Consider the
iteration of AdaCRR-FC. We divide
into mutually exclusive sets
Since , the sets defined above satisfy the following properties (this follows from Lemma 24)
Using sets , we now rewrite
Note that above corresponds to
that appears in the proof of Lemma 11. We now use similar techniques as in the proof of Lemma 11 for bounding
Bounding as follows
Note that is the bucket chosen by AdaCRR-FC in
iteration of Algorithm 1.
can be upper bounded as
Note that the covariates (are still distributed according to Gaussian distribution. This follows from the fact that
is formed based on the noise magnitude
, which is independent of the covariates. We use this observation to derive an upper bound for
. Using chi-square concentration result from Lemma 30 in Appendix J, we obtain the following upper bound for
which holds with probability at least
Combining this result with the upper bound on , we obtain the following bound on
holds with probability at least
where the third equality follows from the fact that . This shows that with probability at least
Bounding corresponds to the term
that appears in the proof of Lemma 11. Using similar techniques as in Lemma 11,
can be bounded as
which holds with probability at least . Combining the above bounds for
the following bound on
, which holds with probability at least
Let E be the even that . From the definition of
, we know that
. We now lower bound
as follows:
Since Finally, to prove the Theorem, we substitute the above bound on
in Theorem 6.
The CDF of a Cauchy random variable z with location parameter 0 and scale parameter by:
So, the tail probability
is given by
This can equivalently be written as:
. Replacing
setting
in the result of Theorem 6 proves the Corollary.
In this section we consider the 1d heavy-tailed mean estimation problem and provide intuition for why AdaCRR-FC estimates the mean of any distribution which is symmetric around its mean as remarked in Section 7. Let , be the input to AdaCRR-FC. Note that
are related as
where is independent of
. Next, observe that AdaCRR-FC is “sign invariant”; i.e., changing
doesn’t change the course of the algorithm and the iterates produced by the algorithm will exactly be the same in both cases. So we can randomly choose n{2 data points and flip their sign without effecting the outcome of the algorithm. Let
be the resulting data points, where
are related as follows
Note that the distribution of is given by
Since the distribution of is symmetric around origin,
has the same distribution as
over,
remains independent of
even after the transformation; a simple calculation shows that
. This shows that
can be viewed as being generated from a different linear model in which the covariates are sampled from a sub-Gaussian distribution with mean 0 and variance 1 and the noise is independent of covariates and is sampled from a heavy tailed distribution. So, we can use the exact same arguments as in the proof of Theorem 6, where we assumed the covariates are Gaussian, to show that AdaCRR-FC estimates the mean of any symmetric heavy tailed distribution, with bounded first moment, at sub-gaussian rates.
H.1. Estimating
In this section we present techniques to estimate settings.
We begin with low dimensional setting where and provide techniques to estimate
two different initializations of w. In the following Proposition we provide a constant factor upper bound for
is intialized at the OLS solution.
Proposition 27 Suppose we start Algorithm 1 at . Consider the following estimate for
where
the following inequalities with probability at least
Proof Recall that we start our algorithm at is given by
where . We first study
and understand its lower and upper bounds. We use concentration properties of Gaussian random variables to derive these bounds.
written as
Note that, conditioned on , the elements of
are i.i.d Gaussian random variables with mean 0 and variance
. Using Chi-squared concentration results (see Lemma 30) we get the following lower and upper bounds for
, which hold with probability at least
Note that acts as a remainder term and is of a smaller order than
consider the term
Using concentration properties of eigenvalues of covariance matrix (see Lemma 32), we get the following bound on , which holds with probability at least
This gives us the following upper bound for , which holds with probability at least
We now substitute the above bounds on in Equation (25). Simplifying the resulting terms and using the assumption that
, we get the following bound for
, which holds with probability at least
The above bound shows that to estimate , it suffices to have a good estimate of
We now show that
is a good estimate of
. Note that
can written as:
where is the projection matrix onto the column span of
Note that
is a random projection matrix which projects any given n-dimensional vector onto a random p-dimensional subspace, with every subspace equally likely. So for any fixed vector u,
has the same distribution as
is sampled uniformly from the unit sphere in
is the subvector of v corresponding the first p coordinates. We now use Lemma 2.2 of Dasgupta and Gupta (2003) which provides a high probability bound on
to get the following bound on
, which holds with probability at least
This gives us the following bound on the norm of the residual vector
Substituting this in Equation (26) and simplifying the resulting terms, we get the following bounds for in terms of the residual vector
We now consider the case where . In this case,
. So the problem of estimating
is equivalent to the problem of estimating the signal strength
a well studied problem in statistics and a number of estimators have been proposed, which work under fairly mild conditions on the distributions of
, and references therein). Here we consider the following estimator for
The following Proposition shows that provides a good approximation of
, when the noise is not too strong compared to the signal strength.
Proposition 28 Suppose the noise and corruptions are such that
Proof First note that can be rewritten as
where is the projection matrix onto the column span of
This gives us the following upper and lower bound for
We now bound each of the terms in the upper and lower bounds of
• Recall that in Proposition 27 we showed the following bound for holds with probability at least
where the last inequality follows from concentration properties of chi-square random variables (see Lemma 30) and holds with probability at least
• Note that each entry of is a Gaussian random variable with mean 0 and variance
Using concentration properties of chi-square random variables (Lemma 30), we get the following bounds for
, which hold with probability at least
Substituting these bounds in the previous equation gives us the following upper and lower bounds on , which hold with probability at least
Discussion. When , we can see that
is a good upper bound of
Condition (27) holds when the noise and corruptions aren’t too strong compared to the signal strength. It imposes a bound on the norm of corruptions and requires to be bounded by
. This is a very mild assumption and holds even if the adversary adds
corruptions to each data point.
In this setting we consider the case where AdaCRR-HD is initialized at 0 and estimate which is equal to the signal strength
. The problem of estimating signal strength and noise variance in sparse regression setting is well studied (see Sun and Zhang, 2012; Fan et al., 2012, and references therein). In this work we use the estimator of Sun and Zhang (2012) to first estimate the variance of
, and then use it to estimate the signal strength
estimator of Sun and Zhang (2012) solves the following scaled sparse linear regression problem to estimate the noise variance
For
. To be more precise,
satisfies the following bound:
The following Proposition, which is similar to Proposition 28, shows that is a good estimate of
Proposition 29 Let the noise be sampled from
the Gaussian noise and corruptions are such that,
for some universal constant
Proof We begin by deriving lower and upper bounds for can be written as:
Note that the elements of v are independent and . Moreover,
. Using concentration properties of Gaussian and chi-squared random variables (see Lemma 30), we get the following bounds which hold with probability at least
for some universal constant . Using the above bounds together with Equation (28) gives us the following upper and lower bounds for
The Theorem follows by observing that
H.2. Analysis of AdaCRR-GD
In this section we provide an informal argument for why AdaCRR-GD obtains similar guarantees as AdaCRR-FC. Here, we work in the oblivious adversary model described in Section 2. Consider the update step of AdaCRR-GD described in Algorithm 3:
Letting , we rewrite the above equation as:
One step progress. We now use the same proof technique as used in the proof of Theorem 4 (see Lemma 11) to obtain a bound on the one-step progress made by AdaCRR-GD. We first bound
Supose . Then from Lemma 32 on concentration of largest eigenvalue of covariance matrix, and the fact that
, we get the following bound (w.p
where
Lets suppose the interval length we choose in AdaHT is such that Then from Lemma 10 we know that none of the uncorrupted points get thresholded by AdaHT, in the
iteration of AdaCRR-GD. Now, consider the sets
defined in the proof of Lemma 11. From Lemma 10 we also know that
. We now decompose
in terms of
as follows:
Note that these are the exactly the same terms which appear in the proof of Lemma 11. Using the results from Lemma 11, we get the following bound on
Bound on Now, suppose the interval length
is chosen as in Equation (3), with
, for some universal constant
. Then, using the same induction argument as in the proof of Theorem 4, we get the following bound on
Discussion. The above bound shows that AdaCRR-GD achieves similar error guarantees as AdaCRRFC in oblivious adversary model. However, the fraction of corruptions that AdaCRR-GD can tolerate depends on the condition number
I.1. Proof of Lemma 8
From the definition of we know that:
Substituting
we get:
I.2. Proof of Lemma 9
The proof is based on a simple counting argument. Suppose is greater than
number of points in bucket j. We know that
. The number of points in the first
buckets can be lower bounded as:
However, the total number of points in . This shows that
can’t be greater than
I.3. Proof of Lemma 10
Let be the input to AdaHT in
iteration of AdaCRR-FC:
is independent of
and since sum of two independent sub-gaussian random variables is a sub-gaussian random variable, it is easy to see that
sub-gaussian and satisfies the following tail bounds:
Set Now, consider the residual of points in
where the second inequality follows from the definition of and the above concentration bound on
, and the last inequality follows from the fact that
show that any point with a residual larger than
will never be thresholded; that is, the point will never be added to
. Any point with residual larger than
can either lie to the right or inside the
interval. If it is to the right, then it will not be added to
. If it lies in the interval, we uniformly sample
and add the point to
Clearly, this can never hold for the points in
. This shows that
Set Now, consider the residual of points in
Using a similar argument as above, we can show that any point with residual smaller than will always be added to
. This shows that
is a subset of
Set be the center of
interval, which is given by:
that
is given by:
We first obtain an upper bound for the residual of points in
Next we obtain a lower bound for the residual of points in
This shows that all the points in fall in the
interval. This finishes the proof of the Proposition.
I.4. Proof of Lemma 12
First note that can be rewritten as a sum of two independent random variables:
Since
Since z is a standard normal random variable, we get the following closed form expression for Eras:
I.5. Proof of Lemma 13
From Lemma 12 we know that:
Using this expression we now compute Eras:
for some
I.6. Proof of Lemma 15
First note that x can be rewritten as: can be rewritten as:
We now show that are both sub-Gaussian random vectors.
To show that
is a sub-Gaussian random vector, it suffices to show that
is a sub-Gaussian random variable. Let
. Note that,
Some computation shows that we get the following tail bound for This shows that for any
This shows that is a sub-Gaussian random vector2.
It is easy to see that
. We now bound its MGF. Let
be any vector
Since
This shows that is a sub-Gaussian random vector. We now use the result that the sum of two dependent sub-Gaussian random vectors is also a subGaussian random vector. As a result
is also a sub-Gaussian random vector, which satisfies the following for any
for some universal constant . This shows that a is a sub-Gaussian random vector.
I.7. Proof of Lemma 16
First note that can be written as:
Using Lemma 15 to bound
We now bound the expectation in the RHS of the above equation. From Lemma 12 we know that , for some random variable
. This shows that
bounded random variable which satisfies:
for some random variable . Since every bounded random variable is sub-Gaussian we have:
Substituting this in Equation (29) we get:
I.8. Proof of Lemma 17
Define random variable are independent sub-Gaussian random variable, a is also a sub-Gaussian random vector which satisfies the following inequality for any
for some universal constant c. To get the tail bound for the norm of a, we the following result from Hsu et al. (2012). Suppose a is a sub-Gaussian vector which satisfies the following for any satisfies the following tail bound for any
Setting gives us the required bound.
In this section we state some well-known concentration results of Gaussian and sub-gaussian ran- dom variables.
Lemma 30 (Laurent and Massart (2000)) Let standard normal random variables. Let
be non-negative. Let
. Then, the following holds for any positive t:
Lemma 31 (Hsu et al. (2012)) Suppose a is a sub-Gaussian vector which satisfies the following for any satisfes the following tail bound for any
Lemma 32 Let be the matrix of covariates with columns sampled from Np0, Iq. Then, for any
, with probability at least
, the following statements are true:
Lemma 33 (Bhatia et al. (2015)) Let be the matrix of covariates with columns sampled from Np0, Iq. Then, for any
, with probability at least
, the following statement is true:
Lemma 34 (Proposition 2.1, Vershynin (2012)) Consider independent random vectors in
, which have sub-gaussian distribution with parameter
Then for any
with probability at least
Figure 2: Performance of various estimators as we vary n in the robust regression setup. In the noiseless setting , AdaCRR consistently outperforms baseline methods while in the setting with
regression is able to compete with AdaCRR in terms of recovery guarantees.
Figure 3: Performance of various estimators as we vary fraction of corruptions in the robust regression setting. As stated in Theorem 1, AdaCRR is able to tolerate
fraction of outliers while most other competing methods tend to have a smaller breakdown point as compared to AdaCRR.
Lemma 35 (Raskutti et al. (2010); Zhou (2009)) Let be a matrix with rows
sampled from a Gaussian distribution with mean 0 and covariance
satisfies the following restricted eigenvalue (RE) property with probability at least
where is a constant that depends on
In this section, we compare the performance of AdaCRR with baselines, for the problems of robust regression as well as heavy-tailed regression.
K.1. Robust Regression
In this section, we focus on experiments related to the robust regression problem described in Section 2. We begin by describing the problem setup as well as the baseline algorithms before proceeding to a discussion of the experimental findings.
Experiment Setup. For all the experiments, the feature vectors . For experiments with condition number
was set to identity; whenever the condition number
was chosen to be a diagonal matrix with entries uniformly sampled from p0, 1q and rescaled to ensure that the maximum and the minimum entry are
respectively. Further for any value of
, the corruption vector when
was set as follows: a random set of n{4 corruptions were set to 1000, another n{4 were set to?1000 and the remaining were uniformly sampled from
, we set the corruption vector as follows: a random set of n{4 corruptions were set to 1, another n{4 were set to 1{?n and the remaining were set to 1{n. Each experiment was repeated for 5 runs and we report the mean performance of the methods across these runs.
AdaCRR. We use gradient descent to solve the least squares optimization problem that arises in the update step of w in AdaCRR-FC, and take 5 gradient descent steps. Since the least squares loss is strongly convex and smooth, a small number of GD steps is enough to take us close to the optimal solution.
Baseline Algorithms. We compare the two variants of our proposed estimator AdaCRR-FC and AdaCRR-GD with various baseline alogrithms including a) Ordinary Least Squares (OLS) estimator, b) TORRENT estimator (Bhatia et al., 2015), c) CRR estimator (Bhatia et al., 2017), d) Robust Gradient estimator (RobGrad) (Prasad et al., 2018) using the mean estimator of Lai et al. (2016) to robustly estimate the gradients, e) Huber regression and f) -regression (L1-DALM). For TOR- RENT, CRR and RobGrad, we use the implementations provided by the authors. For Huber Regression we compared the matlab in-built implementation, which uses an iteratively rewieghted least squares solver (IRLS), with the library by Mark Schmidt (2006), which has various solvers for Huber Regression (e.g., sub-gradient solver, L-BFGS solver), and found the L-BFGS solver to be more stable in its recovery properties than the other techniques. For
regression, we use the DALM solver by Yang et al. (2013); for a detailed evaluation of various solvers
-regression we refer the reader to Bhatia et al. (2015). The hyperparameters of the the baseline algorithms were tuned using a fine grid-search while those for AdaCRR were fixed to a default setting: a) the hyperparameters
in AdaHT are set as 1{18, 4 respectively, b)
in the interval length computation is set to
- the upper bound of
- is estimated using the techniques described in Appendix H, c) the upper bound of
, and d) the interval length was computed as
. Note that this expression differs from the expression in Equation (3) in the leading constants. This is because the expression in Equation (3) is an optimistic estimate of the “optimal” interval length, which ensures none of the un-corrupted points are left out. In practice, one can use smaller interval lengths and still ensure this property holds.
Recovery guarantees. Figures 2 and 3 show the performance of various estimators for the above designed setup where we measure the performance in terms of the estimation error
Figure 2 exhibits this performance as we vary the total number of datapoints, n, keeping other parameters fixed. a) In the noiseless setting, , we obser that both AdaCRR-FC and AdaCRRGD are consistently able to achieve much lower error rate as compared to other baseline methods, even when they were run to convergence. The L1-DALM method could not be scaled to more than 20000 datapoints since it required the computation of a
matrix which lead to memory errors. b) In the noisy setting,
, the recovery error of Huber regression method was competitive with those of AdaCRR while all other methods typically had worse performance in comparison.
Figure 3 shows the recover error with variation in the level of corruption , keeping other parameters fixed. Both in the noisy and noiseless settings, AdaCRR is able to tolerate much higher levels of corruption as compared to other baselines, with huber regression becoming competitive in the setting with
. This is in accordance with the theoretical guarantees of AdaCRR which show that it can tolerate a level of corruption
Computational Efficiency. Figure 4 compares the computational efficiency of AdaCRR with Huber regression, which had competitive recovery error in a few settings. It shows the variation in recovery error with the number of iteration of the algorithm for
Figure 4: Error vs. iteration number for robust regression. Corruptions are sampled uniformly from r0, 100s. Both the algorithms are intialized at the same point. AdaCRR can be upto 100x faster as compared to Huber regression while achieving much better recovery errors.
Figure 5: Performance of various estimators on linear regression with cauchy noise. Both AdaCRR and Huber regression are able to offer consistent estimators for this regression problem where the noise variables do not have a well-defined first moment. As before, AdaCRR exhibits linear convergence to the optima and can be orders of magnitude faster than Huber regression.
iteration of AdaCRR-GD is a simple matrix-vector multiplication with cost Opndq which is atleast that incurred by the LBFGS solver for huber regression. AdaCRR-GD and AdaCRR-FC can be upto 100x faster as compared to stable solvers for huber regression and are in accordance with the theoretically proven linear convergence for the optimization procedure.
K.2. Heavy-tailed Regression
In this section, we focus on experiments related to the heavy-tailed regression problem described in Section 2 with an emphasis on Cauchy regression, wherein the noise variables are sampled i.i.d. from a Cauchy distribution.
Experiment Setup. For all the experiments, the feature vectors was set similar to the robust regression experiments in Section K.1. We set the scale parameter (
of the Cauchy distribution to 1. Each experiment was repeated for 5 runs and we report the mean performance of the methods across these runs.
Baseline Algorithms. We compare the two variants of our proposed estimator AdaCRR-FC and AdaCRR-GD with various baseline alogrithms including a) Ordinary Least Squares (OLS) estimator, b) Robust Gradient estimator (RobGrad) (Prasad et al., 2018) which uses the median-of-means estimator for estimating mean gradient, c) Huber regression d) Sparse regression (LASSO) (Tibshirani, 1996), e) Ridge Regression and f) -regression (L1-DALM). For RobGrad, we use the implementations provided by the authors. The hyperparameters of the the baseline algorithms were tuned using
a fine grid-search while those for AdaCRR were fixed to a default setting described above, with
Recovery guarantees. Figure 5 shows the variation of parameter recovery error number of data points n are varied for the above setup. AdaCRR-FC, AdaCRR-GD and the huber regression estimator are able to achieve consistent estimation of the parameter vector with the error exhibiting a decreasing trend with increasing datapoints even when the Cauchy noise does not have a well-defined first moment. Other baseline estimators do not exhibit this consistency and tend to flatten out in terms of recovery error. The high fluctuations in the OLS and Ridge estimators can be attributed to the heavy-tailed Cauchy distribution.