Private Stochastic Convex Optimization: Efficient Algorithms for Non-smooth Objectives

2020·Arxiv

Abstract

Abstract

In this paper, we revisit the problem of private stochastic convex optimization. We propose an algorithm based on noisy mirror descent, which achieves optimal rates both in terms of statistical complexity and number of queries to a first-order stochastic oracle in the regime when the privacy parameter is inversely proportional to the number of samples.

1 Introduction

Modern machine learning systems often leverage data that are generated ubiquitously and seamlessly through devices such as smartphones, cameras, microphones, or user’s weblogs, transaction logs, social media, etc. Much of this data is private, and releasing models trained on such data without serious privacy considerations can reveal sensitive information (Narayanan and Shmatikov, 2008; Sweeney, 1997). Consequently, much emphasis has been placed in recent years on machine learning under the constraints of a robust privacy guarantee. One such notion that has emerged as a de facto standard is that of differential privacy.

Informally, differential privacy provides a quantitative assessment of how different are the outputs of a randomized algorithm when fed two very similar inputs. If small changes in the input do not manifest as drastically different outputs, then it is hard to discern much information about the inputs solely based on the outputs of the algorithm. In the context of machine learning, this implies that if the learning algorithm is not overly sensitive to any single datum in the training set, then releasing the trained model should preserve the privacy of the training data. This requirement, apriori, seems compatible with the goal of learning, which is to find a model that generalizes well on the population and does not overfit to the given training sample. It seems reasonable then to argue that privacy is not necessarily at odds with generalization, especially when large training sets are available.

We take the following stochastic optimization view of machine learning, where the goal is to find a predictor that minimizes the expected loss (aka risk)

based on i.i.d. samples from the source distribution D, and full knowledge of the instantaneous objective function and the hypothesis class W. We are particularly interested in convex learning problems where the hypothesis class is a convex set and the loss function is a convex function in the first argument for all . We seek a learning algorithm that uses the smallest possible number of samples and the least runtime and returns for a user specified , while guaranteeing differential privacy (see Section 2.2 for a formal definition).

A natural approach to solving Problem 1 is sample average approximation (SAA), or empirical risk minimization (ERM), where we instead minimize an empirical approximation of the objective based on the i.i.d. sample. Empirical risk minimization for convex learning problems has been studied in the context of differential privacy by several researchers including Bassily et al. (2019) who give statistically efficient algorithms with oracle complexity matching that of optimal non-private ERM.

An alternative approach to solving Problem 1 is stochastic approximation (SA), wherein rather than form an approximation of the objective, the goal is to directly minimize the true risk. The learning algorithm is an iterative algorithm that processes a single sample from the population in each iteration to perform an update. Stochastic gradient descent (SGD), for instance, is a classic SA algorithm. Recent work of Feldman et al. (2020) gives optimal rates for convex learning problems (Problem 1) using stochastic approximation for smooth loss functions; however, they leave open the question of optimal rates for non-smooth convex learning problems which include a large class of learning problems, including, for example, support vector machines. In this work, we focus on non-smooth convex learning problems and propose a simple algorithm which achieves nearly optimal rates in the special case where the privacy guarantee scales inversely with the number of observed samples. There are alternative approaches; for further discussion we refer the reader to Section 6.

2 Notation and Preliminaries

We consider the general learning setup of Vapnik (2013). Let Z be the sample space and let D be an unknown distribution over Z. We are given a sample drawn identically and independently (i.i.d) from D – the samples may correspond to (feature, label) tuples as in supervised learning, or to features in unsupervised learning. We assume that loss a convex function in its first argument w and the hypothesis set W is a convex set. Given n samples, the goal is to minimize the population risk (Problem 1).

We assume that the hypothesis class W is a convex, closed and bounded set in equipped with norm , such that . We recall that the dual space of is the set of all linear functionals over it; the dual norm, denoted by , is defined as where h is a element of the dual space . In general we will use to denote the when there is no risk of confusion.

We do not assume that f(w, z) is necessarily differentiable and will denote an arbitrary subgradient as . We assume that -lipschitz with respect to the dual norm i.e., . For convex f, this implies that sub-gradients with respect to w, are bounded as . A popular instance of the above is the setup, which considers norm in primal space and the corresponding in the dual space such that . A function -strongly smooth (or just if . For convex functions, this is equivalent to the condition . A function g is -strongly convex if . Finally, we use to suppress poly-logarithmic factors in the complexity.

2.1 Mirror descent and potential functions

We recall some basics of convex duality, which will help us set up the mirror descent algorithm and analysis. For a convex function , we define its conjugate as . Moreover, denotes Bregman divergence induced by defined as

For a convex set W, we denote by the indicator function of the set otherwise. The following result from Kakade et al. (2009) establishes a natural relation between strong convexity and strong smoothness of a potential function and its conjugate, respectively.

Theorem 1 (Theorem 6 from Kakade et al. (2009))is a closed convex function. Then -strongly convex with respect to a norm -strongly smooth with respect to the dual norm

2.2 Differential privacy

We seek to design algorithms for solving the stochastic convex optimization problem (Problem 1) with the additional constraint that the algorithm’s output is differentially private.

3 Related Work

In convex learning and optimization, the following four classes of functions are widely studied: (a) L-lispchitz convex functions, (b) -smooth and convex functions, (c) -strongly convex functions, and (d) -smooth and -strongly convex functions. In the computational framework of first order stochastic oracle, algorithms with optimal oracle complexity for all these classes of functions have long been known (Agarwal et al., 2009). However, the landscape of known results changes with the additional constraint of privacy. We can trace two approaches to solving the private version of Problem 1. The first is private ERM (Chaudhuri et al., 2011; Bassily et al., 2014; Feldman et al., 2018; Bassily et al., 2019) and the second is private stochastic approximation (Feldman et al., 2020). Chaudhuri et al. (2011) begin the study of private ERM by constructing algorithms based on output perturbation and objective perturbation. Under the assumption that the stochastic gradients are -Lipschitz continuous, the output perturbation bounds achieve excess population risk of is the Lipschitz constant of the loss function and D measures the diameter of the hypothesis class in the respective norm. The objective perturbation bounds have a similar form. Bassily et al. (2014) showed tight upper and lower bounds for the excess empirical risk. They also showed bounds for the excess population risk when the loss function does not have Lipschitz continuous gradients, achieving a rate of . Their techniques appeal to uniform convergence i.e. bounding , and convert the guarantees on excess empirical risk to get a bound on the excess population risk. These guarantees, however, were sub-optimal. Bassily et al. (2019) improved these to get optimal bounds on excess population risk, leveraging connections between algorithmic stability and generalization. The algorithms given by Bassily et al. (2019) are sample efficient but their runtimes are superlinear (in the number of samples), whereas the non-private counterparts run in linear time. In a follow-up work, Feldman et al. (2020) improved the runtime of some of these algorithms without sacrificing statistical efficiency; however, the authors require that the stochastic gradients are Lipschitz continuous. Essentially, the statistical complexity of private stochastic convex optimization has been resolved, but some questions about computational efficiency still remain open. We begin with a discussion of different settings for the population loss in subsequent paragraphs, describe what is already known, and what are the avenues for improvement.

Non-smooth Lipschitz Convex. For the class of L-Lipschitz convex functions, Bassily et al. (2014) improved upon Chaudhuri et al. (2011) and gave optimal bounds on excess empirical risk of . They then appeal to uniform convergence to convert the guarantees on excess empirical risk to get an excess population risk of . This is sub-optimal and was very recently improved by Bassily et al. (2019) using connections between algorithmic stability and generalization to get . This is optimal since is the optimal excess risk without privacy

constraints, and

tion. This resolves the statistical complexity of private convex learning and shows that in various regimes of interest, e.g., when , the constraint of privacy has no effect on utility. However, the algorithm proposed by Bassily et al. (2019) is based on Moreau smoothing and proximal methods, and requires stochastic gradient computations. This rate vastly exceeds the gradient computations needed for non-private stochastic convex optimization which are of the order O(n). The computational complexity was improved by Feldman et al. (2020) to using a regularized ERM algorithm that runs in phases and after each phase, noise is added to the solution (output perturbation) and used as regularization for the next phase.

queries to the stochastic gradient oracle. This, again, was improved in a later work of Feldman et al. (2020) to O(n) stochastic gradient queries. Note that even for non-private stochastic optimization O(n) stochastic gradient computations are necessary, so Feldman et al. (2020) achieve optimal statistical and oracle complexity for the smooth Lipschitz convex functions. The algorithm they present is an instance of a single-pass noisy SGD, and the guarantees hold for the last iterate.

-strongly convex functions, Bassily et al. (2014) gave an algorithm with optimal excess empirical risk which is in . However, as in the non-strongly convex case, the corresponding excess population risk is , which is sub-optimal. For this case, Feldman et al. (2020) proposed an algorithm which

achieves optimal rates in O(n) time for smooth functions but for non-smooth functions.

4 Algorithm and Utility Analysis

In this section, we describe the proposed algorithm and provide convergence guarantees. Recall that we are given a set of drawn i.i.d. from D. We consider an iterative algorithm, wherein, at time t, we sample index uniformly at random from the set of

indices, [n]. Note that is a random variable; we denote the realization of . Through the run of the algorithm, we maintain a set, , of indices observed until time

If sample that has not been seen and processed prior to the then we proceed with a projected gradient descent step using the noisy gradient , then the algorithm simply perturbs the current iterate, , and projects on to the set W. We remark that the noise-only step is important for the privacy analysis, as it allows for privacy amplification by sub-sampling.

The algorithm terminates when half of the points in the training set have been processed at least once, i.e., the size of is greater than or equal to n/2. We denote this stopping time by We refer the reader to Algorithm 1 for the pseudo-code.

Next, we establish the utility guarantee for Algorithm 1. We first show that is finite almost surely and bounded by O(n) with probability . The proof follows a standard bins-and-balls argument.

with probability it holds that , which implies

We proceed with bounding the regret of Algorithm 1. Given a sequence of convex functions , we bound the regret

where the expectation is with respect to any randomness in the algorithm and randomness of assume full access to the gradient of and that the diameter of

Our setup fits the popular framework of Online Stochastic Mirror Descent (OSMD) algorithm, wherein, given a strictly convex potential , the updates are given as

We utilize the following result.

Theorem 3 (Theorem 5.5 (Bubeck and Cesa-Bianchi, 2012))be a sequence of functions from is an unbiased estimator of initializes OSMD 2 with , then the expected pseudo-regret is bounded as

For any fixed , we can view Algorithm 1 as an instance of OSMD, with

and implies that for any is the strong-convexity parameter of the potential (in this case ) – using this result with Theorem 3, and taking expectation with respect to the randomness in that for any

Since , we have the following corollary.

Corollary 1. Suppose the diameter of W is bounded by -Lipschitz function for all . Running Algorithm 1 for iterations with noise sequence and step size guarantees that

Proof. Combine Equation 3, Wald’s lemma and the assumptions of the theorem to get

Using the bound on from Theorem 2 together with the step-size choice in the corollary claim finishes the proof.

The result now follows using the Corollary above and the following lemma.

Lemma 1. For the iterates of Algorithm 1 it holds that where is the output of the algorithm.

be the random variable measuring the size of . It then holds that stopping time for the process . Further, denote by the vector of indices consisting of . Let us focus on the term Let be the measure capturing all the randomness after T iterations except for the randomness with respect to the data distribution D. Furthermore, let the indicator of the input event. We now show that essentially follows using the tower rule of expectation. We have the following

where in the second equality, we condition on , and write the conditional expectation as total expectation as:

we use the fact that the iterates are fixed, and take the expectation of with respect to the data generating distribution . The rest of the steps collapses the conditional expectation back to a total expectation, and finally the last inequality holds using convexity. Using Corollary 1 we get

where in the second inequality we have used Wald’s lemma to guarantee

Corollary 1 with Lemma 1 gives the utility guarantee.

Theorem 4. Suppose the elements in W are bounded in norm by function for all . Running Algorithm 1 for iterations with noise sequence step size guarantees that

5 Privacy Proof

In this section, we establish the differential privacy of Algorithm 1. The privacy proof essentially follows the analysis of noisy-SGD in Bassily et al. (2014), but stated in full detail, for completeness and to provide a self-contained presentation. The basic idea is as follows. We view Algorithm 1 as a variant of noisy stochastic gradient descent on the ERM problem over Indeed, but for the step where we update the iterate only with noise and do not compute a gradient, the proposed algorithm would be exactly equivalent to noisy SGD. Prior work shows that using noisy SGD to solve the ERM problem enjoys nicer differential privacy due to privacy amplification via subsampling. Intuitively, Algorithm 1 should also benefit from privacy amplification, and the algorithm should not suffer from privacy loss in steps where we use the zero gradient. We formalize this intuition using the standard analysis for Rènyi differential privacy (Wang et al., 2019) and properties of Rènyi divergence (Mironov, 2017).

We first introduce additional notation. Let be a function which describes the iteration of the algorithm – it takes as input, an iterate of indices of data points, and a subset ; it outputs an iterate and set of indices . We assume that S is totally ordered according to the indices of the datapoints. Further let denote the set of indices of datapoints in denote the first and second outputs of . Note that in the algorithm, the ’s are composed – we initialize is fixed, therefore, acts on the output of . In general, we have that

Formally, given , we define random variable

where is the following gradient operator:

In Algorithm 1, m = 1, and in general m is just the size of the sub-sampled set from S, which we use to construct a mini-batched gradient.

The input space of has measure Unif([n], m) is the sub-sampling measure, which sub-samples m out of n data points uniformly

randomly. We now construct a vector concatenating the first outputs of all these

We claim that is differentially private.

To prove the above theorem we are first going to show that the first coordinate of each sufficiently differentially private.

Proof. Consider the two differing datasets and let the index at which they differ be Let p = m/|S|. Denote the random subsample from respectively Under the uniform random sampling we have . This holds because the sampling of indices only depends on the size . The proof follows the standard privacy amplification by sub-sampling argument , and any measurable (with respect to the Lebesgue measure on the space define the measures

First we note that because the gradient descent step is restricted only to points indexed by and the differing point does not belong to that set. We also have because in this case is not part of the subsampled points. We consider two cases: . In the first case , and we have

perfect (0, 0)-DP. In the other case, we have

where in the first inequality we have used the following DP-guarantee of subsampled sets it holds that the mechanism -DP as it is a step of noisy projected gradient descent with gradients bounded in norm by L and Gaussian noise with sufficient variance. We use the DP guarantee twice – once on the neighboring datasets get the inequality and once on the neighboring dataset to get the inequality . In the second application, note that R(E) contains events when a previously seen point is sampled; however the noise-only-step ensures that it also is a Gaussian with the same variance. In the second inequality we use the fact that a convex combination of two numbers is greater than their minimum and in the last inequality we use the standard relation Combining the two cases finishes the proof.

We can now prove Theorem 5.

Proof of Theorem 5. The proof essentially follows the proof of Theorem 3.3 (Dwork et al., 2010). Let . Condition on the event that throughout the -DP holds for all of the mechanisms . This event fails with probability and will be accounted for in the final DP bound. Define the set of events on which -DP does not holds as

The goal is to show that the random variable taking values . Further let w be a fixed realization of

recall we conditioned on the event that each of the mechanisms -DP. Define the random variable which takes values

We have just shown that switching we can show in the exact same way that implies that the random variable is a.s. bounded by . Further using the fact that -DP implies boundedness in the -divergence we also have

Lemma 3.2 in (Dwork et al., 2010) now implies that the KL-divergence between is bounded as . Together with the defini- tion of this implies

Because the filtration generated by includes the one generated by now apply Azuma’s inequality on the martingale difference sequence , which we have just shown it is bounded as the KL-term is bounded by ’s are bounded by

Because for it holds that . This in hand implies which shows that for a fixed

Combining the utility and privacy guarantees yields the main result, stated below.

Theorem 6. Let f(w, z) be a convex L-Lipschitz function in its first argument, . Furthermore, assume that the diameter of W is bounded by , Algorithm 1 run with step size

which satisfies – differential privacy, for any and its accuracy is bounded as,

Proof. Condition on the event that . Using Theorem 5, with implies that Algorithm 1 is -DP under the condition that . The event happens with probability according to Theorem 2, which implies that Algorithm 1 is -DP. Finally, the convergence guarantee follows from Theorem 4.

-DP, for any , we can set . Using the condition , we get that The utility guarantee

6 Other Approaches

We now briefly discuss a few other approaches which would also recover a convergence rate of . The first approach uses the standard decomposition of excess population risk to generalization gap + excess empirical risk. As pointed out in Bassily et al. (2019); Feldman et al. (2020), it is possible to use the optimal rates for ERM from Bassily et al. (2014) together with generalization propties of DP from Dwork et al. (2015) to bound the generalization gap, to guarantee a rate of , which evaluates toin the regime

. Even though the runtime of noisy-SGD for ERM, stated in Bassily et al. (2014) is , it can be shown that their result holds with only

– therefore, in the regime , it is a linear time algorithm. Second, it may be possible to use amplification by subsampling in Algorithm 1 of Feldman et al.

(2020) instead of amplification by iteration. This would discard the smoothness requirement for and perhaps yield the same result in Theorem 6. We note that Algorithm 1 is much simpler and does not require the complicated mini-batch schedule from Algorithm 1 of Feldman et al. (2020).

7 Conclusion

In this paper, we studied the problem of private stochastic convex optimization for non-smooth objectives. We proposed a noisy version of the OSMD algorithm and presented convergence and privacy guarantees for the achieves statistically optimal rates, while running in linear time, and is differentially private as long as the DP parameter is sufficiently small. Algorithm 1 can easily be extended to geometries induced by arbitrary strictly convex potentials We leave it as future work to explore what kind of privacy guarantees can be retained if the privacy mechanism is tailored towards the geometry induced by . Finally, we think it should be possible to extend our analysis to the case when is strongly convex for all and achieve optimal statistical rates in linear time.

References

Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information- theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009.

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473. IEEE, 2014.

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, pages 11279–11288, 2019.

Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi- armed bandit problems. arXiv preprint arXiv:1204.5721, 2012.

Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. JMLR, 2011.

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.

Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and Differential Privacy. In FOCS, pages 51–60, 2010.

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 117–126. ACM, 2015.

Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 521–532, 2018.

Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439–449, 2020.

Sham Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, http://ttic. uchicago. edu/shai/papers/KakadeShalevTewari09. pdf, 2(1), 2009.

Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263–275. IEEE, 2017.

Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on, pages 111–125. IEEE, 2008.

Latanya Sweeney. Weaving technology and policy together to maintain confidentiality. The Journal of Law, Medicine & Ethics, 25(2-3):98–110, 1997.

Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.

Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled rényi differential privacy and analytical moments accountant. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1226–1235. PMLR, 2019.

A Proof of Theorem 2

Proof. We begin by fixing a time horizon T and showing that with high probability appropriately chosen be the function which counts the unique draws among . Further, let be the indicator random variable of the event that the i-th datapoint is drawn at least ones in the T rounds, i.e., . We can write the expectation of q as

Setting already shows that the number of unique elements after is at least n/2 and thus the stopping time condition is reached. Next, we show concentration around the expectation of q. Note that if a single element is changed by the value of q will not change by more than 1 i.e.,

This allows us to use McDiarmid’s inequality to claim that

This implies that with probability at least

Setting we have that for any it holds that with probability

To get the bound in expectation we note that . On the other hand we know from the above derivation, that for it holds that . This implies