b

DiscoverSearch
About
My stuff
Private Stochastic Convex Optimization: Efficient Algorithms for Non-smooth Objectives
2020·arXiv
Abstract
Abstract

In this paper, we revisit the problem of private stochastic convex optimization. We propose an algorithm based on noisy mirror descent, which achieves optimal rates both in terms of statistical complexity and number of queries to a first-order stochastic oracle in the regime when the privacy parameter is inversely proportional to the number of samples.

Modern machine learning systems often leverage data that are generated ubiquitously and seamlessly through devices such as smartphones, cameras, microphones, or user’s weblogs, transaction logs, social media, etc. Much of this data is private, and releasing models trained on such data without serious privacy considerations can reveal sensitive information (Narayanan and Shmatikov, 2008; Sweeney, 1997). Consequently, much emphasis has been placed in recent years on machine learning under the constraints of a robust privacy guarantee. One such notion that has emerged as a de facto standard is that of differential privacy.

Informally, differential privacy provides a quantitative assessment of how different are the outputs of a randomized algorithm when fed two very similar inputs. If small changes in the input do not manifest as drastically different outputs, then it is hard to discern much information about the inputs solely based on the outputs of the algorithm. In the context of machine learning, this implies that if the learning algorithm is not overly sensitive to any single datum in the training set, then releasing the trained model should preserve the privacy of the training data. This requirement, apriori, seems compatible with the goal of learning, which is to find a model that generalizes well on the population and does not overfit to the given training sample. It seems reasonable then to argue that privacy is not necessarily at odds with generalization, especially when large training sets are available.

We take the following stochastic optimization view of machine learning, where the goal is to find a predictor that minimizes the expected loss (aka risk)

image

based on i.i.d. samples from the source distribution D, and full knowledge of the instantaneous objective function  f(·, ·)and the hypothesis class W. We are particularly interested in convex learning problems where the hypothesis class is a convex set and the loss function  f(·, z)is a convex function in the first argument for all  z ∈ Z. We seek a learning algorithm that uses the smallest possible number of samples and the least runtime and returns  ˜w such that F(˜w) ≤ infw∈W F(w)+α,for a user specified  α > 0, while guaranteeing differential privacy (see Section 2.2 for a formal definition).

A natural approach to solving Problem 1 is sample average approximation (SAA), or empirical risk minimization (ERM), where we instead minimize an empirical approximation of the objective based on the i.i.d. sample. Empirical risk minimization for convex learning problems has been studied in the context of differential privacy by several researchers including Bassily et al. (2019) who give statistically efficient algorithms with oracle complexity matching that of optimal non-private ERM.

An alternative approach to solving Problem 1 is stochastic approximation (SA), wherein rather than form an approximation of the objective, the goal is to directly minimize the true risk. The learning algorithm is an iterative algorithm that processes a single sample from the population in each iteration to perform an update. Stochastic gradient descent (SGD), for instance, is a classic SA algorithm. Recent work of Feldman et al. (2020) gives optimal rates for convex learning problems (Problem 1) using stochastic approximation for smooth loss functions; however, they leave open the question of optimal rates for non-smooth convex learning problems which include a large class of learning problems, including, for example, support vector machines. In this work, we focus on non-smooth convex learning problems and propose a simple algorithm which achieves nearly optimal rates in the special case where the privacy guarantee scales inversely with the number of observed samples. There are alternative approaches1; for further discussion we refer the reader to Section 6.

We consider the general learning setup of Vapnik (2013). Let Z be the sample space and let D be an unknown distribution over Z. We are given a sample  z1, z2, . . . , zndrawn identically and independently (i.i.d) from D – the samples  zimay correspond to (feature, label) tuples as in supervised learning, or to features in unsupervised learning. We assume that loss  f : W × Z → R isa convex function in its first argument w and the hypothesis set W is a convex set. Given n samples, the goal is to minimize the population risk (Problem 1).

We assume that the hypothesis class W is a convex, closed and bounded set in  Rd equipped with norm  ∥·∥, such that  ∥w∥ ≤ D for all w ∈ W. We recall that the dual space of  (Rd, ∥·∥)is the set of all linear functionals over it; the dual norm, denoted by  ∥·∥∗, is defined as  ∥h∥∗ = min∥w∥≤1 h(w),where h is a element of the dual space  (Rd, ∥·∥∗). In general we will use  ∥ · ∥to denote the  ℓ2 normwhen there is no risk of confusion.

We do not assume that f(w, z) is necessarily differentiable and will denote an arbitrary subgradient as  ∇f(w, z). We assume that  f(·, z) is L-lipschitz with respect to the dual norm  ∥·∥∗,i.e.,  |f(w1, z) − f(w2, z)| ≤ L ∥w1 − w2∥ for all z. For convex f, this implies that sub-gradients with respect to w, are bounded as  ∥∇f(w, z)∥∗ ≤ L, ∀ w ∈ W, z ∈ Z. A popular instance of the above is the  ℓp/ℓqsetup, which considers  ℓpnorm in primal space and the corresponding  ℓq normin the dual space such that 1p + 1q = 1. A function  g(·) is β-strongly smooth (or just  β-smooth)if  ∥∇g(w1) − ∇g(w2)∥∗ ≤ β ∥w1 − w2∥ , ∀w1, w2 ∈ W. For convex functions, this is equivalent to the condition  g(w2) ≤ g(w1) + ⟨∇g(w1), w2 − w1⟩ + β2 ∥w1 − w2∥2, ∀w1, w2 ∈ W. A function g is λ-strongly convex if  g(w2) ≥ g(w1) + ⟨∇g(w1), w2 − w1⟩ + λ2 ∥w1 − w2∥2 , ∀w1, w2 ∈ W. Finally, we use ˜O(·)to suppress poly-logarithmic factors in the complexity.

2.1 Mirror descent and potential functions

We recall some basics of convex duality, which will help us set up the mirror descent algorithm and analysis. For a convex function  Φ : Rd → R, we define its conjugate  Φ∗ : Rd → R ∪ {∞}as  Φ∗(Y ) = supX⟨Y, X⟩ − Φ(X). Moreover,  DΦ(z||y)denotes Bregman divergence induced by  Φ,defined as

image

For a convex set W, we denote by  IWthe indicator function of the set  W, IW(z) = 0 if z ∈ W, and∞otherwise. The following result from Kakade et al. (2009) establishes a natural relation between strong convexity and strong smoothness of a potential function and its conjugate, respectively.

Theorem 1 (Theorem 6 from Kakade et al. (2009)). Assume Φis a closed convex function. Then Φ is α-strongly convex with respect to a norm  ∥ · ∥ iff Φ∗ is 1α-strongly smooth with respect to the dual norm  ∥ · ∥∗.

2.2 Differential privacy

We seek to design algorithms for solving the stochastic convex optimization problem (Problem 1) with the additional constraint that the algorithm’s output is differentially private.

image

In convex learning and optimization, the following four classes of functions are widely studied: (a) L-lispchitz convex functions, (b)  β-smooth and convex functions, (c)  λ-strongly convex functions, and (d)  β-smooth and  λ-strongly convex functions. In the computational framework of first order stochastic oracle, algorithms with optimal oracle complexity for all these classes of functions have long been known (Agarwal et al., 2009). However, the landscape of known results changes with the additional constraint of privacy. We can trace two approaches to solving the private version of Problem 1. The first is private ERM (Chaudhuri et al., 2011; Bassily et al., 2014; Feldman et al., 2018; Bassily et al., 2019) and the second is private stochastic approximation (Feldman et al., 2020). Chaudhuri et al. (2011) begin the study of private ERM by constructing algorithms based on output perturbation and objective perturbation. Under the assumption that the stochastic gradients are  β-Lipschitz continuous, the output perturbation bounds achieve excess population risk of O(LD max(1/√n, d/(nǫ), d√β/(n2/3ǫ))), where Lis the Lipschitz constant of the loss function and D measures the diameter of the hypothesis class in the respective norm. The objective perturbation bounds have a similar form. Bassily et al. (2014) showed tight upper and lower bounds for the excess empirical risk. They also showed bounds for the excess population risk when the loss function does not have Lipschitz continuous gradients, achieving a rate of  O(d1/4/(√nǫ)). Their techniques appeal to uniform convergence i.e. bounding  supw∈W F(w) − �F(w), and convert the guarantees on excess empirical risk to get a bound on the excess population risk. These guarantees, however, were sub-optimal. Bassily et al. (2019) improved these to get optimal bounds on excess population risk, leveraging connections between algorithmic stability and generalization. The algorithms given by Bassily et al. (2019) are sample efficient but their runtimes are superlinear (in the number of samples), whereas the non-private counterparts run in linear time. In a follow-up work, Feldman et al. (2020) improved the runtime of some of these algorithms without sacrificing statistical efficiency; however, the authors require that the stochastic gradients are Lipschitz continuous. Essentially, the statistical complexity of private stochastic convex optimization has been resolved, but some questions about computational efficiency still remain open. We begin with a discussion of different settings for the population loss in subsequent paragraphs, describe what is already known, and what are the avenues for improvement.

Non-smooth Lipschitz Convex. For the class of L-Lipschitz convex functions, Bassily et al. (2014) improved upon Chaudhuri et al. (2011) and gave optimal bounds on excess empirical risk of O� dǫn�. They then appeal to uniform convergence to convert the guarantees on excess empirical risk to get an excess population risk of  O�max�d1/4√n ,√dǫn��. This is sub-optimal and was very recently improved by Bassily et al. (2019) using connections between algorithmic stability and generalization to get  O�max� 1√n,√dǫn��. This is optimal since 1√n is the optimal excess risk without privacy

constraints, and

tion. This resolves the statistical complexity of private convex learning and shows that in various regimes of interest, e.g., when  d = Θ(n) and ǫ = Θ(1), the constraint of privacy has no effect on utility. However, the algorithm proposed by Bassily et al. (2019) is based on Moreau smoothing and proximal methods, and requires  O(n5)stochastic gradient computations. This rate vastly exceeds the gradient computations needed for non-private stochastic convex optimization which are of the order O(n). The computational complexity was improved by Feldman et al. (2020) to  O(n2) byusing a regularized ERM algorithm that runs in phases and after each phase, noise is added to the solution (output perturbation) and used as regularization for the next phase.

image

queries to the stochastic gradient oracle. This, again, was improved in a later work of Feldman et al. (2020) to O(n) stochastic gradient queries. Note that even for non-private stochastic optimization O(n) stochastic gradient computations are necessary, so Feldman et al. (2020) achieve optimal statistical and oracle complexity for the smooth Lipschitz convex functions. The algorithm they present is an instance of a single-pass noisy SGD, and the guarantees hold for the last iterate.

(Smooth and Non-smooth) Lipschitz Strongly Convex. For L-Lispchitz λ-strongly convex functions, Bassily et al. (2014) gave an algorithm with optimal excess empirical risk which is in ˜O� dλn2ǫ2�. However, as in the non-strongly convex case, the corresponding excess population risk is ˜O� √dλǫn�, which is sub-optimal. For this case, Feldman et al. (2020) proposed an algorithm which

achieves optimal rates in O(n) time for smooth functions but  O(n2)for non-smooth functions.

In this section, we describe the proposed algorithm and provide convergence guarantees. Recall that we are given a set of  n samples (z1, . . . , zn)drawn i.i.d. from D. We consider an iterative algorithm, wherein, at time t, we sample index  Ytuniformly at random from the set of

indices, [n]. Note that  Ytis a random variable; we denote the realization of  Yt as yt. Through the run of the algorithm, we maintain a set,  Ft, of indices observed until time  t, i.e., Ft = {yτ | τ < t}.

If  yt /∈ Ft, i.e., zyt is a freshsample that has not been seen and processed prior to the  tth iteration,then we proceed with a projected gradient descent step using the noisy gradient  ∇f(wt, zyt) + ξt. Ifyt ∈ Ft, then the algorithm simply perturbs the current iterate,  wt, and projects on to the set W. We remark that the noise-only step is important for the privacy analysis, as it allows for privacy amplification by sub-sampling.

The algorithm terminates when half of the points in the training set have been processed at least once, i.e., the size of  Ftis greater than or equal to n/2. We denote this stopping time by  τ.We refer the reader to Algorithm 1 for the pseudo-code.

image

Next, we establish the utility guarantee for Algorithm 1. We first show that  τis finite almost surely and bounded by O(n) with probability  1 − O(exp (−n)). The proof follows a standard bins-and-balls argument.

Theorem 2. For any n ≥ 16with probability  1 − 2 exp (−n/16)it holds that  τ ≤ 2n, which implies E[τ] ≤ O(n).

We proceed with bounding the regret of Algorithm 1. Given a sequence of convex functions ˜ft : W → R, where ft(·) = f(·, z), we bound the regret

image

where the expectation is with respect to any randomness in the algorithm and randomness of ˜ft’s. Weassume full access to the gradient of ˜ftand that the diameter of  W is D, i.e., ∀w, v ∈ W, ∥w−v∥ ≤ D.

Our setup fits the popular framework of Online Stochastic Mirror Descent (OSMD) algorithm, wherein, given a strictly convex potential  Φ : Rd → R, the updates are given as

image

We utilize the following result.

Theorem 3 (Theorem 5.5 (Bubeck and Cesa-Bianchi, 2012)). Let f1, . . . , fτbe a sequence of functions from  Rd to R. Suppose ∇ ˜ftis an unbiased estimator of  ∇ft, i.e., E[∇ ˜ft] = ∇ft. If oneinitializes OSMD 2 with  w1 = arg minw∈W Φ(w), then the expected pseudo-regret ¯R(τ)is bounded as

image

For any fixed  τ < ∞, we can view Algorithm 1 as an instance of OSMD, with  Φ ≡ 12∥ · ∥22, and

image

and  ft(w) = Eξt[ ˜ft(w)]. Theorem 1implies that for any  z, y ∈ Rd DΦ∗(z||y) ≤ 1/α∥z − y∥2∗, whereαis the strong-convexity parameter of the potential (in this case  α = 1) – using this result with Theorem 3, and taking expectation with respect to the randomness in  τ and ξt, t ∈ [τ], we havethat for any  u ∈ W

image

Since  ξ ∼ N(0, σI), we have the following corollary.

Corollary 1. Suppose the diameter of W is bounded by  D and that f(·, zt) is an L-Lipschitz function for all  t ∈ [n]. Running Algorithm 1 for  τiterations with noise sequence  ξt ∼ N(0, σI)and step size ηt = D√n(L+σ√d) guarantees that

image

Proof. Combine Equation 3, Wald’s lemma and the assumptions of the theorem to get

image

Using the bound on  E[τ] ≤ O(n)from Theorem 2 together with the step-size choice in the corollary claim finishes the proof.

The result now follows using the Corollary above and the following lemma.

Lemma 1. For the iterates  {wt}τt=1 of Algorithm 1 it holds that  E[�t∈Fτ f(wt, zYt)] ≥ nE[F(�wτ)],where  �wτis the output of the algorithm.

Proof. Let St = |Ft|be the random variable measuring the size of  Ft. It then holds that  τ is astopping time for the process  {St}∞t=1, i.e., τ = min{t : St = n}. Further, denote by  y = (y1, . . . , yT )the vector of indices consisting of  y1, . . . , yT for some T. Let us focus on the term  E[�t∈Fτ f(wt, xYt)].Let  µ = N(000, σ2Id)T × Unif[n]T be the measure capturing all the randomness after T iterations except for the randomness with respect to the data distribution D. Furthermore, let  χ(·) denotesthe indicator of the input event. We now show that  E[�t∈Fτ f(wt, zYt)] = E��t∈Fτ F(wt)�, whichessentially follows using the tower rule of expectation. We have the following

image

where in the second equality, we condition on  τ, Fτ, and write the conditional expectation as total expectation as:  E��t∈Fτ f(wt, zYt)|τ, Fτ�= E�χ(τ=T,FT =y) �t∈FT f(wt,zYt)P[τ=T,FT =y]

we use the fact that the iterates  wtare fixed, and take the expectation of  f(wt, zyt)with respect to the data generating distribution  D, yielding F(wt). The rest of the steps collapses the conditional expectation back to a total expectation, and finally the last inequality holds using convexity. Using Corollary 1 we get

image

where in the second inequality we have used Wald’s lemma to guarantee  E[�τt=1⟨ξt, wt⟩] = 0 andE[�τt=1⟨ξt, u⟩] = 0.

image

Corollary 1 with Lemma 1 gives the utility guarantee.

Theorem 4. Suppose the elements in W are bounded in norm by  D and that f(·, zt) is an L-Lipschitzfunction for all  t ∈ [n]. Running Algorithm 1 for  τiterations with noise sequence  ξt ∼ N(0, σI) andstep size  ηt = D√n(L+σ√d) guarantees that

image

In this section, we establish the differential privacy of Algorithm 1. The privacy proof essentially follows the analysis of noisy-SGD in Bassily et al. (2014), but stated in full detail, for completeness and to provide a self-contained presentation. The basic idea is as follows. We view Algorithm 1 as a variant of noisy stochastic gradient descent on the ERM problem over  n samples {zi}ni=1.Indeed, but for the step where we update the iterate only with noise and do not compute a gradient, the proposed algorithm would be exactly equivalent to noisy SGD. Prior work shows that using noisy SGD to solve the ERM problem enjoys nicer differential privacy due to privacy amplification via subsampling. Intuitively, Algorithm 1 should also benefit from privacy amplification, and the algorithm should not suffer from privacy loss in steps where we use the zero gradient. We formalize this intuition using the standard analysis for Rènyi differential privacy (Wang et al., 2019) and properties of Rènyi divergence (Mironov, 2017).

We first introduce additional notation. Let  Mi : W × [n]∗ × S → W × [n]∗ be a function which describes the  iit iteration of the algorithm – it takes as input, an iterate  wi ∈ W, a set, Fi ⊆ [n]∗,of indices of data points, and a subset  Si ⊆ S; it outputs an iterate  wi+1 ∈ Wand set of indices Fi+1 ⊆ [n]∗. We assume that S is totally ordered according to the indices of the datapoints. Further let  Sidenote the set of indices of datapoints in  Si. Let M1i and M2i denote the first and second outputs of  Mi, i.e., M1i (wi, Fi, S) = wi+1 and M2i (wi, Fi, S) = Fi+1. Note that in the algorithm, the  Mi’s are composed – we initialize  F1 = ∅ and w1is fixed, therefore,  M1(w1, ∅, S) = (w2, F2).M2acts on the output of  M1(w1, ∅, S) as M2((M1(w1, ∅, S)), S) = (w3, F3). In general, we have that

image

Formally, given  w ∈ W, m ∈ N, and F ∈ [n]∗, we define random variable  Mi(w, F, S) : Rd ×[n]m →

image

where  giis the following gradient operator:

image

In Algorithm 1, m = 1, and in general m is just the size of the sub-sampled set from S, which we use to construct a mini-batched gradient.

The input space of  Mi(wi, Fi, S), which is Rd×[n]m has measure  N(0, σ2I)×Unif([n], m), whereUnif([n], m) is the sub-sampling measure, which sub-samples m out of n data points uniformly

randomly. We now construct a vector concatenating the first outputs of all these  Mi’s. Let

image

We claim that  Mτ(w1, S)is differentially private.

image

To prove the above theorem we are first going to show that the first coordinate of each  Mi issufficiently differentially private.

image

Proof. Consider the two differing datasets  S and S′ and let the index at which they differ be  ρ.Let p = m/|S|. Denote the random subsample from  S and S′ as ˜S and ˜S′ respectively Under the uniform random sampling we have ˜S = ˜S′. This holds because the sampling of indices only depends on the size  n of S and S′. The proof follows the standard privacy amplification by sub-sampling argument 2. For fixed w and F, and any measurable (with respect to the Lebesgue measure on the space  Rd) E ⊆ Wdefine the measures

image

First we note that  Q(E) = Q′(E)because the gradient descent step is restricted only to points indexed by  [ ˜S] \Fand the differing point  ρdoes not belong to that set. We also have  R(E) = R′(E)because in this case  ρis not part of the subsampled points. We consider two cases:  ρ ∈ F andρ ̸∈ F. In the first case  P�M1i (w, F, S) ∈ E�= Q(E) = Q′(E) = P�M1i (w, F, S′) ∈ E�, and we have

perfect (0, 0)-DP. In the other case, we have

image

where in the first inequality we have used the following DP-guarantee of  M1(w, F, S). For anysubsampled sets ˜S ⊆ S and ˜S′ ⊆ S′ it holds that the mechanism  M1 is (˜ǫ, δ)-DP as it is a step of noisy projected gradient descent with gradients bounded in norm by L and Gaussian noise with sufficient variance. We use the DP guarantee twice – once on the neighboring datasets  S and S′ toget the inequality  P(E) ≤ e˜ǫP ′(E)and once on the neighboring dataset  S \{zρ}to get the inequality P(E) ≤ e˜ǫR(E). In the second application, note that R(E) contains events when a previously seen point is sampled; however the noise-only-step ensures that it also is a Gaussian with the same variance. In the second inequality we use the fact that a convex combination of two numbers is greater than their minimum and in the last inequality we use the standard relation  1 + x ≤ ex.Combining the two cases finishes the proof.

We can now prove Theorem 5.

Proof of Theorem 5. The proof essentially follows the proof of Theorem 3.3 (Dwork et al., 2010). Let  ǫ0 = 1|S|(e˜ǫ − 1), δ0 = 1|S|δ, let ǫ1 = eǫ0 − 1 and let ǫ′ = ǫ0�2τ log (1/δ′) + τǫ0ǫ1. Condition on the event that throughout the  τ rounds, ǫ0-DP holds for all of the mechanisms  M1i . This event fails with probability  τδ0and will be accounted for in the final DP bound. Define the set of events on which  ǫ′-DP does not holds as

image

The goal is to show that  P [Mτ(w1, S) ∈ B] ≤ δ′. Let W = Mτ(w1, S) = [W1, . . . , Wτ], where Wi isthe random variable taking values  wi, and let W′ = Mτ(w1, S′). Further let w be a fixed realization of  W, i.e., w = [w1, . . . , wτ]. We have

image

recall we conditioned on the event that each of the mechanisms  M1i are ǫ0-DP. Define the random variable  ci−1(W2, . . . , Wi, W′2, . . . , W′i, Y1, . . . , Yi−1)which takes values

image

We have just shown that  ci−1(w2, . . . , wi, w2, . . . , wi, y1, . . . , yi−1) ≤ ǫ0 for any w2:i, y1:i−1. Byswitching  Wi with W′i we can show in the exact same way that  −ci−1(w2:i, w2:i, y1:i−1) ≤ ǫ0 whichimplies that the random variable  ci−1(W2:i, W′2:i, Y1:i−1)is a.s. bounded by  ǫ0. Further using the fact that  ǫ0-DP implies boundedness in the  ∞-divergence we also have

image

Lemma 3.2 in (Dwork et al., 2010) now implies that the KL-divergence between  M1i (wi, Fi, S) andM1i (wi, Fi, S′)is bounded as  DKL(M1i (wi, Fi, S)||M1i (wi, Fi, S′)) ≤ ǫ1ǫ0. Together with the defini- tion of  ci−1this implies

image

Because the filtration generated by  W2:i−1, Y1:i−1includes the one generated by  c1:i−1 we cannow apply Azuma’s inequality on the martingale difference sequence  ci−1(W2:i, W′2:i, Y1:i−1) −DKL(M1i (wi, Fi, S)||M1i (wi, Fi, S′)), which we have just shown it is bounded as the KL-term is bounded by  ǫ0ǫ1 and ci’s are bounded by  ǫ0, and get

image

Because for  ˜ǫ ≤ 1.256it holds that  e˜ǫ − 1 ≤ 2˜ǫ we have ǫ0 ≤ 2˜ǫ|S|. This in hand implies  ǫ1 ≤ 2ǫ0which shows that for a fixed  τ Algorithm 1 is (2˜ǫ√2τ log(1/δ′)|S| + 4τ|S|2ǫ2, τ|S|δ + δ′)-DP.

Combining the utility and privacy guarantees yields the main result, stated below.

Theorem 6. Let f(w, z) be a convex L-Lipschitz function in its first argument,  w ∈ W, for allz ∈ Z. Furthermore, assume that the diameter of W is bounded by  D. For any n ≥ 16, given0 < ǫ ≤ 12√n, δ > 0, Algorithm 1 run with step size  η = D√n(L+σ√d) and σ = 8L√log(1/δ)√nǫ outputs �wτ

which satisfies  (4ǫ(�log (1/δ′) + 2), δ + δ′ + 2 exp (−n/16))– differential privacy, for any  δ′ > 0,and its accuracy is bounded as,

image

Proof. Condition on the event that  τ ≤ 2n. Using Theorem 5, with  ˜ǫ = √nǫimplies that Algorithm 1 is  (4ǫ(�log (1/δ′) + 2), δ + δ′)-DP under the condition that  ǫ ≤ 1.256√n . The event  τ > 2nhappens with probability  −2 exp (−n/16)according to Theorem 2, which implies that Algorithm 1 is  (4ǫ(�log (1/δ′) + 2), δ + δ′ + 2 exp (−n/16))-DP. Finally, the convergence guarantee follows from Theorem 4.

Remark 1. To get (¯ǫ, ¯δ)-DP, for any  6 exp (−n/16) ≤ ¯δ ≤ 3e−4, we can set  δ′ = δ = ¯δ3, andǫ = ¯ǫ8√log(1/δ′). Using the condition  ǫ ≤ 1√n, we get that ¯ǫ�log(3/¯δ) ≤ 8√n The utility guarantee

image

We now briefly discuss a few other approaches which would also recover a convergence rate of ˜O(1/√n+√d/(ǫn)) for ǫ ≤ O(�1/n). The first approach uses the standard decomposition of excess population risk to generalization gap + excess empirical risk. As pointed out in Bassily et al. (2019); Feldman et al. (2020), it is possible to use the optimal rates for ERM from Bassily et al. (2014) together with generalization propties of DP from Dwork et al. (2015) to bound the generalization gap, to guarantee a rate of  O�max�d1/4/√n,√d/(nǫ)��, which evaluates to�d/nin the regime

ǫ ≤ O(1/√n). Even though the runtime of noisy-SGD for ERM, stated in Bassily et al. (2014) is O(n2), it can be shown that their result holds with only  O�(nǫ)2√d

– therefore, in the regime  ǫ ≲ 1√n, it is a linear time algorithm. Second, it may be possible to use amplification by subsampling in Algorithm 1 of Feldman et al.

(2020) instead of amplification by iteration. This would discard the smoothness requirement for f(·, z)and perhaps yield the same result in Theorem 6. We note that Algorithm 1 is much simpler and does not require the complicated mini-batch schedule from Algorithm 1 of Feldman et al. (2020).

In this paper, we studied the problem of private stochastic convex optimization for non-smooth objectives. We proposed a noisy version of the OSMD algorithm and presented convergence and privacy guarantees for the  ℓ2 geometry. Algorithm 1achieves statistically optimal rates, while running in linear time, and is differentially private as long as the DP parameter is sufficiently small. Algorithm 1 can easily be extended to geometries induced by arbitrary strictly convex potentials  Φ.We leave it as future work to explore what kind of privacy guarantees can be retained if the privacy mechanism is tailored towards the geometry induced by  Φ. Finally, we think it should be possible to extend our analysis to the case when  f(·, zi)is strongly convex for all  i ∈ [n]and achieve optimal statistical rates in linear time.

Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information- theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009.

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473. IEEE, 2014.

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, pages 11279–11288, 2019.

Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi- armed bandit problems. arXiv preprint arXiv:1204.5721, 2012.

Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. JMLR, 2011.

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.

Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and Differential Privacy. In FOCS, pages 51–60, 2010.

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 117–126. ACM, 2015.

Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 521–532, 2018.

Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439–449, 2020.

Sham Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, http://ttic. uchicago. edu/shai/papers/KakadeShalevTewari09. pdf, 2(1), 2009.

Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263–275. IEEE, 2017.

Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on, pages 111–125. IEEE, 2008.

Latanya Sweeney. Weaving technology and policy together to maintain confidentiality. The Journal of Law, Medicine &amp; Ethics, 25(2-3):98–110, 1997.

Vladimir Vapnik. The nature of statistical learning theory. Springer science &amp; business media, 2013.

Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled rényi differential privacy and analytical moments accountant. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1226–1235. PMLR, 2019.

Proof. We begin by fixing a time horizon T and showing that with high probability  τ ≤ T forappropriately chosen  T. Let q : [n]T → Nbe the function which counts the unique draws among (Y1, . . . , YT ). Further, let  Zibe the indicator random variable of the event that the i-th datapoint is drawn at least ones in the T rounds, i.e.,  ∃j ∈ [T] : Yj = i. We can write the expectation of q as

image

Setting  T ≥ nalready shows that the number of unique elements after  T ≥ nis at least n/2 and thus the stopping time condition is reached. Next, we show concentration around the expectation of q. Note that if a single element  yiis changed by  y′i the value of q will not change by more than 1 i.e.,

image

This allows us to use McDiarmid’s inequality to claim that

image

This implies that with probability at least  1 − δ we have

image

Setting  T = 2n and δ = 2 exp (−n/16)we have that for any  n ≥ 16it holds that with probability

image

To get the bound in expectation we note that  P [τ ≥ T] = P [q(Y1, . . . , YT ) ≤ n/2]. On the other hand we know from the above derivation, that for  T ≥ 2nit holds that  P [q(Y1, . . . , YT ) ≤ n/2] ≤2e−T/16. This implies

image


Designed for Accessibility and to further Open Science