Privacy Amplification of Iterative Algorithms via Contraction Coefficients

2020·Arxiv

Abstract

Abstract

We investigate the framework of privacy amplification by iteration, recently proposed by Feldman et al., from an informationtheoretic lens. We demonstrate that differential privacy guarantees of iterative mappings can be determined by a direct application of contraction coefficients derived from strong data processing inequalities for f-divergences. In particular, by generalizing the Dobrushin’s contraction coefficient for total variation distance to an f-divergence known as -divergence, we derive tighter bounds on the differential privacy parameters of the projected noisy stochastic gradient descent algorithm with hidden intermediate updates.

I. INTRODUCTION AND MOTIVATION

Differential privacy (DP) [1, 2] has become the standard definition for designing privacy-preserving machine learning algorithms. One reason for its success is its operational significance, which can be best described in terms of binary hypothesis testing (see, e.g., [3, 4]). Nevertheless, it is often difficult to compute DP guarantees for applications where a high number of data accesses is needed for a single analysis [5, 6]. To obtain the DP parameters in such applications, which include machine learning models trained using stochastic gradient descent (SGD), one needs to resort to composition theorems which are often loose due to their generality. As a remedy, several variants of DP have been recently proposed [7–10] based on Rényi divergence. These variants enjoy better composition properties. Among these variants, Rényi DP (RDP) has proven to be effective in studying private deep learning algorithms [5] especially when paired with sub-sampling techniques [11].

Recently, the new framework of privacy amplification by iteration was proposed by Feldman et al. [12] as an alternative to privacy amplification by sub-sampling. This framework possesses several advantages which makes it well-suited for determining and enforcing privacy in distributed settings where data samples are stored locally by each user. Existing private algorithms based on sub-sampling require hiding the set of users participating in each update step of the model. This requirement, however, dictates either all data samples be stored centrally (i.e., no distributed setting) or all-to-all communication (i.e., excessive communication complexity). The new framework of privacy amplification by iteration relaxes these issues; it does not require the order of participating users to be random or hidden. On the other hand, it requires that all intermediate updates be hidden until a certain number of update steps are applied (e.g., not disclosing model update of SGD before a pre-specified step, say, n-th step).

Since the intermediate updates are assumed to be hidden, one can view an iterative process as a concatenation of channels. To see this, let be a sequence of mapping and the update rule be given by

where and are i.i.d. copies of a noise distribution . Let be the output of the same process started at . Letting and be the distributions of and , the strong data processing inequality (SDPI) for f-divergences (see, e.g., [13, 14]) implies that

where is an f-divergence and is the contraction coefficient (also known as strong data processing constant) of the Markov kernel under f-divergence (see Section III for details). By exploiting the connection between DP and a certain f-divergence known as -divergence, we build upon (2) to obtain bounds for DP parameters of iterative processes. Specifically, we study the noisy stochastic gradient descent algorithm and obtain tighter bounds for its DP parameters than those provided currently in the literature [12, 15]. To do so, we obtain a closed-form expression for the contraction coefficient of Markov kernels under -divergence that generalizes the well-known Dobrushin’s theorem [16].

Our approach is inspired by the original work of Feldman et al. [12]. They adopted RDP as the measure of privacy and proved the following SDPI result [12, Theorem 1] for the Rényi divergence of order : For the iterative process described in (1) with the Gaussian distribution ,

Despite its tractability, RDP lacks a clear operational interpretation. As a result, RDP guarantees are usually translated to DP guarantees via a transformation which is known to be loose, see, e.g., [7, Proposition 3].

As a special case of iterative processes, we consider the noisy SGD algorithm with Laplacian or Gaussian perturbation. Our empirical analyses show that the DP parameters of noisy SGD obtained by our approach are smaller than that of [12, 15] (after applying the RDP to DP transformation). To capture common practice in machine learning applications, the input alphabet of the Markov kernels in this work are assumed to be compact. As a result, our analysis of contraction coefficients of such kernels is akin to the analysis of input-constrained channels performed by [17].

II. BACKGROUND

In this section, we briefly review privacy mechanisms, f-divergences and contraction coefficients. We also review a relation between DP and -divergence.

A. Privacy Mechanisms

The following examples describe two typical privacy mechanisms used in machine learning.

Example 1. (Private Queries) Let X be an arbitrary alphabet. A query is a function f that takes a sample and produces a response y in the space of responses Y. In this setting, a privacy mechanism K takes a response and produces another (random) response in the same space. In general, a privacy mechanism can be described by a Markov kernel , i.e., a channel with the same input and output space Y, where P(Y) denotes the set of probability measures over Y. Thus, the private query, say M, is a random variable satisfying .

Example 2. (Stochastic Optimization) Let Y denote a parameter space, e.g., the coefficients in a linear regression model. Given a dataset , typical stochastic optimization methods take an initial point and further refine it through a random optimization process. The latter process typically depends on the dataset D and can be encoded by a Markov kernel . Furthermore, in some cases it is of iterative form, e.g., stochastic gradient descent, and the kernel can be decomposed as . Here, the randomness of the initial point and the optimization process may provide some level of privacy.

input and producing an element in a given set Y as output. Furthermore, we assume that any privacy mechanism, say M, is a random variable satisfying

where the measure and the kernel may depend on D, i.e., and .

B. f-Divergence and Contraction Coefficients

Given a convex function with f(1) = 0, f-divergence between two probability measures and is defined in [18, 19] as

Let be a Markov kernel. Following the definition from Ahslwede and Gács [20], we define the contraction coefficient (or strong data processing coefficient) of K under f-divergence as

This quantity has been studied for several f-divergences, e.g., KL-divergence for which -divergence for which , and also total variation distance for which . In particular, Dobrushin [16] showed that

where denotes the total variation distance between and . It is worth noting that (4) has been extensively used in information theory [17, 21], statistics [22] and graph theory [23, 24].

Fig. 1. The schematic representation of the projected noisy stochastic gradient descent described algorithm in Algorithm 1. Given an arbitrary distribution on K and dataset -th iteration is encoded by a projected additive kernel . The output distribution of kernel

C. Differential Privacy and Eγ-Divergence

For an arbitrary alphabet X, let be the set of all datasets of size n. By definition, two datasets D and are neighboring, denoted as , if their Hamming distance is equal to one. Given a randomized mechanism M, we let be the distribution of M(D), the output of M with as the input. For and , a mechanism M is said to be -differentially

for every measurable set and neighboring datasets . When , we simply say that M is -DP. The definition of -DP given in (5) can be reformulated in terms of -divergence, also known as hockey-stick divergence [25–27]. Given -divergence between two probability distributions P and Q is defined as

where and denotes the information density between P and Q. The -divergence is in fact an f-divergence associated with and it also satisfies that . The next theorem provides a relation between this divergence and -DP.

Theorem 1 ([28, 29]). A mechanism M is -DP if and only if, for all ,

By relating DP to -divergence, this theorem enables us to invoke the SDPI relationship (2), specialized to -divergence, to obtain the DP parameters and of iterative processes. To do so, we first need to compute the contraction coefficient under -divergence, which is addressed in the next section.

III. CONTRACTION OF Eγ-DIVERGENCE

In this section we establish a closed-form expression for the contraction coefficient of kernels under -divergence that generalizes the Dobrushin’s theorem in (4). We then instantiate this expression to introduce a family of practically-appealing kernels with compact input alphabet. For ease of notation, we let .

Theorem 2. For any , we have

Note that the Dobrushin’s theorem [16] given in (4) corresponds to the special case of in Theorem 2. This theorem implies that in order to compute the contraction coefficient of a Markov kernel K under -divergence, one needs to compute -divergence between and for any . The following lemmas are useful for such task. For and v > 0, let L(m, v) denote the Laplace distribution with mean m and variance .

For and , let denote the multivariate Gaussian distribution with mean m and covariance matrix .

Lemma 2. For , we have

The previous lemma motivates the following definition.

Definition 1. For , we define by

where is any vector of unit norm.

With this definition at hand, we can write

It is worth pointing out that has a similar role as the function introduced by Feldman et al. in [12].

The additive Gaussian kernel is the kernel determined by for some . This kernel models the privacy mechanisms which map with . An application of Theorem 2 and Lemma 2 shows that, under -divergence, the contraction coefficient of the additive Gaussian kernel is trivial1, i.e., . A similar conclusion holds for the additive Laplace kernel2 which is determined by K(y) = L(y, v) for some v > 0.

Fortunately, the input and output of kernels appearing in applications tend to be bounded. Think, for example, of the kernel which models the update of the weights of a neural networks during its training. In this case, the weights are bounded either by design or by regularization mechanisms. Motivated by this observation, we say that a kernel is the projected additive Gaussian kernel if it models the mechanism which maps where is compact and convex, is the projection operator onto K and for some . Similarly, we say that a kernel is the projected additive Laplace kernel if it models the mechanism which maps where for some v > 0. These construction of kernels will be instrumental in the analysis of privacy guarantee of iterative processes in the next section.

IV. ANALYSIS OF ITERATIVE MECHANISMS VIA CONTRACTION COEFFICIENTS

In this section, we consider iterative processes that can be decomposed into projected additive kernels. This constraint allows us to analyze the evolution of such processes through the lens of contraction coefficients.

A. General Setting

Recall the stochastic optimization setting in Example 2, where is a parameter space and is a dataset. In this context, an iterative stochastic optimization method is fully characterized by a set of kernels with . The following lemma provides an upper bound for the f-divergence between the parameters returned when using two neighboring datasets.

Lemma 3. Let and be a family of kernels over Y. If and are neighbouring datasets with for some , then

where .

While the previous lemma follows from a routine application of the strong data processing inequality, it provides a natural framework to study the privacy guarantees of iterative optimization methods. In the following, we use it to study the privacy properties of the projected noisy stochastic gradient descent algorithm and some of its variations.

B. Projected Noisy Stochastic Gradient Descent

We now apply Lemma 3 to study the projected noisy stochastic gradient descent (PNSGD) algorithm under two different noise densities: Laplacian and Gaussian.

Assume that K is a compact and convex subset of and that is a loss function differentiable in its first argument. In the literature it is customary to assume regularity conditions on the loss function [12, 15]. We make the following assumptions on the loss function:

• is L-Lipschitz for all ,

• is -Lipschitz for all ,

• is -strongly convex for all .

For a given a dataset , the PNSGD algorithm starts from a given point (an arbitrary initial distribution) and then updates it according to the rule

where is the projection operator onto is the learning rate and is a collection of i.i.d. noise variables sampled from a distribution absolutely continuous w.r.t. the Lebesgue measure. The PNSGD algorithm is summarized in Algorithm 1.

For any , let . Notice that is encoded by the projected additive Laplacian (resp., Gaussian) kernel if Z is Laplacian (resp., Gaussian) noise variable. Given the dataset , one can therefore view the i-th iteration of the PNSGD algorithm as a projected kernel that models the mapping

where Z is the common distribution of . If are produced by the PNSGD algorithm with , then, for all , we have

This allows us to express the PNSGD algorithm as a concatenation of n channels, as illustrated in Fig. 1.

Before delving into the privacy analysis of PNSGD, it is important to pause and adapt the definition of differential privacy to PNSGD setting. We recall from [12] that a mechanism M is -DP for its ith input if for any pair of datasets D and differing on the i-th coordinate. Specializing Lemma 3 to -divergence, we can say that the PNSGD algorithm is -DP for its i-th input if

C. Laplacian Projected Noisy Stochastic Gradient Descent

Here, we consider the PNSGD algorithm with Laplacian noise; i.e., for some v > 0. The following theorem establishes the -DP property of such algorithm. For and given in Section IV-B, define

Theorem 3. Assume that K = [a, b] for some a < b. Then PNSGD algorithm with Laplace noise is -DP for its i-th input where and

Consequently, we have if .

Fig. 2. The privacy parameters of PNSGD with Gaussian noise having and loss function with parameter computed using both Theorem 4 and Balle et al. [15, Theorem 5] for i = 1, i = 20, and i = 39 in dataset of size n = 40. Other parameters are as follows:

The use of Laplacian noise to provide -DP for SGD algorithms was extensively studied, see e.g., [31–33]. Unlike previous results, Theorem 3 is the first result regarding the privacy guarantees of PNSGD with Laplacian noise in the distributed-oriented framework proposed by Feldman et al. [12]. It is worth pointing out that our approach seems conceptually simpler than the approaches employed in [12, 15].

D. Gaussian Projected Noisy Stochastic Gradient Descent

Next, we assume that the noise distribution in the PNSGD algorithm is Gaussian, i.e., .

Theorem 4. Let be a compact and convex set. The PNSGD algorithm with Gaussian noise is -DP for its i-th input where and

Compared to Laplacian, the Gaussian perturbation has a better utility in high-dimensional setting, as illustrated in [31]. Hence, it has extensively appeared in DP literature as a de facto mechanism for providing privacy guarantees in training deep learning models [5]. Gaussian distribution is, in particular, appealing in the case of RDP as the Rényi divergence between two Gaussian distributions has a simple form (as opposed to -divergence). This intuition, among others, led Feldman et al. [12] and Balle et al. [15] to adopt RDP to examine the PNSGD algorithm with Gaussian noise in the framework of privacy amplification by iteration. While the former studied the problem for cases where M = 1 (i.e., ), the latter assumed M < 1 (i.e., ) and derived strictly better bounds for RDP guarantees. In fact, [15, Theorem 5] reduces to [12, Theorem 23] when . We wish to compare Theorem 4 with these results with or without strong convexity. To do so, we first need to convert the RDP guarantee given in [15, Theorem 5] to -DP. This conversion is a standard practice in DP literature and follows from an straightforward application of [7, Proposition 3].

where if and if i = n.

Note that in Theorem 4 is given in terms of Q function and hence it is challenging to analytically compare with . Nevertheless, we provide several numerical comparisons. In Fig. 2, we compare in Theorem 4 with in Proposition 1 for the first (i = 1), middle (i = 20) and the second to last (i = 39) individuals in a dataset of size n = 40 and with the assumption that the loss function is not strictly convex (i.e., ). As clearly seen, our approach outperfoms [12] especially for the individuals whose records were processed later in the algorithm.

In Fig. 3, we focus on the effect of strong convexity parameter on the privacy guarantee. We again depict and for the second half of the dataset: i = 20, i = 30, and i = 39 in a dataset of size n = 40 and . Here, we assume that the loss function is strictly convex with parameter . As observed in this case, Theorem 4 provides better privacy in the high privacy region (i.e., small ) as well as for the individuals who appear later in the dataset for all privacy region.

E. Randomly Stopped PNSGD Algorithm

We end this section by pointing out a potential shortcoming of Theorems 3 and 4 (and in general the privacy amplification by iteration framework): different individuals participating in the dataset experience different privacy guarantees; that is, individuals

Fig. 3. The privacy parameters of PNSGD with Gaussian noise having and strongly convex loss function (), computed using both Theorem 4 and Balle et al. [15, Theorem 5] for i = 20, i = 30, and i = 39 in a dataset of size n = 40. Other parameters are as follows:

whose records were processed earlier experience higher privacy guarantee. This may not be justified in practice. To address this issue, we follow [12] to consider the random stopping for the PNSGD algorithm: namely, instead of iterating for n steps, we pick a random time T uniformly on [n], stop the algorithm after T steps and then output . The following theorem illuminates that such algorithm in fact uniformizes the privacy guarantee among all individuals.

Theorem 5. Let be a compact and convex set. The randomly-stopped PNSGD algorithm with Gaussian noise is -DP with and

The randomly stopped PNSGD was first proposed by Feldman et al. [12] where they derived its RDP guarantee in [12, Theorem 26] only if satisfies a certain constraint. This constraint is due to the non-convexity of the map . In contrast, since is jointly convex (as for any other f-divergences), Theorem 5 holds for any .

Another approach to address the non-uniformity of privacy guarantees is to permute the dataset first, via a random permutation and then feed it to the PNSGD algorithm. We will examine this approach in our future work.

REFERENCES

[1] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Proc. Theory of Cryptography (TCC), Berlin, Heidelberg, 2006, pp. 265–284.

[2] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor, “Our data, ourselves: Privacy via distributed noise generation,” in EUROCRYPT, S. Vaudenay, Ed., 2006, pp. 486–503.

[3] L. Wasserman and S. Zhou, “A statistical framework for differential privacy,” Journal of the American Statistical Association, vol. 105, no. 489, pp. 375–389, 2010.

[4] P. Kairouz, S. Oh, and P. Viswanath, “The composition theorem for differential privacy,” IEEE Trans. Inf. Theory, vol. 63, no. 6, pp. 4037–4049, June 2017.

[5] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proc. ACM SIGSAC CCS, 2016, pp. 308–318.

[6] B. Balle, G. Barthe, and M. Gaboardi, “Privacy amplification by subsampling: Tight analyses via couplings and divergences,” in NeurIPS, 2018, pp. 6280–6290.

[7] I. Mironov, “Rényi differential privacy,” in Proc. Computer Security Found. (CSF), 2017, pp. 263–275.

[8] C. Dwork and G. N. Rothblum, “Concentrated differential privacy,” vol. abs/1603.01887, 2016. [Online]. Available: http://arxiv.org/abs/1603.01887

[9] M. Bun and T. Steinke, “Concentrated differential privacy: Simplifications, extensions, and lower bounds,” in Theory of Cryptography, 2016, pp. 635–658.

[10] M. Bun, C. Dwork, G. N. Rothblum, and T. Steinke, “Composable and versatile privacy via truncated cdp,” in STOC, 2018, pp. 74–86.

[11] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith, “What can we learn privately?” SIAM J. Comput., vol. 40, no. 3, pp. 793–826, Jun. 2011.

[12] V. Feldman, I. Mironov, K. Talwar, and A. Thakurta, “Privacy amplification by iteration,” FOCS, pp. 521–532, 2018.

[13] M. Raginsky, “Strong data processing inequalities and-sobolev inequalities for discrete channels,” IEEE Trans. Inf. Theory, vol. 62, no. 6, pp. 3355–3389, June 2016.

[14] A. Makur and L. Zheng, “Bounds between contraction coefficients,” 2018. [Online]. Available: http://arxiv.org/abs/1510.01844

[15] B. Balle, G. Barthe, M. Gaboardi, and J. Geumlek, “Privacy amplification by mixing and diffusion mechanisms,” in NeurIPS, 2019, pp. 13 277–13 287.

[16] R. L. Dobrushin, “Central limit theorem for nonstationary markov chains. I,” Theory Probab. Appl., vol. 1, no. 1, pp. 65–80, 1956.

[17] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,” IEEE Trans. Inf. Theory, vol. 62, no. 1, pp. 35–55, Jan 2016.

[18] S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,” Journal of Royal Statistics, vol. 28, pp. 131–142, 1966.

[19] I. Csiszár, “Information-type measures of difference of probability distributions and indirect observations,” Studia Sci. Math. Hungar., vol. 2, pp. 299–318, 1967.

[20] R. Ahlswede and P. Gács, “Spreading of sets in product spaces and hypercontraction of the markov operator,” Ann. Probab., vol. 4, no. 6, pp. 925–939, 12 1976.

[21] I. Sason and S. Verdú, “f -divergence inequalities,” IEEE Trans. Inf. Theory, vol. 62, no. 11, pp. 5973–6006, Nov 2016.

[22] A. Kontorovich and M. Raginsky, “Concentration of measure without independence: A unified approach via the martingale method,” in Convexity and Concentration. Springer New York, 2017, pp. 183–210.

[23] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov chains and mixing times. American Mathematical Society, 2006.

[24] P. M. del Moral, M. Ledoux, and L. Miclo, “On contraction properties of markov kernels,” Probability Theory and Related Fields, vol. 126, pp. 395–420, 2003.

[25] Y. Polyanskiy, H. V. Poor, and S. Verdu, “Channel coding rate in the finite blocklength regime,” IEEE Trans. Inf. Theory, vol. 56, no. 5, pp. 2307–2359, May 2010.

[26] N. Sharma and N. A. Warsi, “Fundamental bound on the reliability of quantum information transmission,” CoRR, vol. abs/1302.5281, 2013. [Online]. Available: http://arxiv.org/abs/1302.5281

[27] I. Csiszár and P. C. Shields, “Information theory and statistics: A tutorial,” Commun. Inf. Theory, vol. 1, no. 4, pp. 417–528, Dec. 2004.

[28] G. Barthe and F. Olmedo, “Beyond differential privacy: Composition theorems and relational logic for f-divergences between probabilistic programs,” in Proc. ICALP, 2013, pp. 49–60.

[29] B. Balle and Y.-X. Wang, “Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising,” in ICML, vol. 80, 10–15 July 2018, pp. 394–403.

[30] J. Cohen, J. Kemperman, and G. Zb˘aganu, Comparisons of Stochastic Matrices, with Applications in Information Theory, Statistics, Economics, and Population Sciences. Birkhäuser, 1998.

[31] X. Wu, F. Li, A. Kumar, K. Chaudhuri, S. Jha, and J. Naughton, “Bolt-on differential privacy for scalable stochastic gradient descent- based analytics,” in SIGMOD, 2017, pp. 1307–1322.

[32] R. Bassily, A. Smith, and A. Thakurta, “Private empirical risk minimization, revisited,” in ICML 2014 Workshop on Learning, Security and Privacy, Beijing, China, 25 Jun 2014.

[33] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially private empirical risk minimization,” Journal of Machine Learning Research, vol. 12, no. Mar, pp. 1069–1109, 2011.

[34] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, 1st ed. Springer Publishing Company, Incorporated, 2014.

[35] S. Bubeck, Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 2015, vol. 8, no. 3-4.

APPENDIX

Proof of Theorem 2. Balle et al. [15] recently showed that, for ,

where . Now we show that the above inequality is indeed an equality. Let be such that . We define and where and . In particular,

It is straightforward to verify that and . Hence, by (6),

Therefore, we obtain that

Since and are arbitrary, we conclude that the reversed version of inequality (17) holds true and hence

with . It is easy to verify that for fixed and is continuous and decreasing. Hence, by taking the limit as in (18), the result follows.

Proof of Lemma 1. For ease of notation, let for i = 1, 2. It follows from the definition that

where (assuming ). Clearly, the above integral is non-zero only if . With this in mind, we can write

The case is similar. In general, we can write

Proof of Lemma 2. Recall that with i = 1, 2. A direct computation shows that

Thus, if we let , then

Therefore, by expression for given in (7),

as desired.

Proof of Theorem 3. Let and be two neighbouring datasets with . By Theorem 1, it is enough to show that

where is the initial distribution of PNSGD and models the update rule in (12). Following (13), we obtain that

where is a projected additive Laplacian kernel, and

We begin by recalling Lemma 1 and the data processing inequality to bound as

To refine the above bound for , we resort to the following standard result in convex optimization, see, e.g., [34] or [35, Theorem 3.12]. Notice that we say a function is -smooth if is -Lipschitz.

Lemma 4. Let be a convex set and suppose is -smooth and -strongly convex. If , then the map is M-Lipschitz on K with .

On the other hand, we use Jensen’s inequality to compute :

where the first inequality is due to the Jensen’s inequality (note that is jointly convex for all as for any other f-divergences) and the last inequality comes from the fact that

where we use the fact that is L-Lipschitz for all . Plugging (20) and (21) into (19), we conclude the proof.

Proof of Theorem 4. Let and be two neighbouring datasets with . By Theorem 1, it is enough to show that

where is the initial distribution of PNSGD and models the update rule in (12). Following (13), we obtain that

where is a projected additive Gaussian kernel, and We begin by recalling Lemma 2 and data processing inequality to bound , for , as

where the first inequality is due to the Jensen’s inequality (note that is jointly convex for all as for any other f-divergences) and the last inequality and the second inequality stems from the following

Plugging (23) and (24) into (22), we conclude the proof.

Proof of Proposition 1. First note that both [12, Theorem 23] and [15, Theorem 5] quantifies the RDP guarantee of PNSGD. However, as the latter strictly improves the former under the assumption of strong convexity, we only consider the former and convert it to DP guarantee. Given , a mechanism M is said to be -RDP [7] if for all where is the output distribution of M running on the dataset D. As before, this definition can be adapted as follows: M is said to be -RDP for its i-th input if the above holds for dataset D and differing in the i-th coordinate. It is shown in [7, Proposition 3] and [5, Theorem 2] that

for and for i = n, where M is defined in Lemma 4. We use (25) to obtain the best DP parameter: PNSGD is -DP for

Solving this minimization problem, we obtain that the minimizer is and the minimum value is

Notice that since , we must have .

Proof of Theorem 5. Let T be a uniform random variable on the set [n]. We assume that the PNSGD algorithm stops at the random time T. Recall that we are given two dataset and with for all except j = i. Let and be the output distribution of the randomly-stopped PNSGD algorithm running on D and , respectively. We can write

and

Hence, the convexity of and Jensen’s inequality imply that

Finally, by applying Theorem 4,

as we wanted to prove.