b

DiscoverSearch
About
My stuff
Privacy Amplification of Iterative Algorithms via Contraction Coefficients
2020·arXiv
Abstract
Abstract

We investigate the framework of privacy amplification by iteration, recently proposed by Feldman et al., from an informationtheoretic lens. We demonstrate that differential privacy guarantees of iterative mappings can be determined by a direct application of contraction coefficients derived from strong data processing inequalities for f-divergences. In particular, by generalizing the Dobrushin’s contraction coefficient for total variation distance to an f-divergence known as  Eγ-divergence, we derive tighter bounds on the differential privacy parameters of the projected noisy stochastic gradient descent algorithm with hidden intermediate updates.

Differential privacy (DP) [1, 2] has become the standard definition for designing privacy-preserving machine learning algorithms. One reason for its success is its operational significance, which can be best described in terms of binary hypothesis testing (see, e.g., [3, 4]). Nevertheless, it is often difficult to compute DP guarantees for applications where a high number of data accesses is needed for a single analysis [5, 6]. To obtain the DP parameters in such applications, which include machine learning models trained using stochastic gradient descent (SGD), one needs to resort to composition theorems which are often loose due to their generality. As a remedy, several variants of DP have been recently proposed [710] based on Rényi divergence. These variants enjoy better composition properties. Among these variants, Rényi DP (RDP) has proven to be effective in studying private deep learning algorithms [5] especially when paired with sub-sampling techniques [11].

Recently, the new framework of privacy amplification by iteration was proposed by Feldman et al. [12] as an alternative to privacy amplification by sub-sampling. This framework possesses several advantages which makes it well-suited for determining and enforcing privacy in distributed settings where data samples are stored locally by each user. Existing private algorithms based on sub-sampling require hiding the set of users participating in each update step of the model. This requirement, however, dictates either all data samples be stored centrally (i.e., no distributed setting) or all-to-all communication (i.e., excessive communication complexity). The new framework of privacy amplification by iteration relaxes these issues; it does not require the order of participating users to be random or hidden. On the other hand, it requires that all intermediate updates be hidden until a certain number of update steps are applied (e.g., not disclosing model update of SGD before a pre-specified step, say, n-th step).

Since the intermediate updates are assumed to be hidden, one can view an iterative process as a concatenation of channels. To see this, let  {ψt}nt=1be a sequence of mapping and the update rule be given by

image

where  Y0 = y0 ∈ Rdand  {Zt}nt=1are i.i.d. copies of a noise distribution  PZ. Let  {Y ′t }nt=1be the output of the same process started at  Y ′0 = y′0 ∈ Rd. Letting  µtand  νtbe the distributions of  Ytand  Y ′t, the strong data processing inequality (SDPI) for f-divergences (see, e.g., [13, 14]) implies that

image

where  Dfis an f-divergence and  ηf(Kt)is the contraction coefficient (also known as strong data processing constant) of the Markov kernel  Kt(y) := PYt|Yt−1=y = PZ+ψt(y)under f-divergence (see Section III for details). By exploiting the connection between DP and a certain f-divergence known as  Eγ-divergence, we build upon (2) to obtain bounds for DP parameters of iterative processes. Specifically, we study the noisy stochastic gradient descent algorithm and obtain tighter bounds for its DP parameters than those provided currently in the literature [12, 15]. To do so, we obtain a closed-form expression for the contraction coefficient of Markov kernels under  Eγ-divergence that generalizes the well-known Dobrushin’s theorem [16].

Our approach is inspired by the original work of Feldman et al. [12]. They adopted RDP as the measure of privacy and proved the following SDPI result [12, Theorem 1] for the Rényi divergence of order  α > 1: For the iterative process described in (1) with  PZthe Gaussian distribution  N(0, σ2Id),

image

Despite its tractability, RDP lacks a clear operational interpretation. As a result, RDP guarantees are usually translated to DP guarantees via a transformation which is known to be loose, see, e.g., [7, Proposition 3].

As a special case of iterative processes, we consider the noisy SGD algorithm with Laplacian or Gaussian perturbation. Our empirical analyses show that the DP parameters of noisy SGD obtained by our approach are smaller than that of [12, 15] (after applying the RDP to DP transformation). To capture common practice in machine learning applications, the input alphabet of the Markov kernels in this work are assumed to be compact. As a result, our analysis of contraction coefficients of such kernels is akin to the analysis of input-constrained channels performed by [17].

In this section, we briefly review privacy mechanisms, f-divergences and contraction coefficients. We also review a relation between DP and  Eγ-divergence.

A. Privacy Mechanisms

The following examples describe two typical privacy mechanisms used in machine learning.

Example 1. (Private Queries) Let X be an arbitrary alphabet. A query is a function f that takes a sample  D ∈ X nand produces a response y in the space of responses Y. In this setting, a privacy mechanism K takes a response  y ∈ Yand produces another (random) response in the same space. In general, a privacy mechanism can be described by a Markov kernel  K : Y → P(Y), i.e., a channel with the same input and output space Y, where P(Y) denotes the set of probability measures over Y. Thus, the private query, say M, is a random variable satisfying  M(D) ∼ K(f(D)).

Example 2. (Stochastic Optimization) Let Y denote a parameter space, e.g., the coefficients in a linear regression model. Given a dataset  D = {x1, . . . , xn} ∈ X n, typical stochastic optimization methods take an initial point  Y0 ∼ µ0 ∈ P(Y)and further refine it through a random optimization process. The latter process typically depends on the dataset D and can be encoded by a Markov kernel  KD : Y → P(Y). Furthermore, in some cases it is of iterative form, e.g., stochastic gradient descent, and the kernel  KDcan be decomposed as  KD = Kx1 · · · Kxn. Here, the randomness of the initial point and the optimization process may provide some level of privacy.

image

input and producing an element in a given set Y as output. Furthermore, we assume that any privacy mechanism, say M, is a random variable satisfying

image

where the measure  µ0 ∈ P(Y)and the kernel  K : Y → P(Y)may depend on D, i.e.,  µ0 = (µ0)Dand  K = KD.

B. f-Divergence and Contraction Coefficients

Given a convex function  f : (0, ∞) → Rwith f(1) = 0, f-divergence between two probability measures  µand  νis defined in [18, 19] as

image

Let  K : Y �→ P(Y)be a Markov kernel. Following the definition from Ahslwede and Gács [20], we define the contraction coefficient (or strong data processing coefficient) of K under f-divergence as

image

This quantity has been studied for several f-divergences, e.g., KL-divergence for which  f(t) = t log(t), χ2-divergence for which  f(t) = (t − 1)2, and also total variation distance for which  f(t) = 12|t − 1|. In particular, Dobrushin [16] showed that

image

where  TV(µ, ν)denotes the total variation distance between  µand  ν. It is worth noting that (4) has been extensively used in information theory [17, 21], statistics [22] and graph theory [23, 24].

image

Fig. 1. The schematic representation of the projected noisy stochastic gradient descent described algorithm in Algorithm 1. Given  µ0an arbitrary distribution on K and dataset  D = {x1, . . . , xn}, the i-th iteration is encoded by a projected additive kernel  Kxi given byy �→ ΠK(ψxi(y) + ηZi) where ψxi(y) = y − η∇yℓ(y, xi). The output distribution of kernel  Kxi is µi = µ0Kx1 . . . Kxi−1.

C. Differential Privacy and Eγ-Divergence

For an arbitrary alphabet X, let  X nbe the set of all datasets of size n. By definition, two datasets D and  D′are neighboring, denoted as  D ∼ D′, if their Hamming distance is equal to one. Given a randomized mechanism M, we let  PDbe the distribution of M(D), the output of M with  D ∈ X nas the input. For  ε ≥ 0and  δ ∈ [0, 1], a mechanism M is said to be  (ε, δ)-differentially

image

for every measurable set  A ⊂ Yand neighboring datasets  D ∼ D′. When  δ = 0, we simply say that M is  ε-DP. The definition of  (ε, δ)-DP given in (5) can be reformulated in terms of  Eγ-divergence, also known as hockey-stick divergence [2527]. Given  γ ≥ 1, Eγ-divergence between two probability distributions P and Q is defined as

image

where  [b]+ = max{0, b}and  ıP ∥Q(t) := log dPdQ(t)denotes the information density between P and Q. The  Eγ-divergence is in fact an f-divergence associated with  f(t) = (t − γ)+and it also satisfies that  E1(P∥Q) = TV(P, Q). The next theorem provides a relation between this divergence and  (ε, δ)-DP.

Theorem 1 ([28, 29]). A mechanism M is  (ε, δ)-DP if and only if, for all  D ∼ D′,

image

By relating DP to  Eγ-divergence, this theorem enables us to invoke the SDPI relationship (2), specialized to  Eγ-divergence, to obtain the DP parameters  εand  δof iterative processes. To do so, we first need to compute the contraction coefficient under Eγ-divergence, which is addressed in the next section.

In this section we establish a closed-form expression for the contraction coefficient of kernels under  Eγ-divergence that generalizes the Dobrushin’s theorem in (4). We then instantiate this expression to introduce a family of practically-appealing kernels with compact input alphabet. For ease of notation, we let  ηγ(K) := ηEγ(K).

Theorem 2. For any  γ ≥ 1, we have

image

Note that the Dobrushin’s theorem [16] given in (4) corresponds to the special case of  γ = 1in Theorem 2. This theorem implies that in order to compute the contraction coefficient of a Markov kernel K under  Eγ-divergence, one needs to compute Eγ-divergence between  K(y1)and  K(y2)for any  y1, y2 ∈ Y. The following lemmas are useful for such task. For  m ∈ Rand v > 0, let L(m, v) denote the Laplace distribution with mean m and variance  2v2.

image

For  m ∈ Rdand  σ > 0, let  N(m, σ2Id)denote the multivariate Gaussian distribution with mean m and covariance matrix σ2Id.

Lemma 2. For  Ni = N(mi, σ2Id), i = 1, 2, we have

image

The previous lemma motivates the following definition.

Definition 1. For  γ ≥ 1, we define  θγ : [0, ∞) → [0, 1]by

image

where  u ∈ Rdis any vector of unit norm.

With this definition at hand, we can write

image

It is worth pointing out that  θγhas a similar role as the function  Rαintroduced by Feldman et al. in [12].

The additive Gaussian kernel  K : Rd → Rdis the kernel determined by  K(y) = N(y, σ2Id)for some  σ > 0. This kernel models the privacy mechanisms which map  y �→ y + Zwith  Z ∼ N(0, σ2Id). An application of Theorem 2 and Lemma 2 shows that, under  Eγ-divergence, the contraction coefficient of the additive Gaussian kernel is trivial1, i.e.,  ηγ(K) = 1. A similar conclusion holds for the additive Laplace kernel2 which is determined by K(y) = L(y, v) for some v > 0.

Fortunately, the input and output of kernels appearing in applications tend to be bounded. Think, for example, of the kernel which models the update of the weights of a neural networks during its training. In this case, the weights are bounded either by design or by regularization mechanisms. Motivated by this observation, we say that a kernel  K : K → Kis the projected additive Gaussian kernel if it models the mechanism which maps  y �→ ΠK(y + Z)where  K ⊂ Rdis compact and convex, ΠKis the projection operator onto K and  Z ∼ N(0, σ2Id)for some  σ > 0. Similarly, we say that a kernel  K : K → Kis the projected additive Laplace kernel if it models the mechanism which maps  y �→ ΠK(y + L)where  L ∼ L(0, v)for some v > 0. These construction of kernels will be instrumental in the analysis of privacy guarantee of iterative processes in the next section.

In this section, we consider iterative processes that can be decomposed into projected additive kernels. This constraint allows us to analyze the evolution of such processes through the lens of contraction coefficients.

A. General Setting

Recall the stochastic optimization setting in Example 2, where  Y ⊂ Rdis a parameter space and  D = {x1, . . . , xn}is a dataset. In this context, an iterative stochastic optimization method is fully characterized by a set of kernels  {Kx : x ∈ X}with  Kx : Y → P(Y). The following lemma provides an upper bound for the f-divergence between the parameters returned when using two neighboring datasets.

Lemma 3. Let  µ0 ∈ P(Y)and  {Kx : x ∈ X}be a family of kernels over Y. If  D = {x1, . . . , xn}and  D′ = {x′1, . . . , x′n}are neighbouring datasets with  xi ̸= x′ifor some  i ∈ [n], then

image

where  µi−1 := µ0Kx1 · · · Kxi−1.

While the previous lemma follows from a routine application of the strong data processing inequality, it provides a natural framework to study the privacy guarantees of iterative optimization methods. In the following, we use it to study the privacy properties of the projected noisy stochastic gradient descent algorithm and some of its variations.

B. Projected Noisy Stochastic Gradient Descent

We now apply Lemma 3 to study the projected noisy stochastic gradient descent (PNSGD) algorithm under two different noise densities: Laplacian and Gaussian.

Assume that K is a compact and convex subset of  Rdand that  ℓ : K × X → R+is a loss function differentiable in its first argument. In the literature it is customary to assume regularity conditions on the loss function [12, 15]. We make the following assumptions on the loss function:

 y �→ ℓ(y, x)is L-Lipschitz for all  x ∈ X,

 y �→ ∇yℓ(y, x)is  β-Lipschitz for all  x ∈ X,

 y �→ ℓ(y, x)is  ρ-strongly convex for all  x ∈ X.

For a given a dataset  D = {x1, . . . , xn}, the PNSGD algorithm starts from a given point  Y0 ∼ µ0 ∈ P(K)(an arbitrary initial distribution) and then updates it according to the rule

image

where  ΠK : Rd → Kis the projection operator onto  K, η > 0is the learning rate and  {Zt}nt=1is a collection of i.i.d. noise variables sampled from a distribution absolutely continuous w.r.t. the Lebesgue measure. The PNSGD algorithm is summarized in Algorithm 1.

image

For any  x ∈ X, let  ψx(y) := y − η∇yℓ(y, x). Notice that  y �→ ΠK(ψx(y) + ηZ)is encoded by the projected additive Laplacian (resp., Gaussian) kernel if Z is Laplacian (resp., Gaussian) noise variable. Given the dataset  D = {x1, . . . , xn}, one can therefore view the i-th iteration of the PNSGD algorithm as a projected kernel  Kxi : K → P(K)that models the mapping

image

where Z is the common distribution of  {Zt}nt=1. If  Y1, . . . , Ynare produced by the PNSGD algorithm with  Y0 ∼ µ0, then, for all  t ∈ [n], we have

image

This allows us to express the PNSGD algorithm as a concatenation of n channels, as illustrated in Fig. 1.

Before delving into the privacy analysis of PNSGD, it is important to pause and adapt the definition of differential privacy to PNSGD setting. We recall from [12] that a mechanism M is  (ε, δ)-DP for its ith input if  Eeε(PD∥PD′) ≤ δfor any pair of datasets D and  D′differing on the i-th coordinate. Specializing Lemma 3 to  Eγ-divergence, we can say that the PNSGD algorithm is  (ε, δ)-DP for its i-th input if

image

C. Laplacian Projected Noisy Stochastic Gradient Descent

Here, we consider the PNSGD algorithm with Laplacian noise; i.e.,  Z ∼ L(0, v)for some v > 0. The following theorem establishes the  ε-DP property of such algorithm. For  L, β,and  ρgiven in Section IV-B, define

image

Theorem 3. Assume that K = [a, b] for some a < b. Then PNSGD algorithm with Laplace noise is  (ε, δ)-DP for its i-th input where  ε ≥ 0and

image

Consequently, we have  δ = 0if  ε ≥ min{ 2Lv , M(b−a)ηv }.

image

Fig. 2. The privacy parameters  ε and δof PNSGD with Gaussian noise having  σ = 2and loss function with parameter  L = 1 and β = 0.5,computed using both Theorem 4 and Balle et al. [15, Theorem 5] for i = 1, i = 20, and i = 39 in dataset of size n = 40. Other parameters are as follows:  η = 0.5, ρ = 0, and DK = 1.

The use of Laplacian noise to provide  ε-DP for SGD algorithms was extensively studied, see e.g., [3133]. Unlike previous results, Theorem 3 is the first result regarding the privacy guarantees of PNSGD with Laplacian noise in the distributed-oriented framework proposed by Feldman et al. [12]. It is worth pointing out that our approach seems conceptually simpler than the approaches employed in [12, 15].

D. Gaussian Projected Noisy Stochastic Gradient Descent

Next, we assume that the noise distribution in the PNSGD algorithm is Gaussian, i.e.,  Z ∼ N(0, σ2Id).

Theorem 4. Let  K ⊂ Rdbe a compact and convex set. The PNSGD algorithm with Gaussian noise is  (ε, δ)-DP for its i-th input where  ε ≥ 0and

image

Compared to Laplacian, the Gaussian perturbation has a better utility in high-dimensional setting, as illustrated in [31]. Hence, it has extensively appeared in DP literature as a de facto mechanism for providing privacy guarantees in training deep learning models [5]. Gaussian distribution is, in particular, appealing in the case of RDP as the Rényi divergence between two Gaussian distributions has a simple form (as opposed to  Eγ-divergence). This intuition, among others, led Feldman et al. [12] and Balle et al. [15] to adopt RDP to examine the PNSGD algorithm with Gaussian noise in the framework of privacy amplification by iteration. While the former studied the problem for cases where M = 1 (i.e.,  ρ = 0), the latter assumed M < 1 (i.e.,  ρ > 0) and derived strictly better bounds for RDP guarantees. In fact, [15, Theorem 5] reduces to [12, Theorem 23] when  ρ = 0. We wish to compare Theorem 4 with these results with or without strong convexity. To do so, we first need to convert the RDP guarantee given in [15, Theorem 5] to  (ε, δ)-DP. This conversion is a standard practice in DP literature and follows from an straightforward application of [7, Proposition 3].

image

where  κ = 2L2(n−i)σ2 M (n−i+1)if  i ∈ [n − 1]and  κ = 2 L2σ2if i = n.

Note that  δin Theorem 4 is given in terms of Q function and hence it is challenging to analytically compare  δwith ˜δ. Nevertheless, we provide several numerical comparisons. In Fig. 2, we compare  δin Theorem 4 with ˜δin Proposition 1 for the first (i = 1), middle (i = 20) and the second to last (i = 39) individuals in a dataset of size n = 40 and  σ = 2with the assumption that the loss function is not strictly convex (i.e.,  ρ = 0). As clearly seen, our approach outperfoms [12] especially for the individuals whose records were processed later in the algorithm.

In Fig. 3, we focus on the effect of strong convexity parameter  ρon the privacy guarantee. We again depict  δand ˜δfor the second half of the dataset: i = 20, i = 30, and i = 39 in a dataset of size n = 40 and  σ = 1. Here, we assume that the loss function is strictly convex with parameter  ρ = 0.4. As observed in this case, Theorem 4 provides better privacy in the high privacy region (i.e., small  ε) as well as for the individuals who appear later in the dataset for all privacy region.

E. Randomly Stopped PNSGD Algorithm

We end this section by pointing out a potential shortcoming of Theorems 3 and 4 (and in general the privacy amplification by iteration framework): different individuals participating in the dataset experience different privacy guarantees; that is, individuals

image

Fig. 3. The privacy parameters  ε and δof PNSGD with Gaussian noise having  σ = 1and strongly convex loss function (ρ = 0.4), computed using both Theorem 4 and Balle et al. [15, Theorem 5] for i = 20, i = 30, and i = 39 in a dataset of size n = 40. Other parameters are as follows:  η = 0.7, L = 1, β = 0.3, and DK = 1.

whose records were processed earlier experience higher privacy guarantee. This may not be justified in practice. To address this issue, we follow [12] to consider the random stopping for the PNSGD algorithm: namely, instead of iterating for n steps, we pick a random time T uniformly on [n], stop the algorithm after T steps and then output  YT. The following theorem illuminates that such algorithm in fact uniformizes the privacy guarantee among all individuals.

Theorem 5. Let  K ⊂ Rdbe a compact and convex set. The randomly-stopped PNSGD algorithm with Gaussian noise is (ε, δ)-DP with  ε ≥ 0and

image

The randomly stopped PNSGD was first proposed by Feldman et al. [12] where they derived its RDP guarantee in [12, Theorem 26] only if  σsatisfies a certain constraint. This constraint is due to the non-convexity of the map  (µ, ν) �→ Dα(µ∥ν). In contrast, since  (µ, ν) �→ Eγ(µ∥ν)is jointly convex (as for any other f-divergences), Theorem 5 holds for any  σ.

Another approach to address the non-uniformity of privacy guarantees is to permute the dataset first, via a random permutation and then feed it to the PNSGD algorithm. We will examine this approach in our future work.

[1] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Proc. Theory of Cryptography (TCC), Berlin, Heidelberg, 2006, pp. 265–284.

[2] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor, “Our data, ourselves: Privacy via distributed noise generation,” in EUROCRYPT, S. Vaudenay, Ed., 2006, pp. 486–503.

[3] L. Wasserman and S. Zhou, “A statistical framework for differential privacy,” Journal of the American Statistical Association, vol. 105, no. 489, pp. 375–389, 2010.

[4] P. Kairouz, S. Oh, and P. Viswanath, “The composition theorem for differential privacy,” IEEE Trans. Inf. Theory, vol. 63, no. 6, pp. 4037–4049, June 2017.

[5] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proc. ACM SIGSAC CCS, 2016, pp. 308–318.

[6] B. Balle, G. Barthe, and M. Gaboardi, “Privacy amplification by subsampling: Tight analyses via couplings and divergences,” in NeurIPS, 2018, pp. 6280–6290.

[7] I. Mironov, “Rényi differential privacy,” in Proc. Computer Security Found. (CSF), 2017, pp. 263–275.

[8] C. Dwork and G. N. Rothblum, “Concentrated differential privacy,” vol. abs/1603.01887, 2016. [Online]. Available: http://arxiv.org/abs/1603.01887

[9] M. Bun and T. Steinke, “Concentrated differential privacy: Simplifications, extensions, and lower bounds,” in Theory of Cryptography, 2016, pp. 635–658.

[10] M. Bun, C. Dwork, G. N. Rothblum, and T. Steinke, “Composable and versatile privacy via truncated cdp,” in STOC, 2018, pp. 74–86.

[11] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith, “What can we learn privately?” SIAM J. Comput., vol. 40, no. 3, pp. 793–826, Jun. 2011.

[12] V. Feldman, I. Mironov, K. Talwar, and A. Thakurta, “Privacy amplification by iteration,” FOCS, pp. 521–532, 2018.

[13] M. Raginsky, “Strong data processing inequalities andφ-sobolev inequalities for discrete channels,” IEEE Trans. Inf. Theory, vol. 62, no. 6, pp. 3355–3389, June 2016.

[14] A. Makur and L. Zheng, “Bounds between contraction coefficients,” 2018. [Online]. Available: http://arxiv.org/abs/1510.01844

[15] B. Balle, G. Barthe, M. Gaboardi, and J. Geumlek, “Privacy amplification by mixing and diffusion mechanisms,” in NeurIPS, 2019, pp. 13 277–13 287.

[16] R. L. Dobrushin, “Central limit theorem for nonstationary markov chains. I,” Theory Probab. Appl., vol. 1, no. 1, pp. 65–80, 1956.

[17] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,” IEEE Trans. Inf. Theory, vol. 62, no. 1, pp. 35–55, Jan 2016.

[18] S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,” Journal of Royal Statistics, vol. 28, pp. 131–142, 1966.

[19] I. Csiszár, “Information-type measures of difference of probability distributions and indirect observations,” Studia Sci. Math. Hungar., vol. 2, pp. 299–318, 1967.

[20] R. Ahlswede and P. Gács, “Spreading of sets in product spaces and hypercontraction of the markov operator,” Ann. Probab., vol. 4, no. 6, pp. 925–939, 12 1976.

[21] I. Sason and S. Verdú, “f -divergence inequalities,” IEEE Trans. Inf. Theory, vol. 62, no. 11, pp. 5973–6006, Nov 2016.

[22] A. Kontorovich and M. Raginsky, “Concentration of measure without independence: A unified approach via the martingale method,” in Convexity and Concentration. Springer New York, 2017, pp. 183–210.

[23] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov chains and mixing times. American Mathematical Society, 2006.

[24] P. M. del Moral, M. Ledoux, and L. Miclo, “On contraction properties of markov kernels,” Probability Theory and Related Fields, vol. 126, pp. 395–420, 2003.

[25] Y. Polyanskiy, H. V. Poor, and S. Verdu, “Channel coding rate in the finite blocklength regime,” IEEE Trans. Inf. Theory, vol. 56, no. 5, pp. 2307–2359, May 2010.

[26] N. Sharma and N. A. Warsi, “Fundamental bound on the reliability of quantum information transmission,” CoRR, vol. abs/1302.5281, 2013. [Online]. Available: http://arxiv.org/abs/1302.5281

[27] I. Csiszár and P. C. Shields, “Information theory and statistics: A tutorial,” Commun. Inf. Theory, vol. 1, no. 4, pp. 417–528, Dec. 2004.

[28] G. Barthe and F. Olmedo, “Beyond differential privacy: Composition theorems and relational logic for f-divergences between probabilistic programs,” in Proc. ICALP, 2013, pp. 49–60.

[29] B. Balle and Y.-X. Wang, “Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising,” in ICML, vol. 80, 10–15 July 2018, pp. 394–403.

[30] J. Cohen, J. Kemperman, and G. Zb˘aganu, Comparisons of Stochastic Matrices, with Applications in Information Theory, Statistics, Economics, and Population Sciences. Birkhäuser, 1998.

[31] X. Wu, F. Li, A. Kumar, K. Chaudhuri, S. Jha, and J. Naughton, “Bolt-on differential privacy for scalable stochastic gradient descent- based analytics,” in SIGMOD, 2017, pp. 1307–1322.

[32] R. Bassily, A. Smith, and A. Thakurta, “Private empirical risk minimization, revisited,” in ICML 2014 Workshop on Learning, Security and Privacy, Beijing, China, 25 Jun 2014.

[33] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially private empirical risk minimization,” Journal of Machine Learning Research, vol. 12, no. Mar, pp. 1069–1109, 2011.

[34] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, 1st ed. Springer Publishing Company, Incorporated, 2014.

[35] S. Bubeck, Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 2015, vol. 8, no. 3-4.

Proof of Theorem 2. Balle et al. [15] recently showed that, for  δ ∈ (0, 1),

image

where  γ′ := 1 + γ−1δ. Now we show that the above inequality is indeed an equality. Let  y1, y2 ∈ Ybe such that  y1 ̸= y2. We define  µδ = ¯δδy0 + δδy1and  νδ = (¯δ/γ)δy0 + (1 − ¯δ/γ)δy2where ¯δ := 1 − δand  y0 /∈ {y1, y2}. In particular,

image

It is straightforward to verify that  µδK = ¯δK(y0) + δK(y1)and  νδK = (¯δ/γ)K(y0) + (1 − ¯δ/γ)K(y2). Hence, by (6),

image

Therefore, we obtain that

image

Since  y1and  y2are arbitrary, we conclude that the reversed version of inequality (17) holds true and hence

image

with  γ′ = 1 + γ−1δ. It is easy to verify that for fixed  µand  ν, γ �→ Eγ(µ||ν)is continuous and decreasing. Hence, by taking the limit as  δ → 1in (18), the result follows.

Proof of Lemma 1. For ease of notation, let  Li = L(mi, v)for i = 1, 2. It follows from the definition that

image

where  ˜m = m2 − m1(assuming  m2 ≥ m1). Clearly, the above integral is non-zero only if  ˜m ≥ v log γ. With this in mind, we can write

image

The case  m1 ≥ m2is similar. In general, we can write

image

Proof of Lemma 2. Recall that  Ni = N(mi, σ2Id)with i = 1, 2. A direct computation shows that

image

Thus, if we let  β = ∥m1−m2∥σ, then

image

Therefore, by expression for  Eγgiven in (7),

image

as desired.

Proof of Theorem 3. Let  D = {xt}nt=1and  D′ = {x′t}nt=1be two neighbouring datasets with  xi ̸= x′i. By Theorem 1, it is enough to show that

image

where  µ0is the initial distribution of PNSGD and  Kxmodels the update rule in (12). Following (13), we obtain that

image

where  Kxtis a projected additive Laplacian kernel,  ζi = µ0Kx1 . . . Kxiand  ωi = µ0Kx′1 . . . Kx′i.

We begin by recalling Lemma 1 and the data processing inequality to bound  ηeε(Kxt)as

image

To refine the above bound for  ηeε, we resort to the following standard result in convex optimization, see, e.g., [34] or [35, Theorem 3.12]. Notice that we say a function  f : K → Rdis  β-smooth if  a �→ ∇f(a)is  β-Lipschitz.

Lemma 4. Let  K ⊂ Rdbe a convex set and suppose  f : K → Rdis  β-smooth and  ρ-strongly convex. If  η < 2β+ρ, then the map  ψ(a) = a − η∇f(a)is M-Lipschitz on K with  M =�1 − 2ηβρβ+ρ.

image

On the other hand, we use Jensen’s inequality to compute  Eγ(ζi∥ωi):

image

where the first inequality is due to the Jensen’s inequality (note that  (µ, ν) �→ Eγ(µ∥ν)is jointly convex for all  γ > 1as for any other f-divergences) and the last inequality comes from the fact that

image

where we use the fact that  y �→ ℓ(y, x)is L-Lipschitz for all  x ∈ X. Plugging (20) and (21) into (19), we conclude the proof.

Proof of Theorem 4. Let  D = {xt}nt=1and  D′ = {x′t}nt=1be two neighbouring datasets with  xi ̸= x′i. By Theorem 1, it is enough to show that

image

where  µ0is the initial distribution of PNSGD and  Kxmodels the update rule in (12). Following (13), we obtain that

image

where  Kxtis a projected additive Gaussian kernel,  ζxi = µ0Kx1 . . . Kxi = µi−1Kxiand  ωxi = µ0Kx′1 . . . Kx′i = µi−1Kx′i.We begin by recalling Lemma 2 and data processing inequality to bound  ηγ(Kxt), for  γ = eε, as

image

image

where the first inequality is due to the Jensen’s inequality (note that  (µ, ν) �→ Eγ(µ∥ν)is jointly convex for all  γ > 1as for any other f-divergences) and the last inequality and the second inequality stems from the following

image

Plugging (23) and (24) into (22), we conclude the proof.

Proof of Proposition 1. First note that both [12, Theorem 23] and [15, Theorem 5] quantifies the RDP guarantee of PNSGD. However, as the latter strictly improves the former under the assumption of strong convexity, we only consider the former and convert it to DP guarantee. Given  α > 1, a mechanism M is said to be  (α, λ)-RDP [7] if  Dα(PD∥PD′) ≤ λfor all  D ∼ D′where  PDis the output distribution of M running on the dataset D. As before, this definition can be adapted as follows: M is said to be  (α, λ)-RDP for its i-th input if the above holds for dataset D and  D′differing in the i-th coordinate. It is shown in [7, Proposition 3] and [5, Theorem 2] that

image

for  i ∈ [n − 1]and  κ = 2αL2for i = n, where M is defined in Lemma 4. We use (25) to obtain the best DP  δparameter: PNSGD is  (ε, ˜δ)-DP for

image

Solving this minimization problem, we obtain that the minimizer is  α∗ = ε+κ2κand the minimum value is

image

Notice that since  α > 1, we must have  ε > κ.

Proof of Theorem 5. Let T be a uniform random variable on the set [n]. We assume that the PNSGD algorithm stops at the random time T. Recall that we are given two dataset  D = {x1, . . . , xn}and  D′ = {x′1, . . . , x′n}with  xj = x′jfor all  j ∈ [n]except j = i. Let  µTand  νTbe the output distribution of the randomly-stopped PNSGD algorithm running on D and  D′, respectively. We can write

image

and

image

Hence, the convexity of  (P, Q) �→ Eγ(P∥Q)and Jensen’s inequality imply that

image

Finally, by applying Theorem 4,

image

as we wanted to prove.


Designed for Accessibility and to further Open Science