b

DiscoverSearch
About
My stuff
An Adaptive and Fast Convergent Approach to Differentially Private Deep Learning
2019·arXiv
Abstract
Abstract

With the advent of the era of big data, deep learning has become a prevalent building block in a variety of machine learning or data mining tasks, such as signal processing, network modeling and traffic analysis, to name a few. The massive user data crowdsourced plays a crucial role in the success of deep learning models. However, it has been shown that user data may be inferred from trained neural models and thereby exposed to potential adversaries, which raises information security and privacy concerns. To address this issue, recent studies leverage the technique of differential privacy to design private-preserving deep learning algorithms. Albeit successful at privacy protection, differential privacy degrades the performance of neural models. In this paper, we develop ADADP, an adaptive and fast convergent learning algorithm with a provable privacy guarantee. ADADP significantly reduces the privacy cost by improving the convergence speed with an adaptive learning rate and mitigates the negative effect of differential privacy upon the model accuracy by introducing adaptive noise. The performance of ADADP is evaluated on real-world datasets. Experiment results show that it outperforms state-of-the-art differentially private approaches in terms of both privacy cost and model accuracy.

Index Terms—crowdsourcing, information security and privacy, differential privacy, deep learning, adaptive, fast convergent

A. Background and Motivation

The past decade has witnessed the remarkable success of deep learning techniques in various machine learning / data mining tasks, such as signal processing [1], network modeling [2] and traffic analysis [3]. The great success relies heavily on the massive collection of user data, which, however, often raise severe privacy and security issues. For example, Fredrikson et al. [4], demonstrates that the individual privacy information in the training dataset can be recovered by repeatedly querying the output probabilities of a disease recognition classifier built upon a convolutional neural network (CNN). Existing privacy concerns are likely to discourage users from sharing their data and thereby obstruct the future development of deep learning itself. This paper studies the problem of user privacy protection in the training process of neural models. We consider the white-box scenario where an adversary has access to the parameters of a trained model. In this scenario, many service providers allow users to download models to their personal devices (e.g., computers and smart phones), and malicious users could analyze the parameters of the model which may expose personal information in the training dataset.

B. Limitations of Prior Art

To address the privacy issue, several differential privacy (DP) [5] based approaches were proposed, which may be classified into two categories: data obfuscation and gradient obfuscation. Data obfuscation based approaches obfuscate data with noise prior to potential exposure of sensitive information [6], [7]. These approaches may suffer from significant accuracy degradation of the trained model. The reason is that to guarantee the differential privacy bound, the added noise may be excessively intense and make differently labeled training instances almost indistinguishable. In contrast to data obfuscation, gradient obfuscation based approaches add noise to the gradient in the training process [8], [9], [10], [11], [12], [13]. However, they may not circumvent the accuracy degradation issue completely. Although some methods aim to improve the accuracy of gradient obfuscation [11], [13], they have three key limitations. First, the privacy cost is high because the convergence speed of these methods is slow while the privacy cost is accumulated for each gradient calculation. Second, the accuracy still cannot meet the highprecision requirements of many applications since they add identically distributed noise to all components of the gradient which results in large distortion of the original gradient. Third, these methods are computationally inefficient because they need to evaluate the model multiple times or solve a large-scale optimization problem per iteration which make the task computationally prohibitive.

C. Proposed Approach

In this paper, we propose ADADP, an adaptive and fast convergent approach to differentially private deep learning. Our key observation is that different components of the gradient have inhomogeneous sensitivity to the training data. In light of this observation, ADADP mitigates the influence of noise on the model performance by adaptively sampling noise from different Gaussian distributions, based on the sensitivity of each component. In the first stage, ADADP adjusts the learning rate in an adaptive manner based on historical gradients such that infrequently updated components tend to have a larger learning rate. Then, ADADP samples noise from different Gaussian distributions according to the sensitivity of each gradient component and constructs differentially private gradients by adding the sensitivity-dependent noise to the original gradient. In this way, Gaussian noise with a lower variance is added to components with a smaller sensitivity.

Compared to existing data obfuscation and gradient obfuscation based methods, ADADP has three key advantages.

First, the privacy cost of ADADP is low because it exhibits remarkable improvement in the convergence speed due to an adaptive learning rate. Since the privacy cost is accumulated on each gradient update, a faster rate of convergence indicates that training a model using ADADP incurs lower privacy cost.

Second, ADADP achieves both a provable privacy guarantee and a comparable accuracy to non-differentially private models simultaneously. We will show later that this is attained by adding adaptive noise to different gradient components, depending on their sensitivity. As the model converges, we will see a decrease in the expected sensitivity of each gradient component, which thereby reduces the variance of noise distribution. In other words, in contrast to prior works, the noise distribution of ADADP is adaptive to not only different gradient components but also different training iterations.

Third, ADADP is computationally efficient since it does not need to solve any optimization problem to determine the noise distribution at each iteration, in sharp contrast to [13] that requires solving a large-scale non-convex optimization problems. We design an efficient scheme for adjusting the noise distribution and the scheme is evaluated via both theoretical analysis and numerical experiments.

D. Technical Challenges and Solutions

First, it is technically challenging to mathematically analyze the influence of the noise distribution upon the prediction performance, in light of the complicated nature of deep neural networks. As an alternative, we analyze the sufficient and necessary condition for noise distributions to guarantee the target differential privacy level. The condition involves an inequality with respect to the sensitivity of each gradient component and the variance of each corresponding Gaussian distribution. Based on the analysis, we conduct an experiment to compare the influence of different noise distributions upon the original function which can be seen as a query on a dataset. According to the theoretical and experiment results, we propose a heuristic that adapts the noise distributions to the sensitivity of each gradient component. Finally, we perform another experiment to verify the effectiveness of this heuristic according to the influence of noise on the gradient descent algorithm which is widely used for optimizing deep learning models.

Another technical challenge is to compute the privacy cost without any assumptions on the parameters such as the noise level and the sampling ratio. This is in sharp contrast to prior methods. For example, the moments accountant method requires the noise level  σ ≥ 1and the sampling ratio  q < 116σ[10]. To remove the assumptions, we use a technique termed subsampled R´enyi differential privacy (RDP) [14], which computes the privacy cost by analyzing the privacy amplification in the subsampling scenario. And importantly, it requires no assumption on the parameters in the analysis [15].

E. Summary of Experiment Results

We evaluated the privacy cost, the accuracy and the computational efficiency of ADADP and baselines on two real datasets: MNIST [16] and CIFAR-10 [17]. The results show that ADADP outperforms state-of-the-art methods with 50% privacy cost reduction. On MNIST and CIFAR-10, ADADP achieves an improvement of up to 4.3% and 3.5%, respectively, in accuracy over state-of-the-art methods. Our experimental results also show that the processing time of ADADP at each iteration is much less than that of [13], [11], validating the computational efficiency of ADADP.

To protect sensitive information in crowdsourced data collection, differentially private crowdsourcing mechanisms were designed [18], [19], [20], [21]. To preclude personal information from being inferred and/or identified from neural models [4], a line of works emerged which applied DP to deep learning [8], [9], [22], [10], [23], [24], [25], [26], [11], [12], [27], [28], [13]. For instance, [24], [26] integrate DP into a teacher-student framework to protect data on student nodes. [9], [8], [29], [27] study transferring features or gradients with privacy control in collaborative deep learning. [18], [19], [20], [21] apply DP to personal data collection. In the above works, black-box attacks are (implicitly) assumed, i.e., the learned model is inaccessible to adversaries. Other works studied privacy protection in a more realistic white-box model where adversaries may have full knowledge of the model [10], [25], [11], [12], [13]. [10] proposes a differentially private gradient descent algorithm DPSGD by adding Gaussian noise to the gradient. [12] uses an adaptive learning rate to improve the convergence rate and reduce the privacy cost. [25] introduces the Laplace mechanism such that the privacy budget consumption is independent of the number of training steps. [11] allocates different privacy budgets to each training iteration to counteract the influence of noise on the gradient. In all aforementioned works, however, the noise on each gradient component follows the same probability distribution. As a result, the original gradient is distorted to a large extent. Although [13] samples noise from different distributions for each gradient component, solving a large-scale optimization is required at every step.

To reduce the privacy cost, ADADP uses an adaptive learning rate for acceleration of the convergence. Additionally, ADADP adds inhomogeneous and adaptive noise to different coordinates of the gradient based on their sensitivity in order to mitigate the influence of noise on the model performance. The next two subsections elaborate the adaptive learning rate and noise respectively. Then we present ADADP and show its differential privacy guarantee.

A. Adaptive Learning Rate

The most popular method for training deep models is the gradient-descent-type algorithms. They iteratively update the parameters of a model by moving them in the direction opposite to the gradient of the loss function evaluated on the training data. The loss function L is the difference between the predictions and the true labels. To minimize the loss L(θ), stochastic gradient descent (SGD) randomly chooses a subset of training data (denoted by S) at each iteration and performs the update  θ ← θ − η 1|S|�i∈S ∇θL(θ, xi). SGD poses several challenges, e.g., selection of a proper learning rate and avoidance of local minimum traps.

More advanced optimizers, including RMSPROP, ADAM, ADADELTA and NADAM, are proposed to address the above issues. They adjust the learning rate on a per-parameter basis in an adaptive manner and scale the coordinates of the gradient according to historical data.

ADADP uses an adaptive strategy similar to that of RMSPROP. Nevertheless, we would like to note that the framework proposed in this paper is applicable to other adaptive gradient-descent-type algorithms. Recall the update of RMSPROP

image

where  θtdenotes the parameters at step  t, gtdenotes the original gradient,  ηis the learning rate, and  ϵ0is the smoothing term (in case that the denominator is 0).

In ADADP, let  ∆θtdenote the update term at step t and we have  ∆θt = ˜gt√E[˜g2]t+ϵ0where  ˜gtis the noisy gradient which has been added Gaussian noise. In other words, ADADP uses the denominator�E[˜g2]t + ϵ0to adjust the learning rate  ηin an adaptive manner.

B. Adaptive Noise

The intuition of adaptive noise is that different coordinates of the gradient exhibit inhomogeneous sensitivities due to their different values. It significantly affects the direction of the gradient if noise with higher intensity is added to coordinates with a smaller value, and vice versa. In light of this intuition, ADADP clips the gradient and adds Gaussian noise with a smaller/larger variance to dimensions of the clipped gradient with a smaller/larger sensitivity.

Formally, we define the  ℓ2-sensitivity of a function f as ∆f = maxD,D′∥f(D) − f(D′)∥2where  D, D′are two datasets which differ in only one record. Correspondingly, adaptive noise is sampled from Gaussian distributions with different variances based on the  ℓ2-sensitivity of each dimension. In the meanwhile, adaptive noise must satisfy the constraint of differential privacy. To guarantee this, we first present the following lemma which provides the theoretical foundation of adaptive noise.

Lemma 1. Suppose mechanism M(D) = f(D)+Z, where D is the input dataset,  f(·)is a m-dimension function that  ∆f ≤

1 and noise Z ∼ N(0, σ2), is (ϵ, δ)-differentially private, then

mechanism  M ′(D) = f ′(D) + Z′, where  f ′(·)is also a m-dimension function,  Z′ = (z′1, . . . , z′m)Tand  ∀i ∈ [m], z′i ∼N(0, σ2i ), is also  (ϵ, δ)-differentially private if  �mi=1s2iσ2i ≤ 1σ2∗where  siis the  ℓ2-sensitivity of the  ithdimension of  f ′(·).

To prove Lemma 1, we need an auxiliary result which involves the sufficient and necessary condition on the privacy loss variable to satisfy  (ϵ, δ)-DP.

Lemma 2 (Analytical differential privacy [30]). A mechanism M is  (ϵ, δ)-DP if and only if for each  D, D′, the following holds:

image

where  lM,D,D′is the privacy loss variable defined by ln Pr(M(D)=o)Pr(M(D′)=o).

We are now ready to present the proof of Lemma 1.

Proof. The first step is to show that  lM,D,D′is a Gaussian random variable. Let  (r1, . . . , rm) = o − f(D). We consider the worst case of  lM,D,D′. In this case, we have (s1, . . . , sm) = f(D)−f(D′), which yields  (r1+s1, . . . , rm+sm) = o − f(D′). As a result, the following equations hold

image

In light of  rj ∼ N(0, σ2j ), we obtain that  lM,D,D′ =�mj s2j2σ2j +�mj rjsjσ2jalso obeys a Gaussian distribution. Specif- ically, we have  lM,D,D′ ∼ N(�mj s2j2σ2j , �mj s2jσ2j ). If we let H

image

The second step is to show that  Pr (lM,D,D′ ≥ ϵ) −eϵ Pr (lM,D′,D ≤ −ϵ)is monotonically increasing in H. Let us compute the first term

image

where A(H) =

re-written as

image

where  A′(H) = −ϵ√H −√H2.Then the derivative of Pr (lM,D,D′ ≥ ϵ) −eϵ Pr (lM,D′,D ≤ −ϵ)about H is:

image

Since  (A′(H))2 = (A(H))2 + 2ϵ, then we have

image

Therefore,  Pr (lM,D,D′ ≥ ϵ)−eϵ Pr (lM,D′,D ≤ −ϵ)is monotonically increasing with H.

Then if a function  f(·)that  ∆f ≤ 1satisfies  (ϵ, δ)-DP with σ∗, the privacy loss variable  l∗M,D,D′must satisfy Eq. (2). Note that  l∗M,D,D′is a Gaussian variable that  l∗M,D,D′ ∼ N( H∗2 , H∗)where  H∗ = �mj (s∗j )2σ2∗ = 1σ2∗. Since the left part of Eq. (2) is monotonically increasing with H, Eq. (2) will hold for  H′if H′ ≤ H∗, namely �mj s2jσ2j ≤ 1σ2∗.

Lemma 1 shows the condition of differential privacy when we add noise from Gaussian distributions with different variances to different dimensions of a query function. For example, suppose we have a 2-dimension query function  f ′(·), and s1 = 12.0, s2 = 6.0. If mechanism M satisfies  (ϵ, δ)-DP with σ∗ = 1.0, then mechanism  M ′satisfies the same  (ϵ, δ)-DP with  σ1 = 17.0and  σ2 = 8.5since 12.0217.02 + 6.028.52 ≤ 11.02. Namely, we sample noise with standard deviation 17.0 for the first dimension and 8.5 for the second dimension of  f ′(·). In contrast, previous methods sample noise from the same Gaussian distribution  N(0, 13.52)for each dimension of  f ′(·)since 12.0213.52 + 6.0213.52 ≤ 11.02.

Continue with the above example, we conducted a numerical experiment to compare the influence of different noise distributions on the original query function. Specifically, we calculate cosine similarity between the noisy result (M ′(D)) and the original result (f ′(D)). As a metric of such influence, the higher similarity two results have, the less the influ-ence of noise on the original query function. Without loss of generality, we assume  f ′(D) = (10.0, 5.0)Tgiven that s1 = 12.0, s2 = 6.0. We then sample noise 10000 times and compute the average cosine similarity. When we set  σ1 = 17.0and  σ2 = 8.5, the average cosine similarity between the noisy result and the original result is 0.52. However, when we set σ1 = σ2 = 13.5, the average is only 0.36. Another strategy is to set  σ1 = 12.4and  σ2 = 24.8, the cosine similarity in this scenario is reduced to 0.28. Note that like the third strategy, the optimization techniques in [13] tend to sample noise with higher variance for dimensions with smaller sensitivity. In other words, this numeric experiment shows that coordinates of the query function with larger sensitivity can tolerate noise with higher variance.

In the scenario of training a deep learning model, we consider the gradient at each iteration as a query function of the training dataset. Specifically, given the gradient  gtat the tthiteration, we still use  sito denote the  ℓ2-sensitivity of the ithdimension of  gt. Then depending on Lemma 1, a Gaussian mechanism  Mt(x) = gt(x) + Z, where  Z = (z1, . . . , zm)Tand  ∀i ∈ [m], zi ∼ N(0, σ2i ), satisfies  (ϵ, δ)-DP if  �mi=1s2iσ2i ≤

1σ∗2where  ϵ, δ, σ∗will be determined later. The remained problem is how to calculate  siand  σi. Note that RMSPROP uses the term  E[g2]t−1to estimate the average square of historical gradients. We observe that historical gradients can also be used to estimate the current gradient. For example, if E[g2]it−1 > E[g2]jt−1, then it is quite possible that  |git|> |gjt |. In other words,  E[g2]t−1can be considered as a kind of prior knowledge about the value of  |gt|. To distinguish from the term used in the adaptive learning rate, we use  E′[g2]to denote this

prior knowledge which can be computed as follows:

image

Then depending on the prior knowledge,  sican be calculated as  β�E′[g2]it−1where  β > 0is a parameter. Since  siis the  ℓ2-sensitivity of the  ithdimension, we clip each coordinate of the original gradient to guarantee this. Specifically, the  ithdimension of the clipped gradient (denoted by  ¯git) is calculated as:

image

In contrast with previous methods which clip the gradient with a global  ℓ2-norm upper bound C such that  ∥gt∥2≤ C, we call our method as local clipping since it operates on each coordinate separately and call  βas local clipping factor. Then

considering �mi=1s2iσ2i ≤ 1σ∗2, we have  σi = βσ∗�mE′[g2]it−1where m is the number of dimensions. This heuristic means that for the  ithdimension of the gradient, the standard deviation of Gaussian noise (σi) is scale to the  ℓ2-sensitivity of the  ithdimension (β�E′[g2]it−1). On the other hand, as the training tends to converge, the gradual decline of�E′[g2]it−1will lead to the decrease of each  σi. That means, the noise distributions are not only adaptive to different dimensions of the gradient, but also adaptive to different iterations of the training.

Since we have  E′[g2]0 = #»0at the first iteration, this value cannot be used to clip the gradient which will cause  si = 0. We set another parameter G called local clipping threshold and ADADP applies local clipping only when  Var[E′[g2]] > G. In other words, as a kind of prior knowledge about the gradient gt, E′[g2]t−1lacks sufficient information at the beginning of the training. During this phase, ADADP applies global clipping as follows:

image

where C is a parameter called  ℓ2-norm clipping bound. After several iterations,  E′[g2]will become more stable and ADADP will perform local clipping when  Var[E′[g2]] > G.

To gain more insight into the advantages of adaptive noise, we illustrate and compare the updates of ADADP, DPSGD, and their non-private counterparts by testing them on the Beale function  f(x, y) = (1.5−x+xy)2+(2.25−x+xy2)2+(2.625−x + xy3)3, as shown in Figs. 1a and 1b. In this experiment, we set  γ = 0.1, γ′ = 0.9, β = 1.5and  G = 10−6.

We observe that the trajectory of DPSGD exhibits a remarkable deviation from that of its non-private version, while ADADP and its non-private version display similar trajectories. Quantitatively, We define the distance between two trajectories P = {pi}ni=1and  Q = {qi}ni=1of the same length as D(P, Q) = 1|P |�ni=1 ∥pi − qi∥2. Note that the length of a trajectory is the number of training iterations. The distance between the trajectories of DPSGD and its non-private version is 0.90 and the distance between ADADP and its non-private version is 0.21. Experiments on all widely used test functions for optimization [31] also show similar results.

The above experiment suggests that adaptive noise can mitigate the deviation of the noisy result from the original result and the performance of ADADP is more robust to the privacy-protecting Gaussian noise than DPSGD. Recall that DPSGD adds Gaussian noise with the same intensity to all dimensions. In contrast, ADADP updates each dimension separately and the intensity of the added Gaussian noise relies on the sensitivity of each dimension of the gradient.

image

Figure 1: The trajectories of ADADP, DPSGD and corresponding non-private algorithms on Beale function with 150 training iterations.

C. Algorithm and Main Results

Now we present ADADP in Algorithm 1. Let  θ0 ={θ10, θ20, . . . , θn0 }denote the initial values of the n trainable pa- rameters. Note that, a lot is a sample from the training dataset which is different from batch. There can be several batches in a lot and the lot size is the batch size multiplied with the number of batches in it. At each step t, we compute the gradient with per-example gradient algorithm [32] (Line 5). Then for each dimension of the gradient, we perform the local clipping at line Line 9 and Line 10 if the local clipping threshold is satisfied (Line 7). After that, we calculate  σiand add Gaussian noise to the  ithdimension of the gradient (Line 12 and Line 13). If the local clipping threshold is not achieved, we perform global clipping and add non-adaptive noise at Line 16 and Line 18. Then we update the two expectations used in the adaptive learning rate and the adaptive noise correspondingly (Line 22 and Line 23). Afterwards, we update the parameters at the  tthiteration with  ∆ˆθt(Line 26). After T training steps, ADADP returns the final values  θT = {θ1T , θ2T , . . . , θmT }(Line 28).Before presenting our main results on the privacy guarantee of ADADP, let us review the definition of differential privacy.

Definition 1 (Differential Privacy [5], [10]). A randomized mechanism M satisfies  (ϵ, δ)-differential privacy, if for any two neighboring datasets D and  D′that differ only in one tuple, and for any possible subset of outputs O of M, we have

image

where  Pr(·)denotes the probability of an event. If  δ = 0, Mis said to satisfy  ϵ-differential privacy.

We now show the privacy guarantee of ADADP in Theorem 1.

Theorem 1. Let �·

B(l) = �li=0(−1)i�li�e(i−1)i/(2σ2∗). Given a privacy budget

(ϵ, δ), the sampling ratio  q = LNand any integer  α ≥ 2, if the

image

noise scale  σiin Algorithm 1 satisfies �mi s2iσ2i ≤ 1σ2∗and  σ∗satisfies

image

then Algorithm 1 is  (ϵ, δ)-differentially private.

To prove Theorem 1, we use the techniques of RDP to analyze the privacy cost of the composition of Gaussian mechanisms. The results that we obtain via RDP are later translated to DP.

Definition 2 (R´enyi Divergence [14]). Given two probability distributions P and Q, the R´enyi divergence between P and Q with order  α > 1is defined by

image

Definition 3 (R´enyi differential privacy [14]). A randomized mechanism  f : D → Rdis said to be  (α, ϵ)-R´enyi differential private if for any adjacent datasets  D, D′, it holds that

image

Our proofs are based on the three key properties of RDP, as stated in Lemmas 3 to 5.

Lemma 3 (RDP of Gaussian Mechanism [14]). If  ∥f(·)∥2≤ 1, then the Gaussian mechanism  M(D) = f(D) + N(0, σ2)satisfies  (α, α/(2σ2))-RDP.

Lemma 4 (Composition of RDP [14]). For two randomized mechanisms f, g such that f is  (α, ϵ1)-RDP and g is  (α, ϵ2)-RDP, the composition of f and g which is defined as (X, Y ) (a sequence of results), where  X ∼ fand  Y ∼ g, satisfies (α, ϵ1 + ϵ2)-RDP.

Lemma 5 (Translation from RDP to DP [14]). If a randomized mechanism  f : D → Rdsatisfies  (α, ϵ)-RDP, then it satisfies (ϵ + log 1/δα−1 , δ)-DP where  0 < δ < 1.

Lemma 6 analyzes the privacy amplification in the subsampling setting.

Lemma 6 (RDP of subsampled Gaussian Mechanism [15]). Define function  B(ϵ, l) = �li=0(−1)i�li�e(i−1)ϵ(i). Given a dataset D of n records and a Gaussian mechanism f satisfying  (α, ϵ(α))-RDP, define a new randomized mechanism f ◦subsampleas: (1) subsample m records where q = m/n, (2) apply these m records as the input of mechanism f, then for any integer  α > 1, f ◦ subsamplesatisfies  (α, ϵ′(α))-RDP, where:

image

Before showing the proof of Theorem 1, we need to establish the following important lemma. In Lemma 7, we bound the noise level  σof a Gaussian mechanism in the subsampling setting.

Lemma 7. Given the sampling probability q = L/N, the number of steps T, the Gaussian mechanism  M = f(·) +N(0, σ2)where  ∥f(·) ∥2≤ 1, then the composition of T these mechanisms satisfies  (ϵ, δ)-differentially private if  σsatisfies:

image

where  αcan be any integer satisfying  α ≥ 2and the function B is defined as  B(l) = �li=0(−1)i�li�e(i−1)i/(2σ2).

Proof. By Lemma 3, we could compute  B(ϵ, l)defined in Lemma 6 on Gaussian mechanisms as

image

Since a lot used in ADADP is a subsample of the training dataset, each step is  (α, ϵ′(α))-RDP where  ϵ′(α)satisfies (8) in Lemma 6. Then after T training steps of ADADP, we could obtain the total privacy cost  (α, Tϵ′(α))via composing such T subsampled Gaussian mechanisms depending on Lemma 4. Then by substituting the function  B(ϵ, l)in (8) with (11), Tϵ′(α)can be clearly expressed as

image

At last, through converting  (α, Tϵ′(α))to DP representation via Lemma 5, ADADP satisfies  (Tϵ′(α) + log(1/δ)α−1 , δ)-DP. Let Tϵ′(α)+ log(1/δ)α−1 ≤ ϵwhere  ϵis the given privacy budget and combine this with (12) as follows:

image

then the proof is completed.

Now we are ready to prove Theorem 1.

Proof. Combining Lemma 1 and Lemma 7, Theorem 1 is easy to prove. Suppose we use M(D) = f(D) + Z as a Gaussian mechanism at training step t in Algorithm 1, then the composition of such T mechanisms satisfies Eq. (9). Based on Lemma 1, if we use  M ′(D) = f ′(D)+Z′at each training step t in which �mi=1s2iσ2i ≤ 1σ2∗, then  M ′(D)satisfies the same differential privacy guarantee as M(D). Therefore, the proof of Lemma 7 is also suitable for  M ′(D). Combine this fact with the post-processing property of DP [33], the proof is completed.

image

We would like to remark that Theorem 1 covers the realm of small noise and high sampling ratio that the moments accountant method [10] omits (which requires  σ ≥ 1and q < 116σ). For instance, if we train a model using ADADP with L = 600 and  σ∗ = 0.9on a dataset of N = 60000 examples, Theorem 1 bounds the privacy cost by choosing the optimal  α = 6and implies that the model achieves (4.0, 10−5)-differential privacy after 1800 training steps.

A. Evaluation Setup

We evaluate the privacy cost, the accuracy and the computational efficiency of ADADP compared with state-of-the-art methods: DPSGD [10], AGD [11] and DPOPT [13] on two real datasets: MNIST and CIFAR-10. We implemented all these algorithms using TensorFlow [34] with a GTX 1080Ti GPU.

MNIST is a standard dataset for handwritten digit recognition, which consists of 60,000 training examples and 10,000 testing examples. Each example is a  28×28gray-level image. CIFAR-10 consists of 60,000 labeled examples of  32 × 32RGB images. There are 50,000 training images and 10,000 testing images.

The model for MNIST task first performs a 60-dimensional differentially private PCA (DPPCA) projection [35] and then applies a single 1,000-unit ReLU hidden layer [36]. For the CIFAR-10 task, we use a variant of AlexNet [37], which contains two convolutional layers followed by two fully connected layers and one softmax layer. The first convolutional layer includes 64 filters of size  5 × 5with stride 1. The layer is followed by the ReLU activation,  2 × 2max pooling, and local response normalization. The structure of the second convolutional layer is identical to that of the first one except that the local response normalization is performed before max pooling. The output is then flattened into a vector as the input for the following fully connected layer.

As outlined in Algorithm 1, we compute gradients for each training example. However, it is prohibitive to compute per-example gradients due to the parameter sharing scheme of convolutional layers. Since convolutional layers are shown to be well transferred [38], we pre-train the model on CIFAR-100 and initialize the network with the trained parameters. When we train it on CIFAR-10, the parameters of convolutional layers are maintained and updates happen to the fully connected layers and the softmax layers.

We use the result of Theorem 1 to calculate the privacy cost. Specifically, given  ϵ, σ∗, and q, at each iteration, we select  α’s from {2, 3, . . . , 64} and determine the smallest  δ∗that satisfies (7) in Theorem 1. The privacy cost is the pair  (ϵ, δ∗).For ADADP, we set  γ = 0.1, γ′ = 0.9and  G = 10−6in all experiments. For AGD, since it was only evaluated on shallow machine learning tasks in [11], its privacy computation is not suitable for deep learning as it does not consider the privacy amplification due to subsampling. Therefore, we use Theorem 1 to compute the privacy guarantee for AGD to give a fair comparison. To implement DPOPT1, we use the projected gradient descent algorithm and optimize the objective function with 50 steps as the same as [13].

B. Privacy Cost

image

Table I: Results on the privacy cost of DPSGD and ADADP

To illustrate the trade-off between the privacy cost and the accuracy, we measured the privacy cost of ADADP and DPSGD to attain a pre-specified accuracy level. We set the noise level  σ∗ = 3.0and  σp = 6.0on MNIST and  σ∗ = 4.0on CIFAR-10, where  σpis the noise level for DPPCA [35].

Compared with DPSGD (as a representative method with a non-adaptive learning rate), ADADP achieves an average reduction of 54% and 46% in privacy cost on MNIST and CIFAR-10 respectively. Table I summarizes the results, where ϵDand  ϵAdenote the minimum privacy cost required by DPSGD and ADADP, respectively, to attain the pre-specified accuracy level. We observe that ADADP always requires much lower privacy cost than DPSGD to achieve the same accuracy level. This is mainly due to the faster convergence and fewer training steps of ADADP.

C. Accuracy

Furthermore, given a fixed privacy budget, we evaluate the accuracy of ADADP, DPSGD, AGD and DPOPT on both MNIST and CIFAR-10 datasets under the high privacy level, the medium privacy level and the low privacy level respectively. In all experiments on MNIST, we set the lot size L = 600 and  ℓ2-norm clipping bound C = 4.0. For ADADP, we set the local clipping factor  β = 1.2. The noise levels  (σ∗, σp)for training the fully connected layer and the DPPCA layer are set to (8, 16), (4, 7), and (2, 4), respectively, for the three experiments. When testing DPSGD and DPOPT, we set the initial learning rate to 0.1 and linearly drop it down to 0.052 after 10 epochs and then keep it constant. When applying ADADP, we set the learning rate  ηto 0.002 and keep it unchanged. On CIFAR-10, we set the noise level  σ∗ = 6.0, lot size  L = 2000, ℓ2-norm clipping bound C = 3.0 for fully connected layers and C = 6.0 for the softmax layer. For ADADP, we set the local clipping factor  β = 1.2. For DPSGD and DPOPT, we set the initial learning rate to 0.1 and apply an exponentially decaying schedule in the first experiment and set the learning rate to 0.052 in the last two experiments. For ADADP, the learning rate  ηis set to 0.001.

The accuracy achieved by ADADP significantly outperforms that of DPSGD, AGD and DPOPT under all three privacy levels on both MNIST and CIFAR-10 datasets. Fig. 2 illustrates how the accuracy varies with the number of epochs and  δ∗on MNIST. The gray line and the black line (please refer to the right vertical axis) denote the accumulating privacy cost for AGD and other methods respectively in terms of  δ∗given a fixed  ϵ. We observe that the final test accuracy of ADADP on the MNIST achieves an increase of 5.9%, 2.7% and 1.0% respectively compared with DPSGD, 8.8%, 9.5% and 6.7% respectively compared with AGD, 4.3%, 1.5% and 0.7% respectively compared with DPOPT. The results on CIFAR-10 are shown in Fig. 3. In all three settings, ADADP achieves an accuracy increase of 3.2%, 5.3% and 4.8% respectively compared with DPSGD, 8.5%, 8.0% and 6.9% respectively compared with AGD, 2.1%, 3.5% and 3.2% respectively compared with DPOPT.

We now analyze how we achieve higher accuracy than each state-of-the-art method respectively. Compared with DPSGD, the performance improvement of ADADP is achieved by both the adaptive learning rate and the adaptive noise. As to AGD,  σin this algorithm can only decrease gradually which cannot be reversed such that the privacy budget is consumed too fast. Generally, a few epochs of training are not enough to guarantee the convergence for deep learning models. Therefore, AGD performs poorly in deep learning. For DPOPT, on the one hand, ADADP adopts an adaptive

image

Figure 2: Results on the accuracy for different privacy levels on MNIST dataset.

image

Figure 3: Results on the accuracy for different privacy levels on CIFAR-10 dataset.

learning rate to improve the convergence. On the other hand, as our first numeric experiment indicates, ADADP samples noise with higher variance for dimensions of the gradient with larger sensitivity which mitigates the influence of noise significantly. In addition, the expected variance of each noise distribution will gradually decrease as the training progresses.

D. Computational Efficiency

image

Table II: Results on the computational efficiency.

In addition, we evaluate the average processing time of ADADP, DPSGD, AGD and DPOPT per iteration, as a measure of computational efficiency. The time involves processing a whole batch where we set the batch size as 120 on MNIST and 32 on CIFAR-10 respectively. Other settings are the same as Fig. 2a and Fig. 3a respectively.

ADADP is much more computationally efficient than both AGD and DPOPT. The results are shown in Table II. We observe that for each iteration, the average processing time of ADADP is close to that of DPSGD which is far shorter than the other two algorithms. Note that AGD repeatedly evaluates the model multiple times to obtain the best update step size while DPOPT needs to solve a non-convex optimization problem with variables as many as the parameters of the deep learning model. In contrast to AGD and DPOPT, ADADP adopts a heuristic to avoid heavy computation at each iteration.

E. Micro Benchmarks

Considering there are two components in ADADP, namely the adaptive learning rate and the adaptive noise, we study their independent contribution to the final performance in this subsection. We call the methods with only one component ADAL (only adaptive learning rate) and ADAN (only adaptive noise) respectively. ADAL uses global clipping method to clip the original gradient and samples noise for each dimension from the same Gaussian distribution. The only difference between ADAL and DPSGD is that ADAL adjusts the learning rate based on Eq. (1). As to ADAN, we implement it directly by removing the adaptive learning rate part in ADADP. For the experiment on MNIST, the learning rate is set to 0.1 and 0.001 for ADAN and ADAL respectively. Other settings are the same as Fig. 2a. On CIFAR-10, the learning rate is set to 0.05 and 0.001 for ADAN and ADAL respectively and other settings are the same as Fig. 3a.

Both adaptive components contribute to the performance gain of ADADP while the adaptive noise component contributes more. As illustrated in Fig. 4a and Fig. 4b, we observe that ADAN achieves a higher accuracy than ADAL, showing that, compared with the adaptive learning rate, the adaptive noise component has a more significant impact on the performance of ADADP.

Since ADADP depends on whether the estimation of �E′[g2]t−1is accurate for  |gt|, we further conducted another experiment to verify its effectiveness under the settings as the same as Fig. 2a.

The estimation given by�E′[g2]t−1is accurate and can reflect the changing trend of  |gt|. The results are shown in Fig. 4c, from which we observe that the distribution of �E′[g2]t−1(the figure below) is close to that of  |gt|(the figure above) at each iteration. Also, from the  100thiteration to the  1000thiteration, both the gradient value and the estimated value show a declining trend.

Additionally, as aforementioned, given the expected sensitivity of the gradient decreases with the training progresses,

image

Figure 4: The micro benchmarks of key components in ADADP.

image

Figure 5: The sensitivity of different parameters on the accuracy of ADADP.

each  σiwill also decrease gradually. We select the  σiin ADADP at the  100thiteration and the  1000thiteration respectively to illustrate this property. All the settings are the same as Fig. 2a.

The standard deviation of most noise distributions for each coordinate of the gradient gradually decreases as the training progresses. As shown in Fig. 4d (the area of each circle is proportional to the value of  σi), we observe that most  σiat the  1000thiteration (the figure below) are more concentrated in the range less than 10 while there are many  σidistributed from 10 to 20 at the  100thiteration (the figure above). Therefore, the noise distributions in ADADP are adaptive not only to each dimension of the gradient, but also to different iterations during training.

F. Impact of Parameters

In this set of experiments on the MNIST dataset, we study the impact of the learning rate, the local clipping factor, the noise level and the lot size on the accuracy. In all experiments, if not specified, we set the learning rate  η = 0.002, lot size L = 600, local clipping factor  β = 1.2, ℓ2-norm clipping bound C = 4.0, noise level  σ∗ = 8.0, σp = 16.0, and privacy level  ϵ = 0.5, δ = 1e−5.

Learning Rate. As shown in Fig. 5a, the accuracy stays consistently above 93%, irrespective of the learning rate ranging from  0.08×10−2to  0.5×10−2. When the learning rate is lower than  2 × 10−2, the accuracy increases with the learning rate. When it is higher than  2 × 10−2, a higher learning rate results in a lower accuracy.

Local Clipping Factor. The local clipping factor  βcontrols the scale of each dimension of the gradient. Note that the estimation�E[′[g2]t−1]is not equal to  |gt|, a smaller  βdrops more information of the gradient since some dimensions will be clipped as  sior  −si. Fig. 5b shows that the performance of ADADP is relatively resistant to the local clipping factor and a reasonable setting for  βis to take a value slightly greater than one. However, due to the adaptive noise scheme of ADADP, a large value of  βwill not raise the noise level drastically (especially because when the gradient is close to 0 as the training converges,  siwill also be close to 0 whatever  βtakes).

Noise Scale. The noise scale determines the amount of Gaussian noise added to the update term at each step. Although a smaller noise scale mitigates the effect of noises, it results in fewer training steps and the model is hard to converge. Meanwhile, setting a larger noise scale allows more training steps, but excessive noise will ruin the original gradient. Results achieved with different noise scales are shown in Fig. 5c, where our model attains the most superior performance with σ∗ = 7.0. Compared with DPSGD, which reaches the highest accuracy at  σ∗ = 4.0[10], we attribute the much larger value of  σ∗of ADADP to adaptive noises which alleviate the impact of the differential privacy mechanism and a larger noise scale yields more training steps.

Lot Size. The lot size controls the sampling ratio. A large lot size yields a higher sampling ratio and reduces the number of training steps. If the noise intensity is fixed, a smaller lot size results in a more notable effect of Gaussian noise. Fig. 5d shows the accuracy vs. the lot size. It can be observed that as the lot size grows, the performance increases initially, peaks at L = 800, and finally declines. The result agrees with the above analysis that too high too low sampling ratio will incur performance degradation.

In this paper, our key contribution is three-fold. First, we propose a differentially private deep learning algorithm which leads to a faster convergence and higher accuracy in comparison with prior methods. Second, we intuitively analyze advantages of ADADP over DPSGD and mathematically prove that ADADP satisfies differential privacy by more advanced analytic tools. Third, we applied ADADP, DPSGD, AGD and DPOPT for model training of deep learning networks with real-world datasets and experimentally evaluate the better performance of ADADP.

[1] X. Wang, L. Gao, and S. Mao, “Csi phase fingerprinting for indoor localization with a deep learning approach,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 1113–1123, 2016.

[2] J. Wang, J. Tang, Z. Xu, Y. Wang, G. Xue, X. Zhang, and D. Yang, “Spatiotemporal modeling and prediction in cellular networks: A big data enabled deep learning approach,” in Proceddings of 36th Annual IEEE International Conference on Computer Communications (INFOCOM), 2017, pp. 1–9.

[3] Y. Zhou, M. Han, L. Liu, J. S. He, and Y. Wang, “Deep learning approach for cyberattack detection,” in Proceddings of 37th IEEE International Workshop on Computer Communications (INFOCOM), 2018, pp. 262– 267.

[4] M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart, “Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing,” in Proceedings of 23rd USENIX Security Symposium (USENIX Security), 2014, pp. 17–32.

[5] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of cryptography conference. Springer, 2006, pp. 265–284.

[6] R. Bost, R. A. Popa, S. Tu, and S. Goldwasser, “Machine learning classification over encrypted data.” in Proceedings of 22nd Annual Network & Distributed System Security Symposium (NDSS), vol. 4324, 2015, p. 4325.

[7] T. Zhang, “Privacy-preserving machine learning through data obfusca- tion,” arXiv preprint arXiv:1807.01860, 2018.

[8] P. Kairouz, S. Oh, and P. Viswanath, “Secure multi-party differential privacy,” in 29th Conference on Neural Information Processing Systems (NeurIPS), 2015, pp. 1999–2007.

[9] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Proceedings of 22nd ACM Conference on Computer and Communications Security (CCS), 2015, pp. 1310–1321.

[10] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Tal- war, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of 23rd ACM Conference on Computer and Communications Security (CCS), 2016, pp. 308–318.

[11] J. Lee and D. Kifer, “Concentrated differentially private gradient descent with adaptive per-iteration privacy budget,” in Proceedings of 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2018, pp. 1656–1665.

[12] A. Koskela and A. Honkela, “Learning rate adaptation for differentially private stochastic gradient descent,” arXiv preprint arXiv:1809.03832, 2018.

[13] L. Xiang, J. Yang, and B. Li, “Differentially-private deep learning from an optimization perspective,” in Proceedings of 38th Annual IEEE Conference on Computer Communications (INFOCOM), 2019, pp. 559– 567.

[14] I. Mironov, “R´enyi differential privacy,” in Proceedings of 30th IEEE Computer Security Foundations Symposium (CSF), 2017, pp. 263–275.

[15] Y.-X. Wang, B. Balle, and S. Kasiviswanathan, “Subsampled r´enyi differential privacy and analytical moments accountant,” arXiv preprint arXiv:1808.00087, 2018.

[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[17] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.

[18] L. Zhang, Y. Li, X. Xiao, X.-Y. Li, J. Wang, A. Zhou, and Q. Li, “Crowdbuy: Privacy-friendly image dataset purchasing via crowdsourcing,” in Proceedings of 37th IEEE Annual Conference on Computer Communications (INFOCOM), 2018, pp. 2735–2743.

[19] W. Jin, M. Xiao, M. Li, and L. Guo, “If you do not care about it, sell it: Trading location privacy in mobile crowd sensing,” in Proceedings of 38th IEEE Annual Conference on Computer Communications (INFOCOM), 2019, pp. 1045–1053.

[20] C. Niu, Z. Zheng, S. Tang, X. Gao, and F. Wu, “Making big money from small sensors: Trading time-series data under pufferfish privacy,” in Proceedings of 38th IEEE Annual Conference on Computer Communications (CCS). IEEE, 2019, pp. 568–576.

[21] H. Li, Y. Yang, Y. Dou, J.-M. J. Park, and K. Ren, “Pedss: Privacy enhanced and database-driven dynamic spectrum sharing,” in Proceedings of 38th IEEE Annual Conference on Computer Communications (INFOCOM). IEEE, 2019, pp. 1477–1485.

[22] N. Phan, Y. Wang, X. Wu, and D. Dou, “Differential privacy preservation for deep auto-encoders: An application of human behavior prediction,” in Proceedings of 30th AAAI Conference on Artifacial Intelligence (AAAI), vol. 16, 2016, pp. 1309–1316.

[23] A. Papadimitriou, A. Narayan, and A. Haeberlen, “DStress: Efficient differentially private computations on distributed data.” in Proceedings of the 12th European Conference on Computer Systems (EuroSys), 2017, pp. 560–574.

[24] N. Papernot, M. Abadi, ´U. Erlingsson, I. Goodfellow, and K. Talwar, “Semi-supervised knowledge transfer for deep learning from private training data,” in Proceedings of 5th International Conference on Learning Representations (ICLR), 2017.

[25] N. Phan, X. Wu, H. Hu, and D. Dou, “Adaptive laplace mechanism: differential privacy preservation in deep learning,” in Proceedings of the IEEE International Conference on Data Mining (ICDM), 2017, pp. 385–394.

[26] N. Papernot, S. Song, I. Mironov, A. Raghunathan, K. Talwar, and ´U. Erlingsson, “Scalable private learning with PATE,” in Proceedings of 6th International Conference on Learning Representations (ICLR), 2018.

[27] S. Collet, R. Dadashi, Z. N. Karam, C. Liu, P. Sobhani, Y. Vahlis, and J. C. Zhang, “Boosting model performance through differentially private model aggregation,” arXiv preprint arXiv:1811.04911, 2018.

[28] L. Wang, G. Qin, D. Yang, X. Han, and X. Ma, “Geographic differential privacy for mobile crowd coverage maximization,” in Proceedings of 32nd AAAI Conference on Artificial Intelligence (AAAI), 2018.

[29] Y. Mao, S. Yi, Q. Li, J. Feng, F. Xu, and S. Zhong, “A privacy-preserving deep learning approach for face recognition with edge computing,” in USENIX Workshop on Hot Topics in Edge Computing (HotEdge), 2018.

[30] B. Balle and Y.-X. Wang, “Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising,” arXiv preprint arXiv:1805.06530, 2018.

[31] “Test functions for optimization Wikipedia, the free encyclopedia,” http://en.wikipedia.org/w/index.php?title=Test functions for optimization.

[32] I. Goodfellow, “Efficient per-example gradient computations,” arXiv preprint arXiv:1510.01799, 2015.

[33] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.

[34] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in Proceedings of 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), vol. 16, 2016, pp. 265–283.

[35] C. Dwork, K. Talwar, A. Thakurta, and L. Zhang, “Analyze gauss: optimal bounds for privacy-preserving principal component analysis,” in Proceedings of 46th Annual ACM Symposium on Theory of Computing (STOC), 2014, pp. 11–20.

[36] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of 27th International Conference on Machine Learning (ICML), 2010, pp. 807–814.

[37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of 26th Conference on Neural Information Processing Systems (NeurIPS), 2012, pp. 1097–1105.

[38] K. Jarrett, K. Kavukcuoglu, Y. LeCun et al., “What is the best multistage architecture for object recognition?” in Proceedings of IEEE 12th International Conference on Computer Vision (ICCV), 2009, pp. 2146– 2153.


Designed for Accessibility and to further Open Science