b

DiscoverSearch
About
My stuff
Private Machine Learning via Randomised Response
2020·arXiv
ABSTRACT
ABSTRACT

We introduce a general learning framework for private machine learning based on randomised response. Our assumption is that all actors are potentially adversarial and as such we trust only to release a single noisy version of an individual’s datapoint. Our approach forms a consistent way to estimate the true underlying machine learning model and we demonstrate this in the case of logistic regression.

Our desire is to develop a strategy for machine learning driven by the requirement that private data should be shared as little as possible and that no-one can be trusted with an individual’s data, neither a data collector/aggregator, nor the machine learner that tries to fit a model.

Randomised Response, see for example Warner (1965), is relevant in this context in which a datapoint xnis replaced with a randomised ‘noisy’ version  ˜xn. A classical example is voting in an election in which an individual voter votes for one of two candidates A or B and is asked to lie (with probability p) about whom they voted for . This results in noisy data and estimating the fraction  fAof voters that voted for candidate A based on this noisy data

image

can give a potentially significantly incorrect estimate. As Warner (1965) showed, since we know the probabilistic mechanism that generated the noisy data, a better estimate of the fraction of voters voting for candidate A is given by

image

In a machine learning context, the kind of scenario we envisage is that users may have labelled face images as “happy" or “sad" on their mobile phones and the company MugTome wishes to train a “happy/sad" face classifier; however, users do not wish to send the raw face images to MugTome and also wish to be able to plausibly deny which label they gave any training image. To preserve privacy, each user will send to MugTome only a single corrupted datapoint — a single corrupted image and a single corrupted label.

It is straightforward to extend our approach to deal with users sending multiple corrupted datapoints. However, since MugTome will potentially then know which corrupted datapoints belong to each user, they will have more information to help reveal the underlying clean datapoint. Since we assume we cannot trust MugTome, MugTome may attempt to recover the underlying true datapoint. For example, if a user sends three class labels  c1, c2, c3, ci ∈ {0, 1}, then MugTome can have a good guess of the underlying true class label by simple taking the majority class  c = I (c1 + c2 + c3 > 2). Indeed, in general, if M corrupted datapoints are independently generated for a user, then MugTome’s ability to reveal the true class (or attribute) increases dramatically. For example, if MugTome know the corruption mechanism  p(cm|ctrue)the posterior of the class is given by

image

where  p(ctrue)is the prior belief on the true class. This posterior distribution concentrates exponentially quickly (in M) around the true value  ctrue. Similarly, if a pollster asks each voter three times what they voted, then the questioner would have a very good idea of the true vote of each voter; to protect the voter’s privacy, the voter would then have to trust that the pollster either does not pass on any information that states that the three votes came from the same person or that the pollster doesn’t attempt themselves to figure out what the voter voted for.

Similarly, in a medical setting in which a patient privately owns a datapoint, releasing M synthetic versions (corruptions) of that datapoint can compromise privacy if which synthetic datapoints belong to each person is also known. To guarantee that privacy is retained would require patients to trust people with their data, namely that any data aggregation process will remove their patient ID. However, this is something out of the control of the patient and as such we do not consider generating multiple synthetic datapoints (see for example Bindschaedler et al. (2017)) a ‘safe’ mechanism.

For these reasons, we wish to make a process in which an individual only reveals a single corrupted datapoint; from that point onwards in the machine learning training process, no other trust in that process is required. To motivate our general approach to private machine learning we discuss the voting example in more detail in section(3). Connections to other forms of privacy preserving machine learning are discussed in section(7). The justification for our approach hinges on the properties of the Spread Divergence, which we review in the following section.

Throughout we use the notation p(X = x) for a random variable X in state x. However, to reduce notational overhead, where unambiguous, we write simply p(x).

A divergence D(p||q) (see, for example Dragomir (2005)) is a measure of the difference between two distributions p and q with the property

D(p||q) ≥ 0 and D(p||q) = 0 ⇔ p = q (4)An important class is the f-divergence, defined as

image

where f(x) is a convex function with f(1) = 0. A special case of an f-divergence is the well-known Kullback-Leibler divergence KL(p||q) = Ep(x)�log p(x)q(x)�which is widely used to train models using maximum likelihood. For the Spread Divergence, from q(x) and p(x) we define new distributions ˜q(˜x)and  ˜p(˜x)that have the same support. Using the notation�xto denote integration�(·) dxfor continuous  x, and �x∈X for discrete x with domain X, we define a random variable  ˜xwith the same domain as x and distributions

image

where  p(˜x|x)‘spreads’ the mass of p and q such that  ˜p(˜x)and  ˜q(˜x)have the same support. For example, if we use a Gaussian  p(˜x|x) = N�˜x x, σ2�, then ˜p and ˜qboth have support R. The spread divergence has a requirement on the noise  p(˜x|x), namely that D(˜p||˜q) = 0 ⇔ p = q; that is, if the divergence of the spreaded distributions is zero, then the original non-spreaded distribution will match. As shown in Zhang et al. (2018) this is guaranteed for certain ‘spread noise’ distributions. In particular, for continuous x and  ˜xof the same dimension and injective function f, a sufficient condition for a valid spread noise  p(˜x|x) = K(˜x − f(x))is that the kernel K(x) has strictly positive Fourier Transform. For discrete variables, a sufficient condition is that  p(˜x = i|x = j) = Pij is thatPij > 0and the matrix P is square and invertible.

Spread divergences have a natural connection to privacy preservation and Randomised Response (Warner, 1965). The spread divergence suggests a general strategy to perform private machine learning. We first express the machine learning problem as  minθ Df(p(X)||pθ(X))for a specified model  pθ(X). Then, given only noisy data ˜X, we fit the model by  minθ Df�˜p( ˜X)||˜pθ( ˜X)�. To explain in more detail how this works, we first describe randomised response in a classical voting context and then justify how to generalise this to principled training of machine learning models based on corrupted data.

There are two candidates in an election, candidate “one" and candidate “zero" and Alice would like to know the fraction of voters that voted for candidate “one". We write the dataset of voting as a collection of binary values  {x1, . . . , xN}, xn ∈ {0, 1}.

3.1 LEARNING θ USING CLEAN DATA

If we assume that Alice has full knowledge of which candidate each voter voted for, then clearly Alice may simply count the fraction of people that voted for “one” and set

image

It will be useful to first consider how to arrive at the same result from a modelling perspective. We can consider an independent Bernoulli model

image

where

image

so that

image

We also construct an empirical data distribution that places mass only on the observed joint state, namely

image

where  δ(x, x′)is the Kronecker delta function. Then

image

where

image

and minimising KL(ˆp||pθ)(or maximising  LN(θ)) with respect to  θrecovers the fraction of votes that are 1, equation(7). This shows how we can frame estimating the quantity  θfrom uncorrupted private data as a divergence minimisation problem.

3.2 LEARNING θ USING CORRUPTED DATA

Returning to the privacy setting, Bob would also like to know the fraction of votes that are 1. However, Alice does not want to send to Bob the raw data  x1, . . . , xNsince the votes of any individual should not be explicitly revealed. To preserve privacy, Alice sends noisy data  ˜x1, . . . , ˜xNto Bob. In this case we draw a single joint sample  ˜x1, . . . , ˜xNfrom the distribution

image

where the ‘spread noise’ model is  p( ˜Xn = i|Xn = j) = Pij. Hence, if  xn = 0Alice draws a sample ˜xn = 0with probability  P00 and ˜xn = 1with probability  P10.

Given a sampled noisy dataset  ˜x1, . . . , ˜xNwe form an empirical spreaded data distribution

image

Similarly, the corrupted joint model is given by

image

where

image

On receiving the noisy dataset  ˜x1, . . . , ˜xN, Bob can try to estimate  θby minimising

image

with respect to  θ. Equivalently, he may maximise the scaled spread log likelihood

image

For this simple model, Bob can easily explicitly calculate

image

Similarly,  ˜pθ(˜x = 0) = P01θ + P00 (1 − θ). In this case, equation(20) becomes

image

where

image

Using ˜f0 + ˜f1 = 1, P00 + P10 = 1, P01 + P11 = 1, the maximum of the spread log likelihood is at

image

which forms Bob’s estimate of the underlying fraction of voters that voted for candidate “one”.

For example, if there were no noise  P10 = P01 = 0, Bob would estimate  θ = ˜f1, simply recovering the fraction of votes that are 1 in the original data. In the limit of a large number of votes  N → ∞and true probability  θ0of a voter voting for candidate “one”, then ˜f0tends to  P00(1 − θ0) + P01θ0and Bob’s estimate recovers the true underlying voting probability  θ = θ0. Hence, even though Bob only receives a corrupted set of votes, in the limit of a large number of votes, he can nevertheless estimate the true fraction of people that voted for candidate “one”.

The above example suggests a general strategy to perform private machine learning:

1. Phrase problem as likelihood maximisation: We first assume that a machine learning task for private data  x1, . . . , xNcan be expressed as learning a data model  pθ(X)by optimising an objective

image

image

where  p( ˜X|X)is a defined spread noise distribution and known by both the owner of the private data and the receiver of the corrupted data. To do this, we go through each element of the dataset  xnand replace it with a corruption  ˜xnsampled from  p( ˜Xn = ˜xn|Xn = xn).

3. Send data to learner: We then send to the learner the corrupted dataset  ˜x1, . . . , ˜xN, the model to be learned  pθ(X)and the corruption probability  p( ˜X|X).

4. Estimate  θfrom corrupted data: Having received the corrupted data  ˜x1, . . . , ˜xN, the learner fits  θby maximising the objective

image

4.1 JUSTIFICATION

If we assume that each element  xnof the training data  x1, . . . , xNis identically and independently sampled from a model  pθ0(Xn = xn), then each corrupted observation  ˜xnis a sample from the same distribution given by

image

By the law of large numbers the objective equation(27) approaches its average over the data generating mechanism

image

and maximising the spread likelihood objective ˜LN(θ)becomes equivalent to minimising

image

Provided that the spread noise is valid (see section(2)), then

image

for an identifiable model  pθ. Thus

image

is a consistent estimator.

This means that (in the large data limit and assuming the training data is generated from the model), even though we only train on corrupted data, we are optimising an objective ˜LN(θ)which has a global minimum close to that of the objective on uncorrupted data  LN(θ). Indeed, the estimator is consistent in the sense that as the amount of training data increases, we will recover the true clean data generating mechanism. Hence, provided that the corruption process is based on spread noise, then we can still learn the model parameters  θeven by training on only corrupted data. In our motivating voting scenario in section(3), we saw explicitly that the estimate  θof the true underlying voting fraction is consistent and indeed, this is a general property of our approach.

image

Figure 1: Training based on the reconstruction approach, section(4.3) for the model with binary variable  p(x = 1) = θ, p(x = 0) = 1 − θ. In each case we plot along the x-axis the true  θ0from 0 to 1 and on the y-axis the value of  θthat maximises  J∞(θ). In each plot we use a different flip probability. For a consistent estimator we would require that each plot is a straight x = y line, which only occurs in the case of no noise,  pf = 0.

4.2 TRAINING ON NOISE ONLY

A common approach in private machine learning is to form synthetic (noisy, corrupted) data and then simply train the standard model on this noisy data — see for example Li et al. (2019). In our notation, this would be equivalent to maximising the likelihood

image

As above, assuming that the training data is generated from an underlying model  pθ0(Xn = xn), bythe law of large numbers,

image

In general, the optimum of this objective does not occur when  θ = θ0and therefore training on noisy data alone does not form a consistent estimator of the true underlying model.

We discuss learning with noisy labels more extensively in the context of logistic regression in section(B) in which we show that provided the label flip noise is not too high  p0→1 + p1→0 < 1, andfor zero mean isotropically Gaussian distributed inputs, maximum likelihood training with corrupted class labels does form a consistent estimator. Hence, whilst one cannot guarantee that maximum likelihood training of logistic regression on noisy data will result in a consistent estimator, there are special situations in which this may work.

4.3 RECONSTRUCTION APPROACH

A seemingly natural alternative to our method is to attempt to reconstruct the clean datapoint from the noisy datapoint and use that within a standard learning framework. This approach would give an objective

image

Here we need to define a posterior distribution  p(xn|˜xn)to reconstruct the clean datapoint. Since the learner only has knowledge of the prior  pθ(x)it is natural to set

image

By the law of large numbers  JNconverges to its expectation with respect to the true data generating mechanism  pθ0(˜x) =�p(˜x|x)pθ0(x), so that

image

In general, the optimum of  J∞(θ)is not at  θ = θ0. To demonstrate this, we plot in figure(1) the optimal  θfor a simple Bernoulli model for which we can calculate  J∞(θ)exactly. As we see, for all but zero flip noise,  pf = 0, the estimator does not correctly identify the underlying probabilty generating mechanism. For this reason, we do not pursue this approach further.

4.4 OTHER DIVERGENCES

An extension of the above is to learn  θby minimising other f-divergences

image

However, this generalisation to any f-divergence is harder to justify since the expectation of this objective (by averaging over the noise realisations)

image

will not in general give a divergence between spreaded distributions. This means that in the limit of a large number of datapoints, it is not guaranteed to recover the true data generating process, except for special choices of the f-divergence, such as the KL divergence. We leave a discussion of this for future work.

As an application of the above framework to a standard machine learning model, we now discuss how to form a private version of logistic regression.

Returning to our motivating example, users may have labelled face images as “happy" or “sad" on their mobile phones and the company MugTome wishes to train a “happy/sad" face classifier; however, users do not wish to send the raw face images to MugTome and also wish to be able to plausibly deny which label they gave any training image.

In this case we have a set of training data  x1, . . . , xN, xn ∈ RDand corresponding binary class labels  c1, . . . , cN, cn ∈ {0, 1}. We wish to fit a logistic regression model

image

where  φ(x) = 1/(1 + e−x)is the logistic function. We follow the general approach outlined in section(4).

1. The model: For observation (x, c) and parameter  θpθ(c, x) = pθc(c|x)pθx(x) (42)where  pθc(c|x)is the standard logistic regression model above and  pθx(x)is a model of the input x. The training objective is

image

We note that this is a separable objective for  LN(θ) = LcN(θc) + LxN(θx), in which the logistic regression parameters  θcare conditionally independent (conditioned on the training data) of the input parameters  θx.

image

3. Send to learner corrupted data and model: The corrupted labels and inputs are sent to the learner  (˜c1, ˜x1), . . . , (˜cN, ˜xN)along with the model  pθc(c|x), pθx(x)and corruption process  p(˜c|c), p(˜x|x).

4. Learn the model parameters  θ: The spread log likelihood is

image

Unfortunately, in all but special cases, the integral (for continuous x) or sum (for discrete x) required to evaluate ˜Lis not tractable and numerical approximation is required. For this stage, there are many options available and we present below the approach taken in the experiments.

Interestingly, we note that, unlike training on clean data, the objective ˜L(θ)is not separable into a function of  θcplus a function of  θx, meaning that learning the class prediction parameter  θcis coupled with learning the input distribution parameter  θx.

5.1 IMPLEMENTATION

In general, the spread noise defines a distribution on a pair of spread variables  p(˜c, ˜x|c, x)and the full joint distribution, including the original model is

image

For continuous x, the spread likelihood is then obtained from

image

In general, this sum/integral over x is intractable due to the high-dimensionality of x. We use a standard approach to lower bound the log likelihood (for a single datapoint) by

image

where q is a distribution chosen to make the bound tight, see for example Barber (2012). This allows us to use an EM-style procedure in which we iterate between the two steps : (M-step) fix q and optimise  θand (E-step) fix  θand update q.

image

An advantage of this approach is that  E(θ; q)is separable and we can update the class prediction parameter  θcindependently of the input distribution parameter  θx.

In practice we will typically only do a partial optimisation (gradient ascent step) over  θto guarantee an increase in the energy.

2. Iteration k E-step: The bound is tightest when q is set to the posterior (see for example Barber (2012)),

image

image

for some function f(x, c). Assuming that the posterior will be reasonably peaked around the noisy data we use sampling with an importance distribution

image

Choosing

image

for normalising functions  Zρ(˜c), Zρ(˜x)we then run a standard importance sampling approximation (see section(A)). For a given noisy datapoint  (˜cn, ˜xn)we generate a set of  S samples c1n, . . . , cSn fromρ(c|˜cn)and samples  x1n, . . . , xSn from ρ(x|˜xn)and compute the importance weights

image

The energy equation(52) separates into two independent terms (see section(A))

image

and

image

Equation(60) is a weighted version of the standard logistic regression log likelihood,  Lc(θc)in equation(44); similarly equation(61) is a weighted version of  Lx(θx). The advantage therefore is that, given the importance samples, the learning procedure for  θrequires only a minor modification of the standard maximum likelihood training procedure on clean data.

The full procedure is that we randomly initialise  θand then, for each datapoint n, draw samples and accumulate the gradient across samples and datapoints. After doing a gradient ascent step in  θ, we update the importance distributions and repeat until convergence.

The Importance Sampling approximation is a convenient approach, motivated by the assumption that corrupted datapoints will be close to their uncorrupted counterparts. Whilst we used a bound as part of the approximation, this is not equivalent to using a parametric q distribution; by sampling we form a consistent estimator of the tightest possible lower bound. In other words, we are simply using Importance Sampling to estimate the expectations required within a standard Expectation Maximisation algorithm, see for example Barber (2012). We also tried learning a parametric q, similar to standard variational approaches to log likelihood maximisation, but didn’t find any improvement on the Importance Sampling approach.

5.2 LEARNING THE PRIOR

If we have access to clean data, the optimal input model  pθx(x)can be learned from maximising the likelihood  Lx(θx). However, our general assumption is that we will never have access to clean data. There may be situations in which the learner has a good model of  pθx(x), without compromising privacy (for example a publicly available dataset for a similar prediction problem might be available) in which case it makes sense to set the prior to this known model.

In the absence of a suitable prior we can attempt to learn  pθx(x)from the corrupted data by maximising ˜L(θ). For simplicity we assume a factorised model and for a D-dimensional input vector x =

image

for a collection of learnable univariate distributions p(x[d]|d), d = 1, . . . , D. Under this assumption, and using the Importance Sampling approach in equation(61), this means that p(x[d]|d) can be learned by maximising

image

Since this is a separable objective, we can learn each  p(xns [d]|d)independently.

For simplicity, we assume a discrete distribution for x[d] that contains K states (or bins). Then

image

where  I (xns [d] ∈ k)is 1 if the sample  xns [d]is in the  kthstate and 0 otherwise. Optimising with respect to p(k|d) we obtain

image

For the M-step of the algorithm we then make a gradient update for  θcand update the prior using equation(65).

We implemented our approach in section(5) to train a logistic regression classifier to distinguish between the MNIST digits 7 and 9 based on noisy data (250 train and 900 test examples from each class). We chose to train on a small dataset since this constitutes the most challenging scenario and helps highlight potential differences between rival approaches. The MNIST images x have pixesl with 256 states and we used a discrete distribution to model x.

For our experiments we assume a corruption model  p(˜c|c)that flips the label  0 → 1 and 1 → 0 withprobability  pfwith probability  pf. We also assume here for simplicity assume a factorised input corruption model  p(˜x|x) = �Dd=1 p(˜x[d]|x[d])in which with probability  1 − pfand uniformly from the other states of that pixel with probability  pf.

In this case, computing the Importance Sampling distribution is straightforward since the posterior is factorised over the image pixels. We considered three settings for the prior (required to compute the Importance Sampling distribution) : (i) flat prior, (ii) learned prior using EM, (iii) true factorised prior based on computing the marginal distribution of each pixel on the training dataset. In the ‘true prior’ case we assumed that we know the true marginal distribution of each pixel p(x[d]|d) – in general, this information would be private, but it is interesting to consider how much improvement is obtained by knowing this quantity.

We compare the following approaches:

Log Reg Clean Data We trained logistic regression on clean data. This sets an upper limit on the expected performance.

Log Reg on Noisy Data We trained a standard logistic regression model but using the corrupted data. This forms a simple baseline comparison.

Spread Log Reg with Learned Prior We used our Spread Likelihood approach to learn the prior.

image

Figure 2: (a) An example MNIST “7” alongside its noisy examples (b)  pf = 0.1, (c)  pf = 0.2, (d) pf = 0.3, (e) pf = 0.4) which is sent to Mugshot.com noise. Each pixel remains in state  1 − pf andis otherwise sampled uniformly from the available states of that pixel. The bottom row shows an example of a clean “9” (f) and corruptions.

Spread Log with ‘True Prior’ In general our assumption is that the true prior will not be known (since this requires users to release their private data to the prior learner). However, this forms an interesting comparison and expected upper bound on the performance of the spread approach.

Spread Log with Flat Prior In this case we used an informative, flat prior on all pixel states.

We ran 10 experiments for each level of flip noise  pffrom 0.1, 0.2, 0.3, 0.4 and then tested the prediction accuracy of the learned logistic classifiers on clean hold out data, see figure(3).

For all but small noise levels, the results show the superiority of the spread learning approach over simply training on noisy data, consistent with our theory that training the standard model on noisy data does not in general form a consistent estimator. The best performing spread approach is that which uses the true prior – however, in general this true prior will not be available. For this experiment, there appears to be little difference between using a flat prior and a learned prior.

The performance of standard logistic regression training but using corrupted data is surprisingly effective, at least at low noise levels. However the performance degrades quickly for higher noise levels. A partial explanation for why logistic regression may give good results simply trained on noisy data is given in section(B).

6.1 GAUSSIAN INPUT PRIOR p(x)

We also demonstrate here training logistic regression treating the pixels as continuous. If we an independent have (per pixel) a Gaussian prior

image

and independent Gaussian spread noise

image

then using the Importance Sampling posterior is

image

where

image

We also used Gaussian spread noise to corrupt the images and train a binary classifier to distinguish between the MNIST digits 7 and 9 based on noisy data (4500 train and 900 test examples from each class). For simplicity, we assumed factorised distributions with prior  p(xi) = N�xi µi, ¯σ2i�,

image

Figure 3: The test accuracy (on clean data) of the trained logistic regression models, averaged over 10 different randomly chosen training datasets of 500 datapoints. The x-axis is the corruption probability pf. “log reg”: Standard logistic regression training on clean data; “SD true prior”: spread divergence training approach with true prior; “SD learn prior”: spread approach with learned prior; “SD flat prior”: spread approach with flat prior; “log reg noise”: standard logistic regression training but trained on noisy data.

p(˜xi|xi) = N�˜xi xi, σ2i�. We chose spread flip noise  pf = 0.2for the class labels and uniform spread noise with variance  σ2i = 0.1; the prior p(x) was set to be quite uninformative with mean µi = 0and variance  ¯σ2i = 10. This level of noise means that approximately 20% of the class labels are incorrect in the data passed to MugTome and the associated image is significantly blurred, see figure(4). For standard logistic regression we found that for a learning rate of 0.2, 400 iterations gave the best performance, with 95.5% train accuracy and 95.7% test accuracy. Using our Importance Sampling scheme with S = 2 samples per noisy datapoint, the trained  θwhen tested on clean images had 94.4% test accuracy. This shows that despite the high level of class label and image noise, MugTome are able to learn an effective classifier, preserving the privacy of the users. The loss in test and training accuracy, despite this high noise level is around a modest 1%. When using higher spread noise with variance  σ2 = 0.5, the  θlearned on the noisy data had a clean data test accuracy 93%, which is also a modest decrease in accuracy for a significant increase in privacy.

For future work it would be interesting to consider other forms of noise, for example downsampling images. However, downsampling does not form an injective mapping and as such we cannot guarantee that we can find a consistent estimator for the underlying model.

There are many forms of private machine learning. Some attempt to transform a datapoint x to a form x′such that a protected attribute a (such as gender) cannot be recovered from  x′, yet the prediction

image

Figure 4: (a) An example MNIST “7” alongside its noisy example which is sent to MugTome with Gaussian noise variance (b)  σ2 = 0.1 and (c) σ2 = 0.5; (d,e,f) similarly for an MNIST “9”.

of an output y (for example using  p(y|x′)) is retained. For example this could be achieved by using a loss function such as (see for example Li et al. (2019))

image

where n is the data index,  y(x′; θ)is a function that takes input  x′ and outputs a prediction  y; a(x′; φ)is a function that takes input  x′and outputs an attribute prediction a and  x′ = f(x; ψ)gives a representation of the input;  Ly, Laare loss functions. In this protected attribute setting, typically some form of the clean dataset is required to learn the parameters  θ, φ, ψ.

Another common form of private machine learning is based on differential privacy (Dwork & Roth, 2014), with the aim to make it difficult to discern whether a datapoint  xnwas used to train the predictor  y(x; θ). That is, given a trained model, differential privacy attempts to restrict the ability to differentiate whether any individual’s datum was used to train the model.

A closely related concept to randomised response is that of plausible deniability, namely privately corrupting a datapoint  xnsuch that no-one (except the datapoint provider) can confidently state what the original (private) value of  xnis. Recently Bindschaedler et al. (2017) used this to create synthetic datapoints, which were subsequently used with a standard machine learning training approach. The authors showed that generating synthetic data  ˜xfrom a distribution  p(˜x|x)that takes dependency amongst the elements of the vector x results in better machine learning predictors than sampling from a factorised distribution. In synthetic data generation approaches the assumption is that the statistical characteristics are similar to the real data. However, care is required since if the generating mechanism is powerful, it may generate data which is very similar to the private data.

In general these synthetic data generating approaches do not take into consideration when learning the parameters of the machine learning model what that synthetic data generation mechanism is. This is analogous to simply using the corrupted votes to directly estimate the fraction of voters that voted for a candidate, equation(1), rather than using knowledge of the data generation approach, equation(2).

We discussed a general privacy preserving mechanism based on random response in a datapoint is replaced by a corrupted versions. We showed that, provided the corruption process is a valid spread noise, then a maximum likelihood approach forms a consistent estimator. That is, even though the model is only trained on corrupted, synthetic data, it is possible to recover the true underlying data genering mechnanism on the clean data. We applied this approach to a simple logistic regression model, showing that the approach can work well, even with high levels of noise. The approach is readily applicable to a large class of much more complex models and other divergences.

I would like to thank Xijie Hou for useful discussions.

D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, New York, NY, USA, 2012. ISBN 0521518148, 9780521518147.

Vincent Bindschaedler, Reza Shokri, and Carl A. Gunter. Plausible deniability for privacy-

image

S. S. Dragomir. Some general divergence measures for probability distributions. Acta Mathematica Hungarica, 109(4):331–345, Nov 2005. ISSN 1588-2632. doi: 10.1007/s10474-005-0251-6.

C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci., 9(3–4):211–407, 2014. ISSN 1551-305X. doi: 10.1561/0400000042.

Ang Li, Jiayi Guo, Huanrui Yang, and Yiran Chen. Deepobfuscator: Adversarial training framework for privacy-preserving image classification, 2019.

S. L. Warner. Randomised response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.

Mingtian Zhang, Peter Hayes, Tom Bird, Raza Habib, and David Barber. Spread divergences, 2018.

A PRIVACY PRESERVING LOGISTIC REGRESSION

The posterior is given by p(c, x|˜c, ˜x) = p(˜c|c)p(˜x|x)p(c|x)p(x)�c�x p(˜c|c)p(˜x|x)p(c|x)p(x) = p(˜c|c)p(˜x|x)p(c|x)p(x)Z(˜c, ˜x) (71)

For the learning, we need to take expectations �

image

We use importance sampling to approximate this expectation

image

Choosing ρ(c|˜c) = p(˜c|c)�c p(˜c|c) = p(˜c|c)Zρ(˜c) , ρ(x|˜x) = p(˜x|x)p(x)�x p(˜x|x)p(x) = p(˜x|x)p(x)Zρ(˜x) (75)

we have �

image

Here Z(˜c, ˜x) =�

image

Putting this together and using the same samples to estimate the numerator and denominator expecta-

tions, �

image

for importance weight w(s) = p(cs|xs)�s p(cs|xs) (83)

For the logistic regression case, we have f(x, c) = log p(c|x) = log φ�(2c − 1)θTx� (84)

where  φ(x) = 1/(1 + exp(−x)).

The variational lower bound then becomes

image

where, for a given noisy datapoint  ˜cn, ˜xnwe generate a set of S samples  c1n, . . . , cSnfrom  ρ(c|˜cn)

and samples  x1n, . . . , xSn from ρ(x|˜xn)

image

B TRAINING ON NOISY DATA

A common approach in private machine learning is to train the standard model based on noisy alone,

corresponding to maximising  L′N(θ), equation(35). As we discussed, this does not in general give a

consistent estimator of the true underlying model. To show this, we consider a logistic regression

in which only the class labels c are corrupted with probabilities  p0→1 ≡ p(˜c = 1|c = 0)and in the

same state with  p1→1 ≡ p(˜c = 1|c = 1), leaving the inputs x uncorrupted. In this case L′N(θ) = 1N

image

If we assume that the true labels are drawn from an underlying model p(c = 1|x) = φ�θT0x� (88)

then the probability of a corrupted label is given by p(˜c = 1|x) = p1→1φ�θT0x�+ p0→1�1 − φ�θT0x�� (89)

and by the law of large numbers,  L′N tends toL′∞(θ) ≡ Ep(x)�p(˜c = 1|x) log φ(θTcx) + (1 − p(˜c = 1|x)) log�1 − φ(θTcx))�� (90)

Taking the gradient wrt  θc, we obtain gθ = Ep(x)��p(˜c = 1|x)(1 − φ(θTcx))x − (1 − p(˜c = 1|x))φ(θTcx)�x� (91)

A sufficient condition for the gradient to be zero is p(˜c = 1|x)(1 − φ(θTcx)) = (1 − p(˜c = 1|x))φ(θTcx) (92)

That is p(˜c = 1|x) = φ(θTcx) (93)

which is p1→1φ�θT0x�+ p0→1�1 − φ�θT0x��= φ(θTcx) (94)

In general, equation(94) does not have a solution at  θc = θ0.

To understand when the objective equation(90) has an optimum, we assume the data is drawn from

p(x) = N (x µ, Σ), then, defining z1 = θT0x, z2 = θTcx (95)

Since x is Gaussian distributed, z is also Gaussian distributed with E [z1] = θT0µ, E [z2] = θTµ (96)

E [z1z2] − E [z1]E [z2] = θTΣθ0 (97)E�z21�− E [z1]2 = θT0Σθ0 (98)E�z22�− E [z2]2 = θTΣθ (99)

We can then write the large data limit log likelihood as a two dimensional expectation

image

For simplicity, consider  µ = 0, Σ = s2I, θT0θ0 = 1, θTθ = 1, θT0θ = cos(α). It is straightforward

to show that in this case the gradient with respect to  αis zero when  α = 0, namely when  θ = θ0.

However, in general, for non-isotropic data covariance  Σ, the gradient is non-zero at  θ = θ0.

To derive the above result, we note that the covariance for z in this case is simply

image

We now use the decomposition  C = MM T, with Cholesky factor

image

Then drawing a sample from z is equivalent to  z ∼ Mϵ for ϵ ∼ N (ϵ 0, I). Definingγ(x) = p1→1φ(x) + p0→1(1 − φ(x)) (103)

we can then write the expected log likelihood as a function of  φ:

image

where the functions are defined as Z1(ϵ1) = sϵ1, Z2(ϵ1, ϵ2) = s (ϵ1 cos α + ϵ2 sin α) (105)

Differentiating  L′∞(α)with respect to  α, we obtain sEN (ϵ 0,I) [(γ(Z1(ϵ1)) − φ(Z2(ϵ1, ϵ2))) (ϵ2 cos α − ϵ1 sin α)] (106)

When  α = 0we note that  Z2(ϵ1, ϵ2)is independent of  ϵ2and that the above is therefore is zero.

Hence ˜L′∞(α)has zero gradient at  α = 0.

It is straightforward to show that the second derivative of  L′∞(α)(evaluated at  α = 0) issEN (ϵ1 0,1) [−ϵ1γ(sϵ1)] − s2EN (ϵ1 0,1) [φ(sϵ1)(1 − φ(sϵ1))] (107)

The second term in equation(107) above is clearly negative. Using integration by parts (and noting

that we may assume s > 0), one may easily show that the first term is also negative provided that

p1→1 > p0→1.

Hence we arrive at the (perhaps surprising) result that for zero mean isotropic Gaussian distributed

input data, training on noisy data  (˜c, x)in which the class labels have been flipped with some

probability, results in a consistent estimator for  θ0, provided the flip noise is not too high, namely

p1→1 > p0→1, or equivalently,  p0→1 + p1→0 < 1. This result holds even in the case of asymmetric

flip noise  p0→1 ̸= p1→0.

More generally, even if the data p(x) is not Gaussian distributed, from the Central Limit Theorem,

p(z) is likely to be close to Gaussian distributed for high dimensional inputs. Hence, for input data x

that is roughly isotropically distributed, we can expect that training using maximum likelihood for

any classifier of the form  p(c = 1|x) = φ�θTx�will likely be close to recovering the true  θ0that

generated the data (in the limit of a large number of datapoints).

The above analysis considered only noise on the class label. If, independently of the class label we

add isotropic Gaussian noise to the observations, then the projection z will still be isotropic Gaussian

distributed for Gaussian inputs p(x) and the above argument trivially extends to this case as well.

Hence, one can expect training (using standard logistic regression but with corrupted inputs and

flipped labels) to be partially successful at recovering the true data generating process provided that

the input data is close to isotropically distributed, motivating a whitening pre-processing step of the

input data.


Designed for Accessibility and to further Open Science