Our desire is to develop a strategy for machine learning driven by the requirement that private data should be shared as little as possible and that no-one can be trusted with an individual’s data, neither a data collector/aggregator, nor the machine learner that tries to fit a model.
Randomised Response, see for example Warner (1965), is relevant in this context in which a datapoint is replaced with a randomised ‘noisy’ version
. A classical example is voting in an election in which an individual voter votes for one of two candidates A or B and is asked to lie (with probability p) about whom they voted for . This results in noisy data and estimating the fraction
of voters that voted for candidate A based on this noisy data
can give a potentially significantly incorrect estimate. As Warner (1965) showed, since we know the probabilistic mechanism that generated the noisy data, a better estimate of the fraction of voters voting for candidate A is given by
In a machine learning context, the kind of scenario we envisage is that users may have labelled face images as “happy" or “sad" on their mobile phones and the company MugTome wishes to train a “happy/sad" face classifier; however, users do not wish to send the raw face images to MugTome and also wish to be able to plausibly deny which label they gave any training image. To preserve privacy, each user will send to MugTome only a single corrupted datapoint — a single corrupted image and a single corrupted label.
It is straightforward to extend our approach to deal with users sending multiple corrupted datapoints. However, since MugTome will potentially then know which corrupted datapoints belong to each user, they will have more information to help reveal the underlying clean datapoint. Since we assume we cannot trust MugTome, MugTome may attempt to recover the underlying true datapoint. For example, if a user sends three class labels , then MugTome can have a good guess of the underlying true class label by simple taking the majority class
. Indeed, in general, if M corrupted datapoints are independently generated for a user, then MugTome’s ability to reveal the true class (or attribute) increases dramatically. For example, if MugTome know the corruption mechanism
the posterior of the class is given by
where is the prior belief on the true class. This posterior distribution concentrates exponentially quickly (in M) around the true value
. Similarly, if a pollster asks each voter three times what they voted, then the questioner would have a very good idea of the true vote of each voter; to protect the voter’s privacy, the voter would then have to trust that the pollster either does not pass on any information that states that the three votes came from the same person or that the pollster doesn’t attempt themselves to figure out what the voter voted for.
Similarly, in a medical setting in which a patient privately owns a datapoint, releasing M synthetic versions (corruptions) of that datapoint can compromise privacy if which synthetic datapoints belong to each person is also known. To guarantee that privacy is retained would require patients to trust people with their data, namely that any data aggregation process will remove their patient ID. However, this is something out of the control of the patient and as such we do not consider generating multiple synthetic datapoints (see for example Bindschaedler et al. (2017)) a ‘safe’ mechanism.
For these reasons, we wish to make a process in which an individual only reveals a single corrupted datapoint; from that point onwards in the machine learning training process, no other trust in that process is required. To motivate our general approach to private machine learning we discuss the voting example in more detail in section(3). Connections to other forms of privacy preserving machine learning are discussed in section(7). The justification for our approach hinges on the properties of the Spread Divergence, which we review in the following section.
Throughout we use the notation p(X = x) for a random variable X in state x. However, to reduce notational overhead, where unambiguous, we write simply p(x).
A divergence D(p||q) (see, for example Dragomir (2005)) is a measure of the difference between two distributions p and q with the property
DAn important class is the f-divergence, defined as
where f(x) is a convex function with f(1) = 0. A special case of an f-divergence is the well-known Kullback-Leibler divergence KLwhich is widely used to train models using maximum likelihood. For the Spread Divergence, from q(x) and p(x) we define new distributions
and
that have the same support. Using the notation
to denote integration
for continuous
for discrete x with domain X, we define a random variable
with the same domain as x and distributions
where ‘spreads’ the mass of p and q such that
and
have the same support. For example, if we use a Gaussian
both have support R. The spread divergence has a requirement on the noise
, namely that D
; that is, if the divergence of the spreaded distributions is zero, then the original non-spreaded distribution will match. As shown in Zhang et al. (2018) this is guaranteed for certain ‘spread noise’ distributions. In particular, for continuous x and
of the same dimension and injective function f, a sufficient condition for a valid spread noise
is that the kernel K(x) has strictly positive Fourier Transform. For discrete variables, a sufficient condition is that
and the matrix P is square and invertible.
Spread divergences have a natural connection to privacy preservation and Randomised Response (Warner, 1965). The spread divergence suggests a general strategy to perform private machine learning. We first express the machine learning problem as for a specified model
. Then, given only noisy data
, we fit the model by
. To explain in more detail how this works, we first describe randomised response in a classical voting context and then justify how to generalise this to principled training of machine learning models based on corrupted data.
There are two candidates in an election, candidate “one" and candidate “zero" and Alice would like to know the fraction of voters that voted for candidate “one". We write the dataset of voting as a collection of binary values
3.1 LEARNING θ USING CLEAN DATA
If we assume that Alice has full knowledge of which candidate each voter voted for, then clearly Alice may simply count the fraction of people that voted for “one” and set
It will be useful to first consider how to arrive at the same result from a modelling perspective. We can consider an independent Bernoulli model
where
so that
We also construct an empirical data distribution that places mass only on the observed joint state, namely
where is the Kronecker delta function. Then
where
and minimising KL(or maximising
) with respect to
recovers the fraction of votes that are 1, equation(7). This shows how we can frame estimating the quantity
from uncorrupted private data as a divergence minimisation problem.
3.2 LEARNING θ USING CORRUPTED DATA
Returning to the privacy setting, Bob would also like to know the fraction of votes that are 1. However, Alice does not want to send to Bob the raw data since the votes of any individual should not be explicitly revealed. To preserve privacy, Alice sends noisy data
to Bob. In this case we draw a single joint sample
from the distribution
where the ‘spread noise’ model is . Hence, if
Alice draws a sample
with probability
with probability
Given a sampled noisy dataset we form an empirical spreaded data distribution
Similarly, the corrupted joint model is given by
where
On receiving the noisy dataset , Bob can try to estimate
by minimising
with respect to . Equivalently, he may maximise the scaled spread log likelihood
For this simple model, Bob can easily explicitly calculate
Similarly, . In this case, equation(20) becomes
where
Using , the maximum of the spread log likelihood is at
which forms Bob’s estimate of the underlying fraction of voters that voted for candidate “one”.
For example, if there were no noise , Bob would estimate
, simply recovering the fraction of votes that are 1 in the original data. In the limit of a large number of votes
and true probability
of a voter voting for candidate “one”, then
tends to
and Bob’s estimate recovers the true underlying voting probability
. Hence, even though Bob only receives a corrupted set of votes, in the limit of a large number of votes, he can nevertheless estimate the true fraction of people that voted for candidate “one”.
The above example suggests a general strategy to perform private machine learning:
1. Phrase problem as likelihood maximisation: We first assume that a machine learning task for private data can be expressed as learning a data model
by optimising an objective
where is a defined spread noise distribution and known by both the owner of the private data and the receiver of the corrupted data. To do this, we go through each element of the dataset
and replace it with a corruption
sampled from
3. Send data to learner: We then send to the learner the corrupted dataset , the model to be learned
and the corruption probability
4. Estimate from corrupted data: Having received the corrupted data
, the learner fits
by maximising the objective
4.1 JUSTIFICATION
If we assume that each element of the training data
is identically and independently sampled from a model
, then each corrupted observation
is a sample from the same distribution given by
By the law of large numbers the objective equation(27) approaches its average over the data generating mechanism
and maximising the spread likelihood objective becomes equivalent to minimising
Provided that the spread noise is valid (see section(2)), then
for an identifiable model
is a consistent estimator.
This means that (in the large data limit and assuming the training data is generated from the model), even though we only train on corrupted data, we are optimising an objective which has a global minimum close to that of the objective on uncorrupted data
. Indeed, the estimator is consistent in the sense that as the amount of training data increases, we will recover the true clean data generating mechanism. Hence, provided that the corruption process is based on spread noise, then we can still learn the model parameters
even by training on only corrupted data. In our motivating voting scenario in section(3), we saw explicitly that the estimate
of the true underlying voting fraction is consistent and indeed, this is a general property of our approach.
Figure 1: Training based on the reconstruction approach, section(4.3) for the model with binary variable . In each case we plot along the x-axis the true
from 0 to 1 and on the y-axis the value of
that maximises
. In each plot we use a different flip probability. For a consistent estimator we would require that each plot is a straight x = y line, which only occurs in the case of no noise,
4.2 TRAINING ON NOISE ONLY
A common approach in private machine learning is to form synthetic (noisy, corrupted) data and then simply train the standard model on this noisy data — see for example Li et al. (2019). In our notation, this would be equivalent to maximising the likelihood
As above, assuming that the training data is generated from an underlying model the law of large numbers,
In general, the optimum of this objective does not occur when and therefore training on noisy data alone does not form a consistent estimator of the true underlying model.
We discuss learning with noisy labels more extensively in the context of logistic regression in section(B) in which we show that provided the label flip noise is not too high for zero mean isotropically Gaussian distributed inputs, maximum likelihood training with corrupted class labels does form a consistent estimator. Hence, whilst one cannot guarantee that maximum likelihood training of logistic regression on noisy data will result in a consistent estimator, there are special situations in which this may work.
4.3 RECONSTRUCTION APPROACH
A seemingly natural alternative to our method is to attempt to reconstruct the clean datapoint from the noisy datapoint and use that within a standard learning framework. This approach would give an objective
Here we need to define a posterior distribution to reconstruct the clean datapoint. Since the learner only has knowledge of the prior
it is natural to set
By the law of large numbers converges to its expectation with respect to the true data generating mechanism
In general, the optimum of is not at
. To demonstrate this, we plot in figure(1) the optimal
for a simple Bernoulli model for which we can calculate
exactly. As we see, for all but zero flip noise,
, the estimator does not correctly identify the underlying probabilty generating mechanism. For this reason, we do not pursue this approach further.
4.4 OTHER DIVERGENCES
An extension of the above is to learn by minimising other f-divergences
However, this generalisation to any f-divergence is harder to justify since the expectation of this objective (by averaging over the noise realisations)
will not in general give a divergence between spreaded distributions. This means that in the limit of a large number of datapoints, it is not guaranteed to recover the true data generating process, except for special choices of the f-divergence, such as the KL divergence. We leave a discussion of this for future work.
As an application of the above framework to a standard machine learning model, we now discuss how to form a private version of logistic regression.
Returning to our motivating example, users may have labelled face images as “happy" or “sad" on their mobile phones and the company MugTome wishes to train a “happy/sad" face classifier; however, users do not wish to send the raw face images to MugTome and also wish to be able to plausibly deny which label they gave any training image.
In this case we have a set of training data and corresponding binary class labels
. We wish to fit a logistic regression model
where is the logistic function. We follow the general approach outlined in section(4).
1. The model: For observation (x, c) and parameter where
is the standard logistic regression model above and
is a model of the input x. The training objective is
We note that this is a separable objective for , in which the logistic regression parameters
are conditionally independent (conditioned on the training data) of the input parameters
3. Send to learner corrupted data and model: The corrupted labels and inputs are sent to the learner along with the model
and corruption process
4. Learn the model parameters : The spread log likelihood is
Unfortunately, in all but special cases, the integral (for continuous x) or sum (for discrete x) required to evaluate is not tractable and numerical approximation is required. For this stage, there are many options available and we present below the approach taken in the experiments.
Interestingly, we note that, unlike training on clean data, the objective is not separable into a function of
plus a function of
, meaning that learning the class prediction parameter
is coupled with learning the input distribution parameter
5.1 IMPLEMENTATION
In general, the spread noise defines a distribution on a pair of spread variables and the full joint distribution, including the original model is
For continuous x, the spread likelihood is then obtained from
In general, this sum/integral over x is intractable due to the high-dimensionality of x. We use a standard approach to lower bound the log likelihood (for a single datapoint) by
where q is a distribution chosen to make the bound tight, see for example Barber (2012). This allows us to use an EM-style procedure in which we iterate between the two steps : (M-step) fix q and optimise and (E-step) fix
and update q.
An advantage of this approach is that is separable and we can update the class prediction parameter
independently of the input distribution parameter
In practice we will typically only do a partial optimisation (gradient ascent step) over to guarantee an increase in the energy.
2. Iteration k E-step: The bound is tightest when q is set to the posterior (see for example Barber (2012)),
for some function f(x, c). Assuming that the posterior will be reasonably peaked around the noisy data we use sampling with an importance distribution
Choosing
for normalising functions we then run a standard importance sampling approximation (see section(A)). For a given noisy datapoint
we generate a set of
and samples
and compute the importance weights
The energy equation(52) separates into two independent terms (see section(A))
and
Equation(60) is a weighted version of the standard logistic regression log likelihood, in equation(44); similarly equation(61) is a weighted version of
. The advantage therefore is that, given the importance samples, the learning procedure for
requires only a minor modification of the standard maximum likelihood training procedure on clean data.
The full procedure is that we randomly initialise and then, for each datapoint n, draw samples and accumulate the gradient across samples and datapoints. After doing a gradient ascent step in
, we update the importance distributions and repeat until convergence.
The Importance Sampling approximation is a convenient approach, motivated by the assumption that corrupted datapoints will be close to their uncorrupted counterparts. Whilst we used a bound as part of the approximation, this is not equivalent to using a parametric q distribution; by sampling we form a consistent estimator of the tightest possible lower bound. In other words, we are simply using Importance Sampling to estimate the expectations required within a standard Expectation Maximisation algorithm, see for example Barber (2012). We also tried learning a parametric q, similar to standard variational approaches to log likelihood maximisation, but didn’t find any improvement on the Importance Sampling approach.
5.2 LEARNING THE PRIOR
If we have access to clean data, the optimal input model can be learned from maximising the likelihood
. However, our general assumption is that we will never have access to clean data. There may be situations in which the learner has a good model of
, without compromising privacy (for example a publicly available dataset for a similar prediction problem might be available) in which case it makes sense to set the prior to this known model.
In the absence of a suitable prior we can attempt to learn from the corrupted data by maximising
. For simplicity we assume a factorised model and for a D-dimensional input vector x =
for a collection of learnable univariate distributions p(x[d]|d), d = 1, . . . , D. Under this assumption, and using the Importance Sampling approach in equation(61), this means that p(x[d]|d) can be learned by maximising
Since this is a separable objective, we can learn each independently.
For simplicity, we assume a discrete distribution for x[d] that contains K states (or bins). Then
where is 1 if the sample
is in the
state and 0 otherwise. Optimising with respect to p(k|d) we obtain
For the M-step of the algorithm we then make a gradient update for and update the prior using equation(65).
We implemented our approach in section(5) to train a logistic regression classifier to distinguish between the MNIST digits 7 and 9 based on noisy data (250 train and 900 test examples from each class). We chose to train on a small dataset since this constitutes the most challenging scenario and helps highlight potential differences between rival approaches. The MNIST images x have pixesl with 256 states and we used a discrete distribution to model x.
For our experiments we assume a corruption model that flips the label
probability
with probability
. We also assume here for simplicity assume a factorised input corruption model
in which with probability
and uniformly from the other states of that pixel with probability
In this case, computing the Importance Sampling distribution is straightforward since the posterior is factorised over the image pixels. We considered three settings for the prior (required to compute the Importance Sampling distribution) : (i) flat prior, (ii) learned prior using EM, (iii) true factorised prior based on computing the marginal distribution of each pixel on the training dataset. In the ‘true prior’ case we assumed that we know the true marginal distribution of each pixel p(x[d]|d) – in general, this information would be private, but it is interesting to consider how much improvement is obtained by knowing this quantity.
We compare the following approaches:
Log Reg Clean Data We trained logistic regression on clean data. This sets an upper limit on the expected performance.
Log Reg on Noisy Data We trained a standard logistic regression model but using the corrupted data. This forms a simple baseline comparison.
Spread Log Reg with Learned Prior We used our Spread Likelihood approach to learn the prior.
Figure 2: (a) An example MNIST “7” alongside its noisy examples (b) , (c)
, (d)
) which is sent to Mugshot.com noise. Each pixel remains in state
is otherwise sampled uniformly from the available states of that pixel. The bottom row shows an example of a clean “9” (f) and corruptions.
Spread Log with ‘True Prior’ In general our assumption is that the true prior will not be known (since this requires users to release their private data to the prior learner). However, this forms an interesting comparison and expected upper bound on the performance of the spread approach.
Spread Log with Flat Prior In this case we used an informative, flat prior on all pixel states.
We ran 10 experiments for each level of flip noise from 0.1, 0.2, 0.3, 0.4 and then tested the prediction accuracy of the learned logistic classifiers on clean hold out data, see figure(3).
For all but small noise levels, the results show the superiority of the spread learning approach over simply training on noisy data, consistent with our theory that training the standard model on noisy data does not in general form a consistent estimator. The best performing spread approach is that which uses the true prior – however, in general this true prior will not be available. For this experiment, there appears to be little difference between using a flat prior and a learned prior.
The performance of standard logistic regression training but using corrupted data is surprisingly effective, at least at low noise levels. However the performance degrades quickly for higher noise levels. A partial explanation for why logistic regression may give good results simply trained on noisy data is given in section(B).
6.1 GAUSSIAN INPUT PRIOR p(x)
We also demonstrate here training logistic regression treating the pixels as continuous. If we an independent have (per pixel) a Gaussian prior
and independent Gaussian spread noise
then using the Importance Sampling posterior is
where
We also used Gaussian spread noise to corrupt the images and train a binary classifier to distinguish between the MNIST digits 7 and 9 based on noisy data (4500 train and 900 test examples from each class). For simplicity, we assumed factorised distributions with prior ,
Figure 3: The test accuracy (on clean data) of the trained logistic regression models, averaged over 10 different randomly chosen training datasets of 500 datapoints. The x-axis is the corruption probability . “log reg”: Standard logistic regression training on clean data; “SD true prior”: spread divergence training approach with true prior; “SD learn prior”: spread approach with learned prior; “SD flat prior”: spread approach with flat prior; “log reg noise”: standard logistic regression training but trained on noisy data.
. We chose spread flip noise
for the class labels and uniform spread noise with variance
; the prior p(x) was set to be quite uninformative with mean
and variance
. This level of noise means that approximately 20% of the class labels are incorrect in the data passed to MugTome and the associated image is significantly blurred, see figure(4). For standard logistic regression we found that for a learning rate of 0.2, 400 iterations gave the best performance, with 95.5% train accuracy and 95.7% test accuracy. Using our Importance Sampling scheme with S = 2 samples per noisy datapoint, the trained
when tested on clean images had 94.4% test accuracy. This shows that despite the high level of class label and image noise, MugTome are able to learn an effective classifier, preserving the privacy of the users. The loss in test and training accuracy, despite this high noise level is around a modest 1%. When using higher spread noise with variance
, the
learned on the noisy data had a clean data test accuracy 93%, which is also a modest decrease in accuracy for a significant increase in privacy.
For future work it would be interesting to consider other forms of noise, for example downsampling images. However, downsampling does not form an injective mapping and as such we cannot guarantee that we can find a consistent estimator for the underlying model.
There are many forms of private machine learning. Some attempt to transform a datapoint x to a form such that a protected attribute a (such as gender) cannot be recovered from
, yet the prediction
Figure 4: (a) An example MNIST “7” alongside its noisy example which is sent to MugTome with Gaussian noise variance (b) ; (d,e,f) similarly for an MNIST “9”.
of an output y (for example using ) is retained. For example this could be achieved by using a loss function such as (see for example Li et al. (2019))
where n is the data index, is a function that takes input
and outputs a prediction
is a function that takes input
and outputs an attribute prediction a and
gives a representation of the input;
are loss functions. In this protected attribute setting, typically some form of the clean dataset is required to learn the parameters
Another common form of private machine learning is based on differential privacy (Dwork & Roth, 2014), with the aim to make it difficult to discern whether a datapoint was used to train the predictor
. That is, given a trained model, differential privacy attempts to restrict the ability to differentiate whether any individual’s datum was used to train the model.
A closely related concept to randomised response is that of plausible deniability, namely privately corrupting a datapoint such that no-one (except the datapoint provider) can confidently state what the original (private) value of
is. Recently Bindschaedler et al. (2017) used this to create synthetic datapoints, which were subsequently used with a standard machine learning training approach. The authors showed that generating synthetic data
from a distribution
that takes dependency amongst the elements of the vector x results in better machine learning predictors than sampling from a factorised distribution. In synthetic data generation approaches the assumption is that the statistical characteristics are similar to the real data. However, care is required since if the generating mechanism is powerful, it may generate data which is very similar to the private data.
In general these synthetic data generating approaches do not take into consideration when learning the parameters of the machine learning model what that synthetic data generation mechanism is. This is analogous to simply using the corrupted votes to directly estimate the fraction of voters that voted for a candidate, equation(1), rather than using knowledge of the data generation approach, equation(2).
We discussed a general privacy preserving mechanism based on random response in a datapoint is replaced by a corrupted versions. We showed that, provided the corruption process is a valid spread noise, then a maximum likelihood approach forms a consistent estimator. That is, even though the model is only trained on corrupted, synthetic data, it is possible to recover the true underlying data genering mechnanism on the clean data. We applied this approach to a simple logistic regression model, showing that the approach can work well, even with high levels of noise. The approach is readily applicable to a large class of much more complex models and other divergences.
I would like to thank Xijie Hou for useful discussions.
D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, New York, NY, USA, 2012. ISBN 0521518148, 9780521518147.
Vincent Bindschaedler, Reza Shokri, and Carl A. Gunter. Plausible deniability for privacy-
S. S. Dragomir. Some general divergence measures for probability distributions. Acta Mathematica Hungarica, 109(4):331–345, Nov 2005. ISSN 1588-2632. doi: 10.1007/s10474-005-0251-6.
C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci., 9(3–4):211–407, 2014. ISSN 1551-305X. doi: 10.1561/0400000042.
Ang Li, Jiayi Guo, Huanrui Yang, and Yiran Chen. Deepobfuscator: Adversarial training framework for privacy-preserving image classification, 2019.
S. L. Warner. Randomised response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
Mingtian Zhang, Peter Hayes, Tom Bird, Raza Habib, and David Barber. Spread divergences, 2018.
A PRIVACY PRESERVING LOGISTIC REGRESSION
The posterior is given by p
For the learning, we need to take expectations
We use importance sampling to approximate this expectation �
Choosing
we have
Here
Putting this together and using the same samples to estimate the numerator and denominator expecta-
tions,
for importance weight w
For the logistic regression case, we have
where
The variational lower bound then becomes
where, for a given noisy datapoint we generate a set of S samples
from
and samples
B TRAINING ON NOISY DATA
A common approach in private machine learning is to train the standard model based on noisy alone,
corresponding to maximising , equation(35). As we discussed, this does not in general give a
consistent estimator of the true underlying model. To show this, we consider a logistic regression
in which only the class labels c are corrupted with probabilities and in the
same state with , leaving the inputs x uncorrupted. In this case
If we assume that the true labels are drawn from an underlying model
then the probability of a corrupted label is given by p
and by the law of large numbers,
Taking the gradient wrt , we obtain
A sufficient condition for the gradient to be zero is
That is
which is p
In general, equation(94) does not have a solution at
To understand when the objective equation(90) has an optimum, we assume the data is drawn from
, then, defining z
Since x is Gaussian distributed, z is also Gaussian distributed with
We can then write the large data limit log likelihood as a two dimensional expectation
For simplicity, consider . It is straightforward
to show that in this case the gradient with respect to is zero when
, namely when
.
However, in general, for non-isotropic data covariance , the gradient is non-zero at
To derive the above result, we note that the covariance for z in this case is simply
We now use the decomposition , with Cholesky factor
Then drawing a sample from z is equivalent to
we can then write the expected log likelihood as a function of
where the functions are defined as
Differentiating with respect to
, we obtain
When we note that
is independent of
and that the above is therefore is zero.
Hence has zero gradient at
It is straightforward to show that the second derivative of (evaluated at
The second term in equation(107) above is clearly negative. Using integration by parts (and noting
that we may assume s > 0), one may easily show that the first term is also negative provided that
Hence we arrive at the (perhaps surprising) result that for zero mean isotropic Gaussian distributed
input data, training on noisy data in which the class labels have been flipped with some
probability, results in a consistent estimator for , provided the flip noise is not too high, namely
, or equivalently,
. This result holds even in the case of asymmetric
flip noise
More generally, even if the data p(x) is not Gaussian distributed, from the Central Limit Theorem,
p(z) is likely to be close to Gaussian distributed for high dimensional inputs. Hence, for input data x
that is roughly isotropically distributed, we can expect that training using maximum likelihood for
any classifier of the form will likely be close to recovering the true
that
generated the data (in the limit of a large number of datapoints).
The above analysis considered only noise on the class label. If, independently of the class label we
add isotropic Gaussian noise to the observations, then the projection z will still be isotropic Gaussian
distributed for Gaussian inputs p(x) and the above argument trivially extends to this case as well.
Hence, one can expect training (using standard logistic regression but with corrupted inputs and
flipped labels) to be partially successful at recovering the true data generating process provided that
the input data is close to isotropically distributed, motivating a whitening pre-processing step of the
input data.