Passwords and long account and credit-card numbers are the dominant security measures, not because they are the most secure, but because they are the most conveniently implemented. Some data demands the highest levels of security and privacy protections, while for other data processing efficiency and sheer convenience are paramount — some security is better than none (which tends to be the alternative). The present paper proposes privacy-preserving, secure multiparty computations performed solely in the IEEE standard double-precision arithmetic that dominates most platforms for numerical computations. The scheme amounts to lossy, leaky cryptography, with the loss of accuracy and leakage of information carefully controlled via mathematical analysis and rigorous proofs. Information loss balances against roundoff error, providing perfect privacy at a specified finite precision of computations.
Perfect privacy at a given precision is when the information leakage is less than the specified precision (precision being limited due to roundoff error). The present paper provides perfect privacy at a precision of about 10in the IEEE standard double-precision arithmetic of [11]; observing the encrypted outputs leaks no more than a millionth of a bit per input real number, whereas roundoff alters the results by around one part in a hundred thousand. In a megapixel image, the encrypted image would leak at most a single bit — enough information to discern whether the original image is dim or bright, perhaps, but no more. Performing all computations in floating-point arithmetic facilitates implementations on existing hardware, including both commodity centralprocessing units (CPUs) and graphics-processing units (GPUs), whereas alternative methods based on integer modular arithmetic could require difficult specialized optimizations to attain performance on par with the scheme proposed in this paper (even then, schemes based on integer modular arithmetic would have to contend with tricky issues of discretization and precision in order to handle the real numbers required for machine learning and statistics).
The algorithms and analysis consider the traditional honest-but-curious model of threats: we assume that the multiple parties follow the agreed-upon protocols correctly but may try to glean information from data they observe; the secure multiparty computations prevent any of the parties from gleaning much information without all parties conspiring together to break the scheme.
Our analysis provides no guarantees about what information leaks when all parties collude to reveal encrypted results. If all parties conspire to collect together all their shares or send them to a collecting agency, then the unified collection will reveal the secrets of whatever results get collected. If the collected information results from training a machine-learned model, then revealing the trained model can compromise the confidentiality of the data used to train that model, unless the model is differentially private. Ensuring privacy even after revealing the results of secure multiparty computations is complementary to securing the intermediate computations. The present paper only guarantees the privacy of the intermediate computations, providing no guarantees about what leaks when all parties collude to reveal the final results of their secure multiparty computations.
This paper has the following structure: Section 2 introduces secure multiparty computations in floating-point arithmetic, reviewing classical methods such as additive sharing and Beaver multiplication. Section 3 upper-bounds the amount of information that can leak, referring to Appendices A, B, and C for full, rigorous proofs. Section 4 reviews techniques for efficient, highly accurate polynomial approximations to many real functions of interest (notably those in Table 3). Section 5 validates an implementation on synthetic examples and illustrates its performance on real measured data, too; the examples apply various generalized linear models, including ordinary linear least-squares regression, binary and multinomial logistic regression, probit regression, and Poisson regression. Appendices D, E, and F very briefly review Chebyshev series, minibatched stochastic gradient descent, and generalized linear models, respectively — readers may wish to refer to those appendices as concise refreshers.
Throughout, all numbers and random variables are real-valued, even when not stated explicitly.
Secure multiparty computations allow holders of sensitive data to securely distribute so-called shares of their data to multiple parties such that the multiple parties can process the data without revealing the data and can only reconstruct the data by colluding to put back together (that is, to sum) all the shares. We briefly review an arithmetic scheme in the present section. The arithmetic scheme supports addition and multiplication, as discussed in the present section, as well as functions that polynomials can approximate accurately, as discussed in Section 4 below. To be concrete and simplify the presentation, we focus first on secure two-party computations, in Subsection 2.1, then sketch an extension to several parties in Subsection 2.2.
2.1 Two-party computations
We can hide a matrix X by masking with a random matrix Y , so that one party holds the other party holds Y . Given other data, say U, and another random matrix V hiding U, so that one party holds
and the other party holds V , the parties can then independently form the sums (
required to reconstruct
if all these matrices have the same dimensions. This is known as “additive sharing,” as described, for example, by [2]. Additive sharing thus supports privacy-preserving addition of U and X.
Table 1: Ledgers for two parties in a Beaver multiplication of U and X
As introduced by [1], used by [2], and reviewed in Table 1, privacy-preserving multiplication of U and X is also possible whenever U and X are matrices such that their product UX is well-defined. Summing across the two parties in line 10 of Table 1 would yield , as desired. Notice that
in lines 8 and 9 of Table 1 are still masked by the random matrices P and R whose values are unknown to the two parties. The all-reduce in lines 8 and 9 requires the parties to conspire and distribute to each other the results of summing their respective shares of lines 6 and 7, but does not require the parties to get shares of any secret distributed data — that was necessary only for the original data U and X being multiplied (and this “original” data can be the result of prior secure multiparty computations). Also, all values on lines 3–5 are independent of U and X, so can be precomputed and stored on disk. As with addition, multiplication via the Beaver scheme never requires securely distributing secret shares, so long as the input data and so-called Beaver triples (P, R, PR) from Table 1 have already been securely distributed into shares. The secure distribution of shares is completely separate from the processing of the resulting distributed data set. Communication among the parties during processing is solely via the all-reduce in lines 8 and 9.
Table 1’s scheme also works with matrix multiplication replaced by convolution of sequences.
Table 1 simplifies to Table 2 for the case when U and X are scalars such that U = X. Summing across the two parties in line 6 of Table 2 yields , as desired. Notice that this recovers
from a sum involving several terms as large as
and
(as well as
cancels when summing across the two parties in line 6), then we obtain
to precision upper-bounded by 6
denotes the machine precision (
is approximately 2
in the IEEE standard double-precision arithmetic of [11]). Theorem 3 proves that the information leakage may be as large as 6
are distributed uniformly over [
is distributed uniformly over [
] (similarly, P, Q, R, S, V , and Y in Table 1 should be distributed uniformly over [
should be distributed uniformly over [
Balancing the roundoff bound with the information bound requires 6
(so
for IEEE standard double-precision arithmetic).
Table 2: Ledgers for two parties in a Beaver squaring of X
2.2 Several-party computations
Extending Subsection 2.1 beyond two parties is straightforward. The steps in the algorithms, summarized in the columns labeled “source” in Tables 1 and 2, stay as they were (the division by 2 in the last line of Table 1 and in the last line of Table 2 becomes division by the number of parties). Distributing additive shares of data across several parties works as follows: for each piece of data to be distributed (in machine learning, a “piece” may naturally be a sample or example from the collection of all samples or examples), we generate n independent and identically distributed random matrices is the number of parties (the number of parties need not relate to the total number of pieces, samples, or examples of data being distributed). We randomly permute the parties and then distribute to them (in that random order)
is the piece of data being shared. We generate different independent random variables and random permutations for different pieces of data. The distribution of the difference between independent random matrices drawn from the same distribution is the same for each party, making this an especially simple generalization to the case of several parties. Distributing additive shares to several parties leaks somewhat more information than limiting to only two parties; the present paper focuses on the case of two parties for simplicity.
This section bounds the amount of information-theoretic entropy that can leak when adding noise for masking, drawing heavily on canonical concepts from information theory, as detailed, for example, by [5].
We denote by X the scalar random variable that we want to hide, and by Y an independent variate that we add to X to effect the hiding. To simplify the analysis, we assume that the distribution of Y arises from a probability density function. Then, revealing X + Y leaks the following number of bits of information about X:
In the left-hand side of (1), H denotes the Shannon entropy measured in bits if the distribution of X is discrete, and the differential entropy measured in bits (rather than nats) if the distribution of X is continuous. In the right-hand sides of (1), H denotes the differential entropy measured in bits (not nats); I denotes the mutual information. Given a prior on X, Bayes’ Rule yields the full posterior distribution for X given X + Y ; the information gain (or loss or leakage) defined in (1) is a summary statistic characterizing the divergence of the posterior from the prior.
Recall that mutual information is the fundamental limit on how much information can be gleaned from observing the outputs of a noisy channel; in our setting, we purposefully add noise in order to reveal only the results of communications via a (very) noisy channel, purposefully obscuring with noise the signal containing data being kept confidential and secure.
The information leakage is at most is distributed uniformly over [
], as stated in the following theorem and proven in Appendix A:
Theorem 1. Suppose that X and Y are independent scalar random variables and positive real numbers such that
is distributed uniformly over [
. Then, the information leaked about X from observing X + Y satisfies
where I denotes the mutual information between X and X + Y , measured in bits (not nats); the mutual information satisfies (1), which expresses I as a change in entropy. The inequality in (2) is an equality when times a Rademacher variate.
The following theorem, proven in Appendix B, states that the information leakage from hiding data multiple times is at most the sum of the information leaking from each individual hiding.
Theorem 2. Suppose that X, Y , and Z are independent scalar random variables. Then,
where I denotes the mutual information.
The procedures of Tables 1 and 2 can also leak information, but not much — consider Table 2: Party 2 observes nothing about the input data X other than , and Theorem 1 bounds how much information that reveals about X. Party 1 observes
2. The following theorem, proven in Appendix C, bounds how much information about X these observations reveal.
Theorem 3. Suppose that X, Y , P, Q, and T are independent scalar random variables and a positive real number such that
, the random variable T is distributed uniformly over [
are distributed uniformly over [
where I denotes the mutual information measured in bits, and its arguments (aside from X) are the observations in Table 2 under the column for “party 1.”
The following theorem states the classical data-processing inequality:
Theorem 4. Suppose that random matrices X and Z are conditionally independent given a random matrix Y . Then,
where I denotes the mutual information.
Table 3: Subsections providing polynomial approximations to various functions
The following is a corollary of Theorem 4:
Corollary 5. Suppose that X and Y are random matrices and f is a deterministic function. Then,
where I denotes the mutual information.
Combining (1) and (7) shows that even very complicated manipulations such as the iterations in Section 4 below cause no further information leakage, despite changing the added noise in some highly nonlinear fashion: (7) guarantees that no information leaks beyond the individual maskings obeying (2)–(4), so long as the manipulations are deterministic algorithms (or randomized algorithms with randomization independent of the data being masked).
Polynomials can approximate many functions of interest in machine learning, allowing the accurate approximation of those functions using only additions and multiplications. Section 2 above discusses schemes that multiple parties can use to perform additions and multiplications securely. The present section describes polynomial approximations useful in tandem with the schemes of Section 2. Subsection 4.1 leverages the method of Newton and Raphson. Subsection 4.2 uses Chebyshev series. Subsection 4.3 utilizes Pad´e approximation, in the method of scaling and squaring. The method of Newton and Raphson tends to be the most efficient, while Chebyshev series apply to a much broader class of functions. The method of scaling and squaring is for exponentiation. Table 3 lists the functions that each subsection treats.
4.1 Newton iterations
Various iterations derived from the Newton method for finding zeros of functions allow the computation of functions such as sgn(using only additions and multiplications (not requiring any divisions or square roots); in this subsection, x denotes a real number.
According to [12], the Newton-Schulz iterations for computing sgn(x) are
with (and the desired loss of accuracy relative to the machine precision is less than a factor of
According to [12], the Schulz (or Newton) iterations for computing 1/x when x > 0 are
with (and then adjusting the resulting reciprocal) is important to align with the domain of convergence and high accuracy illustrated in Figure 1; and similar observations pertain to the rest of the iterations of the present subsection. In Subsection 4.3 below, x can range from 1 to the number of terms in the softmax (so requires scaling by the reciprocal of the number of terms in the softmax).
According to [9], the Newton iterations for computing 1
with
with
Figures 1–4 illustrate the errors obtained from (8)–(11); note that the scale of the vertical axes in Figures 1–3 involve 1e–16. In the figures, the tilde denotes the approximation computed via the Newton iterations (8)–(11); for example, approximates 1/x.
A common operation in the deep learning of [13] and others is the rectified linear unit
easily obtained from (8).
4.2 Chebyshev series
Chebyshev series provide efficient approximations to smooth functions using only additions and multiplications. The approximations are especially efficient for odd functions, such as
in this subsection, x denotes a real number. The function in (13) is the hyperbolic tangent. The function in (14) is a constant plus the standard logistic function, familiar from logistic regression. The function in (15) is a constant plus the cumulative distribution function for the standard normal distribution, familiar from probit regression. Performing logistic regression or probit regression by
Figure 4: Absolute error in computation of |x| = x sgn(x) with 60 iterations of (8); the figure superimposes a white curve over a black curve, where the white curve uses to start (8) while the black curve uses
, both with
maximizing the log-likelihood relies on the evaluation of (14) or (15), respectively, at least when using a gradient-based optimizer, the method of Newton and Raphson, or the method of scoring. Details about these functions and regressions are available, for example, in the monograph of [14].
Appendix D reviews algorithms for computing approximations via Chebyshev series of odd functions, with accuracy determined via two parameters, n and z, where the degree of the (odd) approximating polynomial is 2] is the interval over which the approximation is valid. Setting n = 50 and z = 10 yields 7-digit accuracy for the approximation of (13); setting n = 22 and z = 5 yields 4-digit accuracy for the approximation of (14); and setting n = 34 and z = 10 yields 5-digit accuracy for the approximation of (15). In Section 5 below, we err on the side of caution, defaulting to n = 60 and z = 20 for (13) and (14) and to n = 50 and z = 20 for (15), while also discussing the results from other choices.
4.3 Softmax
As reviewed, for example, by [13], a common operation in machine learning is the so-called “softmax” transforming n non-positive real numbers positive real numbers exp(
, . . . , exp(
). These are the probabilities at unit temperature in the Gibbs distribution associated with energies
. . . ,
is the partition function. Once we have computed the exponentials, summation yields Z directly; the secure multiparty computations of Section 2 support such summation. Division by Z is available via the iterations in (9) of Subsection 4.1.
Thus, given real numbers 0, and a real number
0
1, we would like to calculate exp(x) to precision
We use the method of scaling and squaring, as reviewed, for example, by [10]. If we let n be the least integer that is at least log
), then squaring exp(
) yields exp(
), squaring exp(
) yields exp(
and so on, so that n successive squarings will yield exp(x); further, 1+
approximates exp(
that is,
while
so n successive squarings of 1+yields exp(x) to relative accuracy
(or better). We use n = 20 successive squarings in all numerical experiments of Section 5 below.
Given a real number , less than 6
bits can leak from computing the approximation to exp(x) if we add to 1 +
a random variable distributed uniformly over [
double the width of the added noise upon each of the n squarings, in accord with Theorems 1, 2, and 3 of Section 3.
Clearly, we can enforce that a real number x be non-positive by applying ) from (12). Computing the softmax of real numbers that may not necessarily be non-positive is also possible, even without risk of leaking any information beyond the case for non-positive numbers. Indeed, we can replace each real number
), without altering the probability distribution produced by the softmax; we perform such replacement in our implementation of multinomial logistic regression. When implementing a softmax for multinomial logistic regression, we include a negative offset in the bias for the input to the softmax. That is, we adjust the bias to be a constant amount less, constant over all classes in the classification. Subtracting such a positive constant C tends to make
negative even before replacing each real number
). Subtracting a constant C reduces the sum
); without subtracting the constant, the sum
) can adversely impact accuracy if the sum becomes too large. Subtracting the constant C reduces accuracy by a factor of up to exp(
148 — a tad more than two digits) for our numerical experiments reported in the following section, Section 5.
Via several numerical experiments, this section illustrates the performance of the scheme proposed above. All examples reported in the present section use two parties for the private computations. Subsection 2.2 outlines an extension to several parties. The terminology “in private” refers to computations fully encrypted via the scheme introduced above, while “in public” refers to computations in plaintext. Subsection 5.1 validates the scheme on examples for which the correct answer is known by construction. Subsection 5.2 applies the scheme to classical data sets.
All examples use minibatched stochastic gradient descent to maximize the log-likelihood under the corresponding generalized linear model, at the constant learning rates specified below (except where noted for probit regression). Appendix E briefly reviews stochastic gradient descent with minibatches and weight decay; Appendix F briefly reviews generalized linear models.
In all cases, we learn (that is, fit) not only the vector of weights to which the design matrix gets applied, but also a constant offset known as the “bias” in the literature on stochastic gradient descent. Thus, the linear function of the weight (fitted parameter) vector w in the generalized linear model is actually the affine transform Aw +c, where A is the design matrix and c is the bias vector whose entries are all the same constant offset, learned or fitted together with w during the iterations of stochastic gradient descent (whereas A stays fixed during the iterations). In multinomial logistic regression, the entries of the bias vector c are constant for each class (constant over all covariates and data samples), but the constant may be different for different classes.
In the subsections below, we view linear least-squares regression as a generalized linear model with the link function being the identity; equivalently, we take the parametric family defining the statistical model to be an affine transform of the vector of weights (parameters) plus independent and identically distributed normal random variables. The log-likelihood of a such a model summed over all samples in the design matrix A is simply a constant minus denotes the Euclidean norm, w is the vector of weights (parameters), Aw +c is the affine transform defined by the design matrix A and bias vector c, and b is the vector of targets. For simplicity, when we report the “negative log-likelihood” or “loss” in plots, averaged over the m samples in the design matrix A, we report
, ignoring the constant and factor of 2.
The implementation of encrypted computations builds on CrypTen of [8], which in turn builds on PyTorch. The implementation uses only IEEE standard double-precision arithmetic. All experiments ran on one of Facebook’s computer clusters for research, which enables rapid communication between the multiple parties. In actual deployments, communications between the multiple parties are likely to have high latency, dramatically impacting the speed of the multiparty computations. The speed of such communications would vary significantly between different applications and arrangements, likely requiring separate analyses and characterizations of computational efficiency for different deployments. Yet, while timings are fairly unique to the particular computational environment, the accuracies we report below should be fully representative for most applications.
Figure 5: Euclidean norm of the difference between the ideal normalized weight vector its computed approximation
as a function of the width
of the uniform noise on [
added to the shares of data (the lines for the logit and probit links overlap)
5.1 Validations on synthetic data
For the synthetic examples discussed in the following sub-subsections, we set m = 64 and n = 8 for the numbers of rows and columns in the design matrices being constructed. Figure 5 displays the discrepancy of the computed results from the ideal solution (the ideal is known a-priori to two-digit accuracy by construction in these synthetic examples) as a function of the maximum value of the random variable distributed uniformly over [
]. In accordance with the analysis in Subsection 2.1 above, the figure reports that
works well, yielding the roughly two-digit agreement that would be optimal for the synthetic data sets constructed in the following sub-subsections, so we set
for the remainder of the paper. The logistic and probit regressions reported below both rely on the Chebyshev approximations reviewed in Subsection 4.2 above, with the approximations being valid over the interval [
Subsection 4.2). Figure 6 indicates that the degree 40 approximations suffice for optimal accuracy, while even degrees 6–12 produce reasonably accurate results (accurate enough for most applications in machine learning for prediction, in which only residuals or matching targets matters). For the remainder of the paper, we use 120 terms in the Chebyshev approximation of the logistic function (or tanh), and 100 terms in the Chebyshev approximation of the inverse probit (or the cumulative distribution function of a standard normal variate). For both Figures 5 and 6, we trained for 10,000 iterations with 8 samples in each iteration’s minibatch, thus sweeping through a random permutation of the full synthetic data set 1,250 times (each sweep is known as an “epoch”). The following sub-subsections detail the construction of our synthetic data sets.
Figure 6: Euclidean norm of the difference between the ideal normalized weight vector and its computed approximation
as a function of the number of terms in the Chebyshev approximation of Subsection 4.2 (about half the terms vanish, since the polynomial is odd)
5.1.1 Linear least-squares regression (identity link)
We form the design matrix A and target vector b as follows. We orthonormalize the columns of an + 1) matrix whose entries are all independent and identically distributed (i.i.d.) standard normal variates to obtain an
whose columns are orthonormal (A is the leftmost block of n columns) and an
1 column vector v that is orthogonal to all columns of A and such that
is the remaining column). We define the ideal weights
vector whose entries are i.i.d. standard normal variates. We define
1 column vector
so that by construction
Needless to say, obtaining to its minimum, 10.
After 10,000 iterations (which is 1,250 epochs) with 8 samples per minibatch at the learning rate 3 , the residuals
obtained by training in public and by training in private are both equal to 10.0 to three significant figures. For training in private (training in public yields very similar results), the Euclidean norm of the difference between the ideal w and the computed weight vector x is 0.025 (the Euclidean norm of the difference between
is 0.004), and all entries of the vector
008 (which is
003 when divided by the Euclidean norm of x); that these values are so small certifies the correctness of the training.
5.1.2 Logistic regression (logit link)
We define the ideal weights w to be the result of normalizing an 1 column vector whose entries are i.i.d. standard normal variates, that is, we divide the vector whose entries are i.i.d. by its Euclidean norm, ensuring
The construction of the design matrix A and target vector t is more involved; readers interested only in the results and not the details of the construction may wish to skip to the last two paragraphs of the present sub-subsection.
For 1 column vector
whose entries are i.i.d. standard normal variates, project off the component of
and set
and
for k = 1, 2, . . . , n. For the remaining 20 rows of
20 rows from the result of orthonormalizing the columns of an
matrix whose entries are all i.i.d. standard normal variates.
We construct the 1 column vector
and define the target classes
for j = 1, 2, . . . , m, where “round” rounds to the nearest integer (0 or 1) and is the standard logistic function
Combining (25), (23), (24), (22), and (21) yields
and
for is slightly positive while
is slightly negative, the target classes
defined in (26) will be 1 and 0, respectively, even though the difference between the corresponding (2j)th and (2j+1)th rows of A is small — combining (23), (24), and (21) yields that the Euclidean norm of their difference is 0.04. The decision hyperplane separating class 0 from class 1 will thus have to pass between 10 pairs of points in n-dimensional space (n = 8), with the points in each pair very close to each other (albeit on opposite sides of the decision hyperplane). Classifying all these points correctly hence determines the hyperplane to reasonably high accuracy.
Needless to say, obtaining x = w and c = 0 produces perfect accuracy for the logistic regression which classifies by rounding the result of (27) applied to each entry of Ax+c, as then Ax = Aw = b, and the target classes in t are the result of (27) applied to each entry of b. In fact, obtaining x as any positive multiple of w together with c = 0 yields perfect accuracy — any positive multiple of a vector orthogonal to the hyperplane separating the two classes specifies that same hyperplane.
After 10,000 iterations (which is 1,250 epochs) with 8 samples per minibatch at the learning rate 3, training binary logistic regression in private (training in public produces very similar results) drives the Euclidean norm of the difference between and the ideal
to 0.006 and drives every entry of
to 0.006, where c is the vector of offsets (whose entries are all the same). That these values are so small validates the training. The log-likelihood, averaged over all
088 over the 10,000 iterations (needless to say, the log-likelihood cannot exceed 0). The accuracy becomes exactly perfect (that is, 1).
When training multinomial logistic regression for two classes on the same synthetic data set with the same settings, similar validation attains: for training in private (training in public yields very similar results), the Euclidean norm of the difference between the ideal computed
008, and every entry of
becomes either
where c contains the bias offsets. The log-likelihood, averaged over all m = 64 samples, changes from
047 over the 10,000 iterations, and the accuracy becomes perfect (that is, exactly 1). Of course, using multinomial logistic regression with a binomial distribution rather than standard logistic regression makes no sense, but these results certify the correctness of the multinomial logistic regression nonetheless.
5.1.3 Probit regression (probit link)
The design matrix A and target vector t for probit regression are the same as for logistic regression, but replacing the sigmoid defined in (27) with the inverse probit
which is also known as the cumulative distribution function for standard normal variates. An iteration of stochastic gradient descent involves selecting rows of the design matrix A and collecting them together into a matrix R, as well as collecting together the corresponding targets from t into a vector s (the number of rows is the size of the minibatch). One iteration of stochastic gradient descent for maximizing the log-likelihood in logistic regression updates the weight vector x by adding to x the learning rate times the transpose
applied to the difference between the corresponding target samples
defined in (27) applied to each entry of Rx + c; with a minor abuse of notation, we could write that x updates to
)). Since the sigmoid defined in (30) is numerically very similar to the sigmoid defined in (27) (after scaling such that the variances of the sigmoids are the same), we use the same updating formula for probit regression as for logistic regression, but using the design matrix A associated with probit regression rather than that for logistic regression, and using the sigmoid
associated with probit regression rather than that for logistic regression. A na¨ıve application of stochastic gradient descent for maximizing the log-likelihood in probit regression would update the weight vector in the same direction, but scaled slightly, effectively altering the learning rate a tiny bit from iteration to iteration; we omit the extra computations required to follow the na¨ıve method exactly.
As with logistic regression, obtaining x = w and c = 0 produces perfect accuracy for the probit regression which classifies by rounding the result of (30) applied to each entry of Ax + c. And, again, obtaining x as any positive multiple of w together with c = 0 yields perfect accuracy — any positive multiple of a vector orthogonal to the hyperplane separating the two classes specifies that same hyperplane. After 10,000 iterations (which is 1,250 epochs) with 8 samples per minibatch at the learning rate 3, training in private (training in public produces very similar results) drives the Euclidean norm of the difference between and the ideal
to 0.006 and drives every entry of
to 0.005, where c is the vector of offsets (whose entries are all the same). That these values are so small validates the training. The log-likelihood, averaged over all m = 64 samples, changes from
058 over the 10,000 iterations (needless to say, the log-likelihood cannot exceed 0). The accuracy becomes precisely perfect (that is, 1).
5.1.4 Poisson regression (log link)
We obtain the design matrix A by orthonormalizing the columns of an matrix whose entries are i.i.d. standard normal variates. We define w to be 10 times the result of normalizing an
column vector whose entries are i.i.d. standard normal variates, that is, we divide the vector whose entries are i.i.d. by its Euclidean norm, and then multiply by 10, ensuring
We construct the 1 column vector
where “3” indicates the 1 column vector whose entries are all 3. We then define the target counts
for j = 1, 2, . . . , m, where “round” rounds to the nearest integer (0, 1, 2, . . . ). Using 10 as the norm of w in (31) ensures that the integers vary over a significant range, while using 3 in the right-hand side of (32) ensures that, on average, half will be greater than exp(3)
Having targets that vary over a significant range and with many not too small ensures that the maximum-likelihood estimates of w and 3 in Poisson regression are close to w and 3 with reasonably high accuracy — the discretization from the rounding operation in (33) matters little when many counts are large and spread over a range significantly greater than the discretization spacing.
After 10,000 iterations (which is 1,250 epochs) with 8 samples per minibatch at the learning rate 3 , training in private (training in public produces very similar results) drives the Euclidean norm of the difference between
and the ideal
to 0.003 and drives every entry of
3 to 0.003, where c is the vector of offsets (whose entries are all the same). That these values are so small certifies the correctness of the training, as does the following result: the log-likelihood for the obtained weight vector x and offset c, averaged over all
349 after the 10,000 iterations, which matches the log-likelihood for the ideal weight vector w and ideal offset (3) to three-digit precision.
5.2 Performance on measured data
The following sub-subsections illustrate the application of the scheme proposed above to several benchmark data sets, namely, handwritten digits from MNIST, forest covers from covtype, and the numbers of deaths from horsekicks in corps of the Prussian army over two decades.
5.2.1 MNIST
We use both binary and multinomial logistic regression, as well as probit regression, to analyze a classic data set of handwritten digits, created by Yann LeCun, Corinna Cortes, and Christopher J. C. Burges via merging two sets from the National Institute of Standards and Technology; the mixed NIST set is available at http://yann.lecun.com/exdb/mnist as a training set of 60,000 samples and a testing set of 10,000 examples. Each sample is a centered 28-pixel 28-pixel grayscale image of one of the digits 0–9, together with a label for which one; pixel values can range from 0 to 1. For multinomial logistic regression we use all 10 classes (with one class per digit); for binary logistic and probit regressions, we use the 2 classes corresponding to the digits 0 and 1, for which there are 12665 samples in the training set, and 2115 samples in the testing set. We set the number of samples in a minibatch to 50 for training with all 10 classes, and to 85 for training with only the 2 classes corresponding to the digits 0 and 1. We trained for 20 epochs (that is, 24,000 iterations) at learning rate 10
with all 10 classes, and for 30 epochs (that is, 4,470 iterations) at learning rate 10
with only the 2 classes corresponding to the digits 0 and 1. During training for all 10 classes, we supplemented each iteration of stochastic gradient descent with weight decay of 10
which is equivalent to adding to the objective function being minimized (that is, to the negative log-likelihood) a regularization term of 10
times half the square of the Euclidean norm of the weights. This weight decay has negligible impact on accuracy yet ensures that the constant 5 we subtract from the bias in the stochastic gradient descent is sufficient to make almost all inputs of the softmax be non-positive, as suggested in the last paragraph of Subsection 4.3 above.
Figure 7 plots the results of training and Figure 8 displays the performance of the resulting trained model when applied to the testing set, both as a function of the maximum value random variable distributed uniformly over [
]; the figures show that
works well, in agreement with the analysis in Subsection 2.1 above. Table 4 details the results for
in public produces the same results at the precision reported in the table. For this application to machine learning, even
produces practically perfect predictions during both training and testing. Logistic regression corresponds to the logit link; probit regression corresponds to the probit link. The value of the log-likelihood averaged over the testing set is remarkably similar to the average value over the training set, showing that training generalizes well to the independent testing set.
Figure 9 displays the results of training multinomial logistic regression for all 10 classes (one class per digit), along with applying the resulting trained model to the testing set, as a function of the maximum of the random variable distributed uniformly over [
]; the accuracy exceeds 0.9 from
. The generalization from the training set to the testing set is ideal. At
, the training loss is 0.318, the testing loss is 0.305, the training accuracy is 0.913, and the testing accuracy is 0.917; training in public yields the same results to three-digit precision.
5.2.2 Forest cover type
We use both binary and multinomial logistic regression, as well as probit regression, to analyze standard data on the type of forest cover based on cartographic variates from Jock A. Blackard of the United States Forest Service, Denis J. Dean of the University of Texas at Dallas, and Charles W. Anderson of Colorado State University (with copyright retained by Jock A. Blackard and Colorado State University); the original data is available from the archive of [7] at http:// archive.ics.uci.edu/ml/datasets/covertype and the preprocessed and formatted versions that we use are available from the work of [4] at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ datasets/binary.html#covtype.binary (for binary classification) and http://www.csie.ntu.
Table 4: Values of the negative log-likelihood (the “loss”) and accuracy averaged over the training samples or testing samples from MNIST, with being the maximal possible value of the random variable distributed uniformly over [
] added to shares of the data
Figure 7: Value of the negative log-likelihood (the “loss”) averaged over the training data from MNIST, after convergence of the training iterations, as a function of the maximum value random variable distributed uniformly on [
] added to the shares of the data
Figure 8: Value of the negative log-likelihood (the “loss”) averaged over the testing data from MNIST, as a function of the maximum value of the random variable distributed uniformly on [
] added to the shares of the data
Figure 9: Accuracy of multinomial logistic regression averaged over the training or testing data from MNIST, as a function of the maximum value of the random variable distributed uniformly on [
] added to the shares of the data
Table 5: Values of the negative log-likelihood (the “loss”) and accuracy averaged over the training samples or testing samples from data on forest cover type, with being the maximal possible value of the random variable distributed uniformly over [
] added to shares of the data
edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#covtype (for the multinomial logistic regression).
We predict one of 7 types of forest (or one of 2 for the binarized data) based on 10 integer-valued covariates (elevation, aspect, slope, horizontal and vertical distances to bodies of water, horizontal distance to roadways, horizontal distance to fire points, and hillshade at 9am, 12pm, and 3pm), as well as one-hot encodings of 4 types of wilderness areas and 40 types of soil. Thus, there are 54 covariates in all, including the one-hot encodings. (A one-hot encoding is a vector whose entries are all 0 except for one 1 in the position corresponding to the associated type.) For normalization, we subtract the mean from each of the integer-valued covariates (not from the one-hot encodings) and then divide by the maximum absolute value. We randomly permute and then partition the data into a training set of 500,000 samples and a testing set of the other 81,012 samples. We trained for 20 epochs (that is, 10,000 iterations) with 1,000 samples per minibatch. The learning rate for the binary classification was 3 (for identity link was 0.1) and for the multi-class (7-class) classification was 1. During training for all 7 classes, we supplemented each iteration of stochastic gradient descent with weight decay of 10, to be consistent with our training for all 10 classes of MNIST in the previous sub-subsection. This weight decay barely impacts accuracy yet makes the constant 5 that we subtract from the bias in the stochastic gradient descent shift almost all inputs of the softmax to be non-positive, as suggested in the last paragraph of Subsection 4.3 above.
Figure 10 displays the results of training and Figure 11 depicts the performance of the resulting trained model when applied to the testing set, both as a function of the width of the random variable distributed uniformly over [
]; the figures show that
works well, in accordance with the analysis in Subsection 2.1 above. Table 5 details the results for
; training in public yields the same results to within
001 of those reported in the table. In fact, even
perfectly fine for this application to machine learning. Logistic regression corresponds to the logit link; probit regression corresponds to the probit link. The value of the log-likelihood averaged over the testing set is reassuringly close to the average value over the training set, demonstrating good generalization from the training set to the testing set.
Figure 12 displays the results of training multinomial logistic regression for all 7 classes, together with applying the resulting trained model to the testing set, as a function of the maximum the random variable distributed uniformly over [
]; the accuracy is excellent from
to
. The generalization from the training set to the testing set is perfect. At
the training loss is 0.705, the testing loss is 0.711, the training accuracy is 0.710, and the testing accuracy is 0.710; training in public produces the same results to within
Figure 10: Value of the negative log-likelihood (the “loss”) averaged over the training data on forest cover type, after convergence of the training iterations, as a function of the maximum value the random variable distributed uniformly on [
] added to the shares of the data
Figure 11: Value of the negative log-likelihood (the “loss”) averaged over the testing data on forest cover type, as a function of the maximum value of the random variable distributed uniformly on [
] added to the shares of the data
Figure 12: Accuracy of multinomial logistic regression averaged over the training or testing data on forest cover type, as a function of the maximum value of the random variable distributed uniformly on [
] added to the shares of the data
5.2.3 Deaths from horsekicks
We use Poisson regression to analyze the classical data from Ladislaus Bortkiewicz tabulating the numbers of deaths from horsekicks in 14 corps of the Prussian army for each of the 20 years from 1875 to 1894, available (with extensive discussion) in the work of [15]. This data is a canonical example of counts which follow the Poisson distribution. We consider four separate Poisson regressions, for the following sets of covariates: (0) no covariates, (1) one-hot encodings of the corps, (2) second-order polynomials of the years, and (3) concatenating the one-hot encodings of the corps and the second-order polynomials of the years. The one-hot encoding of a corps is a vector with 14 entries, one of which is 1 and 13 of which are 0; the position of the entry that is 1 corresponds to the associated corps. The second-order polynomials of the years arise from using as covariate vector a vector with 3 entries, the entries being the constant 1, the number of years beyond 1875, and the square of the number of years beyond 1875; Poisson regression considers linear combinations of these entries, thus forming quadratic functions of the years.
to (3) is 35.3; these deviances refer to the totals over all 280 samples, so are 280 times the average over the samples. The degrees of freedom corresponding to each of these changes is less than the decrease in deviance, indicating some minor heterogeneity in the data. (The baseline deviance generally has no statistical interpretation in terms of a universal distribution such as in deviance when changing covariates are what matter.)
With 14 samples per minibatch, each iteration of stochastic gradient descent takes about 110
seconds when training in private for any of the four sets of covariates; the smallest set (0) takes about 0
seconds less per iteration than the largest set (3). When training in public, each iteration takes about 1
seconds for any of the four sets of covariates. Training in private thus takes about 100 times longer than training in public on standard machines in a cluster for software development at Facebook.
A simple, standard computation yields that the differential entropy, in bits, of a random variable Y distributed uniformly over [
Combining (1), (34), and the upper-bound on H(X + Y ) from the following lemma yields (2), completing the proof of Theorem 1.
Lemma 6. Suppose that X and Y are independent scalar random variables and are positive real numbers such that
is distributed uniformly over [
where H denotes the differential entropy measured in bits (not nats), with equality attained in (35) when times a Rademacher variate, that is, when
with probability 1
with probability 1/2.
Proof. Denoting the cumulative distribution function of X by F, the probability density function of X + Y is
Combining and the definition of a cumulative distribution function yields that F(x) = 0 for
, so (36) becomes (as diagrammed in Figure 13)
Value
Figure 13: Sketch of a visualization for understanding the proof of Lemma 6, in the case where the probability distribution of X is given by an even unimodal probability density
and the entropy in bits of X + Y is
The function ) is minimal for 0 < p < 1 at p = 1/2, so the integral in the right-hand side of (38) is minimal when
, in which case (38) becomes (35) with equality attained when
times a Rademacher variate.
For any scalar random variables X, Y , and Z, the definition of mutual information states
and
where H denotes entropy. Furthermore, the subadditivity of entropy for any arbitrary random variables yields
Taking X, Y , and Z to be independent then yields
Combining (39)–(43) yields (3), completing the proof of Theorem 2.
We suppose that X, Y , P, Q, and T are independent scalar random variables and is a positive real number such that
, the random variable T is distributed uniformly over [
are distributed uniformly over [
]. Then, since (
is just the difference between
, providing no more and no less information than
on their own without (
2 is also merely a deterministic combination of the observations
providing no more and no less information than
on their own without (
2, the mutual information satisfies
Since ) establish a bijection between the pair
and the pair
, it follows that
Combining (3) and the fact that X, Y , Q, P, and T are independent yields
Combining (44)–(46) and (47) from the following lemma, together with (2), yields (4), completing the proof of Theorem 3.
Lemma 7. Suppose that X, P, and T are independent scalar random variables and is a positive real number such that
, the random variable P is distributed uniformly over [
and T is distributed uniformly over [
where I denotes the mutual information measured in bits.
Proof. The proof begins with a string of identities, systematically simplifying (or re-expressing) their right-hand sides. Indeed, since is simply the square of
, it follows that
Since establish a bijection between the pair
and the pair
and
, it follows that
Since is simply the square of
, it follows that
The chain rule for mutual information yields
The definition of mutual information states
where H denotes the differential entropy measured in bits. Since T is independent of X and P, the last term in the right-hand side of (52) is
The fact that T is distributed uniformly over [] yields via a simple, straightforward calculation that
We now upper-bound the first term in the right-hand side of (52), by defining
and noticing
That conditioning never increases entropy yields that the first term in the right-hand side of (52) satisfies
Combining (35) and the fact that T is distributed uniformly over [] yields that, maximizing over all random variables
(even dropping the constraint that
in the maximization),
Combining (52)–(58) yields that, maximizing over any random variable
In this appendix, we review the approximation of odd functions via Chebyshev series, as summarized for general (not necessarily odd) functions in Sections 4–6 of [6]. Given a real-valued differentiable function ] that is odd, that is,
for any real number , we can approximate f via its Chebyshev series, as follows. First, we select a sufficiently large positive integer n and compute
for 1; the approximation will converge as n increases. Having calculated
. . . ,
from (61), we can compute a good approximation to f evaluated at any real number y such that
where ) denotes the Chebyshev polynomial of degree j evaluated at x = y/z.
To calculate the right-hand side of (62) efficiently using only additions and multiplications involving the input
we evaluate the sums
for k = 1, 2, . . . , n, via the following recurrence:
and
started with
and
The final sum is equal to the right-hand side of (62), due to (63) and (64).
As discussed in any standard reference on modern machine learning, such as that of [3], minibatched stochastic gradient descent (SGD) calculates a vector w of parameters that minimizes the expected value is a function of both w and a random vector X. The expected value is known as the “risk” and
is known as the “loss.” SGD minimizes the expected loss without having direct access to the probability distribution of X, instead relying solely on samples of X (usually drawn at random from a so-called “training set”). Minibatched SGD generates a sequence of approximations via iterations,
where is a positive real number known as the “learning rate,”
denotes the gradient with respect to
’s second argument), m is the number of samples in a so-called “minibatch,” and
denote samples from X. The iterations fail to minimize the expected loss when
is constant rather than decaying to 0 as the iterations proceed, but fixing
at a sensibly small value is a common practice in machine learning (and still ensures convergence to the minimum of the empirical risk under suitable conditions on
, that is, to the minimum of the average of the loss, averaged over all samples in a fixed, finite training set).
Minimizing the regularized objective function is a nonnegative real number and
denotes the Euclidean norm of w, is a common way of ensuring that w not become too large. Adding such regularization to SGD is also known as “weight decay,” and the iterations in (71) become
we used some weight decay for the multinomial logistic regression of the measured data in Subsection 5.2 (but used no weight decay for any other results reported above, nor did we use any weight decay for iterative updates to the so-called bias offsets in c from the following appendix).
As discussed in any standard reference on generalized linear models, such as that of [14], a generalized linear model regresses a random vector Y of so-called targets against a matrix X of so-called covariates via the model
where g is known as the “link function,” E(Y |X) is the conditional expectation of the vector Y of targets given X (the matrix of covariates), w is a vector of so-called “weights” (or “parameters”), and c is a vector whose entries are independent of the values of X, known as “biases.” Table 6 lists several special cases of generalized linear models. Given independent samples of pairs ((
), . . . , (
), the standard method of fitting the vectors w of weights and c of biases is to minimize the negative of the natural logarithm of the likelihood, that is, to minimize the empirical risk
) denotes the probability (at the parameter values w and c) of observing
. Minimizing
))) via the minibatched stochastic gradient descent of the previous appendix is another (nearly equivalent) approach.
The probability distribution of Y given X for all of the generalized linear models considered in Table 6, except for probit regression, is a so-called “exponential family” of the form
where h is a nonnegative-valued function, and is known as the “log partition function”:
2 for the normal distribution,
)) for the Bernoulli distribution, and
exp(
) for the Poisson distribution. For all these cases, substituting (74) and taking the gradient with respect to
of both sides of
; combined with (73) this yields that
, so the link g is the inverse of
Combining (74) and the chain rule yields that the gradient with respect to w of ln(p(y|x; w, c)) is
and the gradient with respect to c of ln(p(y|x; w, c)) is
Thus, the stochastic gradient descent of the previous appendix requires nothing more than addition, matrix-vector multiplications, and evaluation of the inverse () of the link g. Needless to say, the inverse of the identity function is the identity function, the inverse of ln is exp, and the inverse of the inverse of the cumulative distribution function Φ for the standard normal distribution is Φ. A simple calculation shows that the inverse of the logit function
)) is the standard logistic function
)) and that the softmax detailed in Subsection 4.3 above inverts ln applied entrywise to a probability vector (the softmax simply applies exp entrywise and then normalizes to form a proper probability distribution).
The probability distribution for probit regression is an exponential family, but not of the form in (74); we handle this special case as detailed in Sub-subsection 5.1.3.
We would like to thank Shubho Sengupta and Andrew Tulloch.
Table 6: Special cases of generalized linear models, where ) denotes the (possibly multivariate) normal distribution with mean
and variance-covariance matrix being the identity matrix I, and Φ is the cumulative distribution function for the standard normal distribution (so Φ
corresponding quantile function, the inverse of Φ); “linear least squares” is also known as “ordinary least squares” or the “general” linear model — a special case of the “generalized” linear model
[1] Beaver, D. (1991) Efficient multiparty protocols using circuit randomization. in Advances in Cryptology — Proceedings of the 1991 Annual International Cryptology Conference, ed. by J. Feigenbaum, vol. 576 of Lecture Notes in Computer Science, pp. 420–432. Springer-Verlag.
[2] Bogdanov, D., Laur, S. & Willemson, J. (2008) Sharemind: a framework for fast privacy-preserving computations. in Computer Security — Proceedings of the 2008 European Symposium on Research in Computer Security, ed. by S. Jajodia, & J. Lopez, vol. 5283 of Lecture Notes in Computer Science, pp. 192–206. Springer-Verlag.
[3] Bottou, L., Curtis, F. E. & Nocedal, J. (2018) Optimization methods for large-scale machine learning. SIAM Rev., 60(2), 223–311.
[4] Chang, C.-C. & Lin, C.-J. (2011) LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3), 1–27, Article no. 27.
[5] Cover, T. M. & Thomas, J. A. (2006) Elements of Information Theory. Wiley-Interscience, 2nd edn.
[6] Curry, B. (2006) Parameter redundancy in neural networks: an application of Chebyshev polynomials. Comput. Management Sci., 4(3), 227–242.
[7] Dua, D. & Graff, C. (2019) UCI Machine Learning Repository. UC-Irvine, School of Information and Computer Sciences, Available at http://archive.ics.uci.edu/ml.
[8] Gunning, D., Hannun, A., Ibrahim, M., Knott, B., van der Maaten, L., Reis, V., Sengupta, S., Venkataraman, S. & Zhou, X. (2019) CrypTen: a new research tool for secure machine learning with PyTorch. https://ai.facebook.com/blog/ crypten-a-new-research-tool-for-secure-machine-learning-with-pytorch.
[9] Guo, C.-H. & Higham, N. J. (2006) A Schur-Newton method for the matrix pth root and its inverse. SIAM J. Matrix Anal. Appl., 28(3), 788–804.
[10] Higham, N. J. (2005) The scaling and squaring method for the matrix exponential revisited. SIAM J. Matrix Anal. Appl., 26(4), 1179–1193.
[11] IEEE Computer Society Microprocessor Standards Committee (2019) IEEE Standard for Floating-Point Arithmetic. Discussion Paper 754-2019, IEEE.
[12] Kenney, C. & Laub, A. J. (1991) Rational iterative methods for the matrix sign function. SIAM J. Matrix Anal. Appl., 12(2), 273–291.
[13] LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. & Huang, F. J. (2006) A tutorial on energy-based learning. in Predicting Structured Data, ed. by G. Bakir, T. Hofmann, B. Sch¨olkopf, A. J. Smola, B. Taskar, & S. V. N. Vishwanathan, Neural Information Processing, pp. 191–246. MIT Press.
[14] McCullagh, P. & Nelder, J. A. (1989) Generalized Linear Models, vol. 37 of CRC Monographs on Statistics and Applied Probability. Chapman and Hall, 2nd edn.
[15] Stigler, S. M. (2019) Data have a limited shelf life. Harvard Data Science Review, 1(2), 1–23, Available at https://hdsr.mitpress.mit.edu/pub/iu26pfw1.