b

DiscoverSearch
About
My stuff
Secure multiparty computations in floating-point arithmetic
2020·arXiv
Abstract
Abstract

Secure multiparty computations enable the distribution of so-called shares of sensitive data to multiple parties such that the multiple parties can effectively process the data while being unable to glean much information about the data (at least not without collusion among all parties to put back together all the shares). Thus, the parties may conspire to send all their processed results to a trusted third party (perhaps the data provider) at the conclusion of the computations, with only the trusted third party being able to view the final results. Secure multiparty computations for privacy-preserving machine-learning turn out to be possible using solely standard floating-point arithmetic, at least with a carefully controlled leakage of information less than the loss of accuracy due to roundoff, all backed by rigorous mathematical proofs of worst-case bounds on information loss and numerical stability in finite-precision arithmetic. Numerical examples illustrate the high performance attained on commodity off-the-shelf hardware for generalized linear models, including ordinary linear least-squares regression, binary and multinomial logistic regression, probit regression, and Poisson regression.

Passwords and long account and credit-card numbers are the dominant security measures, not because they are the most secure, but because they are the most conveniently implemented. Some data demands the highest levels of security and privacy protections, while for other data processing efficiency and sheer convenience are paramount — some security is better than none (which tends to be the alternative). The present paper proposes privacy-preserving, secure multiparty computations performed solely in the IEEE standard double-precision arithmetic that dominates most platforms for numerical computations. The scheme amounts to lossy, leaky cryptography, with the loss of accuracy and leakage of information carefully controlled via mathematical analysis and rigorous proofs. Information loss balances against roundoff error, providing perfect privacy at a specified finite precision of computations.

Perfect privacy at a given precision is when the information leakage is less than the specified precision (precision being limited due to roundoff error). The present paper provides perfect privacy at a precision of about 10−5 in the IEEE standard double-precision arithmetic of [11]; observing the encrypted outputs leaks no more than a millionth of a bit per input real number, whereas roundoff alters the results by around one part in a hundred thousand. In a megapixel image, the encrypted image would leak at most a single bit — enough information to discern whether the original image is dim or bright, perhaps, but no more. Performing all computations in floating-point arithmetic facilitates implementations on existing hardware, including both commodity centralprocessing units (CPUs) and graphics-processing units (GPUs), whereas alternative methods based on integer modular arithmetic could require difficult specialized optimizations to attain performance on par with the scheme proposed in this paper (even then, schemes based on integer modular arithmetic would have to contend with tricky issues of discretization and precision in order to handle the real numbers required for machine learning and statistics).

The algorithms and analysis consider the traditional honest-but-curious model of threats: we assume that the multiple parties follow the agreed-upon protocols correctly but may try to glean information from data they observe; the secure multiparty computations prevent any of the parties from gleaning much information without all parties conspiring together to break the scheme.

Our analysis provides no guarantees about what information leaks when all parties collude to reveal encrypted results. If all parties conspire to collect together all their shares or send them to a collecting agency, then the unified collection will reveal the secrets of whatever results get collected. If the collected information results from training a machine-learned model, then revealing the trained model can compromise the confidentiality of the data used to train that model, unless the model is differentially private. Ensuring privacy even after revealing the results of secure multiparty computations is complementary to securing the intermediate computations. The present paper only guarantees the privacy of the intermediate computations, providing no guarantees about what leaks when all parties collude to reveal the final results of their secure multiparty computations.

This paper has the following structure: Section 2 introduces secure multiparty computations in floating-point arithmetic, reviewing classical methods such as additive sharing and Beaver multiplication. Section 3 upper-bounds the amount of information that can leak, referring to Appendices A, B, and C for full, rigorous proofs. Section 4 reviews techniques for efficient, highly accurate polynomial approximations to many real functions of interest (notably those in Table 3). Section 5 validates an implementation on synthetic examples and illustrates its performance on real measured data, too; the examples apply various generalized linear models, including ordinary linear least-squares regression, binary and multinomial logistic regression, probit regression, and Poisson regression. Appendices D, E, and F very briefly review Chebyshev series, minibatched stochastic gradient descent, and generalized linear models, respectively — readers may wish to refer to those appendices as concise refreshers.

Throughout, all numbers and random variables are real-valued, even when not stated explicitly.

Secure multiparty computations allow holders of sensitive data to securely distribute so-called shares of their data to multiple parties such that the multiple parties can process the data without revealing the data and can only reconstruct the data by colluding to put back together (that is, to sum) all the shares. We briefly review an arithmetic scheme in the present section. The arithmetic scheme supports addition and multiplication, as discussed in the present section, as well as functions that polynomials can approximate accurately, as discussed in Section 4 below. To be concrete and simplify the presentation, we focus first on secure two-party computations, in Subsection 2.1, then sketch an extension to several parties in Subsection 2.2.

2.1 Two-party computations

We can hide a matrix X by masking with a random matrix Y , so that one party holds  X − Y andthe other party holds Y . Given other data, say U, and another random matrix V hiding U, so that one party holds  U − Vand the other party holds V , the parties can then independently form the sums (U − V ) + (X − Y ) and V + Yrequired to reconstruct  U + X = (U − V ) + (X − Y ) + (V + Y )if all these matrices have the same dimensions. This is known as “additive sharing,” as described, for example, by [2]. Additive sharing thus supports privacy-preserving addition of U and X.

image

Table 1: Ledgers for two parties in a Beaver multiplication of U and X

As introduced by [1], used by [2], and reviewed in Table 1, privacy-preserving multiplication of U and X is also possible whenever U and X are matrices such that their product UX is well-defined. Summing across the two parties in line 10 of Table 1 would yield  PR + (U − P)R +P(X − R) + (U − P)(X − R) = UX, as desired. Notice that  U − P and X − Rin lines 8 and 9 of Table 1 are still masked by the random matrices P and R whose values are unknown to the two parties. The all-reduce in lines 8 and 9 requires the parties to conspire and distribute to each other the results of summing their respective shares of lines 6 and 7, but does not require the parties to get shares of any secret distributed data — that was necessary only for the original data U and X being multiplied (and this “original” data can be the result of prior secure multiparty computations). Also, all values on lines 3–5 are independent of U and X, so can be precomputed and stored on disk. As with addition, multiplication via the Beaver scheme never requires securely distributing secret shares, so long as the input data and so-called Beaver triples (P, R, PR) from Table 1 have already been securely distributed into shares. The secure distribution of shares is completely separate from the processing of the resulting distributed data set. Communication among the parties during processing is solely via the all-reduce in lines 8 and 9.

Table 1’s scheme also works with matrix multiplication replaced by convolution of sequences.

Table 1 simplifies to Table 2 for the case when U and X are scalars such that U = X. Summing across the two parties in line 6 of Table 2 yields  P 2 + 2P(X − P) + (X − P)2 = X2, as desired. Notice that this recovers  X2 from a sum involving several terms as large as  P 2; if |X| ≤ 1 < 3 < γand  |P| ≤ γ(as well as  |T| ≤ γ2, where Tcancels when summing across the two parties in line 6), then we obtain  X2 to precision upper-bounded by 6γ2 · ε, where εdenotes the machine precision (εis approximately 2.2 × 10−16 in the IEEE standard double-precision arithmetic of [11]). Theorem 3 proves that the information leakage may be as large as 6/γ if P, Q, and Yare distributed uniformly over [−γ, γ] and Tis distributed uniformly over [−γ2, γ2] (similarly, P, Q, R, S, V , and Y in Table 1 should be distributed uniformly over [−γ, γ] while Tshould be distributed uniformly over [−γ2, γ2]).Balancing the roundoff bound with the information bound requires 6γ2 ·ε = 6/γ, so that γ = 1/ 3√ε(so  γ ≈ 105 for IEEE standard double-precision arithmetic).

image

Table 2: Ledgers for two parties in a Beaver squaring of X

2.2 Several-party computations

Extending Subsection 2.1 beyond two parties is straightforward. The steps in the algorithms, summarized in the columns labeled “source” in Tables 1 and 2, stay as they were (the division by 2 in the last line of Table 1 and in the last line of Table 2 becomes division by the number of parties). Distributing additive shares of data across several parties works as follows: for each piece of data to be distributed (in machine learning, a “piece” may naturally be a sample or example from the collection of all samples or examples), we generate n independent and identically distributed random matrices  Y1, Y2, . . . , Yn, where nis the number of parties (the number of parties need not relate to the total number of pieces, samples, or examples of data being distributed). We randomly permute the parties and then distribute to them (in that random order)  X + Y1 −Y2, Y2 − Y3, Y3 − Y4, . . . , Yn−1 − Yn, Yn − Y1, where Xis the piece of data being shared. We generate different independent random variables and random permutations for different pieces of data. The distribution of the difference between independent random matrices drawn from the same distribution is the same for each party, making this an especially simple generalization to the case of several parties. Distributing additive shares to several parties leaks somewhat more information than limiting to only two parties; the present paper focuses on the case of two parties for simplicity.

This section bounds the amount of information-theoretic entropy that can leak when adding noise for masking, drawing heavily on canonical concepts from information theory, as detailed, for example, by [5].

We denote by X the scalar random variable that we want to hide, and by Y an independent variate that we add to X to effect the hiding. To simplify the analysis, we assume that the distribution of Y arises from a probability density function. Then, revealing X + Y leaks the following number of bits of information about X:

image

In the left-hand side of (1), H denotes the Shannon entropy measured in bits if the distribution of X is discrete, and the differential entropy measured in bits (rather than nats) if the distribution of X is continuous. In the right-hand sides of (1), H denotes the differential entropy measured in bits (not nats); I denotes the mutual information. Given a prior on X, Bayes’ Rule yields the full posterior distribution for X given X + Y ; the information gain (or loss or leakage) defined in (1) is a summary statistic characterizing the divergence of the posterior from the prior.

Recall that mutual information is the fundamental limit on how much information can be gleaned from observing the outputs of a noisy channel; in our setting, we purposefully add noise in order to reveal only the results of communications via a (very) noisy channel, purposefully obscuring with noise the signal containing data being kept confidential and secure.

The information leakage is at most  β/γ bits if |X| ≤ β < γ and Yis distributed uniformly over [−γ, γ], as stated in the following theorem and proven in Appendix A:

Theorem 1. Suppose that X and Y are independent scalar random variables and  β and γ arepositive real numbers such that  |X| ≤ β < γ and Yis distributed uniformly over [−γ, γ]. Then, the information leaked about X from observing X + Y satisfies

image

where I denotes the mutual information between X and X + Y , measured in bits (not nats); the mutual information satisfies (1), which expresses I as a change in entropy. The inequality in (2) is an equality when  X is βtimes a Rademacher variate.

The following theorem, proven in Appendix B, states that the information leakage from hiding data multiple times is at most the sum of the information leaking from each individual hiding.

Theorem 2. Suppose that X, Y , and Z are independent scalar random variables. Then,

image

where I denotes the mutual information.

The procedures of Tables 1 and 2 can also leak information, but not much — consider Table 2: Party 2 observes nothing about the input data X other than  X − P, and Theorem 1 bounds how much information that reveals about X. Party 1 observes  X −Y , P −Q, P 2−T, (X −Y )−(P −Q),X−P, and (P 2−T)+2(P −Q)(X−P)+(X−P)2/2. The following theorem, proven in Appendix C, bounds how much information about X these observations reveal.

Theorem 3. Suppose that X, Y , P, Q, and T are independent scalar random variables and  γ isa positive real number such that  |X| ≤ 1 < 3 < γ, the random variable T is distributed uniformly over [−γ2, γ2], and Y , P, and Qare distributed uniformly over [−γ, γ]. Then,

image

where I denotes the mutual information measured in bits, and its arguments (aside from X) are the observations in Table 2 under the column for “party 1.”

The following theorem states the classical data-processing inequality:

Theorem 4. Suppose that random matrices X and Z are conditionally independent given a random matrix Y . Then,

image

where I denotes the mutual information.

image

Table 3: Subsections providing polynomial approximations to various functions

The following is a corollary of Theorem 4:

Corollary 5. Suppose that X and Y are random matrices and f is a deterministic function. Then,

image

where I denotes the mutual information.

Combining (1) and (7) shows that even very complicated manipulations such as the iterations in Section 4 below cause no further information leakage, despite changing the added noise in some highly nonlinear fashion: (7) guarantees that no information leaks beyond the individual maskings obeying (2)(4), so long as the manipulations are deterministic algorithms (or randomized algorithms with randomization independent of the data being masked).

Polynomials can approximate many functions of interest in machine learning, allowing the accurate approximation of those functions using only additions and multiplications. Section 2 above discusses schemes that multiple parties can use to perform additions and multiplications securely. The present section describes polynomial approximations useful in tandem with the schemes of Section 2. Subsection 4.1 leverages the method of Newton and Raphson. Subsection 4.2 uses Chebyshev series. Subsection 4.3 utilizes Pad´e approximation, in the method of scaling and squaring. The method of Newton and Raphson tends to be the most efficient, while Chebyshev series apply to a much broader class of functions. The method of scaling and squaring is for exponentiation. Table 3 lists the functions that each subsection treats.

4.1 Newton iterations

Various iterations derived from the Newton method for finding zeros of functions allow the computation of functions such as sgn(x), 1/x, and 1/√xusing only additions and multiplications (not requiring any divisions or square roots); in this subsection, x denotes a real number.

According to [12], the Newton-Schulz iterations for computing sgn(x) are

image

with  y0 = x/γ, where |x| ≤ γ(and the desired loss of accuracy relative to the machine precision is less than a factor of  γ).According to [12], the Schulz (or Newton) iterations for computing 1/x when x > 0 are

image

with  y0 = 1. Rescaling x(and then adjusting the resulting reciprocal) is important to align with the domain of convergence and high accuracy illustrated in Figure 1; and similar observations pertain to the rest of the iterations of the present subsection. In Subsection 4.3 below, x can range from 1 to the number of terms in the softmax (so requires scaling by the reciprocal of the number of terms in the softmax).

According to [9], the Newton iterations for computing 1/√x when x > 0 are

image

with  y0 = 1; similarly, the Newton iterations for computing x−1/8 when x > 0 are

image

with  y0 = 1.

Figures 1–4 illustrate the errors obtained from (8)(11); note that the scale of the vertical axes in Figures 1–3 involve 1e–16. In the figures, the tilde denotes the approximation computed via the Newton iterations (8)(11); for example, �1/xapproximates 1/x.

A common operation in the deep learning of [13] and others is the rectified linear unit

image

easily obtained from (8).

4.2 Chebyshev series

Chebyshev series provide efficient approximations to smooth functions using only additions and multiplications. The approximations are especially efficient for odd functions, such as

image

in this subsection, x denotes a real number. The function in (13) is the hyperbolic tangent. The function in (14) is a constant plus the standard logistic function, familiar from logistic regression. The function in (15) is a constant plus the cumulative distribution function for the standard normal distribution, familiar from probit regression. Performing logistic regression or probit regression by

image

image

Figure 4: Absolute error in computation of |x| = x sgn(x) with 60 iterations of (8); the figure superimposes a white curve over a black curve, where the white curve uses  y0 = −x/γto start (8) while the black curve uses  y0 = x/γ, both with  γ = 105

maximizing the log-likelihood relies on the evaluation of (14) or (15), respectively, at least when using a gradient-based optimizer, the method of Newton and Raphson, or the method of scoring. Details about these functions and regressions are available, for example, in the monograph of [14].

Appendix D reviews algorithms for computing approximations via Chebyshev series of odd functions, with accuracy determined via two parameters, n and z, where the degree of the (odd) approximating polynomial is 2n − 1, and [−z, z] is the interval over which the approximation is valid. Setting n = 50 and z = 10 yields 7-digit accuracy for the approximation of (13); setting n = 22 and z = 5 yields 4-digit accuracy for the approximation of (14); and setting n = 34 and z = 10 yields 5-digit accuracy for the approximation of (15). In Section 5 below, we err on the side of caution, defaulting to n = 60 and z = 20 for (13) and (14) and to n = 50 and z = 20 for (15), while also discussing the results from other choices.

4.3 Softmax

As reviewed, for example, by [13], a common operation in machine learning is the so-called “softmax” transforming n non-positive real numbers  x1, x2, . . . , xn into the npositive real numbers exp(x1)/Z, exp(x2)/Z, . . . , exp(xn)/Z, where Z = Z(x1, x2, . . . , xn) = �nk=1 exp(xk). These are the probabilities at unit temperature in the Gibbs distribution associated with energies  −x1, −x2,. . . ,  −xn, where Zis the partition function. Once we have computed the exponentials, summation yields Z directly; the secure multiparty computations of Section 2 support such summation. Division by Z is available via the iterations in (9) of Subsection 4.1.

Thus, given real numbers  x and β such that −β ≤ x ≤0, and a real number  ε such that0  < ε <1, we would like to calculate exp(x) to precision  ε.We use the method of scaling and squaring, as reviewed, for example, by [10]. If we let n be the least integer that is at least log2(2β2/ε), then squaring exp(x/2n) yields exp(x/2n−1), squaring exp(x/2n−1) yields exp(x/2n−2),and so on, so that n successive squarings will yield exp(x); further, 1+x/2n approximates exp(x/2n):

image

that is,

image

while

image

so n successive squarings of 1+x/2n yields exp(x) to relative accuracy  ε(or better). We use n = 20 successive squarings in all numerical experiments of Section 5 below.

Given a real number  γ > 3β, less than 6nβ/γbits can leak from computing the approximation to exp(x) if we add to 1 +  x/2n a random variable distributed uniformly over [−γ/2n, γ/2n], anddouble the width of the added noise upon each of the n squarings, in accord with Theorems 1, 2, and 3 of Section 3.

Clearly, we can enforce that a real number x be non-positive by applying  − ReLU(−x) from (12). Computing the softmax of real numbers that may not necessarily be non-positive is also possible, even without risk of leaking any information beyond the case for non-positive numbers. Indeed, we can replace each real number  xj with xj − �nk=1 ReLU(xk), without altering the probability distribution produced by the softmax; we perform such replacement in our implementation of multinomial logistic regression. When implementing a softmax for multinomial logistic regression, we include a negative offset in the bias for the input to the softmax. That is, we adjust the bias to be a constant amount less, constant over all classes in the classification. Subtracting such a positive constant C tends to make  x1 − C, x2 − C, . . . , xn − Cnegative even before replacing each real number  xj − C with xj − C − �nk=1 ReLU(xk − C). Subtracting a constant C reduces the sum �nk=1 ReLU(xk − C); without subtracting the constant, the sum  �nk=1 ReLU(xk) can adversely impact accuracy if the sum becomes too large. Subtracting the constant C reduces accuracy by a factor of up to exp(C); we set C = 5 (so exp(C) ≈148 — a tad more than two digits) for our numerical experiments reported in the following section, Section 5.

Via several numerical experiments, this section illustrates the performance of the scheme proposed above. All examples reported in the present section use two parties for the private computations. Subsection 2.2 outlines an extension to several parties. The terminology “in private” refers to computations fully encrypted via the scheme introduced above, while “in public” refers to computations in plaintext. Subsection 5.1 validates the scheme on examples for which the correct answer is known by construction. Subsection 5.2 applies the scheme to classical data sets.

All examples use minibatched stochastic gradient descent to maximize the log-likelihood under the corresponding generalized linear model, at the constant learning rates specified below (except where noted for probit regression). Appendix E briefly reviews stochastic gradient descent with minibatches and weight decay; Appendix F briefly reviews generalized linear models.

In all cases, we learn (that is, fit) not only the vector of weights to which the design matrix gets applied, but also a constant offset known as the “bias” in the literature on stochastic gradient descent. Thus, the linear function of the weight (fitted parameter) vector w in the generalized linear model is actually the affine transform Aw +c, where A is the design matrix and c is the bias vector whose entries are all the same constant offset, learned or fitted together with w during the iterations of stochastic gradient descent (whereas A stays fixed during the iterations). In multinomial logistic regression, the entries of the bias vector c are constant for each class (constant over all covariates and data samples), but the constant may be different for different classes.

In the subsections below, we view linear least-squares regression as a generalized linear model with the link function being the identity; equivalently, we take the parametric family defining the statistical model to be an affine transform of the vector of weights (parameters) plus independent and identically distributed normal random variables. The log-likelihood of a such a model summed over all samples in the design matrix A is simply a constant minus  ∥Aw + c − b∥22/2, where ∥ · ∥2denotes the Euclidean norm, w is the vector of weights (parameters), Aw +c is the affine transform defined by the design matrix A and bias vector c, and b is the vector of targets. For simplicity, when we report the “negative log-likelihood” or “loss” in plots, averaged over the m samples in the design matrix A, we report  ∥Aw + c − b∥22/m, ignoring the constant and factor of 2.

The implementation of encrypted computations builds on CrypTen of [8], which in turn builds on PyTorch. The implementation uses only IEEE standard double-precision arithmetic. All experiments ran on one of Facebook’s computer clusters for research, which enables rapid communication between the multiple parties. In actual deployments, communications between the multiple parties are likely to have high latency, dramatically impacting the speed of the multiparty computations. The speed of such communications would vary significantly between different applications and arrangements, likely requiring separate analyses and characterizations of computational efficiency for different deployments. Yet, while timings are fairly unique to the particular computational environment, the accuracies we report below should be fully representative for most applications.

image

Figure 5: Euclidean norm of the difference between the ideal normalized weight vector  w/∥w∥2 andits computed approximation  x/∥x∥2as a function of the width  γof the uniform noise on [−γ, γ]added to the shares of data (the lines for the logit and probit links overlap)

5.1 Validations on synthetic data

For the synthetic examples discussed in the following sub-subsections, we set m = 64 and n = 8 for the numbers of rows and columns in the design matrices being constructed. Figure 5 displays the discrepancy of the computed results from the ideal solution (the ideal is known a-priori to two-digit accuracy by construction in these synthetic examples) as a function of the maximum value γof the random variable distributed uniformly over [−γ, γ]. In accordance with the analysis in Subsection 2.1 above, the figure reports that  γ = 105 works well, yielding the roughly two-digit agreement that would be optimal for the synthetic data sets constructed in the following sub-subsections, so we set  γ = 105 for the remainder of the paper. The logistic and probit regressions reported below both rely on the Chebyshev approximations reviewed in Subsection 4.2 above, with the approximations being valid over the interval [−20, 20] (so z = 20 in the notation ofSubsection 4.2). Figure 6 indicates that the degree 40 approximations suffice for optimal accuracy, while even degrees 6–12 produce reasonably accurate results (accurate enough for most applications in machine learning for prediction, in which only residuals or matching targets matters). For the remainder of the paper, we use 120 terms in the Chebyshev approximation of the logistic function (or tanh), and 100 terms in the Chebyshev approximation of the inverse probit (or the cumulative distribution function of a standard normal variate). For both Figures 5 and 6, we trained for 10,000 iterations with 8 samples in each iteration’s minibatch, thus sweeping through a random permutation of the full synthetic data set 1,250 times (each sweep is known as an “epoch”). The following sub-subsections detail the construction of our synthetic data sets.

image

Figure 6: Euclidean norm of the difference between the ideal normalized weight vector  w/∥w∥2and its computed approximation  x/∥x∥2as a function of the number of terms in the Chebyshev approximation of Subsection 4.2 (about half the terms vanish, since the polynomial is odd)

5.1.1 Linear least-squares regression (identity link)

We form the design matrix A and target vector b as follows. We orthonormalize the columns of an m × (n+ 1) matrix whose entries are all independent and identically distributed (i.i.d.) standard normal variates to obtain an  m × n matrix Awhose columns are orthonormal (A is the leftmost block of n columns) and an  m ×1 column vector v that is orthogonal to all columns of A and such that  ∥v∥2 = 1 (vis the remaining column). We define the ideal weights  w to be an n × 1 columnvector whose entries are i.i.d. standard normal variates. We define  b to be the m ×1 column vector

image

so that by construction

image

Needless to say, obtaining  x = w drives ∥Ax − b∥2to its minimum, 10.

After 10,000 iterations (which is 1,250 epochs) with 8 samples per minibatch at the learning rate 3  × 10−2, the residuals  ∥Ax + c − b∥2obtained by training in public and by training in private are both equal to 10.0 to three significant figures. For training in private (training in public yields very similar results), the Euclidean norm of the difference between the ideal w and the computed weight vector x is 0.025 (the Euclidean norm of the difference between  w/∥w∥2 and x/∥x∥2is 0.004), and all entries of the vector  c are −0.008 (which is  −0.003 when divided by the Euclidean norm of x); that these values are so small certifies the correctness of the training.

5.1.2 Logistic regression (logit link)

We define the ideal weights w to be the result of normalizing an  n ×1 column vector whose entries are i.i.d. standard normal variates, that is, we divide the vector whose entries are i.i.d. by its Euclidean norm, ensuring

image

The construction of the design matrix A and target vector t is more involved; readers interested only in the results and not the details of the construction may wish to skip to the last two paragraphs of the present sub-subsection.

For  j = 1, 2, . . . , 10, we construct an n ×1 column vector  v(j) whose entries are i.i.d. standard normal variates, project off the component of  v(j) along w to obtain u(j),

image

and set

image

and

image

for k = 1, 2, . . . , n. For the remaining  m −20 rows of  A, we use m −20 rows from the result of orthonormalizing the columns of an  m × nmatrix whose entries are all i.i.d. standard normal variates.

We construct the  m ×1 column vector

image

and define the target classes

image

for j = 1, 2, . . . , m, where “round” rounds to the nearest integer (0 or 1) and  σis the standard logistic function

image

Combining (25), (23), (24), (22), and (21) yields

image

and

image

for  j = 1, 2, . . . , 10. Since, for j = 1, 2, . . . , 10, b2jis slightly positive while  b2j+1is slightly negative, the target classes  t2j and t2j+1defined in (26) will be 1 and 0, respectively, even though the difference between the corresponding (2j)th and (2j+1)th rows of A is small — combining (23), (24), and (21) yields that the Euclidean norm of their difference is 0.04. The decision hyperplane separating class 0 from class 1 will thus have to pass between 10 pairs of points in n-dimensional space (n = 8), with the points in each pair very close to each other (albeit on opposite sides of the decision hyperplane). Classifying all these points correctly hence determines the hyperplane to reasonably high accuracy.

Needless to say, obtaining x = w and c = 0 produces perfect accuracy for the logistic regression which classifies by rounding the result of (27) applied to each entry of Ax+c, as then Ax = Aw = b, and the target classes in t are the result of (27) applied to each entry of b. In fact, obtaining x as any positive multiple of w together with c = 0 yields perfect accuracy — any positive multiple of a vector orthogonal to the hyperplane separating the two classes specifies that same hyperplane.

After 10,000 iterations (which is 1,250 epochs) with 8 samples per minibatch at the learning rate 3, training binary logistic regression in private (training in public produces very similar results) drives the Euclidean norm of the difference between  x/∥x∥2and the ideal  w/∥w∥2to 0.006 and drives every entry of  c/∥x∥2to 0.006, where c is the vector of offsets (whose entries are all the same). That these values are so small validates the training. The log-likelihood, averaged over all m = 64 samples, changes from −0.785 to −0.088 over the 10,000 iterations (needless to say, the log-likelihood cannot exceed 0). The accuracy becomes exactly perfect (that is, 1).

When training multinomial logistic regression for two classes on the same synthetic data set with the same settings, similar validation attains: for training in private (training in public yields very similar results), the Euclidean norm of the difference between the ideal  w/∥w∥2 and thecomputed  x/∥x∥2 becomes 0.008, and every entry of  c/∥x∥2becomes either  −0.037 or −0.033,where c contains the bias offsets. The log-likelihood, averaged over all m = 64 samples, changes from  −0.958 to −0.047 over the 10,000 iterations, and the accuracy becomes perfect (that is, exactly 1). Of course, using multinomial logistic regression with a binomial distribution rather than standard logistic regression makes no sense, but these results certify the correctness of the multinomial logistic regression nonetheless.

5.1.3 Probit regression (probit link)

The design matrix A and target vector t for probit regression are the same as for logistic regression, but replacing the sigmoid  σdefined in (27) with the inverse probit

image

which is also known as the cumulative distribution function for standard normal variates. An iteration of stochastic gradient descent involves selecting rows of the design matrix A and collecting them together into a matrix R, as well as collecting together the corresponding targets from t into a vector s (the number of rows is the size of the minibatch). One iteration of stochastic gradient descent for maximizing the log-likelihood in logistic regression updates the weight vector x by adding to x the learning rate  ηtimes the transpose  R⊤ applied to the difference between the corresponding target samples  s and σdefined in (27) applied to each entry of Rx + c; with a minor abuse of notation, we could write that x updates to  x + ηR⊤(s − σ(Rx + c)). Since the sigmoid defined in (30) is numerically very similar to the sigmoid defined in (27) (after scaling such that the variances of the sigmoids are the same), we use the same updating formula for probit regression as for logistic regression, but using the design matrix A associated with probit regression rather than that for logistic regression, and using the sigmoid  σassociated with probit regression rather than that for logistic regression. A na¨ıve application of stochastic gradient descent for maximizing the log-likelihood in probit regression would update the weight vector in the same direction, but scaled slightly, effectively altering the learning rate a tiny bit from iteration to iteration; we omit the extra computations required to follow the na¨ıve method exactly.

As with logistic regression, obtaining x = w and c = 0 produces perfect accuracy for the probit regression which classifies by rounding the result of (30) applied to each entry of Ax + c. And, again, obtaining x as any positive multiple of w together with c = 0 yields perfect accuracy — any positive multiple of a vector orthogonal to the hyperplane separating the two classes specifies that same hyperplane. After 10,000 iterations (which is 1,250 epochs) with 8 samples per minibatch at the learning rate 3, training in private (training in public produces very similar results) drives the Euclidean norm of the difference between  x/∥x∥2and the ideal  w/∥w∥2to 0.006 and drives every entry of  c/∥x∥2to 0.005, where c is the vector of offsets (whose entries are all the same). That these values are so small validates the training. The log-likelihood, averaged over all m = 64 samples, changes from  −0.918 to −0.058 over the 10,000 iterations (needless to say, the log-likelihood cannot exceed 0). The accuracy becomes precisely perfect (that is, 1).

5.1.4 Poisson regression (log link)

We obtain the design matrix A by orthonormalizing the columns of an  m × nmatrix whose entries are i.i.d. standard normal variates. We define w to be 10 times the result of normalizing an  n × 1column vector whose entries are i.i.d. standard normal variates, that is, we divide the vector whose entries are i.i.d. by its Euclidean norm, and then multiply by 10, ensuring

image

We construct the  m ×1 column vector

image

where “3” indicates the  m ×1 column vector whose entries are all 3. We then define the target counts

image

for j = 1, 2, . . . , m, where “round” rounds to the nearest integer (0, 1, 2, . . . ). Using 10 as the norm of w in (31) ensures that the integers  t1, t2, . . . , tmvary over a significant range, while using 3 in the right-hand side of (32) ensures that, on average, half will be greater than exp(3)  ≈ 20.Having targets that vary over a significant range and with many not too small ensures that the maximum-likelihood estimates of w and 3 in Poisson regression are close to w and 3 with reasonably high accuracy — the discretization from the rounding operation in (33) matters little when many counts are large and spread over a range significantly greater than the discretization spacing.

After 10,000 iterations (which is 1,250 epochs) with 8 samples per minibatch at the learning rate 3  × 10−3, training in private (training in public produces very similar results) drives the Euclidean norm of the difference between  x/∥x∥2and the ideal  w/∥w∥2to 0.003 and drives every entry of c −3 to 0.003, where c is the vector of offsets (whose entries are all the same). That these values are so small certifies the correctness of the training, as does the following result: the log-likelihood for the obtained weight vector x and offset c, averaged over all  m = 64 samples, is −2.349 after the 10,000 iterations, which matches the log-likelihood for the ideal weight vector w and ideal offset (3) to three-digit precision.

5.2 Performance on measured data

The following sub-subsections illustrate the application of the scheme proposed above to several benchmark data sets, namely, handwritten digits from MNIST, forest covers from covtype, and the numbers of deaths from horsekicks in corps of the Prussian army over two decades.

5.2.1 MNIST

We use both binary and multinomial logistic regression, as well as probit regression, to analyze a classic data set of handwritten digits, created by Yann LeCun, Corinna Cortes, and Christopher J. C. Burges via merging two sets from the National Institute of Standards and Technology; the mixed NIST set is available at http://yann.lecun.com/exdb/mnist as a training set of 60,000 samples and a testing set of 10,000 examples. Each sample is a centered 28-pixel  ×28-pixel grayscale image of one of the digits 0–9, together with a label for which one; pixel values can range from 0 to 1. For multinomial logistic regression we use all 10 classes (with one class per digit); for binary logistic and probit regressions, we use the 2 classes corresponding to the digits 0 and 1, for which there are 12665 samples in the training set, and 2115 samples in the testing set. We set the number of samples in a minibatch to 50 for training with all 10 classes, and to 85 for training with only the 2 classes corresponding to the digits 0 and 1. We trained for 20 epochs (that is, 24,000 iterations) at learning rate 10−2 with all 10 classes, and for 30 epochs (that is, 4,470 iterations) at learning rate 10−3 with only the 2 classes corresponding to the digits 0 and 1. During training for all 10 classes, we supplemented each iteration of stochastic gradient descent with weight decay of 10−3,which is equivalent to adding to the objective function being minimized (that is, to the negative log-likelihood) a regularization term of 10−3 times half the square of the Euclidean norm of the weights. This weight decay has negligible impact on accuracy yet ensures that the constant 5 we subtract from the bias in the stochastic gradient descent is sufficient to make almost all inputs of the softmax be non-positive, as suggested in the last paragraph of Subsection 4.3 above.

Figure 7 plots the results of training and Figure 8 displays the performance of the resulting trained model when applied to the testing set, both as a function of the maximum value  γ of therandom variable distributed uniformly over [−γ, γ]; the figures show that  γ = 105 works well, in agreement with the analysis in Subsection 2.1 above. Table 4 details the results for  γ = 105; trainingin public produces the same results at the precision reported in the table. For this application to machine learning, even  γ = 106 produces practically perfect predictions during both training and testing. Logistic regression corresponds to the logit link; probit regression corresponds to the probit link. The value of the log-likelihood averaged over the testing set is remarkably similar to the average value over the training set, showing that training generalizes well to the independent testing set.

Figure 9 displays the results of training multinomial logistic regression for all 10 classes (one class per digit), along with applying the resulting trained model to the testing set, as a function of the maximum  γof the random variable distributed uniformly over [−γ, γ]; the accuracy exceeds 0.9 from  γ = 103 to γ = 106. The generalization from the training set to the testing set is ideal. At  γ = 105, the training loss is 0.318, the testing loss is 0.305, the training accuracy is 0.913, and the testing accuracy is 0.917; training in public yields the same results to three-digit precision.

5.2.2 Forest cover type

We use both binary and multinomial logistic regression, as well as probit regression, to analyze standard data on the type of forest cover based on cartographic variates from Jock A. Blackard of the United States Forest Service, Denis J. Dean of the University of Texas at Dallas, and Charles W. Anderson of Colorado State University (with copyright retained by Jock A. Blackard and Colorado State University); the original data is available from the archive of [7] at http:// archive.ics.uci.edu/ml/datasets/covertype and the preprocessed and formatted versions that we use are available from the work of [4] at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ datasets/binary.html#covtype.binary (for binary classification) and http://www.csie.ntu.

image

Table 4: Values of the negative log-likelihood (the “loss”) and accuracy averaged over the training samples or testing samples from MNIST, with  γ = 105 being the maximal possible value of the random variable distributed uniformly over [−γ, γ] added to shares of the data

image

Figure 7: Value of the negative log-likelihood (the “loss”) averaged over the training data from MNIST, after convergence of the training iterations, as a function of the maximum value  γ of therandom variable distributed uniformly on [−γ, γ] added to the shares of the data

image

Figure 8: Value of the negative log-likelihood (the “loss”) averaged over the testing data from MNIST, as a function of the maximum value  γof the random variable distributed uniformly on [−γ, γ] added to the shares of the data

image

Figure 9: Accuracy of multinomial logistic regression averaged over the training or testing data from MNIST, as a function of the maximum value  γof the random variable distributed uniformly on [−γ, γ] added to the shares of the data

image

Table 5: Values of the negative log-likelihood (the “loss”) and accuracy averaged over the training samples or testing samples from data on forest cover type, with  γ = 105 being the maximal possible value of the random variable distributed uniformly over [−γ, γ] added to shares of the data

edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#covtype (for the multinomial logistic regression).

We predict one of 7 types of forest (or one of 2 for the binarized data) based on 10 integer-valued covariates (elevation, aspect, slope, horizontal and vertical distances to bodies of water, horizontal distance to roadways, horizontal distance to fire points, and hillshade at 9am, 12pm, and 3pm), as well as one-hot encodings of 4 types of wilderness areas and 40 types of soil. Thus, there are 54 covariates in all, including the one-hot encodings. (A one-hot encoding is a vector whose entries are all 0 except for one 1 in the position corresponding to the associated type.) For normalization, we subtract the mean from each of the integer-valued covariates (not from the one-hot encodings) and then divide by the maximum absolute value. We randomly permute and then partition the data into a training set of 500,000 samples and a testing set of the other 81,012 samples. We trained for 20 epochs (that is, 10,000 iterations) with 1,000 samples per minibatch. The learning rate for the binary classification was 3 (for identity link was 0.1) and for the multi-class (7-class) classification was 1. During training for all 7 classes, we supplemented each iteration of stochastic gradient descent with weight decay of 10−3, to be consistent with our training for all 10 classes of MNIST in the previous sub-subsection. This weight decay barely impacts accuracy yet makes the constant 5 that we subtract from the bias in the stochastic gradient descent shift almost all inputs of the softmax to be non-positive, as suggested in the last paragraph of Subsection 4.3 above.

Figure 10 displays the results of training and Figure 11 depicts the performance of the resulting trained model when applied to the testing set, both as a function of the width  γof the random variable distributed uniformly over [−γ, γ]; the figures show that  γ = 105 works well, in accordance with the analysis in Subsection 2.1 above. Table 5 details the results for  γ = 105; training in public yields the same results to within  ±0.001 of those reported in the table. In fact, even  γ = 106 worksperfectly fine for this application to machine learning. Logistic regression corresponds to the logit link; probit regression corresponds to the probit link. The value of the log-likelihood averaged over the testing set is reassuringly close to the average value over the training set, demonstrating good generalization from the training set to the testing set.

Figure 12 displays the results of training multinomial logistic regression for all 7 classes, together with applying the resulting trained model to the testing set, as a function of the maximum  γ ofthe random variable distributed uniformly over [−γ, γ]; the accuracy is excellent from  γ = 103to  γ = 106. The generalization from the training set to the testing set is perfect. At  γ = 105,the training loss is 0.705, the testing loss is 0.711, the training accuracy is 0.710, and the testing accuracy is 0.710; training in public produces the same results to within  ±0.001.

image

Figure 10: Value of the negative log-likelihood (the “loss”) averaged over the training data on forest cover type, after convergence of the training iterations, as a function of the maximum value  γ ofthe random variable distributed uniformly on [−γ, γ] added to the shares of the data

image

Figure 11: Value of the negative log-likelihood (the “loss”) averaged over the testing data on forest cover type, as a function of the maximum value  γof the random variable distributed uniformly on [−γ, γ] added to the shares of the data

image

Figure 12: Accuracy of multinomial logistic regression averaged over the training or testing data on forest cover type, as a function of the maximum value  γof the random variable distributed uniformly on [−γ, γ] added to the shares of the data

5.2.3 Deaths from horsekicks

We use Poisson regression to analyze the classical data from Ladislaus Bortkiewicz tabulating the numbers of deaths from horsekicks in 14 corps of the Prussian army for each of the 20 years from 1875 to 1894, available (with extensive discussion) in the work of [15]. This data is a canonical example of counts which follow the Poisson distribution. We consider four separate Poisson regressions, for the following sets of covariates: (0) no covariates, (1) one-hot encodings of the corps, (2) second-order polynomials of the years, and (3) concatenating the one-hot encodings of the corps and the second-order polynomials of the years. The one-hot encoding of a corps is a vector with 14 entries, one of which is 1 and 13 of which are 0; the position of the entry that is 1 corresponds to the associated corps. The second-order polynomials of the years arise from using as covariate vector a vector with 3 entries, the entries being the constant 1, the number of years beyond 1875, and the square of the number of years beyond 1875; Poisson regression considers linear combinations of these entries, thus forming quadratic functions of the years.

image

to (3) is 35.3; these deviances refer to the totals over all 280 samples, so are 280 times the average over the samples. The degrees of freedom corresponding to each of these changes is less than the decrease in deviance, indicating some minor heterogeneity in the data. (The baseline deviance generally has no statistical interpretation in terms of a universal distribution such as  χ2; changesin deviance when changing covariates are what matter.)

With 14 samples per minibatch, each iteration of stochastic gradient descent takes about 1.7 ×10−2 seconds when training in private for any of the four sets of covariates; the smallest set (0) takes about 0.5 × 10−2 seconds less per iteration than the largest set (3). When training in public, each iteration takes about 1.7 × 10−4 seconds for any of the four sets of covariates. Training in private thus takes about 100 times longer than training in public on standard machines in a cluster for software development at Facebook.

A simple, standard computation yields that the differential entropy, in bits, of a random variable Y distributed uniformly over [−γ, γ] is

image

Combining (1), (34), and the upper-bound on H(X + Y ) from the following lemma yields (2), completing the proof of Theorem 1.

Lemma 6. Suppose that X and Y are independent scalar random variables and  β and γare positive real numbers such that  |X| ≤ β < γ and Yis distributed uniformly over [−γ, γ]. Then,

image

where H denotes the differential entropy measured in bits (not nats), with equality attained in (35) when  X is βtimes a Rademacher variate, that is, when  X = βwith probability 1/2 and X = −βwith probability 1/2.

Proof. Denoting the cumulative distribution function of X by F, the probability density function of X + Y is

image

Combining  |X| ≤ βand the definition of a cumulative distribution function yields that F(x) = 0 for  x < −β and F(x) = 1 for x > β, so (36) becomes (as diagrammed in Figure 13)

image

image

−γ − β −γ −γ + β −β 0 β γ − β γ γ + βValue

Figure 13: Sketch of a visualization for understanding the proof of Lemma 6, in the case where the probability distribution of X is given by an even unimodal probability density

and the entropy in bits of X + Y is

image

The function  p log2(p)+(1−p) log2(1−p) is minimal for 0 < p < 1 at p = 1/2, so the integral in the right-hand side of (38) is minimal when  F(y) = 1/2 for |y| < β, in which case (38) becomes (35) with equality attained when  X is βtimes a Rademacher variate.

For any scalar random variables X, Y , and Z, the definition of mutual information states

image

and

image

where H denotes entropy. Furthermore, the subadditivity of entropy for any arbitrary random variables yields

image

Taking X, Y , and Z to be independent then yields

image

Combining (39)(43) yields (3), completing the proof of Theorem 2.

We suppose that X, Y , P, Q, and T are independent scalar random variables and  γis a positive real number such that  |X| ≤ 1 < 3 < γ, the random variable T is distributed uniformly over [−γ2, γ2], and Y , P, and Qare distributed uniformly over [−γ, γ]. Then, since (X − Y ) − (P − Q)is just the difference between  X − Y and P − Q, providing no more and no less information than X −Y and P −Qon their own without (X −Y )−(P −Q), and (P 2 −T)+2(P −Q)(X −P)+(X −P)2/2 is also merely a deterministic combination of the observations  P 2 − T, P − Q, and X − P,providing no more and no less information than  P 2 − T, P − Q, and X − Pon their own without (P 2 − T) + 2(P − Q)(X − P) + (X − P)2/2, the mutual information satisfies

image

Since  X − Q = (X − P) + (P − Q) and P − Q = (X − Q) − (X − P) establish a bijection between the pair  X − P and X − Qand the pair  X − P and P − Q, it follows that

image

Combining (3) and the fact that X, Y , Q, P, and T are independent yields

image

Combining (44)(46) and (47) from the following lemma, together with (2), yields (4), completing the proof of Theorem 3.

Lemma 7. Suppose that X, P, and T are independent scalar random variables and  γis a positive real number such that  |X| ≤ 1 < 3 < γ, the random variable P is distributed uniformly over [−γ, γ],and T is distributed uniformly over [−γ2, γ2]. Then,

image

where I denotes the mutual information measured in bits.

Proof. The proof begins with a string of identities, systematically simplifying (or re-expressing) their right-hand sides. Indeed, since  X2 − 2XP + P 2 is simply the square of  X − P, it follows that

image

Since  X2−2XP +T = (X2−2XP +P 2)−(P 2−T) and P 2−T = (X2−2XP +P 2)−(X2−2XP +T)establish a bijection between the pair  X2 − 2XP + P 2 and P 2 − Tand the pair  X2 − 2XP + P 2and  X2 − 2XP + T, it follows that

image

Since  X2 − 2XP + P 2 is simply the square of  X − P, it follows that

image

The chain rule for mutual information yields

image

The definition of mutual information states

image

where H denotes the differential entropy measured in bits. Since T is independent of X and P, the last term in the right-hand side of (52) is

image

The fact that T is distributed uniformly over [−γ2, γ2] yields via a simple, straightforward calculation that

image

We now upper-bound the first term in the right-hand side of (52), by defining

image

and noticing

image

That conditioning never increases entropy yields that the first term in the right-hand side of (52) satisfies

image

Combining (35) and the fact that T is distributed uniformly over [−γ2, γ2] yields that, maximizing over all random variables  S such that |S| ≤ β = 1 + 2γ(even dropping the constraint that S = X2 − 2XPin the maximization),

image

Combining (52)(58) yields that, maximizing over any random variable  X such that |X| ≤ 1,

image

Combining (48)(51) and (59) yields (47).

In this appendix, we review the approximation of odd functions via Chebyshev series, as summarized for general (not necessarily odd) functions in Sections 4–6 of [6]. Given a real-valued differentiable function  f on [−z, z] that is odd, that is,

image

for any real number  x such that −z ≤ x ≤ z, we can approximate f via its Chebyshev series, as follows. First, we select a sufficiently large positive integer n and compute

image

for  j = 1, 3, . . . , 2n −1; the approximation will converge as n increases. Having calculated  c1, c3,. . . ,  c2n−1from (61), we can compute a good approximation to f evaluated at any real number y such that  −z ≤ y ≤ z:

image

where  Tj(x) denotes the Chebyshev polynomial of degree j evaluated at x = y/z.

To calculate the right-hand side of (62) efficiently using only additions and multiplications involving the input

image

we evaluate the sums

image

for k = 1, 2, . . . , n, via the following recurrence:

image

and

image

started with

image

image

and

image

The final sum  s2n−1is equal to the right-hand side of (62), due to (63) and (64).

As discussed in any standard reference on modern machine learning, such as that of [3], minibatched stochastic gradient descent (SGD) calculates a vector w of parameters that minimizes the expected value  E(ℓ(X; w)), where ℓis a function of both w and a random vector X. The expected value is known as the “risk” and  ℓis known as the “loss.” SGD minimizes the expected loss without having direct access to the probability distribution of X, instead relying solely on samples of X (usually drawn at random from a so-called “training set”). Minibatched SGD generates a sequence of approximations via iterations,

image

where  ηis a positive real number known as the “learning rate,”  ∂/∂wdenotes the gradient with respect to  w (which is ℓ’s second argument), m is the number of samples in a so-called “minibatch,” and  x(1,k), x(2,k), . . . , x(m,k) denote samples from X. The iterations fail to minimize the expected loss when  ηis constant rather than decaying to 0 as the iterations proceed, but fixing  ηat a sensibly small value is a common practice in machine learning (and still ensures convergence to the minimum of the empirical risk under suitable conditions on  ℓ, that is, to the minimum of the average of the loss, averaged over all samples in a fixed, finite training set).

Minimizing the regularized objective function  E(ℓ(X; w)) + ρ∥w∥22/2, where ρis a nonnegative real number and  ∥w∥2denotes the Euclidean norm of w, is a common way of ensuring that w not become too large. Adding such regularization to SGD is also known as “weight decay,” and the iterations in (71) become

image

we used some weight decay for the multinomial logistic regression of the measured data in Subsection 5.2 (but used no weight decay for any other results reported above, nor did we use any weight decay for iterative updates to the so-called bias offsets in c from the following appendix).

As discussed in any standard reference on generalized linear models, such as that of [14], a generalized linear model regresses a random vector Y of so-called targets against a matrix X of so-called covariates via the model

image

where g is known as the “link function,” E(Y |X) is the conditional expectation of the vector Y of targets given X (the matrix of covariates), w is a vector of so-called “weights” (or “parameters”), and c is a vector whose entries are independent of the values of X, known as “biases.” Table 6 lists several special cases of generalized linear models. Given independent samples of pairs (x(1), y(1)),(x(2), y(2)), . . . , (x(n), y(n)), the standard method of fitting the vectors w of weights and c of biases is to minimize the negative of the natural logarithm of the likelihood, that is, to minimize the empirical risk  − 1n�nk=1 ln(p(y(k)|x(k); w, c)), where p(y(k)|x(k); w, c) denotes the probability (at the parameter values w and c) of observing  y(k) given x(k). Minimizing  −E(ln(p(Y |X; w, c))) via the minibatched stochastic gradient descent of the previous appendix is another (nearly equivalent) approach.

The probability distribution of Y given X for all of the generalized linear models considered in Table 6, except for probit regression, is a so-called “exponential family” of the form

image

where h is a nonnegative-valued function, and  ψis known as the “log partition function”:  ψ(θ) =∥θ∥22/2 for the normal distribution,  ψ(θ) = ln(1+exp(θ)) for the Bernoulli distribution, and  ψ(θ) =exp(θ) for the Poisson distribution. For all these cases, substituting (74) and taking the gradient with respect to  θof both sides of�p(y|x; w, c) dy = 1 yields after a straightforward calculation that∂ψ/∂θ = E(Y |x), where θ = xw + c; combined with (73) this yields that  g(∂ψ/∂θ) = θ, so the link g is the inverse of  ∂ψ/∂θ.

Combining (74) and the chain rule yields that the gradient with respect to w of ln(p(y|x; w, c)) is

image

and the gradient with respect to c of ln(p(y|x; w, c)) is

image

Thus, the stochastic gradient descent of the previous appendix requires nothing more than addition, matrix-vector multiplications, and evaluation of the inverse (∂ψ/∂θ) of the link g. Needless to say, the inverse of the identity function is the identity function, the inverse of ln is exp, and the inverse of the inverse of the cumulative distribution function Φ for the standard normal distribution is Φ. A simple calculation shows that the inverse of the logit function  g(µ) = ln(µ/(1 − µ)) is the standard logistic function  ∂ψ/∂θ = 1/(1 + exp(−θ)) and that the softmax detailed in Subsection 4.3 above inverts ln applied entrywise to a probability vector (the softmax simply applies exp entrywise and then normalizes to form a proper probability distribution).

The probability distribution for probit regression is an exponential family, but not of the form in (74); we handle this special case as detailed in Sub-subsection 5.1.3.

We would like to thank Shubho Sengupta and Andrew Tulloch.

image

Table 6: Special cases of generalized linear models, where  N(µ, I) denotes the (possibly multivariate) normal distribution with mean  µand variance-covariance matrix being the identity matrix I, and Φ is the cumulative distribution function for the standard normal distribution (so Φ−1 is thecorresponding quantile function, the inverse of Φ); “linear least squares” is also known as “ordinary least squares” or the “general” linear model — a special case of the “generalized” linear model

[1] Beaver, D. (1991) Efficient multiparty protocols using circuit randomization. in Advances in Cryptology — Proceedings of the 1991 Annual International Cryptology Conference, ed. by J. Feigenbaum, vol. 576 of Lecture Notes in Computer Science, pp. 420–432. Springer-Verlag.

[2] Bogdanov, D., Laur, S. & Willemson, J. (2008) Sharemind: a framework for fast privacy-preserving computations. in Computer Security — Proceedings of the 2008 European Symposium on Research in Computer Security, ed. by S. Jajodia, & J. Lopez, vol. 5283 of Lecture Notes in Computer Science, pp. 192–206. Springer-Verlag.

[3] Bottou, L., Curtis, F. E. & Nocedal, J. (2018) Optimization methods for large-scale machine learning. SIAM Rev., 60(2), 223–311.

[4] Chang, C.-C. & Lin, C.-J. (2011) LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3), 1–27, Article no. 27.

[5] Cover, T. M. & Thomas, J. A. (2006) Elements of Information Theory. Wiley-Interscience, 2nd edn.

[6] Curry, B. (2006) Parameter redundancy in neural networks: an application of Chebyshev polynomials. Comput. Management Sci., 4(3), 227–242.

[7] Dua, D. & Graff, C. (2019) UCI Machine Learning Repository. UC-Irvine, School of Information and Computer Sciences, Available at http://archive.ics.uci.edu/ml.

[8] Gunning, D., Hannun, A., Ibrahim, M., Knott, B., van der Maaten, L., Reis, V., Sengupta, S., Venkataraman, S. & Zhou, X. (2019) CrypTen: a new research tool for secure machine learning with PyTorch. https://ai.facebook.com/blog/ crypten-a-new-research-tool-for-secure-machine-learning-with-pytorch.

[9] Guo, C.-H. & Higham, N. J. (2006) A Schur-Newton method for the matrix pth root and its inverse. SIAM J. Matrix Anal. Appl., 28(3), 788–804.

[10] Higham, N. J. (2005) The scaling and squaring method for the matrix exponential revisited. SIAM J. Matrix Anal. Appl., 26(4), 1179–1193.

[11] IEEE Computer Society Microprocessor Standards Committee (2019) IEEE Standard for Floating-Point Arithmetic. Discussion Paper 754-2019, IEEE.

[12] Kenney, C. & Laub, A. J. (1991) Rational iterative methods for the matrix sign function. SIAM J. Matrix Anal. Appl., 12(2), 273–291.

[13] LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. & Huang, F. J. (2006) A tutorial on energy-based learning. in Predicting Structured Data, ed. by G. Bakir, T. Hofmann, B. Sch¨olkopf, A. J. Smola, B. Taskar, & S. V. N. Vishwanathan, Neural Information Processing, pp. 191–246. MIT Press.

[14] McCullagh, P. & Nelder, J. A. (1989) Generalized Linear Models, vol. 37 of CRC Monographs on Statistics and Applied Probability. Chapman and Hall, 2nd edn.

[15] Stigler, S. M. (2019) Data have a limited shelf life. Harvard Data Science Review, 1(2), 1–23, Available at https://hdsr.mitpress.mit.edu/pub/iu26pfw1.


Designed for Accessibility and to further Open Science