Closure Properties for Private Classification and Online Prediction

2020·Arxiv

Abstract

Abstract

Let H be a class of boolean functions and consider a composed class that is derived from H using some arbitrary aggregation rule (for example, may be the class of all 3-wise majority-votes of functions in H). We upper bound the Littlestone dimension of in terms of that of H. As a corollary, we derive closure properties for online learning and private PAC learning.

The derived bounds on the Littlestone dimension exhibit an undesirable exponential dependence. For private learning, we prove close to optimal bounds that circumvents this suboptimal dependency. The improved bounds on the sample complexity of private learning are derived algorithmically via transforming a private learner for the original class H to a private learner for the composed class . Using the same ideas we show that any (proper or improper) private algorithm that learns a class of functions H in the realizable case (i.e., when the examples are labeled by some function in the class) can be transformed to a private algorithm that learns the class H in the agnostic case.

1 Introduction

We study closure properties for learnability of binary-labeled hypothesis classes in two related settings: online learning and differentially private PAC learning.

Closure Properties for Online Learning. Let H be a class of experts that can be online learned with vanishing regret. That is, there exists an algorithm A such that given any sequence of T prediction tasks, the number of false predictions made by A is larger by at most R(T) = o(T) than the number of false predictions made by the best expert in H.

Consider a scenario where the sequence of tasks is such that every single expert in H predicts poorly on it, however there is a small unknown set of experts that can predict well by collaborating. More formally, there is an aggregation rule such that the combined expert exhibits accurate predictions on a significant majority of the tasks. For example, a possible aggregation rule G could be the majority-vote of the k experts. Since we assume that the identities of the k experts are not known, it is natural to consider the class which consists of all possible G-aggregations of k experts from H. We study the following question:

Question 1.1. Can the optimal regret with respect to be bounded in terms of that of H?

The Littlestone dimension is a combinatorial parameter that determines online learnability [Littlestone, 1987, Ben-David et al., 2009]. In particular, H is online learnable if and only if it has a finite Littlestone dimension , and the best possible regret R(T) for online learning H satisfies

Furthermore, if it is known that if one of the experts never errs (a.k.a the realizable setting), then the optimal regret is exactly (The regret is called mistake-bound in this context.)

Thus, the above question boils down to asking whether the Littlestone dimension of is bounded by a function of the Littlestone dimension of H. One of the two main results in this work provides an affirmative answer to this question (Theorem 2.1).

We next discuss a variant of this question in the setting of Differentially Private (DP) learning. The two settings of online and DP-learning are intimately related (see, e.g., Bun et al. [2020], Abernethy et al. [2017], Joseph et al. [2019], Gonen et al. [2019]). In particular, both online learning and DP-learning are characterized by the finiteness of the Littlestone dimension [Littlestone, 1987, Ben-David et al., 2009, Bun et al., 2015, Alon et al., 2019, Bun et al., 2020].

Closure Properties for Differentially Private Learning. Imagine the following medical scenario: consider a family H of viruses for which there is an algorithm A that can learn to diagnose any specific virus given enough labeled medical data. Further assume that A has the desired property of being differentially private learning algorithm as defined by [Kasiviswanathan et al., 2011]; that is, it is a PAC learning algorithm in which the privacy of every patient whose data is used during training is guarded in the formal sense of differential privacy [Dwork et al., 2006b].

Assume that an outbreak of a deadly disease has occurred in several locations all over the world and that it is known that is caused by some relatively small, yet unknown group of viruses from H. That is, our prior information is that there are unknown viruses for a relatively small k such that for some rule G. For example, G could be the OR function in which case and only if the patient is infected with at least one of the viruses

It would be highly beneficial if one could use the algorithm A to diagnose in an automated fashion. Moreover, doing it in a private manner could encourage health institutions in the different locations to contribute their patients’ data. This inspires the following question:

Question 1.2. Can one use the algorithm A to privately learn to diagnose ? How does the sample complexity of this learning task scale as a function of G?

Differential Privacy, Online Learning, and the Littlestone Dimension. Question 1.2 and Question 1.1 are equivalent in the sense that both online learning and DP-learning are characterized by the finiteness of the Littlestone dimension [Littlestone, 1987, Ben-David et al., 2009, Bun et al., 2015, Alon et al., 2019, Bun et al., 2020].

Note however that unlike the bounds relating the Littlestone dimension to online learning, which are tight up to logarithmic factors (see (1)), the bounds relating the Littlestone dimension and DP-learning are very far from each other; specifically, if d denotes the Littlestone dimension of H then the lower bound on the sample complexity of privately learning H scales with [Bun et al., 2015, Alon et al., 2019], while the best known2 upper bound scales with exp(d) [Bun et al., 2020].

Thus, while our solution to Question 1.1 yields an affirmative answer to Question 1.2, the implied quantitative bounds are far from being realistically satisfying. Specifically, every finite H is learnable with privacy using O(log |H|) samples [Kasiviswanathan et al., 2011], and so if H is finite and not too large, the bounds implied by the Littlestone dimension are not meaningful. We therefore focus on deriving effective bounds for private learning, which is the content of Theorem 2.3 (see Theorem 7.1 for a precise statement).

Littlestone Classes. It is natural to ask which natural hypothesis classes have bounded Littlestone dimension. First, it holds that , so for finite classes the Littlestone dimension scales rather gracefully with their size.

There are also natural infinite Littlestone classes: for example, let the domain dimensional vector space over some field consist of all affine subspaces of V of dimension . It can be shown here that Ldim(H) = d. (For example, the class of all lines in Littlestone dimension 1.) A bit more generally, any class of hypotheses that can be described by polynomial equalities of a bounded degree has bounded Littlestone dimension. (Observe that if one replaces “equalities” with “inequalities” then the Littlestone dimension may become unbounded, however the VC dimension remains bounded (e.g. Halfspaces).) We note in passing that this can be further generalized to classes that are definable in stable theories, which is a deep and well-explored notion in model theory. We refer the reader to Chase and Freitag [2019], Section 5.1 for such examples.

Organization. Formal statement of our main results and description of our techniques appears in Section 2, specifically, a short overview of the proofs is given in Section 2.1. Definitions and background results are provided in Section 3. The complete proofs appear in the rest of the paper. Closure properties for Littlestone classes is proved in Section 4. The effective bounds for private learning are given in Section 5 and Sections 6 and 7. We note that each of these parts can be read independently of the other.

2 Main Results and Techniques

Let be a boolean function and let be hypothesis classes. Denote by the following class For example, if is the class of all pairwise intersections/conjunctions of a function from and a function from

Theorem 2.1 (A Closure Theorem for the Littlestone Dimension). Let be a boolean function, let be classes, and let

where conceals polynomial factors in log k and log d.

In particular, if . Consequently, if each of the ’s is online learnable then is online learnable. We comment that if the aggregating function G is simple then one can obtain better bounds. For example, if G is a majority-vote, a k-wise OR, or a k-wise AND function then a bound of holds. (See Section 4.2.2.)

Another combinatorial parameter which arises in the relationship between online and DP learning is the threshold dimension: a sequence threshold-shattered by H if there are such that if and only if threshold dimension, T(H) is the maximum size of a sequence that is threshold-shattered by H. The threshold dimension plays a key role in showing that DP learnable classes have a finite Littlestone dimension [Alon et al., 2019]. A classical theorem by Shelah [1978] in model theory shows that the Littlestone and the threshold dimensions are exponentially related.3 In particular if and only if . (See Theorem 3.2 in the preliminaries section.) We prove the following closure theorem in terms of the threshold dimension.

Theorem 2.2 (A Closure Theorem for the Threshold Dimension). Let be a boolean function, let be classes, and let

Moreover, an exponential dependence in t is necessary: for every there exists a class H such that

Note that the bounds in Theorem 2.1 and Theorem 2.2 escalate rapidly with k (the arity of G) and with t. It will be interesting to determine tight bounds.

By Alon et al. [2019], Bun et al. [2020], Theorem 2.1 also implies closure properties for DP-learnable classes. However, the quantitative bounds are even worse: not only do the bounds on the Littlestone dimension of escalate rapidly with d and k, also the quantitative relationship between the Littlestone dimension and DP-learning sample complexity is very loose, and the best bounds exhibit a tower-like gap between the upper and lower bounds. For example, if the class of functions H is finite and its Littlestone dimension is , then the bound of Theorem 2.1 is most likely to be much worse than the generic application of the exponential mechanism, whose sample complexity is the logarithm of the size of the class. We therefore explore the closure properties of differentially-private learning algorithms directly and derive the following bound.

Theorem 2.3 (A Closure Theorem for Private Learning (informal)). Let boolean function. Let be classes that are -differentially private and accurate learnable with sample complexity respectively. Then, -private and -accurate learnable with sample complexity

The exact quantitative satement of the results appears in Theorem 7.1. We remark that closure properties for pure differentially-private learning algorithms (i.e., when ) are implied by the characterization of [Beimel et al., 2019]. Similarly, closure properties for non-private PAC learning are implied by the characterization of their sample complexity in terms of the VC dimension and by the Sauer-Shelah-Perles Lemma [Sauer, 1972]. However, since there is no tight characterization of the sample complexity of approximate differentially-private learning algorithms (i.e., when ), we prove Theorem 2.3 algorithmically by constructing a (non-efficient) learning algorithm for from private learning algorithms for

Beimel et al. [2015] proved that any proper private learning algorithm in the realizable case4 can be transformed into an agnostic5 private learning algorithm, with only a mild increase in the sample complexity. We show that the same result holds even for improper private learning (i.e., when the private learning algorithm can return an arbitrary hypothesis).

Theorem 2.4 (Private Learning Implies Agnostic Private Learning). For every , and every concept class H, if there exists a -differentially private -accurate PAC learner for the hypothesis class H with sample complexity m, then there exists an -differentially private learner for H with sample complexity

Furthermore, if the original learner is proper, then the agnostic learner is proper.

We obtain this result by showing that a variant of the transformation of [Beimel et al., 2015] also works for the improper case; we do not know if the original transformation of [Beimel et al., 2015] also works for the improper case. Our analysis of the transformation for the improper case is more involved than the analysis for the proper case.

2.1 Technical Overview

2.1.1 Closure for Littlestone Dimension

Our proof of Theorem 2.1 exploits tools from online learning. It may be instructive to compare Theorem 2.1 with an analogous result for VC classes: a classical result by Dudley [1978] upper bounds the VC dimension of is the VC dimension of . The argument uses the Sauer-Shelah-Perles Lemma [Sauer, 1972] to bound the growth-rate (a.k.a. shatter function) of some : indeed, if we let then by the definition of the shatter function, , which implies that as stated. It is worth noting that a notion of growth-rate as well as a corresponding variant of the Sauer-Shelah-Perles Lemma also exist for Littlestone classes [Bhaskar, 2017, Chase and Freitag, 2018]. However we are not aware of a way of using it to prove Theorem 2.1.

We take a different approach. We first focus on the case where G is a majority-vote. That is, the class consists of all k-wise majority-votes of experts . We bound the Littlestone dimension of H by exhibiting an online learning algorithm A that learns H in the mistake-bound model with at most mistakes. The derivation of A exploits fundamental tools from online learning such as the Weighted Majority Algorithm by Littlestone and Warmuth [1989] and Online Boosting [Chen et al., 2012, Beygelzimer et al., 2015, Brukhim et al., 2020].

Then, the bound for a general is obtained by expressing G as a formula which only uses majority-votes and negations gates. The exponential dependence in k in the final bound is a consequence of the formula-size which can be exponential in k. We do not know whether this exponential dependence is necessary.

2.1.2 Closure for Threshold Dimension

Our proof of Theorem 2.2 is combinatorial. First, note that an inferior bound follows from Theorem 2.1, using the fact that the Littlestone and threshold dimensions are exponentially related (see Theorem 3.2). However this approach yields a super-exponential bound on

The bound in Theorem 2.2 follows by arguing contra-positively that if then is also “largish” for some . Specifically, if . This is shown using a Ramsey argument that asserts that any large enough sequence that is threshold-shattered by must contain a relatively large subsequence that is threshold-shattered by one of the ’s. Quantitatively, if then there must be a subsequence that is threshold-shattered by one of the

This upper bounds . It is worth noting that, in contrast with Theorem 2.1, an exponential dependence here is inevitable: we prove in Theorem 2.2 that for any t there exists a class This lower bound is achieved by a randomized construction.

2.1.3 Private learning Implies Agnostic Private Learning

We start by describing the transformation of [Beimel et al., 2015] from a proper private learning algorithm of a class H to an agnostic proper private learning algorithm for H. Assume that there is a private learning algorithm A for H with sample complexity m. The transformation takes a sample S of size O(m) and constructs all possible behaviors H of functions in H on the points of the sample (ignoring the labels). By the Sauer-Shelah-Perles Lemma, the number of such behaviors is at most. Then, it finds using the exponential mechanism a behavior that minimizes the empirical error on the sample. (The exponential mechanism is guaranteed to identify a behavior with small empirical error because the number of possible behaviors is relatively small.) Finally, the transformation relabeles the sample applies A on the relabeled sample. If A is a proper learning algorithm then, by standard VC arguments, the resulting algorithm is an agnostic algorithm for H. The privacy guarantees of the resulting algorithm are more delicate, and it is only O(1)-differentially private, even if -differentially private for a small (The difficulty in the privacy analysis is the set of behaviors H is data-dependent. Therefore, the privacy guarantees of the resulting algorithms are not directly implied by those of the exponential mechanism, which assume that the set of possible outcomes is fixed and data-independent.)

When A is improper, we cannot use VC arguments to argue that the resulting algorithm is an agnostic learner. We rather use the generalization properties of differential privacy (proved in [Dwork et al., 2015, Bassily et al., 2016, Rogers et al., 2016, Feldman and Steinke, 2017, Nissim and Stemmer, 2017, Jung et al., 2020]): if a differentially private algorithm has a small empirical error on a sample chosen i.i.d. from some distribution, then it also has a small generalization error on the underlying distribution (even if the labeling hypothesis is chosen after seeing the sample). There are technical issues in applying these results in our case that require some modifications in the transformation.

2.1.4 Closure for Differentially Private Learning

We prove Theorem 2.3 by constructing a private algorithm for the class private learning algorithms for the classes . Algorithm uses the relabeling procedure (the one that we use to transform a private PAC learner into a private agnostic learner) in a new setting.

The input to is a sample labeled by some function in . The algorithm finds hypotheses in steps, where in the i’th step, the algorithm finds a hypothesis have a completion to a hypothesis with small error (assuming that have a good completion).

Each step of is similar to the algorithm for agnostic learning described above. That is, in the first relabels the input sample S using some in a way that guarantees completion to a hypothesis with small empirical error. The relabeling h is chosen using the exponential mechanism with an appropriate score function. The relabeled sample is then fed to the private algorithm for the class to produce a hypothesis and then the algorithm proceeds to the next step i + 1. As in the algorithm for agnostic learning, the proof that the hypothesis returned by the algorithm is easier when the private algorithms for are proper and it is more involved if they are improper.

3 Preliminaries

This section is organized as follows: Section 3.1 contains basic definitions and tools related to the Littlestone dimension and Section 3.2 contains basic definitions and tools related to private learning.

3.1 Preliminaries on the Littlestone Dimension

The Littlestone dimension is a combinatorial parameter that characterizes regret bounds in online learning [Littlestone, 1987, Ben-David et al., 2009]. The definition of this parameter uses the notion of mistake-trees: these are binary decision trees whose internal nodes are labeled by elements of X. Any root-to-leaf path in a mistake tree can be described as a sequence of examples is the label of the i’th internal node in the path, and ’th node in the path is the right child of the i’th node, and otherwise . We say that a tree T is shattered by H if for any root-to-leaf path . The Littlestone dimension of H, denoted by Ldim(H), is the depth of the largest complete tree that is shattered by H.

Definition 3.1 (Subtree). Let T be labeled binary tree. We will use the following notion of a subtree depth h of T by induction on h:

1. Any leaf of T