b

DiscoverSearch
About
My stuff
Decidability of Sample Complexity of PAC Learning in finite setting
2020·arXiv
Abstract
Abstract

In this short note we observe that the sample complexity of PAC machine learning of various concepts, including learning the maximum (EMX), can be exactly determined when the support of the probability measures considered as models satisfies an a-priori bound. This result contrasts with the recently discovered undecidability of EMX within ZFC for finitely supported probabilities (with no a priori bound). Unfortunately, the decision procedure is at present, at least doubly exponential in the number of points times the uniform bound on the support size.

It has been recently discovered [Ben-David et al. 2019] that machine learnability can be undecidable within ZFC (Zermelo-Fraenkel with axiom of Choice). For a given degree of approximation  ǫand value of residual probability  δ, learnability of a concept over a certain class P of probabilities consists of the existence of a learning algorithm, a learner, and a number m such that with m independent observations from anyone of the possible probabilities  P ∈ P, the concept is approximated by the learner within an error of at most  ǫwith P probability larger than 1  − δ. [Ben-David et al. 2019] shows that there are situations in which learnability of a concept is independent of ZFC axioms systems, even restricting to finitely supported probabilities. It thus becomes relevant to explore conditions which ensure decidability of learning.

To this extent, it is convenient to look at things from the opposite side: given a sample size m, an algorithm is not an (ǫ, δ)-learner if there exists a probability P, in the class P, which violates, with m independent observations, the degree of approximation with a probability exceeding the bound on residual probabilities. This way the problem has been turned into that of existence of a probability, within a certain class, satisfying certain additional conditions, besides those required for it to be a probability (and to be in the prescribed class).

To explore decidability, we consider the so called ”discretization trick” ( [Shalev-Shwartz and Ben-David 2014] - Remark 4.1), according to which in virtually all concrete applications various limitations, such as using a computer to handle the data, introduce an a-priori bound on the number of possible states of the system. We then show that in this case, learnability can be expressed in terms of polynomial relations for the probabilities of specific events; we can then refer to the Tarski-Seidenberg Theorem for real closed fields [Bochnak, Coste, Roy 1998], which shows that existence of a solution for any finite set of polynomial relations is decidable. This provides an explicit, albeit computationally very expensive, algorithm for the determination of the sample complexity. Very accurate bounds have been developed for the sample complexity [Hanneke 2016], but, in view of [Ben-David et al. 2019], they provide no guarantee of decidability.

It is interesting to realize that the Tarski-Seidenberg Theorem does not indicate how to find a probability violating a given tentative learner G, nor could it provide any exact method for it, as in general there is no finite algorithm to determine solutions of polynomial equations of degree greater than or equal to 5. It is also interesting to notice that the theories of natural [G¨odel 1931] or rational numbers [Robinson 1949] are not decidable; so, if we insisted in restricting to probabilities taking rational values it would not be clear if learnability is decidable. On the other hand, we are not interested in any such restriction, or in the exact determination of the probabilities, but only in their existence, in order to exclude a tentative learner G, or non existence, in order to assess that G is a learner; and that’s exactly what the Tarski-Seidenberg Theorem guarantees. In other words, decidability is guaranteed if we discretize the inputs, but not the values of the modeling probabilities.

As a related topic, we mention that the issue of existence of a probability satisfying certain requirements can be given an interpretation in terms of Logic. In this context, one develops first the syntax of a logic, i.e. the allowed symbols and formulas; then, the collection of models in which the formulas can be interpreted represents the semantics of a logic. When models are probabilities of a certain class, then a formula is valid if it is true for all probabilities of the class: a logic for finite probabilities with rational coefficients has been developed in [Fagin, Halpern and Megiddo 1990]. Learnability can then be interpreted in terms of validity of the formulas expressing the fact that a certain G is a learner. We briefly discuss and exploit this connection further in Section 4.

image

Sample complexity of agnostic PAC learning is defined as follows. Let X, Y be sets, indicating the set of features and labels, respectively, and let H = {h : X → Y}indicate a hypothesis set. To evaluate a hypothesis, we introduce a loss  L : Y × Y → R+, and a loss function  ℓ : H × X × Ydefined by ℓ(h, (x, y)) := L(h(x), y). When a probability P is defined on  X × Y, and X, Y are two random variables taking values in X and Y, respectively, with joint distribution P, the average loss of a hypothesis h is  EP (h) :=  EP (ℓ(h, (X, Y)). For m ∈ N, a potential m-learner is a function G : (X × Y)m → H, where we denote hG,(x1,y1),...,(xm,ym):= G(((x1, y1), . . . , (xm, ym))). Given a class  P∗of probabilities on (possibly suitable subsets of)  X × Y, and  ǫ, δ >0, an (m, ǫ, δ)-learner with respect to  P∗is a potential m-learner G such that

image

for all  P ∈ P∗. The class H is agnostically PAC learnable with respect to  P∗if for every 0  < ǫ, δ ∈ Rthere exists  m(ǫ, δ) such that for all  m ≥ m(ǫ, δ) there exists an (m, ǫ, δ)-learner G. The sample complexity of agnostically PAC learning the class H with respect to  P∗is the minimum  mHof such  m(ǫ, δ)’s.

Theorem 2.1. The sample complexity of agnostic PAC learnability with respect to the class  P∗of all probabilities with uniformly bounded support is decidable.

Proof. Suppose the support of each probability is bounded by some uniform constantn, and consider first fixed finite sets X and Y such that  |X × Y| ≥n. In this case, the hypothesis class  H = {h′|h′:  X → Y}is finite as well. A finite hypothesis class is learnable ( [Shalev-Shwartz and Ben-David 2014] Cor 4.6), hence for  ǫ, δ >0 there exists  m(ǫ, δ) such that for all  m ≥ m(ǫ, δ) (1) holds for some G.

Next, for each  m < m(ǫ, δ), potential m-learner G, and hypothesis  h′ ∈ H, we say that G is not a (m, ǫ, δ)-learner of  h′if there exists a probability P on X × Ysuch that

image

notice that G is not a (m, ǫ, δ)-learner if there is an  h′ ∈ Hsuch that G is not a learner of  h′; we also say that a probability P satisfying (2) violates G. Condition (2) is polynomial in the probabilities P(A)’s of some events  A ⊆(X × Y)m, in the following sense: for j = 1, . . . , m, let (X × Y)(j)be a copy of  X × Y, and consider events  A(j)(x,y)=  {Xj = x, Yj = y}, (x, y) ∈ (X × Y)(j), where  Xj, Yjare the realizations of the j-th trial; next, let  p(j)(x,y):=  P(A(j)(x,y)). Then, given  h′, the existence of a probability P satisfying (2) is easily seen to

image

      

System (3) is written in terms of unknowns  p(j)(x,y)’s using polynomials and in- dicator functions. In order to apply the results of the next section, we need to eliminate the indicator functions. This can be done by considering subsets A ⊆ (X × Y)m, and then observing that there is a solution to (3) if and only if there is a solution to at least one of the systems in the following collection labeled by  A :⊆ (X × Y)m,

image

Since  EPis an expected value, hence a linear condition, all the relations in each of the systems of the form (4) are polynomial in the variables  p(j)(x,y)’s. We then have at most 2|X ×Y|m(|X × Y|m(ǫ,δ)+  m(ǫ, δ)|X × Y| + m+ 1) polynomial conditions to check. For each  m < m(ǫ, δ), this is is decidable within the theory of real-closed fields by Theorem 3.1 below. Hence the sample complexity of agnostic PAC learnability, with respect to set  P∗of all probabilities on  X × Y, is decidable.

Finally, consider any setX, a finite set of labels ˜Y, and one of the finite subset ˜X ⊆Xsuch that | ˜X טY| ≤n; ˜X טY is viewed as a possible support of a probability violating a potential m-learner G. Consider also an injective maps from ˜X to X, and from ˜Y to Y, respectively, where X and Y are the fixed sets considered above. The action of G on observations from ˜X טY and the existence of a probability P on ˜X טY violating G is determined by systems of the form (4), which are preserved by the above injective maps (assigning probability zero to all points in  X × Ynot in the image of ˜X טY). So, whether G is a learner or not for given  m, ǫ, δcan be determined by the fact that the there is a learner or not on  X × Y, which we have seen is decidable. Hence, sample complexity of agnostic PAC learnability with respect to the class  P∗of all probabilities with support uniformly bounded byn is decidable

image

In general, the exact value of the sample complexity can be only determined by a systematic examination, which is guaranteed to end by the above theorem.

Example 1. Consider learning the maximum in a binary space X, which we can then assume to be X = {0, 1}. Suppose that the hypothesis class is H = {{0}, {1}} and we use the ERM learner.

For  ǫ= 1/3 and  δ= 1/3 we have  mH= 1, while the standard upper bound based on Hoeffding theorem gives  mH ≤ ⌈2 log (2|H|/δ)/ǫ2⌉= 45.

For  ǫ= 1/10, δ= 1/100, the standard upper bound gives  mH ≤1199; some known lower bounds give  mH ≥41, and there is a matching upper bound but with an unkonwn constant [Hanneke 2016]; the sample complexity turns out to be  mH= 539.

lems in finite probabilities by Tarski-Seidenberg Theorem

System (4) is a special case of a general situation which occurs often in elementary probability: there are an unknown probability P; a finite number of events,  A1, . . . , An; and then a finite number R of polynomial relations that are to be satisfied by the probabilities of either the  Ai’s or some of their boolean combinations.

Possibly using the disjunctive normal form, one can always reduce these problems to a collection of polynomial relations, equalities and inequalities, in variables which represent the probabilities of a finite set S (or, equivalently, in terms of the probabilities of the atoms of the normal form).

Expressed in general terms, we arrive at a system of polynomial relations in the variables  pα, α ∈ S = {s1, . . . , s|S|}of the form

image

for some  k ∈ N, where  gr’s are polynomials, p = (ps1, . . . , ps|S|), and  ⊳stands for either of  ≥, =, ̸=.

Algebraic Geometry has developed the appropriate tools to decide if such a system has a solution (see, e.g.. [Bochnak, Coste, Roy 1998] or [Basu, Pollack, Roy 2006]).

Theorem 3.1. The existence and nonexistence of probabilities satisfying system (5), is decidable.

Proof. System (5) determines a semi-algebraic set, which is nonempty if and only if there are solutions satisfying all the equations. Whether a semi-algebraic set is empty or not is decidable with the following decision procedure. First, by the Tarski-Seidenberg Theorem a semi-algebraic set in  Rk+1, k ≥ 2,is non empty

if and only if its projection on  Rkis non empty (see e.g. [Basu, Pollack, Roy 2006], Theorem 2.76); iterating, this procedure reduces the problem to semi-algebraic sets in R. For these, every semi-algebraic set can be decomposed in finitely many basic semi-algebraic sets of the form  {x ∈ R|P(x) = 0, Q(x) > 0 for all  Q ∈ Q}, where P is a polynomial, and Q is a collection of polynomials. Finally, whether each basic semi-algebraic set is non empty can be determined by a General Law of Signs, which consists of checking the signs of suitable combinations of the coefficients of the polynomials (see e.g. [Basu, Pollack, Roy 2006], Lemma 2.74)

. The last step is similar to the methods in Sturm’s Theorem or Descartes Law of Signs.

It is shown in [Fagin, Halpern and Megiddo 1990], Theorem 5.3 that when the coefficient of the polynomials are rational, as they would be in any implementation, there is a procedure, for deciding if a polynomial weight formula is satisfi-able in a (finite) probability space, that runs in polynomial space. It is then easy to see that each of the systems (4) can be expressed as a polynomial weight formula in the language of [Fagin, Halpern and Megiddo 1990], Chapter 5: as set of primitive propositions we take Φ = (X ×Y)mwith boolean operations defined as usual for subsets of (X × Y)m; weight terms are w(((x1, y1), . . . , (xm, ym))) := p(x1,y1),...,(xm,ym), with linear operations and multiplications allowed to make formulas. The semantics to these weight formulas is then given by probabilities on the set (X × Y)m, and hence [Fagin, Halpern and Megiddo 1990], Theorem 5.3 applies to the decision problem of each of the systems (4).

In terms of number of arithmetic operations, on the other hand, the implementation of Tarski-Seidenberg elimination and the General Law of Signs has very high complexity; a slightly better version is cylindrical decomposition (see e.g. [Basu, Pollack, Roy 2006], Ch. 5), which is implemented in various software, but remains doubly exponential in the number of variables and of equations: for the System (5) it takes (km)2O(S)[Basu 2017] operations to decide whether a solution exists. So, the direct calculation of the sample complexity using this method is accessible only for problems with a very small a-priori boundn..

Expressed in terms of the number of pixels and colors in an image the number of arithmetic operations needed to determine the sample complexity of learning a hypothesis class H would be a quadruple exponential; something of the order of  |H|22105mfor a 64  ×64, 16-color image and a sample of size m.

We make noe a partial exploration of the source of the undecidability found in [Ben-David et al. 2019] when learning is seen from the point of view of existence/nonexistence of probabilities satisfying suitable conditions. This examination is hindered in [Ben-David et al. 2019], as in that paper learning is equivalently expressed in terms of compression schemes. The onset of undecidability is partially elucidated by the following.

Corollary 5.1. Let G be a (m, 1/3, 1/3)-learner of EMX for finitely supported probabilities defined on a model M of ZFC satisfying CH; when extended to a model  M ′containing M and satisfying  ¬CH, Gdetermines a system of the form (4), with m =m, which admits a solution.

One example of such extension is obtained from the use of the forcing method [Cohen 1963].

Proof. Consider a model M of ZFC satisfying CH. It is shown in [Ben-David et al. 2019] that there is an m and a (m, 1/3, 1/3) learner G of EMX over the collection of finitely supported probabilities in [0, 1]. For any given finite collection s1, . . . , sn ∈[0, 1] which could be used as support of a finite probability violating G, G determines a finite number of systems of the form (4), with m =m, none of which has a solution (since G is a learner).

Consider now an extension  M ′of M satisfying  ¬CH. Suppose the learner G is extended to the sequences ((x1, y1), . . . , (xn, yn)) such that some of the  xi’s do not belong to M. Then the existence of a probability violating the extension of G is also determined by systems of the form (4), with m =m, but now there must be a solution for at least one of such systems, as the extension of G cannot be a learner.

image

To summarize the results of the paper, we have shown that, aside from the very high computational complexity of the decision procedure, the exact determination of the sample complexity in agnostic PAC learning, including EMX, is decidable under the ”discretization trick” (i.e. when the probabilities are known to be supported on a finite set with an a-priori bounded size). This result contrasts with the undecidability of learning discovered in [Ben-David et al. 2019] for learning the maximum with  ǫ = δ= 1/3 with respect to probabilities supported on a finite set (whose size has no a priori bound). We have also investigated the mechanism by which a learner developed in a model satisfying CH fails in any extension to a model in which CH ceases to hold.

[Ben-David et al. 2019] Shai Ben-David, Pavel Hrubesss, Shay Moran, Amir Shpilka, and Amir Yehudayoff. Learnability can be undecidable. Nature Machine Intelligence 1, 1 (2019), 44-48.

[Bochnak, Coste, Roy 1998] Bochnak, Jacek; Coste, Michel; Roy, Marie-Franccoise. Real Algebraic Geometry. Translated from the 1987 French original. Revised by the authors. Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)], 36. Springer-Verlag, Berlin, 1998.

[Basu 2017] Saugata Basu. 2017. ALGORITHMS IN REAL ALGEBRAIC GEOME- TRY: A SURVEY. Panoramas & Synthses, 51, 2017, 107-153.

[Basu, Pollack, Roy 2006] Saugata Basu, Richard Pollack, and Marie-Franccoise Roy. 2006. Algorithms in Real Algebraic Geometry (Algorithms and Computation in Mathematics). Springer-Verlag New York, Inc., Secaucus, NJ, USA.

[Cohen 1963] Cohen, Paul J., 1963. The Independence of the Continuum Hypothesis”. Proceedings of the National Academy of Sciences of the United States of America. 50 (6): 1143-1148.

[E2006] Richard L. Epstein Classical Mathematical Logic: The Semantic Foundations of Logic, Princeton University Press.

[Fagin, Halpern and Megiddo 1990] Fagin, R., Halpern, J. H. and Megiddo N: A Logic for Reasoning about Probabilities, Inform. and Comput., 87, Nos. 1/2, 1990

[G¨odel 1931] Kurt G¨odel (1931), ”¨Uber formal unentscheidbare Stze der Principia Mathematica und verwandter Systeme, I.” Monatshefte f¨ur Mathematik und Physik 38, 173-198.

[GKP1988] G. Georgakopoulos, D. Kavvadias, and C. H. Papadimitriou. Probabilistic satisfiability. Journal of Complexity, 4:1-11, 1988.

[Hanneke 2016] S. Hanneke: The Optimal Sample Complexity of PAC Learning. Jour- nal of Machine Learning Research 17 (2016) 1-15.

[HM2001] Hazewinkel, Michiel, ed. (2001), ”Disjunctive normal form”, Encyclopedia of Mathematics, Springer,

[Robinson 1949] J. Robinson (1949). Definability and decision problems in arithmetic. The Journal of Symbolic Logic, 14, 98-114.

[Shalev-Shwartz and Ben-David 2014] Shalev-Shwartz, S. and Ben-David, S. (2014) Understanding Machine Learning: From theory to algorithms. Cambridge: Cambridge University Press.

Contact address: NYU Abu Dhabi Saadiyat Island P.O Box 129188 Abu Dhabi, UAE email: ag189@nyu.edu


Designed for Accessibility and to further Open Science