Learning the Hypotheses Space from data: Learning Space and U-curve Property

2020·Arxiv

Abstract

Abstract

This paper presents an extension of the classical agnostic PAC learning model in which learning problems are modelled not only by a Hypothesis Space H, but also by a Learning Space L(H), which is a cover of H, constrained by a VC-dimension property, that is a suitable domain for Model Selection algorithms. Our main contribution is a data driven general learning algorithm to perform regularized Model Selection on L(H). A remarkable, formally proved, consequence of this approach are conditions on L(H) and on the loss function that lead to estimated out-of-sample error surfaces which are true U-curves on L(H) chains, enabling a more efficient search on L(H). To our knowledge, this is the first rigorous result asserting that a non exhaustive search of a family of candidate models can return an optimal solution. In this new framework, an U-curve optimization algorithm becomes a natural component of Model Selection, hence of learning algorithms. The abstract general framework proposed here may have important implications on modern learning models and on areas such as Neural Architecture Search.

1 INTRODUCTION

The state-of-the-art in Machine Learning is a model composed by a set H of hypotheses h, which are functions from , to a finite subset , called Hypotheses Space, and a learning algorithm , which processes a training sample of a random vector (X, Y ), with range and unknown joint probability distribution P, and returns seeking to approximate a target hypothesis .

A learning algorithm is an optimization algorithm, whose search space is H, which seeks to minimize an error measure that assesses how good each predicts the values of Y from instances of X. Let be a loss function. The error of a hypothesis is the expected value of the local measures . There are two types of errors: the in-sample error , which is the sample mean of under ; and the out-of-sample error L(h), the expectation of under the unknown joint distribution P. In this context, a target hypothesis is such that its out-of-sample error is minimum in H, i.e., .

Under this learning framework, the out-of-sample error of may be estimated by the expectation of under a sample independent of , called validation sample. The instances in the validation sample are not in the training set, hence they provide information about the out-of-sample performance of , the hypothesis returned by A. Such learning model has an important parameter that is problem-specific: the Hypotheses Space H, which has a strong impact on the generalization quality of the model, that is, the correctness of the classification of non observed instances.

The fundamental result in Machine Learning is the Vapnik-Chervonenkis Theory [1], [2], which implies that a Hypotheses Space H is PAC learnable [3] if, and only if, it has finite VC dimension (). This means that for any unknown joint distribution generalizes, that is, L may be uniformly approximated by with a given precision and great confidence if N is sufficiently large. Therefore, it is possible to learn hypotheses with a finite sample, with precision and confidence dependent on the training sample size and the VC dimension, i.e., complexity, of H.

The VC theory is general, has a structural importance to the field and is an useful guide for modelling practical problems. However, since , the least N under VC theory bounds such that

is not a tight bound, as it is distribution-free, it is usually not a meaningful quantity to real application problems. In fact, the sample size N usually depends on data availability, which may be conditioned upon several factors such as technical difficulties and costs. Thus, parameters are usually predetermined, so the only free variable to be adjusted on VC theory’s bounds is the Hypotheses Space H or, more precisely, its VC dimension. The data-driven selection of H, known in the literature of the field as Model Selection, is an important problem that is usually treated heuristically.

In order to select H from data, it is required a combinatorial algorithm, which searches a family of Hypotheses Spaces seeking to minimize an estimator of the out-of-sample error of the estimated optimal hypothesis of each space. In other words, given a family of Hypotheses Spaces, a sample of size N and a consistent estimator of the out-of-sample error, normally given by an independent validation sample, such algorithm returns the subspace , whose estimated optimal hypothesis is the best estimator, in some sense, for a target hypothesis. On the state-of-the-art solution, the candidate Hypotheses Spaces are usually chosen heuristically. Then, each model is trained with the sample and the estimated optimal hypothesis is evaluated by an estimator of the out-of-sample error and the selected model is the one with minimum estimated out-of-sample error. Finally, this model is retrained with the union of the two data sets, in case it was used an independent validation sample.

The VC dimension of the selected Hypotheses Space will be adequate, in some sense, to learn with N examples for this particular unknown distribution P. If is a chain, then a method based on the Structured Risk Minimization (SRM) Inductive Principle may be applied to solve this problem (see [2, Chapter 4] for more details). Also, the classical problem of feature selection [4], [5] constitutes another framework for Model Selection, in which a family of partially ordered constrained Hypotheses Spaces is generated through elimination of variables. There are specialized solvers to approach this problem, which can explore the Ucurve property observed in the chains of the Boolean lattice of subsets of H which constitutes this search space (see [6], [7], [8], [9] for more details).

A limitation of the feature selection model is that the family of candidate Hypotheses Spaces is too constrained, so it may not be sharp enough for some problems of interest. In this paper, we propose a family of Hypotheses Spaces, called Learning Space, which can be designed with adequate constraints for each class of problems, and has properties which one can take advantage of in order to implement Model Selection algorithms more efficient than an exhaustive search of the candidate models. In this context, the feature selection and SRM routines become particular cases.

We propose an extension of the classical learning model, defining the model composed by a Hypotheses Space H; a Learning Space L(H), which is a cover of H, that satisfies a property regarding the VC dimension of related subspaces; and a learning algorithm , which processes L(H), a training sample and a consistent estimator of P, and returns an M that well approximates, in some sense, a target , for each . Then, among , it selects hypotheses which well approximate the target H.

Our model is based on properties of VC dimension and can be applied in principle to both discrete and continuous learning problems, including various kinds of hypothesis algebraic representations and learning algorithms, e.g., feature selection, neural networks [10] and lattice based methods [8]. The classical learning model requires that a particular Hypotheses Space H, and maybe a chain decomposition of it, is selected a priori, while this new approach has the feature of searching for the target hypotheses in a family of Hypotheses Spaces given by L(H), a Learning Space of H, which has a much richer structure than a single chain, as is usually a complete lattice, a semi-lattice or a poset.

An example of the importance of structuring the Hypotheses Space H and searching for optimal hypothesis in subspaces of it are the pyramid-based learning models which specify the Hypotheses Space through the specifica-tion of properties of the desired hypotheses, e.g., increasing or decreasing hypotheses, and some sampling pyramid, which constrains the set of hypotheses considered, generating enormous equivalence classes in the hypotheses domain, which work as a structure of it (see [11], [12], [13] for more details).

The usual algebraic structure adopted in Machine Learning is the algebraic structure of the learning model, which gives a parametric representation for each hypothesis. Indeed, the hypotheses are characterized by a set of parameters in some algebraic representation structure (ARS). The choice of an ARS has impact on the definition of H and on the architecture of learning algorithms. In this paper, we introduce the Learning Space as a complementary algebraic structure based on a VC dimension decomposition of a Hypotheses Space. Then, we establish a non exhaustive algorithm to find, among the subsets of H in the Learning Space, the ones which have the most adequate VC dimension to estimate a target hypothesis from a sample with size N. Figure 1 shows a possible scheme of the search on a Learning Space. In case the sample size is small relatively to the complexity of H, the Learning Space should model H as sharply as possible by taking into account all available prior information, so this modelling becomes an important part of the problem solution.

Fig. 1. Structured selection of the Hypotheses Space from data based on a Learning Space, in which is an estimator of the joint distribution of the input X and output is an estimator of the out-of-sample error and is given by the expectation of a loss function under . This idea is presented in more details in [14].

Following this Introduction, Section 2 presents the formal definition of Learning Space, as well as examples of discrete and continuous ones. Section 3 discusses a robust Ucurve algorithm for estimating an optimal hypothesis which takes advantage of the Learning Space structure and an U-curve property. Finally, in the Discussion we review the main contributions of this paper and future research perspectives.

2 A GENERAL STRUCTURE FOR HYPOTHESES SPACES

2.1 Hypotheses Spaces

Let X be a random vector and Y a random variable de-fined on a same probability space with ranges , and , respectively, in which Y is assumed to be a finite subset.1 We denote P(x, y) := as the joint probability distribution of (X, Y ) at point . We define a sample as a sequence of independent and identically distributed random vectors defined on with joint distribution P(x, y). Throughout this paper we assume that P(x, y) is unknown, but fixed.

Let H be a general Hypotheses Space, whose typical element h is a function . We only make a mild measurability assumption about H: it is an arbitrary set of functions h with domain X and image Y such that is -measurable. We denote subsets of H as indexed by the positive integers, i.e., . We may also denote a subset of H by M when it has a special property.

For each hypothesis in H, we assign a value indicating the loss incurred by the use of such hypothesis as a predictor for Y from the values of X. Let be a measurable loss function. The out-of-sample error, also known in the literature as risk or loss, of a hypothesis is defined as

in which E means expectation under P. We denote

as the in-sample error of a hypothesis h and

as a consistent estimator of the out-of-sample error of h independent of . We assume that represents the Riemann-Stieltjes integral under an empirical measure of P defined on , the same probability space of , so that it makes sense to say that is independent of . At first, we assume that is also indexed by N and is such that

i.e., converges in distribution to P. may be, for example, the in-sample error of a validation sample independent of the training sample , whose size diverges with N.

The main goal of Machine Learning is to approximate target hypotheses, i.e., hypotheses in H which minimize the out-of-sample error. These hypotheses are in the set

Furthermore, we will also be interested in target hypotheses of subsets of H which are in

depending on the subset. Observe that H may be a proper subset of the space of all functions with domain X and image in Y, so there may exist an , with . However, in this paper, we focus on approximating the best hypotheses in H, so we disregard all hypotheses outside of it.

Under the Empirical Risk Minimization paradigm, the target hypotheses are estimated by the ones in

while the estimated optimal hypotheses of subspaces are in

When we estimate optimal hypotheses based only on , our estimation is prone to overfit [15] sample , so that it is necessary to apply some tool to the estimation process in order to avoid it. For this purpose, we propose a structure of H which, among with a property of some loss functions under this structure, helps to avoid overfitting when estimating, yielding consistent estimators for and an algorithm for a data driven selection of Hypotheses Spaces. This structure depends on the VC dimension of H, which in the binary classification problem, is as follows.

Definition 1. Suppose that , and . The N-th shatter coefficient of H is defined as

in which is the cardinality of a set. The VC dimension of a Hypotheses Space H is the greatest integer such that and is denoted by . If for all integer , we denote .

In case |Y| = 2 and is the simple loss function, the VC dimension assesses the power of the functions in H in classifying instances with values in X into the categories Y. The VC dimension plays a central role on Machine Learning Theory, as it is a measure of the complexity of a Hypotheses Space, which may be employed to bound estimation (generalization) errors involved in the learning of hypotheses. Furthermore, the VC dimension is the main property of Hypotheses Spaces which allows us to define a Learning Space as a decomposition of H based on it: the VC dimension rules the complexity of subspaces, enabling the decomposition of H into subspaces of different complexities. The VC dimension may also be defined for the case in which |Y| > 2 or is not the simple loss function. See [2, Chapter 3] for more details.

2.2 Learning Space

Let H be a general Hypotheses Space with and be a finite subset of the powerset of H, i.e., . We say that the poset is a Learning Space if

We define the VC dimension of L(H) as

for which an upper bound is . On the one hand, for L(H) to be a structuring of H it should cover H, so the need for (i). On the other hand, condition (ii) implies that any element is maximal in the sense that there does not exist such that and . Furthermore, it guarantees that if then the complexity of is greater than that of .

The concept of Learning Space contemplates the search space of Model Selection displayed in Figure 1 if, given , condition (ii) is satisfied, as it is enough to take . Indeed, it is common for one to choose without thinking of it as a decomposition of a Hypotheses Space H. Nevertheless, if condition (ii) is satisfied, then it would be a Learning Space of , so taking H as this union, the only non-trivial condition is (ii).

As L(H) covers H, when one searches for a Hypotheses Space in L(H) on which to learn, no hypothesis of H is lost beforehand, as there is no a priori constraint which exclude hypotheses from H. Such a constraint is added a posteriori, and based on data, as the method to be proposed seeks to select, based solely on data, i.e., learn from data, the Hypotheses Space in L(H) on which to learn, that can be a proper subspace .

Although L(H) is not unique, i.e., there are multiple subsets of P(H) which are Learning Spaces, there are classes of Learning Spaces that have some properties which allow algorithm A to perform a non exhaustive search of L(H) when looking for a target subspace. The main class of Learning Spaces are the Lattice Learning Spaces, which have a lattice structure with useful properties that increase the performance of search algorithms.

Definition 2. Let L(H) be a Learning Space of H. We say that L(H) is a Lattice Learning Space if is a complete lattice,2 in which is the supremum operator, is the infimum operator, O is the least subset and I is the greatest subset of L(H).

In this paper we consider only Lattice Learning Spaces, or posets of one, although the abstract framework is quite general and may be also applied to other cases.

2.3 Building Learning Spaces

The first step in building a Learning Space is fixing an algebraic parametric representation of the hypotheses in H. Some important families of learning models, such as neural networks and lattice functions, have a particular algebraic structure, with a parametric representation and a corresponding optimization algorithm to estimate a target hypothesis from a given sample of size N. In neural networks, the parameters represent the weight of synapsis connections and the minimum bursting threshold. In lattice functions, the parameters are lattice intervals which represent a join or meet minimal representation. In the binary case, the lattice representation derives from Boolean Algebra. In the continuous or integer cases, it derives from lattice algebra. In the discrete case, the representation is finite, since the function domain X is finite. In the continuous case, there are topological conditions on the parametric representation of hypotheses which guarantee finite representations (see [17]). Therefore, all these learning models are parametric, hence a Learning Space L(H) should represent sets of hypotheses in terms of their parameters.

The algebraic structure of may be defined from the learning model and algebraic representation fixed, in the following manner. Let be a poset, in which F is an arbitrary set with finite cardinality. Also, let be a lattice isomorphism from set to , a subset of the powerset of H partially ordered by inclusion. This means that R is bijective and if , then , so that R preserves the partial order on F as the partial order on Im(R) given by inclusion. Then, if

we may define L(H) := Im(R) as a Learning Space of H. Isomorphisms which satisfy these conditions play a central role on the theory, hence we formally define them.

Definition 3. Given a partially ordered set , a Lattice isomorphism , with , which satisfies i) and ii) is called a Learning Space generator.

A Learning Space is completely defined by a triple , in which the elements of F may be interpreted as sets of parameters which describe a subset of hypotheses, i.e., the hypotheses in are represented by the parameters a, so that, in particular, F generates a parametric representation of the functions in H. For this reason, we call a parametric poset of H. Therefore, in general, to build a Learning Space of H we apply a generator to a parametric poset of its hypotheses. Furthermore, since the generator R is an isomorphism, it preserves properties of , hence, for instance, by applying R to , a complete lattice, we obtain a Lattice Learning Space.

2.4 Examples of Learning Spaces

We present some examples of Learning Spaces completely defined by a triple .

Example 1 (Feature Selection). Let H be a space of hypotheses with domain . Let F = P({1, . . . , d}) be the powerset of {1, . . . , d} partially ordered by inclusion, so that is a complete Boolean lattice. Consider the Learning Space generator P(H) given by

in which a , . . . , aand x , . . . , x, . . . , zif, and only if, xfor i = 1, . . . , j, so R(a)

contains the hypotheses which depend only on variables in a. The lattice isomorphism R satisfies condition i), and often satisfies ii), as in many applications the VC dimension is an increasing function of the number of variables, so Im(R) is often a Learning Space.

Example 2 (Partition Lattice). In the learning of hypotheses with finite domain and binary image, a particular structure useful for representing a Learning Space is the Partition Lattice or posets which are sub-graphs of it. Suppose that and let H contain all functions with domain X and image {0, 1}. A partition of X is a set of non empty subsets of X , called blocks, such that every element is in exactly one of these blocks. A partition generates an equivalence relation on X in the sense that x and z in X are -equivalent, i.e., , if, and only if, they are in the same block of partition .

Define is a partition of X} as the set of all partitions of X , partially ordered by defined as

for , which is a complete lattice . See Figure 2 for an example of Partition Lattice. By applying the generator given by

for , we obtain a Lattice Learning Space in which L(H) := Im(R). In this case the parameters of the functions are the elements in their domain X , in contrast, for example, to the variables they depend on as in Example 1. If H is given by the set of Boolean functions we have the special case of a Boolean Partition Lattice, which is studied in [14], where simulation studies involving the U-curve property may be found.

This Learning Space is quite general and haas subsets which are themselves useful Learning Spaces, illustrating how one may drop subsets of L(H) to obtain other Learning Spaces according to the needs of the application at hand. For example, the Feature Selection Learning Space when and Y = {0, 1} is a sub-lattice of the Partition Lattice Learning Space and is represented in orange in Figure 2 when . This Learning Space can also be obtained by applying the generator of Example 1.

Apart from dropping nodes, one can also obtain Learning Spaces for subsets by taking which is often a Learning Space of M. For example, by taking M as the increasing hypotheses in H when , i.e., if , we obtain L(M) composed by the dashed nodes a in Figure 2 whose hypotheses are the ones in . Some nodes may be dropped because is empty or equal to with so we keep only the least space so the structure of L(H) is preserved.

These are some examples of how one can incorporate prior knowledge about the problem at hand into the Learning Space to remodel it accordingly. If one believes that the target hypothesis does not depend on all variables, he may use the Feature Selection Learning Space, while if one believes the target hypothesis is increasing, he may consider the respective subset of the Partition Lattice Learning Space. In both cases, the Learning Space is defined by constraining the Partition Lattice Learning Space according to prior information, so multi-purpose learning algorithms based on the Partition Lattice may be applied on several instances when distinct levels of prior information is available.

Example 3 (Linear Classifiers). Let H be given by the linear classifiers in :

in which and is the function indexed by its parameters . Denoting A = {1, . . . , d}, we consider two distinct Learning Spaces generators: from the Boolean Lattice and from the Partition Lattice of A, in which P(A) is the power set of A and is the set of all partitions of A. The Partition Lattice is represented in Figure 2 for d = 4.

for as a feature selection generator, and define as

for as a generator which equal parameters, i.e., create equivalence classes in A. Both clearly satisfies i) and

so they also satisfy ii). Therefore, these lattice isomorphisms generate two distinct Lattice Learning Spaces for a same Hypotheses Space H and the application at hand will dictate which one is the more suitable to solve the problem. For example, if it is believed that does not depend on all variables, then the Boolean Lattice Learning Space may be preferable; otherwise, one would rather choose the Partition Lattice Learning Space, a subset of it or the intersection of it with a subset if one believes that the target linear classifier satisfies some property.

2.5 The Target Hypotheses Space

The intuition of the new paradigm proposed by this paper is presented in Figure 3, in which the ellipses represent some subspaces in L(H) and their area is proportional to their VC dimension. Assume that H is all we have to learn on, i.e., we are not willing to consider any hypothesis outside of H. Then, if we could choose, we would like to learn on : the subspace in L(H) with least VC dimension which contains . Of course this subspace is dependent on both L(H) and P, i.e., it is not distribution-free, thus we cannot establish beforehand, without looking at data, on which subspace of L(H) we should learn on. Also, even if we looked at data, it could not be possible to search L(H) exhaustively to properly estimate and, in the general case, there

Fig. 2. Partition Lattice for Linear Classifiers with d = 4 or for X = {1, 2, 3, 4}. The tables present the hypotheses in selected subspaces of the Partition Lattice Learning Space for X = {1, 2, 3, 4}.The orange nodes represent the Boolean Lattice of feature selection when points are 1 = (0, 0), 2 = (0, 1), 3 = (1, 0) and 4 = (1, 1). The dashed nodes are the ones in is composed by the increasing hypotheses.

is nothing guaranteeing that it is feasible to estimate anyhow.

Fig. 3. The decomposition of H by an L(H). We omitted some subspaces for a better visualization. In reality, L(H) should cover H.

In fact, when learning on a Hypothesis Space , that is random, since is learned (estimated) from data, we commit three errors which are, for fixed,

(II) > ϵ(III) > ϵ

and the composition of them

that we call Types II, III and IV generalization errors.3 In a broad sense, type (III) error would represent the bias of learning on , while type (II) would represent the variance, and type (IV) would be the error committed when learning on under the Empirical Error Minimization paradigm. Indeed, type (III) error compares the target hypothesis of with the target hypothesis of H, hence any difference between them would be a systematic bias of learning on when compared to learning on H, and type (II) error compares the loss of the estimated hypothesis of and the loss of its target assessing how much the estimated hypothesis vary from the target. These generalization errors are illustrated in Figure 4.

Fig. 4. Types II, III and IV generalization errors when learning on

As is often the case, there will be a bias-variance trade-off that should be minded when learning (on) . Hence, the feasibility of a Model Selection approach depends on algorithms which (a) are more efficient than an exhaustive search of L(H) and (b) guarantee that, when the sample size increases, the types (II) and (III), and consequently type (IV), generalization errors converge to zero, so there is no systematic bias when learning on , and it is statistically consistent to do so.

This paper and a subsequent more technical one [18] present an approach to Model Selection satisfying (a) and (b). In this paper, we discuss the U-curve phenomenon [9] and show how it can be explored, via the solution of an U-curve problem [8] through an U-curve algorithm [6], to estimate a target hypothesis by first learning a Hypothesis Space and then a hypothesis on it in the scheme of Figure 1, without exhaustively searching L(H). Then, in [18] it is rigorously showed, by extending the VC theory, that under mild conditions it is feasible to estimate by an , a random subspace learned from data, which converges to with probability 1 when N tends to infinity, which implies the convergence to zero of the generalization errors. Furthermore, it is showed in [18] that the generalization errors of learning on are lesser than those we commit when learning on H, i.e., by introducing a bias (III), which converges to zero, we decrease the variance (II) of the learning process, so it is more efficient to learn on a subspace learned from data.

In summary, in this paper we show that it may be computationally possible to learn the Hypotheses Space from data, via a Learning Space and an U-curve algorithm, even though this algorithm is NP-hard [19], as it is usually significantly more efficient than an exhaustive search; in the more technical [18] it is showed that it is theoretically feasible to learn the Hypotheses Space from data.

3 U-CURVE OPTIMIZATION ALGORITHM

Given a Learning Space L(H), the challenge is the following: take advantage of the L(H) structure to approximate a target hypothesis by applying an algorithm which searches for such approximation on subspaces with different VC dimensions. This optimization becomes more efficient if one takes advantage of the U-curve phenomenon. In this section, we discuss the U-curve problem in the context of L(H).

3.1 The U-curve Property

The goal of an U-curve learning machine is to approximate a target hypothesis by estimating the optimal hypotheses of each , via the minimization of the in-sample error, and then selecting the estimated optimal hypotheses with the smallest estimated out-of-sample error. Formally, the U-curve learning machine , given and returns

that is, the estimated optimal hypotheses of the subspaces in L(H) with the smallest estimated out-of-sample error.

An algorithm which returns (4) is either a two level combinatorial algorithm or a hybrid of a continuous and a combinatorial algorithm. The first case is when the estimation of the optimal hypotheses of is carried by a combinatorial algorithm; the second is when such estimation is performed by a continuous optimization algorithm. For example, when the hypotheses considered have a discrete domain, i.e., it consists of Boolean or integer input vectors, the algorithm is two level combinatorial, but when the domain is composed by real vectors, and the Hypothesis Space is that of linear classifiers, the optimization in a subspace is continuous.

The U-curve algorithm is a combinatorial algorithm employed to avoid overfitting. If we try to approximate the target hypotheses employing only the in-sample error, then there is no gain in using a Learning Space, since the hypotheses with smallest in-sample error will be , the hypotheses which minimize the in-sample error in H, which may overfit the data if the sample size is not relatively great. On the other hand, by estimating the optimal hypotheses of each subspace in L(H) from the in-sample error, and validating them on an independent estimator of the out-of-sample error, we may select as an approximation to a target hypothesis the estimated optimal of some proper subset of H.

The hypotheses estimated by the U-curve algorithm are, in some sense, compatible with sample size N. On the one hand, if N is great enough, then the U-curve algorithm will return the estimated optimal hypotheses of H, since, by VC theory bounds, the in-sample error will be close to the out-of-sample error. Thus, the optimal hypotheses by the in-sample and estimated out-of-sample error will be the same. On the other hand, if N is not great enough, then the selected hypotheses will be compatible with N, in the sense that the VC dimension of the subspace of L(H) which contains them is compatible with sample size N. In other words, the generalization error of estimating in a subspace with greater VC dimension may be too great, as the in-sample and estimated out-of-sample error differ.

To implement the U-curve algorithm we take advantage of an homonym property which is often satisfied by the estimated out-of-sample error when calculated for the estimated optimal hypotheses of the subspaces in a chain of L(H). Performing an exhaustive search on L(H) to obtain (4) is not feasible, as requires an algorithm of high computational complexity when |L(H)| is large. Nevertheless, we take advantage of the structure of L(H), i.e., the fact that it is a poset, semi-lattice, lattice or chain, in order to, together with the U-curve property of certain Learning Spaces and loss functions, solve optimization problem (4).

We now define the U-curve property and related concepts. For we denote by their distance on the Hasse diagram . We may denote from now on for all when suitable.

Definition 4. A sequence is called a dense chain of L(H) if, and only if, for all .

Definition 5. The subspace is:

• a weak local minimum of a dense chain if , in which ;

• a strong local minimum of L(H) if it is a weak local minimum of all dense chains of L(H) which contain it;

Figure 5 illustrates the concepts of weak and strong local minimums. Observe that, in Definition 5, the representative hypothesis of each is chosen based on the in-sample error, while, in order to establish the minimums, the error is estimated by a consistent estimator of the out-of-sample error, independent of , so to avoid overfitting. Indeed, the sequence is non increasing as . Therefore, if we were to define the local minimums considering only the in-sample error, they would always be the greatest subspace of the chain, so that we would be doomed to overfit. On the other hand, estimating hypotheses based on and validating them on a consistent estimator of L, independent of , is a manner of avoiding overfitting.

An important property of the global minimum is that it is a strong (and weak) local minimum. This strong minimum

Fig. 5. Example of (a) strong and (b) weak local minimums.

was called by [9] minima exhausted in the context of feature selection. Therefore, in order to find we need only to find all strong (or all weak) local minimums of L(H). This search may be performed more efficiently than an exhaustive one if the loss function satisfies either the strong or the weak U-curve property.

Definition 6. A Learning Space L(H) under loss function satisfies the:

• strong U-curve property if every weak local minimum of a dense chain of L(H) is a global minimum of such chain;

• weak U-curve property if every strong local minimum is a global minimum of all dense chains of L(H) which contain it.

We call the properties above U-curve for the plot of , has an U format when calculated for any dense chain if the loss function has the strong U-curve property. It is straightforward that the strong U-curve property implies the weak U-curve property.

The U-curve phenomenon allows us to make the search for more efficient. If the strong U-curve property is satisfied, in order to find we do not need to exhaustively search L(H): we go through every dense chain of L(H) until we find the (only) weak local minimum of it, so that we find every weak local minimum and, therefore, the global minimum. Similarly, if the weak U-curve property is satisfied, in order to find we go through every dense chain of L(H) until we find the (only) strong local minimum of it, so that we find every strong local minimum and, therefore, the global minimum.

Either way, L(H) is not exhaustively searched, as when we find a weak or strong local minimum we do not need to estimate the optimal hypotheses of the remaining subsets of a dense chain, as the strong or weak U-curve property, respectively, ensure that the found local minimum is a global minimum of the dense chain. This is done for feature selection lattices in [6], [7], [8] where more details about the algorithm may be found. The U-curve by minimum exhaustion proposed by [9] solves this problem when the weak U-curve property is satisfied by identifying an exhausted minima (strong local minimum) and doing cuts above and below this node of the feature selection lattice. After repeating this procedure until the search space is empty, the algorithm returns the global minimums. In Figure 6 we present an example of a lattice which satisfies the weak Ucurve property.

Next proposition shows that the Partition Lattice Learning Space satisfies the weak U-curve property if we apply

Fig. 6. Example of a lattice satisfying the weak U-curve property. The number inside each node . The strong local minimums are in green and the weak local minimums are in orange. All strong local minimums are global minimums of all dense chains which contain them, so this is an example of a weak U-curve property configuration.

the simple loss function, hence the class of Learning Spaces which satisfies an U-curve property is not empty, i.e., these concepts appear in real-world applications and are not merely theoretical speculations.

Proposition 1. The Partition Lattice Learning Space under the simple loss function satisfies the weak U-curve property.

In the case of the Partition Lattice Learning Space, being a strong local minimum implies that for every obtained from by breaking one block partition into two, or merging two block partitions into one, has a greater cost than . Then, the key point of the proof is that when we go, say, from partition {a, b, c} to , the estimated out-of-sample error of the estimated optimal hypothesis increases on the same amount as when we go from partition {a, d, e} to . Thus, when we know that is a strong local minimum, we know that any breaking (merging) of block partitions increases the estimated out-of-sample error and, as the increment does not depend on the estimated error of the current partition, but only on what block is broken (merged), it follows that is a global minimum of every dense chain which contains it, as every such that or is given by breaking or merging block partitions, respectively.

We now establish a general sufficient condition for the weak U-curve property motivated by the proof of Proposition 1. For each , a Lattice Learning Space, define

as the subspaces which contain or are contained in M, respectively. Both and are complete lattices. We define for each the upper immediate neighbourhood of relative to M as

∈ C⊂ M

and for each the lower immediate neighbourhood relative to M as

∈ C⊂ M

If Mthen M ⊂ M⊂ Mand if M∈

then . We say that L(H) is Ucurve compatible if for every

{M} or || ≥ 2, ∀ M∈ C\ {M} {M} or || ≥ 2, ∀ M∈ C\ {M}

i.e., L(H) is U-curve compatible if for every the upper (lower) immediate neighbourhood of all subspaces in ) is equal to M or contains at least two distinct subspaces. Both the Boolean and Partition Lattices are U-curve compatible. If L(H) is U-curve compatible, then a simple property of is sufficient for the weak U-curve property.

Theorem 1. Let L(H) be an U-curve compatible Lattice Learning Space. If all such that satisfies

with probability 1, then the weak U-curve property is in force for L(H) under .

Assuming, without loss of generality, that and rewriting (5) as

we observe that the increase on when we go from to is greater than when we go from to . This feature is analogous to that observed on convex functions, that is, the increment of the function increases when its inputs increase, in which increase of inputs in this context is according to relation and the value of when subsets are not related. Hence, we call property (5) Lattice Convexity. As is the case with convex functions, (strong) local minimums are global minimums, but, since L(H) is a poset, this is true only for the subspaces which are related to the local minimum.

Besides being a sufficient condition to the Weak U-curve Property, condition (5) is also a tool for increasing the effi-ciency of an U-curve algorithm. Suppose that and satisfy (5) and that . Then

so after visiting the subspaces and , and noting the increase in the estimated error, one does not need to visit as the estimated error will increase. Therefore, apart from the global pruning of L(H) one does when a strong local minimum is found, it is also possible to perform local pruning, when a subspace which is the supremum of two is not visited when one observes an increase on the estimated error from their infimum to the subspaces. This fact may be explored, together with the U-curve properties, in order to develop more efficient algorithms.

Remark 1. Although the Partition Lattice Learning Space does not satisfy (5) (see technical Remark 3), it satisfies the weak U-curve property for other reasons by Theorem 1. However, there are sub-graphs of it that, chosen in a way such that (5) is satisfied, also have the weak U-curve property due to this sufficient condition (see technical Remark 3). Nonetheless, even when no U-curve property is satisfied, one may still apply an U-curve algorithm and obtain a suitable suboptimal solution. This has been done successfully in feature selection (see [6], [7], [8], [9], [14], [19] for more details).

4 DISCUSSION

For a long time, the communities of Pattern Recognition and Machine Learning have been doing optimization on, mainly Boolean, lattices, to design classifiers or to learn hypotheses. However, we have no knowledge about results establishing lattice and loss function properties which guarantee that the applied procedures really return optimal solutions. In this context, the main contribution of this paper is, to the best of our knowledge, the first rigorous definition of properties which ensure that a non-exhaustive search of a lattice with the purpose of Model Selection can return an optimal result. We established such a property and showed that it does exist by proving the Weak U-curve Property is satisfied by the Partition Lattice Learning Space.

In general lines, the method presented in this paper proposes a data-driven systematic, consistent and feasible approach to Model Selection. Indeed, given a loss function, a sample, an estimator of the out-of-sample error and a Learning Space generator, we proposed a systematic procedure, based on data, to select a model, i.e., a subset in the Learning Space, and then learn a hypothesis of it seeking to approximate a target hypothesis of H. In a subsequent technical article [18], we show that this method is consistent and, due to the U-curve property, it is feasible, in the sense of being more efficient than an exhaustive search on the Learning Space for obtaining optimal solutions, when an Ucurve property is satisfied, and suboptimal, when such a property is not satisfied [6], [9], [14].

The practical application of the abstract theory requires, first, the concrete choice of a parametric representation of the hypotheses. Then, within a suitable algebraic structure of such a representation, one needs to define a Learning Space generator, that is a rule of how to obtain, via constraints on the parameters values, a family of subsets of H, partially ordered by inclusion, with the same algebraic structure considered for the parameters. A canonical practical example of this abstract system is the Boolean lattice for feature selection, in which the hypotheses are represented by the variables they depend on; the Boolean lattice algebra of the subsets of variables is considered; and the Learning Space generator associates each subset of variables with the set of hypotheses which depend only on variables in the set.

Although abstract and general, and even, if at first sight, it may not be clear how to construct Learning Spaces, they emerge naturally on applications on which a meaningful parametric representation is employed. By meaningful we mean that each parameter is identifiable with some concrete concept, that is, the parameters are interpretable. Observe that this is the case of all examples presented above, where the parameters represent variables or points of the classifier domain. Besides facilitating the development of Learning Spaces, the interpretability of the parameters has at least another two properties. First, since one knows what each parameter means, it is possible to transform prior information into constraints on these parameters in order to (a) define a Learning Space generator, (b) replace H by a subset of it and (c) drop sets from the Learning Space (see discussion on Example 2 for a practical example). Second, one may re-parametrize the hypotheses, or consider another constraint on the same parameters, to add other kinds of prior information, thus, generating another Learning Space for the same Hypothesis Space H (see Example 3). Hence, on the one hand, the Learning Space is not unique and there is no general abstract formulae to build just one that would work for all applications, what would be a “canonical Learning Space”. However, on the other hand, this flexibility when defining Learning Spaces makes it a customizable method, which can be instantiated for a large family of learning problems.

The most popular problem in machine learning is the learning of Boolean functions (i.e., defined from to {0, 1}), since they are natural models for a large number of application problems: design of binary and flat grey-scale image operators, digital circuits design, robotics, identifica-tion of genes network, etc. The concept of Learning Space applied to a H formed by Boolean functions and the Ucurve algorithm on the Partition Lattice constitute a general framework to learn them. In a given context, the Hypothesis Space is defined based on constraints on the general set of Booleans functions (e.g., increasing, decreasing, supgenerating, inf-generating, extensive, anti-extensive, etc.). The combination of constraints define particular Hypothesis Spaces of Boolean functions that characterize real world problems and imply a reduction of the Hypothesis Space complexity and consequently the reduction of L(H) to posets of the original Partition Lattice. Applying the U-curve algorithm to such poset, one may obtain optimal or suitable sub-optimal solutions to the practical problem at hand.

Also, there are some interesting perspectives of applying the method to Neural Networks. A given architecture of a Neural Network generates a Hypotheses Space H, formed by all hypotheses obtained wandering all over the Neural Net parameters domain. Of course, such an H has subsets, and families of them satisfy the two axioms of Learning Spaces. Now, if one could develop a Learning Space generator which associates constraints on the architecture’s parameters to subsets of a Learning Space, the abstract method proposed here could be applied to select Neural Net architectures, since the constraints on the parameters would generate sub-architectures, so the Learning Space could be seem as a family of architectures, and the learning on it would be the learning of architectures. This instantiation has potential to become a relevant contribution to the field of Neural Architecture Search [20] since it would constitute a systematic and general approach to architecture learning, which could have major implications to the future of Neural Networks research. However, since each parameter of a Neural Network is not exactly interpretable by itself, due to over-parametrization [21], and the VC dimension of H is not exactly known in general for Neural Nets [22], the instantiation of the method to this case is not immediate and requires a further understanding of the concepts proposed here for Neural Nets. We leave this important study as a relevant topic for future research.

From the viewpoint of Machine Learning Theory, the approach presented here can also be seen as a regularization method. However, we believe that the U-curve algorithm applied to a Learning Space differs from established regularization methods because the optimization occurs in plain sight and is a consequence of a formal model designed to represent an application problem. Indeed, some ordinary regularization procedures are heuristics in which the loss function is penalized by the complexity of the model, and the estimated hypothesis is obtained by a constrained optimization problem which gives few or no reasons to why such model has been selected in detriment of others. In the extended learning model proposed here, regularization is an abstract process that results from the optimization in the Learning Space under a finite sample. In this new learning approach, the search on the Learning Space itself implies the regularization effect. This oversight of the regularization process comes with the cost of computing power, since the search on a Learning Space is a combinatorial algorithm, even under an U-curve property, which will tend to have bigger computational complexity than other regularization procedures. Nevertheless, with the ever growing computing capabilities, in the near future such an increase in computational complexity could be worth it in order to have sharper models for real problems. Besides, the concept of regularization under the proposed framework is quite general, applying for both continuous and discrete models.

In a subsequent paper [18] it is shown that it is theoretically feasible to learn a subspace of a Hypotheses Space from data and that the sample complexity does depend on the target via a bias-variance trade-off. Let us examine a limit case related to this property. Suppose that the target hypothesis is in a subspace with , which is for sure the case if . Consequently, if we could identify this subspace, a sample size compatible with a space of VC dimension , and not necessarily compatible with H, would suffice to well approximate the target. On the one hand, we cannot identify this subspace a priori, as the probability distribution of the data, and thereafter the target, is unknown. On the other hand, we may take advantage of the Learning Space structure and an U-curve property to identify this subspace based on data, i.e., learn the Hypotheses Space from data. This is the central idea of the paradigm presented in Figure 3 and leads to the concept of random hypotheses subspace and bias-variance considerations which are treated in [18].

This paradigm also leads to an important practical property of Machine Learning: the lack of experimental data may be mitigated by a great computational capacity, as one can apply computer power to look for the target subspace compatible with a given small sample under the paradigm in Figure 3. In a context of continuous increasing and popularization of high computational power, this property may be the key to understanding why Machine Learning has become so important in all branches of Science, even in the ones where data is expensive and hard to get.

APPENDIX A PROOF OF THE RESULTS AND TECHNICAL REMARKS

Proof of Proposition 1. We first show that if is a strong local minimum of L(H) and if then . Note that for all we have

This is the case because, if and , then for all and

so that if a = b, then for all . Furthermore, if and then

as is obtained from by partitioning a block of it into two blocks . From (6) we can establish that , for all or for all . Indeed, if

then either

as the ratio in (7) with is a weighted mean of the ratios in (8) with , so that if it is greater than 1/2, as is the case when , the maximum of the ratios in (8) is greater than 1/2, as the maximum is not lesser than the weighted mean. This establishes that at least one of the is equal to .

L(H). Then, if and , denoting A =

as for , in which the inequality follows since is a strong local minimum. We have that

a, with a, a, a, and

so that, as is a strong local minimum, by substituting (10) in (9), we have

which in turn is equal to , where is a partition of , as for all there exists a such that , which follows from the fact that , so we may apply inequality (11) to each parcel of the sum.

To establish that is a global minimum of every dense chain that contains it, it is enough to show that implies . This can be done analogously by supposing that there exists and , when , and proceeding in a similar manner.

Proof of Theorem 1. Let be a strong local minimum. We first show that if then , which implies that M is a global minimum of , as it is its least element. Let be the size of the greatest dense chain in which contains M and define for as

Length of dense chain starting in ending in

We may partition by the greatest position each subset occupies in a dense chain starting in M:

for and C{M}.

By hypothesis, for all as it is a strong local minimum. We proceed by induction. Assume, for a , that for all when . Fix and with . Note that and, as , since L(H) is U-curve compatible, there exists another such that . Therefore

by hypothesis, as for and both are in . From (12) and the induction hypothesis it follows that for all as there is an such that which implies .

With an analogous deduction and the inequality

we can show that if then , which implies that M is also global minimum of , as it is its greatest element.

Remark 2. From the proof of Theorem 1 one can deduce that its hypotheses may be loosened. Condition (5) need not to be satisfied by all such that . It is necessary only that for every and every with , there are , such that , and condition (5) is satisfied by . Also, Theorem 1 may be adapted to L(H) which is not a lattice by adding further constraints to it.

Remark 3. The sufficient condition (5) is not satisfied by the Partition Lattice Learning space. Indeed, it is only true when to get from to we perform breaks on two distinct blocks of partition . If these breaks are on the same block then the condition may fail. However, in view of Remark 2, we may consider sub-graphs of the Partition Lattice Learning Space which satisfy the Weak Ucurve Property due to (5). A further study of this fact is left for a future study.

ACKNOWLEDGMENTS

DM received financial support from CNPq during this research. JB is supported by FAPESP: proc. 14/50937-1 and 15/24485-9.

REFERENCES

[1] V. Vapnik, Statistical learning theory. 1998. Wiley, New York, 1998, vol. 3.

[2] ——, The nature of statistical learning theory. Springer science & business media, 2000.

[3] L. G. Valiant, “A theory of the learnable,” Communications of the ACM, vol. 27, no. 11, pp. 1134–1142, 1984.

[4] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of machine learning research, vol. 3, no. Mar, pp. 1157–1182, 2003.

[5] G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant features and the subset selection problem,” in Machine learning: proceedings of the eleventh international conference, 1994, pp. 121–129.

[6] E. Atashpaz-Gargari, M. S. Reis, U. M. Braga-Neto, J. Barrera, and E. R. Dougherty, “A fast branch-and-bound algorithm for u-curve feature selection,” Pattern Recognition, vol. 73, pp. 172–188, 2018.

[7] M. S. Reis, G. Estrela, C. E. Ferreira, and J. Barrera, “featsel: A framework for benchmarking of feature selection algorithms and cost functions,” SoftwareX, vol. 6, pp. 193–197, 2017.

[8] ——, “Optimal boolean lattice-based algorithms for the u-curve optimization problem,” Information Sciences, vol. 471, pp. 97–114, 2019.

[9] M. Ris, J. Barrera, and D. C. Martins, “U-curve: A branch-and-bound optimization algorithm for u-shaped cost functions on boolean lattices applied to the feature selection problem,” Pattern Recognition, vol. 43, no. 3, pp. 557–568, 2010.

[10] S. Haykin, Neural networks: a comprehensive foundation. Prentice Hall PTR, 1994.

[11] E. R. Dougherty, J. Barrera, G. Mozelle, S. Kim, and M. Brun, “Multiresolution analysis for optimal binary filters,” Journal of Mathematical Imaging and Vision, vol. 14, no. 1, pp. 53–72, 2001.

[12] K. Grauman and T. Darrell, “Pyramid match kernels: Discrim- inative classification with sets of image features (version 2),” Computer Science and Artificial Intelligence Laboratory MIT, 2006.

[13] R. H. Junior, M. Brun, J. Barrera, and E. R. Dougherty, “Mul- tiresolution design of aperture operators,” Journal of Mathematical Imaging and Vision, vol. 16, no. 3, pp. 199–222, 2002.

[14] J. E. S. Castro, “Model selection for learning boolean hypothesis,” Ph.D. dissertation, Universidade de S˜ao Paulo, 2018.

[15] Overfitting., Oxford Dictionaries Online, 2020. [Online]. Available: https://www.lexico.com/definition/overfitting

[16] G. J. F. Banon and J. Barrera, “Decomposition of mappings between complete lattices by mathematical morphology, part i. general lattices,” Signal Processing, vol. 30, no. 3, pp. 299–327, 1993.

[17] ——, “Minimal representations for translation-invariant set map- pings by mathematical morphology,” SIAM Journal on Applied Mathematics, vol. 51, no. 6, pp. 1782–1798, 1991.

[18] D. Marcondes, A. Simonis, and J. Barrera, “Learning the hypothe- ses space from data part ii: Convergence and feasibility,” arXiv preprint arXiv:2001.11578, 2020.

[19] M. S. Reis, “Minimization of decomposable in u-shaped curves functions defined on poset chains–algorithms and applications,” Ph.D. dissertation, Institute of Mathematics and Statistics, University of Sao Paulo, Brazil (in Portuguese), 2012.

[20] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” Journal of Machine Learning Research, vol. 20, pp. 1–21, 2019.

[21] B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro, “The role of over-parametrization in generalization of neural networks,” in International Conference on Learning Representations, 2018.

[22] E. D. Sontag, “Vc dimension of neural networks,” NATO ASI Series F Computer and Systems Sciences, vol. 168, pp. 69–96, 1998.