b

DiscoverSearch
About
My stuff
Theoretical Models of Learning to Learn
2020·arXiv
Abstract
Abstract

A Machine can only learn if it is biased in some way. Typically the bias is supplied by hand, for example through the choice of an appropriate set of features. However, if the learning machine is embedded within an environment of related tasks, then it can learn its own bias by learning sufficiently many tasks from the environment [4, 6]. In this paper two models of bias learning (or equivalently, learning to learn) are introduced and the main theoretical results presented. The first model is a PAC-type model based on empirical process theory, while the second is a hierarchical Bayes model.

image

Hume’s analysis [10] shows that there is no a priori basis for induction. In a machine learning context, this means that a learner must be biased in some way for it to generalise well [11]. Typically such bias is introduced by hand through the skill and insights of experts, but despite many notable successes, this process is limited by the experts’ abilities. Hence a desirable goal is to find ways of automatically learning the bias. Bias learning is a form of learning to learn, and the two expressions will be used interchangeably throughout this document.

The purpose of this chapter is to present an overview of two models of supervised bias learning. The first [4, 3] is based on Empirical Process theory (henceforth the EP model) and the second [6] is based on Bayesian inference and information theory (henceforth the Bayes model). Empirical process theory is a general theory that includes the analysis of pattern classification first introduced by Vapnik and Chervonenkis [13, 12]. Note that these are models of supervised bias learning and as such have little to say about learning to learn in a reinforcement learning setting.

In this introduction a high level overview of the features common to both models will be presented, and then in later sections the details and main results of each model will be discussed.

In ordinary models of machine learning the learner is presented with a single task. Learning the “right bias” in such a model does not really make sense, because the ultimate bias is one which completely solves the task. Thus in single-task learning, bias learning or learning to learn is the same as learning.

In order to learn bias one has introduce extra assumptions about the learning process. The central assumption of both the Bayes model and the EP model of bias learning is that the learner is embedded within an environment of related problems. The learner’s task is to find a bias that is appropriate for the entire environment, not just for a single task.

A simple example of an environment of learning problems with a common bias is handwritten character recognition. A preprocessing stage that identifies and removes any (small) rotations, dilations and translations of an image of a character will be advantageous for recognising all characters. If the set of all individual character recognition problems is viewed as an environment of learning tasks, this preprocessor represents a bias that is appropriate to all tasks in the environment.

Preprocessing can also be viewed as feature extraction, and there are many classes of learning problems that possess common feature sets. For example, one can view face recognition as a collection of related learning problems, one for each possible face classifier, and it is likely that there exists sets of features that are good for learning all faces. A similar conclusion applies to other domains such as speech recognition (all the individual word classifiers may be viewed as separate learning problems possessing a common feature set), fingerprint recognition, and so on. The classical approach to statistical pattern recognition in these domains is to first guess a set of features and then to learn each problem by estimating a simple (say linear) function of the features. The choice of features represents the learner’s bias, thus in bias learning the goal is to get the learner to learn the features instead of guessing them.

In order to perform a theoretical analysis of bias learning, we assume the tasks in the environment are generated according to some underlying probability distribution. For example, if the learner is operating in an environment where it must learn to recognise faces, the distribution over learning tasks will have its support restricted to face recognition type problems. The learner acquires information about the environment by sampling from this distribution to generate multiple learning problems, and then sampling from each learning problem to generate multiple training sets. The learner can then search for bias that is appropriate for learning all the tasks.

In the EP model, the learner is provided with a family of hypothesis spaces and it searches for an hypothesis space that contains good solutions to all the training sets. Such a hypothesis space can then be used to learn novel tasks drawn from the same environment. The key result of the EP model (theorem 2 in section 3) gives a bound on the number of tasks and number of examples of each task required to ensure that a hypothesis space containing good solutions to all training sets will, with high probability, contain good solutions to novel tasks drawn from the same environment. This ability to learn novel tasks after seeing sufficiently many examples of sufficiently many tasks is the formal definition of learning to learn under the EP model.

The Bayes model is the same as the EP model in that the learner is assumed to be embedded within an environment of related tasks and can sample from the environment to generate multiple training sets corresponding to different tasks. However, the Bayes bias learner differs in the way it uses the information from the multiple training sets. In the Bayes model, the distribution over learning tasks in the environment is interpreted as an objective prior distribution. The learner does not know this distribution, but does have some idea of a set Π of possible prior distributions to which the true distribution belongs. The learner starts out with a hyper-prior distribution on Π and based on the data in the training sets, updates the hyper-prior to a hyper-posterior using Bayes’ rule. The hyper-posterior is then used as a prior distribution when learning novel tasks. In section 4 results will be presented showing how the information needed to learn each task (in a Shannon sense) decays to the minimum possible for the environment as the number of tasks and number of examples of each tasks seen already grows. Within the Bayes model, this is the formal definition of learning to learn.

Before moving on to the details of these models, it is worth pausing to assess what bias learning solves, and what it doesn’t—and in a sense can never—solve. On face value, being able to learn the right bias appears to violate Hume’s conclusion that there can be no a priori basis for induction. However this is not the case, for the bias learner learner is still fundamentally limited by the possible choices of bias available. For example, if a learner is learning a set of features for an environment in which there are in fact no small feature sets, then any bias it comes up with (i.e. any feature set) will be a very poor bias for that environment. Thus, there is still guesswork involved in determining the appropriate way to hyper-bias the learner. The main advantage of bias learning is that this hyper-bias can be much weaker than the bias: the right hyper-bias for many environments is just that there exists a set of features, whereas specifying the right bias means actually finding the features.

To understand how bias learning can be modeled from a statistical perspective, it is necessary to first understand how ordinary learning is modeled from a statistical perspective. The empirical process (EP) process approach and the Bayes approach will be discussed in turn.

2.1. The empirical process (EP) approach

The empirical process (EP) approach to modeling ordinary (single-task) learning has the following essential ingredients:

An input space X and an output space Y ,

a probability distribution P on  X × Y,

a loss function  l: Y × Y → R, and

a hypothesis space H which is a set of hypotheses or functions  h: X → Y.

As an example, if the problem is to learn to recognize images of Mary’s face using a neural network, then X would be the set of all images (typically represented as a subset of  Rdwhere each component is a pixel intensity), Y would be the set {0, 1}, and the distribution P would be peaked over images of different faces and the correct class labels. The learner’s hypothesis space H would be a class of neural networks mapping the input space (Rd) to {0, 1}.

image

Using the loss function allows us to present a unified treatment of both concept learning (Y = {0, 1}, l as above), and real-valued function learning (e.g. regression) in which Y = R and  l(y, y′) = (y − y′)2.

The goal of the learner is to select a hypothesis  h ∈ Hwith minimum expected loss:

image

For classifying Mary’s face, the  h ∈ Hwith minimum value of erP (h) is the one that makes the fewest number of mistakes on average. Of course, the learner does not know P and so it cannot search through H for an h minimizing erP (h). In practice, the learner samples repeatedly from the distribution P to generate a training set

image

and instead of minimizing erP (h), the learner searches for an  h ∈ Hminimizing the empirical loss on sample z:

image

Of course, there are more intelligent things to do with the data than simply minimizing empirical error—for example one can add regularisation terms to avoid over-fitting. However we do not consider those issues here as they do not substantially alter the discussion.

Minimizing ˆerz(h) is only a sensible thing to do if there is some guarantee that ˆerz(h) is close to expected loss erP (h). This will in turn depend on the “richness” of the class H and the size of the training set (m). If H contains every possible function then clearly there can never be any guarantee that ˆerz(h) is close to erP (h). The conditions ensuring convergence between ˆerz(h) and erP (h) are by now well understood; in the case Boolean function learning (Y = {0, 1}), convergence is controlled by VCdim(H)—the VC-dimension of H (see e.g. [1, 12]). The following is typical of the theorems in this area.

Theorem 1 Let P be any probability distribution on  X × {0, 1}and suppose z = {(x1, y1), . . . , (xm, ym)}is generated by sampling m times from  X ×{0, 1}according to P. Let d := VCdim(H). Then with probability at least 1  − δ(over the choice of the training set  z), all h ∈ Hwill satisfy

image

There are a number of key points about this theorem:

1. We can never say for certain that erP (h) and ˆerz(h) are close, only that they are close with high probability (1  − δ). This is because no matter how large the training set, there is always a chance that we will get unlucky and generate a highly unrepresentative sample z.

2. Keeping the confidence parameter  δfixed, and ignoring log factors, (5) shows that the difference between the empirical estimate ˆerz(h) and the true loss erP (h) decays like�d/m, uniformlyfor all  h ∈ H. Thus, for sufficiently large training sets z and if d = VCdim(H) is finite, we can be confident that an h with small empirical error will generalise well.

3. If H contains an h with zero error, and the learner always chooses an h consistent with the training set, then the rate of convergence of ˆerz(h) and erP (h) can be improved to d/m.

4. Often results such as theorem 1 are called uniform convergence results, because they provide bounds for all  h ∈ H.

Theorem 1 only provides conditions under which the deviation between erP (h) and ˆerzis small, it does not guarantee that the true error erP (h) will actually be small. This is governed by the choice of H. If H contains a solution with small error and the learner minimizes error on the training set, then with high probability erP (h) will be small. However, a bad choice of H will mean there is no hope of achieving small error. Thus, the bias of the learner in the EP model is represented by the choice of H.

2.2. The Bayes approach

The Bayes approach and the EP approach are not all that different. In fact, the EP approach can be understood as a maximum likelihood approximation to the Bayes solution. The essential ingredients of the Bayes approach to modeling ordinary (single-task) learning are:

An input space X and an output space Y ,

a set of probability distributions  Pθon  X × Y, parameterised by  θ ∈Θ, and

a prior distribution  p(θ) on Θ.

The hypothesis space H of the EP model has been replaced with a set of distributions  {Pθ: θ ∈ Θ}. As with the EP model, the learning task is represented by a distribution  Pθ∗on  X × Y, only this time we assume realizability, i.e. that  θ∗ ∈Θ. The prior distribution  p(θ) represents the learner’s initial beliefs about the relative plausibility of each  Pθ. The richness of the set  {Pθ: θ ∈ Θ}and the prior  p(θ) represent the bias of the learner.

Again the learner does not know  θ∗, but has access to a training set z = {(x1, y1), . . . , (xm, ym)}of pairs (xi, yi) sampled according to  Pθ∗. Upon receipt of the data z, the learner updates its prior distribution  p(θ) to a posterior distribution  p(θ|z) using Bayes rule:

image

Often we are not interested in modeling the input distribution, only the conditional distribution on Y given X. In that case,  p(x, y|θ) will factor into  p(x)p(y|x; θ). The posterior distribution  p(θ|z) can be used to predict the output y of a novel input x∗by averaging:

image

One would hope that as the data increases, predictions made in this way would become increasingly accurate. There are many ways to measure what we mean by “accurate” in this setting. The one considered here is the Kullback-Liebler (KL) divergence between the true distribution  Pθ∗on  X × Y, and the posterior distribution  Pmon  X × Ywith density

image

The KL divergence between  Pθ∗and  Pmis defined to be

image

Note that if  p(x, y|θ) =  p(x)p(y|θ) as above then

image

This form of the KL divergence has a natural interpretation: it is (within one bit) the expected extra number of bits needed to encode the output y using a code generated by the posterior p(y|x; z), over and above what would be required using an optimal code (one generated from  p(y|x; θ∗)). The expectation is over all pairs (x, y) drawn according to the true distribution  Pθ∗. This quantity is only zero if the posterior is equal to the true distribution.

In [8, 2] an analysis of  DK(Pθ∗∥Pm) was given for the limit of large training set size (m). They showed that if Θ is a compact subset of  Rd, and under certain extra restrictions which we won’t discuss here:

image

where o(1/m) stands for a function f(m) for which  mf(m) →0 as  m → ∞.

There is a strong similarity between this result and theorem 1 in the zero error case (see note 3 after the theorem).  DK(Pθ∗∥Pm) is the analogue of  |erz(h)−erP (h)|in this case, but because we have assumed realizability, erz(h) = 0. So theorem 1 says that choosing any hypothesis consistent with the data will guarantee you an error of no more than d/m, where d is the VC dimension of the learner’s hypothesis space. Although the error measure is different, equation (11) says essentially the same thing: if you classify novel data using a posterior generated according to Bayes rule, you will suffer an error of no more than d/m, where now d is the dimension of the Θ.

Recall from the introduction that the main extra assumption of both the Bayes and EP models of bias learning is that the learner is embedded in an environment of related tasks, and can sample from the environment to generate multiple training sets belonging to multiple different tasks. In the EP model of ordinary (single-task) learning, the learning problem is represented by a distribution P on  X × Y. So in the EP model of bias learning, an environment of learning problems is represented by a pair (P, Q) where P is the set of all probability distributions on  X×Y (i.e. Pis the set of all possible learning problems), and Q is a distribution on  P1. Qcontrols which learning problems the learner is likely to see. For example, if the learner is in a face recognition environment, Q will be highly peaked over face-recognition-type problems, whereas if the learner is in a character recognition environment Q will be peaked over character-recognition-type problems.

Recall from the end of section 2.1 that the learner’s bias is represented by its choice of hypothesis space H. So to enable the learner to learn the bias, it is supplied with a family or set of hypothesis spaces H := {H}. As each H is itself a set of functions  h: X → Y , His a set of sets of functions.

Putting this together, formally a learning to learn or bias learning problem consists of:

an input space X and an output space Y ,

a loss function  l: Y × Y → R,

an environment (P, Q) where P is the set of all probability distributions on X × Yand Q is a distribution on P,

a hypothesis space family H = {H} where each  H ∈ His a set of functions h: X → Y.

The goal of a bias learner is to find a hypothesis space  H ∈ Hminimizing the loss (recall equation (2))

image

The only way erQ(H) can be small is if, with high Q-probability, H contains a good solution to any problem P drawn at random according to Q. In this sense erQ(H) measures how appropriate the bias embodied by H is for the environment (P, Q).

In general the learner will not know Q, so it will not be able to find an H minimizing erQ(H) directly. However, the learner can sample from the environment in the following way:

Sample n times from P according to Q to yield: P1, . . . , Pn.

Sample m times from  X × Yaccording to each  Pito yield: zi = {(xi1, yi1) . . . , (xim, yim)}.

The learner receives the (n, m)-sample:

image

Note that an (n, m)-sample is simply n training sets  z1, . . . , znsampled from n different learning tasks  P1, . . . , Pn, where each task is selected according to the environmental probability distribution Q.

Instead of minimizing erQ(H), the learner searches for an  H ∈ Hminimizing the empirical loss on the sample z, where this is defined by:

image

(recall equation (4)). Note that ˆerz(H) is a biased estimate of erQ(H).

The question of generalisation within this framework now becomes: How many tasks (n) and how many examples of each task (m) do we need to ensure that ˆerz(H) and erQ(H) are close with high probability?

Or, informally, how many tasks and how many examples of each task are required to ensure that a hypothesis space with good solutions to all the training tasks will contain good solutions to novel tasks drawn from the same environment?

In order to present the main theorem answering this question, some extra defini-tions must be introduced.

image

For any sequence of n hypotheses (h1, . . . , hn), define (h1, . . . , hn)l: (X × Y )n → Rby

image

We will also use  hlto denote (h1, . . . , hn)l. For any H in the hypothesis space family H, define

image

In the first part of the definition above, hypotheses  h: X → Yare turned into functions  hlmapping  X × Y → Rusing the loss function.  Hlis then just the collection of all such functions where the original hypotheses come from  H. Hlis often called a loss-function class. In our case we are interested in the average loss across n tasks, where each of the n hypotheses is chosen from a fixed hypothesis space H. This motivates the definition of  hland  Hnl. Finally,  Hnlis the collection of all (h1, . . . , hn)l, with the restriction that all  h1, . . . , hnbelong to a single hypothesis space  H ∈ H.

image

image

It is the “size” of  Hnland  H∗that controls how large the (n, m)-sample z must be to ensure erz(H) and erQ(H) are close uniformly over all  H ∈ H. Their size will be defined in terms of certain covering numbers, and in order to define the covering numbers, we need first to define how to measure the distance between elements of Hnland also between elements of  H∗.

Definition 3 Let P = (P1, . . . , Pn) be any sequence of n probability distributions on  X × Y. For any  hl, h′l ∈ Hnl, define

image

It is easily verified that  dPis a pseudo-metric on  Hnl, and similarly that  dQis a pseudo-metric on  H∗. A pseudo-metric is simply a metric without the condition that  ρ(x, y) = 0  ⇒ x = y. For example,  dQ(H∗1, H∗2) could equal 0 simply because the distribution Q puts mass one on some distribution P for which  H∗1(P) =  H∗2(P), and not because  H∗1=  H∗2.

Definition 4 An  ε-cover of (H∗, dQ) is a set  {H∗1, . . . , H∗N}such that for all  H∗ ∈H∗, dQ(H∗, Hi) ≤ εfor some i = 1 . . . N. Let  N(ε, H∗, dQ) denote the size of the smallest such cover. Set

image

We can define  N(ε, Hnl , dP) in a similar way, using  dPin place of  dQ. Again, set:

image

Now we have enough machinery to state the main theorem.

Theorem 2 Let Q be any probability distribution on P, the set of all distributions on  X × Y. Suppose z is an (n, m)-sample generated by sampling n times from P according to Q to give  P1, . . . , Pn, and then sampling m times from each  Pito generate  zi = {(xi1, yi1), . . . , (xim, yim)}, i= 1, . . . , n. Suppose the loss function l: Y × Y → Rhas range [0, 1] (any bounded loss function can be rescaled so this is true). Let H = {H} be any hypothesis space family. If the number of tasks n satisfies

image

For a proof of a similar theorem to this one, see the proof of theorem 7 in [3]. Note that the constants in this theorem have not been heavily optimized.

There are several important points to note about theorem 2:

1. In order to learn to learn (in the sense that erQ(H) and ˆerz(H) are close uniformly over all  H ∈ H), both the number of tasks n and the number of examples of each task m must be sufficiently large.

2. We can never say for certain that erQ(H) and ˆerz(H) are close, only that they are close with high probability (1  − δ). Regardless of the size of the (n, m) sample z, we still might get unlucky and generate unrepresentative learning problems  P1, . . . , Pnor unrepresentative examples of those learning problems, although the chance of being unlucky diminishes as m and n grow.

3. Once the learner has found an  H ∈ Hwith a small value of ˆerz(H), it will then use H to learn novel tasks P drawn according to Q. Assuming that the learner is always able to find an  h ∈ Hminimizing erP (h), theorem 2 tells us that with probability at least 1−δ, the expected value of erP (h) on a novel task P will be less than ˆerz(H) +  ε. Of course, this does not rule out really bad performance on some tasks P. However, the probability of generating such “bad” tasks can be bounded. In particular, note that erQ(H) is just the expected value of the function  H∗over P, and so by Markov’s inequality, for  γ >0,

image

4. Keeping the accuracy and confidence parameters  ε, δfixed, note that the number of examples of each task required for good generalisation obeys

image

So provided ln  C (ε, Hnl) increases sublinearly with n, the upper bound on the number of examples required of each task will decrease as the number of tasks increases. This is discussed further after theorem 3 below.

Theorem 2 only provides conditions under which ˆerz(H) and erQ(H) are close, it does not guarantee that erQ(H) is actually small. This is governed by the choice of H. If H contains a hypothesis space H with a small value of erQ(H), and the learner minimizes error on the (n, m) sample z, then with high probability erQ(H) will be small. However, a bad choice of H will mean there is no hope of finding an H with small error. In this sense the choice of H represents the hyper-bias of the learner.

It may seem that we have simply replaced the problem of selecting the right bias (i.e. selecting the right hypothesis space H) with the equally difficult problem of selecting the right hyper-bias (i.e. the right hypothesis space family H). However, in many cases selecting the right hyper-bias is far easier than selecting the right bias. For example, in section 3.2 we will see how the feature selection problem may be viewed as a bias selection problem. Selecting the right features can be extremely difficult if one knows little about the environment, but specifying only that a set of features should exist and then learning those features is far simpler.

3.1. Learning multiple tasks

It may be that the learner is not interested in learning to learn, but simply wants to learn n tasks from the environment (P, Q). As in the previous section, we assume the learner starts out with a hypothesis space family H, and also that it receives an (n, m)-sample z generated from the n distributions  P1, . . . , Pn. This time, however, the learner is simply looking for n hypotheses (h1, . . . , hn), all contained in the same hypothesis space H, such that the average training set error of the n hypotheses is minimal. Denoting (h1, . . . , hn) by h, this error is defined by

image

For any hypothesis space H, let  Hn:=  {(h1, . . . , hn): hi ∈ H, i= 1, . . . , n}. Let Hn:=  ∪H∈HHn. Hnis simply the set of all possible sequences (h1, . . . , hn) where all the  h′iscome from the same hypothesis space H (recall the definition of  Hnlfor a similar concept). Writing P = (P1, . . . , Pn), the learner’s generalisation error in

this context is measured by the average generalisation error across the n tasks:

image

Recall definition 4 for the meaning of  C(ε, Hnl).

Theorem 3 Let P = (P1, . . . , Pn) be n probability distributions on  X × Yand let z be an (n, m)-sample generated by sampling m times from  X × Yaccording to each  Pi. Suppose the loss function  l: Y × Y → Rhas range [0, 1] (any bounded loss function can be rescaled so this is true). Let H = {H} be any hypothesis space family. If the number of examples m of each task satisfies m ≥max �72nε2ln 4C( ε24, Hnl)δ ,18ε2

image

Notes:

1. Note that the bound on m in theorem 3 is virtually identical to the bound on m in theorem 2.

2. The important thing about the bound on m is that it depends inversely on the number of tasks n (assuming that the first part of the “max” expression is the dominate one). In fact, it is easy to show that for any hypothesis space family H,

image

So keeping the accuracy parameters  εand  δfixed, and plugging (35) into (32), we see that the upper bound on the number of examples required of each task never increases with the number of tasks, and at best decreases as  O� 1n�. Although only an upper bound, this provides a strong hint that learning multiple related tasks should be advantageuos on a “number of examples required per task” basis.

3. In section 3.2 it will be shown that all types of behaviour, from no advantage at all to O( 1n) decrease, are possible.

image

Figure 1. Neural network for feature learning. The feature map is implemented by the first two hidden layers. The n output nodes correspond to the n different tasks in the (n, m)-sample z.

3.2. Feature learning with neural networks

Consider the following quote:

The classical approach to estimating multidimensional functional dependencies is based on the following belief:

Real-life problems are such that there exists a small number of “strong features,” simple functions of which (say linear combinations) approximate well the unknown function. Therefore, it is necessary to carefully choose a low-dimensional feature space and then to use regular statistical techniques to construct an approximation.

(from “The Nature of Statistical Learning Theory”, Vapnik 1996.) It must be pointed out that Vapnik advocates an alternative approach in his book: that of using an extremely large number of simple features but choosing a hypothesis with maximum classification margin. However his approach cannot be viewed as a form of bias learning or learning to learn, whereas the strong feature approach can, so here we will concentrate on the latter.

The aim of this section is to use the ideas of the previous section to show how neural-network feature sets can be learnt for an environment of related tasks.

In general, a set of features may be viewed as a map from the (typically highdimensional) input space  Rdto a much smaller dimensional space  Rk(k ≪ d). Any such bounded feature map can be approximated to arbitrarily high accuracy by a one-hidden-layer neural network with k output nodes. This is illustrated in Figure 1. Fixing the number of nodes in the first hidden layer, let Φw: Rd → Rkdenote the feature map computed by the the neural network with weights w. The set of all such feature maps is  {Φw: w ∈ RW }where W is the number of weights in the first two layers.

For argument’s sake, assume the “simple functions” of the features (mentioned in the above quote) are squashed linear maps. Denoting the k components of the feature map Φwby  φw,1, . . . , φw,k, each setting of the feature map weights generates a hypothesis space  Hwby

image

where  σ: R → Ris the squashing function.  Hwis simply the set of all squashed linear functions of the features Φw. The set of all such hypothesis spaces,

image

is a hypothesis space family.

Finding the right set of features for the environment (P, Q) is equivalent to finding the right hypothesis space  Hw ∈ H.

As in the previous section, the correct set of features may be learnt by finding a hypothesis space with small error on a sufficiently large (n, m)-sample z (recall that an (n, m)-sample is simply n training sets corresponding to n different learning tasks). Specializing to squared loss, in the present framework the error of  Hwon z (equation (14)) is given by

image

Using gradient descent and an n output node network as in figure 1, output weights (α1, . . . , αk) and feature weights w minimizing (38) can be found. For details see [4].

The size of z ensuring that the resulting features will be good for learning novel tasks from the same environment is given by theorem 2. All we have to do is compute the logarithm of the covering numbers  C(ε, Hnl) and  C(ε, H∗). If the feature weights w and the output weights  α1, . . . , αkare bounded, and the squashing function  σis Lipschitz2, then there exist constants  κ, κ′(independent of  ε, Wand k) such that for all  ε >0,

image

(see [5] for a proof). Plugging these expressions into theorem 2 gives the following theorem.

Theorem 4 Let  H = {Hw}be a hypothesis space family where each hypothesis space  Hwis a set of squashed linear maps composed with a neural network feature map, as above. Suppose the number of features is k, and the total number of feature weights is W. Assume all feature weights and output weights are bounded, and the squashing function  σis Lipschitz. Let z be an (n, m)-sample generated from the environment (P, Q). If

image

Notes:

1. Keeping the accuracy paramters  εand  δfixed, the upper bound on the number of examples required of each task behaves like O(k + W/n). The same upper bound also applies in theorem 3.

2. Once the feature map is learnt, only the output weights have to be estimated to learn a novel task. Again keeping the accuracy parameters fixed, this requires no more that O(k) examples.

3. Thus, as the number of tasks learnt increseas, the upper bound on the number of examples required of each task decays to the minimum possible, O(k).

4. If the “small number of strong features” assumption is correct, then k will be small. However, typically we will have very little idea of what the features are, so the size of the feature network will have to be huge, so  W ≫ k.

5. O(k + W/n) decreases most rapidly with increasing n when  W ≫ k, so at least in terms of the upper bound on the number of examples required per task, learning small feature sets is an ideal application for learning to learn.

6. Note that if we do away with the feature map altogether then W = 0 and the upper bound on m becomes O(k), independent of n. So in terms of the upper bound, learning n tasks becomes just as hard as learning one task. At the other extreme, if we fix the output weights then effctively k = 0 and the number of examples required of each task decreases as O(W/n). Thus a range of behaviour in the number of examples required of each task is possible: from no improvement at all to an O(1/n) decrease as the number of tasks n increases.

7. To rigorously conclude that learning n tasks is better than learning one, we would have to show a matching lower bound of Ω(k + W/n) on the number of examples required of each task. Rather than search for lower bounds within the EP model (which are difficult to prove), we discuss a Bayes model of learning to learn in the next section where simultaneous upper and lower bounds appear more naturally.

Recall from section 2.2 that in Bayesian models of ordinary learning the learner’s bias is represented by the space of possible distributions  {Pθ: θ ∈ Θ}along with the choice of prior  p(θ). The learning task P is assumed to be equal to some  Pθ∗where  θ∗ ∈Θ.

Observe that  p(θ) is a subjective prior distribution over a set of distributions  {Pθ}. It is subjective because it simply reflects the prior beliefs of the learner, not some objective stochastic phenomenon. Now note that the environment (P, Q) consists of a set of distributions P, and a distribution Q on P, and furthermore that P can be sampled according to Q to generate multiple tasks  P1, . . . , Pn(recall the discussion in section 3). This makes Q an objective prior distribution. Objective in the sense that it can be sampled, i.e. it corresponds to some objective stochastic phenomenon.

If we assume  P = {Pθ: θ ∈ Θ}(so now P is a restricted subset of all possible distributions on  X × Y), the goal of a bias learner in this framework is to find the right prior distribution Q. To do this, the learner must have available a set of possible prior distributions  {Pπ: π ∈ Π}from which to choose. Each  Pπis a distribution on Θ. We assume realizability, so that  Q = Pπ∗for some  π∗ ∈Π.

To summarize, the Bayes model of learning to learn consists of the following ingredients:

An input space X and an output space Y ,

a set of probability distributions  Pθon  X × Y, parameterised by  θ ∈Θ,

a set of prior distributions  Pπon Θ, parameterised by  π ∈Π.

an objective or environmental prior distribution  Pπ∗where  π∗ ∈Π.

To complete the model, the learner also has a subjective hyper-prior distribution PΠon Π.

The two-tiered structure with a set of possible priors  {Pπ: π ∈ Π}and a hyper-prior  p(π) on Π makes this model an example of a hierarchical Bayesian model [7, 9].

As with the EP model, the learner receives an (n, m)-sample z, generated by first sampling n times from Θ according  Pπ∗to give  θ1, . . . , θn, and then sampling m times from each  X × Yaccording to each  Pθito generate  zi= (xi1, yi1), . . . , (xim, yim). To simplify the notation in this section, let Z :=  X × Yand  zij:= (xij, yij). As it will be necessary to distinguish between (n, m −1) samples and (n, m) samples, this will be made explicit in the notation by writing  z(n,m)instead of z:

image

4.1. Loss as the extra information required to predict the next observation

Recall that in the Bayes model of single task learning (section 2.2), the learner’s loss was measured by the amount of extra information needed to encode novel examples of the task. So one way to measure the advantage in learning n tasks together is by the rate at which the learner’s loss in predicting novel examples decays for each task. This question is similar to that addressed by theorem 3.

So fix the number of tasks n, sample n tasks  θn=  θ1, . . . , θnaccording to the true prior  Pπ∗, and then for each m = 1, 2, . . . the learner has already seen  m −1 examples of each task:

image

where each row is drawn according to  P m−1θi(or equivalently, each column is drawn according to  Pθn). The learner then:

generates the posterior distribution  p(θn|z(n,m−1)) on the set of all n tasks, Θn, according to Bayes’ rule:

image

uses the posterior distribution to generate a predictive distribution on  Zn,

image

and suffers a loss,Ln,m, equal to the expected amount of extra information needed per task to encode a novel example of each task using the predictive distribution  p(zn|z(n,m−1)), over and above the minimum amount of information, i.e. the information required using the true distribution,  p(zn|θn):

image

Note that the loss at the first trial is:

image

where  p(zn) is the learner’s initial distribution on  Znbefore any data has arrived,

image

To understand better the meaning ofLn,m, consider the loss associated with learning a single classification task. In this case  Z = X × {0, 1}. If we assume that only the conditional distribution on class labels is affected by the model, then  p(z|θ) = p(x)p(y|x, θ), and for the predictive distribution,  p(z|zm) =  p(x)p(y|x, zm). Let α(x) := p(y = 1|x, θ) and  β(x) := p(y = 1|x, zm). Substituting these expressions into (46) and simplifying yields

image

The expression in square brackets is zero if  α(x) =  β(x), i.e.if the conditional distributions on class labels are the same for the true and predictive distributions. It increases slowly as  α(x) and  β(x) diverge.

It turns out thatLn,mis difficult to analyse, so instead we look at the cumulative loss over a sequence of trials:

image

i.e. the expected total loss incurred by the learner after m steps of the above process. Note that the expectation is over all sequences of n tasks  θndrawn according to Pπ∗and all (n, k)-samples drawn according to  p(zn|θn).

4.2. (a, b) models

In [6], the asymptotic behaviour ofCn,m,π∗as a function of m was analysed for general hierarchical models. To illustrate the results, and to show how they apply to the feature learning example of section 3.2, here we will restrict our attention to (a, b)-models. The concept of an (a, b)-model was formally defined in [6] as follows:

Definition 5 An (a, b)-model is a hierarchical model in which Π =  Rb, Θ =  Ra ×Rband the following conditions hold:

image

where  δ(·) is the b-dimensional Dirac delta function and  gπis a continuous function on  Rathat also varies continuously with  π.

2. The hyper-prior on Π,  PΠhas continuous density  p(π) and the true prior  π∗has positive density  p(π∗).

3. The conditional distributions  p(z|θ) are twice continuously differentiable functions of  θ.

The definition of an (a, b)-model in [6] contains two technical conditions which have been omitted in the above definition. The interested reader is referred to that paper for a full discussion. Condition 1 of an (a, b)-model formalizes the idea of a hierarchical model in which there are a + b parameters, b of which are effectively hyper-parameters and are fixed by the prior and the remaining a of which are model parameters. Conditions 2 and 3 ensure realizability and an appropriate level of smoothness.

The following definition is needed to state the main theorem.

Definition 6 Let (X, Σ, P) be a measure space and  f, g: N × X → R (Nis the positive integeres) be two real-valued functions on  N × Xsuch that for all  n ∈ N, f(n, ·) and  g(n, ·) are measurable functions on X. Set  Xn:= {x: f(n, x) = g(n, x)} for each  n ∈ N. We say

image

if limn→∞ P(Xn) = 1. This will be abbreviated to f(n, x) .= g(n, x) when X and P are clear from the context.

For the following theorem, fix n and take all limiting behaviour with respect to m.

Theorem 5 ([6],Theorem 6) In an (a, b)-model, the learner’s cumulative risk (50) satisfies

image

Note that equation (54) is an equality. In [6] it was stated as a .= relation but in fact the stronger expression holds.

Comparing (53) and (54), we see that as the number of tasks n increases, the effect of lack of knowledge of the true prior can be made arbitrarily small. Also, learning multiple tasks is most advantageous when  b ≫ a, i.e.when the true model is quite small but our uncertainty concerning the true model is large. This is a similar conclusion to the one reached in section 3.2 (see note 3 after theorem 3.

4.3. Learning features within the Bayes model

Consider the feature learning model of section 3.2 (recall figure 1). In this case, Θ is the set of all possible neural networks implementable by fixing the feature weights and the weights of a single output node. As there are k output weights and W feature weights, Θ =  Rk+W. The realizability assumption means there exists a fixed set of features such that all tasks in the environment can be implemented by composing a squashed linear map with the feature set. Thus, the true prior distribution  p(θ|π∗) is a delta-function over the correct feature weight setting, combined with an appropriate distribution over the k output weights (one that generates the tasks in the environment with the correct frequency). Assuming the output-weight distribution is continuous, the true prior is of the form (51), with a = k and b = W. The set of all such priors is simply the set of all distributions that are a delta function over some feature weight setting, combined with a continuous distribution over the output weights. To complete the model, suppose that for each  θ ∈ Rk+W, p(z|θ) is of the form

image

where  fθ(x) is the output of the network with weights  θand input x, and p(x) is a continuous density on some compact subset of  Rd. Assume the sigmoid on the hidden nodes is  σ(x) = tanh(x), and on the output node is  σ(x) = (1 + tanh(x))/2.

With these assumptions, the neural network feature learning model is a (k, W)-model in the sense of definition 5. Unfortunately, the neural network feature model does not satisfy the technical conditions mentioned after definition 5, so a straightforward application of theorem 5 is not possible. However, the difficulties are not insurmountable (see [6] for the details) and so we obtain the following theorem:

Theorem 6 For the neural-network feature learning model as above, the cumulative risk (50) satisfies

image

Note the reappearance of the factor k + W/n (recall theorem 3 and the notes afterwards). Comparing equations (55) and (56) we see again that the effect of ignorance of the true prior (in this case ignorance of the right features) can be made arbitrarily small by learning sufficiently many tasks. The improvement is greatest when the number of features (k) is small, but our uncertainty as to what the right feature set should be is large.

Two mathematical models of bias learning (or learning to learn) have been discussed: one based on Empirical process theory and the other a Hierarchical Bayesian model. Both models show that if a learning machine is embedded within an environment of related learning tasks, then it can learn its own bias for the environment by learning sufficiently many tasks. Bounds were given on the number of tasks and number of examples of each task needed to ensure good generalisation from a bias learner. Good generalisation in this case means that with high probability the learner’s choice of hypothesis space will contain good solutions to novel tasks drawn from the same environment. The theory was specialised to the case of feature learning with neural networks.

There are many pattern classification problems that can be viewed as consisting of large number of related tasks and that seem to possess small feature sets. Speech recognition, character recognition, face recognition and fingerprint recognition all fit this bill. Feature learning in these domains should be particularly successful.

image

1. Strictly speaking, in order for Q to be well defined we need to specify a  σ-algebra of subsets of P. However, such considerations are beyond the scope of the present discussion. See [3] for more details.

2.  σis Lipschitz if there exists a constant  K such that σ(x, x′) ≤ K|x − x′| for all x, x′ ∈ R.

1. Martin Anthony and Norman Biggs. Computational Learning Theory. Cambridge University Press, Cambridge, 1992.

2. Andrew Barron and Bertrand Clarke. Jeffreys’ Prior is Asymptotically Least Favourable under Entropy Risk. Journal of Statistical Planning and Inference, 41:37–60, 1994.

3. Jonathan Baxter. A Model of Bias Learning. Technical Report LSE-MPS-97, London School of Economics, Centre for Discrete and Applicable Mathematics, November 1995. Submitted for publication. Copy available from  http://keating.anu.edu.au/∼jon/papers/lse.ps.gz.

4. Jonathan Baxter. Learning Internal Representations. In Proceedings of the Eighth International Conference on Computational Learning Theory, pages 311–320. ACM Press, 1995.

5. Jonathan Baxter. Learning Internal Representations. PhD thesis, Department of Mathematics and Statistics, The Flinders University of South Australia, 1995. Copy available from http://keating.anu.edu.au/∼jon/papers/thesis.ps.gz.

image

10. David Hume. A Treatise of Human Nature. 1737. 11. Tom M Mitchell. The need for biases in learning generalisations. In Tom G Dietterich and

image

12. V N Vapnik. Estimation of Dependences based on Empirical Data. Springer-Verlag, New York, 1982. 13. V N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of

image


Designed for Accessibility and to further Open Science