b

DiscoverSearch
About
Introduction to Machine Learning: Class Notes 67577
2009·arXiv
Contents 1 Bayesian Decision Theory page 1 1.1 Independence Constraints 5 1.1.1 Example: Coin Toss 7 1.1.2 Example: Gaussian Density Estimation 7 1.2 Incremental Bayes Classifier 9 1.3 Bayes Classifier for 2-class Normal Distributions 10
Contents 1 Bayesian Decision Theory page 1 1.1 Independence Constraints 5 1.1.1 Example: Coin Toss 7 1.1.2 Example: Gaussian Density Estimation 7 1.2 Incremental Bayes Classifier 9 1.3 Bayes Classifier for 2-class Normal Distributions 10

2 Maximum Likelihood/ Maximum Entropy Duality 12 2.1 ML and Empirical Distribution 12 2.2 Relative Entropy 14 2.3 Maximum Entropy and Duality ML/MaxEnt 15

3 EM Algorithm: ML over Mixture of Distributions 19 3.1 The EM Algorithm: General 21 3.2 EM with i.i.d. Data 24 3.3 Back to the Coins Example 24 3.4 Gaussian Mixture 26 3.5 Application Examples 27 3.5.1 Gaussian Mixture and Clustering 27 3.5.2 Multinomial Mixture and ”bag of words” Application 27

4 Support Vector Machines and Kernel Functions 30 4.1 Large Margin Classifier as a Quadratic Linear Programming 31 4.2 The Support Vector Machine 34 4.3 The Kernel Trick 36 4.3.1 The Homogeneous Polynomial Kernel 37 4.3.2 The non-homogeneous Polynomial Kernel 38 4.3.3 The RBF Kernel 39 4.3.4 Classifying New Instances 39

image

5 Spectral Analysis I: PCA, LDA, CCA 41 5.1 PCA: Statistical Perspective 42 5.1.1 Maximizing the Variance of Output Coordinates 43 5.1.2 Decorrelation: Diagonalization of the Covariance Matrix 46 5.2 PCA: Optimal Reconstruction 47 5.3 The Case n >> m 49 5.4 Kernel PCA 49 5.5 Fisher’s LDA: Basic Idea 50 5.6 Fisher’s LDA: General Derivation 52 5.7 Fisher’s LDA: 2-class 54 5.8 LDA versus SVM 54 5.9 Canonical Correlation Analysis 55

6 Spectral Analysis II: Clustering 58 6.1 K-means Algorithm for Clustering 59 6.1.1 Matrix Formulation of K-means 60 6.2 Min-Cut 62 6.3 Spectral Clustering: Ratio-Cuts and Normalized-Cuts 63 6.3.1 Ratio-Cuts 64 6.3.2 Normalized-Cuts 65

7 The Formal (PAC) Learning Model 69 7.1 The Formal Model 69 7.2 The Rectangle Learning Problem 73 7.3 Learnability of Finite Concept Classes 75 7.3.1 The Realizable Case 76 7.3.2 The Unrealizable Case 77

8 The VC Dimension 80 8.1 The VC Dimension 81 8.2 The Relation between VC dimension and PAC Learning 85

9 The Double-Sampling Theorem 89 9.1 A Polynomial Bound on the Sample Size m for PAC Learning 89 9.2 Optimality of SVM Revisited 95

10 Appendix 97 Bibliography 105

image

During the next few lectures we will be looking at the inference from training data problem as a random process modeled by the joint probability distribution over input (measurements) and output (say class labels) variables. In general, estimating the underlying distribution is a daunting and unwieldy task, but there are a number of constraints or ”tricks of the trade” so to speak that under certain conditions make this task manageable and fairly effective.

To make things simple, we will assume a discrete world, i.e., that the values of our random variables take on a finite number of values. Consider for example two random variables X taking on k possible values  x1, ..., xkand H taking on two values  h1, h2. The values of X could stand for a Body Mass Index (BMI) measurement  weight/height2 of a person and H stands for the two possibilities  h1standing for the ”person being over-weight” and h2as the possibility ”person of normal weight”. Given a BMI measurement we would like to estimate the probability of the person being over-weight.

The joint probability P(X, H) is a two dimensional array (2-way array) with 2k entries (cells). Each training example (xi, hj) falls into one of those cells, therefore  P(X = xi, H = hj) = P(xi, hj) holds the ratio between the number of hits into cell (i, j) and the total number of training examples (assuming the training data arrive i.i.d.). As a result �ij P(xi, hj) = 1.

The projections of the array onto its vertical and horizontal axes by summing over columns or over rows is called marginalization and produces P(hj) = �i P(xi, hj) the sum over the j’th row is the probability  P(H = hj),i.e., the probability of a person being over-weight (or not) before we see any measurement — these are called priors. Likewise,  P(xi) = �j P(xi, hj)is the probability  P(X = xi) which is the probability of receiving such a BMI measurement to begin with — this is often called evidence. Note

image

Fig. 1.1. Joint probability P(X, H) where X ranges over 5 discrete values and H over two values. Each entry contains the number of hits for the cell (xi, hj). Thejoint probability  P(xi, hj) is the number of hits divided by the total number of hits (22). See text for more details.

that, by definition, �j P(hj) = �i P(xi) = 1. In Fig. 1.1 we have that P(h1) = 14/22, P(h2) = 8/22 that is there is a higher prior probability of a person being over-weight than being of normal weight. Also  P(x3) = 7/22is the highest meaning that we encounter BMI =  x3with the highest probability.

image

tween the number of hits in cell (i, j) and the number of hits in the i’th column, i.e., the probability that the outcome is  H = hjgiven the measurement  X = xi. In Fig. 1.1 we have  P(h2 | x3) = 3/7. Note that

image

Likewise, the conditional probability  P(xi | hj) = P(xi, hj)/P(hj) is thenumber of hits in cell (i, j) normalized by the number of hits in the j’th row and represents the probability of receiving BMI =  xigiven the class label H = hj(over-weight or not) of the person. In Fig. 1.1 we have  P(x3 | h2) =3/8 which is the probability of receiving BMI =  x3given that the person is known to be of normal weight. Note that �i P(xi | hj) = 1.

The Bayes formula arises from:

image

from which we get:

image

The left hand side  P(hj | xi) is called the posterior probability and  P(xi | hj)is called the class conditional likelihood. The Bayes formula provides a way to estimate the posterior probability from the prior, evidence and class likelihood. It is useful in cases where it is natural to compute (or collect data of) the class likelihood, yet it is not quite simple to compute directly the posterior. For example, given a measurement ”12” we would like to estimate the probability that the measurement came from tossing a pair of dice or from spinning a roulette table. If x = 12 is our measurement, and  h1stands for ”pair of dice” and  h2for ”roulette” then it is natural to compute the class conditional:  P(”12” | ”pair of dice”) = 1/36 andP(”12” | ”roulette”) = 1/38.Computing the posterior directly is much more difficult. As another example, consider medical diagnosis. Once it is known that a patient suffers from some disease  hj, it is natural to evaluate the probabilities  P(xi | hj) of the emerging symptoms  xi. As a result, in many inference problems it is natural to use the class conditionals as the basic building blocks and use the Bayes formula to invert those to obtain the posteriors.

The Bayes rule can often lead to unintuitive results — the one in particular is known as ”base rate fallacy” which shows how an nonuniform prior can influence the mapping from likelihoods to posteriors. On an intuitive basis, people tend to ignore priors and equate likelihoods to posteriors. The following example is typical: consider the ”Cancer test kit” problem†which has the following features: given that the subject has Cancer ”C”, the probability of the test kit producing a positive decision ”+” is P(+ | C) = 0.98 (which means that  P(− | C) = 0.02) and the probability of the kit producing a negative decision ”-” given that the subject is healthy ”H” is  P(− | H) = 0.97(which means also that P(+ | H) = 0.03). The prior probability of Cancer in the population is P(C) = 0.01. These numbers appear at first glance as quite reasonable, i.e, there is a probability of 98% that the test kit will produce the correct indication given that the subject has Cancer. What we are actually interested in is the probability that the subject has Cancer given that the test kit generated a positive decision, i.e., P(C | +). Using Bayes rule:

image

which means that there is a 26.6% chance that the subject has Cancer given that the test kit produced a positive response — by all means a very poor performance.

If we draw the posteriors  P(h1 |x) and P(h2 | x) using the probability distribution array in Fig. 1.1 we will see that  P(h1 |x) > P(h2 | x) for allvalues of X smaller than a value which is in between  x3 and x4. Therefore the decision which will minimize the probability of misclassification would

be to choose the class with the maximal posterior:

image

which is known as the Maximal A Posteriori (MAP) decision principle. Since P(x) is simply a normalization factor, the MAP principle is equivalent to:

image

In the case where information about the prior P(h) is not known or it is known that the prior is uniform, the we obtain the Maximum Likelihood (ML) principle:

image

The MAP principle is a particular case of a more general principle, known as ”proper Bayes”, where a loss is incorporated into the decision process. Let  l(hi, hj) be the loss incurred by deciding on class  hiwhen in fact  hj isthe correct class. For example, the ”0/1” loss function is:

image

The least-squares loss function is:  l(hi, hj) = ∥hi −hj∥2 typically used when the outcomes are vectors in some high dimensional space rather than class labels. We define the expected risk:

image

The proper Bayes decision policy is to minimize the expected risk:

image

The MAP policy arises in the case  l(hi, hj) is the 0/1 loss function:

image

Thus,

image

image

At this point we may pause and ask what have we obtained? well, not much. Clearly, the inference problem is captured by the joint probability distribution and we do not need all these formulas to see this. How do we obtain the necessary data to fill in the probability distribution array to begin with? Clearly without additional simplifying constraints the task is not practical as the size of these kind of arrays are exponential in the number of variables. There are three families of simplifying constraints used in the literature:

statistical independence constraints, parametric form of the class likelihood  P(xi | hj) where the inference

becomes a density estimation problem, structural assumptions — latent (hidden) variables, graphical models.

Today we will focus on the first of these simplifying constraints — statistical independence properties.

Consider two random variables X and Y . The variables are statistically independent  X⊥Y if P(X | Y ) = P(X) meaning that information about the value of Y does not add anything about X. The independence condition is equivalent to the constraint: P(X, Y ) = P(X)P(Y ). This can be easily proven: if  X⊥Y then P(X, Y ) = P(X | Y )P(Y ) = P(X)P(Y ). On theother hand, if P(X, Y ) = P(X)P(Y ) then

image

Let the values of X range over  x1, ..., xkand the values of Y range over y1, ..., yl. The associated  k × l2-way array,  P(X = xi, Y = yj) is represented by the outer product  P(xi, yj) = P(xi)P(yj) of two vectors P(X) = (P(x1), ..., P(xk)) and P(Y ) = (P(y1), ..., P(yl)). In other words, the 2-way array viewed as a matrix is of rank 1 and is determined by k + l (minus 2 because the sum of each vector is 1) parameters rather than kl (minus 1) parameters.

Likewise, if  X1⊥X2⊥....⊥Xn are nstatistically independent random variables where  Xiranges over  kidiscrete and distinct values, then the n-way array  P(X1, ..., Xn) = P(X1) · ... · P(Xn) is an outer-product of n vectors and is therefore determined by  k1 + ... + kn(minus n) parameters instead of  k1k2...kn(minus 1) parameters†. Viewed as a tensor, the joint probability is a rank 1 tensor. The main point is that the statistical independence assumption reduced the representation of the multivariate joint distribution from exponential to linear size.

Since our variables are typically divided to measurement variables and an output/class variable H (or in general  H1, ..., Hl), it is useful to introduce another, weaker form, of independence known as conditional independence. Variables X, Y are conditionally independent given H, denoted by X⊥Y | H, iff P(X | Y, H) = P(X | H) meaning that given H, the value of Y does not add any information about X. This is equivalent to the condition P(X, Y | H) = P(X | H)P(Y | H). The proof goes as follows:

If P(X | Y, H) = P(X | H), then

image

If P(X, Y | H) = P(X | H)P(Y | H), then

image

Consider as an example, Joe and Mo live on opposite sides of the city. Joe goes to work by train and Mo by car. Let X be the event ”Joe is late to work” and Y be the event ”Mo is late for work”. Clearly X and Y are not independent because there could be other factors. For example, a train strike will cause Joe to be late, but because of the strike there would be extra traffic (people using their car instead of the train) thus causing Mo to be pate as well. Therefore, a third variable H standing for the event ”train strike” would decouple X and Y .

From a computational standpoint, the conditional independence assumption has a similar effect to the unconditional independence. Let X range over k distinct value, Y range over r distinct values and H range over s distinct values. Then P(X, Y, H) is a 3-way array of size  k × r × s. Giventhat  X⊥Y | Hmeans that  P(X, Y | H = hi), a 2-way ”slice” of the 3-way array along the H axis is represented by the outer-product of two vectors P(X | H = hi)P(Y | H = hi). As a result the 3-way array is represented by s(k +r −2) parameters instead of  skr −1. Likewise, if  X1⊥....⊥Xn | H thenthe n-way array  P(X1, ..., Xn | H = hi) (which is a slice along the H axis of the (n + 1)-array  P(X1, ..., Xn, H)) is represented by an outer-product of n vectors, i.e., by  k1 + .. + kn − nparameters.

image

We will use the ML principle to estimate the bias of a coin. Let X be a random variable taking the value {0, 1} and H would be our hypothesis taking a real value in [0, 1] standing for the coin’s bias. If the coin’s bias is q then P(X = 0 | H = q) = q and P(X = 1 | H = q) = 1 − q. We receive m i.i.d. examples  x1, ..., xm where xi ∈ {0, 1}. We wish to determine the value of q. Given that  x1⊥...⊥xm | H, the ML problem we must solve is:

image

Let 0  ≤ λ ≤ mstand for the number of ’0’ instances, i.e.,  λ = |{xi = 0 | i =1, ..., m}|. Therefore our ML problem becomes:

image

Taking the partial derivative with respect to q and setting it to zero:

image

produces the result:

image

So far we considered constraints induced by conditional independent state- ments among the random variables as a means to reduce the space and time complexity of the multivariate distribution array. Another approach would be to assume some form of parametric form governing the entries of the array — the most popular assumption is Gaussian distribution  P(X1, ..., Xn) ∼N(µ, E) with mean vector  µand covariance matrix E. The parameters of the density function are denoted by  θ = (µ, E) and for every vector  x ∈ Rnwe have:

image

Assume we are given an i.i.d sample of  k points S = {x1, ..., xk}, xi ∈ Rn,and we would like to find the Bayes optimal  θ:

image

by maximizing the likelihood (here we are assuming that the the priors  P(θ)are equal, thus the maximum likelihood and the MAP would produce the same result). Because the sample was drawn i.i.d. we can assume that:

image

Let  L(θ) = log P(S | θ) = �i log P(xi | θ) and since Log is monotonously increasing we have that  θ∗ = argmax

be recovered by taking derivatives with respect to  θ, i.e., ∇θL= 0. We have:

image

We will start with a simple scenario where  E = σ2I, i.e., all the covariances are zero and all the variances are equal to  σ2. Thus, E−1 = σ−2I and|E| = σ2n. After substitution (and removal of items which do not depend on  θ) we have:

image

The partial derivative with respect to  µ:

image

from which we obtain:

image

The partial derivative with respect to  σ is:

image

from which we obtain:

image

Note that the reason for dividing by n is due to the fact that  σ21 = ... =

image

In the general case, E is a full rank symmetric matrix, then the derivative of eqn. (1.1) with respect to  µ is:

image

and since  E−1 is full rank we obtain  µ = (1/k) �i xi. For the derivative with respect to E we note two auxiliary items:

image

Using the fact that  x⊤y = trace(xy⊤) we can transform  z⊤E−1z to trace(zz⊤E−1)for any vector z. Given that  E−1 is symmetric, then:

image

Substituting  z = x − µwe obtain:

image

from which we obtain:

image

Consider another application of conditional dependence which is the Bayes incremental rule. Suppose we have processed  n examples X(n) = {X1, ..., Xn}and computed somehow  P(H | X(n)). We are given a new measurement X and wish to compute (update) the posterior  P(H | X(n), X). We will use the chain rule†:

P(X | Y, Z) = P(X, Y, Z)P(Y, Z ) = P(Z | X, Y )P(X | Y )P(Y )P(Z | Y )P(Y ) = P(Z | X, Y )P(X | Y )P(Z | Y )

to obtain:

image

from conditional independence,  P(X | X(n), H) = P(X | H). The term P(X | X(n)) can expanded as follows:

image

After substitution we obtain:

image

The old posterior  P(H | X(n)) is now the prior for the updated formula. Consider the following example†: We have a coin which could be either fair or biased towards Head at a probability of 0.6. Let H = h1be the event that the coin is fair, and  H = h2that the coin is biased. We start with prior probabilities  P(h1) = 0.75 and P(h2) = 0.25 (we have a higher initial belief that the coin is fair). Suppose our first coin toss is a Head, i.e.,  X1 = ”0”.Then,

image

and  P(h2 | x1) = 0.286. Our posterior belief that the coin is fair has gone down after a Head toss. Assume we have another measurement  X2 = ”0”,then:

image

and  P(h2 | x1, x2) = 0.325, thus our belief that the coin is fair continues to go down after Head tosses.

image

For the last topic in this lecture consider the 2-class inference problem. We will encountered this problem in this course in the context of SVM and LDA. In the Bayes framework, if  H = {h1, h2}denotes the ”class member” variable with two possible outcomes, then the MAP decision policy calls for

making the decision based on data x:

image

or in other words the class  h1would be chosen if  P(h1 | x) > P(h2 | x).The decision surface (as a function of x) is therefore described by:

image

The questions we ask here is what would the Bayes optimal decision surface be like if we assume that the two classes are normally distributed with different means and the same covariance matrix? What we will see is that under the condition of equal priors  P(h1) = P(h2) the decision surface is a hyperplane — and not only that, it is the same hyperplane produced by LDA.

image

N(µ2, E), the the Bayes optimal decision surface is a hyperplane  w⊤(x −µ) = 0 where µ = (µ1 + µ2)/2 and w = E−1(µ1 − µ2). In other words, the decision surface is described by:

image

Proof: The decision surface is described by  P(h1 | x) − P(h2 | x) = 0which is equivalent to the statement that the ratio of the posteriors is 1, or equivalently that the log of the ratio is zero, and using Bayes formula we obtain:

image

In other words, the decision surface is described by

image

After expanding the two terms we obtain eqn. (1.2).

image

In the previous lecture we defined the principle of Maximum Likelihood (ML): suppose we have random variables  X1, ..., Xnform a random sample from a discrete distribution whose joint probability distribution is  P(x | φ)where  x = (x1, ..., xn) is a vector in the sample and  φis a parameter from some parameter space (which could be a discrete set of values — say class membership). When  P(x | φ) is considered as a function of  φit is called the likelihood function. The ML principle is to select the value of  φthat maximizes the likelihood function over the observations (training set)  x1, ..., xm.If the observations are sampled i.i.d. (a common, not always valid, assumption), then the ML principle is to maximize:

image

which due to the product nature of the problem it becomes more convenient to maximize the log likelihood. We will take a closer look today at the ML principle by introducing a key element known as the relative entropy measure between distributions.

image

The ML principle states that the empirical distribution of an i.i.d. sequence of examples is the closest possible (in terms of relative entropy which would be defined later) to the true distribution. To make this statement clear let X be a set of symbols  {a1, ..., an} and let P(a | θ) be the probability (belonging to a parametric family with parameter  θ) of drawing a symbol a ∈ X. Let x1, ..., xmbe a sequence of symbols drawn i.i.d. according to P. The occurrence frequency f(a) measures the number of draws of the symbol

image

and let the empirical distribution be defined by

image

The joint probability  P(x1, ..., xm | φ) is equal to the product �i P(xi | φ)which according to the definitions above is equal to:

image

The ML principle is therefore equivalent to the optimization problem:

image

where  Q = {q ∈ Rn : q ≥ 0, �i qi = 1}denote the set of n-dimensional probability vectors (”probability simplex”). Let  pi stand for P(ai | φ) andfi stand for f(ai). Since argmaxxz(x) = argmaxx ln z(x) and given that ln �i pfii = �i fi ln pithe solution to this problem can be found by setting the partial derivative of the Lagrangian to zero:

image

where  λis the Lagrange multiplier associated with the equality constraint �i pi − 1 = 0 and µi ≥0 are the Lagrange multipliers associated with the inequality constraints  pi ≥0. We also have the complementary slackness condition that sets  µi = 0 if pi > 0.

After setting the partial derivative with respect to  pito zero we get:

image

Assume for now that  fi > 0 for i = 1, ..., n. Then from complementary slackness we must have  µi= 0 (because  pi > 0).We are left therefore with the result  pi = (1/λ)fi. Following the constraint �i p1= 1 we obtain λ = �i fi. As a result we obtain:  P(a | φ) = ˆP(a). In case fi= 0 we could use the convention 0 ln 0 = 0 and from continuity arrive to  pi = 0.

We have arrived to the following theorem:

Theorem 1 The empirical distribution estimate ˆP is the unique Maximum Likelihood estimate of the probability model Q on the occurrence frequency f().

This seems like an obvious result but it actually runs deep because the result holds for a very particular (and non-intuitive at first glance) distance measure between non-negative vectors. Let dist(f, p) be some distance measure between the two vectors. The result above states that:

image

for some (family?) of distance measures dist(). It turns out that there is only one†such distance measure, known as the relative-entropy, which satisfies the ML result stated above.

image

The relative-entropy (RE) measure D(x||y) between two non-negative vectors  x, y ∈ Rn is defined as:

image

In the definition we use the convention that 0 ln 00 = 0 and based on con- tinuity that 0 ln 0y = 0 and x ln x0 = ∞. When x, yare also probability vectors, i.e., belong to  Q, then D(x||y) = �i xi ln xiyi is also known as the Kullback-Leibler divergence. The RE measure is not a distance metric as it is not symmetric,  D(x||y) ̸= D(y||x), and does not satisfy the triangle inequality. Nevertheless, it has several interesting properties which make it a fundamental measure in statistical inference.

The relative entropy is always non-negative and is zero if and only if x = y. This comes about from the log-sum inequality:

image

Thus,

image

But  a ln(a/b) ≥ a − b for a, b ≥ 0 iff ln(a/b) ≥ 1 − (b/a) which follows from the inequality ln(x + 1) > x/(x + 1) (which holds for  x > −1 and x ̸= 0).We can state the following theorem:

Theorem 2 Let f ≥ 0be the occurrence frequency on a training sample. ˆP ∈ Qis a ML estimate iff

image

Proof:

image

and

image

There are two (related) interesting points to make here. First, from the proof of Thm. 1 we observe that the non-negativity constraint  p ≥ 0 neednot be enforced - as long as  f ≥0 (which holds by definition) the closest p to f under the constraint �i pi = 1 must come out non-negative. Second,the fact that the closest point p to f comes out as a scaling of f (which is by definition the empirical distribution ˆP) arises because of the relative-entropy measure. For example, if we had used a least-squares distance measure ∥f − p∥2 the result would not be a scaling of f. In other words, we are looking for a projection of the vector f onto the probability simplex, i.e., the intersection of the hyperplane  x⊤1= 1 and the non-negative orthant x ≥0. Under relative-entropy the projection is simply a scaling of f (and this is why we do not need to enforce non-negativity). Under least-sqaures, a projection onto the hyper-plane  x⊤1= 1 could take us out of the non-negative orthant (see Fig. 2.1 for illustration). So, relative-entropy is special in that regard — it not only provides the ML estimate, but also simplifies the optimization process†(something which would be more noticeable when we handle a latent class model next lecture).

image

The relative-entropy measure is not symmetric thus we expect different out- comes of the optimization minx D(x||y) compared to miny D(x||y). The lat-

image

Fig. 2.1. Projection of a non-neagtaive vector f onto the hyperplane �i xi − 1 = 0.Under relative-entropy the projection ˆP is a scaling of f (and thus lives in the probability simplex). Under least-squares the projection  p2lives outside of the probability simplex, i.e., could have negative coordinates.

ter of the two, i.e., minP∈Q D(P0||P), where P0is some empirical evidence and Q is some model, provides the ML estimation. For example, in the next lecture we will consider Q the set of low-rank joint distributions (called latent class model) and see how the ML (via relative-entropy minimization) solution can be found.

Let  H(p) = − �i pi ln pidenote the entropy function. With regard to minx D(x||y) we can state the following observation:

Claim 2

image

Proof:

image

which follows from the condition �i pi = 1.

In other words, the closest distribution to uniform is achieved by maximizing the entropy. To make this interesting we need to add constraints. Consider a linear constraint on  p such as �i αipi = β. To be concrete, con- sider a die with six faces thrown many times and we wish to estimate the probabilities  p1, ..., p6given only the  average �i ipi. Say, the average is 3.5 which is what one would expect from an unbiased die. The  Laplace’s prin-ciple of insufficient reasoning calls for assuming uniformity unless there is additional information (a controversial assumption in some cases). In other words, if we have no information except that each  pi ≥0 and that �i pi = 1we should choose the uniform distribution since we have no reason to choose any other distribution. Thus, employing Laplace’s principle we would say that if the average is 3.5 then the most ”likely” distribution is the uniform. What if  β = 4.2? This kind of problem can be stated as an optimization problem:

image

where  αi = i and β = 4.2. We have now two constraints and with the aid of Lagrange multipliers we can arrive to the result:

image

Note that because of the exponential  pi ≥0 and again ”non-negativity comes for free”†. Following the constraint �i pi= 1 we get exp−(1−λ) =1/ �i expµαifrom which obtain:

image

where Z (a function of  µ) is a normalization factor and  µneeds to be set by using  β(see later). There is nothing special about the uniform distribution, thus we could be seeking a probability vector p as close as possible to some prior probability  p0under the constraints above:

image

with the result:

image

We could also consider adding more linear constraints on p of the form: �i fijpi = bj, j = 1, ..., k. The result would be:

image

Probability distributions of this form are called Gibbs Distributions. In practical applications the linear constraints on p could arise from average information about the system such as temperature of a fluid (where  pi arethe probabilities of the particles moving at various velocities), rainfall data or general environmental data (where  pirepresent the probability of finding animal colonies at discrete locations in a 3D map). A constraint of the form �i fijpi = bjstates that the expectation  Ep[fj] should be equal to the empirical distribution  β = E ˆP [fj] where ˆPis either uniform or given as input. Let

image

and

image

We could therefore consider looking for the ML solution for the parameters µ1, ..., µkof the Gibbs distribution:

image

where if ˆp is uniform then min  D(ˆp||q) can be replaced by max �i ln qi(because  D((1/n)1||x) = − ln(n) − �i ln xi).

As it turns out, the MaxEnt and ML are duals of each other and the intersection of the two sets  P ∩ Qcontains only a single point which solves both problems.

Theorem 3 The following are equivalent:

image

In practice, the duality theorem is used to recover the parameters of the Gibbs distribution using the ML route (second line in the theorem above) — the algorithm for doing so is known as the iterative scaling algorithm (which we will not get into).

image

In Lecture 2 we saw that the Maximum Likelihood (ML) principle over i.i.d. data is achieved by minimizing the relative entropy between a model Q and the occurrence-frequency of the training data. Specifically, let  x1, .., xm bei.i.d. where each  xi ∈ X d is a d-tupple of symbols taken from an alphabet X having n different letters  {a1, ..., an}. Let ˆPbe the empirical joint distribution, i.e., an array with d dimensions where each axis has n entries, i.e., each entry ˆPi1,...,id, where ij = 1, ..., n, represents the (normalized) co-occurrence of the  d-tupe ai1, ..., aidin the training set  x1, ..., xm. We wish to find a joint distribution  P ∗ (also a d-array) which belongs to some model family of distributions Q closest as possible to ˆP in relative-entropy:

image

In this lecture we will focus on a model of distributions Q which represents mixtures of simple distributions H— known as latent class models. A latent class model arises when the joint probability  P(X1, ..., Xd) we observe (i.e., from which ˆP is generated by observing samples  x1, ..., xm) is in fact a marginal of  P(X1, ..., Xd, Y ) where Yis a ”hidden” (or ”latent”) random variable which has k different discrete values  α1, .., αk. Then,

image

The idea is that given the value of the hidden variable H the problem of recovering the model  P(X1, ..., Xd | Y = αj), which belongs to some family of joint distributions H, is a relatively simple problem. To make this idea clearer we consider the following example: Assume we have two coins. The first coin has a probability of heads (”0”) equal to p and the second coin has a probability of heads equal to q. At each trial we choose to toss coin 1 with probability  λand coin 2 with probability 1  − λ. Once a coin has been chosen it is tossed 3 times, producing an observation  x ∈ {0, 1}3. We aregiven a set of such observations  D = {x1, ..., xm}where each observation  xiis a triplet of coin tosses (the same coin). Given D, we can construct the empirical distribution ˆP which is a 2  × 2 ×2 array defined as:

image

Let  yi ∈ {1, 2}be a random variable associated with the observation  xi suchthat  yi = 1 if xiwas generated by coin 1 and  yi = 2 if xiwas generated by coin 2. If we knew the values of  yithen our task would be simply to estimate two separate Bernoulli distributions by separating the triplets generated from coin 1 from those generated by coin 2. Since  yiis not known, we have the marginal:

image

where (x1, x2, x3) ∈ {0, 1}3 is a triplet coin toss and 0  ≤ ni ≤ 3 is thenumber of heads (”0”) in the triplet of tosses. In other words, the likelihood P(x) of triplet of tosses  x = (x1, x2, x3) is a linear combination (”mixture”) of two Bernoulli distributions. Let H stand for Bernoulli distributions:

image

where  u⊗d stands for the outer-product of  u ∈ Rn with itself d times, i.e., an n- way array indexed by  i1, ..., id, where ij ∈ {1, ..., n}, and whose value there is equal to  ui1 · · · uid. The model family Q is a mixture of Bernoulli distributions:

image

where specifically for our coin-toss example becomes:

image

We see therefore that the eight entries of  P ∗ ∈ Qwhich minimizes  D( ˆP||P)over the set Q is determined by three parameters  λ, p, q. For the coin-toss

example this looks like:

image

where  ni123 = i1 + i2 + i3. Trying to work out an algorithm for minimizing the unknown parameters  λ, p, qwould be somewhat ”unpleasant” (and even more so for other families of distributions H) because of the log-over-a-sum present in the optimization function — if we could somehow turn this into a sum-over-log our task would be much easier. We would then be able to turn the problem into a succession of problems over H rather than a single problem over  Q = �j λjH.Another point worth attention is the non-negativity of the output variables — simply minimizing the relative-entropy measure under the constraints of the class model Q would not guarantee a non-negative solution. As we shall see, breaking down the problem into a successions of problems over H would give us the ”non-negativity for free” feature.

The technique for turning the log-over-sum into a sum-over-log as part of finding the ML solution for a mixture model is known as the ExpectationMaximization (EM) algorithm introduced by Dempster, Laird and Rubin in 1977. It is based on two ideas: (i) introduce auxiliary variables, and (ii) use of Jensen’s inequality.

image

Let  D = {x1, ..., xm}represent the training data where  xi ∈ Xis taken from some instance space X which we leave unspecified. For now, we leave matters to be as general as possible and specifically we do not make independence assumptions on the data generation process.

The ML problem is to find a setting of parameters  θwhich maximizes the likelihood  P(x1, ..., xm | θ), namely, we wish to maximize  P(D | θ) overparameters  θ, which is equivalent to maximizing the log-likelihood:

image

where y represents the hidden variables. We will denote  L(θ) = log P(D | θ).Let  q(y | D, θ) be some (arbitrary) distribution of the hidden variables y conditioned on the parameters  θand the input sample  D, i.e., �y q(y | D, θ) =1. We define a  lower bound on L(θ) as follows:

image

The inequality comes from Jensen’s inequality log �j αjaj ≥ �j αj log ajwhen �j αj= 1. What we have obtained is an ”auxiliary” function  Q(q, θ)satisfying

image

for all distributions  q(y | D, θ). The maximization of  Q(q, θ) proceeds by interleaving the variables  q and θas we separately ascend on each set of variables. At the (t + 1) iteration we fix the current value of  θ to be θ(t)of the t’th iteration and maximize  Q(q, θ(t)) over q, and then maximize

image

The strategy of the EM algorithm is to maximize the lower bound  Q(q, θ)with the hope that if we ascend on the lower bound function we will also ascend with respect to  L(θ). The claim below guarantees that an ascend on Q will also generate an ascend on L:

Claim 3 (Jordan-Bishop) The optimal  q(y | D, θ(t))at each step is P(y | D, θ(t)).

Proof: We will show that  Q(P(y | D, θ(t)), θ(t)) = L(θ(t)) which proves the claim since  L(θ) ≥ Q(q, θ) for all q, θ, thus the best q-distribution we can

hope to find is one that makes the lower-bound meet  L(θ) at θ = θ(t).

image

The proof provides also the validity for the approach of ascending along the lower bound  Q(q, θ) because at the point  θ(t) the two functions coincide, i.e., the lower bound function at  θ = θ(t) is equal to  L(θ(t)) therefore if we continue and  ascend along Q(·) we are guaranteedto ascend along  L(θ)as well†— therefore, convergence is guaranteed. It can also be shown (but omitted here) that the point of convergence is a stationary point of  L(θ) (wasshown originally by C.F. Jeff Wu in 1983 years after EM was introduced in 1977) under fairly general conditions. The second step of maximizing over θthen becomes:

image

This defines the EM algorithm. Often the ”Expectation” step is described as taking the expectation of:

image

followed by a Maximization step of finding  θthat maximizes the expectation — hence the term EM for this algorithm.

Eqn. 3.8 describes a principle but not an algorithm because in general, without making assumptions on the statistical relationship between the data points and the hidden variable the problem presented in eqn. 3.8 is unwieldy. We will reduce eqn. 3.8 to something more manageable by making the i.i.d. assumption. This is detailed in the following section.

image

The EM optimization presented in eqn. 3.8 can be simplified if we assume the data points (and the hidden variable values) are i.i.d.

image

and

image

For any  α(yi) we have:

image

this is because �yj P(yj | xj, θ) = 1. Substituting the simplifications above into eqn. 3.8 we obtain:

image

where yi ∈ {α1, ..., αk}.

image

We will apply the EM scheme to our running example of mixture of Bernoulli distributions. We wish to compute

image

and then maximize Q() with respect to  p, q, λ.

image

Q(θ, θ′) =

image

where  θ′ stands for θ(t) and µi = P(yi = 1 | xi, θ′). The values of  µi areknown since  θ′ = (λo, po, qo) are given from the previous iteration. The Bayes formula is used to compute  µi:

image

We wish to compute: maxp,q,λ Q(θ, θ′). The partial derivative with respect

image

from which we obtain the update formula of  λ given µi:

image

The partial derivative with respect to p is:

image

from which we obtain the update formula:

image

Likewise the update rule for q is:

image

To conclude, we start with some initial ”guess” of the values of  p, q, λ, com-pute the values of  µiand update iteratively the values of  p, q, λwhere at the end of each iteration the new values of  µiare computed.

image

The Gaussian mixture model assumes that  P(x) where x ∈ Rd is a linear combination of Gaussian distributions

image

where

image

is Normally distributed with mean  cjand covariance matrix  σ2j I. Let D ={x1, ..., xm}be the i.i.d sample data and we wish to solve for the mean and covariances of the individual Gaussians (the ”factors”) and the mixing coefficients  λj = P(y = j). In order to make clear where the parameters are located we will write  P(x | φj) instead of  P(x | y = j) where φj = (cj, σ2j )are the mean and variance of the j’th factor. We denote by  θthe collection of mixing coefficients  λj and φj, j = 1, ..., k. Let wji be auxiliary variables per point  xiand per factor y = j standing for:

image

The EM step (eqn. 3.9) is:

image

Note the constraint �j λj= 1. The update formula for  wji is done through the use of Bayes formula:

image

where  Ziis a scaling factor so that �j wji = 1.The update formula for  λj, cj, σjfollow by taking partial derivatives of eqn. (3.10) and setting them to zero. Taking partial derivatives with respect

to  λj, cj and σjwe obtain the update rules:

image

In other words, the observations  xiare weighted by  wji before a Gaussian is fitted (k times, one for each factor).

image

The Gaussian mixture model is classically used for clustering applications. In a clustering application one receives a sample of points  x1, ..., xm whereeach point resides in  Rd. The task of the learner (in this case ”unsupervised” learning) is to group the m points into  k sets. Let yi ∈ {1, ..., k} wherei = 1, ..., m stands for the required labeling. The clustering solution is an assignment of values to  y1, ..., ymaccording to some clustering criteria.

In the Gaussian mixture model points are clustered together if they arise from the same Gaussian distribution. The EM algorithm provides a probabilistic assignment  P(yi = j | xi) which we denoted above as  wji .

image

The multinomial mixture (the coins example we toyed with) is typically used for representing ”count” data, such as when representing text documents as high-dimensional vectors. A vector representation of a text document associates a word from a fixed vocabulary to a coordinate entry of the vector. The value of the entry represents the number of times that particular word appeared in the document. If we ignore the order in which the words appeared and count only their frequency, a set of documents  d1, ..., dm and aset of words  w1, ...., wncould be jointly represented by a co-occurence  n×mmatrix  G where Gijcontains the number of times word  wiappeared in document  dj. If we scale  G such that �ij Gij= 1 then we have a distribution P(w, d). This kind of representation of a set of documents is called ”bag of words”.

For purposes of search and filtering it is desired to reveal additional information about words and documents such as to which ”topic” a document belongs to or to which topics a word is associated with. This is similar to a clustering task where documents associated with the same topic are to be clustered together. This can be achieved by considering the topics as the value of a latent variable y:

image

where we made the assumption that  w⊥d | y(i.e., words and documents are conditionally independent given the topic). The conditional independent assumption gives rise to the multinomial mixture model. To be more specific, ley  y ∈ {1, ..., k}denote the k possible topics and let  λj = P(y = j) (notethat �j λj= 1), then the latent class model becomes:

image

Note that P(w | y = j) is a vector which we denote as  uj ∈ Rn and P(d | y =j) is also a vector we denote by  vj ∈ Rm. The term P(w | y = j)P(d | y = j)stands for the outer-product  ujv⊤j of the two vectors, i.e., is a rank-1  n × mmatrix. The Maximum-Likelihood estimation problem is therefore to find vectors  u1, ..., uk and v1, ..., vkand scalars  λ1, ..., λksuch that the empirical distribution represented by the unit scaled matrix G is as close as possible (in relative-entropy measure) to the low-rank matrix �j λjujv⊤j subject tothe constraints of non-negativity and �j λj = 1, uj and vjare unit-scaled as well (1⊤uj = 1⊤vj = 1).

Let  xi = (w(i), d(i)) stand for the i’th example i = 1, ..., q where an example is a pair of word and document where  w(i) ∈ {1, ..., n}is the index to the word alphabet and  d(i) ∈ {1, ..., m}is the index to the document. The EM algorithm involves the following optimization step:

image

An update rule for  ujr (the r’th entry of  uj) is derived below: the derivative

of the Lagrangian is:

image

where N(r) stands for the frequency of the word  wrin all the documents d1, ..., dm. Note that N(r) is the result of summing-up the  r’th row of G andthat the vector N(1), ..., N(n) is the marginal  P(w) = �d P(w, d). Giventhe constraint  1⊤uj= 1 we obtain the update rule:

image

Update rules for the remaining unknowns are similarly derived. Once EM has converged, then �w(i)=r w∗ij is the probability of the word  wr to belongto the j’th topic and �d(i)=s w∗ij is the probability that the s’th document comes from the  j’th topic.

image

In this lecture we begin the exploration of the 2-class hyperplane separation problem. We are given a training set of instances  xi ∈ Rn, i = 1, ..., m,and class labels  yi = ±1 (i.e., the training set is made up of “positive” and “negative” examples). We wish to find a hyperplane direction  w ∈ Rnand an offset scalar  b such that w · xi − b >0 for positive examples and w · xi − b <0 for negative examples — which together means that the margins  yi(w · xi − b) >0 are positive.

Assuming that such a hyperplane exists, clearly it is not unique. We therefore need to introduce another constraint so that we could find the most “sensible” solution among all (infinitley many) possible hyperplanes which separate the training data. Another issue is that the framework is very limited in the sense that for most real-world classification problems it is somewhat unlikely that there would exist a linear separating function to begin with. We therefore need to find a way to extend the framework to include non-linear decision boundaries at a reasonable cost. These two issues will be the focus of this lecture.

Regarding the first issue, since there is more than one separating hyper-plane (assuming the training data is linearly separable) then the question we need to ask ourselves is among all those solutions which of them has the best “generalization” properties? In other words, our goal in constructing a learning machine is not necessarily to do very well (or perfect) on the training data, because the training data is merely a sample of the instance space, and not necessarily a “representative” sample — it is simply a sample. Therefore, doing well on the sample (the training data) does not necessarily guarantee (or even imply) that we will do well on the entire instance space. The goal of constructing a learning machine is to maximize the performance on the test data (the instances we haven’t seen), which in turn means that we wish to generalize “good” classification performance on the training set onto the entire instance space.

A related issue to generalization is that the distribution used to generate the training data is unknown. Unlike the statistical inference material we had so far, this time we will not attempt to estimate the distribution. The reason one can derive optimal learning algorithms yet bypass the need for estimating distributions would be explained later in the course when PAClearning will be introduced. For now we will focus only on the algorithmic aspect of the learning problem.

The idea is to consider a subset  Cγof all hyperplanes which have a fixed margin  γwhere the margin is defined as the distance of the closest training point to the hyperplane:

image

The Support Vector Machine (SVM), first introduced by Vapnik and his colleagues in 1992, seeks a separating hyperplane which simultaneously minimizes the empirical error and maximizes the margin. The idea of maximizing the margin is intuitively appealing because a decision boundary which lies close to some of the training instances is less likely to generalize well because the learning machine will be susceptible to small perturbations of those instance vectors. A formal motivation for this approach is deferred to the PAC-learning material we will introduce later in the course.

4.1 Large Margin Classifier as a Quadratic Linear Programming

We would first like to set up the linear separating hyperplane as an optimization problem which is both consistent with the training data and maximizes the margin induce by the separating hyperplane over all possible consistent hyperplanes.

image

defined by

image

Since we are allowed to scale the parameters w, b at will (note that if  w ·x − b > 0 so is (λw) · x − (λb) > 0 for all λ >0) we can set the distance between the boundary points to the hyperplane to be 1/√w · wby scaling w, b such the point(s) with smallest margin (closest to the hyperplane) will be normalized:  | w·x−b |= 1, therefore the margin is simply 2/√w · w (seeFig. 5.1). Note that argmaxw2/√w · wis equivalent to argmaxw2/(w · w)which in turn is equivalent to argminw12w · w.Since all positive points and negative points should be farther away from the boundary points we also have the separability constraints  w · x − b ≥ 1 when xis a positive instance and  w·x−b ≤ −1 when xis a negative instance. Both separability constraints can be combined:  y(w · x − b) ≥1. Taken together, we have defined the following optimization problem:

image

This type of optimization problem has a quadratic criteria function and linear inequalities and is known in the literature as a Quadratic Linear Programming (QP) type of problem.

This particular QP, however, requires that the training data are linearly separable — a condition which may be unrealistic. We can relax this condition by introducing the concept of a “soft margin” in which the separability holds approximately with some error:

image

Where  νis some pre-defined weighting factor. The (non-negative) variables  ϵiallow data points to be miss-classified thereby creating an approximate separation. Specifically, if  xiis a positive instance (yi= 1) then the “soft” constraint becomes:

image

where if  ϵi= 0 we are back to the original constraint where  xiis either a boundary point or laying further away in the half space assigned to positive instances. When  ϵi >0 the point  xican reside inside the margin or even in the half space assigned to negative instances. Likewise, if  xiis a negative instance (yi = −1) then the soft constraint becomes:

image

image

Fig. 4.1. Separating hyperplane w, b with maximal margin. The boundary points are associated with non-vanishing Lagrange multipliers  µi >0 and margin errors are associated with  ϵi >0 where the criteria function encourages a small number of margin errors.

The criterion function penalizes (the  L1-norm) for non-vanishing  ϵi thus theoverall system will seek a solution with few as possible “margin errors” (see Fig. 5.1). Typically, when possible, an  L1norm is preferable as the  L2 normoverly weighs high magnitude outliers which in some cases can dominate the energy function. Another note to make here is that strictly speaking the ”right thing” to do is to penalize the margin errors based on the  L0norm  ∥ϵ∥00 = |{i : ϵi > 0}|, i.e., the number of non-zero entries, and drop the balancing parameter  ν. This is because it does not matter how far away a point is from the hyperplane — all what matters is whether a point is classified correctly or not (see the definition of empirical error in Lecture 4). The problem with that is that the optimization problem would no longer be convex and non-convex problems are notoriously difficult to solve. Moreover, the class of convex optimization problems (as the one described in Eqn. 4.3) can be solved in polynomial time complexity.

So far we have described the problem formulation which when solved would provide a solution with “sensible” generalization properties. Although we can proceed using an off-the-shelf QLP solver, we will first pursue the ”dual” problem. The dual form will highlight some key properties of the approach and will enable us to extend the framework to handle non-linear decision surfaces at a very little cost. In the appendix we take a brief tour on the basic principles associated with constrained optimization, the Karush-Kuhn-Tucker (KKT) theorem and the dual form. Those are recommended to read before moving to the next section.

image

We return now to the primal problem (eqn. 6.3) representing the maximal margin separating hyperplane with margin errors:

image

We will now derive the Lagrangian Dual of this problem. By doing so a new key property will emerge facilitated by the fact that the criteria function  θ(µ) (note there are no equality constraints thus there is no need for  λ)involves only inner-products of the training instance vectors  xi. This property will form the key of mapping the original input space of dimension n to a higher dimensional space thereby allowing for non-linear decision surfaces for separating the training data.

Note that with this particular problem the strong duality conditions are satisfied because the criteria function and the inequality constraints form a convex set. The Lagrangian takes the following form:

image

Recall that

image

Since the minimum is obtained at the vanishing partial derivatives of the Lagrangian with respect to w, b, the next step would be to evaluate those

constraints and substitute them back into L() to obtain  θ(µ):

image

From the first constraint (4.4) we obtain  w = �i µiyixi, that is, w is de-scribed by a linear combination of a subset of the training instances. The reason that not all instances participate in the linear superposition is due to the KKT conditions:  µi = 0 when yi(w · xi − b) >1, i.e., the instance  xiis classified correctly and is not a boundary point, and conversely,  µi > 0when  yi(w · xi − b) = 1 − ϵi, i.e., when  xiis a boundary point or when xiis a margin error (ϵi >0) — note that for a margin error instance the value of  ϵiwould be the smallest possible required to reach an equality in the constraint because the criteria function penalizes large values of  ϵi. Theboundary points (and the margin errors) are called support vectors thus w is defined by the support vectors only. The third constraint (4.6) is equivalent to the constraint:

image

since  δi ≥0. Also note that if  ϵi >0, i.e., point  xiis a margin-error point, then by KKT conditions we must have  δi= 0. As a result  µi = ν. Therefore based on the values of  µialone we can make the following classifications:

image

Substituting these results/constraints back into the Lagrangian L() we obtain the dual problem:

image

The criterion function  θ(µ) can be written in a more compact manner as follows: Let  M be a l × lmatrix whose entries are  Mij = yiyjxi · xj thenθ(µ) = µ⊤1 − 12µ⊤Mµ where 1is the vector of (1, ..., 1) and µis the vector (µ1, ..., µm) and µ⊤ is the transpose (row vector). Note that M is positive definite, i.e., x⊤Mx >0 for all vectors  x ̸= 0 — a property which will be important later.

The key feature of the dual problem is not so much that it is simpler than the primal (in fact it isn’t since the primal has no equality constraints) or that it has a more “elegant” feel, the key feature is that the problem is completely described by the inner products of the training instances  xi,i = 1, ..., m. This fact will be shown to be a crucial ingredient in the so called “kernel trick” for the computation of inner-products in high dimensional spaces using simple functions defined on pairs of training instances.

image

We ended with the dual formulation of the SVM problem and noticed that the input data vectors  xiare represented by the Gram matrix M. In other words, only inner-products of the input vectors play a role in the dual formulation — there is no explicit use of  xior any other function of  xi besidesinner-products. This observation suggests the use of what is known as the ”kernel trick” to replace the inner-products by non-linear functions.

The common principle of kernel methods is to construct nonlinear variants of linear algorithms by substituting inner-products by nonlinear kernel functions. Under certain conditions this process can be interpreted as mapping of the original measurement vectors (so called ”input space”) onto some higher dimensional space (possibly infinitely high) commonly referred to as the ”feature space”. Mathematically, the kernel approach is defined as follows: let  x1, ..., xlbe vectors in the input space, say  Rn, and con-sider a mapping  φ(x) : Rn → F where Fis an inner-product space. The kernel-trick is to calculate the inner-product in F using a kernel function k : Rn ×Rn → R, k(xi, xj) = φ(xi)⊤φ(xj), while avoiding explicit mappings (evaluation of)  φ().

Common choices of kernel selection include the d’th order polynomial kernels  k(xi, xj) = (x⊤i xj + θ)dand the Gaussian RBF kernels  k(xi, xj) =exp(− 12σ2 ∥xi − xj∥2). If an algorithm can be restated such that the input vectors appear in terms of inner-products only, one can substitute the inner-products by such a kernel function. The resulting kernel algorithm can be interpreted as running the original algorithm on the space F of mapped objects  φ(x).

We know that M of the dual form is positive semi-definite because M

can be written is  M = Q⊤Q where Q = [y1x1, ..., ylxl]. Therefore  x⊤Mx =∥Qx∥2 ≥0 for all choices of x (which means that the eigenvalues of M are non-negative). If the entries of M are to be replaced with  yiyjk(xi, xj) thenthe condition we must enforce on the function k() is that it is a positive definite kernel function. A positive definite function is defined such that for any set of vectors  x1, ..., xqand for any values of q the matrix K whose entries are  Kij = k(xi, xj) is positive semi-definite. Formally, the conditions for admissible kernels k() are known as Mercer’s conditions summarized below:

Theorem 4 (Mercer’s Conditions) Let k(x, y)be symmetric and continuous. The following conditions are equivalent:

image

(iii) for all  {xi}qi=1 and for all q, the matrix  Kij = k(xi, xj)is positive semi-definite.

Perhaps the non-obvious condition is No. 1 which allows for the feature map  φ() to have infinitely many coordinates (a vector in Hilbert space). For example, as we shall see below, the kernel exp(− 12σ2 ∥xi − xj∥2) is an inner- product of two vectors with infinitely many coordinates. We will consider next a number of popular kernels.

image

Let  x, y ∈ Rk and define k(x, y) = (x⊤y)d where d >0 is a natural number. Then, the corresponding feature map  φ(x) has�k+d−1d �= O(kd) coordinates which take the value:

image

where� dn1,...,nk�= d!/(n1! · · · nk!) is the multinomial coefficient (number of ways to distribute d balls into k bins where the j’th bin hold exactly  nj ≥ 0balls):

image

The dimension of the vector space  φ(x) where x ∈ Rk can be measured using the following combinatorial problem: how many arrangements of  k−1partitions to be placed among d items? the answer is�k+d−1k−1 �=�k+d−1d �=O(kd). For example, k = d = 2 gives us :

image

where  φ(x) = (x21, x22,√2x1x2).

image

The feature map  φ(x) contains all monomials whose power is lesser or equal to  d, i.e., �i ni ≤ d. This can be acheived by increasing the dimension to  k + 1 where nk+1is used to fill the gap between �ki=1 ni < d and d.Therefore the dimension of  φ(x) where x ∈ Rk would be�k+dd �. We have:

(x⊤y + θ)d = (x1y1 + ... + xkyk +√θ√θ)d

image

Therefore, the entries of the vector  φ(x) take the values:

image

For example, k = d = 2 gives us :

image

where  φ(x) = (x21, x22,√2x1x2,√2θx1,√2θx2,√θ). In this example,  φ() is amapping from  R2 to R6 and hyperplanes  φ(w)⊤φ(x)−b = 0 in R6 correspondto  conics in R2:

image

Assume we would like to find a separating conic (Parabola, Hyperbola, Ellipse) function rather than a line in  R2. The discussion so far suggests we construct the Gram matrix M in the dual form with the d = 2 polynomial kernel  k(x, y) = (x⊤y+θ)2 for some parameter  θof our choosing. The extra effort we will need to invest is negligible — simply replace every occurrence x⊤i xj with (x⊤i xj + θ)2.

image

The function  k(x, y) = e−∥x−y∥2/2σ2 known as a Radial Basis Function(RBF) is a kernel function but with an infinite expansion. Without loss of generality let  σ= 1, then we have:

image

From which we can see that the entries of the feature map  φ(x) are:

image

By adopting some kernel k() we are in fact mapping  x → φ(x), thus wethen proceed to solve for  φ(w) and busing some QLP solver. The QLP solution of the dual form will yield the solution for the Lagrange multipliers µ1, ..., µm. We saw from eqn. (4.4) that we can express  φ(w) as a function of the (mapped) examples:

image

Rather than explicitly representing  φ(w) — a task which may be prohibitly expensive since in general the dimension of the feature space of a polynomial mapping is�k+dd �— we store all the support vectors (those input vectors with corresponding  µi >0) and use them for the evaluation of test examples:

image

We see that the kernel trick enabled us to look for a non-linear separating surface by making an implicit mapping of the input space onto a higher dimensional feature space using the same dual form of the SVM formulation — the only change required was in the way the Gram matrix was constructed. The price paid for this convenience is to carry all the support vectors at the time of classification f(x).

A couple of notes may be worthwhile at this point. The constant b can be recovered from any of the support vectors. Say,  x+ is a positive support vector (but not a margin error, i.e.,  µi < ν). Then φ(w)⊤φ(x+) − b = 1from which b can be recovered. The second note is that the number of support vectors is typically around 10% of the number of training examples (empirically). Thus the computational load during evaluation of f(x) may be relatively high. Approximations have been proposed in the literature by looking for a reduced number of support vectors (not necessarily aligned with the training set) — but this is beyond the scope of this course.

The kernel trick gained its popularity with the introduction of the SVM but since then has taken a life of its own and has been applied to principal component analysis (PCA), ridge regression, canonical correlation analysis (CCA), QR factorization and the list goes on. We will meet again with the kernel trick later on.

image

In this lecture (and the following one) we will focus on spectral methods for learning. Today we will focus on dimensionality reduction using Principle Component Analysis (PCA), multi-class learning using Linear Discriminant Analysis (LDA) and Canonical Correlation Analysis (CCA). In the next lecture we will focus on spectral clustering methods.

Dimensionality reduction appears when the dimension of the input vector is very large (imagine pixels in an image, for example) while the coordinate measurements are highly inter-dependent (again, imagine the redundancy present among neighboring pixels in an image). High dimensional data impose computational efficiency challenges and often translate to poor generalization abilities of the learning engine (see lectures on PAC). A dimensionality reduction can also be viewed as a feature extraction process where one takes as input a large feature set (the original measurements) and creates from them a much smaller number of new features which are then fed into the learning engine.

In this lecture we will focus on feature extraction from a very specific (and constrained) stanpoint. We would be looking for a mixing (linear combination) of the input coordinates such that we obtain a linear projection from Rn to Rq for some q < n. In doing so we wish to reduce the redundancy while preserving as much as possible the variance of the data. From a statistical standpoint this is achieved by transforming to a new set of variables, called principal components, which are uncorrelated so that the first few retain most of the variation present in all of the original coordinates. For example, in an image processing application the input images are highly redundant where neighboring pixel values are highly correlated. The purpose of feature extraction would be to transform the input image into a vector of output components with the least redundancy possible. Form a geometric standpoint, this is achieved by finding the ”closest” (in least squares sense) linear q-dimensional susbspace to the m sample points S. The new subspace is a lower dimensional ”best approximation” to the sample S. These two, equivalent, perspectives on data compression (dimensionality reduction) form the central idea of principal component analysis (PCA) which probably the oldest (going back to Pearson 1901) and best known of the techniques of multivariate analysis in statistics. The computation of PCA is very simple and the definition is straightforward, but has a wide variety of different applications, a number of different derivations, quite a number of different terminologies (especially outside the statistical literature) and is the basis for quite a number of variations on the basic technique.

We then extend the variance preserving approach for data representation for labeled data sets. We will describe the linear classifier approach (separating hyperplane) form the point of view of looking for a hyperplane such that when the data is projected onto it the separation is maximized (the distance between the class means is maximal) and the data within each class is compact (the variance/spread is minimized). The solution is also produced, just like PCA, by a spectral analysis of the data. This approach goes under the name of Fisher’s Linear Discriminant Analysis (LDA).

What is common between PCA and LDA is (i) the use of spectral matrix analysis — i.e., what can you do with eigenvalues and eigenvectors of matrices representing subspaces of the data? (ii) these techniques produce optimal results for normally distributed data and are very easy to implement. There is a large variety of uses of spectral analysis in statistical and learning literature including spectral clustering, Multi Dimensional Scaling (MDS) and data modeling in general. Another point to note is that this is the first time in the course where the type of data distribution plays a role in the analysis — the two techniques are defined for any distribution but are optimal only under the Gaussian distribution.

We will also describe a non-linear extension of PCA known as KernelPCA, but the focus would be mostly on PCA itself and its analysis from a couple of vantage points: (i) PCA as an optimal reconstruction after a dimension reduction, i.e., data compression, and (ii) PCA for redundancy reduction (decorrelation) of the output components.

image

Let  x1, ..., xm ∈ Rn be our sample data S of vectors in  Rn, arranged as columns of a matrix A. It will be convenient to assume that the data is centered, i.e., � xi= 0. If the data is not centered we can always center it by computing the mean vector  µ = (1/m) �i xiand replace the original data sample with the new sample  xi − µ. In a statistical sense, the coordinates of the vector  x ∈ Rn are considered as random variables, thus a row in the matrix A is the sample of values of a particular random variable, drawn from some unknown probability distribution, associated with the row position. We wish to find vectors  u1, ..., uq(arranged as columns of a matrix U), where  q ≤ min(n, m), such that the new feature measurements  y = U⊤x(who are the result of linear combinations  u⊤1 x, ..., u⊤q xof the original feature measurements x) have certain desirable properties.

The idea property to seek from the new coordinates y is statistical independence, i.e.,  P(y1, .., yq) = P(y1)···P(yq) which would mean that we have removed the redundancy of the original data x in the best possible manner. This goal, however, is too much to ask from a linear transformation and instead we would ask for a weaker property to hold: that the pairwise covariance  cov(yi, yj) = 0 vanishes, i.e., that the covariance matrix on the new coordinates is diagonal. A diagonal covariance insures some redundancy removal, but not as good as statistical independence. However, when the data is Normally distributed  P(x) ∼ N(µ,Σ) with mean  µand covariance Σ, then the transformation which diagonalizes the covariance matrix also guarantees statistical independence. Among all transformations that de-correlate the data we will seek the one that maximizes the spread (variance) of the sample data after being projected onto the new axes vectors.

image

The property we would like to maximize is that the projection of the sample data on the new axes is as spread as possible. To start this analysis, assume q = 1, i.e., the n components of the input vector x are reduced to a single output component  y = u⊤x. We are looking for a single vector  u ∈ Rnwhose direction maximizes the variance of the output component y.

Formally, we are looking for a unit vector u which maximizes �i(u⊤xi)2(see Appendix A for basic statistical definitions and note that E[y] = 0 because �i u⊤xi = u⊤i (�i xi) = 0). In other words, the projected points onto the axis represented by the vector u are as spread as possible (in a least squares sense). In vector notation, the optimization problem takes the following form:

image

The Lagrangian of the problem is:

image

By taking the partial derivative  ∂L/∂u= 0 we obtain the following necessary condition (see Appendix B):

image

which tells us that u is an eigenvector of the  n × n(symmetric and positive definite) matrix  AA⊤. There are n eigenvectors associated with  AA⊤ andwe can easily convince ourselves that we are looking for the one associated with the maximal eigenvalue: substitute  λuinstead of  AA⊤uin the criterion function  u⊤AA⊤u to obtain λ(u⊤u) = λand since the eigenvalues must be positive (since  AA⊤ is positive definite), then the optimum is obtained for the maximal eigenvalue. The leading eigenvector  u of AA⊤ is called the first principal axis of the data sample represented by the columns of the matrix A, and y = u⊤xis called the first principal component of the data sample.

For convenience, we denote  u1 = u and λ1 = λas the leading eigenvector and eigenvalue of  AA⊤. Next, we look for  y2 = u⊤2 x which is uncorrelatedwith  y1 = u⊤1 xand which has maximum variance (and so on for  u3, ..., uq).Two random variables are uncorrelated if their covariance vanishes. By definition of covariance (see Appendix A) we obtain:

image

We can therefore use the condition  u⊤1 u2= 0 to specify zero correlation between  y1, y2. The functional to be optimized becomes:

image

with the Lagrangian being:

image

By taking the partial derivative with respect to  u2we obtain the necessary condition:

image

Multiply the equation by  u1from the left:

image

and noting from above that  u⊤1 AA⊤u2 = u⊤1 u2= 0 we obtain  δ = 0. As aresult we obtain:

image

so once more we have that  λ, u2form an eigenvalue/eigenvector pair of  AA⊤.As before,  λshould be as large as possible from the remaining spectral decomposition. By induction, it can be shown that the remaining principal vectors  u3, ..., uqare the decreasing order eigenvactors of  AA⊤ and the variance of the i’th principal component  yi = u⊤i x is λi.

Taken together, the PCA is the solution of the following optimization problem:

image

It will be useful for later to write the optimization function in a more concise manner as follows. Let  U be the n × qmatrix whose columns are  ui andD = diag(λ1, ..., λq) is an q×qdiagonal matrix and  λ1 ≥ λ2 ≥ ... ≥ λq. Thenfrom above we have that  U⊤U = I and AA⊤U = UD. Using the fact that

image

trace(B) we can convert �i ∥u⊤i A∥2 to trace(U⊤AA⊤U) as follows:

image

Thus, PCA becomes the solution of the following optimization function:

image

The solution, as saw above, is that  U = [u1, ..., uq] consists of the decreasing order eigenvectors of  AA⊤. At the optimum,  trace(U⊤AA⊤U) is equal to trace(D) which is equal to the sum of eigenvalues  λ1 + ... + λq.

It is worthwhile noting that when  q = n, UU⊤ = U⊤U = I, and the PCA transform is a change of basis in  Rn known as Karhunen-Loeve transform.

To conclude, the PCA transform looks for q orthogonal direction vectors (called the principal axes) such that the projection of input sample vectors onto the principal directions has the maximal spread, or equivalently that the variance of the output coordinates  y = U⊤xis maximal. The principal directions are the leading (with respect to descending eigenvalues) q eigenvectors of the matrix  AA⊤. When q = n, the principal directions form a basis of  Rn with the property of de-correlating the data and maximizing the variance of the coordinates of the sample input vectors.

image

In the previous section we saw that PCA generates a new coordinate system y = U⊤xwhere the coordinates  y1, ..., yq of xin the new system are uncorrelated. This means that the covariance matrix over the principle components should be diagonal. In this section we will explore this perspective in more detail.

The covariance matrix  Σxof the sample data  x1, ..., xmwith zero mean is

image

therefore the matrix  AA⊤ we derived above is a scaled version of the covariance of the sample data (see Appendix A). The scale factor 1/m was unimportant in the process above because the eigenvectors are of unit norm, thus any scale of  AA⊤ would produce the same set of eigenvectors.

The off-diagonal entries of the covariance matrix  Σxrepresent the correlation (a measure of statistical dependence) between the i’th and j’th component vectors, i.e., the entries of the input vectors x. The existence of correlations among the components (features) of the input signal is a sign of redundancy, therefore from the point of view of transforming the input representation into one which is less redundant, we would like to find a transformation  y = U⊤xwith an output representation y which is associated with a diagonal covariance matrix  Σy, i.e., the components of y are uncorrelated.

Formally,  Σy = (1/m) �i yiy⊤i = (1/m)U⊤AA⊤U, therefore we wish to find an  n × qmatrix for which  U⊤AA⊤Uis diagonal. If in addition, we would require that the variance of the output coordinates is maximized, i.e.,  trace(U⊤AA⊤U) is maximal (but then we need to constrain the length of the column vectors of U, i.e., set  ∥ui∥= 1) then we would get a unique solution for U where the columns are orthonormal and are defined as the first q eigenvectors of the covariance matrix  Σx. This is exactly the optimization problem defined by eqn. (5.1).

We see therefore that PCA “decorrelates” the input data. Decorrelation and statistical independence are not the same thing. If the coordinates are statistically independent then the covariance matrix is diagonal†, but it does not follow that uncorrelated variables must be statistically independent — covariance is just one measure of dependence. In fact, the covariance is a measure of pairwise dependency only. However, it is a fact that uncorrelated variables are statistically independent if they have a multivariate normal distribution (a Gaussian). In other words, if the sample data x are drawn from a probability distribution p(x) which has Gaussian form, the PCA transforms the sample data into a statistically independent set of variables y = U⊤x. The details are explained below.

Recall that a multivariate normal distribution of the random variables x = (x1, ..., xn)⊤ is defined as  p(x) ≈ N(µ, Σ):

image

Also recall that a linear combination of the variables produces also a normal distribution  N(U⊤µ, U⊤ΣU):

Σy =�

image

therefore choose  U such that Σy = U⊤ΣUis a diagonal matrix  Σy =diag(σ21, ..., σ2n). We have in that case:

image

which can be written as a product of univariate normal distributions  pxi(xi):

image

which proves the assertion that decorrelated normally distributed variables are statistically independent.

image

A different, yet equivalent, perspective on the PCA transformation is as an optimal reconstruction (in a least squares sense) after a dimension reduction. We are given a sample data as before  x1, ..., xmand we are looking for a small number of orthonormal principal vectors  u1, ..., uq where q < min(n, k)which define a q-dimensional linear subspace of  Rn which bestapproximate the original input vectors in a least squares sense. In other words, the projection ˆxiof the sample points  xionto the q-dimensional subspace should minimize �i ∥xi − ˆxi∥2over all possible q-dimensional subspaces of  Rn.

Let U be the subspace spanned by the principal vectors (columns of U) and let  P be the n × nprojection matrix mapping a point  x ∈ Rn onto itsprojection ˆx ∈ U. From the definition of projection, the vector  x − ˆx mustbe orthogonal to the subspace  U. Let y = (y1, ..., yq) be the coordinates of ˆx with respect to the principal vectors, i.e.,  Uy = ˆx. Then, from orthogonality we have that (x − Uy)⊤Uw= 0 for all vectors  w ∈ Rn. Since this is true for all  w then U⊤Uy − U⊤x= 0. Therefore,  y = (U⊤U)−1U⊤x and as aresult the projection matrix P becomes:

image

satisfying  Px = ˆx. In the case the columns of U are orthonormal,  U⊤U = I,we have  P = UU⊤. We are ready now to describe the optimization problem on U: we wish to find an orthonormal set of principal vectors,  U⊤U = I,such that �i ∥xi − UU⊤xi∥2is minimized.

Note that �i ∥xi − UU⊤xi∥2 = ∥A − UU⊤A∥2F where ∥B∥2F = �i,j b2ijis the square Frobenious norm of a matrix. The optimal reconstruction problem therefore becomes:

image

We will show now that:

image

which shows that the optimal reconstruction problem is solved by PCA (recall Eqn. 5.1). From the identity  ∥B∥2F = trace(BB⊤), we have:

image

Expanding the right hand side gives us:

trace((A  − UU⊤A)(A − UU⊤A)⊤) = trace(AA⊤) − trace(AA⊤UU⊤)

image

The second and third term are equal (commutativity of trace) and is also equal to the 4th term due to commutativity of the trace and  U⊤U = I.Taken together:

image

To conclude, we have proven that by taking the first q eigenvectors of  AA⊤we obtain a linear subspace which is as close as possible (in a least squares sense) to the original sample data. Hence, PCA can be viewed as a vehicle for optimal reconstruction after dimension reduction. The optimization problem whose solution is the leading q eigenvectors of  AA⊤ is described in eqn. 5.1:

image

Consider the situation where n, the dimension of the input vectors, is relatively large compared to the number of sample vectors m. For example, consider input vectors representing 50  ×50 sized images of faces, i.e., n = 2500, where m = 100. In other words, we are looking for a small number of “face templates” (known as “eigenfaces”) which approximate well the original set of 100 face images. In this case,  AA⊤ is very large, 2500×2500, yet the number of non-vanishing eigenvalues cannot be higher than 100. Given that the eigendecomposition process is  O(25003), the computational burden would be very high. However, it is possible to perform an eigendecomposition on A⊤A (a 100 ×100 matrix) instead, as shown next.

Let the columns of Q be the first q < m eigenvectors of  A⊤A, i.e., A⊤AQ =QD where D is diagonal containing the corresponding eigenvalues. After pre-multiplying both sides by A we obtain:

image

from which we conclude that AQ contains the first q eigenvectors (but unnormalized) of  AA⊤. We have therefore that  U = AQD− 12 because:

image

where we used the fact that  Q⊤A⊤AQ = D. Note that eigenvalues of  A⊤Aand  AA⊤ are the same (because  AA⊤(AQD− 12 ) = (AQD− 12 )D).

image

We can take the case n >> m described in the previous section one step further and consider such large values of n which are practically uncomputable — a situation which results when mapping the original input vectors to a high dimensional space:  φ(x) where φ : Rn → F for which dim(F) >> n.For example,  φ(x) representing the d’th order monomials of the coordinates of  x, i.e., dim(F) = �n+d−1d �which is exponential in d. The mappings of interest are those which are paired with a non-linear kernel function: k(x, x′) = φ(x)⊤φ(x′).

Performing PCA on  A = [φ(x1), ..., φ(xm)] is equivalent to finding the

non-linear surface in  Rn (the nature of the non-linearity depends on the choice of  φ()) which best approximates the original sample data  x1, ..., xm.The problem is that  AA⊤ is not computable — however  A⊤Ais computable because (A⊤A)ij = k(xi, xj).

From the previous section,  U = AQD− 12 = AVcontains the first q eigenvectors of  AA⊤(where Q and Dare computable). Since A itself is not computable we cannot represent U explicitly, but we can project a new vector  φ(x) onto the principal directions  u1, ..., uqand obtain the principal components, i.e., the output vector  y = U⊤φ(x), as follows.

image

Given the principal components (entries of  y = U⊤φ(x) of φ(x)) we canmeasure, for example, the  distance between φ(x) and the projection ˆφ(x) =UU⊤φ(x) = Uyonto the linear subspace spanned by  u1, ..., uq(without the need to explicitly compute the principal axes  ui), as follows.

image

We now extend the variance preserving approach for data representation for labeled data sets. We will focus on 2-class sets and look for a separating hyperplane:

image

such that x belongs to the first class if f(x) > 0 and x belongs to the second class if f(x) < 0. In the statistical literature this type of function is called a linear discriminant function. The decision boundary is given by the set of points satisfying f(x) = 0 which is a hyperplane. Fisher’s (1936) Linear Discriminant Analysis (LDA) is a variance preserving approach for finding a linear discriminant function.

image

Fig. 5.1. Linear discriminant analysis based on class centers alone is not sufficient. Seeking a projection which maximizes the distance between the projected centers will prefer the horizontal axis over the vertical, yet the two classes overlap on the horizontal axis. The projected distance along the vertical axis is smaller yet the classes are better separated. The conclusion is that the sample variance of the two classes must be taken into consideration as well.

We will then introduce another popular statistical technique called Canonical Correlation Analysis (CCA) for learning the mapping between input and output vectors using the notion ”angle” between subspaces.

What is common in the three techniques PCA, LDA and CCA is the use of spectral matrix analysis — i.e., what can you do with eigenvalues and eigenvectors of matrices representing subspaces of the data? These techniques produce optimal results for normally distributed data and are very easy to implement. There is a large variety of uses of spectral analysis in statistical and learning literature including spectral clustering, Multi Dimensional Scaling (MDS) and data modeling in general.

To appreciate the general idea behind Fisher’s LDA consider Fig. 5.1. Let the centers of classes one and two be denoted by  µ1 and µ2respectively. A linear discriminant function is a projection onto a 1D subspace such that the classes would be separated the most in the 1D subspace. The obvious first step in this kind of analysis is to make sure that the projected centers ˆµ1, ˆµ2would be separated as much as possible. We can easily see that the direction of the 1D subspace should be proportional to  µ1 − µ2as follows:

image

The right-hand term is maximized when  w ≈ µ1 − µ2. As illustrated in Fig. 5.1, this type of consideration is not sufficient to capture separability in the projected subspace because the spread (variance) of the data points around their centers also play an important role. For example, the horizontal axis in the figure separates the centers better than the vertical axis but on the other hand does a worse job in separating the classes themselves because of the way the data points are spread around their centers. The argument in favor of separating the centers would work if the data points were living in a hyper-sphere around the centers, but will not be sufficient otherwise.

The basic idea behind Fisher’s LDA is to consider the sample covariance matrix of the individual classes as well as their centers, in the following way. The optimal 1D projection would that which maximizes the variance of the projected centers while minimizes the variance of the projected data points of each class separately. Mathematically, this idea can be implemented by maximizes the following ratio:

image

and likewise,

image

where ˆx = w⊤∥w∥xi + b.

We will now formalize this approach and derive its solution. We will begin with a general description of a multiclass problem where the sample data points belong to q different classes, and later focus on the case of q = 2.

image

Let the sample data points S be members of  q classes C1, ..., Cq where thenumber of points belonging to class  Ciis denoted by  liand the total number of the training set is  l = �i li. Let µjdenote the center of class  Ci and µdenote the center of the complete training set S:

image

Let  Ajbe the matrix associated with class  Cjwhose columns consists of the mean shifted data points:

image

Then, 1lj AjA⊤j is the covariance matrix associated with class  Cj. Let Sw(where ”w” stands for ”within”) be the sum of the class covariance matrices:

image

From the discussion in the previous section, it is 1∥w∥2 w⊤Sww which wewish to minimize. To see why this is so, note

image

Let B be the matrix holding the class centers:

image

and let  Sb = 1qBB⊤(where ”b” stands for ”between”). From the discussion above it is 1∥w∥2 w⊤Sbw = �i(ˆµi − ˆµ)2which we wish to maximize. Taken together, we wish to maximize the ratio (called ”Rayleigh’s quotient”):

image

The necessary condition for optimality is:

image

From which we obtain the generalized eigensystem:

image

That is, w is the leading eigenvector of  S−1w Sb (assuming Sw is invertible). The general case of finding q such axes involves finding the leading generalized eigenvectors of (Sb, Sw) — the derivation is out of scope of this lecture. Note that since  S−1w Sb is not symmetric there may be no real-value solution, which is a complication will not pursue further in this course. Instead we will focus now on the 2-class (q = 2) setting below.

image

The general derivation is simplified when there are only two classes. The covariance matrix  BB⊤ becomes a rank-1 matrix:

image

As a result,  BB⊤wis a vector in direction  µ1 − µ2. Therefore, the solution for w from eqn. 5.2 is:

image

The decision boundary  w⊤(x − µ) = 0 becomes:

image

This decision boundary will surface again in the course when we consider Bayseian inference. It will be shown that this decision boundary is the Maximum Likelihood solution in the case where the two classes are normally distributed with means  µ1, µ2and with the same covariance matrix  Sw.

image

Both LDA and SVM search for a so called ”optimal” linear discriminant function, what is the difference? The heart of the matter lies in the definition of what constitutes a sufficient compact representation of the data. In LDA the assumption is that each class can be represented by its mean vector and its spread (i.e., covariance matrix). This is true for normally distributed data — but not true in general. This means that we should expect that LDA will produce the optimal discriminant linear function when each of the classes are normally distributed.

With SVM, on the other hand, there is no assumption on how the data is distributed. Instead, the emerging result is that the data is represented by the subset of data points which lie on the boundary between the two classes (the so called support vectors). Rather than making a parametric assumption on how the data can be captured (i.e., mean and covariance) the theory shows that the data can be captured by a special subset of points. The tools, as a result, are naturally more complex (quadratic linear programming versus spectral matrix analysis) — but the advantage is that optimality is guaranteed without making assumptions on the distribution of the data (i.e., distribution free). It can be shown that SVM and LDA would produce the same result if the class data is normally distributed.

image

CCA is a technique for learning a mapping  f(x) = y where x ∈ Rk andy ∈ Rs using the notion of subspace similarity (an extension of the inner product between two vectors) from a training set of (xi, yi), i = 1, ..., n. Sucha mapping, where y can be any point in  Rk as opposed to a discrete set of labels, is often referred to as a ”regression” (as opposed to ”classification”).

Like in PCA and LDA, the approach would be to look for projection axes such that the projection of the input and output vectors on those axes satisfy certain requirements — and like PCA and LDA the tools we would be using is matrix spectral analysis.

It will be convenient to stack our vectors as rows of an input matrix A and output matrix  B. Let A be an n × kmatrix whose rows are  x⊤1 , ..., x⊤n andB is the n × smatrix whose rows are  y⊤1 , ..., y⊤n . Consider vectors  u ∈ Rkand  v ∈ Rs and project the input and output data onto them producing Au = (x⊤1 u, ..., x⊤n u) and Bv. The requirement we would like to place on the projection axes is that  Au ≈ Bv, or in other words that (Au)⊤(Bv)is maximal. The requirement therefore is that the projection of the input points onto the u axis is similar to the projection of the output points onto the v axis. If we extend this notion to multiple axes  u1, ..., uq (notnecessarily orthogonal) and  v1, ..., vq where q ≤ min(k, s) our requirement becomes that the new coordinates of the input points projected onto the subspace spanned by the u vectors are similar to the new coordinates of the output points projected onto the subspace spanned by the v vectors. In other words, we wish to find two q-dimensional subspaces one of  Rk andthe other of  Rs such that the two sets of projected points are as aligned as possible.

CCA goes a step further and makes the assumption that the input/output relationship is solely determined by the relation (angles) between the column spaces of A, B. In other words, the particular columns of A are not really important, what is important is the space  UAspanned by the columns. Since g = Au is a point in  UA(a linear combination of the columns of A) and h = Bv is a point in  UB, then g⊤his the cosine angle, cos(φ)between the two axes provided that we normalize the vectors g and h. If we continue this line of reasoning recursively, we obtain a set of angles 0  ≤ θ1 ≤ ... ≤ θq ≤ (π/2), called ”principal angles”, between the two subspaces uniquely defined as:

image

subject to:

image

As a result, we obtain the following optimization function over axes u, v:

image

To solve this problem we first perform a ”QR” factorization of A and B. A ”QR” factorization of a matrix A is a Grahm-Schmidt process resulting in an orthonormal set of vectors arranged as the columns of a matrix  QA whosecolumn space is equal to the column space of A, and a matrix  RA whichcontains the coefficients of the linear combination of the columns of  QA suchthat  A = QARA. Since orthoganilzation is not unique, the Grahm-Schmidt process perfroms the orthogonalization such that  RAis an upper-diagonal matrix. Likewise let  B = QBRB. Because the column spaces of  A and QAare the same, then for every u there exists a ˆu such that Au = QAˆu. Ouroptimization problem now becomes:

image

The solution of this problem is when ˆu and ˆvare the leading singular vectors of  Q⊤AQB. The singular value decomposition (SVD) of any matrix E is a decomposition  E = UDV ⊤ where the columns of U are the leading eigenvectors of  EE⊤, the rows of  V ⊤ are the leading eigenvectors of  E⊤E andD is a diagonal matrix whose entries are the corresponding square eigenvalues (note that the eigenvalues of  EE⊤ and E⊤Eare the same). The SVD decomposition has the property that if we keep only the first q leading eigenvectors then  UDV ⊤ is the closest (in least squares sense) rank q matrix to E.

Therefore, let ˆUD ˆV ⊤ be the SVD of  Q⊤AQBusing the first q eigenvectors. Then, our sought after axes  U = [u1, ..., uq] is simply  R−1A ˆUand likewise and the axes  V = [v1, ..., vq] is equal to  R−1B ˆV. The axes are called ”canon- ical vectors”, and the vectors  gi = Aui(mutually orthogonal) are called ”variates”. The concept of principal angles is due to Jordan in 1875, where Hotelling in 1936 is the first to introduce the recursive definition above.

Given a new vector  x ∈ Rk the resulting vector y can be found by solving the linear system  U⊤x = V ⊤y(since our assumption is that in the new basis the coordinates of x and y are similar).

To conclude, the relationship between A and B is captured by creating similar variates, i.e., creating subspaces of dimension q such that the projections of the input vectors and the output vectors have similar coordinates. The process for obtaining the two q-dimensional subspaces is by performing a QR factorization of A and B followed by an SVD. Here again the spectral analysis of the input and output data matrices plays a pivoting role in the input/output association.

image

In the previous lecture we ended up with the formulation:

image

and showed the solution G is the leading eigenvectors of the symmetric positive semi definite matrix  K. When K = AA⊤ (sample covariance matrix) with  A = [x1, ..., xm], xi ∈ Rn, those eigenvectors form a basis to a k-dimensional subspace of  Rn which is the closest (in  L2norm sense) to the sample points  xi. The axes (called principal axes)  g1, ..., gkpreserve the variance of the original data in the sense that the projection of the data points on the  g1has maximum variance, projection on  g2has the maximum variance over all vectors orthogonal to  g1, etc. The spectral decomposition of the sample covariance matrix is a way to ”compress” the data by means of linear super-position of the original coordinates  y = G⊤x.

We also ended with a ratio formulation:

image

where  S1, S2where scatter matrices defined such that  w⊤S1wis the variance of class centers (which we wish to maximize) and  w⊤S2wis the sum of within class variance (which we want to minimize). The solution w is the generalized eigenvector  S1w = λS2wwith maximal  λ.

In this lecture we will show additional applications where the search for leading eigenvectors plays a pivotal part of the solution. So far we have seen how spectral analysis relates to PCA and LDA and today we will focus on the classic Data Clustering problem of partitioning a set of points x1, ..., xm into k ≥2 classes, i.e., generating as output indicator variables y1, ..., ym where yi ∈ {1, ..., k}. We will begin with ”K-means” algorithm for clustering and then move on to show how the optimization criteria relates to grapth-theoretic approaches (like Min-Cut, Ratio-Cut, Normalized Cuts) and spectral decomposition.

image

The K-means formulation (originally introduced by [4]) assumes that the clusters are defined by the distance of the points to their class centers only. In other words, the goal of clustering is to find those k mean vectors  c1, ..., ckand provide the cluster assignment  yi ∈ {1, ..., k}of each point  xiin the set. The K-means algorithm is based on an interleaving approach where the cluster assignments  yiare established given the centers and the centers are computed given the assignments. The optimization criterion is as follows:

image

Assume that  c1, ..., ckare given from the previous iteration, then

image

and next assume that  y1, .., ym(cluster assignments) are given, then for any set  S ⊆ {1, ..., m}we have that

image

In other words, given the estimated centers in the current round, the new assignments are computed by the closest center to each point  xi, and thengiven the updated assignments the new centers are estimated by taking the mean of each cluster. Since each step is guaranteed to reduce the optimization energy the process must converge — to some local optimum.

The drawback of the K-means algorithm is that the quality of the local optimum strongly depends on the initial guess (either the centers or the assignments). If we start with a wild guess for the centers it would be fairly unlikely that the process would converge to a good local minimum (i.e. one that is close to the global optimum). An alternative approach would be to define an approximate but simpler problem which has a closed form solution (such as obtained by computing eigenvectors of some matrix). The global optimum of the K-means is an NP-Complete problem (mentioned briefly in the next section).

Next, we will rewrite the K-means optimization criterion in matrix form and see that it relates to the spectral formulation (eqn. 6.1).

image

We rewrite eqn. 6.2 as follows [7]. Instead of carrying the class variables  yiwe define class sets  ψ1, ..., ψk where ψi ⊂ {1, ..., n} with � ψj = {1, ..., n}and  ψi� ψj = ∅. The K-means optimization criterion seeks for the centers and the class sets:

image

Let  lj = |ψj|and following the expansion of the squared norm and dropping x⊤i xiwe end up with an equivalent problem:

image

Next we substitute  cjwith its definition: (1/lj) �i∈ψj xjand obtain a new equivalent formulation where the centers  cjare eliminated form consideration:

image

which is more conveniently written as a maximization problem:

image

Since the resulting formulation involves only inner-products we could have replaced  xi with φ(xi) in eqn. 6.2 where the mapping  φ(·) is chosen such that  φ(xi)⊤φ(xj) can be replaced by some non-linear function  κ(xi, xj) —known as the ”kernel trick” (discussed in previous lectures). Having the ability to map the input vectors onto some high-dimensional space before K-means is applied provides more flexibility and increases our chances of getting out a ”good” clustering from the global K-means solution (again, the local optimum depends on the initial conditions so it could be ”bad”). The RBF kernel is quite popular in this context  κ(xi, xj) = e−∥xi−xj∥2/σ2with  σsome pre-determined parameter. Note that  κ(xi, xj) ∈ (0, 1] whichcan be interpreted loosely as the probability of  xi and xjto be clustered together.

Let  Kij = κ(xi, xj) making K a m × msymmetric positive-semi-definite matrix often referred to as the ”affinity” matrix. Let  F be an n × n matrixwhose entries are  Fij = 1/lr if (i, j) ∈ ψrfor some class  ψr and Fij = 0otherwise. In other words, if we sort the points  xiaccording to cluster membership, then F is a block diagonal matrix with blocks  F1, ..., Fk whereFr = (1/lr)11⊤ is an lr × lrblock of 1’s scaled by 1/lr. Then, Eqn. 6.3 can be written in terms of K as follows:

image

In order to form this as an optimization problem we need to represent the structure of F in terms of constraints. Let  G be an n × kcolumn-scaled indicator matrix:  Gij = (1/�lj) if i ∈ ψj (i.e., xibelongs to the j’th class) and  Gij= 0 otherwise. Let  g1, ..., gkbe the columns of G and it can be easily verified that  grgr⊤ = diag(0, .., Fr, 0, ..,0) therefore  F = �j gjg⊤j = GG⊤.Since trace(AB) = trace(BA) we can now write eqn. 6.4 in terms of G:

image

under conditions on G which we need to further spell out.

We will start with the necessary conditions. Clearly  G ≥0 (has non-negative entries). Because each point belongs to exactly one cluster we must have  G⊤Gij = 0 when i ̸= j and G⊤Gii = (1/li)1⊤1 = 1, thus G⊤G = I.Furthermore we have that the rows and columns of  F = GG⊤ sum up to1, i.e.,  F1 = 1, F ⊤1 = 1which means that F is doubly stochastic which translates to the constraint  GG⊤1 = 1 on G.We have therefore three necessary conditions on  G: (i) G ≥ 0, (ii) G⊤G = I, and (iii)  GG⊤1 = 1.The claim below asserts that these are also sufficient conditions:

Claim 4 The feasibility set of matrices G which satisfy the three conditions G ≥ 0, GG⊤1 = 1 and G⊤G = Iare of the form:

image

Proof: From G ≥ 0 and g⊤r gs = 0 we have that  GirGis = 0, i.e., Ghas a single non-vanishing element in each row. It will be convenient to assume that the points are sorted according to the class membership, thus the columns of G have the non-vanishing entries in consecutive order and let  ljbe the number of non-vanishing entries in column  gj. Let uj thevector of  ljentries holding only the non-vanishing entries of  gj. Then,the doubly stochastic constraint  GG⊤1 = 1results that (1⊤uj)uj = 1 forj = 1, ..., k. Multiplying 1 from both sides yields (1⊤uj)2 = 1⊤1 = lj,therefore  uj = (1/�lj)1.

This completes the equivalence between the matrix formulation:

image

and the original K-means formulation of eqn. 6.2.

We have obtained the same optimization criteria as eqn. 6.1 with additional two constraints: G should be non-negative and  GG⊤ should be doubly stochastic. The constraint  G⊤G = Icomes from the requirement that each point is assigned to one class only. The doubly stochastic constraint comes from a ”class balancing” requirement which we will expand on below.

image

We will arrive to eqn. 6.5 from a graph-theoretic perspective. We start with representing the graph Min-Cut problem in matrix form, as follows. A convenient way to represent the data to be clustered is by an undirected graph with edge-weights where V = {1, ..., m} is the vertex set,  E ⊂ V × Vis the edge set and  κ : E → R+is the positive weight function. Vertices of the graph correspond to data points  xi, edges represent neighborhood relationships, and edge-weights represent the similarity (affinity) between pairs of linked vertices. The weight adjacency matrix K holds the weights where  Kij = κ(i, j) for (i, j) ∈ E and Kij= 0 otherwise.

A cut in the graph is defined between two disjoint sets  A, B ⊂ V , A ∪B = V , is the sum of edge-weights connecting the two sets: cut(A, B) = �i∈A,j∈B Kijwhich is a measure of dissimilarity between the two sets. The Min-Cut problem is to find a minimal weight cut in the graph (can be solved in polynomial time through Max Network Flow solution). The following claim associates algebraic conditions on G with an indicator matrix:

Claim 5 The feasibility set of matrices G which satisfy the three conditions G ≥ 0, G1 = 1 and G⊤G = Dfor some diagonal matrix D are of the form:

image

Proof: Let G = [g1, ..., gk]. From G ≥ 0 and g⊤r gs = 0 we have that GirGis= 0, i.e., G has a single non-vanishing element in each row. From G1 = 1 the single non-vanishing entry of each row must have the value of 1.

In the case of two classes (k = 2), the function  tr(G⊤KG) is equal to �(i,j)∈ψ1 Kij + �(i,j)∈ψ2 Kij. Therefore maxG tr(G⊤KG) is equivalent to minimizing the cut: �i∈ψ1,j∈ψ2 Kij. As a result, the Min-Cut problem is equivalent to solving the optimization problem:

image

We seem to be close to eqn. 6.5 with the difference that G is orthogonal (instead of orthonormal) and the doubly-stochasitc constraint is replaced by G1 = 1. The difference can be bridged by considering a ”balancing” requirement. Min-Cut can produce an unbalanced partition where one set of vertices is very large and the other contains a spurious set of vertices having a small number of edges to the larger set. This is an undesirable outcome in the context of clustering. Consider a ”balancing” constraint G⊤1 = (m/k)1which makes a strict requirement that all the k clusters have an equal number of points. We can relax the balancing constraint slightly by combining the balancing constraint with G1 = 1 into one single constraint GG⊤1 = (m/k)1, i.e., GG⊤ is scaled doubly stochastic. Note that the two conditions  GG⊤1 = (m/k)1 and G⊤G = D result in D = (m/k)I. Thus wepropose the relaxed-balanced hard clustering scheme:

image

The scale m/k is a global scale that can be dropped without affecting the resulting solution, thus the Min-Cut with a relaxed balancing requirement becomes eqn. 6.5 which we saw is equivalent to K-means:

image

We saw above that the doubly-stochastic constraint has to do with a ”bal- ancing” desire. A further relaxation of the balancing desire is to perform the optimization in two steps: (i) replace the affinity matrix K with the closest (under some chosen error measure) doubly-stochastic matrix  K′, (ii) find a solution to the problem:

image

because  GG⊤ should come out close to  K′ (tr(G⊤K′G) = tr(K′GG⊤))and  K′ is doubly-stochastic, then  GG⊤ should come out close to satisfying a doubly-stochastic constraint — this is the motivation behind the 2-step approach. Moreover, we drop the non-negativity constraint  G ≥ 0. Notethat the non-negativity constraint is crucial for the physical interpretation of G; nevertheless, for k = 2 clusters it is possible to make an interpretation, as we shall next. As a result we are left with a spectral decomposition problem of eqn. 6.1:

image

where the columns of G are the leading eigenvectors of  K′. We will refer to the first step as a ”normalization” process and there are two popular normalizations in the literature — one leading to Ratio-Cuts and the other to Normalized-Cuts.

image

Let D = diag(K1) which is a diagonal matrix containing the row sums of K. The Ratio-Cuts normalization is to look for  K′ as the closest doubly-stochastic matrix to K by minimizing the  L1norm — this turns out to be K′ = K − D + I.

Claim 6 (ratio-cut) Let K be a symmetric positive-semi-definite whose values are in the range [0, 1]. The closest doubly stochastic matrix  K′ underthe  L1error norm is

image

∥(K − F)1∥1for any matrix F, we must have:

image

Let  F = K − D + I, then

image

The Laplacian matrix of a graph is  D − K. If vis an eigenvector of the Laplacian  D − Kwith eigenvalue  λ, then vis also an eigenvector of  K′ =K − D + Iwith eigenvalue 1  − λand since (D − K)1= 0 then the smallest eigenvector v = 1 of the Laplacian is the largest of  K′, and the second smallest eigenvector of the Laplacian (the ratio-cut result) corresponds to the second largest eigenvector of  K′. Because the eigenvectors are orthogonal, the second eigenvector must have positive and negative entries (because the inner-product with 1 is zero) — thus the sign of the entries of the second eigenvector determines the class membership.

Ratio-Cuts, the second smallest eigenvector of the Laplacian  D − K, isan approximation due to Hall in the 70s [2] to the Min-Cut formulation. Let  z ∈ Rm determine the class membership such that  xi and xj would beclustered together if  zi and zjhave similar values. This leads to the following optimization problem:

image

The criterion function is equal to (1/2)z⊤(D−K)zand the derivative of the Lagrangian (1/2)z⊤(D − K)z − λ(z⊤z −1) with respect to z gives rise to the necessary condition (D − K)z = λzand the Ratio-Cut scheme follows.

image

Normalized-Cuts looks for the closest doubly-stochastic matrix  K′ in relativeentropy error measure defined as:

image

We will encounter the relative entropy measure in more detail later in the course. We can show that  K′ must have the form ΛKΛ for some diagonal matrix Λ:

Claim 7 The closest doubly-stochastic matrix F under the relative-entropy error measure to a given non-negative symmetric matrix K, i.e., which minimizes:

image

has the form  F = ΛKΛfor some (unique) diagonal matrix Λ.

Proof: The Lagrangian of the problem is:

image

The derivative with respect to  fij is:

image

from which we obtain:

image

Let D1 = diag(eλ1, ..., eλn) and D2 = diag(eµ1, ..., eµn), then we have:

image

Since  F = F ⊤ and Kis symmetric we must have  D1 = D2.

Next, we can show that the diagonal matrix Λ can found by an iterative process where K is replaced by  D−1/2KD−1/2 where Dwas defined above as diag(K1):

Claim 8 For any non-negative symmetric matrix  K(0), iterating the process K(t+1) ← D−1/2K(t)D−1/2 with D = diag(K(t)1)converges to a doubly stochastic matrix.

The proof is based on showing that the permanent increases monotonically, i.e.  perm(K(t+1)) ≥ perm(K(t)). Because the permanent is bounded the process must converge and if the permanent does not change (at the convergence point) the resulting matrix must be doubly stochastic. The resulting doubly stochastic matrix is the closest to K in relative-entropy.

Normalized-Cuts takes the result of the first iteration by replacing K with  K′ = D−1/2KD−1/2 followed by the spectral decomposition (in case of k = 2 classes the partitioning information is found in the second leading eigenvector of  K′ — just like Ratio-Cuts but with a different  K′). Thus, K′in this manner is not the closest doubly-stochastic matrix to K but is fairly close (the first iteration is the dominant one in the process).

Normalized-Cuts, as the second leading eigenvector of  K′ = D−1/2KD−1/2,is an approximation to a ”balanced” Min-Cut described first in [6]. Deriving it from first principles proceeds as follows:

Let  sum(V1, V2) = sumi∈V1,j∈V2Kijbe defined for any two subsets (not necessarily disjoint) of vertices. The normalized-cuts measures the cut cost as a fraction of the total edge connections to all the nodes in the graph:

image

A minimal Ncut partition will no longer favor small isolated points since the cut value would most likely be a large percentage of the total connections from that small set to all the other vertices. A related measure Nassoc(A, B) defined as:

image

reflects how tightly on average nodes within the group are connected to each other. Given that  cut(A, B) = sum(A, V )−sum(A, A) one can easily verify

that:

image

therefore the optimal bi-partition can be represented as maximizing  Nassoc(A, V −A). The Nassoc naturally extends to k > 2 classes (partitions) as follows: Let  ψ1, ..., ψkbe disjoint sets  ∪jψj = V , then:

image

We will now rewrite Nassoc in matrix form and establish equivalence to eqn. 6.7. Let ¯G = [g1, ..., gk] with gj = 1/�sum(ψj, V )(0, ..., 0, 1, ...1, 0., , , 0)with the 1s indicating membership to the j’th class. Note that

image

therefore  trace( ¯G⊤K ¯G) = Nassoc(ψ1, ..., ψk).Note also that  g⊤i Dgi =(1/sum(ψi, V )) �r∈ψi dr= 1, therefore ¯G⊤D ¯G = I. Let G = D1/2 ¯G so wehave that  G⊤G = I and trace(G⊤D−1/2KD−1/2G) = Nassoc(ψ1, ..., ψk).Taken together we have that maximizing Nassoc is equivalent to:

image

where  K′ = D−1/2KD−1/2. Note that this is exactly the K-means matrix setup of eqn. 6.5 where the doubly-stochastic constraint is relaxed into the replacement of  K by K′. The constraint  G ≥0 is then dropped and the resulting solution for G is the k leading eigenvectors of  K′.

We have arrived via seemingly different paths to eqn. 6.8 which after we drop the constraint  G ≥0 we end up with a closed form solution consisting of the k leading eigenvectors of  K′. When k= 2 (two classes) one can easily verify that the partitioning information is fully contained in the second eigenvector. Let  v1, v2be the first leading eigenvectors of  K′. Clearlyv = D1/21is an eigenvector with eigenvalue  λ = 1:

image

In fact  λ= 1 is the largest eigenvalue (left as an exercise) thus  v1 = D1/21 >0. Since  K′ is symmetric the  v⊤2 v1 = 0 thus v2contains positive and negative entries — those are interpreted as indicating class membership (positive to one class and negative to the other).

The case k > 2 is treated as an embedding (also known as Multi-Dimensional Scaling) by re-coordinating the points  xiusing the rows of G. In other

words, the i’th row of G is a representation of  xi in Rk. Under ideal con-ditions where K is block diagonal (the distance between clusters is infinity) the rows associated with points clustered together are identical (i.e., the n original points are mapped to  k points in Rk) [5]. In practice, one performs the iterative K-means in the embedded space.

image

We have see so far algorithms that explicitly estimate the underlying distribution of the data (Bayesian methods and EM) and algorithms that are in some sense optimal when the underlying distribution is Gaussian (PCA, LDA). We have also encountered an algorithm (SVM) that made no assumptions on the underlying distribution and instead tied the accuracy to the margin of the training data.

In this lecture and in the remainder of the course we will address the issue of ”accuracy” and ”generalization” in a more formal manner. Because the learner receives only a finite training sample, the learning function can do very well on the training set yet perform badly on new input instances. What we would like to establish are certain guarantees on the accuracy of the learner measured over all the instance space and not only on the training set. We will then use those guarantees to better understand what the largemargin principle of SVM is doing in the context of generalization.

In the remainder of this lecture we will refer to the following notations: the class of learning functions is denoted by C. A learning functions is often referred to as a ”concept” or ”hypothesis”. A target function  ct ∈ C is afunction that has zero error on all input instances (such a function may not always exist).

image

In many learning situations of interest, we would like to assume that the learner receives m examples sampled by some fixed (yet unknown) distribution D and the learner must do its best with the training set in order to achieve the accuracy and confidence objectives. The Probably Approximate Correct (PAC) model, also known as the ”formal model”, first introduced by Valient in 1984, provides a probabilistic setting which formalizes the notions of accuracy and confidence.

The PAC model makes the following statistical assumption. We assume the learner receives a set  S of m instances x1, ..., xm ∈ Xwhich are sampled randomly and independently according to a distribution D over X. In other words, a random training set S of length m is distributed according to the product probability distribution  Dm. The distribution D is unknown, but we will see that one can obtain useful results by simply assuming that D is fixed — there is no need to attempt to recover D during the learning process. To recap, we make the following three assumptions: (i) D is unkown, (ii) D is fixed throughout the learning process, and (iii) the example instances are sampled independently of each other (are Identically and Independently Distributed — i.i.d.).

We distinguish between the ”realizable” case where a target concept  ct(x)is known to exist, and the unrealizable case, where there is no such guarantee. In the realizable case our training examples are  Z = {(xi, ct(xi)},i = 1, ..., m and D is defined over  X (since yi ∈ Yare given by  ct(xi)). Inthe unrealizable case,  Z = {(xi, yi)} and Dis the distribution over  X × Y(each element is a pair, one from X and the other from Y ).

We next define what is meant by the error induced by a concept function h(x). In the realizable case, given a function  h ∈ C, the error of h is defined with respect to the distribution D:

image

where ind(F) is an indication function which returns ’1’ if the proposition F is true and ’0’ otherwise. The function err(h) is the probability that an instance x sampled according to D will be labeled incorrectly by h(x). Let  ϵ >0 be a parameter given to the learner specifying the ”accuracy” of the learning process, i.e. we would like to achieve  err(h) ≤ ϵ. Note that err(ct) = 0.

In addition, we define a ”confidence” parameter  δ >0, also given to the learner, which defines the probability that  err(h) > ϵ, namely,

image

or equivalently:

image

In other words, the learner is supposed to meet some accuracy criteria but is allowed to deviate from it by some small probability. Finally, the learning algorithm is supposed to be ”efficient” if the running time is polynomial in 1/ϵ, ln(1/δ), nand the size of the concept target function  ct() (measured by the number of bits necessary for describing it, for example).

We will say that an algorithm L learns a concept family C in the formal sense (PAC learnable) if for any  ct ∈ Cand for every distribution D on the instance space X, the algorithm L generates efficiently a concept function h ∈ Csuch that the probability that  err(h) ≤ ϵis at least 1  − δ.

The inclusion of the confidence value  δcould seem at first unnatural. What we desire from the learner is to demonstrate a consistent performance regardless of the training sample Z. In other words, it is not enough that the learner produces a hypothesis h whose accuracy is above threshold, i.e., err(h) ≤ ϵ, for some training sample Z. We would like the accuracy performance to hold under all training samples (sampled from the distribution Dm) — since this requirement could be too difficult to satisfy, the formal model allows for some ”failures”, i.e, situations where  err(h) > ϵ, for sometraining samples Z, as long as those failures are rare and the frequency of their occurrence is controlled (the parameter  δ) and can be as small as we like.

In the unrealizable case, there may be no function  h ∈ C for whicherr(h) = 0, thus we need to define what we mean by the best a learning algorithm can achieve:

image

which is the best that can be done on the concept class C using functions that map between X and Y . Given the desired accuracy  ϵand confidence  δvalues the learner seeks a hypothesis  h ∈ Csuch that:

image

We are ready now to formalize the discussion above and introduce the defi-nition of the formal learning model (Anthony & Bartlett [1], pp. 16):

Definition 1 (Formal Model) Let C be the concept class of functions that map from a set X to Y . A learning algorithm L is a function:

image

from the set of all training examples to C with the following property: given any  ϵ, δ ∈ (0, 1)there is an integer  m0(ϵ, δ)such that if  m ≥ m0 then, forany probability distribution  D on X × Y , if Zis a training set of length m drawn randomly according to the product probability distribution  Dm, thenwith probability of at least 1  − δthe hypothesis  h = L(Z) ∈ C output byL is such that  err(h) ≤ Opt(C) + ϵ. We say that C is learnable (or PAC learnable) if there is a learning algorithm for C.

There are few points to emphasize. The sample size  m0(ϵ, δ) is a suf-ficient sample size for PAC learning C by L and is allowed to vary with ϵ, δ. Decreasing the value of either  ϵ or δmakes the learning problem more difficult and in turn a larger sample size is required. Note however that m0(ϵ, δ) does not dependon the distribution D! that is, a sufficient sample size can be given that will work for any distribution D — provided that D is fixed throughout the learning experience (both training and later for testing). This point is a crucial property of the formal model because if the sufficient sample size is allowed to vary with the distribution D then not only we would need to have some information about the distribution in order to set the sample complexity bounds, but also an adversary (supplying the training set) could control the rate of convergence of L to a solution (even if that solution can be proven to be optimal) and make it arbitrarily slow by suitable choice of D.

What makes the formal model work in a distribution-invariant manner is that it critically depends on the fact that in many interesting learning scenarios the concept class C is not too complex. For example, we will show later in the lecture that any finite concept class  |C| < ∞is learnable, and the sample complexity (in the realizable case) is

image

In the next lecture we will consider concept classes of infinite size and show that despite the fact that the class is infinite it still can be of low complexity!

Before we illustrate the concepts above with an example, there is another useful measure which is the empirical error (also known as the sample error) ˆerr(h) which is defined as the proportion of examples from Z on which h made a mistake:

image

(replace  ct(xi) with yifor the unrealizable case). The situation of bounding the true error err(h) by minimizing the sample error ˆerr(h) is very convenient — we will get to that later.

image

As an illustration of learnability we will consider the problem (introduced in Kearns & Vazirani [3]) of learning an axes-aligned rectangle from positive and negative examples. We will show that the problem is PAC-learnable and find out  m0(ϵ, δ).

In the rectangle learning game we are given a training set consisting of points in the 2D plane with a positive ’+’ or negative ’-’ label. The positive examples are sampled inside the target rectangle (parallel to the main axes) R and the negative examples are sampled outside of R. Given m examples sampled i.i.d according to some distribution D the learner is supposed to generate an approximate rectangle  R′ which is consistent with the training set (we are assuming that R exists) and which satisfies the accuracy and confidence constraints.

We first need to decide on a learning strategy. Since the solution  R′is not uniquely defined given any training set Z, we need to add further constraints to guarantee a unique solution. We will choose  R′ as the axes-aligned concept which gives the tightest fit to the positive examples, i.e., the smallest area axes-aligned rectangle which contains the positive examples. If no positive examples are given then  R′ = ∅. We can also assume that Z contains at least three non-collinear positive examples in order to avoid complications associated with infinitesimal area rectangles. Note that we could have chosen other strategies, such as the middle ground between the tightest fit to the positive examples and the tightest fit (from below) to the negative examples, and so forth. Defining a strategy is necessary for the analysis below — the type of strategy is not critical though.

We next define the error  err(R′) on the concept  R′ generated by our learning strategy. We first note that with the strategy defined above we always have  R′ ⊂ R since R′ is the tightest fit solution which is consistent with the sample data (there could be a positive example outside of  R′ whichis not in the training set). We will define the ”weight” w(E) of a region E in the plane as

image

i.e., the probability that a random point sampled according to the distribution D will fall into the region. Therefore, the error associated with the concept  R′ is

image

image

Fig. 7.1. Given the tightest-fit to positive examples strategy we have that  R′ ⊂ R.The strip  T1has weight  ϵ/4 and the strip  T ′1 is defined as the upper strip covering the area between  R and R′.

and we wish to bound the error  w(R − R′) ≤ ϵwith probability of at least 1  − δafter seeing m examples.

We will divide the region  R − R′ into four strips  T ′1, ..., T ′4 (see Fig.7.1) which overlap at the corners. We will estimate  prob(w(T ′i) ≥ ϵ4) noting that the overlaps between the regions makes our estimates more pessimistic than they truly are (since we are counting the overlapping regions twice) thus making us lean towards the conservative side in our estimations.

Consider the upper strip  T ′1. If w(T ′1 ≤ ϵ4) then we are done. We are however interested in quantifying the probability that this is not the case. Assume  w(T ′1) > ϵ4 and define a strip  T1which starts from the upper axis of R and stretches to the extent such that  w(T1) = ϵ4. Clearly T1 ⊂ T ′1. Wehave that  w(T ′1) > ϵ4 iff T1 ⊂ T ′1. Furthermore:

Claim 9 T1 ⊂ T ′1 iff x1, ..., xm ̸∈ T1.

Proof: If xi ∈ T1the the label must be positive since  T1 ⊂ R. But ifthe label is positive then given our learning strategy of fitting the tightest rectangle over the positive examples, then  xi ∈ R′. Since T1 ̸⊂ R′ it followsthat  xi ̸∈ T1.

We have therefore that  w(T ′1 > ϵ4) iff no point in  T1appears in the sample S = {x1, ..., xm}(otherwise  T1intersects with  R′ and thus T ′1 ⊂ T1). Theprobability that a point sampled according to the distribution D will fall outside of  T1 is 1 − ϵ4. Given the independence assumption (examples are

drawn i.i.d.), we have:

image

Repeating the same analysis to regions  T ′2, T ′3, T ′4 and using the union bound P(A ∪ B) ≤ P(A) + P(B) we come to the conclusion that the probability that any of the four strips of  R − R′ has weight greater that  ϵ/4 is at most 4(1  − ϵ4)m. In other words,

image

We can make the expression more convenient for manipulation by using the inequality  e−x ≥ 1 − x(recall that 1 + (1/n))n < efrom which it follows that (1 +  z)1/z < eand by taking the power of  rz where r ≥0 we obtain (1 +  z)r < erz then set r = 1, z = −x):

image

from which we obtain the bound:

image

To conclude, assuming that the learner adopts the tightest-fit to positive examples strategy and is given at least  m0 = 4ϵ ln 4δ training examples in order to find the axes-aligned rectangle  R′, we can assert that with probability 1  − δthe error associated with  R′ (i.e., the probability that an (m + 1)’thpoint will be classified incorrectly) is at most  ϵ.

We can see form the analysis above that indeed it applies to any distribution D where the only assumption we had to make is the independence of the draw. Also, the sample size m behaves well in the sense that if one desires a higher level of accuracy (smaller  ϵ) or a higher level of confidence (smaller  δ) then the sample size grows accordingly. The growth of m is linear in 1/ϵand linear in ln(1/δ).

image

In the previous section we illustrated the concept of learnability with a particular simple example. We will now focus on applying the learnability model to a more general family of learning examples. We will consider the family of all learning problems over finite concept classes  |C| < ∞. Forexample, the conjunction learning problem (over boolean formulas) with n literals contains only 3n hypotheses because each variable can appear in the conjunction or not and if appears it could be negated or not. We have shown that n is the lower bound on the number of mistakes on the worst case analysis any on-line algorithm can achieve. With the definitions we have above on the formal model of learnability we can perform accuracy and sample complexity analysis that will apply to any learning problem over finite concept classes. This was first introduced by Valiant in 1984.

In the realizable case over  |C| < ∞, we will show that any algorithm L which returns a hypothesis  h ∈ Cwhich is consistent with the training set Z is a learning algorithm for C. In other words, any finite concept class is learnable and the learning algorithms simply need to generate consistent hypotheses. The sample complexity  m0associated with the choice of  ϵ andδcan be shown as equal to: 1ϵ ln |C|δ .

image

h ∈ C that minimizesthe empirical error (the error obtained on Z) is a learning algorithm for C. The sample complexity can be shown as equal to:

image

Let  h ∈ Cbe some consistent hypothesis with the training set Z (we know that such a hypothesis exists, in particular  h = ctthe target concept used for generating Z) and suppose that

image

Then, the probability (with respect to the product distribution  Dm) that hagrees with  cton a random sample of length m is at most (1  − ϵ)m. Usingthe inequality we saw before  e−x ≥ 1 − x we have:

image

We wish to bound the error uniformly, i.e., that  err(h) ≤ ϵfor all concepts h ∈ C. This requires the evaluation of:

image

There at most |C| such functions h, therefore using the Union-Bound the probability that some function in C has error larger than  ϵand is consistent

with  cton a random sample of length m is at most  |C|e−ϵm:

image

For any positive  δ, this probability is less than  δ provided:

image

This derivation can be summarized in the following theorem (Anthony & Bartlett [1], pp. 25):

Theorem 5 Let C be a finite set of functions from X to Y . Let L be an algorithm such that for any m and for any  ct ∈ C, if Zis a training sample  {(xi, ct(xi))}, i = 1, ..., m, then the hypothesis h = L(Z) satisfies h(xi) = ct(xi). Then Lis a learning algorithm for C in the realizable case with sample complexity

image

In the realizable case an algorithm simply needs to generate a consistent hypothesize to be considered a learning algorithm in the formal sense. In the unrealizable situation (a target function  ctmight not exist) an algorithm which minimizes the empirical error, i.e., an algorithm L generates h = L(Z) having minimal sample error:

image

is a learning algorithm for C (assuming finite |C|). This is a particularly useful property given that the true errors of the functions in C are unknown. It seems natural to use the sample errors ˆerr(h) as estimates to the performance of L.

The fact that given a large enough sample (training set Z) then the sample error ˆerr(h) becomes close to the true error err(h) is somewhat of a restatement of the ”law of large numbers” of probability theory. For example, if we toss a coin many times then the relative frequency of ’heads’ approaches the true probability of ’head’ at a rate determined by the law of large numbers. We can bound the probability that the difference between the empirical error and the true error of some  h exceeds ϵusing Hoeffding’s inequality:

Claim 10 Let h be some function from X to Y = {0, 1}. Then

image

for any probability distribution  D, any ϵ > 0and any positive integer m.

Proof: This is a straightforward application of Hoeffding’s inequality to Bernoulli variables. Hoeffding’s inequality says: Let X be a set, D a probability distribution on  X, and f1, ..., fmreal-valued functions  fi : X → [ai, bi]from X to an interval on the real line (ai < bi). Then,

image

where

image

In our case  fi(xi) = 1 iff h(xi) ̸= yi and ai = 0, bi = 1. Therefore(1/m) �i fi(xi) = ˆerr(h) and err(h) = Ex∼D[f(x)].

The Hoeffding bound almost does what we need, but not quite so. What we have is that for any given hypothesis  h ∈ C, the empirical error is close to the true error with high probability. Recall that our goal is to minimize err(h) over all possible  h ∈ Cbut we can access only ˆerr(h). If we can guarantee that the two are close to each other  for every h ∈ C, thenminimizing ˆerr(h) over all h ∈ Cwill approximately minimize err(h). Put formally, in order to ensure that L learns the class C, we must show that

image

In other words, we need to show that the empirical errors converge (at high probability) to the true errors  uniformly over C as m → ∞. If that can be guaranteed, then with (high) probability 1  − δ, for every  h ∈ C,

image

So, since the algorithm L running on training set Z returns h = L(Z) which minimizes the empirical error, we have:

image

which is what is needed in order that L learns C. Thus, what is left is to prove the following claim:

Claim 11

image

Proof: We will use the union bound. Finding the maximum over C is equivalent to taking the union of all the events:

image

using the union-bound and Claim 2, we have:

image

Finally, given that 2|C|e−2ϵ2m ≤ δwe obtain the sample complexity:

image

This discussion is summarized with the following theorem (Anthony & Bartlett [1], pp. 21):

Theorem 6 Let C be a finite set of functions from X to Y = {0, 1}. Let L be an algorithm such that for any m and for any training set  Z = {(xi, yi)},i = 1, ..., m, then the hypothesis L(Z) satisfies:

image

Then L is a learning algorithm for C with sample complexity  m0 = 2ϵ2 ln 2|C|δ .

Note that the main difference with the realizable case (Theorem 1) is the larger 1/ϵ2 rather than 1/ϵ. The realizable case requires a smaller training set since we are estimating a random quantity so the smaller the variance the less data we need.

image

The result of the PAC model (also known as the ”formal” learning model) is that if the concept class C is PAC-learnable then the learning strategy must simply consist of gathering a sufficiently large training sample S of size  m > mo(ϵ, δ), for given accuracy  ϵ >0 and confidence 0  < δ < 1parameters, and finds a hypothesis  h ∈ Cwhich is consistent with S. The learning algorithm is then guaranteed to have a bounded error  err(h) < ϵwith probability 1  − δ. The error measurement includes data not seen by the training phase.

This state of affair also holds (with some slight modifications on the sample complexity bounds) when there is no consistent hypothesis (the unrealizable case). In this case the learner simply needs to minimize the empirical error ˆerr(h) on the sample training data S, and if m is sufficiently large then the learner is guaranteed to have  err(h) < Opt(C)+ϵwith probability 1  − δ. The measure Opt(C) is defined as the minimal  err(g) over all g ∈ C.Note that in the realizable case Opt(C) = 0.

The property of bounding the true error err(h) by minimizing the sample error ˆerr(h) is very convenient. The fundamental question is under what conditions this type of generalization property applies? We saw in the previous lecture that a satisfactorily answer can be provided when the cardinality of the concept space is bounded, i.e.  |C| < ∞, which happens for Boolean concept space for example. In that lecture we have proven that:

image

is sufficient for guaranteeing a learning model in the formal sense, i.e., which has the generalization property described above.

In this lecture and the one that follows we have two goals in mind. First is to generalize the result of finite concept class cardinality to infinite cardinality — note that the bound above is not meaningful when  |C| = ∞.Can we learn in the formal sense any non-trivial infinite concept class? (we already saw an example of a PAC-learnable infinite concept class which is the class of axes aligned rectangles). In order to answer this question we will need to a general measure of concept class complexity which will replace the cardinality term |C| in the sample complexity bound  mo(ϵ, δ). It is tempting to assume that the number of parameters which fully describe the concepts of C can serve as such a measure, but we will show that in fact one needs a more powerful measure called the Vapnik-Chervonenkis (VC) dimension. Our second goal is to pave the way and provide the theoretical foundation for the large margin principle algorithm (SVM) we derived in Lecture 4.

image

The basic principle behind the VC dimension measure is that although C may have infinite cardinality, the restriction of the application of concepts in C to a finite sample S has a finite outcome. This outcome is typically governed by an exponential growth with the size m of the sample S — but not always. The point at which the growth stops being exponential is when the ”complexity” of the concept class C has exhausted itself, in a manner of speaking.

We will assume C is a concept class over the instance space X — both of which can be infinite. We also assume that the concept class maps instances in X to {0, 1}, i.e., the input instances are mapped to ”positive” or ”negative” labels. A training sample S is drawn i.i.d according to some fixed but unknown distribution D and S consists of  m instances x1, ..., xm.In our notations we will try to reserve  c ∈ Cto denote the target concept and  h ∈ C to denote someconcept. We begin with the following definition:

Definition 2

image

which is a set of vectors in  {0, 1}m.

ΠC(S) is set whose members are m-dimensional Boolean vectors induced by functions of C. These members are often called dichotomies or behaviors on S induced or realized by C. If C makes a full realization then ΠC(S) willhave 2m members. An equivalent description is a collection of subsets of S:

image

where each  h ∈ Cmakes a partition of S into two sets — the positive and negative points. The set ΠC(S) contains therefore subsets of S (the positive points of S under h). A full realization will provide �mi=0�mi�= 2m. Wewill use both descriptions of ΠC(S) as a collection of subsets of S and as a set of vectors interchangeably.

Definition 3 If |ΠC(S)| = 2m then Sis considered shattered by C. In other words, S is shattered by C if C realizes all possible dichotomies of S.

Consider as an example a finite concept class  C = {c1, ..., c4}applied to three instance vectors with the results:

image

Then,

image

With these definitions we are ready to describe the measure of concept class complexity.

Definition 4 (VC dimension) The VC dimension of C, noted as V Cdim(C), is the cardinality d of the largest set S shattered by C. If all sets S (arbitrarily large) can be shattered by  C, then V Cdim(C) = ∞.

image

The VC dimension of a class of functions C is the point d at which all samples S with cardinality |S| > d are no longer shattered by C. As long as C shatters S it manifests its full ”richness” in the sense that one can obtain from S all possible results (dichotomies). Once that ceases to hold, i.e., when |S| > d, it means that C has ”exhausted” its richness (complexity). An infinite VC dimension means that C maintains full richness for all sample sizes. Therefore, the VC dimension is a combinatorial measure of a function class complexity.

Before we consider a number of examples of geometric concept classes and their VC dimension, it is important clarify the lower and upper bounds (existential and universal quantifiers) in the definition of VC dimension. The VC dimension is at least d if there exists some sample |S| = d which is shattered by C — this does not mean that all samples of size d are shattered by C. Conversely, in order to show that the VC dimension is at most d, one must show that no sample of size d + 1 is shattered. Naturally, proving an upper bound is more difficult than proving the lower bound on the VC dimension. The following examples are shown in a ”hand waiving” style and are not meant to form rigorous proofs of the stated bounds — they are shown for illustrative purposes only.

Intervals of the real line: The concept class C is governed by two parameters  α1, α2in the closed interval [0, 1]. A concept from this class will tag an input instance 0 < x < 1 as positive if  α1 ≤ x ≤ α2and negative otherwise. The VC dimension is at least 2: select a sample of 2 points  x1, x2positioned in the open interval (0, 1). We need to show that there are values of  α1, α2which realize all the possible four dichotomies (+, +), (−, −), (+, −), (−, +).This is clearly possible as one can place the interval [α1, α2] such the intersection with the interval [x1, x2] is null, (thus producing (−, −)), or to fully include [x1, x2] (thus producing (+, +)) or to partially intersect [x1, x2] suchthat  x1 or x2are excluded (thus producing the remaining two dichotomies). To show that the VC dimension is at most 2, we need to show that any sample of three points  x1, x2, x3on the line (0, 1) cannot be shattered. It is sufficient to show that one of the dichotomies is not realizable: the labeling (+, −,+) cannot be realizable by any interval [α1, α2] — this is because if x1, x3are labeled positive then by definition the interval [α1, α2] must fully include the interval [x1, x3] and since  x1 < x2 < x3 then x2must be labeled positive as well. Thus V Cdim(C) = 2.

Axes-aligned rectangles in the plane: We have seen this concept class in the previous lecture — a point in the plane is labeled positive if it lies in an axes-aligned rectangle. The concept class C is thus governed by 4 parameters. The VC dimension is at least 4: consider a configuration of 4 input points arranged in a cross pattern (recall that we need only to show some sample S that can be shattered). We can place the rectangles (concepts of the class C) such that all 16 dichotomies can be realized (for example, placing the rectangle to include the vertical pair of points and exclude the horizontal pair of points would induce the labeling (+, −, +, −)).It is important to note that in this case, not all configurations of 4 points can be shattered — but to prove a lower bound it is sufficient to show the existence of a single shattered set of 4 points. To show that the VC dimension is at most 4, we need to prove that any set of 5 points cannot be shattered. For any set of 5 points there must be some point that is ”internal”, i.e., is neither the extreme left, right, top or bottom point of the five. If we label this internal point as negative and the remaining 4 points as positive then there is no axes-aligned rectangle (concept) which cold realize this labeling (because if the external 4 points are labeled positive then they must be fully within the concept rectangle, but then the internal point must also be included in the rectangle and thus labeled positive as well).

Separating hyperplanes: Consider first linear half spaces in the plane. The lower bound on the VC dimension is 3 since any three (non-collinear) points in  R2 can be shattered, i.e., all 8 possible labelings of the three points can be realized by placing a separating line appropriately. By having one of the points on one side of the line and the other two on the other side we can realize 3 dichotomies and by placing the line such that all three points are on the same side will realize the 4th. The remaining 4 dichotomies are realized by a sign flip of the four previous cases. To show that the upper bound is also 3, we need to show that no set of 4 points can be shattered. We consider two cases: (i) the four points form a convex region, i.e., lie on the convex hull defined by the 4 points, (ii) three of the 4 points define the convex hull and the 4th point is internal. In the first case, the labeling which is positive for one diagonal pair and negative to the other pair cannot be realized by a separating line. In the second case, a labeling which is positive for the three hull points and negative for the interior point cannot be realize. Thus, the VC dimension is 3 and in general the VC dimension for separating hyperplanes in  Rn is n + 1.

Union of a finite number of intervals on the line: This is an example of a concept class with an infinite VC dimension. For any sample of points on the line, one can place a sufficient number of intervals to realize any labeling.

The examples so far were simple enough that one might get the wrong impression that there is a correlation between the number of parameters required to describe concepts of the class and the VC dimension. As a counter example, consider the two parameter concept class:

image

which has an infinite VC dimension as one can show that for every set of m points on the line one can realize all possible labelings by choosing

a sufficiently large value of  ω(which serves as the frequency of the sync function) and appropriate phase. We conclude this section with the following claim:

Theorem 7 The VC dimension of a finite concept class  |C| < ∞is bounded from above:

image

Proof: if V Cdim(C) = d then there exists at least 2d functions in C because every function induces a labeling and there are at least 2d labelings.Thus, from  |C| ≥ 2d follows that  d ≤ log2 |C|.

image

We saw that the VC dimension is a combinatorial measure of concept class complexity and we would like to have it replace the cardinality term in the sample complexity bound. The first result of interest is to show that if the VC dimension of the concept class is infinite then the class is not PAC learnable.

Theorem 8 Concept class  C with V Cdim(C) = ∞is not learnable in the formal sense.

Proof: Assume the contrary that C is PAC learnable. Let L be the learning algorithm and m be the number of training examples required to learn the concept class with accuracy  ϵ = 0.1 and 1 − δ = 0.9. That is, after seeing at least  m(ϵ, δ) training examples, the learner generates a concept h which satisfies  p(err(h) ≤ 0.1) ≥ 0.9.

Since the VC dimension is infinite there exist a sample set S with 2m instances which is shattered by C. Since the formal model (PAC) applies to any training sample we will use the set S as follows. We will define a probability distribution on the instance space X which is uniform on S (with probability 12m) and zero everywhere else.

Because S is shattered, then any target concept is possible so we will choose our target concept c in the following manner:

image

in other words, the labels  ct(xi) are determined by a coin flip. The learner L selects an i.i.d. sample of m instances ¯S — which due to the structure of D means that the ¯S ⊂ Sand outputs a consistent hypothesis  h ∈ C. Theprobability of error for each  xi ̸∈ ¯S is:

image

The reason for that is because S is shattered by C, i.e., we can select any target concept for any labeling of S (the 2m examples) therefore we could select the labels of the m points not seen by the learner arbitrarily (by flipping a coin). Regardless of h, the probability of mistake is 0.5. The expectation on the error of h is:

image

This is because we have 2m points to sample (according to D as all other points have zero probability) from which the error on half of them is zero (as h is consistent on the training set ¯S) and the error on the remaining half is 0.5. Thus, the average error is 0.25. Note that E[err(h)] = 0.25 for any choice of  ϵ, δas it is based on the sample size m. For any sample size m we can follow the construction above and generate the learning problem such that if the learner produces a consistent hypothesis the expectation of the error will be 0.25.

The result that E[err(h)] = 0.25 is not possible for the accuracy and confidence values we have set: with probability of at least 0.9 we have that err(h) ≤ 0.1 and with probability 0.1 then err(h) = β where 0.1 < β ≤ 1.Taking the worst case of  β= 1 we come up with the average error:

image

We have therefore arrived to a contradiction that C is PAC learnable.

We next obtain a bound on the growth of  |ΠS(C)|when the sample size |S| = m is much larger than the VC dimension V Cdim(C) = d of the concept class. We will need few more definitions:

Definition 5 (Growth function)

image

The measure ΠC(m) is the maximum number of dichotomies induced by C for samples of size m. As long as  m ≤ d then ΠC(m) = 2m. The question is what happens to the growth pattern of ΠC(m) when m > d. We willsee that the growth becomes polynomial — a fact which is crucial for the learnability of C.

Definition 6 For any natural numbers m, d we have the following definition:

image

By induction on m, d it is possible to prove the following:

Theorem 9

image

Proof: by induction on m, d. For details see [[3], pp. 56].

image

polynomial upper bound as follows.

image

Dividing both sides by� dm�d yields:

image

We need one more result before we are ready to present the main result of this lecture:

Theorem 10 (Sauer’s lemma) If V Cdim(C) = d, then for any m, ΠC(m) ≤ Φd(m).

Proof: By induction on both d, m. For details see [[3], pp. 55–56].

Taken together, we have now a fairly interesting characterization on how the combinatorial measure of complexity of the concept class C scales up with the sample size m. When the VC dimension of C is infinite the growth is exponential, i.e., ΠC(m) = 2m for all values of m. On the other hand, when the concept class has a bounded VC dimension  V Cdim(C) = d < ∞then the growth pattern undergoes a discontinuity from an exponential to a polynomial growth:

image

As a direct result of this observation, when m >> d is much larger than d the entropy becomes much smaller than m. Recall than from an information theoretic perspective, the entropy of a random variable Z with discrete values  z1, ..., znwith probabilities  pi, i = 1, ..., nis defined as:

image

where  I(pi) = log2 1pi is a measure of ”information”, i.e., is large when piis small (meaning that there is much information in the occurrence of an unlikely event) and vanishes when the event is certain  pi = 1. Theentropy is therefore the expectation of information. Entropy is maximal for a uniform distribution  H(Z) = log2 n. The entropy in information theory context can be viewed as the number of bits required for coding  z1, ..., zn. Incoding theory it can be shown that the entropy of a distribution provides the lower bound on the average length of any possible encoding of a uniquely decodable code fro which one symbol goes into one symbol. When the distribution is uniform we will need the maximal number of bits, i.e., one cannot compress the data. In the case of concept class C with VC dimension d, we see that one when  m ≤ dall possible dichotomies are realized and thus one will need m bits (as there are 2m dichotomies) for representing all the outcomes of the sample. However, when m >> d only a small fraction of the 2m dichotomies can be realized, therefore the distribution of outcomes is highly non-uniform and thus one would need much less bits for coding the outcomes of the sample. The technical results which follow are therefore a formal way of expressing in a rigorous manner this simple truth — If it is possible to compress, then it is possible to learn. The crucial point is that learnability is a direct consequence of the ”phase transition” (from exponential to polynomial) in the growth of the number of dichotomies realized by the concept class.

In the next lecture we will continue to prove the ”double sampling” theorem which derives the sample size complexity as a function of the VC dimension.

image

In this lecture will use the measure of VC dimension, which is a combinatorial measure of concept class complexity, to bound the sample size complexity.

9.1 A Polynomial Bound on the Sample Size m for PAC Learning

In this section we will follow the material presented in Kearns & Vazirani [3] pp. 57–61 and prove the following:

Theorem 11 (Double Sampling) Let C be any concept class of VC dimension d. Let L be any algorithm that when given a set S of m labeled examples  {xi, c(xi)}i, sampled i.i.d according to some fixed but unknown distribution D over the instance space X, of some concept  c ∈ C, producesas output a concept  h ∈ Cthat is consistent with S. Then L is a learning algorithm in the formal sense provided that the sample size obeys:

image

for some constant  c0 > 0.

The idea behind the proof is to build an ”approximate” concept space which includes concepts arranged such that the distance between the approximate concepts h and the target concept c is at least  ϵ— where distance is de-fined as the weight of the region in X which is in conflict with the target concept. To formalize this story we will need few more definitions. Unless specified otherwise,  c ∈ Cdenotes the target concept and  h ∈ C denotessome concept.

Definition 7

image

c∆his the region in instance space where both concepts do not agree — the error region. The probability that  x ∈ c∆his equal to (by definition) err(h).

Definition 8

image

∆(c) is a set of error regions, one per concept  h ∈ Cover all concepts. The error regions are with respect to the target concept. The set ∆ϵ(c) ⊂ ∆(c)is the set of all error regions whose weight exceeds  ϵ. Recall that weight is defined as the probability that a point sampled according to D will hit the region.

It will be important for later to evaluate the VC dimension of ∆(c). Unlike C, we are not looking for the VC dimension of a class of function but the VC dimension of a set of regions in space. Recall the definition of ΠC(S)from the previous lecture: there were two equivalent definitions one based on a set of vectors each representing a labeling of the instances of S induced by some concept. The second, yet equivalent, definition is based on a set of subsets of S each induced by some concept (where the concept divides the sample points of S into positive and negative labeled points). So far it was convenient to work with the first definition, but for evaluating the VC dimension of ∆(c) it will be useful to consider the second definition:

image

that is, the collection of subsets of S induced by intersections with regions of ∆(c). An intersection between S and a region r is defined as the subset of points from S that fall into r. We can easily show that the VC dimensions of  C and ∆(c) are equal:

Lemma 1

image

Proof: we have that the elements of ΠC(S) and Π∆(c)(S) are susbsets of S, thus we need to show that for every S the cardinality of both sets is equal  |ΠC(S)| = |Π∆(c)(S)|. To do that it is sufficient to show that for every element  s ∈ ΠC(S) there is a unique corresponding element in Π∆(c)(S). Letc∩Sbe the subset of S induced by the target concept c. The set s (a subset of S) is realized by some concept h (those points in S which were labeled positive by h). Therefore, the set  s ∩ (c ∩ S) is the subset of S containing the points that hit the region  h∆cwhich is an element of Π∆(c)(S). Sincethis is a one-to-one mapping we have that  |ΠC(S)| = |Π∆(c)(S)|.

Definition 9 (ϵ-net) For every ϵ > 0, a sample set  S is an ϵ-net for ∆(c)if every region in ϵ(c)is hit by at least one point of S:

image

In other words, if S hits all the error regions in ∆(c) whose weight exceeds ϵ, then S is an ϵ-net. Consider as an example the concept class of intervals on the line [0, 1]. A concept is defined by an interval [α1, α2] such that all points inside the interval are positive and all those outside are negative. Given  c ∈ Cis the target concept and  h ∈ Cis some concept, then the error region  h∆cis the union of two intervals:  I1consists of all points  x ∈ hwhich are not in  c, and I2the interval of all points  x ∈ cbut which are not in h. Assume that the distribution D is uniform (just for the sake of this example) then,  prob(x ∈ I) = |I|which is the length of the interval I. As a result,  err(h) > ϵ if either |I1| > ϵ/2 or |I2| > ϵ/2. The sample set

image

contains sample points from 0 to 1 with increments of  ϵ/2. Therefore, every interval larger than  ϵmust be hit by at least one point from S and by definition  S is an ϵ-net.

It is important to note that if  S forms an ϵ-net then we are guaranteed that  err(h) ≤ ϵ. Let h ∈ Cbe the consistent hypothesis with S (returned by the learning algorithm L). Becuase h is consistent,  h∆c ∈ ∆(c) hasnot been hit by S (recall that  h∆cis the error region with respect to the target concept c, thus if h is consistent then it agrees with c over S and therefore S does not hit  h∆c). Since S forms an ϵ-net for ∆(c) we must have  h∆c ̸∈ ∆ϵ(c) (recall that by definition S hits all error regions with weight larger than  ϵ). As a result, the error region  h∆cmust have a weight smaller than  ϵwhich means that  err(h) ≤ ϵ.

The conclusion is that if we can bound the probability that a random sample  S does not form an ϵ-net for ∆(c), then we have bounded the probability that a concept h consistent with  S has err(h) > ϵ. This is the goal of the proof of the double-sampling theorem which we are about to prove below:

image

dom sample of size m (sampled i.i.d. according to the unknown distribution D) and let A be the event that  S1 does not form an ϵ-net for ∆(c). From the preceding discussion our goal is to upper bound the probability for A to occur, i.e.,  prob(A) ≤ δ.

If A occurs, i.e.,  S1 is not an ϵ-net, then by definition there must be some region  r ∈ ∆ϵ(c) which is not hit by  S1, that is S1 ∩ r = ∅. Note that r = h∆(c) for some concept h which is consistent with  S1. At this point the space of possibilities is infinite, because the probability that we fail to hit h∆(c) in mrandom examples is at most (1−ϵ)m. Thus the probability that we fail to hit  some h∆c ∈ ∆ϵ(c) is bounded from above by  |∆(c)|(1 − ϵ)m— which does not help us due to the fact that  |∆(c)|is infinite. The idea of the proof is to turn this into a finite space by using another sample, as follows.

Let  S2be another random sample of size m. We will select m (for both S1 and S2) to guarantee a high probability that  S2 will hit rmany times. In fact we wish that  S2 will hit r at least ϵm2 with probability of at least 0.5:

image

We will use the Chernoff bound (lower tail) to obtain a bound on the right-hand side term. Recall that if we have m Bernoulli trials (coin tosses) Z1, ..., Zmwith expectation  E(Zi) = pand we consider the random variable Z = Z1 + ... + Zmwith expectation  E(Z) = µ(note that  µ = pm) then forall 0  < ψ <1 we have:

image

Considering the sampling of m examples that form  S2as Bernoulli trials, we have that  µ ≥ ϵm(since the probability that an example will hit r is at least  ϵ) and ψ = 0.5. We obtain therefore:

image

which happens when  m = 8ϵ ln 2 = O(1ϵ).To summarize what we have obtained so far, we have calculated the probability that  S2 will hit r manytimes given that r was fixed using the previous sampling, i.e., given that  S1does not form an  ϵ-net. To formalize this, let B denote the combined event that  S1does not form an  ϵ-event and S2 hits r at least ϵm/2 times. Then, we have shown that for  m = O(1/ϵ) we have:

image

From this we can calculate prob(B):

image

which means that our original goal of bounding prob(A) is equivalent to finding a bound  prob(B) ≤ δ/2 because prob(A) ≤ 2 · prob(B) ≤ δ. Thecrucial point with the new goal is that to analyze the probability of the event B, we need only to consider a finite number of possibilities, namely to consider the regions of

image

This is because the occurrence of the event B is equivalent to saying that there is some  r ∈ Π∆ϵ(c)(S1 ∪ S2) such that  |r| ≥ ϵm/2 (i.e., the region r is hit at least  ϵm/2 times) and  S1 ∩ r = ∅. This is because Π∆ϵ(c)(S1 ∪ S2)contains all the subsets of  S1 ∪ S2realized as intersections over all regions in ∆ϵ(c). Thus even though we have an infinite number of regions we still have a finite number of subsets. We wish therefore to analyze the following probability:

image

Let  S = S1∪S2a random sample of 2m (note that since the sampling is i.i.d. it is equivalent to sampling  S1 and S2separately) and r satisfying  |r| ≥ ϵm/2being fixed. Consider some random partitioning of  S into S1 and S2 andconsider then the problem of estimating the probability that  S1 ∩ r = ∅.This problem is equivalent to the following combinatorial question: we have 2m balls, each colored Red or Blue, with exaclty  l ≥ ϵm/2 Red balls. We divide the 2m balls into groups of equal size  S1 and S2and we are interested in bounding the probability that all of the l balls fall in  S2(that is, the probability that  S1 ∩ r = ∅). This in turn is equivalent to first dividing the 2m uncolored balls into  S1 and S2groups and then randomly choose l of the balls to be colored Red and analyze the probability that all of the Red balls fall into  S2. This probability is exactly

image

This probability was evaluated for a fixed S and r. Thus, the probability that this occurs for  some r ∈ Π∆ϵ(c)(S) satisfying  |r| ≥ ϵm/2 (which is prob(B)) can be calculated by summing over all possible fixed r and applying the

union bound  prob(�i Zi) ≤ �i prob(Zi):

image

from which it follows that:

image

Few comments are worthwhile at this point:

(i) It is possible to show that the upper bound on the sample complexity m is tight by showing that the lower bound on  m is Ω(d/ϵ) (see [[3], pp. 62]).

(ii) The treatment above holds also for the unrealizable case (target concept  c ̸∈ C) with slight modifications to the bound. In this context, the learning algorithm L must simply minimize the sample (empirical) error ˆerr(h) defined:

image

The generalization of the double-sampling theorem (Derroye’82) states that the empirical errors converge uniformly to the true errors:

image

Taken together, we have arrived to a fairly remarkable result. Despite the fact that the distribution D from which the training sample S is drawn from is unknown (but is known to be fixed), the learner simply needs to minimize the empirical error. If the sample size m is large enough the learner is guaranteed to have minimized the true errors for some accuracy and confidence parameters which define the sample size complexity. Equivalently,

image

Not only is the convergence is independent of D but also the rate of convergence is independent (namely, it does not matter where the optimal  h∗is located). The latter is very important because without it one could arbitrarily slow down the convergence rate by maliciously choosing D. The beauty of the results above is that D does not have an effect at all — one simply needs to choose the sample size to be large enough for the accuracy, confidence and VC dimension of the concept class to be learned over.

image

In Lecture 4 we discussed the large margin principle for finding an optimal separating hyperplane. It is natural to ask how does the PAC theory presented so far explains why a maximal margin hyperplane is optimal with regard to the formal sense of learning (i.e. to generalization from empirical errors to true errors)? We saw in the previous section that the sample complexity  m(ϵ, δ, d) depends also on the VC dimension of the concept class — which is n + 1 for hyperplanes in  Rn. Thus, another natural question that may certainly arise is what is the gain in employing the ”kernel trick”? For a fixed m, mapping the input instance space X of dimension n to some higher (exponentially higher) feature space might simply mean that we are compromising the accuracy and confidence of the learner (since the VC dimension is equal to the instance space dimension plus 1).

Given a fixed sample size m, the best the learner can do is to minimize the empirical error and at the same time to try to minimize the VC dimension d of the concept class. The smaller d is, for a fixed m, the higher the accuracy and confidence of the learning algorithm. Likewise, the smaller d is, for a fixed accuracy and confidence values, the smaller sample size is required.

There are two possible ways to decrease d. First is to decrease the dimension n of the instance space X. This amounts to ”feature selection”, namely find a subset of coordinates that are the most ”relevant” to the learning task r perform a dimensionality reduction via PCA, for example. A second approach is to maximize the margin. Let the margin associated with the separating hyperplane h (i.e. consistent with the sample  S) be γ. Let theinput vectors  x ∈ Xhave a bounded norm,  |x| ≤ R. It can be shown that the VC dimension of the concept class  Cγof hyperplanes with margin  γ is:

image

Thus, if the margin is very small then the VC dimension remains n + 1. As the margin gets larger, there comes a point where  R2/γ2 < nand as a result the VC dimension decreases. Moreover, mapping the instance space X to some higher dimension feature space will not change the VC dimension as long as the margin remains the same. It is expected that the margin will not scale down or will not scale down as rapidly as the scaling up of dimension from image space to feature space.

To conclude, maximizing the margin (while minimizing the empirical error) is advantageous as it decreases the VC dimension of the concept class and causes the accuracy and confidence values of the learner to be largely immune to dimension scaling up while employing the kernel trick.

image

image

Let X, Y be two random variables and let f(x, y) be some function on  x ∈X, y ∈ Y , and let p(x, y) be the probability of the event x and y occurring together. The expectation E[f(x, y)] is defined:

image

. The mean, variance and covariance are defined:

image

σxy = Cov(XY ) = E[(x − µx)(y − µy)] =�

image

In vector-matrix notation, let x represent the n random variables of  X1, ..., Xn,i.e.,  x = (x1, ..., xn)⊤ is an instance vector and p(x) is the probability of the instance occurrence. Then the mean is a vector  µand the covariance matrix E are defined:

image

Note that the covariance matrix E is the linear superposition of rank-1 matrices (x−µ)(x−µ)⊤ with coefficients p(x). The diagonal of E containes the variances of the variables  x1, ..., xn.For a uniform distribution and a sample data S consisting of m points, let  A = [x1 − µ, ..., xm − µ] bethe matrix whose columns consist of the points centered around the mean: µ = 1m�i xi. The (sample) covariance matrix is  E = 1mAA⊤.

A0.2 Derivatives of Matrix Operations: Scalar Functions of a Vector

The two most important examples of a scalar function of a vector x are the linear form  a⊤xand the quadratic form  x⊤Axfor some square matrix A.

image

where the derivative  d(x⊤Ax) using the rule of products  d(f · g) = (df) · g +f · (dg) where g = Ax and f = x⊤ and noting that d(Ax) = Adx. Thus, ddx(a⊤x) = a⊤ and ddx(x⊤Ax)) = x⊤(A + A⊤). If Ais symmetric then ddx(x⊤Ax)) = (2Ax)⊤.

image

Consider first the general optimization with equality constraints which gives rise to the notion of Lagrange multipliers.

image

where  f : Rn → R and h : Rn → Rk where his a vector function (h1, ..., hk)each from  Rn to R. We want to derive a necessary and sufficient constraint for a point  xoto be a local minimum subject to the k equality constraints h(x) = 0. Assume that  xo is a regularpoint, meaning that the gradient vectors  ∇hj(x) are linearly independent. Note that  ∇h(xo) is a k×n matrixand the null space of this matrix:

image

defines the tangent plane at the point  xo. We have the following fundamental theorem:

image

in other words, all vectors y spanning the tangent plane at the point  xo arealso perpendicular to the gradient of  f at xo.

image

curve on the surface h(x) = 0, i.e., h(x(t)) = 0. Let  xo = x(0) and y =

dtx(0) the tangent to the curve at  xo. From the definition of tangency, the vector  y lives in null(∇h(xo)), i.e., y · ∇hj(x(0)) = 0, j = 1, ..., k. Sincexo = x(0) is a local extremum of f(x), then

image

where the coefficients  λiare called Lagrange Multipliers and the expression:

image

is called the Lagrangian of the optimization problem (0.1).

image

Consider next the general constrained optimization with inequality constraints (called “non-linear programming”):

image

where  g : Rn → Rs.We will assume that the optimal solution  xo is aregular point which has the following meaning: Let J be the set of indices j such that gj(xo) = 0, then  xois a regular point if the gradient vectors ∇hi(xo), ∇gj(xo), i = 1, ..., k and j ∈ Jare linearly independent. A basic result (we will not prove here) is the Karush-Kuhn-Tucker (KKT) theorem:

Let  xobe a local minimum of the problem and suppose  xois a regular point. Then, there exist  λ1, ..., λk and µ1 ≥ 0, ..., µs ≥ 0such that:

image

Fig. A0.1. Geometric interpreatation of Duality (see text).

image

Note that the condition � µjgj(xo) = 0 is equivalent to the condition that µjgj(xo) = 0 (since  µ ≥ 0 and g(xo) ≤0 thus there sum cannot vanish unless each term vanishes) which in turn implies:  µj = 0 when gj(xo) < 0.The expression

image

is the Lagrangian of the problem (0.2) and the associated condition  µjgj(xo) =0 is called the KKT condition.

The remaining concepts we need are the “duality” and the “Lagrangian Dual” problem.

image

The optimization problem (0.2) is called the “Primal” problem. The Lagrangian Dual problem is defined as:

image

where