Bandit Multiclass Linear Classification for the Group Linear Separable Case

2019·Arxiv

Abstract

Abstract

We consider the online multiclass linear classification under the bandit feedback setting. Beygelzimer, P´al, Sz¨or´enyi, Thiruvenkatachari, Wei, and Zhang [ICML’19] considered two notions of linear separability, weak and strong linear separability. When examples are strongly linearly separable with margin presented an algorithm based on Multiclass Perceptron with mistake bound the number of classes. They employed rational kernel to deal with examples under the weakly linearly separable condition, and obtained the mistake bound of min(this paper, we refine the notion of weak linear separability to support the notion of class grouping, called group weak linear separable condition. This situation may arise from the fact that class structures contain inherent grouping. We show that under this condition, we can also use the rational kernel and obtain the mistake bound of represents the number of groups.

1 Introduction

In an online-learning paradigm, at each time step t, the learner receives a feature vector , makes a prediction ˆ, and obtains a feedback. Note that the learner is playing against an adversary who picks the vector and the correct class from a set of K classes. In the standard full-information feedback setting, the feedback is the correct class , while in the bandit feedback setting, the only feedback is a binary indicator specifying if the learner makes the correct prediction, i.e., ]. The performance of the learner is measured by the total number of mistakes over all the steps.

Typically, the theoretical analysis is carried out under particular linear separability with margin assumptions. Beygelzimer, P´al, Sz¨or´enyi, Thiruvenkatachari, Wei, and Zhang [1] introduced two definitions of linear separability, called strong and weak linear separability. We give a brief summary here (see formal definitions in Section 2.1). For both definitions, there are K vectors defining K hyperplanes. The weak linear separable condition which is similar to standard multiclass linear separability defined in Crammer and Singer [2] ensures that examples from each class lie in the intersection of K halfspaces induced by these hyperplanes. The strong linear separable condition requires that each class is separated by a single hyperplane.

In the full-information feedback setting, Crammer and Singer [2] showed that if all examples are weakly linear separable with margin and have norm at most R, the Multiclass Perceptron algorithm makes at most mistakes. This is tight (up to a constant) since any algorithms must make at least

For the bandit feedback setting [3], Beygelzimer et al. [1] presented an algorithm that make at most ) if the examples are strongly linear separable with margin , paying the price of a factor of K for the bandit feedback setting. They also showed how to extend the algorithm to work with weakly linear separable case using the kernel approach. More specifically, they showed that the examples can be (non-linearly) transformed to higher dimensional space so that they are strongly linear separable with margin (which depends only on and K).

In this paper, we introduce a more refined linear separability condition. Intuitively, the set of weight vectors represents the “directions” of the examples. In this paper, we are interested in the cases where these directions collapse, i.e., while there are K classes of examples, the number of distinct weight vectors required to linearly separate them is less than K. This situation may arise from the fact that class structures contain inherent grouping where intra-group classes can be separated with a single weight vector (or direction). (See Fig. 1, for example.)

More specifically, we consider the case where the classes can be partitioned into L groups, where , such that (1) examples from any two classes in the same group are linearly separable with a margin with a single weight vector, and (2) examples from two classes under different groups are weakly linear separable with a margin. We refer to this condition as the group weakly linear separable condition.

We show that under this refined condition, the same kernel as in [1] can also be used so that the algorithm works in the space where there is (strong) margin that depends on L. Our proofs, as well as that of [1], use the ideas from Klivans and Servedio [4] (which is also based on Beigel et al. [5]).

We note that our key contribution is the mathematical analysis of the margin for group weakly linearly separable examples for the kernelized algorithm in Beygelzimer et al.. This means that everything in their paper works under this group condition (with a better margin bound that depends on L not K).

Section 2 gives definitions and problem settings. Our main result is in Section 3. In particular, Section 3.3 contains our technical theorem that establishes the margin under the transformed inner product space. We provide small examples in Section 4.

2 Deﬁnitions and problem settings

In this section, we review various definitions of linear separability and state a new group weakly linear separable condition, the focus of this work. We also provide a quick review of kernel methods and the Kernelized Bandit Algorithm algorithm used by Beygelzimer et. al. [1].

2.1 Linear separability

We restate the definitions for strong and weak linear separability by Beygelzimer et. al. [1] here. We use the common notation that [K] = {1, 2, . . . , K}.

The examples lie in an inner product space (). Let K be the number of classes and let be a positive real number. Labeled examples

are if there exist vectors such that for all ],

for , and 1.On the other hand, the labeled examples are if there exist vectors such that for all ],

for , and 1.

The strong linear separability also appears in Chen et al. [6]. The weak linear separable condition appears in Crammer and Singer [2].

We now define group weakly linear separability. Let be a partition of [K], i.e.,] for all for , and = [K]. Let g : [] be a mapping function such that iff . We say that the labeled examples

are group weakly linear separable with margin under G if

1. there exist vectors such that 1, and, for all ],

2. there exist vectors such that 1, and, for all ] such that and ) = ), either

Note that vectors ’s define inter-group hyperplanes, while each defines intra-group boundaries. Also note that, to simplify our proofs, the “margin” between intra-group classes is 2; this would create the +and gaps that already exist between groups.

To illustrate the idea, Fig. 1 shows 3 sets of examples.

Figure 1: Three set of examples in showing different linear separable conditions. Thick lines represent class boundaries. (a) Strongly linear separable examples with 3 classes (linearly separable in ). (b) Weakly linear separable examples with 3 classes. (c) Group weakly linear separable examples with 3 groups; group 1 (white) contains 3 classes, group 2 (black) contains 4 classes, and group 3 (gray) contains 1 class.

2.2 Kernel methods

We give an overview of the kernel methods (see [7] for expositions) and the rational kernel [8].

The kernel method is a standard approach to extend linear classification algorithms that use only inner products to handle the notions of “distance” between pairs of examples to nonlinear classification. A positive definite kernel (or kernel) is a function of the form for some set X such that the matrix [is symmetric positive definite for any set of m examples . It is known that for every kernel k, there exists some inner product space () and a feature map such that ) = . Therefore, a linear learning algorithm can essentially non-linearly map every example into V and work in V instead of the original space without explicitly working with using k. This can be very helpful when the dimension of V is infinite.

As in Beygelzimer et al. [1], we use the rational kernel. Assume that examples are in . Denote by B(0, 1) a unit ball centered at 0 in . The rational kernel k : B(0, 1) 1) is defined as

Given ) can be computed in O(d) time.

Let : be the classical real separable Hilbert space equipped with the standard inner product = . We can index the coordinates of by d-tuples () of non-negative integers, the associated feature map : B(0, 1) to k is defined as

where= is the multinomial coefficient. It can be verified that k is the kernel with its feature map to and for any B(0, 1), .

2.3 Multiclass Linear Classification

Beygelzimer et al. [1] presented a learning algorithm for the strongly linearly separable examples using K copies of the Binary Perceptron. They obtained a mistake bound of ) when the examples are from with maximum norm R with margin .

Their approach for dealing the weakly linear separable case is to use the kernel method. They introduced the Kernelized Bandit Algorithm (Algorithm 1) and proved the following theorem.

Theorem 1 (Theorem 4 from [1]). Let X be a non-empty set, let () be an inner product space. Let be a feature map and let , where ) = , be the kernel. If (are labeled examples such that

1. the mapped examples () are strongly linearly separable with margin ,

2. k(x, x), k(x, x), . . . , k(x, x

then the expected number of mistakes that the Kernelized Bandit Algorithm makes is at most (1).

The key theorem for establishing the mistake bound is the following margin transformation theorem based on the rational kernel.

Theorem 2 (Theorem 5 from [1]). (Margin transformation from [1]). Let (B(0, 1) ] be a sequence of labeled examples that is weakly linear separable with margin 0. Let defined as in (1) let

where r = 2log3)+ 1 and log. Then the feature map makes the sequence () strongly linearly separable with margin = max. Also for all 2.

This implies the following mistake bound.

Corollary 1 (Corollary 6 from [1]). (Mistake upper bound from [1]). The mistake bound made by Algorithm 1 when the examples are weakly linearly separable with margin is at most min(2 ).

Beygelzimer et al. [1] gave two margin transformation proofs. In this paper, we only provide one margin transformation based on the Chebyshev polynomials (Theorem 7 from [1]).

2.4 Our contribution

We consider labeled examples with group weakly linearly separable with margin and show that in this case, the rational kernel also transforms the margin and the new margin depends on the number of groups L instead of the number of classes K. More specifically we prove the margin transformation in Theorem 3 and show the mistake bound of in Corollary 2. This can be compared to one of the mistake bound of in [1].

The proofs are fairly technical. We follow the idea in [1] and construct a “good” polynomial that separates examples from one class to the other (strong separation) based on the Chebyshev polynomials [9].

3 Main result

Our main technical result is the following margin transformation using the rational kernel.

Theorem 3. (Margin transformation). Let (B(0, 1) ] be a sequence of labeled examples that is group weakly linear separable with margin 0. Let L be number of group weakly separable such that Let defined as in (1) let

The feature map makes the sequence () strongly linearly separable with margin .

We note that the margin depends on L, the number of groups, instead of K, the number of classes. Using Theorem 3 with Theorem 1 from [1] we obtain the following mistake bound for our algorithm.

Corollary 2. (Mistake bound for group weakly linearly separable case) Let K be positive integer, and be positive real number. The mistake bound made by Algorithm 1 when the examples are group weakly linearly separable with margin with L groups is at most .

Note that multiplicative factor of K is hidden from the second bound of [1] because of the ˜O notation on the exponent. We cannot do that because in our exponent we have only log L which can be much smaller than K. Their actual bound (showing K), which can be compared to ours, is .

3.1 Intra-group boundaries

We first prove a structural property of intra-group classes. The following lemma shows that it is possible to separate one class from the rest in the same group using only lower and upper thresholds. This is independent of the number of classes in that group.

and (2) when ) = g(y) but , either

3.2 Margin transformation

This section is devoted to the proof of Theorem 3. A key property of the space is that it “contains” all multivariate polynomials and the rational kernel k allows us to work in that space. More specifically, by (implicitly) transforming examples to , we can use multivariate polynomials to separate examples from different classes, turning group weakly separability into strong linear separability in . Therefore, to prove the margin transformation, as in [1], we have to (1) establish a separating polynomial and (2) prove the margin bound which depends on the degree and the norm of the polynomials (defined below).

Consider a d-variate polynomial of the form

where the sum ranges over a finite set of d-tuple () of non-negative integers and ’s are real coefficients. We denote the degree of p as deg(p). Following [4], the norm of a polynomial p is defined as

The following lemma from [1] expresses this intuition precisely.

Lemma 2 (from Lemma 9 in [1]). (Norm bound) Let be a multivariate polynomial. There exists such that p(x) = and

As discussed previously, to prove Theorem 3, we need to show the existence of multivariate polynomials that separate one class from the other. Consider class ] in group g(i). Its positive example x, when compared with examples from other group ), satisfies

implying that all examples in class i lie in

while all examples in other groups lie in

When comparing with other classes j in the same group g(i), from Lemma 1, we know that there exists thresholds and that can be used to separate examples from group i, i.e., all its positive examples lie in

while examples from other classes in group g(i) lie in

while examples from other classes in group g(i) lie in

From Lemma 2, for class i, it is enough to establish a multivariate polynomial such that

This is shown in Theorem 4 below. This theorem is fairly technical and is proved in Section 3.3.

Theorem 4. (Polynomial approximation of intersection of halfspaces) Let such that 1. Let such that 1 and 1. Let such that 1. Let (0, 1) and B(0, 1). There exists a multivariate polynomial such that

Proof of Theorem 3. Consider class ]. We will apply Theorem 4. For , let

Also, let = and .

From Theorem 4, there exists a multivariate polynomial such that for all ] and the sequence (), we have

• if , since , and

• if , since .

It is left to check the properties of p. Theorem 4 implies that

By Lemma 2, there exists such that and

We are ready to construct strongly separable vectors for our group weakly separable case in such that + + 1 and for all , and for all , by scaling appropriately as follows. We can let

and

then the theorem follows.

3.3 Separating polynomials

This section proves Theorem 4, i.e., we provide a polynomial that separates one class of examples from the others with degree and norm bounds. As in [1] and [4], we use the Chebyshev polynomials [9] ) defined as follows.

The following two lemmas are from [1].

Lemma 3 (from Lemma 15 in [1]). (Properties of Chebyshev polynomials) Chebyshev polynomials satisfy

1. ) = n for all 0.

2. If 1, the leading coefficient of ) is 2.

3. (cos()) = cos() for all and all 0.

4. (cosh()) = cosh() for all and all 0.

5. 1 for all 1] and all 0.

6. 1 + 1) for all 1 and all 0.

7. (1 +for all 0.

Lemma 4 (from Lemma 14 in [1]). (Properties of norm of polynomials)

1. Let be multivariate polynomials and let p(x) = ) be their product. Then, .

2. Let q be a multivariate polynomial of degree at most s and let p(x) = (. Then, .

3. Let be multivariate polynomials. Then,

Our proof follows the approach in [1].

Proof of Theorem 4. Let log+ 4)and

First, consider the case when

Note that for all ]. Since 1 and 1, we have [0, 1]; thus, (1]. Consider the terms involving and . Since 1, we have that 2 and 2. This implies that 1 2 and 1 2; hence, (2))1] and (2))1]. Therefore,

Now consider the case when

There are two subcases to consider.

Subcase 1: Suppose that for some . In this case, 1 1 + and Lemma 3 (part 6)

Let ) = be the expansion of s-th Chebyshev polynomial. We can bound the term as follows.

and

since ) = + (1 + ) and ) = + (1 ).

The terms and can be analyzed similarly as . We have that

and

Finally,

Figure 2: Group weakly separable dataset in .

Substitutions of r and s finish the proof.

4 Experiments

While we focus mostly on the theoretical aspect of the problem, we performed some experiment to visualize the algorithm.

We generated a dataset in under the group weakly linear separable condition, with K = 9 classes and L = 3 groups with margin = 0.005, shown in Fig 2.

We compared two versions of the bandit multiclass perceptron [1], the standard one and the kernelized one (using the rational kernel). Since the standard one only works with strongly separable case, it would definitely fail in this experiment, but we used it to give an overall sense of improvement for the kernelized version. We ran both algorithms for T = 10steps. For the kernelized version, we conducted 5 experiments, while the linear one we only ran once. Fig. 3 shows the result. The kernelized version made on average

Figure 3: Comparison of the standard algorithm and the kernelized algorithm with T = 10.

130, 884.6 mistakes (13.1%), while the standard one made 835, 848 mistakes (83.6%). Theoretically, the kernelized version should stop making mistakes at some point, but since the number of steps that we ran is too low, we can only see that increasing rate of the number of mistakes decreases over time.

To see the decision boundary, we ploted the contours of the corresponding polynomials for two classes shown in Fig. 4 and Fig. 5. Note that the class in Fig. 5 was much harder to learn as its boundary still overlapped with other classes (i.e., mistakes could still be made).

5 Acknowledgements

We thank Sanparith Marukatat for insightful comments and for pointing out our calculation errors. We also thank Thanawin Rakthanmanon for useful comments. Funding: Both authors are supported by the Thailand Research Fund, Grant RSA-6180074.

References

[1] A. Beygelzimer, D. Pal, B. Szorenyi, D. Thiruvenkatachari, C.-Y. Wei, C. Zhang, Bandit multiclass linear classification: Efficient algorithms for the separable case, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, Vol. 97 of Proceedings of Machine Learning Research, PMLR, Long Beach, California, USA, 2019, pp. 624–633. URL http://proceedings.mlr.press/v97/beygelzimer19a.html

[2] K. Crammer, Y. Singer, Ultraconservative online algorithms for multiclass problems, J. Mach. Learn. Res. 3 (2003) 951–991. doi:10.1162/jmlr.2003.3.4-5.951.

[3] S. M. Kakade, S. Shalev-Shwartz, A. Tewari, Efficient bandit algorithms for online multiclass prediction, in: Proceedings of the 25th International Conference on Machine Learning, ICML ’08, ACM, New York, NY, USA, 2008, pp. 440–447. doi:10.1145/1390156.1390212.

[4] A. R. Klivans, R. A. Servedio, Learning intersections of halfspaces with a margin, Journal of Computer and System Sciences 74 (1) (2008) 35 – 48, learning Theory 2004. doi:https://doi.org/10.1016/j. jcss.2007.04.012.

[5] R. Beigel, N. Reingold, D. Spielman, Pp is closed under intersection, Journal of Computer and System Sciences 50 (2) (1995) 191 – 202. doi:https://doi.org/10.1006/jcss.1995.1017.

Figure 4: The decision contours of a class (in black) of the kernelized algorithm after T = 10steps.

Figure 5: The decision contours of a class (in black) of the kernelized algorithm after T = 10steps.

[6] G. Chen, G. Chen, J. Zhang, S. Chen, C. Zhang, Beyond banditron: A conservative and efficient reduction for online multiclass prediction with bandit setting model, in: ICDM 2009, The Ninth IEEE International Conference on Data Mining, Miami, Florida, USA, 6-9 December 2009, 2009, pp. 71–80. doi:10.1109/ ICDM.2009.36.

[7] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, New York, NY, USA, 2004.

[8] S. Shalev-Shwartz, O. Shamir, K. Sridharan, Learning kernel-based halfspaces with the 0-1 loss, SIAM J. Comput. 40 (6) (2011) 1623–1646. doi:10.1137/100806126. URL https://doi.org/10.1137/100806126

[9] J. Mason, D. Handscomb, Chebyshev Polynomials, CRC Press, 2002.