In an online-learning paradigm, at each time step t, the learner receives a feature vector , makes a prediction ˆ
, and obtains a feedback. Note that the learner is playing against an adversary who picks the vector
and the correct class
from a set of K classes. In the standard full-information feedback setting, the feedback is the correct class
, while in the bandit feedback setting, the only feedback is a binary indicator specifying if the learner makes the correct prediction, i.e.,
]. The performance of the learner is measured by the total number of mistakes over all the steps.
Typically, the theoretical analysis is carried out under particular linear separability with margin assumptions. Beygelzimer, P´al, Sz¨or´enyi, Thiruvenkatachari, Wei, and Zhang [1] introduced two definitions of linear separability, called strong and weak linear separability. We give a brief summary here (see formal definitions in Section 2.1). For both definitions, there are K vectors defining K hyperplanes. The weak linear separable condition which is similar to standard multiclass linear separability defined in Crammer and Singer [2] ensures that examples from each class lie in the intersection of K halfspaces induced by these hyperplanes. The strong linear separable condition requires that each class is separated by a single hyperplane.
In the full-information feedback setting, Crammer and Singer [2] showed that if all examples are weakly linear separable with margin and have norm at most R, the Multiclass Perceptron algorithm makes at most
mistakes. This is tight (up to a constant) since any algorithms must make at least
For the bandit feedback setting [3], Beygelzimer et al. [1] presented an algorithm that make at most ) if the examples are strongly linear separable with margin
, paying the price of a factor of K for the bandit feedback setting. They also showed how to extend the algorithm to work with weakly linear separable case using the kernel approach. More specifically, they showed that the examples can be (non-linearly) transformed to higher dimensional space so that they are strongly linear separable with margin
(which depends only on
and K).
In this paper, we introduce a more refined linear separability condition. Intuitively, the set of weight vectors represents the “directions” of the examples. In this paper, we are interested in the cases where these directions collapse, i.e., while there are K classes of examples, the number of distinct weight vectors required to linearly separate them is less than K. This situation may arise from the fact that class structures contain inherent grouping where intra-group classes can be separated with a single weight vector (or direction). (See Fig. 1, for example.)
More specifically, we consider the case where the classes can be partitioned into L groups, where , such that (1) examples from any two classes in the same group are linearly separable with a margin with a single weight vector, and (2) examples from two classes under different groups are weakly linear separable with a margin. We refer to this condition as the group weakly linear separable condition.
We show that under this refined condition, the same kernel as in [1] can also be used so that the algorithm works in the space where there is (strong) margin that depends on L. Our proofs, as well as that of [1], use the ideas from Klivans and Servedio [4] (which is also based on Beigel et al. [5]).
We note that our key contribution is the mathematical analysis of the margin for group weakly linearly separable examples for the kernelized algorithm in Beygelzimer et al.. This means that everything in their paper works under this group condition (with a better margin bound that depends on L not K).
Section 2 gives definitions and problem settings. Our main result is in Section 3. In particular, Section 3.3 contains our technical theorem that establishes the margin under the transformed inner product space. We provide small examples in Section 4.
In this section, we review various definitions of linear separability and state a new group weakly linear separable condition, the focus of this work. We also provide a quick review of kernel methods and the Kernelized Bandit Algorithm algorithm used by Beygelzimer et. al. [1].
2.1 Linear separability
We restate the definitions for strong and weak linear separability by Beygelzimer et. al. [1] here. We use the common notation that [K] = {1, 2, . . . , K}.
The examples lie in an inner product space (). Let K be the number of classes and let
be a positive real number. Labeled examples
are if there exist vectors
such that for all
],
for , and
1.On the other hand, the labeled examples are
if there exist vectors
such that for all
],
for , and
1.
The strong linear separability also appears in Chen et al. [6]. The weak linear separable condition appears in Crammer and Singer [2].
We now define group weakly linear separability. Let be a partition of [K], i.e.,
] for all
for
, and
= [K]. Let g : [
] be a mapping function such that
iff
. We say that the labeled examples
are group weakly linear separable with margin under G if
1. there exist vectors such that
1, and, for all
],
2. there exist vectors such that
1, and, for all
] such that
and
) =
), either
Note that vectors ’s define inter-group hyperplanes, while each
defines intra-group boundaries. Also note that, to simplify our proofs, the “margin” between intra-group classes is 2
; this would create the +
and
gaps that already exist between groups.
To illustrate the idea, Fig. 1 shows 3 sets of examples.
Figure 1: Three set of examples in showing different linear separable conditions. Thick lines represent class boundaries. (a) Strongly linear separable examples with 3 classes (linearly separable in
). (b) Weakly linear separable examples with 3 classes. (c) Group weakly linear separable examples with 3 groups; group 1 (white) contains 3 classes, group 2 (black) contains 4 classes, and group 3 (gray) contains 1 class.
2.2 Kernel methods
We give an overview of the kernel methods (see [7] for expositions) and the rational kernel [8].
The kernel method is a standard approach to extend linear classification algorithms that use only inner products to handle the notions of “distance” between pairs of examples to nonlinear classification. A positive definite kernel (or kernel) is a function of the form for some set X such that the matrix [
is symmetric positive definite for any set of m examples
. It is known that for every kernel k, there exists some inner product space (
) and a feature map
such that
) =
. Therefore, a linear learning algorithm can essentially non-linearly map every example into V and work in V instead of the original space without explicitly working with
using k. This can be very helpful when the dimension of V is infinite.
As in Beygelzimer et al. [1], we use the rational kernel. Assume that examples are in . Denote by B(0, 1) a unit ball centered at 0 in
. The rational kernel k : B(0, 1)
1)
is defined as
Given ) can be computed in O(d) time.
Let :
be the classical real separable Hilbert space equipped with the standard inner product
=
. We can index the coordinates of
by d-tuples (
) of non-negative integers, the associated feature map
: B(0, 1)
to k is defined as
where=
is the multinomial coefficient. It can be verified that k is the kernel with its feature map
to
and for any
B(0, 1),
.
2.3 Multiclass Linear Classification
Beygelzimer et al. [1] presented a learning algorithm for the strongly linearly separable examples using K copies of the Binary Perceptron. They obtained a mistake bound of ) when the examples are from
with maximum norm R with margin
.
Their approach for dealing the weakly linear separable case is to use the kernel method. They introduced the Kernelized Bandit Algorithm (Algorithm 1) and proved the following theorem.
Theorem 1 (Theorem 4 from [1]). Let X be a non-empty set, let () be an inner product space. Let
be a feature map and let
, where
) =
, be the kernel. If (
are labeled examples such that
1. the mapped examples () are strongly linearly separable with margin
,
2. k(x, x
), k(x
, x
), . . . , k(x
, x
then the expected number of mistakes that the Kernelized Bandit Algorithm makes is at most (1)
.
The key theorem for establishing the mistake bound is the following margin transformation theorem based on the rational kernel.
Theorem 2 (Theorem 5 from [1]). (Margin transformation from [1]). Let (B(0, 1)
] be a sequence of labeled examples that is weakly linear separable with margin
0. Let
defined as in (1) let
where r = 2log
3)
+ 1 and
log
. Then the feature map
makes the sequence (
) strongly linearly separable with margin
= max
. Also for all
2.
This implies the following mistake bound.
Corollary 1 (Corollary 6 from [1]). (Mistake upper bound from [1]). The mistake bound made by Algorithm 1 when the examples are weakly linearly separable with margin is at most min(2
).
Beygelzimer et al. [1] gave two margin transformation proofs. In this paper, we only provide one margin transformation based on the Chebyshev polynomials (Theorem 7 from [1]).
2.4 Our contribution
We consider labeled examples with group weakly linearly separable with margin and show that in this case, the rational kernel also transforms the margin and the new margin depends on the number of groups L instead of the number of classes K. More specifically we prove the margin transformation in Theorem 3 and show the mistake bound of
in Corollary 2. This can be compared to one of the mistake bound of
in [1].
The proofs are fairly technical. We follow the idea in [1] and construct a “good” polynomial that separates examples from one class to the other (strong separation) based on the Chebyshev polynomials [9].
Our main technical result is the following margin transformation using the rational kernel.
Theorem 3. (Margin transformation). Let (B(0, 1)
] be a sequence of labeled examples that is group weakly linear separable with margin
0. Let L be number of group weakly separable such that
Let
defined as in (1) let
The feature map makes the sequence (
) strongly linearly separable with margin
.
We note that the margin depends on L, the number of groups, instead of K, the number of classes. Using Theorem 3 with Theorem 1 from [1] we obtain the following mistake bound for our algorithm.
Corollary 2. (Mistake bound for group weakly linearly separable case) Let K be positive integer, and
be positive real number. The mistake bound made by Algorithm 1 when the examples are group weakly linearly separable with margin
with L groups is at most
.
Note that multiplicative factor of K is hidden from the second bound of [1] because of the ˜O notation on the exponent. We cannot do that because in our exponent we have only log L which can be much smaller than K. Their actual bound (showing K), which can be compared to ours, is .
3.1 Intra-group boundaries
We first prove a structural property of intra-group classes. The following lemma shows that it is possible to separate one class from the rest in the same group using only lower and upper thresholds. This is independent of the number of classes in that group.
and (2) when ) = g(y) but
, either
3.2 Margin transformation
This section is devoted to the proof of Theorem 3. A key property of the space is that it “contains” all multivariate polynomials and the rational kernel k allows us to work in that space. More specifically, by (implicitly) transforming examples to
, we can use multivariate polynomials to separate examples from different classes, turning group weakly separability into strong linear separability in
. Therefore, to prove the margin transformation, as in [1], we have to (1) establish a separating polynomial and (2) prove the margin bound which depends on the degree and the norm of the polynomials (defined below).
Consider a d-variate polynomial of the form
where the sum ranges over a finite set of d-tuple () of non-negative integers and
’s are real coefficients. We denote the degree of p as deg(p). Following [4], the norm of a polynomial p is defined as
The following lemma from [1] expresses this intuition precisely.
Lemma 2 (from Lemma 9 in [1]). (Norm bound) Let be a multivariate polynomial. There exists
such that p(x) =
and
As discussed previously, to prove Theorem 3, we need to show the existence of multivariate polynomials that separate one class from the other. Consider class ] in group g(i). Its positive example x, when compared with examples from other group
), satisfies
implying that all examples in class i lie in
while all examples in other groups lie in
When comparing with other classes j in the same group g(i), from Lemma 1, we know that there exists thresholds and
that can be used to separate examples from group i, i.e., all its positive examples lie in
while examples from other classes in group g(i) lie in
while examples from other classes in group g(i) lie in
From Lemma 2, for class i, it is enough to establish a multivariate polynomial such that
This is shown in Theorem 4 below. This theorem is fairly technical and is proved in Section 3.3.
Theorem 4. (Polynomial approximation of intersection of halfspaces) Let such that
1. Let
such that
1 and
1. Let
such that
1. Let
(0, 1) and
B(0, 1). There exists a multivariate polynomial
such that
Proof of Theorem 3. Consider class ]. We will apply Theorem 4. For
, let
Also, let =
and
.
From Theorem 4, there exists a multivariate polynomial such that for all
] and the sequence (
), we have
• if , since
, and
• if , since
.
It is left to check the properties of p. Theorem 4 implies that
By Lemma 2, there exists such that
and
We are ready to construct strongly separable vectors for our group weakly separable case in such that
+
+
1 and for all
, and for all
, by scaling
appropriately as follows. We can let
and
then the theorem follows.
3.3 Separating polynomials
This section proves Theorem 4, i.e., we provide a polynomial that separates one class of examples from the others with degree and norm bounds. As in [1] and [4], we use the Chebyshev polynomials [9]
) defined as follows.
The following two lemmas are from [1].
Lemma 3 (from Lemma 15 in [1]). (Properties of Chebyshev polynomials) Chebyshev polynomials satisfy
1. ) = n for all
0.
2. If 1, the leading coefficient of
) is 2
.
3. (cos(
)) = cos(
) for all
and all
0.
4. (cosh(
)) = cosh(
) for all
and all
0.
5. 1 for all
1] and all
0.
6. 1 +
1) for all
1 and all
0.
7. (1 +
for all
0.
Lemma 4 (from Lemma 14 in [1]). (Properties of norm of polynomials)
1. Let be multivariate polynomials and let p(x) =
) be their product. Then,
.
2. Let q be a multivariate polynomial of degree at most s and let p(x) = (. Then,
.
3. Let be multivariate polynomials. Then,
Our proof follows the approach in [1].
Proof of Theorem 4. Let log
+ 4)
and
First, consider the case when
Note that for all
]. Since
1 and
1, we have
[0, 1]; thus, (
1]. Consider the terms involving
and
. Since
1, we have that
2 and
2. This implies that 1
2 and 1
2; hence, (
2))
1] and (
2))
1]. Therefore,
Now consider the case when
There are two subcases to consider.
Subcase 1: Suppose that for some . In this case, 1
1 +
and Lemma 3 (part 6)
Let ) =
be the expansion of s-th Chebyshev polynomial. We can bound the term
as follows.
and
since ) =
+ (1 +
) and
) =
+ (1
).
The terms and
can be analyzed similarly as
. We have that
and
Finally,
Figure 2: Group weakly separable dataset in .
Substitutions of r and s finish the proof.
While we focus mostly on the theoretical aspect of the problem, we performed some experiment to visualize the algorithm.
We generated a dataset in under the group weakly linear separable condition, with K = 9 classes and L = 3 groups with margin
= 0.005, shown in Fig 2.
We compared two versions of the bandit multiclass perceptron [1], the standard one and the kernelized one (using the rational kernel). Since the standard one only works with strongly separable case, it would definitely fail in this experiment, but we used it to give an overall sense of improvement for the kernelized version. We ran both algorithms for T = 10steps. For the kernelized version, we conducted 5 experiments, while the linear one we only ran once. Fig. 3 shows the result. The kernelized version made on average
Figure 3: Comparison of the standard algorithm and the kernelized algorithm with T = 10.
130, 884.6 mistakes (13.1%), while the standard one made 835, 848 mistakes (83.6%). Theoretically, the kernelized version should stop making mistakes at some point, but since the number of steps that we ran is too low, we can only see that increasing rate of the number of mistakes decreases over time.
To see the decision boundary, we ploted the contours of the corresponding polynomials for two classes shown in Fig. 4 and Fig. 5. Note that the class in Fig. 5 was much harder to learn as its boundary still overlapped with other classes (i.e., mistakes could still be made).
We thank Sanparith Marukatat for insightful comments and for pointing out our calculation errors. We also thank Thanawin Rakthanmanon for useful comments. Funding: Both authors are supported by the Thailand Research Fund, Grant RSA-6180074.
[1] A. Beygelzimer, D. Pal, B. Szorenyi, D. Thiruvenkatachari, C.-Y. Wei, C. Zhang, Bandit multiclass linear classification: Efficient algorithms for the separable case, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, Vol. 97 of Proceedings of Machine Learning Research, PMLR, Long Beach, California, USA, 2019, pp. 624–633. URL http://proceedings.mlr.press/v97/beygelzimer19a.html
[2] K. Crammer, Y. Singer, Ultraconservative online algorithms for multiclass problems, J. Mach. Learn. Res. 3 (2003) 951–991. doi:10.1162/jmlr.2003.3.4-5.951.
[3] S. M. Kakade, S. Shalev-Shwartz, A. Tewari, Efficient bandit algorithms for online multiclass prediction, in: Proceedings of the 25th International Conference on Machine Learning, ICML ’08, ACM, New York, NY, USA, 2008, pp. 440–447. doi:10.1145/1390156.1390212.
[4] A. R. Klivans, R. A. Servedio, Learning intersections of halfspaces with a margin, Journal of Computer and System Sciences 74 (1) (2008) 35 – 48, learning Theory 2004. doi:https://doi.org/10.1016/j. jcss.2007.04.012.
[5] R. Beigel, N. Reingold, D. Spielman, Pp is closed under intersection, Journal of Computer and System Sciences 50 (2) (1995) 191 – 202. doi:https://doi.org/10.1006/jcss.1995.1017.
Figure 4: The decision contours of a class (in black) of the kernelized algorithm after T = 10steps.
Figure 5: The decision contours of a class (in black) of the kernelized algorithm after T = 10steps.
[6] G. Chen, G. Chen, J. Zhang, S. Chen, C. Zhang, Beyond banditron: A conservative and efficient reduction for online multiclass prediction with bandit setting model, in: ICDM 2009, The Ninth IEEE International Conference on Data Mining, Miami, Florida, USA, 6-9 December 2009, 2009, pp. 71–80. doi:10.1109/ ICDM.2009.36.
[7] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, New York, NY, USA, 2004.
[8] S. Shalev-Shwartz, O. Shamir, K. Sridharan, Learning kernel-based halfspaces with the 0-1 loss, SIAM J. Comput. 40 (6) (2011) 1623–1646. doi:10.1137/100806126. URL https://doi.org/10.1137/100806126
[9] J. Mason, D. Handscomb, Chebyshev Polynomials, CRC Press, 2002.