My stuff
Boosting the kernelized shapelets: Theory and algorithms for local features

We consider binary classification problems using local features of objects. One of motivating applications is time-series classification, where features reflecting some local closeness measure between a time series and a pattern sequence called shapelet are useful. Despite the empirical success of such approaches using local features, the generalization ability of resulting hypotheses is not fully understood and previous work relies on a bunch of heuristics. In this paper, we formulate a class of hypotheses using local features, where the richness of features is controlled by kernels. We derive generalization bounds of sparse ensembles over the class which is exponentially better than a standard analysis in terms of the number of possible local features. The resulting optimization problem is well suited to the boosting approach and the weak learning problem is formulated as a DC program, for which practical algorithms exist. In preliminary experiments on time-series data sets, our method achieves competitive accuracy with the state-of-the-art algorithms with small parameter-tuning cost.

Classifying objects using their “local” patterns is often effective in various applications. For example, in time-series classification problems, a local feature called shapelet is shown to be quite powerful in the data mining literature (Ye and Keogh, 2009; Keogh and Rakthanmanon, 2013; Hills et al., 2014; Grabocka et al., 2014). More precisely, a shapelet  z = (z1, . . . , zℓ) isa real-valued “short” sequence in  Rℓ for some ℓ >1. Given a time-series  x = (x1, . . . , xL), atypical measure of closeness between the time-series x and the shapelet  z is minQj=1 ∥xj:j+ℓ−1−z∥2, where Q = L−ℓ+1 and xj:j+ℓ−1 = (xj, . . . , xj+ℓ−1). Here, the measure focuses on “local” closeness between the time-series and the shapelet. In many time-series classification problems, sparse combinations of features based on the closeness to some shapelets are useful (Grabocka et al., 2015; Renard et al., 2015; Hou et al., 2016). Similar situations could happen in other applications. Say, for image classification problems, template matching is a well-known technique to measure similarity between an image and “(small) template image” in a local sense.

Despite the empirical success of applying local features, theoretical guarantees of such approaches are not fully investigated. In particular, trade-offs of the richness of such local features and the generalization ability are not characterized yet.

In this paper, we formalize a class of hypotheses based on some local closeness. Here, the richness of the class is controlled by associated kernels. We show generalization bounds of ensembles of such local classifiers. Our bounds are exponentially tighter in terms of some parameter than typical bounds obtained by a standard analysis.

Further, for learning ensembles of the kernelized hypotheses, our theoretical analysis suggests a 1-norm constrained optimization problem with infinitely many parameters, for which the dual problem is categorized as a semi-infinite program (see Shapiro, 2009). To obtain approximate solutions of the problem efficiently, we take the approach of boosting (Schapire et al., 1998). In particular, we employ LPBoost (Demiriz et al., 2002), which solve 1-norm constrained soft margin optimization via a column generation approach. As a result, our approach has two stages, where the master problem is a linear program and the sub-problems are difference of convex programs (DC programs), which are non-convex. While it is difficult to solve the sub-problems exactly due to non-convexity, various techniques are investigated for DC programs and we can find good approximate solutions effi-ciently for many cases in practice.

In preliminary experiments on time-series data sets, our method achieves competitive accuracy with the state-of-the-art algorithms using shapelets. While the previous algorithms need careful parameter tuning and heuristics, our method uses less parameter tuning and parameters can be determined in an organized way. In addition, our solutions tend to be sparse and could be useful for domain experts to select good local features.

1.1 Related work

The concept of time-series shapelets was first introduced by Ye and Keogh (2009). The algorithm finds shapelets by using the information gains of potential candidates associated with all the subsequences of the given time series and constructs a decision tree. Shapelet transform (Hills et al., 2014) is a technique combining with shapelets and machine learning. The authors consider the time-series examples as feature vectors defined by the set of local closeness to some shapelets and in order to obtain classification rule, they employed some effective learning algorithms such as linear SVM or random forests. Note that shapelet transform completely separate the phase searching for shapelets from the phases of creating classifi-cation rules. Afterward, many algorithms have been proposed to search the good shapelets efficiently keeping high prediction accuracy in practice (Keogh and Rakthanmanon, 2013; Grabocka et al., 2015; Renard et al., 2015; Karlsson et al., 2016). The algorithms are based on the idea that discriminative shapelets are contained in the training data. This approach, however, might overfit without a regularization. Learning Time-Series Shapelets (LTS) algorithm (Grabocka et al., 2014) is a different approach from such subsequence-based algorithms. LTS approximately solves an optimization problem of learning the best shapelets directly without searching subsequences in a brute-force way. In contrast to the above subsequence-based methods, LTS finds nearly optimal shapelets and achieves higher prediction accuracy than the other existing methods in practice. However, there is no theoretical guarantee of its generalization error.

There are previous results using local features based on kernels (e.g., Odone et al., 2005; Harris; Shimodaira et al., 2001; Zhang et al., 2010). However, their approaches focus on the closeness (similarity) between examples (e.g., subsequence a of example A and subsequence b of example B), not known to capture local similarity.

Let  P ⊆ Rℓ be a set, in which an element is called a pattern. Our instance space X is a set of sequences of patterns in P. For simplicity, we assume that every sequence in X is of the same length. That is,  X ⊆ PQ for some integer Q. We denote by  x = (x(1), . . . , x(Q))an instance sequence in X, where every  x(j) is a pattern in P. The learner receives a labeled sample  S = (((x(1)1 , . . . , x(Q)1 ), y1), . . . , ((x(1)m , . . . , x(Q)m ), ym)) ∈ (X × {−1, 1})m ofsize m, where each labeled instance is independently drawn according to some unknown distribution  D over X × {−1, +1}.

Let K be a kernel over P, which is used to measure the similarity between patterns, and let Φ :  P → Hdenote a feature map associated with the kernel K for a Hilbert space H. That is, K(z, z′) = ⟨Φ(z), Φ(z′)⟩for patterns  z, z′ ∈ P, where ⟨·, ·⟩denotes the inner product over H. The norm induced by the inner product is denoted by  ∥ · ∥Hand satisfies ∥u∥H =�⟨u, u⟩ for u ∈ H.

For each  u ∈ H, we define the base classifier (or the feature), denoted by  hu, as thefunction that maps a given sequence  x = (x(1), . . . , x(Q)) ∈ Xto the maximum of the similarity scores between  u and x(j) over all patterns  x(j) in x. More specifically,


where [Q] denotes the set  {1, 2, . . . , Q}. For a set U ⊆ H, we define the class of base classifiers as


and we denote by conv(HU) the set of convex combinations of base classifiers in  HU. Moreprecisely,


The goal of the learner is to find a final hypothesis  g ∈ conv(HU), so that its generalization error  ED(g) = Pr(x,y)∼D[sign(g(x)) ̸= y] is small.

Example: Learning with time-series shapelets For a typical setting of learning with time-series shapelets, an instance is a sequence of real numbers  x = (x1, x2, . . . , xL) ∈ RLand a base classifier  hs, which is associated with a shapelet (i.e., a “short” sequence of real numbers)  s = (s1, s2, . . . , sℓ) ∈ Rℓ, is defined as


where  x(j) = (xj, xj+1, . . . , xj+ℓ) is the subsequence of  x of length ℓthat begins with jth index1. In our framework, this corresponds to the case where  Q = L − ℓ + 1, P ⊆ Rℓ, andU = {Φ(s) | s ∈ P}.

In this section, we give generalization bounds of the hypothesis classes conv(HU) for various U and K. To derive the bounds, we use the Rademacher and the Gaussian complexity (Bartlett and Mendelson, 2003).

Definition 1 (The Rademacher and the Gaussian complexity, Bartlett and Mendelson, 2003) Given a sample  S = (x1, . . . , xm) ∈ X m, the empirical Rademacher complexity R(H) of a class  H ⊂ {h : X → R} w.r.t. Sis defined as  RS(H) = 1m Eσ [suph∈H�mi=1 σih(xi)],where  σ ∈ {−1, 1}m and each σiis an independent uniform random variable in  {−1, 1}.The empirical Gaussian complexity  GS(H) of H w.r.t. Sis defined similarly but each  σi isdrawn independently from the standard normal distribution.

The following bounds are well-known.


Lemma 2 (Mohri et al., 2012, Corollary 6.1) For fixed  ρ, δ > 0, the following bound holds with probability at least 1  − δ: for all f ∈ conv(H),


where  ES,ρ(f)is the empirical margin loss of f over S, i.e., the ratio of examples for which f has margin  yif(xi) < ρ.

To derive generalization bounds based on the Rademacher or the Gaussian complexity is quite standard in the statistical learning theory literature and applicable to our classes of interest as well. However, a standard analysis provides us sub-optimal bounds.

For example, let us consider the simple case where the class  HUof base classifiers is defined by the linear kernel with U to be the set of vectors in  Rℓ of bounded norm. In this case,  HUcan be viewed as  HU = {max{h1, . . . , hQ} | h1 ∈ H1, . . . , hQ ∈ HQ}, whereHj = {h : x �→ ⟨u, x(j)⟩ | u ∈ U} for every j ∈ [Q].Then, by a standard analysis (see, e.g., Mohri et al., 2012, Lemma 8.1), we have  R(HU) ≤ �Qj=1 R(Hj) = O� Q√m�.However, this bound is weak since it is linear in Q. In the following, we will give an improved bound of ˜O(log Q/√m). The key observation is that if base classes  H1, . . . , HQare “correlated” somehow, one could obtain better Rademacher bounds. In fact, we will exploit some geometric properties among these base classes.

3.1 Main theorem

First, we show our main theoretical result on the generalization bound of conv(HU).

Given a sample  S, let PSbe the set of patterns appearing in S and let Φ(PS) = {Φ(z) |z ∈ PS}. Let Φdiff(PS) = {Φ(z) − Φ(z′) | z, z′ ∈ PS, z ̸= z′}. By viewing each instance v ∈ Φdiff(PS) as a hyperplane  {u | ⟨v, u⟩ = 0}, we can naturally define a partition of the Hilbert space H by the set of all hyperplanes  v ∈ Φdiff(PS). Let Ibe the set of all cells of the partition. Each cell  I ∈ Iis a polyhedron which is defined by a minimal set  VI ⊆ Φdiff(PS)that satisfies  I = �v∈VI{u | ⟨u, v⟩ ≥ 0}. Let


Let  d∗Φ,S be the VC dimension of the set of linear classifiers over the finite set Φdiff(PS),given by  FU = {f : v �→ sign(⟨u, v⟩) | u ∈ U}. Then, our main result is stated as follows:

Theorem 1 Suppose that for any  z ∈ P, ∥Φ(z)∥H ≤ R. Then, for any  ρ > 0 and δ(0 <δ < 1), with probability at least 1  − δ, the following holds for any  g ∈ conv(HU) withU ⊆ {u ∈ H | ∥u∥H ≤ Λ}:


In particular, (i) if Φ is the identity mapping (i.e., the associated kernel is the linear kernel), or (ii) if Φ satisfies that  ⟨Φ(z), Φ(x)⟩is monotone decreasing with respect to  ∥z − x∥2 (e.g.,the mapping defined by the Gaussian kernel) and  U = {Φ(s) | s ∈ Rℓ, ∥Φ(s)∥H ≤ Λ}, thend∗Φ,S can be replaced by  ℓ. (iii) Otherwise,  d∗Φ,S ≤ RΛ/µ∗.

In order to prove the theorem 1, we show several definitions, lemmas and the proofs as following subsection.

3.2 Proof sketch

Definition 2 The set Θof mappings from an instance to a pattern For any  u ∈ U, letθu : [m] → [Q]be a mapping defined by


Lemma 3 Suppose that for any  z ∈ P, ∥Φ(z)∥H ≤ R.Then, the empirical Gaussian complexity of  HUwith respect to  S for U ⊆ {u | ∥u∥H ≤ Λ}is bounded as follows:


The proof is given in the supplemental material.

Thus, it suffices to bound the size  |Θ|. Naively the size  |Θ|is at most  Qm since there are  Qm possible mappings from [m] to [Q]. However, this naive bound is too pessimistic. The basic idea to get better bounds is the following. Fix any  i ∈ [m] and consider points Φ(x1i ), . . . , Φ(xQi ). Then, we define equivalence classes of  u such that θu(i) is the same, which define a Voronoi diagram for the points Φ(x1i ), . . . , Φ(xQi ). Note here that the closeness is measured by the inner product, not a distance. More precisely, let  Vi = (V (1)i , . . . , V (Q)i )be the Voronoi diagram defined as  V (j)i = {u ∈ H | θu(i) = j}. Let us consider the set of intersections  ∩i∈[m]V (ji)ifor all combinations of (j1, . . . , jm) ∈ [Q]m. The key observation is that each non-empty intersection corresponds to a mapping  θ ∈Θ. Thus, we obtain |Θ| = (the number of intersections ∩i∈[m]V (ji)i). In other words, the size of Θ is exactly the number of rooms defined by the intersections of m Voronoi diagrams  V1, . . . , Vm. From nowon, we will derive upper bounds based on this observation.


The proof is shown in the supplemental material.


(ii) if Φ satisfies that  ⟨Φ(z), Φ(x)⟩is monotone decreasing with respect to  ∥z − x∥2 (e.g.,the mapping defined by the Gaussian kernel) and  U = {Φ(s) | s ∈ Rℓ, ∥Φ(s)∥H ≤ Λ},then  |Θ| = O((mQ)2ℓ).


The proof is shown in the supplemental material.

Now we are ready to prove Theorem 1. Proof [Proof of Theorem 1] By using Lemma 1, and 2, we obtain the generalization bound in terms of the Gaussian complexity of H. Then, by applying Lemma 3 and Theorem 2, we complete the proof.

In this section, we formulate an optimization problem to learn ensembles in conv(HU) forU ⊆ {u : ∥u∥H ≤ Λ}.The problem is a 1-norm constrained soft margin optimization problem using hypotheses in  HU, which is a linear program.


where  hu ∈ HUis a given base classifiers,  ν ∈ [0,1] is a constant parameter,  ρ is a targetmargin and ξiis a slack variable. The dual problem (2) is given as follows.




The dual problem is categorized as a semi-infinite program (SIP), since it contains possibly infinitely many constraints. Note that the duality gap is zero since the problem (3) is convex and the optimum is finite (Shapiro, 2009, Theorem 2.2). Our approach is to approximately solve the primal and the dual problems for  U = {u : ∥u∥H ≤ Λ}by solving the sub-problems over a finite subset  U ′ ⊂ U. Such approach is called column generation in linear programming, which add a new constraint (column) to the dual problems and solve them iteratively. LPBoost (Demiriz et al., 2002) is a well known example of the approach in the boosting setting. At each iteration, LPBoost chooses a hypothesis  hu so that hu maximallyviolates the constraints in the current dual problem. This sub-problem is called weak learning in the boosting setting and formulated as follows:


However, the problem (4) cannot be solved directly, since we have only access to U though the associated kernel. Fortunately, the optimal solution  u∗ of the problem can be written as a linear combination of the functions  K(xi, ·) because of the following representer theorem.

Theorem 3 (Representer Theorem) The optimal solution  u∗ of optimization problem (4) has the form of  u∗ = �mi=1�Qj=1 αijK(x(j)i , ·).

The proof is shown in the supplemental material.

By Theorem 3, we can design a weak learner by solving the following equivalent problem:


The optimization problem (5) is difference of convex functions (DC) programming problem and we can obtain local optimum  ǫ-approximately by using DC Algorithm (Tao and Souad, 1988). For the above optimization problem, we replace maxj∈[Q]�mi=1�Qk=1 αikK�x(k)i , x(j)q �

with  λq, then we get the equivalent optimization problem as below.


We show the pseudo code of LPBoost using column generation algorithm, and our weak learning algorithm using DC programming in Algorithm 1 and 2, respectively. In Algorithm 2, the optimization problem (7) can be solved by standard QP solver. It is interesting to note that, if we restrict  αijto unit vector, we can get the optimal  αanalytically in each weak learning step, and the final hypothesis obtained by LPBoost totally behaves like time-series shapelets.

In the following experiments, we show that our methods are practically effective for time-series classification problem as one of the applications.

6.1 Classification accuracy for UCR datasets

We use UCR datasets (Chen et al., 2015), that are often used as benchmark datasets for time-series classification methods. For simplicity, we use several binary classification datasets (of course, our method is applicable to multi-class). The detailed information of the datasets is described in the left side of Table 1. In our experiments, we assume that patterns are subsequence of a time series of length  ℓ.

We set the hyper-parameters as following: The length  ℓof pattern was searched in {0.1, 0.2, 0.3, 0.4}×L, where Lis the length of each time-series dataset, and  ν = {0.2, 0.15, 0.1}.


We found good  ℓ and νthrough a grid search via 5-fold cross validation. We use the Gaussian kernel  K(x, x′) = exp(−σ∥x−x′∥2). For each  ℓ, we choose  σ from {0.0001, 0.001, . . . , 10000},which maximizes the variance of the values of the kernel matrix. We set Λ = 1, and thenumber of maximum iterations of LPBoost is 100, and the number of maximum iterations of DC Programming is 10.

Unfortunately, the quadratical normalization constraints of the optimization problem (6) have highly computational costs. Thus, in practice, we replace the constraint with 1-norm regularization:  ∥α∥1 ≤Λ and solve the linear program. We expect to obtain sparse solutions while that makes the decision space of  αsmall. As a LP solver for the optimization problem of the weak learner and LPBoost, we used the CPLEX software.

The results of classification accuracy are shown in the right side of Table 1. We referred to the experimental result reported by Hou et al. (2016) with regard to the accuracies of the following three state-of-the-arts algorithms, LTS (Grabocka et al., 2014), IG-SVM (Hills et al., 2014), and FLAG (Hou et al., 2016). It is reported that the above algorithms are highly accurate in practice. Particularly, LTS is known as one of the most accurate algorithm in practice, however, it highly and delicately depends on the input hyper-parameters (see, e.g., Wistuba et al., 2015). Many of other existing shapelet-based methods including them have difficulty of adjusting hyper-parameters. However, as shown in Table 1, our algorithm had competitive performance for used datasets with a little effort to parameter search.

Table 1: The detailed information of used datasets and Classification accuracies (%).


6.2 Sparsity and visualization analysis

It is said that shapelets-based hypotheses have great visibility (see, e.g., Ye and Keogh, 2009). Now, we show that the solution obtained by our method has high sparsity and it induces visibility, despite our final hypothesis contains (non-linear) kernels. Let us explain the sparsity using Gun-point dataset as an example. Now, we focus on the final hypothesis:  g = �Tt=1 wtht, where ht = maxj�mk=1�Ql=1 αt,klK(x(l)k , ·(j)). The number of final hypotheses  hts such that  wt ̸= 0 is 5, and the number of non-zero elements αt,kl of suchhts is 26 out of 34000 (0.6%). Actually, for the other datasets, percentages of non-zero elements are from 0.08% to 6%. It is clear that the solution obtained by our method has high sparsity.

Figure 1 is an example of visualization of a final hypothesis obtained by our method for Gun-point time series data. The colored lines shows all of the  x(l)k s in gwhere both  wtand  αt,klare non-zero. Each value of the legends show the multiplication of  wt and αt,kl


Figure 1: Visualization of discriminative pattern for an example of Gun-point data (nega- tive label). Black line is original time series. Positive value (red to yellow line) indicates the contribution rate for positive label, and negative value (light blue to purple line) indicates the contribution rate for negative label.

corresponding to  x(l)k . Since it is hard to visualize the local features over the Hilbert space, we plotted each of them to match the original time series based on Euclidean distance. In contrast to visualization analyses by time-series shapelets (see, e.g., Ye and Keogh, 2009), our visualization, colored lines and plotted position, do not strictly represent the meaning of the final hypothesis because of the non-linear feature mapping. However, we can say that colored lines represent “discriminative pattern” and certainly make important contributions to classification. Thus, we believe that our solutions are useful for domain experts to interpret important patterns.

In this paper, we show generalization error bounds of the hypothesis classes based on local features. Further, we formulate an optimization problem, which fits in the boosting framework. We design an efficient algorithm for the weak learner using DC programming techniques. Experimental results show that our method achieved competitive accuracy with the existing methods with small parameter-tuning costs.

Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, March 2003. ISSN 1532-4435.

Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdul- lah Mueen, and Gustavo Batista. The ucr time series classification archive, July 2015.


A Demiriz, K P Bennett, and J Shawe-Taylor. Linear Programming Boosting via Column Generation. Machine Learning, 46(1-3):225–254, 2002.

Josif Grabocka, Nicolas Schilling, Martin Wistuba, and Lars Schmidt-Thieme. Learning time-series shapelets. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 392–401, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2956-9.

Josif Grabocka, Martin Wistuba, and Lars Schmidt-Thieme. Scalable discovery of time- series shapelets. CoRR, abs/1503.03238, 2015.

Zellig Harris. Distributional structure. Word, 10(23):146–162, 1954.

Jon Hills, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. Classifi- cation of time series by shapelet transformation. Data Mining and Knowledge Discovery, 28(4):851–881, July 2014. ISSN 1384-5810.

Lu Hou, James T. Kwok, and Jacek M. Zurada. Efficient learning of timeseries shapelets. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 1209–1215, 2016.

Isak Karlsson, Panagiotis Papapetrou, and Henrik Bostr¨om. Generalized random shapelet forests. Data Min. Knowl. Discov., 30(5):1053–1085, September 2016. ISSN 1384-5810.

Eamonn J. Keogh and Thanawin Rakthanmanon. Fast shapelets: A scalable algorithm for discovering time series shapelets. In Proceedings of the 13th SIAM International Conference on Data Mining, May 2-4, 2013. Austin, Texas, USA., pages 668–676, 2013.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.

Francesca Odone, Annalisa Barla, and Alessandro Verri. Building kernels from binary strings for image matching. IEEE Trans. Image Processing, 14(2):169–180, 2005.

Xavier Renard, Maria Rifqi, Walid Erray, and Marcin Detyniecki. Random-shapelet: an algorithm for fast shapelet discovery. In 2015 IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA’2015), pages 1–10, Paris, France, October 2015. IEEE.

Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wen Sun Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26 (5):1651–1686, 1998.

B. Sch¨olkopf and AJ. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, USA, December 2002.

Alexander Shapiro. Semi-infinite programming, duality, discretization and optimality con- ditions. Optimization, 58(2):133–161, 2009.

Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, and Shigeki Sagayama. Dynamic time- alignment kernel in support vector machine. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01,pages 921–928, Cambridge, MA, USA, 2001. MIT Press.

Pham Dinh Tao and El Bernoussi Souad. Duality in D.C. (Difference of Convex functions) Optimization. Subgradient Methods, pages 277–293. Birkh¨auser Basel, Basel, 1988. ISBN 978-3-0348-9297-1.

Martin Wistuba, Josif Grabocka, and Lars Schmidt-Thieme. Ultra-fast shapelets for time series classification. CoRR, abs/1503.05018, 2015.

Lexiang Ye and Eamonn Keogh. Time series shapelets: A new primitive for data mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 947–956, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-495-9.

D. Zhang, W. Zuo, D. Zhang, and H. Zhang. Time series classification using support vector machine with gaussian elastic metric kernel. In 2010 20th International Conference on Pattern Recognition, pages 29–32, Aug 2010. doi: 10.1109/ICPR.2010.16.

8.1 Proof of Lemma 3

Proof Since U can be partitioned into  ∪θ∈Θ{u ∈ U | θu = θ},

GS(HU) = 1m Eσ


The first inequality is derived from the relaxation of u, the second inequality is due to Cauchy-Schwarz inequality and the fact  ∥u∥H ≤Λ, and the last inequality is due to Jensen’s inequality. We denote by  K(θ) the kernel matrix such that  K(θ)ij = ⟨Φ(x(θ(i))i ), Φ(x(θ(j))j )⟩.

Then, we have


The first inequality is due to Jensen’s inequality, and the second inequality is due to the fact that the supremum is bounded by the sum. By using the symmetry property of  K(θ),we have �mi,j=1 σiσjK(θ)ij = σ⊤K(θ)σ, which is rewritten as


where  λ1 ≥ · · · ≥ λm ≥0 are the eigenvalues of  K(θ) and V = (v1, . . . , vm) is the orthonormal matrix such that  viis the eigenvector that corresponds to the eigenvalue  λi. By thereproductive property of Gaussian distribution,  V⊤σobeys the same Gaussian distribution as well. So,


Now we replace  σ by σ′ = √1 − cλkσ. Since dσ′ = √1 − cλkdσ, we have:� ∞


because it holds that 1(1−x) ≤ 1 + 2(√2 − 1)x in 0 ≤ x ≤ 12. Further, taking logarithm, dividing the both sides by c and applying ln(1 +  x) ≤ x, we get:


where the last inequality holds since  λ1 = ∥K∥2 ≤ m∥K∥max ≤ R2. By equation (8) and (11), we have:


8.2 Proof of Lemma 4

Proof We will reduce the problem of counting intersections of the Voronoi diagrams to that of counting possible labelings by hyperplanes for some set. Note that for each neighboring Voronoi regions, the border is a part of hyperplane since the closeness is defined in terms of the inner product. So, by simply extending each border to a hyperplane, we obtain intersections of halfspaces defined by the extended hyperplanes. Note that, the size of these intersections gives an upper bound of intersections of the Voronoi diagrams. More precisely, we draw hyperplanes for each pair of points in  PS = {Φ(x(j)i ) | i ∈ [m], j ∈ [Q]}so that each point on the hyperplane has the same inner product between two points. Note that for each pair  z, z′ ∈ PS, the normal vector of the hyperplane is given as  z − z′ (byfixing the sign arbitrary). So, the set of hyperplanes obtained by this procedure is exactly Φdiff(PS). The size of Φdiff(PS) is�mQ2�, which is at most  m2Q2. Now, we consider a “dual” space by viewing each hyperplane as a point and each point in U as a hyperplane. Note that points u (hyperplanes in the dual) in an intersection give the same labeling on the points in the dual domain. Therefore, the number of intersections in the original domain is the same as the number of the possible labelings on Φdiff(PS) by hyperplanes in U. By the classical Sauer’s Lemma and the VC dimension of hyperplanes (see, e.g., Theorem 5.5 in Sch¨olkopf and Smola (2002)), the size is at most  O((m2Q2)d∗Φ,S).

8.3 Proof of Theorem 2

Proof (i) In this case, the Hilbert space His contained in  Rℓ. Then, by the fact that VC dimension  d∗Φ,S is at most ℓand Lemma 4, the statement holds. (ii) f  ⟨Φ(z), Φx⟩is monotone decreasing for  ∥z − x∥, then the following holds:


Therefore, maxu:∥u∥H=1⟨u, Φ(x)⟩ = ∥Φ(x)∥H, where u = Φ(x)∥Φ(x)∥H .It indicates that the number of Voronoi cells made by  V (j)i = {z ∈ Rℓ | j = arg maxk∈[Q] z · x(k)i }corresponds to the ˆV (j)i = {Φ(z) ∈ H | j = arg maxk∈[Q]⟨Φ(z), Φ(x(k)i )⟩}. Then, by following the same argument for the linear kernel case, we get the same statement.

(iii) We follow the argument in Lemma 4. For the set of classifiers  F = {f : Φdiff(PS) →{−1, 1} | f = sign(⟨u, z⟩), ∥u|H, minzinΦdiff(PS) |⟨u, z⟩| = µ}, its VC dimension is known to be at most  RΛ/µ for Φdiff(PS) ⊆ {z | ∥z∥H ≤ R}(see, e.g., Sch¨olkopf and Smola (2002)). By the definition of  µ∗, for each intersections given by hyperplanes, there always exists a point u whose inner product between each hyperplane is at least  µ∗. Therefore, the size of the intersections is bounded by the number of possible labelings in the dual space by U ′′ = {u ∈ H, ∥u∥H ≤ Λ, minzinΦdiff(PS) |⟨u, z⟩| = µ∗}}. Thus we obtain that  d∗Φ,S is at mostRΛ/µ∗ and by Lemma 4, we complete the proof.

8.4 Proof of Theorem 3

Proof We can rewrite the optimization problem (4) by using  θ ∈Θ defined in Subsection 3.2 as follows:


Thus, if we fix  θ ∈Θ, we have a sub-problem. Since the constraint  θ = θucan be written as linear constraints, each sub-problem is equivalent to a convex optimization. Indeed, each sub-problem can be written as the equivalent unconstrained minimization (by neglecting constants in the objective)


where  β and λi,j (i ∈ [m], j ∈ [Q]) are the corresponding positive constants. Now for each sub-problem, we can apply the standard Representer Theorem argument (see, e.g., Mohri et al. (2012)). Let  H1be the subspace  {u ∈ H | u = �mi=1�Qj=1 αijΦ(x(j)i ), αij ∈ R}.We denote  u1as the orthogonal projection of  u onto H1 and any u ∈ Hhas the decomposition  u = u1 + u⊥. Since u⊥ is orthogonal w.r.t.  H1, ∥u∥2H = ∥u1∥2H + ∥u⊥∥2H ≥ ∥u1∥2H. Onthe other hand,�u, Φ�x(j)i ��=�u1, Φ�x(j)i ��. Therefore, the optimal solution of each sub-problem has to be contained in  H1. This implies that the optimal solution, which is the maximum over all solutions of sub-problems, is contained in  H1 as well.

Designed for Accessibility and to further Open Science