Boosting the kernelized shapelets: Theory and algorithms for local features

2017·Arxiv

Abstract

Abstract

We consider binary classification problems using local features of objects. One of motivating applications is time-series classification, where features reflecting some local closeness measure between a time series and a pattern sequence called shapelet are useful. Despite the empirical success of such approaches using local features, the generalization ability of resulting hypotheses is not fully understood and previous work relies on a bunch of heuristics. In this paper, we formulate a class of hypotheses using local features, where the richness of features is controlled by kernels. We derive generalization bounds of sparse ensembles over the class which is exponentially better than a standard analysis in terms of the number of possible local features. The resulting optimization problem is well suited to the boosting approach and the weak learning problem is formulated as a DC program, for which practical algorithms exist. In preliminary experiments on time-series data sets, our method achieves competitive accuracy with the state-of-the-art algorithms with small parameter-tuning cost.

1. Introduction

Classifying objects using their “local” patterns is often effective in various applications. For example, in time-series classification problems, a local feature called shapelet is shown to be quite powerful in the data mining literature (Ye and Keogh, 2009; Keogh and Rakthanmanon, 2013; Hills et al., 2014; Grabocka et al., 2014). More precisely, a shapelet a real-valued “short” sequence in 1. Given a time-series typical measure of closeness between the time-series x and the shapelet ). Here, the measure focuses on “local” closeness between the time-series and the shapelet. In many time-series classification problems, sparse combinations of features based on the closeness to some shapelets are useful (Grabocka et al., 2015; Renard et al., 2015; Hou et al., 2016). Similar situations could happen in other applications. Say, for image classification problems, template matching is a well-known technique to measure similarity between an image and “(small) template image” in a local sense.

Despite the empirical success of applying local features, theoretical guarantees of such approaches are not fully investigated. In particular, trade-offs of the richness of such local features and the generalization ability are not characterized yet.

In this paper, we formalize a class of hypotheses based on some local closeness. Here, the richness of the class is controlled by associated kernels. We show generalization bounds of ensembles of such local classifiers. Our bounds are exponentially tighter in terms of some parameter than typical bounds obtained by a standard analysis.

Further, for learning ensembles of the kernelized hypotheses, our theoretical analysis suggests a 1-norm constrained optimization problem with infinitely many parameters, for which the dual problem is categorized as a semi-infinite program (see Shapiro, 2009). To obtain approximate solutions of the problem efficiently, we take the approach of boosting (Schapire et al., 1998). In particular, we employ LPBoost (Demiriz et al., 2002), which solve 1-norm constrained soft margin optimization via a column generation approach. As a result, our approach has two stages, where the master problem is a linear program and the sub-problems are difference of convex programs (DC programs), which are non-convex. While it is difficult to solve the sub-problems exactly due to non-convexity, various techniques are investigated for DC programs and we can find good approximate solutions effi-ciently for many cases in practice.

In preliminary experiments on time-series data sets, our method achieves competitive accuracy with the state-of-the-art algorithms using shapelets. While the previous algorithms need careful parameter tuning and heuristics, our method uses less parameter tuning and parameters can be determined in an organized way. In addition, our solutions tend to be sparse and could be useful for domain experts to select good local features.

1.1 Related work

The concept of time-series shapelets was first introduced by Ye and Keogh (2009). The algorithm finds shapelets by using the information gains of potential candidates associated with all the subsequences of the given time series and constructs a decision tree. Shapelet transform (Hills et al., 2014) is a technique combining with shapelets and machine learning. The authors consider the time-series examples as feature vectors defined by the set of local closeness to some shapelets and in order to obtain classification rule, they employed some effective learning algorithms such as linear SVM or random forests. Note that shapelet transform completely separate the phase searching for shapelets from the phases of creating classifi-cation rules. Afterward, many algorithms have been proposed to search the good shapelets efficiently keeping high prediction accuracy in practice (Keogh and Rakthanmanon, 2013; Grabocka et al., 2015; Renard et al., 2015; Karlsson et al., 2016). The algorithms are based on the idea that discriminative shapelets are contained in the training data. This approach, however, might overfit without a regularization. Learning Time-Series Shapelets (LTS) algorithm (Grabocka et al., 2014) is a different approach from such subsequence-based algorithms. LTS approximately solves an optimization problem of learning the best shapelets directly without searching subsequences in a brute-force way. In contrast to the above subsequence-based methods, LTS finds nearly optimal shapelets and achieves higher prediction accuracy than the other existing methods in practice. However, there is no theoretical guarantee of its generalization error.

There are previous results using local features based on kernels (e.g., Odone et al., 2005; Harris; Shimodaira et al., 2001; Zhang et al., 2010). However, their approaches focus on the closeness (similarity) between examples (e.g., subsequence a of example A and subsequence b of example B), not known to capture local similarity.

2. Preliminaries

Let be a set, in which an element is called a pattern. Our instance space X is a set of sequences of patterns in P. For simplicity, we assume that every sequence in X is of the same length. That is, for some integer Q. We denote by an instance sequence in X, where every is a pattern in P. The learner receives a labeled sample size m, where each labeled instance is independently drawn according to some unknown distribution

Let K be a kernel over P, which is used to measure the similarity between patterns, and let Φ : denote a feature map associated with the kernel K for a Hilbert space for patterns denotes the inner product over H. The norm induced by the inner product is denoted by and satisfies

For each , we define the base classifier (or the feature), denoted by function that maps a given sequence to the maximum of the similarity scores between over all patterns . More specifically,

where [Q] denotes the set , we define the class of base classifiers as

and we denote by conv() the set of convex combinations of base classifiers in precisely,

The goal of the learner is to find a final hypothesis ), so that its generalization error ] is small.

Example: Learning with time-series shapelets For a typical setting of learning with time-series shapelets, an instance is a sequence of real numbers and a base classifier , which is associated with a shapelet (i.e., a “short” sequence of real numbers) , is defined as

where ) is the subsequence of that begins with jth index. In our framework, this corresponds to the case where

3. Risk bounds of the hypothesis classes

In this section, we give generalization bounds of the hypothesis classes conv() for various U and K. To derive the bounds, we use the Rademacher and the Gaussian complexity (Bartlett and Mendelson, 2003).

Definition 1 (The Rademacher and the Gaussian complexity, Bartlett and Mendelson, 2003) Given a sample , the empirical Rademacher complexity R(H) of a class is defined as where is an independent uniform random variable in The empirical Gaussian complexity is defined similarly but each drawn independently from the standard normal distribution.

The following bounds are well-known.

Lemma 2 (Mohri et al., 2012, Corollary 6.1) For fixed , the following bound holds with probability at least 1

where is the empirical margin loss of f over S, i.e., the ratio of examples for which f has margin

To derive generalization bounds based on the Rademacher or the Gaussian complexity is quite standard in the statistical learning theory literature and applicable to our classes of interest as well. However, a standard analysis provides us sub-optimal bounds.

For example, let us consider the simple case where the class of base classifiers is defined by the linear kernel with U to be the set of vectors in of bounded norm. In this case, can be viewed as Then, by a standard analysis (see, e.g., Mohri et al., 2012, Lemma 8.1), we have However, this bound is weak since it is linear in Q. In the following, we will give an improved bound of ˜). The key observation is that if base classes are “correlated” somehow, one could obtain better Rademacher bounds. In fact, we will exploit some geometric properties among these base classes.

3.1 Main theorem

First, we show our main theoretical result on the generalization bound of conv(

Given a sample be the set of patterns appearing in S and let Φ(. By viewing each instance ) as a hyperplane , we can naturally define a partition of the Hilbert space H by the set of all hyperplanes be the set of all cells of the partition. Each cell is a polyhedron which is defined by a minimal set that satisfies

Let be the VC dimension of the set of linear classifiers over the finite set Φgiven by . Then, our main result is stated as follows:

Theorem 1 Suppose that for any . Then, for any , with probability at least 1 , the following holds for any

In particular, (i) if Φ is the identity mapping (i.e., the associated kernel is the linear kernel), or (ii) if Φ satisfies that is monotone decreasing with respect to the mapping defined by the Gaussian kernel) and can be replaced by . (iii) Otherwise,

In order to prove the theorem 1, we show several definitions, lemmas and the proofs as following subsection.

3.2 Proof sketch

of mappings from an instance to a pattern For any be a mapping defined by

Lemma 3 Suppose that for any Then, the empirical Gaussian complexity of with respect to is bounded as follows:

The proof is given in the supplemental material.

Thus, it suffices to bound the size . Naively the size is at most since there are possible mappings from [m] to [Q]. However, this naive bound is too pessimistic. The basic idea to get better bounds is the following. Fix any ] and consider points Φ(). Then, we define equivalence classes of ) is the same, which define a Voronoi diagram for the points Φ(). Note here that the closeness is measured by the inner product, not a distance. More precisely, let be the Voronoi diagram defined as . Let us consider the set of intersections for all combinations of (. The key observation is that each non-empty intersection corresponds to a mapping Θ. Thus, we obtain ). In other words, the size of Θ is exactly the number of rooms defined by the intersections of m Voronoi diagrams on, we will derive upper bounds based on this observation.

The proof is shown in the supplemental material.

(ii) if Φ satisfies that is monotone decreasing with respect to the mapping defined by the Gaussian kernel) and then

The proof is shown in the supplemental material.

Now we are ready to prove Theorem 1. Proof [Proof of Theorem 1] By using Lemma 1, and 2, we obtain the generalization bound in terms of the Gaussian complexity of H. Then, by applying Lemma 3 and Theorem 2, we complete the proof.

4. Optimization problem formulation

In this section, we formulate an optimization problem to learn ensembles in conv(The problem is a 1-norm constrained soft margin optimization problem using hypotheses in , which is a linear program.

where is a given base classifiers, 1] is a constant parameter, is a slack variable. The dual problem (2) is given as follows.

sub.to

The dual problem is categorized as a semi-infinite program (SIP), since it contains possibly infinitely many constraints. Note that the duality gap is zero since the problem (3) is convex and the optimum is finite (Shapiro, 2009, Theorem 2.2). Our approach is to approximately solve the primal and the dual problems for by solving the sub-problems over a finite subset . Such approach is called column generation in linear programming, which add a new constraint (column) to the dual problems and solve them iteratively. LPBoost (Demiriz et al., 2002) is a well known example of the approach in the boosting setting. At each iteration, LPBoost chooses a hypothesis violates the constraints in the current dual problem. This sub-problem is called weak learning in the boosting setting and formulated as follows:

However, the problem (4) cannot be solved directly, since we have only access to U though the associated kernel. Fortunately, the optimal solution of the problem can be written as a linear combination of the functions ) because of the following representer theorem.

Theorem 3 (Representer Theorem) The optimal solution of optimization problem (4) has the form of

The proof is shown in the supplemental material.

By Theorem 3, we can design a weak learner by solving the following equivalent problem:

5. Algorithm

The optimization problem (5) is difference of convex functions (DC) programming problem and we can obtain local optimum -approximately by using DC Algorithm (Tao and Souad, 1988). For the above optimization problem, we replace max

with , then we get the equivalent optimization problem as below.

We show the pseudo code of LPBoost using column generation algorithm, and our weak learning algorithm using DC programming in Algorithm 1 and 2, respectively. In Algorithm 2, the optimization problem (7) can be solved by standard QP solver. It is interesting to note that, if we restrict to unit vector, we can get the optimal analytically in each weak learning step, and the final hypothesis obtained by LPBoost totally behaves like time-series shapelets.

6. Experiments

In the following experiments, we show that our methods are practically effective for time-series classification problem as one of the applications.

6.1 Classification accuracy for UCR datasets

We use UCR datasets (Chen et al., 2015), that are often used as benchmark datasets for time-series classification methods. For simplicity, we use several binary classification datasets (of course, our method is applicable to multi-class). The detailed information of the datasets is described in the left side of Table 1. In our experiments, we assume that patterns are subsequence of a time series of length

We set the hyper-parameters as following: The length of pattern was searched in is the length of each time-series dataset, and

We found good through a grid search via 5-fold cross validation. We use the Gaussian kernel ). For each , we choose number of maximum iterations of LPBoost is 100, and the number of maximum iterations of DC Programming is 10.

Unfortunately, the quadratical normalization constraints of the optimization problem (6) have highly computational costs. Thus, in practice, we replace the constraint with 1-norm regularization: Λ and solve the linear program. We expect to obtain sparse solutions while that makes the decision space of small. As a LP solver for the optimization problem of the weak learner and LPBoost, we used the CPLEX software.

The results of classification accuracy are shown in the right side of Table 1. We referred to the experimental result reported by Hou et al. (2016) with regard to the accuracies of the following three state-of-the-arts algorithms, LTS (Grabocka et al., 2014), IG-SVM (Hills et al., 2014), and FLAG (Hou et al., 2016). It is reported that the above algorithms are highly accurate in practice. Particularly, LTS is known as one of the most accurate algorithm in practice, however, it highly and delicately depends on the input hyper-parameters (see, e.g., Wistuba et al., 2015). Many of other existing shapelet-based methods including them have difficulty of adjusting hyper-parameters. However, as shown in Table 1, our algorithm had competitive performance for used datasets with a little effort to parameter search.

Table 1: The detailed information of used datasets and Classification accuracies (%).

6.2 Sparsity and visualization analysis

It is said that shapelets-based hypotheses have great visibility (see, e.g., Ye and Keogh, 2009). Now, we show that the solution obtained by our method has high sparsity and it induces visibility, despite our final hypothesis contains (non-linear) kernels. Let us explain the sparsity using Gun-point dataset as an example. Now, we focus on the final hypothesis: ). The number of final hypotheses s such that s is 26 out of 34000 (0.6%). Actually, for the other datasets, percentages of non-zero elements are from 0.08% to 6%. It is clear that the solution obtained by our method has high sparsity.

Figure 1 is an example of visualization of a final hypothesis obtained by our method for Gun-point time series data. The colored lines shows all of the where both and are non-zero. Each value of the legends show the multiplication of

Figure 1: Visualization of discriminative pattern for an example of Gun-point data (nega- tive label). Black line is original time series. Positive value (red to yellow line) indicates the contribution rate for positive label, and negative value (light blue to purple line) indicates the contribution rate for negative label.

corresponding to . Since it is hard to visualize the local features over the Hilbert space, we plotted each of them to match the original time series based on Euclidean distance. In contrast to visualization analyses by time-series shapelets (see, e.g., Ye and Keogh, 2009), our visualization, colored lines and plotted position, do not strictly represent the meaning of the final hypothesis because of the non-linear feature mapping. However, we can say that colored lines represent “discriminative pattern” and certainly make important contributions to classification. Thus, we believe that our solutions are useful for domain experts to interpret important patterns.

7. Conclusion

In this paper, we show generalization error bounds of the hypothesis classes based on local features. Further, we formulate an optimization problem, which fits in the boosting framework. We design an efficient algorithm for the weak learner using DC programming techniques. Experimental results show that our method achieved competitive accuracy with the existing methods with small parameter-tuning costs.

References

Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, March 2003. ISSN 1532-4435.

Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdul- lah Mueen, and Gustavo Batista. The ucr time series classification archive, July 2015.

www.cs.ucr.edu/~eamonn/time_series_data/.

A Demiriz, K P Bennett, and J Shawe-Taylor. Linear Programming Boosting via Column Generation. Machine Learning, 46(1-3):225–254, 2002.

Josif Grabocka, Nicolas Schilling, Martin Wistuba, and Lars Schmidt-Thieme. Learning time-series shapelets. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 392–401, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2956-9.

Josif Grabocka, Martin Wistuba, and Lars Schmidt-Thieme. Scalable discovery of time- series shapelets. CoRR, abs/1503.03238, 2015.

Zellig Harris. Distributional structure. Word, 10(23):146–162, 1954.

Jon Hills, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. Classifi- cation of time series by shapelet transformation. Data Mining and Knowledge Discovery, 28(4):851–881, July 2014. ISSN 1384-5810.

Lu Hou, James T. Kwok, and Jacek M. Zurada. Efficient learning of timeseries shapelets. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 1209–1215, 2016.

Isak Karlsson, Panagiotis Papapetrou, and Henrik Bostr¨om. Generalized random shapelet forests. Data Min. Knowl. Discov., 30(5):1053–1085, September 2016. ISSN 1384-5810.

Eamonn J. Keogh and Thanawin Rakthanmanon. Fast shapelets: A scalable algorithm for discovering time series shapelets. In Proceedings of the 13th SIAM International Conference on Data Mining, May 2-4, 2013. Austin, Texas, USA., pages 668–676, 2013.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.

Francesca Odone, Annalisa Barla, and Alessandro Verri. Building kernels from binary strings for image matching. IEEE Trans. Image Processing, 14(2):169–180, 2005.

Xavier Renard, Maria Rifqi, Walid Erray, and Marcin Detyniecki. Random-shapelet: an algorithm for fast shapelet discovery. In 2015 IEEE International Conference on Data , pages 1–10, Paris, France, October 2015. IEEE.

Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wen Sun Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26 (5):1651–1686, 1998.

B. Sch¨olkopf and AJ. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, USA, December 2002.

Alexander Shapiro. Semi-infinite programming, duality, discretization and optimality con- ditions. Optimization, 58(2):133–161, 2009.

Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, and Shigeki Sagayama. Dynamic time- alignment kernel in support vector machine. In Proceedings of the 14th International pages 921–928, Cambridge, MA, USA, 2001. MIT Press.

Pham Dinh Tao and El Bernoussi Souad. Duality in D.C. (Difference of Convex functions) Optimization. Subgradient Methods, pages 277–293. Birkh¨auser Basel, Basel, 1988. ISBN 978-3-0348-9297-1.

Martin Wistuba, Josif Grabocka, and Lars Schmidt-Thieme. Ultra-fast shapelets for time series classification. CoRR, abs/1503.05018, 2015.

Lexiang Ye and Eamonn Keogh. Time series shapelets: A new primitive for data mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 947–956, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-495-9.

D. Zhang, W. Zuo, D. Zhang, and H. Zhang. Time series classification using support vector machine with gaussian elastic metric kernel. In 2010 20th International Conference on Pattern Recognition, pages 29–32, Aug 2010. doi: 10.1109/ICPR.2010.16.

8. Supplemental material

8.1 Proof of Lemma 3

Proof Since U can be partitioned into

The first inequality is derived from the relaxation of u, the second inequality is due to Cauchy-Schwarz inequality and the fact Λ, and the last inequality is due to Jensen’s inequality. We denote by the kernel matrix such that

Then, we have

The first inequality is due to Jensen’s inequality, and the second inequality is due to the fact that the supremum is bounded by the sum. By using the symmetry property of we have , which is rewritten as

where 0 are the eigenvalues of ) is the orthonormal matrix such that is the eigenvector that corresponds to the eigenvalue reproductive property of Gaussian distribution, obeys the same Gaussian distribution as well. So,

Now we replace

because it holds that . Further, taking logarithm, dividing the both sides by c and applying ln(1 +

where the last inequality holds since . By equation (8) and (11), we have:

8.2 Proof of Lemma 4

Proof We will reduce the problem of counting intersections of the Voronoi diagrams to that of counting possible labelings by hyperplanes for some set. Note that for each neighboring Voronoi regions, the border is a part of hyperplane since the closeness is defined in terms of the inner product. So, by simply extending each border to a hyperplane, we obtain intersections of halfspaces defined by the extended hyperplanes. Note that, the size of these intersections gives an upper bound of intersections of the Voronoi diagrams. More precisely, we draw hyperplanes for each pair of points in so that each point on the hyperplane has the same inner product between two points. Note that for each pair , the normal vector of the hyperplane is given as fixing the sign arbitrary). So, the set of hyperplanes obtained by this procedure is exactly Φ). The size of Φ, which is at most . Now, we consider a “dual” space by viewing each hyperplane as a point and each point in U as a hyperplane. Note that points u (hyperplanes in the dual) in an intersection give the same labeling on the points in the dual domain. Therefore, the number of intersections in the original domain is the same as the number of the possible labelings on Φ) by hyperplanes in U. By the classical Sauer’s Lemma and the VC dimension of hyperplanes (see, e.g., Theorem 5.5 in Sch¨olkopf and Smola (2002)), the size is at most

8.3 Proof of Theorem 2

Proof (i) In this case, the Hilbert space His contained in . Then, by the fact that VC dimension and Lemma 4, the statement holds. (ii) f is monotone decreasing for , then the following holds:

Therefore, maxIt indicates that the number of Voronoi cells made by corresponds to the ˆ. Then, by following the same argument for the linear kernel case, we get the same statement.

(iii) We follow the argument in Lemma 4. For the set of classifiers , its VC dimension is known to be at most (see, e.g., Sch¨olkopf and Smola (2002)). By the definition of , for each intersections given by hyperplanes, there always exists a point u whose inner product between each hyperplane is at least . Therefore, the size of the intersections is bounded by the number of possible labelings in the dual space by . Thus we obtain that and by Lemma 4, we complete the proof.

8.4 Proof of Theorem 3

Proof We can rewrite the optimization problem (4) by using Θ defined in Subsection 3.2 as follows:

Thus, if we fix Θ, we have a sub-problem. Since the constraint can be written as linear constraints, each sub-problem is equivalent to a convex optimization. Indeed, each sub-problem can be written as the equivalent unconstrained minimization (by neglecting constants in the objective)

where ]) are the corresponding positive constants. Now for each sub-problem, we can apply the standard Representer Theorem argument (see, e.g., Mohri et al. (2012)). Let be the subspace We denote as the orthogonal projection of has the decomposition is orthogonal w.r.t. the other hand,. Therefore, the optimal solution of each sub-problem has to be contained in . This implies that the optimal solution, which is the maximum over all solutions of sub-problems, is contained in

designed for accessibility and to further open science