On Recursive Edit Distance Kernels with Application to Time Series Classification

2010·arXiv

Abstract

1 INTRODUCTION

ELASTIC similarity measures such as Dynamic TimeWarping (DTW) or Edit Distances have proved to be quite efficient compared to non-elastic similarity measures such as Euclidean measures or LP norms when addressing tasks that require the matching of time series data, in particular time series clustering and classifica-tion. A wide scope of applications in different domains, such as physics, chemistry, finance, bio-informatics, network monitoring, etc, have demonstrated the benefits of using elastic measures. A natural question to ask at this point is whether we can develop and use elastic kernel methods based from such elastic distances. This requires to address the existence and construction of Reproducing Edit Distance Hilbert Spaces for a given elastic measure, i.e. a functional vector space endowed with the dot product in which edit distances or time warping algorithms can be defined. Unfortunately it seems that common elastic measures that are derived from DTW or more generally dynamic programming recursive algorithms are not directly induced by an inner product of any sort, even when such measures are metrics. One can conjecture that it is not possible to embed time series in an Hilbert space having a time-warp capability using these classical elastic measures, but nevertheless, we can propose (close) regularized variants for which such kernel construction construction is possible.

This paper aims at (re-)exploring this issue and, following earlier works ([1], [2], [3], [4], [5]) proposes Recursive Edit Distance Kernels (REDK) constructions that try to preserve the properties of elastic measures from which they are derived, while offering the possibility of embedding time series in Time Warped Hilbert Spaces. The main contributions of the paper are as follows

1) we verify the indefiniteness of the main time-warp measures used in the literature,

2) we propose a new method to construct positive definite kernels from classical time-warp or edit measures. This method is quite general and less restrictive than previous ones, although it requires to introduce a recursive regularizing term that allows to prove the definiteness of our kernels as a direct consequence of the Haussler’s convolution theorem,

3) we show that the regularizing approach we develop applies also when a symmetric corridor is used to limit the search space used to evaluate the elastic measure,

4) we experiment and compare the proposed kernels on some time series classification tasks using a large variety of time series data sets to estimate in practice the benefit we can expect from such kernels.

The paper is organized as follows: the second section synthesizes the related works; the third section introduces the notation and mathematical background that is used throughout the paper. The fourth section develops the construction of a general REDK and details some instantiations derived from classical elastic measures. The fifth section gathers classification experimentation on a wide range of time series data and compares REDK accuracies with classical elastic measures. The sixth section proposes a conclusion and further research perspectives. Appendix A states the indefiniteness of classical elastic measures, and appendix B presents the proof of our main results.

2 RELATED WORKS

During the last decades, the use of kernel-based methods in pattern analysis has provided numerous results and fruitful applications in various domains such as biology, statistics, networking, signal processing, etc. Some of these domains, such as bioinformatics, or more generally domains that rely on sequence or time series models, require the analysis and processing of variable length vectors, sequences or timestamped data. Various methods and algorithms have been developed to quantify the similarity of such objects. From the original dynamic programming [6] implementation of the symbolic edit distance [7] by Wagner and Fisher [8], the Smith and Waterman (SW) algorithm [9] has been designed to evaluate the similarity between two symbolic sequences by means of a local gap alignment. More efficient local heuristics have since been proposed to meet the massive symbolic data challenge, such as BLAST [10] or FASTA [11]. Similarly, dynamic time warping measures have been developed to evaluate similarity between digital time series or timestamped data [12], [13], and more recently [14], [15] propose elastic metrics dedicated to such digital data.

Our ability to construct kernels with elastic or time-warp properties from such elastic distances allowing to embed time series into vector spaces (Euclidean or not) has attracted attention (e.g. [2][16][17]). Indeed, signifi-cant benefits can be expected from applications of kernel-based machine learning algorithms to variable length data, or more generally data for which some elastic matching has a meaning. Among the kernel machine algorithms applicable to discrimination or regression tasks, Support Vector Machines (SVM) [18], [19], [20] are still reported to yield state-of-the art performances although their accuracy greatly depends on the exploited kernel.

The definition of good kernels from known elastic or time-warp distances applicable to data of variable lengths has been a major challenge since the 1990s. The notion of ’goodness’ has rapidly been associated to the concept of definiteness. Basically SVM algorithms involve an optimization process whose solution is proved to be uniquely defined if and only if the kernel is positive definite: in that case the objective function to optimize is quadratic and the optimization problem convex. Nevertheless, if the definiteness of kernels is an issue, in practice, many situations exist where definite kernels are not applicable. This seems to be the case for the main elastic measures traditionally used to estimate the similarity of objects of variable lengths. A pragmatic approach consists of using indefinite kernels [2][17][21], although contradictory results have been reported about the impact of definiteness or indefiniteness of kernels on the empirical performances of SVMs. The sub-optimality of the non-convex optimization process is possibly one of the causes leading to these un-guaranteed performances [22], [2]. Regulation procedures have been proposed to locally approximate indefinite kernel functions by definite ones with some benefits. Among others, some approaches apply direct spectral transformations to indefinite kernels. These methods [23] [24] consist in i) flipping the negative eigenvalues or shifting the eigenvalues using the minimal shift value required to make the spectrum of eigenvalues positive, and ii) reconstructing the kernel with the original eigenvectors in order to produce a positive semidefinite kernel. Yet, in general, ’convexification’ procedures are difficult to interpret geometrically and the expected effect of the original indefinite kernel may be lost. Some theoretical highlights have been provided through approaches that consist in embedding the data into a pseudo-Euclidean (pE) space and in formulating the classification problem with an indefinite kernel, such as minimizing the distance between convex hulls formed from the two categories of data embedded in the pE space [17]. The geometric interpretation results in a constructive method allowing for the understanding, and in some cases the prediction of the classification behavior of an indefinite kernel SVM in the corresponding pE space.

Some other works like [25], [26] address the construction of elastic kernels for time series analysis through a decomposition of time series as a sum of local low degree polynomials and, using a resampling process, the piecewise approximation of the time series are embedded into a proper so-called Reproducing Kernel Hilbert Space in which the SVM is learned.

Other approaches try to use directly the elastic distance into the kernel construction, without any approximation or resampling process.These works are based on the work of Haussler on convolution kernels [1] defined on a set of discrete structures such as strings, trees, or graphs. The iterative method that is developed is generative, as it allows for the building of complex kernels from the convolution of simple local kernels. Following the work of Haussler [1], Saigo et al [27] define, from the Smith and Waterman algorithm [9], a kernel to detect local alignment between strings by convolving simpler kernels. These authors show that the Smith and Waterman distance measure, dedicated to determining similar regions between two nucleotide or protein sequences, is not definite, but is nevertheless connected to the logarithm of a point-wise limit of a series of definite convolution kernels. Cuturi et al. [5] have adapted this approach to times series alignments covering the DTW elastic distance. In fact, these previous studies have very general implications, the first one being that classical elastic measures can also be understood as the limit of a series of definite convolution kernels.

In this paper we tackle a new approach to the positive definiteness regularization of elastic kernels that follows directly the lines of the Haussler’s theorem on convolution kernels. The condition to obtain positive definiteness is weaker than the one proposed in previous works [5], although it requires the addition of an explicit recursive regularizing term. This term is easy to evaluate and has the same complexity than the recursive equation that defines the elastic distance. Our regularization strategy is thus quite general, independent from the data (although a parameter can be optimized), and can be applied to a very large family of editing or dynamic time warping like distances. It applies also to some of their variants which exploit a symmetric corridor to speed-up the computation.

3 NOTATIONS AND MATHEMATICAL BACKGROUNDS

We give in this section commonly used definitions, with few details, for metric, kernel and definiteness, sequence set, and classical elastic measures.

3.1 Kernel and definiteness

A very large literature exists on kernels, among which [28], [20] and [29] present a large synthesis of major results. We give hereinafter some basic definitions and some results derived from these previous references.

Definition 3.1: A kernel on a non empty set U refers to a complex (or real) valued symmetric function (or R).

Definition 3.2: Let U be a non empty set. A function is called a positive (resp. negative) definite kernel if and only if it is Hermitian (i.e. where the overline stands for the conjugate number) for all x and y in U and (resp. ), for all n in and .

Definition 3.3: Let U be a non empty set. A function is called a conditionally positive (resp. conditionally negative) definite kernel if and only if it is Hermitian (i.e. for all x and y in U) and (resp. ), for all in N, and with .

In the last two above definitions, it is easy to show that it is sufficient to consider mutually different elements in U, i.e. collections of distinct elements . This is what we will consider for the remaining of the paper.

Definition 3.4: A positive (resp. negative) definite kernel defined on a finite set U is also called a positive (resp. negative) semidefinite matrix. Similarly, a positive (resp. negative) conditionally definite kernel defined on a finite set is also called a positive (resp. negative) conditionally semidefinite matrix.

For convenience sake, we will use p.d. and c.p.d. for positive definite and conditionally positive definite to characterize either a kernel or a matrix having these properties.

Constructing p.d. kernels from c.p.d. kernels is quite straightforward. For instance, if is a c.p.d. kernel on a set U and then, according to [28], is a p.d. kernel, so are and . The converse is also true.

Furthermore, is p.d. for t > 0 if is c.p.d. We will precisely use this last result to construct p.d. kernels from classical elastic distances.

3.2 Sequence set

Definition 3.5: Let U be the set of finite sequences (symbolic sequences or time series): . is a sequence with discrete index varying between 1 and p. We note the empty sequence (with null length) and by convention so that is a member of set U. |A| denotes the length of the sequence A. Let be the set of sequences whose length is shorter or equal to p.

Definition 3.6: Let A be a finite sequence. Let A(i) be the element (symbol or sample) of sequence A. We will consider that where S embeds the multidimensional space variables (either symbolic or numeric) and embeds the time stamp variable, so that we can write A(i) = (a(i), t(i)) where and , with the condition that t(i) > t(j) whenever i > j (time stamps strictly increase in the sequence of samples). with is the subsequence consisting of the through the element (inclusive) of A. So with i > j is the null time series, e.g. . Furthermore, if i > |A| then by convention, .

3.3 General Edit/Elastic distance on a sequence set

Definition 3.7: An edit operation is a pair written . The sequence B results from the application of the edit operation into sequence A, written via , if and for some sub-sequences and . We call a substitution operation if and , a delete operation if , an insert operation if .

For any pair of sequences , for which we con- sider the extensions whose first element is , and for each elementary edit operation related to position in sequence A and to position in sequence B is associated a cost value , or or . To simplify this notation, we will simply write or with the understanding that the suppression cost (respectively the insertion cost ) may depend on the current location j in sequence B (respectively on the location i in sequence A).

Hence we consider that is a function from to .

Definition 3.8: A function is called an edit distance defined on U if, for any pair of sequences , the following recursive equation is satisfied

Note that not all edit/elastic distances are metric. In particular, the dynamic time warping distance does not satisfy the triangle inequality.

3.3.1 Levenshtein distance

The Levenshtein distance has been defined for string matching. For this edit distance, the delete and insert operations induce unitary costs, i.e. while the substitution cost is null if A(p) = B(q) or 1 otherwise. Thus, for this distance, we consider that the functional term for all X and k, since the suppression and insertion costs do not depend on the suppressed or inserted element respectively.

3.3.2 Dynamic time warping

The DTW similarity measure [12][13] is defined according to the previous notations such as:

where and is the LP norm of vector in S, and so for DTW, . Thus, for this distance, for all X and k. Let us note that the time stamp values are not used, therefore the costs of each edit operation involve vectors a and b in S instead of vectors and in . Furthermore, does not comply with the triangle inequality as shown in [14].

3.3.3 Edit Distance with real penalty

with

where g is a constant in S. For this distance, for all X and k.

Note that the time stamp coordinate is not taken into account, therefore is a distance on S but not on . Thus, the cost of each edit operation involves vectors a and b in instead of vectors and in .

According to the authors of ERP [14], the constant g should be set to 0 for some intuitive geometric interpretation and in order to preserve the mean value of the transformed time series when adding gap samples.

3.3.4 Time warp edit distance

Time Warp Edit Distance (TWED) [30], [15] is defined similarly to the edit distance defined for string [7][8]. The similarity between any two time series A and B of finite length, respectively p and q is defined as:

Min

with

where is a positive constant that represents a gap penalty. The time stamps are exploited to evaluate d(A(p), B(q)), in practice, d(A(p), B(q)) = , where is a non negative constant which characterizes the stiffness of the elastic measure along the time axis. Furthermore, for this distance, we consider that the functional term for all X and k, since the suppression and insertion costs do not depend on the suppressed or inserted element respectively.

3.4 Indefiniteness of elastic distance kernels

In appendix A, we give counter examples, one for each previously defined elastic distance, showing that these distances do not lead to definite kernels. This demonstrates that the metric properties of a distance defined on U, in particular the triangle inequality, are not sufficient conditions to establish definiteness (conditionally or not) of the associated distance kernel. One could conjecture that elastic distances cannot be definite (conditionally or not), possibly because of the presence of the max or min operators in the recursive equation. In the following sections, we will see that replacing these min or max operators by a sum operator makes possible, under some conditions, for the construction of series of positive definite kernels whose limit is quite directly connected to the previously addressed elastic distance kernels.

4 CONSTRUCTING POSITIVE DEFINITE KERNELS FROM ELASTIC DISTANCE

The main idea leading to the construction of positive definite kernels from a given elastic distance defined on U is to replace the min or max operator into the recursive equation defining the elastic distance by a summation () operator. Instead of keeping only one of the best alignment paths, the new kernel will sum up the costs of all the existing sub-sequence alignment paths with some weighting factor that will favor good alignments while penalizing bad alignments. In addition, this weighting factor can be optimized. This principle has been applied successfully to the Smith and Waterman symbolic distance which is also known to be indefinite [27], and more recently to dynamic time warping kernels for time series alignment [5]. If, basically, we are following the same objective, the approach that we propose is quite different since it is based on a regularization principle which is simply and directly expressed into the recursive equation defining the elastic distance. First we replace the min (or max) operator by a summation operator and introduce a symmetric corridor function to optionally limit the summation. Then we add a regularizing recursive term such that the proof of the positive definiteness property can be understood as a direct consequence of the Haussler’s convolution theorem, as shown in appendix B. Basically, all is reduced to the proper inventory of pairs of symmetric alignment paths.

4.1 Recursive accumulation of f-cost products

Definition 4.1: A function is called a Recursive Accumulation of f-cost Products (RAfP) if, for any pair of sequences , there exist a function and three indicator functions and such that the following recursive equation is satisfied:

Where c is a non negative constant. This recursive definition relies on an initialization. To that end we set , where is a real non negative constant, typically , and the null sequence. Note that if no path satisfying the indicator functions exists when aligning time series A and B, then C(A, B) = 0.

Furthermore, according to the indicator functions , and , this type of construction sums up the multiplication of the local quantities that we call f-costs for all or for some of the possible alignment paths that exist between the two time series, the concept of alignment path being precisely defined in appendix B (see definition B.3).

In this paper we are specifically concerned with the following two symmetric RAfP families. The first one (Eq. 2), characterized by and , accumulates the f-cost products along all the alignment paths between the two time series (Eq.2) that exist inside the symmetric corridor defined by h(p, q). For the case depicted in Figure 1, h(p, q) = 1, whenever , h(p, q) = 0 otherwise. We denote this RAfP function .

The second RAfP function (Eq. 3) accumulates the f-cost products of all mappings characterized by simultaneous insertions, deletions and substitutions. Hence the substitution, deletion and insertion operations are simultaneously performed on the same index value. This function is thus characterized by , where h is a symmetric function, and , where is the Kronecker’s symbol (Eq.3). We denote this second RAfP function .

(Γ(Ω)) 1)

4.2 Recursive Edit Distance Kernel - Main result

Definition 4.2: We call Recursive Edit Distance Kernel (REDK) the function R.

The following theorem, that relates to the ones presented in [5] and [31], establishes sufficient conditions on for a REDK function to be definite positive, and thus is a basis for the construction of definite REDK. The proof of this theorem, that is given in Appendix B, is a direct consequence of the Hausslers R-convolution theorem [1]. The sufficient condition that is proposed here is quite general and less restrictive than those proposed by [5] and [31]. The avenue that is taken here only requires that the local kernel k(x, y) is PD. In previous works, the condition for positive definiteness is that the local kernel k(x, y) and the ratio k(x, y)/(k(x, y) + 1) are jointly PD. Our new condition is obviously weaker than the one proposed in [9]. Furthermore, our REDK construction applies to a wider range of edit or time warp distances developed either for symbolic sequence or time series matching.

Theorem 4.3: Definiteness of REDK:

REDK is a positive definite kernel on iff the local kernel is a positive definite kernel on .

A sketch of proof for theorem 4.3 is given in the appendix B.

As the cost function is, in general, conditionally negative definite, choosing for f(.) the exponential ensures that is a positive definite kernel [32].

Other functions can be used, such as the Inverse Multi Quadric kernel . As with the

exponential (Gaussian or Laplace) kernel, the Inverse Multi Quadric kernel results in a positive definite matrix with full rank [33] and thus forms an infinite dimension feature space. Note here that Cuturi et al. [5] state that to ensure the definiteness of the DTW-REDK kernel, not only , but also y))) need to be PD kernels, which forms a stronger condition than the one we propose.

4.2.1 Computational cost of REDK

Although the number of paths that are summed up exponentially increases with the lengths |A| and |B| of the time series that are evaluated, the recursive computation of any RAfP function and therefore of K(A, B) lead to a quadratic computational cost O(|A||B|), e.g. if N is the average length of the time series that are considered. This quadratic complexity can be reduced to a linear complexity by limiting the number of alignment paths to consider in the recursion. This can be achieved when using a search corridor [13] as far as the kernel remains symmetric, which is the case when processing time series of equal lengths and restraining the search space using, for instance, a fixed size corridor symmetrically displayed around the diagonal as shown in Fig. 1. Note that any kind of corridor can be specified using the indicator functions and .

4.3 Exponentiated REDK

Definition 4.4: The following equations define the two RAfP function instances entering into the construction of the exponentiated REDK, considering

Fig. 1. Example of alignment paths existing when matching time series A and B. In red dotted line a path whose editing f-cost product is accumulated by and , in blue plain line, a path whose editing cost product is accumulated by but not . A symmetric corridor of any form can be used to reduce the number of admissible alignment paths that are inventoried by a RAfP

where is a stiffness parameter that weights the contribution of the local elementary costs. The larger

is, the more the kernel is selective around the optimal paths. At the limit, when , only the optimal path costs are summed up by the kernel. Note that, as is generally seen, several optimal paths leading to the same global cost exist, therefore does not coincide with the elastic distance that involves the same corresponding elementary costs, although we expect it to be close.

The constant 1/c (typically 1/3) is used to maintain the global accumulation functions and in a range that is computationally acceptable.

The alignment cost of two null time series being 0, we set in this exponentiation context.

Proposition 4.5: Definiteness of the exponentiated REDK instance, :

is positive definite for the cost functions and involved in the computation of the and distances.

The proof of proposition 4.5 is straightforward and is omitted.

The local cost function involved in the computation of the does not respect the condition of theorem 4.3 and thus the local kernel is not guaranteed yet to be p.d. for this distance.

The REDKs constructed from these distances are referred respectively to , in the rest of the paper.

4.3.1 Interpretation of the exponentiated REDK

In a REDK kernel, each alignment path is assigned with a product of f-costs that is the multiplication of the local f-cost functions attached to each edge of the path. For exponentiated REDK, the local f-cost function, e.g. can be interpreted, up to a normalizing constant, as a probability to align symbol A(p) with symbol B(q), and the value affected to each path can be interpreted as the probability of a specific alignment between two sequences. In this case the REDK, that sums up the probability of all possible alignment paths between two sequences, can be interpreted as a matching probability between two sequences. This probabilistic interpretation suggests an analogy between REDK and the alpha-beta algorithm designed to learn HMM models: while the Viterbi’s algorithm that uses a max operator in a dynamic programming implementation (just like the DTW algorithm) evaluates only the probability of the best alignment path, the alpha-beta algorithm is based on the summation of the probabilities of all possible alignment paths. As reported in [27], the main drawbacks of these kind of kernels is the vanishing of the product of local cost functions (that are lower than one) when comparing long sequences. When considering distance-matrix (obtained from pairwise distances on finite sets) this leads to a matrix that suffers from the diagonal dominance problem, i.e. the fact that the kernel value decreases extremely fast when the similarity slightly decreases.

5 CLASSIFICATION EXPERIMENTS

The purpose of this experiment is to estimate the benefit one can expect from using PD elastic kernels instead of indefinite ones in SVM classification. It is clear that the the Sequential Minimal Optimization (SMO) involved in the SVM learning procedure [34] [35] is able to handle indefinite kernels, but, as the convergence toward a global extremum is not guaranteed (the optimization is problem is not convex), we can expect some better accuracy for definite kernels especially when the pairwise distance matrix derived from the train data set is far from definiteness. This is precisely what our experiment targets to demonstrate.

To that end, we empirically evaluate the effectiveness of some REDK kernels comparatively to Gaussian Radial Basis Function (RBF) Kernels or elastic distance substituting kernels [2] using some classification tasks on a set of time series coming from quite different application fields. The classification task we have considered consists of assigning one of the possible categories to an unknown time series for the 20 data sets available at the UCR repository [36]. As time is not explicitly given for these datasets, we used the index value of the samples as the time stamps for the whole experiment.

For each dataset, a training subset (TRAIN) is defined as well as an independent testing subset (TEST). We use the training sets to train two kinds of classifiers:

• the first one is a first near neighbor (1-NN) classifier: first we select a training data set containing time series for which the correct category is known. To assign a category to an unknown time series selected from a testing data set (different from the train set), we select its nearest neighbor (in the sense of a distance or similarity measure) within the training data set, then, assign the associated category to its nearest neighbor. For that experiment, a leave one out procedure is performed on the training dataset to optimized the meta parameters of the considered comparability measure.

• the second one is a SVM classifier [19], [37] configured with a Gaussian RBF kernel whose parameters are C > 0, a trade-off between regularization and constraint violation and that determines the width of the Gaussian function. To determine the C and hyper parameter values, we adopt a 10-folded cross-validation method on each training subset. According to this procedure, given a predefined training set TRAIN and a test set TEST, we adapt the meta parameters based on the training set TRAIN: we first divide TRAIN into 10 stratified subsets ; then for each subset we use it as a new test set, and regard as a new training set; Based on the average error rate obtained on the 10 classification tasks, the optimal values of meta parameters are selected as the ones leading to the minimal average error rate.

We have used the LIBSVM library [38] to implement the SVM classifiers.

5.1 Experimenting with REDK

We tested the exponentiated REDK kernel based on the distance costs. We consider respectively the positive definite kernels. Our experiment compares classification errors on the test data for

• the first near neighbor classifiers based on the distance measures (1-NN , 1-NN and 1-NN ),

• the SVM classifiers using Gaussian distance substituting kernels based on the same distances [21], e.g. SVM , SVM , SVM ,

• and their corresponding REDK kernel, SVM , SVM , SVM .

For and we used the L1-norm, while the L2-norm has been implemented for and , a classical choice for DTW [39].

5.1.1 Meta parameters

For kernel, meta parameter g is optimized for each dataset on the train data by minimizing the classification error rate of a first near neighbor classifier using a Leave One Out (LOO) procedure. For this kernel, g is selected in . This optimized value is also used for comparison with the kernel.

For kernel, meta parameters and are optimized for each dataset on the train data by minimizing the classification error rate of a first near neighbor classi-fier. For our experiment, the stiffness value () is selected from and is selected from {0, .25, .5, .75, 1.0}. If different values lead to the minimal error rate estimated for the training data then the pairs containing the highest value are selected first, then the pair with the highest value is finally selected. These optimized () values are also used for comparability purposes with the kernel.

The kernels exploited by the SVM classifiers are the Gaussian Radial Basis Function (RBF) kernels where stands for , . To limit the search space for the SVM meta parameters, the pairwise distances or kernels values are normalized using the min and max values calculated on the TRAIN data, basically

• for or , and

• δ = exploglogloglogfor δ = REDK, REDK, REDK.

Meta parameter C is selected from , and from . The best values are obtained using the 10-folds cross validation procedure.

For the and kernels, meta parameter is selected from the discrete set G = .

The optimization procedure is as follows:

• for each value in G, we train a SVM clas-sifier on the training dataset using the previously described 10-folded cross validation procedure to select the SVM meta parameters C and and the average of the classification error is recorded.

• the best and values are the ones that lead to the minimal average error.

Fig. 2. Comparison of error rates (in %) between two SVM classifiers, the first one based on the substituting kernel (SVM ), and the second one based on a REDK kernel induced by the distance (SVM ). The straight line has a slope of 1.0 and dots correspond, for the pair of classifiers, to the error rates on the train (star) or test (circle) data sets. A dot below (resp. above) the straight line indicates that SVM has a lower (resp. higher) error rate than distance SVM

Table 1 gives for each data set and each tested kernel (and ) the corresponding optimized values of the meta parameters.

5.2 Discussion

5.2.1 REDK experiment analysis

Tables 2 and 3 show the classification error rates obtained for the tested methods, e.g. the first near neighbor classifier based on the and distances (1-NN , 1-NN and 1-NN ), the Gaussian RBF kernel SVM based on the same distances (SVM , SVM and SVM ) and Euclidean distance, and the Gaussian RBF kernel SVM based on the REDK kernels (SVM , SVM and SVM ).

In this experiment, we show that the SVM classifiers clearly outperform in general the 1-NN classifiers that are used as a baseline. But the interesting results reported in tables 2 and 3 and figures 2, 3 and 4 is that SVM and SVM perform slightly better than SVM and SVM respectively, and the SVM is clearly much efficient than the SVM . This could come from the fact that and are metrics but not , but another explanation could be related to the the SMO optimization involved into the SVM learning. SVM poorly behaves compared to the other tested classifiers, probably because the SVM opti-

TABLE 1 Meta parameters used in conjunction with and kernels

TABLE 2 Comparative study using the UCR datasets: classification error rates (in %) obtained using the first near neighbor classification rule and a SVM classifier for the and kernels. Two scores are given S1|S2: the first one, S1, is evaluated on the training data, while the second one, S2, is evaluated on the test data. Bold faces indicates the lowest error rate between the pairs of SVMs () and ().

mization process does not perform well. Nevertheless, the kernel based on greatly improves the classification results. To further explore the potential impact of indefiniteness on classification rates, we give in table 4 two quantified hints of deviation to conditionally definiteness for the distance-matrices corresponding to the and distances. Since to be conditionally definite (negative) a pairwise distance-matrix should have a single positive eigenvalue, the first hint is the number of positive eigenvalues #Pev (we give also as a reference the total number of eigenvalues, #Ev).

The second hint,

, where is an eigenvalue of the distance matrix, quan-tifies the weight of the extra positive eigenvalues relatively to the weight of the total number of positive eigenvalues. Therefore, a conditionally definite (negative) matrix should be such that simultaneously #Pev = 1 and .

By examining the distance-matrices corresponding to each training datasets and for each distances , and , we can show that the kernel is much further away from a conditionally defi-

TABLE 3 Comparative study using the UCR datasets: classification error rates (in %) obtained using the first near neighbor classification rule and a SVM classifier for the and kernels. Two scores are given S1|S2: the first one, S1, is evaluated on the training data, while the second one, S2, is evaluated on the test data. Bold faces indicates the lowest error rate between the pair of SVMs ().

Fig. 3. Comparison of error rates (in %) between two SVM classifiers, the first one based on the substituting kernel (SVM ), and the second one based on a REDK induced by the distance (SVM )). The straight line has a slope of 1.0 and dots correspond, for the pair of classifiers, to the error rates on the train (star) or test (circle) data sets. A dot below (resp. above) the straight line indicates that SVM has a lower (resp. higher) error rate than distance

Fig. 4. Comparison of error rates (in %) between two SVM classifiers, the first one based on the substituting kernel (SVM ), and the second one based on a REDK induced by the distance (SVM ). The straight line has a slope of 1.0 and dots correspond, for the pair of classifiers, to the error rates on the train (star) or test (circle) data sets. A dot below (resp. above) the straight line indicates that SVM has a lower (resp. higher) error rate than distance SVM

TABLE 4 Analysis of the deviation to conditionally definiteness for the distance-matrices associated to the and distances. We report for each dataset the number of positive eigenvalues (#Pev) relatively to the total number of eigenvalues (#Ev)and the deviation to definiteness estimated as that expresses in %. The expectation is a single positive eigenvalue, #Pev = 1, corresponding to .

nite matrix than the and kernels. The distance that is closer to conditional definiteness is the distance. This is clearly quantifiable by computing the number of positive eigenvalues, as well as their amplitudes, of the pairwise distance matrices. For datasets such as FaceAll, OSULeaf, SwedishLeaf, 50words, Adiac, TwoPatterns, Fish etc., all the three distances lead to indefinite pairwise distance matrices. The REDK regularization brings in general a significant improvement, specifically when the number and amplitudes of the extra positive eigenvalues are high. Furthermore, for datasets of small sizes (such as CBF, Beef, Coffee, OliveOil, etc.), and produce conditionally definite matrices where does not. One can see that the regularization brought by REDK is much more effective on on these data set. Nevertheless, some datasets are better classified by SVM that use directly the distance substituting kernel instead of the derived REDK regularized kernel.

To conclude, our experiment clearly shows that for problems involving distance matrices far from definite-ness (more than 5-10% of eigenvalues are positive, see Table 4), the REDK regularized version outperforms the original elastic distance substituting kernel. This is particularly true for . For , or even distances, most of the training sets lead to pairwise distance-matrices that are either conditionally definite negative or close to CND matrices (with very few extra positive eigenvalues). Nevertheless, in most cases, the REDK-and REDK-outperformed and respectively, showing some robustness of the REDK regularization. The main drawback of REDK is the extra parameter . This extra parameter makes the search for an optimal setting on the train data more difficult and requires more learning data to converge. The trade-off between learning and generalization is therefore more complex to optimize.

6 CONCLUSION

Following the work on convolution kernels [1] and local alignment kernels defined for string processing around the Smith and Waterman algorithm [9] [27], or defined for time series on the basis of the DTW similarity measure [5], we have addressed the definiteness of elastic distances through the construction of positive definite kernels built from the recursive definitions of elastic (editing) distances themselves and achieve some extension and generalization of these previous results.

Contrary to other approaches that rely on the regularization of the Gram-matrix [23][24], this approach does not depend at all on the data, excpet for the optimization of a regularizing parameter (). By adding an extra recursive term we achieve some simple sufficient conditions, that are weaker than those proposed in [27] or [5], allowing to build positive definite exponentiated REDK. These conditions are basically satisfied by any classical elastic distance defined by a recursive equation. In particular they can be applied to the edit distance, the well known Dynamic Time Warping measure and to some variants such as the Edit Distance With Real penalty and the Time Warp Edit Distance, the latter two being metrics as well as the symbolic edit distance. Furthermore, our results apply when a symmetric corridor is used to limit the search space required to evaluate the elastic distance, thus reducing consequently the computation time, which, in the worse case, i.e. without corridor, remains quadratic.

The experiments conducted on a variety of time series datasets show that the positive definite REDKs outperforms in general the indefinite elastic distances from which they are derived, when considering 1-NN and SVM classification tasks. This is mostly striking when the Gram-matrix evaluated on the train data sets is far from definiteness, which is the case when DTW is used. Recent experiment in isolated gesture recognition [40] corroborate this observation.

APPENDIX A INDEFINITENESS OF SOME ELASTIC MEASURES

A.1 The Levenshtein distance

The Levenshtein distance kernel is known to be indefinite. Below, we discuss the first known counter-example produced by [41]. Let us consider the subset of sequences V = {abc, bad, dab, adc, bcd} that leads to the following distance matrix

A.2 The Dynamic Time Warping distance

The DTW kernel is also known not to be conditionally definite. The following example demonstrates this known result. Let us consider the subset of sequences V = {01, 012, 0122, 01222}.

Then the DTW distance matrix evaluated on V is

and consider coefficient vectors C and D in such that

with . Clearly and , showing that has no definiteness.

A.3 The Time Warp Edit Distance

Similarly, it is easy to find simple counter examples that show that TWED kernels are not definite.

Let us consider the subset of sequences V = {010, 012, 103, 301, 032, 123, 023, 003, 302, 321}.

For the TWED metric, with and we get the following matrix:

The eigenvalue spectrum for this matrix is the following:

, . This spectrum contains 2 strictly positive eigenvalues, showing that has no definiteness.

A.4 The Edit Distance with Real Penalty

For the ERP metric, with g = 0.0 we get the following matrix:

The eigenvalue spectrum for this matrix is the following:

, . This spectrum contains 3 strictly positive eigenvalues (although the third positive eigenvalue which is very small could be the result of the imprecision of the used diagonalization algorithm), showing that has no definiteness.

APPENDIX B PROOF OF OUR MAIN RESULT

B.1 Proof of theorem 4.3

Definition B.1: Let be the subset of U containing all the sequences whose lengths are lower or equal to n.

Fig. 5. Example of an alignment path corresponding to the alignment map (0, 0)(0, 1)(1, 2)(1, 3)(2, 4)(3, 4)(4, 5). The white squares correspond to substitution operations and black circles to either deletion or insertion operations.

Definition B.2: For all let be the local kernel defined as follows:

where stands for the null sequence.

Definition B.3: Let be an ordered alignment map between two finite non empty sequences of successive integers of the form 0, .., n. Basically is a finite sequence of pairs of integers for , satisfying the following conditions

As shown in figure 5, there exists a direct correspondence between an alignment map and an alignment path, i.e. a finite sequence of editing operations.

Definition B.4: For all n we introduce two projections, basically two vectorized representations, and induced uniquely by any alignment map .

The principle for the construction of these two unique projections, given any sequence A and any alignment map , is straightforward. With the convention that if k > |A| then and , during the course along of length at step , we apply the following rules:

i) if both indexes and increase , then, we insert in and in .

ii) if only index increases, then we insert in and in , with the convention that, if

iii) if only index increases, then we insert in and in .

iv) when we reach the end of , if the lengths of (respectively ) is shorter than 2n, then we insert into the remaining dimensions.

where is the symbol used for an insertion or deletion operation. This symbol depends on the elastic distance. For DTW, if , or if i > |A|.

Then, for any and , we call the set of projections (or parts) for sequence A induced by . Note that all these projections are sequences whose lengths are 2n. Even

Proposition B.5: If the local kernel y)) is positive definite on then and

is a p.d. kernel on .

Proof of lemma B.5 is a direct consequence of the Haussler’s R-convolution kernel theorem [1]. Indeed, since k(x, y) is a p.d. kernel on , and, considering the sets of parts and associated respectively to the sequences A and B, the conditions for the Haussler’s R-convolution are satisfied.

Note that simply rewrites as

Let be the subset of paths that are restricted by the symmetric corridor induced by any symmetric indicator function h entering into the construction of the RAfP functions and .

Proposition B.6: For any , any , and any

, we have

We state first that such that and . () is a pair of symmetrical paths with respect to the main diagonal.

Thus, since evaluates the sum of the cost-product along all paths in , that is

we can rewrite

Similarly, for we get

Hence, the expected result.

Proposition B.6 states that K(., .) is a finite sum of p.d. kernels defined on . According to the closure properties under the addition of p.d. kernels, we show that K(., .) is a p.d. kernel defined on .

As this result is true for all n, one can see K(., .) as a point-wise limit as n tends toward infinity of a p.d. kernel defined on . This establishes that K(., .B) is a p.d. kernel on .

ACKNOWLEDGMENTS

The authors thank the French Ministry of Research, the Brittany Region, the General Council of Morbihan and the European Regional Development Fund that had partially fund this research. The authors also thank Dr. E. Keogh’s team at UC Riverside for making available the time series data sets that have been used in this study.

REFERENCES

[1] D. Haussler, “Convolution kernels on discrete structures,” University of California at Santa Cruz, Santa Cruz, CA, USA, Technical Report UCS-CRL-99-10, 1999. [Online]. Available: http://citeseer.ist.psu.edu/haussler99convolution.html

[2] B. Haasdonk and C. Bahlmann, “Learning with distance substitution kernels,” in Pattern Recognition, ser. Lecture Notes in Computer Science, C. Rasmussen, H. Blthoff, B. Schlkopf, and M. Giese, Eds. Springer Berlin Heidelberg, 2004, vol. 3175, pp. 220–227.

[3] J. P. Vert, H. Saigo, and T. Akutsu, “Local alignment kernels for biological sequences,” in Kernel Methods in Computational Biology, B. Scholkopf, K. Tsuda, and J. Vert, Eds. MIT Press, 2004, pp. 131–154.

[4] C. Cortes, P. Haffner, and M. Mohri, “Rational kernels: Theory and algorithms,” J. Mach. Learn. Res., vol. 5, pp. 1035–1062, Dec. 2004.

[5] M. Cuturi, J.-P. Vert, . Birkenes, and T. Matsui, “A kernel for time series based on global alignments,” in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 2, April 2007, pp. II–413–II–416.

[6] R. Bellman, Dynamic Programming. Princeton Univ Press, 1957, new Jersey.

[7] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” Doklady Akademii Nauk SSSR, 163(4):845-848, 1965 (Russian), vol. 8, pp. 707–710, 1966, english translation in Soviet Physics Doklady, 10(8).

[8] R. A. Wagner and M. J. Fischer, “The string-to-string correction problem,” Journal of the ACM (JACM), vol. 21, pp. 168–173, 1973.

[9] T. Smith and M. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, vol. 147, pp. 195–197, 1981.

[10] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “A basic local alignment search tool,” Journal of Molecular Biology, vol. 215, pp. 403–410, 1990.

[11] W. Pearson, “Rapid and sensitive sequence comparisons with fasp and fasta,” Methods Enzymol, vol. 183, pp. 63–98, 1990.

[12] V. M. Velichko and N. G. Zagoruyko, “Automatic recognition of 200 words,” International Journal of Man-Machine Studies, vol. 2, pp. 223–234, 1970.

[13] H. Sakoe and S. Chiba, “A dynamic programming approach to continuous speech recognition,” in Proceedings of the Seventh International Congress on Acoustics, Budapest, vol. 3. Budapest: Akad´emiai Kiad´o, 1971, pp. 65–69.

[14] L. Chen and R. Ng, “On the marriage of lp-norms and edit distance,” in Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, ser. VLDB ’04. VLDB Endowment, September 2004, pp. 792–803.

[15] P.-F. Marteau, “Time warp edit distance with stiffness adjustment for time series matching,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 2, pp. 306–318, Feb 2009.

[16] A. Hayashi, Y. Mizuhara, and N. Suematsu, “Embedding time series data for classification,” in Machine Learning and Data Mining in Pattern Recognition, ser. Lecture Notes in Computer Science, P. Perner and A. Imiya, Eds. Springer Berlin Heidelberg, 2005, vol. 3587, pp. 356–365.

[17] B. Haasdonk, “Feature space interpretation of svms with in- definite kernels,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 4, pp. 482–492, April 2005.

[18] V. Vapnik, Statistical Learning Theory. Wiley-Interscience, ISBN 0-471-03003-1, 1989.

[19] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory, ser. COLT ’92. New York, NY, USA: ACM, July 1992, pp. 144–152.

[20] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001.

[21] D. Zhang, W. Zuo, D. Zhang, and H. Zhang, “Time series classifi- cation using support vector machine with gaussian elastic metric kernel,” in Proceedings of the 2010 20th International Conference on Pattern Recognition, ser. ICPR ’10. Washington, DC, USA: IEEE Computer Society, August 2010, pp. 29–32.

[22] A. Woznica, A. Kalousis, and M. Hilario, “Distances and (indefi- nite) kernels for sets of objects,” in Data Mining, 2006. ICDM ’06. Sixth International Conference on, Dec. 2006, pp. 1151–1156.

[23] G. Wu, E. Y. Chang, and Z. Zhang, “Learning with non-metric proximity matrices,” in MULTIMEDIA ’05: Proceedings of the 13th annual ACM international conference on Multimedia. New York, NY, USA: ACM, Nov. 2005, pp. 411–414.

[24] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Caz- zanti, “Similarity-based classification: Concepts and algorithms,” J. Mach. Learn. Res., vol. 10, pp. 747–776, Jun. 2009.

[25] K. Sivaramakrishnan and C. Bhattacharyya, “Time series classifi- cation for online tamil handwritten character recognition a kernel based approach,” in Neural Information Processing, ser. Lecture Notes in Computer Science, N. R. Pal, N. Kasabov, R. K. Mudi, S. Pal, and S. K. Parui, Eds. Springer Berlin / Heidelberg, 2004, vol. 3316, pp. 800–805.

[26] K. Kumara, R. Agrawal, and C. Bhattacharyya, “A large margin approach for writer independent online handwriting classifica-tion,” Pattern Recogn. Lett., vol. 29, pp. 933–937, May 2008.

[27] H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu, “Protein homology detection using string alignment kernels,” Bioinformatics, vol. 20, no. 11, pp. 1682–1689, Jul. 2004.

[28] C. Berg, J. P. R. Christensen, and P. Ressel, Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions, ser. Graduate Texts in Mathematics. New York: Springer-Verlag, Apr. 1984, vol. 100.

[29] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. New York, NY, USA: Cambridge University Press, 2004.

[30] P. F. Marteau, “Time warp edit distance,” VALORIA, Universite de Bretagne Sud, Tech. Rep., 2008, technical Report valoriaUBS-2008-3v, http://hal.archives-ouvertes.fr/hal-00258669/fr/.

[31] K. Shin and T. Kuboyama, “A generalization of haussler’s convo- lution kernel mapping kernel and its application to tree kernels,” Journal of Computer Science and Technology, vol. 25, no. 5, pp. 1040– 1054, 2010.

[32] I. J. Schoenberg, “Metric spaces and positive definite functions,” Transactions of the American Mathematical Society, vol. 44, no. 3, pp. 522–536, nov 1938.

[33] C. Micchelli, “Interpolation of scattered data: Distance matrices and conditionally positive definite functions,” Constructive Approximation, vol. 2, no. 1, pp. 11–22, 1986.

[34] J. C. Platt, “Advances in kernel methods,” B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA, USA: MIT Press, 1999, ch. Fast Training of Support Vector Machines Using Sequential Minimal Optimization, pp. 185–208.

[35] P.-H. Chen, R.-E. Fan, and C.-J. Lin, “A study on smo-type decomposition methods for support vector machines,” Neural Networks, IEEE Transactions on, vol. 17, no. 4, pp. 893–908, July 2006.

[36] E. J. Keogh, X. Xi, L. Wei, and C. Ratanamahatana, “The UCR time series classification-clustering datasets,” 2006, http://wwwcs.ucr.edu/ eamonn/timeseriesdata/.

[37] V. Vapnik, The Nature of Statistical Learning Theory. SpringerVerlag, 1995.

[38] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011, software available at http://www.csie.ntu.edu.tw/cjlin/libsvm.

[39] C. A. Ratanamahatana and E. J. Keogh, “Making time-series classification more accurate using learned constraints,” in Proceedings of the Fourth SIAM International Conference on Data Mining (SDM’04), Apr. 2004, pp. 11–22.

[40] P.-F. Marteau and S. Gibet, “Down-Sampling coupled to Elastic Kernel Machines for Efficient Recognition of Isolated Gestures,” in Proceedings of the 2014 22th International Conference on Pattern Recognition, IAPR, Ed. Stockholm, Sweeden: IEEE, Aug. 2014, p. to appear.

[41] C. Cortes, P. Haffner, and M. Mohri, “Positive definite rational kernels,” in Learning Theory and Kernel Machines, ser. Lecture Notes in Computer Science, B. Schlkopf and M. Warmuth, Eds. Springer Berlin Heidelberg, 2003, vol. 2777, pp. 41–56.

Designed for Accessibility and to further Open Science