Classification of Time-Series Images Using Deep Convolutional Neural Networks

2017·arXiv

ABSTRACT

1. INTRODUCTION

A time-series is a sequence of data points (measurements) which has a natural temporal ordering. Many important real-world pattern recognition tasks deal with time-series analysis. Biomedical signals (e.g. EEG and ECG), financial data (e.g. stock market and currency exchange rates), industrial devices (e.g. gas sensors and laser excitation), biometrics (e.g. voice, signature and gesture), video processing, music mining, forecasting and weather are examples of application domains with time-series nature.The time-series analysis motivations and tasks are mainly divided into curve fitting, function approximation, prediction and forecasting, segmentation, classification and clustering categories. In a univariate time-series classification, so that n-th series of length ) is associated with a class label . It is worth noting that although this paper is mainly focused on time-series classification problem, the proposed method can be easily adapted to the other tasks such as clustering and anomaly detection.

The existing Time-Series Classification (TSC) methods may be categorized from different perspectives. Regarding the feature types, ”frequency-domain” methods include spectral analysis and wavelet analysis; while ”time-domain” methods include auto-correlation, auto-regression and cross-correlation analysis. Regarding the classification strategy, it can also be divided into ”instance-based” and ”feature-based” methods. The former measures similarity between any incoming test sample and the training set; and assigns a label to the most similar class (the euclidean distance based 1-Nearest Neighbor (1-NN) and Dynamic Time Wrapping (DTW) are two popular and widely used methods of this category.The latter first transforms the time-series into the new space and extract more discriminative and representative features in order to be used by a pattern classifier, which aims of the optimum classification boundaries.

Recently, Deep Learning (DL also known as feature learning or representation learning) models have achieved a high recognition rate for computer visionand speech recognition.The Convolutional Neural Networks

Figure 1: From time-series signal to recurrence plot. Left: A simple example of time-series signal (x) with 12 data points. Middle: The 2D phase space trajectory is constructed from x by the time delay embedding (States in the phase space are shown with bold dots: : (: (: (). Right: The recurrence plot R is a 11 11 square matrix with ).

(CNN) is one of the most popular DL models. Unlike the traditional ”feature-based” classification framework, CNN does not require hand-crafted features. Both feature learning and classification parts are unified in one model and are learned jointly. Therefore, their performances are mutually enhanced. Multiple layers of different processing units (e.g. convolution, pooling, sigmoid/hyperbolic tangent squashing, rectifier and normalization) are responsible to learn (represent) a hierarchy of features from low-level to high-level.

This paper investigates the performance of Recurrence Plots (RP)within the deep CNN model for TSC. RP provides a way to visualize the periodic nature of a trajectory through a phase space and enables us to investigate certain aspects of the m-dimensional phase space trajectory through a 2D representation. Because of the recent outstanding results by CNN on image recognition, we first encode time-series signals as 2D plots, and then treat TSC problem as texture recognition task. A CNN model with 2 hidden layers followed by a fully connected layer is used.

2. RELATED WORK

This section briefly reviews the recent deep learning contributions on the TSC task. Application of deep learning on TSC has not been fully explored until recently. There are two main types of approaches, when it comes to application of CNN on TSC: some modify the traditional CNN architecture and use 1D time-series signals as an input, while some others first transform 1D signals into 2D matrices and then apply CNN, similar to the traditional CNN for image recognition. A time delay neural network (TDNN) model is adopted for EEG classification.However, their model has only one single hidden layer and is not deep enough to learn hierarchical features. Lee et al.explored convolutional deep belief network (CDBN) for audio classification. They applied this unsupervised feature learning on frequency domain rather than in time domain. A multi-channel CNN has been proposed to deal with multivariate time-series.Each time series signal is fed into a separate CNN, and afterwards outputs of all CNNs are concatenated and fed into a fully connected MLP classifier. Instead of using the raw signal and learn features automatically, Daltofeed CNN with variables post-processed using input variable selection (IVS) algorithm. Inaudio signals are transformed to time-frequency domain and fed into CNN. In another similar work by the same authors, CNN is applied to speech recognition within the framework of hybrid NN-HMM model.A multi-scale CNN is proposed for TSC in. In order to extract features at different scales and frequencies, it transforms signals by down-sampling and smoothing operators, followed by a local convolution stage. A deep CNN is applied on multichannel time-series signals of human activities.A sliding window strategy is adopted to put time-series segments into a collection of short pieces of signals. This way, a 2D representation of a 1D time-series signal is obtained and a CNN model applied on 2D matrices (treating them similar to images).

Perhapsis the most similar research to our work here. The Gramian Angular Fields (GAF) and Markov Transition Fields (MTF) are used to encode time-series signals as images. Afterwards, a Tiled CNN model is applied to classify the time-series images. Another similar work applied visual descriptors such as Gabor

Figure 2: The proposed 2-stages CNN architecture for TSC. RP images are resized to 28x28, 56x56 or 64x64 (depend on the data) and fed into CNN model. This architecture 32(5)-2-32(5)-2-125-c consists of 2 convolution, 2 pooling, and 2 fully-connected layers.

and Local Binary Patterns (LBP) on RP to extract texture features from time series, and then used SVM classifiers.The difference between the latter and our work is that they used a traditional classification framework with hand-crafted features and separated feature extraction and classification steps. While our proposed CNN-based framework, automatically learns the texture features that are useful for the classification layer. The joint learning of feature representation and classifier offered by CNN increases the classification performance.

3. METHODOLOGY

3.1 Time-Series to Image Encoding

Time-series can be characterized by a distinct recurrent behavior such as periodicities and irregular cyclicities. Additionally, the recurrence of states is a typical phenomenon for dynamic nonlinear systems or stochastic processes that time-series are generated in. The RPis a visualization tool that aims to explore the m-dimensional phase space trajectory through a 2D representation of its recurrences. The main idea is to reveal in which points some trajectories return to a previous state and it can be formulated as:

where K is the number of considered states is a threshold distance, a norm and ) the Heaviside function. The R-matrix contains both texture which are single dots, diagonal lines as well as vertical and horizontal lines; and typology information which are characterized as homogeneous, periodic, drift and disrupted. For instance, fading to the upper left and lower right corners means that the process contains a trend or drift; or vertical and horizontal lines/clusters show that some states do not change or change slowly for some time and this can be interpreted as laminar states.Obviously, there are patterns and information in RP that are not always very easy to visually see and interpret.

Figure 1 shows the step-by-step instructions for calculating RP for a simple time-series signal. First the 2D phase space trajectory (m = 2) is constructed from the time-series. Then, the R-matrix is calculated based on the closeness of the states in the phase space. It is worth noting that the resulting R-matrix by Formula 1 has only {0,1} values, that caused by thresholding parameter . In order to avoid information loss by binarization of the R-matrix, this paper skips the thresholding step and uses the gray-level texture images. Inspired by the unique texture images obtained from the R-matrices, this paper proposed a TSC pipeline based on the CNN model. First the raw 1D time-series signals are transformed into 2D recurrence texture images, and then both features and classifier are jointly learned in one unified model.

3.2 CNN for Time-Series Image Classification

There are two aspects of any CNN model that should be considered carefully: i) designing an appropriate architecture, and ii) choosing the right learning algorithm. Both architecture and learning rules should be

Figure 3: Application of RP (UCR archive: 50words, TwoPatterns, FaceAll, OliveOil and Yoga data (from left to right, respectively).

chosen in a way that they are not only compatible with each other, but also fit the data and the application appropriately.

Architecture. A 2-stage deep CNN model is applied here with 1-channel input of size 28 28 and the output layer with c neurons. Each feature learning stage is representing a different feature-level and consists of convolution (filter), activation, and pooling operators, respectively. The input and output of each layer are called feature maps. A filter layer convolves its input with a set of trainable kernels. The convolutional layer is the core building block of a CNN and exploits spatially local correlation by enforcing a local connectivity pattern between neurons of adjacent layers. The connections are local, but always extend along the entire depth of the input volume in order to produce the strongest response to a spatially local input pattern. The activation function (such as sigmoid and tanh) introduces non-linearity into the networks and allows them to learn complex models. Here we applied ReLU (Rectified Linear Units) because it trains the neural networks several times fasterwithout a significant penalty to generalisation accuracy. Pooling (a.k.a. subsampling) reduces the resolution of input and make it robust to small variations for previously learned features. It combines the outputs of i-1th layer into a single input in ith layer over a range of local neighborhood.

At the end of the 2-stage feature extraction, the feature maps are flatten and fed into a fully connected (FC) layer for classification. FC layers connect every neuron in one layer to every neuron in another layer, which in principle are the same as the traditional multi-layer perceptron (MLP). The proposed pipeline for TSC is shown in Fig. 2.

Learning. Training the above CNN architecture is similar to the MLPs. Gradient-based optimization method (error back-propagation algorithm) is utilized to estimate parameters of the model. For faster convergence, the stochastic gradient descent (SGD) is used for updating the parameters. The training phase has two main steps: propagation and weight update. Each propagation involves feedforward and error back-propagation passes. Former determines the feature maps on input vector by passing from layer to layer until reaching the output (left to right in Figure 2). Latter, calculates the propagation errors according to the loss function for the predicted output (error propagates from right to left in Figure 2). Predicted error on each layer is used for calculating the derivatives by taking advantage of chain-rule of derivative. Once the derivatives of parameters obtained, the weight is updated as follows: the weight’s output delta and input activation are multiplied to find the gradient of the weight. And then, a ratio (learning rate) of the weight’s gradient is subtracted from the weight. This learning cycle is repeated until the network reaches a satisfactory validation error. More details on CNN architecture and learning algorithm can be found in.

4. EXPERIMENTS

In order to evaluate the performance of the proposed method, the UCR time-series classification archive is used.For having a comprehensive evaluation of the classification algorithms, this repository contains 85 time-

Table 1: Performance (in terms of the error rates) of the proposed method compared to the state-of-the art TSC algorithms on 20 selected data from the UCR archive.

series datasets with different characteristics i.e. the number of classes 2 60, number of training samples 16 8926 and time-series length 24 2709. The datasets are obtained from varieties of different real-world applications, ranging from EEG signals to food/beverage flavors, and electric devices to phonemes signals. We have selected the same 20 datasets that the results of the other state-the-art algorithms are also reported for them (particularly consideringalgorithms that we need to compare our results with - see Table 1). The training and testing sets are provided separate to make sure that the results of different algorithms of different studies are comparable. Furthermore, the error rates of ”1-NN Euclidean Distance”, ”1-NN DTW(r) where r is the percentage of time series length”, and ”1-NN DTW (no Warping Window)” are reported as the base-line methods. Only the results of 1 is included in the table, because it obtains the lower error rates compared to the other two.Application of RP (with the phase space dimension m = 3, and embedding time delay are shown in Figure 3.

For comparison purposes, ”Number of Wins” and ”Average Rank” are calculated for each algorithm. Number of Wins counts number of datasets that a specific algorithm obtains the lowest error rates, while Average Rank is the mean of the error-rate ranking over entire datasets. Since some error rates are missing for some datasets for some algorithms, we normalized the measures for each algorithm. An algorithm outperforms the rest if it has a highest number of wins and lowest average rank.

All experiments are carried out with Python(using Keras, Theano and TensorFlow) on a PC with 2.4GHz32 CPU and 32GB memory. For training of the CNN a fixed size window as input layer is required. In the experiments 2828, 5656 and 6464 pixel input, depend on the data, are used. Both layers contain 32 feature maps with 33 convolution (323), MaxPooling of size 2layer contains 128 hidden neurons and c output neurons with Dropout = 0.5. A standard way to represent this architecture is C1(size)-S1-C2(size)-S2-H-O, where C1 and C2 are number of filters in first and seconds layers,

Figure 4: Visualization of kernels in the 1st (left) and 2nd (right) hidden layers from a CNN trained for ”TwoPattern” data from the UCR archive.

Size denotes the kernel size, S1 and S2 are pooling size, H and O represent the number of hidden and output neurons in MLP. Based on this template, the proposed CNN model denoted as 32(5)-2-32(5)-2-128-c. The loss function of ”categorical-crossentropy” with ”adam” optimizer is used. The batch and epoch sizes are chosen from {5, 20} and {50, 250, 1000, 2000}, respectively and based on their performance on the validation set. Since finding the optimal CNN parameters is still an open problem, we followed the rules of thumb to choose the parameters.Visualization of kernels in the 1st and 2nd representation layers from a CNN trained for ”TwoPattern” data is given in Figure 4.

Error rate of the proposed model is given in Table 1. Besides comparing with 1, three state-of-the art TSC algorithms are also reported: Fast-Shapelets,Bag-of-Patterns (BoP also known as BoF or BoW),and SAX-VSM.Not all the performances are reported on the selected data. Therefore, there are some missing elements (shown with ”-”) in the Table. We have also compared our results with two recent deep CNN models, i.e. Multi-scale CNN(MCNN)and GAF-MTF.The former performs various transformations of the time-series which extract features of different frequency and time scales embedded in a CNN framework. The latter, which has the similar pipeline to ours, first encodes time-series to images using GAF and MTF, and then applies a Tiled CNN. As shown, the proposed model obtains 10 out of 20 wins and average rank of 2.15. The second most accurate algorithm is MCNN (6 wins out of 19, and Ave.Rank = 2.36), followed by SAX-VSM (6 wins out of 19, and Ave.Rank = 3.0). The other deep CNN model that uses time-series images (GAF-MTF) ranks 5th. It highlights the fact that choosing a right time-series encoding is crucial for CNN model. Furthermore, it is observable that TFRP algorithm(using RP images in a traditional classification framework with different texture descriptors followed by a SVM classifier) ranks better than GAF-MTF. It can be explained that the RP introduces more discriminant and informative features than the GAF-MTF transform for TSC. Additionally, comparing the proposed method with TFRP highlights the fact that for TSC of RP images the CNN model is more effective than the traditional classification frameworks.

5. CONCLUSIONS

A novel pipeline for TSC is proposed. Taking advantage of the CNN’s high performance on image classification, time-series signals are first transformed into texture images (using RP) and then handled by a deep CNN model. This pipeline offers the following advantages: i) RP enables us to visualize certain aspects of the m-dimensional phase space trajectory through a 2D images, and ii) CNN automatically learns different levels of time-series features and classification jointly and in a supervised manner. Experimental results demonstrate the superiority of the proposed pipeline. In particular, comparing with models using RP with the traditional classification framework (e.g. SIFT, Gabor and LBP features with SVM classifier) and other CNN-based time-series image classification (e.g. GAF-MTF images with CNN) demonstrates that using RP images with CNN in our proposed model obtains the better results.

As future work, CNN architecture with more feature representation layers should be investigated for more difficult TSC tasks (preferably with more data samples available). Large datasets are needed in order to train a deeper architectures. Therefore, adopting the proposed pipeline for TSC with small sample sizes can be another interesting future direction. Exploring different ensemble learning methodsfor CNN can be also interesting. We will particularly be investigating application of the output codingfor CNNs.

ACKNOWLEDGMENTS

This research is partially supported by the French national research agency (ANR) under the PANDORE grant with reference number ANR-14-CE28-0027.

REFERENCES

[1] Wang, J., Liu, P., She, M., Nahavandi, S., Kouzani, A., Bag-of-words representation for biomedical time series classification. Biomedical Signal Processing and Control 8(6), 634-644, 2013.

[2] Hatami, N., Chira, C., Classifiers with a reject option for early time-series classification. IEEE Sym. on Computational Intelligence and Ensemble Learning (CIEL), 9-16, 2013.

[3] Wang, Z., Oates, T., Pooling sax-bop approaches with boosting to classify multivariate synchronous physio- logical time series data, FLAIRS Conference, 335-341, 2015.

[4] Jeong, Y., Jeong, M., Omitaomu, O., Weighted dynamic time warping for time series classification, Pattern Recognition 44(9), 2231-2240, 2011.

[5] Xing, Z., Pei, J., Yu, P., Early prediction on time series: A nearest neighbor approach, International Joint Conference on Artificial Intelligence (IJCAI), 1297-1302, 2011.

[6] Eads, D., Glocer, K., Perkins, S., Theiler, J., Grammar-guided feature extraction for time series classification. Proceedings of the Conference on Neural Information Processing Systems (NIPS05), 2005.

[7] Nanopoulos, A., Alcock, R., Manolopoulos, Y., Feature-based classification of time-series data, International Journal of Computer Research 10, 49-61, 2001.

[8] Rodriguez, J., Alonso, C., Interval and dynamic time warping-based decision trees, ACM Symposium On Applied Computing, 548-552, 2004.

[9] Krizhevsky, A. Sutskever, I. Hinton, GE. Classification with deep convolutional neural networks, Proceedings of the Conference on Neural Information Processing Systems (NIPS12), 1097-1105, 2012.

[10] Simonyan, K. Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv preprint arXiv:1409.1556, 2014.

[11] Karpathy, A. Toderici, G. Shetty, S. Leung, T. Sukthankar, R. Fei-Fei, L. Large-scale video classification with convolutional neural networks, Computer Vision and Pattern Recognition (CVPR) Conference, 2014.

[12] Deng et al., Recent advances in deep learning for speech research at Microsoft, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013.

[13] Graves, A. Mohamed, A. Hinton, G. Speech recognition with deep recurrent neural networks, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013.

[14] Eckmann, J. Kamphorst, S. Ruelle, D., Recurrence plots of dynamical systems, EPL (Europhysics Letters) 4(9), 973, 1987.

[15] Haselsteiner, E. Pfurtscheller, G., Using time-dependent neural networks for EEG classification, IEEE transactions on rehabilitation engineering 8 (4), 457-463, 2000.

[16] Lee, H. Largman, Y. Pham, P. Ng, A. Unsupervised feature learning for audio classification using con- volutional deep belief networks, Proceedings of the Conference on Neural Information Processing Systems (NIPS09), 2009.

[17] Zheng, Y. Liu, Q. Chen, E. Ge, Y. Zhao, J. Time Series Classification Using Multi-Channels Deep Convo- lutional Neural Networks, Web-Age Information Management, 298-310, 2014.

[18] Dalto, M. Deep neural networks for time series prediction with applications in ultra-short-term wind fore- casting, IEEE International Conference on Industrial Technology (ICIT), 2015.

[19] Abdel-Hamid, O. Deng, L. Yu, D. Exploring convolutional neural network structures and optimization techniques for speech recognition, Conference of the International Speech Communication Association (Interspeech), 3366-3370, 2013.

[20] Abdel-Hamid, O. Mohamed, A. Jiang, H. Penn, G. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012.

[21] Cui, Z. Chen, W. Chen, Y. Multi-scale convolutional neural networks for time series classification, arXiv:1603.06995, 2016.

[22] Yang, J. Nguyen, M. San, P. Li, X. Krishnaswamy, S. Deep Convolutional Neural Networks on Multichannel Time Series for Human Activity Recognition, International Joint Conference on Artificial Intelligence (IJCAI) 2015.

[23] Wang, Z., Oates, T., Imaging Time-Series to Improve Classification and Imputation, International Joint Conference on Artificial Intelligence (IJCAI) 3939-3945, 2015.

[24] Wang, Z., Oates, T. Encoding time series as images for visual inspection and classification using tiled convolutional neural networks, Association for the Advancement of Artificial Intelligence (AAAI) conference, 2015.

[25] Souza, V. Silva, D. Batista, G. Extracting Texture Features for Time Series Classification, 1425-1430, International Conference on Pattern Recognition (ICPR), 2014.

[26] Souza, V. Silva, D. Batista, G. Time Series Classification Using Compression Distance of Recurrence Plots, 687-696, IEEE International Conference on Data Mining (ICDM), 2013.

[27] Marwan, N., Romano, M.C., Thiel, M., Kurths, J., Recurrence plots for the analysis of complex systems, Physics Reports 438(5-6), 237-329, 2007.

[28] Krizhevsky, A. Sutskever, I. Hinton, GE. Imagenet classification with deep convolutional neural networks, Proceedings of the Conference on Neural Information Processing Systems (NIPS12), 1097-1105, 2012.

[29] LeCun, Y. Bottou, L. Bengio, Y. Haffner, P. Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11), 2278-2324, 1998.

[30] Bouvrie, J. Notes on convolutional neural networks, 2006.

[31] LeCun, Y. Bottou, L. Orr, G. Mller, K. Efficient backprop, Neural networks: Tricks of the trade, 9-48, 2012.

[32] Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G., The ucr time series classifi-cation archive, URL www.cs.ucr.edu/~eamonn/time_series_data/, 2015.

[33] Rakthanmanon, T. Keogh, E. Fast shapelets: A scalable algorithm for discovering time series shapelets, SIAM International Conference on Data Mining, 668-676, 2013.

[34] Lin, J., Khade, R., Li, Y., Rotation-invariant similarity in time series using bag-of-patterns representation, Journal of Intelligent Information Systems 39(2), 287-315, 2012.

[35] Oates, T. et al. Exploiting representational diversity for time series classification, International Conference On Machine Learning And Applications (ICMLA), 2012.

[36] Senin, P., Malinchik, S., Sax-vsm: Interpretable time series classification using sax and vector space model, 1175-1180, International Conference on Data Mining (ICDM), 2013.

[37] Hatami, N. Some proposal for combining ensemble classifiers, PhD thesis University of Cagliari, 2012.

[38] Hatami, N., Thinned-ecoc ensemble based on sequential code shrinking, Expert Systems with Applications 39(1), 936-947, 2012.

[39] Hatami, N., Ebrahimpour, R., Ghaderi, R., Ecoc-based training of neural networks for face recognition, IEEE Conference on Cybernetics and Intelligent Systems , 450-454, 2008.

[40] Armano, G. Chira, C. Hatami, N. Error-Correcting Output Codes for Multi-Label Text Categorization, 26-37, Italian Information Retrieval Workshop (IIR), 2012.

Designed for Accessibility and to further Open Science