Generalization of Change-Point Detection in Time Series Data Based on Direct Density Ratio Estimation

2020·Arxiv

Abstract

Abstract

The goal of the change-point detection is to discover changes of time series distribution. One of the state of the art approaches of the change-point detection are based on direct density ratio estimation. In this work we show how existing algorithms can be generalized using various binary classification and regression models. In particular, we show that the Gradient Boosting over Decision Trees and Neural Networks can be used for this purpose. The algorithms are tested on several synthetic and real-world datasets. The results show that the proposed methods outperform classical RuLSIF algorithm. Discussion of cases where the proposed algorithms have advantages over existing methods are also provided.

Keywords: time series, change-point detection, machine learning, neural networks, density ratio

1. Introduction

Abrupt change of time series behaviour is called a change-point. Reasons behind these changes might be different and their detection helps to investigate them and properly react to them. For example, component failures of a system can be accompanied by change-points in time series. Detection of these changes allows to prevent the failure of the system or restore its functionality faster. Another example is signals measured in seismological stations. When seismic wave from an earthquake reaches the station it changes distribution of measured seismogram values. Detection of this change-point helps to predict coming earthquake.

Variety of change-point detection methods were proposed in recent decades and summarised in [1, 2]. This work is focused on a group of algorithms that are based on direct density-ratio estimation. There are two main ways to estimate probability densities ratio for two samples [3, 4]. The first one requires estimation of individual densities for each of the samples. The second way estimates the density ratio directly without estimation of the individual densities. There are a range of algorithms that allow to do this. It was demonstrated in [5] that density ratio can be estimated directly using Kernel Logistic Regression Clas-sifier. Later several other approaches based on kernels were proposed: Kernel Mean Matching (KMM) [6], the KullbackLeibler importance estimation procedure (KLIEP) [7], unconstrained least-squares importance fitting (uLSIF) [8, 9] and extension of uLSIF called relative uLSIF (RuLSIF) [10, 11].

The density ratios allow to calculate of different dissimilarity scores between two samples. These scores are used in change-point detection in time series data. Application of KLIEP method for the change-point detection was demonstrated in [12] with KullbackLeibler divergence as dissimilarity score. It was shown that KLIEP outperforms kernel logistic regression classifier. Application and comparison of KLIEP, uLSIF and RuLSIF methods for change-point detection in time series are provided in [2]. According to [1] these algorithms are considered as one of the best for change-point detection. It was demonstrated in [13] how Convolutional Neural Networks (CNN) can be trained with uLSIF loss function to detect outliers in images. Separation change point detection (SEP) [14] method also uses kernels with modified uLSIF loss function for change-point detection in time series.

One more interesting approach for the change analysis between two samples using decision tree and logistic regression classifiers was demonstrated in [15]. The idea of this method is that if the two samples have different distributions, a classifier will separate them and prediction accuracy will be significantly higher than 50%. However, this method does not estimate the density ratio and it was not demonstrated how it works for change-point detection in time series.

This work describes two approaches for change-point detection in time series data based on direct density ratio estimation. The first approach uses binary classifiers for direct density ratio estimation. The second approach demonstrates how regression models can be used for the ratio estimation. In particular, there are considered algorithms based on Gradient Boosting over Decision Trees and Neural Networks.

The paper is organized as follows. The problem statement and a quality metric are given in section 2. Direct density ratio estimation algorithms and dissimilarity scores are described in section 3 respectively. Section 4 provides description of change-point detection algorithm. Section 5 provides results of change-points detection in several time series. Finally, the conclusion is in section 6.

2. Problem Statement

Similarly to [2] we define the following notations. Consider a d-dimensional time series that is described by a vector of observations at time t. Let’s define sequence of observations for time t with length k as:

Figure 1: Example of change-point detection based on direct density ratio estimation with k = 10 and n = 500. (Top) Time series of the original 1D signal with change-point at quality metric calculation.

Sequence of observations is required to take into account temporal dependencies between these observations. Sample of sequences of size n is defined as:

It is also implied that observation distribution changes at time . The goal is to detect this change. As it will be shown in the next section, change-point detection algorithms are based on comparison of reference ) and test ) samples of the time series. The idea is to estimate dissimilarity score between these two samples. The larger dissimilarity the more likely the change-point occurs at time . Several score definitions are provided in the next section.

Consider a time series example with a change-point at time is shown in Fig. 1. The detection was performed with a sequence size k = 10 and n = 500 sequences in reference and test samples. The figure demonstrates that the dissimilarity score estimated by a change-point detection algorithm is small for times . This corresponds to a case when reference and test samples entirely located before the change-point. Then the score starts to increase when the test sample ) contains observations after and differs from the reference sample ). It reaches its maximum at t = 1500 when the two samples are on opposite sides of the change-point. After that the score reduces until the two samples have all observation after the change-point.

This example demonstrates that the dissimilarity score forms a peak with width of 2n timestamps after the change-point. Quality of the detection depends on relative height of this peak and variance of the score. The change-point is considered as detected at ˆwhen the score reaches a threshold value D(ˆ. If variance of the dissimilarity score is large, the threshold value should be large enough to avoid high false alarm rate. This leads to increasing of the detection time delay ˆ. When the time series distribution changes not very much at , the threshold might appear to be too large and the change-point will not be detected.

To take into account these two effects the time series observation are labeled in the following way. Label 0 is attributed for all timestamps except those with + 2n for all change-points . These observations have label 1 as it is demonstrated in Fig. 1. The change-point detection quality is measured by ROC AUC score that is calculated based on estimated dissimilarity score values and defined labels.

3. Dissimilarity Score Estimation

3.1. RuLSIF-Based Models

Consider a regression problem to estimate the density ratio w(X) directly. In the original RuLSIF [3, 8] algorithm the ratio is defined as:

where ) and ) are distribution densities for test ) and reference ) samples respectively; is an adjustable parameter. Let’s ˆw(X) is estimated ratio by a regression model. The model is fitted by minimizing the following loss function J(X):

where the last term is constant and it can be ignored during the minimization procedure. In discrete case the loss function takes the following form:

where and are numbers of sequences in reference and test samples respectively. We consider several algorithms based on this loss function.

The original RuLSIF algorithm is based on kernel methods. The density ratio ˆw(X) is estimated as:

where ) is a kernel function; () are parameters that are learned from the data samples. The kernel centers are randomly selected from the test sample. According [2] RuLSIF was used with Gaussian kernel for the change-point detection:

where is the kernel width. It is determined using cross-validation procedure. In this work RuLSIF algorithm is taken as baseline as it was described in [2]. It was fitted with 10 gaussian kernels, 1010

and L2 regularization parameter values 1010. The best values of the parameters were selected during the cross-validation procedure.

We propose to use other regression models to avoid limitations of kernel methods. In this work Gradient Boosting over Decision Trees (GBDT-RulSIF) and Neural Network (NN-RuLSIF) regression models are used for the direct density ratio estimation by minimizing the loss function in Eq. 5. GBDTRuLSIF algorithm is shown in Alg. 1. It uses 100 regression trees with maximum depth 6, learning rate 0.2 and 1 in the loss function. NN-RuLSIF is based on one-layer Neural Network with 10 hidden neurons with tanh activation and one output neuron with linear activation function. It was trained during 20 epochs using Adam optimizer with learning rate 9 and batch size 32. In the loss function it is taken 1. The algorithm is described in Alg. 2.

Estimated density ratios ˆw(X) are used to calculate dissimilarity between the reference and test samples. For the RuLSIF-based models the following dissimilarity score is used:

where w(X) is estimated by the RuLSIF-based models using ) and X(t) as reference and test samples respectively; ) is estimated by the models using ) and X(t) as test and reference samples respectively. So, in this case the models are fitted twice.

3.2. Binary Classifiers

In additional to RuLSIF-based methods we demonstrate how binary clas-sifiers can be used for the change-point detection in time series. Suppose the reference sample ) has label ) has label y = 1. Also suppose that a probabilistic binary classifier is used to separate these two samples. The classifier output is denoted as P(y = 1|X) and has the meaning of probability that a sequence of observations X belongs to the test sample. The distribution density ratio w(X) is defined as:

where ) and ) are distribution densities for test and reference samples respectively. In terms of the probabilistic classifier 1) and expressions:

Then the density ratio is defined as follows:

where P(y = 0) and P(y = 1) are estimated as:

where and are sizes of reference and test samples respectively. In this work it is supposed that the samples have the same sizes. Also, from the probabilistic classifier we know that ). Taking into account these two properties we write out the estimated density ratio ˆw(X) expression as:

where f(X) = P(y = 1|X) for simplicity. Different binary classifiers can be used instead of probabilistic one. In this case density ratios are estimated with some inaccuracy. However, it is enough for change-point detection in time series.

To estimate dissimilarity between the reference and test samples in case of using binary classifiers the following score is used:

where KL(P|Q) is the KullbackLeibler divergence that defined as:

where P and Q are probability distributions. Using this definition and Eq. 18 rewrite the score D for discrete case:

Scores based on other f-divergences also can be used. Conventional classifi-cation metrics like ROC AUC or accuracy are also suitable for the dissimilarity measures. However, our experiments show that the proposed score in Eq. 21 is better for the change-point detection problem in time series using binary classifiers.

In this work the following classifiers are used: Gradient Boosting over Decision Trees (GBDT) implemented in XGBoost python library and Neural Network (NN) using PyTorch library. GBDT model was fitted using 100 decision trees with maximum depth 6, learning rate 0.1 and binary cross-entropy loss function. NN model uses one-layer Neural Network with 10 hidden neurons with tanh activation and one output neuron with sigmoid activation function. It was fitted with binary cross-entropy loss function during 20 epochs using Adam optimizer with learning rate 9 and batch size 32.

4. Change-Point Detection Algorithm

Change-point detection algorithm is based on comparison of reference n) and test ) samples in two consecutive time windows. The RuLSIF-based models or binary classifiers are used to estimate dissimilarity score D(t) between the samples for the timestamp t. To do this the samples are splitted into train and validation subsamples. The train subsample is used to fit the models as it was described in Sec. 3. Then, these models are used to estimate the density ratios ˆw(X) on the validation subsample and to calculate the dissimilarity score D(t). This procedure is repeated for all timestamps with time step . If reference and test samples have the same distribution of observations the dissimilarity score will be close to 0. Larger difference between the distributions higher the dissimilarity score. Change-point is considered as detected at time ˆwhen , where is a threshold value. The change-point detection algorithm is described in Alg. 3.

For all experiments presented in this work the following parameter values are used: k = 10, n = 500, m = 1. For experiments with the Kepler data n = 200 was used due to small time intervals between two consecutive change points.

5. Experiments

5.1. Synthetic datasets

Similar to [2] the following synthetic datasets are used to demonstrate and compare change-point detection algorithms:

Dataset 1: The first dataset consists of two components. The first component is generated using 1-dimensional auto-regressive process:

where ) is Gaussian noise with mean and standard deviation changing mean in the following way:

Figure 2: Demonstration of the change-point detection algorithms performance on dataset 1. (Top) The first component of the time series. (Bottom) Results of the detection using RuLSIF and GBDT binary classifier.

where N is an integer that estimated as 2000(1) + 1 2000N. The second component is Gaussian noise that is generated as following:

where ) is Gaussian noise with mean deviation This component is added to demonstrate that nonkernels methods are less sensitive to uninformative input features.

Dataset 2: The second dataset is generated using the same auto-regressive process as for Dataset 1, but change-points are generated by changing standard deviation of the first component with constant mean

Dataset 3: The third dataset is a periodic 1-dimensional time series that generated as:

where ) is Gaussian noise with mean 5 and standard deviation

Figure 3: Demonstration of the change-point detection algorithms performance on dataset 2. (Top) The first component of the time series. (Bottom) Results of the detection using RuLSIF and GBDT binary classifier.

Each dataset was generated 10 times and the algorithms results were averaged. Demonstrations of change-point detection using RuLSIF and GBDT binary classifier for the synthetic datasets are provided in Fig. 2-4. ROC AUC values for all algorithms are presented in Tab. 1. The table shows that all of them work well and all new considered algorithms outperform RuLSIF. This is explained by the fact that RuLSIF is based on kernel methods and uses distances between observations as it is show in Eq. 7. Time series in datasets 1 and 2 contain the noise component that does not provide any information about change-points. This component decreases the dissimilarity score estimated by RuLSIF and makes the change-point detection harder. Algorithms based on Gradient Boosting and Neural Networks classifiers are able to recognize uninformative components and decrease their influence on the dissimilarity score. Moreover, they can estimate importance of all components and use it for better change-point detection.

In general, we recommend to use change-point detection algorithms based on NN or GBDT binary classifiers and regression models with RuLSIF loss function instead of kernel methods in case of high dimensional time series. They have better results when not all time series components describe change-points, less sensitive to uninformative components and are able to recognize more complicated changes of time series distribution.

Figure 4: Demonstration of the change-point detection algorithms performance on dataset 3. (Top) The first component of the time series. (Bottom) Results of the detection using RuLSIF and GBDT binary classifier.

Table 1: ROC AUC values for change-point detection algorithms and for the synthetic datasets.

5.2. Real-world datasets

To demonstrate work of the change-point detection algorithms on real-world data the two following datasets are used:

The Kepler data: Data from the Kepler spacecraft that was launched in March 2009. Its mission was to search for transit-driven exoplanet located within the habitable zones of Sun-like stars. In this work we use the Kepler light curves [16] with Data Conditioning Simple Aperture Photometry (DCSAP) data for 10 stars with exoplanets.

IRIS data: The second dataset consists of data from Incorporated Research Institutions for Seismology (IRIS). For the demonstration we took seismograms [17] recorded on seismological stations over the world for one earthquake.

Example of the Kepler light curve is demonstrated in Fig. 5. The spacecraft measured light flux coming from a star. When an exoplanet goes between the star and the spacecraft the light flux drops as it is shown in the figure. Parameters of this drop allows to estimate properties of the exoplanet. One

Figure 5: (Top) The Kepler light curve for a star with an exoplanet. (Bottom) Dissimilarity score estimated by change-point detection algorithm based on RuLSIF and GBDT classifier.

Table 2: ROC AUC values for change-point detection algorithms and for the real world datasets.

light curve has about 10 such drops and, as result, about 20 change-points. Experiments were done on 10 light curves of different stars and results were averaged. Demonstrations of the change-point detection based on RuLSIF and GBDT classifier are shown in Fig. 5. ROC AUC values for the all algorithms are provided in Tab. 2.

Example of a signal recorded by a seismological station is show in Fig. 6. When seismic wave reaches the station the signal variance increases and changes with time. Before an earthquake detected dissimilarity score takes small values. The score increases and forms the first peak right after the seismic wave registration. Then the score changes with changing of the signal. It has trend to slow down with the seismic wave attenuation as it is demonstrated in the figure. ROC AUC values for the all algorithms are provided in Tab. 2.

6. Conclusion

Two different approaches for change-point detection in time series based on direct density-ratio estimation are presented in the work. It was demonstrated that this problem can be solved using different binary classification and regression algorithms in machine learning, but not only based on kernel methods. In

Figure 6: (Top) IRIS seismogram from one station. (Bottom) Dissimilarity score estimated by change-point detection algorithm based on RuLSIF and GBDT classifier.

particular, change-point detection algorithms based on Gradient Boosting over Decision Trees and Neural Networks classification and regression were discussed in this work. It was shown that the proposed algorithms outperform the classical RuLSIF on a set of synthetic and real-world datasets.

7. Acknowledgments

We wish to thank Denis Derkach and Vladislav Belavin for their useful comments and discussions of this work.

The research was carried out with the financial support of the Ministry of Science and Higher Education of Russian Federation within the framework of the Federal Target Program Research and Development in Priority Areas of the Development of the Scientific and Technological Complex of Russia for 2014-2020. Unique identifier RFMEFI58117X0023, agreement 14.581.21.0023 on 03.10.2017.

References

[1] S. Aminikhanghahi, D. J. Cook, A survey of methods for time series change point detection, Knowledge and Information Systems 51 (2) (2017) 339– 367. doi:10.1007/s10115-016-0987-z. URL https://doi.org/10.1007/s10115-016-0987-z

[2] S. Liu, M. Yamada, N. Collier, M. Sugiyama, Change-point detection in time-series data by relative density-ratio estimation, Neural Networks 43 (2013) 72 – 83. doi:https://doi.org/10.1016/j.neunet.2013.01.012. URL http://www.sciencedirect.com/science/article/pii/ S0893608013000270

[3] T. K. Masashi Sugiyama, Taiji Suzuki, Density Ratio Estimation in Ma- chine Learning, Cambridge University Press, 2012.

[4] M. K. Masashi Sugiyama, Machine Learning in Non-Stationary Environ- ments: Introduction to Covariate Shift Adaptation, The MIT Press, 2012.

[5] S. Bickel, M. Brckner, T. Scheffer, Discriminative learning for differing training and test distributions, Vol. 227, 2007, pp. 81–88. doi:10.1145/ 1273496.1273507.

[6] J. Huang, A. Smola, A. Gretton, K. Borgwardt, B. Sch¨olkopf, Correcting sample selection bias by unlabeled data, in: Advances in Neural Information Processing Systems 19, Max-Planck-Gesellschaft, MIT Press, Cambridge, MA, USA, 2007, pp. 601–608.

[7] M. Sugiyama, S. Nakajima, H. Kashima, P. von Bnau, M. Kawanabe, Direct importance estimation with model selection and its application to covariate shift adaptation., Vol. 20, 2007.

[8] T. Kanamori, S. Hido, M. Sugiyama, A least-squares approach to direct importance estimation, Journal of Machine Learning Research 10 (2009) 1391–1445. doi:10.1145/1577069.1755831.

[9] M. Sugiyama, T. Suzuki, T. Kanamori, Density ratio matching under the bregman divergence: A unified framework of density ratio estimation, Annals of the Institute of Statistical Mathematics 64. doi:10.1007/ s10463-011-0343-8.

[10] T. Kanamori, T. Suzuki, M. Sugiyama, Computational complexity of kernel-based density-ratio estimation: A condition number analysis, Machine Learning 90. doi:10.1007/s10994-012-5323-6.

[11] M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, M. Sugiyama, Relative density-ratio estimation for robust distribution comparison, Neural Comput. 25 (5) (2013) 1324–1370. doi:10.1162/NECO_a_00442. URL https://doi.org/10.1162/NECO_a_00442

[12] Y. Kawahara, M. Sugiyama, Sequential change-point detection based on direct density-ratio estimation, Statistical Analysis and Data Mining 5. doi:10.1002/sam.10124.

[13] H. Nam, M. Sugiyama, Direct density ratio estimation with convolutional neural networks with application in outlier detection, IEICE Transactions on Information and Systems E98.D (2015) 1073–1079. doi:10.1587/ transinf.2014EDP7335.

[14] S. Aminikhanghahi, T. Wang, D. J. Cook, Real-time change point detection with application to smart home time series data, IEEE Transactions on Knowledge and Data Engineering 31 (5) (2019) 1010–1023. doi:10.1109/ TKDE.2018.2850347.

[15] S. Hido, T. Id´e, H. Kashima, H. Kubo, H. Matsuzawa, Unsupervised change analysis using supervised learning, in: Proceedings of the 12th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’08, Springer-Verlag, Berlin, Heidelberg, 2008, pp. 148–159. URL http://dl.acm.org/citation.cfm?id=1786574.1786592

[16] Kepler and K2 Science Center, Kepler and K2 data products, https:// keplerscience.arc.nasa.gov/data-products.html (2019).

[17] Incorporated Research Institutions for Seismology (IRIS), IRIS seismograms, Viewer: http://ds.iris.edu/gsv/, Builder: http: //services.iris.edu/irisws/timeseries/docs/1/builder/ (2019).