Temporal Tensor Transformation Network for Multivariate Time Series Prediction

2020·Arxiv

Abstract

Abstract

Multivariate time series prediction has applications in a wide variety of domains and is considered to be a very challenging task, especially when the variables have correlations and exhibit complex temporal patterns, such as seasonality and trend. Many existing methods suffer from strong statistical assumptions, numerical issues with high dimensionality, manual feature engineering efforts, and scalability. In this work, we present a novel deep learning architecture, known as Temporal Tensor Transformation Network, which transforms the original multivariate time series into a higher order of tensor through the proposed Temporal-Slicing Stack Transformation. This yields a new representation of the original multivariate time series, which enables the convolution kernel to extract complex and non-linear features as well as variable interactional signals from a relatively large temporal region. Experimental results show that Temporal Tensor Transformation Network outperforms several state-of-the-art methods on window-based predictions across various tasks. The proposed architecture also demonstrates robust prediction performance through an extensive sensitivity analysis.

Index Terms—multivariate time series, prediction, convolution, deep learning, tensor transformation

I. INTRODUCTION

Multivariate time series analysis has gained wide spread applications in many fields, e.g., financial market prediction, weather forecasting, and energy consumption prediction. It is used to model and explain the underlying temporal patterns among a group of time series variables in dynamical systems. Various methods have been proposed to predict multivariate time series based on statistical modeling and deep neural networks.

Classical statistical models assume that the time series is stationary, i.e., the summary statistics of data points are consistent over time. Preprocessing procedures are usually needed to remove trend, seasonality, and other time-dependent structures from the raw series in order to make the data stationary. In addition, these models also assume the independence condition in the underlying linear regression problem, i.e., the random errors in the model are not correlated over time. Autocorrelation and partial autocorrelation functions are usually applied to identify the appropriate order of variables. Constructing statistically meaningful prediction models requires performing various preprocessing, transformations, and feature engineering, which are time consuming and difficult to scale. On the other hand, deep learning-based approaches, such as recurrent neural network, have demonstrated state-of-the-art performance in modeling time series data through the use of stateful models. Recently, convolutional neural network (CNN) has also been applied to predict multivariate time series. Specifically, CNN is used as a stateless model to directly extract features from raw time series and generate predictions, or is used as part of the feature extraction step within a RNN architecture.

Existing work of using CNN for multivariate time series prediction treats the time series as an image. For example, the number of variables1 is equal to the width of the image while the number of time steps is equal to the length of the image. The convolution is conducted over the temporal-variable plane. In this context, one of the key underlying structural premises of the convolution operation assumes a local spatially dependent topology of the data, as opposed to the dense layer where all the inputs are jointly modeled. That is, in a convolutional layer, neurons receive input from a restricted sub-area of the previous layer (aka the receptive field), whereas neurons in the dense layer receives input from the entire previous layer. Therefore, when the CNN kernel convolves over the variable-temporal plane, it is only able to observe a narrow set of variable interactions within a given time window. Alternatively, convolution operations where we consider each feature in the time series as an independent channel also suffers from similar issues where the kernel is only able to observe a small local-window of time steps as its filters convolve over the data. In both scenarios, the limited view of the receptive field renders a local focus of the time series. Different from image processing, where objects in different regions could be quite distinct, time series data tend to be relatively “homogeneous”. The prediction of future values depend more on the global pattern within the historical time window, rather than a local pattern. In this paper, we propose a novel neural network architecture for multivariate time series prediction with a new class of transformation functions known as temporal-slicing stack transformation. These operations transform the original input data structure into a higher-order tensor, where the individual features in the time series are rearranged from a 1D temporal sequence into a 2D matrix. This transformation expands the view of the receptive field. As a result, the convolution is operated on a temporally larger region, which may help capture time-dependent features such as trend and seasonality, as well as variable interactions in a longer range.

The temporal-slicing stack transformation is demonstrated

Fig. 1. Visualization of the Temporal-Slicing Stack Transformation process

by Figure 1, where a window is utilized to slide over the 2D time series data and extract a collection of time series slices. These slices are then correspondingly stacked on top of each other sequentially to form a 3D tensor. Specifically the three dimensions are “features”, “time”, and “stack”. For multivariate time series, the transformation will convert a 2D time series into a 3D tensor. For univariate time series, the transformation will render a 2D matrix. The resulting structure yields an emergent pattern, where a spatial feature extractor, such as CNN, can explicitly model complex and nonlinear autocorrelational-like features. To illustrate this, we use the transformation of univariate time series as an example. Figure 2 shows the transformation results of univariate time series from four different datasets. We also show their corresponding autocorrelation plots. The transformation demonstrates strong patterns for time series of high autocorrelations. Comparing with the original 1D time series, the 2D matrix has much richer information and therefore may enable more informative feature extraction.

The following sections of this paper are organized as follows. We highlight various statistical and deep learning approaches for time series analysis in Section II. Definitions of the temporal tensor transformation, the temporal-slicing stack transformation, and the proposed model architecture are described in Section III. The experimental setup, dataset, metrics, and results are presented in Section IV. In Section V, we perform sensitivity analysis based on experiments with synthesized datasets and controlling the input and output dimensions. We finally conclude and discuss future directions in Section VI.

II. RELATED WORK

In this section, we review the prior work for multivariate time series prediction. We first introduce some of the known classical statistical methods and then present some of the state-of-the-art deep learning methods for time series modeling.

Time series prediction has been well studied with various statistical modeling methods for both univariate and multivariate data. For univariate time series modeling, the Auto Regressive Integrated Moving Average (ARIMA) model is often applied using the Box-Jenkins modeling process [1]. Other variants of the ARIMA models have been proposed to model temporal patterns of the data, for example, seasonality (SARMA) and coefficient dependent periodicity (PARMA) [2]. Furthermore, methods combining ARIMA and neural networks have also been developed to model both linear and non-linear dynamics [3]. For multivariate time series prediction, vector based methods, such as Vector Auto Regression (VAR) methods, extend the Auto Regressive (AR) models for univariate time series modeling [1] [4]. Variants of the VAR model have also been proposed, such as the VARMAX model [5] and the VARX model [6], where the model subsumes properties of the original VAR model and jointly learns interactions between the given variables. Additionally, Gaussian Process [7], as a non-parametric statistical model, is used to predict over a distribution of continuous variables, as opposed to the aforementioned parametric methods. Statistical models often have very high computational complexity and face numerical issues, when the number of variables in the time series is high.

With the advent of deep learning, new neural network architectures have been proposed for multivariate time series prediction. Borovykh et al. [8] develop a multivariate time series model based on the WaveNet architecture [9], which is originally designed for speech audio signal processing. They augment the original architecture by simplifying and optimizing its core algorithms with the dialated convolution to capture long-term multivariate temporal data with noisy signals. Another multivariate time series modeling framework is proposed by Lai et al. [10], namely LSTNet, which combines CNN and RNN to extract hierarchical short-term and long-term temporal dependencies from the time series. In addition to model dependencies, LSTNet also accounts for the autoregressive component of the model as a residual connection between the CNN and LSTM components. Qin et al. [11] develop a dual-stage attention-based neural network architecture for multivariate time series modeling, which utilizes two sets of RNN as an encoder-decoder based architecture. Through each stage of the attention-based network architecture, the method attends to both the feature and temporal dimensions to adaptively select the relevant driving series.

III. MODEL ARCHITECTURE

In this section, we present the proposed Temporal Tensor Transformation Network architecture. First, we introduce the key notations used in the paper as well as the formal problem definition for multivariate time series prediction. Then we introduce the Temporal Tensor Transformation operation. Finally, we present the proposed neural network architecture.

A. Notations

We use x to denote a one-dimensional vector, X to denote a two-dimensional matrix, and X to represent a threedimensional tensor. Scalars with respect to their corresponding indices are denoted by lower-case letters, followed by subscripted letter triplets. For instance, of a 3D-tensor X indicates the (i, j, k)-th scalar value of the tensor X. Further notation and variables will be introduced and defined as they appear in the context.

Fig. 2. Comparison of the autocorrelation plots and the resulting Temporal Tensor Transformations

B. Problem Definition

We now formally define the multivariate time series prediction problem. Given a complete multivariate time series dataset , where is the number of features, and n is the total number of time steps in the multivariate series, our objective is to predict a future series of values up to a defined horizon window. Specifically, given a subset of the time series up to time step T, our model, denoted as function , will take in an input series of , and subsequently generate an output sequence of , where h is the horizon window. Hence, the model can be formulated as the mapping function .

C. Temporal Tensor Transformation

The proposed transformation augments the original data by adding a single higher-order dimension to the original data input. For example, if the input series data is a 1D vector of values (i.e., univariate time series), the transformed data structure will be a 2D matrix structure. Likewise, if the input is a 2D matrix (i.e., multivariate time series), the resulting structure will be a 3D tensor.

We define the Temporal Tensor Transformation as a mapping function , where is the input multivariate time series and the resulting transformation generates a 3D tensor . Here, is the slice window size (i.e., the number of steps within a time window), and o is the number of slices or the stack height of the resulting transformed tensor. The value of o can be computed based on the hyperparameters. Our specific temporal tensor transformation is referred to as the Temporal-Slicing Stack Transformation. Before introducing the Temporal-Slicing Stack operation, we will first introduce the hyperparameters involved in the slicing process.

The window size, denoted by , defines the overall slicing window of the transformation function, and determines one of the major dimensions of the output tensor. The length of the slicing window is fixed by the number of features in the time series, i.e., m. Therefore, the resulting slicing window is of dimension . The stride parameter of the slicing window, denoted by s, indicates how many time steps to advance the slicing window along the temporal dimension. The greater the stride, the lower the overall pattern resolution, due to the loss of information among contiguous values within the time series. The dilation parameter d is similar to the parameter introduced by Yu et al. [12] for dilated CNN, which allows us to slice a wider receptive window size without compromising to limited memory space. The padding value p is similar to how CNN inherently increases the dimensionality of the original data structure. It allow for shift-invariant transformations as well as retaining the dimensional size of the data. Specifically, padding in our context refers to the process of symmetrically appending a set of 1D-vectors of size m to both ends of the input time series matrix along the temporal dimension. However, we hypothesize that such values appended to the time series data may be potentially problematic, due to the risk of contaminating the original series data with noise or out-of-distribution values. Thus, we need to carefully choose the padding values, for example, using the same adjacent values or the local mean of vectors within a predetermined time window.

Given the hyperparameters defined above, we can deterministically derive the number of slices o as follows:

We summarize the Temporal-Slicing Stack Transformation in Algorithm 1. For simplicity, we do not consider padding and dilation in this formulation (i.e. p = 0 and d = 1).

D. Neural Network Architecture

In this section, we describe the deep learning architecture on top of the proposed temporal tensor transformation, which is referred to as TSSNet. The neural network architecture is based on a fairly simple network structure, as shown in Figure 3.

Fig. 3. Temporal Tensor Transformation Network Architecture

1) Temporal-Slicing Stack Transformation: Given the initial input multivariate time series X, we first perform the temporal tensor transformation as described in Algorithm 1. As noted previously, the Temporal-Slicing Stack transformation is de-fined as a mapping of , where the input data is a two dimensional matrix and the output is a three dimensional tensor .

2) Convolutional Neural Network: After transforming the input time series data to , we utilize a CNN to extract features from the tensor along the plane. That is, we treat each feature as a separate channel. The total number of channels is therefore equal to the number of features m. We also fix one dimension of the CNN kernel to be the stack height o. The dimension of the kernel is thus , where k is the width of the convolving kernel. Suppose we have l CNN kernels, the i-th kernel learns the following set of weights:

where denotes the convolution operator, and are the weight and bias parameters, respectively. Note that during this process, we do not apply any sort of activation function as the model is sensitive to any type of operation that can collapse the range of values from the input (e.g., negative values will become zero in RELU). We empirically set the number of kernels to be m. The output feature maps from the convolution operation is subsequently fed into a max pooling operation, which is applied over the same plane. We repeat these two operations twice in a successive manner.

3) Dense Layer: After the features are extracted from the CNN layer, the corresponding feature maps are then flattened out as a single vector. It is then subsequently fed into one fully connected hidden layer, :

where h is the 1D flatted out feature map from the previous CNN feature maps, and and are the weight and bias parameters, respectively. As a heuristic, the dimension of the hidden weight parameters used in this dense layer is usually greater than that of the output layer dimensions. Finally the subsequent latent vector is fed into the output layer, where we learn the following set of parameters:

where and are the weight and bias parameters. The final output is a 1D vector, , which can also be reshaped as a 2D matrix of dimensions .

E. Objective Function

To train the proposed neural network architecture, we minimize the squared error loss function:

where the cost is minimized w.r.t. the parameters , and denotes the Frobenius norm.

F. Optimization Method

To optimize the cost function, we utilize canonical methods for optimizing standard neural network. Specifically, we can apply common gradient-based methods such as stochastic gradient descent (SGD) or the Adam algorithm [13].

IV. EXPERIMENT

We evaluate the performance of the proposed neural network architecture in this section. We first introduce several baseline models used for the benchmark comparison. We then introduce the evaluation metrics and data sets, and finally present the experimental results.

A. Baseline Models

To compare the performance and robustness of our proposed model, we benchmarked against the following methods:

• Vector Auto Regression (VAR)

• Long Short-Term Memory (LSTM)

• Gated Recurrent Unit (GRU)

• 1D Convolutional Neural Network (CNN)

• LSTNet [10] All implementations were developed in PyTorch [14], with the exception of the Vector Autoregression (VAR) model which is

based on the Python StatsModels package [15]. our evaluations are performed across models which are stateful (i.e. VAR and RNN variants), stateless (i.e. CNN variants), and a hybrid of the two (i.e. LSTNet).

For Recurrent Neural Network variants, we implemented a many-to-one single layer vanilla architecture for the Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) with both having two fully connected layers appended at the end. Likewise, the 1D Convolutional Neural Network (CNN) architecture utilizes a single 1D CNN and a max pooling block with two fully connected layers. In this architecture, we treat the 2D input time series as a single channel image and perform convolution over the temporal-variable plane. Our LSTNet architecture is based on the code package from Lai et. al [10]. For all the neural network based models, they generate a batched multistep prediction values instead of a single time step output. For Vector Auto Regression, the predictions are generated based on an iterative manner, where the prediction at the current time step is appended to the previous input in order to predict the value at the next time step.

B. Evaluation Metrics

To compare all the methods, we use the following two evaluation metrics. The objective of our models is to minimize the Root Mean Squared Error (RMSE), while jointly maximizing the empirical correlation coefficient (CORR).

Root Mean Square Error (RMSE):

where Y, such that n is the total number of samples being computed. In this formulation we aggregate the error across all of the m different feature for the total h horizon of the predicted values.

Empirical Correlation Coefficient (CORR):

where Y, , , , each denotes the ground truth value, the predicted value, the mean value of the ground truth for each feature, and the mean value of the predicted value for each feature, respectively.

C. Datasets

We use four benchmark datasets across a wide variety of domain applications to compare the performance of all the models. Each dataset presented here has various degrees of non-stationarity. This allows us to evaluate the potential strengths and weaknesses of our proposed model architecture. Due to the memory-constraints, for each dataset, excluding the Exchange Rate dataset, we select at random ten features as a method of dimensionality reduction. We plot the autocorrelational functions for each dataset, as shown in Figure 4.

Fig. 4. Autocorrelation function plots of the raw datasets

1) Electricity Usage [16]: The dataset contains electricity consumption recordings sampled at 15 min intervals between the year of 2012 to 2014 for 321 clients. The dataset was pre-processed to reflect the hourly consumption rate of the electricity usage instead. From the ACF plot, we see that different features have different levels of autocorrelation, notably a strong presence of seasonality as most of the features tend to have a consistent lag at around 24 hours.

2) Foreign Exchange Rates: The data set was collected over a period of 26 years between 1990 and 2016, and consists of the daily exchange rates from eight different countries. The dataset contains a total sample size of 7,588. Out of the four datasets, it does not show strong presence of the auto-correlational signal.

3) Solar Power Production [17]: The dataset comes from the Solar Power Data for Integration Studies, which contains the solar power and hourly day-ahead forecasts for approximately 6,000 simulated PV plants, at a sampling rate of every 5 minutes, in the year of 2006. From the ACF plot, we can see that the selected features all have uniform seasonality patterns with lag values around 140 hours.

4) San Francisco Traffic [18]: The dataset is from the Caltrans Performance Measurement System (PeMS), where traffic data was collected in real-time from over many detectors over the span of one year between 2015 and 2016, in the San Francisco Bay Area freeway. These detectors represent the density of the traffic between 0 and 1, where 0 indicates no traffic and 1 represents high amounts of congestion. From the ACF plot, we can see that most of the selected features have a seasonality with lag values around 160 hours.

For evaluation, we split the dataset into three different partitions in a chronological manner, with the first 60% of the dataset will be used for training the model, the next 20% of the data will be used for computing validation scores, and the remaining 20% of the data will be used to test the model.

D. Hyperparameters

For training models, we tuned the hyperparameters using the Gaussian Process Optimization Framework (GPyOpt) [19]. We focused the parameter optimizations on the TemporalSlicing Stack Transformation, in particular, the window size, , between the range of 5 to 10, and the stride, s, between the range of 1 to 5. Furthermore, we also tuned the learning rate of our model between values of [0, 0.01] and applied gradient clipping value of 10. Furthermore, we correspondingly adjusted the hidden weight dimensions of the model such that the dimension of the first dense layer had a greater number of parameters than the second output dense layer, or the final output dimension of the model. We perform 100 iterations of the modeling process and select the model with the highest empirical correlation coefficient score as our objective value to optimize over.

E. Experimental Results

We present the experimental results on the two defined evaluation metrics in Table I. For each dataset, we provide the model a fixed input of T = 168 time steps and obtained predictions against a horizon window of 15, 30, 60, and 120 time steps. We note that these horizons are multi-time step windowed predictions, as opposed to a single-point estimation. The best performing models with respect to the empirical correlations are in bold for each dataset and horizon. We also provide a comparative graph summary of the Empirical Correlation Coefficient as shown in Figure 5.

Based on the empirical results, our proposed model (denoted by TSSNet) outperforms the baseline metrics for the Electricity, Solar Energy, and Traffic datasets, and outperforms some of the models for the Exchange Rates dataset. We attribute our analysis based on two key observations: i) the use of stateful versus stateless model types and ii) the autocorrelations present in the dataset.

In Figure 5, we observe a noticeable cluster between stateless models (i.e. 1D CNN, LSTNet, and TSSNet) and stateful models (i.e. LSTM, GRU, and VAR) in terms of the overall performance. Furthermore, we observe a sharper performance degradation over larger temporal horizon for stateful models than stateless models. Within stateless models, our proposed model consistently retains an upper bound performance, while the 1D CNN model primarily remains as a lower bound with the LSTNet in between. This is also present in the Exchange Rate dataset.

The other factor affecting the overall performance of our model is the presence of autocorrelations in the dataset, in particular, the seasonality. As previously shown in Figure 4, we have noted that there is a strong presence of autocorrelationa within the Electricity, Solar Energy, and Traffic datasets, while weaker in the Exchange Rate dataset. This realization in context with our empirical results shed light on some observations. For highly autocorrelational data, stateless models tend to perform better over stateful models. However, with weaker autocorrelational signals such as the Exchange Rate dataset, our model performs better than the stateless models, but marginally worse than the stateful model variants.

In particular, our experimental results with 1D CNNs and LSTNet demonstrate that the methods in how convolution is applied over the temporal data can lead to difference in performance. Hence, this comparison indicates that our transformation method enables high efficiency of information gain with respect to the autocorrelational features captured over a large temporal region.

We also plot sample predictions against our test set at horizons h = 120, with an empirical comparison against LSTNet. The results are shown in Figure 6. We have chosen to evaluate LSTNet for its marginally close performance in terms of the empirical correlation coefficient, but having slightly lower RMSE scores as shown in Table I. From Figure 6, we can observe how effectively our model can predict non-linear patterns as the model is able to adapt to most of the acute signals in the data. From the Electricity dataset, we observe much more precise adaptations to the peaks and troughs of the cyclical signals. In the Traffic dataset, the proposed TSSNet is able to capture the sinusoidal patterns, however is slightly less precise to sudden peaks such as those found in the last two cycles of the dataset. In the Solar Energy dataset, we observe a closer fit to the overall shape of the data, as it is able to very closely capture both the non-linear and linear components of the data. However, for the Exchange Rate dataset, we see that both models have a relatively hard time with the predictions due to the high noisy outputs. In TSSNet, we observe that the predictions are centralized around a consistent range of values, while LSTNet’s predictions are sporadically shifting the prediction values from time to time.

V. SENSITIVITY ANALYSIS

In this section, we perform sensitivity analysis to evaluate and understand the advantages and limitations of our proposed architecture. This analysis fixates the intermediate model architecture and manipulates the input and output data parameters in a controlled manner. We observe the resulting effects to the overall performance of the model and draw conclusions on the robustness of our architecture.

In particular, our objectives in this analysis are two folds. First, we empirically study and demonstrate the underlying properties of how our proposed model is able to learn emergent autocorrelational patterns from controlled and synthesized inputs. We subsequently analyze the activation maps to better understand the extracted features. In our second analysis, we evaluate the robustness of the proposed model architecture by varying the input and output dimensions. These two experiments will help us better identify some of the core inner workings and draw insights of the model.

A. Synthesized Data Sensitivity Analysis

In this section, we describe methods used for observing the underlying feature maps that are generated by CNN. We perform these experiments to empirically identify properties of autocorrelational signals such as seasonality and trend. For

TABLE I RMSE AND CORR OF MULTIVARIATE TIME SERIES PREDICTION

Fig. 5. Empirical correlation coefficient plots by prediction horizon

Fig. 6. Comparison of TSSNet and LSTNet predictions

this experiment, we generated a set of synthetic datasets from simple functions which explicitly exhibit properties of seasonality and trend, as demonstrated in Figure 7. Through these controlled and simplified experiments, we can uncover some of the underlying mechanisms for extracting autocorrelational features from the data.

We overfit our models with inputs from known simple deterministic functions, such as the sine and linear functions. We then evaluate the resulting activation maps (after the first convolutional layer) generated by the model, in order to identify which salient features are extracted by the learned CNN filters. Furthermore, to evaluate robustness of our model, we inject noise of various degrees by adding a noise factor within the sine function argument as , where is the ratio of the noise present, and is a random number sampled from a standard normal distribution. For each function, we empirically evaluate the robustness by increasing the degree of from 0 to 0.75.

Based on the experimental formulation above, we demonstrate the feature maps generated by the CNN layer, as presented in Figure 7. Along the vertical axis, we list the different deterministic functions utilized to generate the synthetic dataset, while on the horizontal axis we show the effect over the different rates of noise injection.

Empirically, we found that the CNN filters perform the roles of both a feature extractor as well as a de-noising component of the model. For the first function, y = sin(x), we can observe that the model is able to pick up the uniform repetitive pattern of the data as exhibited by the uniform checkerboard like pattern. Even with the injected noise, we see that the filter is able to de-noise a fair amount of the artifacts from the model and still retain the core checkerboardlike pattern as exhibited when . In the second function, y = sin(x) + x, we observe across all values of , the activation map shows a uniform gradient-like pattern. This indicates that the model attenuates the sin(x) component of the function and focuses only on the linear component of

Fig. 7. CNN feature maps of synthesized function datasets

the model. In essence, the filter treats sin(x) as a constant and emphasizes the monotonically increasing nature of the time series data from the linear x component of the equation. Hence, the resulting activation map reveals a gradient like pattern which indicates an increasing trend pattern. For the third and fourth functions, we observe the same uniform checkerboard pattern similar to the first function. Furthermore, we observe a slight gradient effect indicating that the values on the left region of the plot is darker and the right region is lighter. This implies that the model’s learned filter can extract both the seasonal component as well as the linear trend pattern concurrently from the synthetic dataset. The fourth function contains a slightly darker pattern which is indicative of the extra linear factor.

From these controlled experiments, we empirically demonstrate the effectiveness of using CNNs as feature extractors to identify core autocorrelational components of the model, notably, seasonality and trend from the transformed temporal tensor representations. Furthermore, we also demonstrate that such learned filters are able to act as de-noising filters, which can enable generalizations for noisy input time series.

B. Sensitivity Analysis on Input vs Prediction Horizon

In our previous experiments, we have only evaluated our models against a fixed input size of 168 time steps while we correspondingly varied the output horizon time steps for prediction. In this experiment, we perform a sensitivity analysis by varying both the input and output dimensions. One key motivation behind this experimentation is to assess the balance of the input and output size and how they influence the overall predictive power of the model. Various degrees of temporal granularity, such as daily, weekly, or monthly perspectives of the input data can strongly influence the respective outcome of the prediction due to the observational scope of the information provided to the model. This is particularly important when considering autocorrelational patterns, as long-range dependencies of repetitive patterns can significantly change with respect to the different temporal granularities of the data. As a result, the representation the model learns will also differ and consequently affect the prediction performance. For this analysis, we constrain our experiment to only evaluate the performance of our proposed model architecture (i.e. TSSNet). We vary the input sizes by 32, 64, 128, and 256, while also vary the output horizon sizes by 15, 30, 60, and 120, respectively. We perform experiments with every combination of these input and output dimensions for evaluation.

The experimental results are shown in Table II. We also provide two plots, which evaluate the empirical correlation coefficient performance degradation from the perspective of the input and output dimensions respectively, as shown in Figures 8 and 9. From these plots, we can draw upon several key insights regarding the overall robustness of our model, as well as some heuristics which can potentially help with improving the overall performance of the model.

In Figure 8, we present a plot which demonstrates how the empirical correlation coefficient performance with different input dimensions degrades with respect to the different output horizon dimensions. First, we can observe a similar performance degradation effect shown from our previous experiments. In particular, one key observation to note is the variance of the correlation with respect to each of the different output horizons are greater when the data is not highly autocorrelated. For example, the Exchange Rate time series does not have strong autocorrelation, therefore exhibiting a wider variance of correlation scores for the input window size with respect to the output horizon dimensions. In contrast, for datasets such as the Electricity and Traffic, the overall correlation performance degradation is relatively smaller and its variance bounds are significantly tighter. However, one detail to make note of in Figure 8 is the top performing models for each of the different datasets. While in the Exchange Rate dataset, the model with input window of 32 performs the best, the other datasets mostly have larger input window sizes performing better in general. This results suggest that the overall input size can significantly affect the performance of the model. This also provides further evidence that our proposed model can better extract features from data which exhibits long-range autocorrelational patterns.

In Figure 9, we present a different view of the results

TABLE II INPUT VS HORIZON SENSITIVITY ANALYSIS RESULTS

Fig. 8. Input size vs output horizon correlation degradation

Fig. 9. Output horizon vs input size correlation degradation

which demonstrates how the empirical correlation coefficient performance with different output dimensions is influenced by varying the input sizes. The motivation behind this analysis allows us to evaluate the overall robustness of the different output horizon dimensions of the model’s prediction with respect to varying the overall input window dimensions. Unlike the previous analysis from our first experiment, we focus on empirically evaluating the effect of the benefited information gain the model has obtained when we provide a greater amount of temporal information (i.e., a larger input window). In Figure 9, we also notice very similar effects of the variance bounds of the correlation results with respect to the different input window sizes, as highly autocorrelated data tend to have smaller bounds while non-autocorrelated data have larger variance ranges. In particular, increasing the input window sizes with respect to the different output horizon dimensions, we note that the overall performance degradation is minimal, and in some cases performance improves for datasets which exhibit high autocorrelational properties. In contrast, when the data does not contain any sort of autocorrelational signals (e.g., the Exchange Rate), the model performance drops significantly as we increase the input window sizes. These results empirically reinforce the notion that our model is able to robustly learn long-range temporal dependencies with high autocorrelational features.

VI. CONCLUSION

We propose a neural network architecture for multivariate time series prediction through a new class of transformation function known as Temporal Tensor Transformations. In particular, we present the Temporal-Slicing Stack Transformation which utilizes a slice-based operator to transform the original 2D time series into a 3D tensor. This transformation encodes long-range auto-correlational features that can be extracted by a Convolutional Neural Network. Both experimental results and sensitivity analysis provide strong evidence that the proposed architecture is able to learn non-linear auto-correlational patterns effectively from the data.

For future work, we plan to investigate various components of the proposed architecture. To further understand the underlying mechanisms and sensitivity of the architecture, we aim to carefully identify which hyperparameters are more sensitive with respect to the overall model performance. Additionally, to better improve the prediction performance of the model on data without strong autocorrelation, we will explore the use of hybrid approaches that combine both our stateless approach in feature extraction in conjunction with recurrent neural networks. As an extension to the proposed temporal tensor transformation, we can also explore other types of transformation methods which utilize explicit components, such as multivariate interactions and variable sampling rates among different variables. The core challenges behind these classes of transformation functions lie in how to design new structures that can enable efficient information gain through the use of CNN for feature extractions.

REFERENCES

[1] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series analysis: forecasting and control. John Wiley & Sons, 2015.

[2] E. Hannan, “A test for singularities in sydney rainfall,” Australian Journal of Physics, vol. 8, p. 289, 1955.

[3] G. P. Zhang, “Time series forecasting using a hybrid arima and neural network model,” Neurocomputing, vol. 50, pp. 159–175, 2003.

[4] H. L¨utkepohl, New introduction to multiple time series analysis. Springer Science & Business Media, 2005.

[5] A. Milhoj, Multiple Time Series Modeling Using the SAS VARMAX Procedure. SAS Institute, 2016.

[6] H. J. Bierens, “Var models with exogenous variables,” Retrieved October, vol. 10, p. 2015, 2004.

[7] M. Alvarez and N. D. Lawrence, “Sparse convolved gaussian processes for multi-output regression,” in Proceedings of Advances in Neural Information Processing Systems, 2009, pp. 57–64.

[8] A. Borovykh, S. Bohte, and C. W. Oosterlee, “Conditional time se- ries forecasting with convolutional neural networks,” arXiv preprint arXiv:1703.04691, 2017.

[9] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499, 2016.

[10] G. Lai, W.-C. Chang, Y. Yang, and H. Liu, “Modeling long-and short- term temporal patterns with deep neural networks,” in Proceedings of the International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 95–104.

[11] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. Cottrell, “A dual-stage attention-based recurrent neural network for time series prediction,” in Proceedings of the International Joint Conference on Artificial Intelligence, 2017, pp. 2627–2633.

[12] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.

[13] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[14] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.

[15] S. Seabold and J. Perktold, “Statsmodels: Econometric and statistical modeling with python,” in Proceedings of the 9th Python in Science Conference, 2010.

[16] “Uci machine learning repository: Electricityloaddia-grams20112014 data set,” https://archive.ics.uci.edu/ml/datasets/ ElectricityLoadDiagrams20112014, accessed: 2019-06-18.

[17] “Solar power data for integration studies - grid modernization - nrel,” https://www.nrel.gov/grid/solar-power-data.html, accessed: 2019-06-18.

[18] “Caltrans pems,” http://pems.dot.ca.gov/, accessed: 2019-06-18.

[19] T. G. authors, “GPyOpt: A bayesian optimization framework in python,” http://github.com/SheffieldML/GPyOpt, 2016.