This paper presents a time series forecasting framework which combines standard forecasting methods and a machine learning model. The inputs to the machine learning model are not lagged values or regular time series features, but instead forecasts produced by standard methods. The machine learning model can be either a convolutional neural network model or a recurrent neural network model. The intuition behind this approach is that forecasts of a time series are themselves good features characterizing the series, especially when the modelling purpose is forecasting. It can also be viewed as a weighted ensemble method. Tested on the M4 competition dataset, this approach outperforms all submissions for quarterly series, and is more accurate than all but the winning algorithm for monthly series.
The competitiveness of neural network (NN) models and other machine learning (ML) models for time series forecasting compared to statistical models has long been questioned by practitioners  . Although in the field of time series forecasting, there is a plethora of literature presenting complex novel models, in practice the performance of ML models is often below expectation . For example, pure ML methods perform poorly in the M4 competition, with none of the five submitted pure ML solutions beating the Comb benchmark, which is the simple arithmetic average of Single, Holt and Damped exponential smoothing .
The poor performance of NN and general ML models in forecasting time series is in sharp contrast with the tremendous success of these models in the areas like computer vision and natural language processing. NN models are known to be capable of extracting features automatically from images for tasks like classification, while in the pre-deep learning era, such tasks were usually solved by feeding hand-crafted features to various ML models . The availability of large dataset such as ImageNet, which has more than 14 million images, is believed to be crucial for the success of NN models in computer vision.
The lack of enough time series data is attributed as a reason for the underachievement of ML models in forecasting . Although it is shown by Zhang et al.  that there exists feedforward NNs that are able to generate many popular time series features such as minimum, maximum and count above mean, without enough data, sophisticated models easily get overfitted and fail short in extracting true patterns. When the search space is so large and the guidance is so little, it is not surprising that ML models are not lucky enough to find the parameters to generalize well.
To make ML models work for time series forecasting, we need to provide more guidance, which can be either more data or better features. But usually it may be difficult or even impractical to collect enough relevant series. Therefore, generating good features seems to be a more viable option. After all, before deep learning gained popularity, the key to the success of many ML systems is feature engineering. Some even believe that “applied machine learning is basically feature engineering” .
This paper presents a time series forecasting framework we call For2For (forecasts to forecast). The framework is composed of standard off-the-shelf forecasting methods and a ML model. The inputs to the ML model are not lagged values or mined features like seasonality strength, but instead forecasts produced by standard methods such as ARIMA and ETS. The intuition behind this approach is that forecasts of a time series are themselves good features characterizing the series, especially when the modelling purpose is forecasting. The ML model can be either a convolutional neural network (CNN) model or a recurrent neural network (RNN) model. In other words, the guidance we provide to the NN models are the forecasts produced by other models, which we will refer to as base models.
Of course, a different way of viewing this approach is that the NN models are trained to combine the forecasts of base models, therefore essentially it is an ensemble model. But in our opinion, it is different from standard combination methods in two ways. First of all, this method does not try to select a single base model (as Talagala et al. ) or to assign weights to base models (as FFORMA ). Instead it tries to combine the forecasts of base models differently at each forecasting point. We will elaborate this further in Section 2. Second, the inputs are forecasts of base models instead of time series features extracted by software packages such as tsfeatures .
This method is tested on the M4 competition dataset and it is more accurate than a simple arithmetic combination of the base models. Furthermore, it beats all submissions for quarterly series, and outperforms all but the winning algorithm for monthly series.
The rest of the paper is organized as follows. Section 2 introduces the framework and presents the structures of the CNN model and the RNN model. The implementation details and testing results are discussed in Section 3 and Section 4 concludes the paper.
Let us first define as a collection of time series to forecast, and the forecasting horizon is set to h. For notational convenience, the first time step to forecast is defined as is known up to t = 0 with n = 1, 2, ..., N.
2.1 The For2For framework
As illustrated in Figure 1, the For2For framework consists of two parts. In the first part, a group of base models are used to forecast time series X, and the forecast results are
where ˆdenotes the forecast of produced by base model . This is common practice in various forecast combination methods   . In the second part of the framework, the forecasts produced by base models B are fed into a NN model N. The NN model can be either a CNN model or a RNN model. The forecast ¯¯¯generated by the NN model N is the final forecast result. Standard training and validation processes are used to build the NN model and are not drawn explicitly in the diagram.
In conventional combination approaches, the final forecast is a linear combination of forecasts generated by base models,
where is the weight assigned to base model when forecasting the series .
In this framework, we do not try to select the single best performing base model, nor do we try to assign weights to each base model. In fact, no weights are explicitly calculated at all. The NN model is trained to learn from the forecasts of base models and then make forecast automatically.
If we insist to view this approach as a weighted combination, then the weights are assigned to individual forecast result at each forecasting step,
This means that the forecasts of a given series by a base model at different t are not necessarily trusted to the same extent in a multi-horizon forecasting task. In other words, the weight w is not just a function of n (which denotes time series) and m (which denotes base model), but also a function of t (forecasting step). It is only when h = 1 is Equation 2 reduced to Equation 1.
Any popular forecasting methods can be included into the set of base models. After applying the base models to a time series , we obtain a matrix ˆ, with the (i, j) element of the matrix being the forecast result at time t = i by model . Other relevant information such as the domain type in the M4 dataset can be appended side by side to this matrix.
2.2 CNN model
The linear layer with a 1 parameter vector on the left simply combines the M base models linearly, and the output is a 1 vector. This layer can also be viewed as a 1 convolutional layer with valid padding and linear activation. Often such a linear model is not adequate, then it is necessary to include more nonlinear mechanisms in the model, and this is exactly what the residual network on the right is intended to do. Intuitively, the four 3 3 convolutional layers with same padding and sigmoid activation are used to capture the difference between forecasts by different base models at different time steps. A fully connected layer which outputs a 1 vector is connected to the last 3 3 convolutional layer. The number of 3 3 convolutional layers can be increased or decreased to reduce bias or variance when necessary. The fully connected layer could also be replaced by a 1 linear layer if there is an obvious overfitting problem. In the extreme case, all the 3 3 convolutional layers and the fully connected layer are removed and the model collapses to a linear combination model.
When there are extra categorical features available, they may be incorporated to the model by a separate embedding layer and the final forecast is the summation of the outputs from the linear part, the residual part and the embedding part.
The CNN model is not flexible enough as to build the computation graph, it is necessary to know the forecasting horizon h beforehand. This is a constraint placed by the fully connected layer. For this reason, when the forecasting horizons are different, as is the case for the series of different frequencies in the M4 competition, one model has to be built for each frequency.
2.3 RNN model
The RNN model considered is more flexible than its CNN counterpart and is more interpretable. Figure 3 demonstrates the folded version of the RNN model considered in this paper, while Figure 4 shows the unfolded version, where ˆis the vector containing the forecasts of series by all the M base models at is the state vector of the RNN cell at 1, ¯) is the final forecast of series at t = i. The initial state is set to zero. The linear layer is used to convert the output vector from the RNN cell to a scalar. In such a structure, instead of making all the h forecasts at the very last step, the forecast is produced one at a time.
Due to the existence of , ¯) is a combination of the forecasts ˆup to t = i. For example, when the model generates the final forecast at t = 2, it does not only look at the forecasts by base models at it is not necessary to know the forecasting horizon h in advance. It is also possible to make predictions for a different h other than the one used for model training.
Extra information available for forecasting can be fed to the model by expanding the input vector to the RNN cell. For example, the domain type of the series in the M4 dataset can be included into the model by first applying an embedding layer to the categorical feature and then appending the embedding results to the forecasts by base models at each time step.
To assess the performance of the proposed method, the M4 dataset is used for experiments. Among the 100,000 time series in the dataset, there are 23,000 yearly series, 24,000 quarterly series, 48,000 monthly series, 359 weekly series, 4227 daily series and 414 hourly series . The forecasting horizons of series of these frequencies are six, eight, 18, 13, 14 and 48, respectively. The series were collected from six domains: micro, industry, macro, finance, demographic and other.
3.1 Base models
Inspired by FFORMA, we choose the same base models as Montero-Manso et al. , with the exception of the na¨ıve method. The reason that we exclude the na¨ıve method is that at the data preprocessing stage, we normalize all the series by their last observations and then apply a log transformation. As a result, the forecasts at any time t by the na¨ıve method is simply zero. Including an all-zero vector into the inputs to a NN model adds no new information.
For the sake of completeness, the eight base models are listed in Table 1. These methods are implemented in the R package forecast  and can be readily used. We follow the practice in FFORMA so that the default settings of the R functions are used and the forecast is replaced by seasonal na¨ıve forecast if an error is reported.
3.2 Preprocessing and training
We take the monthly series as an example to show how the modelling process is carried out. First of all, the last 18 (horizon-number) observations of each of the 48,000 quarterly series are removed. Secondly, base models are applied to each chopped series to make forecasts. Negative forecasts are clipped by 10 as it is pointed out in  that the minimum of all series is 10. The clipped forecasts are normalized by the last observation and then a log transformation is applied.
Figure 1: For2For (forecasts to forecast) framework
Table 1: List of base models and R functions
Figure 2: CNN model structure
Figure 3: RNN model structure, folded version
At the model training stage, one third of randomly chosen series are held out as validation dataset for tuning model hyperparameters. Once the hyperparameters are determined, all the chopped series are used to re-train the final model, which will be used for the actual forecasting. The inputs to the final model are preprocessed forecasts of complete series produced by base models.
For fair comparison with FFORMA, the domain type of the series is not used in the experiments. Due to the stochastic nature of the initialization of model parameters and the fact that the training samples are randomly divided into 10 mini-batches in each training epoch, the final model obtained is not deterministic. To reduce uncertainty, multiple instances of the model are trained concurrently and the average of these instances is taken as the final forecasting result. This model averaging approach was also employed by Smyl in his submission that wins the competition .
3.3 Accuracy measure
Mean absolute error (MAE) is used as the loss function. It is possible to improve the accuracy by using carefully designed custom loss function such as the pinball loss in . In the M4 competition, the accuracy of a model is measured by the symmetric mean absolute percentage error (sMAPE)
and the mean absolute scaled error (MASE)
where X(t) is the true value at point t, ¯X(t) the forecast, h the forecasting horizon, l the number of in-sample data points and p the time interval to compute the difference between successive data points .
The final ranking of a submission in the competition is determined by the overall weight average (OWA) , which is defined by
where the subscript denotes the Na¨ıve 2 method, which amounts to the na¨ıve method applied to seasonally adjusted series. The Na¨ıve 2 method was used in the competition as the benchmark to evaluate the performance of submitted results .
For ease of comparison, Table 2 lists the accuracy measures of the top ranked submissions for series of each frequency in the M4 competition. Since the proposed method has much in common with FFORMA, the performance of the FFORMA method is given in the last column.
3.4 One model for each frequency
In this subsection, independent models are built for series of different frequencies. In the CNN model, four layers of 33 convolutional layers are used for yearly, quarterly, and monthly series. LSTM (long short-term memory) networks are used in the RNN model and the numbers of cell states are three for yearly series, four for quarterly series and six for monthly series.
Figure 4: RNN model structure, unfolded version
Table 2: Performance of top submissions
We train each model eight times with fresh starts and in each run, the model is trained 2000 epochs. The sMAPE, MASE and OWA of each run for each frequency are shown in Table 3. The last column of the table gives the accuracy of the ensemble model. It is clear that model averaging improves the forecasting accuracy. In some cases, the emsemble is more accurate than any of the eight individual runs.
The proposed CNN model would rank 3rd, 2nd and 3rd in terms of sMAPE for yearly, quarterly and monthly series. It would rank 6th, 1st and 3rd in terms of OWA, respectively. From Table 3 it can be seen that the RNN model is more accurate than its CNN counterpart in terms of sMAPE. The benefits of model averaging is even more evident. For yearly, quarterly and monthly series, the RNN model would rank 3rd, 1st and 2nd, respectively, in terms of sMAPE, and would rank 8th, 1st and 2nd, respectively, in terms of OWA.
The CNN and the RNN models for yearly series have very good rankings in terms of sMAPE, but not so in terms of MASE. This is due to the difference of the two measures. The numerators in Equation 3 and Equation 4 are the same, while the denominators are different. For series with large mean value but very small variations, it is possible that the sMAPE of a forecast is very small, but the MASE is excessively high.
In our experiments, forecasts of base models are log transformed after being normalized by the last in-sample observation. Such a combination of preprocessing is well suited for reducing sMAPE. But when the purpose is to minimise MASE, it may be better not to do log transformation, but to simply normalize the forecasts of base models by the denominator of Equation 4. To strike a balance between these two measures, it is possible to combine forecasts under these two preprocessing settings. We do not do so in this paper, as our purpose is to show the generality of the proposed framework for time series forecasting. It should be mentioned that there are many possible ways to improve the implementation.
The proposed models have very similar performance as FFORMA for daily series. They do not perform as well for series with much fewer samples, i.e., for weekly (359 series) and hourly (414 series) data. In these cases, a simple linear model works better as it is less prone to overfitting. Once again, we observe that a large sample size is crucial for NN models to work well for time series forecasting.
In our experiments, one training sample is generated for each series, but it is possible to increase the samples using a stretching window scheme. For a given series X of length l, the one training sample is produced by first taking out the last h observations and then applying the M base models to the first observations. The forecasts by base models are used as the features in the NN model and the last h observations are used as the label. To increase the samples, we can apply the base model to the first observations where k is a positive integer, and use the h observations immediately after as the labels.
3.5 One RNN model for all frequencies
As mentioned in Section 2.2, due to the constraint placed by the fully connected layer, independent CNN models have to be built for series with different forecasting horizons. By contrast, it is possible to build one single RNN model for series of all frequencies. The only restriction is that at the training stage, series of different frequencies have to be in different mini-batches, as the sequence lengths (forecasting horizons) are different.
The full results of such a RNN model with a state size of nine are shown in Table 4. Interestingly, this single RNN model built for all series is more accurate than individual models built specifically for a particular frequency. It achieves slightly better OWA than FFORMA for quarterly, monthly and daily series, and does considerably worse for hourly series. Overall, the RNN model is marginally more accurate than FFORMA and would rank 2nd in terms of
OWA among all the submissions.
We present a time series forecasting framework For2For. In the framework, forecasts produced by standard models are fed to a NN model, which learns how to make final forecast based on forecasts of various sources. This approach can be seen as a combination method, thus in essence the NN model is trained to learn how to combine forecasts made by standard models. The NN model can be either a CNN model with a structure similar to ResNet or a RNN model which makes forecast one at a time.
To evaluate this approach, we test the method on the M4 competition dataset. Both the CNN and the RNN models perform very well for yearly, quarterly and monthly series. When the data sample size is too small, the NN models tend to overfit and do not generalize well. We also build one single RNN model for all frequencies, and it is more accurate than individual models built specifically for a particular frequency.
In the experiments, we don’t use any features other than forecasts generated by base models. In practice, it is certainly possible to combine them. Prediction intervals are not discussed in this paper, and could be a topic of future investigation.
The work of Ying Feng is supported by XJTLU Research Development Fund 18-02-27.
 S. Makridakis, E. Spiliotis, V. Assimakopoulos, Statistical and machine learning forecasting methods: Concerns and ways forward, PloS one 13 (3) (2018) e0194889.
 S. Makridakis, R. J. Hyndman, F. Petropoulos, Forecasting in social settings: the state of the art, International Journal of Forecasting 36 (1) (2020) 15–28.
 H. Hewamalage, C. Bergmeir, K. Bandara, Recurrent neural networks for time series fore- casting: Current status and future directions, arXiv preprint arXiv:1909.00590.
 S. Makridakis, E. Spiliotis, V. Assimakopoulos, The M4 competition: Results, findings, conclusion and way forward, International Journal of Forecasting 34 (4) (2018) 802–808.
 L. Nanni, S. Ghidoni, S. Brahnam, Handcrafted vs. non-handcrafted features for computer vision classification, Pattern Recognition 71 (2017) 158–172.
 R. Zhang, S. Dong, X. Nie, S. Xiao, Forward neural network for time series anomaly detection, arXiv preprint arXiv:1812.08389.
 A. Ng, Machine learning and AI via brain simulations, Accessed: May 3 (2013) 2018.
 T. S. Talagala, R. J. Hyndman, G. Athanasopoulos, Meta-learning how to forecast time series, Tech. rep., Monash University, Department of Econometrics and Business Statistics (2018).
 P. Montero-Manso, G. Athanasopoulos, R. J. Hyndman, T. S. Talagala, FFORMA: Feature-based forecast model averaging, International Journal of Forecasting 36 (1) (2020) 86–92.
 R. J. Hyndman, Y. Kang, T. Talagala, E. Wang, Y. Yang, tsfeatures: Time series feature extraction, R package version 1 (0).
 R. J. Hyndman, G. Athanasopoulos, Forecasting: principles and practice, OTexts, 2018.
 M. Pawlikowski, A. Chorowska, Weighted ensemble of statistical models, International Journal of Forecasting 36 (1) (2020) 93–97.
 K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 R. J. Hyndman, Y. Khandakar, Automatic time series forecasting: the forecast package for R, Journal of Statistical Software 26 (3) (2008) 1–22.
 S. Smyl, A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting, International Journal of Forecasting 36 (1) (2020) 75–85.