Table 4. MSE and MAE in districts from group B, calculated for all ML models in all variants and ML tasks. “0” columns represent a model predicting always “0”. Best models in a given ML task according to a given metric are highlighted with bold font.
Table 5. MSE and MAE in districts from group C, calculated for all ML models in all variants and ML tasks. “0” columns represent a model predicting always “0”. Best models in a given ML task according to a given metric are highlighted with bold font.
Figure 3. From left to right: predictions of LR, GB and RF models in ML task T6V5 for districts from group C
Figure 4. Feature importance in districts from group C. From left to right: ML task V5T1-T7
The mean square error (MSE) and mean absolute error (MAE) are smaller for group B districts when compared to group C districts, which is an obvious conclusion provided the all-time per district average negative events value (ANE) value for these groups is 0.270 and 3.443 respectively.
Closer analysis of results for group B with MAE regarded as the quality measure shows that the strategy “to predict always 0” as the number of negative events is best compared to evaluated models in ML tasks T1-T5. Most likely this can be attributed to low number of negative events in these tasks. Only in tasks 6 and 7, which address years richer in negative events, other models come to play. When considering MSE, the “0” strategy does not bring satisfactory results at all and the 3 analyzed ML models outperform it by a strong margin. However discussing which of these models performs best is difficult since differences in errors are vague. One conclusion that can be drawn is, that linear regression performs poorly in T1 and T7, that is in cases where least training data is provided.
More pronounced differences between models performance are visible in results of group C districts. Here, due to numerous negative events, the “0 strategy” is never preferred. The smallest errors, both MSE and MAE, are achieved either by Random Forest or Linear Regression with Gradient Boosting following closely. Figure 3 presents examples of predictions in district group C carried out by 3 analyzed models with use of all proposed features and in a machine learning task where all years from available data where used. The shapes created by the plotted predictions are similar and only with precisely described errors in table 5 one can conclude which model is superior to others. This confirms, that the models perform similarly.
An insight into informativeness of features fed to ML models is provided by figure 4. There we are able to compare 10 most important features according to Random Forest regression model in each ML task and in the variant of analysis that uses all proposed features. It can be observed, that in tasks T1 to T5 there is one investment feature that seems important for predictions, namely A2(t-1) as proposed in (4). District ID is mentioned in all tasks and in T1, where little historical information is present, it seems to be the most important feature. From the time related information features (TRI) the prevailing ones are MMANE-1, NE-1 and NE-11. This indicates that the number of negative events from previous month in the given district as well as averaged over districts, play an important role. However before jumping into conclusions it must be noted, that for each ML task the MSE and MAE are extremely similar regardless of feature set fed to the ML models. For instance in the analyzed ML task T6 in all feature set variants V1-5 the average and standard deviation of MAE achieved by Random Forest model is 3.1699 0.0185, therefore standard deviation is only 0.58 % of the average value. This shows there is very small influence of changes in the feature set on model performance. At the same time this puts a question mark on the usefulness of carried out exploratory data analysis regarding the choice of most informative investment-related features and on overall possibility of extracting information that could improve prediction quality from available investment data.
Dividing the dataset into groups of districts according to the all-time per district average negative event value (ANE) allows to simplify the machine learning task by instantly excluding from analysis 140 out of 400 districts with no negative events. The group of districts with non 0 but still small number of negative events constitutes a very difficult regression task, where in many cases all evaluated models are outperformed by “always predict 0” strategy. Only with the rise of number of negative events it is possible to demonstrate superiority of machine learning models over naive approach. In this case, our analysis found that the discussed time series data has strong trend and visible seasonal component. As a result reasonable quality of predictions can be achieved simply with use of historical data regarding negative events. Adding features regarding investments or providing district identification number explicitly to the trained models doesn’t seem to bring measurable improvement in performance.
1. Çakıt, E., and Karwowski, W.: Assessing the Relationship between Economic Factors and Adverse Events in an Active War Theater Using Fuzzy Inference System Approach. International Journal of Machine Learning and Computing. 5(3), pp.252-257 (2015).
2. Çakıt, E., and Karwowski, W.: Fuzzy Inference Modelling with the Help of Fuzzy Clustering for Predicting the Occurrence of Adverse Events in an Active Theater of War. Applied Artificial Intelligence. 29, 945-961 (2015).
3. Çakıt, E., and Karwowski, W.: Understanding the Social and Economic Factors Affecting Adverse Events in an Active Theater of War: A Neural Network Approach. In Advances in Cross-Cultural Decision Making. Springer: Advances in Intelligent Systems and Computing, 610, pp. 215-223 (2017).
4. Zurada, Jozef, et al. "Detecting Adverse Events in an Active Theater of War Using Advanced Computational Intelligence Techniques." International Conference on Theory and Applications of Fuzzy Systems and Soft Computing. Springer, Cham, 2018.
5. Cleveland, R. B., Cleveland, W. S., McRae, J. E., & Terpenning, I. J. (1990). STL: A seasonal-trend decomposition procedure based on loess. Journal of Official Statistics, 6(1), 3–33. http://bit.ly/stl1990
6. Chen, Y. Y. (2010). Autoregressive Distributed Lag (ADL) Model. Link http://mail. tku. edu. tw/chenyiyi/ADL. Pdf.
7. Cromwell, Jeff B., Walter C. Labys, and Michel Terraza. Univariate tests for time series models. No. 99. Sage, 1994.