Prediction of adverse events in Afghanistan: regression analysis of time series data grouped not by geographic dependencies

2020·arXiv

Abstract The aim of this study was to approach a difficult regression task on highly unbalanced data regarding active theater of war in Afghanistan. Our focus was set on predicting the negative events number without distinguishing precise nature of the events given historical data on investment and negative events per each of predefined 400 Afghanistan’s districts. In contrast with previous research on the matter, we propose an approach to analysis of time series data that benefits from non-conventional aggregation of these territorial entities. By carrying out initial exploratory data analysis we demonstrate that dividing data according to our proposal allows to identify strong trend and seasonal components in the selected target variable. Utilizing this approach we also tried to estimate which data regarding investments is most important for prediction performance. Based on our exploratory analysis and previous research we prepared 5 sets of independent variables that were fed to 3 machine learning regression models. The results expressed by mean absolute and mean square errors indicate that leveraging historical data regarding target variable allows for reasonable performance, however unfortunately other proposed independent variables doesn’t seem to improve prediction quality. Introduction In any conflict the ability to predict negative events before they take place gives edge over the adversary. Leveraging modern machine learning techniques enables to analyze historical conflictrelated data and deliver high-quality predictions which is why these methods are very often applied. In this study we aimed to apply machine learning methods to a dataset containing data from Afghanistan which was introduced an analyzed in previous studies (1,2,3,4). The authors of these studies utilized various approaches and models in order to solve various regression and binary classification tasks formulated on the data. (1,2,3) addressed regression tasks with 4 dependent variables, namely number of dead, wounded, hijacked personnel and overall number of negative events. In (1,2) various fuzzy inference systems were evaluated, in (3) shallow artificial neural networks and multiple linear regression models were employed and in (4) for solving of binary classification task 5 model types were used, namely, neural networks, k-nearest neighbors, support vector machines, random forests and C4.5 decision trees. Since the discussed dataset reflects administrative division of territorial entities in Afghanistan, all above mentioned studies followed the same manner of aggregating data into districts, provinces and regions. Such approach resulted in difficult to solve problem of sparse target variable vectors e.g. (4) calculated that vector of negative events target variable for whole Afghanistan is in 87.23% filled with “0” values. In our work we proposed a method of overcoming this difficulty by

different aggregation of districts and training selected machine learning models separately for the new resulting groups. Methods Dataset and Exploratory Data Analysis (EDA) This study analyzes data described in (1,2,3,4) and addresses a regression task for one target variable, namely amount of negative events (NE) without distinguishing between types of NE i.e. dead, wounded or hijacked personnel. Inspired by observation by (4) that this target variable is a sparse vector, we focused on tackling this problem. Our analysis approaches the entire data set comprising of 33 600 data points as a time series with 400 realizations individual for each district, with time step of 1 month and duration of 7 years, resulting in 84 data points for each district realization. At first we tried to aggregate the realizations in a natural geographically-constrained manner as in (1,2,3,4), namely in groups of provinces and regions also provided in the dataset. To initially explore data grouped in this manner we computed the all-time per district average negative events value (ANE) and concluded that realizations in these geographic groups varied strongly. This seamed challenging for any prediction model and was in agreement with conclusions by (1,2,3,4), so we searched for another approach to grouping 400 districts. After creating ANE histogram presented in figure 1, Figure 1. Number of districts as function of all-time per district average negative events value (ANE) we found that all realizations can be divided by ANE into 3 groups: A with ANE = 0, B with ANE in range (0, 2> and C with ANE greater than 2. It appeared that in group A there are 140 districts, 247 districts fall into group B and 13 districts belong to group C. It can be concluded, that 140 out of total 400 districts are “silent” i.e. have always NE = 0 value, regardless of external factors. Predicting always 0 as the number of NE for districts from group A is a perfect solution and doesn’t require any prediction model. In contrast, groups B and C are characterized by percentage of data points with NE = 0 of 83.49 % for group B and 20.97 % for group C. This shows that the proposed groups differ strongly and in fact can be treated as different data generating processes that should be modeled separately. Therefore, our aim was to model two separate time series for group B and C separately. Time series decomposition

As part of EDA we decided to create two separate time series, namely for group B and separately group C by computing a Month-by-Month Average Negative Event value for all districts (MMANE) in the group (MMANE B and MMANE C). Further, we decomposed these time series in a manner popular e.g. in economic sciences into trend, seasonal and residual components (https://otexts.com/fpp2/classical-decomposition.html). We compared additive, multiplicative and STL (Cleveland et al., 1990) decomposition methods by the mean absolute residual component synonymous with mean absolute error (MAE) computed for each method. The results are presented in table 1. Table 1. MAE for tested decomposition methods for MMANE B and MMANE C time series Figure 2. Curves derived from additive decomposition method for MMANE B and MMANE C time series. Based on figure 2, it can be concluded that in both groups all curves have somewhat similar shapes indicating obvious increase of MMANE over years and strong seasonality over months. It can be concluded that leveraging basic time-related historic information regarding MMANE allows to carry out prediction of MMANE with MAE of 0.082 when addressing districts from group B and 0.776 when dealing with group C districts. Given average MMANE for group B and C was 0.270 and

3.443 respectively, this translates to average prediction error of 30.27 % for group B and 22.54 % for group C. To preliminary explore importance of the investment data present in the dataset, for all investment types, budget amounts and number of investment projects we followed the Distributed Lag (DL) approach proposed by (7) and computed information regarding past month-by-month values in range <-1, -12> months. We further employed a basic linear regression model (LRM) to try explaining the residual component left after additive decomposition with use of investment information. Unfortunately, in both district groups B and C each realization was a time series composed of only 84 data points. In order to for the model no to over-fit to abundance of provided investment-related input variables we fed only one single lagged variable at a time e.g. A3 investment type lagged 4 months or B5 investment budget lagged 8 months. This allowed us to compare the MAE decrease in each case. In most cases the decrease of MAE was close to none and the lagged variables that provided best MAE decrease for group B and C are presented in table 2. It can be observed that the most valuable investment information for districts from group B was the number of emergency assistance projects carried out 6 months backwards (coded as A6-6) which allowed for over 5 % improvement in MAE. Analogically, for districts from group C improvement of over 6 % MAE was achieved with use of information regarding amount of investments in gender carried out 4 months backwards (coded as B9-4). Table 2. MAE for additive decomposition and linear regression model, information that allowed best performance and percentage gain achieved by leveraging investment information. Preparation of data and features fed to machine learning (ML) models Firstly, we tested auto regression (AR) models fed only with historical data regarding target variable. Since EDA demonstrated that there is a strong trend and seasonal component in historic information regarding MMANE, for each data point we prepared the following time related information (TRI) derived from original data: • NE value for the analyzed district for -m, where m is months in range <-1,-12>; • 3 month trend of MMANE value for all districts understood as: MMANE(-3m) - MMANE(-1m); • MMANE value for all districts for last year; • MMANE value for all districts for last half year; • MMANE value for all districts for last 3 months; • MMANE value for all districts for last 1 month; • month of data instance; and • year of data instance. Secondly, we made use of the investment-related information that was prepared as previously described with use of DL approach. In addition, before feeding other data to machine learning models we made use of basic feature engineering methods.

Similarly as in previously described LRM approach, in order for the models not to over-fit to the data, we tried to keep low the number of independent variable columns and considered only selected information regarding investments (SID), namely A6-6 and B9-4 i.e. the two features that were found most informative during EDA. Also, because previous research (4) found the information regarding number of Community Development projects carried out a year before the data instance (coded as A2(t-1)) important for predicting number of NE, we also considered this information. In the very foundation of our concept for analyzing the discussed data lies importance of district identification number (ID), therefore we added also this feature to analysis. In order to compare influence of selected information on ML model performance, we decided to define 5 variants of feature sets fed to ML models: V1) include only TRI; V2) include TRI + district identification number (ID) V3) include TRI + selected investment data (SID) V4) include TRI + ID + SID V5) include TRI + ID + SID + A2(t-1) Therefore, models trained on V1 and V2 feature sets can be treated as AR models, whereas models trained on V3, V4 and V5 feature sets followed autoregressive distributed lag approach proposed in (6). Time series data split It is possible to analyze the dataset as a whole, namely, taken into consideration that it is comprised of time series for each district, divide the data in a way that years 2004-2009 are used for training and 2010 for testing. However, apart from this approach, we also proposed splitting the data in another manner. Since the data comprised of 7 years of observations, we propose to split it to 6 ML tasks that progressively address longer and longer time period. Additionally, because the last year witnessed dramatically increased number of NE when compared to earlier years, we added a separate 8th data split that uses only 2 last years of observations. The whole pattern applied for creating 7 train-test data splits is presented in table 3. Taken into consideration that we already divided the data into group B and group C, this constitutes 14 ML tasks. Also, including the fact that we defined 5 different feature sets, results in overall 70 separate ML tasks. Table 3. Data split for 7 machine learning tasks. Selected machine learning models, metrics and visualization techniques For the regression task we selected to compare performance of 3 machine learning models in each of 70 defined tasks: linear regression (LR) as baseline, random forest (RF) and gradient boosting (GB). As a result, we ended up comparing performance of 210 ML model variants. For each defined ML task we also computed metrics for a naive baseline model predicting always 0 as the NE value.

We adopted two metrics for the addressed regression tasks, namely mean absolute error (MAE) and mean square error (MSE). To visually assess model performance we created plots visualizing directly the predictions made by chosen ML models in chosen ML tasks. Also, we benefit from the fact that the random forest ML models during training develop ranks that reflect influence of independent variables on the target variable. We used these ranks to create visualizations of feature importance for all ML tasks in the V5 variant which includes all analyzed features. Software Data preparation and ML models were implemented in Python with use of numpy, pandas, statsmodels and sci-kit learn packages. All computations were carried out on the same computing machine. Results

Table 4. MSE and MAE in districts from group B, calculated for all ML models in all variants and ML tasks. “0” columns represent a model predicting always “0”. Best models in a given ML task according to a given metric are highlighted with bold font.

Table 5. MSE and MAE in districts from group C, calculated for all ML models in all variants and ML tasks. “0” columns represent a model predicting always “0”. Best models in a given ML task according to a given metric are highlighted with bold font.

Figure 3. From left to right: predictions of LR, GB and RF models in ML task T6V5 for districts from group C

Figure 4. Feature importance in districts from group C. From left to right: ML task V5T1-T7

The mean square error (MSE) and mean absolute error (MAE) are smaller for group B districts when compared to group C districts, which is an obvious conclusion provided the all-time per district average negative events value (ANE) value for these groups is 0.270 and 3.443 respectively.

Closer analysis of results for group B with MAE regarded as the quality measure shows that the strategy “to predict always 0” as the number of negative events is best compared to evaluated models in ML tasks T1-T5. Most likely this can be attributed to low number of negative events in these tasks. Only in tasks 6 and 7, which address years richer in negative events, other models come to play. When considering MSE, the “0” strategy does not bring satisfactory results at all and the 3 analyzed ML models outperform it by a strong margin. However discussing which of these models performs best is difficult since differences in errors are vague. One conclusion that can be drawn is, that linear regression performs poorly in T1 and T7, that is in cases where least training data is provided.

More pronounced differences between models performance are visible in results of group C districts. Here, due to numerous negative events, the “0 strategy” is never preferred. The smallest errors, both MSE and MAE, are achieved either by Random Forest or Linear Regression with Gradient Boosting following closely. Figure 3 presents examples of predictions in district group C carried out by 3 analyzed models with use of all proposed features and in a machine learning task where all years from available data where used. The shapes created by the plotted predictions are similar and only with precisely described errors in table 5 one can conclude which model is superior to others. This confirms, that the models perform similarly.

An insight into informativeness of features fed to ML models is provided by figure 4. There we are able to compare 10 most important features according to Random Forest regression model in each ML task and in the variant of analysis that uses all proposed features. It can be observed, that in tasks T1 to T5 there is one investment feature that seems important for predictions, namely A2(t-1) as proposed in (4). District ID is mentioned in all tasks and in T1, where little historical information is present, it seems to be the most important feature. From the time related information features (TRI) the prevailing ones are MMANE-1, NE-1 and NE-11. This indicates that the number of negative events from previous month in the given district as well as averaged over districts, play an important role. However before jumping into conclusions it must be noted, that for each ML task the MSE and MAE are extremely similar regardless of feature set fed to the ML models. For instance in the analyzed ML task T6 in all feature set variants V1-5 the average and standard deviation of MAE achieved by Random Forest model is 3.1699 0.0185, therefore standard deviation is only 0.58 % of the average value. This shows there is very small influence of changes in the feature set on model performance. At the same time this puts a question mark on the usefulness of carried out exploratory data analysis regarding the choice of most informative investment-related features and on overall possibility of extracting information that could improve prediction quality from available investment data.

Dividing the dataset into groups of districts according to the all-time per district average negative event value (ANE) allows to simplify the machine learning task by instantly excluding from analysis 140 out of 400 districts with no negative events. The group of districts with non 0 but still small number of negative events constitutes a very difficult regression task, where in many cases all evaluated models are outperformed by “always predict 0” strategy. Only with the rise of number of negative events it is possible to demonstrate superiority of machine learning models over naive approach. In this case, our analysis found that the discussed time series data has strong trend and visible seasonal component. As a result reasonable quality of predictions can be achieved simply with use of historical data regarding negative events. Adding features regarding investments or providing district identification number explicitly to the trained models doesn’t seem to bring measurable improvement in performance.

References

1. Çakıt, E., and Karwowski, W.: Assessing the Relationship between Economic Factors and Adverse Events in an Active War Theater Using Fuzzy Inference System Approach. International Journal of Machine Learning and Computing. 5(3), pp.252-257 (2015).

2. Çakıt, E., and Karwowski, W.: Fuzzy Inference Modelling with the Help of Fuzzy Clustering for Predicting the Occurrence of Adverse Events in an Active Theater of War. Applied Artificial Intelligence. 29, 945-961 (2015).

3. Çakıt, E., and Karwowski, W.: Understanding the Social and Economic Factors Affecting Adverse Events in an Active Theater of War: A Neural Network Approach. In Advances in Cross-Cultural Decision Making. Springer: Advances in Intelligent Systems and Computing, 610, pp. 215-223 (2017).

4. Zurada, Jozef, et al. "Detecting Adverse Events in an Active Theater of War Using Advanced Computational Intelligence Techniques." International Conference on Theory and Applications of Fuzzy Systems and Soft Computing. Springer, Cham, 2018.

5. Cleveland, R. B., Cleveland, W. S., McRae, J. E., & Terpenning, I. J. (1990). STL: A seasonal-trend decomposition procedure based on loess. Journal of Official Statistics, 6(1), 3–33. http://bit.ly/stl1990

6. Chen, Y. Y. (2010). Autoregressive Distributed Lag (ADL) Model. Link http://mail. tku. edu. tw/chenyiyi/ADL. Pdf.

7. Cromwell, Jeff B., Walter C. Labys, and Michel Terraza. Univariate tests for time series models. No. 99. Sage, 1994.

Designed for Accessibility and to further Open Science