Individual Claims Forecasting with Bayesian Mixture Density Networks

2020·Arxiv

Abstract

Abstract

We introduce an individual claims forecasting framework utilizing Bayesian mixture density networks that can be used for claims analytics tasks such as case reserving and triaging. The proposed approach enables incorporating claims information from both structured and unstructured data sources, producing multi-period cash flow forecasts, and generating different scenarios of future payment patterns. We implement and evaluate the modeling framework using publicly available data.

Keywords claims reserving individual claims reserving loss reserving

1 Introduction

Individual claims reserving has garnered increasing interest in recent years. While the main benefit cited for performing reserving at the claim level over aggregate loss triangle approaches, such as chain ladder and Bornhuetter-Ferguson, is potential improvement in predictive accuracy, especially in environments with changing portfolio mix [6], there are additional practical advantages to forecasting individual claim behavior. These include being able to obtain updated views of portfolio risk as claims are reported and optimize adjuster resource allocation based on severity predictions. Although the benefits of individual claims modeling are promising, it has not yet achieved widespread adoption in practice. One contributing reason for the lack of adoption, we hypothesize, is the absence of a modeling framework with features important to practitioners. We suggest that an effective loss reserves modeling framework should be able to:

• Incorporate arbitrary claims information as predictors, • Produce multi-period forecasts that are sufficiently stable over time, and • Sample different realizations of future payment patterns that encompass both process uncertainty and model risk.

Companies are capturing increasingly diverse data, such as unstructured text from claims adjusters’ notes and photographs of damages, that can potentially be predictive. Timing of cash flows and being able to sample from different future states have both business decision making and regulatory applications. A desirable characteristic of the forecasts is that they are stable over time, as management is averse to volatility in loss reserves figures from one accounting period to the next. A subjective criterion not listed above, which may affect adoption of an approach, is that the models should be implementable without the need for extensive bespoke feature engineering or specification of complex assumptions for underlying stochastic processes.

To the best of our knowledge, no existing loss reserving framework implements all of the above features. In this paper, we propose an extensible individual claims forecasting framework towards satisfying many of these criteria, utilizing ideas from Bayesian neural networks (BNN) [19] and mixture density networks (MDN) [4]. While we discuss these concepts in detail later in the paper, at a high level,

• BNNs are non-linear supervised learning models that capture complex interactions among inputs, with prior distributions on model parameters; and

• MDNs are mixture models for conditional densities, where the mixture model parameters are the outputs of the neural network.

Concretely, our contributions are:

• Development of an individual claims forecasting framework based on Bayesian Mixture Density Networks (BMDN).

• Implementation of the proposed framework using publicly available claims-level data, which provides a baseline for future work to compare against.

2 Related Work

Claims-level reserving is a fast-moving research area, and [6] and [23] provide recent surveys. Many of the current works in the area utilize machine learning (ML) techniques. Wüthrich [25] introduces using machine learning algorithms to incorporate diverse claims characteristics inputs. It demonstrates a simple model where regression trees are used to predict the number of payments, and suggests extensions such as compound modeling to predict severity and bootstrap simulation to obtain prediction uncertainty. More recently, [7, 18, 3] also utilize tree-based techniques for individual claims reserving. Another direction of research for the individual claims reserving problem are the generative approaches of, for example, [2, 20, 21]. In this latter category of methodologies, a set of distributional assumptions are posited for the different drivers of claims, such as time to the next payment and its amount. These distributions are fit to the data, then, with the obtained parameters, the modeler is able to perform simulations of future development paths by sampling from the fitted distributions. While this approach provides a natural way to obtain samples of future cash flow paths, the distributional assumptions may be too rigid in some cases. It is also difficult to incorporate individual claim characteristics; to differentiate among different characteristics, one would have to segment the claims and fit separate models to each group.

In formulating our framework, we also draw inspiration from machine learning approaches to aggregate triangle data, including [9, 8], which embed a classical parametric loss reserving models into neural networks, and the DeepTriangle [16] framework, whose neural network architecture we adapt for individual claims data.

3 Preliminaries

We begin with a description of the loss reserving problem then briefly discuss BNN and MDN, two ideas that we incorporate into our claims forecasting neural network. As neural networks have been utilized and discussed extensively in recent loss reserving literature [9, 8, 16, 26, 10], we defer discussion of neural network fundamentals to those works and the standard reference [11]. To aid the practicing actuary in consuming this paper, we expand on certain concepts when we introduce the proposed neural network architecture in Section 4.4.

We remark that in sections 3.2 and 3.3, we discuss Bayesian inference and mixture density networks, respectively, in general, and provide details on our specific choices of distributions later on in the paper.

3.1 The Loss Reserving Problem

Figure 1 shows a diagram of the development of a typical claim. We first point out that there can be a time difference, known as the reporting lag, between an accident’s occurrence and its reporting. The accidents which have been reported to the insurer, but not yet settled, are known as reported but not settled (RBNS) or, equivalently, incurred but not enough reported (IBNER) claims, while the accidents which have occurred but are yet unknown to the insurer are known as incurred but not reported (IBNR) claims. The reserving actuary is interested in estimating the ultimate loss amounts associated with accidents that have already occurred. As demonstrated in Figure 1, it is possible for a closed claim to re-open, and it is also possible for a claim to be closed without any cash flows.

In this paper, we are concerned with RBNS/IBNER, but not IBNR, claims, due to limitations of available data, as for IBNR claims we do not have individual claim feature information available. Also, the claims we study encompass closed claims to allow for the possibility of claim re-opening.

Figure 1: Development of a claim

3.2 Bayesian Inference on Neural Networks

In this section, we briefly discuss Bayesian inference on neural networks, following the discourse in [12] and [5].

Consider a neural network parameterized by weights w that takes input x and returns output y. We can view this general formulation as a probabilistic model P(y|x, w); as an example, in the case of linear regression where , mean squared loss is specified, and constant variance is assumed, P(y|x, w) corresponds to a Gaussian distribution.

Rather than treating w as a fixed unknown parameter, we adapt a Bayesian perspective and treat w as a random variable with some prior distribution P(w). The task is then to compute the posterior distribution P(w|x, y) given the training data. We can then calculate the posterior predictive distribution ) = )], where is a new data point to be scored and the corresponding unknown response. However, determining P(w|x, y) analytically is intractable, and convergence of Markov chain Monte Carlo (MCMC) to the actual posterior for nontrivial neural networks is too slow to be feasible. Variational inference provides a workaround to this problem by approximating the posterior with a more tractable distribution ), which is often chosen to come from the mean-field family, i.e., but can be more general [5]. We then formulate an optimization problem to find the parameters . Specifically, we minimize the Kullback-Leibler (KL) divergence from the true posterior distribution P to the approximate distribution q. KL divergence is defined in general for probability distributions P to Q as

Our optimization problem can then be stated as

where D = (is the training dataset. We remark that, in Equation 4, the P(D) term resulting from applying Bayes’ theorem to P(w|D) disappears because it is irrelevant to the optimization.

Equation 5 then gives us the optimization objective

In practice, during training, where D is randomly split into M mini-batches , we compute

which we approximate with

where the are sampled independently from ). In other words, we sample from the weights distribution just once for each training sample.

3.3 Mixture Density Networks

In practical applications, the response variables we try to predict often have multimodal distributions and exhibit heteroscedastic errors. This is particular relevant in forecasting claims cash flows since, in a given time period, there could be a large payment or little or no payment. An MDN allows the output to follow a mixture of arbitrary distributions and estimate each of its parameters with the neural network. Recall that a mixture distribution has a distribution function of the form

where each 0, = 1, and n is the number of component distributions . Letting P denote the union of the sets of parameters for the distributions , the neural network must then output n + |P| values. The first n values determine the categorical distribution of the mixing weights, and the |P| outputs parameterize the component distributions.

By obtaining distributions rather than single points as prediction outputs, we also gain a straightforward mechanism to quantify the uncertainty of individual cash flow forecasts.

We emphasize now the difference between the uncertainty captured by specifying a distribution as the neural network’s output, as discussed here, and the uncertainty in the weight distribution, as discussed in Section 3.2. The former corresponds to the irreducible pure randomness, while the latter reflects uncertainty in parameter estimation.

4 Data and Model

In this section, we describe the data used for our experiments and the proposed model. The dataset used and relevant code are available on GitHub.The experiments are implemented using the R programming language [22] and TensorFlow [1].

4.1 Data

We utilize the individual claims history simulator of [10] to generate data for our experiments. For each claim, we have the following static information: line of business, labor sector of the injured, accident year, accident quarter, age of the injured, part of body injured, and the reporting year. In addition, we also have 12 years of claim development information, in the form of cash flows and claims statuses (whether the claim is open or not). The generated claims history exhibit behaviors such as negative cash flows for recoveries and late reporting, which mimic realistic scenarios.

Since we only have claims data and not policy level and exposure data, we study only reported claims.

4.2 Experiment Setup

We first simulate approximately 500,000claims using the simulator, which provides us with the full development history of these claims. The claims cover accident years 1994 to 2005. Since we are concerned with reported claims, we remove claims with report date after 2005, which leaves us with N = 497,516 claims. In this paper, we assume that each claim is fully developed at development year 11 (note that in the dataset the first development year is denoted year 0). More formally, we can represent the dataset as the collection

where ) denote the static claim features, incremental cash flow sequences, and claim status sequences, respectively, and j indexes the claims.

To create the training and testing sets, we select year 2005 as the evaluation year cutoff. For the training set, any cash flow information available after 2005 is removed. In symbols, we have

where AYdenotes the accident year associated with claim j.

For each claim in the training set, we create a training sample for each time period after development year 0. The response variable consists of cash flow information available as of the end of each time period until the evaluation year cutoff, and predictors are derived from information available before the time period. For a claim j, we have the following input-output pairs:

where denotes the latest development year for which data is available for claim j in the training set. As an example, if a claim has an accident year of 2000, five training samples are created. The first training sample has cash flows from 2001 to 2005 for the response and one cash flow value from 2000 for the predictor, while the last training sample has only the cash flow in year 2005 for the response and cash flows from 2000 to 2004 for the predictor.

We note here that we do not predict future claim statuses. As we will discuss in Section 4.4.4, our output distributions, which contain point masses at zero, can accommodate behaviors of both open and closed claims.

The training samples in Equation 12 undergo additional transformations before they are used for training our model. We discuss these transformations in detail in the next section.

4.3 Feature Engineering

We exhibit the predictors used and associated transformations in Table 1. In the raw simulated data, each time period has a cash flow amount along with a claim open status indicator associated with it. We take the cash flow value and derive two variables from it: the paid loss and the recovery, corresponding to the positive part and negative part of the cash flow, respectively. In other words, for each claim and for each time step, at most one of paid loss and recovery can be nonzero. For predictors, both the paid losses and recoveries are centered and scaled with respect to all instances of their values in the training set. The response cash flow values are not normalized.

Claim status indicator, which is a scalar value of 0 or 1, is one-hot encoded (i.e., represented using dummy variables); for each training sample, a claim status value is available for each time step.

Development year is defined as the number of years since the accident occurred. It is then scaled to [0, 1].

The static claim characteristic variables include age, line of business, occupation of the claimant, and the injured body part. Of these, age is numeric while the others are categorical. We center and scale the age variable and integer-index the others, which are fed into embedding layers [13], discussed in the next section.

As noted in the previous section, the response variable and the sequence predictors can have different lengths (in terms of time steps) from one sample to the next. To facilitate computation, we pre-pad and post-pad the sequences with a predetermined masking value (we use 99999) so that all sequences have a fixed length of 11.

Table 1: Predictor variable and transformations.

Figure 2: Claims forecasting model architecture. Note that the weights for the variational dense layers are shared across time steps.

4.4 Claims Forecasting Model

We utilize an encoder-decoder architecture with sequence output similar to the model proposed by [16]. The architecture is illustrated in Figure 2. We first provide a brief overview of the architecture, then provide details on specific components later in the section.

The output in our case is the future sequence of distributions of loss payments, and the input is the cash flow and claim status history along with static claim characteristics.

Static categorical variable inputs, such as labor sector of the injured, are indexed, then connected to embedding layers [13] and sequential data, including payments and claim statuses, are connected to long short-term memory (LSTM) layers [14].

The encoded inputs are concatenated together, then repeated 11 times before being passed to a decoder LSTM layer which returns a sequence. The length of this output sequence is so chosen to match our requirement to forecast a maximum of 11 steps into the future. Each time step of this sequence is connected to two dense variational layers, each of which parameterizes an output distribution, corresponding to paid losses and recoveries, respectively. The weights of the dense variational layer (for paid loss and recovery) are each shared across the time steps. In other words, for each training sample, we output two 11-dimensional random variables, each of which can be considered as a collection of 11 independent but non-identically distributed random variables.

We use the same loss function for each output and weight them equally for model optimization. The loss function is the negative log-likelihood given the target data with an adjustment for variable output lengths which we discuss in detail in Section 4.4.5.

In the remainder of this section, we discuss in more detail embedding layers, LSTM, our choices of the variational and distribution layers, the loss function, and model training.

4.4.1 Embedding Layer

An embedding layer maps each level of a categorical variable to a fixed-length vector in n-dimensional Euclidean space. The value of n is a hyperparameter chosen by the modeler; in our case we select n = 2 for all embedding layers, which means we map each factor level to a point in . In contrast to data-preprocessing dimensionality techniques such as principal component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE), the values of the embeddings are learned during training of the neural network.

4.4.2 Long Short-Term Memory

LSTM is a type of recurrent neural network (RNN) architecture that is appropriate for sequential data, such as natural language or time series (our use case). Empirically, LSTM is more apt at handling data with longer term dependencies than simple RNN [17]. In our model, for both of the sequence input encoders and the decoder, we utilize a single layer of LSTM with 3 hidden units. We found that these relatively small modules provided reasonable results, but did not perform comprehensive tuning to test deeper and larger architectures.

4.4.3 Dense Variational Layer

We choose Gaussians for both the prior and surrogate posterior distributions over the weights of the dense layers. The prior distribution is constrained to have zero mean and unit variance while for the surrogate posterior both the mean and variance are trainable. Symbolically, if we let w denote the weights associated with each dense layer, we have

where represents trainable parameters. The KL term associated with estimation, as described in Section 3.2, is added to the neural network loss during optimization. The dense layers each output four units to parameterize the output distributions which we discuss next.

4.4.4 Output Distribution

For each of the 11 time steps we forecast, we parameterize two distributions, one for paid losses and one for recoveries. Each of the distributions is chosen to be a mixture of a shifted log-normal distribution and a deterministic degenerate distribution localized at zero, representing cash flow and no cash flow, respectively. Denoting to be the output of the preceding dense variational layer, we have the following as the distribution function:

We now describe the components of Equation 15. First,

are the normalized mixing weights. ) denotes the distribution function for a log-normal distribution with location and scale parameters , respectively, shifted to the left by 0.001 to accommodate zeros in the data. In other words, if . This modification is necessary because we have zero cash flows in the response sequences, and computing the likelihood becomes problematic since the support of the log-normal distribution does not contain zero.

is the distribution function for the deterministic distribution, and is the sigmoid function defined as

In Equation 15, the constants 0 are included for numerical stability and can be tuned as hyperparameters, although we set them to 0.001 and 0.01, respectively, for our experiments. The purpose of is also numerical stability, as the scale parameter of the log-normal distribution is hard to learn in practice, so we bound it above by a constant. In our experiments we fix it to be 0.7.

The net loss prediction for each development year is then the difference of the paid loss and recovery amounts. We assume that the output distributions, conditional on their parameters, are independent across the time steps, which facilitates the derivation of the loss function in the next section. We remark that this is a weakness of our approach, since the independence assumption is not fulfilled in practice.

4.4.5 Loss Function

We calculate the log-likelihood loss for each output by computing the log probability of the true labels with respect to the output distributions. As mentioned in Section 4.3, the output sequences may contain masking values that need to be adjusted. Specifically, we marginalize out the components with the masking values. Formally, for a given training sample, let M denote the masking constant, ) denote a single output sequence, ), and, abusing notation, also let

The adjusted log-likelihood for a single training sample then becomes

where is nonempty and 11 otherwise.

Hence, the log-likelihood associated with a training sample is the sum of the log probabilities of the nonmasked elements. The negative log-likelihood, summed with the KL term associated with the variational layers, discussed in Section 3.2, comprise the total network loss for optimization. The contribution of a single training sample to the loss is then

where k = 1, 2 correspond to variational layers for parameterizing the paid loss and recovery distributions, respectively.

4.4.6 Training and Scoring

We use a random subset of the training set consisting of 5% of the records as the validation set for determining early stopping and scheduling the learning rate. For optimizing the neural network, we use stochastic gradient descent with an initial learning rate of 0.01 and a minibatch size of 100,000, and halve the learning rate when there is no improvement in the validation loss for five epochs. Training is stopped early when the validation loss does not improve for ten epochs; we cap the maximum number of epochs at 100. Here, an epoch refers to a complete iteration through our training dataset, and the minibatch size refers to how many training samples we randomly sample for each gradient descent step.

To forecast an individual claim, we construct a scoring data point by using all data available to us as of the evaluation date cutoff. In other words, the last elements of the predictor cash flow sequences correspond to the actual value from the cutoff year. Hence, in the output sequence, the first element corresponds to the prediction of the cash flow distribution of the year after the cutoff.

Recall that our model weights are a random variable, so each time we score a new data point, we are sampling from the distribution of model weights and expect to obtain different parameters for the distributions of cash flows. The output distributions themselves are also stochastic and can be sampled from. Following [15], we refer to the former variability as epistemic uncertainty and the latter as aleatoric uncertainty. Epistemic uncertainty reflects uncertainty about our model which can be reduced given more training data, while aleatoric uncertainty reflects the irreducible noise in the observations. In actuarial science literature, these concepts are also known as parameter estimation uncertainty and process uncertainty, respectively [24].

Generating a distribution of future paths of cash flows amounts to repeating the procedure of drawing a model from its distribution, calculating the output distribution parameters using the drawn model, then drawing a sample of cash flows from the distribution. Since we decompose the cash flows into paid losses and recoveries, we can obtain net amounts by subtracting the generated recoveries from the generated paid losses.

If we were only interested in point estimates, we can avoid sampling the output distributions and simply compute their means, since given the model weights we have closed form expressions for the distributions given by Equation 15. Note that we would still have to sample the model weights.

5 Results and Discussion

We evaluate our modeling framework by qualitatively inspecting samples paths generated for an individual claim and also comparing the aggregate estimate of unpaid claims with a chain ladder estimate. We note that, to the best of our knowledge, prior to the current paper, there is not a benchmark for individual claims forecasts. We also discuss the extensibility of our approach to accommodate company specific data elements and expert knowledge.

5.1 Individual Claim Forecasts

Figure 3 shows various posterior densities, obtained by sampling the model weights 1,000 times, of parameters for the output distributions of a single claim. Our model assigns a higher probability of payment in the next year along with more variability around that probability. It can be seen that the expected probability of a payment decreases as we forecast further into the future. Given that a loss payment occurs, we see from the middle and bottom plots that both the expected value and the variability for both the mean and variance of the log-normal distributions increase with time. In other words, for this particular claim, loss payments become less likely as time goes on, but if they occur, they tend to be more severe and the model is less certain about the severity distribution.

Figure 3: Posterior distributions for a single claim at development year 4. (Top) Payment probability. (Middle) Mean of loss payment distribution. (Bottom) Log variance of payment distribution.

Figure 4: 1,000 samples of cumulative cash flow paths for a single claim. The training data includes development year 4 and the predictions begin in development year 5.

In Figure 4 we show plausible future developments of the claim. Note that one thousand samples are drawn, so the handful of scenarios that present large payments represents a small portion of the paths, which is consistent with the distributions of parameters we see above.

5.2 Aggregate estimates

To compute a point estimate for the total unpaid losses, we score each claim once (i.e., sample from the weights distribution once) to obtain a distribution for each time step for paid losses and recoveries. We then compute the means of the paid loss and recovery distributions and take the differences to obtain the net cash flow amount. The final unpaid loss estimate is then the sum of all the net amounts up to and including development year 11. Since there is randomness in the neural network weight initialization, we instantiate and train the model 10 times, and take the average of the predicted future paid losses across the 10-model ensemble. In practice, one could distribute the prediction procedure over a computing cluster and draw more samples for an even more stable estimate.

For the chain ladder development method benchmark, we aggregate the data into a report year triangle (to exclude IBNR), calculate ultimate losses using all-year weighted age-to-age factors, then subtract the already paid amounts to arrive at the unpaid estimate. The unpaid amounts from the two approaches are then compared to the actual unpaid amounts. The results are shown in Table 2.

Table 2: Aggregate forecast results.

While we are not improving on chain ladder estimates at the aggregate level for this particular simulated dataset, our approach is able to provide individual claims forecasts, which are interesting in their own right.

5.3 Extensibility

One of the primary advantages of neural networks is their flexibility. Considering the architecture described in Section 4.4, we can include additional predictors by appending additional input layers. These inputs can be unstructured and we can leverage appropriate techniques to process them before combining with the existing features. For example, we may utilize convolutional layers to transform image inputs and recurrent layers to transform audio or text inputs.

The forms of the output distributions can also be customized. We choose a log-normal mixture for our dataset, but depending on the specific book of business, one may want to specify a different distribution, such as a Gamma or a Gaussian. As we have done in constraining the scale parameter of the log-normal distribution, the modeler also has the ability to bound or fix specific parameters as situations call for them.

6 Conclusion

We have introduced a framework for individual claims forecasting that can be utilized in loss reserving. By leveraging Bayesian neural networks and stochastic output layers, our approach provides ways to learn uncertainty from the data. It is also able to produce cash flow estimates for multiple future time periods. Experiments confirm that the approach gives reasonable results and provides a viable candidate for future work on individual claims forecasting to benchmark against.

There are a few potential avenues for enhancements and extensions, including larger scale simulations to evaluate the quality of uncertainty estimates, evaluating against new datasets, and relaxing simplifying assumptions, such as the independence of time steps to obtain more internally consistent forecasts. One main limitation of the proposed approach and experiments is that we focus on reported claims only. With access to policy level data, a particularly interesting direction of research would be to extend the approach to also estimate IBNR claims.

Acknowledgments

We thank Sigrid Keydana, Daniel Falbel, Mario Wüthrich, and volunteers from the Casualty Actuarial Society (CAS) for helpful discussions. This work is supported by the CAS.

References

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[2] K. Antonio and R. Plat. Micro-level stochastic loss reserving for general insurance. Scandinavian Actuarial Journal, 2014(7):649–669, 2014.

[3] M. Baudry and C. Y. Robert. A machine learning approach for individual claims reserving in insurance. Applied Stochastic Models in Business and Industry.

[4] C. M. Bishop. Mixture Density Networks. 1994.

[5] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight Uncertainty in Neural Networks. arXiv:1505.05424 [cs, stat], May 2015.

[6] A. Boumezoued and L. Devineau. Individual claims reserving: A survey. Nov. 2017.

[7] F. Duval and M. Pigeon. Individual Loss Reserving Using a Gradient Boosting-Based Approach. Risks, 7(3):79, Sept. 2019.

[8] A. Gabrielli. A Neural Network Boosted Double Over-Dispersed Poisson Claims Reserving Model. SSRN Scholarly Paper ID 3365517, Social Science Research Network, Rochester, NY, Apr. 2019.

[9] A. Gabrielli, R. Richman, and M. V. Wüthrich. Neural network embedding of the over-dispersed Poisson reserving model. Scandinavian Actuarial Journal, pages 1–29, 2019.

[10] A. Gabrielli and M. V Wüthrich. An individual claims history simulation machine. Risks, 6(2):29, 2018.

[11] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT press, 2016.

[12] A. Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.

[13] C. Guo and F. Berkhahn. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737, 2016.

[14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

[15] A. Kendall and Y. Gal. What uncertainties do we need in Bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems, pages 5574–5584, 2017.

[16] K. Kuo. DeepTriangle: A deep learning approach to loss reserving. Risks, 7(3):97, 2019.

[17] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436, 2015.

[18] O. Lopez, X. Milhaud, and P.-E. Thérond. A Tree-Based Algorithm Adapted to Microlevel Reserving and Long Development Claims. ASTIN Bulletin: The Journal of the IAA, pages 1–22, 2019.

[19] R. M. Neal. Bayesian Learning for Neural Networks. Springer Science & Business Media, Dec. 2012.

[20] M. Pigeon, K. Antonio, and M. Denuit. Individual loss reserving with the multivariate skew normal framework. ASTIN Bulletin: The Journal of the IAA, 43(3):399–428, 2013.

[21] M. Pigeon, K. Antonio, and M. Denuit. Individual loss reserving using paid–incurred data. Insurance: Mathematics and Economics, 58:121–131, 2014.

[22] R. R Development Core Team. R: A Language and Environment for Statistical Computing. R foundation for statistical computing Vienna, Austria, 2011.

[23] G. Taylor. Loss Reserving Models: Granular and Machine Learning Forms. Risks, 7(3):82, 2019.

[24] M. V. Wuthrich. Non-life insurance: Mathematics & statistics. Available at SSRN 2319328, 2017.

[25] M. V. Wüthrich. Machine learning in individual claims reserving. Scandinavian Actuarial Journal, 2018(6):465–480, 2018.

[26] M. V. Wüthrich. Neural networks applied to chain–ladder reserving. European Actuarial Journal, 8(2):407–436, 2018.