CatBoostLSS -- An extension of CatBoost to probabilistic forecasting

2020·arXiv

ABSTRACT

ABSTRACT

We propose a new framework of CatBoost that predicts the entire conditional distribution of a univariate response variable. In particular, CatBoostLSS models all moments of a parametric distribution (i.e., mean, location, scale and shape [LSS]) instead of the conditional mean only. Choosing from a wide range of continuous, discrete and mixed discrete-continuous distributions, modelling and predicting the entire conditional distribution greatly enhances the flexibility of CatBoost, as it allows to gain insight into the data generating process, as well as to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived. We present both a simulation study and real-world examples that demonstrate the benefits of our approach.

Keywords CatBoost Distributional Modelling Expectile Regression Probabilistic Forecast Statistical Machine Learning Uncertainty Quantification

1 Introduction

To reason rigorously under uncertainty we need to invoke the language of probability (Zhang et al., 2020). Any model that falls short of providing quantification of the uncertainty attached to its outcome is likely to yield an incomplete and potentially misleading picture. While this is an irrevocable consensus in statistics, a common misconception, albeit a very persistent one, is that machine learning models usually lack proper ways of quantifying uncertainty. Despite the fact that the two terms exist in parallel and are used interchangeably, the perception that machine learning and statistics imply a non-overlapping set of techniques remains lively, both among practitioners and academics. This is vividly portrayed by the provocatively (and potentially tongue-in-cheek) statement of Brian D. Ripley that ’machine learning is statistics minus any checking of models and assumptions’ that he made during the useR! 2004, Vienna conference that served to illustrate the difference between machine learning and statistics (Zeileis et al., 2016).

In fact, the relationship between statistics and machine learning is artificially complicated by such statements and is unfortunate at best, as it implies a profound and qualitative distinction between the two disciplines (Januschowski et al., 2020). The paper by Breiman (2001) is a noticeable exception, as it proposes to differentiate the two based on scientific culture, rather than on methods alone. Both statistics and machine learning create models from data, but for different purposes. There is the statistical culture that is well embedded in statistical theory and that assumes the data to be generated by a stochastic process. The aim is to draw inference from the sample and to provide insights into the data generating process. Hence, while the emphasize of statistics is on inference, mathematical rigour and elegance, model validity and estimation of model parameters, the algorithmic culture of machine learning is rather concerned with out-of-sample fit, computational performance and function optimization (Januschowski et al., 2020).

While the approaches discussed in Breiman (2001) are an admissible partitioning of the space of how to analyse and model data, more recent advances have gradually made this distinction less clear-cut (see Section 4 and the references therein for an overview). In fact, the current research trend in both statistics and machine learning gravitates towards bringing both disciplines closer together. In an era of increasing necessity that the output of prediction models needs to be turned into explainable and reliable insights, this is an exceedingly promising and encouraging development, as both disciplines have much to learn from each other. Along with Januschowski et al. (2020), we argue that it is more constructive to seek common ground than it is to introduce artificial boundaries. As such, this paper contributes to further closing the gap between the two cultures by extending statistical boosting to a machine learning approach that accounts for for all distributional properties of the data. In particular, we present an extension of CatBoost, which has gained much popularity and attention recently as a competitor to the eminent XGBoost model. We term our model CatBoostLSS, as it combines the accuracy and speed of CatBoost with the flexibility and interpretability of GAMLSS that allow for the estimation and prediction of the entire conditional distribution .1 CatBoostLSS allows the user to choose from a wide range of continuous, discrete and mixed discrete-continuous distributions to better adapt to the data at hand, as well as to provide predictive distributions, from which prediction intervals and quantiles can be derived. CatBoostLSS therefore contributes to the growing literature on statistical machine learning that aims at weakening the separation between the ’Data Modelling Culture’ and ’Algorithmic Modelling Culture’, so that models designed mainly for prediction can also be used to describe and explain the underlying data generating process of the response of interest.2

The remainder of this paper is organised as follows: Section 2 introduces the reader to distributional modelling and Section 4 presents an overview of related research. In Section 3, we formally introduce CatBoostLSS, while Section 5 presents both a simulation study and real world examples that provide a walk-through of the functionality of our model. Section 6 gives an overview of available software implementations and Section 7 concludes.

2 Distributional Modelling

The ultimate goal of regression analysis is to obtain information about the [entire] conditional distribution of a response given a set of explanatory variables.3(Hothorn et al., 2014)

Consulting the literature on machine learning shows that the main focus so far has been on prediction accuracy and estimation speed. In fact, even though machine learning approaches (e.g., Random Forest or Gradient Boosting-type algorithms) outperform many statistical models when it comes to prediction accuracy, the output/forecast of these models provides information about the conditional mean E(Y |X = x) only. As a consequence, this class of models is rather reluctant to reveal other characteristics of the (predicted) distribution and falls short in applications where probabilistic forecasts are required, e.g., for assessing prediction uncertainty in form of prediction intervals. By focusing on point-forecasts and hoping them to materialize, we ignore one of the fundamental principles of nature which is that, by default, future is uncertain and the best we can hope for when providing forecasts is to properly quantify the uncertainty that is attached to them. In contrast, probabilistic forecasts are predictions in the form of a probability distribution, rather than simply a single point estimate. Having estimated and predicted the conditional distribution, we can create sample paths that can be interpreted as a possible realization of the future. In this context, the introduction of Generalised Additive Models for Location Scale and Shape (GAMLSS) by Rigby and Stasinopoulos (2005) has stimulated a lot of research and culminated in a new branch of statistics that focuses on modelling the entire conditional distribution as functions of covariates. This section introduces the reader to the general idea of distributional modelling.4

In its original formulation, GAMLSS assume that a univariate response follows a distribution D that depends on up to four parameters, i.e., , where and are location and scale parameters, respectively, while and correspond to shape parameters such as skewness and kurtosis. Hence, the framework allows to model not only the mean (or location) but all parameters as functions of explanatory variables. In contrast to Generalised Linear (GLM) and Generalised Additive Models (GAM), the assumption of the response belonging to an exponential family type of distribution is relaxed in GAMLSS and replaced by a general distribution family, including highly skewed and/or kurtotic continuous, discrete and mixed discrete distributions. While the original formulation of GAMLSS in Rigby and Stasinopoulos (2005) suggests that any distribution can be described by location, scale and shape parameters, it is not necessarily true that the distribution at hand is actually characterized by parameters that represent shape parameters, i.e., skewness and kurtosis. Hence, we follow Klein et al. (015b) and use the term distributional modelling and GAMLSS interchangeably. From a frequentist point of view, distributional modelling can be formulated as follows

where D denotes a parametric distribution for the response y that depends on K distributional parameters denoting a known monotonic function relating the distribution parameters to a predictor . In its most generic form, the predictor is given by

Within the original distributional regression framework, the functions usually represent an additive GAMtype predictor that are based on a basis function approach using splines, i.e., , where is a parameter vector modelling linear effects or categorical variables, xis the corresponding design matrix and reflect different types of regression effects that model the effect of a continuous covariate .5 However, it is important to stress that besides its classic representation, the predictor specification in Equation (2) is generic enough to also represent classification and regression trees, which allows us to extend CatBoost to a probabilistic framework. Concerning the estimation of distributional regression, it relies on the availability of first and second order derivatives of the (log)-likelihood function needed for Fisher-scoring type algorithms. As we will see in Section 3, this is very closely related to the estimation of CatBoost, which we will exploit to arrive at CatBoostLSS.

We would like to draw the attention of the reader to an implication that is a consequence of modelling and predicting the entire distribution and that has received relatively little interest in the machine learning community until very recently (Quiñonero-Candela et al., 2009). Many machine learning algorithms have been proposed and shown to be very successful. One assumption required to guarantee performance for prediction tasks is the test data to have the same distribution as the training data, or more formally, that train and test observations to be independent and identically distributed (iid) realizations arising from the same stationary distribution , where is a vector of distributional parameters. As such, we aim to train a model that, given new inputs of the test set, can accurately predict the corresponding unseen output. In real world applications, however, distributions are complex and likely to be non-stationary, rendering the conditional response distribution to remain unchanged very difficult.6 If not addressed properly, a shift in the conditional distribution between training and test data may lead to inaccuracy of parameter estimates and instability of predictions (Quiñonero-Candela et al., 2009). To illustrate the implications of distributional modelling, let us re-visit the concept of stationarity used in time series analysis, with covariates x including time. Most forecasting methods assume that the time series at hand can be rendered approximately stationary using appropriate transformations, e.g., difference-stationary or trend-stationary. In general, one can distinguish two forms of stationarity. The first, and the weaker one, is covariance stationarity which requires the first moment (i.e., the mean) and auto-covariance to not vary with respect to time. The second, and stricter one, is strong stationarity that can be formulated as follows

where is the joint cumulative distribution function of at times t. Given that does not change with a shift in time of , it follows that all parameters of a strictly stationary process are time invariant. However, this is a very restrictive assumption that is likely to be violated in many real-world applications. Along with Quiñonero- Candela et al. (2009), we argue that flexible modelling frameworks are essential for the development of a detailed understanding of the problems attached to modelling non-stationary distributions. As we will see in subsequent sections, distributional modelling in general and CatBoostLSS in particular, allows to uncover and analyse the underlying mechanisms of a distribution shift, such as change in variance, by relating all distributional parameters to explanatory variables. In particular, distributional modelling implies that the observations are independent, but not necessarily identical realizations , where all distributional parameters are related to and allowed to change with covariates. 7

3 CatBoostLSS

In this section, we introduce CatBoostLSS. It is based on the CatBoost (for categorical boosting) algorithm recently introduced by Prokhorenkova et al. (2017) and Dorogush et al. (2018). There are several characteristics that set CatBoost apart from other existing boosting approaches, namely the implementation of ordered boosting, a permutation-driven alternative to existing approaches, and an efficient algorithm for vector representation of categorical data that makes CatBoost particularly suitable for handling data sets with a lot of categorical features. Both novelties are using random permutations of the training examples to fight the prediction shift caused by a special kind of target leakage present in all existing implementations of gradient boosting algorithms (Prokhorenkova et al., 2017; Dorogush et al., 2018).8 Even though it provides a flexible interface for parameter tuning, CatBoost outperforms many of the existing state-of-the-art implementations of gradient boosted decision trees, such as XGBoost of Chen and Guestrin (2016), on a diverse set of popular tasks, without any parameter tuning using default parameters only.9 It has both a CPU and GPU implementation which are faster than other gradient boosting libraries.10 Depending on the task at hand, CatBoost allows the user to select between gradient and Newton boosting.11 To establish the connection between GAMLSS and CatBoostLSS, we need to recall that for Newton boosting, the first and second order partial derivatives of the (element-wise) loss function with respect to the fitted label is calculated at each iteration t. As such, Newton boosting amounts to a weighted least-squares regression problem at each iteration, which is solved using base learners (e.g., using CART). As a consequence, Newton boosting can be understood as an iterative empirical risk minimization procedure in function space, that determines both the step direction and step length at the same time. This is where CatBoostLSS makes the connection to GAMLSS, as empirical risk minimization and Maximum Likelihood estimation are closely related. Recall from Section 2 that GAMLSS are estimated using the first and second order partial derivatives of the log-likelihood function with respect to the distributional parameter of interest. By selecting an appropriate loss, or equivalently, a log-likelihood function, Maximum Likelihood can be formulated as empirical risk minimization so that the resulting CatBoost model can be interpreted as a statistical model.12

Now that we have outlined that CatBoostLSS can be interpreted as a statistical model by having established the connection between the estimation of GAMLSS and CatBoost, we can introduce CatBoostLSS more formally. Algorithm (1) gives a conceptual overview of the steps involved to estimate our model. We have designed CatBoostLSS in such a way that the initial CatBoost implementation remains unchanged, so that its full functionality is still available. In a sense, CatBoostLSS is a wrapper around CatBoost, where we interpret the loss function from a statistical perspective by formulating empirical risk minimization as Maximum Likelihood estimation. As outlined in Algorithm (1), we first need to specify an appropriate log-likelihood, from which Gradients and Hessians are derived, that represent the partial first and second order derivatives of the log-likelihood with respect to the distributional parameter of interest. In contrast, however, to the approach in Mayr et al. (2012) and Thomas et al. (2018), that uses a component-wise gradient descent algorithm, where each of the is updated successively in each iteration, using the current estimates of the other distribution parameters as input, our approach is a two-step procedure. In the first step, we estimate a separate model for each distributional parameter , where the unconditional Maximum Likelihood estimates of the parameters , not currently being estimated, are used as offset values. As such, while is estimated, are treated being constant. Once all are estimated, we update each parameter by incorporating information from all other parameters until a stopping criterion based on the global deviance is met. Once all parameters are updated and the global deviance has converged, we can draw random samples from the predicted distribution that allows us to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived.

4 Related Research

Reviewing the current literature at the intersect between machine learning and statistics shows that there has been an incredibly rich stream of ideas that aim at bringing the two disciplines closer together. As this section cannot give an exhaustive overview of all approaches, we refer the interested reader to März (2019) and the references therein.

Amongst all of the available implementations, the approach closest to CatBoostLSS is the one introduced in our related paper März (2019). In fact, CatBoostLSS and XGBoostLSS differ only by the base learner used to model the conditional distribution. The choice of which approach to use depends, as always, on the purpose and problem at hand. CatBoostLSS makes use of CatBoost’s efficient algorithm for representing categorical data which makes it particularly suitable for handling data sets where the majority of the features are categorical. Also, the fact that CatBoost achieves state-of-the-art prediction performance with basically no or very little hyper-parameter tuning, makes it particularly useful for distributions with many parameters , as the careful selection of hyper-parameters for learning higher conditional moments, e.g., kurtosis or skewness, is crucial for the stability and convergence of the algorithm. While existing GAMLSS frameworks and implementations are supposed to perform well for small to medium sized data sets, CatBoostLSS plays off its strengths in situations where the user faces data sets with hundreds of thousands or even millions of observations.

5 Applications

In the following, we present both a simulation study and real world examples that demonstrate the functionality of CatBoostLSS. It is important to note that, since CatBoost yields state-of-the-art prediction results without extensive parameter tuning typically required by other machine learning methods, we estimate all CatBoostLSS models in the following using default hyper-parameter settings.13

5.1 Simulation

We start with a simulated a data set that exhibits heteroskedasticity, where the interest lies in predicting the 5% and 95% quantiles.14 The dots in red show points that lie outside the 5% and 95% quantiles, which are indicated by the black dashed lines.

Figure 1: Simulated Train Dataset with 7,000 observations . Points outside the 5% and 95% quantile are coloured in red. The black dashed lines depict the actual 5% and 95% quantiles. Besides the only informative predictor x, we have added as noise variables to the design matrix.

As splitting procedures, that are internally used to construct trees, can detect changes in the mean only, standard implementations of machine learning models are not able to recognize any distributional changes (e.g., change of variance), even if these can be related to covariates (Hothorn and Zeileis, 2018). As such, CatBoost doesn’t provide any uncertainty quantification in its current implementation, as the model focuses on predicting the conditional mean E(Y |X = x) only, without any assessment on the full predictive distribution . This is in contrast to CatBoostLSS, where all distributional parameters are modelled as functions of covariates.

Let us now fit CatBoostLSS to the data. In general, the syntax is similar to the original CatBoost implementation. However, the user has to make a distributional assumption by specifying a family in the function call. As the data has been generated by a Normal distribution, we use the Normal as a function input. The user also has the option of providing a list of hyper-parameters that are used for tuning the model. Once the model is trained, we can predict all parameters of the distribution. As CatBoostLSS allows to model the entire conditional distribution, we obtain prediction intervals and quantiles of interest directly from the predicted quantile function. Figure 2 shows the predictions of CatBoostLSS for the 5% and 95% quantile in blue.

Figure 2: Simulated Test Dataset with 3,000 observations . Points outside the conditional 5% and 95% quantile are in red. The black dashed lines depict the actual 5% and 95% quantiles. Conditional 5% and 95% quantile predictions obtained from CatBoostLSS are depicted by the blue lines. Besides the only informative predictor x, we have added as noise variables to the design matrix.

Comparing the coverage of the intervals with the nominal level of 90% shows that CatBoostLSS does not only correctly model the heteroskedasticity in the data, but it also provides a reasonable forecast for the 5% and 95% quantiles. The flexibility of CatBoostLSS also comes from its ability to provide attribute importance, as well as partial dependence plots, for all of the distributional parameters. In the following we only investigate the effect on the conditional variance. Figure 3 shows that CatBoostLSS has identified the only informative predictor x and does not consider any of the noise variables as important features.

Figure 3: Mean Absolute Shapley Value of V(Y |X = x).

Inspecting partial dependence plots of V(Y |X = x) shown in Figure 4 indicates that it also correctly identifies the heteroskedasticity in the data.

Figure 4: Smoothed Partial Dependence Plot of V(Y |X = x).

5.2 Munich Rent

Considering there is an active discussion around imposing a freeze in German cities on rents, we have chosen to re-visit the famous Munich Rent data set commonly used in the GAMLSS literature, as Munich is among the most expensive cities in Germany when it comes to living costs. In this example, we illustrate the functionality of CatBoostLSS using a sample of 2,053 apartments from the data collected for the preparation of the Munich rent index 2003, as shown in Figure 5. As our dependent variable, we select Net rent per square meter in EUR.

Figure 5: Munich Rents per square meter per district.

The first decision one has to make is about choosing an appropriate distribution for the response. As there are many potential candidates, we use an automated approach based on the generalised Akaike information criterion (GAIC).

Table 1: Candidate Response Distributions

Even though Table 1 suggests the Generalized Beta Type 2 to provide the best approximation to the data, we use the more parsimonious Normal distribution, as it has only two distributional parameters, compared to 4 of the Generalized Beta Type 2. In general, though, CatBoostLSS is flexible to allow the user to choose from a wide range of continuous, discrete and mixed discrete-continuous distributions. The good fit of the Normal distribution is also confirmed by the the density plot, where the response of the train data is presented as a histogram, while the fitted Normal is shown in red.

Figure 6: Fitted Normal Distribution.

Now that we have specified the distribution, we fit our CatBoostLSS model to the data. Looking at the estimated effects presented in Figure 7 indicates that newer flats are on average more expensive, with the variance first decreasing and increasing again for flats built around 1980 and later. Also, as expected, rents per square meter decrease with an increasing size of the apartment.

Figure 7: Estimated Partial Effects.

The diagnostics for CatBoostLSS are based on quantile residuals of the fitted model and are shown in Figure 8.15

Figure 8: Quantile Residuals.

CatBoostLSS provides a well calibrated forecast and confirms that our model is a good approximation to the data. CatBoostLSS also allows to investigate feature importance for all distributional parameters. Looking at the top 10 features with the highest Shapley values for both the conditional mean and variance in Figure 9 indicates that both yearc and area are considered as being the most important variables.

Figure 9: Mean Absolute Shapley Value of E(Y |X = x) and V(Y |X = x).

To get a more detailed overview of which features are most important for our model, we can also plot the SHAP values of every feature for every sample. The plot below sorts features by the sum of SHAP value magnitudes over all samples and uses SHAP values to show the distribution of the impacts each feature has on the model output. The colour represents the feature value (red high, blue low). This reveals for example that newer flats increase rents on average.

Figure 10: Shapley Values of E(Y |X = x).

We can also visualize all predictions and assess the attribute importance.

Figure 11: Shapley Values of E(Y |X = x).

Besides the global attribute importance, the user might also be interested in local attribute importance for each single prediction individually. This allows to answer questions like ’How did the feature values of a single data point affect its prediction?’ For illustration purposes, we select the first predicted rent of the test data set and present the local attribute importance for E(Y |X = x) .

Figure 12: Local Shapley Value of E(Y |X = x).

As we have modelled all parameters of the Normal distribution, CatBoostLSS provides a probabilistic forecast, from which any quantity of interest can be derived. Figure 13 shows a random subset of 50 predictions only for ease of readability. The red dots show the actual out of sample rents, while the boxplots visualise the predicted distributions.

Figure 13: Boxplots of Probabilistic Forecasts of Munich Rents.

We can also plot a subset of the forecasted densities and cumulative distributions as shown in Figure 14.

Figure 14: Density and Cumulative Distribution Plots of Probabilistic Forecasts of Munich Rents.

5.2.1 Comparison to other approaches

To evaluate the prediction accuracy of CatBoostLSS, we compare the forecasts of the Munich rent example to the implementations available in , as well as to the Bayesian formulation of GAMLSS implemented in bamlss by Umlauf et al. (2017), to Distributional Regression Forests of Schlosser et al. (2018); Schlosser and Zeileis (2019) implemented in distforest and to ngboost as introduced by Duan et al. (2019). For all approaches that can handle categorical information directly, we use factor coding. We evaluate distributional forecasts in Table 2 using the average Continuous Ranked Probability Scoring Rules (CRPS) and the average Logarithmic Score (LOG), where lower scores indicate a better forecast, along with additional error measures evaluating the mean-prediction accuracy of the models.17

Table 2: Forecast Comparison

Average Continuous Ranked Probability Scoring Rules (CRPS); Average Logarithmic Score (LOG); Mean Absolute Percentage Error (MAPE); Mean Square Error (MSE); Root Mean Square Error (RMSE); Mean Absolute Error (MAE); Median Absolute Error (MEDIAN-AE); Relative Absolute Error (RAE); Root Mean Square Percentage Error (RMSPE); Root Mean Squared Logarithmic Error (RMSLE); Root Relative Squared Error (RRSE); R-Squared/Coefficient of Determination (R2). Best out-of-sample results are marked in bold (lower is better, except R2).

Comparing the results to its hyper-parameter tuned competitors, all measures show that CatBoostLSS provides a competitive forecast using default hyper-parameter settings.18

5.2.2 Expectile Regression

While GAMLSS require to specify a parametric distribution for the response, it may also be useful to completely drop this assumption and to use models that allow to describe parts of the distribution other than the mean. This may in particular be the case in situations where interest does not lie with identifying covariate effects on specific parameter of the response distribution, but rather on the relation of extreme observations on covariates in the tails of the distribution. This is feasible using Quantile and Expectile Regression. As with mean regression models, where the conditional mean is modelled as a function of covariates, both Quantile and Expectile Regression relate any specific quantile/expectile of the response to a set of covariates. Consequently, any desired point of the response distribution can be modelled, so that a dense grid of regressions yields a detailed description of the conditional distribution. Therefore, estimating and comparing parameter estimates across a different set of quantiles/expectiles allows for fully characterising the response distribution and for investigating the differential effect that covariates may have on different points of the conditional distribution. For our Munich rent analysis, Quantile/Expectile Regression yields additional insight compared to mean regression models, as they provide a richer description of the relationship between the rent of a flat and its attributing values for different values of

As CatBoostLSS requires both the Gradient and Hessian to be non-zero, we illustrate the ability of CatBoostLSS to model and provide inference for different parts of the response distribution using Expectile Regression.19 Plotting the effects across different expectiles allows the estimated effects, as well as their strengths, to vary across the response distribution.20

Figure 15: Estimated Partial Effects across different Expectiles.

Investigation of the feature importances across different Expectiles allows to infer the most important covariates for each point of the response distribution so that, e.g., effects that are more important for expensive rents can be compared to those from affordable rents.

Figure 16: Mean Absolute Shapley Value across different Expectiles.

6 Software implementation

In its current implementation, CatBoostLSS is available in Python and will be made publicly available on the project Git-repo following this link StatMixedML/CatBoostLSS.

7 Conclusion

There is indeed more to life than mean and variance. A good point at which to start is by replacing them by location and scale and noting that one reason for the stress on mean and variance is the implicit assumption of Gaussianity. Once the assumption of Gaussianity is dropped, attention shifts to estimating [all of] the parameter in a distribution.21 (Harvey, 2013)

The language of statistics is of probabilistic nature. Any model that falls short of providing quantification of the uncertainty attached to its outcome is likely to yield an incomplete and potentially misleading picture. However, quantification of uncertainty in general and probabilistic forecasting in particular doesn’t just provide an average point forecast, but it rather equips the user with a range of outcomes and the probability of each of those occurring. In an effort of bringing both disciplines closer together, this paper extends CatBoost to a full probabilistic forecasting framework termed CatBoostLSS. By exploiting its Newton boosting nature and the close connection between empirical risk minimization and Maximum Likelihood estimation, our approach models and predicts the entire conditional distribution from which prediction intervals and quantiles of interest can be derived. As such, CatBoostLSS provides a comprehensive description of the response distribution, given a set of covariates. By means of a simulation study and real world examples, we have shown that models designed mainly for prediction can also be used to describe and explain the underlying data generating process of the response of interest.

References

Breiman, L. (2001). Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3):199–231.

Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System.

Dorogush, A. V., Ershov, V., and Gulin, A. (2018). CatBoost: gradient boosting with categorical features support. arXiv preprint, pages 1–7.

Duan, T., Avati, A., Ding, D. Y., Basu, S., Ng, A. Y., and Schuler, A. (2019). NGBoost: Natural Gradient Boosting for Probabilistic Prediction. arXiv preprint, pages 1–10.

Fahrmeir, L. and Kneib, T. (2011). Bayesian smoothing and regression for longitudinal, spatial and event history data, volume 36 of Oxford statistical science series. Oxford University Press, Oxford and New York.

Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013). Regression: Models, methods and applications. Springer, Berlin, 1 edition.

Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378.

Harvey, A. (2013). Discussion of’Beyond mean regression’. Statistical Modelling, 13(4):363.

Hothorn, T. (2018). Top-down transformation choice. Statistical Modelling, 18(3-4):274–298.

Hothorn, T., Kneib, T., and Bühlmann, P. (2014). Conditional transformation models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):3–27.

Hothorn, T. and Zeileis, A. (2018). Transformation Forests.

Hutter, M. (2008). Introduction to Statistical Machine Learning. Machine Learning Summer School - MLSS-2008, 2 – 15 March, Kioloa.

Januschowski, T., Gasthaus, J., Wang, Y., Salinas, D., Flunkert, V., Bohlke-Schneider, M., and Callot, L. (2020). Criteria for classifying forecasting methods. International Journal of Forecasting, 36(1):167–177.

Klein, N., Kneib, T., and Lang, S. (2015c). Bayesian Generalized Additive Models for Location, Scale, and Shape for Zero-Inflated and Overdispersed Count Data. Journal of the American Statistical Association, 110(509):405–419.

Klein, N., Kneib, T., Lang, S., and Sohn, A. (2015b). Bayesian structured additive distributional regression with an application to regional income inequality in Germany. The Annals of Applied Statistics, 9(2):1024–1052.

März, A. (2019). XGBoostLSS: An extension of XGBoost to probabilistic forecasting. arXiv preprint, pages 1–23.

Mayr, A., Fenske, N., Hofner, B., Kneib, T., and Schmid, M. (2012). Generalized additive models for location, scale and shape for high-dimensional data - a flexible approach based on boosting. Journal of the Royal Statistical Society, Series C - Applied Statistics, 61(3):403–427.

Milly, P. C. D., Betancourt, J., Falkenmark, M., Hirsch, R. M., Kundzewicz, Z. W., Lettenmaier, D. P., Stouffer, R. J., Dettinger, M. D., and Krysanova, V. (2015). On Critiques of “Stationarity is Dead: Whither Water Management?”. Water Resources Research, 51(9):7785–7789.

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. (2017). CatBoost: unbiased boosting with categorical features. arXiv preprint, pages 1–23.

Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (2009). Dataset shift in machine learning. Neural information processing series. MIT Press, Cambridge, Mass.

Rigby, R. A. and Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(3):507–554.

Schlosser, L., Hothorn, T., Stauffer, R., and Zeileis, A. (2018). Distributional Regression Forests for Probabilistic Precipitation Forecasting in Complex Terrain.

Schlosser, L. and Zeileis, A. (2019). disttree: Trees and Forests for Distributional Regression.

Serinaldi, F. and Kilsby, C. G. (2015). Stationarity is undead: Uncertainty dominates the distribution of extremes. Advances in Water Resources, 77:17–36.

Sigrist, F. (2019). Gradient and Newton Boosting for Classification and Regression. arXiv preprint, pages 1–42.

Sobotka, F. and Kneib, T. (2012). Geoadditive expectile regression. Computational Statistics & Data Analysis, 56(4):755–767.

Stasinopoulos, M. D., Rigby, R. A., Heller, G. Z., Voudouris, V., and de Bastiani, F. (2017). Flexible Regression and Smoothing: Using GAMLSS in R. Chapman & Hall / CRC The R Series. CRC Press, London.

Thomas, J., Mayr, A., Bischl, B., Schmid, M., Smith, A., and Hofner, B. (2018). Gradient boosting for distributional regression - faster tuning and improved variable selection via noncyclical updates. Statistics and Computing, 28(3):673–687.

Umlauf, N., Klein, N., and Zeileis, A. (2017). BAMLSS: Bayesian Additive Models for Location, Scale and Shape (and Beyond). Journal of Computational and Graphical Statistics.

Villarini, G., Smith, J. A., Serinaldi, F., Bales, J., Bates, P. D., and Krajewski, W. F. (2009). Flood frequency analysis for nonstationary annual peak records in an urban drainage basin. Advances in Water Resources, 32(8):1255–1266.

Waltrup, L. S., Sobotka, F., Kneib, T., and Kauermann, G. (2015). Expectile and quantile regression—David and Goliath? Statistical Modelling, 15(5):433–456.

Zeileis, A., the R community. Contributions (fortunes and/or code) by Torsten Hothorn, Peter Dalgaard, Uwe Ligges, Kevin Wright, Martin Maechler, Kjetil Brinchmann Halvorsen, Kurt Hornik, Duncan Murdoch, Andy Bunn, Ray Brownrigg, Roger Bivand, Spencer Graves, Jim Lemon, Christian Kleiber, David L. Reiner, Berton Gunter, Roger Koenker, Charles Berry, Marc Schwartz, Michael Dewey, Ben Bolker, Peter Dunn, Sarah Goslee, Simon Blomberg, Bill Venables, Roland Rau, Thomas Petzoldt, Rolf Turner, Mark Leeds, Emmanuel Charpentier, Chris Evans, Paolo Sonego, Peter Ehlers, Detlef Steuer, Tal Galili, Greg Snow, Brian D. Ripley, Michael Sumner, David Winsemius, Liviu Andronic, Brian Diggs, Matthieu Stigler, Michael Friendly, Dirk Eddelbuettel, Richard M. Heiberger, Patrick Burns, Dieter Menne, Andrie de Vries, Barry Rowlingson, Renaud Lancelot, R. Michael Weylandt, Jon Olav Skoien, Francois Morneau, Antony Unwin, Joshua Wiley, Terry Therneau, Bryan Hanson, Henrik Singmann, Eduard Szoecs, Gregor Passolt, and John C. Nash. (2016). fortunes: R Fortunes: Brian D. Ripley (about the difference between machine learning and statistics) useR! 2004, Vienna (May 2004).

Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2020). Dive into Deep Learning.

Appendix A Munich Rent Quantile Loss Comparison

For a given quantile , a target value -quantile prediction -quantile loss is defined as QLuse a normalized sum of quantile losses . Best out-of-sample results are marked in bold (lower is better).

Table A2: Munich Rent Quantile Loss Comparison - Average Rank

Designed for Accessibility and to further Open Science