Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel Data

2019·Arxiv

Abstract

Abstract

In this paper we develop a data-driven smoothing technique for high-dimensional and non-linear panel data models. We allow for individual specific (non-linear) functions and estimation with econometric or machine learning methods by using weighted observations from other individuals. The weights are determined by a data-driven way and depend on the similarity between the corresponding functions and are measured based on initial estimates. The key feature of such a procedure is that it clusters individuals based on the distance / similarity between them, estimated in a first stage. Our estimation method can be combined with various statistical estimation procedures, in particular modern machine learning methods which are in particular fruitful in the high-dimensional case and with complex, heterogeneous data. The approach can be interpreted as a “soft-clustering” in comparison to traditional“ hard clustering”that assigns each individual to exactly one group. We conduct a simulation study which shows that the prediction can be greatly improved by using our estimator. Finally, we analyze a big data set from didichuxing.com, a leading company in transportation industry, to analyze and predict the gap between supply and demand based on a large set of covariates. Our estimator clearly performs much better in out-of-sample prediction compared to existing linear panel data estimators.

Key words: Nonlinear Panel Data, Discrete Smoothing, Clustering, Incidental Parameter Problem, Machine Learning, Nonparametric Statistics.

1. Introduction

1.1. Motivation. Panel or longitudinal data are very important tool for empirical research in economics, social sciences, biostatistics and many other fields. Most of the literature on non- and semi-parametric panel models is based on the assumption that the regression function is the same across individuals. But this assumption, however, could be unrealistic in many applications, in particular when the number of observed individuals is large or there is unobserved heterogeneity across the individuals. In modern internet data with hundreds of thousands of users and a relatively short longitudinal time frame, it is important to study complicated and non-linear heterogeneity among different individuals.

A classical example is the prediction of demand and supply. More specifically, we will consider data from didichuxing.com (the Chinese counterpart of Uber in the U.S.) and estimate gaps between real time supply and demand in a metropolitan area in China being partitioned into 66 districts and the 24 hours day time being divided into 144 intervals, each containing 10 minutes. For companies like didichuxing.com it is crucial to develop a good model for estimating supply and demand gaps to provide better dispatching and services and to develop a dynamic pricing strategy.

Obviously, the model for predicting the demand and supply of taxi service at a given location and a specified time spot should not be uniform over all time-location combinations. Even at the same time spot, e.g., 8:00pm - 8:10pm, the supply, the demand, and the gap between them, as well as traffic conditions, vary dramatically across different locations in the city. Also, for the same location, the supply and demand are quite different in different time spots.

For modeling this kind of data sets, panel data models are widely used both in the literature but also in many empirical applications. However, due to significant heterogeneity across time and locations, the linear panel data model works poorly in terms of predicting the supply and demands, both in and out of sample when the number of observations per unit is relatively small. From a theoretical point of view, nonparametric panel data models are still a challenge as usually there are only a limited number of observations per individual available which makes nonparametric estimation of the individual regression functions quite imprecise or even impossible. Despite these challenges, one can often observe a certain grouping structure among models for different individuals in real data sets. For example, it is possible that district number one, at 8am, may share the same or a similar model for the gap between supply and demand, with district two at 6pm. One one hand, such a grouping structure among individual models can be used to improve the prediction for each model by borrowing observations from other similar models. On the other hand, these latent group structure among models is unknown and can hardly be figured out using common knowledge.

In this paper we propose a new adaptive discrete-smoothing (ADS) method which is designed to utilize the latent grouping structure among a large number of nonparametric panel data models. More precisely, the proposed ADS method simultaneously “clusters” the cross-sectional individuals and conducts model estimation.

In a non-parametric panel data model with the fixed effect setting, we assume that there are individual fixed effects which determine the nonparametric regression function of each individual. As we explain later in the model specification part, these individual effects can be interpreted as the indices of a class of functions that determine the shape of the individual regression functions. To characterize the grouping structure among regression functions, we make an essential continuity/smoothness assumption: individuals with close values of s have similar regression functions. This assumption enables us to effectively address the curse of dimensionality (i.e., the insufficiency of samples for each individual model) in estimating nonparametric panel data models. The main idea behind the construction of our estimator is to use all the observations for estimating each individual model, including the observations from other individuals, but these observations receive a lower weight in estimation. More specifically, we develop a general two-stage estimator, where the first stage estimates the weights that characterize the “similarity” of the underlying regression functions. Given the estimated weight matrix between pairs of individuals, the second stage estimates the regression function for each individual using all the weighted observations.

Our idea shares a similar spirit as the “discrete smoothing” idea from Racine and Li (2004). In particular, Racine and Li (2004) studied the nonparametric regression with discrete and continuous variables and used fixed weights for observations falling in other “cells”, where each cell corresponds to a combination of categorical variables (“discrete smoothing”). They showed that the mean squared error can be effectively reduced by using the weighted observations across cells. Instead of using the fixed weights, our adaptive smoothing approach over discrete cells depends on how similar the corresponding underlying regressions functions are. To accomplish this, we introduce data-driven weights that measure the similarity (distance) of regression functions across different individuals. Using observations from other individuals, on the one hand, introduces bias, but on the other hand, adding additional observations in the estimation process might reduce variance. By constructing the data-driven weights, our method leads to a better variance-bias trade-off and hence faster convergence rates in theory.

Generally speaking, the proposed methodology is applicable to almost any statistical procedure which allows weighted estimation, including kernel estimation, series / sieves, and modern machine learning methods like Lasso, Boosting and many others methods. As the weights employed for the smoothing of the regression functions are data-driven, we call our method “in the rest of the paper. Our ADS method applies to both discrete and continuous indices without having any prior knowledge of . As the rise of digitization in many fields leads to complex, high-dimensional data sets, we focus in the this paper on Lasso. In this paper, we focus on panel data models. The methods can also be used for non-parametric regression with both continuous and discrete variables, as considered in Racine and Li (2004). Adaptive discrete smoothing for non-parametric regression is considered in an accompanying paper (Xi Chen (2020)). For the theoretical properties of our estimator first results are available for some estimation methods upon request.

Outline In Section 2 we give a heuristic motivation for our procedure. In Section 3 the ADS procedure is formally introduced in a general setting and specified for relevant problems. Section 4 presents a set of simulation studies to demonstrate the power of our methods compared to other alternatives. In Section 5, we present an empirical application: a study on didichuxing.com’s panel data for predicting gap between supply and demand. Finally, we conclude.

1.2. Related Literature. Since the use of panel data is popular in many disciplines such as social, medical, financial studies, the panel data model has always been an active research area (see, e.g., Baltagi and Raj (1992); Wooldridge (2010); Hsiao (2014)). Due to the large volume of work in this area, we only provide a brief survey of closely related works.

Our paper is related to different lines of recent research on panel data model. The first line is high-dimensional (or diverging dimensional) panel data model (see, e.g., Kock (2013, 2016); Belloni et al. (2014); Zhu (2017)). In addition to parameter estimation and inference, Li et al. (2016) and Qian and Su (2016) further study the estimation of common structural breaks in linear panel data model in the diverging dimensional case.

Moreover, our work extends also non-parametric panel methods and recent approaches employing clustering for estimation of panel data models. Nonlinear panel data have been a field of active research. A comprehensive survey is given in Arellano and Bon- homme (2011). Most of the literature on non- and semi-parametric panel models is based on the assumption that the regression function is the same across individuals; see Henderson et al. (2008), Mammen et al. (2009) and Qian and Schmidt (2003) among many others. As argued above, this assumption might not be realistic realistic in many applications. To relax the assumption, recent research has focused on assuming a group structure and employing cluster methods for panel data models. This means that every individual belongs to a group and the group assignment is unknown and has to be estimated. In such a setting, the number of groups usually needs to be finite and pre-defined. Vogt and Linton (2017) considered a nonparametric regression model that the observed individuals can be grouped into a number of classes whose members all share the same regression function. This is a special case of the model in Example 1 with discrete indices (see Section 1). They develop a statistical procedure to estimate the unknown group structure and then estimate each regression function by averaging the individual functional estimates within each group. Bonhomme and Manresa (2015) introduce time-varying grouped patterns of heterogeneity in linear panel data models. The parameters are estimated using a grouped fixed effects estimator that minimizes a least squares criterion with respect to all possible groupings of the cross-sectional units based on K-means clustering. Bonhomme et al. (2017) develop two-step and iterative panel data estimators based on a discretization of unobserved heterogeneity employing clustering. Su et al. (2016) provide a novel mechanism for identifying and estimating latent group structures in panel data using penalized techniques. They consider both linear and nonlinear models where the regression coefficients are heterogeneous across groups but homogeneous within a group and the group membership is unknown. Su et al. (2016) proposes a novel penalty term called classifier-Lasso (C-Lasso). More specially, assuming that the true regression coefficients for N individuals vector values, the C-Lasso imposes the non-convex penalty both are the decision variables in the optimization. This mixed additive-multiplication form of penalty tries to shrink individual regression coefficients to an unknown group-level coefficient vector

Despite the popularity of investigating the grouping effect, there are several limitations in the existing approaches. First, the computation of the estimators is usually quite demanding. For example, some estimators rely on the k-means algorithm to learn the group structure. However, finding the optimal solution to the k-means clustering problem is known to be NP-hard even for two clusters (Aloise et al., 2009). Other estimators either depend on the exhaustive search of the group structure or involve solving some non-convex optimization problems for identifying the group structure. Second, the estimators with an explicit clustering step usually require the number of clusters is known a priori. Approaches have been developed to determine the unknown number of groups but this is again computational demanding and might need additional assumptions. Third, it is very reasonable that in many applications there are not only group effects, but also individual effects. There might not be a hard group structure among different individuals. For example, some individuals might have similar coefficients but do not share exactly the same model. Finally, except for Vogt and Linton (2017), most existing research mainly focuses on linear or parametric panel data models and does not deal with general nonparametric regression functions.

To address these challenges, we propose a unified approach for a wide range of panel data models that include both parametric and non-parametric models, continuous and discrete variables, fixed-dimensional and high-dimensional settings. Instead of enforcing the “hard clustering” that assigns each individual to exactly one group, our approach can be interpreted as a “soft clustering”with weights determined by the similarities between groups. This soft clustering approach is not only computationally attractive as compared to those “hard clustering” methods but also avoids fixing the number of groups prior to that.

1.3. Important Examples in Econometrics.

Example 1 (Non-linear Panel data). We assume that

denotes the error term which is contemporaneously uncorrelated with the regressors

The dependent variable depends nonparametrically on the regressors and an unobserved “fixed effect”. So each individual i might be subject to a different, nonlinear function ) which is unknown and has to be estimated. Different values of lead to different functions . The effects are not identified, but we show that the functions represent the unobserved heterogeneity . One way to interpret this is that the effects simply serve as an index for the nonparametric functions and different lead to different regression functions . Without loss of generality we assume that 1] and define the family of functions that are indexed by . Formally, . For each individual i = 1, 2, ..., N, the fixed effect leads to a response function take a finite discrete values, this general model includes the models considered in Vogt and Linton (2017) as special cases. The core idea of our estimator relies on a continuity assumptions: individuals with similar values of also have similar regression functions. This assumption will be stated more rigorously in the next sections. In the following comment we will also argue that non-parametric functions allow a linear high-dimensional approximation which is often sparse. As our interest is mainly on modern high-dimensional data sets we will mainly focus on the high-dimensional setting with keeping in mind that non-linear models might be represented as a high-dimensional problem.

Comment 1. Approximate Sparse Models We start with a nonlinear relationship of the form

where is the outcome variable, -vector of elementary regressors, regression function, and disturbances. Let ) is a vector of dimension , that contains a dictionary of possibly technical transformations of , including a constant. The values are treated fixed, and normalized. The regression function ) admits an approximate sparse form, if there exists

where is a constant independent of n.

The methodology we introduce can be applied directly for nonparametric panel models, e.g. employing kernel or Sieves methods. But in the rest of the paper we will focus on linear and high-dimensional panel data models. First, modern data sets are often high-dimensional and of particular interest. Second, non-parametric models can often be represented as approximate sparse models as defined in the remark above.

Example 2 (Linear panel data with heterogeneous coefficients). We assume that

where is a link function that takes as an input. That said, p-dimensional vector indexed by , where in the high dimensional case, much smaller than min(

When p is fixed, we can simply estimate the model by linear regressions. It implies that we will be running N linear regressions, one for each individual. When p is high-dimensional, i.e., p >> T, we can estimate such a model by different machine learning techniques, such as Lasso, and

2. Motivation and Heuristic Derivation

To illustrate the core of our idea, we consider estimation of (nonparametric) panel functions / longitudinal data. Figure 1 shows for illustration individual specific functions for four individual. A naive approach might be to estimate a non-parametric function for each individual separately. But if the number of observations per individual T is small, the regression functions cannot be estimated precisely. It seems that the regression functions for individuals 1 and 2 and for individuals 3 and 4 are similar. Hence, for estimation of the regression function of individual 1 it is reasonable to use (“borrow”) observations from individual 2 for estimation, maybe with some lower weight. This might introduce bias, but decrease variance, leading to an increased MSE. Our idea is now to use data-dependent weights and the weights reflect the similarity between the curves of all individuals. If the curves for individuals, here for example 1 and 2, are similar the weights should be close to one, as the information in the observations for individual 2 is valuable for estimation the function of individual 1. If the curves are very different, like the functions for 1 and 3, the weights should be small or zero. In the adaptive discrete smoothing procedure we propose the weights are based on the similarity between the two curves. The similarity is measured by a distance measure / metric initial estimates, denote the estimated individual regression functions. The weight is then given by the expression with denoting constants. The weights for observation belonging to the individual itself are set to 1. The weights are hence determined in data-driven / data-dependent way. Based on initial estimates for each individual, a weighting matrix is determined, measuring the similarity of functions. Finally, the individual functions are estimated using the weights. This procedure can be applied to all estimation methods which allow for weighted estimation. As for the initial estimates few observations might be available, but many potential covariates, we focus in this paper on Lasso estimation.

To make the idea more formal, we first introduce further notation. We consider panel data, in particular we observe N individuals and for each individual we observe T periods. The dependent variable is denoted as the predictors are denoted as . The covariates dimensional vectors (

Given any i, we assume that terms stands for the individual specific effect in the non-parametric panel data model. In general, the , serving as an low-dimensional index of , but can be considered as multiple dimensional as well. In the panel data setting, is usually consider as a scalar.

One naive way of predicting is through building individual-level models:

However, the naive estimator suffers from several drawbacks. First of all, in most of the real applications of panel data, we face the large N and small T problem, and estimating functions independently from other individual’s observations lead to limited amount of data - only a sample size of T for estimating each

In the traditional linear panel data model, 0. In such a model, the estimation process utilizes samples at the same time because by assumption, is a common parameter across all individuals i. On contrast, the random coefficient model is a model where all randomly distributed and can only be estimated by using individual longitudinal data without additional assumptions.

Suppose there exists group structures amongst all individuals in the panel data. One extreme case is that the number of such groups is only finite, which means that there are hidden labels that we do not observe, which we call them , we say that the two individuals belong to the same group, and ). The linear panel data model is an special case because the coefficients are common across all individual i. A more general case is that we assume the distribution of is continuous, and we allow that all individuals are different but might share some similarities. As N becomes larger and larger, for any individual i, there must exist other individuals whose individual effects are close enough to . Our idea is

to assign proper weights to each individual , and such weights should reflect the similarity between

3. General Discrete Smoothing Estimator for High-Dimensional and Non-Linear Panel Data Models

3.1. Generic ADS Algorithm for Panel Data. In this section we introduce a generic algorithm for adaptive discrete smoothing of panel data. We consider the model dimensional vectors of regressors. As mentioned in the previous section the algorithm consists of three steps: First, providing initial estimates . Second, constructing of the weighting matrix W. Third, for all i = 1, . . . , N: weighted estimation of ˜). The algorithm is very flexible and can be combined with any estimation methods that allows weighted estimation, including ordinary least squares, kernel regression, series regression, maximum likelihood estimation and modern machine learning methods like Lasso, Boosting, Neural Networks. The generic algorithm can be described as follows:

Algorithm 1 (Generic ADS-Algorithm). (1) (First stage) Estimation of the first stage. Construction of initial estimates using only the corre- sponding T of each individual to construct the individual-specific first step estimators (Construction Weighting matrix) Compute W matrix such that

for all is a metric. Set W(i, i) = 1 for all i = 1, 2, ..., N. (3) (Second stage) Weighted estimation of by using all observations with weights given by W(i, j). The final estimator is denoted by

Comment 2. (1) The choice of the metric / distance measure in Step 2 depends on the estimator. E.g. if ols or Lasso regression is employed, a natural choice is the Euclidean norm of the difference of the estimated coefficient vectors, In the case of kernel or series estimation, one might choose: ) as the empirical measure.

(2) In the last step (Second stage) a weighted regression is conducted. So for estimation all units are used, but with the weights employed in step 2. Hence, any estimation method supporting weights can be used with our method.

Comment 3. A modification of the proposed algorithm is to iterate the calculation of the weights: After the third step the weights are updated with the estimated functions and then Step 3 is repeated. Either one stops then or the updating of the weights is repeated until the change of weights falls below some threshold.

Comment 4 (Parallel computation). From the computation perspective, for both the first and second stage estimation, all N estimators can be constructed in a fully parallel way, and thus the method is computationally attractive.

Comment 5. In principle, the individual fixed effect can be viewed as unknown categories. The ADS algorithm we propose has some similarity with clustering algorithms, e.g., means algorithm, but offers more flexibility and has some advantages. First, the number of groups has not to be known or specified before. Rather, the grouping is determined in a data-driven fashion. The proposed estimator can cope with both discrete or “continuous”categories. In Section 4, we will layout theorems for the ordinary least squares (ols) and Lasso case to show that the procedure adapts to different situations with different asymptotic properties. Second, the estimator can be interpreted as a “soft clustering”algorithm, as each individual has not to be assigned to one group (0-1-weights), but allows flexible weights and continuum number of groups. By this soft-clustering of individuals whose functions or coefficents are close enough, each second-stage estimator will be more accurate in terms of the mean squared error (MSE). This is because the variance of the estimation reduces as compared to the individual estimator.

In the following we will specify the generic algorithm for linear panel data models and estimate them with ols in the low-dimensional setting and with lasso in the high-dimensional setting.

3.2. Linear Panel Data Model. To begin with, let us first consider the linear panel data model from Example 2 (2) in a low-dimensional setting, i.e. where the dimensionality p of each predictor is fixed and independent of the sample size. We illustrate our ADS estimator for this case based on the OLS estimation.

Algorithm 2 (ADS-Algorithm for linear panel data).

(1) (First stage estimation) Using the OLS estimator to construct the initial individual estimators:

When the sample size N and T are large, the choice of ) does not matter for the asymptotic performance of the ADS procedure. In practice, we could simply choose a 1). For example, in our study, the is set to be 0.5. We further make a few important comments about the algorithm.

Comment 6 (Extension to an iterative approach). As mentioned earlier, the algorithm can be extended by iterating the steps to update first the initial estimates and then the weighting matrix. More precisely for the ols estimtaor, after obtaining from Step 3, we could update the weight matrix in Step 2 using and repeat Step 3 to obtain refined estimators. One could do either a one-step correction or repeat Step 2 and Step 3 for several iterations until the change of the weights falls below a certain threshold.

3.3. High-dimensional Panel Data Models. One interesting special case of the Algorithm 1 is the high-dimensional linear regression case which is the core contribution of the paper. As high-dimensional data are more and more available for researchers, we would like to work out a special version of Algorithm 1 for Lasso estimation of panel data in this subsection.

For high-dimensional linear regression, many estimators have been proposed in literature, such as Lasso (Tibshirani, 1996), Dantzig selector (Cand`es and Tao, 2007), squareroot Lasso (Belloni et al., 2011) (or scaled Lasso (Sun and Zhang, 2012)), (Spindler and Luo, 2016), and post-Lasso (Belloni and Chernozhukov, 2013). In principle, our ADS algorithm can be combined with any estimator for high-dimensional linear regression. For the ease of illustration, we will adopt the most widely used Lasso estimator and present the high-dimensional ADS algorithm next.

Consider the following high-dimensional model: for 1

where each are i.i.d. errors draw from a normal distribution with mean zero and variance

Comment 7. The distributional assumption on the error terms can be relaxed. For example, serial correlation can be introduced or heteroskedasticity by employing the theory of self-normalized processes for Lasso as proposed in Belloni et al. (2012).

For each has to be of order )) for some depends on the unobserved variable is a scalar. This can be interpreted in different ways as pointed out earlier.

Algorithm 3 (ADS-Algorithm for Lasso). (1) Run N separate Lassos:

where is the penalty loading for individual i in the second stage.

For the high dimensional data, in Algorithm 3, we can replace Lasso with other statistical or machine learning procedures that work well in high-dimensions, e.g., Boosting (Spindler and Luo (2016)), Post-Lasso (Belloni and Chernozhukov (2013)). For brevity, we shall not repeat these procedures in this paper and restrict to the Lasso case.

4. Simulation Study

In this section we present results from a simulation study, in particular we compare our methods to the “frequentist”approach, i.e. estimating individual functions with only observations from this individual. We present results for OLS and Lasso regression.

4.1. Linear Regression. We consider the following model

with + 1) matrix) with (dimensional vector containing a constant of 1. For () we consider two settings:

(1) iid setting: ((2) correlated setting: (Σ) where Σ has Toeplitz structure with parameter 0.5.

We consider two data generating processes, one specifies the (dimensional coefficient vector as a function of an random, unobserved quantity and one assumes that the are drawn from a normal distribution with constant correlation the same component of different observations i but independent components.

4.1.1. DGP 1. The individual coefficients + 1-dimensional) are simulated in such a way – based on the normal distribution – that the components are uncorrelated for individual i, but the (same) components are correlated between individuals i and j with correlation (dimensional correlation matrix has entries diagonal. The vectors are iid for different components l (l = 1, . . . , p + 1).

4.1.2. DGP 2. This DGP is similar to 1, but is generated differently, namely:

The are drawn iid from a uniform distribution,

4.2. Lasso Estimation / Machine Learning. Here we estimate a model comparable to 7, but we use Lasso and Lasso with discrete smoothing for estimation. Lasso usually assumes a sparse setting, i.e. it is assumed that there is a set of p+1 potential variables, but only a small subset of s + 1 regressors has non-zero coefficients. As data generating processes we use variants of DGP 1 and DGP 2 which are described above: the first s + 1 components are simulated in exactly the same way as in DGP1 and DGP2 and the remaining variables are set equal to zero. In the simulations we vary n, p, s and T. We decompose the vector ) in two parts: Hence we can formulate the DGPs in the following way

DGP 3 β

DGP 4 For each component are multivariate normal distributed with constant correlation Σ) with Σ a correlation matrix with unit correlation on the diagonal and off-diagonal entries of . The components are independent.

4.3. Results. In this section we present the simulation results. We generate the data according to the DGPs described above and estimate it with OLS (DGP 1 and 2) or Lasso (DGP 3 and 4) and the smoothing version introduced in this paper. We vary different parameters. For estimation we use data set of size n, T. The forecasts are evaluated out-of-sample (same sample size as used for estimation) according to the mean squared error (MSE). We set the number of repetitions to R = 500. We use a setting where the design matrix X is iid (“X iid”) and setting where X is correlated (“X corr”), i.e. has Toeplitz structure. In the linear case we set the number of parameters N, T, p. In the Lasso case we use a exact sparsity design, where p denotes the number of parameters and s the number of non-zero coefficients.

The results can be summarized as follows: when T is small and N is large our methods performs very favorable. When N is small and T is large, discrete smoothing does not add much benefit. This is exactly what one would expect. Overall, the simulation results highlight the usefulness of our approach for empirical applications.

4.3.1. OLS Setting. .

Table 1. Simulation Results Linear - out of sample - Setting 1, X iid

Table 2. Simulation Results Linear - out of sample - Setting 1, X cor, p=5

Table 3. Simulation Results Linear - out of sample - Setting 2, p=5

5. Empirical Studies: DIDI data and gap prediction

5.1. Data Set. Didi is a major ride-sharing company in China with more than 400 million users across 400 cities. As the services are based on smartphone applications the company can collect a huge amount of data on a daily basis concerning detailed information on asked and provided rides including detailed geographical information and time stamps. We use the raw data on single rides to estimate the gap in demand which is defined as the difference of requested rides, i.e. the number of calls in a certain location, within a period of 10 minutes and the number of answers in this location, within this period. The gap is an important variable for the company as good predictions of this variable help to hold available drivers and it serves as input for dynamic pricing strategies. The data set contains the following variables:

• location: the location id corresponds to a clustered map, dividing cities in hexagons (66).

• time (Date str, DayM, DayW, TimeP): date, day of month, day of week (1-7), time period of the day (1-144).

• weather: categorical variable on weather conditions • temperature in a certain location at a certain point of time • measure of air pollution (continuous) • requests (demand): number of calls in a certain location in time spans of 10 minutes

• answers (supply): number of answers in a certain location in time spans of 10 minutes

• gap defined as the difference of requests and answers • road / traffic conditions (four variables, L1-L4) • price: the average price/ median price within a 10-minute-cell at a certain location, computed from the raw data.

We build a model to predict the gaps on location level, i.e. for each district we estimate a model for gap. It seems plausible that the demand depends on the location (e.g. city center vs. outskirt). This leads to 66 individual location units in the city we consider for our analysis. The total sample size of observations for all units is n = 135, 674. We split the data set in a training sample (n = 105, 000) and a testing sample (n = 30, 674) under consideration of the location structure. So to estimate each model appr. 1,500 observations are available.We model variables like day of month and day of week as categorical leading to a large set of potential covariates for each unit, namely 46 which is challenging. This situation which is common for many real word applications, in particular data sets collected on the internet fits well for the methods we proposed as additional information from other observations is very valuable in this setting.

5.2. Results. We estimate for each unit linear regression models with OLS and Lasso and compare the results with the adaptive discrete smoothing method for each estimator introduced in this paper. Although the OLS estimator is not well defined for units with p > n, it can be used for prediction.

We apply the ADS procedure to didichuxing.com’s data. This data involves 66 districts. In our study, we found that our ADS procedure outperforms all other methods, measured by both in-sample and out-of-sample MSE. Already in the simulation study we found clear evidence that the ADS procedure does better in MSE compared to traditional methods, in many different settings.

6. Conclusion

In this paper, we propose a novel adaptive discrete smoothing procedure (ADS). Such a procedure is applicable to both non-linear and high-dimensional panel data.cedure is especially useful in the large cross-sectional size and small longitude scenario, where building individual models based on individual data is highly imprecise, while heterogeneity across cross-sectional individuals prevents good performance of a uniform modeling process for all individuals. The results of our simulation strongly support our procedure compared to other existing econometrics or statistics methodologies. Our procedure can also be interpreted as a clustering method. In future research we would like to derive the theoretical properties (work in progress) and explore more formally the connection to cluster methods.

References

Aloise, Daniel, Deshpande, Amit, Hansen, Pierre and Popat, Preyas. (2009). ‘NP-hardness of Euclidean sum-of-squares clustering’, Machine Learning 75, 245–248.

. (2011). ‘Nonlinear panel data analysis’, Annu. Rev. Econ. 3(1), 395–424.

Baltagi, B. H. and Raj, B. (1992). ‘A survey of recent theoretical developments in the econometrics of panel data’, Empirical Economics 17(1), 85–109.

Belloni, A. and Chernozhukov, V. (2013). ‘Least Squares After Model Selection in High-dimensional Sparse Models’, Bernoulli 19(2), 521–547.

Belloni, A., Chernozhukov, V. and Wang, L. (2011). ‘Square-Root-LASSO: Pivotal Recovery of Sparse Signals via Conic Programming’, Biometrika 98(4), 791–806. Arxiv, 2010.

Belloni, Alexandre, Chen, Daniel, Chernozhukov, Victor and Hansen, Chris- tian. (2012). ‘Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain’, Econometrica 80, 2369–2429.

Belloni, Alexandre, Chernozhukov, Victor, Hansen, Christian and Kozbur, Damian. (2014). ‘Inference in High Dimensional Panel Models with an Application to Gun Control’, Journal of Business & Economic Statistics 34(4), 590–605.

Bonhomme, Stephane, Lamadon, Thibaut and Manresa, Elena. (2017), Discretizing Unobserved Heterogeneity: Approximate Clustering Methods for Dimension Reduction, Technical report, Institute for Fiscal Studies.

Bonhomme, Stephane and Manresa, Elena. (2015). ‘Grouped Patterns of Heterogeneity in Panel Data’, Econometrica 83(3), 1147–1184.

Cand`es, E. and Tao, T. (2007). ‘The Dantzig selector: statistical estimation when p is much larger than 35, 2313–2351.

Henderson, Daniel J., Carroll, Raymond J. and Li, Qi. (2008). ‘Nonparametric estimation and testing of fixed effects panel data models’, Journal of Econometrics 144(1), 257 – 275.

Hsiao, Cheng. (2014), Analysis of Panel Data, Econometric Society Monographs, 3rd edn, Cambridge University Press.

Kock, A.B . (2013). ‘Oracle efficient variable selection in random and fixed effects panel data models’, Econometric Theory 29(1), 115–152.

Kock, Anders Bredhahl. (2016). ‘Oracle inequalities, variable selection and uniform inference in high-dimensional correlated random effects panel data models’, Journal of Econometrics 195(1), 71–85.

Li, D., Qian, J. and Su, L. (2016). ‘Panel data models with interactive fixed effects and multiple structural breaks’, J. Amer. Statist. Assoc. 111(516), 1804–1819.

. (2009). ‘Nonparametric Additive Models for Panels of Time Series’, Econometric Theory 25(2), 442–481.

Qian, Hailong and Schmidt, Peter. (2003). ‘Partial GLS Regression’, Economics Letters 79, 385–392.

Qian, J. and Su, L. (2016). ‘Shrinkage estimation of common breaks in panel data models via adaptive group fused lasso’, J. Econometrics 191(1), 86–109.

Racine, Jeff and Li, Qi. (2004). ‘Nonparametric estimation of regression functions with both categorical and continuous data’, Journal of Econometrics 119(1), 99 – 130.

Spindler, Martin and Luo, Ye. (2016). ‘High-Dimensional L2-Boosting: Rate of Convergence’, arXiv preprint arXiv:1602.08927 .

Su, Liangjun, Shi, Zhentao and Phillips, Peter C. B. (2016). ‘Identifying Latent Structures in Panel Data’, Econometrica 84(6), 2215–2264.

Sun, Tingni and Zhang, Cun-Hui. (2012). ‘Scaled sparse linear regression’, Biometrika 4(1), 879–898.

Tibshirani, R. (1996). ‘Regression Shrinkage and Selection via the Lasso’, Journal of the Royal Statistical Society. Series B (Statistical Methodology) 58, 267–288.

Vogt, Michael and Linton, Oliver. (2017). ‘Classification of non-parametric regression functions in longitudinal data models’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79(1), 5–27.

Wooldridge, Jeffrey M. (2010), Econometric Analysis of Cross Section and Panel Data, 2nd edn, The MIT Press.

Xi Chen, Ye Luo, Martin Spindler. (2020), Adapative Smooting For Nonparametric Estimation, Technical report.

Zhu, Yinchu. (2017). ‘High-Dimensional Panel Data with Time Heterogeneity: Estimation and Inference’, Available at SSRN: https://ssrn.com/abstract=2665374 .