b

DiscoverSearch
About
My stuff
Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond
2019·arXiv
Abstract
Abstract

We consider estimating a low-dimensional parameter in an estimating equation involving high-dimensional nuisances that depend on the parameter. A central example is the efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference, which involves as a nuisance the covariate-conditional cumulative distribution function evaluated at the quantile to be estimated. Debiased machine learning (DML) is a data-splitting approach to estimating high-dimensional nuisances using flexible machine learning methods, but applying it to problems with parameter-dependent nuisances is impractical. For (L)QTE, DML requires we learn the whole covariate-conditional cumulative distribution function. We instead propose localized debiased machine learning (LDML), which avoids this burdensome step and needs only estimate nuisances at a single initial rough guess for the parameter. For (L)QTE, LDML involves learning just two regression functions, a standard task for machine learning methods. We prove that under lax rate conditions our estimator has the same favorable asymptotic behavior as the infeasible estimator that uses the unknown true nuisances. Thus, LDML notably enables practically-feasible and theoretically-grounded efficient estimation of important quantities in causal inference such as (L)QTEs when we must control for many covariates and/or flexible relationships, as we demonstrate in empirical studies.

In this paper, we consider estimating parameters  θ∗ = (θ∗1, θ∗2) ∈ Θ = Θ1 × Θ2 ⊆ Rddefined as the (unique) solution to the following d-dimensional estimating equation:

image

where  Z ∈ Zare observed random variables with distribution  P, η∗1(Z; θ1) and η∗2(Z) are twounknown nuisance functions, and 0 is the zero vector in  Rd. We hope to estimate  θ∗ based on(Z1, . . . , ZN), Nindependent and identically distributed (i.i.d) draws from the distribution P. As we will show, estimating equations of the form above are prevalent in efficient estimation in causal inference and missing data problems, with quantile treatment effect (QTE) estimation (Section 1.1) as a prominent example, among many others (Section 1.2).

One important feature of Eq. (1) is that the nuisance  η∗1(Z; θ1) depends on the parameters to be estimated, which raises several challenges and causes existing methods to be unstable and computationally burdensome. Specifically, we could potentially use the observed data to estimate the nuisances  η∗1(Z; θ1) and η∗2(Z) and then solve a sample analogue of Eq. (1) based on the estimated nuisances in order to estimate  θ∗, possibly using cross-fitting (Chernozhukov et al. 2018a). However, this requires estimating the nuisance  η∗1(Z; θ1) for all possible θ1, i.e., learning infinitely manyfunctions of Z, and then solving for the root of an estimated function. For example, when estimating QTE (see Section 1.1), this involves estimating a whole conditional cumulative distribution function, or equivalently, infinitely many binary probability regressions (one for each threshold). This can be very unstable, especially in causal inference with observational data where typically a large number of covariates need to be conditioned on to remove confounding. Although one may discretize the space of Θ1and estimate  η∗1(Z; θ1) only for finitely many  θ1, this can still be compu- tationally burdensome when the discrete grid is large, and the resulting estimator can be sensitive to the discretization scheme.

In this paper, we propose a localized debiased machine learning (LDML) approach that only requires estimating  η∗1(Z; θ1) at a single θ1value, without estimating it for all possible values or ad-hoc discretized values of  θ1. Importantly, our estimator is asymptotically equivalent to an oracle estimator that knows the whole continuum of nuisance function  η∗1(Z; θ1) for all θ1 ∈ Θ1. Inother words, asymptotically, our method does not incur any loss even though it only estimates the nuisance function at a single  θ1value. Moreover, estimating this far simpler nuisance reduces to standard classification and regression tasks, i.e., fitting conditional expectations (regression) and conditional binary probabilities (classification), for which many machine learning methods exist. In particular, our approach will be shown to be largely insensitive to how these conditional expectation functions are estimated, so we may directly use off-the-shelf machine learning methods and treat them as black-box regression or classification algorithms (e.g., random forests, gradient boosting, neural networks). Therefore, our proposed method notably enables practical and efficient estimation using time-tested machine learning methods to solve Eq. (1).

In comparison, existing approaches for debiased and efficient estimation with black-box nuisance estimators either focus on settings where nuisances do not depend on target parameters or treat nuisances as abstract objects so that one must estimate a continuum of nuisances when applying to Eq. (1) thus precluding the use of standard machine-learning algorithms for regression and classification (Robins et al. 2008, Zheng and van der Laan 2011, Robins et al. 2013, Chernozhukov et al. 2018a, Bravo et al. 2020). Similarly, existing works specifically on the efficient estimation of QTEs either apply similar debiased approaches using a continuum of nuisances (Belloni et al. 2017, D´ıaz 2017) or use specific non-black-box nuisance estimators like polynomial sieves and local polynomial kernel regression and make explicit smoothness restrictions (Firpo 2007, Fr¨olich and Melly 2013). (We provide an extensive literature review in Section 7.) Compared to these works, our proposal is fully generic, flexible, and machine-learning driven in that it handles many important examples that fit into Eq. (1), as we review in the next two sections; it permits the use of flexible black-box nuisance estimators, since we only require lax rate conditions that are in fact more lax than some previous results; and these black boxes may be standard machine-learning methods for regression and classification, since whenever  η∗1(Z; θ1) for any single  θ1 ∈ Θ1is a conditional expectation, such as in all of the examples we review in the next section, our method only ever has to fit very few conditional expectations.

1.1 Motivating Example: Quantile Treatment Effects

A primary motivation of considering the setting of Eq. (1) is the estimation of QTE. In this case, we consider a population P of units, each associated with some baseline covariates  X ∈ X, two potential outcomes  Y (0), Y (1) ∈ Rfor each of two possible treatments, and a treatment indicator  T ∈ {0, 1}.Since both potential outcomes are included in this description, we refer to P as the complete-data distribution. We are interested in the  γ-quantile of  Y (1): the θ∗1 such that P(Y (1) ≤ θ∗1) = γ(assuming existence and uniqueness) for  γ ∈ (0,1). And, similarly, we are interested in the quantile of Y (0) and in the difference of the quantiles, known as the quantile treatment effect (QTE), but these estimation questions are analogous so for brevity we focus just on  θ∗1, the γ-quantile of Y (1) (see also Remark 2). Compared to the average outcome and the average treatment effect (ATE), the quantile of outcomes and the QTE provide a more robust assessment of the effects of treatment and are very important quantities in program evaluation.

We do not observe the potential outcomes but instead only the realized factual outcome corresponding to the assigned treatment, Y = Y (T). Therefore, we only observe Z = (X, T, Y ), whose distribution P is given by coarsening P via Y = Y (T). Ignorable treatment assignment with respect to X assumes that  Y (1) ⊥⊥ T | X (i.e., no unobserved confounders) and overlap assumes that P(T = 1 | X) > 0, and these together ensure that  θ∗1 is identifiable from observations of Z. Specifically, a straightforward identification is given by the so-called inverse propensity weighting (IPW) equation:

image

Here estimating the propensity score function  η∗2 amounts to learning a conditional probabil- ity function from a binary response, for which many standard machine learning methods exist. Once we construct an estimator ˆη2, we can obtain the standard IPW estimator ˆθIPW1 by solving

image

the particular method used to construct ˆη2and its convergence rate can be slowed down by that of ˆη2, prohibiting the use of general nonparametric machine learning methods and potentially leading to unstable estimates.

Instead, one can alternatively obtain the following estimating equation from the efficient influence function for  θ∗ (e.g., Tsiatis 2006):

image

An important feature of the above is that it satisfies a property known as Neyman orthogonality: the moment  Pψ(Z; θ1, η1(Z; θ1), η2(Z)) has zeroderivatives with respect to the nuisances at  θ∗1, η∗1, η∗2.This means that the estimating equation is robust to small perturbations in the nuisances so that estimation errors therein contribute only to higher-order error terms in the final estimate of  θ∗1.In particular, Chernozhukov et al. (2018a) recently proposed to leverage Neyman orthogonality to enable the use of plug-in machine learning estimates of the nuisances. Their proposal, called debiased machine learning (DML), is as follows: split the data randomly into  K folds, D1, . . . , DK, andthen for each k = 1, . . . , K, use all but the  kth fold to construct nuisance estimates ˆη(k)1 , ˆη(k)2 , andfinally solve the empirical estimating equation 1N�Kk=1�i∈Dk ψ(Zi; θ1, ˆη(k)1 (Zi; θ1), ˆη(k)2 (Zi)) = 0to obtain the estimator ˆθ. They prove that as long as the estimates ˆη(k)1 , ˆη(k)2converge to  η∗1, η∗2faster than  N−1/4, the estimate ˆθ1will have similar behavior to the oracle estimate that solves

�Ni=1 ψ(Zi; θ1, η∗1(Zi; θ1), η∗2(Zi)) = 0, i.e., solving the empirical estimating equation using the true nuisance functions. As a result, the estimate ˆθ1is asymptotically normal and semiparametrically efficient. Since, apart from the mild rate requirement on ˆη(k)1 , ˆη(k)2 , no metric entropy condi- tions are assumed, this allows one to successfully use machine learning methods to learn nuisances and achieve asymptotically normal and efficient estimation.

The problem with this approach for estimating quantiles of outcomes (similarly, QTEs), however, is that it requires the estimation of a very complex nuisance function:  η∗1(Z; θ1) is the wholeconditional cumulative distribution function of a real-valued outcome, potentially conditioned on high-dimensional covariates. While certainly nonparametric methods for estimating conditional distributions exist such as kernel estimators, this learning problem is much harder to do in a flexi-ble, blackbox, machine-learning manner, compared to just estimating a single regression function. This indeed stands in stark contrast to the estimation of ATEs, where applying DML requires a far simpler nuisance function given by the regression of outcome on covariates and treatment, E [Y | X, T], for which a long list of practice-proven machine learning methods can be directly and successfully applied. The key difference is that the nuisance function in ATE estimation does not depend on the estimand and can therefore be estimated in an independent manner whereas the nuisance function in QTE estimation does depend on the estimand. This issue makes DML, despite its theoretical benefits, untenable in practice for the important task of QTE estimation.

The primary goal of this paper can be understood as extending DML to effectively tackle the case where nuisances depend on the estimand by alleviating this dependence via localization. In particular, this will enable efficient estimation of important quantities such as QTEs in the presence of high-dimensional nuisances by using and debiasing black-box machine learning methods for the standard regression task.

The basic idea as it applies to the estimation of the quantile of outcomes, which we will generalize and analyze thoroughly, is as follows. While perhaps inefficient, ˆθIPW1relies only on estimating a binary regression  η∗1. This is amenable to machine learning approaches but may have a slow convergence rate in general. Despite its slow rate, this rough initial guess can sufficiently localize our nuisance estimation and it may suffice to only estimate  η∗1(Z; ˆθIPW1 ), i.e., the nuisance evaluated at just a single initial estimate of  θ1. Then we use this estimated nuisance at this initial estimate of  θ∗1 in place of  η∗1(Z; θ1) when solving the empirical estimating equation for  θ1. For estimating the quantiles, this means we only have to regress the binary response  I[Y ≤ ˆθIPW1 ] on X, treatingˆθIPW1as fixed. In particular, we propose a special three-way data splitting procedure that debiases such plug-in nuisance estimates in order to obtain an estimate for  θ∗ with near-oracle performance.

1.2 Estimating Equations with Incomplete Data under Ignorable Treatment Assignment or Using Instrumental Variables

More generally, we can consider parameters (θ∗1, θ∗2) defined as the solution to the following esti- mating equation on the (unavailable) complete data:

image

for some given functions  U(y; θ1) and V (θ2). Quantile is one example of this. Another example is conditional value at risk (CVaR) of outcomes:  θ∗2 = P[Y (1)I [F1(Y (1)) ≥ γ]]/(1 − γ), where F1 isthe cumulative distribution function of Y (1), that is, the expectation of Y (1) conditioned on being above the  γ-quantile (again, assuming uniqueness). CVaR is also known as expected shortfall, a popular risk measure widely used in risk management and optimization (Rockafellar and Uryasev 2002). Again, we may consider the CVaR of Y (0) and the differences of CVaRs analogously. Letting

image

image

Yet another example is the expectile, a measure for asymmetric risk (Newey and Powell 1987). The γ-expectile of Y (1) is defined by the following asymmetric least squares problem:

image

whose first-order condition gives an estimating equation for complete data:

image

Under ignorable treatment assignment and overlap, a general-purpose Neyman-orthogonal estimating equation for the estimand (θ∗1, θ∗2) defined by Eq. (4) is given by

image

Alternatively, instead of assuming ignorable treatment assignment, we may have access to an instrumental variable (IV). We considier a binary IV denoted as  W ∈ {0, 1}and assume that it satisfies identification conditions in Imbens and Angrist (1994) (namely, for potential treatments T(w) and potential outcomes Y (t, w), we have exclusion  Y (t) := Y (t, w) = Y (t, 1 − w), exogeneity (Y (t), T(w)) ⊥ W | X, overlap P(W = 1 | X) ∈ (0,1), relevance P(T(1) = 1) > P(T(0) = 1), and monotonicity  T(1) ≥ T(0)). We seek to use observations of Z = (X, W, T, Y ) to estimate local parameters defined by the following estimating equation conditionally on the subpopulation of compliers (i.e., T(1) > T(0)):

image

For example, specializing Eq. (8) to the functions  U (y; θ1) , V (θ2) in Eq. (5) gives the local quantile and CVaR, which in turn gives the local QTE (LQTE). In Appendix A, we present the efficient estimating equations for these local parameters and show they also satisfy Neyman orthogonality and involve some estimand-dependent nuisance functions  η∗1(Z; θ1).

In all examples above, the nuisance  η∗1(Z; θ1) depends on the estimand. This occurs whether estimating quantiles, CVaR, or expectiles (more generally, whenever  U(y; θ1) is not linear in  θ1)and whether the identification is via ignorable treatment assignment, ignorable coarsening, or valid IV. And, in such cases, learning  η∗1(Z; θ1) for all θ1is practically difficult, which may involve learning a whole conditional distribution function or a whole continuum of conditional expectation functions given potentially high-dimensional covariates.

Notation. We let d1, d2be the dimensions of  θ∗1, θ∗2, respectively, where  d1 + d2 = d. For f :Rd → Rm, ∂θ⊤f(θ) is the m × d-matrix-valued function with entry ∂fi(θ)∂θjin position (i, j) and ∂θ⊤f(θ)|θ=θ0is its evaluation at  θ0. For g : Rd → R, ∂θ∂θ⊤g (θ) is the d × d-matrix-valued function with entry ∂g(θ)∂θi∂θj in position (i, j). We use σmax (∂θ∂θ⊤g (θ)) to denote its largest singular value. We let  P (Z ∈ A) and E [Z | Z ∈ A] for measurable sets A denote probabilities and expectations with respect to  P. We let Pf(Z) =�f dPfor measurable functions f denote expectations with respect to Z alone, while we let  Ef(Z; Z1, . . . , Zn) denote expectations with respect to Z and the data. Thus, if ˆϕdepends on the data,  Pf(Z; ˆϕ) remains a function of the data while  Ef(Z; ˆϕ)is a number. We let  PNdenote the empirical expectation:  PNf(Z) = 1N�Ni=1 f(Zi) for anymeasurable function f. Moreover, for vector-valued function  f(Z) = (f1(Z), . . . , fd(Z)), we letPf2(Z) := (Pf21 (Z), . . . , Pf2d(Z)). For any  x ∈ Rd, we denote the open ball centered at x with radius  δ as B(x; δ). For p >0 and a probability measure Q, we denote  ∥f∥Q,p =��|f|p dQ�1/p.For a set of functions F, we define the covering number  N(ϵ, F, ∥ · ∥Q,2) as the minimal number N of functions  f1, . . . , fNsuch that supf∈F infi=1,...,N ∥f − fi∥Q,2 ≤ ϵ. For positive deterministic sequence  anand random variable sequence  Xn, Xn = oP(an) means P(|Xn|/an > ϵ) → 0 ∀ϵ > 0 andXn = OP(an) means for any  ϵ >0, there exists M > 0 such that lim supn→∞ P(|Xn|/an ≥ M) ≤ ϵ.

We next present our methodology, first motivating the localization technique, and then explicitly stating our meta-algorithm.

2.1 Motivation

Ideally, if the nuisances  η∗1 and η∗2 were both known, then Eq. (1) suggests that  θ∗could be estimated by solving the following estimating equation:

image

Under standard regularity conditions for Z-estimation (van der Vaart 1998), the resulting oracle estimator ˜θthat solves Eq. (9) is asymptotically linear (and hence√N-consistent and asymptotically normal):

image

Furthermore, if  J∗−1ψ(Z; θ∗, η∗1(Z; θ∗1), η∗2(Z)) is the semiparametrically efficient influence function for  θ∗, then ˜θalso achieves the efficiency lower bound, that is, has minimal asymptotic variance among all regular estimators (van der Vaart 1998).

Since  η∗1 and η∗2 are actually unknown, the oracle estimator ˜θis of course infeasible. Instead, we must estimate the nuisance functions. A direct application of DML would require us to learn the whole functions  η∗1 and η∗2. That is, in order to solve Eq. (9) we would need to estimate infinitely many nuisance functions,  H1 = {η∗1(·, θ1) : θ1 ∈ Θ1}.

To avoid the daunting task of estimating infinitely many nuisances, we will instead attempt to target the following alternative oracle estimating equation

image

Although Eq. (11) appears very similar to Eq. (9), it only involves  η∗1(Z; θ1) at the single valueθ1 = θ∗1, as opposed to the infinitely many possible values for  θ1. In other words, among the whole family of nuisances  H1, only η∗1(Z; θ∗1) ∈ H1is relevant for Eq. (11). This formulation considerably reduces the need of nuisance estimation: now we only need to estimate  η∗1(Z; θ∗1) and η∗2(Z), bothfunctions only of Z.

image

Figure 1: Sketch of the LDML estimation procedure and a possible initial guess estimator. Squiggly arrows “⇝” denote estimation. Plain arrows “→” denote plugging in.

The (infeasible) estimators that solve each of Eqs. (9) and (11) have the same leading asymptotic behavior as long as the respective associated Jacobian matrices coincide, as posited by the following assumption.

Assumption 1 (Invariant Jacobian). ∂θ⊤{P [ψ(Z; θ, η∗1(Z; θ∗1), η∗2(Z))]}|θ=θ∗ = J∗.

In Appendix F, we provide a general sufficient condition for Assumption 1 in terms of a Fr´echet-differentiation variant of the Neyman orthogonality condition (Assumption 2 condition vii.). But it is easy to directly show that this invariant Jacobian assumption holds for estimating equations with incomplete data presented in Section 1.2. In particular, the estimating equation  ψin Eq. (7) satisfies that

image

which does not depend on  η∗1(Z; θ1) at all. Thus whether fixing  η∗1(Z; θ1) at θ1 = θ∗1 or not, the Jacobian matrix of the estimating equation  ψremains the same. This means that solving Eq. (9) or Eq. (11) will have the same asymptotic behavior. Both, however, are infeasible since they involve unknown nuisances. Nonetheless, Eq. (11) motivates a new algorithm that eschews estimating H1 = {η∗1(· ; θ1) : θ1 ∈ Θ1} in full.

2.2 The LDML Meta-Algorithm

Motivated by the new (infeasible) estimating equation in Eq. (11), we propose to estimate  θ∗by the following (feasible) three-way sample splitting method, which we term localized debiased machine learning (LDML). The algorithm has two parts: three-way-cross-fold nuisance estimation and solving the estimating equation.

We start by discussing how we estimate the nuisances that we will then plug into Eq. (11).

Definition 1 (3-way-cross-fold nuisance estimation). Fix integers K ≥ 3, K′ ∈ [1, K − 2].

1. Randomly permute the data indices and let  Dk = {⌈(k − 1)N/K⌉ + 1, . . . , ⌈kN/K⌉}, k =1, . . . , K be a random even K-fold split of the data.

2. For k = 1, . . . , K:

image

For illustration the first iteration of step 2 above is sketched in Fig. 1(a) along with the plugging of estimated nuisances into the estimating equation (see Definitions 2 and 5 below).

Notice that since  DC,1k and DC,2kare disjoint,  η∗1(· ; ˆθ(k)1,init) is a fixed, nonrandom function with respect to the data  DC,2k. That is, the  η∗1 nuisance estimation task in step 2b appears as estimating a  single η∗1(· ; θ′1) ∈ H1 for θ′1 = ˆθ(k)1,init rather than the estimation of  all of H1.

A natural question is, what might be a reasonable initial estimator. In the examples given in Sections 1.1 and 1.2, we can use an IPW estimate for ˆθ(k)1,init (see Fig. 1(b) and Definition 4).

Given these nuisance estimates, we can obtain the LDML estimator for  θ∗ by approximately solving the average of the estimate of Eq. (11) in each fold.

Definition 2 (LDML). We let the estimator ˆθbe given by (approximately) solving

image

In fact, we allow for an approximate least-squares solution, which is useful if the empirical estimating equation has no exact solution. Namely, we let ˆθbe any satisfying

image

In Appendix D Definition 5, we give an alternative LDML estimator obtained by averaging solutions to Eq. (11) estimated in each fold separately. These two LDML estimators are asymptotically equivalent and all results in this paper apply to both, thus we focus on Definition 2 in the main text. Moreover, both estimators depend on the random splitting in Definition 1. To reduce the variance from this, we may aggregate estimates from multiple different sample splitting realizations. See Appendix E for a detailed discussion.

In this section, we provide the sufficient conditions that guarantee the proposed estimator ˆθ inDefinitions 2 and 5 to be consistent and asymptotically normal. In particular, although the proposed estimator relies on plug-in nuisance estimators, it is asymptotically equivalent to the infeasible estimator based on Eq. (9) with true nuisances, that is, it satisfies Eq. (10). While some of our conditions are analogous to those in Chernozhukov et al. (2018a), some are not and our proof takes a different approach that enables weaker conditions for convergence rates of the nuisance estimators. Our asymptotic normality results may be stated uniformly over a sequence of models  PN for anydata generating distribution  P ∈ PN. Our first set of assumptions ensure that  θ∗ is reasonably identified by the given estimating equation for all  P ∈ PN. We also assume that our estimating equation satisfies the Neyman orthogonality condition with respect to a nuisance realization set TN ⊂ [Z → R]2 that contains the nuisance estimates ˆη1(· ; ˆθ1,init) and ˆη2(·) with high probability. Note the set  TNconsists of pairs of functions of the data  Z alone and not θ1. Therefore, we denote members of the set as (η1(· ; θ′1), η2(·)) ∈ TN, where η1(· ; θ′1) is simply understood as a symbol representing of some fixed function of Z alone.

Assumption 2 (Regularity of Estimating Equations). Assume there exist positive constants  c1 toc7such that the following conditions hold for all  P ∈ PN:

image

vi. The nuisance realization set  TNcontains the true nuisance parameters (η∗1(· ; θ∗1), η∗2(·)).Moreover, the parameter space Θ is bounded and for each (η1(· ; θ′1), η2(·)) ∈ TN, the function class  Fη,θ′1 = {Z �→ ψj(Z; θ, η1(Z; θ′1), η2(Z)) : j = 1, . . . , d, θ ∈ Θ}is suitably measurable and its uniform covering entropy satisfies the following condition: for positive constants a, v, and q > 2, supQ log N(ϵ∥Fη,θ′1∥Q,2, Fη,θ′1, ∥ · ∥Q,2) ≤ v log(aϵ) ∀ϵ ∈ (0, 1], where Fη,θ′1 is ameasurable envelope for  Fη,θ′1 that satisfies  ∥Fη,θ′1∥P,q ≤ c7.

vii. ∂r {Pψ(Z; θ∗, η∗1(Z; θ∗1) + r(η1(Z; θ′1) − η∗1(Z; θ∗1)), η∗2(Z) + r(η2(Z) − η∗2(Z))}��r=0 = 0 for all(η1(· ; θ′1), η2(·)) ∈ TN.

Assumption 2 conditions i.–v. constitute standard identification and regularity conditions for Z-estimation (with uniform guarantees; see also Remark 1 below). Assumption 2 condition vi. requires that  ψis a well-estimable function of  θ for any fixedset of nuisances. Importantly, while it imposes a metric entropy condition on  ψ, this condition does not impose metric entropy conditions on our nuisance estimators, so flexible machine learning nuisance estimators are allowed. This assumption is very mild as Θ is finite-dimensional, so it can be ensured by some continuity and compactness condition. Finally, Assumption 2 condition vii. is the Neyman orthogonality condition (Cher- nozhukov et al. 2018a). We will show how these conditions are ensured in the incomplete data setting in Section 1.2.

Our second set of assumptions involve conditions on our nuisance estimators.

Assumption 3 (Nuisance Estimation Conditions). For any P ∈ PN:

i. For some sequence of constants N → 0, the nuisance estimates η(k)1 (· ; ˆθ(k)1,init), ˆη(k)2 (·)) belongto the realization set  TN for all k = 1, . . . , Kwith probability at least 1  − ∆N.

image

Here our condition on  λ′N (θ) differs from the counterpart condition in Chernozhukov et al. (2018a), which also leads to a different proof strategy. Our condition and proof generally requires weaker conditions for convergence rates of nuisance estimators. See the discussions in Appendix H for more details. Moreover, the constants ∆N, δN, τNare all prespecified and do not depend on any particular instance P.

Our key result in this paper is the following theorem, which shows that the asymptotic distribution of our estimator is identical to the (infeasible) oracle estimator solving the estimating equation in Eq. (9) with known nuisances.

Theorem 1 (Asymptotic Behavior of LDML). Assume Assumptions 1 to 3 hold with

image

Then the estimator ˆθgiven in Definition 2 is asymptotically linear and converges to a Gaussian distribution uniformly over  P ∈ PN:

image

The conditions in Eq. (16) and  ρN ≲ δNare fairly mild because Assumption 2 condition vi. requires q > 2 (so N−1/2+1/q →0) and Assumption 3 condition ii. requires  r′N ≤ δNlog N .

Remark 1 (Uniform vs non-uniform convergence). To obtain a non-uniform convergence result, we need only need set  PN = {P}as a constant singleton in Theorem 1. In this case, much of Assumption 2 simplifies: the existence of the constants  c4, c6is trivial, the non-singularity of  J∗ isenough for  c3to exist, and  θ∗ being in the interior of Θ is enough for  c1to exist. Further, we can relax condition iv. by allowing  c5to be zero (in which case we rephrase the asymptotic normality in Theorem 1 by putting Σ on the right-hand side of the limit rather than inverting it). Uniformity, however, is important in practice. Without uniformity, for any given sample size N there may always exist some bad instance such that the normal approximation suggested by the convergence is inaccurate (Kasy 2019).

In the previous section we established the asymptotic normality of the LDML estimator under lax conditions. This suggests that if we can estimate its asymptotic variance, then we can easily construct confidence intervals on  θ. In this section we provide a variance estimator and prove its consistency, resulting in asymptotically calibrated confidence intervals. For DML, Chernozhukov et al. (2018a) provides variance estimates only for estimating functions  ψthat are linear in  θ, whichalready excludes estimand-dependent nuisances. Our results are therefore notable both for handling nonlinear and non-differentiable estimating equations and for handling estimand-dependent nuisances.

Definition 3 (LDML variance estimator). Given ˆθfrom Definition 2 and ˆJ, set

image

We next establish the consistency of �Σ, which relies on the following assumption.

Assumption 4. Assume that  ∥ ˆJ − J∗∥ = ρJ,N ≲ δNand that for some  C, β > 0,

image

Here, Eq. (17) implies continuity  θ �→ ψ(Z; θ, η∗1(Z; θ∗1), η∗2(Z)) in terms of  L2norm in the range space. Note that this condition can be satisfied even if  θ �→ ψ(Z; θ, η∗1(Z; θ∗1), η∗2(Z)) is non-differentiable. For example, in the estimation of QTEs, the efficient estimating equation in Eq. (3) involves the indicator function  I [Y ≤ θ1], so the map  θ �→ ψ(Z; θ, η∗1(Z; θ∗1), η∗2(Z)) is obviously not differentiable. However, the condition in Eq. (17) amounts to

image

In Assumption 5, we will assume that  P (T = 1 | X) ≥ ϵπfor a positive constant  ϵπ. Then thecondition above follows if the cumulative distribution function of Y (1) is smooth enough, so that ��P(Y (1) ≤ θ1) − P(Y (1) ≤ θ∗1)�� ≤ Cϵπ |θ1 − θ∗1|β for any θ1 ∈ B(θ∗1; τN).

Under Assumption 4, we now show that the variance estimator in Definition 3 is consistent and it leads to asymptotically valid confidence intervals.

Theorem 2. Assume the assumptions in Theorem 1 and Assumption 4. Then,

image

Given some  ζ ∈ Rd, the confidence interval  CI := [ζ⊤ˆθ ± Φ−1(1 − α/2)�ζ⊤ ˆΣ2ζ/N] obeys

image

In Definition 3, we assumed that we have a consistent estimator ˆJ for J∗. How to construct such an estimator depends on the problem. When  θ �→ ψ(Z; θ, η∗1(Z; θ∗1), η∗2(Z)) is differentiable, an estimator may easily be constructed as follows:

image

However, the estimating equation for QTE is not differentiable. Thus we rely on deriving the form of  J∗ and estimate it directly, which we discuss in detail in Remark 4.

With finite sample, the variance of the LDML estimator also depends on the uncertain sample splitting in Definition 1. This uncertainty can be additionally accounted for when multiple sample splitting realizations are used, which we discuss in Appendix E.

Remark 2 (Estimating and Conducting Inference on Treatment Effects). Suppose we have two sets of parameters,  θ∗(0), θ∗(1), each identified by its own estimating equation,  ψ(0), ψ(1), and weare interested in estimating the difference,  τ ∗ = θ∗(1) − θ∗(0). For example,  θ∗(0), θ∗(1) can be thequantile and/or CVaR of Y (0), Y (1), respectively, and we are interested in the QTE and/or CVaR treatment effect. To do this, we can concatenate the two estimating equations and augment them with the additional equation  θ∗(1) − θ∗(0) − τ ∗ = 0. Estimating this set of estimating equationswith LDML is equivalent to applying LDML to each of  ψ(0), ψ(1) and letting ˆτbe the difference of the estimates ˆθ(0), ˆθ(1), where we may use the same data and the same folds for the two LDML procedures. For QTE and for other estimating equations with incomplete data, we can even share the nuisance estimates of the propensity score (i.e., ˆη(1),(k)2 = ˆη(0),(k)2in the below equation). The variance estimate one would derive for ˆτfrom the augmented estimating equations is equivalent to

image

In this section, we apply our method and theory to general estimating equations with incomplete data presented in Eq. (4), which subsumes the estimation of QTEs, quantile of potential outcomes, CVaR treatment effect, CVaR of potential outcomes, expectile treatment effect, and expectile of potential outcomes. We will proceed to further specialize these results to quantile and CVaR estimation, deferring the case of expectiles to the appendix (Appendix B). We also defer the case of using IVs to estimate the solution to local estimating equations, such as those that describe the LQTE, to appendix (Appendix A).

As motivated in Section 1.1, under unconfoundedness, there is a very natural initial estimator: the IPW estimator. As we will show, the LDML estimate for this problem using the IPW initial estimator can be computed using just blackbox algorithms for (possibly binary) regression, which is the standard supervised learning task in machine learning. And, under lax conditions, the estimate is efficient, asymptotically normal, and amenable to inference.

Recall that  θis defined by the complete-data estimating equations in Eq. (4), namely,  P[U(Y (1); θ1)+V (θ2)] = 0. Assuming ignorability and overlap, θis identified from the incomplete-data observations Z = (X, T, Y ) where Y = Y (T). In particular, Eq. (7) provides a Neyman-orthogonal estimating equation identifying  θ. For better interpretability, we give our nuisances names: we denote  π∗(t | x) = P(T = t | X = x), µ∗j(x, t; θ1) = E [Uj(Y ; θ1) | X = x, T = t], and µ∗(x, t; θ1) =[µ∗1(x, t; θ1), . . . , µ∗d(x, t; θ1)]⊤.For estimating parameters corresponding to Y (1), our estimandindependent nuisance is the propensity score  η∗2(Z) = π∗(1 | X), and our estimand-dependent nuisance is  η∗1(Z; θ1) = µ∗(X, 1; θ1). The case for Y (0) is symmetric; and it also need the symmet- ric ignorability and overlap assumptions for identifiability:  Y (0) ⊥⊥ T | X and P(T = 1 | X) < 1.Treatment effects (e.g., QTEs) can be estimated by differences of estimates, where we can use the same data, the same fold splits, and the same estimates of  π∗ for both treatments (see Remark 2).

This problem also admits a simpler but unstable (i.e., non-orthogonal) estimating equation using IPW, which suggests a possible initial estimator, using  K′ ≥2 in Definition 1:

Definition 4 (IPW Initial Estimator). For each k = 1, . . . , K and l ∈ Hk,1as in Definition 1, use only the data in  DC,1,lk =�Zi : i ∈ �k′∈Hk,1\{l} Dk′�to construct a propensity score estimator ˆπ(k,l)(1 | ·) for π∗(1 | ·). Then let ˆθ(k)1,init be given by solving the following estimating equation (or, its least squares solution up to approximation error of  ϵN):

image

This procedure is illustrated in Fig. 1(b). Note that, given a fixed  θ′1, both π∗(1 | ·) and µ∗(·, 1; θ′1)are conditional expectations of observable variables given X. Thus, in this setting, the whole LDML estimate using the IPW initial estimate can be computed given just blackbox algorithms for (possibly binary) regression.

5.1 Theoretical Analysis

We first study the LDML estimate for estimating equations with incomplete data by leveraging our general theory in Theorem 1. To this end, we assume a strong form of the overlap condition and specify the convergence rates of the initial estimator and nuisance estimators used. We consider a generic treatment level  t ∈ {0, 1}in these two assumptions.

Assumption 5 (Strong Overlap). Assume that there exists a positive constant  επ > 0 such thatfor any  P ∈ PN, π(t | X) ≥ επalmost surely.

Assumption 6 (Nuisance Estimation Rates). Assume that for any  P ∈ PN: condition i. of Assumption 3 holds for a sequence of constants N → 0; with probability at least 1  − ∆N, ˆπ(k)(t |X) ≥ επfor almost all realizations of X, and

image

The following theorem establishes that the asymptotic distribution of our proposed estimator is similar to the (infeasible) one that solves the semiparametric efficient estimating equation in Eq. (7) with known nuisances. This theorem is proved by verifying conditions in Theorem 1, namely Assumptions 1 to 3.

Theorem 3 (LDML for Estimating Equations with Incomplete Data). Fix t = 1 and let the estimator ˆθbe given by applying Definition 2 to the estimating equation in Eq. (7). Suppose Assumptions 5 and 6 hold and that there exist positive constants  c′, C, and c1 to c7such that for any P ∈ PNthe following conditions hold:

i. Conditions i. (with  c1), ii., v. (with  c5, c6), and vi. (with  c7) of Assumption 2 and condition iii. of Assumption 3 for the estimating equation in Eq. (7).

ii. For  j = 1, . . . , d, θ �→ P [Uj(Y (t); θ1) + V (θ2)]is differentiable at any  θin a compact set Θ, and each component of its gradient is  c′-Lipschitz continuous at  θ∗. Moreover, for any  θ ∈ Θwith  ∥θ − θ∗∥ ≥ c32√dc′ , we have 2∥P [U(Y (t); θ1) + V (θ2)] ∥ ≥ c2.

image

�P�µ∗j(X, t; θ1) − µ∗j(X, t; θ∗1)�2�1/2≤ C∥θ1 − θ∗1∥, �����P�∂θ1µ∗j(X, t; θ1)�2�1/2���� ≤ C,σmax�P�∂θ1∂θ⊤1 µ∗j(X, t; θ1)��≤ C, σmax�∂θ2∂θ⊤2 Vj(θ2)�≤ C.

image

Then ˆθsatisfies the conclusion of Theorem 1 for  ψ(Z; θ∗, η∗1(Z; θ∗1), η∗2(Z))given in Eq. (7).

An analogous result for the estimating equations involving Y (0) holds when we change t = 1 to t = 0 everywhere in Theorem 3. See Remark 2 regarding estimation of the difference of the parameters (i.e., the treatment effects) and inference thereon.

In Theorem 3, conditions ii. and iii. guarantee the identification conditions iii. and iv. of Assumption 2. Condition iv. enables exchange of integration, which together with conditions v., vi., and vii. imply the rate condition ii. of Assumption 3. Note condition vii. permits nonparametric rates for nuisance estimators. Focusing on the order in the sample size and up to polylog factors, the condition allows for  ρπ,Nρµ,N = o(N−1/2), ρπ,Nρθ,N = o(N−1/2), ρπ,N = o(1), ρµ,N = o(1), ρθ,N = o(1).Note the first two restrictions are on products, permitting a trade-off between rates for different nuisances (see also Appendix H).

Remark 3 (Rate Conditions with IPW Initial Estimator). In Appendix C, we prove that if the propensity nuisance estimators used to construct the IPW initial estimators (Definition 4) also have convergence rate  ρπ,N, then the initial estimators’ convergence rates satisfy that  ρθ,N = O (ρπ,N).In this case, we are essentially imposing  ρπ,N = o(N−1/4): condition vii. of Theorem 3 requires ρπ,Nρθ,N = o(N−1/2), so unless  ρθ,Nis somehow even faster than  ρπ,N, we must need both  ρθ,Nand  ρπ,N to be o(N−1/4).

5.2 Quantile and CVaR

Now we consider estimating quantile and (possibly) CVaR based on the semiparametrically efficient estimating equation in Eq. (3). Instantiating Eq. (7) for the simultaneous estimation of quantile and CVaR and rearranging, we obtain the following estimating equation:

image

where η∗1(Z; θ1) =

image

We use  Ft(· | x) and Ft(·) to denote the conditional and unconditional cumulative distribution function of Y (t), respectively: for any  y, Ft(y | x) = P(Y (t) ≤ y | X = x) and Ft(y) = P(Y (t) ≤ y).The following proposition gives the asymptotic behavior of our proposed estimators for the quantile and CVaR of Y (1). This conclusion is proved by verifying all conditions in Theorem 3. Analogous conclusions also hold for Y (0) when all assumptions hold for t = 0 instead of t = 1.

Proposition 1 (LDML for Quantile and CVaR). Fix t = 1 and Let the estimator ˆθbe given by applying Definition 2 to the estimating function in Eq. (18). Suppose Assumptions 5 and 6 hold and there exist positive constants  c′1 ∼ c′5 and C ≥ 1, such that for any  P ∈ PN, the following conditions hold:

i. Conditions i. (with  c1), ii., v. (with  c5, c6) of Assumption 2, condition iii. of Assumption 3, and condition vii. of Theorem 3 for the estimating function in Eq. (18) and the corresponding nuisance estimators.

ii.  Ft(θ1)is twice differentiable with derivatives  ft(θ1), ˙ft(θ1)satisfying 0  < c′1 ≤ ft(θ∗1), ft(θ1) ≤c′2, | ˙ft(θ1)| ≤ c′3 ∀θ1 ∈ Θ1. Moreover,  |Ft(θ∗1) − Ft(θ1)| ≥ c′4 for |θ1 − θ∗1| ≥ c′12c′3 .

iii. At any  θ ∈ B(θ∗; max{4C√dρπ,NδNεπ , ρθ,N}) ∩ Θ, Ft(θ1 | X)is twice differentiable almost surely with first two order derivatives  ft(θ1 | X) and ˙ft(θ1 | X)that satisfy  ft(θ1 | X) ≤ C and| ˙ft(θ1 | X)| ≤ Calmost surely.

image

Remark 4 (Estimating  ft(θ∗1) for Variance Estimation). If we want to conduct inference on the quantile or QTE using our method from Section 4, we need to estimate  ft(θ∗1). We only need to do this consistently, regardless of rate, in order to get the correct asymptotic coverage. One simple approach is to use cross-fitted IPW kernel density estimation at ˆθ1:

image

where  κ(u) is a kernel function such as  κ(u) = (2π)−1/2 exp(−u2/2) and h →0 is a bandwidth. Under Assumption 5,  h ≍ N−1/5 would be the optimal bandwidth. While this together with any consistent estimate ˆπ(k) suffices for asymptotic coverage, the estimate may be unstable. It is therefore recommended to use self-normalization by dividing the above by 1n�Kk=1�i∈Dk I[Ti=1]ˆπ(k)(1|Xi)and to potentially clip propensities.

We first study the behavior of LDML in a simulation study. We then demonstrate its use in estimating the QTE of 401(k) eligibility on net financial assets. In Appendix A, we additionally consider estimating the LQTE of 401(k) participation using eligibility as IV.

6.1 Simulation Study

First, we consider a simulation study to compare the performance of LDML estimates to benchmarks. We consider estimating  θ∗1 as the second tertile of Y (1) from incomplete data. The distri- bution P is as follows. First, we draw 20-dimensional covariates X from the uniform distribution on [0, 1]20. Then, we draw T from Bernoulli(Φ(3(1  − X1 − X3))), where Φ is the standard normal cumulative distribution function, and we draw  Y (1) from N(I[X1+X2 ≤ 1], 2X3). We only observe Y (1) when T = 1.

We consider estimating  θ∗1 using four different methods. First, we consider LDML applied to the efficient estimating equation (Eq. (3)) with  K = 5, K′ = 2, ˆθ(k)1,init estimated using 2-fold cross- fitted IPW with random-forest-estimated propensities, and ˆπ(k)(1 | X), ˆµ(k)(X, 1; ˆθ(k)1,init) similarly estimated by random forests. Second, we consider K = 5-fold cross-fitted IPW with random-forest-estimated propensities. Third, we consider DML with K = 5 and the estimand-dependent nuisance estimated using a discretization approach similar to the suggestion of Belloni et al. (2018): for j = 1, . . . , 99, fix θ1,j to be the j/100 marginal quantile of  Y and fit ˆµ(k)(X, 1; θ1,j) using random forests; then apply DML with the restricted discretized estimand range  {θ1,j : j = 1, . . . , 99}.We refer to this method as DML-D for discretized. Fourth, we consider DML with K = 5 and where the estimand-dependent nuisance is estimated using an approach similar to Meinshausen (2006), Bertsimas and Kallus (2014): namely, fit a random forest regression to the out-of-fold data  {(Xi, Yi) : i /∈ Dk, Ti = 1} to obtain Bregression trees  τj: support(X) → {1, . . . , ℓj}, thenset ˆµ(k)(X, 1; θ1) = �i/∈Dk:Ti=1I[Yi≤θ1]B �Bj=1 I[τj(Xi)=τj(X)]�i′ /∈Dk:Ti′ =1 I[τj(Xi)=τj(Xi′)] for all θ1. We refer to this method as DML-F for forest. For each method, we run it three times with new random fold splits (with the same data) and take the median of the three results to be the estimate.

For each of n = 100, 200, . . . , 25600, we consider 75 replications of drawing a dataset of size n and constructing each of the above four estimates. We plot the mean-squared error of each method and n over the 75 replications in Fig. 2(a). The shaded regions show plus/minus one standard error of this as the sample mean of 75 squared errors. We clearly see that LDML offers significant improvements over the other methods when we use flexible machine learning methods to tackle estimand-dependent nuisances.

In Fig. 2(b), we additionally report the coverage of the true parameter using the standard error estimate proposed in Remark 4 and Appendix E. Namely, for each of the three random runs of each method, we take the sample standard deviation of the estimated influence function evaluated at the final estimand and with the cross-fitted nuisances and divide it by √ntimes an estimate

image

Figure 2: Results for the simulation study. Shaded regions denote plus/minus one standard error for estimated mean-squared error or mean coverage computed over 75 replications.

of  ft(θ∗1) given by cross-fitted IPW kernel density estimation at the estimand. (We do the same for the IPW estimate for the sake of comparison but note IPW’s asymptotic variance may also depend on the propensity-estimation variance, unlike LDML and DML.) We take the median of these standard errors over the three runs and add to it the standard deviation of estimands over the three runs divided by√3. Then we consider the 95% confidence interval given by the estimand plus/minus 1.96 of this estimated standard error. Figure 2(b) shows the sample mean coverage of  θ∗1over the 75 replications, and the shaded region shows plus/minus one standard error of this sample mean. LDML offers good, calibrated coverage (the 100% coverage for some n can be attributed to only observing 75 replications with 95% success probability each). The other methods have poor coverage, which may be attributed to significant bias so that confidence intervals based only on standard errors of the sample average would undercover and even get worse as samples grow and standard errors shrink relative to bias. In particular, the IPW estimate’s convergence directly depends on that of the random-forest-estimated propensities, so convergence may be slower than √nand/or the true standard error may be far larger than that of the cross-fitted sample average. Similarly, using DML to estimate the control-variate term using a discretization or a forest need not converge, so ultimately their convergence and standard errors may be similar to IPW’s, again leading to underconvering.

6.2 Effect of 401(k) Eligibility on Net Financial Assets

Next we consider an empirical case study to demonstrate the estimation of QTE using LDML in practice and with a variety of machine learning nuisance estimators. We use data from Cher- nozhukov and Hansen (2004) to estimate the QTEs of 401(k) retirement plan eligibility on net financial assets. Eligibility for 401(k) (here considered the treatment, T) is not randomly assigned, but is argued in Chernozhukov and Hansen (2004) to be ignorable conditioned on certain covariates: age, income, family size, years of education, marital status, two-earner household status, availability of defined benefit pension plan to household, IRA participation, and home ownership status. Net

Table 1: The QTE of 401(k) eligibility in thousand dollars (and standard error) estimated by LDML using different regression methods, and raw unadjusted differences of quantiles.

image

financial assets (the outcome, Y ) are defined as the sum of IRA and 401(k) balances, bank accounts, and other interest-earning accounts and assets minus non-mortgage debt. While Chernozhukov and Hansen (2004) considered controlling for these in a low-dimensional linear specification, it is not clear whether such is sufficient to account for all confounding. Consequently, Belloni et al. (2017) considered including higher-order terms and interactions, but needed to theoretically construct a continuum of LASSO estimates and may not be able to use generic black-box regression methods. Finally, Chernozhukov et al. (2018a) considered using generic machine learning methods, but only tackled ATE estimation.

In contrast, we will use LDML to estimate and conduct inference on the QTEs of 401(k) eligibility on net assets using a variety of flexible black-box regression methods. First, to understand the effect of different choices in the application of LDML to the problem, we consider estimating the 25%, 50%, and 75% QTE while varying K in {5, 15, 25} and varying the nuisance estimators. We consider estimating both propensity score  η∗2 and conditional cumulative distribution  η∗1 with eachof: boosting (using R package gbm), LASSO (using R package hdm), and a one-hidden-layer neural network (using R package nnet). For LASSO, we use a 275-dimensional expansion of the covariates by considering higher-order terms and interactions. In each instantiation of LDML, we construct folds so to ensure a balanced distribution of treated and untreated units, we let  K′ = (K − 1)/2,we use the IPW initial estimator for ˆθ1,init, we normalize propensity weights to have mean 1 within each treatment group, we use estimates given by solving the grand-average estimating equation as in Definition 2, and for variance estimation we estimate  J∗ using IPW kernel density estimation as in Remark 4. The solution to the LDML-estimated empirical estimating equation must occur at an observed outcome  Yiand that we can find the solution using binary search after sorting the data along outcomes. We re-randomize the fold construction and repeat each instantiation 100 times. We then remove the outlying 2.5% from each end and report ˆθmean, �Σmean as in Appendix E. The resulting estimates and standard errors are shown in Table 1. The estimates appear overall roughly stable across methods and K.

Next, we consider estimating a range of QTEs. We focus on nuisance estimation using LASSO and fix K = 15. We then estimate the 10%, 11%, . . . , 89%, and 90% quantiles and QTEs. We plot the resulting LDML estimates with 90% confidence intervals in Fig. 3 and compare these to the raw unadjusted marginal quantiles within each treatment group.

image

Figure 3: LDML estimates of a range of quantiles and QTEs with confidence 90% intervals and comparison to raw unadjusted marginal quantiles by treatment group.

Semiparametric Estimation, Neyman orthogonality, and Debiased Machine Learning. Our work is closely related to the classical semiparametric estimation literature on constructing √N-consistent and asymptotically normal estimators for low dimensional target parameters in the presence of infinitely dimensional nuisances, typically estimated by conventional nonparametric estimators such as kernel or series estimators (e.g., Newey 1990, 1994, Newey et al. 1998, Ibragimov and Hasminskii 1981, Levit 1976, Bickel et al. 1998, Bickel 1982, Robinson 1988, var der Vaart 1991, Andrews 1994, Robins and Rotnitzky 1995, Linton 1996, Chen et al. 2003, Ai and Chen 2003, van der Laan and Rose 2011, Ai and Chen 2012). Our work builds on the Neyman orthogonality condition introduced by Neyman (1959)). This condition plays a critical role in many works that go beyond such nonparametric estimators, such as targeted learning (e.g., van der Laan and Rose 2011, Van der Laan and Rose 2018), inference for coefficients in high dimensional linear models (e.g., Belloni et al. 2016, 2014c, Zhang and Zhang 2014, Van de Geer et al. 2014, Javanmard and Montanari 2014, Chernozhukov et al. 2015, Ning et al. 2017), and semiparametric estimation with nuisances that involve high dimensional covariates (e.g., Belloni et al. 2017, Smucler et al. 2019, Chernozhukov et al. 2018b, Farrell 2015, Belloni et al. 2014a,b, Bradic et al. 2019, Bravo et al. 2020).

Chernozhukov et al. (2018a) further advocate the use of cross-fitting in addition to orthogonal estimating equations, so that the traditional Donsker assumption on nuisance estimators can be relaxed, and a broad array of black-box machine learning algorithms can be used instead. They refer to this generic approach as DML, which provides a principled framework to estimate low-dimensional target parameters with strong asymptotic guarantees when leveraging modern machine learning methods in nuisance estimation. Similar forms of sample splitting and cross-fitting have also appeared in Klaassen (1987), Zheng and van der Laan (2011), Fan et al. (2012), Bickel (1982), Robins et al. (2013), Schick (1986), Robins et al. (2008, 2017). Since the DML framework was introduced, numerous works have applied it in many different problems, such as heterogeneous treatment effect estimation (Kennedy 2020, Nie and Wager 2017, Curth et al. 2020, Semenova and Chernozhukov 2020, Oprescu et al. 2019, Fan et al. 2020), causal effects of continuous treatments (Colangelo and Lee 2020, Oprescu et al. 2019), instrumental variable estimation (Singh and Sun 2019, Syrgkanis et al. 2019), partial identification (Bonvini and Kennedy 2019, Kallus et al. 2019, Semenova 2017, Yadlowsky et al. 2018), difference-in-difference models (Lu et al. 2019, Chang 2020, Zimmert 2018), off-policy evaluation (Kallus and Uehara 2020, Demirer et al. 2019, Zhou et al. 2018, Athey and Wager 2017), generalized method of moments (Chernozhukov et al. 2016, Belloni et al. 2018), improved machine learning nuisance estimation (Farrell et al. 2018, Cui and Tchetgen 2019), statistical learning with nuisances (Foster and Syrgkanis 2019), causal inference with surrogate observations (Kallus and Mao 2020), linear functional estimation (Chernozhukov et al. 2018d,c, Bradic et al. 2019), etc. Our work complements this line of research by proposing a simple but effective way to handle estimand-dependent nuisances. This type of nuisances frequently appears in efficient estimation of complex causal effects such as QTEs, and applying DML directly would require estimating a continuum of nuisances, which is challenging in practice.

Efficient estimation of (L)QTE. Firpo (2007) first considered efficient estimation of QTE and proposed an IPW estimator based on propensity scores estimated by a logistic sieve estimator. Under strong smoothness conditions, this IPW estimator is√N-consistent and achieves the semi-parametric efficiency bound. Fr¨olich and Melly (2013) consider a weighted estimator for LQTE with weights estimated by local linear regressions using high-order kernels and show that their estimator is also semiparametrically efficient. Although these purely weighted methods bypass the estimation of nuisances that depend on the estimand, their favorable behavior is restricted to certain nonparametric weight estimators and strong smoothness requirements. D´ıaz (2017) proposed a Targeted Minimum Loss Estimator (TMLE) estimator for efficient QTE estimation. Built on the efficient influence function with nuisances that depends on the quantile itself, this estimator requires estimating a whole conditional cumulative distribution function, which as discussed may be very challenging in practice using flexible machine learning methods. Belloni et al. (2017) similarly consider efficient estimation of LQTE with high-dimensional covariates by using a Neyman-orthogonal estimating equation and discretizing a continuum of LASSO estimators for the estimand-dependent nuisance. In contrast, our proposed estimator can leverage a wide variety of flexible machine learning methods for the standard regression task to estimate nuisances, since we require estimating conditional cumulative distribution function only at a single point, which amounts to a binary regression problem.

Estimand-dependent nuisances. Besides (local) quantiles and CVaR, many efficient estimation problems involve nuisances that depends on the estimand (e.g., Tsiatis 2006, Chen et al. 2005). Previous approaches estimate the whole continuum of the estimand-dependent nuisances either by positing simple parametric model for conditional distributions (Tsiatis 2006, Chap 10), using sieve estimators (Chen et al. 2005), or discretizing a hypothetical continuum of regression estimators (Belloni et al. 2017). In contrast, our proposed method obviates the need to estimate infinitely many nuisances by fitting nuisances only at a preliminary estimate of the parameter of interest. This idea was briefly mentioned by Robins et al. (1994), focusing on parametric models for nuisance estimation. Our paper rigorously develops this approach and admits flexible machine learning methods for estimating nuisances that depend on the estimand.

In many causal inference and missing data settings, the efficient influence function involves nuisances that depend on the estimand of interest. A key example provided was that of QTE under ignorable treatment assignment and LQTE estimation using an IV, where in both cases the efficient influence function depends on the conditional cumulative distribution function evaluated at the quantile of interest. This structure, common to many other important problems, makes the application of existing debiased machine learning methods difficult in practice. In quantile estimation, it requires we learn the whole conditional cumulative distribution function. To avoid this difficulty, we proposed the LDML approach, which localized the nuisance estimation step to an initial rough guess of the estimand. This was motivated by the fact that in many applications, the oracle estimating equation is asymptotically equivalent to one where the nuisance is evaluated at the true parameter value, which our localization approach targets. Assuming only standard identification conditions, Neyman orthogonality, and lax rate conditions on our nuisance estimates, we proved the LDML enjoys the same favorable asymptotics as the oracle estimator that solves the estimating equation with the true nuisance functions. This newly enables the practical efficient estimation of important quantities such as QTEs using machine learning.

An interesting future direction is to consider a uniform estimation way over the quantile  γ thoughwe consider the setting in which  τis fixed to emphasize our main point. This might be possible by combining recent technique in quantile regression (Ota et al. 2019, Bradic and Kolar 2017); however, the rigorous result is left as future research.

Chunrong Ai and Xiaohong Chen. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795–1843, 2003.

Chunrong Ai and Xiaohong Chen. Semiparametric efficiency bound for models of sequential moment restric- tions containing unknown functions. Journal of Econometrics, 170:442–457, 2012.

Donald W. K. Andrews. Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica, 62(1):43–72, 1994.

Susan Athey and Stefan Wager. Efficient policy learning. arXiv preprint arXiv:1702.02896, 2017.

Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. High-dimensional methods and inference on structural and treatment effects. Journal of Economic Perspectives, 28(2):29–50, 2014a.

Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608–650, 2014b.

Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Pivotal estimation via square-root lasso in non- parametric regression. The Annals of Statistics, 42(2):757–788, 2014c.

Alexandre Belloni, Victor Chernozhukov, and Ying Wei. Post-selection inference for generalized linear models with many controls. Journal of Business & Economic Statistics, 34(4):606–619, 2016.

Alexandre Belloni, Victor Chernozhukov, Ivan Fern´andez-Val, and Christian Hansen. Program evaluation and causal inference with high-dimensional data. Econometrica, 85(1):233–298, 2017.

Alexandre Belloni, Victor Chernozhukov, Denis Chetverikov, Christian Hansen, and Kengo Kato. High- dimensional econometrics and regularized gmm. arXiv preprint arXiv: 1806.01888, 2018.

Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics. arXiv preprint arXiv:1402.5481, 2014.

P. J. Bickel. On adaptive estimation. Annals of Statistics, 10(3):647–671, 1982.

P. J. Bickel, C. A. J. Klaassen, Y. Ritov, and J. A. Wellner. Efficient and Adaptive Estimation for Semiparametric Models. Springer, 1998.

Matteo Bonvini and Edward H Kennedy. Sensitivity analysis via the proportion of unmeasured confounding. arXiv preprint arXiv:1912.02793, 2019.

Jelena Bradic and Mladen Kolar. Uniform inference for high-dimensional quantile regression: linear func- tionals and regression rank scores. arXiv preprint arXiv:1702.06209, 2017.

Jelena Bradic, Victor Chernozhukov, Whitney K. Newey, and Yinchu Zhu. Minimax semiparametric learning with approximate sparsity. arXiv preprint arXiv:1912.12213, 2019.

Jelena Bradic, Stefan Wager, and Yinchu Zhu. Sparsity double robust inference of average treatment effects. arXiv preprint arXiv:1905.00744, 2019.

Francesco Bravo, Juan Carlos Escanciano, and Ingrid Van Keilegom. Two-step semiparametric empirical likelihood inference. The Annals of Statistics, 48(1):1–26, 2020.

Neng-Chieh Chang. Double/debiased machine learning for difference-in-differences models. The Econometrics Journal, 23(2):177–191, 2020.

Xiaohong Chen, Oliver Linton, and Ingrid Van Keilegom. Estimation of semiparametric models when the criterion function is not smooth. LSE Research Online Documents on Economics, 2003.

Xiaohong Chen, Han Hong, and Elie Tamer. Measurement error models with auxiliary data. The Review of Economic Studies, 72(2):343–366, 2005.

Victor Chernozhukov and Christian Hansen. The effects of 401(k) participation on the wealth distribution: an instrumental quantile regression analysis. Review of Economics and statistics, 86(3):735–751, 2004.

Victor Chernozhukov, Christian Hansen, and Martin Spindler. Valid post-selection and post-regularization inference: An elementary, general approach. 2015.

Victor Chernozhukov, Juan Carlos Escanciano, Hidehiko Ichimura, Whitney K Newey, and James M Robins. Locally robust semiparametric estimation. arXiv preprint arXiv:1608.00033, 2016.

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21:C1–C68, 2018a.

Victor Chernozhukov, Denis Nekipelov, Vira Semenova, and Vasilis Syrgkanis. Plug-in regularized estimation of high-dimensional parameters in nonlinear semiparametric models. arXiv preprint arXiv:1806.04823, 2018b.

Victor Chernozhukov, Whitney Newey, James Robins, and Rahul Singh. Double/de-biased machine learning of global and local parameters using regularized riesz representers. arXiv preprint arXiv:1802.08667, 2018c.

Victor Chernozhukov, Whitney K Newey, and Rahul Singh. Learning l2 continuous regression functionals via regularized riesz representers. arXiv preprint arXiv:1809.05224, 8, 2018d.

Kyle Colangelo and Ying-Ying Lee. Double debiased machine learning nonparametric inference with contin- uous treatments. arXiv preprint arXiv:2004.03036, 2020.

Yifan Cui and Eric Tchetgen Tchetgen. Bias-aware model selection for machine learning of doubly robust functionals. arXiv preprint arXiv:1911.02029, 2019.

Alicia Curth, Ahmed M Alaa, and Mihaela van der Schaar. Semiparametric estimation and inference on struc- tural target functions using machine learning and influence functions. arXiv preprint arXiv:2008.06461, 2020.

Mert Demirer, Vasilis Syrgkanis, Greg Lewis, and Victor Chernozhukov. Semi-parametric efficient policy learning with continuous actions. arXiv preprint arXiv:1905.10116, 2019.

Iv´an D´ıaz. Efficient estimation of quantiles in missing data models. Journal of Statistical Planning and Inference, 190:39–51, 2017.

Jianqing Fan, Shaojun Guo, and Ning Hao. Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74 (1):37–65, 2012.

Qingliang Fan, Yu-Chin Hsu, Robert P Lieli, and Yichong Zhang. Estimation of conditional average treat- ment effects with high-dimensional data. Journal of Business & Economic Statistics, (just-accepted): 1–39, 2020.

Max H Farrell. Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1):1–23, 2015.

Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation and inference. arXiv preprint arXiv:1809.09953, 2018.

Sergio Firpo. Efficient semiparametric estimation of quantile treatment effects. Econometrica, 75:259–276, 2007.

Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019.

Markus Fr¨olich and Blaise St´ephane Melly. Unconditional quantile treatment effects under endogeneity. Journal of Business & Economic Statistics, 31(3):346–357, 2013.

I. A. Ibragimov and R. Z. Hasminskii. Statistical estimation : asymptotic theory. 1981.

Guido W Imbens and Joshua D Angrist. Identification and estimation of local average treatment effects. Econometrica, pages 467–475, 1994.

Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909, 2014.

Nathan Kallus and Xiaojie Mao. On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv preprint arXiv:2003.12408, 2020.

Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. Journal of Machine Learning Research, 21(167):1–63, 2020.

Nathan Kallus, Xiaojie Mao, and Angela Zhou. Assessing algorithmic fairness with unobserved protected class using data combination. arXiv preprint arXiv:1906.00285, 2019.

Maximilian Kasy. Uniformity and the delta method. Journal of Econometric Methods, 8(1):1–19, 2019.

Edward H Kennedy. Optimal doubly robust estimation of heterogeneous causal effects. arXiv preprint arXiv:2004.14497, 2020.

Chris AJ Klaassen. Consistent estimation of the influence function of locally asymptotically linear estimators. The Annals of Statistics, pages 1548–1562, 1987.

B. Ya. Levit. On the efficiency of a class of non-parametric estimates. Theory of Probability and Its Applications, 20(4):723–740, 1976.

Oliver B. Linton. Edgeworth approximation for minpin estimators in semiparametric regression models. Econometric Theory, 12(1):30–60, 1996.

Chen Lu, Xinkun Nie, and Stefan Wager. Robust nonparametric difference-in-differences estimation. arXiv preprint arXiv:1905.11622, 2019.

Nicolai Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7(Jun):983–999, 2006.

Whitney K. Newey. Semiparametric efficiency bounds. Journal of Applied Econometrics, 5(2):99–135, 1990.

Whitney K. Newey. The asymptotic variance of semiparametric estimators. Econometrica, 62(6):1349–1382, 1994.

Whitney K Newey and James L Powell. Asymmetric least squares estimation and testing. Econometrica: Journal of the Econometric Society, pages 819–847, 1987.

Whitney K. Newey, Fushing Hsieh, and James Robins. Undersmoothing and bias corrected functional estimation. 1998.

Jerzy Neyman. Optimal asymptotic tests of composite statistical hypotheses. Probability and Statistics, pages 416–44, 1959.

Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912, 2017.

Yang Ning, Han Liu, et al. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1):158–195, 2017.

Miruna Oprescu, Vasilis Syrgkanis, and Zhiwei Steven Wu. Orthogonal random forest for causal inference. In International Conference on Machine Learning, pages 4932–4941. PMLR, 2019.

Hirofumi Ota, Kengo Kato, and Satoshi Hara. Quantile regression approach to conditional mode estimation. Electronic Journal of Statistics, 13(2):3120–3160, 2019.

James Robins, Lingling Li, Eric Tchetgen, Aad van der Vaart, et al. Higher order influence functions and minimax estimation of nonlinear functionals. In Probability and statistics: essays in honor of David A. Freedman, pages 335–421. Institute of Mathematical Statistics, 2008.

James M. Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.

James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994.

James M Robins, Peng Zhang, Rajeev Ayyagari, Roger Logan, Eric Tchetgen Tchetgen, Lingling Li, Thomas Lumley, Aad van der Vaart, HEI Health Review Committee, et al. New statistical approaches to semiparametric regression with application to air pollution research. Research report (Health Effects Institute), (175):3, 2013.

James M Robins, Lingling Li, Rajarshi Mukherjee, Eric Tchetgen Tchetgen, Aad van der Vaart, et al. Minimax estimation of a functional on a structured high-dimensional model. The Annals of Statistics, 45(5):1951–1987, 2017.

Peter M Robinson. Root-n-consistent semiparametric regression. Econometrica, 56(4):931–954, 1988.

R Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general loss distributions. Journal of banking & finance, 26(7):1443–1471, 2002.

Anton Schick. On asymptotically efficient estimation in semiparametric models. The Annals of Statistics, pages 1139–1151, 1986.

Vira Semenova. Machine learning for set-identified linear models. arXiv preprint arXiv:1712.10024, 2017.

Vira Semenova and Victor Chernozhukov. Debiased machine learning of conditional average treatment effects and other causal functions. The Econometrics Journal, 2020.

Rahul Singh and Liyang Sun. De-biased machine learning for compliers. arXiv preprint arXiv:1909.05244, 2019.

Ezequiel Smucler, Andrea Rotnitzky, and James M Robins. A unifying approach for doubly-robust  ℓ1regularized estimation of causal contrasts. arXiv preprint arXiv:1904.03737, 2019.

Vasilis Syrgkanis, Victor Lei, Miruna Oprescu, Maggie Hei, Keith Battocchi, and Greg Lewis. Machine learn- ing estimation of heterogeneous treatment effects with instruments. In Advances in Neural Information Processing Systems, pages 15193–15202, 2019.

Anastasios. Tsiatis. Semiparametric Theory and Missing Data. Springer, New York, 2006.

Sara Van de Geer, Peter B¨uhlmann, Ya’acov Ritov, Ruben Dezeure, et al. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, 2014.

Mark J. van der Laan and Sherri Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. 2011.

Mark J Van der Laan and Sherri Rose. Targeted learning in data science. Springer, 2018.

A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998. doi: 10.1017/CBO9780511802256.

Aad var der Vaart. On differentiable functionals. Annals of Statistics, 19(1):178–204, 1991.

Steve Yadlowsky, Hongseok Namkoong, Sanjay Basu, John Duchi, and Lu Tian. Bounds on the conditional and average treatment effect with unobserved confounding factors. arXiv preprint arXiv:1808.09521, 2018.

Cun-Hui Zhang and Stephanie S Zhang. Confidence intervals for low dimensional parameters in high dimen- sional linear models. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pages 217–242, 2014.

Wenjing Zheng and Mark J van der Laan. Cross-validated targeted minimum-loss-based estimation. In Targeted Learning, pages 459–474. Springer, 2011.

Zhengyuan Zhou, Susan Athey, and Stefan Wager. Offline multi-action policy learning: Generalization and optimization. arXiv preprint arXiv:1810.04778, 2018.

Michael Zimmert. Efficient difference-in-differences estimation with high-dimensional common trend con- founding. arXiv preprint arXiv:1809.01643, 2018.

In Section 1.2, we mention that without the ignorability assumption, we can rely on an instrumental variable to identify local parameters, namely, solutions  θ∗ = (θ∗1, θ∗2) to the following local estimating equation:

image

We assume standard instrumental variable identification conditions: for potential treatments T(w) and potential outcomes Y (t, w), we have exclusion  Y (t) := Y (t, w) = Y (t, 1 − w), exogeneity (Y (t), T(w)) ⊥ W | X, overlap P(W = 1 | X) ∈ (0,1), relevance P(T(1) = 1) > P(T(0) = 1), and monotonicity  T(1) ≥ T(0). Following Belloni et al. (2017), a Neyman orthogonal estimating equation for  θ∗ is given by

image

where

image

with nuisance functions

image

Here the second estimating equation  E [ψ2(Z; θaux2 , η2(Z))] = 0 identifies the compliance probability,denoted by the following auxiliary parameter  θaux ∗2 :

image

By redefining ˜θ1 = θ1, ˜θ2 = (θ2, θaux2 ), and ˜θ =�˜θ1, ˜θ2�, the estimating equation becomes

image

which apparently fits into our general framework in Eq. (1). Therefore, we can directly apply our LDML algorithm in Section 2.2 to estimate the local parameters  θ∗ = (θ∗1, θ∗2). We can also use the theory in Sections 3 and 4 to analyze the asymptotic distribution of the resulting estimators and estimate their asymptotic variances.

A.1 Estimating Local Quantiles

In particular, we take the local quantile estimation as an example, namely, the solution  θ∗1 to thelocal estimating equation in Eq. (19) with

image

Its orthogonal estimating equation involves the following nuisance functions:

image

For better readability, we denote the event of being a complier, i.e., T(1) > T(0), as C, the nuisance functions as ˜π∗(X) = P(W = 1 | X), ν∗w (X) = P (T = 1 | X, W = w), and ˜µ∗w(X; θ1) =P (T = 1, Y ≤ θ1 | X, W = w) for w ∈ {0, 1}. We fit estimators for the nuisance functions based on the sample-splitting scheme given in Definition 1, which we denote as ˆ˜π(k)(X), ˆν(k)w (X) andˆ˜µ(k)(X; ˆθ1,init) = (ˆ˜µ(k)1 (X; ˆθ1,init), ˆ˜µ(k)0 (X; ˆθ1,init)) respectively for k = 1, . . . , K. Finally, we obtain the estimator ˆθ =�ˆθ1, ˆθaux2 �by searching approximate solutions over Θ = Θ1 × Θ2 ⊆ R × R to theempirical estimating equations in Definition 2 or Definition 5, specialized to Eqs. (20) and (23).

We next assume a strong form of the overlap and relevance assumptions and specify the convergence rates of the initial estimator and nuisance estimators. We again consider a generic treatment level t ∈ {0, 1}in these two assumptions.

Assumption 7 (Strong Overlap and Relevance Assumptions). Assume that there exists a positive constant  ϵ > 0such that for any  P ∈ PN, ϵ ≤ ˜π∗(X) ≤ 1 − ϵholds almost surely, and  θaux ∗2 ≥ ϵ.

Assumption 8 (Nuisance Estimation Rates). Assume that for any P ∈ PN: with probability atleast 1 − ∆N, for w = 0, 1,

image

and ϵ ≤ ˆ˜π(k)(X) ≤ 1 − ϵ, 0 ≤ ˆ˜µ(k)w �X; ˆθ(k)1,init)�≤ 1, 0 ≤ ˆν(k)w (X) ≤ 1 almost surely.

In the following theorem, we derive the asymptotic distribution of the local quantile estimator, which is proved by verifying all assumptions in Theorem 1.

Proposition 2 (LDML for Local Quantile). Fix t = 1 and let Θ = (Θ1, Θ2) ⊆ R2 be a compact set where  θaux2 ≥ ϵ for any θaux2 ∈ Θ2 and ϵgiven in Assumption 7. Let (ˆθ1, ˆθaux2 )be the LDML estimator given in either Definition 2 or Definition 5, specialized to Eqs. (20) and (23). Suppose that there exist constants  c′, Csuch that the following conditions hold for any instance  P ∈ PN:

i. Conditions i. (with  c1), ii., v. (with c5, c6) and condition iii. of Assumption 3 for the estimating equation in Eqs. (20) and (23).

ii. For any  θ1 ∈ Θ1, the distribution function of Y (t) for compliers, denoted as  Ft(θ1 | C), istwice continuously differentible. Its first two order derivatives  ft(θ1 | C) and ˙ft(θ1 | C) satisfythat  ft(θ1 | C) ≤ c′1,��� ˙ft(θ1 | C)��� ≤ c′2 for any θ1 ∈ Θ1, and ft(θ∗1 | C) ≥ c′3 > 0.

image

iv. For any  θ1 ∈ B(θ∗1; max{ 4˜ρπ,Nϵ2(1−ϵ)δN , ρθ,N}) ∩ Θ and w ∈ {0, 1}, the conditional distribution of Y (t) given X, T (w) = 1, denoted as  Ft,w(θ1 | X), is twice differentiable almost surely with

first two order derivatives  ft,w(θ1 | X) and ˙ft,w(θ1 | X)that satisfy  ft,w(θ1 | X) ≤ C and��� ˙ft,w(θ1 | X)��� ≤ Calmost surely.

image

Then (ˆθ1, ˆθaux2 )satisfies the conclusion of Theorem 1 for  ψ(Z; θ∗, η∗1(Z; θ∗1), η∗2(Z))given in Eq. (20) and

image

In particular, the local quantile estimator ˆθ1is asymptotically linear with the following influence function:

image

where  ψ1(Zi; θ∗, η∗1(Zi; θ∗1), η∗2(Zi)) and ψ2(Zi; θaux ∗2 , η∗2(Zi))are given in Eq. (20). Analogous con- clusion for local quantiles of Y (0) holds when all assumptions above hold for t = 0.

A.2 Effect of 401(k) Participation on Net Financial Assets

Next, we estimate the effect of 401(k) participation on net assets. Participation in a 401(k) plan (here considered the treatment, T) is not randomly assigned: individuals with a preference for saving may save more in non-retirement accounts than others whether they were to participate in retirement savings or not. There may be many other confounding factors, such as the possibility of higher financial acumen of savers leading to higher net worth otherwise. It is unlikely that we can control for all these factors using observable covariates. Instead, we rely on instrumenting on eligibility since, as argued in Section 6.2, eligibility is ignorable given covariates. Additionally, one cannot participate if one is ineligible, ensuring monotonicity, and some eligible individuals do

Table 2: The LQTE of 401(k) participation in thousand dollars (and standard error) estimated by LDML using different regression methods, and raw unadjusted differences of marginal quantiles by eligibility.

image

image

Figure 4: LDML estimates of a range of local quantiles and LQTEs with confidence 90% intervals and comparison to raw unadjusted marginal quantiles by treatment group.

participate, ensuring relevance. Assuming that eligibility cannot affect net assets except through its effect on participation, we have that eligibility for a 401(k) (here considered as W) is valid IV. We can therefore use it to estimate local quantiles by and LQTEs of 401(k) Participation on the population of individuals that would participate if eligible.

We use LDML applied to the Neyman orthogonal estimating equation Eqs. (20) and (23). Again, we consider the impact of different choices in the application of LDML. We repeat the same specifi-cation as above, using each possible nuisance estimator to fit the conditional probabilities Eqs. (21) and (24). We display the results for the 25%, 50%, and 75% quantiles while varying K and the nuisance estimators in Table 2. The qualitative results regarding the stability of LDML across methods and K remain the same. Then, focusing as before on nuisance estimation using LASSO and on K = 15, we also estimate a range of local quantiles and QTEs, which we plot along with 90% confidence intervals in Fig. 3. Again, we compare to the raw unadjusted marginal quantiles within each treatment group.

We can also apply our method and analysis to estimating the  γ-expectile θ1 of Y(1), as defined in Eq. (6). Instantiating Eq. (7) for expectiles and rearranging, we get the following efficient estimating function from incomplete data:

image

The next result gives the asymptotic behavior of LDML applied to these equations.

Proposition 3. Fix t = 1 and let the estimator ˆθ1be given by applying either Definition 2 or Definition 5 to the estimating function in Eq. (25). Suppose Assumptions 5 and 6 hold and there exist positive constants  C, c′1, c′2, such that for any  P ∈ PN, the following conditions hold:

i. Conditions i. (with  c1), ii., v. (with  c5, c6) of Assumption 2, condition iii. of Assumption 3, and condition vii. of Theorem 3 for the estimating function in Eq. (25) and the corresponding nuisance estimators.

ii.  Ft(θ1)is continuous at  θ∗1, and | − (1 − 2γ)Ft(θ∗1) − γ| ≥ c′1 > 0. Moreover, for any  θ ∈ Θsuch that  ∥θ − θ∗∥ ≥ c′12 , 2 |P [U(Y (t); θ1)]| ≥ c′2 for U(Y (t); θ1)given in Eq. (6).

iii. At any  θ1 ∈ B(θ∗; max{4C√dρπ,NδNεπ , ρθ,N}) ∩ Θ1 , Ft(θ1 | X)is almost surely differentiable with first-order derivative  ft(θ1 | X), and second-order derivative ˙ft(θ1 | X)that satisfies ft(θ1 | X) ≤ C and��� ˙ft(θ1 | X)��� ≤ Calmost surely;

image

Then ˆθ1satisfies the conclusion of Theorem 1 for  ψ(Z; θ∗1, η∗1(Z; θ∗1), η∗2(Z))given in Eq. (25) and J∗ = −γ − (1 − 2γ)Ft(θ∗1). Analogous conclusion for expectile of Y (0) holds when all assumptions above hold for t = 0.

When constructing confidence intervals, we only need to estimate  Ft(θ∗1) to estimate  J∗. This canbe easily estimated by the inverse propensity reweighted estimator

image

Alternatively, it can be estimated by an imputation estimator based on ˆµ(k) or a LDML estimator that uses both ˆπ(k) and ˆµ(k) (see Remark 4).

In this part, we show that the IPW initial estimator given in Definition 4 can satisfy the conditions on ˆθ1,initin Assumption 6.

Proposition 4 (IPW Initial Estimator Rate). Fix t = 1 and let the initial estimator ˆθ(k)1,init beconstructed according to Definition 4 for k = 1, . . . , K. Assume the following (for t = 1):

i. For each  k ∈ {1, . . . , K} and l ∈ Hk,1, ˆπ(k,l) satisfies the same conditions as for ˆπ(k) inAssumption 6.

image

iii. There exists a nuisance realization set ΠNthat contains the true propensity score  π∗ and alsothe propensity score estimators ˆπ(k,l) for k = 1, . . . , K and l ∈ Hk,1with at least probability 1  − ∆N. Moreover, any  π ∈ ΠNsatisfies that  π(t | X) ≥ ϵπ.

iv. For each  π ∈ ΠN, the function class  Gπ = {(X, T, Y ) �→ I[T=t]π(t|X)Uj(Y ; θ1) + Vj(θ2) : j =1, . . . , d, θ ∈ Θ}is suitably measurable and its uniform covering entropy satisfies the following condition: for positive constants  a′, v′ and q′ > 2, supQ log N(ϵ∥Gπ∥Q,2, Gπ, ∥ · ∥Q,2) ≤v′ log(a′ϵ) ∀ϵ ∈ (0, 1], where Gπis a measurable envelope for  Gπ.There exists a positive constant  c8such that for any  P ∈ PN, ∥Gπ∥P,q′ ≤ c8.

image

Then there exists a constant c that only depends on pre-specified constants in the conditions above such that with probability 1  − c (log N)−1, ρθ,N ≤ 2c−13 �C√dϵ−1π + 1�ρπ,N.

In Remark 3, we discuss the corresponding rate conditions on other nuisance estimators when using the IPW initial estimator, based on the conclusion in Proposition 4.

In Definition 2, we construct an LDML estimator by first averaging estimates of the equation in Eq. (11) over all folds and then solving the grand-average equation approximately. Below we provide an alternative LDML estimator that first solves the estimate of Eq. (11) from each fold separately and then averages these solutions.

Definition 5 (LDML2). For k = 1, . . . , K, construct ˆθ(k) by (approximately) solving

image

In fact, we allow for an approximate least-squares solution, which is useful if the empirical estimating equation has no exact solution. Namely, we let ˆθk be any satisfying

image

Then, we let the final estimator be

image

We can easily follow previous proofs for the LDML estimator in Definition 2 to show that Theorems 1 to 3 and Proposition 1 also apply to ˆθin Definition 5, provided that  ϵNin Eq. (27) is o�N−1/2�(i.e., condition iii. in Assumption 3). For example, we demonstrate this at the end of the proof for Theorem 1. Thus the two LDML estimators in Definition 2 and Definition 5 are asymptotically equivalent.

The proposed LDML estimator ˆθin Definition 2 or Definition 5 relies on nuisance estimates based on random sample splitting (Definition 1). Although the uncertainty due to sample splitting does not affect the asymptotic theory, it may influence the finite-sample performance of the LDML estimator.

To make the results more robust to sample splitting, we may consider aggregating the estimates over different random splitting realizations. In particular, it is possible to use many other different ways of splitting data. For example, in both Definitions 2 and 5 we may average more than just K solutions or equations. For each k, we can permute over all�K−1K′�splits of {1, . . . , K}\{k} into K′and  K − 1 − K′ folds used for fitting ˆθ(k)1,init and ˆη(k)1 (·; ˆθ(k)1,init), ˆη(k)2 . Or, we could even permute over all �K−2K′=1�K−1K′�ways to split {1, . . . , K}\{k} into two. Or, we can even repeat the initial random splitting into K folds many times over and average the resulting estimates from either Definition 5 or 2, or take their median to avoid outliers, or solve the grand-mean of estimating equations. All of these procedures can provide improved finite-sample performance in practice as they can only reduce variance without affecting bias, and we do recommend these, but they have no effect on the leading asymptotic behavior, which remains the same whether you use one or more splits of the data into folds and/or one or more splits of {1, . . . , K} \ {k} into two.

With estimates from multiple random splitting realizations, we may also improve variance estimation and to account for the variance due to random splitting. In particular, letting ˆθs, �Σsbe the parameter and variance estimates for each run of LDML for s = 1, . . . , S, we can let ˆθmean = 1S�Ss=1 ˆθs and �Σmean = 1S�Ss=1(�Σs + 1S (ˆθs − ˆθmean)(ˆθs − ˆθmean)⊤) be the final parameter and variance estimates. Like ˆθmean, the first term in �Σmean reduces the variance in the estimate �Σsitself. The second term in  �Σmean accounts for the variance of ˆθmean due to random splitting. Notice that the second term vanishes as  S → ∞; indeed then ˆθmean has no variance due to random splitting as it is fully averaged over. Because ˆθsare each consistent, the second term also vanishes as N → ∞. Removing the 1S factor in the second term we can instead get an estimate of the variance of each single ˆθs, rather than of ˆθmean, accounting for random splitting. This procedure extends a similar proposal by Chernozhukov et al. (2018a) for inference in linear estimating equations.

The key condition that motivates our LDML approach is the invariant Jacobian condition in Assumption 1. Below Assumption 1, we directly show that this condition is satisfied for efficient estimating equations in incomlete data settings. In the following proposation, we provide a more general sufficient condition for the invariant Jacobian condition.

Proposition 5 (Sufficient Conditions for Invariant Jacobian). Assume that the map (θ, η1(·; θ′1)) �→P [ψ(Z; θ, η1(Z; θ′1), η∗2(Z))]is Fr´echet differentiable at (θ∗, η∗1(·, θ∗1)). Namely, assume that there exists a bounded linear operator  Dη∗1, such that for any (θ, η′1(·, θ′1))within a small open neighborhood N around (θ∗, η∗1(·, θ∗1)),

image

Assume further that there exists C > 0 such that for any (θ, η′1(·, θ′1)) ∈ N

image

Then Assumption 1 is satisfied.

Here the condition in Eq. (29) is an orthogonality condition using the Fr´echet derivative, which is stronger than the Gˆateaux differentiability required in Neyman orthogonality (see Assumption 2 condition vii.). In Eq. (12), we already show that  P [ψ(Z; θ, η1(Z; θ′1), η∗2(Z))] for the efficient estimating function in the incomplete data setting does not depend on  η1at all. Thus, its Fr´echet derivative with respect to  η1trivially exists and is always 0, and therefore our Assumption 1 will be satisfied per Proposition 5.

Our work is closely related to the classical semiparametric estimation literature on constructing √N-consistent and asymptotically normal estimators for low dimensional target parameters in the presence of infinitely dimensional nuisances, typically estimated by conventional nonparametric estimators such as kernel and series estimators (e.g., Newey 1990, 1994, Newey et al. 1998, Ibragimov and Hasminskii 1981, Levit 1976, Bickel et al. 1998, Bickel 1982, Robinson 1988, var der Vaart 1991, Andrews 1994, Robins and Rotnitzky 1995, Linton 1996, Chen et al. 2003, van der Laan and Rose 2011, Ai and Chen 2012). Our work builds on the Neyman orthogonality condition introduced by Neyman (1959) (see Eq. (29) in Proposition 5 below). This condition plays a critical role in many works that go beyond such nonparametric estimators, such as targeted learning (e.g., van der Laan and Rose 2011, Van der Laan and Rose 2018), inference for coefficients in high dimensional linear models (e.g., Belloni et al. 2016, 2014c, Zhang and Zhang 2014, Van de Geer et al. 2014, Javanmard and Montanari 2014, Chernozhukov et al. 2015, Ning et al. 2017), and semiparametric estimation with nuisances that involve high dimensional covariates (e.g., Belloni et al. 2017, Smucler et al. 2019, Chernozhukov et al. 2018b, Farrell 2015, Belloni et al. 2014a,b, Bradic et al. 2019).

Chernozhukov et al. (2018a) further advocate the use of cross-fitting in addition to orthogonal estimating equations, so that the traditional Donsker assumption on nuisance estimators can be relaxed, and a broad array of black-box machine learning algorithms can be used instead. They refer to this generic approach as DML, which provides a principled framework to estimate low-dimensional target parameters with strong asymptotic guarantees when leveraging modern machine learning methods in nuisance estimation. Other forms of sample splitting and cross-fitting have also appeared in Klaassen (1987), Zheng and van der Laan (2011), Fan et al. (2012), Bickel (1982), Robins et al. (2013), Schick (1986), van der Vaart (1998), Robins et al. (2008, 2017). Since the DML framework was introduced, numerous works have applied it in many different problems, such as heterogeneous treatment effect estimation (Kennedy 2020, Nie and Wager 2017, Curth et al. 2020, Semenova and Chernozhukov 2020, Oprescu et al. 2019, Fan et al. 2020), causal effects of continuous treatments (Colangelo and Lee 2020, Oprescu et al. 2019), instrumental variable estimation (Singh and Sun 2019, Syrgkanis et al. 2019), partial identification (Bonvini and Kennedy 2019, Kallus et al. 2019, Semenova 2017, Yadlowsky et al. 2018), difference-in-difference models (Lu et al. 2019, Chang 2020, Zimmert 2018), off-policy evaluation (Kallus and Uehara 2020, Demirer et al. 2019, Zhou et al. 2018, Athey and Wager 2017), generalized method of moments (Chernozhukov et al. 2016, Belloni et al. 2018), improved machine learning nuisance estimation (Farrell et al. 2018, Cui and Tchetgen 2019), statistical learning with nuisances (Foster and Syrgkanis 2019), causal inference with surrogate observations (Kallus and Mao 2020), linear functional estimation (Chernozhukov et al. 2018d,c, Bradic et al. 2019), etc. Our work complements this line of research by proposing a simple but effective way to handle estimand-dependent nuisances. This type of nuisances frequently appears in efficient estimation of complex causal effects such as QTEs, and applying DML directly would require estimating a continuum of nuisances, which is challenging in practice.

Our proof of Theorem 1 and the proof of Theorem 3.3 in Chernozhukov et al. (2018a) are overall similar, but critically differ in Step II. In Step II, both proofs are based on the following decompo-

sition:

image

where

image

and  I5 = OP(δN) is proved analogously in both proofs.

However, our proof and the proof in Chernozhukov et al. (2018a) assume different rate on  λ′N andthus  I4:

image

Under our condition,  I4 ≤�√N∥ˆθ − θ∗∥ + 1�δN, then jointly considering the left hand side and

right hand side in Eq. (30) gives  ∥ˆθ−θ∗∥ = Op(N−1/2), which in turn implies that  I4 = O(δN), andthus the asserted conclusion in Theorem 1. In contrast, the counterpart condition in Chernozhukov et al. (2018a) guarantees that  I4 = O(δN) directly without needing to consider both sides of Eq. (30) jointly.

Now we use the example of estimating equation for incomplete data to show that the condition Eq. (32) in Chernozhukov et al. (2018a) generally requires stronger conditions for the convergence rates of nuisance estimators than our condition Eq. (31).

According to Eq. (44), under suitable regularity conditions,

image

Since Step I in the proof of Theorem 1 already proves that  ∥ˆθ − θ∗∥ ≤ ρπ,NδN , we need ρπ,Nρµ,N ≤δNN−1/2, ρπ,Nρθ,N ≤ δNN−1/2, and ρπ,N ≤ δ2N to guarantee our condtion. Thus our condition in Eq. (31) only requires that the product error rates to vanish faster than  O(N−1/2), which is common in debiased machine learning for linear estimating equation (Chernozhukov et al. 2018a).

In contrast, to guarantee the condition in Chernozhukov et al. (2018a) given in Eq. (32), we need to assume that  ρπ,N ≤ δ3/2N N−1/4, besides the conditions on product error rates. Therefore, following the proof in Chernozhukov et al. (2018a) directly will require the propensity score to converge faster than  O(N−1/4), no matter how fast the initial estimator ˆθ1,initand the regression estimator ˆµ(·, ˆθ1,init) converge.

I.1 Proofs for Section 2

Proof for Proposition 5. For any θ = (θ1, θ2) such that (θ, η∗1(·, θ)) ∈ N, the asserted Fr´echet differentiability and orthogonality condition imply that

image

This means that  J∗ = ∂θ{P [ψ(Z; θ, η∗1(Z; θ1), η∗2(Z))]}|θ=θ∗.

I.2 Proofs for Section 3

Proof for Theorem 1. Fix any sequence  {PN}N≥1that generates the observed data (Z1, . . . , ZN) and satisfies that  PN ∈ PN for all N ≥1. Because this sequence is chosen arbitrarily, to prove that the asserted conclusion holds uniformly over  P ∈ PN, we only need to prove

image

For  k = 1, . . . , K, we use PN,kto represent the empirical average operator based on  Dk. Forexample,  PN,k [ψ(Z; θ∗, η∗1(Z; θ∗1), η∗2(Z))] = 1|Dk|�i∈Dk ψ(Zi; θ∗, η∗1(Zi; θ∗1), η∗2(Zi)).Analogously, PNis the empirical average operator for the whole dataset, i.e.,  PNf(Z) = 1N�Ni=1 f(Zi). GN,kis the empirical process operator √N (PN,k − P).Moreover, for a given  N, PN,k, PN and thepopulation average operator P are all derived from the underlying true distribution  PN, but wesupress such dependence for ease of notation. Throughout the proof, we condition on the event (ˆη1(·, ˆθ1,init), ˆη2(·)) ∈ TN, which happens with at least  PN-probability 1−∆Naccording to Assumption 3 condition i.. All statements involving  o(·), OPN (·) or ≲notations in this proof depend on only constants pre-specified in Assumptions 2 and 3, and do not depend on constants specific to the instance  PN. This should be clear from the proof, and the fact that the maximal inequality in Lemma 6.2 of Chernozhukov et al. (2018a) only depend on pre-specified parameters. Here we prove the asymptotic distribution of ˆθgiven in Definition 2 first.

Step I: Prove a preliminary convergence rate for ˆθ: ∥ˆθ − θ∗∥ ≤ τN with PN-probability 1  − o(1). Here we prove this by showing that with  PN-probability 1  − o(1),

image

so that Assumption 2 implies that

image

Since the singular values of  J∗ are lower bounded by  c3 >0, we can conclude that with  PN-probability 1  − o(1), ∥ˆθ − θ∗∥ ≤ τN for Nexceeding an instance-independent threshold.

In order to prove Eq. (33), we use the following decomposition:

image

Denote

image

Then obviously,

image

To bound (c), note that Eq. (14) implies

image

Thus

(c) ≤ 1K

image

Therefore,

image

Note that Assumption 3 condition ii. implies that  I1,k ≤ δNτNand the Assumption 3 condition iii. implies that  εN ≤ δNN−1/2 = o(τN).

image

for a positive constant  Cq,c7that only depends on  q and c7specified in Assumption 2.

Then conditionally on ˆθ1,init, ˆη(k)1 (Z, ˆθ(k)1,init), ˆη(k)2 (Z), we can use Lemma 6.2 eq. (A.1) in Cher- nozhukov et al. (2018a) to prove that with  PN-probability 1  − o(1),

image

which also holds unconditionally according to Lemma 6.1 of in Chernozhukov et al. (2018a). This further implies that  I2,k ≲ N−1/2 log N(1+N−1/2+1/q) = o(N−1/2 log2 N(1+N−1/2+1/q)) = o(τN).Thus

image

Step II: Linearization and√N−Consistency.In Step I, we proved that  ∥ˆθ − θ∗∥ ≤ τN withPN-probability 1  − o(1). Conditioned on this event, we will show that

image

where

image

Here Assumption 3 condition ii. guarantees that  I4 ≤ δN�1 +√N∥ˆθ − θ∗∥�and the assumption that  εN = δNN−1/2 guarantees that  εNN1/2 ≤ δN. In step III and IV, we will further bound I5,k ≲ ρ′N := (N−1/2+1/q + r′N)log N + r′Nlog1/2(1/r′N) + N−1/2+1/q log(1/r′N) ≲ δN and I3 ≤I4 + 1K�Kk=1 I5,krespectively.

Consequently, with  PN-probability 1  − o(1),

image

and

image

By Assumption 2 condition v. and Markov inequality,  ∥√NJ∗−1PN[ψ(Z; θ∗, η∗1(Z; θ∗1), η∗2(Z))]∥ =OPN (√c6). Thus, with  PN-probability 1  − o(1),

image

Plugging this back into Eq. (36) gives

image

Thus,

image

because  ∥J∗−1∥ ≤ 1/c3 and ∥Σ−1/2∥ ≤ 1/√c5.

Now we prove the decomposition Eq. (35). Note that for any  θ ∈ Θ and (η1(·, θ1), η2) ∈ TN

image

If we apply Eq. (37) with  θ = ˆθ and (η1(·, θ1), η2) equal (ˆη(k)1 (·, ˆθ(k)1,init), ˆη(k)2 ) for the kth fold, and apply Eq. (14), then

image

Here

image

and the second order tayler expansion at  r = 0 gives that for some data-dependent ˜r ∈ (0, 1),

image

where the third equality uses the Neyman orthogonality in Assumption 2 condition vii..

Combining Eq. (38), Eq. (39) and Eq. (40) gives decomposition Eq. (35).

Step III: bounding I5,k. To bound I5,k, we still condition on ˆη(k)1 (·, ˆθ(k)1,init), ˆη(k)2 , and then apply Lemma 6.2 in Chernozhukov et al. (2018a) with function class

F′ˆη(k),ˆθ(k)1,init = {ψj(·; θ, ˆη(k)1 (·, ˆθ(k)1,init), ˆη(k)2 )−ψj(·; θ∗, η∗1(·, θ∗), η∗2) : j = 1, . . . , d, θ  ∈ Θ, ∥θ −θ∗∥ ≤ τN}.

We can verify that  F′ˆη(k),ˆθ(k)1,initsatisfies similar entropy condition with envelope  F1,ˆη(k),ˆθ(k)1,init +F1,η∗,θ∗1.Moreover, Assumption 3 implies that

image

Thus conditionally on ˆθ1,init, ˆη(k)1 (Z, ˆθ(k)1,init), ˆη(k)2 (Z), we can use Lemma 6.2 eq. (A.1) in Cher- nozhukov et al. (2018a) to show that with  PN-probability 1  − o(1),

image

which also holds unconditionally according to Lemma 6.1 in Chernozhukov et al. (2018a) .

Step IV: bounding I3. Let θ = θ∗ − J∗−1PN [ψ(Z; θ∗, η∗1(Z, θ∗), η∗2(Z))].

Since  P [ψ(Z; θ∗, η∗1(Z, θ∗), η∗2(Z))] = 0, J∗is nonsingular with singular values bounded away from 0 by  c3, and ∥PN [ψ(Z; θ∗, η∗1(Z, θ∗), η∗2(Z))] ∥ = OPN (N−1/2), ∥θ − θ∗∥ = OPN (N−1/2) = oPN (τN).According to Assumption 2 condition i.,  θ ∈ Θ with PNprobability 1  − o(1). Therefore,

image

Then apply the linearization Eq. (37) and taylor expansion similar to Eq. (40) with  θ = θ and

image

where the last equality here holds because  PN[ψ(Z; θ, η∗1(Z; θ∗1), η∗2(Z))] + J∗(θ − θ∗) = 0 as aconsequence of the special construction of  θ.

Extension: ˆθ defined in Definition 5.By applying step I to IV to sample estimating equation Eq. (27), we can get that for k = 1, . . . , K,

image

Since K is a fixed integer that does not grow with N, the equation above implies that the asserted conclusion in Theorem 1 also holds for ˆθ = 1K�Kk=1 ˆθ(k).

I.3 Proofs for Section 4

Proof of Theorem 2. We still consider data generating processes  {PN}N≥1defined in the proof for Theorem 1, and define  ⊗a = aa⊤. Now we prove that

image

for any  k ∈ [1, · · · , K].Then, the statement in Theorem 2 is immediately concluded. For all j, l ∈ [1, · · · , d] (d = d1 + d2), Eq. (41) follows once we have  Ijl = OPN (ρ′′N), where

image

Here, to simplify the notation, we use ˆη(k)1 = ˆη(k)1 (Z, ˆθ(k)init), η∗1 = η∗1(Z, θ∗1), ˆη(k)2 = ˆη(k)2 (Z, ˆθ(k)init), η∗2 =η∗2(Z, θ∗2). Obviously we have  Ijl ≤ Ijl,1 + Ijl,2, where

image

and we show that each term here is  Op(ρ′′N).

We first bound  Ijl,2. This is upper bounded as

image

Here, we use the fourth moment assumption in Assumption 4. From conditional Markov inequality, we have  Ijl,2 = OPN (1/N −1/2).

Next, we bound  Ijl,1. Following the proof of Theorem 3.2 (Chernozhukov et al. 2018a), we have

image

In addition, from the fourth moment assumption in Assumption 4

image

It follows from Markov inequality that

image

It remains to bound  RN. We have

image

Then, the first term of Eq. (42) is upper bounded with  PN-probability 1  − o(1) as

image

In the last inequality, we use Lemma 6.2 (Chernozhukov et al. 2018a). Here, the envelops exists since

image

for some constant C depending on  d1, d2. According to condition vi. in Assumption 2, it satisfies the moment condition  ∥F 2η∗,θ∗∥P,q/2 ≤ c1. In addition, the metricy entropy assumption is satisfied since

image

Similarly, with  PN-probability probability 1  − o(1), the second term of Eq. (42) can be upper bounded as follows:

PN,k[∥ψ(Z; ˆθ, ˆη1, ˆη2) − ψ(Z; ˆθ, η∗1, η∗2)∥2]

image

In the last inequality, we use Lemma 6.2 (Chernozhukov et al. 2018a) and Assumption 3. In the end, we have

image

This concludes the proof.

I.4 Proofs for Section 5

Proof for Theorem 3. In this part, we prove the asymptotic distribution of our estimators corresponding to the general estimating equation Eq. (7). We prove this by verifying all conditions in the assumptions for Theorem 1.

Verifying Assumption 1.

image

where the second and third equality follow because  P�I(T=t)−π∗(t|X)π∗(t|X) µ∗(X, t; θ1)�= 0.

Verifying Assumption 2. We first verify conditions iii. and iv. in Assumption 2. We denote that  Jjk(θ) = ∂θ(k)P [Uj(Y (t); θ1) + Vj(θ2)] where θ(k) is the kth component of  θ = (θ1, θ2). Bycondition ii.,  Jjk(θ) is Lipschitz continuous at  θ∗ with Lipschitz constant  c′. So for any  ε > 0, if θbelongs to the open ball  B(θ∗; ϵ/c′), then

image

By first-order Taylor expansion, for any  θ ∈ B(θ∗; δ), there exists  θ ∈ B(θ∗; ∥θ − θ∗∥) such that

image

where the second last inequality holds if we choose  ε ≤ c32√d ≤ 12√dσmin(J(θ∗)), where σmin(J(θ∗))is the smallest singular value of  J(θ∗). Thus

image

image

Moreover, the singular values  J∗ are bounded between  c3, c4according to condition iii..

We then verify condition vii. in Assumption 2: for any (η1(·; θ′1), η2) ∈ TN,

image

Verifying Assumption 3. We take TNto be the set that contains all (µ(·, θ′1), π(·)) that satisfies the following conditions:

image

Then Assumption 6 and condition vii. in Theorem 3 guarantee that the nuisance estimates (ˆµ(, ˆθ1,init), ˆπ) ∈ TNwith probability, namely, condition i. in Assumption 3 is satisfied.

Before verifying other conditions, first note that the condition vi. states that

image

Now we verify the condition on  rN: for any (η1(·; θ′1), η2(·)) = (µ(·, θ′1), π(·)) ∈ TN, and θ ∈ Θ,

image

Thus, the condition on  rNis satisfied with  τN such that τN = 4C√dρπ,NδNεπ .

Next, we verify the condition on  r′N: for any θ such that ∥θ−θ∗∥ ≤ 4C√dρπ,NδNεπ, and any (η1(·; θ′1), η2(·)) =(µ(·, θ′1), π(·)) ∈ TN,

image

Finally, to verify the condition on  λ′N, we note that for any  θ such that ∥θ − θ∗∥ ≤ 4C√dρπ,NδNεπ , andany (η1(·; θ′1), η2(·)) = (µ(·, θ′1), π(·)) ∈ TN

image

− r�µ(X, T; θ′1) − µ∗(X, T; θ∗1)��+�µ∗(X, t; θ∗1) + r�µ(X, t; θ′1) − µ∗(X, t; θ∗1)��+ V (θ∗2 + r(θ2 − θ∗2))�

image

Thus the first-order derivative is

image

×�µ(X, T; θ′1) − µ∗(X, T; θ∗1)��+ P��µ(X, t; θ′1) − µ∗(X, t; θ∗1)��+ ∂θ⊤2 V (θ2)|θ2=θ∗2+r(θ2−θ∗2)(θ2 − θ∗2).

The second order derivative is

image

+P

image

+P

image

Above, we use condition iv. in Theorem 3 to ensure exchange of integration and differentiation so we can get terms  ∂θ⊤1 µ∗(X, T; θ1)|θ1=θ∗1+r(θ1−θ∗1) and ∂2θ1,θ⊤1µ∗(X, T; θ1)|θ1=θ∗1+r(θ1−θ∗1).

image

=

image

=

image

and

image

Thus for any  θ such that ∥θ − θ∗∥ ≤ 4C√dρπ,NδNεπ ,

∥∂2rf(r; θ, µ(X, T; θ′1), π)∥ (43)≤ρπ,Nε3π

image

Given  ρπ,N ≤ δ3Nlog N , when δNlog N ≤ ε2π8C2d and δ2Nlog N ≤ ε3π2C√d, 4C2dε2πδN ρπ,N∥θ − θ∗∥ + C√dε3π ρπ,N∥θ1 − θ∗1∥ ≤δN∥θ − θ∗∥. Moreover, when  ρπ,N(ρµ,N + Cρθ,n) ≤ ε3π3 δNN−1/2, 3ε3π ρπ,N(ρµ,N + Cρθ,n) ≤ δNN−1/2.Consequently,  ∥∂2rf(r; θ, µ(X, T; θ′1), π)∥ ≤ δN(∥θ − θ∗∥ + N−1/2).

Proof for Proposition 1. First note that

image

When  Ft(θ1) is differentiable,  P [U(Y (t); θ1) + V (θ2)] is also differentiable by Leibnitz integral rule, with derivative

image

Thus

image

Now we prove Proposition 1 by verifying the assumptions in Theorem 3.

Verifying condition i in Theorem 3. We only need to verify that condition vi. of Assumption 2 hold. Since Θ is compact,  {y �→ I [y ≤ θ1] , θ ∈ Θ}, {y �→ max{θ1, 11−γ (y − θ1)} − θ2, θ ∈ Θ} areobviously Donsker classes, so condition vi. of Assumption 2 is satisfied.

Verifying conditions ii. and iii. in Theorem 3. It is straightforward to show that  J(θ) isinvertible with the following matrix as its inverse:

image

Note that  σmax(J(θ∗)) ≤ 2 max{ft(θ∗1), 1} ≤ 2 max{c′2, 1} and

image

Thus condition iii. in Theorem 3 is satisfied with  c3 = 12 min{1, 1−γγ c′1, c′1} and c4 = 2 max{1, c′2}.When we estimate quantile only, then only  ft (θ1) in J (θ) matters. Then condition iii. in Theorem 3 is satisfied with  c3 = c′1 and c4 = 2c′2.

Since  ft(θ1) ≤ c′2 and ˙ft(θ1) ≤ c′3, it follows that each element in  J(θ) is Lipschtiz continuous at θ∗ with Lipschitz constant  c′ = max{c′2, c′3}. Moreover, for  θ ∈Θ such that  ∥θ − θ∗∥ ≥ c32√2c′ , wehave 2∥P�U(Y (t); θ1) + V (θ2)�∥ ≥ c′5. This means that condition ii. in Theorem 3 is satisfied with c′ = max{c′2, c′3} and c2 = c′5. When we estimate quantile only, we only require  |F(θ∗1)−F(θ1)| ≥ c′4for  |θ1 − θ∗1| ≥ c′12c′3 . Then condition ii. in Theorem 3 is satisfied with  c′ = c′3 and c2 = 2c′4.

Verifying condition iv. in Theorem 3. This condition can be verified by the following facts: for any  θ1 such that |θ1 − θ∗1| ≤ 4C√dρπ,NδNεπ ,

image

and

|∂rµ∗2(X, t; θ∗1 + r(θ1 − θ∗1))| =����∂r

= |θ1 − θ∗1|����1 − 11 − γ (1 − Ft(θ∗1 + r(θ1 − θ∗1) | X))���� ≤ |θ1 − θ∗1|��∂2rµ∗2(X, t; θ∗1 + r(θ1 − θ∗1))�� = |θ1 − θ∗1|2 |ft(θ∗1 + r(θ1 − θ∗1) | X)| ≤ C |θ1 − θ∗1|2 .

Verifying conditions v. and vi. in Theorem 3. For any (θ1, θ2) ∈ Θ,

image

By first-order Taylor expansion, for any  θ1 such that |θ1 − θ∗1| ≤ max{4C√dρπ,NδNεπ , ρθ,N}, there exists ˜θ1 between θ1 and θ∗1 such that

image

Moreover, for any  θ1 such that |θ1 − θ∗1| ≤ max{4C√dρπ,NδNεπ , ρθ,N},

image

Proof for Proposition 4. We follow the proof of Theorem 1 to consider any sequence of data generating process  PN ∈ PNbut we suppress it for ease of notation. We prove the conclusion for a generic  k ∈ {1, . . . , K}. For l ∈ Hk,1, we denote  PN,l and GN,las the empirical average operator and empirical process operator for data in the  Dl. Throughout the proof, we condition on the event that the convergence rate of propensity score estimator ˆπ(k,l) in mean squared error is  ρπ,N and itis lower bounded by  ϵπ, which holds with at least probability 1  − ∆Naccording to Assumption 6. In this proof, all notations  ≲only involve pre-specified constants and not any instance-dependent constants.

We use the following decomposition analogous to that in Step I of proof for Theorem 1.

image

where

image

Given that condition vi. in Assumption 2 is satisfied for the estimating equation  ψIPW, we canfollow the end of step I in the proof for Theorem 1 to prove that with  PNprobability 1−c (log N)−1

for a constant c that depends on only constants in the assumptions,

image

Therefore, with  PN-probability 1  − c (log N)−1,

image

In the proof of Theorem 3, we have showed that conditions ii. and iii. in Theorem 3 imply that

image

Therefore, with probability 1  − c (log N)−1:

image

I.5 Proofs for Appendix

Proof of Proposition 2. In this part, we prove the asymptotic distribution of our estimator ˆθ =�ˆθ1, ˆθaux2 �∈ Θ1 × Θ2 ⊆ R2corresponding to Eqs. (20) and (23). We denote  θ = (θ1, θaux2 ). Weprove this by verifying all conditions in the assumptions in Theorem 1.

Verifying Assumption 1. Similar to the proof of Theorem 3, we can easily show that

image

Verifying Assumption 2. We first verify conditions iii. and iv. in Assumption 2. We can easily derive that

image

and its Jacobian matrix is given by

image

This means that

image

Therefore,

image

and

image

The latter implies that

image

Therefore, condition iv. in Assumption 2 is satisfied with  c3 = 12 min {c′3, c′3ϵ/γ, 1}, c4 = 2 max�c′1, γϵ , 1�.

Moreover, for any (θ1, θaux2 ) ∈ Θ1 × Θ2 and t = 1, ft(θ1 | C) ≤ c′1,��� ˙ft(θ1 | C)��� ≤ c′2, so we have

image

that entries in  J(θ) are all Lipschtiz with  cLip := max

a valid Lipschitz constant. Moreover, we have 2∥P [ψ(Z; θ, θaux2 , η∗1(Z; θ∗1), η∗2(Z))] ∥ ≥ c2 for allθ = (θ1, θaux2 ) ∈Θ such that  ∥θ −θ∗∥ ≥ c32√dcLip . By following the proof of Theorem 3, we can easily

verify condition iii. in Assumption 2.

Next, we verify condition vi. in Assumption 2. For any fixed  η1 (Z; θ′1) and η2, the class  Fη,θ′1 ={ψj(Z; θ, η1(Z; θ′1), η2(Z)) : j = 1, . . . , d, θ ∈ Θ} depend on θonly through  {I [Y ≤ θ1] : θ1 ∈ Θ1}and  {θaux2 : θaux2 ∈ Θ2}. Since the latter two classes are Donsker class, vi. in Assumption 2 for the function class  Fη,θ′1 = {ψj(Z; θ, η1(Z; θ′1), η2(Z)) : j = 1, . . . , d, θ ∈ Θ}has to be satisfied as well.

Verifying Assumption 3. We take TNto be the set that contains all (η1 (·; θ′1) = ˜µ(·, θ′1), η2 (·) =(νw (·) , ˜π(·))) that satisfies the following conditions: for w = 0, 1,

image

Cϵgiven in Eq. (45).

Then Assumption 8 and Proposition 2 condition v. ensure that the nuisance estimates (ˆµ(, ˆθ1,init), ˆπ) ∈TNwith probability, namely, condition i. in Assumption 3 is satisfied.

Before verifying other conditions, we first note that

image

It follows from Item iv. that for any  θ1 ∈ B(θ∗1; max{ 4˜ρπ,Nϵ2(1−ϵ)δN , ρθ,N}) ∩ Θ,

image

Next, we verify Assumption 3 condition ii..

We first verify the condition on  rN. By following the proof of Theorem 3, we can show that for any (η1(·; θ′1), η2(·)) ∈ TN,

image

The last inequality holds because

image

and so is ˜µw (X; θ1). This means that the condition on  rNis satisfied with  τN = 4˜ρπ,Nϵ2(1−ϵ)δN .

Next, we verify the condition on  r′N. Again, by following the proof of Theorem 3, we have that for

image

and

image

where

image

and

image

Also define

image

Morevoer, when ˜ρπ,N (˜ρµ,N + C ˜ρθ,N) ≤ ϵ4(1−ϵ)38(ϵ3+(1−ϵ)3)δNN−1/2, we have

image

Moreover, we can similarly show that

image

provided that ˜ρπ,N ˜ρν,N ≤ ϵ3(1−ϵ)38(ϵ3+(1−ϵ)3)δNN−1/2.

image

which verifies Assumption 3 condition ii..

Therefore, we have

image

This means that

image

Proof for Proposition 3. We only need to verify the conditions in Theorem 3.

Verifying condition i. in Theorem 3. We only need to verify that condition vi. of Assumption 2 hold. Since Θ is compact,  {y �→ (1 − γ) y − θ1, θ ∈ Θ}, {y �→ (1 − 2γ) max{y − θ1, 0}, θ ∈ Θ} areobviously Donsker classes, condition vi. of Assumption 2 is satisfied.

Verifying condition ii. and iii. in Theorem 3. According to Eq. (6), the estimating function for complete data is given by

image

It follows that

image

Here the differentiability of ∂∂θ1 P [U(Y (t); θ1)] is guaranteed by Leibniz integral rule, the con- tinuity of its derivative at  θ∗1 is guaranteed by the continuity of  Ft(θ1) at θ∗1 , and J(θ∗1) =∂∂θ1 P [U(Y (t); θ1)] |θ1=θ∗1 = −γ − (1 − 2γ)Ft(θ∗1), whose singular value  | − γ − (1 − 2γ)Ft(θ∗1)| isbounded between  c′4 and max{γ, 1 − γ}. Moreover, ∂∂θ1 P [U(Y (t); θ1)] ≤ max{γ, 1 − γ}, whichimplies that  P [U(Y (t); θ1)] is Lipschtiz continuous with Lipschitz constant max{γ, 1 − γ} ≤ 1.Therefore, the constants  c′ in condition ii. and constant  c3in iii. of Theorem 3 can be set as c3 = c′1, c′ = 1. The assumption that ∥θ − θ∗∥ ≥ c32c′ = c′12 , 2P [U(Y (t); θ1)] ≥ c′2 for any θ ∈ Θensures the condition ii. of Theorem 3 with constant  c2 = c′2.

Verifying condition iv. in Theorem 3. Note that for any  θ ∈ B(θ∗; 4C√dρπ,NδNεπ ) ∩ Θ,

image

Thus

image

which trivially imply condition iv. in Theorem 3.

Verifying condition iv in Theorem 3. Again

image

The the asserted assumpton iv means that�P[η∗2,1(Z)]2�1/2 ≤ C and�P[η∗1(Z; θ1)]2�1/2 ≤ C forany  θ ∈ Θ, thus�P [µ∗(X, 1; θ1)]2�1/2is upper bounded by  |1 − γ|+|1 − 2γ| C ≤ 2C for any θ ∈ Θ.

Plus, for any  θ1 ∈ B(θ∗1; max{4C√dρπ,NδNεπ , ρπ,N}) ∩ Θ

image

and there exists ˜θ1 between θ1 and θ∗1 such that

image


Designed for Accessibility and to further Open Science