b

DiscoverSearch
About
My stuff
Regularity and stability of feedback relaxed controls
2020·arXiv
Abstract
Abstract

This paper proposes a relaxed control regularization with general exploration rewards to design robust feedback controls for multi-dimensional continuous-time stochastic exit time problems. We establish that the regularized control problem admits a H¨older continuous optimal feedback control, and demonstrate that both the value function and the feedback control of the regularized control problem are Lipschitz stable with respect to parameter perturbations. Moreover, we show that a pre-computed feedback relaxed control gives a robust performance in a perturbed system, and derive a first-order sensitivity equation for both the value function and optimal feedback relaxed control. These stability results provide a theoretical justification for recent reinforcement learning heuristics that including an exploration reward in the optimization objective leads to more robust decision making. We finally prove first-order monotone convergence of the value functions for relaxed control problems with vanishing exploration parameters, which subsequently enables us to construct the pure exploitation strategy of the original control problem based on the feedback relaxed controls.

Key words. exploration and exploitation, feedback relaxed control, Lipschitz stability, sensitivity equation, reinforcement learning, Hamilton-Jacobi-Bellman equation.

image

In this paper, we propose a relaxed control regularization with a class of exploration rewards to design robust feedback controls for multi-dimensional stochastic control problems in a continuous setting. In particular, we shall rigorously demonstrate that the constructed optimal feedback control is Lipschitz stable with respect to perturbations in the underlying model.

Since parameter uncertainty in a given model is practically inevitable, it is essential but challenging to a priori evaluate the performance of a pre-computed feedback control in a perturbed system, and to design feedback policies capable of handling model uncertainty. For instance, let us consider the following infinite-horizon stochastic control problem. Suppose (αt)t≥0is an admissible control process taking values in a finite action space A, and the underlying state dynamics follows a controlled stochastic differential equation (SDE) defined as follows:  Xα,x0 = x ∈ Rn, and

image

where  b : Rn × A → Rn and σ : Rn × A → Rn×n are given coefficients. The aim of the controller is to maximize the total expected discounted reward over all admissible strategies. It is well-known that (see e.g. [19, Corollary 5.1 on p. 167] and Theorem 2.2 for more precise statements), under certain regularity assumptions, the optimal control strategy can be represented as a deterministic function  αu : Rn → A, called the optimal feedback control, which maps the current state space into the action space. Moreover, one can construct such an optimal feedback control  αu via a verification argument, which consists of solving a nonlinear Hamilton– Jacobi–Bellman (HJB) partial differential equation (PDE) arising from the dynamic programming principle for the optimal reward function u, and then performing a pointwise maximization of the associated Hamiltonian involving the function u and its derivatives (∂iu, ∂iju)ni,j=1 as follows: for any given  x ∈ Rn,

image

where  a(x, α) = σ(x, α)σT (x, α)/2, the functions c and f denote the discount rate and the instantaneous reward, respectively. We refer the reader to Theorems 2.2 and 3.5 for rigorous arguments of the above procedure for control problems of our interest, and to [19, Theorem 5.1 on p. 166] for a general statement.

We observe, however, that the control strategy  αu satisfying (1.1) in general is difficult to implement and unstable to parameter perturbations, which in practice would result in numerical instability of learning algorithms. Due to the finiteness of the action space A and the fact that arg max is a set-valued mapping, a function  αu : Rn → Asatisfying (1.1) in general is non-unique and merely measurable, and hence it is hard to follow such an irregular strategy in practice. More importantly, the discreteness of the set A implies that the arg max mapping is not continuous (in the sup-norm), which makes the feedback control  αu very sensitive to perturbations of the coefficients (b, σ, c, f). In other words, a slight change of the model parameters will result in a significant change of the feedback control, especially in the regions where two or more actions lead to similar performances based on the current model. Since it is difficult to determine the occurance of such regions a priori, it is unclear how well the control strategy  αu will perform in a real system with the perturbed coefficients (˜b, ˜σ, ˜c, ˜f), even if (˜b, ˜σ, ˜c, ˜f) is very close to (b, σ, c, f).See the last paragraph of Section 2 for more details on the instability of feedback controls and its practical impact on learning algorithms.

A tremendous amount of effort has been made to overcome the above difficulties, particularly in the (discrete-time) Reinforcement Learning (RL) setting (see e.g. [39]), where the agent seeks (nearly) optimal decisions in a random environment with incomplete information. Generally speaking, the controller must balance between greedily exploiting the available information to choose actions that maximize short-term rewards, and continuously exploring the environment to acquire more knowledge for long-term benefits. In particular, an entropy-regularized formulation has been proposed for solving (discrete-time) RL problems in [46, 33, 21], where the authors incorporate explorations by explicitly including the entropy of the exploration strategy in the optimization objective as a reward function, and balance exploitation and exploration by adjusting a weight imposed on this regularization term. Empirical studies (e.g. [46, 25, 33, 21]) show that such a regularized formulation leads to more robust decision making. Recently, the authors in [42, 43] extended this entropy-regularized formulation to continuous-time RL problems by using the relaxed control framework, and study the exploration/exploitation trade-off for one-dimensional linear-quadratic (LQ) control problems via explicit solutions. The relaxed control approach has then been extended to (discrete-time) RL problems with mean-field controls in [23].

In this work, we propose an exploratory framework with general exploration rewards to design robust feedback controls for continuous-time stochastic exit time problems with continuous state space and discrete action space. Our formulation extends the relaxed control approach in [42, 43] to multi-dimensional state dynamics and general exploration rewards, including Shannon’s differ-ential entropy and other commonly used regularization functions in the optimization literature (see e.g. [15, 45]); see the remark at the end of Section 3 for a detailed comparison among different exploration reward functions.

A major theoretical contribution of this work is a rigorous stability analysis of the regularized control problem and its associated feedback control strategy. Although the entropy-regularized RL formulation has demonstrated remarkable robustness in various empirical studies (e.g. [46, 25, 33, 21, 23, 43]), to the best of our knowledge, there is no published theoretical work on the Lipschitz stability of feedback relaxed controls with respect to parameter uncertainty (even in a discrete-time setting) nor on the Lipschitz stability of the value functions for regularized continuous-time stochastic control problems with general multi-dimensional nonlinear state dynamics. In fact, most existing results on the Lipschitz stability of feedback controls are for LQ control problems with linear state dynamics and quadratic cost functions (see e.g. [31] for discrete-time LQ problems in an ergodic setting and [6] for finite-horizon continuous-time LQ problems). The stability analysis of such problems relies heavily on the linearity of optimal feedback controls and the associated Riccati equations, and hence cannot be directly extended to general nonlinear control problems. We refer the reader also to [2, 30, 3, 4, 7, 8, 27] for the continuity of various stochastic optimization problems, including stochastic control problems and optimal stopping problems, in the underlying processes with respect to the (extended) weak topology.

In this work, we shall close the gap by providing a theoretical justification for recent RL heuristics that including an exploration reward in the optimization objective leads to more robust decision making. In particular, we shall demonstrate that the change in value functions of the regularized control problems (in the  C2,β-norm) depends Lipschitz-continuously on the perturbations of the model parameters, including the coefficients of the state dynamics and reward functions in the optimization objective. We shall also prove that the regularized control problem admits a H¨older continuous feedback control (cf. the original control  αu in (1.1) is merely measurable), which is Lipschitz stable (in the  Cβ-norm) with respect to parameter perturbations; see Theorem 4.2.

Moreover, this is the first paper which precisely quantifies the performance of a feedback control pre-computed based on a given model in a new multi-dimensional controlled dynamics with perturbed coefficients. We will prove that the gap between the suboptimal reward function achieved by the pre-computed feedback relaxed control and the optimal reward function of the perturbed relaxed control problem depends Lipschitz-continuously on the magnitude of perturbations in the coefficients (see Theorem 4.4). We also establish a first-order sensitivity equation for the value function and feedback control of the perturbed relaxed control problem (see Theorem 5.2 and Remark 5.1), which enables us to quantify the explicit dependence of the Lipschitz stability of feedback controls on the exploration parameter  ε(see Theorem 5.4).

Let us briefly comment on the two main difficulties encountered in the stability analysis of feedback relaxed controls beyond those encountered in the finite-dimensional RL setting (see e.g. [20, 12, 21]) and the LQ setting (see e.g. [31, 6]). As we shall see in (3.6), the feedback relaxed control (in the present continuous setting) is defined as the pointwise maximizer of the associated Hamiltonian, which in general involves not only the value function of the regularized control problem, but also its first and second order derivatives. Hence, besides estimating the sup-norm of the value functions as in the finite-dimensional RL setting, we also need to quantify the impact of parameter uncertainty on the (first and second order) derivatives of the value functions, which are solutions to a fully nonlinear HJB PDEs. For continuous-time LQ problems, such an analysis can be greatly simplified by taking advantage of the quadratic structure of the value function, which reduces the study of HJB PDEs to that of Riccati ordinary differential equations. Such a simplification is not possible for general nonlinear control problems, which requires us to derive a precise a priori estimate for the derivatives of solutions to the associated fully nonlinear HJB

equations.

Moreover, the Lipschitz stability and the first-order sensitivity analysis of the feedback relaxed controls also require us to establish the regularity of the HJB operator and the arg max-mapping between suitable function spaces for regularized control problems. As already pointed out in [40, 26], the fact that the HJB operator is fully nonlinear (since we allow the diffusion coefficients to be controlled) poses a significant challenge for choosing proper function spaces to simultaneously ensure the differentiability of the fully nonlinear HJB operator and the bounded invertibility of its (Fr´echet) derivative, which are essential for deriving the sensitivity equations of the value functions and feedback controls (see Theorem 5.2 and Remark 5.1). Here, by taking advantage of the exploration reward functions, we demonstrate that the HJB operator and the arg max-mapping for the regularized control problem are sufficiently smooth between suitable H¨older spaces, which together with an elliptic regularity estimate leads us to the desired sensitivity results for the feedback relaxed controls; see Remark 4.1 for more details.

Finally, we establish that, as the exploration parameter tends to zero, the value function of the relaxed control problem converges monotonically to that of the classical stochastic control problem with a first-order accuracy (see Theorem 6.1). The convergence of value functions (in the  C2,β-norm) subsequently enables us to deduce a novel uniform result (on compact sets) for the feedback relaxed control to a pure exploitation strategy of the original control problem. We further prove an exact regularization property for a class of reward functions, which allows us to recover the pure exploitation strategy based on the feedback relaxed control without sending the exploration parameter to 0 (see Theorem 6.4).

We organize this paper as follows. Section 2 introduces the stochastic exit control problem, and establishes its connection to HJB equations. In Section 3, we propose a relaxed control regularization involving general exploration reward functions for the stochastic control problem, and establish the H¨older regularity of the feedback relaxed control strategy. Then, for a fixed positive exploration parameter, we prove the Lipschitz stability of the value function and feedback relaxed control with respect to parameter perturbations in Section 4, and derive their first-order sensitivity equations in Section 5. We establish the convergence of value functions and relaxed control strategies for vanishing exploration parameters in Section 6. Appendix A is devoted to the proofs of some technical results.

In this section, we introduce the stochastic exit time problem of our interest, state the main assumptions on its coefficients, and recall its connection with HJB equations. We start with some useful notation which is needed frequently throughout this work.

For any given multi-index  β = (β1, . . . , βn) with βi ∈ N ∪ {0}, i = 1, . . . , n, we define |β| =�ni=1 βi and Dβφ = ∂|β|φ∂xβ11 ...∂xβnn .For any given open subset  O ⊂ Rn, k ∈ N ∪ {0}, θ ∈ (0, 1], and

function  φ :

image

Then we shall denote by  Ck(O) the space of k-times continuously differentiable functions in O equipped with the norm  |φ|k;O = �km=0[φ]m,0;O, and by Ck,θ(O) the space consisting of all functions in  Ck(O) satisfying [φ]k,θ;O < ∞, equipped with the norm  |φ|k,θ;O = |φ|k;O + [φ]k,θ;O.When k = 0, we use  Cθ(O) to denote  C0,θ(O), and use | · |θ;O to denote | · |0,θ;O. We shall omit the subscriptO in the (semi-)norms if no confusion appears.

Finally, we shall denote by [aij] the n×nmatrix whose ijth-entries are given by  aij, by Sn, Sn0and  Sn>, respectively, the set of  n × nsymmetric, symmetric positive semi-definite and symmetric positive definite matrices, by  X ≥ Y in Sn the fact that  X − Yis positive semi-definite. For any given  K ∈ N, we denote by ∆Kthe probability simplex in  RK, i.e.,

image

Now we are ready to introduce the control problem of interest. In order to allow irregular feedback control strategies, we consider the following weak formulation of a control problem, which includes the underlying probability space as part of control strategies (see e.g. [44, 19]). See Remark 2.2 for possible extensions to stochastic control problems under strong formulation, for which the underlying probability reference system is fixed.

Definition 2.1. A 5-tuple π = (Ω, F, {Ft}t≥0, P, W) is said to be a reference probability system if (Ω, F, {Ft}t≥0, P) is a filtered probability space satisfying the usual condition1, and W = (Wt)t≥0is an  {Ft}t≥0-adapted n-dimensional Brownian motion. We denote by Πrefthe set of all reference probability systems.

Now let O be a given bounded domain in  Rn, i.e., a bounded connected open subset of  Rn.The aim of the controller is to maximize the expected discounted reward up to the first exit time of a controlled dynamics from the domain O. More precisely, let  π = (Ω, F, {Ft}t≥0, P, W) ∈ Πrefbe a given reference probability system, and  Aπbe the set of  {Ft}t≥0-progressively measurable processes  αtaking values in a finite set A. For any given initial state  x ∈ Rn, and control  α ∈ Aπ,we consider the controlled dynamics  Xα,x satisfying the following SDE:  Xα,x0 = x and

image

where  b : Rn × A → Rn and σ : Rn × A → Rn×n are given Lipschitz continuous functions (see (H.1) for precise conditions), and denote by  τ α,x := inf{t ≥ 0 | Xα,xt ̸∈ O}the first exit time of the dynamics  Xα,x from the domain  O,2 and by (Γα,xt )t∈[0,τ α,x]the controlled discount factor: Γα,xt := exp�−� t0 c(Xα,xs , αs) ds�for all t ∈ [0, τ α,x]. Then, for each given  x ∈O, we shall consider the following value function:

image

where the functions f and g denote, respectively, the running reward and the exit reward. Throughout this work, we shall perform the analysis under the following assumptions on the coefficients:

H.1. Let n, K ∈ N, K = {1, . . . , K}, Ais a set of cardinality  K, i.e., A = {ak}k∈K, and O bea bounded domain in  Rn. There exist constants  ν, Λ > 0, θ ∈ (0, 1]such that the boundary  ∂Oof O is of class  C2,θ, g ∈ C2,θ(O), and the functions  b : Rn × A → Rn, σ : Rn × A → Rn×n,

image

Remark 2.1. The Lipschitz continuity of  b and σ on Rn ensures that, for any given  π ∈ Πref,α ∈ Aπ and x ∈ Rn, the controlled SDE (2.2) admits a unique strong solution. Moreover, the non-degeneracy of  σ on Rn ensures that SDEs with non-Lipschitz feedback controls admit a weak solution (cf. Theorems 2.2 and 3.5); see also Lemma 3.1.

As shown in [22, Lemma 6.38], the fact that  ∂Ois of class  C2,θ ensures that a function in C2,θ(O) has boundary values in  C2,θ(∂O), and conversely, any function  φ ∈ C2,θ(∂O) can beextended to a function in  C2,θ(O). Hence, one can introduce a boundary norm  | · |2,θ;∂O for thespace  C2,θ(∂O), such that for any given  φ ∈ C2,θ(∂O), |φ|2,θ,∂O = infΦ |Φ|2,θ;O, where Φ ∈ C2,θ(O)is a global extension of  φ toO. The space  C2,θ(∂O) equipped with the norm  | · |2,θ;∂Ois a Banach space (see e.g. the discussions on page 94 in [22]).

To simplify the presentation, we study exit time control problems with H¨older continuous co-efficients in this work and analyze classical solutions of associated elliptic HJB equations. Similar results, including the characterization and Lipchitz stability of feedback relaxed controls in Sections 3 and 4, can be obtained for finite horizon control problems with measurable coefficients, whose corresponding parabolic HJB equations admit weak solutions in suitable Sobolev spaces (see [41] for the well-posedness of weak solutions to parabolic HJB equations and [29, Theorem 1 on p. 122] for a generalized Itˆo’s formula). The first-order sensitivity analysis in Section 5 in general can only be performed for classical solutions in H¨older spaces; see Remark 4.1 for details.

The rest of this section is devoted to the connection between the stochastic exit time problem and a Hamilton-Jacobi-Bellman (HJB) boundary value problem, which plays an essential role in the construction of feedback control strategies. More precisely, we now consider the following HJB equation with inhomogeneous Dirichlet boundary data:

image

where  H0 : RK → Ris the pointwise maximum function, i.e.,  H0(x) = maxk∈K xk for all x =(x1, . . . , xK)T ∈ RK, f :O → RK is the function satisfying  f(x) = (f(x, ak))k∈K for all x ∈O, andL = (Lk)k∈Kis a family of elliptic operators satisfying for all  k ∈ K, φ ∈ C2(O), x ∈ O that

image

Above and hereafter, when there is no ambiguity, we shall denote by  φk(·) a generic function φ(·, ak) for all k ∈ K, and adopt the summation convention as in [22, 16], i.e., repeated equal dummy indices indicate summation from 1 to n.

Throughout this paper, we shall focus on the classical solution  u ∈ C(O) ∩ C2(O) to (2.6) es-tablished in the following theorem, which subsequently enables us to characterize optimal feedback controls for (2.3).

Theorem 2.1. Suppose (H.1) holds, and let  M = supi,j,k |σijk |0;O. Then the Dirichlet problem (2.6) admits a unique solution  u ∈ C(O) ∩ C2(O).Moreover, there exists a constant  β0 =

image

Proof. We shall only prove the uniqueness of solutions in  C(O) ∩ C2(O), since the existence of classical solutions in  C2,min(β0,θ)(O) will be established constructively based on the relaxed control approximation in Theorem 6.1 (see also [16, Theorem 7.5] for a proof of existence based on the method of continuity), and the existence of a Borel measurable function satisfying (2.8) follows directly from the measurable selection theorem (see [1, Theorem 18.19]).

Let  u1, u2 ∈ C(O) ∩ C2(O) be solutions to (2.6). Then for all  x ∈ O, we can deduce from the fundamental theorem of calculus that

image

where  h : [0, T] × O → ∆Kis a measurable function, and ˜L denotes the elliptic operator satisfying for all  φ ∈ C2(O) and x ∈ O that ˜Lφ(x) = ηT (x)Lφ(x) with η(x) =� 10 h(s, x) ds. Inparticular, the function h can be chosen as the weak limit of the functions ([0, T] × O ∋ (s, x) �→(∇Hε0)(Lu2(x) + f(x) + sL(u1 − u2)(x)) ∈ ∆K)ε>0 in L2([0, T] × O), where (Hε0)ε>0 is a se-quence of smooth approximations of  H0obtained by using the standard mollification argument. Then we can easily show that  η(x) ∈ ∆K for all x ∈ O, ˜Lis a uniform elliptic operator, and �Kk=1 ηk(x)ck(x) ≥ 0 for all x ∈ O. Hence the classical maximum principle (see e.g. [22, Theorem 3.7]) and  u1 = u2 on ∂Oimply that  u1 = u2 onO, which shows that the Dirichlet problem (2.6) admits at most one solution in  C(O) ∩ C2(O).

We now present a verification result, i.e., Theorem 2.2, which shows that the classical solution to the HJB equation (2.6) is the value function (2.3), and the Borel measurable function  αudefined as in (2.8) is a feedback control of (2.3). The proof will be postponed to Appendix A, which essentially follows from Itˆo’s formula and the existence result of weak solutions to SDEs with non-degenerate diffusion coefficients (see [32, Theorem 1]).

We first recall the definition of optimal feedback control (see e.g. [44, Definition 6.1]).

Definition 2.2. A Borel measurable function  h :O → Ais said to be a feedback control of (2.3) if for all  x ∈O, there exists  πx = (Ωx, Fx, {Fxt }t≥0, Px, W) ∈ Πref, and an {Fxt }t≥0-progressively measurable continuous process (Xxt )t≥0, such that  Xx0 = x, and for Px-a.s. that

image

O}. A feedback control h is said to be optimal if we have for all  x ∈

image

Theorem 2.2. Suppose (H.1) holds. Let  v :O → Rbe the value function defined as in (2.3), u ∈ C(O) ∩ C2(O)be the solution to the Dirichlet problem (2.6), and  αu :O → Abe a Borel measurable function satisfying (2.8). Then we have  u(x) = v(x) for all x ∈O, and αu is anoptimal feedback control of (2.3).

Remark 2.2. As shown in Theorem 2.2, by considering a weak formulation of the stochastic control problem (2.3) with reference probability systems varying in Πref, we can rigorously demonstrate that a measurable function  αu satisfying (2.8) is indeed an optimal feedback control strategy.

One can also consider stochastic exit time problems under a strong formulation, for which we first fix a reference probability system  π = (Ω, F, {Ft}t≥0, P, W), and the agent only maximizes the reward functional over all admissible control processes in  Aπ. It has been shown in [14, Theorem 2.1] that, if we assume (H.1) and  c > 0 on ¯O × A, then (2.6) satisfies the strong comparison principle i.e., a comparison result for semicontinuous viscosity solutions. In particular, (H4) in [14] is satisfied since  ∂O ∈ C2,θ enjoys the exterior ball condition, and (H5) in [14] is satisfied with Γout = ∂Odue to the uniform ellipticity condition (2.4). The strong comparison principle further enables us to show that the value function of the stochastic control problem (under the strong formulation) is the unique continuous viscosity solution to (2.6); see [5, Theorem 3.1]. Since the classical solution u is a viscosity solution of (2.6), we see it is the value function of the stochastic control problem (under the strong formulation), and the strategy  αu defined in (2.8) will lead to the optimal reward. Hence, we can still view the function  αu as an optimal feedback control.

We reiterate that, due to the fact that arg max is a set-valued mapping, the feedback control strategy (2.8) in general is non-unique, discontinuous, and sensitive to the perturbation of the co-efficients. For instance, let K = 2, and consider the set  G = {x ∈ O | (L1−L2)u(x)+(f1−f2)(x) =0} at whose boundary the optimal control  αu in (2.8) could have a jump discontinuity. Except for the trivial case where  αu is a constant on O, one can easily deduce from the connectedness of O, the fact that  u ∈ C2(O), and the continuity of the coefficients that the set G is non-empty. Since the boundary of the level set G can have poor regularity, we see the feedback control  αu ingeneral is merely Borel measurable, which introduces a substantial difficulty to follow the optimal control in practice. Moreover, the discontinuity of  αu also implies that a small perturbation of the coefficients could lead to a significant difference of  αu in the sup-norm, especially near the boundary of the set G. It is well-known (see e.g. [9, Section 6.4.2] and [24, Figure 4]) that such an instability of feedback controls would result in a numerical instability of the learning process, i.e., the approximate policies generated by an iterative learning algorithm may change subsequently from one iteration to the next, and eventually oscillate among several far-from-optimal policies.

In this section, we propose a relaxation of the stochastic exit time problem (2.3), which extends the ideas used in [42] to control problems with multi-dimensional controlled dynamics and general exploration reward functions. As we shall see shortly, the relaxed control problem has a H¨older continuous feedback control strategy, and enjoys better stability with respect to perturbation of the coefficients.

The following technical lemma is essential for the formulation of relaxed control problems with multi-dimensional dynamics, whose proof is included in Appendix A.

Lemma 3.1. Suppose (H.1) holds. Then there exist unique functions ˜b : Rn × ∆K → Rn and˜σ : Rn × ∆K → Sn> such that it holds for all  x ∈ Rn, λ ∈ ∆K that

image

Moreover, it holds for all  x ∈ Rn, λ ∈ ∆K that ˜σ(x, λ) ≥ √νIn and �i,j |˜σij(·, λ)|0,1+�i |˜bi(·, λ)|0,1 <∞.

We now proceed to introduce the relaxation of the exit time problem (2.3). Roughly speaking, instead of seeking the optimal feedback action, which maps the current state to a specific action in the space A, we seek the optimal feedback control distribution, which is a deterministic mapping from the current state to a probability measure over the space  A, i.e., λ∗ : O → P(A). Once such a mapping is determined, at each given state, the agent will execute the control by sampling a control action based on the distribution  λ∗(x). We refer the reader to [42] for a more detailed derivation of the following regularized control problem (3.6) in a one-dimensional setting. Note that the fact that A has cardinality  K < ∞enables us to identify the space of probability measures over A as the probability simplex ∆K.

More precisely, let  π = (Ω, F, {Ft}t≥0, P, W) ∈ Πrefbe a given reference probability system, and  Mπbe the set of  {Ft}t≥0-progressively measurable processes  λtaking values in the set ∆K.Suppose that (H.1) holds, for any given initial state  x ∈ Rn, and control  λ ∈ Mπ, we consider the controlled diffusion process  Xλ,x satisfying the following SDE:  Xλ,x0 = x and

image

where ˜b : Rn×∆K → Rn and ˜σ : Rn×∆K → Sn> are the functions defined in Lemma 3.1. We further introduce the first exit time of  Xλ,x from the domain  O defined as τ λ,x := inf{t ≥ 0 | Xλ,xt ̸∈ O},and the controlled discount factor Γλ,xt := exp�−� t0�Kk=1 c(Xλ,xs , ak)λks ds�for all t ∈ [0, τ λ,x].

Now let  ρ : RK → R ∪ {∞}be a given exploration reward function satisfying  ρ < ∞ onK(precise conditions will be specified in (H.2)). For any given relaxation parameter  ε > 0, weconsider the following value function: for each  x ∈O,

image

Note that the exploration reward function  ρplays a crucial role in the above relaxed control regularization. If we set the exploration reward function  ρ ≡0 or the relaxation parameter  ε = 0,then one can show that Dirac measures supported on the optimal strategies of the original control problem (2.8) (see  αu defined as in (2.8)) are optimal control distributions of the relaxed control problem (3.2), and the value function v in (2.3) will be equal to the value function  vε in (3.2)(see Theorems 6.1 and 6.4). Hence, to achieve the stability of the optimal control strategy for the relaxed control problem (3.2), we shall impose the following condition on the reward function  ρ:

H.2. There exists a convex function  H ∈ C2(RK)and a constant  c0 > 0, depending on K, such that for all  x, y ∈ RK, we have H(x)−c0 ≤ maxk∈K xk ≤ H(x) and ρ(y) = supz∈RK�zT y−H(z)�.

We remark that (H.2) is satisfied by most commonly used reward functions, including Shannon’s differential entropy proposed in [46, 25, 33, 21, 42]. We refer the reader to the discussion at the end of this section for a detailed comparison of different reward functions.

Given a function  H : RK → R, we define for each  ε ≥0 the function  Hε : RK → R such thatfor all  x = (x1, . . . , xK)T ∈ RK,

image

Note that (Hε)ε≥0are convex functions if H is a convex function. The next lemma follows directly from (H.2) and standard arguments in convex analysis, whose proof will be given in Appendix A for completeness.

image

(1) the function  ρ : RK → R ∪ {∞}is convex on  RK, continuous relative to K, and satisfies that  ρ(y) ∈ [−c0, 0] for all y ∈ ∆K and ρ(y) = ∞ for all y ∈ (∆K)c,

(2) it holds for all  x ∈ RK and ε > 0 that Hε(x)−εc0 ≤ H0(x) ≤ Hε(x), Hε(x) = maxy∈∆K�yT x−ερ(y)�, and (∇Hε)(x) = arg maxy∈∆K�yT x−ερ(y)�. Consequently, we have for all  x, y ∈ RKand  ε > 0 that |Hε(x) − Hε(y)| ≤ |x − y|.

We proceed to study the corresponding HJB equation of the relaxed control problem (3.2), which plays a crucial role in our subsequent analysis. For each  λ = (λ1, . . . , λK)T ∈ ∆K, letf λ : O → Rbe the function satisfying for all  x ∈ O that f λ(x) = �Kk=1 f(x, ak)λk = λT f(x) (withf defined as in (2.6)), and  Lλ be the elliptic operator satisfying for all  φ ∈ C2(O) and x ∈ O that

image

where we have used the definition of the elliptic operators  L = (Lk)k∈K (cf. (2.7)), and the definition of the functions ˜b and ˜σ(cf. Lemma 3.1).

Since the diffusion coefficient of SDE (3.1) is non-degenerate (see Lemma 3.1) and all coeffi-cients of the relaxed control problem (3.2) are continuous onO × ∆K, a formal application of the dynamic programming principle (see e.g. [19, 13] and references within) enables us to associate the relaxed control problem (3.2) with the following HJB equation:

image

Moreover, (3.4) and Lemma 3.2(2) imply that the above Dirichlet problem is equivalent to

image

where the function  Hεis defined as in (3.3), and L, f are defined as those in (2.6).

In order to rigorously justify the connection between (3.2) and (3.5), we establish the well-posedness of classical solutions to (3.5) in Theorem 3.4, and then prove a verification result in Theorem 3.5.

We need the following proposition, which gives an a priori estimate of classical solutions to (3.5). We postpone the proof to Appendix A, which adapts the technique in [16, Theorem 7.5 on p. 127] to HJB equations with compact control sets, and reduces the problem to an a priori estimate for HJB equations involving only principal terms.

Proposition 3.3. Suppose (H.1) and (H.2) hold, and let  M = supi,j,k |σijk |0;O. Then there exists a constant  β0 = β0(n, ν, M) ∈ (0, 1), such that it holds for all  β ∈ (0, min(β0, θ)] that, if uε ∈ C2,β(O)is a solution to the Dirichlet problem (3.5) with parameter  ε > 0, then uε satisfies the estimate that  |uε|2,β ≤ C(|g|2,β + εc0 + 1), where the constant C depends only on  n, ν, Λ, β and O.

Theorem 3.4. Suppose (H.1) and (H.2) hold, let  ε > 0 and M = supi,j,k |σijk |0;O. Then theDirichlet problem (3.5) admits a unique solution  uε ∈ C(O) ∩ C2(O).Moreover, there exists

image

Proof. One can deduce by similar arguments as those for Theorem 2.1 and the classical maximum principle that (3.5) admits a unique classical solution in  C(O) ∩ C2(O). Moreover, by using the a priori bound of classical solutions in Proposition 3.3, we can establish the existence and regularity of the classical solution  uε to (3.5) based on the method of continuity; see [16, Theorem 5.1 on p. 116].

Now let  uε ∈ C2,β(O) be the solution to (3.5) with some  β ∈ (0, θ]. The continuity of  Lλ, f λand ρ on ∆K, and Lemma 3.2(2) ensure that the function  λuε is well-defined on O, and has the expression  λuε = (∇Hε)(Luε + f). Note that, it holds for any given  φ1, φ2 ∈ Cβ(O) thatφ1φ2 ∈ Cβ(O). Hence the H¨older continuity of the coefficients (see (H.1)) implies that  Luε + f ∈Cβ(O, RK). We can then easily deduce from the local Lipschitz continuity of  ∇Hε : RK → RKthat  λuε ∈ Cβ(O, RK).

The next theorem shows that the function (3.6) is an optimal feedback control of (3.2), which is defined similarly to Definition 2.2. The proof of this statement is similar to that of Theorem 2.2 and hence omitted.

Theorem 3.5. Suppose (H.1) and (H.2) hold. Let  ε > 0, vε :O → Rbe the value function defined as in (3.2),  uε ∈ C(O) ∩ C2(O)be the solution to the Dirichlet problem (3.5), and  λuε :O → ∆Kbe the function defined as in (3.6). Then  uε(x) = vε(x) for all x ∈O, and λuε is an optimal feedback control of (3.2).

Remark 3.1. Theorem 3.4 shows that the feedback control  λuε is uniquely defined and H¨older continuous. This improved regularity makes it easier to implement the relaxed control  λuε inpractice, compared to the original (merely measurable) feedback control  αu (cf. Theorem 2.1).

We end this section with a remark about possible choices of reward functions. Generally speaking, we shall choose a reward function  ρwhose generating function H and its gradient  ∇Hcan be efficiently evaluated, such that one can design an efficient algorithm to solve the relaxed control problem (3.2) (see e.g. [46, 25, 33, 21, 26]). A common choice of reward functions in the literature is the following entropy-type reward function (see e.g. [28, 35, 36, 42]):

image

whose generating function is  Hen(x) = ln �Kk=1 exp(xk), x ∈ RK.One can show that  Hen ∈C∞(RK) ∩ C2,1(RK), and it satisfies (H.2) with  c0 = ln K(see e.g. [36]).

The advantage of the entropy reward function is that both  Hen and ∇Henare given in closed form, and they can be naturally extended to continuous action spaces A (see e.g. [42]). However, it is important to notice that the evaluation of  Hen and ∇Heninvolves exponentials. Hence, when the relaxation parameter  εis small, a naive implementation of iterative algorithms for solving (3.5), which in general involves evaluating the value and inverse of  Hen and ∇Hen at alarge argument  z = (Luε(x) + f(x))/ε ∈ RK with x ∈ O, may lead to unreliable results due to unstable floating-point arithmetic; see [10, Example 4.2] and [11] for more details. Moreover, since  ∇Hen(x) ∈ (0, 1)K for all x ∈ RK, the optimal relaxed control of (3.2) may converge to the optimal control of (2.3) with a very slow rate as the relaxation parameter  εtends to zero.

Alternatively, by virtue of the fact that only the generating function H and its gradient are involved in the HJB equation (3.5) and the feedback control (3.6), we can also obtain a reward function  ρby directly constructing a K-dimensional function H based on a recursive application of smoothing functions for the two-dimensional max function. For instance, we can start with the following two-dimensional smoothing functions (see e.g. [15, 45]): for  x = (x1, x2)T ∈ R2,

image

Then, for any given  K ≥3, by using the fact that maxk∈K xk = max(maxi∈K1 xi, maxj∈K2 xj), withK1 = {1, . . . , K0}, K2 = {K0+1, . . . , K} and K0 = ⌊(K+1)/2⌋, we can express the K-dimensional max function as a nested application of the two-dimensional max function and one-dimensional identity function. Hence, by replacing the two-dimensional max function with the two-dimensional smoothing function (3.7) (resp. (3.8)) in the recursive expression, we can obtain the K-dimensional smoothing function  Hchks ∈ C∞(RK) ∩ C2,1(RK) (resp. Hzang ∈ C2,1(RK)). It has been shown in [10, Lemma 3.3] that for any given  K ≥2, both functions  Hchks and Hzangsatisfy (H.2) with c0 = (log2(K − 1) + 1)/2 for Hchks, and c0 = 3(log2(K − 1) + 1)/32 for Hzang.

Note that, the evaluation of  Hchks, Hzangand their gradients only involves square-roots and multiplications, hence they are numerically more stable than the entropy-type smoothing  Hen(see [10]). More importantly, since  Hzangonly modifies the function  H0locally near the non-differentiable points, we can determine the optimal control of (2.3) precisely from the optimal control of (3.2) without sending the relaxation parameter  εto zero (see Theorem 6.4 and Remark 6.2 for details).

Figure 1 compares the functions  Hen, Hzang : R3 → Rand the reward functions generated by them. One can clearly see from Figure 1 (left) that  Hensubstantially modifies the pointwise maximum function  H0everywhere, while  Hzangonly performs a modification of  H0locally near the kinks. For both functions, the difference from  H0peaks around the the points where arg maxk∈K xkis not a singleton. Such points correspond to the regions where the agent of the control problem (2.3) cannot make a clear decision based on the current model, since two or more different actions would result in a very similar reward.

Figure 1 (right) depicts the reward functions  ρen(y1, y2, y3) and ρzang(y1, y2, y3) with y3 =1  − y1 − y2, for all (y1, y2) ∈ C := {(y1, y2) ∈ R2 | 0 ≤ y1, y2 ≤ 1, y1 + y2 ≤ 1}. The point(1/3, 1/3, 1/3) corresponds to the pure exploration strategy, i.e., the uniform distribution on the action space  A = {a1, a2, a3}, while the vertices of C corresponds to the pure exploitation strategy, i.e., the Dirac measures supported on some  ai ∈ A.Both functions achieve their minimum around the point (1/3, 1/3, 1/3), which indicates that the exploration reward functions encourage the controller of the relaxed control problem to explore further, especially when it is difficult to choose a unique optimal action based on the current model.

Note that, by comparing the values of the reward functions near the point (1/3, 1/3, 1/3) and near the vertices of C, we see that  ρenin general gives more rewards for exploration than ρzang. Consequently, to recover the value function and optima control of (2.3), we have to take a smaller relaxation parameter for (2.3) with  ρenthan that for (2.3) with  ρzang, which could cause a numerical instability issue due to the exponentials in  Hen and ∇Hen(see e.g. [10]).

image

Figure 1: Comparison of  Hen and Hzangand their corresponding reward functions for K = 3.

In this section, we shall fix a relaxation parameter  ε >0 and study the robustness of the feedback control strategy (3.6) for a relaxed control problem associated with a perturbed model. In particular, we shall show that the control strategy (3.6) admits a (locally) Lipschitz continuous dependence on the perturbation of the coefficients, if the reward function is generated by a function H with locally Lipschitz continuous Hessian.

We start by presenting two technical results, which are essential for our subsequent analysis. The first one is due to Nugari [34], which establishes the regularity of Nemytskij operators in H¨older spaces.

Lemma 4.1. Let n, K ∈ N, α ∈ (0, 1], O ⊂ Rn be an open bounded set,  φ : RK → Rbe a continuously differentiable function, and Φ :  u ∈ Cα(O, RK) �→ Φ[u] ∈ Cα(O)be the Nemytskij operator satisfying for all  u = (u1, . . . uK) that Φ[u](x) = φ(u(x)), x ∈O. Then Φis well-defined, continuous and bounded. Moreover, if we further suppose  ∇φis locally Lipschitz continuous (resp.  φis twice continuously differentiable), then Φ is locally Lipschitz continuous (resp. continuously differentiable with the Fr´echet derivative Φ′[u] = (∇φ)T (u) for all u ∈ Cα(O, RK)).

Remark 4.1. Lemma 4.1 enables us to view the fully nonlinear HJB operator  Fε in (3.5) andthe value-to-action map  uε �→ λuε defined in (3.6) as differentiable maps between suitable H¨older spaces, which is essential for the sensitivity analysis on the value functions and feedback relaxed controls in Section 5.

Note that in general it is not possible to perform the same first-order sensitivity analysis by interpreting the HJB operator  Fεas a map between the Sobolev space  W 2,p(O) and the Lebesgue space  Lq(O). In fact, since the operator  Fε : W 2,p(O) → Lq(O) in general is only differentiable with p > q (see [40, Theorem 13]), we see the derivative of  Fε, which is a second-order linear elliptic operator, is not bijective between  W 2,p(O) and Lq(O). Consequently, we cannot apply the implicit function theorem to derive the sensitivity equation for the value function (3.2) as in Theorem 5.2.

If the operator  Fεis only semilinear, i.e., the diffusion coefficient of (2.2) is uncontrolled, then one can show that  Fεis differentiable between  W 2,p(O) and Lp(O) for 1 < p < ∞, and itsderivative is a bijection between the same spaces (see [26] for the case with p = 2). In this case, we can extend Theorem 5.2 and study  Lp-perturbation of the coefficients in (3.2).

Now we proceed to introduce a relaxed control problem with a set of perturbed coefficients satisfying the following conditions:

H.3. Let ν > 0, θ ∈ (0, 1]be the constants in (H.1), and Λ′ > 0be a constant. The functions ˆb : Rn ×A → Rn, ˆσ : Rn ×A → Rn×n, ˆc :O ×A → [0, ∞), ˆf :

image

Let  ε >0 be a fixed relaxation parameter. We shall consider a perturbed control problem (2.3) with the coefficients (ˆb, ˆσ, ˆc, ˆf, ˆg), and its relaxation (see (3.2)) with parameter  ε, whose value function is denoted as ˆvε. Then, by using Lemma 3.2, Theorems 3.4 and 3.5, one can verify that, under (H.2) and (H.3), the value function ˆvε is the classical solution ˆuε ∈ C(O) ∩ C2(O) of thefollowing Dirichlet problem:

image

where the function  Hεis defined as in (3.3), ˆf : O → RK is the function satisfying ˆf(x) = ( ˆf(x, ak))k∈K for all x ∈O, and ˆL = ( ˆLk)k∈Kis a family of elliptic operators satisfying for all

image

Moreover, we can deduce from (3.6) that, the optimal feedback control of the perturbed relaxed control problem is given by

image

Note that Theorem 3.4 shows that the classical solution ˆuε of (4.1) is in C2,β(O) for some β > 0,so the above function ˆλˆuε is well-defined on  ∂O.The following result shows the (local) Lipschitz dependence of ˆuε −uε and ˆλˆuε −λuε on pertur-bation of the coefficients, which demonstrates the robustness of the relaxed control problem. For notational simplicity, given the functions (b, σ, c, f, g) and (ˆb, ˆσ, ˆc, ˆf, ˆg) satisfying (H.1) and (H.3) respectively, we shall introduce for each  β ∈ (0, θ] the following measurement of perturbations:

image

image

Proof. Throughout this proof, we shall denote by C a generic constant, which depends only on  ε,n, K, ν, Λ, Λ′, β, c0, Mg and O, and may take a different value at each occurrence.

The a priori estimate in Proposition 3.3 shows that there exists a constant  β0 = β0(n, ν, M) ∈(0, 1), such that we have for all  β ∈ (0, min(β0, θ)] the estimates  |uε|2,β, |ˆuε|2,β ≤ C. Moreover, we have by the fundamental theorem of calculus that

image

in  O, where η :O → ∆Kis the function defined as  η :=� 10 (∇Hε)�s(Luε +f)+(1−s)(ˆLˆuε +ˆf)�ds.

Now let  β ∈ (0, min(β0, θ)] be a fixed constant. The fact that  ∇Hε ∈ C1(RK, ∆K) (see (H.2)),the H¨older continuity of coefficients (see (H.1) and (H.3)), and the a priori estimates of  |uε|2,β and|ˆuε|2,βyield the estimate that  |η|β ≤ C(see Lemma 4.1). Then, by setting  w = uε − ˆuε ∈ C2,β(O),we can deduce from (4.5) that w is the classical solution to the following Dirichlet problem:

image

Hence the fact that  η ∈ Cβ(O, ∆K) and the global Schauder estimate in [22, Theorem 6.6] lead us to the estimate that

image

which, together with the maximum principle (see [22, Theorem 3.7]) and the a priori estimate of |ˆuε|2,β, enables us to conclude that:

image

with the constant  Eper,βdefined as in (4.3). Now we show the stability of feedback controls. Note that (4.4) implies that

image

The additional assumption that  H : RK → R in (H.2) has a locally Lipschitz continuous Hessian implies that  ∇Hεis differentiable with locally Lipschitz continuous derivatives, which along with Lemma 4.1 shows that the Nemytskij operator  ∇Hε : Cβ(O, RK) → Cβ(O, RK) is locally Lipschitz continuous. Hence there exists a constant C, such that for all perturbed coefficients (ˆb, ˆσ, ˆc, ˆf, ˆg)satisfying (H.3), we have

image

which finishes the desired (local) Lipschitz estimate.

Remark 4.2. The assumption that  H : RK → R in (H.2) has a locally Lipschitz continuous Hessian is satisfied by most commonly used functions, including  Hen, Hchks and Hzang given inSection 3. In general, if H is merely twice continuously differentiable as in (H.2), we can follow a similar argument and establish that the H¨older norm of the difference between two relaxed control strategies is continuously dependent on the H¨older norms of the perturbations in the coefficients.

Note that the Lipschitz stability result (4.4) in general does not hold for the original control problem (2.3) (or equivalently,  ε = 0 in (3.2)).In fact, for any given  β ∈ (0, 1), [18, Theo-rem 2] shows that the Nemytskij operator  f ∈ (Cβ(O))K �→ H0(f) ∈ Cβ(O) is not continuous, which implies that there exists (fm)m∈N∪{∞} ⊂ (Cβ(O))K such that limm→∞ |fm − f∞|β = 0 and|H0(fm) − H0(f∞)|β ≥ 1 for all m ∈ N. Now for each  m ∈ N ∪ {∞}, we consider the following simple HJB equation (2.6): ∆um + H0(fm) = 0 in O and um = 0 on ∂O. Hence we have |∆(um − u∞)|β = |H0(fm) − H0(f∞)|β ≥ 1 for all m ∈ N, which implies that the  C2,β-normof the value function (2.3) does not depend continuously on the  Cβ-perturbation of the model parameters. See Theorem 5.4 for a precise quantification of  ε-dependence in (4.4).

The remaining part of this section is devoted to an important application of Theorem 4.2,

where we shall examine the performance of the control strategy  λuε, computed based on the relaxed control problem with the original coefficients (b, σ, c, f, g) (see (3.6)), on a new relaxed

control problem with perturbed coefficients satisfying (H.3).

We first observe that, if there exists a classical solution  uε ∈ C(O) ∩ C2(O) to the following

image

with ˆL and ˆfdefined as in (4.1), then by using Itˆo’s formula, one can easily show that the reward function  vε, resulting by implementing the H¨older continous feedback control  λuε to therelaxed control problem with the coefficients (ˆb, ˆσ, ˆc, ˆf, ˆg), coincides with the function  uε (seee.g. Theorems 2.2 and 3.5). On the other hand, we have seen that the (optimal) value function ˆvε of the perturbed relaxed control problem is the classical solution ˆuε to (4.1). Hence it suffices to compare the classical solutions to (4.6) and (4.1).

The following proposition shows that (4.6) indeed admits an unique classical solution.

image

O → ∆Kbe the function defined as in (3.6), and  β0 = β0(n, ν, M) ∈ (0, 1)be the constant in Proposition 3.3. Then the Dirichlet problem (4.6) admits a unique solution  uε ∈ C2,min(β0,θ)(O).

image

We are ready to show that, the difference between this suboptimal reward function  vε andthe (optimal) value function ˆvε of the perturbed relaxed control problem depends Lipschitzcontinuously on the magnitude of perturbations in the coefficients.

image

Hessian, then there exists  β0 = β0(n, ν, M) ∈ (0, 1), such that for all  β ∈ (0, min(β0, θ)], we havethe estimate  |ˆuε − uε|2,β ≤ CEper,β, with the constant  Eper,βdefined as in (4.3), and a constant

image

and C be a generic constant, which depends only on  ε, n, K, ν, Λ, Λ′, β, c0, Mg and O, and may

image

which, together with the fact that ˆuε = uε = ˆgand the classical maximum principle (see [22, Theorem 3.7]), shows that ˆuε ≥ uε onO.

We now estimate ˆuε−uεby assuming the function  H : RK → R in (H.2) has a locally Lipschitz continuous Hessian. By using the definition of the optimal control ˆλˆuε, we have that

image

By subtracting (4.6) from the above equation, we have

image

Note that, the a priori estimate in Proposition 3.3 shows that, under (H.1), (H.2) and (H.3), there exists a constant  β0 = β0(n, ν, M) ∈ (0,1), such that we have for all  β ∈ (0, min(β0, θ)]the estimates  |uε|2,β, |ˆuε|2,β ≤ C, which, along with the fact that  ∇Hε ∈ C1(RK) and Lemma 4.1, implies the  a priori bounds |ˆλˆuε|β, |λuε|β ≤ C. Hence, from any given  β ∈ (0, min(β0, θ)],we can deduce from the Schauder theory in [22, Theorem 6.6] and the maximum principle in [22, Theorem 3.7] that

image

By using the additional assumption that H has a locally Lipschitz continuous Hessian, and the identity (4.7), we can deduce that  ρ(∇Hε) : RK → Ris continuously differentiable with a locally Lipschitz continuous gradient, from which, we can obtain from Lemma 4.1 that for any α ∈ (0,1], the corresponding Nemytskij operator (ερ)(∇Hε) : Cα(O, RK) → Cα(O, R) is locally Lipschitz continuous. Hence, we can obtain from (4.8) and the definitions of  λuε and ˆλˆuε (see(3.6) and (4.2)) that

image

from which, we can conclude from the  a priori bound of |ˆuε|2,βand Theorem 4.2 the desired estimate  |ˆuε − uε|2,β ≤ CEper,β.

In this section, we proceed to derive a first-order Taylor expansion for the value function and the optimal control of the relaxed control problem (3.2) with perturbed coefficients, which subsequently leads us to a first-order approximation of the optimal strategy for the perturbed problem based on the pre-computed optimal control. The sensitivity equation further enables us to quantify the explicit dependence of the Lipschitz stability result (4.4) on the relaxation parameter  ε.

The following proposition establishes the Fr´echet differentiability of the fully nonlinear HJB operator with inhomogeneous boundary conditions. For notational simplicity, for any given  β ∈(0, 1], and bounded open subset  O ⊂ Rn with C2,β boundary, we shall introduce the Banach space Θβ for the coefficients:

image

equipped with the product norm  |·|Θβ, and denote by  ϑ = ((ak, bk, ck, fk)k∈K, g) a generic element in Θβ. We also denote by  C2,β(∂O) the Banach space of  C2,β functions defined on  ∂O(see Remark 2.1), and by  τD : C2,β(O) → C2,β(∂O) the restriction operator on  ∂O. Furthermore, for any given Banach spaces X and Y , we denote by B(X, Y ) the Banach space containing all continuous linear mappings from X into Y , equipped with the operator norm.

Proposition 5.1. Suppose (H.2) holds. Let  ε > 0, β ∈ (0, 1], Obe a bounded domain in  Rn withC2,β boundary, Hε : RK → Rbe the function defined as in (3.3), Θβ be the Banach space defined as in (5.1), and  F β : Θβ × C2,β(O) → Cβ(O) × C2,β(∂O)be the following HJB operator:

image

where for any given  ϑ = ((ak, bk, ck, fk)k∈K, g) ∈ Θβ, f ϑ = (fk)k∈K ∈ Cβ(O)K, gϑ = g andLϑ = (Lϑk )k∈Kis the elliptic operators satisfying  Lϑk φ = aijk ∂ijφ + bik∂iφ − ckφ for all k ∈ K,φ ∈ C2(O).

C2,β(O), Cβ(O) × C2,β(∂O))satisfying for all (ϑ, u) ∈ Θβ × C2,β(O), ˜ϑ ∈ Θβ and v ∈ C2,β(O)that

image

Proof. We first write the HJB operator as  F β = (F1, F2), where F1 : Θβ × C2,β(O) → Cβ(O) isthe composition of the Nemytskij operator  Hε : Cβ(O)K → Cβ(O) and the mapping  G : (ϑ, u) ∈Θβ × C2,β(O) �→ G[ϑ, u] := Lϑu + f ϑ ∈ Cβ(O)K, and F2 : (ϑ, u) ∈ Θβ × C2,β(O) �→ F2[ϑ, u] :=τD(u − gϑ) ∈ C2,β(∂O) is the linear boundary operator.

Since the function  Hε is in C2(RK), we can deduce from Lemma 4.1 that the Nemytskij operator  Hε : Cβ(O)K → Cβ(O) is well-defined and continuously differentiable with the Fr´echet derivative (Hε)′[u] = (∇Hε)T (u) ∈ B(Cβ(O)K, Cβ(O)) for all u ∈ Cβ(O)K.

Moreover, since for any given (ϑ, u) ∈ Θβ × C2,β(O), G[·, u] : Θβ → Cβ(O)K and G[ϑ, ·] :C2,β(O) → Cβ(O)K are affine mappings, one can easily compute the partial derivatives  ∂uG :Θβ × C2,β(O) → B(C2,β(O), Cβ(O)K) and ∂ϑG : Θβ × C2,β(O) → B(Θβ, Cβ(O)K) of G asfollows: (∂uG)[ϑ, u](v) = Lϑv and (∂ϑG)[ϑ, u](˜ϑ) = L˜ϑu + f˜ϑ for all (ϑ, u) ∈ Θβ × C2,β(O),˜ϑ ∈ Θβ and v ∈ C2,β(O). Moreover, it is clear that  ∂uG and ∂ϑGare both continuous, which implies that  G : Θβ × C2,β(O) → Cβ(O)K is continuously differentiable with derivative

image

for all (ϑ, u) ∈ Θβ × C2,β(O), ˜ϑ ∈ Θβ and v ∈ C2,β(O) (see [17, Theorem 7.2-3]).

Therefore, by using the chain rule (see [17, Theorem 7.1-3]), we see the composite mapping F1 : Θβ × C2,β(O) → Cβ(O) is also continuously differentiable with the derivative  F ′1[ϑ, u] =(Hε)′[G[ϑ, u]]G′[ϑ, u] for all (ϑ, u) ∈ Θβ × C2,β(O). This, along with the fact that  F2 : C2,β(O) ×Θβ → C2,β(∂O) is a linear operator, enables us to conclude the desired differentiability of the operator  F β = (F1, F2).

With the above proposition in hand, we are ready to derive the first-order sensitivity equation for the value function of the relaxed control problem with respect to the parameter perturbations.

Theorem 5.2. Suppose (H.1) and (H.2) hold. Let  ε > 0, (Θβ)β∈(0,1]be the Banach spaces defined as in (5.1),  ϑ0 = ((σkσTk /2, bk, ck, fk)k∈K, g), uε ∈ C(O) ∩ C2(O)be the solution to the Dirichlet problem (3.5) (with the coefficients  ϑ0), and β0 ∈ (0, 1)be the constant in Proposition 3.3.

Then it holds for each  β ∈ (0, min(β0, θ)]that, there exists a neighborhood  V of ϑ0 in Θβ, aneighborhood  W of uε in C2,β(O), and a mapping  S : V → Wsatisfying the following properties:

image

(2)  S : V → Wis continuously differentiable with  S[ϑ0 + δϑ] = uε + S′[ϑ0]δϑ + o(|δϑ|Θβ) as|δϑ|Θβ → 0, and for each  δϑ ∈ Θβ, δu = S′[ϑ0]δϑ ∈ C2,β(O)is the solution to the following Dirichlet problem:

image

Proof. The desired result comes from a direct application of the implicit function theorem (see [17, Theorem 7.13-1]). Theorem 3.4 shows that the Dirichlet problem (3.5) with the coefficients ϑ0admits a solution  uε ∈ C2,β(O) for each β ∈ (0, min(β0, θ)].

Let  β ∈ (0, min(β0, θ)] be a fixed constant. We shall consider the mapping  F β : Θβ×C2,β(O) →Cβ(O) × C2,β(∂O) defined as follows:

image

Due to the fact that  uε ∈ C2,β(O) satisfies (3.5) with the coefficients  ϑ0, we have Hε(Lϑ0uε+f ϑ0) =0 in  O and Hε(Lϑ0uε+f ϑ0) ∈ Cβ(O), which subsequently implies that  Hε(Lϑ0uε+f ϑ0) = 0 onO.The boundary condition of (3.5) implies that  τD(uε−gϑ0) = 0 in C2,β(∂O). Hence F β[ϑ0, uε] = 0.

Proposition 5.1 shows that  F β is continuously differentiable on Θβ × C2,β(O), and for each (˜ϑ, v) ∈ Θβ × C2,β(O),

image

where we have used the definition of  λuε ∈ Cβ(O, ∆K) (see (3.6)). The classical maximum principle (see e.g. [22, Theorem 3.7]) implies that the map  ∂uF β[ϑ0, uε](·) : C2,β(O) → Cβ(O) × C2,β(∂O)is an injection. We now show it is also a surjection. Let ( ˆf, ˆg) ∈ Cβ(O) × C2,β(∂O) be given. Then the assumption that  ∂O ∈ C2,β enables us to apply [22, Lemma 6.38] and extend ˆg to a function in  C2,β(O), which is still denoted by ˆg. The fact that  λuε ∈ Cβ(O, ∆K) (see Theorem 3.4) and the elliptic regularity theory (see [22, Theorem 6.14]) ensure that the Dirichlet problem ∂uF β[ϑ0, uε](v) = ( ˆf, ˆg) admits a unique solution  v ∈ C2,β(O).Hence we see  ∂uF β[ϑ0, uε] :C2,β(O) → Cβ(O) × C2,β(∂O) is a bijection.

Therefore, the implicit function theorem (see [17, Theorem 7.13-1]) ensures the existence of S ∈ C1(V, W) with derivative  S′[ϑ0] = −(∂uF β[ϑ0, uε])−1∂ϑF β[ϑ0, uε] ∈ B(Θβ, C2,β(O)). Hencewe have  S[ϑ0 + δϑ] = uε + S′[ϑ0]δϑ + o(|δϑ|Θβ) as |δϑ|Θβ → 0. Let δϑ ∈ Θβ and δu = S′[ϑ0]δϑ,the characterization of partial derivatives of  F β enables us to conclude that  δusatisfies (5.2).

Remark 5.1. We can further obtain a first-order expansion of the optimal control  λuε in terms of the perturbations of the coefficients. If  ε >0 and the function  H in (H.2) is in C3(RK) (c.f. Henand  Hchksin Section 3), then Lemma 4.1 shows that  ∇Hε : Cα(O, RK) → Cα(O, RK), α ∈ (0, 1],is continuously differentiable with derivative (∇Hε)′[u]h = (∇2Hε)(u)h for all u, h ∈ Cα(O, RK),where  ∇2Hεis the Hessian of  Hε. Hence, by using the chain rule and Theorem 5.2, we have for all  β ∈ (0, min(β0, θ)] that

image

as  |δϑ|Θβ → 0, where λS[ϑ0+δϑ] is the optimal feedback control of the relaxed control problem with the perturbed coefficients  ϑ0 + δϑ, and δuis the classical solution to (5.2).

With the sensitivity equation (5.2) in hand, we now estimate the precise dependence of  δu onthe relaxation parameter  ε, which strengthens the Lipschitz stability result (4.4) by quantifying the explicit  ε-dependence of the (local) Lipschitz constant. Note that Remark 4.2 shows that the value function (2.3) (in the  C2,β-norm) does not depend continuously on the  Cβ-perturbation of the parameters, which suggests that for a fixed  δϑ ∈ Θβ, the | · |2,β-norm of δuwill blow up as the parameter  εtends to 0.

Since the H¨older norm of the function  λuε in (5.2) tends to infinity as  ε →0, we first present a precise a priori estimate for the classical solutions to linear elliptic equations with  ε-dependentcoefficients. The proof will be postponed to Appendix A, where we first reduce the equation to a constant coefficient equation involving only second-order terms, and then apply the classical Schauder estimate.

Proposition 5.3. Let α ∈ [0, 1], β ∈ (0, 1), ν, Λ > 0, and Obe a bounded domain in  Rn withC2,β boundary. For every  ε ∈ (0, 1], let aε :O → Rn×n, bε :

functions satisfying  aε ≥ νIn onO. Suppose that [aijε ]0, [biε]0, [cε]0 ≤ Λ and [aijε ]β, [biε]β, [cε]β ≤Λε−α for all ε ∈ (0, 1] and i, j = 1, . . . , n. Then for every  ε ∈ (0, 1], f ∈ Cβ(O) and g ∈ C2,β(O),the Dirichlet problem

image

which applies to relaxed control problems with reward functions generated by  Hen, Hchks andHzang.

Theorem 5.4. Assume the setting of Theorem 5.2 and in addition that the function  H : RK → Rin (H.2) has a Lipschitz continuous gradient. Let  β0 ∈ (0, 1)be the constant in Proposition 3.3 and ¯β0 = min(β0, θ). Then it holds for all  ε ∈ (0, 1], β ∈ (0, ¯β0] and δϑ ∈ Θβ that, the classical solution  δuto the Dirichlet problem (5.2) satisfies the estimate  |δu|2,β ≤ Cε−(β+2)/¯β0|δϑ|Θβ, whereC is a constant independent of  ε and δϑ.

Proof. Throughout this proof, let C be a generic constant depending possibly on  ϑ0 and β, butindependent of  ε and δϑ. Proposition 3.3 shows that  |uε|2, ¯β0 ≤ C for all ε ∈ (0,1], which together with (3.6), the fact that  ∇Hε(x) = ∇H(ε−1x) for all x ∈ RK (see (3.3)) and the Lipschitz continuity of  ∇Himplies that  |λuε|0 ≤ C and |λuε|¯β0 ≤ Cε−1 for all ε ∈ (0,1]. Consequently, we have for all  β ∈ (0, ¯β0] and ε ∈ (0, 1] that |λuε|β ≤ C|λuε|β/¯β0¯β0 |λuε|(¯β0−β)/¯β00 ≤ Cε−β/¯β0.

Now let us fix  β ∈ (0, ¯β0] and δϑ ∈ Θβ. Since λuε ∈ ∆K on

O, we can apply Proposition 5.3 (with  α = β/¯β0) to (5.2) and conclude the desired estimate from the following inequality:

image

In this section, we analyze the convergence of the relaxed control problem (3.2) to the original control problem (2.3) as the relaxation parameter tends to zero. In particular, with the help of the HJB equations (2.6) and (3.5), we shall establish first-order monotone convergence of the value functions, and also uniform convergence of the feedback controls (in regions where a strict complementary condition is satisfied).

We first study the convergence of the value functions of the relaxed control problems. The following theorem shows that, as the relaxation parameter  εtends to zero, the value function (3.2) converges monotonically to the value function (2.3) in  C2,β(O) with first order.

Theorem 6.1. Suppose (H.1) and (H.2) hold. Let  β0 ∈ (0, 1)be the constant in Proposition 3.3, and  u ∈ C(O) ∩ C2(O) (resp. uε ∈ C(O) ∩ C2(O)) be the solution to (2.6) (resp. (3.5) with parameter  ε > 0). Then we have  uε1 ≥ uε2 for all ε1 ≥ ε2 > 0. Moreover, it holds for any β ∈ (0, min(β0, θ)) that (uε)ε>0converges to  u in C2,β(O) as ε → 0, and satisfies the estimate:

image

Proof. Let (Fε)ε≥0be defined as in (2.6) and (3.5), and  ε1 ≥ ε2 >0 be given constants. Lemma 3.2 shows that  ρ ≤ 0 on ∆K, and Hε(x) = maxy∈∆K�yT x − ερ(y)�for all x ∈ RK. Hence, we have  Hε1 ≥ Hε2, and

image

where we write  η :=� 10 (∇Hε2)(Luε2 + f + sL(uε1 − uε2)) ds. Since η(x) ∈ ∆K for all x ∈ O, wecan deduce from the classical maximum principle (see e.g. [22, Theorem 3.7]) that infx∈O(uε1 −uε2)(x) ≥ infx∈∂O(uε1 − uε2)−(x) = 0.

Similarly, for any given  ε >0, we can obtain from Lemma 3.2(2) that

image

where we have ˜η :=� 10 (∇Hε)(Lu + f + sL(uε − u)) ds. By using ak = σk(σk)T /2, (2.4) in (H.1),and the fact that ˜η ∈ ∆K onO, we deduce that �Kk=1 ˜ηkck ≥ 0 and �Kk=1 ˜ηkak ≥ (ν/2)In. Hencethe classical maximum principle (see e.g. [22, Theorem 3.7]) and the fact that  uε = u on ∂O giveus the estimate (6.1).

image

any given  β ∈ (0, min(β0, θ)), there exists a subsequence (uεm)m∈N with limm→∞ εm = 0, suchthat (uεm)m∈Nconverges in  C2,β(O) to some function ¯u and ¯u ∈ C2,min(β0,θ)(O). Since the entire sequence (uε)ε>0converges monotonically to  u, we have u = ¯u and (uε)ε>0converges to u in C2,β(O) for all β ∈ (0, min(β0, θ)).

Remark 6.1. The estimate (6.1) depends on  ε, c0, ν, bik and Oin a rather intuitive way. Note that, compared with the original control problem (2.3), the relaxed control problem (3.2) introduces additional randomness for exploration to achieve more robust decisions, especially at regions where two or more strategies lead to similar performances based on the given model (the points at which arg max in (2.8) is not a singleton). The relation (2.8) between feedback controls and the derivatives of value functions further suggests that such regions usually correspond to a sign change of derivatives of value functions.

The exploration surplus in the value functions clearly increases as  ε or c0increase (see Lemma 3.2(1) and Figure 1), since the same level of exploration will bring more rewards. It will also increase with diam(O) as the dynamics will stay in O longer. Furthermore, due to the lack of regularization from the Laplacian operator, a small volatility or a large drift-to-volalitly ratio of the underlying model usually leads to a more rapidly changing value function, which increases the occurrence of the uncertain regions and makes the relaxation approach more beneficial.

Now we turn to investigate the convergence of the feedback relaxed control (3.6). To distinguish different convergence behaviours related to reward functions generated by  Hen and Hzang, we firstintroduce the following concept for functions which only modify the pointwise maximum function locally near the kinks.

Definition 6.1. Let n ∈ N, we say a function  φ : Rn → Rsatisfies (Sloc) with constant  ϑ ≥ 0, ifit holds for all  k = 1, . . . , n and x ∈ Rn with xk ≥ xj + ϑ, ∀j ̸= k, that φ(x) = xk.

It is clear that the pointwise maximum function on  Rn satisfies (Sloc) with ϑ= 0, and the two-dimensional function  Hzangdefined in (3.8) satisfies (Sloc) with ϑ = 1/2.The following lemma shows that property (Sloc) is preserved under function composition and scaling, which consequently implies that the recursively constructed K-dimensional  Hzangand its corresponding scaled function (Hzang)ε (cf. (3.3)) satisfy (Sloc). The proof follows directly from Definition 6.1, and is included in Appendix A.

Lemma 6.2. (1) For each  n ∈ N, let H(n)0 : Rn → R be the n-dimensional pointwise maxi-

image

(2) If  φ : Rn → Rsatisfies (Sloc) with constant  ϑ ≥ 0, then for each  ε > 0, the scaled function φε : x ∈ Rn �→ εφ(ε−1x) ∈ Rsatisfies (Sloc) with constant  εϑ.

The following proposition presents several important convergence properties of the functions (∇Hε)ε>0. In the sequel, we shall denote by  ek ∈ RK, k ∈ K, the unit vector from the k-th column of the identify matrix  IK, and by conv(S) the convex hull of a given set  S ⊂ RK.

Proposition 6.3. Suppose (H.2) holds. Let (Hε)ε≥0be defined as in (3.3), (∂H0)(x) = conv({ek ∈RK | xk = H0(x), k ∈ K}) for all x ∈ RK, and U = {x ∈ RK | (∂H0)(x) is a singleton}. Then it holds for all  x ∈ RK and compact subset  C ⊂ U that

(1) limk→∞ dist((∇Hεk)(xk), (∂H0)(x)) = 0 provided that limk→∞ xk = x and limk→∞ εk = 0+,

(2) (∇Hε)ε>0converges uniformly to  ∂H0 on C as ε → 0. If we further suppose the function H : RK → R in (H.2) satisfies (Sloc) with constant  ϑ ≥ 0, then there exists  ε0 > 0 such that(∇Hε)(x) = (∂H0)(x) for all x ∈ C and ε ∈ (0, ε0].

Proof. We first establish Property (1) by considering the following function:

image

Note that Lemma 3.2(1) shows that the restriction of  ρ on ∆Kis continuous, which subsequently implies that  φis a continuous function. Then we can deduce from [1, Theorem 17.31] that the set-valued mapping Ξ : (x, ε) ∈ RK × [0, 1] ⇒ arg maxy∈∆K φ(x, ε, y) ⊂ ∆Kis upper hemicontinuous, which along with the fact that Ξ(x, ε) = (∇Hε)(x) for all (x, ε) ∈ RK × (0,1] (see Lemma 3.2(2)) enables us to deduce limk→∞ dist((∇Hεk)(xk), Ξ(x,0)) = 0 for any given limk→∞ xk = x andlimk→∞ εk = 0+. Property (1) now follows from the fact that Ξ(x, 0) = (∂H0)(x) (see e.g. [37, Theorem 2]).

Now we shall prove Property (2). We first define the set  Uk = {x ∈ RK | xk > xj, ∀j ̸= k} foreach  k ∈ K. It is clear that (Uk)k∈Kare disjoint open convex sets,  U = ∪k∈KUk, and it holds for all  k ∈ K and x ∈ Uk that H0is differentiable at  x with (∇H0)(x) = ek = (∂H0)(x).

Let  C ⊂ Ube a compact set, then we have  C = ∪k∈K(C ∩ Uk) due to U = ∪k∈KUk. Letus fix an arbitrary index  k ∈ K.By using the fact that (Uk)k∈Kare disjoint open sets, we can deduce that  C ∩ Ukis also compact. Since (Hε)ε≥0are convex and differentiable on  Uk andlimε→0 Hε(x) = H0(x) for all x ∈ Uk, we can deduce from the convexity of  Uk and [38, Theorem25.7] that (∇Hε)ε>0converges uniformly to  ∇H0 = ∂H0 on C ∩ Uk. Since Kis a finite set, we have shown the desired uniform convergence on C.

Moreover, for each  k ∈ K, the compactness of  C ∩ Ukimplies that there exists  ε0,k > 0 suchthat  C ∩ Uk ⊂ {x ∈ RK | xk > xj + ε0,k, ∀j ̸= k}. Then, if Hsatisfies (Sloc) with constant  ϑ ≥ 0,then Lemma 6.2(2) shows that for all  ε >0 satisfying  εϑ ≤ ε0,k, we have Hε = H0(and hence ∇Hε = ∇H0) on C ∩ Uk. Hence, by setting  ε0 >0 to be a constant satisfying  ε0ϑ ≤ mink∈K ε0,k,we can conclude for all  ε ∈ (0, ε0] that ∇Hε = ∇H0 = ∂H0 on C.

Now we are ready to present the convergence of the feedback relaxed control (3.6). Note that the H¨older continuity of the relaxed controls (3.6) and the possible discontinuity of the feedback control (2.8) suggest that the sequence (λuε)ε>0in general does not converge uniformly to  αu onO as ε →0. Thus we shall show that the relaxed controls converge in terms of the Hausdorff metric everywhere in O, and converge uniformly on compact subsets of the following region:

image

where  u ∈ C(O) ∩ C2(O) is the solution to (2.6) (or equivalently the value function (2.3) if the function  σ ∈ Sn0; see Theorem 2.2), and (Lk)k∈K are the elliptic operators defined as in (2.7). Note that  Ostcontains the points at which a strict complementary condition is satisfied, i.e., the optimal feedback control strategy of (2.3) is uniquely determined.

Theorem 6.4. Suppose (H.1) and (H.2) hold. Let (λuε)ε>0be the functions defined as in (3.6) for each  ε > 0, u ∈ C(O) ∩ C2(O)be the solution to (2.6), and  Ostbe the set defined as in (6.2). Then we have for all  x ∈ O and (xε)ε>0 ⊂ O with limε→0 xε = x that

image

Moreover, it holds for all compact subset  C ⊂ Ost that (λuε)ε>0converges uniformly to the function λ∗ : x ∈ Ost → eκu(x) ∈ ∆K on C as ε → 0, where κu(x) = arg maxk∈K�Lku(x) + fk(x)�for allx ∈ Ost. If we further suppose the function  H : RK → R in (H.2) satisfies (Sloc) with constant ϑ > 0, then there exists  ε0 > 0such that it holds for all  ε ∈ (0, ε0] that λuε ≡ λ∗ on C.

Proof. For any give  ε > 0, let uε ∈ C(O) ∩ C2(O) be the solution to (3.5). We first prove (6.3) by fixing an arbitrary point  x ∈ O. By using (3.6) and Proposition 6.3(1), we see it suffices to show limε→0(Luε(xε) + f(xε)) = Lu(x) + f(x), where L, fare defined as those in (2.6). Then the fact that (uε)ε>0converges to u uniformly in  C2(O) (see Theorem 6.1) and the continuity of coefficients enable us to conclude (6.3).

We now proceed to demonstrate the uniform convergence of (λuε)ε>0 in Ost. Note that for all x ∈ Ost, we have eκu(x) = (∂H0)�Lu(x) + f(x)�, where the set-valued mapping  ∂H0 : RK ⇒ ∆Kis defined as in Proposition 6.3. We further define for any given  k ∈ K the set

image

where  u ∈ C(O) ∩ C2(O) is the solution to (2.6), and (Lk)k∈Kare the elliptic operators defined as in (2.7). The continuity of the coefficients in (Lk)k∈K (see (H.1)) implies that (Ost,k)k∈K aredisjoint open sets satisfying  Ost = ∪k∈KOst,k.

image

C ∩ Ost,kis a compact set for each  k ∈ K. Let k ∈ Kbe a fixed index. Then the continuity of the coefficients in (Lk)k∈K, the fact that  u ∈ C2(O), and the compactness of  C ∩ Ost,kimply that, there exist constants  C1, C2 ∈ (0, ∞) such that we have for all  x ∈ C ∩ Ost,k and j ∈ K that,

image

Now by using the fact that (uε)ε>0converges to u uniformly in  C2(O), we can deduce that there exist  ε0, C1, C2 >0 such that the same estimates hold for all (uε)ε∈(0,ε0]. In other words, let U be the set defined as in Proposition 6.3, we can introduce the compact set

image

and conclude for all  ε ∈ (0, ε0], x ∈ C ∩ Ost,k that Luε(x) + f(x) ∈ Gk and Lu(x) + f(x) ∈ Gk.

image

6.3(2)) ensures that there exists  δk >0, such that we have for all  y ∈ Gk and ε < δk that|(∇Hε)(y) − (∂H0)(y)| ≤ η. Hence, by using the fact that  ∂H0 = {ek} on Gk, we have for all

image

which shows the uniform convergence of (λuε)ε>0 to λ∗ on C ∩ Ost,k. Since C = ∪k∈K(C ∩ Ost,k)and K is a finite set, we can conclude the desired uniform convergence on C.

Finally, if we further suppose H satisfies (Sloc) with constant  ϑ ≥0, Proposition 6.3(2) ensures that  ∇Hε ≡ ∂H0 on Gkfor all small enough  ε >0, which leads to the fact that  λuε ≡ λ∗ for allsmall enough  ε > 0 on Cand finishes our proof.

Remark 6.2. One can identify the unit vector  ek ∈ ∆K, k ∈ K, as the Dirac measure supported on  {ak}, which shows that, as the relaxation parameter tends to zero, the agent of the relaxed control problem will emphasize more on exploitation, and the relaxed control distribution will collapse to a pure exploitation strategy for the classical control problem.

Note that Theorem 6.4 demonstrates an exact regularization feature of the reward function ρzanggenerated by  Hzang, which means that we can recover the original control strategy in the region  Ostbased on the feedback relaxed control without sending the relaxation parameter  ε to0. The main intuition of the proof is that the region  Ostcan be mapped into a finite number of convex sets (i.e., the sets (Uk)k∈Kin the proof of Proposition 6.3). Hence, if a reward function only modifies the pointwise maximum function locally near the kinks, then one can employ the local compactness and local convexity structure of  Ostand the finiteness of the action set A, and deduce the local exact regularization property in the region  Ost.

The exact regularization feature of  ρzanghelps avoid the possible numerical instability for solving the relaxed control problem (3.2) with an extremely small relaxation parameter. In contrast, the feedback relaxed control  λuε based on the entropy reward function  ρenis always in (0, 1)K,and the convergence rate to the original control strategy can be arbitrarily slow.

To the best of our knowledge, this is the first paper which constructs Lipschitz stable feedback control strategies for general multi-dimensional continuous-time stochastic control problems, and rigorously analyzes the performance of a pre-computed feedback control for a perturbed problem in a continuous setting. We also perform a novel first-order sensitivity analysis for the value function and feedback relaxed control with respect to perturbations in the model parameters, and quantify the explicit dependence of the Lipschitz stability of feedback controls on the exploration parameter. These stability results provide a theoretical justification for recent reinforcement learning heuristics that including an exploration reward in the optimization objective leads to more robust decision making.

A natural next step would be to extend the stability analysis to finite horizon stochastic control problems and mean-field control problems with continuous action spaces (see e.g. [23, 42]). The infinite cardinality of action spaces implies that the corresponding relaxed controls take values in an infinite-dimensional space of probability measures, which poses additional challenges for the analysis of the regularized control problems. For example, infinite-dimensional convex analysis on spaces of measures must be employed to analyze the regularity of the modified Hamiltonians and the well-posedness of the associated HJB equations. Moreover, one must endow the action space of relaxed controls with a suitable metric structure (such as the Wasserstein metric) in order to study the spatial regularity and Lipschitz stability of feedback relaxed controls.

Another interesting direction is to design efficient numerical algorithms for solving the regularized control problems in a continuous setting.

image

be the strong solution to (2.2) with control  α, and for all  t ≥ 0, let Zα,xt =� t0 c(Xα,xs , αs) ds. Itis shown in [13, Lemma 3.1] that  E[exp(µτ α,x)] < ∞for some constant  µ >0, which implies that  τ α,x < ∞with probability 1. Applying Itˆo’s formula to the function  φ(y, z) = u(y) exp(−z),(y, z) ∈ Rn × R, gives us that

image

where  LXα,xis the generator of the controlled dynamics  Xα,x, and Γα,xt = exp�−� t0 c(Xα,xs , αs) ds�for all  t ∈ [0, τ α,x]. The fact that u is a solution to (2.6) implies that for  P-a.s. ω ∈ Ω, and

image

Then, by rearranging the terms, using the fact that  φ(Xα,xτ α,x, Zα,xτ α,x) = g(Xα,xτ α,x)Γα,xτ α,xand taking the supremum over all  α ∈ Aπ and π ∈ Πref, we can deduce that  u(x) ≥ v(x) for all x ∈O.

We proceed to show  αu is a feedback control of (2.3) (cf. Definition 2.2). Let  αu :O → A bea Borel measurable function satisfying (2.8), and ˜αu : Rn → Abe an extension of  αu such that˜αu = αu onO and ˜αu = a1 onOc. We shall consider the functions  bα : Rn → Rn, σα : Rn → Sn0such that  bα(x) = b(x, ˜αu(x)), σα(x) = σ(x, ˜αu(x)) for all x ∈ Rn. The measurability of  αu andthe continuity of  b, σimply that  bα, σα and ˜αu are Borel measurable. Then, for any given  x ∈ Rn,by using the boundedness of functions  bα, σα, and [32, Theorem 1], we can deduce that there exists  πx = (Ωx, Fx, {Fxt }t≥0, Px, W) ∈ Πref, and an {Fxt }t≥0-progressively measurable continuous process (Xxt )t≥0, such that  Xx0 = x, and

image

Thus we can obtain from the definition of ˜αu that (Xxt )t≥0 satisfies (2.9) with h = αu. Moreover, [29, Theorem 2.2.4 on p. 54] implies that  EPx[� τ αu,x0 �|b(Xxs , αu(Xxs ))| + |σ(Xxs , αu(Xxs ))|2�ds] <∞, which shows that  αu is a feedback control of (2.3).

It remains to show  αu is an optimal feedback control. If  x ∈ ∂O, we can deduce from the definition that  τ ˜αu,x = 0, which shows that  g(x) = g(Xxτ ˜αu,x) = J(x, αu), where J(x, αu) is defined as in (2.10). Similarly, we have for all  π ∈ Πref, α ∈ Aπ, x ∈ ∂Othat the first exit time of  Xα,xfrom O is 0, i.e.,  τ α,x = 0, which implies that v(x) = g(x). Hence, we can deduce from the fact that u satisfies the boundary condition of (2.6) that  u(x) = g(x) = v(x) = J(x, αu) for all x ∈ ∂O.

For each  x ∈ O, let Xx be a progressively measurable continuous process satisfying the SDE (A.3), defined on the reference probability system  πx ∈ Πref. The assumption that  αu satisfies (2.8)ensures that ˜αu(Xx) and Xx obtain the equality in (A.2) for  P-a.s. ω ∈ Ω, and t ∈ [0, τ ˜αu,x(ω)],from which, by using similar arguments as (A.1), we can obtain that  u(x) = J(x, αu) (c.f. (2.10)).On the other hand, owing to the fact that ˜αu(Xx) ∈ Aπx, we have by the definition of v that u(x) ≤ v(x) for all x ∈ O. Combining this with the fact that  u(x) ≥ v(x) for all x ∈ O, we canconclude that  u(x) = v(x) = J(x, αu) in O, which shows that  αu is an optimal feedback control and  u ≡ v onO.

Proof of Lemma 3.1. The definition of ∆K and (H.1) clearly imply that the function ˜b is well-defined and enjoys the desired estimates. Hence we shall focus on establishing the properties of the function ˜σ.

It has been shown in [17, Theorem 7.14-3] that for any given  A ∈ Sn>, there exists a unique matrix  A1/2 ∈ Sn> such that A1/2(A1/2)T = A, A1/2 ≥ √µIn if A ≥ µIn, and the mapping Φ :  A ∈ Sn> �→ Φ(A) = A1/2 ∈ Sn> is infinitely differentiable. Note that (2.4) and (2.5) in (H.1) ensure that there exists a constant  C ∈ (0, ∞), such that it holds for all  x ∈ Rn, λ ∈ ∆K that

image

We now define the function ˜σ : Rn × ∆K → Sn> by ˜σ(x, λ) = Φ��Kk=1 σ(x, ak)σ(x, ak)T λk�forall  x ∈ Rn, λ ∈ ∆K. The facts that Φ is a smooth function and G is a compact subset of  Sn>imply that Φ is bounded and Lipschitz continuous on G. Therefore, we can conclude from (2.4), (2.5), (A.4) and the definition of ˜σthat it holds for all  x ∈ Rn, λ ∈ ∆K that ˜σ(x, λ) ≥ √νIn and�i,j |˜σij(·, λ)|0,1 < ∞.

Proof of Lemma 3.2. We start by establishing Property (1). Since  H : RK → Ris a continuous convex function, the representation of  ρ in (H.2) and [38, Theorem 12.2] ensure that  ρis a closed convex proper function satisfying

image

The assumption that  H(x) − c0 ≤ maxk∈K xk ≤ H(x) for all x ∈ RK implies that for all  y ∈ RK,

image

which together with the fact that

image

shows that  ρ(y) ∈ [−c0,0] for all  y ∈ ∆K and ρ(y) = ∞ for all y ∈ (∆K)c. Finally, since  ρ is aclosed convex function satisfying  {y ∈ RK | ρ(y) < ∞} = ∆K, we can deduce from [38, Theorem 10.2] (∆Kis the standard simplex and hence locally simplicial) that the restriction of  ρ to ∆K isa continuous function.

We now show Property (2). It is clear from (H.2) and (3.3) that  Hε(x)−c0ε ≤ H0(x) ≤ Hε(x)for all  x ∈ RK. Note that (A.5) and the fact that  ρ = ∞ on ∆Kimply that for all  ε > 0 we have

image

which shows the function  ερis the convex conjugate of  Hε, i.e., (Hε)∗ = ερ. Hence, we can further deduce from [38, Theorem 23.5], the differentiability and convexity of  Hε that

image

Consequently, we can obtain from the fundamental theorem of calculus and the Cauchy-Schwarz inequality that  Hεis Lipschitz continuous with constant  LHε = supx∈RK |(∇Hε)(x)| ≤ maxy∈∆K |y|.Note that ∆Kis the convex hull of  {e1, . . . , eK}, where ekis the unit vector from the k-th column of the identify matrix  IK. Hence [38, Theorem 32.2] ensures that maxy∈∆K |y|is attained at {e1, . . . , eK}, which implies that  LHε ≤1, and finishes the proof of Lemma 3.2.

Before establishing Proposition 3.3, we first present an a priori estimate for solutions of fully nonlinear equations involving only the second order term.

Lemma A.1. [16, Theorem 7.2 on p. 125] Let O be a bounded connected open subset of  Rn, andF : O × Sn → Rbe a given function. Suppose the function F is differentiable and convex in its second component, and there exist constants  λ, Λ > 0 such that λIn ≤� ∂F∂rij (x, r)�≤ ΛIn for all(x, r) ∈ O × Sn. Then there exists a constant  α = α(n, Λ/λ) ∈ (0, 1)such that for any  β ∈ (0, α),if we have in addition that  ∂O ∈ C2,β, g ∈ C2,β(O), and there exist constants  γ, µ > 0 such thatit holds for all  x, y ∈ O, r ∈ Sn that |F(x, r) − F(y, r)| ≤ γ(µ + |r|)|x − y|β, then the Dirichlet problem

image

admits a unique solution  u ∈ C2,β(O)satisfying the estimate [u]2,β ≤ C�|u|0 + |g|2,β + µ�, wherethe constant C depends only on  n, Λ/λ, γ, (α − β)−1 and the C2,β-norm of ∂O.

Now we proceed to prove the a priori estimate for solutions to (3.5).

Proof of Proposition 3.3. Throughout this proof, we shall denote by C a generic constant, which may take a different value at each occurrence. Let  φ ∈ C(O) ∩ C2(O) be a given function, we consider the Dirichlet problem

image

where we define  D2u(x) = [∂iju(x)] ∈ Sn, and the function  Fφ : O × Sn → Rsuch that for all

image

It follows from (H.2) that  Fφis differentiable and convex in r. Moreover, a straightforward computation shows for all (x, r) ∈ O × Sn that� ∂Fφ∂rij (x, r)�= �Kk=1 ηk(x, r)ak(x), where we have

image

Note that for each  k ∈ K, the fact that  ak = σσT /2 and (H.1) (see (2.4)-(2.5)) imply that there exists a constant C, depending only on n, such that for all  x ∈ O,

image

which, along with the fact that (η1(x, r), . . . , ηK(x, r))T ∈ ∆K for all (x, r) ∈ O × Sn (see Lemma3.2(2)), shows that ν2In ≤� ∂Fφ∂rij (x, r)�≤ CIn, for some constant C depending only on n and the constant M defined in the statement of Proposition 3.3.

image

3.2(2)) imply that, if the function  φ ∈ C2,η(O), 0 < η ≤ θ, then the function  Fφsatisfies for all

image

for some constant C depending only on n. Consequently, we can deduce from Lemma A.1 that, there exists a constant  β0 = β0(n, ν, M) ∈ (0,1), such that for all  β ∈ (0, min(β0, θ)] and φ ∈C2,β(O), the Dirichlet problem (A.6) admits a unique solution  uφ ∈ C2,β(O), and satisfies [uφ]2,β ≤C�|uφ|0 + |g|2,β + |φ|1,β + 1�, where the constant C depends only on  n, ν, Λ, β, and O.

Now let  uε ∈ C2,β(O), β ∈ (0, min(β0, θ)] be a solution to (3.5). Then it is clear that  uε isa solution to the Dirichlet problem:  Fuε(x, D2u(x)) = 0 in O and u = g on ∂O. We can then deduce from the above arguments that, there exists a constant C, depending only on  n, ν, Λ, βand O, such that [uε]2,β ≤ C�|g|2,β + |uε|1,β + 1�. Hence by using the interpolation inequality (see [16, Theorem 1.2 on p. 18]), we have  |uε|2,β ≤ C�|g|2,β + |uε|0 + 1�.

image

from which, by using the classical maximum principle (see e.g. [22, Theorem 3.7]) and the fact that ∇Hε ∈ ∆K(see Lemma 3.2(2)), we can deduce that, there exists a constant  C = C(n, Λ, O) > 0that

image

which together with the fact that  |uε|2,β ≤ C�|g|2,β + |uε|0 + 1�leads to the desired estimate.

Proof of Proposition 5.3. The well-posedness of the classical solution  wε follows from the standard elliptic regularity theory (see [22, Theorem 6.14]), hence it suffices to prove the a priori estimate for a fixed  ε > 0.

Let  ρ >0 be a constant whose value will be specified later, and (ξm)Mm=1 be a partition of unity in a domain containingO such that the following properties hold: (1) the support of each function ξmis contained in a ball  Bρ(xm) for some xm ∈ Rn; (2) ξm ∈ C∞(Rn) satisfies for all  γ ≥ 0 that|ξm|⌊γ⌋,γ−⌊γ⌋ ≤ Cγρ−γ, where ⌊γ⌋is the integer part of  γ and Cγis a constant independent of m and  γ; (3) for each  x ∈O, �Mm=1 ξm(x) = 1 and the number of intersected supports of (ξm)Mm=1at x is bounded by a constant  Mndepending only on the dimension n. In the following, we shall denote by w the solution  wε, and by Ca generic constant independent of  α, m and ε.

For each m = 1, . . . , M, we define the function  wm = wξm, which satisfies  wm = gξm on ∂Oand

image

which together with the fact that  ∂ijwm = 0 on

image

Then we can deduce from the interpolation inequality (see [16, Theorem 1.3 on p. 19]) and (A.7) that

image

Note that for all  γ ≥0, we can obtain from property (2) of (ξm)Mm=1 that |ξm|⌊γ⌋,γ−⌊γ⌋ ≤Cγ(2CΛε−α)γ/β. Hence by repeatedly applying interpolation inequalities, we can simplify (A.8) into

image

which along with properties (2) and (3) of (ξm)Mm=1 leads to the estimate that

image

Finally, we can conclude from the classical maximum principle (see e.g. [22, Theorem 3.7]) that |w|0 ≤ C(|f|0 + |g|0), which finishes the proof of the desired a priori estimate.

Proof of Lemma 6.2. We first establish Property (1). For any given  x = (x1, . . . , xn2+n3)T ∈Rn2+n3, we write x(1) = (x1, . . . , xn2) ∈ Rn2 and x(2) = (xn2+1, . . . , xn2+n3) ∈ Rn3.

Let  x ∈ Rn2+n3 satisfy for some  k ∈ {1, . . . , n2 + n3} that xk ≥ maxj̸=k xj + c with c =max(ϑ2, ϑ3, c2 + ϑ1, c3 + ϑ1). We assume without loss of generality that  k ≤ n2. Then since  φ2satisfies (Sloc) with ϑ2 and c ≥ ϑ2, we have that  φ2(x(1)) = xk and φ3(x(2)) ≤ H(n3)0 (x(2)) + c3.Moreover, since  xk ≥ H(n3)0 (x(2)) + c and c ≥ c3 + ϑ1, we see φ2(x(1)) ≥ φ3(x(2)) + ϑ1, which,along with the assumption that  φ1satisfies (Sloc) with ϑ1, implies φ(x) = φ2(x(1)) = xk. Similararguments show that the same conclusion holds if  k ≥ n2+ 1, which enables us to conclude that φsatisfies (Sloc) with c.

Now let  x ∈ Rn2+n3 be an arbitrary given point. We have by assumptions that  φ2(x(1)) ≤H(n2)0 (x(1))+c2 and φ3(x(2)) ≤ H(n3)0 (x(2))+c3. Hence, by using the fact that  H(2)0is componentwise increasing and subadditive on  R2, we have

image

which finishes the proof of Property (1). Property (2) follows directly from the definition of  φε.

[1] C. D. Aliprantis and K. C. Border,  Infinite Dimensional Analysis: A Hitchhiker’s Guide, 3rded., Springer-Verlag, Berlin, 2006.

[2] D. Aldous, Weak convergence and the general theory of processes, manuscript, 1981. Available online at https://www.stat.berkeley.edu/ aldous/Papers/weak-gtp.pdf

[3] J. Backhoff-Veraguas, D. Bartl, M. Beiglb¨ock, and M. Eder, All adapted topologies are equal, Probab. Theory Relat. Fields, 178 (2020), pp. 1125–1172.

[4] J. Backhoff-Veraguas, D. Bartl, M. Beiglb¨ock, and J. Wiesel, Estimating processes in adapted Wasserstein distance, preprint, arXiv:2002.07261, 2020.

[5] G. Barles and E. Rouy, A strong comparison result for the Bellman equation arising in stochastic exit time control problems and its applications, Comm. Partial Differential Equations, 23 (1998), pp. 1945–2033.

[6] M. Basei, X. Guo, and A. Hu, Linear quadratic reinforcement learning: Sublinear regret in the episodic continuous-time framework, preprint, arXiv:2006.15316, 2020.

[7] E. Bayraktar, Y. Dolinsky, and J. Guo, Continuity of utility maximization under weak convergence, Math. Financ. Econ., 14 (2020), pp. 725–757.

[8] E. Bayraktar, L. Dolinskyi, and Y. Dolinsky, Extended weak convergence and utility maximisation with proportional transaction costs, Finance Stoch., 24 (2020), pp. 1013–1034.

[9] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996.

[10] S. I. Birbil, S.-C. Fang, J. Frenk, and S. Zhang, Recursive approximate of the high dimensional MAX function, Oper. Res. Lett., 33 (2005), pp. 450–458.

[11] P. Blanchard, D. J. Higham, and N. J. Higham, Accurate computation of the log-sum-exp and softmax functions, preprint (2019) arXiv:1909.03469. Accepted in IMA J. Numer. Anal., https://doi.org/10.1093/imanum/draa038.

[12] O. Bokanowski, S. Maroso, and H. Zidani,  Some convergence results for Howard’s algorithm,SIAM J. Numer. Anal., 47 (2009), pp. 3001–3026.

[13] R. Buckdahn and T. Y. Nie, Generalized Hamilton-Jacobi-Bellman equations with Dirichlet boundary condition and stochastic exit time optimal control problem, SIAM J. Control Optim., 54 (2016), pp. 602–631.

[14] S. Chaumont, Uniqueness to elliptic and parabolic Hamilton–Jacobi–Bellman equations with non-smooth boundary, C.R. Math. Acad. Sci. Paris, 339 (2004), pp. 555–560.

[15] C. Chen and O. L. Mangasarian, Smoothing methods for convex inequalities and linear complementarity problems, Math. Program., 71 (1995), pp. 51–69.

[16] Y.-Z. Chen and L.-C. Wu, Second Order Elliptic Equations and Elliptic Systems, Transl. Math. Monogr. 174, AMS, Providence, RI, 1998.

[17] P. Ciarlet, Linear and Nonlinear Functional Analysis with Applications, Appl. Math. 130, SIAM, Philadelphia, 2013.

[18] P. Dr´abek,  Continuity of Nemyckij’s operator in H¨older spaces, Comm. Math. Univ. Carolinae, 16 (1975), pp. 37–57.

[19] W. H. Fleming and H. M. Soner, Controlled Markov Processes and Viscosity Solutions, 2nd ed., Springer, New York, 2006.

[20] P. Forsyth and G. Labahn, Numerical methods for controlled Hamilton-Jacobi-Bellman PDEs in finance, J. Comput. Finance, 11 (2007/2008, Winter), pp. 1–43.

[21] M. Geist, B. Scherrer, and O. Pietquin, A theory of regularized Markov decision processes, preprint, arXiv:1901.11275, 2019.

[22] D. Gilbarg and N. Trudinger, Elliptic Partial Differential Equations of Second Order, 2nd edition, Springer-Verlag, Berlin, New York, 1985.

[23] H. Gu, X. Guo, X. Wei, and R. Xu, Dynamic programming principles for learning MFGs, preprint, arXiv:1911.07314, 2019.

[24] X. Guo, A. Hu, R. Xu, and J. Zhang, A general framework for learning mean-field games, preprint, arXiv:2003.06069, 2020.

[25] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, Reinforcement learning with deep energy- based policies, preprint, arXiv:1702.08165, 2017.

[26] K. Ito, C. Reisinger, and Y. Zhang, A neural network based policy iteration algorithm with global H2-superlinear convergence for stochastic games on domains, preprint (2019) arXiv:1906.02304. Accepted in Found. Comput. Math., https://doi.org/10.1007/s10208- 020-09460-1.

[27] A. D. Kara and S. Y¨uksel, Robustness to incorrect system models in stochastic control, SIAM J. Control Optim., 58 (2020), pp. 1144–1182.

[28] B. W. Kort and D. P. Bertsekas, A new penalty function algorithm for constrained minimization, in Proceedings of the 1972 IEEE Conference on Decision and Control, New Orleans, Louisiana, 1972.

[29] N. V. Krylov, Controlled Diffusion Processes, Springer-Verlag, Berlin, 1980.

[30] H.J. Langen, Convergence of dynamic programming models, Math. Oper. Res., 6 (1981), pp. 493–512.

[31] H. Mania, S. Tu, and B. Recht, Certainty equivalence is efficient for linear quadratic control, in Advances in Neural Information Processing Systems, 2019, pp. 10154–10164.

[32] Y. S. Mishura and A. Y. Veretennikov, Existence and uniqueness theorems for solutions of McKean-Vlasov stochastic equations, preprint, arXiv:1603.02212, 2016.

[33] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, Bridging the gap between value and policy based reinforcement learning, preprint, arXiv:1702.08892, 2017.

[34] R. Nugari,  Further remarks on the Nemitskii operator in H¨older spaces, Comment. Math. Univ. Carolin. 34 (1993) pp. 89–95.

[35] J. M. Peng, A smoothing function and its applications, in Reformulation: Nonsmooth, Piecewise Smooth, Semismooth and Smoothing Methods, M. Fukushima and L. Qi, ed., Kluwer, Dordrecht, 1998, pp. 293–316.

[36] J. Peng and Z. Lin, A non-interior continuation method for generalized linear complementarity problems, Math. Program., 86 (1999), pp. 533–563.

[37] R. A. Poliquin and R. T. Rockafellar, Proto-derivative formulas for basic subgradient mappings in mathematical programming, Set-Valued Anal., 2 (1994), pp. 275–290.

[38] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970.

[39] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cam- bridge, MA, 1998.

[40] I. Smears and E. S¨uli, Discontinuous Galerkin finite element approximation of Hamilton-Jacobi-Bellman equations with Cordes coefficients, SIAM J. Numer. Anal., 52 (2014), pp. 993–1016,

[41] I. Smears and E. S¨uli, Discontinuous Galerkin finite element methods for time-dependent Hamilton-Jacobi-Bellman equations with Cordes coefficients, Numer. Math., (2015), pp. 1–36.

[42] H. Wang, Z. T. Zariphopoulou, and X. Zhou, Exploration versus exploitation in reinforcement learning: a stochastic control approach, J. Mach. Learn. Res., 21(2020). pp. 1–34.

[43] H. Wang and X. Zhou, Continuous-time mean-variance portfolio selection: A reinforcement learning framework, Math. Finance, 30 (2020), pp. 1273–1308.

[44] J. Yong and X. Zhou, Stochastic Controls: Hamiltonian Systems and HJB Equations, Springer, New York, 1999.

[45] I. Zang, A smoothing-out technique for min-max optimization, Math. Program., 19 (1980), pp. 61–77.

[46] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, Maximum entropy inverse reinforcement learning, In AAAI, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008.


Designed for Accessibility and to further Open Science