In this paper, we propose a relaxed control regularization with a class of exploration rewards to design robust feedback controls for multi-dimensional stochastic control problems in a continuous setting. In particular, we shall rigorously demonstrate that the constructed optimal feedback control is Lipschitz stable with respect to perturbations in the underlying model.
Since parameter uncertainty in a given model is practically inevitable, it is essential but challenging to a priori evaluate the performance of a pre-computed feedback control in a perturbed system, and to design feedback policies capable of handling model uncertainty. For instance, let us consider the following infinite-horizon stochastic control problem. Suppose (is an admissible control process taking values in a finite action space A, and the underlying state dynamics follows a controlled stochastic differential equation (SDE) defined as follows:
where are given coefficients. The aim of the controller is to maximize the total expected discounted reward over all admissible strategies. It is well-known that (see e.g. [19, Corollary 5.1 on p. 167] and Theorem 2.2 for more precise statements), under certain regularity assumptions, the optimal control strategy can be represented as a deterministic function
, called the optimal feedback control, which maps the current state space into the action space. Moreover, one can construct such an optimal feedback control
via a verification argument, which consists of solving a nonlinear Hamilton– Jacobi–Bellman (HJB) partial differential equation (PDE) arising from the dynamic programming principle for the optimal reward function u, and then performing a pointwise maximization of the associated Hamiltonian involving the function u and its derivatives (
as follows: for any given
where 2, the functions c and f denote the discount rate and the instantaneous reward, respectively. We refer the reader to Theorems 2.2 and 3.5 for rigorous arguments of the above procedure for control problems of our interest, and to [19, Theorem 5.1 on p. 166] for a general statement.
We observe, however, that the control strategy satisfying (1.1) in general is difficult to implement and unstable to parameter perturbations, which in practice would result in numerical instability of learning algorithms. Due to the finiteness of the action space A and the fact that arg max is a set-valued mapping, a function
satisfying (1.1) in general is non-unique and merely measurable, and hence it is hard to follow such an irregular strategy in practice. More importantly, the discreteness of the set A implies that the arg max mapping is not continuous (in the sup-norm), which makes the feedback control
very sensitive to perturbations of the coefficients (
). In other words, a slight change of the model parameters will result in a significant change of the feedback control, especially in the regions where two or more actions lead to similar performances based on the current model. Since it is difficult to determine the occurance of such regions a priori, it is unclear how well the control strategy
will perform in a real system with the perturbed coefficients (˜
), even if (˜
) is very close to (
See the last paragraph of Section 2 for more details on the instability of feedback controls and its practical impact on learning algorithms.
A tremendous amount of effort has been made to overcome the above difficulties, particularly in the (discrete-time) Reinforcement Learning (RL) setting (see e.g. [39]), where the agent seeks (nearly) optimal decisions in a random environment with incomplete information. Generally speaking, the controller must balance between greedily exploiting the available information to choose actions that maximize short-term rewards, and continuously exploring the environment to acquire more knowledge for long-term benefits. In particular, an entropy-regularized formulation has been proposed for solving (discrete-time) RL problems in [46, 33, 21], where the authors incorporate explorations by explicitly including the entropy of the exploration strategy in the optimization objective as a reward function, and balance exploitation and exploration by adjusting a weight imposed on this regularization term. Empirical studies (e.g. [46, 25, 33, 21]) show that such a regularized formulation leads to more robust decision making. Recently, the authors in [42, 43] extended this entropy-regularized formulation to continuous-time RL problems by using the relaxed control framework, and study the exploration/exploitation trade-off for one-dimensional linear-quadratic (LQ) control problems via explicit solutions. The relaxed control approach has then been extended to (discrete-time) RL problems with mean-field controls in [23].
In this work, we propose an exploratory framework with general exploration rewards to design robust feedback controls for continuous-time stochastic exit time problems with continuous state space and discrete action space. Our formulation extends the relaxed control approach in [42, 43] to multi-dimensional state dynamics and general exploration rewards, including Shannon’s differ-ential entropy and other commonly used regularization functions in the optimization literature (see e.g. [15, 45]); see the remark at the end of Section 3 for a detailed comparison among different exploration reward functions.
A major theoretical contribution of this work is a rigorous stability analysis of the regularized control problem and its associated feedback control strategy. Although the entropy-regularized RL formulation has demonstrated remarkable robustness in various empirical studies (e.g. [46, 25, 33, 21, 23, 43]), to the best of our knowledge, there is no published theoretical work on the Lipschitz stability of feedback relaxed controls with respect to parameter uncertainty (even in a discrete-time setting) nor on the Lipschitz stability of the value functions for regularized continuous-time stochastic control problems with general multi-dimensional nonlinear state dynamics. In fact, most existing results on the Lipschitz stability of feedback controls are for LQ control problems with linear state dynamics and quadratic cost functions (see e.g. [31] for discrete-time LQ problems in an ergodic setting and [6] for finite-horizon continuous-time LQ problems). The stability analysis of such problems relies heavily on the linearity of optimal feedback controls and the associated Riccati equations, and hence cannot be directly extended to general nonlinear control problems. We refer the reader also to [2, 30, 3, 4, 7, 8, 27] for the continuity of various stochastic optimization problems, including stochastic control problems and optimal stopping problems, in the underlying processes with respect to the (extended) weak topology.
In this work, we shall close the gap by providing a theoretical justification for recent RL heuristics that including an exploration reward in the optimization objective leads to more robust decision making. In particular, we shall demonstrate that the change in value functions of the regularized control problems (in the -norm) depends Lipschitz-continuously on the perturbations of the model parameters, including the coefficients of the state dynamics and reward functions in the optimization objective. We shall also prove that the regularized control problem admits a H¨older continuous feedback control (cf. the original control
) is merely measurable), which is Lipschitz stable (in the
-norm) with respect to parameter perturbations; see Theorem 4.2.
Moreover, this is the first paper which precisely quantifies the performance of a feedback control pre-computed based on a given model in a new multi-dimensional controlled dynamics with perturbed coefficients. We will prove that the gap between the suboptimal reward function achieved by the pre-computed feedback relaxed control and the optimal reward function of the perturbed relaxed control problem depends Lipschitz-continuously on the magnitude of perturbations in the coefficients (see Theorem 4.4). We also establish a first-order sensitivity equation for the value function and feedback control of the perturbed relaxed control problem (see Theorem 5.2 and Remark 5.1), which enables us to quantify the explicit dependence of the Lipschitz stability of feedback controls on the exploration parameter (see Theorem 5.4).
Let us briefly comment on the two main difficulties encountered in the stability analysis of feedback relaxed controls beyond those encountered in the finite-dimensional RL setting (see e.g. [20, 12, 21]) and the LQ setting (see e.g. [31, 6]). As we shall see in (3.6), the feedback relaxed control (in the present continuous setting) is defined as the pointwise maximizer of the associated Hamiltonian, which in general involves not only the value function of the regularized control problem, but also its first and second order derivatives. Hence, besides estimating the sup-norm of the value functions as in the finite-dimensional RL setting, we also need to quantify the impact of parameter uncertainty on the (first and second order) derivatives of the value functions, which are solutions to a fully nonlinear HJB PDEs. For continuous-time LQ problems, such an analysis can be greatly simplified by taking advantage of the quadratic structure of the value function, which reduces the study of HJB PDEs to that of Riccati ordinary differential equations. Such a simplification is not possible for general nonlinear control problems, which requires us to derive a precise a priori estimate for the derivatives of solutions to the associated fully nonlinear HJB
equations.
Moreover, the Lipschitz stability and the first-order sensitivity analysis of the feedback relaxed controls also require us to establish the regularity of the HJB operator and the arg max-mapping between suitable function spaces for regularized control problems. As already pointed out in [40, 26], the fact that the HJB operator is fully nonlinear (since we allow the diffusion coefficients to be controlled) poses a significant challenge for choosing proper function spaces to simultaneously ensure the differentiability of the fully nonlinear HJB operator and the bounded invertibility of its (Fr´echet) derivative, which are essential for deriving the sensitivity equations of the value functions and feedback controls (see Theorem 5.2 and Remark 5.1). Here, by taking advantage of the exploration reward functions, we demonstrate that the HJB operator and the arg max-mapping for the regularized control problem are sufficiently smooth between suitable H¨older spaces, which together with an elliptic regularity estimate leads us to the desired sensitivity results for the feedback relaxed controls; see Remark 4.1 for more details.
Finally, we establish that, as the exploration parameter tends to zero, the value function of the relaxed control problem converges monotonically to that of the classical stochastic control problem with a first-order accuracy (see Theorem 6.1). The convergence of value functions (in the -norm) subsequently enables us to deduce a novel uniform result (on compact sets) for the feedback relaxed control to a pure exploitation strategy of the original control problem. We further prove an exact regularization property for a class of reward functions, which allows us to recover the pure exploitation strategy based on the feedback relaxed control without sending the exploration parameter to 0 (see Theorem 6.4).
We organize this paper as follows. Section 2 introduces the stochastic exit control problem, and establishes its connection to HJB equations. In Section 3, we propose a relaxed control regularization involving general exploration reward functions for the stochastic control problem, and establish the H¨older regularity of the feedback relaxed control strategy. Then, for a fixed positive exploration parameter, we prove the Lipschitz stability of the value function and feedback relaxed control with respect to parameter perturbations in Section 4, and derive their first-order sensitivity equations in Section 5. We establish the convergence of value functions and relaxed control strategies for vanishing exploration parameters in Section 6. Appendix A is devoted to the proofs of some technical results.
In this section, we introduce the stochastic exit time problem of our interest, state the main assumptions on its coefficients, and recall its connection with HJB equations. We start with some useful notation which is needed frequently throughout this work.
For any given multi-index For any given open subset
function
Then we shall denote by ) the space of k-times continuously differentiable functions in O equipped with the norm
) the space consisting of all functions in
) satisfying [
, equipped with the norm
When k = 0, we use
) to denote
. We shall omit the subscriptO in the (semi-)norms if no confusion appears.
Finally, we shall denote by [matrix whose ijth-entries are given by
and
, respectively, the set of
symmetric, symmetric positive semi-definite and symmetric positive definite matrices, by
the fact that
is positive semi-definite. For any given
, we denote by ∆
the probability simplex in
Now we are ready to introduce the control problem of interest. In order to allow irregular feedback control strategies, we consider the following weak formulation of a control problem, which includes the underlying probability space as part of control strategies (see e.g. [44, 19]). See Remark 2.2 for possible extensions to stochastic control problems under strong formulation, for which the underlying probability reference system is fixed.
) is said to be a reference probability system if (Ω
) is a filtered probability space satisfying the usual condition
is an
-dimensional Brownian motion. We denote by Π
the set of all reference probability systems.
Now let O be a given bounded domain in , i.e., a bounded connected open subset of
The aim of the controller is to maximize the expected discounted reward up to the first exit time of a controlled dynamics from the domain O. More precisely, let
be a given reference probability system, and
be the set of
-progressively measurable processes
taking values in a finite set A. For any given initial state
, and control
we consider the controlled dynamics
satisfying the following SDE:
where are given Lipschitz continuous functions (see (H.1) for precise conditions), and denote by
the first exit time of the dynamics
from the domain
the controlled discount factor: Γ
]. Then, for each given
, we shall consider the following value function:
where the functions f and g denote, respectively, the running reward and the exit reward. Throughout this work, we shall perform the analysis under the following assumptions on the coefficients:
is a set of cardinality
a bounded domain in
. There exist constants
such that the boundary
of O is of class
, and the functions
Remark 2.1. The Lipschitz continuity of ensures that, for any given
, the controlled SDE (2.2) admits a unique strong solution. Moreover, the non-degeneracy of
ensures that SDEs with non-Lipschitz feedback controls admit a weak solution (cf. Theorems 2.2 and 3.5); see also Lemma 3.1.
As shown in [22, Lemma 6.38], the fact that is of class
ensures that a function in
) has boundary values in
), and conversely, any function
extended to a function in
). Hence, one can introduce a boundary norm
space
), such that for any given
is a global extension of
. The space
) equipped with the norm
is a Banach space (see e.g. the discussions on page 94 in [22]).
To simplify the presentation, we study exit time control problems with H¨older continuous co-efficients in this work and analyze classical solutions of associated elliptic HJB equations. Similar results, including the characterization and Lipchitz stability of feedback relaxed controls in Sections 3 and 4, can be obtained for finite horizon control problems with measurable coefficients, whose corresponding parabolic HJB equations admit weak solutions in suitable Sobolev spaces (see [41] for the well-posedness of weak solutions to parabolic HJB equations and [29, Theorem 1 on p. 122] for a generalized Itˆo’s formula). The first-order sensitivity analysis in Section 5 in general can only be performed for classical solutions in H¨older spaces; see Remark 4.1 for details.
The rest of this section is devoted to the connection between the stochastic exit time problem and a Hamilton-Jacobi-Bellman (HJB) boundary value problem, which plays an essential role in the construction of feedback control strategies. More precisely, we now consider the following HJB equation with inhomogeneous Dirichlet boundary data:
where is the pointwise maximum function, i.e.,
(
is the function satisfying
is a family of elliptic operators satisfying for all
Above and hereafter, when there is no ambiguity, we shall denote by ) a generic function
, and adopt the summation convention as in [22, 16], i.e., repeated equal dummy indices indicate summation from 1 to n.
Throughout this paper, we shall focus on the classical solution tablished in the following theorem, which subsequently enables us to characterize optimal feedback controls for (2.3).
Theorem 2.1. Suppose (H.1) holds, and let . Then the Dirichlet problem (2.6) admits a unique solution
Moreover, there exists a constant
Proof. We shall only prove the uniqueness of solutions in ), since the existence of classical solutions in
) will be established constructively based on the relaxed control approximation in Theorem 6.1 (see also [16, Theorem 7.5] for a proof of existence based on the method of continuity), and the existence of a Borel measurable function satisfying (2.8) follows directly from the measurable selection theorem (see [1, Theorem 18.19]).
Let ) be solutions to (2.6). Then for all
, we can deduce from the fundamental theorem of calculus that
where is a measurable function, and ˜L denotes the elliptic operator satisfying for all
particular, the function h can be chosen as the weak limit of the functions ([0
(
quence of smooth approximations of
obtained by using the standard mollification argument. Then we can easily show that
is a uniform elliptic operator, and
. Hence the classical maximum principle (see e.g. [22, Theorem 3.7]) and
imply that
, which shows that the Dirichlet problem (2.6) admits at most one solution in
We now present a verification result, i.e., Theorem 2.2, which shows that the classical solution to the HJB equation (2.6) is the value function (2.3), and the Borel measurable function defined as in (2.8) is a feedback control of (2.3). The proof will be postponed to Appendix A, which essentially follows from Itˆo’s formula and the existence result of weak solutions to SDEs with non-degenerate diffusion coefficients (see [32, Theorem 1]).
We first recall the definition of optimal feedback control (see e.g. [44, Definition 6.1]).
Definition 2.2. A Borel measurable function is said to be a feedback control of (2.3) if for all
, there exists
-progressively measurable continuous process (
, such that
O}. A feedback control h is said to be optimal if we have for all
Theorem 2.2. Suppose (H.1) holds. Let be the value function defined as in (2.3),
be the solution to the Dirichlet problem (2.6), and
be a Borel measurable function satisfying (2.8). Then we have
optimal feedback control of (2.3).
Remark 2.2. As shown in Theorem 2.2, by considering a weak formulation of the stochastic control problem (2.3) with reference probability systems varying in Π, we can rigorously demonstrate that a measurable function
satisfying (2.8) is indeed an optimal feedback control strategy.
One can also consider stochastic exit time problems under a strong formulation, for which we first fix a reference probability system ), and the agent only maximizes the reward functional over all admissible control processes in
. It has been shown in [14, Theorem 2.1] that, if we assume (H.1) and
) satisfies the strong comparison principle i.e., a comparison result for semicontinuous viscosity solutions. In particular, (H4) in [14] is satisfied since
enjoys the exterior ball condition, and (H5) in [14] is satisfied with Γ
due to the uniform ellipticity condition (2.4). The strong comparison principle further enables us to show that the value function of the stochastic control problem (under the strong formulation) is the unique continuous viscosity solution to (2.6); see [5, Theorem 3.1]. Since the classical solution u is a viscosity solution of (2.6), we see it is the value function of the stochastic control problem (under the strong formulation), and the strategy
defined in (2.8) will lead to the optimal reward. Hence, we can still view the function
as an optimal feedback control.
We reiterate that, due to the fact that arg max is a set-valued mapping, the feedback control strategy (2.8) in general is non-unique, discontinuous, and sensitive to the perturbation of the co-efficients. For instance, let K = 2, and consider the set 0} at whose boundary the optimal control
) could have a jump discontinuity. Except for the trivial case where
is a constant on O, one can easily deduce from the connectedness of O, the fact that
), and the continuity of the coefficients that the set G is non-empty. Since the boundary of the level set G can have poor regularity, we see the feedback control
general is merely Borel measurable, which introduces a substantial difficulty to follow the optimal control in practice. Moreover, the discontinuity of
also implies that a small perturbation of the coefficients could lead to a significant difference of
in the sup-norm, especially near the boundary of the set G. It is well-known (see e.g. [9, Section 6.4.2] and [24, Figure 4]) that such an instability of feedback controls would result in a numerical instability of the learning process, i.e., the approximate policies generated by an iterative learning algorithm may change subsequently from one iteration to the next, and eventually oscillate among several far-from-optimal policies.
In this section, we propose a relaxation of the stochastic exit time problem (2.3), which extends the ideas used in [42] to control problems with multi-dimensional controlled dynamics and general exploration reward functions. As we shall see shortly, the relaxed control problem has a H¨older continuous feedback control strategy, and enjoys better stability with respect to perturbation of the coefficients.
The following technical lemma is essential for the formulation of relaxed control problems with multi-dimensional dynamics, whose proof is included in Appendix A.
Lemma 3.1. Suppose (H.1) holds. Then there exist unique functions ˜˜
such that it holds for all
Moreover, it holds for all
We now proceed to introduce the relaxation of the exit time problem (2.3). Roughly speaking, instead of seeking the optimal feedback action, which maps the current state to a specific action in the space A, we seek the optimal feedback control distribution, which is a deterministic mapping from the current state to a probability measure over the space ). Once such a mapping is determined, at each given state, the agent will execute the control by sampling a control action based on the distribution
). We refer the reader to [42] for a more detailed derivation of the following regularized control problem (3.6) in a one-dimensional setting. Note that the fact that A has cardinality
enables us to identify the space of probability measures over A as the probability simplex ∆
More precisely, let be a given reference probability system, and
be the set of
-progressively measurable processes
taking values in the set ∆
Suppose that (H.1) holds, for any given initial state
, and control
, we consider the controlled diffusion process
satisfying the following SDE:
where ˜are the functions defined in Lemma 3.1. We further introduce the first exit time of
from the domain
and the controlled discount factor Γ
Now let be a given exploration reward function satisfying
∆
(precise conditions will be specified in (H.2)). For any given relaxation parameter
consider the following value function: for each
Note that the exploration reward function plays a crucial role in the above relaxed control regularization. If we set the exploration reward function
0 or the relaxation parameter
then one can show that Dirac measures supported on the optimal strategies of the original control problem (2.8) (see
defined as in (2.8)) are optimal control distributions of the relaxed control problem (3.2), and the value function v in (2.3) will be equal to the value function
(see Theorems 6.1 and 6.4). Hence, to achieve the stability of the optimal control strategy for the relaxed control problem (3.2), we shall impose the following condition on the reward function
H.2. There exists a convex function and a constant
, depending on K, such that for all
We remark that (H.2) is satisfied by most commonly used reward functions, including Shannon’s differential entropy proposed in [46, 25, 33, 21, 42]. We refer the reader to the discussion at the end of this section for a detailed comparison of different reward functions.
Given a function , we define for each
0 the function
for all
Note that (are convex functions if H is a convex function. The next lemma follows directly from (H.2) and standard arguments in convex analysis, whose proof will be given in Appendix A for completeness.
(1) the function is convex on
, continuous relative to ∆
, and satisfies that
(2) it holds for all ) = arg max
. Consequently, we have for all
and
We proceed to study the corresponding HJB equation of the relaxed control problem (3.2), which plays a crucial role in our subsequent analysis. For each be the function satisfying for all
f defined as in (2.6)), and
be the elliptic operator satisfying for all
where we have used the definition of the elliptic operators )), and the definition of the functions ˜
(cf. Lemma 3.1).
Since the diffusion coefficient of SDE (3.1) is non-degenerate (see Lemma 3.1) and all coeffi-cients of the relaxed control problem (3.2) are continuous on, a formal application of the dynamic programming principle (see e.g. [19, 13] and references within) enables us to associate the relaxed control problem (3.2) with the following HJB equation:
Moreover, (3.4) and Lemma 3.2(2) imply that the above Dirichlet problem is equivalent to
where the function is defined as in (3.3), and L, f are defined as those in (2.6).
In order to rigorously justify the connection between (3.2) and (3.5), we establish the well-posedness of classical solutions to (3.5) in Theorem 3.4, and then prove a verification result in Theorem 3.5.
We need the following proposition, which gives an a priori estimate of classical solutions to (3.5). We postpone the proof to Appendix A, which adapts the technique in [16, Theorem 7.5 on p. 127] to HJB equations with compact control sets, and reduces the problem to an a priori estimate for HJB equations involving only principal terms.
Proposition 3.3. Suppose (H.1) and (H.2) hold, and let . Then there exists a constant
, such that it holds for all
is a solution to the Dirichlet problem (3.5) with parameter
satisfies the estimate that
, where the constant C depends only on
Theorem 3.4. Suppose (H.1) and (H.2) hold, let Dirichlet problem (3.5) admits a unique solution
Moreover, there exists
Proof. One can deduce by similar arguments as those for Theorem 2.1 and the classical maximum principle that (3.5) admits a unique classical solution in ). Moreover, by using the a priori bound of classical solutions in Proposition 3.3, we can establish the existence and regularity of the classical solution
) based on the method of continuity; see [16, Theorem 5.1 on p. 116].
Now let ) be the solution to (3.5) with some
]. The continuity of
, and Lemma 3.2(2) ensure that the function
is well-defined on O, and has the expression
). Note that, it holds for any given
). Hence the H¨older continuity of the coefficients (see (H.1)) implies that
). We can then easily deduce from the local Lipschitz continuity of
that
The next theorem shows that the function (3.6) is an optimal feedback control of (3.2), which is defined similarly to Definition 2.2. The proof of this statement is similar to that of Theorem 2.2 and hence omitted.
Theorem 3.5. Suppose (H.1) and (H.2) hold. Let be the value function defined as in (3.2),
be the solution to the Dirichlet problem (3.5), and
be the function defined as in (3.6). Then
is an optimal feedback control of (3.2).
Remark 3.1. Theorem 3.4 shows that the feedback control is uniquely defined and H¨older continuous. This improved regularity makes it easier to implement the relaxed control
practice, compared to the original (merely measurable) feedback control
(cf. Theorem 2.1).
We end this section with a remark about possible choices of reward functions. Generally speaking, we shall choose a reward function whose generating function H and its gradient
can be efficiently evaluated, such that one can design an efficient algorithm to solve the relaxed control problem (3.2) (see e.g. [46, 25, 33, 21, 26]). A common choice of reward functions in the literature is the following entropy-type reward function (see e.g. [28, 35, 36, 42]):
whose generating function is One can show that
), and it satisfies (H.2) with
(see e.g. [36]).
The advantage of the entropy reward function is that both are given in closed form, and they can be naturally extended to continuous action spaces A (see e.g. [42]). However, it is important to notice that the evaluation of
involves exponentials. Hence, when the relaxation parameter
is small, a naive implementation of iterative algorithms for solving (3.5), which in general involves evaluating the value and inverse of
large argument
, may lead to unreliable results due to unstable floating-point arithmetic; see [10, Example 4.2] and [11] for more details. Moreover, since
, the optimal relaxed control of (3.2) may converge to the optimal control of (2.3) with a very slow rate as the relaxation parameter
tends to zero.
Alternatively, by virtue of the fact that only the generating function H and its gradient are involved in the HJB equation (3.5) and the feedback control (3.6), we can also obtain a reward function by directly constructing a K-dimensional function H based on a recursive application of smoothing functions for the two-dimensional max function. For instance, we can start with the following two-dimensional smoothing functions (see e.g. [15, 45]): for
Then, for any given 3, by using the fact that max
, we can express the K-dimensional max function as a nested application of the two-dimensional max function and one-dimensional identity function. Hence, by replacing the two-dimensional max function with the two-dimensional smoothing function (3.7) (resp. (3.8)) in the recursive expression, we can obtain the K-dimensional smoothing function
)). It has been shown in [10, Lemma 3.3] that for any given
2, both functions
satisfy (H.2) with
Note that, the evaluation of and their gradients only involves square-roots and multiplications, hence they are numerically more stable than the entropy-type smoothing
(see [10]). More importantly, since
only modifies the function
locally near the non-differentiable points, we can determine the optimal control of (2.3) precisely from the optimal control of (3.2) without sending the relaxation parameter
to zero (see Theorem 6.4 and Remark 6.2 for details).
Figure 1 compares the functions and the reward functions generated by them. One can clearly see from Figure 1 (left) that
substantially modifies the pointwise maximum function
everywhere, while
only performs a modification of
locally near the kinks. For both functions, the difference from
peaks around the the points where arg max
is not a singleton. Such points correspond to the regions where the agent of the control problem (2.3) cannot make a clear decision based on the current model, since two or more different actions would result in a very similar reward.
Figure 1 (right) depicts the reward functions 1
, for all (
(1/3, 1/3, 1/3) corresponds to the pure exploration strategy, i.e., the uniform distribution on the action space
, while the vertices of C corresponds to the pure exploitation strategy, i.e., the Dirac measures supported on some
Both functions achieve their minimum around the point (1/3, 1/3, 1/3), which indicates that the exploration reward functions encourage the controller of the relaxed control problem to explore further, especially when it is difficult to choose a unique optimal action based on the current model.
Note that, by comparing the values of the reward functions near the point (1/3, 1/3, 1/3) and near the vertices of C, we see that in general gives more rewards for exploration than
. Consequently, to recover the value function and optima control of (2.3), we have to take a smaller relaxation parameter for (2.3) with
than that for (2.3) with
, which could cause a numerical instability issue due to the exponentials in
(see e.g. [10]).
Figure 1: Comparison of and their corresponding reward functions for K = 3.
In this section, we shall fix a relaxation parameter 0 and study the robustness of the feedback control strategy (3.6) for a relaxed control problem associated with a perturbed model. In particular, we shall show that the control strategy (3.6) admits a (locally) Lipschitz continuous dependence on the perturbation of the coefficients, if the reward function is generated by a function H with locally Lipschitz continuous Hessian.
We start by presenting two technical results, which are essential for our subsequent analysis. The first one is due to Nugari [34], which establishes the regularity of Nemytskij operators in H¨older spaces.
be an open bounded set,
be a continuously differentiable function, and Φ :
be the Nemytskij operator satisfying for all
is well-defined, continuous and bounded. Moreover, if we further suppose
is locally Lipschitz continuous (resp.
is twice continuously differentiable), then Φ is locally Lipschitz continuous (resp. continuously differentiable with the Fr´echet derivative Φ
Remark 4.1. Lemma 4.1 enables us to view the fully nonlinear HJB operator the value-to-action map
defined in (3.6) as differentiable maps between suitable H¨older spaces, which is essential for the sensitivity analysis on the value functions and feedback relaxed controls in Section 5.
Note that in general it is not possible to perform the same first-order sensitivity analysis by interpreting the HJB operator as a map between the Sobolev space
) and the Lebesgue space
). In fact, since the operator
) in general is only differentiable with p > q (see [40, Theorem 13]), we see the derivative of
, which is a second-order linear elliptic operator, is not bijective between
). Consequently, we cannot apply the implicit function theorem to derive the sensitivity equation for the value function (3.2) as in Theorem 5.2.
If the operator is only semilinear, i.e., the diffusion coefficient of (2.2) is uncontrolled, then one can show that
is differentiable between
derivative is a bijection between the same spaces (see [26] for the case with p = 2). In this case, we can extend Theorem 5.2 and study
-perturbation of the coefficients in (3.2).
Now we proceed to introduce a relaxed control problem with a set of perturbed coefficients satisfying the following conditions:
be the constants in (H.1), and Λ
be a constant. The functions ˆ
Let 0 be a fixed relaxation parameter. We shall consider a perturbed control problem (2.3) with the coefficients (ˆ
), and its relaxation (see (3.2)) with parameter
, whose value function is denoted as ˆ
. Then, by using Lemma 3.2, Theorems 3.4 and 3.5, one can verify that, under (H.2) and (H.3), the value function ˆ
is the classical solution ˆ
following Dirichlet problem:
where the function is defined as in (3.3), ˆ
is the function satisfying ˆf(x) = ( ˆ
is a family of elliptic operators satisfying for all
Moreover, we can deduce from (3.6) that, the optimal feedback control of the perturbed relaxed control problem is given by
Note that Theorem 3.4 shows that the classical solution ˆso the above function ˆ
is well-defined on
The following result shows the (local) Lipschitz dependence of ˆ
bation of the coefficients, which demonstrates the robustness of the relaxed control problem. For notational simplicity, given the functions (
) satisfying (H.1) and (H.3) respectively, we shall introduce for each
] the following measurement of perturbations:
Proof. Throughout this proof, we shall denote by C a generic constant, which depends only on , and may take a different value at each occurrence.
The a priori estimate in Proposition 3.3 shows that there exists a constant (0, 1), such that we have for all
)] the estimates
. Moreover, we have by the fundamental theorem of calculus that
in is the function defined as
Now let )] be a fixed constant. The fact that
the H¨older continuity of coefficients (see (H.1) and (H.3)), and the a priori estimates of
yield the estimate that
(see Lemma 4.1). Then, by setting
we can deduce from (4.5) that w is the classical solution to the following Dirichlet problem:
Hence the fact that ) and the global Schauder estimate in [22, Theorem 6.6] lead us to the estimate that
which, together with the maximum principle (see [22, Theorem 3.7]) and the a priori estimate of , enables us to conclude that:
with the constant defined as in (4.3). Now we show the stability of feedback controls. Note that (4.4) implies that
The additional assumption that ) has a locally Lipschitz continuous Hessian implies that
is differentiable with locally Lipschitz continuous derivatives, which along with Lemma 4.1 shows that the Nemytskij operator
) is locally Lipschitz continuous. Hence there exists a constant C, such that for all perturbed coefficients (ˆ
satisfying (H.3), we have
which finishes the desired (local) Lipschitz estimate.
Remark 4.2. The assumption that ) has a locally Lipschitz continuous Hessian is satisfied by most commonly used functions, including
Section 3. In general, if H is merely twice continuously differentiable as in (H.2), we can follow a similar argument and establish that the H¨older norm of the difference between two relaxed control strategies is continuously dependent on the H¨older norms of the perturbations in the coefficients.
Note that the Lipschitz stability result (4.4) in general does not hold for the original control problem (2.3) (or equivalently, In fact, for any given
rem 2] shows that the Nemytskij operator
) is not continuous, which implies that there exists (
such that lim
. Now for each
, we consider the following simple HJB equation (2.6): ∆
. Hence we have
, which implies that the
of the value function (2.3) does not depend continuously on the
-perturbation of the model parameters. See Theorem 5.4 for a precise quantification of
-dependence in (4.4).
The remaining part of this section is devoted to an important application of Theorem 4.2,
where we shall examine the performance of the control strategy , computed based on the relaxed control problem with the original coefficients (
)), on a new relaxed
control problem with perturbed coefficients satisfying (H.3).
We first observe that, if there exists a classical solution ) to the following
with ˆdefined as in (4.1), then by using Itˆo’s formula, one can easily show that the reward function
, resulting by implementing the H¨older continous feedback control
relaxed control problem with the coefficients (ˆ
), coincides with the function
e.g. Theorems 2.2 and 3.5). On the other hand, we have seen that the (optimal) value function ˆ
of the perturbed relaxed control problem is the classical solution ˆ
). Hence it suffices to compare the classical solutions to (4.6) and (4.1).
The following proposition shows that (4.6) indeed admits an unique classical solution.
be the function defined as in (3.6), and
be the constant in Proposition 3.3. Then the Dirichlet problem (4.6) admits a unique solution
We are ready to show that, the difference between this suboptimal reward function the (optimal) value function ˆ
of the perturbed relaxed control problem depends Lipschitzcontinuously on the magnitude of perturbations in the coefficients.
Hessian, then there exists , such that for all
the estimate
, with the constant
defined as in (4.3), and a constant
and C be a generic constant, which depends only on
which, together with the fact that ˆand the classical maximum principle (see [22, Theorem 3.7]), shows that ˆ
We now estimate ˆby assuming the function
) has a locally Lipschitz continuous Hessian. By using the definition of the optimal control ˆ
, we have that
By subtracting (4.6) from the above equation, we have
Note that, the a priori estimate in Proposition 3.3 shows that, under (H.1), (H.2) and (H.3), there exists a constant 1), such that we have for all
the estimates
, which, along with the fact that
) and Lemma 4.1, implies the
. Hence, from any given
we can deduce from the Schauder theory in [22, Theorem 6.6] and the maximum principle in [22, Theorem 3.7] that
By using the additional assumption that H has a locally Lipschitz continuous Hessian, and the identity (4.7), we can deduce that is continuously differentiable with a locally Lipschitz continuous gradient, from which, we can obtain from Lemma 4.1 that for any
1], the corresponding Nemytskij operator (
) is locally Lipschitz continuous. Hence, we can obtain from (4.8) and the definitions of
(3.6) and (4.2)) that
from which, we can conclude from the and Theorem 4.2 the desired estimate
In this section, we proceed to derive a first-order Taylor expansion for the value function and the optimal control of the relaxed control problem (3.2) with perturbed coefficients, which subsequently leads us to a first-order approximation of the optimal strategy for the perturbed problem based on the pre-computed optimal control. The sensitivity equation further enables us to quantify the explicit dependence of the Lipschitz stability result (4.4) on the relaxation parameter
The following proposition establishes the Fr´echet differentiability of the fully nonlinear HJB operator with inhomogeneous boundary conditions. For notational simplicity, for any given (0, 1], and bounded open subset
boundary, we shall introduce the Banach space Θ
for the coefficients:
equipped with the product norm , and denote by
) a generic element in Θ
. We also denote by
) the Banach space of
functions defined on
(see Remark 2.1), and by
) the restriction operator on
. Furthermore, for any given Banach spaces X and Y , we denote by B(X, Y ) the Banach space containing all continuous linear mappings from X into Y , equipped with the operator norm.
Proposition 5.1. Suppose (H.2) holds. Let be a bounded domain in
be the function defined as in (3.3), Θ
be the Banach space defined as in (5.1), and
be the following HJB operator:
where for any given is the elliptic operators satisfying
satisfying for all (
that
Proof. We first write the HJB operator as the composition of the Nemytskij operator
) and the mapping
Θ
) is the linear boundary operator.
Since the function ), we can deduce from Lemma 4.1 that the Nemytskij operator
) is well-defined and continuously differentiable with the Fr´echet derivative (
Moreover, since for any given (are affine mappings, one can easily compute the partial derivatives
Θ
follows: (
˜
). Moreover, it is clear that
are both continuous, which implies that
is continuously differentiable with derivative
for all (, Theorem 7.2-3]).
Therefore, by using the chain rule (see [17, Theorem 7.1-3]), we see the composite mapping ) is also continuously differentiable with the derivative
(
] for all (
). This, along with the fact that
Θ
) is a linear operator, enables us to conclude the desired differentiability of the operator
With the above proposition in hand, we are ready to derive the first-order sensitivity equation for the value function of the relaxed control problem with respect to the parameter perturbations.
Theorem 5.2. Suppose (H.1) and (H.2) hold. Let be the Banach spaces defined as in (5.1),
be the solution to the Dirichlet problem (3.5) (with the coefficients
be the constant in Proposition 3.3.
Then it holds for each that, there exists a neighborhood
neighborhood
, and a mapping
satisfying the following properties:
(2) is continuously differentiable with
, and for each
is the solution to the following Dirichlet problem:
Proof. The desired result comes from a direct application of the implicit function theorem (see [17, Theorem 7.13-1]). Theorem 3.4 shows that the Dirichlet problem (3.5) with the coefficients admits a solution
Let )] be a fixed constant. We shall consider the mapping
) defined as follows:
Due to the fact that ) satisfies (3.5) with the coefficients
0 in
), which subsequently implies that
The boundary condition of (3.5) implies that
Proposition 5.1 shows that is continuously differentiable on Θ
), and for each (˜
where we have used the definition of )). The classical maximum principle (see e.g. [22, Theorem 3.7]) implies that the map
is an injection. We now show it is also a surjection. Let ( ˆ
) be given. Then the assumption that
enables us to apply [22, Lemma 6.38] and extend ˆg to a function in
), which is still denoted by ˆg. The fact that
) (see Theorem 3.4) and the elliptic regularity theory (see [22, Theorem 6.14]) ensure that the Dirichlet problem
) admits a unique solution
Hence we see
) is a bijection.
Therefore, the implicit function theorem (see [17, Theorem 7.13-1]) ensures the existence of ) with derivative
we have
the characterization of partial derivatives of
enables us to conclude that
satisfies (5.2).
Remark 5.1. We can further obtain a first-order expansion of the optimal control in terms of the perturbations of the coefficients. If
0 and the function
and
in Section 3), then Lemma 4.1 shows that
is continuously differentiable with derivative (
where
is the Hessian of
. Hence, by using the chain rule and Theorem 5.2, we have for all
as is the optimal feedback control of the relaxed control problem with the perturbed coefficients
is the classical solution to (5.2).
With the sensitivity equation (5.2) in hand, we now estimate the precise dependence of the relaxation parameter
, which strengthens the Lipschitz stability result (4.4) by quantifying the explicit
-dependence of the (local) Lipschitz constant. Note that Remark 4.2 shows that the value function (2.3) (in the
-norm) does not depend continuously on the
-perturbation of the parameters, which suggests that for a fixed
will blow up as the parameter
tends to 0.
Since the H¨older norm of the function ) tends to infinity as
0, we first present a precise a priori estimate for the classical solutions to linear elliptic equations with
coefficients. The proof will be postponed to Appendix A, where we first reduce the equation to a constant coefficient equation involving only second-order terms, and then apply the classical Schauder estimate.
be a bounded domain in
boundary. For every
functions satisfying . Suppose that [
Λ
. Then for every
the Dirichlet problem
which applies to relaxed control problems with reward functions generated by
Theorem 5.4. Assume the setting of Theorem 5.2 and in addition that the function in (H.2) has a Lipschitz continuous gradient. Let
be the constant in Proposition 3.3 and ¯
. Then it holds for all
that, the classical solution
to the Dirichlet problem (5.2) satisfies the estimate
C is a constant independent of
Proof. Throughout this proof, let C be a generic constant depending possibly on independent of
. Proposition 3.3 shows that
1], which together with (3.6), the fact that
)) and the Lipschitz continuity of
implies that
1]. Consequently, we have for all
Now let us fix
O, we can apply Proposition 5.3 (with ) and conclude the desired estimate from the following inequality:
In this section, we analyze the convergence of the relaxed control problem (3.2) to the original control problem (2.3) as the relaxation parameter tends to zero. In particular, with the help of the HJB equations (2.6) and (3.5), we shall establish first-order monotone convergence of the value functions, and also uniform convergence of the feedback controls (in regions where a strict complementary condition is satisfied).
We first study the convergence of the value functions of the relaxed control problems. The following theorem shows that, as the relaxation parameter tends to zero, the value function (3.2) converges monotonically to the value function (2.3) in
) with first order.
Theorem 6.1. Suppose (H.1) and (H.2) hold. Let be the constant in Proposition 3.3, and
) be the solution to (2.6) (resp. (3.5) with parameter
). Then we have
. Moreover, it holds for any
converges to
, and satisfies the estimate:
be defined as in (2.6) and (3.5), and
0 be given constants. Lemma 3.2 shows that
. Hence, we have
where we write can deduce from the classical maximum principle (see e.g. [22, Theorem 3.7]) that inf
Similarly, for any given 0, we can obtain from Lemma 3.2(2) that
where we have ˜and the fact that ˜
, we deduce that
the classical maximum principle (see e.g. [22, Theorem 3.7]) and the fact that
us the estimate (6.1).
any given )), there exists a subsequence (
that (
converges in
) to some function ¯
). Since the entire sequence (
converges monotonically to
converges to u in
Remark 6.1. The estimate (6.1) depends on in a rather intuitive way. Note that, compared with the original control problem (2.3), the relaxed control problem (3.2) introduces additional randomness for exploration to achieve more robust decisions, especially at regions where two or more strategies lead to similar performances based on the given model (the points at which arg max in (2.8) is not a singleton). The relation (2.8) between feedback controls and the derivatives of value functions further suggests that such regions usually correspond to a sign change of derivatives of value functions.
The exploration surplus in the value functions clearly increases as increase (see Lemma 3.2(1) and Figure 1), since the same level of exploration will bring more rewards. It will also increase with diam(O) as the dynamics will stay in O longer. Furthermore, due to the lack of regularization from the Laplacian operator, a small volatility or a large drift-to-volalitly ratio of the underlying model usually leads to a more rapidly changing value function, which increases the occurrence of the uncertain regions and makes the relaxation approach more beneficial.
Now we turn to investigate the convergence of the feedback relaxed control (3.6). To distinguish different convergence behaviours related to reward functions generated by introduce the following concept for functions which only modify the pointwise maximum function locally near the kinks.
, we say a function
satisfies (
) with constant
it holds for all
It is clear that the pointwise maximum function on = 0, and the two-dimensional function
defined in (3.8) satisfies (
The following lemma shows that property (
) is preserved under function composition and scaling, which consequently implies that the recursively constructed K-dimensional
and its corresponding scaled function (
)) satisfy (
). The proof follows directly from Definition 6.1, and is included in Appendix A.
Lemma 6.2. (1) For each -dimensional pointwise maxi-
(2) If satisfies (
) with constant
, then for each
, the scaled function
satisfies (
) with constant
The following proposition presents several important convergence properties of the functions (. In the sequel, we shall denote by
, the unit vector from the k-th column of the identify matrix
, and by conv(S) the convex hull of a given set
Proposition 6.3. Suppose (H.2) holds. Let (be defined as in (3.3), (
) is a singleton}. Then it holds for all
and compact subset
(2) (converges uniformly to
. If we further suppose the function
) satisfies (
) with constant
, then there exists
(
Proof. We first establish Property (1) by considering the following function:
Note that Lemma 3.2(1) shows that the restriction of is continuous, which subsequently implies that
is a continuous function. Then we can deduce from [1, Theorem 17.31] that the set-valued mapping Ξ : (
is upper hemicontinuous, which along with the fact that Ξ(
) for all (
1] (see Lemma 3.2(2)) enables us to deduce lim
0)) = 0 for any given lim
lim
. Property (1) now follows from the fact that Ξ(
) (see e.g. [37, Theorem 2]).
Now we shall prove Property (2). We first define the set each
. It is clear that (
are disjoint open convex sets,
, and it holds for all
is differentiable at
Let be a compact set, then we have
us fix an arbitrary index
By using the fact that (
are disjoint open sets, we can deduce that
is also compact. Since (
are convex and differentiable on
lim
, we can deduce from the convexity of
25.7] that (
converges uniformly to
is a finite set, we have shown the desired uniform convergence on C.
Moreover, for each , the compactness of
implies that there exists
that
satisfies (
) with constant
then Lemma 6.2(2) shows that for all
0 satisfying
(and hence
. Hence, by setting
0 to be a constant satisfying
we can conclude for all
Now we are ready to present the convergence of the feedback relaxed control (3.6). Note that the H¨older continuity of the relaxed controls (3.6) and the possible discontinuity of the feedback control (2.8) suggest that the sequence (in general does not converge uniformly to
0. Thus we shall show that the relaxed controls converge in terms of the Hausdorff metric everywhere in O, and converge uniformly on compact subsets of the following region:
where ) is the solution to (2.6) (or equivalently the value function (2.3) if the function
; see Theorem 2.2), and (
are the elliptic operators defined as in (2.7). Note that
contains the points at which a strict complementary condition is satisfied, i.e., the optimal feedback control strategy of (2.3) is uniquely determined.
Theorem 6.4. Suppose (H.1) and (H.2) hold. Let (be the functions defined as in (3.6) for each
be the solution to (2.6), and
be the set defined as in (6.2). Then we have for all
Moreover, it holds for all compact subset converges uniformly to the function
) = arg max
. If we further suppose the function
) satisfies (
) with constant
, then there exists
such that it holds for all
Proof. For any give ) be the solution to (3.5). We first prove (6.3) by fixing an arbitrary point
. By using (3.6) and Proposition 6.3(1), we see it suffices to show lim
are defined as those in (2.6). Then the fact that (
converges to u uniformly in
) (see Theorem 6.1) and the continuity of coefficients enable us to conclude (6.3).
We now proceed to demonstrate the uniform convergence of (. Note that for all
, where the set-valued mapping
is defined as in Proposition 6.3. We further define for any given
where ) is the solution to (2.6), and (
are the elliptic operators defined as in (2.7). The continuity of the coefficients in (
)) implies that (
disjoint open sets satisfying
is a compact set for each
be a fixed index. Then the continuity of the coefficients in (
, the fact that
), and the compactness of
imply that, there exist constants
) such that we have for all
Now by using the fact that (converges to u uniformly in
), we can deduce that there exist
0 such that the same estimates hold for all (
. In other words, let U be the set defined as in Proposition 6.3, we can introduce the compact set
and conclude for all
6.3(2)) ensures that there exists 0, such that we have for all
. Hence, by using the fact that
, we have for all
which shows the uniform convergence of (and K is a finite set, we can conclude the desired uniform convergence on C.
Finally, if we further suppose H satisfies () with constant
0, Proposition 6.3(2) ensures that
for all small enough
0, which leads to the fact that
small enough
and finishes our proof.
Remark 6.2. One can identify the unit vector , as the Dirac measure supported on
, which shows that, as the relaxation parameter tends to zero, the agent of the relaxed control problem will emphasize more on exploitation, and the relaxed control distribution will collapse to a pure exploitation strategy for the classical control problem.
Note that Theorem 6.4 demonstrates an exact regularization feature of the reward function generated by
, which means that we can recover the original control strategy in the region
based on the feedback relaxed control without sending the relaxation parameter
0. The main intuition of the proof is that the region
can be mapped into a finite number of convex sets (i.e., the sets (
in the proof of Proposition 6.3). Hence, if a reward function only modifies the pointwise maximum function locally near the kinks, then one can employ the local compactness and local convexity structure of
and the finiteness of the action set A, and deduce the local exact regularization property in the region
The exact regularization feature of helps avoid the possible numerical instability for solving the relaxed control problem (3.2) with an extremely small relaxation parameter. In contrast, the feedback relaxed control
based on the entropy reward function
is always in (0
and the convergence rate to the original control strategy can be arbitrarily slow.
To the best of our knowledge, this is the first paper which constructs Lipschitz stable feedback control strategies for general multi-dimensional continuous-time stochastic control problems, and rigorously analyzes the performance of a pre-computed feedback control for a perturbed problem in a continuous setting. We also perform a novel first-order sensitivity analysis for the value function and feedback relaxed control with respect to perturbations in the model parameters, and quantify the explicit dependence of the Lipschitz stability of feedback controls on the exploration parameter. These stability results provide a theoretical justification for recent reinforcement learning heuristics that including an exploration reward in the optimization objective leads to more robust decision making.
A natural next step would be to extend the stability analysis to finite horizon stochastic control problems and mean-field control problems with continuous action spaces (see e.g. [23, 42]). The infinite cardinality of action spaces implies that the corresponding relaxed controls take values in an infinite-dimensional space of probability measures, which poses additional challenges for the analysis of the regularized control problems. For example, infinite-dimensional convex analysis on spaces of measures must be employed to analyze the regularity of the modified Hamiltonians and the well-posedness of the associated HJB equations. Moreover, one must endow the action space of relaxed controls with a suitable metric structure (such as the Wasserstein metric) in order to study the spatial regularity and Lipschitz stability of feedback relaxed controls.
Another interesting direction is to design efficient numerical algorithms for solving the regularized control problems in a continuous setting.
be the strong solution to (2.2) with control , and for all
is shown in [13, Lemma 3.1] that
for some constant
0, which implies that
with probability 1. Applying Itˆo’s formula to the function
(
, gives us that
where is the generator of the controlled dynamics
for all
]. The fact that u is a solution to (2.6) implies that for
Then, by rearranging the terms, using the fact that and taking the supremum over all
, we can deduce that
We proceed to show is a feedback control of (2.3) (cf. Definition 2.2). Let
a Borel measurable function satisfying (2.8), and ˜
be an extension of
˜
. We shall consider the functions
such that
. The measurability of
the continuity of
imply that
are Borel measurable. Then, for any given
by using the boundedness of functions
, Theorem 1], we can deduce that there exists
-progressively measurable continuous process (
, such that
Thus we can obtain from the definition of ˜. Moreover, [29, Theorem 2.2.4 on p. 54] implies that
, which shows that
is a feedback control of (2.3).
It remains to show is an optimal feedback control. If
, we can deduce from the definition that
= 0, which shows that
) is defined as in (2.10). Similarly, we have for all
that the first exit time of
from O is 0, i.e.,
= 0, which implies that v(x) = g(x). Hence, we can deduce from the fact that u satisfies the boundary condition of (2.6) that
For each be a progressively measurable continuous process satisfying the SDE (A.3), defined on the reference probability system
. The assumption that
ensures that ˜
obtain the equality in (A.2) for
from which, by using similar arguments as (A.1), we can obtain that
On the other hand, owing to the fact that ˜
, we have by the definition of v that
. Combining this with the fact that
conclude that
, which shows that
is an optimal feedback control and
Proof of Lemma 3.1. The definition of ∆) clearly imply that the function ˜b is well-defined and enjoys the desired estimates. Hence we shall focus on establishing the properties of the function ˜
It has been shown in [17, Theorem 7.14-3] that for any given , there exists a unique matrix
, and the mapping Φ :
is infinitely differentiable. Note that (2.4) and (2.5) in (H.1) ensure that there exists a constant
), such that it holds for all
We now define the function ˜all
. The facts that Φ is a smooth function and G is a compact subset of
imply that Φ is bounded and Lipschitz continuous on G. Therefore, we can conclude from (2.4), (2.5), (A.4) and the definition of ˜
that it holds for all
Proof of Lemma 3.2. We start by establishing Property (1). Since is a continuous convex function, the representation of
, Theorem 12.2] ensure that
is a closed convex proper function satisfying
The assumption that implies that for all
which together with the fact that
shows that 0] for all
. Finally, since
closed convex function satisfying
, we can deduce from [38, Theorem 10.2] (∆
is the standard simplex and hence locally simplicial) that the restriction of
a continuous function.
We now show Property (2). It is clear from (H.2) and (3.3) that for all
. Note that (A.5) and the fact that
imply that for all
which shows the function is the convex conjugate of
. Hence, we can further deduce from [38, Theorem 23.5], the differentiability and convexity of
Consequently, we can obtain from the fundamental theorem of calculus and the Cauchy-Schwarz inequality that is Lipschitz continuous with constant
Note that ∆
is the convex hull of
is the unit vector from the k-th column of the identify matrix
, Theorem 32.2] ensures that max
is attained at
, which implies that
1, and finishes the proof of Lemma 3.2.
Before establishing Proposition 3.3, we first present an a priori estimate for solutions of fully nonlinear equations involving only the second order term.
Lemma A.1. [16, Theorem 7.2 on p. 125] Let O be a bounded connected open subset of be a given function. Suppose the function F is differentiable and convex in its second component, and there exist constants
(
. Then there exists a constant
such that for any
if we have in addition that
, and there exist constants
it holds for all
, then the Dirichlet problem
admits a unique solution satisfying the estimate [
the constant C depends only on
Now we proceed to prove the a priori estimate for solutions to (3.5).
Proof of Proposition 3.3. Throughout this proof, we shall denote by C a generic constant, which may take a different value at each occurrence. Let ) be a given function, we consider the Dirichlet problem
where we define , and the function
such that for all
It follows from (H.2) that is differentiable and convex in r. Moreover, a straightforward computation shows for all (
), where we have
Note that for each , the fact that
)) imply that there exists a constant C, depending only on n, such that for all
which, along with the fact that (3.2(2)), shows that
, for some constant C depending only on n and the constant M defined in the statement of Proposition 3.3.
3.2(2)) imply that, if the function , then the function
satisfies for all
for some constant C depending only on n. Consequently, we can deduce from Lemma A.1 that, there exists a constant 1), such that for all
), the Dirichlet problem (A.6) admits a unique solution
), and satisfies [
, where the constant C depends only on
Now let )] be a solution to (3.5). Then it is clear that
a solution to the Dirichlet problem:
. We can then deduce from the above arguments that, there exists a constant C, depending only on
and O, such that [
. Hence by using the interpolation inequality (see [16, Theorem 1.2 on p. 18]), we have
from which, by using the classical maximum principle (see e.g. [22, Theorem 3.7]) and the fact that (see Lemma 3.2(2)), we can deduce that, there exists a constant
that
which together with the fact that leads to the desired estimate.
Proof of Proposition 5.3. The well-posedness of the classical solution follows from the standard elliptic regularity theory (see [22, Theorem 6.14]), hence it suffices to prove the a priori estimate for a fixed
Let 0 be a constant whose value will be specified later, and (
be a partition of unity in a domain containingO such that the following properties hold: (1) the support of each function
is contained in a ball
) satisfies for all
is the integer part of
is a constant independent of m and
; (3) for each
) = 1 and the number of intersected supports of (
at x is bounded by a constant
depending only on the dimension n. In the following, we shall denote by w the solution
a generic constant independent of
For each m = 1, . . . , M, we define the function , which satisfies
and
which together with the fact that
Then we can deduce from the interpolation inequality (see [16, Theorem 1.3 on p. 19]) and (A.7) that
Note that for all 0, we can obtain from property (2) of (
. Hence by repeatedly applying interpolation inequalities, we can simplify (A.8) into
which along with properties (2) and (3) of (leads to the estimate that
Finally, we can conclude from the classical maximum principle (see e.g. [22, Theorem 3.7]) that ), which finishes the proof of the desired a priori estimate.
Proof of Lemma 6.2. We first establish Property (1). For any given
Let satisfy for some
max(
). We assume without loss of generality that
. Then since
satisfies (
, we have that
Moreover, since
along with the assumption that
satisfies (
arguments show that the same conclusion holds if
+ 1, which enables us to conclude that
satisfies (
Now let be an arbitrary given point. We have by assumptions that
. Hence, by using the fact that
is componentwise increasing and subadditive on
which finishes the proof of Property (1). Property (2) follows directly from the definition of
[1] C. D. Aliprantis and K. C. Border, ed., Springer-Verlag, Berlin, 2006.
[2] D. Aldous, Weak convergence and the general theory of processes, manuscript, 1981. Available online at https://www.stat.berkeley.edu/ aldous/Papers/weak-gtp.pdf
[3] J. Backhoff-Veraguas, D. Bartl, M. Beiglb¨ock, and M. Eder, All adapted topologies are equal, Probab. Theory Relat. Fields, 178 (2020), pp. 1125–1172.
[4] J. Backhoff-Veraguas, D. Bartl, M. Beiglb¨ock, and J. Wiesel, Estimating processes in adapted Wasserstein distance, preprint, arXiv:2002.07261, 2020.
[5] G. Barles and E. Rouy, A strong comparison result for the Bellman equation arising in stochastic exit time control problems and its applications, Comm. Partial Differential Equations, 23 (1998), pp. 1945–2033.
[6] M. Basei, X. Guo, and A. Hu, Linear quadratic reinforcement learning: Sublinear regret in the episodic continuous-time framework, preprint, arXiv:2006.15316, 2020.
[7] E. Bayraktar, Y. Dolinsky, and J. Guo, Continuity of utility maximization under weak convergence, Math. Financ. Econ., 14 (2020), pp. 725–757.
[8] E. Bayraktar, L. Dolinskyi, and Y. Dolinsky, Extended weak convergence and utility maximisation with proportional transaction costs, Finance Stoch., 24 (2020), pp. 1013–1034.
[9] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996.
[10] S. I. Birbil, S.-C. Fang, J. Frenk, and S. Zhang, Recursive approximate of the high dimensional MAX function, Oper. Res. Lett., 33 (2005), pp. 450–458.
[11] P. Blanchard, D. J. Higham, and N. J. Higham, Accurate computation of the log-sum-exp and softmax functions, preprint (2019) arXiv:1909.03469. Accepted in IMA J. Numer. Anal., https://doi.org/10.1093/imanum/draa038.
[12] O. Bokanowski, S. Maroso, and H. Zidani, SIAM J. Numer. Anal., 47 (2009), pp. 3001–3026.
[13] R. Buckdahn and T. Y. Nie, Generalized Hamilton-Jacobi-Bellman equations with Dirichlet boundary condition and stochastic exit time optimal control problem, SIAM J. Control Optim., 54 (2016), pp. 602–631.
[14] S. Chaumont, Uniqueness to elliptic and parabolic Hamilton–Jacobi–Bellman equations with non-smooth boundary, C.R. Math. Acad. Sci. Paris, 339 (2004), pp. 555–560.
[15] C. Chen and O. L. Mangasarian, Smoothing methods for convex inequalities and linear complementarity problems, Math. Program., 71 (1995), pp. 51–69.
[16] Y.-Z. Chen and L.-C. Wu, Second Order Elliptic Equations and Elliptic Systems, Transl. Math. Monogr. 174, AMS, Providence, RI, 1998.
[17] P. Ciarlet, Linear and Nonlinear Functional Analysis with Applications, Appl. Math. 130, SIAM, Philadelphia, 2013.
[18] P. Dr´abek, , Comm. Math. Univ. Carolinae, 16 (1975), pp. 37–57.
[19] W. H. Fleming and H. M. Soner, Controlled Markov Processes and Viscosity Solutions, 2nd ed., Springer, New York, 2006.
[20] P. Forsyth and G. Labahn, Numerical methods for controlled Hamilton-Jacobi-Bellman PDEs in finance, J. Comput. Finance, 11 (2007/2008, Winter), pp. 1–43.
[21] M. Geist, B. Scherrer, and O. Pietquin, A theory of regularized Markov decision processes, preprint, arXiv:1901.11275, 2019.
[22] D. Gilbarg and N. Trudinger, Elliptic Partial Differential Equations of Second Order, 2nd edition, Springer-Verlag, Berlin, New York, 1985.
[23] H. Gu, X. Guo, X. Wei, and R. Xu, Dynamic programming principles for learning MFGs, preprint, arXiv:1911.07314, 2019.
[24] X. Guo, A. Hu, R. Xu, and J. Zhang, A general framework for learning mean-field games, preprint, arXiv:2003.06069, 2020.
[25] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, Reinforcement learning with deep energy- based policies, preprint, arXiv:1702.08165, 2017.
[26] K. Ito, C. Reisinger, and Y. Zhang, A neural network based policy iteration algorithm , preprint (2019) arXiv:1906.02304. Accepted in Found. Comput. Math., https://doi.org/10.1007/s10208- 020-09460-1.
[27] A. D. Kara and S. Y¨uksel, Robustness to incorrect system models in stochastic control, SIAM J. Control Optim., 58 (2020), pp. 1144–1182.
[28] B. W. Kort and D. P. Bertsekas, A new penalty function algorithm for constrained minimization, in Proceedings of the 1972 IEEE Conference on Decision and Control, New Orleans, Louisiana, 1972.
[29] N. V. Krylov, Controlled Diffusion Processes, Springer-Verlag, Berlin, 1980.
[30] H.J. Langen, Convergence of dynamic programming models, Math. Oper. Res., 6 (1981), pp. 493–512.
[31] H. Mania, S. Tu, and B. Recht, Certainty equivalence is efficient for linear quadratic control, in Advances in Neural Information Processing Systems, 2019, pp. 10154–10164.
[32] Y. S. Mishura and A. Y. Veretennikov, Existence and uniqueness theorems for solutions of McKean-Vlasov stochastic equations, preprint, arXiv:1603.02212, 2016.
[33] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, Bridging the gap between value and policy based reinforcement learning, preprint, arXiv:1702.08892, 2017.
[34] R. Nugari, , Comment. Math. Univ. Carolin. 34 (1993) pp. 89–95.
[35] J. M. Peng, A smoothing function and its applications, in Reformulation: Nonsmooth, Piecewise Smooth, Semismooth and Smoothing Methods, M. Fukushima and L. Qi, ed., Kluwer, Dordrecht, 1998, pp. 293–316.
[36] J. Peng and Z. Lin, A non-interior continuation method for generalized linear complementarity problems, Math. Program., 86 (1999), pp. 533–563.
[37] R. A. Poliquin and R. T. Rockafellar, Proto-derivative formulas for basic subgradient mappings in mathematical programming, Set-Valued Anal., 2 (1994), pp. 275–290.
[38] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970.
[39] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cam- bridge, MA, 1998.
[40] I. Smears and E. S¨uli, Discontinuous Galerkin finite element approximation of Hamilton-Jacobi-Bellman equations with Cordes coefficients, SIAM J. Numer. Anal., 52 (2014), pp. 993–1016,
[41] I. Smears and E. S¨uli, Discontinuous Galerkin finite element methods for time-dependent Hamilton-Jacobi-Bellman equations with Cordes coefficients, Numer. Math., (2015), pp. 1–36.
[42] H. Wang, Z. T. Zariphopoulou, and X. Zhou, Exploration versus exploitation in reinforcement learning: a stochastic control approach, J. Mach. Learn. Res., 21(2020). pp. 1–34.
[43] H. Wang and X. Zhou, Continuous-time mean-variance portfolio selection: A reinforcement learning framework, Math. Finance, 30 (2020), pp. 1273–1308.
[44] J. Yong and X. Zhou, Stochastic Controls: Hamiltonian Systems and HJB Equations, Springer, New York, 1999.
[45] I. Zang, A smoothing-out technique for min-max optimization, Math. Program., 19 (1980), pp. 61–77.
[46] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, Maximum entropy inverse reinforcement learning, In AAAI, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008.