Convergence Analysis of Machine Learning Algorithms for the Numerical Solution of Mean Field Control and Games: I -- The Ergodic Case

2019·arXiv

Abstract

Abstract

We propose two algorithms for the solution of the optimal control of ergodic McKean-Vlasov dynamics. Both algorithms are based on approximations of the theoretical solutions by neural networks, the latter being characterized by their architecture and a set of parameters. This allows the use of modern machine learning tools, and efficient implementations of stochastic gradient descent.

The first algorithm is based on the idiosyncrasies of the ergodic optimal control problem. We provide a mathematical proof of the convergence of the approximation scheme, and we analyze rigorously the approximation by controlling the different sources of error. The second method is an adaptation of the deep Galerkin method to the system of partial differential equations issued from the optimality condition.

We demonstrate the efficiency of these algorithms on several numerical examples, some of them being chosen to show that our algorithms succeed where existing ones failed. We also argue that both methods can easily be applied to problems in dimensions larger than what can be found in the existing literature. Finally, we illustrate the fact that, although the first algorithm is specifically designed for mean field control problems, the second one is more general and can also be applied to the partial differential equation systems arising in the theory of mean field games.

Key words. Ergodic Mean Field Control, Ergodic Mean Field Game, Numerical Solution, Machine Learning, Rate of Convergence AMS subject classification. 65M12, 65M99, 93E20, 93E25

1. Introduction

The purpose of this paper is to develop numerical schemes for the solution of Mean Field Games (MFGs) and Mean Field Control (MFC) problems. The mathematical theory of these problems has attracted a lot of attention in the last decade (see e.g. [28, 14, 8, 16, 17]), and from the numerical standpoint several methods have been proposed, see e.g. [2, 1, 15, 10, 20] and [29, 4, 34, 6] for finite time horizon MFG and MFC respectively, and [5, 13, 11] for stationary MFG. However, despite recent progress, the numerical analysis of these problems is still lagging behind because of their complexity, in particular when the dimension is high. Here, we choose a periodic model to demonstrate that powerful tools developed for machine learning applications can be harnessed to produce efficient numerical schemes performing better than existing technology in the solution of these problems. We derive systematically the mathematical formulation of the optimization problem amenable to the numerical analysis, and we prove rigorously the convergence of a numerical scheme based on feed-forward neural network architectures. Our first method is designed for the optimal control of McKean-Vlasov dynamics, which is the primary purpose of the present work. Besides the intrinsic motivations for this type of problems, a large class of MFGs has a variational structure and can be recast in this form, see e.g. [28, 7]. Furthermore, the second method we present tackles the PDE system characterizing optimality conditions satisfied by the solution, and it can be directly adapted to solve the PDE system arising in MFGs as we shall explain.

In the subsequent analysis of finite horizon mean field control problems, see [18], the thrust of the study will be the numerical solution of Forward-Backward Stochastic Differential Equations (FBSDEs) of the McKean-Vlasov type, which generalizes to the mean-field setting techniques introduced in [24] for standard BSDEs. Indeed, the well established probabilistic approach to MFGs and MFC posits that the search for Nash equilibria for MFGs, as well as the search for optimal controls for MFC problems, can be reduced to the solutions of FBSDEs of this type. See for example the books [16, 17] for a comprehensive expos´e of this approach. Here, we concentrate on the ergodic problem for which we can provide a direct analytic approach. Our mathematical analysis of the model leads to an infinite dimensional optimization problem for which we can identify and implement numerical schemes capable of providing stable numerical solutions. We prove the theoretical convergence of these approximation schemes and we demonstrate the efficiency of their implementations by comparing their outputs to solutions of benchmark models obtained either by analytical formulas or by a deterministic method for Partial Differential Equations (PDEs). Note that when a game is potential, the search for Nash equilibria can be reformulated as an optimal control problem for a central planner, and because of the McKean-Vlasov nature of this control problem, it can be reformulated as a deterministic control problem for which the state is a probability distribution. This control problem can be tackled by dynamic programing, leading to a Hamilton-Jacobi equation. This is the approach taken in [21]. See also [35] for a recent contribution.

The reasons for our choice of the ergodic case as a prime testbed for our numerical schemes are twofold. First, the absence of the time variable lowers the complexity of the problem and gives us the opportunity to avoid the use of FBSDEs and to rely on strong approximation results from the theory of feed-forward neural networks which we use in this paper. Second, the objective function can be expressed as an integral over the state space with respect to the invariant measure of the controlled system, leading to a much simpler deterministic optimization problem. Indeed, after proving that the state dynamics at the optimum are given by a gradient diffusion, we postulate the form of the invariant measure and optimize accordingly. Last, the choice of the ergodic case for the first model which we consider is motivated by a forthcoming work on reinforcement learning [19].

As a final remark we emphasize that, while all the results and numerical implementations concern Markovian controls and equilibria, in most cases, both theoretical and numerical results still hold for controls and strategies being feedback functions of the history of the path of the state of the system. The theoretical extensions are straightforward, and the numerical implementations rely on the so-called recurrent neural networks instead of the standard feed-forward networks. We refrain from discussing these extensions to avoid the extra technicalities, especially in the notations and the statements of the results.

The paper is structured as follows. In Section 2, we present the framework of ergodic mean field control, derive formally necessary optimality conditions and introduce a setting amenable to numerical computations. To wit, we provide the mathematics that lead to natural introductions of the algorithms and their analyses. The first algorithm is presented in Section 3. We study rigorously its convergence and its accuracy by proving bounds on the approximation and estimation errors (see Theorems 6 and 11 respectively). Section 4 is dedicated to the second of our algorithms. It is based on a variation over the deep Galerkin solver for the PDE system stemming from the aforementioned optimality conditions. Computational results are presented in Section 5. They demonstrate the applicability and the performance of our algorithms. Several test cases are considered. They were chosen for comparison purposes because they can be solved either explicitly via analytical formulas, or numerically by classical PDE system solvers like in the case of ergodic mean field games.

Acknowledgements: Both authors were partially supported by ARO grant AWD1005491 and NSF award AWD1005433. Also, we would like to thank two anonymous referees for a rigorous review of the original version of the paper. Their insightful comments helped us improve significantly the quality of the paper.

2. Ergodic Mean Field Control

Since we are not aiming at the greatest possible generality, for the sake of definiteness, we work with a standard infinite horizon drift-controlled Itˆo process:

where -dimensional Wiener process and where we use the notation the law of the random variable . For notational convenience only, we take the volatility coefficient to be 1. We limit ourselves to stationary controls of the form deterministic measurable and time-independent feedback functions taking values in a closed convex subset A of a Euclidean space . We shall assume that the function b is measurable and bounded on . The space is the space of probability measures on having a finite second moment. We assume that it is equipped with the 2-Wasserstein distance and the corresponding Borel -field. Since we consider controls in feedback form, the above controlled state evolution is in fact a stochastic differential equation, but according to Veretennikov’s classical result, this equation:

(1) dX

has a unique strong solution. See for example [39, 40] and [33, Theorem 2]. We say that the feedback control function is admissible if it is continuous and if the solution is ergodic in the sense that it has a unique invariant measure which we denote by converges toward . The ergodic theory of McKean-Vlasov stochastic differential equations has recently received a lot of attention. See for example [12, 40, 41] for some specific ergodicity sufficient conditions.

2.1. Ergodic Mean Field Costs. The goal of the ergodic control problem is to minimize the cost:

For the sake of definiteness, we assume that the running cost function is continuous and bounded. The cost can be rewritten in the form:

if we use the standard notation for the integral of the function with respect to the measure is admissible, the invariant measure appears as the limit as uniformly continuous in the measure argument, uniformly with respect to the other two arguments, then we can take the limit in the formula giving the ergodic cost (2) and obtain:

(3) J

if, for each probability measure , and each time-independent A-valued feedback function on , we use the notation:

The controlled process solving (1) being ergodic, we can characterize the unique invariant probability measure as the solutions of the non-linear Poisson equation:

So the goal of our mean field control problem is to minimize, over the admissible feedback functions the quantity:

with F defined above in (4), and solving the Poisson equation (5).

2.2. The Adjoint Equation and Optimality Conditions. In order to characterize the minima

of the functional J, we compute its Gateaux derivative. To do so, we assume that the function b : is continuously differentiable in the variables and has a continuous functional (i.e. linear) differential in the variable is viewed as an element of the (linear) space of finite signed measures on . We denote this derivative by . We stress that it is different from the Wasserstein derivative or L-derivative in the sense of Lions.

Let be fixed and let provide a small perturbation of . We first compute, at least formally, the derivative of the probability in the direction

(7)

when we view probability measures as elements of the space of finite (signed) measures on Notice that since1, we must have0. The Poisson equation (5) implies:

and from this equality, we find that if the directional derivative (7) exists, it must solve the following

Partial Differential Equation (PDE):

just by dividing both sides by and taking the limit 0. Note that the quantity

is merely the integral of the function with respect to the measure

Before we turn to the objective function J, we introduce the notion of adjoint equation and adjoint function.

Definition 1. For each admissible feedback function (and associated solution of the Poisson equation), we say that the couple is a function on the state space and is a constant, is a couple of adjoint variables if they satisfy the following linear elliptic PDE:

(9) λ

which we call the adjoint equation. Any solution will be denoted by

Recall that here, the derivative is the standard linear functional derivative (of smooth functions on the vector space ), which is a function of x.

Proposition 2. The directional derivative of the cost function J defined in (6) is given by the formula:

where denotes the functional derivative with respect to in the direction , and the Hamiltonian H is defined by:

Proof. Using the definitions (6) and (4) we get:

so that, using Fubini’s theorem we have:

Now, using the adjoint equation (9) and the fact that the integral of is 0, we get:

Finally, using (8) we get:

To complete the proof, we express this directional derivative in terms of the Hamiltonian function defined in (11). The latter can be rewritten as:

and its directional derivative is given by:

Putting together (12) and (13) we get the desired result.

So, at least informally, solving the ergodic McKean-Vlasov control problem reduces to the solution of the system:

Note that the third equation guarantees the criticality of the function if we define the minimized Hamiltonian

the above system can be written as:

Both systems should be completed with appropriate boundary conditions when needed (like for example in the next subsection where we use periodic boundary conditions to analyze the system on the torus) and the following condition:

to guarantee uniqueness for p. Indeed, the above equations (16) can only determine p up to an additive constant.

2.3. A Class of Models Amenable to Numerical Computations. In general, computing the

invariant distribution solving (5) for a given can be costly. For example, it can be estimated by solving the PDE or by using Monte Carlo simulations for the MKV dynamics (1), see e.g. [9]. To simplify the presentation and focus on the main ingredients of the method proposed here, we shall consider a setting in which the optimal invariant distribution as well as the optimal control can both be expressed directly in terms of an adjoint variable.

for a constant and functions ˜satisfying the following assumption.

is of class are Lipschitz in both variables.

Remark 3. The dependence on the measure of the functions f identified in (17) is linear. However, the derivations and the results in the remainder of the paper extend easily to more general classes of dependence. For example, similar proofs can be applied to functions f of the form

In this case, the cost (21) will be a triple integral of the same form which can be analyzed in the same way, even if the numerical computations may be slower.

So it should be clear that many more functions f can be handled as long as the functional derivative can be computed and the cost (21) can be expressed as the multiple integral with respect to the Lebesgue measure over the product of tori, of a function ˜F of the variables of the tori, and the function evaluated on the different tori.

Remark 4. Models with local interactions have frequently been considered in the existing literature, and some of the numerical examples presented in Section 5 do involve local interactions. In these models, the function f is of the form represents the value of the density of the measure at the point x. In most cases, these models are more difficult to study analytically. However in the present situation, they represent a simplification when compared to the models with non local interactions studied here. Indeed, because of the special form (19) of the measures we work with, the fact that f depends only upon the density of at the point x streamlines the formula (21) giving the cost ˜Jphq which is now expressed as a single integral over the variable x only.

Let be a solution to the optimality system (14). In this setting, the third equation of (14) gives . Substituting in the expression for the drift b and the feedback function the state equation (1), we see that at the optimum, we are dealing with a gradient diffusion:

for the function

Accordingly, the invariant measure is necessarily of the form:

Notice that in this case, the optimal control is given by

Hence minimizing the ergodic cost (3) over controls under the constraint coming from the Poisson equation (5) can be rephrased as the problem of minimizing the functional:

over functions h, where:

(22) rFpx, y, q, X, Y, Z

Notice that for all non-zero for every constant for any constant c P R. This is not an issue to guarantee that the optimal value of rJ is close to the optimal value of the original cost J, but it can lead to numerical difficulties. For this reason it is possible to add a term of the form in (21) in order to enforce a normalization condition and hence, uniqueness of the minimizer. The analysis presented below could be adapted to take into account this extra term at the expense of more cumbersome notations, so for the sake of clarity we only consider (21).

From this point on, instead of attempting to solve the system (14) numerically, we search for the right function h in a family of functions parameterized by the parameter Θ. The desired parameter minimizes the functional:

where:

One should think of the function as a computable approximation of us to replace the minimization of the ergodic cost (2) by the minimization:

Notice that the gradient (with respect to the parameter can easily be computed. It reads:

In anticipation of the set-up of next section where we consider our optimization problem on the torus, the double integral appearing in (23) can be viewed as an expectation, and its minimization is screaming for the use of the Robbins-Monro procedure. Moreover, if we use the family by a feed-forward neural network, this minimization can be implemented efficiently with the powerful tools based on the so-called Stochastic Gradient Descent (SGD) developed for the purpose of machine learning.

3. A First Machine Learning Algorithm

We now restrict ourselves to the case of the torus for the purpose of numerical computations. The admissible feedback functions being continuous, the drift a bounded continuous function on the torus and the controlled state process is a Markov process with infinitesimal generator:

The compactness of the state space and the uniform ellipticity of this generator guarantee that this state process is ergodic and that its invariant probability measure density with respect to the Riemannian measure on (which we assumed to be normalized to have total mass 1). Note that, because the torus does not have a boundary, the integration by parts which we used freely in the computations of the above subsection are fully justified in the present situation.

We introduce new notation to define the class of functions which we use for numerical approximation purposes. We denote by:

the set of layer functions with input dimension , output dimension , and activation function denotes the composition of functions. Building on this notation we define:

the set of regression neural networks with hidden layers and one output layer, the activation function of the output layer being the identity . Note that as a general rule, we shall not use the superscript in that case. The number of hidden layers, the numbers of units per layer, and the activation functions (one single function in the present situation), are what is usually called the architecture of the network. Once it is fixed, the actual network function determined by the remaining parameters:

defining the functions respectively. Their set is denoted by Θ. For each the function computed by the network will be denoted by . As it should be clear from the discussion of the previous section, here, we are interested in the case where

Our analysis is based on the following algorithm. In practice, instead of having a fixed number of iterations M, one can use a criterion of the form: at iteration is small enough, stop; otherwise continue.

3.1. The Approximation Estimates. Our goal is now to analyze the error made by the numerical procedure described in Algorithm 1. We split the error into two parts: the approximation error and the estimation error (or generalization error). The approximation error quantifies the error made by shrinking the class of admissible controls (here we use neural networks of a certain architecture instead of all possible feedback controls). The estimation error quantifies the error made by replacing the integrals by averages over a finite number of Monte Carlo samples.

In this section, denotes a 2periodic activation function of class whose Fourier expansion contains 1, i.e.,

More general activation functions (such as the hyperbolic tangent) could probably be considered at the expense of additional technicalities. The choice of this class of activation functions is motivated by the fact that we will use it to build a neural network which can approximate a periodic function together with its first order derivatives (namely,

Approximation Error. The proof of our first estimate is based on the following special case of [32, Theorems 2.3 and 6.1].We state it for the sake of completeness. It provides a neural network approximation for a function and its derivative. For positive integers n and m, and a function let denote the trigonometric degree of approximation of g defined by:

where the infimum is over trigonometric polynomials of degree at most n in each of its m variables.

Theorem 5 (Theorems 2.3 and 6.1 in [32])be of class be positive integers. Then there exist

where c depends on the activation function through ˆbut does not depend on

The workhorse of our control of the approximation error is the following.

Theorem 6. Assume that for some integer , there exists a minimizer over the cost function rJ defined in (21). Assume that . Then, for large enough we have:

The constants in the above big Bachmann - Landau term depend only on the data of the problem as well as ˆnorms of the partial derivatives of of order up to K ` 1 (but they do not depend upon

Remark 7. The exponent in the O term in the statement of the proposition is what is blamed for the so-called curse of dimensionality. In some settings, the constants in the O term can be estimated if bounds on the norms of the partial derivatives of of order up to K ` 1 are known, for instance from a priori estimates on the solution of the PDE system (16).

Proof of Theorem 6. The first inequality holds by definition of and because order to apply the result of Theorem 5 to , we bound from above the right hand sides of (25) and (26). We use the fact that the trigonometric degree of approximation of a function of class of order when using polynomials of degree at most n. More precisely, by [36, Theorem 4.3], if -times continuously differentiable function which is 2-periodic in each variable, then for every positive integer n, there exists a trigonometric polynomial of degree at most n such that

where C depends only on r and on the bounds on the r-th derivatives of f in each direction: . We apply this result, with some integer n to be specified later, to with is of class . By [36, Theorem 4.3] again, since are both of class , we obtain that for any integer N there exist trigonometric polynomials of degree at most

where the constant depends on but not on . This implies in particular (since

Notice that, since , the number of units in the hidden layer is , hence the right hand side in (27) is of order . In other words,

where the constant in the big O might depend on but is independent of Going back to the definition (21) of rJ, we note that, by (27), for all lie in the interval is Lipschitz continuous on this interval with a Lipschitz constant depending only on , we obtain that:

where here and thereafter c denotes a generic constant which depends on the data of the problem as well as K, and bounds on the norms of the partial derivatives of up to order K ` 1, and whose exact value might change from line to line. Moreover,lie in the intervaland is Lipschitz continuous on this interval with a Lipschitz constant depending only on so we also have:

Hence, recalling the definition (22) of rF, one can check after some calculations that for all

which completes the proof.

and if there exists a classical solution to the optimality system (16), then we have:

Proof. Indeed, if ˜and if we have existence of a classical solution to the optimality system (16), in particular if bounded by constants depending only on the data of the problem, we obtain that given by (18) provides a minimizer of . We can then apply Theorem 6 with

Remark 9. For mean field games, existence of classical solutions to the ergodic PDE system has been studied in several settings, see e.g. [27, 28]. To the best of our knowledge, corresponding results do not exist yet for the PDE system arising in the ergodic optimal control of MKV dynamics and this question will be addressed in a future work. In finite time horizon, existence of classical solutions has been studied e.g. in [3].

Estimation Error. We then turn our attention to the estimation (or generalization) error. Let be a fixed positive integer. In the numerical implementation, we do not minimize directly rJ over a set of neural networks with say units. Instead, we minimize over empirical versions computed from Monte Carlo samples. To be specific, for a given sample:

for which the are picked independently and uniformly in , we minimize:

where rF is defined by (22). The intuition is to approximate the double integral over dxdy by an average over L independent Monte Carlo samples , and likewise, the integralby an empirical average over a sample of points uniformly distributed over

Remark 10. The reader may be concerned by the slow rate of convergence of the Monte Carlo approximation. Indeed, this could be an issue as the convergence is only of the order there are numerical approximations of the integrals which converge faster, and for which proven rates of convergence involve the number L of sample points as well as the dimension d of the torus. Many Quasi Monte Carlo methods based on low discrepancy sequences (as opposed to independent identically distributed uniform samples as in the case of plain Monte Carlo computations) provide such improved rates of convergence. Depending upon the choice of the quasi Monte Carlo method, we can guarantee rates of convergence of the order . See for example the excellent survey [23] for details and comparisons between various methods to compute integrals with respect to the Lebesgue measure over high dimensional cubes. While the rate of convergence of the classical Monte Carlo method is independent of the dimension, the advantage of these new rates of convergence is that not only do they guarantee faster convergence, but they also explicitly quantify how higher dimensions can slow down the convergence. We chose to use plain Monte Carlo approximation procedures for the sake of simplicity.

We shall use the following notation. For two positive constants , we denote by set of functions for which there exist satisfying

and such that , where the superscript J denotes the transpose operation. Here,

We are now in a position to prove the following bound on the uniform deviation between rJ and its empirical counterpart pJ. It is our main insight into the estimation error.

be positive constants. We have:

where the expectation is over the samples S as in (30), and the constants in the big Bachmann - Landau term depend only on the data of the problem and on , but neither on L nor on Q.

Proof. First, introducing ghost Monte Carlo samples ˜picked with the same distribution as S, and independent of the latter, we can rewrite (21) as:

where the expectation is over the samples ˜S. Note that the variables ˜do not appear in this expression. We kept them in the ghost sample for the sake of symmetry. Hence, for each fixed

Taking expectation over S we get:

and we analyze separately the contributions of the two double expectations to the value of the above right hand side. By definition of , there exists a constant 0 such that for every and every sample

and given the assumptions on ˜f and the definitions of , one can find a constant such that:

for all . Notice that the constants depend upon , but not on the particular . Using this Lipschitz bound, we get:

To bound from above the right hand side, we follow a pretty standard strategy. First we notice that:

Moreover, we can introduce a family of independent Rademacher random variables (i.e. satisfying 2), independent of the samples . Since the samples

are independent and identically distributed, we have

where we used Khintchine inequality to derive the last inequality.

We now turn our attention to the estimation of the term piiq in (32). Because of the introduction of the ghost samples, for each , the two terms we compute the difference of are independent and identically distributed, so we can rewrite piiq using a family of independent Rademacher random variables independent of the samples in the following way:

where we used the fact that are i.i.d. For each fixed and samples

Since rF p0, 0, 0, 0, 0, 1q is a constant independent of , we denote it momentarily by ease the notation. Moreover, for each fixed sample

where the variable s is introduced to replace the absolute value. For each we define the function

By definition of , it is clear that the range of the map is contained in a hypercube of the form:

where the constant was introduced earlier. Given the assumptions on ˜f and the definitions of , one can construct a real valued function Φ on which is Lipschitz continuous over the whole space and satisfies:

for , and whose Lipschitz constant, say K, depends upon and , but not on the particular . Next, we introduce the set of functions:

We can then rewrite the right hand side of (35) as:

By the form of Talagrand’s contraction lemma given in Corollary 4 of [31], this quantity is bounded from above by:

where, for denotes the k-th component of the vector where the family of random variables ˜is an independent Rademacher family with one extra index. Accordingly, this quantity is bounded from above by:

and we proceed to estimate the 3d ` 4 terms of the outer sum one by one. Notice that for the term does not depend upon z, and we proceed in the following way. The terms corresponding to ) are easy to control since (resp. ) do not depend upon , rendering the supremum irrelevant. Moreover, since the norms of are bounded by , Khintchine inequality gives:

For k , . . . , d

because of the definition of . Consequently, since the above quantity does not depend upon s, the supremum over can be taken over and we have:

where we used (36). Since the derivative of the activation function is Lipschitz (without any loss of generality we use the same constant 0 for its Lipschitz constant), we can use the original version of Talagrand’s contraction lemma to estimate the above right hand side. From [30, Theorem 4.12] with

where the value of the constant 0 changed from line to line, and where we used Cauchy-Schwarz and Khintchine inequalities.

We proceed similarly for the values 2 since the exponential function is Lipschitz on the range of the functions

We now focus on the penultimate term. For 3, the term does not depend upon s so the supremum over can be taken over and we have:

because of Khintchine inequality. Finally, the term corresponding to 4 can easily be bounded in the same way since

This concludes the analysis of piiq, proving that it is bounded from above by a constant times 1{?L. Combining this with (33) and (32), the proof is complete.

4. Application of the Deep Galerkin Method

An alternative way to solve the ergodic mean field control problem is to tackle directly the PDE system (16). In order to do so, we adapt the Deep Galerkin Method (DGM) proposed by Sirignano and Spiliopoulos [37] for a single PDE. The key idea is to rewrite the PDE system as a new minimization problem where the control is the triple and the loss function is the sum of the PDE residuals (plus some terms taking into account the boundary conditions and the normalization conditions). In our setting, this idea can be implemented as follows. To alleviate the notations, we introduce the sets and use the shorthand notation and we set

(37) L

where

and

Each function encodes one of the two PDEs of the optimality system (16) and contains one term for the residual of the PDE, one term for the periodicity condition, and one term for the normalization condition. These terms can be weighted to adjust their relative importance. In any case, note that solves the PDE system (16). Since our primary motivation is the optimal control of MKV dynamics, we present the method in this setting. However the same ideas can be readily applied to other PDE systems by designing differently the loss function. For instance, to solve the PDE system arising in the corresponding stationary MFG, one simply needs to remove the term in (39). For the sake of illustration, we present several examples below in the next section.

We then look for in the form of neural networks, say with fixed architectures and parameterized by respectively. The unknown is replaced by a variable coefficient which is learnt along the way. As in the method discussed in the previous sections, the integrals are interpreted as expectations with respect to a random variable with uniform distribution over one uses SGD to minimize the total loss function. More precisely, for a given is a finite set of points and is a finite set of points in the empirical loss function as follows: for

where

and

One can use SGD to minimize the loss function (40). The approximation power of this method has been discussed in [37] using a universal approximation theorem. However, this type of results does not give any rate of convergence. More precise convergence results could be obtained by the techniques presented in Section 3. In particular, the approximation error can be bounded by combining again Theorems 2.3 and 6.1 in [32]. For instance, if the PDE system (16) has a solution , then there exist neural networks are in . In turn, this property leads to bounds on both the loss function of the algorithm and the error on the value function of the control problem. The detailed analysis is left for future work.

An important advantage of the DGM method is its flexibility and its generality since it can, a priori, be applied to almost any PDE. However, this generality can also be a drawback. Indeed the method does not exploit the structure of the PDE or in our case, of the PDE system under consideration. In generalizing this method to the case of our system, our main challenge was the choice of the relative weights to be assigned to the various terms in the loss function. These coefficients can be used to give more or less importance to some aspects of the solution. For instance, a large weight for the penalization terms ensures that the boundary and normalization conditions are likely to be satisfied at convergence. Also, if the gradient of one of the terms is too small, putting a larger weight can help the gradient descent to make faster progress. But the weights are hyperparameters that need to be tuned in accordance with other hyperparameters. Indeed, if they are not chosen appropriately, the stochastic gradient descent can easily be stuck in local minima. For instance if the weight of the normalization condition is not sufficiently large, the algorithm goes quickly towards a configuration which completely ignores this constraint and stays stuck there (although it should reduce the loss to try to satisfy this constraint). On the other hand, if the weight on the normalization condition is too large, it obfuscates the role of the other terms. Similarly, giving too much weight to one of the two PDEs prevents the neural networks from solving accurately the system. We had to find a good balance between the weights empirically. Furthermore, we found that it is sometimes helpful to adjust them dynamically during training to guide the neural network towards a satisfactory approximation of the solution. Although we are convinced that it can improve the results, we refrained from using this technique in the numerical examples presented below because of its ad hoc nature. In most machine learning applications, characterizing optimal combinations of hyperparameters is a very challenging task. In our case, finding optimal choices of weights and learning rates is an interesting question beyond the scope of the present work. Even without an optimal choice of parameters, we found that from a numerical standpoint, a good choice can be made by monitoring convergence to zero of each term of the loss function (the residuals and the penalty terms).

5. Numerical Results

In this section we present numerical results obtained using implementations of the methods described in the previous sections. Algorithm 1 refers to the method based on minimization of the cost functional

Figure 1. Test case 1. Solution computed by Algorithm 1 and benchmark solution from deterministic method (dashed red line).

introduced in Section 3. Algorithm 2 refers to the DGM method described in Section 4. The implementations have been done in TensorFlow. The details of the neural network architectures are specified below. We used Adam for the gradient-based optimization. The results are presented on the unit torus (i.e., with periodic boundary conditions) instead of as in the text.

5.1. Examples in Dimension 1. For ease of visualization, we start with univariate examples. We first consider models without explicit solutions, and we compare the solutions computed by the two algorithms introduced earlier with a benchmark solution computed by a deterministic method based on a finite difference scheme for the PDE system [2]. For these test cases, we used feed-forward neural networks with 3 hidden layers, each layer having 20 units.

Test case 1: For the sake of illustration we include a model without mean field interaction, say

with

(41) ˜

which has two local minima, one of them being a global minimum. The solution computed with the first algorithm is presented in Figure 1.

Test case 2: Next, we add a mean field interaction term in the cost

with ˜f given by (41). Here stands for the density of the measure . The results are presented in Figure 2. Comparing with the first test case, one sees that due to the mean field term in the cost function, the distribution is less concentrated around the global minimum and part of the mass is transferred to the second local minimum.

Figure 2. Test case 2. Solution computed by Algorithm 1 and benchmark solution from deterministic method (dashed red line).

Test case 3: The DGM method can be used to solve the previous examples, but can also be used to solve other PDE systems, such as the one arising from mean field games. In the MFG setting, the PDE system for the optimality condition takes the following form [27]:

with normalization and boundary conditions, and where the minimized Hamiltonian in (15). Notice that this PDE system (43) is different from the PDE system (16) for mean field control.

Taking b and f as in the previous test, namely (42), yields the solution displayed in Figure 3 (to be compared with the corresponding curves for the MFC model of Test case 2, see Figure 2). Recall that with the DGM method, both are approximated using two separate neural networks. In particular, we see on Figure 3 that after 20000 iterations of SGD, the neural network for p has already roughly learnt the shape of the optimum, whereas the neural network for is still almost flat.

5.2. Multivariate Examples with Explicit Solution. To assess the quality of the proposed algorithms in higher dimension, we introduce simple toy models which can be solved explicitly. These models are very much in the spirit of examples considered in [5]. Let us take:

So 1. The PDE system (16) rewrites:

Figure 3. Test case 3. Solution computed by Algorithm 2 (DGM) and benchmark solution from deterministic method (dashed red line).

Assuming the existence of a smooth enough solution , the first equation allows us to express in terms of p as follows:

The second equation in (45) rewrites

In this case, the PDE system (45) is solved provided the above equation and the second equation in (45) are satisfied, which means that

We consider two specific instances of ˜f for which we are able to obtain closed-form expressions for p.

Test case 4: We consider (44) again but now with ˜f given by

then the solution is given by

of the approximation learnt by our first algorithm towards the analytical solution p is presented in Figure 4. This figure shows the relative -error, which is defined as

Figure 4. Test case 4. Relative -error by Algorithm 1.

In the implementation, this quantity is estimated with 10Monte Carlo samples for each integral. The figure corresponds to one run of SGD and illustrates the fact that the algorithm can be stuck in a local minimum for a certain number of iterations (between roughly iterations 10example) before finding its way out to a better solution. The distribution is deduced from p using the formula (46), which explains why the two convergence curves have the same shape.

For the DGM method, numerical convergence is presented in Figure 5. As in the previous test case, the convergence rate of is quite different because they are approximated by distinct neural networks. In particular, the error on decreases at a lower rate and suffers from a larger noise, which could be due to the form of the solution, see (46). Moreover, since the logarithm of appears in the HJB equation, we were forced to choose a positive-valued function for the activation function of the output layer (instead of the identity). We compared results obtained with the exponential function and the softplus function: . Since they gave similar results, we provide here the ones obtained with the exponential function.

Here, for both methods, we used feed-forward neural networks with 3 hidden layers, each layer having 20 units.

Test case 5: We consider a variant of the previous test case where ˜f is chosen such that

We use this example to study the influence of the number of hidden units and the number of samples in the population on the approximation found by the algorithm. For simplicity and to be consistent with the theoretical bounds provided in the prequel, we consider here neural networks with a single hidden layer. Figure 6 illustrates the dependence on the number of hidden units. As seen on Figure 6B, the error decreases quickly as the number of units grows until 30. However, for a number of units larger than 30, the error almost stagnates. This is due to the fact that the number of samples in the population drawn at each iteration of SGD is kept fixed to 10. The dependence on this number of

Figure 5. Test case 4. Relative error by Algorithm 2 (DGM).

Figure 6. Test case 5. Dependence of the relative error on the number of hidden units (Algorithm 1). The number of samples in the population is 10. The number of units is indicated by “nU” for the figure on the left. On the right, relative error after 10iterations.

samples is illustrated in Figure 7, while keeping the number of units fixed to 60. These numerical results were obtained by averaging over 10 runs of SGD.

Test case 6: We consider the following variant of the PDE system (45), which corresponds to a MFG with the same cost function and dynamics as the MFC problem described above:

Figure 7. Test case 5. Dependence of the relative error on the number of samples in a population (Algorithm 1). The number of hidden units it 60 for all curves. The number of samples is indicated by “nS” for the figure on the left. On the right, we plot the relative error after 10iterations.

This system does not have a variational interpretation so we approach it using Algorithm 2. Although neural networks with simple structures are universal approximators [22, 26], in practice using deep neural networks with a well-suited architecture is crucial to the success of many deep learning applications. Here, we tested two different architectures. The first one is a simple feedforward fully connected architecture, as presented above. The second one is a recurrent neural network architecture, and more specifically the one – referred to as DGM architecture in the sequel – proposed by Sirignano and Spiliopoulos in [37] which is inspired by long short-term memory networks [25] and highway networks [38]. It has been shown empirically to be particularly well adapted to learn how to approximate functions with complex structures as well as derivatives of functions. For the sake of brevity we do not reproduce here the architecture and we refer to [37] for more details about this type of neural networks and its success on some classes of PDEs. In high dimension, we did not obtain good results using the fully connected architecture so we present only numerical results obtained using the second type of architecture.

we used a three layer neural network with 100 units per layer for p and three layer neural network with 200 units per layer for . As for the learning rate, instead of plain Adam optimizer as in the previous case, we used Adam with a piecewise-constant schedule of learning rate decay defined by

used a minibatch of 4096 samples at each iteration. The curves in Figure 8 have been obtained with a single run of the algorithm, by saving the values every 100 iterations and taking a moving average over 10 points to make the curves more readable.

Test case 7: We consider the following example with a quadratic dependence on the distribution, inspired by [1, Section 6] for which there is no analytical solution to the best of our knowledge. Here

Figure 8. Test case 6. error for each function (left) and residuals for each PDE versus number of iterations using Algorithm 2 (right).

we take

with

(48) ˜

whose maximum and minimum values are . We solved this example in dimensions up to using Algorithm 2 and the DGM architecture, with 5. Since the PDEs no longer involve the logarithm of , we use the identity function for the output layer’s activation function. The residuals of the two PDEs are shown in Figure 9, which was obtained by averaging over three runs of the algorithm for each dimension, saving the residuals every 100 iterations, and using a moving average over 10 points to make the curves more readable. We used minibatches of 128 samples, and a DGM architecture with 3 layers and 200 units per layer. The learning rate was updated using Adam optimizer initialized with the value 10. Although the residuals are much smaller in dimension 50, there is no deterioration between 100. These results show that the method provides a good approximate solution without intensive tuning of the hyperparameters. Tuning these hyperparameters (e.g., the architecture, the learning rate and the minibatch size) based on the investigation carried out in the previous test cases should lead to even better results. But to be able to easily compare the results in various dimensions, we decided to keep them fixed. Only for the sake of comparison between dimensions, the average time per iteration is indicated in Table 1. These results have been obtained using a GPU with 2.4 GHz Xeon Broadwell E5-2680 v4 processor. Recall that here the architecture and the minibatch size are fixed, so the increase in computational time is due to the optimization of the neural network and these results show that for this aspect the computational cost scales almost linearly with the dimension.

Table 1. Average time per iteration in seconds for test cases 7 and 8, with d P t2, 10, 50, 100u.

Figure 9. Test case 7. Residuals versus number of iterations using Algorithm 2, for

Test case 8: We consider the following variant which incorporates a non-local term:

with ˜f given by (48). We solved this example in dimensions up to 100 using the DGM method, with 5. The residuals of the two PDEs are shown in Figure 10, which was obtained using the same architecture and the same procedure as for Test case 6. The average time per iteration is indicated in Table 1 and is roughly the same as in Test case 6 because we used 128 samples too to approximate the integral appearing in (49).

6. Conclusion

In this paper, we introduced two numerical algorithms for the solution of the optimal control of ergodic McKean-Vlasov dynamics also known as ergodic mean field control problems. We approximated the theoretical solutions by functions given by neural networks, the latter being determined by their architectures and suitable sets of parameters. This allowed the use of modern machine learning tools, and efficient implementations of stochastic gradient descent.

Figure 10. Test case 8. Residuals versus number of iterations (Algorithm 2), for 2, 10, 50, 100.

The first algorithm is based on the specific structure of the ergodic optimal control problem. We provided a mathematical proof of the convergence of the algorithm, and we analyzed rigorously the numerical scheme by controlling both the approximation and the estimation error. The second method is an adaptation of the deep Galerkin method to the system of partial differential equations issued from the optimality conditions. We showed that it can also be applied to the PDE system arising in mean field games, even when the latter do not have a variational structure.

Our numerical results support the idea that these methods can be used in large dimension. From here, several directions can be contemplated. First, using the same algorithms and architectures, it should be possible to obtain even better results with more time and better hardware (i.e. computational time and power). We view this work as a first step, and we tried to frame it so that it could be accessible to, and reproducible by a large community of researchers.

Finally, we believe that it should be possible to design efficient methods by using known properties of the specific structure of the mean-field PDE system without assuming any knowledge of the form of the solution. The adjoint structure between the HJB and the Poisson equation is a case in point.

References

[1] Y. Achdou, F. Camilli, and I. Capuzzo-Dolcetta. Mean field games: numerical methods for the planning problem. SIAM J. Control Optim., 50(1):77–109, 2012.

[2] Y. Achdou and I. Capuzzo-Dolcetta. Mean field games: numerical methods. SIAM J. Numer. Anal., 48(3):1136–1162, 2010.

[3] Y. Achdou and M. Lauri`ere. On the system of partial differential equations arising in mean field type control. Discrete Contin. Dyn. Syst., 35(9):3879–3900, 2015.

[4] Y. Achdou and M. Lauri`ere. Mean Field Type Control with Congestion (II): An augmented Lagrangian method. Appl. Math. Optim., 74(3):535–578, 2016.

[5] N. Almulla, R. Ferreira, and D. Gomes. Two numerical approaches to stationary mean-field games. Dyn. Games Appl., 7(4):657–682, 2017.

[6] A. Balata, C. Hur´e, M. Lauri`ere, H. Pham, and I. Pimentel. A class of finite-dimensional numerically solvable mckean- vlasov control problems. ESAIM: Proceedings and Surveys, 65:114–144, 2019.

[7] J.-D. Benamou, G. Carlier, and F. Santambrogio. Variational mean field games. In Active particles. Vol. 1. Advances in theory, models, and applications, Model. Simul. Sci. Eng. Technol., pages 141–171. Birkh¨auser/Springer, Cham, 2017.

[8] A. Bensoussan, J. Frehse, and S. C. P. Yam. Mean field games and mean field type control theory. Springer Briefs in Mathematics. Springer, New York, 2013.

[9] M. Bossy and D. Talay. A stochastic particle method for the McKean-Vlasov and the Burgers equation. Math. Comp., 66(217):157–192, 1997.

[10] L. M. Brice˜no Arias, D. Kalise, Z. Kobeissi, M. Lauri`ere, A. Mateos Gonz´alez, and F. J. Silva. On the implementation of a primal-dual algorithm for second order time-dependent mean field games with local couplings. ESAIM: ProcS, 65:330–348, 2019.

[11] L. M. Brice˜no Arias, D. Kalise, and F. J. Silva. Proximal methods for stationary mean field games with local couplings. SIAM J. Control Optim., 56(2):801–836, 2018.

[12] O. Butkovsky. On ergodic properties of nonlinear Markov chains and stochastic McKean - Vlasov equations. Theory of Probability and Applications, Society for Industrial and Applied Mathematics, 58(4):661 – 674, 2014.

[13] S. Cacace, F. Camilli, A. Cesaroni, and C. Marchi. An ergodic problem for mean field games: qualitative properties and numerical simulations. Minimax Theory Appl., 3(2):211–226, 2018.

[14] P. Cardaliaguet. Notes on mean field games. 2013.

[15] E. Carlini and F. J. Silva. A fully discrete semi-Lagrangian scheme for a first order mean field game problem. SIAM J. Numer. Anal., 52(1):45–67, 2014.

[16] R. Carmona and F. Delarue. Probabilistic Theory of Mean Field Games I: Mean Field FBSDEs, Control, and Games. Stochastic Analysis and Applications. Springer Verlag, 2017.

[17] R. Carmona and F. Delarue. Probabilistic Theory of Mean Field Games II: Mean Field Games with Common Noise and Master Equations. Stochastic Analysis and Applications. Springer Verlag, 2017.

[18] R. Carmona and M. Lauri`ere. Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: II - The finite horizon case. Preprint, arXiv:1908.01613.

[19] R. Carmona, M. Lauri`ere, and Z. Tan. Linear-quadratic mean-field reinforcement learning: Convergence of policy gradient methods. Preprint, arXiv:1910.04295.

[20] J.-F. Chassagneux, D. Crisan, and F. Delarue. Numerical method for FBSDEs of McKean-Vlasov type. Ann. Appl. Probab., 29(3):1640–1684, 2019.

[21] Y. Chow, W. Li, S. Osher, and W. Yin. Algorithm for hamilton-jacobi equations in density space via a generalized hopf formula. Technical report, 2018.

[22] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.

[23] J. Dick, F. Kuo, and I. Sloan. High dimensional integration: The quasi-monte carlo way. Acta Numerica, pages 133 – 288, 2013.

[24] W. E, J. Han, and A. Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differ- ential equations and backward stochastic differential equations. Commun. Math. Stat., 5(4):349–380, 2017.

[25] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

[26] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.

[27] J.-M. Lasry and P.-L. Lions. Jeux `a champ moyen. I. Le cas stationnaire. C. R. Math. Acad. Sci. Paris, 343(9):619–625, 2006.

[28] J.-M. Lasry and P.-L. Lions. Mean field games. Jpn. J. Math., 2(1):229–260, 2007.

[29] M. Lauri`ere and O. Pironneau. Dynamic programming for mean-field type control. J. Optim. Theory Appl., 169(3):902– 924, 2016.

[30] M. Ledoux and M. Talagrand. Probability in Banach spaces, volume 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)]. Springer-Verlag, Berlin, 1991. Isoperimetry and processes.

[31] A. Maurer. A vector-contraction inequality for Rademacher complexities, 2016.

[32] H. Mhaskar and C. Micchelli. Degree of approximation by neural and translation networks with a single hidden layer. Advances in Applied Mathematics, 16:151–183, 1995.

[33] Y. S. Mishura and A. Y. Veretennikov. Existence and uniqueness theorems for solutions of McKean¯DVlasov stochastic equations. Technical report, 2018.

[34] L. Pfeiffer. Numerical methods for mean-field type optimal control problems. Pure Appl. Funct. Anal., 1(4):629–655, 2016.

[35] L. Ruthotto, S. Osher, W. Li, L. Nurbekyan, and S. Fung. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Technical report, 2019.

[36] M. H. Schultz. -multivariate approximation theory. SIAM J. Numer. Anal., 6:184–209, 1969.

[37] J. Sirignano and K. Spiliopoulos. DGM: a deep learning algorithm for solving partial differential equations. J. Comput. Phys., 375:1339–1364, 2018.

[38] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In Advances in neural information processing systems, pages 2377–2385, 2015.

[39] A. Y. Veretennikov. On strong solutions and explicit formulas for solutions of stochastic integral equations. Mathematics of the USSR - Sbornik, 39:387¯D403, 1981.

[40] A. Y. Veretennikov. On ergodic measures for McKean¯DVlasov stochastic equations. In Monte Carlo and Quasi- Monte Carlo Methods 2004, pages 471–486. Springer Verlag, 2006.

[41] P. Yarykin. Stability of the nonlinear stochastic process that approximates the system of interacting Brownian. Theory of Probability and Applications, Society for Industrial and Applied Mathematics, 51(2):387 – 396, 2007.

Designed for Accessibility and to further Open Science