Computation of optimal transport and related hedging problems via penalization and neural networks

2018·Arxiv

Abstract

Abstract

This paper presents a widely applicable approach to solving (multi-marginal, martingale) optimal transport and related problems via neural networks. The core idea is to penalize the optimization problem in its dual formulation and reduce it to a finite dimensional one which corresponds to optimizing a neural network with smooth objective function. We present numerical examples from optimal transport, martingale optimal transport, portfolio optimization under uncertainty and generative adversarial networks that showcase the generality and effectiveness of the approach.

Keywords: optimal transport, robust hedging, numerical method, duality, regularisation, feedforward networks, Knightian uncertainty, distributional robustness

1 Introduction

In this paper we present a penalization method which allows to compute a wide class of optimization problems of the form

by means of neural networks. The most widely known representative of such a functional occurs in the optimal transport problem, to be introduced shortly. More generally, these functionals appear for instance in the representation of coherent risk measures [4] as the worst-case expected loss over a class Q of scenario probabilities, in the representation of nonlinear expectations [41], or as the upper bound of arbitrage-free prices for a contingent claim f, see e.g. [25]. To solve the initial problem we will make use of its dual formulation and restrict to the subclass of those optimization problems which can be realized as a minimal superhedging price

for some , where H is a set of continuous and bounded functions , where the relation of H and Q is given at the beginning of Section 2. A very similar class of optimization problems in an abstract framework of Banach lattices is studied in [21]. Under sufficient regularity conditions the values of the primal problem and its dual problem can be shown to coincide, see e.g. [16] for related pricing-hedging dualities.

A typical example is the Kantorovich relaxation [36] of Monge’s optimal transport problem, where Q is the set of probability measures on a product space with given marginals and , and where H is the set of all continuous and bounded functions and. Further frequently studied problems in this class include multi-marginal optimal transport and Wasserstein distances (see e.g. [5, 50, 51]), martingale optimal transport (see e.g. [7, 26, 31, 33]), value at risk under dependence uncertainty (see e.g. [11, 22, 44]), or calculating worst case copula values and improved Fréchet-Hoeffding bounds (see e.g. [6, 40]). Moreover, serves as a building block for several other problems, like generative adversarial networks (where additionally, the optimization includes generating a distribution, see e.g. [3, 23, 30]), portfolio choice under dependence uncertainty (where additionally, portfolio weights are optimized, see e.g. [10, 43]), or robust optimized certainty equivalents (see e.g. [20]). In these cases, the solution approach presented in this paper is still applicable.

Summary of the approach The goal is to solve numerically. The implementation will build on the dual representation of . The first step is to go over to a finite dimensional setting, where the set H is replaced by a subset :

Theoretically, we will look at a sequence with such that is in a certain sense dense in H. More concretely, can be a set of neural networks with a fixed structure (but unspecified parameter values), and m measures the number of neurons per layer.

To allow for a step-wise updating of the parameters (e.g. by gradient descent methods) for the space , the inequality constraint is penalized. To this end, we introduce a reference on the state space X. Intuitively, this measure will be used to sample points at which the inequality constraint can be tested. Further, we introduce a differentiable and nondecreasing . This leads to the penalized problem

For theoretical considerations we also introduce

Theoretically, we will again consider sequences of penalty functions parametrized by a penalty factor , and use the notation and . Here, an increasing penalty factor can be seen as a more and more precise enforcing of the inequality constraint .

The problems are the ones which are solved numerically. Chapters 2 and 3 study the relation between this problem which is eventually implemented, and the initial problem . To

Figure 1: Occurring problems and their relations. The depicted convergences are studied in Section 2 and, in a more specific context of neural networks, in Section 3.

this end, we analyse how the introduced approximative problems behave for and . Figure 1 summarizes the occurring problems and their relations. Notably, we are only interested in convergence of optimal values, not that of optimizers.

The final step is to find a numerical solution of , which means in practice finding the optimal parameters of the network . We use Tensorflow [1] and the Adam optimizer [38] to this end, and thus mostly regard this step as a black box. We will denote the numerical optimal solution by .

Implementation method: Related literature Penalization of optimal transport problems has been studied in several works (see e.g. [9, 14, 17, 18, 27, 30, 45, 46, 48]). Entropic penalization in particular is applied often, which is in close relation to the Schrödinger problem [39]. Cominetti and San Martín’s work [17] from 1994 on entropic penalization of arbitrary linear programs can be applied to purely discrete optimal transport. The basic idea in [17] is to obtain a strictly convex problem through penalization which can be solved quicker and converges to the initial problem, for an increasing penalty factor. More recently, Cuturi [18] gives an efficient algorithm to compute discrete optimal transport problems with two marginals based on entropic penalization and Sinkhorn’s matrix scaling algorithm. Genevay et al. [27] and Solomon et al. [48] go further in this direction and give algorithms to compute arbitrary optimal transport problems with two marginals, where the algorithm (for the case of continuous marginals) is based on a reproducing kernel Hilbert space approach, and discretization, respectively. In [27] the authors already mention that more general regularizations beyond the entropic one are possible. Among others Benamou et al. [9] and Schmitzer [45] use scaling algorithms related to [18] for a larger class of problems, including for example (discrete) multi-marginal, constrained and unbalanced optimal transport. Carlier et al. [14] show -convergence of the entropic penalized Wasserstein-2 distance to the unpenalized one. The same kind of -convergence is also subject of the studies related to the Schrödinger problem [39], even for more general cost functions. Recent research by Arjovsky et al. [3, 30] inspired by generative adversarial networks include solving a particular optimal transport problem (the Wasserstein-1 distance) based on penalization. In these works, the numerical approach to solve optimal transport problems by parametrization of the dual variables by neural networks originated. Seguy et al. [46] apply a neural network based approach to arbitrary optimal transport problems with two marginals. Their theoretical results are broadly based on entropic penalization, discretization, and weakly continuous dependence of the optimal transport problem on the marginals.

Contribution The current paper gives a unifying numerical solution approach to problems of the form based on penalization and neural networks. The focus lies both on general applicability with respect to the choice of problem, and also on a flexible framework regarding the solution method.

Compared to the existing literature, which often focusses on a single representative (often the optimal transport problem) among problems of the form , our theoretical results are widely applicable. Similarly, the penalization method and the resulting dual relations in this paper allow for many different forms of reference measure and penalty function , while the existing literature is often restricted to uniform or product reference measures, and exponential penalty functions.We show the effects of different reference measures and different penalty functions both theoretically in Theorem 2.2 and practically in the numerical examples in Section 4. In some examples the choice of an appropriate reference measure is crucial, see e.g. Section 4.4. Equation (2.6) of Theorem 2.2 also motivates an updating procedure for the reference measure to reduce the error arising from penalization, which is applied in Section 4.5.

The presented approach is showcased with several examples, which are mostly toy problems taken from existing papers. The reason we use toy problems is to allow for an evaluation of the numerical methods that can be based on analytical solutions.

Structure of the paper In Section 2 we present the theoretical results on approximation and regularization. Section 3 discusses the particular case of as built by multilayer feedforward networks. In Section 4 we illustrate the proposed method with several examples. All proofs are postponed to Section 5.

2 Regularization and approximation of hedging functionals

Let P(X) be the set of all Borel probability measures on a Polish space X, and denote by the linear space of all continuous bounded functions . We consider the superhedging

for , where is a pricing measure and . Throughout this section we assume that H is a linear space which contains the constants (i.e. the constant functions).

In order to derive a dual representation, we assume that is continuous from above, i.e. for every sequence in such that . By the nonlinear Daniell-Stone theorem it has a representation

for all , and the nonempty set for all . In particular . The problems (2.2) and (2.1) are in duality and we refer to (2.2) as the primal and (2.1) as the dual formulation. For the details we refer to the Appendix A. There it is outlined how the duality extends to unbounded functions. However, for the sake of readability we focus on .

The following example illustrates the basic setting:

Example 2.1. Let , and denote by the set of all with first marginal , second marginal , etc. In the following examples, under the assumption that it is straightforward to verify that the corresponding superhedging functional is continuous from above.

where denotes the space of all continuous functions of linear growth corresponding to , see Appendix A. By Strassen’s theorem [49] the set Q is nonempty if the marginals are in convex order.

for some and . For related problems we refer to [6] and the references therein.

2.1 Regularization of the superhedging functional by penalization

Our goal is to regularize the superhedging functional by considering the convolution

where for a , and is a penalty function which is parametrized by . We assume that is a differentiable nondecreasing convex function such that . Its convex conjugate

(b) the with conjugate where for some p > 1.

In case that H = R the functional (2.3) is a so-called optimized certainty equivalent, see Ben-Tal and Teboulle [8]. In the following result we show the dual representation of the regularized superhedging functional and its convergence to .

Theorem 2.2. Let . Suppose there exists such that and. Then

Moreover,

whenever is an -optimizer of (2.2) such that and. If is a minimizer of (2.3) then defined by

is a maximizer of (2.4).

2.2 Approximation of the superhedging functional

In this subsection we consider a sequence of subsets of H, and set . For each , we define the approximated superhedging functional by

For the approximation of by , we need the following density condition on .

Condition (D): For every and holds

In Section 3 we will discuss Condition (D) in the context of multilayer feedforward networks. The condition allows for the following approximation result.

Proposition 2.3. Assume that is a linear space which contains the constants. Under Condition (D) one has

for all .

Given a sampling measure and a parametrized penalty function as in the previous subsection, we define the approximated version of the regularized superhedging functional by

for all . As a consequence of the two approximative steps for in Theorem 2.2 and for in Proposition 2.3 we get the following convergence result.

Proposition 2.4. Suppose that satisfies Condition (D) and for every there exists an -optimizer of (2.4) such that and. Then, for every one has for .

The existence of such -optimizers as required in Theorem 2.2 and Proposition 2.4 is for example established in [12] in the context of multi-marginal optimal transport problems in with absolutely continuous marginals. In general, the existence of such -optimizers crucially depends on the choice of , see also Example 3.6 for a simple illustration.

3 Modelling ﬁnite dimensional subspaces with multilayer feed-forward networks

This section explains the specific choice of approximative subspaces as built by neural networks. Generally, a feasible alternative to neural networks is to build these spaces via basis functions, like polynomials, which is for example pursued in [33] in the context of martingale optimal transport. In contrast to a basis approach, where functions are represented as a weighted sum over fixed basis functions, neural networks rely on the composition of layers of simple functions. This has shown to be an efficient way to approximate a large class of functions with relatively few parameters. Before going into the results, we give the required notation for neural networks.

3.1 Notation

The type of neural networks we consider are fully connected feed-forward neural networks. Those are mappings of the form

where are affine transformations and is a nonlinear activation function that is applied elementwise, i.e. for .

Regarding dimensions, there is an input dimension and a hidden dimension . This means maps from to map from to , and maps from to R. Each affine transformation can trivially be represented as for a matrix and a vector . All these matrices and vectors together are the parameters of the network, which can be regarded as an element of for some .

We will require the sets which contain all feed-forward neural networks with fixed structure (i.e. fixed number of layers and fixed dimensions) but unspecified parameter values. We denote by the sets of possible parameters for a fixed network structure (where formally, D depends on the structure of the network), and by a particular neural network with l layers, input dimension d, hidden dimension m and parameters . We denote the set of all such networks for by .

In the remainder of this section, we work with a fixed number of layers and input dimension, but allow for growing hidden dimension. For different hidden dimensions m, denote by the corresponding parameter sets. We define

We want this definition to be independent of the precise choices of the parameter sets, which is why we make the standing assumption that the sets are growing in m. One way to make this explicit is:

Assumption 3.1. For any and a sequence of parameter sets , where is regarded as a subset of for some , we will always assume that and for all .

The only reason why we do not just set is that in Proposition 3.7 we make the assumption of compact parameter sets. Further, we assume

Assumption 3.2. The activation function is continuous, nondecreasing and satisfies the limit properties and .

3.2 Modelling Hm via neural networks

In the following we assume that H is of the form

where and are continuous functions for all j = 1, . . . , J. This form of H includes many different problems, for instance the ones considered in Example 2.1 (e.g. in (a) one has where denotes the projection on the j-th marginal component).

We approximate H by

and its subspaces

In this context the problems are given by

for all . The final formulation illustrates that the problem is now reduced to a finite dimensional problem of finding the optimal parameters in a neural network. Further, the overall objective depends smoothly on the parameters, and the parameters are unconstrained. In short, problem fits into the framework of machine learning problems that can be numerically solved by standard stochastic gradient descent based methods.

Under the standing Assumptions 3.1 and 3.2, the following lemma establishes situations when Condition (D), which is required for Proposition 2.3, is satisfied in the neural network setting.

Lemma 3.3.

(b) If and for , where is the projection from to j-th marginal component , then satisfies the second part of Condition (D). Further, the second part of Condition (D) is trivially satisfied whenever X is compact.

Notably, part (b) can be seen as a large, but still exemplary case. Intuitively, the second part of Condition (D) is satisfied whenever the space is rich enough.

Remark 3.4. Later in the numerics we will usually work with a ReLU activation function, i.e. max{0, x}. While this does not satisfy the latter limit property of Assumption 3.2, this is easily amendable: Basically, throughout the whole theory the assumptions will only be used to guarantee existence of neural networks with certain properties. Given Assumption 3.2, we will only require two layers (l = 1) to obtain the necessary results. In the numerics however, we use more layers. If more layers are given, one can also bundle several layers and regard them as one layer, with a different activation function. For example:

Whenever is a mapping of the form , an (l+1)-layer network with activation function can represent any function that a two layer network with activation function can represent. For one can easily see that is feasible, which satisfies Assumption 3.2.

3.3 Convergence

In this section we study in what sense converges to for the approximation by neural networks.

First, we study the case of uniform convergence in m and , i.e. conditions for the convergence for . This is subject of Remark 3.5 below, which is a summary of results established in Section 2 and Section 3.2. The two approximative steps leading to uniform convergence are for and for .

On the other hand, sometimes the convergence for is not satisfied even though practically one obtains a good approximation. One such case is given in Example 3.6. Even if uniform convergence does not hold, one can still often connect problems and . This is done by the approximative steps for and for , where the latter is subject of Proposition 3.7. Here, instead of the strong assumption required for , the convergence can be shown by assuming that all parameter sets of the neural networks are compact.

Remark 3.5. Under the assumptions of Lemma 3.3, Proposition 2.3 implies for . Further, given the existence of -optimizers for every as required in Theorem 2.2, convergence for holds. Given both assumptions, Proposition 2.4 yields for . The convergence for is a trivial consequence.

Example 3.6. Let and . Let be the set of all measures in X with first marginal and second marginal , so that

Obviously, . Note that so that .

Consider two possible reference measures, being the uniform distribution on , and . For it is obvious that the existence of -optimizers as required in Theorem 2.2 is given, since itself is the optimizer of . Hence for holds.

On the other hand, there does not exist with , and hence . However, by first approximating by , the functional becomes smoother: Roughly speaking, the marginal constraints are slightly relaxed. This becomes obvious when studying the dual formulations

While one easily finds sequences of functions in so that the values at 0 go to minus infinity but the penalty term stays bounded, this is impossible with functions in given that the activation function is continuous and the parameter sets are compact. So there is hope to establish the convergence for , which will indeed be a consequence of the following result.

Proposition 3.7. Fix . Given that all parameter sets for j = 1, ..., J of the neural networks occurring in are compact and is strictly positive (i.e. gives positive mass to every non-empty open set), it holds for .

4 Numerical Examples

This section aims at showcasing how various frequently studied problems that fall into the theoretical framework of the previous sections can be implemented simply and effectively with neural networks. The examples focus on toy problems that allow an objective evaluation of the numerical results and give the reader an idea about the strengths and weaknesses of the presented approach. We chose a very basic implementation using Tensorflow and the Adam optimizer.As for the network architecture: In all the examples, H is as described in Section 3, and always approximates . To approximate we use a five layer (l = 4 in the previous chapter) ReLU-network with hidden dimension . We did not perform a hyper parameter search to obtain this architecture, but rather oriented ourselves at papers with comparable settings (e.g. at [19, 30, 46]). Notably, increasing the complexity (number of layers or hidden dimension) further did not change the numerical results significantly in the cases tested, so we believe the structure chosen to be adequate for the problems considered.

Simply put, the implementation works as follows: We perform a normal stochastic gradient type optimization (outsourced to the Adam optimizer) for a certain number of iterations to find near optimal parameters of the network. At each iteration during this process, the expectations in the objective function are replaced by averages over a fixed number (called batch size) of random points from the respective distributions. To obtain the numerical approximation of , we finally average the sample objective values over the last roughly 5% of iterations. This is referred to as the dual value. Alternatively, one can use formula (2.6) to obtain sample points from an approximate optimizer of the primal problem and numerically evaluate, which is referred to as the primal value (more details on how to work with such an approximative optimizer is given in Section 4.5). If not stated otherwise, all reported values are dual values.

The numerical procedure we use can likely be improved by fine-tuning parameters or by using more complex network architectures. For example batch normalization is applied in a related setting in [13] which appears to significantly speed up the optimization.

4.1 Optimal transport and Fréchet-Hoeffding bounds

With this first problem, we study the effects of different penalty functions, penalty factor, batch size and number of iterations of the Adam optimizer. Let (where denotes the uniform distribution) and , where is the i-th marginal of . For some fixed , define the function by

The value corresponds to the maximum value of a d-dimensional copula at point z. By the Fréchet-Hoeffding bounds we have an analytical solution to this problem, which is

In Figure 2 we observe how depends on the number of iterations of the Adam optimizer and the batch size. We observe that while higher batch sizes lead to more stable convergence, the speed of convergence appears not strongly related to batch size. This suggests that increasing batch sizes might lead to both quick and finally stable performance.Since penalization appears more

Figure 2: Fréchet-Hoeffding bounds: . Comparison of penalty function and exponential penalty function . The values plotted are running averages over the last 1000 iterations. The dotted red line is the true value . The dotted blue lines are bounds from below for obtained by Equation (2.5) in Theorem 2.2 for the respective choices of .

stable, we will mostly use this penalization for the rest of the applications. Further, the figure illustrates that the numerical solutions appear to approximately obtain the lower bounds for as given by Equation (2.5) in Theorem 2.2. I.e. one approximately has where is an optimizer of

4.2 Multi-marginal optimal transport

The aim of this example is to compare the approach of this paper with existing methods for a numerically challenging problem. Let , where M denotes the number of marginals and D denotes the dimension of each marginal. Let for i = 1, ..., M be K-mixtures of normal distributions with randomly chosen parameters, and define . For let

where we write with i = 1, ..., M, j = 1, ..., D. Note that for two marginals, one has , where is the Wasserstein-p distance with norm on .

In Table 4.2 we compare optimal values to this problem arising from different algorithmic approaches. We compare linear programming methods based on discretization of the marginals, the neural network method presented in this paper, and a reproducing kernel Hilbert space (RKHS) approach as described in [28, Algorithm 3]. For the linear programming methods, we use a maximal number of variables of , which was around the boundary so that Gurobi [32] was

Table 1: Multi-marginal optimal transport: Numerical values for arising from differ-ent numerical schemes. The numbers in brackets are empirical standard deviations over 100 runs. LP denotes linear programming, based on either sampling the marginals randomly (MC) or using the quantization approach from [42, Algorithm 4.5] to approximate marginals (quantization). The neural network (NN) implementation is based on penalization with . For the reproducing kernel Hilbert space solution (RKHS) as described in [28, Algorithm 3] we use a Laplace kernel and the same penalization as for the NN-method. For the final two rows, we report numerical values for . DNC entries did not converge. The final column are analytical reference values given by the comonotone coupling for two marginals.

still able to solve the resulting linear program on our computer. Regarding the RKHS algorithm we have to mention that it is the only method that is not building on an established package like Tensorflow or Gurobi. Hence efficiency with respect to running time and tuning of hyperparameters might be far from optimal for this approach. Notably, switching from exponential penalization as used in [28] to -penalization was already a slight improvement. For the precise specifications of each algorithm and the problem setting, we refer to the code on https://github.com/stephaneckstein/OT_Comparison.

Evaluating the results, we find that the neural network based method and the linear programming method with quantization appear to work best. Surprisingly, even for the case with 10 marginals (where the linear program can only use 4 points to approximate each marginal!), the quantization method achieved a similar value as the neural network method for . We believe the reason is that the function f is very smooth which a quantization approach can exploit. Hence we slightly changed f to in the final two test cases, which makes the function less regular. These are the only cases where the neural network solution and the quantization method strongly differ. In the final case, the quantization approach still has to approximate each marginal distribution with just 4 points, while the neural network method can use millions of points. From this standpoint, one can place higher trust in the neural network solution, even though we have no analytical reference value to compare it against.

Initially, we included a fourth method based on the approach in this paper, but with a polynomial basis instead of neural networks. This performed very badly however (at least when using a standard monomial basis), and hence we omitted the results in this table.

4.3 Martingale optimal transport

In martingale optimal transport, the optimal transport problem is extended by imposing a martingale constraint on top of marginal constraints. Dimensions are regarded as discrete time-steps and the measures in Q are distributions of discrete stochastic processes with fixed marginal distributions as well as the condition that the process is a martingale.

Here, we consider a simple example with d = 2, where an analytical solution is known. This example is taken from [2]. Let and set

For one gets for all . We implement this problem with , where we use the penalty function for different values of . The results are shown in Figure 3. One can see that while for values of up to around 1280, the behavior of the optimal value is approximately as predicted by Equation (2.5) in Theorem 2.2, in that the error decreases by roughly a factor of two if is increased by a factor of two. For larger values of however, numerical instabilities occur and the optimizer cannot find the true optimum. This is indicated by the fact that the value is above .

4.4 Portfolio optimization

Consider a market with two assets, where the distribution of returns for each individual asset is given, but not the joint distribution. An investor wants to maximize his or her worst-case utility from investing into the two assets. Here, the utility of the investor is characterized by a mean-variance objective. While the mean is fully characterized by the marginal distributions, the worst

Figure 3: Martingale optimal transport: Mean numerical optimal values and 95% confidence bounds over 100 independent runs for different values of penalization). The network is trained for 20000 iterations with batch size 1024. The true optimal value of the unpenalized problem is -1.

case considers all possible variances of the portfolio, which depend on the joint distribution of the assets.

The following example is taken from [43]: Let . Let and , where . Let . We will solve the following robust mean-variance portfolio optimization problem

where is the risk aversion. The integral over the term inside the large brackets is the variance of the portfolio. For the analytical solution, see Example 1 of [43]. We implemented the above problem in two ways. For the first, we choose the reference measure . For the second, we use the reference measure i.e. half the product measure, half the perfectly correlated measure. The second version may correspond to our intuition that the optimal coupling should include positive correlation. More precisely: The choice of reference measure always has the implicit objective to lead to narrow bounds in Equation (2.5) in Theorem 2.2. In this example, if one presumes that an optimal measure has mass near the perfectly correlated diagonal, it makes sense to choose a reference measure which puts mass in this region, as does .

Figure 4: Portfolio optimization under dependence uncertainty: As reference measure we take either the product measure or a positively correlated measure. We use penalization with . The network is trained with batch size for 40000 iterations.

The results are reported in Figure 4. As expected, the second version yields results closer to the analytical solution.

4.5 Bounds on the distribution of a sum of dependent random variables

In this section, the objective is to find bounds for the probability for some , where the individual distributions of are known, but not their joint distribution. This problem is in strong relation to calculating worst- and best-case value at risks under dependence uncertainty, see also [20, 22, 44].

Let . For given marginals and , the problem statement is

For simplicity, we consider the case d = 2, and . Let s = 1.9. Every optimal measure gives mass 1/10 uniformly to the line section , while the rest of the mass is irrelevant as long as the marginal condition is satisfied. This leads to an optimal value . For the natural choice of reference measure , every optimal measure is singular with respect to , and thus one can expect high errors by penalization. An implementation with this reference measure and penalization with leads to

Updating the reference measure: To obtain a more accurate value, we make use of Equation (2.6) in Theorem 2.2. Recall that an optimizer of is given by , where is the dual optimizer of . Taking as a reference measure instead of can only reduce the error by penalization, since holds by convexity of . Implementing the problem with as a reference measure has to be done approximately, since the true optimizer is unknown and

Figure 5: Bounds on the distribution of the sum of dependent variables: Sampled points from the numerically optimal measure and the corresponding empirical marginal distributions.

replaced by the numerical optimal solution. We denote the numerically obtained optimal measure by .

To implement requires sampling points from . This is non-trivial since is only given by . We implemented this by an acceptance-rejection method as described in [24]. This is very slow, as the number of rejections increases with the maximum of the Radon-Nikodym derivative. Sampling efficiently in such a situation is difficult, see e.g. [34] for an overview of existing methods and a proposed new one.

Figure 5 illustrates the optimal measure . The measure looks comparable to an optimal solution of , while simultaneously being driven towards the reference measure . One obtains , which is close to the true optimal value 0.1.

In the following, we briefly discuss the rearrangement algorithm [22, 44] which is tailored to this type of problem. In contrast to the presented approach, which relies on sampling from the involved marginal distributions, the rearrangement algorithm mainly relies on the (inverse of) the cumulative distribution function of the marginals. The rearrangement algorithm achieves similar or even better accuracy in higher dimensional settings and with different marginals (e.g. Pareto marginals). The case of higher dimensions scales well with the approach taken here. However, the base time in low dimensions is higher than that of the rearrangement algorithm, and further heavy tailed marginals

like the Pareto distribution can lead to less accuracy.

4.6 Generative Adversarial Networks (GANs)

The objective in GANs is to create new sample points from a measure , of which only an empirical distribution is known (see e.g. [3, 29]). Usually, the measure might refer to the uniform distribution over some very large set of images. The set is then just the uniform distribution over a small subset of these images. The goal is to sample new images that are not already present in the given subset, but that might plausibly have been samples from the measure .

To proceed, we first take some latent probability measure . The goal is to obtain a function G such that and the push-forward measure are close (in a sense to be specified), and thus the pseudo samples for can be obtained by sampling from and applying G. To find such a function G out of a class of functions G, one can only use instead of (thus does not enter the formal problem statement). The closeness of and is measured by different distances in GANs. In the Wasserstein GAN, the first Wasserstein distance (see e.g [51]) is used, and the objective is

The above can be generalized to arbitrary transport distances instead of the first Wasserstein distance. To put this into our setting, let G a set of functions that map into . Let and for define . For a cost function c, arbitrary transport type GANs can be expressed via

If c is a metric, the above corresponds to the Wasserstein GAN.

Since it is difficult to objectively evaluate GAN setups, we omit numerical results in this section. The interested reader can see the code on GitHub, where the toy problems appearing in [30] are implemented for different functions

5 Proofs

5.1 Proof of Theorem 2.2

1) We first show (2.4) by verifying that is real-valued and continuous from above on . To that end, as shown in Appendix A one has for all , so that

so that

for all . This shows that is real-valued on . Further, let be a sequence in such that . For every the monotone convergence theorem implies, so that . Hence, it follows from the nonlinear Daniell-Stone theorem (see Proposition A.1) that

where the convex conjugate is given by

Indeed, the convex conjugate of the convolution is given as the sum of the convex conjugates and . By (5.1) one has if and otherwise. Moreover,

with the convention .

3) Let be a minimizer of (2.3), i.e. . Defining for an arbitrary , the first order condition

This shows that the probability measure with Radon-Nikodým derivative satisfies for all , which in view of (5.1) satisfies . Integrating the identity with w.r.t. , one obtains

which shows that

As a consequence, is a maximizer of (2.4).

5.2 Proof of Proposition 2.3

Fix . That follows from the definition of . Moreover, for every the Condition (D) guarantees and such that and . Hence, and Dini’s lemma implies that is continuous from above on . By Proposition A.1 it follows that

Similar to (A.2) its convex conjugate is given by

It remains to show that for and withthere exists such that. But this follows directly from the first part of Condition (D) for the probability measure . Indeed, there exists a sequence in such that in and in , which shows thatfor n large enough.

5.3 Proof of Proposition 2.4

Observe that

where the first inequality uses that is increasing, the second inequality just drops the constraint , and the third inequality follows from . Fix . By Condition (D) and Theorem 2.2 there exist and such that

for all and . This shows that

for all and , which shows that whenever .

5.4 Proof of Lemma 3.3

(a) From Hornik [35] it follows that is dense in with respect to for every and all j = 1, ..., J. By the triangle inequality and boundedness of , the first part of Condition (D) follows.

(b) If X is compact, the condition is trivially satisfied. Hence assume that and for , where is the projection from to the j-th marginal component in .

Let and . We first fix j, denote by and show that there exists a such that andfor some compact subset

of . Without loss of generality, assume that . This can always be done since the function will be compact-valued and hence for multiple layers, the remaining layers beyond the first can simply approximate the identity function in the supremum norm.Fix such that . By assumption on , for each there exist such that

Now, define and , which is compact. Then one immediately gets and.

5.5 Proof of Proposition 3.7

1) For one fix network , the mapping is pointwise continuous, i.e. it holds for that for all , since is continuous. Further, since we assume that is bounded, the functions are uniformly bounded and hence by dominated convergence one obtains in for all . By the triangle inequality, this continuity transfers to the mapping . Hence, we can write

On the other hand, let be a sequence in with . Our goal is to show that . We assume that otherwise there is nothing to prove. For every there exist and such that

since for all for some constant . In particular, is real-valued as is a continuous function on the compact . By passing to a subsequence we may assume that , and such that in and -a.s. as well as. We next show that is bounded. Suppose by way of contradiction that . Since and is uniformly bounded by compactness of , it follows that

Moreover, in view of (5.2) the sequence

is uniformly integrable in . Hence, it follows from Fatou’s lemma that

which is the desired contradiction. This shows that is bounded and by passing to a subsequence . Finally it follows from Fatou’s lemma that

where if and if x > 0. The second inequality follows because -a.s. as a consequence of the first inequality, and is strictly positive.

A Nonlinear version of the Daniell-Stone theorem

Let X be a Polish space. Given a measurable function , we denote by the Stone vector lattice of all continuous functions such that is bounded. For instance, if is bounded one has , or if on the space contains all continuous functions of linear growth. Further, let be the set of all Borel measures on X which satisfy. The following nonlinear version of the Daniell-Stone theorem follows directly from Proposition 1.1 in [15].

Proposition A.1. Let be an increasingconvex functional which is continuous from above, i.e. for every sequence such that . Then, it has the dual representation

where the convex conjugate is given by .

Continuity from above is strongly related to the concept of tightness, which in the context of risk measures was introduced by Föllmer and Schied, see [25]. Typical examples include transport type problems where tightness is imposed by marginal constraints, see e.g. Bartl et al. [5]. For extensions of the representation (A.1) to upper semicontinuous functions and related pricing-hedging dualities we refer to Cheridito et al. [16].

As an application we consider the superhedging functional

on , where is a probability measure and is a convex cone such that . Straightforward inspection shows that is a real-valued increasing convex functional on . Further, if is continuous from above by Proposition A.1 it has the dual representation (A.1). Its convex conjugate is given by

Since H is a convex cone which contains the constants it follows that whenever is a probability measure such thatfor all , and else. In particular, in case that we conclude the dual representation (2.2).

References

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[2] A. Alfonsi, J. Corbetta, and B. Jourdain. Sampling of probability measures in the convex order and approximation of martingale optimal transport problems. 2017.

[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.

[4] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath. Coherent measures of risk. Mathematical Finance, 9(3):203–228, 1999.

[5] D. Bartl, P. Cheridito, M. Kupper, and L. Tangpi. Duality for increasing convex functionals with countably many marginal constraints. Banach Journal of Mathematical Analysis, 11(1):72–89, 2017.

[6] D. Bartl, M. Kupper, T. Lux, and A. Papapantoleon. Sharpness of improved Fréchet-Hoeffding bounds: an optimal transport approach. arXiv preprint arXiv:1709.00641, 2017.

[7] M. Beiglböck, P. Henry-Labordère, and F. Penkner. Model-independent bounds for option prices: A mass transport approach. Finance and Stochastics, 17(3):477–501, 2013.

[8] A. Ben-Tal and M. Teboulle. An old-new concept of convex risk measures: The optimized certainty equivalent. Mathematical Finance, 17(3):449–476, 2007.

[9] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111– A1138, 2015.

[10] C. Bernard, X. Jiang, and R. Wang. Risk aggregation with dependence uncertainty. Insurance: Mathematics and Economics, 54:93–108, 2014.

[11] C. Bernard, L. Rüschendorf, S. Vanduffel, and J. Yao. How robust is the value-at-risk of credit risk portfolios? The European Journal of Finance, 23(6):507–534, 2017.

[12] U. Bindini. Smoothing operators in multi-marginal optimal transport. arXiv preprint arXiv:1901.07407, 2019.

[13] H. Bühler, L. Gonon, J. Teichmann, and B. Wood. Deep hedging. arXiv preprint arXiv:1802.03042, 2018.

[14] G. Carlier, V. Duval, G. Peyré, and B. Schmitzer. Convergence of entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical Analysis, 49(2):1385–1418, 2017.

[15] P. Cheridito, M. Kupper, and L. Tangpi. Representation of increasing convex functionals with countably additive measures. arXiv preprint arXiv:1502.05763, 2015.

[16] P. Cheridito, M. Kupper, and L. Tangpi. Duality formulas for robust pricing and hedging in discrete time. SIAM Journal on Financial Mathematics, 8(1):738–765, 2017.

[17] R. Cominetti and J. San Martín. Asymptotic analysis of the exponential penalty trajectory in linear programming. Mathematical Programming, 67(1-3):169–187, 1994.

[18] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.

[19] W. E, J. Han, and A. Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.

[20] S. Eckstein, M. Kupper, and M. Pohl. Robust risk aggregation with neural networks. arXiv preprint arXiv:1811.00304, 2018.

[21] I. Ekren and H. M. Soner. Constrained optimal transport. Archive for Rational Mechanics and Analysis, pages 1–37, 2017.

[22] P. Embrechts, G. Puccetti, and L. Rüschendorf. Model uncertainty and VaR aggregation. Journal of Banking & Finance, 37(8):2750–2764, 2013.

[23] S. Feizi, C. Suh, F. Xia, and D. Tse. Understanding GANs: the LQG setting. arXiv preprint arXiv:1710.10793, 2017.

[24] B. D. Flury. Acceptance–rejection sampling made easy. SIAM Review, 32(3):474–476, 1990.

[25] H. Föllmer and A. Schied. Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter, 2011.

[26] A. Galichon, P. Henry-Labordere, N. Touzi, et al. A stochastic control approach to no-arbitrage bounds given marginals, with an application to lookback options. The Annals of Applied Probability, 24(1):312–336, 2014.

[27] A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3440–3448. Curran Associates, Inc., 2016.

[28] A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems, pages 3440–3448, 2016.

[29] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

[30] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028, 2017.

[31] G. Guo and J. Obloj. Computational methods for martingale optimal transport problems. arXiv preprint arXiv:1710.07911, 2017.

[32] L. Gurobi Optimization. Gurobi optimizer reference manual, 2018.

[33] P. Henry-Labordère. Automated option pricing: Numerical methods. International Journal of Theoretical and Applied Finance, 16(08):1350042, 2013.

[34] F. Horger, T. Würfl, V. Christlein, and A. Maier. Deep learning for sampling from arbitrary probability distributions. arXiv preprint arXiv:1801.04211, 2018.

[35] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.

[36] L. V. Kantorovich. On the translocation of masses. In Dokl. Akad. Nauk SSSR, volume 37, pages 199–201, 1942.

[37] H. G. Kellerer. Duality theorems for marginal problems. theorie und verwandte Gebiete, 67(4):399–432, 1984.

[38] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[39] C. Léonard. A survey of the Schrödinger problem and some of its connections with optimal

transport. Discrete & Continuous Dynamical Systems - A, 34(4):1533–1574, 2014.

[40] T. Lux, A. Papapantoleon, et al. Improved Fréchet–Hoeffding bounds on d-copulas and applications in model-free finance. The Annals of Applied Probability, 27(6):3633–3671, 2017.

[41] S. Peng. G-expectation, G-Brownian motion and related stochastic calculus of Itô type. In Stochastic analysis and applications, pages 541–567. Springer, 2007.

[42] G. C. Pflug and A. Pichler. Multistage stochastic optimization. Springer, 2014.

[43] G. C. Pflug and M. Pohl. A review on ambiguity in stochastic portfolio optimization. Set-Valued and Variational Analysis, pages 1–25, 2017.

[44] G. Puccetti and L. Rüschendorf. Computation of sharp bounds on the distribution of a function of dependent risks. Journal of Computational and Applied Mathematics, 236(7):1833–1840, 2012.

[45] B. Schmitzer. Stabilized sparse scaling algorithms for entropy regularized transport problems. arXiv preprint arXiv:1610.06519, 2016.

[46] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large-scale optimal transport and mapping estimation. arXiv preprint arXiv:1711.02283, 2017.

[47] S. L. Smith, P.-J. Kindermans, and Q. V. Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.

[48] J. Solomon, F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):66, 2015.

[49] V. Strassen. The existence of probability measures with given marginals. The Annals of Mathematical Statistics, 36(2):423–439, 1965.

[50] S. Vallender. Calculation of the Wasserstein distance between probability distributions on the line. Theory of Probability & Its Applications, 18(4):784–786, 1974.

[51] C. Villani. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, 2008.

designed for accessibility and to further open science