b

DiscoverSearch
About
My stuff
Computation of optimal transport and related hedging problems via penalization and neural networks
2018·arXiv
Abstract
Abstract

This paper presents a widely applicable approach to solving (multi-marginal, martingale) optimal transport and related problems via neural networks. The core idea is to penalize the optimization problem in its dual formulation and reduce it to a finite dimensional one which corresponds to optimizing a neural network with smooth objective function. We present numerical examples from optimal transport, martingale optimal transport, portfolio optimization under uncertainty and generative adversarial networks that showcase the generality and effectiveness of the approach.

Keywords: optimal transport, robust hedging, numerical method, duality, regularisation, feedforward networks, Knightian uncertainty, distributional robustness

In this paper we present a penalization method which allows to compute a wide class of optimization problems of the form

image

by means of neural networks. The most widely known representative of such a functional occurs in the optimal transport problem, to be introduced shortly. More generally, these functionals appear for instance in the representation of coherent risk measures [4] as the worst-case expected loss over a class Q of scenario probabilities, in the representation of nonlinear expectations [41], or as the upper bound of arbitrage-free prices for a contingent claim f, see e.g. [25]. To solve the initial problem φ(f)we will make use of its dual formulation and restrict to the subclass of those optimization problems which can be realized as a minimal superhedging price

image

for some  µ0 ∈ Q, where H is a set of continuous and bounded functions  h : X → R, where the relation of H and Q is given at the beginning of Section 2. A very similar class of optimization problems in an abstract framework of Banach lattices is studied in [21]. Under sufficient regularity conditions the values of the primal problem  supν∈Q�f dνand its dual problem  infh∈H: h≥f�h dµ0can be shown to coincide, see e.g. [16] for related pricing-hedging dualities.

A typical example is the Kantorovich relaxation [36] of Monge’s optimal transport problem, where Q is the set of probability measures on a product space  X = X1 ×X2with given marginals  µ1and  µ2, and where H is the set of all continuous and bounded functions  h(x1, x2) = h1(x1) + h2(x2)and�h dµ0 =�X1 h1 dµ1 +�X2 h2 dµ2. Further frequently studied problems in this class include multi-marginal optimal transport and Wasserstein distances (see e.g. [5, 50, 51]), martingale optimal transport (see e.g. [7, 26, 31, 33]), value at risk under dependence uncertainty (see e.g. [11, 22, 44]), or calculating worst case copula values and improved Fréchet-Hoeffding bounds (see e.g. [6, 40]). Moreover,  φ(f)serves as a building block for several other problems, like generative adversarial networks (where additionally, the optimization includes generating a distribution, see e.g. [3, 23, 30]), portfolio choice under dependence uncertainty (where additionally, portfolio weights are optimized, see e.g. [10, 43]), or robust optimized certainty equivalents (see e.g. [20]). In these cases, the solution approach presented in this paper is still applicable.

Summary of the approach The goal is to solve  φ(f)numerically. The implementation will build on the dual representation of  φ(f). The first step is to go over to a finite dimensional setting, where the set H is replaced by a subset  Hm:

image

Theoretically, we will look at a sequence  (Hm)m∈Nwith  H1 ⊆ H2 ⊆ ... ⊆ Hsuch that  H∞ :=∪m∈NHmis in a certain sense dense in H. More concretely,  Hmcan be a set of neural networks with a fixed structure (but unspecified parameter values), and m measures the number of neurons per layer.

To allow for a step-wise updating of the parameters (e.g. by gradient descent methods) for the space  Hm, the inequality constraint  h ≥ fis penalized. To this end, we introduce a reference probability measure θon the state space X. Intuitively, this measure will be used to sample points at which the inequality constraint  h ≥ fcan be tested. Further, we introduce a differentiable and nondecreasing  penalty function β : R → R+. This leads to the penalized problem

image

For theoretical considerations we also introduce

image

Theoretically, we will again consider sequences of penalty functions  (βγ)γ>0parametrized by a penalty factor  γ, and use the notation  φθ,γ(f) := φθ,βγ(f)and  φmθ,γ(f) := φmθ,βγ(f). Here, an increasing penalty factor can be seen as a more and more precise enforcing of the inequality constraint h ≥ f.

The problems  φmθ,γ(f)are the ones which are solved numerically. Chapters 2 and 3 study the relation between this problem which is eventually implemented, and the initial problem  φ(f). To

image

Figure 1: Occurring problems and their relations. The depicted convergences are studied in Section 2 and, in a more specific context of neural networks, in Section 3.

this end, we analyse how the introduced approximative problems behave for  m → ∞and  γ → ∞. Figure 1 summarizes the occurring problems and their relations. Notably, we are only interested in convergence of optimal values, not that of optimizers.

The final step is to find a numerical solution of  φmθ,γ(f), which means in practice finding the optimal parameters of the network  Hm. We use Tensorflow [1] and the Adam optimizer [38] to this end, and thus mostly regard this step as a black box. We will denote the numerical optimal solution by ˆφmθ,γ(f).

Implementation method: Related literature Penalization of optimal transport problems has been studied in several works (see e.g. [9, 14, 17, 18, 27, 30, 45, 46, 48]). Entropic penalization in particular is applied often, which is in close relation to the Schrödinger problem [39]. Cominetti and San Martín’s work [17] from 1994 on entropic penalization of arbitrary linear programs can be applied to purely discrete optimal transport. The basic idea in [17] is to obtain a strictly convex problem through penalization which can be solved quicker and converges to the initial problem, for an increasing penalty factor. More recently, Cuturi [18] gives an efficient algorithm to compute discrete optimal transport problems with two marginals based on entropic penalization and Sinkhorn’s matrix scaling algorithm. Genevay et al. [27] and Solomon et al. [48] go further in this direction and give algorithms to compute arbitrary optimal transport problems with two marginals, where the algorithm (for the case of continuous marginals) is based on a reproducing kernel Hilbert space approach, and discretization, respectively. In [27] the authors already mention that more general regularizations beyond the entropic one are possible. Among others Benamou et al. [9] and Schmitzer [45] use scaling algorithms related to [18] for a larger class of problems, including for example (discrete) multi-marginal, constrained and unbalanced optimal transport. Carlier et al. [14] show  Γ-convergence of the entropic penalized Wasserstein-2 distance to the unpenalized one. The same kind of  Γ-convergence is also subject of the studies related to the Schrödinger problem [39], even for more general cost functions. Recent research by Arjovsky et al. [3, 30] inspired by generative adversarial networks include solving a particular optimal transport problem (the Wasserstein-1 distance) based on  L2penalization. In these works, the numerical approach to solve optimal transport problems by parametrization of the dual variables by neural networks originated. Seguy et al. [46] apply a neural network based approach to arbitrary optimal transport problems with two marginals. Their theoretical results are broadly based on entropic penalization, discretization, and weakly continuous dependence of the optimal transport problem on the marginals.

Contribution The current paper gives a unifying numerical solution approach to problems of the form  φ(f)based on penalization and neural networks. The focus lies both on general applicability with respect to the choice of problem, and also on a flexible framework regarding the solution method.

Compared to the existing literature, which often focusses on a single representative (often the optimal transport problem) among problems of the form  φ(f), our theoretical results are widely applicable. Similarly, the penalization method and the resulting dual relations in this paper allow for many different forms of reference measure  θand penalty function  βγ, while the existing literature is often restricted to uniform or product reference measures, and exponential penalty functions.1We show the effects of different reference measures and different penalty functions both theoretically in Theorem 2.2 and practically in the numerical examples in Section 4. In some examples the choice of an appropriate reference measure is crucial, see e.g. Section 4.4. Equation (2.6) of Theorem 2.2 also motivates an updating procedure for the reference measure to reduce the error arising from penalization, which is applied in Section 4.5.

The presented approach is showcased with several examples, which are mostly toy problems taken from existing papers. The reason we use toy problems is to allow for an evaluation of the numerical methods that can be based on analytical solutions.

Structure of the paper In Section 2 we present the theoretical results on approximation and regularization. Section 3 discusses the particular case of  Hmas built by multilayer feedforward networks. In Section 4 we illustrate the proposed method with several examples. All proofs are postponed to Section 5.

Let P(X) be the set of all Borel probability measures on a Polish space X, and denote by  Cb(X)the linear space of all continuous bounded functions  f : X → R. We consider the superhedging

image

for  f ∈ Cb(X), where  µ0 ∈ P(X)is a pricing measure and  H ⊆ Cb(X). Throughout this section we assume that H is a linear space which contains the constants (i.e. the constant functions).

In order to derive a dual representation, we assume that  φis continuous from above, i.e.  φ(fn) ↓ 0for every sequence  (fn)in  Cb(X)such that  f n ↓ 0. By the nonlinear Daniell-Stone theorem it has a representation

image

for all  f ∈ Cb(X), and the nonempty set  Q =�µ ∈ P(X) :�h dµ =�h dµ0for all  h ∈ H�. In particular  µ0 ∈ Q. The problems (2.2) and (2.1) are in duality and we refer to (2.2) as the primal and (2.1) as the dual formulation. For the details we refer to the Appendix A. There it is outlined how the duality extends to unbounded functions. However, for the sake of readability we focus on Cb(X).

The following example illustrates the basic setting:

Example 2.1. Let  X = Rd, and denote by  Π(µ1, ..., µd)the set of all  µ ∈ P(Rd)with first marginal µ1, second marginal  µ2, etc. In the following examples, under the assumption that  Q ̸= ∅it is straightforward to verify that the corresponding superhedging functional is continuous from above.

image

where  Cκ(Rd)denotes the space of all continuous functions of linear growth corresponding to  κ(x) := 1 + |x|, see Appendix A. By Strassen’s theorem [49] the set Q is nonempty if the marginals  µ1, . . . , µdare in convex order.

image

for some  g1, . . . , gN ∈ Cb(Rd)and  c1, . . . , cN ∈ R. For related problems we refer to [6] and the references therein.

2.1 Regularization of the superhedging functional by penalization

Our goal is to regularize the superhedging functional  φby considering the convolution

image

where  ψθ,γ(f) :=�βγ(f) dθfor a  sampling measure θ ∈ P(X), and  βγ(x) := 1γ β(γx)is a penalty function which is parametrized by  γ > 0. We assume that  β : R → R+is a differentiable nondecreasing convex function such that  limx→∞ β(x)/x = ∞. Its convex conjugate

image

(b) the  Lp penalty function β(x) = 1p(max{0, x})pwith conjugate  β∗(y) = 1qyqwhere  q = pp−1for some p > 1.

In case that H = R the functional (2.3) is a so-called optimized certainty equivalent, see Ben-Tal and Teboulle [8]. In the following result we show the dual representation of the regularized superhedging functional  φθ,γand its convergence to  φ.

Theorem 2.2. Let  f ∈ Cb(X). Suppose there exists  π ∈ Qsuch that  π ≪ θand�β∗� dπdθ�dθ < ∞. Then

image

Moreover,

image

whenever  µε ∈ Qis an  ε-optimizer of (2.2) such that  µε ≪ θand�β∗γ� dµεdθ�dθ < ∞. If ˆh ∈ His a minimizer of (2.3) then  ˆµ ∈ P(X)defined by

image

is a maximizer of (2.4).

2.2 Approximation of the superhedging functional

In this subsection we consider a sequence  H1 ⊆ H2 ⊆ · · ·of subsets of H, and set  H∞ := �m∈N Hm. For each  m ∈ N ∪ {+∞}, we define the approximated superhedging functional by

image

For the approximation of  φ(f)by  φm(f), we need the following density condition on  H∞.

Condition (D): For every  ε > 0and  µ ∈ P(X)holds

image

In Section 3 we will discuss Condition (D) in the context of multilayer feedforward networks. The condition allows for the following approximation result.

Proposition 2.3. Assume that  H∞is a linear space which contains the constants. Under Condition (D) one has

image

for all  f ∈ Cb(X).

Given a sampling measure  θand a parametrized penalty function  βγas in the previous subsection, we define the approximated version of the regularized superhedging functional by

image

for all  f ∈ Cb(X). As a consequence of the two approximative steps  φθ,γ(f) → φ(f)for  γ → ∞in Theorem 2.2 and  φm(f) → φ(f)for  m → ∞in Proposition 2.3 we get the following convergence result.

Proposition 2.4. Suppose that  H∞satisfies Condition (D) and for every  ε > 0there exists an ε-optimizer  µεof (2.4) such that  µε ≪ θand�β∗� dµεdθ�dθ < +∞. Then, for every  f ∈ Cb(X)one has  φmθ,γ(f) → φ(f)for  min{m, γ} → ∞.

The existence of such  ε-optimizers as required in Theorem 2.2 and Proposition 2.4 is for example established in [12] in the context of multi-marginal optimal transport problems in  Rdwith absolutely continuous marginals. In general, the existence of such  ε-optimizers crucially depends on the choice of  θ, see also Example 3.6 for a simple illustration.

This section explains the specific choice of approximative subspaces as built by neural networks. Generally, a feasible alternative to neural networks is to build these spaces via basis functions, like polynomials, which is for example pursued in [33] in the context of martingale optimal transport. In contrast to a basis approach, where functions are represented as a weighted sum over fixed basis functions, neural networks rely on the composition of layers of simple functions. This has shown to be an efficient way to approximate a large class of functions with relatively few parameters. Before going into the results, we give the required notation for neural networks.

3.1 Notation

The type of neural networks we consider are fully connected feed-forward neural networks. Those are mappings of the form

image

where  Aiare affine transformations and  ϕ : R → Ris a nonlinear activation function that is applied elementwise, i.e.  ϕ((x1, ..., xn)) = (ϕ(x1), ..., ϕ(xn))for  (x1, ..., xn) ∈ Rn.

Regarding dimensions, there is an input dimension  d ∈ Nand a hidden dimension  m ∈ N. This means  A0maps from  Rdto  Rm, A1, ..., Al−1map from  Rmto  Rm, and  Almaps from  Rmto R. Each affine transformation  Ajcan trivially be represented as  Aj(x) = Mjx + bjfor a matrix  Mjand a vector  bj. All these matrices and vectors together are the parameters of the network, which can be regarded as an element of  RDfor some  D ∈ N.

We will require the sets which contain all feed-forward neural networks with fixed structure (i.e. fixed number of layers and fixed dimensions) but unspecified parameter values. We denote by Ξ ⊂ RDthe sets of possible parameters for a fixed network structure (where formally, D depends on the structure of the network), and by  Nl,d,m(ξ) = Al ◦ ϕ ◦ Al−1 ◦ ... ◦ ϕ ◦ A0a particular neural network with l layers, input dimension d, hidden dimension m and parameters  ξ ∈ Ξ. We denote the set of all such networks  Nl,d,m(ξ)for  ξ ∈ Ξby  Nl,d,m(Ξ).

In the remainder of this section, we work with a fixed number of layers and input dimension, but allow for growing hidden dimension. For different hidden dimensions m, denote by  Ξmthe corresponding parameter sets. We define

image

We want this definition to be independent of the precise choices of the parameter sets, which is why we make the standing assumption that the sets  Nl,d,m(Ξm)are growing in m. One way to make this explicit is:

Assumption 3.1. For any  l, d ∈ Nand a sequence of parameter sets  Ξ1, Ξ2, ..., where  Ξmis regarded as a subset of  RDmfor some  Dm ∈ N, we will always assume that  [−m, m]Dm ⊆ Ξmand Nl,d,m(Ξm) ⊂ Nl,d,m+1(Ξm+1)for all  m ∈ N.

The only reason why we do not just set  Ξm ≡ RDmis that in Proposition 3.7 we make the assumption of compact parameter sets. Further, we assume

Assumption 3.2. The activation function  ϕis continuous, nondecreasing and satisfies the limit properties  limx→−∞ ϕ(x) = 0and  limx→+∞ ϕ(x) = 1.

3.2 Modelling Hm via neural networks

In the following we assume that H is of the form

image

where  ej ∈ Cb(X)and  πj : X → Rdjare continuous functions for all j = 1, . . . , J. This form of H includes many different problems, for instance the ones considered in Example 2.1 (e.g. in (a) one has  H = {�dj=1 hj ◦ prk : hj ∈ Cb(R)}where  prj(x) := xjdenotes the projection on the j-th marginal component).

We approximate H by

image

and its subspaces

image

In this context the problems  φmθ,γ(f)are given by

image

for all  f ∈ Cb(X). The final formulation illustrates that the problem  φmθ,γ(f)is now reduced to a finite dimensional problem of finding the optimal parameters in a neural network. Further, the overall objective depends smoothly on the parameters, and the parameters are unconstrained. In short, problem  φmθ,γ(f)fits into the framework of machine learning problems that can be numerically solved by standard stochastic gradient descent based methods.

Under the standing Assumptions 3.1 and 3.2, the following lemma establishes situations when Condition (D), which is required for Proposition 2.3, is satisfied in the neural network setting.

Lemma 3.3.

image

(b) If  X = Rd = Rd1 × ... × RdJ0and  πj = prj, ej = 1for  j = 1, ..., J0 ≤ J, where  prjis the projection from  Rdto j-th marginal component  Rdj, then  H∞satisfies the second part of Condition (D). Further, the second part of Condition (D) is trivially satisfied whenever X is compact.

Notably, part (b) can be seen as a large, but still exemplary case. Intuitively, the second part of Condition (D) is satisfied whenever the space  H∞is rich enough.

Remark 3.4. Later in the numerics we will usually work with a ReLU activation function, i.e.  ϕ(x) =max{0, x}. While this does not satisfy the latter limit property of Assumption 3.2, this is easily amendable: Basically, throughout the whole theory the assumptions will only be used to guarantee existence of neural networks with certain properties. Given Assumption 3.2, we will only require two layers (l = 1) to obtain the necessary results. In the numerics however, we use more layers. If more layers are given, one can also bundle several layers and regard them as one layer, with a different activation function. For example:

image

Whenever  ϕis a mapping of the form  (x1, ..., xm) �→ (ϕ(x1), ..., ϕ(xm)), an (l+1)-layer network with activation function  ϕcan represent any function that a two layer network with activation function ϕcan represent. For  ϕ(x) = max{0, x}one can easily see that  ϕ(x) = min{1, max{0, x}}is feasible, which satisfies Assumption 3.2.

3.3 Convergence

In this section we study in what sense  φmθ,γ(f)converges to  φ(f)for the approximation by neural networks.

First, we study the case of uniform convergence in m and  γ, i.e. conditions for the convergence φmθ,γ(f) → φ(f)for  min{m, γ} → ∞. This is subject of Remark 3.5 below, which is a summary of results established in Section 2 and Section 3.2. The two approximative steps leading to uniform convergence are  φθ,γ(f) → φ(f)for  γ → ∞and  φm(f) → φ(f)for  m → ∞.

On the other hand, sometimes the convergence  φθ,γ(f) → φ(f)for  γ → ∞is not satisfied even though practically one obtains a good approximation. One such case is given in Example 3.6. Even if uniform convergence does not hold, one can still often connect problems  φmθ,γ(f)and  φ(f). This is done by the approximative steps  φm(f) → φ(f)for  m → ∞and  φmθ,γ(f) → φm(f)for  γ → ∞, where the latter is subject of Proposition 3.7. Here, instead of the strong assumption required for φθ,γ(f) → φ(f), the convergence  φmθ,γ(f) → φm(f)can be shown by assuming that all parameter sets of the neural networks are compact.

Remark 3.5. Under the assumptions of Lemma 3.3, Proposition 2.3 implies  φm(f) → φ(f)for m → ∞. Further, given the existence of  ε-optimizers for every  ε > 0as required in Theorem 2.2, convergence  φθ,γ(f) → φ(f)for  γ → ∞holds. Given both assumptions, Proposition 2.4 yields φmθ,γ(f) → φ(f)for  min{m, γ} → ∞. The convergence  φmθ,γ(f) → φθ,γ(f)for  m → ∞is a trivial consequence.

Example 3.6. Let  X = [0, 1]2, µ1 = µ2 = δ0and  f(x1, x2) = −|x1 − x2|. Let  Q = Π(µ1, µ2)be the set of all measures in X with first marginal  µ1and second marginal  µ2, so that

image

Obviously,  φ(f) = f(0, 0) = 0. Note that  Q = {µ1 ⊗ µ2}so that  µ0 = µ1 ⊗ µ2 = δ(0,0).

Consider two possible reference measures,  θ(1) = U([0, 1]2)being the uniform distribution on [0, 1]2, and  θ(2) = µ1 ⊗ µ2 = δ(0,0). For  θ(2)it is obvious that the existence of  ε-optimizers as required in Theorem 2.2 is given, since  θ(2)itself is the optimizer of  φ(f). Hence  φθ(2),γ(f) → φ(f)for  γ → ∞holds.

On the other hand, there does not exist  ν ∈ Π(µ1, µ2)with  ν ≪ θ(1), and hence  φθ(1),γ(f) =−∞. However, by first approximating  φ(f)by  φm(f), the functional becomes smoother: Roughly speaking, the marginal constraints are slightly relaxed. This becomes obvious when studying the dual formulations

image

While one easily finds sequences of functions in  Cb([0, 1])so that the values at 0 go to minus infinity but the penalty term stays bounded, this is impossible with functions in  Nl,1,m(Ξm)given that the activation function is continuous and the parameter sets are compact. So there is hope to establish the convergence  φmθ(1),γ(f) → φm(f)for  γ → ∞, which will indeed be a consequence of the following result.

Proposition 3.7. Fix  m ∈ N. Given that all parameter sets  Ξj,mfor j = 1, ..., J of the neural networks occurring in  Hmare compact and  θis strictly positive (i.e.  θgives positive mass to every non-empty open set), it holds  φmθ,γ(f) → φm(f)for  γ → ∞.

This section aims at showcasing how various frequently studied problems that fall into the theoretical framework of the previous sections can be implemented simply and effectively with neural networks. The examples focus on toy problems that allow an objective evaluation of the numerical results and give the reader an idea about the strengths and weaknesses of the presented approach. We chose a very basic implementation using Tensorflow and the Adam optimizer.2As for the network architecture: In all the examples, H is as described in Section 3, and  Nlk,malways approximates  Cb(Rd). To approximate  Cb(Rd)we use a five layer (l = 4 in the previous chapter) ReLU-network with hidden dimension  64·d. We did not perform a hyper parameter search to obtain this architecture, but rather oriented ourselves at papers with comparable settings (e.g. at [19, 30, 46]). Notably, increasing the complexity (number of layers or hidden dimension) further did not change the numerical results significantly in the cases tested, so we believe the structure chosen to be adequate for the problems considered.

Simply put, the implementation works as follows: We perform a normal stochastic gradient type optimization (outsourced to the Adam optimizer) for a certain number of iterations to find near optimal parameters of the network. At each iteration during this process, the expectations in the objective function are replaced by averages over a fixed number (called batch size) of random points from the respective distributions. To obtain the numerical approximation ˆφmθ,γ(f)of  φmθ,γ(f), we finally average the sample objective values over the last roughly 5% of iterations. This is referred to as the dual value. Alternatively, one can use formula (2.6) to obtain sample points from an approximate optimizer  ν∗of the primal problem and numerically evaluate�f dν∗, which is referred to as the primal value (more details on how to work with such an approximative optimizer  ν∗is given in Section 4.5). If not stated otherwise, all reported values are dual values.

The numerical procedure we use can likely be improved by fine-tuning parameters or by using more complex network architectures. For example batch normalization is applied in a related setting in [13] which appears to significantly speed up the optimization.

4.1 Optimal transport and Fréchet-Hoeffding bounds

With this first problem, we study the effects of different penalty functions, penalty factor, batch size and number of iterations of the Adam optimizer. Let  X = [0, 1]d, θ = U�[0, 1]d�(where  U(·)denotes the uniform distribution) and  Q = {ν ∈ P(X) : νi = U ([0, 1])}, where  νiis the i-th marginal of  ν. For some fixed  z ∈ [0, 1]d, define the function  f : [0, 1]d → R+by3

image

The value  φ(f) = supν∈Q�f dνcorresponds to the maximum value of a d-dimensional copula at point z. By the Fréchet-Hoeffding bounds we have an analytical solution to this problem, which is

image

In Figure 2 we observe how ˆφmθ,γ(f)depends on the number of iterations of the Adam optimizer and the batch size. We observe that while higher batch sizes lead to more stable convergence, the speed of convergence appears not strongly related to batch size. This suggests that increasing batch sizes might lead to both quick and finally stable performance.4Since  L2penalization appears more

image

Figure 2: Fréchet-Hoeffding bounds:  d = 2, z1 = 0.5, z2 = 0.75. Comparison of  L2penalty function  βγ(x) = γ max{0, x}2and exponential penalty function  βγ(x) = exp(γx−1)γ. The values plotted are running averages over the last 1000 iterations. The dotted red line is the true value  φ(f). The dotted blue lines are bounds from below for  φθ,γ(f)obtained by Equation (2.5) in Theorem 2.2 for the respective choices of  γ.

stable, we will mostly use this penalization for the rest of the applications. Further, the figure illustrates that the numerical solutions appear to approximately obtain the lower bounds for  φθ,γ(f)as given by Equation (2.5) in Theorem 2.2. I.e. one approximately has  φ(f) ≈ φmθ,γ(f)+ 1γ�β∗( dˆµdθ )dθwhere  ˆµis an optimizer of  φ(f).5

4.2 Multi-marginal optimal transport

The aim of this example is to compare the approach of this paper with existing methods for a numerically challenging problem. Let  X = (RD)M, where M denotes the number of marginals and D denotes the dimension of each marginal. Let  µifor i = 1, ..., M be K-mixtures of normal distributions with randomly chosen parameters, and define  Q = Π(µ1, ..., µM). For  p, q ≥ 1let

image

where we write  x = (xi,j) ∈ Xwith i = 1, ..., M, j = 1, ..., D. Note that for two marginals, one has −φ(f) = W pp,q(µ1, µ2), where  Wp,qis the Wasserstein-p distance with  Lqnorm on  Rd.

In Table 4.2 we compare optimal values to this problem arising from different algorithmic approaches. We compare linear programming methods based on discretization of the marginals, the neural network method presented in this paper, and a reproducing kernel Hilbert space (RKHS) approach as described in [28, Algorithm 3]. For the linear programming methods, we use a maximal number of variables of  106, which was around the boundary so that Gurobi [32] was

image

Table 1: Multi-marginal optimal transport: Numerical values for  −φ(f)arising from differ-ent numerical schemes. The numbers in brackets are empirical standard deviations over 100 runs. LP denotes linear programming, based on either sampling the marginals randomly (MC) or using the quantization approach from [42, Algorithm 4.5] to approximate marginals (quantization). The neural network (NN) implementation is based on  L2penalization with γ = M · D · 500. For the reproducing kernel Hilbert space solution (RKHS) as described in [28, Algorithm 3] we use a Laplace kernel and the same penalization as for the NN-method. For the final two rows, we report numerical values for  −φ( ˜f). DNC entries did not converge. The final column are analytical reference values given by the comonotone coupling for two marginals.

still able to solve the resulting linear program on our computer. Regarding the RKHS algorithm we have to mention that it is the only method that is not building on an established package like Tensorflow or Gurobi. Hence efficiency with respect to running time and tuning of hyperparameters might be far from optimal for this approach. Notably, switching from exponential penalization as used in [28] to  L2-penalization was already a slight improvement. For the precise specifications of each algorithm and the problem setting, we refer to the code on https://github.com/stephaneckstein/OT_Comparison.

Evaluating the results, we find that the neural network based method and the linear programming method with quantization appear to work best. Surprisingly, even for the case with 10 marginals (where the linear program can only use 4 points to approximate each marginal!), the quantization method achieved a similar value as the neural network method for  −φ(f). We believe the reason is that the function f is very smooth which a quantization approach can exploit. Hence we slightly changed f to ˜fin the final two test cases, which makes the function less regular. These are the only cases where the neural network solution and the quantization method strongly differ. In the final case, the quantization approach still has to approximate each marginal distribution with just 4 points, while the neural network method can use millions of points. From this standpoint, one can place higher trust in the neural network solution, even though we have no analytical reference value to compare it against.

Initially, we included a fourth method based on the approach in this paper, but with a polynomial basis instead of neural networks. This performed very badly however (at least when using a standard monomial basis), and hence we omitted the results in this table.

4.3 Martingale optimal transport

In martingale optimal transport, the optimal transport problem is extended by imposing a martingale constraint on top of marginal constraints. Dimensions are regarded as discrete time-steps and the measures in Q are distributions of discrete stochastic processes  (Xt)t=1,...,dwith fixed marginal distributions as well as the condition that the process is a martingale.

Here, we consider a simple example with d = 2, where an analytical solution is known. This example is taken from [2]. Let  X := [−1, 1] × [−2, 2], θ := U(X)and set

image

For  f = −|x−y|ρone gets  φ(f) = −1for all  ρ > 2. We implement this problem with  ρ = 2.3, where we use the  L2penalty function for different values of  γ. The results are shown in Figure 3. One can see that while for values of  γup to around 1280, the behavior of the optimal value is approximately as predicted by Equation (2.5) in Theorem 2.2, in that the error decreases by roughly a factor of two if  γis increased by a factor of two. For larger values of  γhowever, numerical instabilities occur and the optimizer cannot find the true optimum. This is indicated by the fact that the value ˆφmθ,γ(f)is above  φ(f) = −1.

4.4 Portfolio optimization

Consider a market with two assets, where the distribution of returns for each individual asset is given, but not the joint distribution. An investor wants to maximize his or her worst-case utility from investing into the two assets. Here, the utility of the investor is characterized by a mean-variance objective. While the mean is fully characterized by the marginal distributions, the worst

image

Figure 3: Martingale optimal transport: Mean numerical optimal values and 95% confidence bounds over 100 independent runs for different values of  γ (L2penalization). The network is trained for 20000 iterations with batch size 1024. The true optimal value of the unpenalized problem is -1.

case considers all possible variances of the portfolio, which depend on the joint distribution of the assets.

The following example is taken from [43]: Let  X = [0, 1] × [0, 2]. Let  θ1 = U([0, 1])and θ2 = U([0, 1]) ◦ ϕ−1, where  ϕ(x) = 2x2. Let  Q = {ν ∈ P(X) : ν1 = θ1, ν2 = θ2}. We will solve the following robust mean-variance portfolio optimization problem

image

where  λ ≥ 0is the risk aversion. The integral over the term inside the large brackets is the variance of the portfolio. For the analytical solution, see Example 1 of [43]. We implemented the above problem in two ways. For the first, we choose the reference measure  θ(1) = θ1⊗θ2. For the second, we use the reference measure  θ(2) = 0.5θ(1) + 0.5�U([0, 1]) ◦ (Id, ϕ)−1�,6i.e. half the product measure, half the perfectly correlated measure. The second version may correspond to our intuition that the optimal coupling should include positive correlation. More precisely: The choice of reference measure  θalways has the implicit objective to lead to narrow bounds in Equation (2.5) in Theorem 2.2. In this example, if one presumes that an optimal measure  ν∗ ∈ Qhas mass near the perfectly correlated diagonal, it makes sense to choose a reference measure which puts mass in this region, as does  θ(2).

image

Figure 4: Portfolio optimization under dependence uncertainty: As reference measure we take either the product measure or a positively correlated measure. We use  L2penalization with  γ = 160. The network is trained with batch size  213for 40000 iterations.

The results are reported in Figure 4. As expected, the second version yields results closer to the analytical solution.

4.5 Bounds on the distribution of a sum of dependent random variables

In this section, the objective is to find bounds for the probability  P(X1 +X2 +...+Xd ≥ s)for some s ∈ R, where the individual distributions of  Xiare known, but not their joint distribution. This problem is in strong relation to calculating worst- and best-case value at risks under dependence uncertainty, see also [20, 22, 44].

Let  X = Rd. For given marginals  µ1, ..., µd ∈ P(R)and  Q = Π(µ1, ..., µd), the problem statement is

image

For simplicity, we consider the case d = 2, and  µi = U([0, 1]). Let s = 1.9. Every optimal measure  ν ∈ Qgives mass 1/10 uniformly to the line section  {(x, 1.9 − x) : 0.9 ≤ x ≤ 1}, while the rest of the mass is irrelevant as long as the marginal condition is satisfied. This leads to an optimal value  φ(f) = 0.1. For the natural choice of reference measure  θ = U([0, 1]2), every optimal measure is singular with respect to  θ, and thus one can expect high errors by penalization. An implementation with this reference measure and  L2penalization with  γ = 320leads to ˆφmθ,γ(f) ≈ 0.0881.7

Updating the reference measure: To obtain a more accurate value, we make use of Equation (2.6) in Theorem 2.2. Recall that an optimizer  ˆν ∈ Qof  φθ,γ(f)is given by dˆνdθ = β′(f −ˆh), where  ˆhis the dual optimizer of  φθ,γ(f). Taking  ˆνas a reference measure instead of  θcan only reduce the error by penalization, since  φˆν,γ(f) ≥ φθ,γ(f)holds by convexity of  β∗γ. Implementing the problem with ˆνas a reference measure has to be done approximately, since the true optimizer ˆhis unknown and

image

Figure 5: Bounds on the distribution of the sum of dependent variables: Sampled points from the numerically optimal measure  ν∗and the corresponding empirical marginal distributions.

replaced by the numerical optimal solution. We denote the numerically obtained optimal measure by  ν∗.

To implement  φmν∗,γ(f)requires sampling points from  ν∗. This is non-trivial since  ν∗is only given by dν∗dθ. We implemented this by an acceptance-rejection method as described in [24]. This is very slow, as the number of rejections increases with the maximum of the Radon-Nikodym derivative. Sampling efficiently in such a situation is difficult, see e.g. [34] for an overview of existing methods and a proposed new one.

Figure 5 illustrates the optimal measure  ν∗. The measure  ν∗looks comparable to an optimal solution of  φ(f), while simultaneously being driven towards the reference measure  θ. One obtains ˆφmν∗,γ(f) ≈ 0.0982, which is close to the true optimal value 0.1.

In the following, we briefly discuss the rearrangement algorithm [22, 44] which is tailored to this type of problem. In contrast to the presented approach, which relies on sampling from the involved marginal distributions, the rearrangement algorithm mainly relies on the (inverse of) the cumulative distribution function of the marginals. The rearrangement algorithm achieves similar or even better accuracy in higher dimensional settings and with different marginals (e.g. Pareto marginals). The case of higher dimensions scales well with the approach taken here. However, the base time in low dimensions is higher than that of the rearrangement algorithm, and further heavy tailed marginals

like the Pareto distribution can lead to less accuracy.8

4.6 Generative Adversarial Networks (GANs)

The objective in GANs is to create new sample points from a measure  µ, of which only an empirical distribution  ˜µis known (see e.g. [3, 29]). Usually, the measure  µmight refer to the uniform distribution over some very large set of images. The set  ˜µis then just the uniform distribution over a small subset of these images. The goal is to sample new images that are not already present in the given subset, but that might plausibly have been samples from the measure  µ.

To proceed, we first take some latent probability measure  τ. The goal is to obtain a function G such that  µand the push-forward measure  τ ◦ G−1are close (in a sense to be specified), and thus the pseudo samples for  µcan be obtained by sampling from  τand applying G. To find such a function G out of a class of functions G, one can only use  ˜µinstead of  µ(thus  µdoes not enter the formal problem statement). The closeness of  ˜µand  τ ◦ G−1is measured by different distances in GANs. In the Wasserstein GAN, the first Wasserstein distance  W1(·, ·)(see e.g [51]) is used, and the objective is

image

The above can be generalized to arbitrary transport distances instead of the first Wasserstein distance. To put this into our setting, let G a set of functions that map into  X1. Let  X = X1 × X1and for  G ∈ Gdefine  QG := {ν ∈ P(X) : ν1 = τ ◦ G−1, ν2 = ˜µ}. For a cost function c, arbitrary transport type GANs can be expressed via

image

If c is a metric, the above corresponds to the Wasserstein GAN.

Since it is difficult to objectively evaluate GAN setups, we omit numerical results in this section. The interested reader can see the code on GitHub, where the toy problems appearing in [30] are implemented for different functions  c.9

5.1 Proof of Theorem 2.2

1) We first show (2.4) by verifying that  φθ,γis real-valued and continuous from above on  Cb(X). To that end, as shown in Appendix A one has  φ∗(µ) = suph∈H(�h dµ −�h dµ0)for all  µ ∈ P(X), so that

image

so that

image

for all  f ∈ Cb(X). This shows that  φθ,γis real-valued on  Cb(X). Further, let  (f k)be a sequence in  Cb(X)such that  f k ↓ f. For every  h ∈ Hthe monotone convergence theorem implies�βγ(f k −h) dθ →�βγ(f −h) dθ, so that  φθ,γ(f k) ↓ φθ,γ(f). Hence, it follows from the nonlinear Daniell-Stone theorem (see Proposition A.1) that

image

where the convex conjugate is given by

image

Indeed, the convex conjugate of the convolution  inff∈Cb(X){φ(f) + ψθ,γ(· − f)}is given as the sum of the convex conjugates  φ∗and  ψ∗θ,γ. By (5.1) one has  φ∗(µ) = 0if  µ ∈ Qand  φ∗(µ) = +∞otherwise. Moreover,

image

with the convention  −∞ + ∞ = +∞.

3) Let ˆh ∈ Hbe a minimizer of (2.3), i.e.  φµ,γ(f) =� ˆh dµ0+�βγ(f−ˆh) dθ. Defining  hλ := ˆh+λhfor an arbitrary  h ∈ H, the first order condition

image

This shows that the probability measure  ˆµwith Radon-Nikodým derivative dˆµdθ := β′γ(f −ˆh)satisfies �h dµ0 =�h dˆµfor all  h ∈ H, which in view of (5.1) satisfies  ˆµ ∈ Q. Integrating the identity βγ(x) = xβ′γ(x) − β∗γ(β′γ(x))with  x = f − ˆhw.r.t.  θ, one obtains

image

which shows that

image

As a consequence,  ˆµ ∈ Qis a maximizer of (2.4).

5.2 Proof of Proposition 2.3

Fix  f ∈ Cb(X). That  limm→∞ φm(f) = φ∞(f) ≥ φ(f)follows from the definition of  H∞. Moreover, for every  ε > 0the Condition (D) guarantees  h ∈ H∞and  K ⊆ Xsuch that  1Kc ≤ hand �h dµ0 ≤ ε. Hence,  φ∞(1Kc) ≤�h dµ0 ≤ εand Dini’s lemma implies that  φ∞is continuous from above on  Cb(X). By Proposition A.1 it follows that

image

Similar to (A.2) its convex conjugate is given by

image

It remains to show that for  h ∈ Hand  µ ∈ P(X)with�h dµ −�h dµ0 > 0there exists  h′ ∈ H∞such that�h′ dµ −�h′ dµ0 > 0. But this follows directly from the first part of Condition (D) for the probability measure 12µ + 12µ0. Indeed, there exists a sequence  (hn)in  H∞such that  hn → hin L1(µ)and in  L1(µ0), which shows that�hn dµ −�hn dµ0 > 0for n large enough.

5.3 Proof of Proposition 2.4

Observe that

image

where the first inequality uses that  βγis increasing, the second inequality just drops the constraint h ≥ f, and the third inequality follows from  Hm ⊆ H. Fix  ε > 0. By Condition (D) and Theorem 2.2 there exist  m0 ∈ Nand  γ0 > 0such that

image

for all  m ≥ m0and  γ ≥ γ0. This shows that

image

for all  m ≥ m0and  γ ≥ γ0, which shows that  φmθ,γ(f) → φ(f)whenever  min{m, γ} → ∞.

5.4 Proof of Lemma 3.3

(a) From Hornik [35] it follows that  Nlj,djis dense in  Cb(Rdj)with respect to  L1(ν)for every ν ∈ P(Rdj)and all j = 1, ..., J. By the triangle inequality and boundedness of  ej, the first part of Condition (D) follows.

(b) If X is compact, the condition is trivially satisfied. Hence assume that  X = Rd = Rd1 × ... ×RdJ0and  πj = prj, ej = 1for  j = 1, ..., J0 ≤ J, where  prjis the projection from  Rdto the j-th marginal component in  Rdj.

Let  ε > 0and  ν ∈ P(X). We first fix j, denote by  ν(j) := ν ◦ pr−1jand show that there exists a  h′′j ∈ Nlj,djsuch that  1Kcj ≤ h′′jand�Rdj h′′j dν(j) ≤ 2εfor some compact subset

Kjof  Rdj. Without loss of generality, assume that  lj = 1. This can always be done since the function  h′′jwill be compact-valued and hence for multiple layers, the remaining layers beyond the first can simply approximate the identity function in the supremum norm.10Fix Kj = [−c, +c]djsuch that  ν(j)(Kcj) ≤ ε/(4dj). By assumption on  ϕ, for each  i ∈ {1, . . . , dj}there exist  ai, bi, ai, bi ∈ Rsuch that

image

Now, define  h′′ := �dj=1 h′′j ◦ prj ∈ H∞and  K := �J0j=1 ˜Kj ⊂ X, which is compact. Then one immediately gets  1Kc ≤ h′′and�h′′dν ≤ 2J0ε.

5.5 Proof of Proposition 3.7

1) For one fix network  Nlj,dj,m, the mapping  ξ �→ Nlj,dj,m(ξ)is pointwise continuous, i.e. it holds for  ξn → ξthat  Nlj,dj,m(ξn)(x) → Nlj,dj,m(ξ)(x)for all  x ∈ Rdj, since  ϕis continuous. Further, since we assume that  ϕis bounded, the functions  Nlj,dj,m(ξn)are uniformly bounded and hence by dominated convergence one obtains  Nlj,dj,m(ξn) → Nlj,dj,m(ξ)in  L1(ν)for all  ν ∈ P(Rdj). By the triangle inequality, this continuity transfers to the mapping  (ξ1, ..., ξJ) �→ �Jj=1 ejNlj,dj,m(ξj) ◦ πj. Hence, we can write

image

On the other hand, let  (γn)be a sequence in  R+with  γn → ∞. Our goal is to show that φm(f) ≤ lim infn→∞ φmθ,γn(f). We assume that  lim infn→∞ φmθ,γn(f) < ∞otherwise there is nothing to prove. For every  n ∈ Nthere exist  An ∈ Amand  an ∈ Rsuch that

image

since  βγn(x) ≥ x + cfor all  n ∈ Nfor some constant  c ∈ R. In particular,  lim infn→∞ φmθ,γn(f)is real-valued as  A �→�f − η(A) + c dθis a continuous function on the compact  Am. By passing to a subsequence we may assume that  limn→∞ φmθ,γn(f) = lim infn→∞ φmθ,γn(f), and  An → A ∈ Amsuch that  η(An) → η(A)in  L1(θ)and  θ-a.s. as well as�η(An) dµ0 →�η(A) dµ0. We next show that (an)is bounded. Suppose by way of contradiction that  an → −∞. Since  limx→∞ β(x)/x = ∞and f − η(An)is uniformly bounded by compactness of  Am, it follows that

image

Moreover, in view of (5.2) the sequence

image

is uniformly integrable in  L1(θ). Hence, it follows from Fatou’s lemma that

image

which is the desired contradiction. This shows that  (an)is bounded and by passing to a subsequence an → a ∈ R. Finally it follows from Fatou’s lemma that

image

where  β∞(x) = 0if  x ≤ 0and  β∞(x) = ∞if x > 0. The second inequality follows because η(A) + a ≥ f θ-a.s. as a consequence of the first inequality,  f, η(A) ∈ Cb(X)and  θis strictly positive.

Let X be a Polish space. Given a measurable function  κ : X → [1, ∞), we denote by  Cκ(X)the Stone vector lattice of all continuous functions  f : X → Rsuch that  |f|/κis bounded. For instance, if  κis bounded one has  Cκ(X) = Cb(X), or if  κ(x) = 1 + |x|on  X = Rdthe space  Cκ(Rd)contains all continuous functions  f : Rd → Rof linear growth. Further, let  ca+κ (X)be the set of all Borel measures  µon X which satisfy�κ dµ < ∞. The following nonlinear version of the Daniell-Stone theorem follows directly from Proposition 1.1 in [15].

Proposition A.1. Let  φ: Cκ(X) → Rbe an increasing11convex functional which is continuous from above, i.e.  φ(f n) ↓ 0for every sequence  (f n)such that  f n ↓ 0. Then, it has the dual representation

image

where the convex conjugate  φ∗ : ca+κ (X) → R∪{+∞}is given by  φ∗(µ) = supf∈Cκ(X){�f dµ−φ(f)}.

Continuity from above is strongly related to the concept of tightness, which in the context of risk measures was introduced by Föllmer and Schied, see [25]. Typical examples include transport type problems where tightness is imposed by marginal constraints, see e.g. Bartl et al. [5]. For extensions of the representation (A.1) to upper semicontinuous functions and related pricing-hedging dualities we refer to Cheridito et al. [16].

As an application we consider the superhedging functional

image

on  Cκ(X), where  µ0 ∈ ca+κ (X)is a probability measure and  H ⊆ Cκ(X)is a convex cone such that κ ∈ H. Straightforward inspection shows that  φis a real-valued increasing convex functional on Cκ(X). Further, if  φis continuous from above by Proposition A.1 it has the dual representation (A.1). Its convex conjugate is given by

image

Since H is a convex cone which contains the constants it follows that  φ∗(µ) = 0whenever  µ ∈ ca+κ (X)is a probability measure such that�h dµ =�h dµ0for all  h ∈ H, and  φ∗(µ) = +∞else. In particular, in case that  Cκ(X) = Cb(X)we conclude the dual representation (2.2).

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[2] A. Alfonsi, J. Corbetta, and B. Jourdain. Sampling of probability measures in the convex order and approximation of martingale optimal transport problems. 2017.

[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.

[4] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath. Coherent measures of risk. Mathematical Finance, 9(3):203–228, 1999.

[5] D. Bartl, P. Cheridito, M. Kupper, and L. Tangpi. Duality for increasing convex functionals with countably many marginal constraints. Banach Journal of Mathematical Analysis, 11(1):72–89, 2017.

[6] D. Bartl, M. Kupper, T. Lux, and A. Papapantoleon. Sharpness of improved Fréchet-Hoeffding bounds: an optimal transport approach. arXiv preprint arXiv:1709.00641, 2017.

[7] M. Beiglböck, P. Henry-Labordère, and F. Penkner. Model-independent bounds for option prices: A mass transport approach. Finance and Stochastics, 17(3):477–501, 2013.

[8] A. Ben-Tal and M. Teboulle. An old-new concept of convex risk measures: The optimized certainty equivalent. Mathematical Finance, 17(3):449–476, 2007.

[9] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111– A1138, 2015.

[10] C. Bernard, X. Jiang, and R. Wang. Risk aggregation with dependence uncertainty. Insurance: Mathematics and Economics, 54:93–108, 2014.

[11] C. Bernard, L. Rüschendorf, S. Vanduffel, and J. Yao. How robust is the value-at-risk of credit risk portfolios? The European Journal of Finance, 23(6):507–534, 2017.

[12] U. Bindini. Smoothing operators in multi-marginal optimal transport. arXiv preprint arXiv:1901.07407, 2019.

[13] H. Bühler, L. Gonon, J. Teichmann, and B. Wood. Deep hedging. arXiv preprint arXiv:1802.03042, 2018.

[14] G. Carlier, V. Duval, G. Peyré, and B. Schmitzer. Convergence of entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical Analysis, 49(2):1385–1418, 2017.

[15] P. Cheridito, M. Kupper, and L. Tangpi. Representation of increasing convex functionals with countably additive measures. arXiv preprint arXiv:1502.05763, 2015.

[16] P. Cheridito, M. Kupper, and L. Tangpi. Duality formulas for robust pricing and hedging in discrete time. SIAM Journal on Financial Mathematics, 8(1):738–765, 2017.

[17] R. Cominetti and J. San Martín. Asymptotic analysis of the exponential penalty trajectory in linear programming. Mathematical Programming, 67(1-3):169–187, 1994.

[18] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.

[19] W. E, J. Han, and A. Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.

[20] S. Eckstein, M. Kupper, and M. Pohl. Robust risk aggregation with neural networks. arXiv preprint arXiv:1811.00304, 2018.

[21] I. Ekren and H. M. Soner. Constrained optimal transport. Archive for Rational Mechanics and Analysis, pages 1–37, 2017.

[22] P. Embrechts, G. Puccetti, and L. Rüschendorf. Model uncertainty and VaR aggregation. Journal of Banking & Finance, 37(8):2750–2764, 2013.

[23] S. Feizi, C. Suh, F. Xia, and D. Tse. Understanding GANs: the LQG setting. arXiv preprint arXiv:1710.10793, 2017.

[24] B. D. Flury. Acceptance–rejection sampling made easy. SIAM Review, 32(3):474–476, 1990.

[25] H. Föllmer and A. Schied. Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter, 2011.

[26] A. Galichon, P. Henry-Labordere, N. Touzi, et al. A stochastic control approach to no-arbitrage bounds given marginals, with an application to lookback options. The Annals of Applied Probability, 24(1):312–336, 2014.

[27] A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3440–3448. Curran Associates, Inc., 2016.

[28] A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems, pages 3440–3448, 2016.

[29] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

[30] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028, 2017.

[31] G. Guo and J. Obloj. Computational methods for martingale optimal transport problems. arXiv preprint arXiv:1710.07911, 2017.

[32] L. Gurobi Optimization. Gurobi optimizer reference manual, 2018.

[33] P. Henry-Labordère. Automated option pricing: Numerical methods. International Journal of Theoretical and Applied Finance, 16(08):1350042, 2013.

[34] F. Horger, T. Würfl, V. Christlein, and A. Maier. Deep learning for sampling from arbitrary probability distributions. arXiv preprint arXiv:1801.04211, 2018.

[35] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.

[36] L. V. Kantorovich. On the translocation of masses. In Dokl. Akad. Nauk SSSR, volume 37, pages 199–201, 1942.

[37] H. G. Kellerer. Duality theorems for marginal problems.  Zeitschrift für Wahrscheinlichkeits-theorie und verwandte Gebiete, 67(4):399–432, 1984.

[38] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[39] C. Léonard. A survey of the Schrödinger problem and some of its connections with optimal

transport. Discrete & Continuous Dynamical Systems - A, 34(4):1533–1574, 2014.

[40] T. Lux, A. Papapantoleon, et al. Improved Fréchet–Hoeffding bounds on d-copulas and applications in model-free finance. The Annals of Applied Probability, 27(6):3633–3671, 2017.

[41] S. Peng. G-expectation, G-Brownian motion and related stochastic calculus of Itô type. In Stochastic analysis and applications, pages 541–567. Springer, 2007.

[42] G. C. Pflug and A. Pichler. Multistage stochastic optimization. Springer, 2014.

[43] G. C. Pflug and M. Pohl. A review on ambiguity in stochastic portfolio optimization. Set-Valued and Variational Analysis, pages 1–25, 2017.

[44] G. Puccetti and L. Rüschendorf. Computation of sharp bounds on the distribution of a function of dependent risks. Journal of Computational and Applied Mathematics, 236(7):1833–1840, 2012.

[45] B. Schmitzer. Stabilized sparse scaling algorithms for entropy regularized transport problems. arXiv preprint arXiv:1610.06519, 2016.

[46] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large-scale optimal transport and mapping estimation. arXiv preprint arXiv:1711.02283, 2017.

[47] S. L. Smith, P.-J. Kindermans, and Q. V. Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.

[48] J. Solomon, F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):66, 2015.

[49] V. Strassen. The existence of probability measures with given marginals. The Annals of Mathematical Statistics, 36(2):423–439, 1965.

[50] S. Vallender. Calculation of the Wasserstein distance between probability distributions on the line. Theory of Probability & Its Applications, 18(4):784–786, 1974.

[51] C. Villani. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, 2008.


Designed for Accessibility and to further Open Science