b

DiscoverSearch
About
My stuff
Multi-objective Reinforcement Learning with Continuous Pareto Frontier Approximation Supplementary Material
2014·arXiv
Abstract
Abstract

This document contains supplementary material for the paper “Multi–objective Reinforcement Learning with Continuous Pareto Frontier Approximation”, published at the Twenty–Ninth AAAI Conference on Artificial Intelligence (AAAI-15). The paper is about learning a continuous approximation of the Pareto frontier in Multi-Objective Markov Decision Problems (MOMDPs). We propose a policy-based approach that exploits gradient information to generate solutions close to the Pareto ones. Differently from previous policy-gradient multi-objective algorithms, where n optimization routines are use to have n solutions, our approach performs a single gradient-ascent run that at each step generates an improved continuous approximation of the Pareto frontier. The idea is to exploit a gradient-based approach to optimize the parameters of a function that defines a manifold in the policy parameter space so that the corresponding image in the objective space gets as close as possible to the Pareto frontier. Besides deriving how to compute and estimate such gradient, we will also discuss the non-trivial issue of defining a metric to assess the quality of the candidate Pareto frontiers. Finally, the properties of the proposed approach are empirically evaluated on two interesting MOMDPs.

The paper “Multi–objective Reinforcement Learning with Continuous Pareto Frontier Approximation” has been published at the Twenty–Ninth AAAI Conference on Artificial Intelligence (AAAI-15). This supplement follows the same structure of the main article. For each section we report the complete set of proofs and some additional details.

In this section we provide the proof of an extended version of Theorem 1.

Theorem 1. Let T be an open set in  Rb, let  Fρ (T) be a manifold parametrized by a smooth map expressed as composition of maps J and  φρ, (J ◦ φρ : T → Rq). Given a continuous function I defined at each point of  Fρ(T), the integral w.r.t. the volume is given by

image

The associated gradient w.r.t. the map parameters  ρis given component–wise by

image

where  T = DθJ(θ)Dtφρ(t),  ⊗is the Kronecker product,  Nb= 12(Ib2 + Kbb) is a symmetric (b2 × b2) idempotent matrix with rank 12b(b+ 1) and  Kbbis a permutation matrix [1]. Note that

image

Proof. The equation of the performance measure  J(ρ) follows directly from the definition of volume integral of a manifold [2] and the definition of function composition. In the following we give a detailed derivation of the i–th component of the gradient. Let  T = DθJ(θt)Dtφρ(t), then

image

where the pedix t is used to denote the direct or indirect dependence on variable t. While the loss derivative and the determinant derivative can be respectively expanded as

image

where

image

and  ⊗is the Kronecker product,  Nb= 12(Ib2 + Kbb) is a symmetric (b2 × b2) idempotent matrix with rank 12b(b+ 1) and  Kbbis a permutation matrix [1]. The last term to be expanded is  DρiT:= ∂vec (T)∂ρi. We star from a basic property of the differential

image

then, applying the vector operator,

image

Finally, the derivative is given by

image

Note that  Dθ (DθJ(θ)) = ∂vec DθJ(θ)∂θTdo not denote the Hessian matrix. In fact, the Hessian matrix is defined as the derivative of the transpose Jacobian, that is,  HθJ(θ) =  Dθ(DθJ(θ))T. The following equation relates the Hessian matrix to  Dθ (DθJ(θ)):

image

where  p = i + q(m −1), where q is the number of rows of the Jacobian matrix.

Theorem 2. For any MOMDP, the Hessian  HθJ(θ) of the expected discounted reward J w.r.t. the policy parameters  θis a (qd × d) matrix obtained by stacking the Hessian of each component

image

where

image

Proof. The Hessian equation follows form the definition of the gradient  ∇θJ(θ)

image

the log trick and the property that the reward of a trajectory  τis independent from the policy parametrization. Let outline the derivation of the Hessian matrix.

image

Then

image

Recall that, since the probability of trajectory  τunder policy  πθis given by

image

the following equations hold

image

Lemma 4. Given a parametrized policy  π(a|s, θ), under the assumption Assumption 3, the i–th component of the log–Hessian of the expected return can be bounded by

image

Proof. Consider the definition of the Hessian in Equation (1). Under assumption 3, the Hessian components can be bounded by (∀m, n)

image

Theorem 5. Given a parametrized policy  π(a|s, θ), under the assumption Assumption 3, using the following number of H–step trajectories

image

the gradient estimate �HθJi(θ) generated by Equation (1) is such that with probability 1  − δ: ��� �HθJi(θ) − HθJi(θ)���max ≤ ϵi.

Proof. Hoeffding’s inequality implies that,  ∀m, n

image

Solving the equation for N, notice that Lemma 4 provides a bound on each samples, we obtain:

image

In this section we present the most relevant experiments conducted on two domains (a Linear-Quadratic Gaussian regulator and a water reservoir) in order to study the behavior of PMGA algorithm with the different loss functions I proposed in the paper. We show the frontiers obtained with the loss functions described in the paper and in addition we present a normalization that takes into account the area of the approximate Pareto frontier. The area  A(ρ) of a manifold is defined as the volume integral of the unitary function:

image

In the following we propose two different type of normalization:

Using the area  A(ρ) of the frontier  F(ρ) as normalization factor:  In = IA(ρ)−β,

Using a convex combination of both the area and the loss function:  In = w1I + w2A(ρ) with w1 + w2= 1.

The idea of these normalizations is that the loss function I should guarantee the accuracy of the solutions obtained (i.e., only non-dominated solutions), while the area  A(ρ) should provide a complete and uniform covering of the frontier.

In all the following experiments the learning rate was hand-tuned.

2.1 Linear-Quadratic Gaussian regulator

The first case of study is a discrete-time Linear-Quadratic Gaussian regulator (LQG) with multidimensional and continuous state and action spaces [3]. The LQG problem is defined by the following dynamics

image

where  stand  atare n-dimensional column vector (n = m), A, B, Q, R ∈ Rn×n, Qis a symmetric semidefinite matrix and R is a symmetric positive definite matrix. Dynamics are not coupled, that is, A and B are identity matrices. The policy is Gaussian with parameters  θ = vec(K), where  K ∈ Rn×n. Finally, a constant covariance matrix Σ = I has been chosen.

The LQG can be easily extended to account for multi-conflicting objectives. In particular, the problem of minimizing the distance from the origin w.r.t. the i-th axis has been taken into account, considering the cost of the action over the other axes

image

image

Figure 1: Learning processes for the 2-objectives LQG (numbers denote the iteration) obtained through PMGA. In Figures 1(b) and 1(a)  I3is used, respectively with and without forcing the parametrization to pass through extrema. Figure 1(c) shows iterations with  I1(J, pau).

image

Figure 2:  J(ρ) trends with different loss functions for the 2-objectives LQG.

Since the maximization of the i-th objective requires to have null action on the other axes, objectives are conflicting.

As this reward formulation violates the positiveness of matrix  Ri, we change the reward adding an ξ-perturbation

image

2.1.1 2-objectives case results

We first present the results obtained using PMGA algorithm and a parametrization that is not forced to pass through the extrema of the frontier. It is the one presented in the paper and it only limits  θiin the interval [−1,0]:

image

In this case using  I1and  I2the algorithm was not able to learn a good approximation of the Pareto– frontier in terms of accuracy and covering. Using utopia point as reference point for  I1(i.e.,  I1(J, pu)) the frontier learned collapses in one point on the knee of the front. The same behaviour occurs using  I2. Using antiutopia point as reference point for  I1(i.e.,  I1(J, pau)) the solutions returned are dominated and the frontier gets wider and tends to diverge from the true frontier expanding on the opposite half

image

Figure 3: Different views of the frontier obtained by PMGA using  I2and  I1(J, pu) for the 3-objective LGQ.

image

Figure 4: Different views of the frontier obtained by PMGA using normalized  I1(J, pu) for the 3-objective LQG.

space (Figure 2(b) shows the divergent trend of  J(ρ)). These behaviours are not unexpected, considering the definition of the loss functions, as explained in the section of the paper devoted to metrics.

The only loss function able to learn with this parametrization was  I3. Figure 1(a) (presented in the paper) shows a few iterations of the learning process using  λ= 2.5 and starting from  ρ0= [1 2 0 3]T(the algorithm was also able to learn starting from different  ρ0). Figure 2(a) shows the indicator  J(ρ) as function of the iterations. It is possible to notice that it converges to a constant value.

Other experiments were conducted using a different parametrization, forced PMGA approximation to pass through the extrema of the frontier:

image

In this case, besides  I3, also  I2proved to be an effective loss function and they both returned an accurate and wide approximation of the Pareto frontier (Figure 1(b), also presented in the paper, shows the learning process starting from  ρ0= [2 2]T).

I2(J, pau) has still the same behaviour discussed before and the approximate frontier diverges from the true one (Figure 1(c)). This problem can be solved using the first normalization with  β= 0.9 (lower βare not enough to correct the behaviour of the loss function, while using higher  βthe frontier returned is shorter and tends to be a line between the extreme points).

image

Figure 5: Approximations of the Pareto frontier obtained by PMGA using  I1(J, pau) for the 3-objective LQG.

I1(J, pu) has a similar behaviour, as the algorithm returns almost a line between the extreme points in order to reduce the frontier length. Using the first normalization with  β = −1.8 such behaviour disappears and the frontier obtained has accurate solutions and guarantees a complete covering of the true Pareto frontier (its performance are thesame as  I2and  I3).

The second normalization, instead, has a critical problem because of the different magnitudo between the loss function and the area of the frontier, and therefore is difficult to properly choose w. A solution could be to ignore the constraint  w1 + w2= 1, but in the 2-objectives case there is such a difference in the magnutudo that we were not able to find a suitable w.

2.1.2 3-objectives case results

We used a parametrization forced to pass through the extrema of the frontier and that limits  θiin the interval [−1,0]:

image

The initial  ρ0is set to 0. Figure 3 shows the frontiers obtained using  I1(J, pu), with and without normalization. We can clearly

see that solutions tend to concentrate to the center of the frontier, in order to minimize the distance from

the utopia point and the area of the frontier. Normalization is not able to correct this behaviour and the only result is to bump the frontier, slightly increasing its area. This effect seems to be indipendent from the normalization used and from the parameters w and  β(we tried with 1  < β <6 and 10 convex combinations uniformly spaced of w). Loss function  I2has the same behaviour and the frontiers obtained were very similar. Figures 5(a) and 5(b) show the frontier obtained with  I1(J, pau), with and without normalization.

As expected, without normalization (Figure 5(a)) the algorithm tried to produce a frontier as wide as

possible, in order to increase the distance from the antiutopia point. This behaviour led to dominated solutions and the learning process does not converge. Using the first normalization with  β= 2 we were able to correct this behaviour, but the algorithm is still not able to cover the frontier completely (Figure 5(b)). Using smaller  βthe frontier was still too wide and contained dominated solutions, while higher  βled to smaller ones. The second normalization instead was ineffective. This is due, again, to the different magnitudo between the loss function and the area of the frontier, that makes the choice of w critical. Finally Figure 6(a) shows the frontier obtained using  I3with  λ= 135. As expected, this loss function

proved to be the best among the three, returning a good approximation of the Pareto frontier in terms

of accuracy and covering, without using any normalization. Figure 6(b) shows the Pareto frontier in the parameter space.

image

Figure 6: Approximation of the Pareto frontier obtained by PMGA using  I3for the 3-objective LQG.

A water reservoir can be modelled as a MOMDP with a continuous state variable s representing the water volume stored in the reservoir, a continuous action a that controls the water release, a state-transition model that depends also on the stochastic reservoir inflow  ϵ, and a set of conflicting objectives. For a complete description of the problem, the reader can refer to [4].

In this work we consider two objectives: flooding along the lake shores and irrigation supply. The immediate rewards are defined by

image

where  ht+1= st+1/Sis the reservoir level (in the following experiments S = 1), ¯h is the flooding threshold (¯h = 50),  ϱt= max(at, min(¯at, at)) is the release from the reservoir and ¯ϱis the water demand (¯ϱ= 50). R1denotes the negative of the cost due to the flooding excess level and  R2is the negative of the deficit in the water supply.

Like in the original work, the discount factor is set to 1 for all the objectives and initial state is drawn from a finite set. However, different settings are used for learning and evaluation. In the learning phase 100 episodes by 100 steps are used (like in the original work), while the evaluation phase exploits 100, 000 episodes by 100 steps.

Since the problem is continuous we exploit a Gaussian policy model

image

where  ν : S → Rdare the basis functions and  d = |θ|. Since the optimal policies for the objectives are not linear in the state variable, a radial basis approximation is used:  ν(s) =�e−��s − ci��2/wi�di=1 ,where the centres  ciare placed at 0, 50, 120 and 160, and the widths are 50, 20, 40 and 50.

3.0.3 Results

We used the following parametrization, forced to pass near the estreme points of the Pareto frontier:

image

A constant variance  σ= 0.1 has been chosen.

image

Figure 7: Initial and final frontiers for a learning process with  I1(J, pu).

image

Figure 8: Results for the water reservoir domain. Using utopia-based loss function  J(ρ) trend is convergent (on the right) and the frontier returned is comparable to the ones obtained with state-of-the-art algorithms.

In order to show the capability of the approximate algorithm we have decided to test the simplest metric, that is, the utopia–based indicator. We start the learning from an arbitrary parametrization ρ0 = −20. Figure 7 reports the initial and the final frontiers obtained with out algorithm. We can notice that, even starting far from the true Pareto frontier, out algorithm is able to approach it, increasing covering and accuracy of the approximate frontier.

Figure 8(a) reports the final frontier obtained with different algorithms. The approximation obtained by our algorithm is comparable to the other results, however, our approach is able to produce a continuous frontier approximation.

It is important to notice that, due to the fact that the transition function of the domain limits the action in the range of admissible values (at ∈ [at, ¯at]), there are infinite policies with equal performance that allow the agent to release more than the reservoir level or less than zero. To overcome this problem [5] introduces a penalty term p in the reward (p = −max(at − ¯at, at − at)) during the learning phase. With our approach this modification was unnecessary, as the algorithm is able to learn without the penalty. We also tried adding it during the learning phase, but the frontier returned was exactly the same.

In this Section we want to examine more deeply the tuning of mixed metric parameters, in order to provide the reader better insights for a correct use of such metric. PMGA performance, indeed, strongly depends on the indicator used and, thereby, their setting is critical. To be more precise, mixed metric, which obtained the best approximate Pareto–frontiers in the experiments, includes a trade-off between accuracy and covering, expressed by some parameters.

image

Figure 9: Approximate frontiers learned by PMGA using  I3on varying  λ. Figure (a) has dominated solutions and (c) is not wide enough. On the contrary, (b) achieves both accuracy and covering.

The indicator we are going to analyze is

image

where w(J) is a penalization term, i.e., it is a monotonic function that decreases as  I2(J) increases. In the previous Sections we proposed w(J) = 1  − λI2(J). In this way we take advantage of the expansive behavior of the antiutopia–based indicator and the accuracy of the optimality–based indicator  I2. In this Section we are going to study the performance of this metric on varying  λ, proposing a simple tuning process. The idea is to set  λto an initial value (for example 1) and then increase (or dicrease) it if the approximate frontier contains dominated solutions (or is not large enough). Figure 9 shows different approximate frontiers obtained with different  λ. Starting with  λ= 1 the indicator behaves like I1(J, pAU), meaning that  λwas too small. Using  λ= 1.5 (Figure 9(a)) the algorithm converges but the approximate frontier still contains dominated solutions. Increasing  λto 1.5 (Figure 9(b)) dominated solutions disappear. Finally, with  λ= 2.5 (Figure 9(c)) the approximate frontier becomes shorter and Pareto–optimal solutions are discarded, meaning that we increased  λtoo much.

[1] J.R. Magnus and H. Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley Ser. Probab. Statist.: Texts and References Section. Wiley, 1999.

[2] J.R. Munkres. Analysis On Manifolds. Adv. Books Classics Series. Westview Press, 1997.

[3] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697, 2008.

[4] Francesca Pianosi, Andrea Castelletti, and Marcello Restelli. Tree-based fitted q-iteration for multi- objective markov decision processes in water resource management. Journal of Hydroinformatics, 15(2):258–270, 2013.

[5] Simone Parisi, Matteo Pirotta, Nicola Smacchia, Luca Bascetta, and Marcello Restelli. Policy gradient approaches for multi-objective sequential decision making. In IJCNN 2014, Beijing, China, July 6-11, 2014, pages 1–7. IEEE, 2014.


Designed for Accessibility and to further Open Science