Multi-objective Reinforcement Learning with Continuous Pareto Frontier Approximation Supplementary Material

2014·Arxiv

Abstract

Abstract

This document contains supplementary material for the paper “Multi–objective Reinforcement Learning with Continuous Pareto Frontier Approximation”, published at the Twenty–Ninth AAAI Conference on Artificial Intelligence (AAAI-15). The paper is about learning a continuous approximation of the Pareto frontier in Multi-Objective Markov Decision Problems (MOMDPs). We propose a policy-based approach that exploits gradient information to generate solutions close to the Pareto ones. Differently from previous policy-gradient multi-objective algorithms, where n optimization routines are use to have n solutions, our approach performs a single gradient-ascent run that at each step generates an improved continuous approximation of the Pareto frontier. The idea is to exploit a gradient-based approach to optimize the parameters of a function that defines a manifold in the policy parameter space so that the corresponding image in the objective space gets as close as possible to the Pareto frontier. Besides deriving how to compute and estimate such gradient, we will also discuss the non-trivial issue of defining a metric to assess the quality of the candidate Pareto frontiers. Finally, the properties of the proposed approach are empirically evaluated on two interesting MOMDPs.

The paper “Multi–objective Reinforcement Learning with Continuous Pareto Frontier Approximation” has been published at the Twenty–Ninth AAAI Conference on Artificial Intelligence (AAAI-15). This supplement follows the same structure of the main article. For each section we report the complete set of proofs and some additional details.

1 Reparametrization

In this section we provide the proof of an extended version of Theorem 1.

Theorem 1. Let T be an open set in , let ) be a manifold parametrized by a smooth map expressed as composition of maps J and , (). Given a continuous function I defined at each point of ), the integral w.r.t. the volume is given by

The associated gradient w.r.t. the map parameters is given component–wise by

where ), is the Kronecker product, = () is a symmetric () idempotent matrix with rank + 1) and is a permutation matrix [1]. Note that

Proof. The equation of the performance measure ) follows directly from the definition of volume integral of a manifold [2] and the definition of function composition. In the following we give a detailed derivation of the i–th component of the gradient. Let ), then

where the pedix t is used to denote the direct or indirect dependence on variable t. While the loss derivative and the determinant derivative can be respectively expanded as

where

and is the Kronecker product, = () is a symmetric () idempotent matrix with rank + 1) and is a permutation matrix [1]. The last term to be expanded is := . We star from a basic property of the differential

then, applying the vector operator,

Finally, the derivative is given by

Note that )) = do not denote the Hessian matrix. In fact, the Hessian matrix is defined as the derivative of the transpose Jacobian, that is, ) = . The following equation relates the Hessian matrix to )):

where 1), where q is the number of rows of the Jacobian matrix.

Theorem 2. For any MOMDP, the Hessian ) of the expected discounted reward J w.r.t. the policy parameters is a () matrix obtained by stacking the Hessian of each component

where

Proof. The Hessian equation follows form the definition of the gradient )

the log trick and the property that the reward of a trajectory is independent from the policy parametrization. Let outline the derivation of the Hessian matrix.

Then

Recall that, since the probability of trajectory under policy is given by

the following equations hold

Lemma 4. Given a parametrized policy ), under the assumption Assumption 3, the i–th component of the log–Hessian of the expected return can be bounded by

Proof. Consider the definition of the Hessian in Equation (1). Under assumption 3, the Hessian components can be bounded by ()

Theorem 5. Given a parametrized policy ), under the assumption Assumption 3, using the following number of H–step trajectories

the gradient estimate ) generated by Equation (1) is such that with probability 1 :

Proof. Hoeffding’s inequality implies that,

Solving the equation for N, notice that Lemma 4 provides a bound on each samples, we obtain:

2 Experiments

In this section we present the most relevant experiments conducted on two domains (a Linear-Quadratic Gaussian regulator and a water reservoir) in order to study the behavior of PMGA algorithm with the different loss functions I proposed in the paper. We show the frontiers obtained with the loss functions described in the paper and in addition we present a normalization that takes into account the area of the approximate Pareto frontier. The area ) of a manifold is defined as the volume integral of the unitary function:

In the following we propose two different type of normalization:

• Using the area ) of the frontier ) as normalization factor: ,

• Using a convex combination of both the area and the loss function: ) with = 1.

The idea of these normalizations is that the loss function I should guarantee the accuracy of the solutions obtained (i.e., only non-dominated solutions), while the area ) should provide a complete and uniform covering of the frontier.

In all the following experiments the learning rate was hand-tuned.

2.1 Linear-Quadratic Gaussian regulator

The first case of study is a discrete-time Linear-Quadratic Gaussian regulator (LQG) with multidimensional and continuous state and action spaces [3]. The LQG problem is defined by the following dynamics

where and are n-dimensional column vector (is a symmetric semidefinite matrix and R is a symmetric positive definite matrix. Dynamics are not coupled, that is, A and B are identity matrices. The policy is Gaussian with parameters ), where . Finally, a constant covariance matrix Σ = I has been chosen.

The LQG can be easily extended to account for multi-conflicting objectives. In particular, the problem of minimizing the distance from the origin w.r.t. the i-th axis has been taken into account, considering the cost of the action over the other axes

Figure 1: Learning processes for the 2-objectives LQG (numbers denote the iteration) obtained through PMGA. In Figures 1(b) and 1(a) is used, respectively with and without forcing the parametrization to pass through extrema. Figure 1(c) shows iterations with ).

Figure 2: ) trends with different loss functions for the 2-objectives LQG.

Since the maximization of the i-th objective requires to have null action on the other axes, objectives are conflicting.

As this reward formulation violates the positiveness of matrix , we change the reward adding an -perturbation

2.1.1 2-objectives case results

We first present the results obtained using PMGA algorithm and a parametrization that is not forced to pass through the extrema of the frontier. It is the one presented in the paper and it only limits in the interval [0]:

In this case using and the algorithm was not able to learn a good approximation of the Pareto– frontier in terms of accuracy and covering. Using utopia point as reference point for (i.e., )) the frontier learned collapses in one point on the knee of the front. The same behaviour occurs using . Using antiutopia point as reference point for (i.e., )) the solutions returned are dominated and the frontier gets wider and tends to diverge from the true frontier expanding on the opposite half

Figure 3: Different views of the frontier obtained by PMGA using and ) for the 3-objective LGQ.

Figure 4: Different views of the frontier obtained by PMGA using normalized ) for the 3-objective LQG.

space (Figure 2(b) shows the divergent trend of )). These behaviours are not unexpected, considering the definition of the loss functions, as explained in the section of the paper devoted to metrics.

The only loss function able to learn with this parametrization was . Figure 1(a) (presented in the paper) shows a few iterations of the learning process using = 2.5 and starting from = [1 2 0 3](the algorithm was also able to learn starting from different ). Figure 2(a) shows the indicator ) as function of the iterations. It is possible to notice that it converges to a constant value.

Other experiments were conducted using a different parametrization, forced PMGA approximation to pass through the extrema of the frontier:

In this case, besides , also proved to be an effective loss function and they both returned an accurate and wide approximation of the Pareto frontier (Figure 1(b), also presented in the paper, shows the learning process starting from = [2 2]).

) has still the same behaviour discussed before and the approximate frontier diverges from the true one (Figure 1(c)). This problem can be solved using the first normalization with = 0.9 (lower are not enough to correct the behaviour of the loss function, while using higher the frontier returned is shorter and tends to be a line between the extreme points).

Figure 5: Approximations of the Pareto frontier obtained by PMGA using ) for the 3-objective LQG.

) has a similar behaviour, as the algorithm returns almost a line between the extreme points in order to reduce the frontier length. Using the first normalization with 8 such behaviour disappears and the frontier obtained has accurate solutions and guarantees a complete covering of the true Pareto frontier (its performance are thesame as and ).

The second normalization, instead, has a critical problem because of the different magnitudo between the loss function and the area of the frontier, and therefore is difficult to properly choose w. A solution could be to ignore the constraint = 1, but in the 2-objectives case there is such a difference in the magnutudo that we were not able to find a suitable w.

2.1.2 3-objectives case results

We used a parametrization forced to pass through the extrema of the frontier and that limits in the interval [0]:

The initial is set to 0. Figure 3 shows the frontiers obtained using ), with and without normalization. We can clearly

see that solutions tend to concentrate to the center of the frontier, in order to minimize the distance from

the utopia point and the area of the frontier. Normalization is not able to correct this behaviour and the only result is to bump the frontier, slightly increasing its area. This effect seems to be indipendent from the normalization used and from the parameters w and (we tried with 1 6 and 10 convex combinations uniformly spaced of w). Loss function has the same behaviour and the frontiers obtained were very similar. Figures 5(a) and 5(b) show the frontier obtained with ), with and without normalization.

As expected, without normalization (Figure 5(a)) the algorithm tried to produce a frontier as wide as

possible, in order to increase the distance from the antiutopia point. This behaviour led to dominated solutions and the learning process does not converge. Using the first normalization with = 2 we were able to correct this behaviour, but the algorithm is still not able to cover the frontier completely (Figure 5(b)). Using smaller the frontier was still too wide and contained dominated solutions, while higher led to smaller ones. The second normalization instead was ineffective. This is due, again, to the different magnitudo between the loss function and the area of the frontier, that makes the choice of w critical. Finally Figure 6(a) shows the frontier obtained using with = 135. As expected, this loss function

proved to be the best among the three, returning a good approximation of the Pareto frontier in terms

of accuracy and covering, without using any normalization. Figure 6(b) shows the Pareto frontier in the parameter space.

Figure 6: Approximation of the Pareto frontier obtained by PMGA using for the 3-objective LQG.

3 Water Reservoir

A water reservoir can be modelled as a MOMDP with a continuous state variable s representing the water volume stored in the reservoir, a continuous action a that controls the water release, a state-transition model that depends also on the stochastic reservoir inflow , and a set of conflicting objectives. For a complete description of the problem, the reader can refer to [4].

In this work we consider two objectives: flooding along the lake shores and irrigation supply. The immediate rewards are defined by

where = is the reservoir level (in the following experiments S = 1), ¯h is the flooding threshold (¯h = 50), = max()) is the release from the reservoir and ¯is the water demand (¯= 50). denotes the negative of the cost due to the flooding excess level and is the negative of the deficit in the water supply.

Like in the original work, the discount factor is set to 1 for all the objectives and initial state is drawn from a finite set. However, different settings are used for learning and evaluation. In the learning phase 100 episodes by 100 steps are used (like in the original work), while the evaluation phase exploits 100, 000 episodes by 100 steps.

Since the problem is continuous we exploit a Gaussian policy model

where are the basis functions and . Since the optimal policies for the objectives are not linear in the state variable, a radial basis approximation is used: ) =where the centres are placed at 0, 50, 120 and 160, and the widths are 50, 20, 40 and 50.

3.0.3 Results

We used the following parametrization, forced to pass near the estreme points of the Pareto frontier:

A constant variance = 0.1 has been chosen.

Figure 7: Initial and final frontiers for a learning process with ).

Figure 8: Results for the water reservoir domain. Using utopia-based loss function ) trend is convergent (on the right) and the frontier returned is comparable to the ones obtained with state-of-the-art algorithms.

In order to show the capability of the approximate algorithm we have decided to test the simplest metric, that is, the utopia–based indicator. We start the learning from an arbitrary parametrization . Figure 7 reports the initial and the final frontiers obtained with out algorithm. We can notice that, even starting far from the true Pareto frontier, out algorithm is able to approach it, increasing covering and accuracy of the approximate frontier.

Figure 8(a) reports the final frontier obtained with different algorithms. The approximation obtained by our algorithm is comparable to the other results, however, our approach is able to produce a continuous frontier approximation.

It is important to notice that, due to the fact that the transition function of the domain limits the action in the range of admissible values (]), there are infinite policies with equal performance that allow the agent to release more than the reservoir level or less than zero. To overcome this problem [5] introduces a penalty term p in the reward (max()) during the learning phase. With our approach this modification was unnecessary, as the algorithm is able to learn without the penalty. We also tried adding it during the learning phase, but the frontier returned was exactly the same.

4 Metrics I3 tuning

In this Section we want to examine more deeply the tuning of mixed metric parameters, in order to provide the reader better insights for a correct use of such metric. PMGA performance, indeed, strongly depends on the indicator used and, thereby, their setting is critical. To be more precise, mixed metric, which obtained the best approximate Pareto–frontiers in the experiments, includes a trade-off between accuracy and covering, expressed by some parameters.

Figure 9: Approximate frontiers learned by PMGA using on varying . Figure (a) has dominated solutions and (c) is not wide enough. On the contrary, (b) achieves both accuracy and covering.

The indicator we are going to analyze is

where w(J) is a penalization term, i.e., it is a monotonic function that decreases as ) increases. In the previous Sections we proposed w(J) = 1 ). In this way we take advantage of the expansive behavior of the antiutopia–based indicator and the accuracy of the optimality–based indicator . In this Section we are going to study the performance of this metric on varying , proposing a simple tuning process. The idea is to set to an initial value (for example 1) and then increase (or dicrease) it if the approximate frontier contains dominated solutions (or is not large enough). Figure 9 shows different approximate frontiers obtained with different . Starting with = 1 the indicator behaves like ), meaning that was too small. Using = 1.5 (Figure 9(a)) the algorithm converges but the approximate frontier still contains dominated solutions. Increasing to 1.5 (Figure 9(b)) dominated solutions disappear. Finally, with = 2.5 (Figure 9(c)) the approximate frontier becomes shorter and Pareto–optimal solutions are discarded, meaning that we increased too much.

References

[1] J.R. Magnus and H. Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley Ser. Probab. Statist.: Texts and References Section. Wiley, 1999.

[2] J.R. Munkres. Analysis On Manifolds. Adv. Books Classics Series. Westview Press, 1997.

[3] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697, 2008.

[4] Francesca Pianosi, Andrea Castelletti, and Marcello Restelli. Tree-based fitted q-iteration for multi- objective markov decision processes in water resource management. Journal of Hydroinformatics, 15(2):258–270, 2013.

[5] Simone Parisi, Matteo Pirotta, Nicola Smacchia, Luca Bascetta, and Marcello Restelli. Policy gradient approaches for multi-objective sequential decision making. In IJCNN 2014, Beijing, China, July 6-11, 2014, pages 1–7. IEEE, 2014.

designed for accessibility and to further open science