In this section we provide the proof of an extended version of Theorem 1.
Theorem 1. Let T be an open set in , let
) be a manifold parametrized by a smooth map expressed as composition of maps J and
, (
). Given a continuous function I defined at each point of
), the integral w.r.t. the volume is given by
The associated gradient w.r.t. the map parameters is given component–wise by
where ),
is the Kronecker product,
=
(
) is a symmetric (
) idempotent matrix with rank
+ 1) and
is a permutation matrix [1]. Note that
Proof. The equation of the performance measure ) follows directly from the definition of volume integral of a manifold [2] and the definition of function composition. In the following we give a detailed derivation of the i–th component of the gradient. Let
), then
where the pedix t is used to denote the direct or indirect dependence on variable t. While the loss derivative and the determinant derivative can be respectively expanded as
where
and is the Kronecker product,
=
(
) is a symmetric (
) idempotent matrix with rank
+ 1) and
is a permutation matrix [1]. The last term to be expanded is
:=
. We star from a basic property of the differential
then, applying the vector operator,
Finally, the derivative is given by
Note that )) =
do not denote the Hessian matrix. In fact, the Hessian matrix is defined as the derivative of the transpose Jacobian, that is,
) =
. The following equation relates the Hessian matrix to
)):
where 1), where q is the number of rows of the Jacobian matrix.
Theorem 2. For any MOMDP, the Hessian ) of the expected discounted reward J w.r.t. the policy parameters
is a (
) matrix obtained by stacking the Hessian of each component
where
Proof. The Hessian equation follows form the definition of the gradient )
the log trick and the property that the reward of a trajectory is independent from the policy parametrization. Let outline the derivation of the Hessian matrix.
Then
Recall that, since the probability of trajectory under policy
is given by
the following equations hold
Lemma 4. Given a parametrized policy ), under the assumption Assumption 3, the i–th component of the log–Hessian of the expected return can be bounded by
Proof. Consider the definition of the Hessian in Equation (1). Under assumption 3, the Hessian components can be bounded by ()
Theorem 5. Given a parametrized policy ), under the assumption Assumption 3, using the following number of H–step trajectories
the gradient estimate ) generated by Equation (1) is such that with probability 1
:
Proof. Hoeffding’s inequality implies that,
Solving the equation for N, notice that Lemma 4 provides a bound on each samples, we obtain:
In this section we present the most relevant experiments conducted on two domains (a Linear-Quadratic Gaussian regulator and a water reservoir) in order to study the behavior of PMGA algorithm with the different loss functions I proposed in the paper. We show the frontiers obtained with the loss functions described in the paper and in addition we present a normalization that takes into account the area of the approximate Pareto frontier. The area ) of a manifold is defined as the volume integral of the unitary function:
In the following we propose two different type of normalization:
• Using the area ) of the frontier
) as normalization factor:
,
• Using a convex combination of both the area and the loss function: ) with
= 1.
The idea of these normalizations is that the loss function I should guarantee the accuracy of the solutions obtained (i.e., only non-dominated solutions), while the area ) should provide a complete and uniform covering of the frontier.
In all the following experiments the learning rate was hand-tuned.
2.1 Linear-Quadratic Gaussian regulator
The first case of study is a discrete-time Linear-Quadratic Gaussian regulator (LQG) with multidimensional and continuous state and action spaces [3]. The LQG problem is defined by the following dynamics
where and
are n-dimensional column vector (
is a symmetric semidefinite matrix and R is a symmetric positive definite matrix. Dynamics are not coupled, that is, A and B are identity matrices. The policy is Gaussian with parameters
), where
. Finally, a constant covariance matrix Σ = I has been chosen.
The LQG can be easily extended to account for multi-conflicting objectives. In particular, the problem of minimizing the distance from the origin w.r.t. the i-th axis has been taken into account, considering the cost of the action over the other axes
Figure 1: Learning processes for the 2-objectives LQG (numbers denote the iteration) obtained through PMGA. In Figures 1(b) and 1(a) is used, respectively with and without forcing the parametrization to pass through extrema. Figure 1(c) shows iterations with
).
Figure 2: ) trends with different loss functions for the 2-objectives LQG.
Since the maximization of the i-th objective requires to have null action on the other axes, objectives are conflicting.
As this reward formulation violates the positiveness of matrix , we change the reward adding an
-perturbation
2.1.1 2-objectives case results
We first present the results obtained using PMGA algorithm and a parametrization that is not forced to pass through the extrema of the frontier. It is the one presented in the paper and it only limits in the interval [
0]:
In this case using and
the algorithm was not able to learn a good approximation of the Pareto– frontier in terms of accuracy and covering. Using utopia point as reference point for
(i.e.,
)) the frontier learned collapses in one point on the knee of the front. The same behaviour occurs using
. Using antiutopia point as reference point for
(i.e.,
)) the solutions returned are dominated and the frontier gets wider and tends to diverge from the true frontier expanding on the opposite half
Figure 3: Different views of the frontier obtained by PMGA using and
) for the 3-objective LGQ.
Figure 4: Different views of the frontier obtained by PMGA using normalized ) for the 3-objective LQG.
space (Figure 2(b) shows the divergent trend of )). These behaviours are not unexpected, considering the definition of the loss functions, as explained in the section of the paper devoted to metrics.
The only loss function able to learn with this parametrization was . Figure 1(a) (presented in the paper) shows a few iterations of the learning process using
= 2.5 and starting from
= [1 2 0 3]
(the algorithm was also able to learn starting from different
). Figure 2(a) shows the indicator
) as function of the iterations. It is possible to notice that it converges to a constant value.
Other experiments were conducted using a different parametrization, forced PMGA approximation to pass through the extrema of the frontier:
In this case, besides , also
proved to be an effective loss function and they both returned an accurate and wide approximation of the Pareto frontier (Figure 1(b), also presented in the paper, shows the learning process starting from
= [2 2]
).
) has still the same behaviour discussed before and the approximate frontier diverges from the true one (Figure 1(c)). This problem can be solved using the first normalization with
= 0.9 (lower
are not enough to correct the behaviour of the loss function, while using higher
the frontier returned is shorter and tends to be a line between the extreme points).
Figure 5: Approximations of the Pareto frontier obtained by PMGA using ) for the 3-objective LQG.
) has a similar behaviour, as the algorithm returns almost a line between the extreme points in order to reduce the frontier length. Using the first normalization with
8 such behaviour disappears and the frontier obtained has accurate solutions and guarantees a complete covering of the true Pareto frontier (its performance are thesame as
and
).
The second normalization, instead, has a critical problem because of the different magnitudo between the loss function and the area of the frontier, and therefore is difficult to properly choose w. A solution could be to ignore the constraint = 1, but in the 2-objectives case there is such a difference in the magnutudo that we were not able to find a suitable w.
2.1.2 3-objectives case results
We used a parametrization forced to pass through the extrema of the frontier and that limits in the interval [
0]:
The initial is set to 0. Figure 3 shows the frontiers obtained using
), with and without normalization. We can clearly
see that solutions tend to concentrate to the center of the frontier, in order to minimize the distance from
the utopia point and the area of the frontier. Normalization is not able to correct this behaviour and the only result is to bump the frontier, slightly increasing its area. This effect seems to be indipendent from the normalization used and from the parameters w and (we tried with 1
6 and 10 convex combinations uniformly spaced of w). Loss function
has the same behaviour and the frontiers obtained were very similar. Figures 5(a) and 5(b) show the frontier obtained with
), with and without normalization.
As expected, without normalization (Figure 5(a)) the algorithm tried to produce a frontier as wide as
possible, in order to increase the distance from the antiutopia point. This behaviour led to dominated solutions and the learning process does not converge. Using the first normalization with = 2 we were able to correct this behaviour, but the algorithm is still not able to cover the frontier completely (Figure 5(b)). Using smaller
the frontier was still too wide and contained dominated solutions, while higher
led to smaller ones. The second normalization instead was ineffective. This is due, again, to the different magnitudo between the loss function and the area of the frontier, that makes the choice of w critical. Finally Figure 6(a) shows the frontier obtained using
with
= 135. As expected, this loss function
proved to be the best among the three, returning a good approximation of the Pareto frontier in terms
of accuracy and covering, without using any normalization. Figure 6(b) shows the Pareto frontier in the parameter space.
Figure 6: Approximation of the Pareto frontier obtained by PMGA using for the 3-objective LQG.
A water reservoir can be modelled as a MOMDP with a continuous state variable s representing the water volume stored in the reservoir, a continuous action a that controls the water release, a state-transition model that depends also on the stochastic reservoir inflow , and a set of conflicting objectives. For a complete description of the problem, the reader can refer to [4].
In this work we consider two objectives: flooding along the lake shores and irrigation supply. The immediate rewards are defined by
where =
is the reservoir level (in the following experiments S = 1), ¯h is the flooding threshold (¯h = 50),
= max(
)) is the release from the reservoir and ¯
is the water demand (¯
= 50).
denotes the negative of the cost due to the flooding excess level and
is the negative of the deficit in the water supply.
Like in the original work, the discount factor is set to 1 for all the objectives and initial state is drawn from a finite set. However, different settings are used for learning and evaluation. In the learning phase 100 episodes by 100 steps are used (like in the original work), while the evaluation phase exploits 100, 000 episodes by 100 steps.
Since the problem is continuous we exploit a Gaussian policy model
where are the basis functions and
. Since the optimal policies for the objectives are not linear in the state variable, a radial basis approximation is used:
) =
where the centres
are placed at 0, 50, 120 and 160, and the widths are 50, 20, 40 and 50.
3.0.3 Results
We used the following parametrization, forced to pass near the estreme points of the Pareto frontier:
A constant variance = 0.1 has been chosen.
Figure 7: Initial and final frontiers for a learning process with ).
Figure 8: Results for the water reservoir domain. Using utopia-based loss function ) trend is convergent (on the right) and the frontier returned is comparable to the ones obtained with state-of-the-art algorithms.
In order to show the capability of the approximate algorithm we have decided to test the simplest metric, that is, the utopia–based indicator. We start the learning from an arbitrary parametrization . Figure 7 reports the initial and the final frontiers obtained with out algorithm. We can notice that, even starting far from the true Pareto frontier, out algorithm is able to approach it, increasing covering and accuracy of the approximate frontier.
Figure 8(a) reports the final frontier obtained with different algorithms. The approximation obtained by our algorithm is comparable to the other results, however, our approach is able to produce a continuous frontier approximation.
It is important to notice that, due to the fact that the transition function of the domain limits the action in the range of admissible values (]), there are infinite policies with equal performance that allow the agent to release more than the reservoir level or less than zero. To overcome this problem [5] introduces a penalty term p in the reward (
max(
)) during the learning phase. With our approach this modification was unnecessary, as the algorithm is able to learn without the penalty. We also tried adding it during the learning phase, but the frontier returned was exactly the same.
In this Section we want to examine more deeply the tuning of mixed metric parameters, in order to provide the reader better insights for a correct use of such metric. PMGA performance, indeed, strongly depends on the indicator used and, thereby, their setting is critical. To be more precise, mixed metric, which obtained the best approximate Pareto–frontiers in the experiments, includes a trade-off between accuracy and covering, expressed by some parameters.
Figure 9: Approximate frontiers learned by PMGA using on varying
. Figure (a) has dominated solutions and (c) is not wide enough. On the contrary, (b) achieves both accuracy and covering.
The indicator we are going to analyze is
where w(J) is a penalization term, i.e., it is a monotonic function that decreases as ) increases. In the previous Sections we proposed w(J) = 1
). In this way we take advantage of the expansive behavior of the antiutopia–based indicator and the accuracy of the optimality–based indicator
. In this Section we are going to study the performance of this metric on varying
, proposing a simple tuning process. The idea is to set
to an initial value (for example 1) and then increase (or dicrease) it if the approximate frontier contains dominated solutions (or is not large enough). Figure 9 shows different approximate frontiers obtained with different
. Starting with
= 1 the indicator behaves like
), meaning that
was too small. Using
= 1.5 (Figure 9(a)) the algorithm converges but the approximate frontier still contains dominated solutions. Increasing
to 1.5 (Figure 9(b)) dominated solutions disappear. Finally, with
= 2.5 (Figure 9(c)) the approximate frontier becomes shorter and Pareto–optimal solutions are discarded, meaning that we increased
too much.
[1] J.R. Magnus and H. Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley Ser. Probab. Statist.: Texts and References Section. Wiley, 1999.
[2] J.R. Munkres. Analysis On Manifolds. Adv. Books Classics Series. Westview Press, 1997.
[3] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697, 2008.
[4] Francesca Pianosi, Andrea Castelletti, and Marcello Restelli. Tree-based fitted q-iteration for multi- objective markov decision processes in water resource management. Journal of Hydroinformatics, 15(2):258–270, 2013.
[5] Simone Parisi, Matteo Pirotta, Nicola Smacchia, Luca Bascetta, and Marcello Restelli. Policy gradient approaches for multi-objective sequential decision making. In IJCNN 2014, Beijing, China, July 6-11, 2014, pages 1–7. IEEE, 2014.