Deep Reinforcement Learning for Multi-objective Optimization

2019·arXiv

Abstract

I. INTRODUCTION

MULTI-OBJECTIVE optimization problems arise reg-ularly in real-world where two or more objectives are required to be optimized simultaneously. Without loss of generality, an MOP can be defined as follows:

where f(x) consists of M different objective functions and is the decision space. Since the M objectives are usually conflicting with each other, a set of trade-off solutions, termed Pareto optimal solutions, are expected to be found for MOPs.

This paper is partially supported by the National Natural Science Foundation of China (No. 61773390 and No. 71571187).

Kaiwen Li, Tao Zhang and Rui Wang (corresponding author) are with the College of Systems Engineering, National University of Defense Technology, Changsha 410073, PR China, and also with the Hunan Key Laboratory of Multi-energy System Intelligent Interconnection Technology, HKL-MSI2T, Changsha 410073, PR China. (e-mail: kaiwenli nudt@foxmail.com, zhangtao@nudt.edu.cn, ruiwangnudt@gmail.com)

Among MOPs, various multi-objective combinatorial optimization problems have been investigated in recent years. A canonical example is the multi-objective travelling salesman problem (MOTSP), where given n cities and M cost functions to travel from city i to j, one needs to find a cyclic tour of the n cities, minimizing the M cost functions. This is an NP-hard problem even for the single-objective TSP. The best known exact method, i.e., dynamic programming algorithm, requires a complexity of for single-objective TSP. It appears to be much harder for its multi-objective version. Hence, in practice, approximate algorithms are commonly used to solve MOTSPs, i.e., finding near optimal solutions.

During the last two decades, multi-objective evolutionary algorithms (MOEAs) have proven effective in dealing with MOPs since they can obtain a set of solutions in a single run due to their population based characteristic. NSGA-II [1] and MOEA/D [2] are two of the most popular MOEAs which have been widely studied and applied in many real-world applications. The two algorithms as well as their variants have also been applied to solve the MOTSP, see e.g., [3], [4], [5].

In addition, several handcrafted heuristics especially designed for TSP have been studied, such as the Lin-Kernighan heuristic [6] and the 2-opt local search [7]. By adopting these carefully designed tricks, a number of specialized methods have been proposed to solve MOTSP, such as the Pareto local search method (PLS) [8], multiple objective genetic local search algorithm (MOGLS) [9] and other similar variants [10], [11], [12]. More other methods and details can be found in this review [13].

Evolutionary algorithms and/or handcrafted heuristics have long been recognized as suitable methods to handle such problems. However, these algorithms, as iteration-based solvers, have suffered obvious limitations that have been widely discussed [14], [15], [13]. First, to find near-optimal solutions, especially when the dimension of problems is large, a large number of iterations are required for population updating or iterative searching, thus usually leading to a long running time for optimization. Second, once there is a slight change of the problem, e.g., changing the city locations of the MOTSP, the algorithm may need to be re-performed to compute the solutions. When it comes to newly encountered problems, or even new instances of a similar problem, the algorithm needs to be revised to obtain a good result, which is known as the No Free Lunch theorem [16]. Furthermore, such problem specific methods are usually optimized for one task only.

Carefully handcrafted evolution strategies and heuristics can certainly improve the performance. However, the recent advances in machine learning algorithms have shown their

c2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

ability of replacing humans as the engineers of algorithms to solve different problems. Several years ago, most people used man-engineered features in the field of computer vision but now the Deep Neural Networks (DNNs) have become the main techniques. While DNNs focus on making predictions, Deep Reinforcement Learning (DRL) is mainly used to learn how to make decisions. Thereby, we believe that DRL is a possible way of learning how to solve various optimization problems automatically, thus demanding no man-engineered evolution strategies and heuristics.

In this work, we explore the possibility of using DRL to solve MOPs, MOTSP in specific, in an end-to-end manner, i.e., given n cities as input, the optimal solutions can be directly obtained through a forward propagation of the trained network. The network model is trained through the trial and error process of DRL and can be viewed as a black-box heuristic or a meta-algorithm [17] with strong learned heuristics. Because of the exploring characteristic of DRL training, the obtained model can have a strong generalization ability, that is, it can solve the problems that it never saw before.

This work is originally motivated by several recent proposed Neural Network-based single-objective TSP solvers. [18] first proposes a Pointer Network that uses attention mechanism [19] to predict the city permutation. This model is trained in a supervised way that requires enormous TSP examples and their optimal tours as training set. It is hard for use and the supervised training process prevents the model from obtaining better tours than the ones provided in the training set. To resolve this issue, [20] adopts an Actor-Critic DRL training algorithm to train the Point Network with no need of providing the optimal tours. [17] simplifies the Point Network model and adds dynamic elements input to extend the model to solve the Vehicle Routing Problem (VRP). Moreover, the advanced Transformer model is employed to solve the routing problems and proves to be effective as well [21], [22].

The recent progress in solving the TSP by means of DRL is really appealing and inspiring due to its non-iterative yet efficient characteristic and strong generalization ability. However, there are no such studies concerning solving MOPs (or the MOTSP in specific) by DRL-based methods.

Therefore, this paper proposes to use the DRL method to deal with MOPs based on a simple but effective framework. And the MOTSP is taken as a specific test problem to demonstrate its effectiveness.

The main contributions of our work are as follows:

• This work provides a new way of solving the MOP by means of DRL. Some encouragingly new characteristics of the proposed method have been found in comparison with classical methods, e.g., strong generalization ability and fast solving speed.

• With a slight change of the problem instance, the classical methods usually need to be re-conducted from scratch, which is impractical for application especially for large-scale problems. In contrast, our method is robust to problem changes. Once the model is trained, it can scale to problems that the algorithm never saw before in terms of the number and locations of the cities of the MOTSP.

• The proposed method requires much lower running time than the classical methods, since the Pareto optimal solutions can be directly obtained by a simple forward propagation of the trained networks without any population updating or iterative searching procedures.

• Empirical studies show that the proposed method significantly outperforms the classical methods especially for large-scale MOTSPs in terms of both convergence and diversity, while requiring much less running time. It is noted that several papers [23], [24] have introduced the concept of Multi-objective deep reinforcement learning in the field of RL. However, they mainly focus on how to apply RL to control a robot with multiple goals, such as controlling a mountain car or controlling a submarine searching for treasures, as investigated in [23]. These studies are not explicitly proposed to deal with mathematical optimization problems like Eq. (1), and thus is out of the scope of this study. The structure of the paper is organized as follows: Section II-A introduces the general framework of DRL-MOA that describes the idea of using DRL for solving MOPs. And Section II-B elaborates the detailed modelling and training process of solving the specific MOTSP problem by means of the proposed DRL-MOA framework. Finally the effectiveness of the method is demonstrated through experiments in Section III and Section IV.

II. THE DEEP REINFORCEMENT LEARNING-BASED MULTI-OBJECTIVE OPTIMIZATION ALGORITHM

In this section, we propose to solve the MOP by means of DRL based on a simple but effective framework (DRL-MOA). First, the decomposition strategy [2] is adopted to decompose the MOP into a number of subproblems. Each subproblem is modelled as a neural network. Then model parameters of all the subproblems are optimized collaboratively according to the neighborhood-based parameter transfer strategy and the Actor-Critic [25] training algorithm. In particular, MOTSP is taken as a specific problem to elaborate how to model and solve the MOP based on the DRL-MOA.

A. General framework

Decomposition strategy. Decomposition, as a simple yet efficient way to design the multi-objective optimization algorithms, has fostered a number of researches in the community, e.g., Cellular-based MOGA[26], MOEA/D, MOEA/DD [27], DMOEA-C [28] and NSGA-III [29]. The idea of decomposition is also adopted as the basic framework of the proposed DRL-MOA in this work. Specifically, the MOP, e.g., the MOTSP, is explicitly decomposed into a set of scalar optimization subproblems and solved in a collaborative manner. Solving each scalar optimization problem usually leads to a Pareto optimal solution. The desired Pareto Front (PF) can be obtained when all the scalar optimization problems are solved.

In specific, the well-known Weighted Sum [30] approach is employed. Certainly, other scalarizing methods can also be applied, e.g., the Chebyshev and the penalty-based boundary intersection (PBI) method [31], [32]. First a set of uniformly spread weight vectors is given, e.g., (1, 0), (0.9, 0.1), . . . , (0, 1) for a bi-objective problem, as shown in Fig. 1. Here , where M represents the number of objectives. Thus the original MOP is converted into N scalar optimization subproblems by the Weighted Sum approach. The objective function of the subproblem is shown as follows [2]:

Therefore, the PF can be formed by the solutions obtained by solving all the N subproblems.

Fig. 1. Illustration of the decomposition strategy.

Neighborhood-based parameter-transfer strategy. To solve each subproblem by means of DRL, the subproblem is modelled as a neural network. Then the N scalar optimization subproblems are solved in a collaborative manner by the neighborhood-based parameter-transfer strategy, which is introduced as follows.

According to Eq. (2) it is observed that two neighbouring subproblems could have very close optimal solutions [2] as their weight vectors are adjacent. Thus, a subproblem can be solved assisted by the knowledge of its neighboring subproblems.

Fig. 2. Illustration of the parameter-transfer strategy.

Specifically, as the subproblem in this work is modelled as a neural network, the parameters of the network model of the subproblem can be expressed as . Here, represents the parameters of the neural network model that have been optimized already and represents the parameters that are not optimized yet. Assume that the subproblem has been solved, i.e., its network parameters have been optimized to its near optimum. Then the best network parameters obtained in the subproblem are set as the starting point for the network training in the subproblem. Briefly, the network parameters are transferred from the previous subproblem to the next subproblem in a sequence, as depicted in Fig. 2. The neighborhood-based parameter-transfer strategy makes it possible for the training of the DRL-MOA model; otherwise a tremendous amount of time is required for training the N subproblems.

Each subproblem is modelled and solved by the DRL algorithm and all subproblems can be solved in sequence by transferring the network weights. Thus, the PF can be finally approximated according to the obtained model. Employing the decomposition in conjunction with the neighborhood-based parameter transfer strategy, the general framework of DRLMOA is presented in Algorithm 1.

One obvious advantage of the DRL-MOA is its modularity and simplicity for use. For example, the MOTSP can be solved by integrating any of the recently proposed novel DRL-based TSP solvers [17], [21] into the DRL-MOA framework. Also, other problems such as VRP and Knapsack problem can be easily handled with the DRL-MOA framework by simply replacing the model of the subproblem. Moreover, once the trained model is available, the PF can be directly obtained by a simple forward propagation of the model.

The proposed DRL-MOA acts as an outer-loop. The next issue is how to model and solve the decomposed scalar subproblems. Therefore we take the MOTSP as a specific example and introduce how to model and solve the subproblem of MOTSP in the next section.

B. Modelling the subproblem of MOTSP

To solve the MOTSP, we first decompose the MOTSP into a set of subproblems and solve each one collaboratively based on the foregoing DRL-MOA framework. Each subproblem is modelled and solved by means of DRL. This section introduces how to model the subproblem of MOTSP in a neural network manner. Here, a modified Pointer Network similar to [17] is used to model the subproblem and the ActorCritic algorithm is used for training.

1) Formulation of MOTSP: We recall the formulation of an MOTSP. One needs to find a tour of n cities, i.e., a cyclic permutation , to minimize M different cost functions simultaneously:

where is the cost of travelling from city to . The cost functions may for example correspond to tour length, safety index or tourist attractiveness in practical applications.

2) The model: In this part, the above problem is modelled using a modified Pointer Network [17].

First the input and output structure of the network model is introduced: let the given set of inputs be X .= where n is the number of cities. Each is represented by a tupleis the attribute of the city that is used to calculate the cost function. For instance, represents the x-coordinate and y- coordinate of the city and is used to calculate the distance between two cities. Taking a bi-objective TSP as an example where both the two cost functions are defined by the Euclidean distance [13], the input structure is shown in Fig. 3. The input is four-dimensional and consists of total values. Moreover, the output of the model is a permutation of the cities .

To map input X to output Y , the probability chain rule is used:

First an arbitrary city is selected as . At each decoding step , we choose from the available cities . The available cities are updated every time a city is visited. In a nutshell, Eq. (4) provides the probability of selecting the next city according to , i.e., the already visited cities.

Then a modified Pointer network similar to [17] is used to model Eq. (4). Its basic structure is the Sequence-to-Sequence model [33], a recently proposed powerful model in the field of machine translation, which maps one sequence to another. The general Sequence-to-Sequence model consists of two RNN networks, termed encoder and decoder. An encoder RNN encodes the input sequence into a code vector that contains knowledge of the input. Based on the code vector, a decoder RNN is used to decode the knowledge vector to a desired sequence. Thus, the nature of Sequence-to-Sequence model that maps one input sequence to an output sequence is suitable for solving the TSP.

In this work, the architecture of the model is shown in Fig. 4 where the left part is the encoder and the right part is the decoder. The model is elaborated as follows.

Encoder. Encoder is used to condense the input sequence into a vector. Since the coordinates of the cities convey no sequential information [17] and the order of city locations in the inputs is not meaningful, RNN is not used in the encoder in this work. Instead, a simple embedding layer is used to encode the inputs to a vector which can decrease the complexity of the model and reduce the computational cost. Specifically, the 1-dimensional (1-D) convolution layer is used to encode the inputs to a high-dimensional vector [17] (=128 in this paper), as shown in Fig. 4. The number of in-channels equals to the dimension of the inputs. For example, the Euclidean bi-objective TSP has a four-dimensional input as shown in Fig. 3 and thus the number of in-channels is four. And the Encoder finally results to a vector where n indicates the city number. It is noteworthy that the parameters of the 1-D convolution layer are shared amongst all the cities. It means that, no matter how many cities there are, each city shares the same set of parameters which encode the city information to a high-dimensional vector. Thus, the encoder is robust to the number of the cities.

Fig. 3. Input structures of the neural network model for solving Euclidean bi-objective TSPs.

Fig. 4. Illustration of the model structure. Attention mechanism, in conjunc- tion with Encoder and Decoder, produces the probability of selecting the next city.

Decoder. The decoder is used to unfold the obtained high-dimensional vector, which stores the knowledge of inputs, into the output sequence. Different from the encoder, a RNN is required in the decoder as we need to summarize the information of previously selected cities so as to make the decision of . RNN has the ability of memorizing the previous outputs. In this work we adopt the RNN model of GRU (Gated recurrent unit) [34] that has similar performance but fewer parameters than the LSTM (Long Short-Term Memory) which is employed in the original Pointer Network in [17]. It is noted that RNN is not directly used to output the sequence. What we need is the RNN decoder hidden state at decoding step t that stores the knowledge of previous steps . Then and the encoding of the inputs are used together to calculate the conditional probability over the next step of city selection. This calculation is realized by the attention mechanism. As shown in Fig. 4, to select the next city at step t + 1, first we obtain the hidden state through Decoder. In conjunction with , the index of next city can be calculated using the attention mechanism.

Attention mechanism. Intuitively, the attention mechanism calculates how much every input is relevant in the next decoding step t. The most relevant one is given more attention and can be selected as the next visiting city. The calculation is as follows:

where are learnable parameters. is a key variable for calculating as it stores the information of previous steps . Then, for each city j, its is computed by and its encoder hidden state , as shown in Fig. 4. The softmax operator is used to normalize and finally the probability for selecting each city j at step t can be finally obtained. The greedy decoder can be used to select the next city. For example, in Fig. 4, city 2 has the largest and so is selected as the next visiting city. Instead of selecting the city with the largest probability greedily, during training, the model selects the next city by sampling from the probability distribution.

3) Training method: The model of the subproblem is trained using the well-known Actor-critic method similar to [20], [17]. However, as [20], [17] trains the model of single-objective TSP, the training procedure is different for the MOTSP case, as presented in Algorithm 2. Next we briefly introduce the training procedure.

Two networks are required for training: (i) an actor network, which is exactly the Pointer Network in this work, gives the probability distribution for choosing the next action, and (ii) a critic network that evaluates the expected reward given a specific problem sate. The critic network employs the same architecture as the Pointer network’s encoder that maps the encoder hidden state into the critic output.

The training is conducted in an unsupervised way. During the training, we generate the MOTSP instances from distributions . Here, M represents different input features of the cities, e.g., the city locations or the security indices of the cities. For example, for Euclidean instances of a bi-objective TSP, and are both city coordinates and or can be a uniform distribution of .

To train the actor and critic networks with parameters and instances are sampled from for training. For each instance, we use the actor network with current parameters to produce the cyclic tour of the cities and the corresponding reward can be computed. Then the policy gradient is computed in line 11 (refer to [35] for details of the formula derivation of policy gradient) to update the actor network. Here, is the reward approximation of instance n calculated by the critic network. The critic network is then updated in line 12 by reducing the difference between the true observed rewards and the approximated rewards.

Once all of the models of the subproblems are trained, the Pareto optimal solutions can be directly output by a simple forward propagation of the models. Time complexity of a forward calculation of Encoder is . And time complexity of a forward calculation of Decoder is , where is the approximated time complexity of the RNN. Thus the approximated time complexity of using DRL-MOA for solving the MOTSP is where N is the number of subproblems. As a forward propagation of the EncoderDecoder neural network can be quite fast, the solutions can be always obtained within a reasonable time.

III. EXPERIMENTAL SETUPS

The proposed DRL-MOA is tested on bi-objective TSPs. All experiments are conducted on a single GTX 2080Ti GPU. The code is written in Python and is publicly available 1

to reproduce the experimental results and to facilitate future studies. Meanwhile, all experiments of the compared MOEAs are conducted on the standard software platform PlatEMO2 [36] which is written in MATLAB. The compared MOGLS is written in Python3. All of the compared algorithms are run on the Intel 16-Core i7-9800X CPU with 64GB memory.

A. Test Instances

The considered bi-objective TSP instances are described as follows [13]:

Euclidean instances: Euclidean instances are the commonly used test instances for solving MOTSP [13]. Intuitively, both the cost functions are defined by the Euclidean distance. The first cost is defined by the distance between the real coordinates of two cities i, j. The second cost of travelling from city i to city j is defined by another set of virtual coordinates that used to calculate another objective. Thus the input is four-dimensional as shown in Fig. 3.

Mixed instances: In order to test the ability of our model that can adapt to different input structures, Mixed instances with three-dimensional input are tested. Here, the first cost function is still defined by the Euclidean distance between two points that represents the real city location, which is a two-dimensional input with x-coordinate and y-coordinate. Moreover, the second cost of travelling from city i to j is defined by a one-dimensional input. This one-dimensional input of city i can be interpreted as the altitude of city i. Thus the objective is to minimize the altitude variance when travelling between two cities. A smoother tour can be obtained with less altitude variance, therefore, we can reduce the fuel cost or improve the comfort of the journey. Thereby, Mixed instances have a three-dimensional input.

Training set: As an unsupervised learning method, only the model input and the reward function are required during the training process, with no need of the best tours as the labels. Euclidean instances with four-dimensional input and Mixed instances with three-dimensional input are generated as the training set. They are all generated from a uniform distribution of [0, 1].

Test set: The standard TSP test problems kroA and kroB in the TSPLIB library [37] are used to construct the Euclidean test instances kroAB100, kroAB150 and kroAB200 which are commonly used MOTSP test instances [13], [12]. kroA and kroB are two sets of different city locations and used to calculate the two Euclidean costs. For Mixed test instances, randomly generated 40-, 70-, 100-, 150- and 200-city instances are constructed.

In this work, the model is trained on 40-city MOTSP instances and it is used to approximate the PFs of 40-, 70-, 100-, 150- and 200-city test instances.

B. Parameter settings of model and training

Most parameters of the model and training are similar to that in [17] which can solve single-objective TSPs effectively. Specifically, the parameter settings of the network model are shown in TABLE I. represents the dimension of input, i.e., for Euclidean bi-objective TSPs. We employ an one-layer GRU RNN with the hidden size of 128 in the Decoder. For the Critic network, the hidden size is also set to 128.

We train both of the actor and critic networks using the Adam optimizer [38] with learning rate of 0.0001 and batch size of 200. The Xavier initialization method [39] is used to initialize the weights for the first subproblem. Weights for the following subproblems are generated by the introduced neighborhood-based parameter transfer strategy.

In addition, different size of generated instances are required for training different types of models. As compared with the

TABLE I PARAMETER SETTINGS OF THE MODEL. 1D-CONV MEANS THE 1-D CONVOLUTION LAYER. REPRESENTS THE DIMENSION OF INPUT. KERNEL SIZE AND STRIDE ARE ESSENTIAL PARAMETERS OF THE 1-D CONVOLUTION LAYER

Mixed MOTSP problem, the model of Euclidean MOTSP problem requires more weights to be optimized because its dimension of input is larger, thus requiring more training instances in each iteration. In this work, we generate 500,000 instances to train the Euclidean bi-objective TSP and 120,000 instances to train the Mixed one. All the problem instances are generated from an uniform distribution of [0, 1] and used for training for 5 epochs. It costs about 3 hours to train the Mixed instances and 7 hours to train the Euclidean instances. Once the model is trained, it can be used to directly output the Pareto Fronts.

IV. EXPERIMENTAL RESULTS AND DISCUSSIONS

In this section, DRL-MOA is compared with classical MOEAs of NSGA-II and MOEA/D on different MOTSP instances. The maximum number of iteration for NSGA-II and MOEA/D is set to 500, 1000, 2000 and 4000 respectively. The population size is set to 100 for NSGA-II and MOEA/D. The number of subproblems for DRL-MOA is set to 100 as well. The Tchebycheff approach, which we found would perform better on MOTSP, is used for MOEA/D. In addition, only the non-dominated solutions are reserved in the final PF.

A. Results on Mixed type bi-objective TSP

We first test the model that is trained on 40-city Mixed type bi-objective TSP instances. The model is then used to approximate the PF of 40-, 70-, 100-, 150- and 200-city instances.

The performance of the PFs obtained by all the compared algorithms on various instances are shown in Fig. 5, 6, 7, 8. It is observed that the trained model can efficiently scale to bi-objective TSP with different number of cities. Although the model is obtained by training on the 40-city instances, it can still exhibit good performances on the 70-, 100-, 150-and 200-city instances. Moreover, the performance indicator of Hypervolume (HV) and the running time that are obtained based on five runs are also listed in Table II.

As shown in Fig. 5, all of the compared algorithms can work well for the small-scale problems, e.g., 40-city instances. By increasing the number of iterations, NSGA-II and MOEA/D even show a better ability of convergence. However, the large number of iterations can lead to a large amount of computing time. For example, 4000 iterations cost 130.2 seconds for MOEA/D and 28.3 seconds for NSGA-II while our method just requires 2.7 seconds.

Fig. 5. A random generated 40-city Mixed bi-objective TSP problem instance: the PF obtained using our method (trained using 40-city instances) in comparison with NSGA-II and MOEA/D. 500, 1000, 2000, 4000 iterations are applied respectively.

As can be seen in Fig. 6, 7, 8, as the number of cities increases, the competitors of NSGA-II and MOEA/D struggle to converge while the DRL-MOA exhibits a much better ability of convergence.

For 100-city instances in Fig. 6, MOEA/D shows a slightly better performance in terms of convergence than other methods by running 4000 iterations with 140.3 seconds. However, the diversity of solutions found by our method is much better than MOEA/D.

For 150- and 200-city instances as depicted in Fig. 7 and Fig. 8, NSGA-II and MOEA/D exhibit an obviously inferior performance than our method in terms of both the convergence and diversity. Even though the competitors are conducted for 4000 iterations, which is a pretty large number of iterations, DRL-MOA still shows a far better performance than them.

Fig. 6. A random generated 100-city Mixed bi-objective TSP problem instance: the PF obtained using our method (trained using 40-city instances) in comparison with NSGA-II and MOEA/D. 500, 1000, 2000, 4000 iterations are applied respectively.

In addition, the DRL-MOA achieves the best HV comparing to other algorithms, as shown in TABLE II. Also, its running time is much lower in comparison with the competitors. Overall, the experimental results clearly indicate the effectiveness of DRL-MOA on solving large-scale bi-objective TSPs. The

Fig. 7. A random generated 150-city Mixed bi-objective TSP problem instance: the PF obtained using our method (trained using 40-city instances) in comparison with NSGA-II and MOEA/D. 500, 1000, 2000, 4000 iterations are applied respectively.

Fig. 8. A random generated 200-city Mixed bi-objective TSP problem instance: the PF obtained using our method (trained using 40-city instances) in comparison with NSGA-II and MOEA/D. 500, 1000, 2000, 4000 iterations are applied respectively.

brain of the trained model has learned how to select the next city given the city information and the selected cities. Thus it does not suffer the deterioration of performance with the increasing number of cities. In contrast, NSGA-II and MOEA/D fail to converge within a reasonable computing time for large-scale bi-objective TSPs. In addition, the PF obtained by the DRL-MOA method shows a significantly better diversity as compared with NSGA-II and MOEA/D whose PF have a much smaller spread.

B. Results on Euclidean type bi-objective TSP

We then test the model on Euclidean type instances. The DRL-MOA model is trained on 40-city instances and applied to approximate the PF of 40-, 70-, 100-, 150- and 200-city instances. For 100-, 150- and 200-city problems, we adopt the commonly used kroAB100, kroAB150 and kroAB200 instances [13]. The HV indicator and computing time are shown in TABLE III.

Fig. 9, 10 and 11 show the experimental results on kroAB100, kroAB150 and kroAB200 instances. For the kroAB100 instance, by increasing the number of iterations to 4000, NSGA-II, MOEA/D and DRL-MOA achieve a similar level of convergence while MOEA/D performs slightly better. However, MOEA/D performs the worst in terms of diversity with all solutions crowded in a small region and its running time is not acceptable.

TABLE II HV VALUES OBTAINED BY DRL-MOA, NSGA-II AND MOEA/D. INSTANCES OF 40-, 70-, 100-, 150-, 200-CITY MIXED TYPE BI-OBJECTIVE TSP ARE TEST. THE RUNNING TIME IS LISTED. THE BEST HV IS MARKED IN GRAY BACKGROUND AND THE LONGEST RUNNING TIME IS MARKED BOLD

TABLE III HV VALUES OBTAINED BY DRL-MOA, NSGA-II AND MOEA/D. INSTANCES OF 40-, 70-, 100-, 150-, 200-CITY EUCLIDEAN TYPE BI-OBJECTIVE TSP ARE TEST. THE RUNNING TIME IS LISTED. THE BEST HV IS MARKED IN GRAY BACKGROUND AND THE LONGEST RUNNING TIME IS MARKED BOLD

Fig. 9. KroAB100 Euclidean bi-objective TSP problem instance: the PF obtained using our method (trained using 40-city instances) in comparison with NSGA-II and MOEA/D. 500, 1000, 2000, 4000 iterations are applied respectively.

When the number of cities increases to 150 and 200, DRLMOA significantly outperforms the competitors in terms of both convergence and diversity, as shown in Fig. 10 and 11. Even though 4000 iterations are conducted for NSGA-II and MOEA/D, there is still an obvious gap of performance between the two methods and the DRL-MOA.

In terms of the HV indicator as demonstrated in TABLE III, DRL-MOA performs the best on all instances. And the running time of DRL-MOA is much lower than the compared MOEAs. Increasing the number of iterations for MOEA/D and NSGA-II can certainly improve the performance but would result in a large amount of computing time. It requires more than 150 seconds for MOEA/D to reach an acceptable

Fig. 10. KroAB150 Euclidean bi-objective TSP problem instance: the PF obtained using our method (trained using 40-city instances) in comparison with NSGA-II and MOEA/D. 500, 1000, 2000, 4000 iterations are applied respectively.

level of convergence. The computing time of NSGA-II is less, approximately 30 seconds, for running 4000 iterations. However, the performance for NSGA-II is always the worst amongst the compared methods.

We further try to evaluate the performance of the model on 500-city instances. The model is still the one that is trained on 40-city instances and it is used to approximated the PF of a 500-city instance. The results are shown in Fig. 12. It is observed that DRL-MOA significantly outperforms the competitors. And the performance gap is especially larger than that on smaller-scale problems.

Fig. 11. KroAB200 Euclidean bi-objective TSP problem instance: the PF obtained using our method (trained using 40-city instances) in comparison with NSGA-II and MOEA/D. 500, 1000, 2000, 4000 iterations are applied respectively.

Fig. 12. A random generated 500-city Euclidean bi-objective TSP problem instance: the PF obtained using our method (trained using 40-city instances) in comparison with NSGA-II and MOEA/D. 500, 1000, 2000, 4000 iterations are applied respectively.

C. Extension to MOTSPs with More Objectives

In this section, the efficiency of our method is further evaluated on the 3- and 5-objective TSPs. The model is still trained on 40-city instances and it is used to approximate the PF of 100- and 200-city instances. The 3-objective TSP instances are constructed by combining two 2-dimensional inputs and a 1-dimensional input similar to the Mixed type instances. And the 5-objective TSP instances are constructed in the same way with two 2-dimensional inputs and three 1-dimensional input.

Results of the experiments on the 3-objective TSP are visualized in Fig. 13 and Fig. 14. It can be observed that DRL-MOA significantly outperforms the classical MOEAs on all of the 100- and 200-city instances. Moreover, DRL-MOA performs clearly better on the 200-city instance than on the 100-city instance while the competitors struggle to converge.

In addition, results of the HV values on the 3- and 5-objective TSP instances that are obtained based on five runs are presented in TABLE IV. In can be seen that the DRLMOA outperforms NSGA-II and MOEA/D on all instances. And NSGA-II is still the least effective method.

D. Comparisons with Local-Search-based Methods

In this section, DRL-MOA is compared with the local-search-based method. Ishibuchi and Murata first proposed a

Fig. 13. PFs obtained by DRL-MOA, NSGA-II and MOEA/D on a randomly generated 3-objective 100-city TSP instance.

Fig. 14. PFs obtained by DRL-MOA, NSGA-II and MOEA/D on a randomly generated 3-objective 200-city TSP instance.

Multi-Objective Genetic Local Search Algorithm (MOGLS) [40] for multi-objective combinatorial optimization. It is further improved and specialized to solve the MOTSP in [41], which significantly outperforms the original MOGLS. In this section the DRL-MOA is compared with the improved MOGLS.

Note that local search has been widely developed and various local-search-based methods that uses a number of specialized techniques to improve effectiveness have been proposed these years. However, the goal of our method is not to outperform a non-learned, specialized MOTSP algorithm. Rather, we show the fast solving speed and high generalization ability of our method via the combination of DRL and multi-objective optimization. Thus we did not consider more other local-search-based methods in this work.

TABLE IV HV VALUES OBTAINED BY DRL-MOA, NSGA-II AND MOEA/D ON 3- AND 5-OBJECTIVE TSP INSTANCES. THE BEST HV IS MARKED IN GRAY BACKGROUND.

As no source code is found, the improved MOGLS is implemented by ourselves strictly according to [41]. The algorithm is written in Python and run in the same machine to make fair comparisons and the code is publicly available4.

The parameters, e.g., size of the temporary population and number of initial solution are consistent with the settings in [41]. The local search uses a standard 2-opt algorithm, which is terminated if a pre-specified number of iterations is completed. is used to control the balance of computation time and model performance [40]. The improved MOGLS with is used for comparisons. It should be noted that improving or repeating the local search until no better solution is found might be able to further improve the performance of MOGLS. But it would be quite timeconsuming and not experimented in this work.

Since the local search can further improve the quality of solutions obtained by DRL as reported in [22], we thus use a simple 2-opt local search to post-process the solutions obtained by DRL-MOA, leading to the results of DRL-MOA+LS. It is noted that the 2-opt is conducted only once for each solution and it only costs several seconds in total. Thus results of the experiments on MOGLS-100, MOGLS-200, MOGLS-300, DRL-MOA and DRL-MOA+LS are presented in TABLE V. For clarity, only the PFs obtained by MOGLS-100, MOGLS-200 and our method are visualized in Fig. 15.

It is found that using local search to post-process the solutions can further improve the performance. It can be observed that DRL-MOA+LS outperforms the compared MOGLS on all instances while requiring much less computation time. DRLMOA without local search can also outperform MOGLS on all 200-city instances even the MOGLS has run for 1000 seconds, while DRL-MOA only requires about 13 seconds. Although MOGLS performs slightly better than DRL-MOA on 100-city instances, DRL-MOA can always obtain a comparable result within 7 seconds.

Results of the experiments indicate the fast computing speed and guaranteed performance of the DRL-MOA method. Moreover, using local search to post-process the solutions can further improve the performance.

Fig. 15. PFs obtained by MOGLS-100, MOGLS-200 and DRL-MOA on 200-city bi-objective TSP instances.

TABLE V HV VALUES OF THE PFS OBTAINED BY MOGLS-100, MOGLS-200, MOGLS-300, DRL-MOA AND DRL-MOA+LS. THE BEST AND THE SECOND VEST HVS ARE MARKED IN GRAY AND LIGHT GRAY BACKGROUND. THE LONGEST COMPUTATION TIME IS MARKED BOLD.

E. Effectiveness of parameter-transfer strategy

In this section, the effectiveness of the neighborhood-based parameter-transfer strategy is checked experimentally. At first, performances of the models that are trained with and without the parameter-transfer strategy are compared. They are both trained on 120,000 20-city MOTSP instances for five epochs. PFs obtained by the two models are presented in Fig. 16 (a). It is obvious that performance of the model is dramatically poor if the parameter-transfer strategy is not used for its training.

Moreover, we train the model without applying the parameter-transfer strategy on 240,000 instances for 10 epochs, that is, the model is trained four times longer than before. The result is presented in Fig. 16 (b). It can be seen that, without the parameter-transfer strategy, even if the model is trained four times longer, it still exhibits a poor performance. Thus it is effective and efficient to apply the parameter-transfer strategy for training; otherwise it is impossible to obtain a promising model within a reasonable time.

Fig. 16. Performances of the models that are trained with and without the parameter-transfer strategy.

F. The impact of training on different number of cities

The forgoing models are trained on 40-city instances. In this part, we try to figure out whether there is any difference of the model performance if the model is trained on 20-city instances. Performance of the model that is trained on 20-city instances is presented in Fig. 17 (a).

Fig. 17. The two models trained respectively on 20- and 40-city Euclidean bi-objective TSP instances. They are used to approximate the PF of 40-, 100-, 200-city problems.

It can be observed that the model trained on 20-city instances exhibits an apparently worse performance than the one trained on 40-city instances. A large number of solutions obtained by the 20-city model are crowded in several regions and there are less non-dominated solutions. A possible reason for the deteriorated result is that, when training on 40-city instances, 40 city selecting decisions are made and evaluated in the process of training each instance, which are twice of that when training on 20-city instances. Loosely speaking, if both the two models use 120, 000 instances, the 40-city model are trained based on cities which are twice of that of 20-city model. Therefore, the model trained on 40-city instances is better. And we can simply increase the number of training instances to improve the performance.

Lastly, it is also interesting to see that the solutions output by DRL-MOA are not all non-dominated. Moreover, these solutions are not distributed evenly (being along with the provided search directions). These issues deserve more studies in future.

G. Summary of the results

Observed from the experimental results, we can conclude that the DRL-MOA is able to handle MOTSP both effectively and efficiently. In comparison with classical multi-objective optimization methods, DRL-MOA has shown some encouragingly new characteristics, e.g., strong generalization ability, fast solving speed and promising quality of the solutions, which can be summarized as follows.

• Strong generalization ability. Once the trained model is available, it can scale to newly encountered problems with no need of retraining the model. Moreover, its performance is less affected by the increase of number of cities compared to existing methods.

• A better balance between the solving speed and the quality of solutions. The Pareto optimal solutions can

be always obtained within a reasonable time while the quality of the solutions is still guaranteed.

V. CONCLUSION

Multi-objective optimization, appeared in various disciplines, is a fundamental mathematical problem. Evolutionary algorithms have been recognized as suitable methods to handle such problem for a long time. However, evolutionary algorithms, as iteration-based solvers, are difficult to be used for on-line optimization. Moreover, without the use of a large number of iterations and/or a large population size, evolutionary algorithms can hardly solve large-scale optimization problems [14], [15], [13].

Inspired by the very recent work of Deep Reinforcement Learning (DRL) for single-objective optimization, this study provides a new way of solving the multi-objective optimization problems by means of DRL and has found very encouraging results. In specific, on MOTSP instances, the proposed DRL-MOA significantly outperforms NSGA-II, MOEA/D and MOGLS in terms of the solution convergence, spread performance as well as the computing time, and thus, making a strong claim to use the DRL-MOA, a non-iterative solver, to deal with MOPs in future.

With respect to the future studies, first in the current DRLMOA, a 1-D convolution layer which corresponds to the city information is used as inputs. Effectively, a distance matrix used as inputs can be further studied, i.e., using a 2-D convolution layer. Second, the distribution of the solutions obtained by the DRL-MOA are not as even as expected. Therefore, it is worth investigating how to improve the distribution of the obtained solutions. Overall, multi-objective optimization by DRL is still in its infancy. It is expected that this study will motivate more researchers to investigate this promising direction, developing more advanced methods in future.

REFERENCES

[1] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE transactions on evolutionary computation, vol. 6, no. 2, pp. 182–197, 2002.

[2] Q. Zhang and H. Li, “MOEA/D: A multiobjective evolutionary algorithm based on decomposition,” IEEE Transactions on evolutionary computation, vol. 11, no. 6, pp. 712–731, 2007.

[3] L. Ke, Q. Zhang, and R. Battiti, “MOEA/D-ACO: A multiobjective evolutionary algorithm using decomposition and antcolony,” IEEE transactions on cybernetics, vol. 43, no. 6, pp. 1845–1859, 2013.

[4] B. A. Beirigo and A. G. dos Santos, “Application of NSGA-II framework to the travel planning problem using real-world travel data,” in 2016 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2016, pp. 746–753.

[5] W. Peng, Q. Zhang, and H. Li, “Comparison between MOEA/D and NSGA-II on the multi-objective travelling salesman problem,” in Multiobjective memetic algorithms. Springer, 2009, pp. 309–324.

[6] S. Lin and B. W. Kernighan, “An effective heuristic algorithm for the traveling-salesman problem,” Operations research, vol. 21, no. 2, pp. 498–516, 1973.

[7] D. Johnson, “Local search and the traveling salesman problem,” in Proceedings of 17th International Colloquium on Automata Languages and Programming, Lecture Notes in Computer Science,(Springer-Verlag, Berlin, 1990), 1990, pp. 443–460.

[8] E. Angel, E. Bampis, and L. Gourv`es, “A dynasearch neighborhood for the bicriteria traveling salesman problem,” in Metaheuristics for Multiobjective Optimisation. Springer, 2004, pp. 153–176.

[9] A. Jaszkiewicz, “On the performance of multiple-objective genetic local search on the 0/1 knapsack problem-a comparative experiment,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 4, pp. 402–412, 2002.

[10] L. Ke, Q. Zhang, and R. Battiti, “A simple yet efficient multiobjective combinatorial optimization method using decompostion and pareto local search,” IEEE Trans. Cybern, vol. 44, pp. 1808–1820, 2014.

[11] X. Cai, Y. Li, Z. Fan, and Q. Zhang, “An external archive guided multi- objective evolutionary algorithm based on decomposition for combinatorial optimization,” IEEE Transactions on Evolutionary Computation, vol. 19, no. 4, pp. 508–523, 2014.

[12] X. Cai, H. Sun, Q. Zhang, and Y. Huang, “A grid weighted sum pareto local search for combinatorial multi and many-objective optimization,” IEEE transactions on cybernetics, vol. 49, no. 9, pp. 3586–3598, 2018.

[13] T. Lust and J. Teghem, “The multiobjective traveling salesman problem: a survey and a new approach,” in Advances in Multi-Objective Nature Inspired Computing. Springer, 2010, pp. 119–141.

[14] X. Zhang, Y. Tian, R. Cheng, and Y. Jin, “A Decision Vari- able Clustering-Based Evolutionary Algorithm for Large-scale Manyobjective Optimization,” IEEE Transactions on Evolutionary Computation, vol. In Press, 2016.

[15] M. Ming, R. Wang, and T. Zhang, “Evolutionary many-constraint optimization: An exploratory analysis,” in Evolutionary Multi-Criterion Optimization, K. Deb, E. Goodman, C. A. Coello Coello, K. Klamroth, K. Miettinen, S. Mostaghim, and P. Reed, Eds. Cham: Springer International Publishing, 2019, pp. 165–176.

[16] D. H. Wolpert, W. G. Macready et al., “No free lunch theorems for optimization,” IEEE transactions on evolutionary computation, vol. 1, no. 1, pp. 67–82, 1997.

[17] M. Nazari, A. Oroojlooy, L. Snyder, and M. Tak´ac, “Reinforcement learning for solving the vehicle routing problem,” in Advances in Neural Information Processing Systems, 2018, pp. 9839–9849.

[18] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Advances in Neural Information Processing Systems, 2015, pp. 2692–2700.

[19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.

[20] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural combinatorial optimization with reinforcement learning,” arXiv preprint arXiv:1611.09940, 2016.

[21] W. Kool, H. van Hoof, and M. Welling, “Attention, learn to solve routing problems!” arXiv preprint arXiv:1803.08475, 2018.

[22] M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L.-M. Rousseau, “Learning heuristics for the tsp by policy gradient,” in International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research. Springer, 2018, pp. 170–181.

[23] C.-H. Hsu, S.-H. Chang, J.-H. Liang, H.-P. Chou, C.-H. Liu, S.-C. Chang, J.-Y. Pan, Y.-T. Chen, W. Wei, and D.-C. Juan, “Monas: Multiobjective neural architecture search using reinforcement learning,” arXiv preprint arXiv:1806.10332, 2018.

[24] H. Mossalam, Y. M. Assael, D. M. Roijers, and S. Whiteson, “Multi-objective deep reinforcement learning,” arXiv preprint arXiv:1610.02707, 2016.

[25] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.

[26] T. Murata, H. Ishibuchi, and M. Gen, “Specification of genetic search directions in cellular multi-objective genetic algorithms,” in Evolutionary Multi-Criterion Optimization, E. Zitzler, K. Deb, L. Thiele, C. A. Coello Coello, and D. Corne, Eds. Springer Berlin Heidelberg, 2001, pp. 82–95.

[27] K. Li, K. Deb, Q. Zhang, and S. Kwong, “An evolutionary many- objective optimization algorithm based on dominance and decomposition,” IEEE Transactions on Evolutionary Computation, vol. 19, no. 5, pp. 694–716, 2015.

[28] J. Chen, J. Li, and B. Xin, “Dmoea-C : Decomposition-based multiobjective evolutionary algorithm with the -constraint framework,” IEEE Transactions on Evolutionary Computation, vol. 21, no. 5, pp. 714–730, 2017.

[29] K. Deb and H. Jain, “An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part I: solving problems with box constraints,” IEEE Transactions on Evolutionary Computation, vol. 18, no. 4, pp. 577–601, 2014.

[30] K. Miettinen, Nonlinear multiobjective optimization. Springer Science & Business Media, 2012, vol. 12.

[31] R. Wang, Z. Zhou, H. Ishibuchi, T. Liao, and T. Zhang, “Localized weighted sum method for many-objective optimization,” IEEE Transactions on Evolutionary Computation, vol. 22, no. 1, pp. 3–18, 2018.

[32] R. Wang, Q. Zhang, and T. Zhang, “Decomposition-based algorithms using pareto adaptive scalarizing methods,” IEEE Transactions on Evolutionary Computation, vol. 20, no. 6, pp. 821–837, 2016.

[33] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.

[34] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.

[35] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in neural information processing systems, 2000, pp. 1008–1014.

[36] Y. Tian, R. Cheng, X. Zhang, and Y. Jin, “Platemo: A matlab platform for evolutionary multi-objective optimization,” IEEE Computational Intelligence Magazine, vol. 12, no. 4, pp. 73–87, 2017.

[37] G. Reinelt, “TSPLIBA traveling salesman problem library,” ORSA journal on computing, vol. 3, no. 4, pp. 376–384, 1991.

[38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[39] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256.

[40] H. Ishibuchi and T. Murata, “A multi-objective genetic local search al- gorithm and its application to flowshop scheduling,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 28, no. 3, pp. 392–403, 1998.

[41] A. Jaszkiewicz, “Genetic local search for multi-objective combinatorial optimization,” European journal of operational research, vol. 137, no. 1, pp. 50–71, 2002.

He is a student with the College of Systems Engineering, NUDT. His research interests include prediction technique, multiobjective optimization, reinforcement learning, data mining, and optimization methods on Energy Internet.

He is a Lecturer with the College of Systems Engineering, NUDT. His research interests include evolutionary computation, multiobjective optimization, machine learning, and various applications using evolutionary algorithms.

He is a Professor with the College of Systems Engineering, NUDT. His research interests include multicriteria decision making, optimal scheduling, data mining, and optimization methods on energy Internet network.

Designed for Accessibility and to further Open Science