Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Dynamic sub-route-based self-adaptive beam search Q-learning algorithm for traveling salesman problem

  • Jin Zhang ,

    Roles Conceptualization, Investigation, Methodology, Supervision, Writing – review & editing

    zhangjin@henu.edu.cn

    Affiliations School of Computer and Information Engineering, Henan University, Kaifeng, Henan, China, Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, Henan, China

  • Qing Liu,

    Roles Software, Writing – original draft

    Affiliation School of Computer and Information Engineering, Henan University, Kaifeng, Henan, China

  • XiaoHang Han

    Roles Data curation, Validation

    Affiliation School of Computer and Information Engineering, Henan University, Kaifeng, Henan, China

Abstract

In this paper, a dynamic sub-route-based self-adaptive beam search Q-learning (DSRABSQL) algorithm is proposed that provides a reinforcement learning (RL) framework combined with local search to solve the traveling salesman problem (TSP). DSRABSQL builds upon the Q-learning (QL) algorithm. Considering its problems of slow convergence and low accuracy, four strategies within the QL framework are designed first: the weighting function-based reward matrix, the power function-based initial Q-table, a self-adaptive ε-beam search strategy, and a new Q-value update formula. Then, a self-adaptive beam search Q-learning (ABSQL) algorithm is designed. To solve the problem that the sub-route is not fully optimized in the ABSQL algorithm, a dynamic sub-route optimization strategy is introduced outside the QL framework, and then the DSRABSQL algorithm is designed. Experiments are conducted to compare QL, ABSQL, DSRABSQL, our previously proposed variable neighborhood discrete whale optimization algorithm, and two advanced reinforcement learning algorithms. The experimental results show that DSRABSQL significantly outperforms the other algorithms. In addition, two groups of algorithms are designed based on the QL and DSRABSQL algorithms to test the effectiveness of the five strategies. From the experimental results, it can be found that the dynamic sub-route optimization strategy and self-adaptive ε-beam search strategy contribute the most for small-, medium-, and large-scale instances. At the same time, collaboration exists between the four strategies within the QL framework, which increases with the expansion of the instance scale.

1. Introduction

For a set of given cities, the traveling salesman problem (TSP) is finding the shortest route along which a salesman visits all of the cities exactly once before returning to the starting point. The TSP is a well-known combinatorial optimization problem with applications in many fields [1], such as transportation, circuit board design, production scheduling, and logistics distribution.

As a traditional NP-hard problem, numerous approaches have been proposed to solve the TSP, most of which use exact and heuristic algorithms. Exact algorithms include branch-and-bound (BnB), cut-plane, integer programming, and dynamic programming, all of which are used to obtain the global optimal solution by continuous iteration. For example, Pekny et al. (1990) [2] and Pesant et al. (1998) [3] proposed the BnB method and its variants for the TSP problem, and the famous TSP solver Corconde (http://www.math.uwaterloo.ca/tsp/concorde/index.html) is based on the BnB algorithm. Sanches et al. (2017) [4] proposed a partitioned cross-improvement initial solution method to speed up Corconde. However, since the time cost of the exact algorithm increases exponentially with the size of the instance, they are not suitable for large-scale applications.

Heuristic algorithms are widely used because of their high computational efficiency, and they can obtain a sub-optimal solution in a reasonable timeframe. Representative heuristic algorithms include the Lin-Kernighan heuristic (LKH), ant colony optimization (ACO) algorithm, genetic algorithm (GA), particle swarm optimization (PSO) algorithm, whale optimization algorithm (WOA), and gray wolf optimization (GWO). Based on this, various improved algorithms have been studied for solving the TSP problem. For example, by modifying the heuristic rules of LKH to improve its search strategy, Helsgaun (2000) [5] proposed an improved LKH algorithm. Ebadinezhad et al. (2020) [6] proposed an ACO algorithm that included a dynamic evaporation strategy (DEACO). Wang et al. (2022) [7] proposed a fine-grained fast parallel GA algorithm based on a ternary optical computer, and Zheng et al. (2022) [8] proposed a transfer learning-based PSO algorithm. Zhang et al. (2020) [9] proposed a variable neighborhood discrete WOA algorithm (VDWOA), while Panwar et al. (2021) [10] proposed a novel discrete GWO algorithm. However, heuristic algorithms are mostly based on random search, which lacks both an ability to learning and a theoretical foundation. They also usually require unique heuristic rules for a given problem. For this reason, designing unified solutions for combinatorial optimization problems such as TSP has become a popular research topic in machine learning.

As one of the main machine-learning methods, reinforcement learning (RL) has strong decision making and autonomous learning capabilities. RL is based on the Markov decision process (MDP), which is a sequential decision mathematical model that has natural similarity to the TSP. Therefore, many recent algorithms have adopted RL for TSP, and they can be divided into three categories.

  1. Deep learning algorithm combined with RL. The deep learning (DL) algorithm combined with RL exploits the perception ability of DL and the decision-making ability of RL. Its fast solution speed and strong generalization ability give this combination great potential in finding approximate TSP solutions. For example, Vinyals et al. (2015) [11] used supervised learning for training in a pointer network for TSP. Bello et al. (2016) [12] used an actor–critic policy gradient method in a recurrent neural network. Dai et al. (2017) [13] proposed an S2V-DQN algorithm using a graph embedding network trained by deep Q-learning. Deudon et al. (2018) [14] trained a neural network for TSP by policy gradient using the reinforcement learning rule with a critic. Ma et al. (2019) [15] used hierarchical RL for training in a graph pointer network for TSP. Stohy et al. (2021) [16] used an actor-critic method for training in a hybrid pointer network. These algorithms use RL to train neural networks by testing a large number of small-scale TSP datasets, which consumes a great deal of time and resources, and the accuracy of the solution is not ideal. The essence of this type of algorithm is deep learning, using RL only as a training method.
  2. Heuristic algorithm combined with RL. The heuristic algorithm combined with RL uses the strong generalization ability of RL to improve the heuristic algorithm to obtain better and faster solutions. For example, Liu et al. (2009) [17] proposed an improved GA algorithm with reinforcement mutation, using the genetic algorithm as the framework and reinforcement learning as a mutation operator, making it essentially a heuristic algorithm. Alipour et al. (2018) [18] proposed a hybridization of GAs and a multi-agent reinforcement learning (MARL) heuristic, which uses GA as a tour improvement heuristic and MARL as a construction heuristic. Costa et al. (2020) [19] proposed a 2-opt heuristic algorithm combined with deep RL for TSP, which essentially enhances the learning process of 2-opt. Combining the advantages of Q-learning (QL), Sarsa, and the Monte Carlo algorithm, Zheng et al. (2020) [20] proposed a variable strategy reinforced (VSR) approach, optimized the k-opt process of LKH based on this, and designed VSR-LKH for TSP. Optimizing the parameters of the biased random-key genetic algorithm (BRKGA) by QL, Chaves et al. (2021) [21] proposed a BRKGA-QL algorithm for TSP. These algorithms are still heuristic, using RL only to optimize parameters or strategies.
  3. RL algorithm. The above two categories of algorithms are not RL; they use RL for training or optimization. Algorithms based on RL are less common. For example, Ottoni et al. (2018) [22] proposed two RL algorithms for TSP, the QL and Sarsa (state action reward state action) algorithms, where the RL parameter estimation is based on a response surface model (RSM). The algorithm only optimizes the parameters of RL, and does not improve the algorithm framework, so the results are worse than the above two types of algorithms. For this reason, we design the improved QL-based algorithms in this paper to find a better QL framework to increase the efficiency of this kind of algorithm and provide a new way to solve the TSP. The principal contributions are as follows:

We introduce an improved QL algorithm for solving the TSP. A new reward matrix is constructed that is able to evaluate the immediate reward of the action more accurately compared to the original reward matrix. To reduce the size of the initial search space, a new initial Q-table is constructed. To balance exploration and exploitation, a new action selection strategy is designed. To more accurately assess the actions of the agent, a new Q-value update formula is designed. A local search strategy is also designed to enhance the algorithm’s local optimization capability.

The remainder of this paper is organized as follows. Section 2 presents the basic theoretical concept of RL, and a QL algorithm for the TSP problem is constructed. After analysis of the algorithm, some improvements are examined. In section 3.1, based on four proposed improvement strategies, an adaptive beam search QL algorithm is constructed, and a dynamic sub-route-based adaptive beam search QL algorithm is studied in section 3.2 to further improve performance of the algorithm. Section 4 discusses our experiments. Section 5 presents conclusions and suggested future work.

2. Principle and methodology of QL algorithm to solve TSP

2.1. Theoretical foundation

RL is an important machine-learning method proposed by Minsky in 1956. Its model is based on MDP, which shows significant advantage and potential in sequential decision problems. Its powerful exploration and autonomous learning abilities [23] are widely used in game-playing [2426], robotic control [27, 28], and transportation [2931].

As a methodological framework for learning, prediction, and decision-making, the agent learns the optimal policy by interacting with the environment [32]. Based on different ways of action selection, RL can be categorized as value-based, policy-based, or actor–critic learning. QL and Sarsa are two representative value-based RL algorithms. Considering that the value-based algorithm is applicable to a discrete action space, QL requires fewer parameters and has a better exploration ability [33], and is likely to find a better solution than Sarsa; thus, we focus on QL algorithms for TSP.

As a model-free RL algorithm, in QL, the agent continually interacts with the environment to obtain rewards, and the action-selection strategy can be evaluated and optimized to maximize the cumulative reward [3436].

The QL algorithm can be described as an MDP, which is usually represented by a five-tuple, <S, A, P, R, γ> [37]. S is the state of the environment, s (sS) represents a certain state, and the environment is the learning object in which the agent performs actions to change its state. A is the action set of the agent, and a (aA) represents a certain action. As a learner, the agent tries all possible actions to cognize the environment to explore the proper actions. P is a matrix of state transition probabilities, and p(s,a)∈P is the probability that the agent performs action a under state s, and the selection of action is determined by policy π. R is the reward matrix, and r(s,a)∈R is the immediate reward obtained by the agent after performing action a in status s. γ is the discount factor that determines the ratio between the immediate and cumulative rewards, whose higher value indicates a greater emphasis on the latter.

QL utilizes the Q-table and reward matrix R to respectively evaluate the cumulative and immediate rewards of actions. The Q-value is updated as follows. Given state of the environment st of step t, the agent selects action at under the ε-greedy strategy to affect the environment so that the state is transferred to st+1, and an immediate reward r(st,at) is obtained. Then, the Q-value is updated as Formula (1). (1) where q(st,at) is the cumulative reward of taking action at in state st; is the maximum cumulative reward that can be obtained by taking action at+1 in state st+1; α is a learning factor, α∈[0,1]; and γ is a discount factor, γ∈[0,1].

The ε-greedy strategy guides the agent to choose the action with the largest Q-value with probability ε and to randomly select the other action with probability 1−ε.

The pseudocode of QL is shown as Algorithm 1.

Algorithm 1. QL

1. Begin

2. Construct the tuple;

3.     Initialize Q-table and reward matrix R;

4.     Define parameters α, γ, ε;

5.     Initialize the maximum number of iterations Imax;

6.     while (current number of iterations< Imax)

7.       Initialize t and st;

8.       While (st is not the terminal state)

9.         Select action at according to ε-greedy policy in state st;

10.         Calculate the immediate reward r(st,at);

11.         Update q(st,at) according to Formula (1);

12.         t = t + 1;

13.       end while

14.     end while

15.     Obtain optimal policy according to Q-table.

16.End

2.2. Problem mapping of QL algorithm for TSP

Solving the TSP using the QL algorithm requires the construction of five-tuples, with the following parameters.

n: total number of cities;

t: number of visited cities, 1≤tn;

St: set of visited cities after visiting t cities;

st: number of t-th visited city, stSt;

At: set of unvisited cities after visiting t cities;

at: number of next city to be visited after city st, atAt;

i: number of current city, 1≤in, which is used as the row value of the matrix P, R, Q, D;

j: number of next city to be visited from current city, 1≤jn, which is used as the column value of the matrix P, R, Q, D;

Ic: current iteration number;

Imax: maximum iteration number;

P: action selection probability matrix, where p(i,j) is the probability of visiting city j from city i;

R: reward matrix, also known as the immediate reward matrix, where r(i,j) is the immediate reward obtained by visiting city j from city I;

Q: Q-table, also known as the cumulative reward matrix, where q(i,j) is the cumulative reward obtained by visiting city j from city i;

D: distance matrix, where d(i,j) is the distance from city i to city j.

2.3. Construction of proposed QL algorithm

The process of the algorithm for solving the TSP can be described as follows. Starting from city st (the initial value of t is 1), select city at from At and visit it according to the ε-greedy strategy, determine the immediate and cumulative rewards from st to at. Repeat this process until all cities are visited.

  1. (1) Construct reward matrix R, where each element is calculated as
(2)
  1. (2) Construct initial Q-table, where each element is calculated as
(3)
  1. (3) Determine the ε-greedy strategy as
(4)

Formula (4) shows that we start from city st and visit city at with a certain probability. The city with the largest Q-value has a probability of ε to be visited, while the other (n−t−1) cities have a probability of (1−ε) /(n−t−1) to be visited.

The pseudocode of the QL algorithm for the TSP is shown as Algorithm 2.

Algorithm 2. QL algorithm for TSP

1. Input: TSP instance with n cities

2. Output: optimal solution

3. Begin

4.     Initialize the Q according to Formula (3);

5.     ε = 0.95;

6.     Ic = 0;

7.     while (Ic < Imax)

8.       Randomly select a city to be the initial one and denoted it as s1;

9.       S1 = {s1};

10.       A1 = {a1, a2, …, an-1};

11.       t = 1;

12.       while (tn)

13.         Determine the next city at to be visited from st according to Formula (4);

14.         Calculate the immediate reward r(st,at) from city st to at according to Formula (2);

15.         Update the cumulative reward q(st,at) from city st to at according to Formula (1);

16.         At+1 = At-{at};

17.         st+1 = at;

18.         St+1 = St∪{st+1};

19.         t = t + 1;

20.         end while

21.         Calculate and save the length of n cities access sequence in set Sn;

22.         Ic = Ic + 1;

23.       end while

24.       The route corresponding to Sn with the shortest route length will be the optimal one.

25.End

2.4. Analysis of results

We selected 18 instances of 30–1655 cities in TSPLIB to test the proposed QL algorithm. The parameters α, γ, and ε of the algorithm were 0.1, 0.1, and 0.95, respectively. The results are compared with those of the VDWOA and DWOA algorithms [9] in the same environment. As only 12 instances are listed in [9], results of the six other instances were tested by VDWOA and DWOA; the details are listed in Table 2.

The results are as follows. For instances with fewer than 500 cities, the QL algorithm is 2–40 times faster than the comparative algorithms; for instances with 500–1000 cities, it is 4–60 times faster, and for instances with more than 1000 cities, it is 5–40 times faster. The QL algorithm has distinct advantages in solution time, but the accuracy is relatively poor.

The proposed QL algorithm (Algorithm 2) is only a simple simulation of the traditional QL algorithm (Algorithm 1), which does not consider the characteristics of TSP and the defects of the QL algorithm itself. There are four reasons for its poor accuracy:

  1. Formula (3) for calculating elements of reward matrix R only considers the distance between the current city i and city j, the next to be visited, without considering the impact of the distance between city j and subsequent cities. This is too greedy, and can lead to a high probability of selecting the local optimum city.
  2. The Q-table is simply initialized with 0, giving all cities the same initial Q-value, so that the algorithm searches nearly blindly in the early stage, resulting in a large search range and decreasing the convergence speed. Due to the great randomness of city selection in the early stage, the probability of worse cities being selected may be increased, causing the algorithm to easily fall into local optima.
  3. Cities are selected based on a ε-greedy strategy. If ε is too large, the selection probability of the global optimal city will be reduced, and the search scope smaller, causing premature convergence. If ε is too small, the probability of selecting a globally worse city will increase, and the search scope of the algorithm will be larger, causing difficult convergence.
  4. There is no new city to visit after the last city, and all that remains is to return to the starting city. Thus, there is no need to consider the cumulative reward, and Formula (1) is no longer suitable for updating the Q-value.

3. QL-based algorithms for TSP

We studied several methods to improve the QL algorithm based on the above analysis.

3.1. Self-adaptive beam search QL algorithm

We propose a self-adaptive beam search QL algorithm, ABSQL. Four strategies are introduced: the weighting function-based reward matrix, power function-based initial Q-table, self-adaptive ε-beam search strategy, and a new formula for updating the Q-value.

3.1.1. Weighting function-based reward matrix.

For reason (1) above, we draw on the idea of the finite discount cumulative reward strategy [38]. Based on Formula (2), we introduce dmin and davg to consider the effect of the distance from city j to subsequent cities from local and global dimensions, respectively, and compute (5) where θ1 and θ2 control the influence degree of city i on city j and its successor, respectively, θ1∈(0.5,1), θ2∈(0.5,1); and dmin and davg are the minimum and average distances, respectively, from city j to the other cities to be visited.

The above strategy (denoted as stgy_1) synthetically considers the minimum and average distance between the current city to be visited and subsequent cities. stgy_1 can mitigate the greediness of the original reward matrix and increase the accuracy of rewards to find a higher-quality solution.

3.1.2. Power function-based initial Q-table.

For reason (2) above, since the distances between cities will affect city selection, the reward matrix constructed according to Formula (5) will positively affect the selection of cities. Therefore, we introduce a power function to cause positive correlation between the initial Q-table and the improved reward matrix. The improved initial Q-table is constructed as (6) where δ is a parameter to control the correlation between q(i,j) and (i,j), δ∈(1.0,1.2).

The improved initial Q-table (denoted as stgy_2) can reduce the search scope and enhance the search ability in the early stage, enabling faster convergence and a better-quality solution.

3.1.3. Self-adaptive ε-beam search strategy.

For reason (3) above, considering that the beam search algorithm has the characteristics of fast operation and a tendency to avoid local optima, a self-adaptive ε-beam search strategy is designed. By improving the fixed ε to be self-adaptive and replacing the greedy search with a beam search, the probability of action selection can be properly controlled.

  1. (1) Self-adaptive ε strategy

In the early stage of the algorithm, the search behavior should be relatively random and the search scope should be large. In this stage, the algorithm focuses on exploration to avoid falling into local optima. In the later stage, the search scope should be small, and the algorithm turns to exploitation, while the convergence speed is gradually accelerated. Exploration and exploitation can be controlled by adjusting the value of ε. With a larger value, the algorithm tends more toward exploitation. Therefore, the value of ε should increase with iteration. The improved calculation is (7) where b and c are parameters to control the respective initial and final values of ε.

  1. (2) Beam search strategy

Beam search is a heuristic graph search algorithm [3941] that will gradually cut off some poor-quality nodes and reserve some high-quality nodes. It is usually adopted when the solution space of the graph is relatively large, to decrease the space and time cost of the search.

Based on a beam search, we sort Q-values of unvisited cities in descending order, and then select the top several as the next candidates to be visited. Considering the time cost and complexity of the algorithm, the number of candidate cities is set to 2, i.e., the current optimal or the second optimal city is selected as the next visited city. Compared with the greedy strategy, which only keeps the current optimal city, it has a larger search scope, which can increase the possibility of selecting the global optimal city, and makes the algorithm less likely to fall into local optima.

In summary, the self-adaptive ε-beam search strategy (denoted as stgy_3) can be described as (8) where formula (a) indicates that starting from city st, the city with the maximum Q-value is selected with probability ε´, formula (b) indicates that the city with the sub-maximum Q-value is selected with probability (1-ε´) * ε´, and formula (c) indicates that other cities are selected with probability (1-ε´)2/(n-t-2) in At.

The improved strategy can balance exploration and exploitation, expand the search scope in the early stage, and enhance the search ability in the middle and later stages.

3.1.4. Improvement of Q-value update formula.

For reason (4) above, the Q-value of the last visited city does not contain long-term rewards, so the value calculated by using Formula (1) that contains long-term rewards is inevitably higher than its real Q-value. Furthermore, the influence of such overestimation will gradually accumulate with iteration, making the estimation of the Q-table more inaccurate, affecting the judgment of city selection. Therefore, Formula (1) is no longer suitable for updating the Q-value of the last visited city.

The last visited city has no other city to visit and needs only to return to the starting city; hence the long-term reward in Formula (1) should not be considered. The improved Q-value update formula is (9) where formula (b) is used for the last visited city, and formula (a) for all other cities.

The improved Q-value update formula (denoted as stgy_4) can make the values of the Q-table more accurate when evaluating city selection.

The pseudocode of the ABSQL algorithm is presented as Algorithm 3.

Algorithm 3. ABSQL algorithm for TSP

1. Input: TSP instance with n cities

2. Output: optimal solution

3. Begin

4.     Initialize the Q table according to Formula (6);

5.     Ic = 0;

6.     while (Ic < Imax)

7.       Calculate ε´ according to Formula (7);

8.       Select one city randomly to be the starting one and denoted it as s1;

9.       Let S1 = {s1}, A1 = {a1, a2, , an-1} and the number of cities visited t = 1;

10.       while (tn)

11.         Determine the next city at to be visited from st according to Formula (8);

12.         Calculate the immediate return value (st,at) according to Formula (5);

13.         if (t! = n)

14.           Update the cumulative reward (st,at) from city st to at according to Formula (9a);

15.         else if (t = = n)

16.           Update the cumulative reward (st,at) from city st to at according to Formula (9b);

17.         end if

18.         At+1 = At-{at};

19.         st+1 = at;

20.         St+1 = St{st+1};

21.         t = t + 1;

22.       end while

23.       Calculate and save the length of n cities access sequence in set Sn;

24.       Ic = Ic + 1;

25.     end while

26.     The route corresponding to Sn with the shortest route length will be the optimal one.

27.End

3.1.5. Analysis of results.

We used the 18 instances in Section 2.4 to test the effectiveness of the proposed ABSQL algorithm through analysis of the length and graph of the route. The parameters α and γ used in the ABSQL algorithm are the same as in the QL algorithm. The newly added parameters θ1, θ2, δ, b and c are 0.8, 0.8, 1.12, 0.25, and 1.05, respectively.

  1. (1) Route length analysis

We compare solutions obtained by the ABSQL algorithm with those of QL and VDWOA, with results as shown in Table 2. Denote min, avg, and max to be the minimum, average, and maximum values, respectively, with deviation rates (10) (11) (12) where opt is the known optimal solution.

The comparison of ABSQL and QL shows that ABSQL has a faster solution, the quality of which is greatly improved.

For the 18 instances, min of ABSQL is better than that of QL, and its Smin is 0–92% less than that of QL. Comparing ABSQL and VDWOA, in terms of time, ABSQL exceeds VDWOA in all instances. In terms of solution quality, the following results are obtained.

For 14 small-scale instances with the total number of cities less than 500, the min of ABSQL is slightly worse than that of VDWOA for eight instances, and its Smin is 0.08%–5.7% greater. However, the avg of ABSQL has clear advantages over VDWOA in 10 instances, and its Savg is 0.31%–7.25% less. For four medium- and large-scale instances with more than 500 cities, the min of ABSQL is better than that of VDWOA, and its Smin is 0.67%–5.11% less. Since VDWOA takes too long for medium- and large-scale instances, it cannot obtain enough solutions in a limited time to calculate avg. Therefore, the avg and Savg of the two algorithms are not compared.

The above analysis shows that ABSQL has a better ability than VDWOA for medium- and large-scale instances. However, as comparing opt values reveals, ABSQL still has some deficiencies. For small-scale instances, the Smin of ABSQL is 0–10.95%, and for medium- and large-scale instances, the Smin of ABSQL is 12.34%–14.31%, i.e., the Smin of ABSQL increases with the scale of the problem, so the solution accuracy of ABSQL requires further improvement.

  1. (2) Route diagram analysis

Analyzing the route diagrams obtained from the above instances, a common problem is found: one or more sub-routes are not completely optimized. We select instance bays29 to analyze this problem from two aspects. Fig 1 is an incompletely optimized route diagram, and Fig 2 is an optimal route diagram. The X- and Y-axes represent the abscissa and ordinate, respectively, of the city; circled numbers indicate serial numbers of cities; solid lines are routes between cities. The three incompletely optimized sub-routes in Fig 1 and their corresponding fully-optimized sub-routes in Fig 2 are marked by oval circles, respectively.

  1. 1) Cross or overlap between edges

In Fig 1, sub-route 11-17-22-14 is incompletely optimized, with a crossover between edges (11,17) and (22,14). In Fig 2, there is no crossover of edges, sub-route 11-22-17-14 is fully optimized, and sub-route 11-17-22-14 can be optimized to 11-22-17-14 by swapping nodes 17 and 22.

  1. 2) No cross or overlap between edges

In Fig 1, there is no crossover or overlap between edges of sub-route 23-8-24-27-16. However, the sub-route is not completely optimized. This is mainly because the sum of the lengths of edges (23,27) and (27,8) is less than that of (16,27) and (27,24). Sub-route 23-27-8-24-16 of Fig 2 is optimal. Sub-route 23-8-24-27-16 can be optimized to 23-27-8-24-16 by swapping nodes 8 and 24, and then nodes 24 and 27. Similarly, sub-route 21-3-29-26 of Fig 1 can be improved to the better sub-route 21-29-3-26 of Fig 2 by swapping nodes 3 and 29.

Through the above analysis, it can be seen that the incomplete optimization of sub-routes is the main reason for the relatively poor solution of the ABSQL algorithm compared with the known optimum. Considering that the route length can be shortened by swapping some nodes of the sub-route, we introduce the 2-opt strategy to further optimize the sub-route.

3.2. Dynamic sub-route-based ABSQL algorithm

To solve the problem of incomplete optimization of sub-routes, static and dynamic sub-route optimization strategies are designed to optimize routes output by ABSQL.

3.2.1. Static sub-route optimization.

In this section, we perform following operations. Denote the route output by ABSQL as Sn. A partition operation is performed first on Sn, as follows. Starting from the first node, successively divide Sn (which has n nodes) into ⌊n/m⌋ sub-routes (each with m nodes), and denote these as sr1, sr2… srn/m. If there is a sub-route with fewer than m nodes, it is denoted as rp, and the set of all sub-routes is Sr = {sr1, sr2, , srn/m, rp}. Because the number of nodes in each sub-route is fixed, the partition process is called static sub-route optimization. For sub-routes with four or more nodes in Sr, all nodes except the first and last are selected to perform 2-opt optimization, and the optimized set is denoted as Sr1. The optimization process is as follows.

Step 1: Select the second node in the sub-route as the starting node, as well as other nodes for (m−2) *(m−3)/2 instances of 2-opt optimization;

Step 2: If the length of the sub-route does not decrease after 2-opt optimization, the optimization of the sub-route is stopped; otherwise, the new sub-route is reserved and we return to step 1.

3.2.2. Dynamic sub-route optimization.

Since static sub-route optimization does not consider optimization of the first and last nodes of a sub-route, and the number of nodes on each sub-route is fixed, a dynamic optimization strategy is introduced in which the first and last nodes are considered at the same time. Dynamic optimization is performed based on Sr1 obtained by static optimization, so that the number of nodes on each sub-route can change over the optimization process, which is as follows.

Step 1: Suppose the number of current optimization rounds is h, with initial value 1, hL. The set of sub-routes after the h-th round of optimization is Srh = {sr1h, sr2h, …, srf(h)-1h, rph}, where sr1h to rph represent the first to f(h)-th sub-routes, respectively; f(h) is calculated according to Formula (13), and L is the total number of optimization rounds, as determined by Formula (14).

(13)(14)

Step 2: All sub-routes in set Srh except for rph are merged in pairs in order, and if srf(h)-1h is left, it is merged with rph;

Step 3: For each sub-route obtained in step 2, select nodes at 1/4 to 3/4 positions of the node sequence, and perform 2-opt optimization to obtain set Srh+1;

Step 4: If h≤L, then let h = h+1, and return to step 2; otherwise, merge all sub-routes in SrL+1 to obtain the new path.

By combining ABSQL with the dynamic sub-route optimization strategy (denoted as stgy_5), a dynamic sub-route-based self-adaptive beam search Q-learning algorithm (DSRABSQL) for TSP is designed.

The pseudocode of the above algorithm is shown as Algorithm 4.

Algorithm 4. DSRABSQL algorithm for TSP

1. Input: Sn of ABSQL algorithm

2. Output: optimal solution

3. Begin

4.     Initialize m;

5.     Sn obtains Sr according to the division operation in Section 3.4.1

6.     Optimize Sr according to step 1 and step 2 in Section 3.4.1 to get Sr1

7.     Initialize L according to Formula (14);

8.     Initialize h = 1;

9.     while (hL)

10.         Execute step 2 and step 3 of section 3.4.2

11.         h = h + 1;

12.     end while

13.         Merge all sub-routes in SrL+1 in order to obtain an optimal route.

14.End

3.2.3. Analysis of results.

We used the 18 instances in Section 2.4 to test the proposed DSRABSQL algorithm. The parameters α, γ, θ1, θ2, δ, b and c used in the DSRABSQL algorithm are the same as in the ABSQL algorithm, with the added parameter m of 10. The performance of DSRABSQL is analyzed by comparing its optimal route diagrams with those of ABSQL.

Since most instances have too many cities or an uneven distribution, the route diagrams are not clear. We list the route diagrams of berlin52 and tsp225 for illustration.

Figs 3 and 4 are the optimal routes of berlin52 and tsp225, respectively, as obtained by ABSQL. Figs 5 and 6 are the corresponding routes as obtained by DSRABSQL.

thumbnail
Fig 3. The optimal route of Berlin52 output by ABSQL algorithm.

https://doi.org/10.1371/journal.pone.0283207.g003

thumbnail
Fig 4. The optimal route of Tsp225 output by ABSQL algorithm.

https://doi.org/10.1371/journal.pone.0283207.g004

thumbnail
Fig 5. The optimal route of Berlin52 output by DSRABSQL algorithm.

https://doi.org/10.1371/journal.pone.0283207.g005

thumbnail
Fig 6. The optimal route of Tsp225 output by DSRABSQL algorithm.

https://doi.org/10.1371/journal.pone.0283207.g006

From the analysis of the route diagram, it can be found that the sub-routes marked in Figs 3 and 4 are not completely optimized, and the sub-routes in the corresponding positions of Figs 5 and 6 are completely optimized. Therefore, by enhancing the local optimization ability of ABSQL, DSRABSQL greatly improves the solution accuracy of TSP.

4. Experiments and analysis of results

The simulation is carried out in Spyder and implemented on an Intel Core i5-7500 CPU at 3.4 GHz, with 8 GB memory and a Windows 10 (64-bit) operating system. We take 18 TSP instances and test them with city’s scale from 30 to 1655. The instances that used for this experiment is taken from TSPLIB (http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/) library. Table 1 lists the simulation parameters of the three algorithms to elaborate on the characteristics of the parameters adopted by each algorithm.

thumbnail
Table 1. Simulation parameters for the QL, ABSQL and DSRABSQL algorithms.

https://doi.org/10.1371/journal.pone.0283207.t001

The values of the parameters listed in Table 1 were obtained through experimental tests, and the specific procedures are as follows.

Procedure 1: Determine α, γ and ε of the QL algorithm.

  1. Set ε based on QL-related books and literature;
  2. Set the initial value, final value and step size of parameters α and γ, and select the optimal values of these parameters according to the test results.

Procedure 2: Determine θ1, θ2, δ, b and c of the ABSQL algorithm.

  1. Set the initial values, final values and step sizes of parameters θ1, θ2 in stgy_1, parameter δ in stgy_2, and parameters b and c in stgy_3 in succession, and select the optimal values of these parameters according to the test results;
  2. The parameters α and γ in stgy_4 inherit the optimal values of α and γ from the QL algorithm.

Procedure 3: Determine parameter m of the DSRABSQL algorithm.

Set the initial value, final value and step size of parameter m in stgy_5, and select the optimal value of these parameters based on the test results.

The QL, ABSQL and DSRABSQL algorithms are iterated for Imax times, respectively. The results are taken in nt simulation runs for each instance. Imax and nt are calculated as (15) (16)

Table 2 lists min (Smin), avg (Savg), max (Smax), Std, and the average time cost of the 18 instances tested by QL, ABSQL, and DSRABSQL, and the corresponding values of VDWOA and DWOA.

thumbnail
Table 2. Results of proposed algorithms and algorithms in [9] for 18 instances.

https://doi.org/10.1371/journal.pone.0283207.t002

From Table 2, it can be seen that min, avg, max, and Std of DSRABSQL are mostly better than those of VDWOA for 18 instances, which indicates that the solution quality and stability of DSRABSQL are superior to those of VDWOA.

To further analyze the performance of the algorithm, the data of Table 2 are discussed in two aspects: the first is the data comparison of QL, ABSQL and DSRABSQL, and the second is the data comparison between DSRABSQL and VDWOA.

For the first aspect, a small- (ch150), medium- (rat575), and large-scale (fl400) instance are selected to analyze the performance of DSRABSQL. Figs 79 illustrate min, avg, max, Std, and Nmin (the number of mins) obtained by using QL, ABSQL, and DSRABSQL, respectively, to test ch150, rat575, and fl1400 under different numbers of iterations. Since the time cost of the algorithm will increase with scale of the instance, ch150, rat575, and fl1400 are tested 50, 20, and 10 times, respectively.

thumbnail
Fig 7. min, avg, max and Std of ch150 obtained by three algorithms.

https://doi.org/10.1371/journal.pone.0283207.g007

thumbnail
Fig 8. min, avg, max and Std for rat575 obtained by three algorithms.

https://doi.org/10.1371/journal.pone.0283207.g008

thumbnail
Fig 9. min, avg, max and Std of fl1400 obtained by three algorithms.

https://doi.org/10.1371/journal.pone.0283207.g009

thumbnail
Fig 10. The value of Nmin for ch150, rat575 and fl1400 obtained by three algorithms.

https://doi.org/10.1371/journal.pone.0283207.g010

A chi-square test is performed on Nmin as obtained for ch150 under 400 and 500 iterations of DSRABSQL, as shown in Fig 10, where Nmin is the number of mins, and It is the number of iterations. With 1 degree of freedom, we assume Hypothesis 0: Nmin and It are correlated.

For this purpose, the Sample Frequency Contingency Table of Nmin and It, as shown in Table 3, must be made, and a chi-square test based on the fourfold data in Table 3 is run, with the test statistic χ2 calculated as 6.895.

thumbnail
Table 3. Sample frequency contingency table of Nmin and It.

https://doi.org/10.1371/journal.pone.0283207.t003

From Table 2 and Figs 7~10, it can be seen that min, avg, max, Std, and Nmin of DSRABSQL are better than those of QL and ABSQL, which indicates that the solution quality, stability, and convergence of DSRABSQL are all significantly improved compared with QL and ABSQL throughout the iterative process. Table 3 shows that χ2∈(6.635, 7.879), and the probability p that Hypothesis 0 is true is between 0.99 and 0.995, which indicates that Nmin and It are significantly correlated. Similarly, Nmin and It of instances rat575 and fl1400 can also be inferred to be correlated. Therefore, Nmin of DSRABSQL increases with It until achieving stability, i.e., the probability of DSRABSQL for getting its minimum value min will increase with It until stability. The above analysis shows that DSRABSQL has learning ability and can be continuously enhanced with iterations until stability.

For the second aspect, the Mann-Whitney-Wilcoxon test is used to verify the performance of DSRABSQL relative to VDWOA based on the data of Table 2. A parameter, (17) is introduced to evaluate the performance of the algorithm under different scale instances. Since Smin, Savg, and Smax have different effects on performance of the algorithm, we assign them different weights, in this case 0.5, 0.4, and 0.1, respectively. The smaller Sc is, the better the comprehensive performance of the algorithm is.

The values of Sc for 18 instances of DSRABSQL and VDWOA are calculated, with results as shown in Tables 3 and 4. To verify that DSRABSQL shows more advantages in solving medium- and large-scale instances, the 18 instances are divided into groups, G1 and G2. Group G1 includes 14 small-scale instances, and G2 includes 4 medium- and large-scale instances. In group G1, Sc of DSRABSQL is marked as Sample 1, and that of VDWOA as Sample 2; for group G2, Sc of DSRABSQL is marked as Sample 3, and that of VDWOA as Sample 4. Then, the following hypotheses are made:

Hypothesis 1(a): Sc of DSRABSQL in G1 and G2 is not significantly different from that of VDWOA;

Hypothesis 1(b): Sc of DSRABSQL in G1 and G2 differ significantly from that of VDWOA.

thumbnail
Table 4. Sc and rank of DSRABSQL and VDWOA algorithms in G1.

https://doi.org/10.1371/journal.pone.0283207.t004

Then, Sc values of Samples 1 and 2 are sorted in ascending order. R1 and R2 are the respective rank sums of Samples 1 and 2. The Mann-Whitney-Wilcoxon test is performed on R1 and R2 to obtain statistics U1 and U2, and the results are shown in Table 4. Similarly, R3 and R4 are the respective rank sums of Samples 3 and 4, respectively, and U3 and U4 are obtained by performing the same process; the results are shown in Table 5.

thumbnail
Table 5. Sc and rank of DSRABSQL and VDWOA algorithm in G2.

https://doi.org/10.1371/journal.pone.0283207.t005

For U1 and U3 of Tables 4 and 5, it can be found that U1 = 27<47 = Uα = 0.02<Uα = 0.025 by checking the Mann-Whitney Table. U1 is lower than Uα after Bonferroni correction, U3 = 0 = Uα = 0.05, so the probability p that Hypothesis 1(b) is true is 0.95. This shows that DSRABSQL algorithm has significant advantages in all scale instances compared with VDWOA.

Fig 11 also shows Nmin as obtained by testing instances kroa100, ch150, d198, and fl417 for 50 times by DSRABSQL and VDWOA under the condition of convergence. The results show that min of DSRABSQL is better than that of VDWOA, and when the algorithms converge, DSRABSQL has a much higher probability of obtaining min.

thumbnail
Fig 11. The Nmin of DSRABSQL and VDWOA algorithms obtained by testing kroa100, ch150, d198, fl417 50 times, respectively.

https://doi.org/10.1371/journal.pone.0283207.g011

To further verify the significance difference of DSRABSQL and VDWOA for groups G1 and G2, Vargha and Delaney’s  was used to compute its effect size, and the following results are obtained: Â21 = 0.86>0.71 and Â43 = 1>0.71. This means that probability of the DSRABSQL algorithm is better than VDWOA for instances of all scales is about 0.86–1. Combined with the conclusions obtained from Fig 10 and Table 2, it can be further concluded that DSRABSQL has better learning ability than the population- and random search-based VDWOA.

The above analysis shows that performance of DSRABSQL is a significant improvement compared with QL, ABSQL, and VDWOA, because DSRABSQL undergoes two stages of improvement based on the QL algorithm. The first stage is from QL to ABSQL, with the proposed stgy _1stgy_4 strategies, which are all improvement methods within the QL framework. The second stage is from ABSQL to DSRABSQL, with the proposed stgy_5 strategy, which is an improvement method outside the QL framework. We compare the improvement effects of these strategies in what follows.

  1. Two groups of comparative experiments are designed. Group 1 includes QL and five comparison algorithms, which are obtained by combining one of the five strategies above with QL, and denoted as QL+1,…, QL+4, QL +5. QL+1–QL+4 are improvements within the QL framework, and QL+5 is an improvement outside the QL framework. Group 2 includes DSRABSQL and five comparison algorithms, which are obtained by removing one of the five strategies above from DSRABSQL, and denoted as DSR-1,…, DSR-4, DSR-5. DSR-1–DSR-4 are improvements outside the QL framework, and DSR-5 is an improvement within the QL framework. The relationships between the algorithms and the five strategies are shown in Table 6.
  2. For every algorithm of each group, instances ch150, rat575, and fl1400 are tested 50, 20, and 10 times, respectively. The experiment results of ch150, rat575, and fl1400 are illustrated by the boxplots of Figs 1214, respectively.
  3. For each comparison algorithm, according to the position distribution of the boxes shown in these figures, the effects of these strategies on improvement of the algorithms are judged. The values of min, avg and max in the boxplot are converted to Sc for processing, and the results are shown in Table 7.
thumbnail
Fig 12. Boxplots obtained by testing ch150 for 50 times by each algorithm.

https://doi.org/10.1371/journal.pone.0283207.g012

thumbnail
Fig 13. Boxplots obtained by testing rat575 for 20 times by each algorithm.

https://doi.org/10.1371/journal.pone.0283207.g013

thumbnail
Fig 14. Boxplots obtained by testing fl1400 for 10 times by each algorithm.

https://doi.org/10.1371/journal.pone.0283207.g014

thumbnail
Table 6. Table of relationships between algorithms and five strategies for two experimental groups.

https://doi.org/10.1371/journal.pone.0283207.t006

thumbnail
Table 7. Sc of ch150, rat575 and fl1400 for algorithms of each group.

https://doi.org/10.1371/journal.pone.0283207.t007

According to Figs 12~14, two kinds of comparisons are performed, as follows:

  1. (1) Horizontal comparison

For instance, for ch150 in experimental Group 1, the improvement degrees of the QL+2, QL+4, QL+1, QL+3, and QL+5 algorithms increase sequentially compared with QL. In experimental Group 2, the performance of DSR-2, DSR-4, DSR-1, DSR-3, and DSR-5 deteriorates gradually compared with DSRABSQL.

For rat575 and fl1400 in experimental Group 1, the improvement degree of QL+2, QL+4, QL+1, QL+3, and QL+5 increases sequentially compared with QL. In experimental Group 2, the performance of DSR-2, DSR-4, DSR-1, DSR-5, and DSR-3 deteriorates gradually compared with DSRABSQL.

  1. (2) Vertical comparison

In experimental Group 1, for ch150, rat575, and fl1400, the improvement degree of QL+2, QL+4, QL+1, QL+3, and QL+5 increases sequentially compared with QL, and with expansion of the instance scale, the solution quality of QL+3 is constantly approaching that of QL+5.

In experimental Group 2, for ch150, the performance of DSR-2, DSR-4, DSR-1, DSR-3, and DSR-5 deteriorates gradually compared with DSRABSQL. For rat575 and fl1400, the performance of DSR-2, DSR-4, DSR-1, DSR-5, and DSR-3 deteriorates gradually compared with DSRABSQL. It can be also found that, with the increase of the instance scale, the gap between DSR-3 and the other four algorithms increases gradually, but the solution quality of DSR-1 and DSR-4 constantly approaches that of DSR-5.

The above analysis shows that for small-scale instances, stgy_5 makes the greatest contribution to QL and DSRABSQL, indicating its key role in early and later improvement stages of the algorithm. For medium- and large-scale instances, stgy_5 contributes most to the QL algorithm, and stgy_3 contributes most to DSRABSQL, which indicates that stgy_5 plays a key role in the early improvement stage of the algorithm, and that stgy_3 plays a key role in the later improvement stage. In conclusion, stgy_5 has the greatest potential for small-scale instances, and stgy_3 for medium- and large-scale instances. stgy_1 and stgy_4 also have certain potential for large-scale instances, while stgy_2 has no obvious influence for any scale of instances.

In addition, from the difference of Sc of each algorithm in Table 7, it can be found that there are collaborations between strategies. To describe this phenomenon in detail, the following data are calculated based on Table 7. Since stgy_5 is the only one of the five strategies outside the QL framework, and it has the best improvement without the collaboration of other strategies, we use this as the reference strategy. The difference of Sc between QL+5 and QL+k (k∈{1,2,3,4}) is denoted as d5-k, which is calculated as (18) and the difference of Sc between DSR-5 and DSR-k is denoted as 5-k, which is calculated as (19) where Sc QL+k, Sc QL+5, Sc DSR-k, and Sc DSR-5 represent the Sc value of the corresponding instances of QL+k, QL+5, DSR-k, and DSR-5, respectively. For algorithm QL+k, no collaboration exists between stgy_1~ stgy_5, but it does for DSR-k. These algorithms are divided into four classes according to their adopted strategies and the relevant data are listed in Table 8.

thumbnail
Table 8. The difference of Sc between stgy_5 and stgy_1~ stgy_4 under collaboration or not.

https://doi.org/10.1371/journal.pone.0283207.t008

From Table 8, the following results can be found. For ch150 of class I, Sc of QL combined with stgy_1 (QL+1) is 2.14% larger than that of QL combined with stgy_5 (QL+5); however, the Sc of QL combined with stgy_1, stgy_2, stgy_3 and stgy_4 (DSR-5) is 1.42% larger than that of QL combined with stgy_2, stgy_3, stgy_4 and stgy_5 (DSR-1), which indicates that with the collaboration of stgy_2, stgy_3 and stgy_4, the difference of the improvement effect between stgy_1 and stgy_5 is decreased gradually. The same is found for classes II–IV. For rat575 and fl1400, similar results can also be deduced.

In other words, the improvement effect of stgy_1stgy_4 is all worse than stgy_5 when no collaboration exists. However, under collaboration, the difference of the improvement effect between stgy_1stgy_4 and stgy_5 gradually decreases. In addition, the improvement effect of stgy_1 and stgy_3 is sometimes better than that of stgy_5, which indicates a better collaboration effect from stgy_1stgy_4, but stgy_5 is relatively independent.

In class I, for ch150, the Sc of DSR-5 is 1.42% larger than that of DSR-1; for rat575, the Sc of DSR-5 is 0.75% larger than that of DSR-1; and for fl1400, the Sc of DSR-5 is 0.2% less than that of DSR-1. This all indicates that with the expansion of the instance scale and with the collaboration of stgy_2, stgy_3, and stgy_4, the difference of the improvement effect between stgy_1 and stgy_5 decreases until it overtakes that of stgy_5. The same is found for classes II–IV. Therefore, the collaboration effect between stgy_1stgy_4 is approaching being better than that of stgy_1stgy_5’s increase with instance scale.

The above analysis shows that DSRABSQL has certain advantages in solving the TSP. To further verify the effectiveness of the algorithm, from categories 1 and 2 described in section 1, we select S2V-DQN and the 2-opt heuristic algorithm combined with deep reinforcement learning (2-opt-DRL for short) from each category to compare with DSRABSQL. Table 9 lists the min and average of Smin obtained by DSRABSQL and these two algorithms for 25 instances.

thumbnail
Table 9. The min obtained by DSRABSQL, S2V-DQN and 2-opt-DRL algorithms for 25 instances.

https://doi.org/10.1371/journal.pone.0283207.t009

For these 25 instances, compared with S2V-DQN, the results of 23 instances obtained by DSRABSQL are better, and compared with 2-opt-DRL, those of 19 instances are better. In general, the average of Smin for DSRABSQL is 0.33 times that of S2V-DQN and 0.18 times that of 2-opt-DRL, which shows that the solution results of DSRABSQL still have a clear advantage compared with those of RL.

5. Conclusion and future work

We proposed a dynamic sub-route-based self-adaptive beam search Q-learning algorithm for medium- to large-scale TSPs. On the basis of the QL algorithm, a weighting function-based reward matrix, a power function-based initial Q-table, a self-adaptive ε-beam search strategy, a new Q-value update formula, and a dynamic sub-route optimization strategy were introduced. The first four are improvement strategies within the QL framework, and the last is outside of it.

Considering the comparative algorithms, the DSRABSQL algorithm not only has certain advantages in solution time, results, and solution stability, but strong local optimization and good learning abilities. However, there are certain deficiencies compared with the current best heuristic algorithms, such as LKH and DEACO. Therefore, it is necessary to further improve DSRABSQL.

For the five proposed improvement strategies for DSRABSQL, the experimental results show that they can accelerate the convergence speed of the algorithm, enhance its search and local optimization ability, and balance exploration and exploitation. In addition, there are strong collaboration effects among the four strategies within the QL framework, which will become stronger with expansion of the instance scale. Although the dynamic sub-route optimization strategy has the most obvious improvement effect on small-scale instances, its relative independence will lead to a fixed improvement effect, which will be limited as the problem size increases. The self-adaptive ε-beam search strategy cannot only make the greatest contributions to medium- and large-scale problems; it has strong collaboration effects with other strategies within the QL framework, and its advantage can be enhanced with increase of the instance size. Therefore, further improvement to the self-adaptive ɛ-beam search strategy will be studied.

DSRABSQL still faces challenges for solving larger-scale problems. Moreover, the many artificial parameters increase the complexity of parameter adjustment and may result in poor solutions for certain types of problems.

In future work, we hope to improve the self-adaptive ε-beam search strategy, enhance the collaboration effects between strategies, and adjust the parameters of each strategy to find a robust QL framework to improve the solution quality of large-scale problems and obtain a model with a better solution capability.

References

  1. 1. Lawler EL. The traveling salesman problem: a guided tour of combinatorial optimization. Wiley-Inter science Series in Discrete Mathematics. 1986; 18(5): 514–515. https://doi.org/10.1112/blms/18.5.514
  2. 2. Pekny JF, Miller DL. An Exact Parallel Algorithm for the Resource Constrained Traveling Salesman Problem with Application to Scheduling with an Aggregate Deadline. In: ACM Conference on Computer Science; 1990. p. 208–214. https://doi.org/10.1145/100348.100380
  3. 3. Pesant G, Gendreau M, Rousseau PM. An Exact Constraint Logic Programming Algorithm for the Traveling Salesman Problem with Time Windows. Transportation Science. 1998; 32(1): 12–29. https://doi.org/10.1287/trsc.32.1.12
  4. 4. Sanches D, Whitley D, Tinos R. Improving an exact solver for the traveling salesman problem using partition crossover. In: Genetic and Evolutionary Computation Conference; 2017. p. 337–344. https://doi.org/10.1145/3071178.3071304
  5. 5. Helsgaun K. An effective implementation of the Lin–Kernighan traveling salesman heuristic. European journal of operational research. 2000; 126(1): 106–130. https://doi.org/10.1016/S0377-2217(99)00284-2
  6. 6. Ebadinezhad S. DEACO: Adopting dynamic evaporation strategy to enhance ACO algorithm for the traveling salesman problem. Engineering Applications of Artificial Intelligence. 2020; 92: 103649. https://doi.org/10.1016/j.engappai.2020.103649
  7. 7. Wang Z, Shen Y, Li S, Wang S. A fine-grained fast parallel genetic algorithm based on a ternary optical computer for solving traveling salesman problem. The Journal of Supercomputing. 2022: 1–31. https://doi.org/10.1007/s11227-022-04813-9
  8. 8. Zheng R, Zhang Y, Yang K. A transfer learning-based particle swarm optimization algorithm for travelling salesman problem. Journal of Computational Design and Engineering. 2022; 9(3): 933–948. https://doi.org/10.1093/jcde/qwac039
  9. 9. Zhang J, Hong L, Liu Q. An Improved Whale Optimization Algorithm for the Traveling Salesman Problem. Symmetry. 2020; 13(1): 48–60. https://doi.org/10.3390/sym13010048
  10. 10. Panwar K, Deep K. Discrete Grey Wolf Optimizer for symmetric travelling salesman problem. Applied Soft Computing. 2021; 105: 107298. https://doi.org/10.1016/j.asoc.2021.107298
  11. 11. Vinyals O, Fortunato M, Jaitly N. Pointer networks. In: Advances in Neural Information Processing Systems, vol. 28; 2015.
  12. 12. Bello I, Pham H, Le QV, Norouzi M, Bengio S. Neural combinatorial optimization with reinforcement learning. In: 5th International Conference on Learning Representations; 2019. https://doi.org/10.48550/arXiv.1611.09940
  13. 13. Khalil E, Dai H, Zhang Y, Dilkina B, Song L. Learning combinatorial optimization algorithms over graphs. In: Advances in Neural Information Processing Systems, vol. 30; 2017. p. 6348–6358.
  14. 14. Deudon M, Cournut P, Lacoste A, Adulyasak Y, Rousseau LM. Learning heuristics for the tsp by policy gradient. In: International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, vol. 10848; 2018. p. 170–181. https://doi.org/10.1007/978-3-319-93031-2_12
  15. 15. Ma Q, Ge S, He D, Thaker D, Drori I. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. In: Association for the Advancement of Artificial Intelligence, 2019. https://doi.org/10.48550/arXiv.1911.04936
  16. 16. Stohy A, Abdelhakam HT, Ali S, Elhenawy M, Hassan AA, Masoud M, et al. Hybrid pointer networks for traveling salesman problems optimization. Plos One. 2021; 16(12): e0260995. pmid:34905571
  17. 17. Liu F, Zeng G. Study of genetic algorithm with reinforcement learning to solve the TSP. Expert Systems with Applications. 2009; 36(3): 6995–7001. https://doi.org/10.1016/j.eswa.2008.08.026
  18. 18. Alipour MM, Razavi SN, Derakhshi MRF, Balafar MA. A hybrid algorithm using a genetic algorithm and multiagent reinforcement learning heuristic to solve the traveling salesman problem. Neural Computing and Applications. 2018; 30(9): 2935–2951.https://doi.org/10.1007/s00521-017-2880-4
  19. 19. Costa PRdO, Rhuggenaath J, Zhang Y, Akcay A. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In: Asian Conference on Machine Learning, vol. 129; 2020. p. 465–480.
  20. 20. Zheng J, He K, Zhou J, Li CM. Combining reinforcement learning with Lin-Kernighan-Helsgaun algorithm for the traveling salesman problem. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35; 2021. p. 12445–12452. https://doi.org/10.1609/aaai.v35i14.17476
  21. 21. Chaves AA, Lorena L. An Adaptive and Near Parameter-Free BRKGA Using Q-Learning Method. In: 2021 IEEE Congress on Evolutionary Computation; 2021. p. 2331–2338. https://doi.org/10.1109/CEC45853.2021.9504766
  22. 22. Ottoni ALC, Nepomuceno EG, de Oliveira MS. A response surface model approach to parameter estimation of reinforcement learning for the travelling salesman problem. Journal of Control, Automation and Electrical Systems. 2018; 29(3): 350–359. https://doi.org/10.1007/s40313-018-0374-y
  23. 23. Beom HR, Cho HS. A sensor-based navigation for a mobile robot using fuzzy logic and reinforcement learning. IEEE transactions on Systems, Man, and Cybernetics. 1995; 25(3): 464–477. https://doi.org/10.1109/21.364859
  24. 24. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, et al. Mastering the game of go with deep neural networks and tree search. Nature. 2016; 529(7587): 484–489. pmid:26819042
  25. 25. Moravčík M, Schmid M, Burch N, Lisý V, Morrill D, Bard N, et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science. 2017; 356(6337): 508–513. pmid:28254783
  26. 26. Ontanón S, Synnaeve G, Uriarte A, Richoux F, Churchill D, Preuss M. A survey of real-time strategy game AI research and competition in StarCraft. IEEE Transactions on Computational Intelligence and AI in games. 2013; 5(4): 293–311. https://doi.org/10.1109/TCIAIG.2013.2286295
  27. 27. Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research. 2013; 32(11): 1238–1274. https://doi.org/10.1177/0278364913495721
  28. 28. Ng AY, Kim HJ, Jordan MI, Sastry S. Autonomous helicopter flight via reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 16; 2003.
  29. 29. Wei H, Zheng GJ, Yao HX, Li ZH. Intellilight: A reinforcement learning approach for intelligent traffic light control. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018. p. 2496–2505. https://doi.org/10.1145/3219819.3220096
  30. 30. Wang D, Qi F. Trajectory planning for a four-wheel-steering vehicle. In: Proceedings 2001 International Conference on Robotics and Automation, vol. 4; 2001. p. 3320–3325. https://doi.org/10.1109/ROBOT.2001.933130
  31. 31. Li MN, Qin ZW, Jiao Y, Yang YD, Wang J, Wang CX, et al. Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In: the World Wide Web Conference, 2019. p. 983–994. https://doi.org/10.1145/3308558.3313433
  32. 32. Akalin N, Loutfi A. Reinforcement learning approaches in social robotics. Sensors. 2021; 21(4): 1292–1328. pmid:33670257
  33. 33. Rummery GA, Niranjan M. On-line Q-learning using connectionist systems. Cambridge: Cambridge University Press; 1994.
  34. 34. Watkins CJCH, Dayan P. Q-learning. Machine learning. 1992; 8(3–4): 279–292. https://doi.org/10.1007/BF00992698
  35. 35. Zamfirache IA, Precup RE, Roman RC, Petriu EM. Reinforcement Learning-based Control Using Q-learning and Gravitational Search Algorithm with Experimental Validation on a Nonlinear Servo System. Information Sciences. 2021; 583: 99–120. https://doi.org/10.1016/j.ins.2021.10.070
  36. 36. Huynh TN, Do DTT, Lee J. Q-Learning-based parameter control in differential evolution for structural optimization. Applied Soft Computing. 2021; 107: 107464–107477. https://doi.org/10.1016/j.asoc.2021.107464
  37. 37. Liu Q, Zhai JW, Zhang ZZ, Zhong S, Zhou Q, Zhang P. A survey on deep reinforcement learning. Chinese Journal of Computers, 2018; 41(1): 1–27.
  38. 38. Mao GJ, Gu SM. An Improved Q-Learning Algorithm and Its Application in Path Planning. J. Taiyuan Univ. Technol. 2021; 52(1): 91–97.
  39. 39. Lowerre BT. The harpy speech recognition system. Carnegie Mellon University; 1976.
  40. 40. Zhang JF, Sun YT, Li GC, Wang YH, Sun JB, Li JX. Machine-learning-assisted shear strength prediction of reinforced concrete beams with and without stirrups. Engineering with Computers. 2020; 5: 1–15. https://doi.org/10.1007/s00366-020-01076-x
  41. 41. Kai-Wen L, Tao Z, Rui W, et al. Research reviews of combinatorial optimization methods based on deep reinforcement learning. Acta Automatica Sinica, 2021, 47(11): 2521–2537.